In [1]:
import pandas as pd

pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

In [2]:
station_path = "../data/raw/dot_traffic_stations_2015.txt.gz"
traffic_df = "../data/raw/dot_traffic_2015.txt.gz"

In [3]:
station_df = pd.read_csv(station_path, compression="gzip")

In [4]:
traffic_df = pd.read_csv(traffic_df, compression="gzip")

# Data Cleaning

We take a glimpse of the data and clean both of the dataset before conducting any EDA. We will do this methodically by first cleaning the station dataset followed by the traffic dataset. 

## Station Data

In [5]:
station_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 28466 entries, 0 to 28465
Data columns (total 55 columns):
 #   Column                                            Non-Null Count  Dtype  
---  ------                                            --------------  -----  
 0   algorithm_of_vehicle_classification               18576 non-null  object 
 1   algorithm_of_vehicle_classification_name          17335 non-null  object 
 2   calibration_of_weighing_system                    8165 non-null   object 
 3   calibration_of_weighing_system_name               6681 non-null   object 
 4   classification_system_for_vehicle_classification  28466 non-null  int64  
 5   concurrent_route_signing                          28466 non-null  int64  
 6   concurrent_signed_route_number                    13592 non-null  object 
 7   direction_of_travel                               28466 non-null  int64  
 8   direction_of_travel_name                          28466 non-null  object 
 9   fips_county_code 

In [6]:
station_df.isnull().sum()

algorithm_of_vehicle_classification                  9890
algorithm_of_vehicle_classification_name            11131
calibration_of_weighing_system                      20301
calibration_of_weighing_system_name                 21785
classification_system_for_vehicle_classification        0
concurrent_route_signing                                0
concurrent_signed_route_number                      14874
direction_of_travel                                     0
direction_of_travel_name                                0
fips_county_code                                        0
fips_state_code                                         0
functional_classification                               0
functional_classification_name                          0
hpms_sample_identifier                              15248
hpms_sample_type                                        0
lane_of_travel                                          0
lane_of_travel_name                                     0
latitude      

In [7]:
station_df.head(10)

Unnamed: 0,algorithm_of_vehicle_classification,algorithm_of_vehicle_classification_name,calibration_of_weighing_system,calibration_of_weighing_system_name,classification_system_for_vehicle_classification,concurrent_route_signing,concurrent_signed_route_number,direction_of_travel,direction_of_travel_name,fips_county_code,fips_state_code,functional_classification,functional_classification_name,hpms_sample_identifier,hpms_sample_type,lane_of_travel,lane_of_travel_name,latitude,longitude,lrs_identification,lrs_location_point,method_of_data_retrieval,method_of_data_retrieval_name,method_of_traffic_volume_counting,method_of_traffic_volume_counting_name,method_of_truck_weighing,method_of_truck_weighing_name,method_of_vehicle_classification,method_of_vehicle_classification_name,national_highway_system,number_of_lanes_in_direction_indicated,number_of_lanes_monitored_for_traffic_volume,number_of_lanes_monitored_for_truck_weight,number_of_lanes_monitored_for_vehicle_class,posted_route_signing,posted_signed_route_number,previous_station_id,primary_purpose,primary_purpose_name,record_type,sample_type_for_traffic_volume,sample_type_for_traffic_volume_name,sample_type_for_truck_weight,sample_type_for_truck_weight_name,sample_type_for_vehicle_classification,sample_type_for_vehicle_classification_name,second_type_of_sensor,shrp_site_identification,station_id,station_location,type_of_sensor,type_of_sensor_name,year_of_data,year_station_discontinued,year_station_established
0,,,,,13,3,91.0,7,West,59,6,2U,Urban: Principal Arterial - Other Freeways or ...,,N,4,Other lanes,33.850898,117.814391,00000000091R,,2,Automated (telemetry),3,Permanent automatic traffic recorder (ATR),0,,0,,Y,5,5,0,0,3,91,,P,Planning or traffic statistics purposes,S,T,Station used for Traffic Volume Trends,,,N,Station not used for Heavy Vehicle Travel Info...,N,,129130,LAKEVIEW AVENUE ORA91R10.091,L,Inductance loop,15,0,97
1,,,,,13,3,99.0,5,South,77,6,3R,Rural: Principal Arterial - Other,,N,1,Outside (rightmost) lane,37.874697,121.21959,00000000099R,248336.0,2,Automated (telemetry),3,Permanent automatic traffic recorder (ATR),0,,0,,Y,2,2,0,0,3,99,,P,Planning or traffic statistics purposes,S,T,Station used for Traffic Volume Trends,,,N,Station not used for Heavy Vehicle Travel Info...,N,,100190,LITTLE JOHN CREEK SJ9912.526,L,Inductance loop,15,0,97
2,G,Axle spacing with Scheme F modified,,,15,1,5.0,1,North,93,6,1R,Rural: Principal Arterial - Interstate,,N,2,Other lanes,41.441777,122.43501,00000000005R,750293.0,2,Automated (telemetry),3,Permanent automatic traffic recorder (ATR),0,,3,Permanent vehicle classification device,Y,2,2,0,2,1,5,,P,Planning or traffic statistics purposes,S,T,Station used for Traffic Volume Trends,,,H,Station used for Heavy Vehicle Travel Informat...,N,,022940,EDGEWOOD SIS5R22.999,P,Piezoelectric,15,0,69
3,D,Vehicle length classification,M,Moving average of the steering axle of 3S2s,13,0,,5,South,35,49,1U,Urban: Principal Arterial - Interstate,A00015293910,Y,1,Outside (rightmost) lane,40.5165,111.89152,000000001500,290600.0,2,Automated (telemetry),3,Permanent automatic traffic recorder (ATR),4,Portable weigh-in-motion system,3,Permanent vehicle classification device,Y,5,5,5,5,1,15,,P,Planning or traffic statistics purposes,S,T,Station used for Traffic Volume Trends,B,Station used for TMG sample and Strategic High...,N,Station not used for Heavy Vehicle Travel Info...,,,000302,I 15 12900 South M.P. 290.6,X,Radio wave,15,0,11
4,G,Axle spacing with Scheme F modified,0,,14,1,0.0,7,West,27,34,1U,Urban: Principal Arterial - Interstate,,N,4,Other lanes,40.892373,74.484206,,,2,Automated (telemetry),2,Portable traffic recording device,0,,3,Permanent vehicle classification device,Y,4,4,4,4,1,80,,P,Planning or traffic statistics purposes,S,T,Station used for Traffic Volume Trends,N,Station not used for any of the above,N,Station not used for Heavy Vehicle Travel Info...,,,W01136,E. of Franklin Rd Underpass,L,Inductance loop,15,0,95
5,,,,,13,0,,3,East,27,16,1U,Urban: Principal Arterial - Interstate,EHUBBWQBVKPC,Y,1,Outside (rightmost) lane,43.5993,116.558919,001010,35701.0,2,Automated (telemetry),3,Permanent automatic traffic recorder (ATR),0,,0,,Y,2,2,0,0,1,84,,P,Planning or traffic statistics purposes,S,T,Station used for Traffic Volume Trends,,,N,Station not used for Heavy Vehicle Travel Info...,,,000276,I-84 300 Ft. W of Beg EB Off,L,Inductance loop,15,0,13
6,F,Axle spacing with Scheme F,,,13,0,,5,South,49,28,1U,Urban: Principal Arterial - Interstate,,N,1,Outside (rightmost) lane,32.348988,90.245834,25_0220P1,,2,Automated (telemetry),3,Permanent automatic traffic recorder (ATR),0,,3,Permanent vehicle classification device,Y,2,2,0,2,1,220,,P,Planning or traffic statistics purposes,S,T,Station used for Traffic Volume Trends,,,H,Station used for Heavy Vehicle Travel Informat...,L,,252560,AVC 222 - 0.5 mi N of Industri,P,Piezoelectric,15,0,8
7,G,Axle spacing with Scheme F modified,Z,Other method,15,2,2.0,7,West,63,53,3R,Rural: Principal Arterial - Other,500232110754,Y,2,Other lanes,47.845357,117.354419,000000000200,301400.0,2,Automated (telemetry),3,Permanent automatic traffic recorder (ATR),5,Permanent weigh-in-motion system,3,Permanent vehicle classification device,Y,2,2,2,2,2,2,,R,Research purposes (e.g. LTPP),S,T,Station used for Traffic Volume Trends,T,Station used for TMG sample (but not SHRP/LTPP...,H,Station used for Heavy Vehicle Travel Informat...,P,,P28AAA,4.15_MI._N._OF_SR_206,P,Piezoelectric,15,0,93
8,F,Axle spacing with Scheme F,T,Test trucks only,13,0,,7,West,141,28,3R,Rural: Principal Arterial - Other,711000010000,Y,1,Outside (rightmost) lane,34.7742,88.147575,71_0072P1,,2,Automated (telemetry),3,Permanent automatic traffic recorder (ATR),5,Permanent weigh-in-motion system,3,Permanent vehicle classification device,Y,2,2,2,2,2,72,,L,Load data for pavement design or pavement mana...,S,T,Station used for Traffic Volume Trends,B,Station used for TMG sample and Strategic High...,H,Station used for Heavy Vehicle Travel Informat...,L,3019.0,710107,WIM 118 - 2.0 mi W of AL State,Q,Quartz piezoelectric - NEW,15,0,80
9,,,,,13,2,17.0,5,South,177,51,1U,Urban: Principal Arterial - Interstate,,N,0,Data with lanes combined,38.24732,77.50721,000000000000,0.0,2,Automated (telemetry),3,Permanent automatic traffic recorder (ATR),0,,0,,Y,3,3,0,0,1,95,0.0,O,Operations purposes but not ITS,S,T,Station used for Traffic Volume Trends,N,Station not used for any of the above,N,Station not used for Heavy Vehicle Travel Info...,X,,160004,I-95 SOUTH SHOULDER @ MM 126.8,X,Radio wave,15,0,9


Based on the first 5 values of the dataset, we can already conclude that there will be missing values in some of the columns, how we deal with these missing values and columns will be based on further analysis.

One thing to also note is that there is around 55 columns of data in this particular station data.

In [8]:
station_df.dtypes

algorithm_of_vehicle_classification                  object
algorithm_of_vehicle_classification_name             object
calibration_of_weighing_system                       object
calibration_of_weighing_system_name                  object
classification_system_for_vehicle_classification      int64
concurrent_route_signing                              int64
concurrent_signed_route_number                       object
direction_of_travel                                   int64
direction_of_travel_name                             object
fips_county_code                                      int64
fips_state_code                                       int64
functional_classification                            object
functional_classification_name                       object
hpms_sample_identifier                               object
hpms_sample_type                                     object
lane_of_travel                                        int64
lane_of_travel_name                     

I noticed that there are many columns that might be functionally the same and hence can be mapped to one another. For instance, direction_of_travel and direction_of_travel_name, will be creating some helper functions to help in obtaining mapping information.

In [9]:
def print_mapped_value(df: pd.DataFrame, column1: str, column2: str) -> None:
    """
    Print mapped values from one column to another
    
    Args:
        df (DataFrame): DataFrame that contains colum1 and column2
        column1 (str): Column of the dataframe that can be mapped to column 2
        column2 (str): Column of the dataframe that can be mapped to column1
        
    Returns:
        None
    """
    column1_unique_values = df[column1].unique()
    for value in column1_unique_values:
        mapped_name = df.loc[
            station_df[column1] == value][column2].unique()

        if len(mapped_name) > 1:
            print(f"{value} maps to more than 1 value! they are {mapped_name}")
        else:
            print(f"{value} maps to {mapped_name}")

For this initial round of cleaning, we will be mainly focusing on remapping columns that are functionally the same and dropping one of them. We will then do EDA and try to make sense of missing values to see if there is any pattern to these missing values. Dropping any columns or rows will be our **last resort**.

### Algorithm of vehicle classification & Algorithm of vehicle classification name

In [10]:
station_df["algorithm_of_vehicle_classification"].isnull().sum()

9890

In [11]:
station_df["algorithm_of_vehicle_classification_name"].isnull().sum()

11131

Both of these columns have null value, although something seems a bit off. Intuitively if both of these columns are similar, the missing values of these two columns should be identical. We will have to delve into this relationship deeper.

In [12]:
station_df["algorithm_of_vehicle_classification"].unique()

array([nan, 'G', 'D', 'F', '0', 'N', 'K', 'L', 'Z', 'M', 'H', 'A', '1',
       'C', 'E'], dtype=object)

In [13]:
station_df["algorithm_of_vehicle_classification_name"].unique()

array([nan, 'Axle spacing with Scheme F modified',
       'Vehicle length classification', 'Axle spacing with Scheme F',
       'Axle spacing and other input(s) not specified above',
       'Axle spacing and weight algorithm',
       'Axle spacing and vehicle length algorithm',
       'Other means not specified above',
       'Axle spacing weight and vehicle length algorithm',
       'Other axle spacing algorithm',
       'Human observation on site (manual)',
       'Automated interpretation of vehicle image or signature (e.g. video microwave sonic)',
       'Axle spacing with ASTM Standard E1572'], dtype=object)

In [14]:
len(station_df["algorithm_of_vehicle_classification"].unique()) - len(station_df["algorithm_of_vehicle_classification_name"].unique())

2

Both the length of these two columns are different, which could explain why there is more missing values from one of the column

In [15]:
len(station_df["algorithm_of_vehicle_classification"].unique())

15

In [16]:
len(station_df["algorithm_of_vehicle_classification_name"].unique())

13

The algorithm_of_vehicle_classification column have two more unique values compared to the algorithm_of_vehicle_classification_name column

In [17]:
unique_classification_value = station_df["algorithm_of_vehicle_classification"].unique()

In [18]:
print_mapped_value(station_df, "algorithm_of_vehicle_classification", "algorithm_of_vehicle_classification_name")

nan maps to []
G maps to ['Axle spacing with Scheme F modified']
D maps to ['Vehicle length classification']
F maps to ['Axle spacing with Scheme F']
0 maps to [nan]
N maps to ['Axle spacing and other input(s) not specified above']
K maps to ['Axle spacing and weight algorithm']
L maps to ['Axle spacing and vehicle length algorithm']
Z maps to ['Other means not specified above']
M maps to ['Axle spacing weight and vehicle length algorithm']
H maps to ['Other axle spacing algorithm']
A maps to ['Human observation on site (manual)']
1 maps to [nan]
C maps to ['Automated interpretation of vehicle image or signature (e.g. video microwave sonic)']
E maps to ['Axle spacing with ASTM Standard E1572']


We notice that the value 0, 1 maps directly to the nan value in the algorithm_of_vehicle_classification_name column. This recouncils the difference in unique value.

Since the algorithm_of_vehicle_classification column and the algorithm_of_vehicle_classification_name column are functionally the same, we will drop the algorithm_of_vehicle_classification_name column as it has more missing values and therefore less information.

In [19]:
station_df.drop("algorithm_of_vehicle_classification_name", axis=1, inplace=True)

### Calibration of weighing system & Calibration of weighing system name

In [20]:
station_df["calibration_of_weighing_system"].isnull().sum()

20301

In [21]:
station_df["calibration_of_weighing_system_name"].isnull().sum()

21785

In [22]:
len(station_df)

28466

Both of these columns have null value, although something seems a bit off. similar to the algorithm_of_vehicle_column, if both of these columns are similar, the missing values of these two columns should be identical. We will have to delve into this relationship deeper.

The length of this dataset is only 28466. Having approximately 71% missing data might be too much missing information to work with. We might ultimately drop these two columns but let's explore this data further first.

In [23]:
station_df["calibration_of_weighing_system"].unique()

array([nan, 'M', '0', 'Z', 'T', 'C', 'A', 'U', 'D', 'P', '2', 'B', 'S'],
      dtype=object)

In [24]:
station_df["calibration_of_weighing_system_name"].unique()

array([nan, 'Moving average of the steering axle of 3S2s', 'Other method',
       'Test trucks only',
       'Combination of test trucks and trucks from the traffic stream (but not ASTM E1318)',
       'ASTM Standard E1318', 'Uncalibrated',
       'Other sample of trucks from the traffic stream',
       'Subset of ASTM Standard E1318', 'Static calibration'],
      dtype=object)

In [25]:
len(station_df["calibration_of_weighing_system_name"].unique()) - len(station_df["calibration_of_weighing_system"].unique())

-3

The calibration_of_weighing_system column has 3 additional values compared to the calibration_of_weighing_system_name column which could explain why it has no null values. We would have to recouncil these difference and we can use the same process as the algorithm_of_weigh_classification column

In [26]:
print_mapped_value(station_df, "calibration_of_weighing_system", "calibration_of_weighing_system_name")

nan maps to []
M maps to ['Moving average of the steering axle of 3S2s']
0 maps to [nan]
Z maps to ['Other method']
T maps to ['Test trucks only']
C maps to ['Combination of test trucks and trucks from the traffic stream (but not ASTM E1318)']
A maps to ['ASTM Standard E1318']
U maps to ['Uncalibrated']
D maps to ['Other sample of trucks from the traffic stream']
P maps to [nan]
2 maps to [nan]
B maps to ['Subset of ASTM Standard E1318']
S maps to ['Static calibration']


The value 0, P and 2 maps to NaN in the calibration_of_weighing_system_name column, this reconcils the difference in unique values between the two columns and we will proceed to drop the calibration_of_weighing_system_name column as it has more null values. (Although we might still drop this column in the future as mentioned earlier).

Normally in situations like these, it will be important to ask the data provider for the cause of the difference but since it is not possible in this scenario, we will exercise some judgement and deal with the data in the most appropriate manner.

In [27]:
station_df.drop("calibration_of_weighing_system_name", axis=1, inplace=True)

### Direction of Travel & Direction of Travel Name

In [28]:
station_df["direction_of_travel"].isnull().sum()

0

In [29]:
station_df["direction_of_travel_name"].isnull().sum()

0

Both of these columns have no null values, awesome!

In [30]:
station_df["direction_of_travel"].unique()

array([7, 5, 1, 3, 9, 0, 2, 6, 8, 4], dtype=int64)

In [31]:
station_df["direction_of_travel_name"].unique()

array(['West', 'South', 'North', 'East',
       'North-South or Northeast-Southwest combined (ATR stations only)',
       'East-West or Southeast-Northwest combined (ATR stations only)',
       'Northeast', 'Southwest', 'Northwest', 'Southeast'], dtype=object)

In [32]:
len(station_df["direction_of_travel"].unique()) - len(station_df["direction_of_travel_name"].unique())

0

In [33]:
print_mapped_value(station_df, "direction_of_travel", "direction_of_travel_name")

7 maps to ['West']
5 maps to ['South']
1 maps to ['North']
3 maps to ['East']
9 maps to ['North-South or Northeast-Southwest combined (ATR stations only)']
0 maps to ['East-West or Southeast-Northwest combined (ATR stations only)']
2 maps to ['Northeast']
6 maps to ['Southwest']
8 maps to ['Northwest']
4 maps to ['Southeast']


In [34]:
station_df.drop("direction_of_travel", inplace=True, axis=1)

### functional_classification & functional_classification_name

In [35]:
len(station_df["functional_classification"].unique()) - len(station_df["functional_classification_name"].unique())

0

Both of these columns have the same number of unique values

In [36]:
print_mapped_value(station_df, "functional_classification", "functional_classification_name")

2U maps to ['Urban: Principal Arterial - Other Freeways or Expressways']
3R maps to ['Rural: Principal Arterial - Other']
1R maps to ['Rural: Principal Arterial - Interstate']
1U maps to ['Urban: Principal Arterial - Interstate']
3U maps to ['Urban: Principal Arterial - Other']
4R maps to ['Rural: Minor Arterial']
4U maps to ['Urban: Minor Arterial']
5U maps to ['Urban: Collector']
5R maps to ['Rural: Major Collector']
6R maps to ['Rural: Minor Collector']
7U maps to ['Urban: Local System']
7R maps to ['Rural: Local System']


This shows that there is a perfect mapping of the two columns and we should probably drop one of the columns to reduce dimensionality. We will be keeping functional_classification_name as it is more informative.

In [37]:
station_df.drop("functional_classification", axis=1, inplace=True)

### lane_of_travel & lane_of_travel_name

In [38]:
station_df["lane_of_travel"].isnull().sum()

0

In [39]:
station_df["lane_of_travel_name"].isnull().sum()

0

In [40]:
len(station_df["lane_of_travel"].unique()) - len(station_df["lane_of_travel_name"].unique())

7

Even though lane_of_travel and lane_of_travel_name have no null values, they have different cardinalities, this would mean that there might be a 1 to many mapping.

In [41]:
print_mapped_value(station_df, "lane_of_travel", "lane_of_travel_name")

4 maps to ['Other lanes']
1 maps to ['Outside (rightmost) lane']
2 maps to ['Other lanes']
0 maps to ['Data with lanes combined']
3 maps to ['Other lanes']
6 maps to ['Other lanes']
5 maps to ['Other lanes']
7 maps to ['Other lanes']
8 maps to ['Other lanes']
9 maps to ['Other lanes']


The relationship between lane_of_travel and lane_of_travel_name is one to many, to preserve as much information as possible, we will keep the lane_of_travel column over lane_of_travel_name.

In [42]:
station_df.drop("lane_of_travel_name", axis=1, inplace=True)

### method_of_data_retrieval & method_of_data_retrieval_name

In [43]:
station_df["method_of_data_retrieval"].isnull().sum()

0

In [44]:
station_df["method_of_data_retrieval_name"].isnull().sum()

440

In [45]:
len(station_df["method_of_data_retrieval"].unique()) - len(station_df["method_of_data_retrieval_name"].unique())

0

In [46]:
print_mapped_value(station_df, "method_of_data_retrieval", "method_of_data_retrieval_name")

2 maps to ['Automated (telemetry)']
0 maps to [nan]
1 maps to ['Not automated (manual)']


There is a perfect map, in the interest of verbosity, we will keep method_of_data_retrieval_name

In [47]:
station_df.drop("method_of_data_retrieval_name", inplace=True, axis=1)

### method_of_traffic_volume_counting & method_of_traffic_volume_counting_name

In [48]:
station_df["method_of_traffic_volume_counting"].isnull().sum()

0

In [49]:
station_df["method_of_traffic_volume_counting_name"].isnull().sum()

880

In [50]:
len(station_df["method_of_traffic_volume_counting"].unique()) - len(station_df["method_of_traffic_volume_counting_name"].unique())

1

In [51]:
print_mapped_value(station_df, "method_of_traffic_volume_counting", "method_of_traffic_volume_counting_name")

3 maps to ['Permanent automatic traffic recorder (ATR)']
2 maps to ['Portable traffic recording device']
0 maps to [nan]
1 maps to ['Human observation (manual)']
4 maps to [nan]


We shall keep the method_of_traffic_volume_counting as it seems to have more cardinality and hence more information for our predictive model to work with

In [52]:
station_df.drop("method_of_traffic_volume_counting_name", inplace=True, axis=1)

### method_of_truck_weighing & method_of_truck_weighing_name

In [53]:
station_df["method_of_truck_weighing"].isnull().sum()

0

In [54]:
station_df["method_of_truck_weighing_name"].isnull().sum()

22580

In [55]:
len(station_df["method_of_truck_weighing"].unique()) - len(station_df["method_of_truck_weighing_name"].unique())

0

In [56]:
print_mapped_value(station_df, "method_of_truck_weighing", "method_of_truck_weighing_name")

0 maps to [nan]
4 maps to ['Portable weigh-in-motion system']
5 maps to ['Permanent weigh-in-motion system']
1 maps to ['Portable static scale']
2 maps to ['Chassis-mounted towed static scale']


It would be preferably to keep the name of the method as it will be more informative to do EDA without the actual name, however we would have to fill in the value nan with 0 as indicated from our mapping table. We will then drop the method_of_truck_weighing column

In [57]:
station_df["method_of_truck_weighing_name"] = station_df["method_of_truck_weighing_name"].fillna("0")

In [58]:
station_df.drop("method_of_truck_weighing", axis=1, inplace=True)

### method_of_vehicle_classification & method_of_vehicle_classification_name

In [59]:
station_df["method_of_vehicle_classification"].isnull().sum()

0

In [60]:
station_df["method_of_vehicle_classification_name"].isnull().sum()

11180

In [61]:
len(station_df["method_of_vehicle_classification"].unique()) - len(station_df["method_of_vehicle_classification_name"].unique())

1

In [62]:
print_mapped_value(station_df, "method_of_vehicle_classification", "method_of_vehicle_classification_name")

0 maps to [nan]
3 maps to ['Permanent vehicle classification device']
2 maps to ['Portable vehicle classification device']
1 maps to ['Human observation (manual) vehicle classification']
4 maps to [nan]


Although we would prefer to keep the method_of_vehicle_classification_name, it seems like we will be losing information by doing so, hence we will be keeping mthe method_of_vehicle_classification instead.

In [63]:
station_df.drop("method_of_vehicle_classification_name", inplace=True, axis=1)

### primary_purpose & primary_purpose_name

In [64]:
station_df["primary_purpose"].isnull().sum()

210

In [65]:
station_df["primary_purpose_name"].isnull().sum()

648

In [66]:
len(station_df["primary_purpose"].unique()) - len(station_df["primary_purpose_name"].unique())

2

In [67]:
print_mapped_value(station_df, "primary_purpose", "primary_purpose_name")

P maps to ['Planning or traffic statistics purposes']
R maps to ['Research purposes (e.g. LTPP)']
L maps to ['Load data for pavement design or pavement management purposes']
O maps to ['Operations purposes but not ITS']
0 maps to [nan]
nan maps to []
I maps to ['Operations purposes in support of ITS initiatives']
4 maps to [nan]
E maps to ['Enforcement purposes (e.g. speed or weight enforcement)']


Since the primary_purpose column have more information than the primary_purpose_name column, we will keep that particular column instead.

In [68]:
station_df.drop("primary_purpose_name", axis=1, inplace=True)

### sample_type_for_traffic_volume & sample_type_for_traffic_volume_name

In [69]:
station_df["sample_type_for_traffic_volume"].isnull().sum()

812

In [70]:
station_df["sample_type_for_traffic_volume_name"].isnull().sum()

1050

In [71]:
len(station_df["sample_type_for_traffic_volume"].unique()) - len(station_df["sample_type_for_traffic_volume_name"].unique())

2

In [72]:
print_mapped_value(station_df, "sample_type_for_traffic_volume", "sample_type_for_traffic_volume_name")

T maps to ['Station used for Traffic Volume Trends']
nan maps to []
N maps to ['Station not used for Traffic Volume Trends']
Y maps to [nan]
t maps to [nan]


Whether a station is used for traffic volume trends seems to be binary in nature, we can try mapping nans to the value 'Station not used for Traffic Volume Trends' while mapping the values "Y", "t" and "T" to 'Station used for Traffic Volume Trends', ie. 1

This is what will be done however one caveat is that we are assuming this column is binary in nature. With this in mind,we can just modify the sample_type_for_traffic_volume column and consolidate all under a uniform value of "T" and fillna to N

In [73]:
station_df["sample_type_for_traffic_volume"].replace("N", "0", inplace=True)
station_df["sample_type_for_traffic_volume"].replace(["t", "Y", "T"], "1", inplace=True)
station_df["sample_type_for_traffic_volume"].fillna("0", inplace=True)

In [74]:
station_df["sample_type_for_traffic_volume"] = pd.to_numeric(station_df["sample_type_for_traffic_volume"])

In [75]:
station_df.drop("sample_type_for_traffic_volume_name", axis=1, inplace=True)

### sample_type_for_truck_weight & sample_type_for_truck_weight_name

In [76]:
station_df["sample_type_for_truck_weight"].isnull().sum()

12062

In [77]:
station_df["sample_type_for_truck_weight_name"].isnull().sum()

12812

In [78]:
len(station_df["sample_type_for_truck_weight"].unique()) - len(station_df["sample_type_for_truck_weight_name"].unique())

3

In [79]:
print_mapped_value(station_df, "sample_type_for_truck_weight", "sample_type_for_truck_weight_name")

nan maps to []
B maps to ['Station used for TMG sample and Strategic Highway Research Program (SHRP) Long Term Pavement Performance (LTPP) sample']
N maps to ['Station not used for any of the above']
T maps to ['Station used for TMG sample (but not SHRP/LTPP sample)']
0 maps to [nan]
5 maps to [nan]
L maps to ['Station used for SHRP/LTPP sample (but not TMG sample)']
1 maps to [nan]


With this mapping table in mind, we will keep the sample_type_for_truck_weigh as it retains the most possible information.

In [80]:
station_df.drop("sample_type_for_truck_weight_name", axis=1, inplace=True)

### sample_type_for_vehicle_classification & sample_type_for_vehicle_classification_name

In [81]:
station_df["sample_type_for_vehicle_classification"].isnull().sum()

3487

In [82]:
station_df["sample_type_for_vehicle_classification_name"].isnull().sum()

4414

In [83]:
len(station_df["sample_type_for_vehicle_classification"].unique()) - len(station_df["sample_type_for_vehicle_classification_name"].unique())

4

In [84]:
print_mapped_value(station_df, "sample_type_for_vehicle_classification", "sample_type_for_vehicle_classification_name")

N maps to ['Station not used for Heavy Vehicle Travel Information System']
H maps to ['Station used for Heavy Vehicle Travel Information System']
0 maps to [nan]
nan maps to []
Y maps to [nan]
2 maps to [nan]
T maps to [nan]


Similar to the sample_type_for_traffic_volume, we will assume that this column is binary in nature, ie. a station will only either be used for Heavy Vehicle Travel Information System or it will not be used.

We will map N and nans to 0 and Y, 2, T, H to 1.

In [85]:
station_df["sample_type_for_vehicle_classification"].replace(["H", "Y", "2", "T"], "1",inplace=True)
station_df["sample_type_for_vehicle_classification"].replace("N", "0",inplace=True)
station_df["sample_type_for_vehicle_classification"].fillna("0", inplace=True)

In [86]:
station_df["sample_type_for_vehicle_classification"] = pd.to_numeric(station_df["sample_type_for_vehicle_classification"])

In [87]:
station_df.drop("sample_type_for_vehicle_classification_name", axis=1, inplace=True)

### type_of_sensor & type_of_sensor_name

In [88]:
station_df["type_of_sensor"].isnull().sum()

352

In [89]:
station_df["type_of_sensor_name"].isnull().sum()

352

In [90]:
len(station_df["type_of_sensor"].unique()) - len(station_df["type_of_sensor_name"].unique())

0

In [91]:
print_mapped_value(station_df, "type_of_sensor", "type_of_sensor_name")

L maps to ['Inductance loop']
P maps to ['Piezoelectric']
X maps to ['Radio wave']
Q maps to ['Quartz piezoelectric - NEW']
W maps to ['Microwave']
nan maps to []
R maps to ['Road tube']
H maps to ['Human observation (manual)']
B maps to ['Bending plate']
U maps to ['Ultrasonic']
Z maps to ['Other']
S maps to ['Sonic/acoustic']
I maps to ['Infrared']
G maps to ['Strain gauge on bridge beam']
V maps to ['Video image']
E maps to ['Hydraulic load cells']
A maps to ['Automatic vehicle identification (AVI)']
K maps to ['Laser/lidar']
M maps to ['Magnetometer']
F maps to ['Fiber optic - NEW']


Since there is perfect mapping, we can keep the more verbose column ie. type_of_sensor_name

In [92]:
station_df.drop("type_of_sensor", inplace=True, axis=1)

We are now done with the station data (for now), moving on to the traffic data

## Traffic Data

In [93]:
traffic_df.isnull().sum()

date                                               0
day_of_data                                        0
day_of_week                                        0
direction_of_travel                                0
direction_of_travel_name                           0
fips_state_code                                    0
functional_classification                          0
functional_classification_name                     0
lane_of_travel                                     0
month_of_data                                      0
record_type                                        0
restrictions                                 7140391
station_id                                         0
traffic_volume_counted_after_0000_to_0100          0
traffic_volume_counted_after_0100_to_0200          0
traffic_volume_counted_after_0200_to_0300          0
traffic_volume_counted_after_0300_to_0400          0
traffic_volume_counted_after_0400_to_0500          0
traffic_volume_counted_after_0500_to_0600     

Most of the data in the traffic_df dataframe is filled up, only the restriction column seems to be missing a huge chunk of data. We will further investigate this particular column.

We will also dropped the functional_classification column and direction_of_travel column since we also dropped them in the station_df from our prior analysis

In [94]:
traffic_df.drop(["functional_classification", "direction_of_travel"], axis=1, inplace=True)

### Restrictions

In [95]:
traffic_df["restrictions"].unique()

array([nan])

The entire column are just NaN values, we will proceed to drop this column

In [96]:
traffic_df.drop("restrictions", axis=1, inplace=True)

In [97]:
traffic_df.to_csv("../data/interim/dot_traffic_2015.csv", index=False)
station_df.to_csv("../data/interim/dot_traffic_stations_2015.csv", index=False)

We will be combing the data with all these intersecting columns, it will be good to check if they agree with each other also after combining them

# Summary

These are the columns that were dropped from the station_df based on the analysis that we have done in the notebook:
1. algorithm_of_classification_name
2. calibration_of_weighing_system_name
3. direction_of_travel
4. functional_classification
5. lane_of_travel_name
6. method_of_data_retrieval_name
7. method_of_traffic_volume_counting_name
8. method_of_truck_weighing
9. method_of_vehicle_classification_name
10. sample_type_for_traffic_volume_name
11. sample_type_for_truck_weight_name
12. sample_type_for_vehicle_classification_name
13. type_of_sensor

These are the columns that were dropped frm the traffic_df based on the analysis that was done:

1. restrictions

We will move on to the EDA portion, data cleaning is by no means finished, we might change the way we clean the data based on the EDA that we will be doing.