# Capstone Two - Data Wrangling

The dataset used was the 2017 National Household Travel Survey. This notebook will focus on the cleaning the data.

## Import packages

In [1]:
#import packages
import pandas as pd
import os
import tabula
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

## Set directories

In [2]:
os.chdir("../..")
cw = os.getcwd()

## Read datasets

The National Household Travel Survey has 4 datasets. 

1. The Person dataset 
2. The Household dataset
3. The Vehicle dataset
4. The Travel dataset

In [3]:
#import person data
data_person=pd.read_sas(os.path.join(cw,'Capstone_Two_Other_Material/Data/sas/perpub.sas7bdat'), format = 'sas7bdat', encoding="ISO-8859-1")

  rslt[name] = self._string_chunk[js, :]
  rslt[name] = self._byte_chunk[jb, :].view(dtype=self.byte_order + "d")


In [4]:
#Look at first few rows
data_person.head()

Unnamed: 0,HOUSEID,PERSONID,R_AGE,EDUC,R_HISP,R_RELAT,R_SEX,R_RACE,PRMACT,PAYPROF,...,SMPLSRCE,WTPERFIN,HBHUR,HTHTNRNT,HTPPOPDN,HTRESDN,HTEEMPDN,HBHTNRNT,HBPPOPDN,HBRESDN
0,30000007,1,67.0,3,2,1,2,2,6,2,...,2,206.690153,T,50,1500,750,750,20,750,300
1,30000007,2,66.0,3,2,2,1,2,1,-1,...,2,197.075742,T,50,1500,750,750,20,750,300
2,30000007,3,28.0,2,2,3,2,2,5,2,...,2,219.51421,T,50,1500,750,750,20,750,300
3,30000008,1,55.0,5,2,1,1,1,1,-1,...,2,63.185911,R,5,300,300,150,5,300,300
4,30000008,2,49.0,4,2,2,2,1,1,-1,...,2,58.665911,R,5,300,300,150,5,300,300


In [5]:
#Looks at dimension
data_person.shape

(264234, 121)

In [6]:
#Look at data info
data_person.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 264234 entries, 0 to 264233
Columns: 121 entries, HOUSEID to HBRESDN
dtypes: float64(31), object(90)
memory usage: 243.9+ MB


In [7]:
#Select string variables
data_person_obj = data_person.select_dtypes(['object'])
print (data_person_obj.head())

    HOUSEID PERSONID EDUC R_HISP R_RELAT R_SEX R_RACE PRMACT PAYPROF GT1JBLWK  \
0  30000007       01   03     02      01    02     02     06      02       -1   
1  30000007       02   03     02      02    01     02     01      -1       02   
2  30000007       03   02     02      03    02     02     05      02       -1   
3  30000008       01   05     02      01    01     01     01      -1       02   
4  30000008       02   04     02      02    02     01     01      -1       02   

   ... HH_CBSA SMPLSRCE HBHUR HTHTNRNT HTPPOPDN HTRESDN HTEEMPDN HBHTNRNT  \
0  ...   XXXXX       02     T       50     1500     750      750       20   
1  ...   XXXXX       02     T       50     1500     750      750       20   
2  ...   XXXXX       02     T       50     1500     750      750       20   
3  ...   33460       02     R       05      300     300      150       05   
4  ...   33460       02     R       05      300     300      150       05   

  HBPPOPDN HBRESDN  
0      750     300  
1      7

In [8]:
#Remove trailing or leading spaces
data_person[data_person_obj.columns] = data_person_obj.apply(lambda x: x.str.strip())

In [9]:
#Look at dimension
data_person.shape

(264234, 121)

In [10]:
#Look at data info
data_person.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 264234 entries, 0 to 264233
Columns: 121 entries, HOUSEID to HBRESDN
dtypes: float64(31), object(90)
memory usage: 243.9+ MB


In [11]:
#import household data
data_hh=pd.read_sas(os.path.join(cw,'Capstone_Two_Other_Material/Data/sas/hhpub.sas7bdat'), format = 'sas7bdat', encoding="ISO-8859-1")

In [12]:
#Look at first few rows
data_hh.head()

Unnamed: 0,HOUSEID,TRAVDAY,SAMPSTRAT,HOMEOWN,HHSIZE,HHVEHCNT,HHFAMINC,PC,SPHONE,TAB,...,SMPLSRCE,WTHHFIN,HBHUR,HTHTNRNT,HTPPOPDN,HTRESDN,HTEEMPDN,HBHTNRNT,HBPPOPDN,HBRESDN
0,30000007,2,3,1,3.0,5.0,7,2,1,2,...,2,187.31432,T,50,1500,750,750,20,750,300
1,30000008,5,2,1,2.0,4.0,8,1,1,2,...,2,69.513032,R,5,300,300,150,5,300,300
2,30000012,5,3,1,1.0,2.0,10,1,1,3,...,2,79.419586,C,80,17000,17000,5000,60,17000,7000
3,30000019,5,3,1,2.0,2.0,3,1,5,5,...,2,279.143588,S,40,300,300,150,50,750,300
4,30000029,3,3,1,2.0,2.0,5,2,5,1,...,2,103.240304,S,40,1500,750,750,40,1500,750


In [13]:
#Look at dimension
data_hh.shape

(129696, 58)

In [14]:
#Look at data info
data_hh.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 129696 entries, 0 to 129695
Data columns (total 58 columns):
 #   Column      Non-Null Count   Dtype  
---  ------      --------------   -----  
 0   HOUSEID     129696 non-null  object 
 1   TRAVDAY     129696 non-null  object 
 2   SAMPSTRAT   129696 non-null  object 
 3   HOMEOWN     129696 non-null  object 
 4   HHSIZE      129696 non-null  float64
 5   HHVEHCNT    129696 non-null  float64
 6   HHFAMINC    129696 non-null  object 
 7   PC          129696 non-null  object 
 8   SPHONE      129696 non-null  object 
 9   TAB         129696 non-null  object 
 10  WALK        129696 non-null  object 
 11  BIKE        129696 non-null  object 
 12  CAR         129696 non-null  object 
 13  TAXI        129696 non-null  object 
 14  BUS         129696 non-null  object 
 15  TRAIN       129696 non-null  object 
 16  PARA        129696 non-null  object 
 17  PRICE       129696 non-null  object 
 18  PLACE       129696 non-null  object 
 19  WA

In [15]:
#Select string variables
data_hh_obj = data_hh.select_dtypes(['object'])
print (data_hh_obj.head())

    HOUSEID TRAVDAY SAMPSTRAT HOMEOWN HHFAMINC  PC SPHONE TAB WALK BIKE  ...  \
0  30000007      02        03      01       07  02     01  02   05   05  ...   
1  30000008      05        02      01       08  01     01  02   04   04  ...   
2  30000012      05        03      01       10  01     01  03   02   05  ...   
3  30000019      05        03      01       03  01     05  05   02   05  ...   
4  30000029      03        03      01       05  02     05  01   -9   -9  ...   

  WEBUSE17 SMPLSRCE HBHUR HTHTNRNT HTPPOPDN HTRESDN HTEEMPDN HBHTNRNT  \
0       01       02     T       50     1500     750      750       20   
1       01       02     R       05      300     300      150       05   
2       01       02     C       80    17000   17000     5000       60   
3       01       02     S       40      300     300      150       50   
4       01       02     S       40     1500     750      750       40   

  HBPPOPDN HBRESDN  
0      750     300  
1      300     300  
2    17000    700

In [16]:
#Remove trailing or leading spaces
data_hh[data_hh_obj.columns] = data_hh_obj.apply(lambda x: x.str.strip())

In [17]:
#Look at dimension
data_hh.shape

(129696, 58)

In [18]:
#Look at data info
data_hh.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 129696 entries, 0 to 129695
Data columns (total 58 columns):
 #   Column      Non-Null Count   Dtype  
---  ------      --------------   -----  
 0   HOUSEID     129696 non-null  object 
 1   TRAVDAY     129696 non-null  object 
 2   SAMPSTRAT   129696 non-null  object 
 3   HOMEOWN     129696 non-null  object 
 4   HHSIZE      129696 non-null  float64
 5   HHVEHCNT    129696 non-null  float64
 6   HHFAMINC    129696 non-null  object 
 7   PC          129696 non-null  object 
 8   SPHONE      129696 non-null  object 
 9   TAB         129696 non-null  object 
 10  WALK        129696 non-null  object 
 11  BIKE        129696 non-null  object 
 12  CAR         129696 non-null  object 
 13  TAXI        129696 non-null  object 
 14  BUS         129696 non-null  object 
 15  TRAIN       129696 non-null  object 
 16  PARA        129696 non-null  object 
 17  PRICE       129696 non-null  object 
 18  PLACE       129696 non-null  object 
 19  WA

In [19]:
#import trip data
data_trip=pd.read_sas(os.path.join(cw,'Capstone_Two_Other_Material/Data/sas/trippub.sas7bdat'), format = 'sas7bdat', encoding="ISO-8859-1")

  rslt[name] = self._string_chunk[js, :]


In [20]:
#Look at first few rows
data_trip.head()

Unnamed: 0,HOUSEID,PERSONID,TDTRPNUM,STRTTIME,ENDTIME,TRVLCMIN,TRPMILES,TRPTRANS,TRPACCMP,TRPHHACC,...,OBHTNRNT,OBPPOPDN,OBRESDN,DTHTNRNT,DTPPOPDN,DTRESDN,DTEEMPDN,DBHTNRNT,DBPPOPDN,DBRESDN
0,30000007,1,1,1000,1015,15.0,5.244,3,0.0,0.0,...,20,750,300,50,750,300,350,30,300,300
1,30000007,1,2,1510,1530,20.0,5.149,3,0.0,0.0,...,30,300,300,50,1500,750,750,20,750,300
2,30000007,2,1,700,900,120.0,84.004,6,0.0,0.0,...,40,1500,750,50,1500,750,750,20,750,300
3,30000007,2,2,1800,2030,150.0,81.628,6,0.0,0.0,...,20,750,300,40,1500,750,750,40,1500,750
4,30000007,3,1,845,900,15.0,2.25,3,0.0,0.0,...,20,750,300,50,750,300,350,60,750,300


In [21]:
#Look at dimension
data_trip.shape

(923572, 115)

In [22]:
#Look at data info
data_trip.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 923572 entries, 0 to 923571
Columns: 115 entries, HOUSEID to DBRESDN
dtypes: float64(22), object(93)
memory usage: 810.3+ MB


In [23]:
#Select string variables
data_trip_obj = data_trip.select_dtypes(['object'])
print (data_trip_obj.head())

    HOUSEID PERSONID TDTRPNUM STRTTIME ENDTIME TRPTRANS VEHID DROP_PRK  \
0  30000007       01       01     1000    1015       03    03       -1   
1  30000007       01       02     1510    1530       03    03       -1   
2  30000007       02       01     0700    0900       06    05       -1   
3  30000007       02       02     1800    2030       06    05       -1   
4  30000007       03       01     0845    0900       03    01       -1   

  WHODROVE WHYFROM  ... OBHTNRNT OBPPOPDN OBRESDN DTHTNRNT DTPPOPDN DTRESDN  \
0       01      01  ...       20      750     300       50      750     300   
1       01      19  ...       30      300     300       50     1500     750   
2       02      03  ...       40     1500     750       50     1500     750   
3       02      01  ...       20      750     300       40     1500     750   
4       03      01  ...       20      750     300       50      750     300   

  DTEEMPDN DBHTNRNT DBPPOPDN DBRESDN  
0      350       30      300     300  
1 

In [24]:


#Remove trailing or leading spaces
data_trip[data_trip_obj.columns] = data_trip_obj.apply(lambda x: x.str.strip())

In [25]:
#Look at dimension
data_trip.shape

(923572, 115)

In [26]:
#Look at dataset info
data_trip.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 923572 entries, 0 to 923571
Columns: 115 entries, HOUSEID to DBRESDN
dtypes: float64(22), object(93)
memory usage: 810.3+ MB


In [27]:
#import vehicle data
data_veh=pd.read_sas(os.path.join(cw,'Capstone_Two_Other_Material/Data/sas/vehpub.sas7bdat'), format = 'sas7bdat', encoding="ISO-8859-1")

In [28]:
#Look at first few rows
data_veh.head()

Unnamed: 0,HOUSEID,VEHID,VEHYEAR,VEHAGE,MAKE,MODEL,FUELTYPE,VEHTYPE,WHOMAIN,OD_READ,...,HTEEMPDN,HBHTNRNT,HBPPOPDN,HBRESDN,GSYRGAL,GSTOTCST,FEGEMPG,FEGEMPGA,GSCOST,FEGEMPGF
0,30000007,1,2007.0,10.0,49,49032,1,1,3,69000.0,...,750,20,750,300,487.064221,1126.457778,30.0,-9.0,2.31275,1
1,30000007,2,2004.0,13.0,49,49442,1,2,-8,164000.0,...,750,20,750,300,250.899523,580.267873,19.0,-9.0,2.31275,1
2,30000007,3,1998.0,19.0,19,19014,1,1,1,120000.0,...,750,20,750,300,444.462475,1027.930589,18.0,-9.0,2.31275,1
3,30000007,4,1997.0,20.0,19,19021,1,1,2,-88.0,...,750,20,750,300,40.329575,93.272224,18.0,-9.0,2.31275,1
4,30000007,5,1993.0,24.0,20,20481,1,4,2,300000.0,...,750,20,750,300,888.404197,2054.656806,14.0,-9.0,2.31275,1


In [29]:
#Look at dimension
data_veh.shape

(256115, 60)

In [30]:
#Look at data info
data_veh.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 256115 entries, 0 to 256114
Data columns (total 60 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   HOUSEID    256115 non-null  object 
 1   VEHID      256115 non-null  object 
 2   VEHYEAR    256115 non-null  float64
 3   VEHAGE     256115 non-null  float64
 4   MAKE       256115 non-null  object 
 5   MODEL      256115 non-null  object 
 6   FUELTYPE   256115 non-null  object 
 7   VEHTYPE    256115 non-null  object 
 8   WHOMAIN    256115 non-null  object 
 9   OD_READ    256115 non-null  float64
 10  HFUEL      256115 non-null  object 
 11  VEHOWNED   256115 non-null  object 
 12  VEHOWNMO   256115 non-null  object 
 13  ANNMILES   256115 non-null  float64
 14  HYBRID     256115 non-null  object 
 15  PERSONID   256115 non-null  object 
 16  TRAVDAY    256115 non-null  object 
 17  HOMEOWN    256115 non-null  object 
 18  HHSIZE     256115 non-null  float64
 19  HHVEHCNT   256115 non-n

In [31]:
#Select string variables
data_veh_obj = data_veh.select_dtypes(['object'])
print (data_veh_obj.head())

    HOUSEID VEHID MAKE  MODEL FUELTYPE VEHTYPE WHOMAIN HFUEL VEHOWNED  \
0  30000007    01   49  49032       01      01      03    -1       01   
1  30000007    02   49  49442       01      02      -8    -1       01   
2  30000007    03   19  19014       01      01      01    -1       01   
3  30000007    04   19  19021       01      01      02    -1       01   
4  30000007    05   20  20481       01      04      02    -1       01   

  VEHOWNMO  ... BEST_OUT HBHUR HTHTNRNT HTPPOPDN HTRESDN HTEEMPDN HBHTNRNT  \
0       -1  ...       04     T       50     1500     750      750       20   
1       -1  ...       -1     T       50     1500     750      750       20   
2       -1  ...       -1     T       50     1500     750      750       20   
3       -1  ...       -1     T       50     1500     750      750       20   
4       -1  ...       -1     T       50     1500     750      750       20   

  HBPPOPDN HBRESDN FEGEMPGF  
0      750     300       01  
1      750     300       01  
2 

In [32]:
#Remove trailing or leading spaces
data_veh[data_veh_obj.columns] = data_veh_obj.apply(lambda x: x.str.strip())

In [33]:
#Look at dimension
data_veh.shape

(256115, 60)

In [34]:
#Look at data info
data_veh.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 256115 entries, 0 to 256114
Data columns (total 60 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   HOUSEID    256115 non-null  object 
 1   VEHID      256115 non-null  object 
 2   VEHYEAR    256115 non-null  float64
 3   VEHAGE     256115 non-null  float64
 4   MAKE       256115 non-null  object 
 5   MODEL      256115 non-null  object 
 6   FUELTYPE   256115 non-null  object 
 7   VEHTYPE    256115 non-null  object 
 8   WHOMAIN    256115 non-null  object 
 9   OD_READ    256115 non-null  float64
 10  HFUEL      256115 non-null  object 
 11  VEHOWNED   256115 non-null  object 
 12  VEHOWNMO   256115 non-null  object 
 13  ANNMILES   256115 non-null  float64
 14  HYBRID     256115 non-null  object 
 15  PERSONID   256115 non-null  object 
 16  TRAVDAY    256115 non-null  object 
 17  HOMEOWN    256115 non-null  object 
 18  HHSIZE     256115 non-null  float64
 19  HHVEHCNT   256115 non-n

## Merge Datasets

Based on the User Guide documentation provided, many of the variables are repeated across multiple table file levels.

In [35]:
#Look at similar variables between datasets we want to merge and save them in variables
data_hh_columns = set(data_hh.columns)
data_veh_columns = set(data_veh.columns)
data_person_columns = set(data_person.columns)
data_trip_columns = set(data_trip.columns)

data_hh_veh_columns = list(data_hh_columns.intersection(data_veh_columns))
data_hh_veh_columns_u = list(data_hh_columns.union(data_veh_columns))

data_hh_veh_person_columns = list(set(data_hh_veh_columns_u).intersection(data_person_columns))
data_hh_veh_person_columns_u = list(set(data_hh_veh_columns_u).union(data_person_columns))

data_hh_veh_person_trip_columns = list(set(data_hh_veh_person_columns_u).intersection(data_trip_columns))

In [36]:
#merge person and vehicle data
data_hh_veh = pd.merge(data_hh,data_veh,on=data_hh_veh_columns)

In [37]:
#look at first 5 rows
data_hh_veh.head()

Unnamed: 0,HOUSEID,TRAVDAY,SAMPSTRAT,HOMEOWN,HHSIZE,HHVEHCNT,HHFAMINC,PC,SPHONE,TAB,...,BESTMILE,BEST_FLG,BEST_EDT,BEST_OUT,GSYRGAL,GSTOTCST,FEGEMPG,FEGEMPGA,GSCOST,FEGEMPGF
0,30000007,2,3,1,3.0,5.0,7,2,1,2,...,14611.926637,1,-1,4,487.064221,1126.457778,30.0,-9.0,2.31275,1
1,30000007,2,3,1,3.0,5.0,7,2,1,2,...,4767.090946,3,-1,-1,250.899523,580.267873,19.0,-9.0,2.31275,1
2,30000007,2,3,1,3.0,5.0,7,2,1,2,...,8000.324552,1,-1,-1,444.462475,1027.930589,18.0,-9.0,2.31275,1
3,30000007,2,3,1,3.0,5.0,7,2,1,2,...,725.932347,2,-1,-1,40.329575,93.272224,18.0,-9.0,2.31275,1
4,30000007,2,3,1,3.0,5.0,7,2,1,2,...,12437.658757,1,-1,-1,888.404197,2054.656806,14.0,-9.0,2.31275,1


In [38]:
#Dimension of data
data_hh_veh.shape

(256115, 83)

In [39]:
#merge household and vehicle data to person data
data_hh_veh_person = pd.merge(data_hh_veh,data_person,on=data_hh_veh_person_columns)

In [40]:
#look at first 5 rows
data_hh_veh_person.head()

Unnamed: 0,HOUSEID,TRAVDAY,SAMPSTRAT,HOMEOWN,HHSIZE,HHVEHCNT,HHFAMINC,PC,SPHONE,TAB,...,BIKE_DFR,BIKE_GKP,CONDTRAV,CONDRIDE,CONDNIGH,CONDRIVE,CONDPUB,CONDSPEC,CONDTAX,WTPERFIN
0,30000007,2,3,1,3.0,5.0,7,2,1,2,...,-1,-1,-1,-1,-1,-1,-1,-1,-1,219.51421
1,30000007,2,3,1,3.0,5.0,7,2,1,2,...,-1,-1,-1,-1,-1,-1,-1,-1,-1,206.690153
2,30000007,2,3,1,3.0,5.0,7,2,1,2,...,-1,-1,-1,-1,-1,-1,-1,-1,-1,197.075742
3,30000007,2,3,1,3.0,5.0,7,2,1,2,...,-1,-1,-1,-1,-1,-1,-1,-1,-1,197.075742
4,30000008,5,2,1,2.0,4.0,8,1,1,2,...,-1,-1,-1,-1,-1,-1,-1,-1,-1,58.665911


In [41]:
#Dimension of data
data_hh_veh_person.shape

(250200, 168)

In [42]:
#merge person, household data, vehicle data to trip data
data_hh_veh_person_trip = pd.merge(data_hh_veh_person,data_trip,on=data_hh_veh_person_trip_columns)

In [43]:
#review first few rows
data_hh_veh_person_trip.head()

Unnamed: 0,HOUSEID,TRAVDAY,SAMPSTRAT,HOMEOWN,HHSIZE,HHVEHCNT,HHFAMINC,PC,SPHONE,TAB,...,OBHTNRNT,OBPPOPDN,OBRESDN,DTHTNRNT,DTPPOPDN,DTRESDN,DTEEMPDN,DBHTNRNT,DBPPOPDN,DBRESDN
0,30000007,2,3,1,3.0,5.0,7,2,1,2,...,20,750,300,50,750,300,350,60,750,300
1,30000007,2,3,1,3.0,5.0,7,2,1,2,...,60,750,300,50,1500,750,750,20,750,300
2,30000007,2,3,1,3.0,5.0,7,2,1,2,...,20,750,300,50,750,300,350,30,300,300
3,30000007,2,3,1,3.0,5.0,7,2,1,2,...,30,300,300,50,1500,750,750,20,750,300
4,30000007,2,3,1,3.0,5.0,7,2,1,2,...,40,1500,750,50,1500,750,750,20,750,300


In [44]:
#review dimension
data_hh_veh_person_trip.shape

(573829, 244)

In [45]:
#rename the data
data = data_hh_veh_person_trip

In [46]:
#nLook at data info
data.info(verbose=True)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 573829 entries, 0 to 573828
Data columns (total 244 columns):
 #    Column      Dtype  
---   ------      -----  
 0    HOUSEID     object 
 1    TRAVDAY     object 
 2    SAMPSTRAT   object 
 3    HOMEOWN     object 
 4    HHSIZE      float64
 5    HHVEHCNT    float64
 6    HHFAMINC    object 
 7    PC          object 
 8    SPHONE      object 
 9    TAB         object 
 10   WALK        object 
 11   BIKE        object 
 12   CAR         object 
 13   TAXI        object 
 14   BUS         object 
 15   TRAIN       object 
 16   PARA        object 
 17   PRICE       object 
 18   PLACE       object 
 19   WALK2SAVE   object 
 20   BIKE2SAVE   object 
 21   PTRANS      object 
 22   HHRELATD    object 
 23   DRVRCNT     float64
 24   CNTTDHH     float64
 25   HHSTATE     object 
 26   HHSTFIPS    object 
 27   NUMADLT     float64
 28   YOUNGCHILD  float64
 29   WRKCOUNT    float64
 30   TDAYDATE    object 
 31   HHRESP      object 
 32 

In [47]:
#reorder variables
first_cols = ['PERSONID','VEHID']
last_cols = [col for col in data.columns if col not in first_cols]
len(last_cols)

242

In [48]:
#reorder variables
data1 = data[first_cols+last_cols]

In [49]:
#Get first few rows
data1.head()

Unnamed: 0,PERSONID,VEHID,HOUSEID,TRAVDAY,SAMPSTRAT,HOMEOWN,HHSIZE,HHVEHCNT,HHFAMINC,PC,...,OBHTNRNT,OBPPOPDN,OBRESDN,DTHTNRNT,DTPPOPDN,DTRESDN,DTEEMPDN,DBHTNRNT,DBPPOPDN,DBRESDN
0,3,1,30000007,2,3,1,3.0,5.0,7,2,...,20,750,300,50,750,300,350,60,750,300
1,3,1,30000007,2,3,1,3.0,5.0,7,2,...,60,750,300,50,1500,750,750,20,750,300
2,1,3,30000007,2,3,1,3.0,5.0,7,2,...,20,750,300,50,750,300,350,30,300,300
3,1,3,30000007,2,3,1,3.0,5.0,7,2,...,30,300,300,50,1500,750,750,20,750,300
4,2,5,30000007,2,3,1,3.0,5.0,7,2,...,40,1500,750,50,1500,750,750,20,750,300


In [50]:
#Dimension of data
data1.shape

(573829, 244)

## Duplicates and NAs

In [51]:
#Any duplicate rows?
data1 = data1.drop_duplicates()
#Dimension
data1.shape

(573829, 244)

There are no duplicates rows

In [52]:
#Check for NAs
data1.isna().values.any()

False

There are no NAs in the dataset. The documentation for the dataset mentioned that there shouldn't be any as well.

There are a few values that should be reviewed further.

1. -1 : Appropriate Skip
2. -9 : Not Ascertained
3. -7 : I prefer not to answer (Selected by participant (available when no answer given)) 
4. -77 : I prefer not to answer (Selected by participant (always available)) 
5. -8 : I don’t know (Selected by participant (available when no answer given))
6. -88 : I don’t know (Selected by participant (always available))

Let's check if any variable consists of all of these values.

In [53]:
#variables where all values are these
data1_check_val = data1.isin([-1.0,-9.0,-7.0,-77.0,-8.0,-88.0, 
                              '-1','-9','-7','-77','-8','-88']).all()
data1_check_val

PERSONID     False
VEHID        False
HOUSEID      False
TRAVDAY      False
SAMPSTRAT    False
             ...  
DTRESDN      False
DTEEMPDN     False
DBHTNRNT     False
DBPPOPDN     False
DBRESDN      False
Length: 244, dtype: bool

In [54]:
#check one variable
data1['TRACC_BUS'].unique()

array(['-1'], dtype=object)

In [55]:
#Get variable names
var_na = list(data1_check_val[data1_check_val==True].index)
var_na

['LSTTRDAY17',
 'SAMEPLC',
 'TRACC_WLK',
 'TRACC_POV',
 'TRACC_BUS',
 'TRACC_CRL',
 'TRACC_SUB',
 'TRACC_OTH',
 'TREGR_WLK',
 'TREGR_POV',
 'TREGR_BUS',
 'TREGR_CRL',
 'TREGR_SUB',
 'TREGR_OTH']

In [56]:
#Number of variables
len(var_na)

14

In [57]:
#drop the variables where everything seems to be not valid.
data1.drop(var_na,axis=1, inplace=True)

In [58]:
#Get first few rows
data1.head()

Unnamed: 0,PERSONID,VEHID,HOUSEID,TRAVDAY,SAMPSTRAT,HOMEOWN,HHSIZE,HHVEHCNT,HHFAMINC,PC,...,OBHTNRNT,OBPPOPDN,OBRESDN,DTHTNRNT,DTPPOPDN,DTRESDN,DTEEMPDN,DBHTNRNT,DBPPOPDN,DBRESDN
0,3,1,30000007,2,3,1,3.0,5.0,7,2,...,20,750,300,50,750,300,350,60,750,300
1,3,1,30000007,2,3,1,3.0,5.0,7,2,...,60,750,300,50,1500,750,750,20,750,300
2,1,3,30000007,2,3,1,3.0,5.0,7,2,...,20,750,300,50,750,300,350,30,300,300
3,1,3,30000007,2,3,1,3.0,5.0,7,2,...,30,300,300,50,1500,750,750,20,750,300
4,2,5,30000007,2,3,1,3.0,5.0,7,2,...,40,1500,750,50,1500,750,750,20,750,300


In [59]:
#Dimension of data
data1.shape

(573829, 230)

My intention was the recode the values but since I need them for the next section, I did not recode.

In [60]:
#Check a character variables
#data1['PHYACT'].unique()

In [61]:
#Check a numeric variable
#data1['PUBTIME'].unique()

In [62]:
#recode variables
#data1.iloc[:,1:] = data1.iloc[:,1:].replace(dict.fromkeys([-1.0,-9.0,-7.0,-77.0,-8.0,-88.0], np.nan))
#data1.iloc[:,1:] = data1.iloc[:,1:].replace(dict.fromkeys(['-1','-9','-7','-77','-8','-88'], np.nan))

In [63]:
#Check a character variables
#data1['PHYACT'].unique()

In [64]:
#Check a numeric variable
#data1['PUBTIME'].unique()

## Keep Only Latest Vehicle Information

In [65]:
data1.groupby(['VEHOWNED','VEHOWNMO'])['VEHOWNMO'].count()

VEHOWNED  VEHOWNMO
-7        -1              71
-8        -1              29
-9        -1              77
01        -1          482439
02        -7             413
          -8             329
          -9               7
          0             3721
          01            9943
          02            8565
          03            8966
          04            8399
          05            7309
          06           10461
          07            6179
          08            7442
          09            6537
          10            7182
          11            5760
Name: VEHOWNMO, dtype: int64

VEHOWNED : Owned Vehicle Longer than a Year
VEHOWNMO : Months of Vehicle Ownership

My intention was to keep the the latest vehicle for each person. What should I do if  a person has multiple vehicles which consists of a Don't Know, Not Ascertained and a vehicle where the time frame is provided? How do I choose the latest vehicle?


## Analyze the outcome variable

In [66]:
data1['FUELTYPE'].value_counts()

01    544413
03     18271
02     10779
97       276
-8        81
-7         9
Name: FUELTYPE, dtype: int64

Only 18271/573739 = 3% are using a Hybrid, electric or alternative fuel vehicle

In [67]:
data1['HFUEL'].value_counts()

-1    555558
04     14818
03      1349
02      1293
97       409
-9       333
01        59
-8        10
Name: HFUEL, dtype: int64

Since my analysis focuses on electric vehicles, we will use the HFUEL variable to distinguish vehicle categories.

The value labels for HFUEL is given below.

1. -9: Not ascertained
2. -8: I don't know
3. -1: Appropriate skip
4. 01: Biodiesel
5. 02: Plug-in Hybrid (gas/electric e.g., Chevy Volt)
6. 03: Electric (e.g. Nissan Leaf)
7. 04: Hybrid (gas/electric, not plug-in e.g., Toyota Prius)
8. 97: Some other fuel

I plan to code "02" and "03" as electric vehicles and the rest as conventional vehicles.

## Convert Format for Variables

In [68]:
#Get these variables
#data_1_int =data1[["BIKE4EX","BIKESHARE","CARRODE","CARSHARE","CNTTDHH","CNTTDTR","DELIVER","DRVRCNT","HHSIZE","HHVEHCNT","LPACT","MCUSED","NBIKETRP","NUMADLT",	"NUMONTRP",	"NUMTRANS","NWALKTRP","PTUSED","RESP_CNT","RIDESHARE","TRPACCMP","TRPHHACC","VEHYEAR","VPACT","WALK4EX","WKFMHMXX","WRKCOUNT","YOUNGCHILD","YRTOUS"]]

In [69]:
#Convert type
#data1[data_1_int.columns]=data_1_int.astype('int')

In [70]:
#Info of data
#data1.info()

In [71]:
#check first few rows
#data1.head()

In [72]:
#check dimension
#data1.shape

## Make the Cells Dummies

In [73]:
#data2 = pd.get_dummies(data1)

Waiting to get feedback on the above codes but just trying out this code. 

## Treating NAs

Reference: <br>
U.S. Department of Transportation, Federal Highway Administration, 2017 National Household Travel Survey. URL: http://nhts.ornl.gov.