# Capstone Two - Data Wrangling

The dataset used was the 2017 National Household Travel Survey. This notebook will focus on the cleaning the data.

## Import packages

In [1]:
#import packages
import pandas as pd
import os
import tabula
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

## Set directories

In [2]:
os.chdir("../..")
cw = os.getcwd()

## Read datasets

The National Household Travel Survey has 4 datasets. 

1. The Person dataset 
2. The Household dataset
3. The Vehicle dataset
4. The Travel dataset

In [None]:
#import person data
data_person=pd.read_sas(os.path.join(cw,'Capstone_Two_Other_Material/Data/sas/perpub.sas7bdat'), format = 'sas7bdat', encoding="ISO-8859-1")

In [None]:
#Look at first few rows
data_person.head()

In [None]:
#Looks at dimension
data_person.shape

In [None]:
#Look at data info
data_person.info()

In [None]:
#Select string variables
data_person_obj = data_person.select_dtypes(['object'])
print (data_person_obj.head())

In [None]:
#Remove trailing or leading spaces
data_person[data_person_obj.columns] = data_person_obj.apply(lambda x: x.str.strip())

In [None]:
#Look at dimension
data_person.shape

In [None]:
#Look at data info
data_person.info()

In [None]:
#import household data
data_hh=pd.read_sas(os.path.join(cw,'Capstone_Two_Other_Material/Data/sas/hhpub.sas7bdat'), format = 'sas7bdat', encoding="ISO-8859-1")

In [None]:
#Look at first few rows
data_hh.head()

In [None]:
#Look at dimension
data_hh.shape

In [None]:
#Look at data info
data_hh.info()

In [None]:
#Select string variables
data_hh_obj = data_hh.select_dtypes(['object'])
print (data_hh_obj.head())

In [None]:
#Remove trailing or leading spaces
data_hh[data_hh_obj.columns] = data_hh_obj.apply(lambda x: x.str.strip())

In [None]:
#Look at dimension
data_hh.shape

In [None]:
#Look at data info
data_hh.info()

In [None]:
#import trip data
data_trip=pd.read_sas(os.path.join(cw,'Capstone_Two_Other_Material/Data/sas/trippub.sas7bdat'), format = 'sas7bdat', encoding="ISO-8859-1")

In [None]:
#Look at first few rows
data_trip.head()

In [None]:
#Look at dimension
data_trip.shape

In [None]:
#Look at data info
data_trip.info()

In [None]:
#Select string variables
data_trip_obj = data_trip.select_dtypes(['object'])
print (data_trip_obj.head())

In [None]:


#Remove trailing or leading spaces
data_trip[data_trip_obj.columns] = data_trip_obj.apply(lambda x: x.str.strip())

In [None]:
#Look at dimension
data_trip.shape

In [None]:
#Look at dataset info
data_trip.info()

In [None]:
#import vehicle data
data_veh=pd.read_sas(os.path.join(cw,'Capstone_Two_Other_Material/Data/sas/vehpub.sas7bdat'), format = 'sas7bdat', encoding="ISO-8859-1")

In [None]:
#Look at first few rows
data_veh.head()

In [None]:
#Look at dimension
data_veh.shape

In [None]:
#Look at data info
data_veh.info()

In [None]:
#Select string variables
data_veh_obj = data_veh.select_dtypes(['object'])
print (data_veh_obj.head())

In [None]:
#Remove trailing or leading spaces
data_veh[data_veh_obj.columns] = data_veh_obj.apply(lambda x: x.str.strip())

In [None]:
#Look at dimension
data_veh.shape

In [None]:
#Look at data info
data_veh.info()

## Merge Datasets

Based on the User Guide documentation provided, many of the variables are repeated across multiple table file levels.

In [None]:
#Look at similar variables between datasets we want to merge and save them in variables
data_hh_columns = set(data_hh.columns)
data_veh_columns = set(data_veh.columns)
data_person_columns = set(data_person.columns)
data_trip_columns = set(data_trip.columns)

data_hh_veh_columns = list(data_hh_columns.intersection(data_veh_columns))
data_hh_veh_columns_u = list(data_hh_columns.union(data_veh_columns))

data_hh_veh_person_columns = list(set(data_hh_veh_columns_u).intersection(data_person_columns))
data_hh_veh_person_columns_u = list(set(data_hh_veh_columns_u).union(data_person_columns))

data_hh_veh_person_trip_columns = list(set(data_hh_veh_person_columns_u).intersection(data_trip_columns))

In [None]:
#merge person and vehicle data
data_hh_veh = pd.merge(data_hh,data_veh,on=data_hh_veh_columns)

In [None]:
#look at first 5 rows
data_hh_veh.head()

In [None]:
#Dimension of data
data_hh_veh.shape

In [None]:
#merge household and vehicle data to person data
data_hh_veh_person = pd.merge(data_hh_veh,data_person,on=data_hh_veh_person_columns)

In [None]:
#look at first 5 rows
data_hh_veh_person.head()

In [None]:
#Dimension of data
data_hh_veh_person.shape

In [None]:
#merge person, household data, vehicle data to trip data
data_hh_veh_person_trip = pd.merge(data_hh_veh_person,data_trip,on=data_hh_veh_person_trip_columns)

In [None]:
#review first few rows
data_hh_veh_person_trip.head()

In [None]:
#review dimension
data_hh_veh_person_trip.shape

In [None]:
#rename the data
data = data_hh_veh_person_trip

In [None]:
#nLook at data info
data.info(verbose=True)

In [None]:
#reorder variables
first_cols = ['PERSONID','VEHID']
last_cols = [col for col in data.columns if col not in first_cols]
len(last_cols)

In [None]:
#reorder variables
data1 = data[first_cols+last_cols]

In [None]:
#Get first few rows
data1.head()

In [None]:
#Dimension of data
data1.shape

## Duplicates and NAs

In [None]:
#Any duplicate rows?
data1 = data1.drop_duplicates()
#Dimension
data1.shape

There are no duplicates rows

In [None]:
#Check for NAs
data1.isna().values.any()

There are no NAs in the dataset. The documentation for the dataset mentioned that there shouldn't be any as well.

There are a few values that should be reviewed further.

1. -1 : Appropriate Skip
2. -9 : Not Ascertained
3. -7 : I prefer not to answer (Selected by participant (available when no answer given)) 
4. -77 : I prefer not to answer (Selected by participant (always available)) 
5. -8 : I don’t know (Selected by participant (available when no answer given))
6. -88 : I don’t know (Selected by participant (always available))

Let's check if any variable consists of all of these values.

In [None]:
#variables where all values are these
data1_check_val = data1.isin([-1.0,-9.0,-7.0,-77.0,-8.0,-88.0, 
                              '-1','-9','-7','-77','-8','-88']).all()
data1_check_val

In [None]:
#check one variable
data1['TRACC_BUS'].unique()

In [None]:
#Get variable names
var_na = list(data1_check_val[data1_check_val==True].index)
var_na

In [None]:
#Number of variables
len(var_na)

In [None]:
#drop the variables where everything seems to be not valid.
data1.drop(var_na,axis=1, inplace=True)

In [None]:
#Get first few rows
data1.head()

In [None]:
#Dimension of data
data1.shape

My intention was the recode the values but since I need them for the next section, I did not recode.

In [None]:
#Check a character variables
#data1['PHYACT'].unique()

In [None]:
#Check a numeric variable
#data1['PUBTIME'].unique()

In [None]:
#recode variables
#data1.iloc[:,1:] = data1.iloc[:,1:].replace(dict.fromkeys([-1.0,-9.0,-7.0,-77.0,-8.0,-88.0], np.nan))
#data1.iloc[:,1:] = data1.iloc[:,1:].replace(dict.fromkeys(['-1','-9','-7','-77','-8','-88'], np.nan))

In [None]:
#Check a character variables
#data1['PHYACT'].unique()

In [None]:
#Check a numeric variable
#data1['PUBTIME'].unique()

## Keep Only Latest Vehicle Information

In [None]:
data1.groupby(['VEHOWNED','VEHOWNMO'])['VEHOWNMO'].count()

VEHOWNED : Owned Vehicle Longer than a Year
VEHOWNMO : Months of Vehicle Ownership

My intention was to keep the the latest vehicle for each person. What should I do if  a person has multiple vehicles which consists of a Don't Know, Not Ascertained and a vehicle where the time frame is provided? How do I choose the latest vehicle?


## Analyze the outcome variable

In [None]:
data1['FUELTYPE'].value_counts()

Only 18271/573739 = 3% are using a Hybrid, electric or alternative fuel vehicle

In [None]:
data1['HFUEL'].value_counts()

Since my analysis focuses on electric vehicles, we will use the HFUEL variable to distinguish vehicle categories.

The value labels for HFUEL is given below.

1. -9: Not ascertained
2. -8: I don't know
3. -1: Appropriate skip
4. 01: Biodiesel
5. 02: Plug-in Hybrid (gas/electric e.g., Chevy Volt)
6. 03: Electric (e.g. Nissan Leaf)
7. 04: Hybrid (gas/electric, not plug-in e.g., Toyota Prius)
8. 97: Some other fuel

I plan to code "02" and "03" as electric vehicles and the rest as conventional vehicles.

## Convert Format for Variables

In [None]:
#Get these variables
#data_1_int =data1[["BIKE4EX","BIKESHARE","CARRODE","CARSHARE","CNTTDHH","CNTTDTR","DELIVER","DRVRCNT","HHSIZE","HHVEHCNT","LPACT","MCUSED","NBIKETRP","NUMADLT",	"NUMONTRP",	"NUMTRANS","NWALKTRP","PTUSED","RESP_CNT","RIDESHARE","TRPACCMP","TRPHHACC","VEHYEAR","VPACT","WALK4EX","WKFMHMXX","WRKCOUNT","YOUNGCHILD","YRTOUS"]]

In [None]:
#Convert type
#data1[data_1_int.columns]=data_1_int.astype('int')

In [None]:
#Info of data
#data1.info()

In [None]:
#check first few rows
#data1.head()

In [None]:
#check dimension
#data1.shape

## Make the Cells Dummies

In [None]:
#data2 = pd.get_dummies(data1)

Waiting to get feedback on the above codes but just trying out this code. 

## Treating NAs

Reference: <br>
U.S. Department of Transportation, Federal Highway Administration, 2017 National Household Travel Survey. URL: http://nhts.ornl.gov.