# National Household Travel Survey (NHTS) 2017

### Koral Buch
### Kenneth Broadhead

## Questions of interest

* Predict market penetration of alternative fuel vehicles.
* Identify the demographic and socioeconomic attributes of alternative fuel vehicle owners.

## Scope of analysis

* California
    * Divide the state into 34 Core Based Statistical Area (CBSA)
* Household-based analysis (and not individual-based)
    * Divide households into two categories: 1-vehicle-household and 2-or-more-vehicle-household
* Alternative fuel vehicle:
    * Battery electric vehicle (BEV)
    * Plug-in hybrid electric vehicle (PHEV)
    * Hybrid electric vehicle (HEV)

## Methods

* logistic regression
* support vector machine
* random forest
* neural nets
* anything else???

__Household Variables__

* HOUSEID - Household Identifier
* HOMEOWN - Home Ownership
* HHSIZE - Count of household members
* HHVEHCNT - Count of household vehicles
* HHFAMINC - Household income
* DRVRCNT - Number of drivers in household
* NUMADLT - Count of adult household members at least 18 years old
* WRKCOUNT - Number of workers in household
* LIF_CYC - Life Cycle classification for the household, derived by attributes pertaining to age, relationship, and work status
* URBRUR - Household in urban/rural area
* HH_RACE - Race of household respondent
* HH_CBSA - Core Based Statistical Area (CBSA) FIPS code for the respondent’s home address
* HTPPOPDN - Category of population density (persons per square mile) in the census tract of the household’s home location

__Vehicle Variables__
* HOUSEID - Household Identifier
* VEHID - Vehicle Identifier
* VEHYEAR - Vehicle Year
* VEHAGE - Age of vehicle, based on model year
* MAKE - Vehicle Make
* MODEL - Vehicle Model
* FUEL - Fuel Type
* VEHTYPE - Vehicle Type
* VEHMILES - Count of Miles Driven in Vehicle Over Last Year

In [None]:
houseid
homeown
hhsize
hhvehcnt
hhfaminc
drvrcnt
numadlt
wrkcount
lif_cyc
urbrur
hh_race
hh_hisp
hh_cbsa
htppopdn

houseid
vehid
vehyear
vehage
make
model
fueltype
fueltype_o
hfuel
hfuel_o
vehtype
vehmiles

In [25]:
import csv
import pandas as pd 
import numpy as np

In [84]:
# read data
veh = pd.read_excel('data\CA_Vehicle.xlsx')
hh = pd.read_excel('data\CA_Household.xlsx')

In [80]:
# recode fuel
veh['fuel'] = np.nan
veh['fuel'] = np.where(veh.fueltype == 1, 'ICE-G', veh['fuel']) # ICE-G
veh['fuel'] = np.where(veh.fueltype == 2, 'ICE-D', veh['fuel']) # ICE-D
veh['fuel'] = np.where(np.logical_and(veh.fueltype == 3,veh.hfuel == 2), 'PHEV', veh['fuel']) # 'PHEV'
veh['fuel'] = np.where(np.logical_and(veh.fueltype == 3,veh.hfuel == 3), 'BEV', veh['fuel']) # 'BEV'
veh['fuel'] = np.where(np.logical_and(veh.fueltype == 3,veh.hfuel == 4), 'HEV', veh['fuel']) # 'HEV'

# recode race
hh['race'] = hh['hh_race']
hh['race'] = np.where(np.logical_and(hh.hh_race == 97,hh.hh_hisp == 1), 7, hh['race']) # hispanic

In [81]:
# merge
data = pd.merge(veh[['houseid','vehid','vehyear','vehage','make','model','vehtype','fuel','vehmiles']], hh[['houseid','homeown','hhsize','hhvehcnt','hhfaminc','drvrcnt','numadlt','wrkcount','lif_cyc','urbrur','race','hh_cbsa','htppopdn']], on='houseid')

In [82]:
print(data.shape)
data.head()

(52228, 21)


Unnamed: 0,houseid,vehid,vehyear,vehage,make,model,vehtype,fuel,vehmiles,homeown,...,hhvehcnt,hhfaminc,drvrcnt,numadlt,wrkcount,lif_cyc,urbrur,race,hh_cbsa,htppopdn
0,30027253,2,2012,5,49,49032,1,ICE-G,17000,1,...,4,6,4,4,1,2,1,3,31080,7000
1,30027253,3,2010,7,37,37031,1,ICE-G,10500,1,...,4,6,4,4,1,2,1,3,31080,7000
2,30027253,4,2007,10,49,49032,1,ICE-G,12500,1,...,4,6,4,4,1,2,1,3,31080,7000
3,30027253,1,2013,4,37,37402,3,ICE-G,12000,1,...,4,6,4,4,1,2,1,3,31080,7000
4,30027721,1,2015,2,37,37031,1,ICE-G,5000,2,...,2,11,2,2,2,2,1,6,41940,17000


## Unbalanced data

In [83]:
data.fuel.value_counts(dropna=False)

ICE-G    47686
ICE-D     1919
HEV       1866
BEV        326
PHEV       258
nan        173
Name: fuel, dtype: int64

## Questions

* There are many NaN values ("I don't know", "I prefer no to answer", etc) - what to do?
* Can unsupervised learning tools be applied here?
* What supervised ML tools should we use?