# Decode VIN


This notebook uses the truncated VIN (Vehicle Identification Number) in the MV104 vehicle dataset and NHTSA's VIN API to get vehicle body type information. 

The MV104 data from the DMV has a variable VEHBDT_ID that referes to the vehicle body type, but it doesn't match what we see from the NHTSA's decoding of the VIN. 

In particular DMV has a vehicle type 'Suburban'. Not immediately clear what that referes too. Turns out, it seems it is any light duty vehicle that is not a car (eg SUV's and pickups).

NHTSA's vehicle type is more descriptive than the DMV variable.

In [21]:
import pandas as pd
pd.options.display.max_columns = 130
pd.options.display.max_rows = 130


import databuild as db
import sys
sys.path.insert(0,'/home/deena/Documents/data_munge/ModaCode/')
import moda


In [2]:
# read in DMV MV104 data. There are 3 tables. 
crash,ind,veh = db.readDMV()

# we're interested in the vehicle table which has VIN
print veh.shape

The minimum supported version is 2.4.6



full crash table (522108, 26)
full person table (1502797, 22)
full vehicle table (1092922, 20)
(1092922, 20)


In [3]:
# Remove null vins
veh = veh[pd.notnull(veh['VIN'])].copy()

veh['CV_VEH_YEAR'] = veh['CV_VEH_YEAR'].astype(str)
veh['VIN'] = veh['VIN'].astype(str)

print veh.shape

(611697, 20)


In [11]:
veh.head()

Unnamed: 0,CS_ID,CV_ID,VEHBDYT_ID,REGT_CDE,PACCACTT_ID,TBCT_DMV_CDE,DIRCTT_CDE,FT_ID,CV_VEH_YEAR,CV_REG_STATE_CDE,CV_WEIGHT_LBS,CV_PSGR_NUM,CV_CYLNDR,CV_VEHMAKE_DESCR,CFT_CDE1,CFT_CDE2,SHZMTT_ID,VEH_EVNTT_ID,VIN,DMV_VIN_NUM
0,32045885,5979599,6.0,16,-1,-3,-1,1,1998,NY,,-1,4,TOYOT,,,,,2T1BR12E2WC,
2,32045886,5979601,5.0,16,1,-3,-1,1,1999,NY,,2,6,DODGE,,,,,2B4FP2532XR,
3,32045887,5979602,60.0,56,1,-3,-1,2,2002,NY,,-1,8,GMC,,,,,1GDJG31F821,
6,32046208,5980144,5.0,29,10,-3,-1,1,1996,NY,,-1,4,SUBAR,,,,,4S3BK6657T7,
7,32046209,5980145,5.0,16,1,-3,W,1,1997,NY,,1,8,MERCU,-1.0,,,,4M2DU55P6VU,


In [5]:
# Take a random sample of vehicles
veh_sample = veh.sample(n=5000)

# create a dataframe to store the vehicle information
# obtained from the NHTSA vehicle api 
vin_api_data = pd.DataFrame()

nhtsaURL1 = 'https://vpic.nhtsa.dot.gov/api/vehicles/decodevinvalues/'
nhtsaURL2 = '*BA?format=json&modelyear='


for index, row in veh_sample.iterrows():
    vin = row['VIN']
    year = row['CV_VEH_YEAR']
    vin_data = pd.read_json(nhtsaURL1+vin+nhtsaURL2+year)
    vin_norm = pd.DataFrame(vin_data['Results'][0],index=[0])
    
    # Add each line of results (individual API call) to vin_api_data 
    vin_api_data = pd.concat([vin_api_data, vin_norm], axis=0)

print vin_api_data.shape

(5000, 146)


In [6]:
# join back to crash table
# Take first 11 digits of the API result's vin
vin_api_data['VIN'] = vin_api_data['VIN'].str[:-3]

# Use that 11 digit vin to join with the original vehicle table
veh_sample = veh_sample.merge(vin_api_data,how='left',on='VIN')
veh_sample.head()

Unnamed: 0,CS_ID,CV_ID,VEHBDYT_ID,REGT_CDE,PACCACTT_ID,TBCT_DMV_CDE,DIRCTT_CDE,FT_ID,CV_VEH_YEAR,CV_REG_STATE_CDE,...,Turbo,ValveTrainDesign,VehicleType,WheelBaseLong,WheelBaseShort,WheelBaseType,WheelSizeFront,WheelSizeRear,Wheels,Windows
0,34733611,11036207,5.0,16,2,-3,E,1,2001,NY,...,,,TRUCK,,,,,,,
1,33385538,8505537,6.0,16,14,-3,E,1,2006,NY,...,,,PASSENGER CAR,,,,,,,
2,32633044,7349171,6.0,16,11,-3,W,1,2001,NY,...,,Single Overhead Cam (SOHC),PASSENGER CAR,,,,,,,
3,32537313,6913053,5.0,16,1,-3,W,1,1999,NY,...,,Dual Overhead Cam (DOHC),MULTIPURPOSE PASSENGER VEHICLE (MPV),,,,,,,
4,34630872,10851608,5.0,16,7,-3,W,1,2000,NY,...,,,MULTIPURPOSE PASSENGER VEHICLE (MPV),,,,,,,


the VIN API has has two variables that refer to type of vehicle:
VehicleType and BodyClass

Here's how they relate to each other:

In [7]:
veh_sample.groupby(['VehicleType','BodyClass'])[['VIN']].count()

Unnamed: 0_level_0,Unnamed: 1_level_0,VIN
VehicleType,BodyClass,Unnamed: 2_level_1
,,19
BUS,,1
BUS,Bus,38
BUS,Bus - School Bus,2
BUS,Incomplete - School Bus Chassis,2
BUS,Truck,8
BUS,Van,15
BUS,Wagon,22
INCOMPLETE VEHICLE,,1
INCOMPLETE VEHICLE,Cargo Van,2


The DMV vehicle file has a column called 'VEHBDYT_ID" which contains the DMV encoding of the body type. 

First let's make the variable more descriptive by replacing the numbers by their descriptive value from the data dictionary.

Then let's compare to the NHTSA body type.

In [17]:
veh_sample = db.formatVehTypeDMV(veh_sample)

In [18]:
# compare counts of DMV vehicle type with NHTSA's VIN decoding 'VehicleType'

veh_sample.groupby(['DMV_VehType','VehicleType'])[['VIN']].count()

Unnamed: 0_level_0,Unnamed: 1_level_0,VIN
DMV_VehType,VehicleType,Unnamed: 2_level_1
2 Door Sedan,,1
2 Door Sedan,MULTIPURPOSE PASSENGER VEHICLE (MPV),2
2 Door Sedan,PASSENGER CAR,208
4 Door Sedan,,4
4 Door Sedan,INCOMPLETE VEHICLE,1
4 Door Sedan,MULTIPURPOSE PASSENGER VEHICLE (MPV),59
4 Door Sedan,PASSENGER CAR,2832
Ambulance,INCOMPLETE VEHICLE,11
Bus (Omnibus),BUS,53
Bus (Omnibus),INCOMPLETE VEHICLE,22


In [22]:
# compare counts of DMV vehicle type with NHTSA's VIN decoding 'BodyClass'


veh_sample.groupby(['DMV_VehType','BodyClass'])[['VIN']].count()

Unnamed: 0_level_0,Unnamed: 1_level_0,VIN
DMV_VehType,BodyClass,Unnamed: 2_level_1
2 Door Sedan,,5
2 Door Sedan,Cabriolet/Convertible,6
2 Door Sedan,Coupe,137
2 Door Sedan,Hatchback/Liftback/Notchback,45
2 Door Sedan,Sedan/Saloon,16
2 Door Sedan,Sport Utility Vehicle (SUV)/Multi Purpose Vehicle (MPV),1
2 Door Sedan,Van,1
4 Door Sedan,,15
4 Door Sedan,Coupe,6
4 Door Sedan,Hatchback/Liftback/Notchback,53


VEHBDYT_ID = 5 ("Suburban") matches primarily to SUV's, but almost as many are Vans and Wagons.
