# Table of Contents 
1. [Research questions](#id0)
2. [Import Packages](#id1)
3. [Load the data](#id2)
4. [Test Train Split for Total](#id3)
5. [Test Train Split for Rural](#id4)
6. [Test Train Split for Urban](#id5)
7. [Saving progress](#id6)

# Preprocessing and Training Data Development for Capstone 2
This notebook handles the preprocessing and development of the training data for the FARS dataset.

<a id="id0"></a>

## Research questions

In considering the transformation of rural areas into nonrural areas in the US, what factors most strongly predict fatal crashes in each context for the year under consideration and thereby what is predictable as this transformation continues? Will there be more fatal crashes if rural areas are transformed to urban areas? Do rural and urban areas have the same factors that correlate with fatalities.


<a id="id1"></a>

## Import packages

In [33]:
import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.decomposition import PCA
from sklearn import preprocessing
from sklearn.preprocessing import scale
from sklearn.model_selection import train_test_split
import ppscore as pps

<a id="id2"></a>

## Load the data

In [34]:
#Import rural and urban datasets separately and also total; will test and train within each and within total.

rural = pd.read_csv('../data/processed/new_rural.csv')
urban = pd.read_csv('../data/processed/new_urban.csv')
total = pd.read_csv('../data/processed/new_total.csv')

In [35]:
rural.head(5)

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,STATE,STATENAME,ST_CASE,VE_TOTAL,VE_FORMS,PVH_INVL,PEDS,PERSONS,...,FATALS,DRUNK_DR,F_RATIO,REGION,WEEK_END,DAY_NIGHT,DN_Avg,WEEK_END_Avg,State_Avg,Day_Avg
0,2,2,1,Alabama,10003,3,3,0,0,4,...,1,0,25.0,South,Weekday,Night_Time,1.085287,1.078926,1.086449,1.079063
1,3,3,1,Alabama,10004,1,1,0,1,1,...,1,0,50.0,South,Weekday,Day_Time,1.086942,1.078926,1.086449,1.079063
2,5,5,1,Alabama,10006,2,2,0,0,2,...,1,0,50.0,South,Weekday,Night_Time,1.085287,1.078926,1.086449,1.073489
3,6,6,1,Alabama,10007,1,1,0,0,5,...,1,0,20.0,South,Weekday,Night_Time,1.085287,1.078926,1.086449,1.082433
4,7,7,1,Alabama,10008,1,1,0,0,1,...,1,1,100.0,South,Weekend,Night_Time,1.085287,1.100046,1.086449,1.09961


In [36]:
urban.head(5)

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,STATE,STATENAME,ST_CASE,VE_TOTAL,VE_FORMS,PVH_INVL,PEDS,PERSONS,...,FATALS,DRUNK_DR,F_RATIO,REGION,WEEK_END,DAY_NIGHT,DN_Avg,WEEK_END_Avg,State_Avg,Day_Avg
0,0,0,1,Alabama,10001,2,2,0,0,3,...,1,1,33.33,South,Weekday,Night_Time,1.085287,1.078926,1.086449,1.073477
1,1,1,1,Alabama,10002,2,2,0,0,2,...,1,0,50.0,South,Weekday,Night_Time,1.085287,1.078926,1.086449,1.082433
2,4,4,1,Alabama,10005,1,1,0,0,1,...,1,1,100.0,South,Weekday,Day_Time,1.086942,1.078926,1.086449,1.085082
3,8,8,1,Alabama,10009,1,1,0,0,1,...,1,0,100.0,South,Weekday,Day_Time,1.086942,1.078926,1.086449,1.085082
4,9,9,1,Alabama,10010,1,1,0,1,1,...,1,0,50.0,South,Weekday,Day_Time,1.086942,1.078926,1.086449,1.073477


In [37]:
total.head(5)

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,STATE,STATENAME,ST_CASE,VE_TOTAL,VE_FORMS,PVH_INVL,PEDS,PERSONS,...,FATALS,DRUNK_DR,F_RATIO,REGION,WEEK_END,DAY_NIGHT,DN_Avg,WEEK_END_Avg,State_Avg,Day_Avg
0,0,0,1,Alabama,10001,2,2,0,0,3,...,1,1,33.33,South,Weekday,Night_Time,1.085287,1.078926,1.086449,1.073477
1,1,1,1,Alabama,10002,2,2,0,0,2,...,1,0,50.0,South,Weekday,Night_Time,1.085287,1.078926,1.086449,1.082433
2,2,2,1,Alabama,10003,3,3,0,0,4,...,1,0,25.0,South,Weekday,Night_Time,1.085287,1.078926,1.086449,1.079063
3,3,3,1,Alabama,10004,1,1,0,1,1,...,1,0,50.0,South,Weekday,Day_Time,1.086942,1.078926,1.086449,1.079063
4,4,4,1,Alabama,10005,1,1,0,0,1,...,1,1,100.0,South,Weekday,Day_Time,1.086942,1.078926,1.086449,1.085082


In [38]:
print("rural shape is", rural.shape, 'urban shape is', urban.shape, 'total shape is', total.shape)

rural shape is (14615, 32) urban shape is (18734, 32) total shape is (33457, 32)


<a id="id3"></a>

## Test train split for Total


In [39]:
### Sorting out the categorical variables
dfo = total.select_dtypes(include=['object']) # select object type columns
total_ready = pd.concat([total.drop(dfo, axis=1), pd.get_dummies(dfo)], axis=1) ### concatenating the dummy filled columns
total_ready.columns

Index(['Unnamed: 0', 'Unnamed: 0.1', 'STATE', 'ST_CASE', 'VE_TOTAL',
       'VE_FORMS', 'PVH_INVL', 'PEDS', 'PERSONS', 'DAY',
       ...
       'WEATHERNAME_Sleet or Hail', 'WEATHERNAME_Snow', 'REGION_Midwest',
       'REGION_Northeast', 'REGION_South', 'REGION_West', 'WEEK_END_Weekday',
       'WEEK_END_Weekend', 'DAY_NIGHT_Day_Time', 'DAY_NIGHT_Night_Time'],
      dtype='object', length=205)

In [40]:
### Sorting out the categorical variables for rural
dfo = rural.select_dtypes(include=['object']) # select object type columns
rural_ready = pd.concat([rural.drop(dfo, axis=1), pd.get_dummies(dfo)], axis=1) ### concatenating the dummy filled columns
rural_ready.columns

Index(['Unnamed: 0', 'Unnamed: 0.1', 'STATE', 'ST_CASE', 'VE_TOTAL',
       'VE_FORMS', 'PVH_INVL', 'PEDS', 'PERSONS', 'DAY',
       ...
       'WEATHERNAME_Sleet or Hail', 'WEATHERNAME_Snow', 'REGION_Midwest',
       'REGION_Northeast', 'REGION_South', 'REGION_West', 'WEEK_END_Weekday',
       'WEEK_END_Weekend', 'DAY_NIGHT_Day_Time', 'DAY_NIGHT_Night_Time'],
      dtype='object', length=198)

In [41]:
### Sorting out the categorical variables for rural
dfo = urban.select_dtypes(include=['object']) # select object type columns
urban_ready = pd.concat([urban.drop(dfo, axis=1), pd.get_dummies(dfo)], axis=1) ### concatenating the dummy filled columns
urban_ready.columns

Index(['Unnamed: 0', 'Unnamed: 0.1', 'STATE', 'ST_CASE', 'VE_TOTAL',
       'VE_FORMS', 'PVH_INVL', 'PEDS', 'PERSONS', 'DAY',
       ...
       'WEATHERNAME_Sleet or Hail', 'WEATHERNAME_Snow', 'REGION_Midwest',
       'REGION_Northeast', 'REGION_South', 'REGION_West', 'WEEK_END_Weekday',
       'WEEK_END_Weekend', 'DAY_NIGHT_Day_Time', 'DAY_NIGHT_Night_Time'],
      dtype='object', length=199)

In [42]:
### Test Train Split for total
X = total_ready.loc[:, total_ready.columns!='F_RATIO']
y = total_ready.F_RATIO

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.3, random_state=42)

In [43]:
# For Total, build scaler based on training data and apply it to test data to then also scale the test data
scaler = preprocessing.StandardScaler().fit(X_train)
X_train_scaled=scaler.transform(X_train)
X_test_scaled=scaler.transform(X_test)

In [44]:
# For Total, We now want to check the shape of the X train, y_train, X_test and y_test to make sure the proportions are right. 
print('Total accident training data', X_train_scaled.shape, 'Total accident test data', X_test_scaled.shape, "\n", 'total fatality training data', y_train.shape, 'total fatality testing data', y_test.shape)

Total accident training data (23419, 204) Total accident test data (10038, 204) 
 total fatality training data (23419,) total fatality testing data (10038,)


<a id="id4"></a>

## Test train split for rural


In [45]:
### Test Train Split for Rural
X_R = rural_ready.loc[:, rural_ready.columns!='F_RATIO']
y_R = rural_ready.F_RATIO

X_train_R, X_test_R, y_train_R, y_test_R = train_test_split(X_R, y_R, test_size=.3, random_state=42)

In [46]:
# For Rural, build scaler based on training data and apply it to test data to then also scale the test data
scaler = preprocessing.StandardScaler().fit(X_train_R)
X_train_R_scaled=scaler.transform(X_train_R)
X_test_R_scaled=scaler.transform(X_test_R)

In [47]:
# For Rural, We now want to check the shape of the X train, y_train, X_test and y_test to make sure the proportions are right. 
print('Rural accident training data', X_train_R_scaled.shape, 'Rural accident test data', X_test_R_scaled.shape, "\n", 'Rural fatality training data', y_train_R.shape, 'Rural fatality testing data', y_test_R.shape)



Rural accident training data (10230, 197) Rural accident test data (4385, 197) 
 Rural fatality training data (10230,) Rural fatality testing data (4385,)


<a id="id5"></a>

## Test train split for urban


In [48]:
### Test Train Split for Rural
X_U = urban_ready.loc[:, urban_ready.columns!='Rural']
y_U = urban_ready.F_RATIO

X_train_U, X_test_U, y_train_U, y_test_U = train_test_split(X_U, y_U, test_size=.3, random_state=42)

In [49]:
# For Rural, build scaler based on training data and apply it to test data to then also scale the test data
scaler = preprocessing.StandardScaler().fit(X_train_U)
X_train_U_scaled=scaler.transform(X_train_U)
X_test_U_scaled=scaler.transform(X_test_U)

In [50]:
# For Rural, We now want to check the shape of the X train, y_train, X_test and y_test to make sure the proportions are right. 
print('Urban accident training data', X_train_U_scaled.shape, 'Urban accident test data', X_test_U_scaled.shape, "\n", 'Urban fatality training data', y_train_U.shape, 'Urban fatality testing data', y_test_U.shape)



Urban accident training data (13113, 199) Urban accident test data (5621, 199) 
 Urban fatality training data (13113,) Urban fatality testing data (5621,)


Data for each dataset is test/train split and then scaled to avoid data leakage.