# Project Title

**Authors:** Carlos McCrum, Micheal Lee, Doug Mill
***

## Overview

A one-paragraph overview of the project, including the business problem, data, methods, results and recommendations.

## Business Problem

Summary of the business problem you are trying to solve, and the data questions that you plan to answer to solve them.

***
Questions to consider:
* What are the business's pain points related to this project?
* How did you pick the data analysis question(s) that you did?
* Why are these questions important from a business perspective?
***

## Data Understanding

Describe the data being used for this project.
***
Questions to consider:
* Where did the data come from, and how do they relate to the data analysis questions?
* What do the data represent? Who is in the sample and what variables are included?
* What is the target variable?
* What are the properties of the variables you intend to use?
***

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import plot_confusion_matrix, roc_auc_score, plot_roc_curve
from sklearn.preprocessing import MultiLabelBinarizer, OneHotEncoder, StandardScaler, label_binarize
from sklearn.compose import make_column_transformer
from sklearn.impute import SimpleImputer
from sklearn.svm import LinearSVC
from sklearn.multiclass import OneVsRestClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import CategoricalNB

%matplotlib inline
import warnings
warnings.filterwarnings('ignore')

## Data Preparation

Describe and justify the process for preparing the data for analysis.

***
Questions to consider:
* Were there variables you dropped or created?
* How did you address missing values or outliers?
* Why are these choices appropriate given the data and the business problem?
***

In [21]:
df1 = pd.read_csv('data/cleaned_crash_data.csv')

In [109]:
df2 = pd.read_csv('data/Traffic_Crashes_Vehicle.csv')

In [110]:
df3 = pd.read_csv('data/Traffic_Crashes_People.csv')

In [39]:
df_crash = df1.copy()
df_vehicle = df2.copy()
df_people = df3.copy()

***
# Cleaning People

In [None]:
df_people.drop(['PERSON_ID', 'PERSON_TYPE', 'RD_NO', 'VEHICLE_ID', 'SEAT_NO', 'CITY', 'STATE', 'ZIPCODE', 'SEX', 
         'DRIVERS_LICENSE_STATE', 'DRIVERS_LICENSE_CLASS', 'SAFETY_EQUIPMENT', 'AIRBAG_DEPLOYED', 'EJECTION', 
         'INJURY_CLASSIFICATION', 'HOSPITAL', 'EMS_AGENCY', 'EMS_RUN_NO', 'PEDPEDAL_ACTION', 'PEDPEDAL_VISIBILITY', 
         'PEDPEDAL_LOCATION', 'BAC_RESULT', 'BAC_RESULT VALUE', 'CELL_PHONE_USE'], axis=1, inplace=True)
df_people.columns

In [None]:
#Drop missing and null values
df_people.dropna(subset=['AGE'], inplace=True)
df_people.dropna(subset=['DRIVER_ACTION'], inplace=True)
df_people.dropna(subset=['DRIVER_VISION'], inplace=True)
df_people.dropna(subset=['PHYSICAL_CONDITION'], inplace=True)

In [None]:
#Cleaning the 3 columns with many unknown values
df_people = df_people[df_people['DRIVER_VISION']!='UNKNOWN']
df_people = df_people[df_people['DRIVER_ACTION']!='UNKNOWN']
df_people = df_people[df_people['PHYSICAL_CONDITION']!='UNKNOWN']

In [None]:
#Formatting, cleaning, and binning the AGE column. 15 is the youngest age to legally drive in the state of Illinois
#with a learner's permit. 
df_people = df_people[df_people['AGE']>=10]
bins = [9, 14, 19, 29, 39, 49, 59, 69, np.inf]
names = ['Underage 10-14', '15-19', '20-29', '30-39', '40-49', '50-59', '60-69', '70+']
df_people['AGE_RANGES'] = pd.cut(df_people['AGE'], bins, labels=names)
print(df_people['AGE_RANGES'].value_counts())
# df.drop('AGE', axis=1, inplace=True)

In [None]:
#value counts of underage drivers between 10-15. 9 and under deleted.
df_people[df_people['AGE_RANGES']=='Underage 10-14']['AGE'].value_counts()

In [None]:
df_people.info()

***
# Cleaning Vehicles

In [None]:
df_vehicle.info()

In [None]:
df_vehicle = df_vehicle[['CRASH_RECORD_ID', 'NUM_PASSENGERS', 'MAKE', 'MODEL', 'VEHICLE_DEFECT']]

In [None]:
df_vehicle.dropna(subset=['MODEL', 'MAKE', 'NUM_PASSENGERS'], inplace=True)
df_vehicle.info()

In [None]:
main_df = df_crash.merge(df_vehicle, on='CRASH_RECORD_ID', how='inner').merge(df_people, on='CRASH_RECORD_ID', how='inner')

In [None]:
main_df.info()

In [None]:
main_df['DRIVER_ACTION'].value_counts()

## Data Modeling
Describe and justify the process for analyzing or modeling the data.

***
Questions to consider:
* How did you analyze or model the data?
* How did you iterate on your initial approach to make it better?
* Why are these choices appropriate given the data and the business problem?
***

In [None]:
df_crash['Target1'].value_counts(normalize = True)

In [None]:
dt = DecisionTreeClassifier(random_state = 1)

X = df_crash.drop(['Target1', 'CRASH_RECORD_ID'], axis = 1)
y = df_crash['Target1']
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 1)

dt.fit(X_train, y_train)
dt.score(X_test, y_test)

In [None]:
plot_confusion_matrix(dt, X_test, y_test)

In [None]:
bayes = CategoricalNB()
ohe = OneHotEncoder()
logreg = LinearSVC()
rf = RandomForestClassifier()
ovr = OneVsRestClassifier(logreg)

In [None]:
main_df.info()

In [None]:
dt = DecisionTreeClassifier(random_state = 1, max_depth=100)
X = main_df.drop(['Target1', 'CRASH_RECORD_ID'], axis=1)
y = main_df.Target1

X = ohe.fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 1)

In [None]:
ovr.fit(X_train, y_train)
dt.fit(X_train, y_train)
print('One vs Rest Score: {}'.format(ovr.score(X_train, y_train)))
print('Decision Tree Score: {}'.format(dt.score(X_train, y_train))) 

In [None]:
plot_confusion_matrix(dt, X_test, y_test);

In [None]:
# train_pred = ovr.predict(X_train)
# train_pred = label_binarize(train_pred, classes=len(main_df.Target1))
# y_test = label_binarize(y_test, classes=len(main_df.Target1))


In [None]:
# roc_auc_score(y_train, train_pred, multi_class='ovo', average='macro')

In [None]:
X2 = main_df.drop(['Target1', 'CRASH_RECORD_ID', 'MAKE', 'MODEL', 'CRASH_DATE', 'AGE'], axis=1)
y2 = main_df.Target1

X2 = ohe.fit_transform(X2)
X_train2, X_test2, y_train2, y_test2 = train_test_split(X2, y2, random_state = 1)


In [None]:
ovr.fit(X_train2, y_train2)
dt.fit(X_train2, y_train2)
print('One vs Rest Score: {}'.format(ovr.score(X_train2, y_train2)))
print('Decision Tree Score: {}'.format(dt.score(X_train2, y_train2))) 

# Piplines

In [None]:
pipeline_1 = Pipeline([('ss', StandardScaler()), 
                        ('RF', RandomForestClassifier(random_state = 1,  max_depth=100))])

In [None]:
grid = [{'RF__max_depth': [4, 5, 6], 
         'RF__min_samples_split': [0.1, 1.0, 10], 
         'RF__min_samples_leaf': [0.1, 0.5, 5]}]
GS = GridSearchCV(estimator=pipeline_1, 
                          param_grid=grid, 
                          scoring='precision', 
                          cv=5)

In [None]:
#GS.fit()

In [None]:
#GS.cv_results_

In [None]:
#GS.best_estimator_.score()

In [None]:
#GS.best_params_

## Evaluation
Evaluate how well your work solves the stated business problem.

***
Questions to consider:
* How do you interpret the results?
* How well does your model fit your data? How much better is this than your baseline model?
* How confident are you that your results would generalize beyond the data you have?
* How confident are you that this model would benefit the business if put into use?
***

## Conclusions
Provide your conclusions about the work you've done, including any limitations or next steps.

***
Questions to consider:
* What would you recommend the business do as a result of this work?
* What are some reasons why your analysis might not fully solve the business problem?
* What else could you do in the future to improve this project?
***