# Team Introduction
Our group is comprised of Braden Anderson, Hien Lam, and Tavin Weeda.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import json
import pickle 

from sklearn.preprocessing import StandardScaler, OrdinalEncoder, OneHotEncoder, StandardScaler, LabelEncoder
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.svm import SVC
from sklearn.linear_model import SGDClassifier, LogisticRegression
from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score, multilabel_confusion_matrix, ConfusionMatrixDisplay
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.feature_selection import SelectKBest, mutual_info_classif
from sklearn.ensemble import RandomForestClassifier

# Data Preparation Part 1
- Define and prepare your class variables. 
- Use proper variable representations (int, float, one-hot, etc.). 
- Use pre-processing methods (as needed) for dimensionality reduction, scaling, etc. 
- Remove variables that are not needed/useful for the analysis.
(10)

## Read, clean the data

In [109]:
# Read in data from github
url_accident = "https://github.com/BradenAnderson/Accident_Severity_Prediction/blob/main/Data/accident.csv.gz?raw=tr"
url_vehicle = "https://github.com/BradenAnderson/Accident_Severity_Prediction/blob/main/Data/vehicle.csv.gz?raw=tr"
url_person = "https://github.com/BradenAnderson/Accident_Severity_Prediction/blob/main/Data/person.csv.gz?raw=tr"

accident = pd.read_csv(url_accident,compression='gzip')
vehicle = pd.read_csv(url_vehicle, compression='gzip', encoding="ISO-8859-1")
person = pd.read_csv(url_person, compression='gzip', encoding="ISO-8859-1")

  vehicle = pd.read_csv(url_vehicle, compression='gzip', encoding="ISO-8859-1")


In [110]:
# Filter accidents where driver is present and vehicle is involved
person = person.loc[(person.VEH_NO==1) & (person.PER_NO==1)]
vehicle = vehicle.loc[vehicle.VEH_NO==1]

In [111]:
# Left join person with vehicle and accident
# Duplicated CASENUM are dropped
df = person.merge(vehicle.drop_duplicates(subset=['CASENUM']), on='CASENUM', how='left')
df = df.merge(accident.drop_duplicates(subset=['CASENUM']),on='CASENUM',how='left')

In [113]:
# Comprehensive list of variables used in this analysis
# ORIGINAL features from lab 1 / EDA: regionname, urbanicityname, body_typname, makename, mod_yearname, vtrafwayname, vnum_lanname, vsurcondname, vtrafconname, 
                                # typ_intname, int_hwyname, weathername, wkdy_imname, reljct1_imname, lgtcon_imname, maxsev_imname, alchl_imname, age_im, sex_imname, trav_sp
# DERIVED features: hour_binned, speeding_status
# NEW features post-lab 1 / EDA: rest_usename, pcrash1_imname, weather_binned (binning of `weathername`), body_type_binned (binning of `body_typname`), int_binned (binning of `typ_intname`)
# DISCARDED features that were not found to be useful: hour_imname (unnecessary with `hour_binned`), vspd_lim (unnecessary with `speeding_status`), makename (too many levels), wrk_zonename

df = df[['REGIONNAME','URBANICITYNAME','BODY_TYPNAME_x', 'MOD_YEARNAME_x','VTRAFWAYNAME','VNUM_LANNAME','VSURCONDNAME','VTRAFCONNAME','TYP_INTNAME','INT_HWYNAME','WEATHERNAME',
        'WKDY_IMNAME', 'RELJCT1_IMNAME','LGTCON_IMNAME','MAXSEV_IMNAME','ALCHL_IMNAME','AGE_IM','SEX_IMNAME','TRAV_SP','REST_USENAME','PCRASH1_IMNAME','HOUR_IMNAME','VSPD_LIM']]

df = df.rename(columns=str.lower)
df.shape

(54473, 23)

In [101]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 54473 entries, 0 to 54472
Data columns (total 23 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   regionname      54473 non-null  object 
 1   urbanicityname  54473 non-null  object 
 2   body_typname_x  54473 non-null  object 
 3   mod_yearname_x  54473 non-null  object 
 4   vtrafwayname    54427 non-null  object 
 5   vnum_lanname    54427 non-null  object 
 6   vsurcondname    54427 non-null  object 
 7   vtrafconname    54427 non-null  object 
 8   typ_intname     54473 non-null  object 
 9   int_hwyname     54473 non-null  object 
 10  weathername     54473 non-null  object 
 11  wkdy_imname     54473 non-null  object 
 12  reljct1_imname  54473 non-null  object 
 13  lgtcon_imname   54473 non-null  object 
 14  maxsev_imname   54473 non-null  object 
 15  alchl_imname    54473 non-null  object 
 16  age_im          54473 non-null  int64  
 17  sex_imname      54473 non-null 

In [136]:
df.head()

Unnamed: 0,regionname,urbanicityname,body_typname_x,mod_yearname_x,vtrafwayname,vnum_lanname,vsurcondname,vtrafconname,typ_intname,int_hwyname,...,lgtcon_imname,maxsev_imname,alchl_imname,age_im,sex_imname,trav_sp,rest_usename,pcrash1_imname,hour_imname,vspd_lim
0,"West (MT, ID, WA, OR, CA, NV, NM, AZ, UT, CO, ...",Rural Area,"4-door sedan, hardtop",2018,"Two-Way, Not Divided",Five lanes,Snow,Traffic control signal(on colors) not known wh...,Four-Way Intersection,No,...,Daylight,No Apparent Injury (O),No Alcohol Involved,61,Female,25.0,Shoulder and Lap Belt Used,Going Straight,8:00am-8:59am,98.0
1,"South (MD, DE, DC, WV, VA, KY, TN, NC, SC, GA,...",Urban Area,"4-door sedan, hardtop",2013,"Two-Way, Not Divided",Two lanes,Dry,No Controls,Not an Intersection,No,...,Dark - Not Lighted,Suspected Minor Injury (B),No Alcohol Involved,23,Male,45.0,Shoulder and Lap Belt Used,Going Straight,1:00am-1:59am,25.0
2,"South (MD, DE, DC, WV, VA, KY, TN, NC, SC, GA,...",Urban Area,Other or Unknown automobile type,Unknown,"Two-Way, Divided, Unprotected Median",Four lanes,Dry,No Controls,T-Intersection,No,...,Daylight,No Apparent Injury (O),No Alcohol Involved,27,Female,15.0,Not Reported,Going Straight,1:00pm-1:59pm,45.0
3,"West (MT, ID, WA, OR, CA, NV, NM, AZ, UT, CO, ...",Rural Area,"Compact Utility (Utility Vehicle Categories ""S...",2015,"Two-Way, Divided, Positive Median Barrier",Two lanes,Snow,No Controls,Not an Intersection,Yes,...,Daylight,No Apparent Injury (O),No Alcohol Involved,20,Male,65.0,Shoulder and Lap Belt Used,Going Straight,2:00pm-2:59pm,80.0
4,"Northeast (PA, NJ, NY, NH, VT, RI, MA, ME, CT)",Rural Area,Station Wagon (excluding van and truck based),2004,Not Reported,Not Reported,Snow,Warning Sign,Not an Intersection,No,...,Dark - Not Lighted,No Apparent Injury (O),No Alcohol Involved,23,Male,998.0,Shoulder and Lap Belt Used,Negotiating a Curve,5:00pm-5:59pm,50.0


In [135]:
# Check for NA values
df.isnull().sum()

regionname         0
urbanicityname     0
body_typname_x     0
mod_yearname_x     0
vtrafwayname      46
vnum_lanname      46
vsurcondname      46
vtrafconname      46
typ_intname        0
int_hwyname        0
weathername        0
wkdy_imname        0
reljct1_imname     0
lgtcon_imname      0
maxsev_imname      0
alchl_imname       0
age_im             0
sex_imname         0
trav_sp           46
rest_usename       0
pcrash1_imname    46
hour_imname        0
vspd_lim          46
dtype: int64

#### Region where crash occurred

- Northeast (PA, NJ, NY, NH, VT, RI, MA, ME, CT)
- West (MT, ID, WA, OR, CA, NV, NM, AZ, UT, CO, WY, AK, HI)
- Midwest (OH, IN, IL, MI, WI, MN, ND, SD, NE, IA, MO, KS)
- South (MD, DE, DC, WV, VA, KY, TN, NC, SC, GA, FL, AL, MS, LA, AR, OK, TX)


In [138]:
# regionname
df["regionname"] = df.loc[:,"regionname"].apply(lambda string: string.split()[0])
df.regionname.value_counts()

South        28393
Midwest      10307
West          9270
Northeast     6503
Name: regionname, dtype: int64

#### Geographical area of the crash

- Urban
- Rural

In [139]:
# urbanicityname
df["urbanicityname"] = df.loc[:,"urbanicityname"].apply(lambda string: string.split()[0])
df.urbanicityname.value_counts()

Urban    40716
Rural    13757
Name: urbanicityname, dtype: int64

#### Year of vehicle **(Continuous variable)**

In [132]:
# mod_yearname_x
# Remove years below 1980 and unknown
df.mod_yearname_x.value_counts()

2016       3809
2017       3724
2019       3491
2018       3360
Unknown    2938
           ... 
1972          1
1932          1
1953          1
1959          1
1969          1
Name: mod_yearname_x, Length: 67, dtype: int64

#### Trafficway flow just prior to crash

In [144]:
# vtrafwayname
# did we use skl to bin this?
df[['vtrafwayname']].value_counts()

vtrafwayname                                         
Two-Way, Not Divided                                     23100
Not Reported                                              9142
Two-Way,  Divided, Positive  Median Barrier               8961
Two-Way, Divided, Unprotected Median                      6879
Two-Way, Not Divided With a Continuous Left-Turn Lane     1966
Non-Trafficway or Driveway Access                         1939
Entrance/Exit Ramp                                        1296
One-Way Trafficway                                        1115
Reported as Unknown                                         29
dtype: int64

#### Number of travel lanes just prior to crash. 
- Median: lanes in opposite directions are additive. 
- No median: lanes in traveling direction counts

In [143]:
# vnum_lanname
# did we use skl to bin this?
df[['vnum_lanname']].value_counts()

vnum_lanname                     
Two lanes                            19168
Not Reported                         15557
Three lanes                           6189
Four lanes                            5078
Five lanes                            3779
Non-Trafficway or Driveway Access     1939
Six lanes                             1295
One lane                               970
Seven or more lanes                    433
Reported as Unknown                     19
dtype: int64

#### Roadway surface condition just prior to crash

In [142]:
# vsurcondname
# did we use skl to bin this?
df[['vsurcondname']].value_counts()

vsurcondname                     
Dry                                  40702
Wet                                   7085
Not Reported                          3420
Non-Trafficway or Driveway Access     1939
Snow                                   512
Ice/Frost                              336
Reported as Unknown                    110
Water (Standing or Moving)             104
Slush                                   98
Mud, Dirt or Gravel                     93
Sand                                    16
Other                                    8
Oil                                      4
dtype: int64

#### Traffic controls in the vehicle’s environment just prior to crash

In [141]:
# vtrafconname
# did we use skl to bin this?
df[['vtrafconname']].value_counts()

vtrafconname                                                                
No Controls                                                                     27471
Traffic control signal(on colors) not known whether or not Pedestrian Signal    11300
Not Reported                                                                     8458
Stop Sign                                                                        5049
Yield Sign                                                                        593
Traffic control signal (on colors) with Pedestrian Signal                         581
Traffic control signal (on colors) without Pedestrian Signal                      191
Flashing Traffic Control Signal                                                   135
Other Regulatory Sign                                                             101
Other                                                                              70
Reported as Unknown                                            

#### Did crash occur at intersection

In [140]:
# int_hwyname
df[['int_hwyname']].value_counts()

int_hwyname
No             49471
Yes             4995
Unknown            7
dtype: int64

In [145]:
# Drop observations with Unknown
df.drop(df[df['int_hwyname'] == 'Unknown'].index, inplace = True)
df[['int_hwyname']].value_counts()

int_hwyname
No             49471
Yes             4995
dtype: int64

#### Name of weekday where crash occurred

In [116]:
# wkdy_imname
df.wkdy_imname.value_counts()

Friday       8998
Thursday     8238
Wednesday    8120
Tuesday      7787
Saturday     7579
Monday       7466
Sunday       6285
Name: wkdy_imname, dtype: int64

#### Relation to junction (crash's location with respect to presence in an interchange area)

In [120]:
# reljct1_imname
df.reljct1_imname.value_counts()

No     50837
Yes     3636
Name: reljct1_imname, dtype: int64

#### Lighting condition during time of crash

In [119]:
# lgtcon_imname
df.lgtcon_imname.value_counts()

Daylight                   36245
Dark - Lighted              9520
Dark - Not Lighted          6072
Dusk                        1401
Dawn                         837
Dark - Unknown Lighting      385
Other                         13
Name: lgtcon_imname, dtype: int64

#### Alcohol state of driver

In [118]:
# alchl_imname
df.alchl_imname.value_counts()

No Alcohol Involved    50263
Alcohol Involved        4210
Name: alchl_imname, dtype: int64

#### Age of driver **(Continuous)**

In [None]:
# age_im
# did we do any cleaning here?
# I think we removed observations below 15 years old?
# Should we bin 80+ or leave as is?

#### Sex of driver

In [117]:
# sex_imname
df.sex_imname.value_counts()

Male      33902
Female    20571
Name: sex_imname, dtype: int64

#### Traveling speed of vehicle **(Continuous)**

In [None]:
# trav_sp
# Did we do any cleaning here? 

### Newly derived features or careful binning of existing features

#### Speeding status of driver (Braden) (Derived feature)

In [None]:
# code here

#### Time of accident, hours binned (Braden) (Derived feature)

- Morning (6am-noon)
- Afternoon (noon-6pm)
- Evening (6pm-midnight)
- Night (midnight-6am)']

In [None]:
# code here

#### Body type of vehicle (Tavin) (Careful binning)

In [None]:
# code here

#### Type of intersection (Careful binning)

In [161]:
df.typ_intname.value_counts()

Not an Intersection        30234
Four-Way Intersection      13458
T-Intersection              5569
Not Reported                4484
Y-Intersection               170
Five Point, or More          160
Roundabout                   143
Traffic Circle                37
L-Intersection                22
Other Intersection Type        6
Reported as Unknown            2
Name: typ_intname, dtype: int64

In [162]:
def intersection_category(row):
    if row == 'Not an Intersection':
        result = 'No'
    elif row == 'Reported as Unknown':
        result = 'Other'
    elif row == 'Not Reported':
        result = 'Other'
    else:
        result = 'Yes'
    return result

df['intersection_binned'] = df['typ_intname'].apply(intersection_category)
df.intersection_binned.value_counts()

No       30234
Yes      19565
Other     4486
Name: intersection_binned, dtype: int64

#### Weather during time of crash (Careful binning)

In [None]:
# weathername
df.weathername.value_counts()

Clear                       38171
Cloudy                       7151
Rain                         4949
Not Reported                 2745
Snow                          775
Fog, Smog, Smoke              227
Reported as Unknown           113
Severe Crosswinds              48
Blowing Snow                   32
Sleet or Hail                  31
Other                          21
Freezing Rain or Drizzle       19
Blowing Sand, Soil, Dirt        3
Name: weathername, dtype: int64

In [147]:
def weather_cat(row):
    if row == 'Cloudy':
        return 'Not Clear'
    elif row == 'Fog, Smog, Smoke':
        return 'Not Clear' 
    elif row == 'Snow':
        return 'Wintery'
    elif row == 'Blowing Snow':
        return 'Wintery'
    elif row == 'Sleet or Hail':
        return 'Wintery'
    elif row == 'Freezing Rain or Drizzle':
        return 'Wintery'
    elif row == 'Severe Crosswinds':
        return 'Windy'
    elif row == 'Blowing Sand, Soil, Dirt':
        return 'Windy'
    elif row == 'Clear':
        return 'Clear'
    elif row == 'Rain':
        return 'Rain'
    else:
        return 'Other'
df['weather_binned'] = df['weathername'].apply(weather_cat)
df.weather_binned.value_counts()

Clear        38313
Not Clear     7387
Rain          4966
Other         2887
Wintery        862
Windy           51
Name: weather_binned, dtype: int64

#### Restraint use by driver (Newly added feature)

In [106]:
# rest_use
df[['rest_usename']].value_counts()

rest_usename                 
Shoulder and Lap Belt Used       40893
None Used/Not Applicable          4979
Reported as Unknown               4897
Not Reported                      2806
Lap Belt Only Used                 429
Shoulder Belt Only Used            336
Restraint Used - Type Unknown      120
Other                                5
Child Restraint Type Unknown         1
dtype: int64

In [148]:
# rest_use: bin into none, minimal, full, other
def restraint_category(row):
    if row == 'Shoulder and Lap Belt Used':
        result = 'Full'
    elif row == 'None Used/Not Applicable':
        result = 'None'    
    elif row == 'Lap Belt Only Used':
        result = 'Minimal'
    elif row == 'Shoulder Belt Only Used':
        result = 'Minimal'
    elif row == 'Restraint Used - Type Unknown':
        result = 'Minimal'
    else:
        result = 'Other'
    return result

df['restraint_binned'] = df['rest_usename'].apply(restraint_category)
df[['restraint_binned']].value_counts()

restraint_binned
Full                40893
Other                7709
None                 4979
Minimal               885
dtype: int64

#### What driver was doing right before crash (Newly added feature)

In [115]:
# pcrash1_imname
df.pcrash1_imname.value_counts()

Going Straight                                                30314
Turning Left                                                   7462
Negotiating a Curve                                            4344
Changing Lanes                                                 3135
Turning Right                                                  2875
Stopped in Roadway                                             1288
Decelerating in Road                                           1136
Backing Up (other than for Parking Position)                   1032
Passing or Overtaking Another Vehicle                           827
Starting in Road                                                659
Making a U-turn                                                 487
Merging                                                         286
Leaving a Parking Position                                      200
Entering a Parking Position                                     123
Accelerating in Road                            

#### **Response variable:** Maximum injury severity of driver (Careful binning)

In [66]:
# maxsev_imname
df.maxsev_imname.value_counts()

No Apparent Injury (O)          24059
Possible Injury (C)             12590
Suspected Minor Injury (B)       9782
Suspected Serious Injury (A)     6446
Fatal Injury (K)                 1415
Injured, Severity Unknown         180
Died Prior to Crash*                1
Name: maxsev_imname, dtype: int64

In [149]:
# Remove Injured, Severity Unknown and Died Prior to Crash 
remove = ['Injured, Severity Unknown', 'Died Prior to Crash*']
df = df[df.maxsev_imname.isin(remove) == False]
df['maxsev_imname'] = df['maxsev_imname'].str[:-4]
df.maxsev_imname.value_counts()

No Apparent Injury          24057
Possible Injury             12589
Suspected Minor Injury       9781
Suspected Serious Injury     6445
Fatal Injury                 1413
Name: maxsev_imname, dtype: int64

In [87]:
# Combine No Apparent Injury, Possible Injury, Suspected Minor Injury into one bin
# Combine Suspected Serious Injury and Fatal Injury into one bin
df['maxsev_binned'] = df['maxsev_imname'].replace(['Suspected Fatal Injury'], 'Fatal Injury')
df['maxsev_binned'] = df['maxsev_imname'].replace(['No Apparent Injury', 'Possible Injury', 'Suspected Minor Injury'], 'Not Fatal Injury')
df.maxsev_binned.value_counts()


Not Fatal Injury    46431
Fatal Injury         7861
Name: maxsev_binned, dtype: int64

### End of data cleaning. Review final dataset

In [164]:
# Remove features that are not useful for modeling or redundant post binning
df = df[df.columns.difference(['body_typname_x', 'typ_intname', 'weathername', 'rest_usename', 'hour_imname', 'vspd_lim'])]

In [153]:
# Check for duplicates
# Recall duplicated casenum was taken cared of during merge
# Here we check for duplicates of instances excluding `casenum`
df.duplicated().sum()

0

In [134]:
# Confirm there are zero NA values
df.isnull().sum()

In [None]:
df.info()

In [None]:
display(df)

## Begin data preprocessing

In [None]:
# code to scale features

In [None]:
# code for one hot encode

In [None]:
# list of features with representation (int, float, one-hot)

In [None]:
# code for feature selection: random forest feature importance

In [None]:
# code for feature selection: selectkbest

In [None]:
# code for feature selection: mutual information

# Data Preparation Part 2
Describe the final dataset that is used for classification/regression (include a description of any newly formed variables you created).
(5)

In [None]:
df.info()

In [None]:
display(df)

In [None]:
# table of feature description

# Model and Evaluation 1
- Choose and explain your evaluation metrics that you will use (i.e., accuracy, precision, recall, F-measure, or any metric we have discussed). 
- Why are the measure(s) appropriate for analyzing the results of your modeling? Give a detailed explanation backing up any assertions.
(10)

**Classification:** 
The evaluation metric to classify driver's injury severity is F1. This is an appropriate metric because the response, injury severity, is a categorical variable. Due to the class imbalance, metrics such as accuracy is not desirable because the model could have a high no-information-rate i.e., choose the most populous category and be correct most of the time.

**Regression:**
The evaluation metric to classify driver's age is MSE. This is an appropriate metric because the response, age, is a continuous variable.

# Model and Evaluation 2
- Choose the method you will use for dividing your data into training and testing splits (i.e., are you using Stratified 10-fold cross validation? Why?). 
- Explain why your chosen method is appropriate or use more than one method as appropriate. For example, if you are using time series data then you should be using continuous training and testing sets across time.
(10)

**Classification:** 
The method used to divide the data into train and test split is xxx because xxx.

**Regression:**
The method used to divide the data into train and test split is xxx because xxx. 

In [None]:
# code to split data here

# Model and Evaluation 3
- Create three different classification/regression models for each task (e.g., random forest, KNN, and SVM for task one and the same or different algorithms for task two). 
- Two modeling techniques must be new (but the third could be SVM or logistic regression). 
- Adjust parameters as appropriate to increase generalization performance using your chosen metric. 
- You must investigate different parameters of the algorithms!
(20)

## Classification

### Random Forest for Classification

### kNN for Classification

### SVM for Classification

## Regression

### Random Forest for Regression

### Model 2 for Regression

### Model 3 for Regression

# Model and Evaluation 4
- Analyze the results using your chosen method of evaluation. 
- Use visualizations of the results to bolster the analysis. 
- Explain any visuals and analyze why they are interesting to someone that might use this model.
(10)

# Model and Evaluation 5
- Discuss the advantages of each model for each classification task, if any. 
- If there are not advantages, explain why. Is any model better than another? 
- Is the difference significant with 95% confidence? Use proper statistical comparison methods. 
- You must use statistical comparison techniques—be sure they are appropriate for your chosen method of validation as discussed in unit 7 of the course.
(10)

# Model and Evaluation 6 
- Which attributes from your analysis are most important? 
- Use proper methods discussed in class to evaluate the importance of different attributes. 
- Discuss the results and hypothesize about why certain attributes are more important than others for a given classification task.
(10)

# Deployment
- How useful is your model for interested parties (i.e., the companies or organizations that might want to use it for prediction)? 
- How would you measure the model's value if it was used by these parties? 
- How would your deploy your model for interested parties? 
- What other data should be collected? How often would the model need to be updated, etc.? 
(5)

# Exceptional Work
- You have free reign to provide additional analyses. 
- One idea: grid search parameters in a parallelized fashion and visualize the performances across attributes. 
- Which parameters are most significant for making a good model for each classification algorithm?
(10)

In [None]:
# grid search code