# Cleaning

This notebook illustrates detailed download and cleaning of the Chicago Crash data. 

Our Problem focused on what factors contribute to **severe** traffic incidents for just drivers at **night**. 

* **Severe** traffic incidents we defined as `FATAL` or `INCAPACITATING` from the `INJURY_TYPE` column. 

* **Night** we defined as the hours between 10pm to 5 am, or hours `22` through `5` in the `CRASH_HOUR` column. 

* Final output is `final_df` which will be used in the following notebook(s).

### Loading the Neccessary Packages and CSV Files

In [1]:
#Importing the neccessary packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import OneHotEncoder, StandardScaler, LabelEncoder
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.neighbors import KNeighborsClassifier
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
from sklearn.metrics import confusion_matrix, plot_confusion_matrix,\
    precision_score, recall_score, accuracy_score, f1_score, log_loss,\
    roc_curve, roc_auc_score, classification_report, plot_roc_curve
from sklearn.metrics import roc_curve, classification_report
# from sklearn_pandas import DataFrameMapper
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from imblearn.pipeline import Pipeline as imb_Pipeline

In [2]:
crash_df = pd.read_csv('data/Traffic_Crashes_-_crashes.csv')
people_df = pd.read_csv('data/Traffic_Crashes_-_people.csv', low_memory=False)

### Dropping Unneccessary Columns

#### `crash_df` dropping Justification: 

* `RD_NO` - Police Dep. Report number, another identifying number associated with each record, we kept `CRASH_RECORD_ID` as the joining record number for each dataframe. 
* `CRASH_DATE_EST_I` - used when crash is reported to police days after the crash, this dataframe inclues crash day of week, hour and month so we can drop the specific date.
* `CRASH_DATE` - this dataframe inclues crash day of week, hour and month so we can drop the specific date.
* `REPORT_TYPE` - administrative report type, not a factor relevant to causing a crash.
* `HIT_AND_RUN_I` - not a factor relevant to causing a crash.
* `DATE_POLICE_NOTIFIED` - not a factor relevant to causing a crash.
* `STREET_NO` - of location related data we chose to keep latitude, longitude
* `BEAT_OF_OCCURENCE` - not a factor relevant to causing a crash.
* `PHOTOS_TAKEN_I` - not a factor relevant to causing a crash.
* `STATEMENTS_TAKEN` - not a factor relevant to causing a crash.
* `MOST_SEVERE_INJURY` - basing our severity of injury off of information from the `people_df` dataframe, including this and other injury related columns would cause multicolliniarity in our modeling. 
* `INJURIES_FATAL`
* `INJURIES_NON_INCAPACITATING`
* `INJURIES_REPORTED_NOT_EVIDENT`
* `INJURIES_NO_INDICATION`
* `INJURIES_UNKNOWN`
* `LONGITUDE`
* `LATITUDE`

In [3]:
crash_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 554228 entries, 0 to 554227
Data columns (total 49 columns):
 #   Column                         Non-Null Count   Dtype  
---  ------                         --------------   -----  
 0   CRASH_RECORD_ID                554228 non-null  object 
 1   RD_NO                          549766 non-null  object 
 2   CRASH_DATE_EST_I               41950 non-null   object 
 3   CRASH_DATE                     554228 non-null  object 
 4   POSTED_SPEED_LIMIT             554228 non-null  int64  
 5   TRAFFIC_CONTROL_DEVICE         554228 non-null  object 
 6   DEVICE_CONDITION               554228 non-null  object 
 7   WEATHER_CONDITION              554228 non-null  object 
 8   LIGHTING_CONDITION             554228 non-null  object 
 9   FIRST_CRASH_TYPE               554228 non-null  object 
 10  TRAFFICWAY_TYPE                554228 non-null  object 
 11  LANE_CNT                       198970 non-null  float64
 12  ALIGNMENT                     

In [4]:
crash_df['CRASH_HOUR'].value_counts()

16    42587
15    42326
17    41653
14    37575
18    34490
13    34297
12    32896
8     28751
11    28532
9     25573
10    25327
19    25290
7     23174
20    20244
21    18065
22    16648
23    14129
6     12268
0     11546
1      9847
2      8404
5      7521
3      6842
4      6243
Name: CRASH_HOUR, dtype: int64

In [5]:
crash_df_cleaned = crash_df[['CRASH_RECORD_ID', 'TRAFFIC_CONTROL_DEVICE', 'DEVICE_CONDITION', 
                             #'LATITUDE', 'LONGITUDE',
                             'WEATHER_CONDITION', 'LIGHTING_CONDITION', 'ROADWAY_SURFACE_COND', 
                             'CRASH_HOUR', 'CRASH_DAY_OF_WEEK', 'CRASH_MONTH']]

#### `people_df` dropping Justification: 

* `PERSON_ID` - unique ID for each person record, 

... do we need to fill in reasons for all these? 

In [6]:
people_df_cleaned = people_df[['CRASH_RECORD_ID', 'AGE', 
                               'BAC_RESULT VALUE', 'INJURY_CLASSIFICATION', 'PERSON_TYPE']]

## Subsetting crash records between 10 pm and 6 am

In [7]:
night_time_df = crash_df_cleaned.copy()
#night_time_df = night_time_df[(night_time_df['CRASH_HOUR'] >= 22) | (night_time_df['CRASH_HOUR'] <= 6)]
night_time_df.columns

Index(['CRASH_RECORD_ID', 'TRAFFIC_CONTROL_DEVICE', 'DEVICE_CONDITION',
       'WEATHER_CONDITION', 'LIGHTING_CONDITION', 'ROADWAY_SURFACE_COND',
       'CRASH_HOUR', 'CRASH_DAY_OF_WEEK', 'CRASH_MONTH'],
      dtype='object')

### change1

In [8]:
people_df['DRIVER_ACTION'].value_counts()

NONE                                 353351
UNKNOWN                              234396
FAILED TO YIELD                       89917
OTHER                                 84309
FOLLOWED TOO CLOSELY                  62148
IMPROPER BACKING                      30836
IMPROPER TURN                         25741
IMPROPER LANE CHANGE                  25627
IMPROPER PASSING                      21591
DISREGARDED CONTROL DEVICES           16374
TOO FAST FOR CONDITIONS               15674
IMPROPER PARKING                       3758
WRONG WAY/SIDE                         3754
CELL PHONE USE OTHER THAN TEXTING      1608
EVADING POLICE VEHICLE                 1596
OVERCORRECTED                          1198
EMERGENCY VEHICLE ON CALL               924
TEXTING                                 425
STOPPED SCHOOL BUS                      113
LICENSE RESTRICTIONS                     44
Name: DRIVER_ACTION, dtype: int64

### change1end

## Joining all two data sets

In [9]:
#checking the shape
night_time_df.shape, people_df_cleaned.shape

((554228, 9), (1226112, 5))

In [10]:
merge = pd.merge(night_time_df, people_df_cleaned, how='left', on='CRASH_RECORD_ID')
merge.shape

(1225784, 13)

In [11]:
merge['AGE'].value_counts()

 25.0     25220
 26.0     24959
 27.0     24753
 28.0     24208
 24.0     24187
          ...  
-47.0         1
-49.0         1
 106.0        1
-177.0        1
-40.0         1
Name: AGE, Length: 116, dtype: int64

## Further Exploring Columns

#### `INJURY_CLASSIFICATION` target Variable - this includes all people involved in incident, cyclists, passengers, drivers, etc. 

In [12]:
merge['INJURY_CLASSIFICATION'].value_counts()

NO INDICATION OF INJURY     1122230
NONINCAPACITATING INJURY      57052
REPORTED, NOT EVIDENT         32978
INCAPACITATING INJURY         11118
FATAL                           682
Name: INJURY_CLASSIFICATION, dtype: int64

In [13]:
# fatal / incapacitate = 1
merge.loc[(merge['INJURY_CLASSIFICATION'] == 'FATAL') | 
           (merge['INJURY_CLASSIFICATION'] == 'INCAPACITATING INJURY') | 
           (merge['INJURY_CLASSIFICATION'] == 'NONINCAPACITATING INJURY') |
           (merge['INJURY_CLASSIFICATION'] == 'REPORTED, NOT EVIDENT'), 'INJURY_CLASSIFICATION'] = 1

# else = 0
merge.loc[(merge['INJURY_CLASSIFICATION'] == 'NO INDICATION OF INJURY'), 'INJURY_CLASSIFICATION'] = 0

merge['INJURY_CLASSIFICATION'].fillna(0, inplace=True)

In [14]:
merge["INJURY_CLASSIFICATION"].value_counts()

0    1123954
1     101830
Name: INJURY_CLASSIFICATION, dtype: int64

In [15]:
merge["INJURY_CLASSIFICATION"].value_counts(normalize=True)

0    0.916927
1    0.083073
Name: INJURY_CLASSIFICATION, dtype: float64

In [17]:
Crash_Injury= merge.loc[["INJURY_CLASSIFICATION"] == 1]

KeyError: 'False: boolean label can not be used without a boolean index'

In [None]:
merge = merge.drop(columns=['CRASH_RECORD_ID'])

### changing traffic control device

In [None]:
merge.loc[merge['TRAFFIC_CONTROL_DEVICE'] == 'NO CONTROLS', 'TRAFFIC_CONTROL_DEVICE'] = 0
merge.loc[merge['TRAFFIC_CONTROL_DEVICE'] != 0, 'TRAFFIC_CONTROL_DEVICE'] = 1

merge.loc[merge.DEVICE_CONDITION == 'FUNCTIONING PROPERLY', 'DEVICE_CONDITION'] = 1
merge.loc[merge.DEVICE_CONDITION != 1, 'DEVICE_CONDITION'] = 0

merge['DEVICE_CONDITION'] = merge['DEVICE_CONDITION'].astype(float)
merge['TRAFFIC_CONTROL_DEVICE'] = merge['TRAFFIC_CONTROL_DEVICE'].astype(float)

### changing weather

In [None]:
# 1 is clear
merge.loc[merge['WEATHER_CONDITION'] == 'CLEAR', 'WEATHER_CONDITION'] = 1

# 0 is not clear
merge.loc[merge['WEATHER_CONDITION'] != 1, 'WEATHER_CONDITION'] = 0

merge['WEATHER_CONDITION'] = merge['WEATHER_CONDITION'].astype(float)

### changing lighting condition

In [None]:
# ohe this during train test split

In [None]:
merge['LIGHTING_CONDITION'].value_counts()

### changing roadway surface cond

In [None]:
merge.loc[merge['ROADWAY_SURFACE_COND'] == 'OTHER', 'ROADWAY_SURFACE_COND'] = 'UNKNOWN'

In [None]:
merge['ROADWAY_SURFACE_COND'].value_counts()

### changing age

In [None]:
merge.info()

In [None]:
merge.loc[merge['AGE'] <= 0, 'AGE'] = None

In [None]:
merge.dropna(subset=['AGE'], inplace=True)

In [None]:
merge.info()

In [None]:
merge = merge.loc[merge['PERSON_TYPE'] == 'DRIVER']

In [None]:
merge.info()

In [None]:
fig, ax = plt.subplots()

ax.bar(list(merge['AGE'].value_counts().index), merge['AGE'].value_counts().values)

In [None]:
merge.drop(columns=['PERSON_TYPE'], inplace=True)

### changing bac_result_value

In [None]:
#merge.rename(columns={'BAC_RESULT VALUE':'BAC_RESULT_VALUE'})

merge['BAC_RESULT VALUE'].fillna(0, inplace=True)

# 1 value is drunk
merge.loc[merge['BAC_RESULT VALUE'] >= 0.08, 'BAC_RESULT VALUE'] = 1

# 0 value is non drunk
merge.loc[merge['BAC_RESULT VALUE'] < 0.08, 'BAC_RESULT VALUE'] = 0

### changing day of week

In [None]:
# binning weekends and weekday nights

# 1 value is a weekend night
merge.loc[merge['CRASH_DAY_OF_WEEK'] >= 6, 'CRASH_DAY_OF_WEEK'] = 1

# 0 value is a weekday night
merge.loc[merge['CRASH_DAY_OF_WEEK'] != 1, 'CRASH_DAY_OF_WEEK'] = 0

### changing lane count (deprecated)

dropping because too many null values that we don't want to skew data with mean/median, and don't want to assume a distribution for synthetic data

In [None]:
# index = merge[merge['LANE_CNT'] > 12].index

# merge.drop(index, inplace=True)

In [None]:
# merge['LANE_CNT'].value_counts()

In [None]:
# merge['LANE_CNT'].value_counts().sum()

In [None]:
# fig, ax = plt.subplots()

# ax.bar(list(merge['LANE_CNT'].value_counts().index), merge['LANE_CNT'].value_counts().values)

In [None]:
# merge['LANE_CNT'].fillna(merge['LANE_CNT'].median(), inplace=True)

In [None]:
# fig, ax = plt.subplots()

# ax.bar(list(merge['LANE_CNT'].value_counts().index), merge['LANE_CNT'].value_counts().values)

## compile final df

In [None]:
final_df = merge.copy()
final_df.info()

In [None]:
# final_df.to_csv('final_df.csv')

#### Exporting the `final_df` into csv file

In [None]:
#clean_data = final_df.to_csv('clean_data.csv', index = False)

## first model

### smote oversampling

In [None]:
X = final_df.drop(columns=['INJURY_CLASSIFICATION'])
y = final_df['INJURY_CLASSIFICATION']

X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size=0.2,
                                                    stratify=y,
                                                    random_state=11)

In [None]:
# Create a column transformer
col_transformer = ColumnTransformer(transformers=[
    ('ohe', OneHotEncoder(categories='auto', handle_unknown='ignore'), ['LIGHTING_CONDITION', 'ROADWAY_SURFACE_COND'])
], remainder='passthrough')

over = SMOTE(sampling_strategy='minority')
#under = RandomUnderSampler(sampling_strategy=0.5)

# Create a pipeline containing the column transformer and model
pipeline = imb_Pipeline(steps=[
    ('col_transformer', col_transformer),
    ('o', over),
    ('classifier', DecisionTreeClassifier(random_state=11))
])

In [None]:
param_grid = [{'classifier__max_depth':[1, 3, 5]}]

grid_search = GridSearchCV(estimator=pipeline,
                           param_grid=param_grid,
                           scoring='accuracy',
                           cv=5
                           )

grid_search.fit(X_train, y_train)

y_pred = grid_search.predict(X_test)
cv_score_smoted = grid_search.best_score_
test_score_smoted = grid_search.score(X_test, y_test)

In [None]:
cv_score_smoted, test_score_smoted

In [None]:
confusion_matrix(y_pred, y_test)

### smote over and undersampling

In [None]:
X = final_df.drop(columns=['INJURY_CLASSIFICATION'])
y = final_df['INJURY_CLASSIFICATION']

X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size=0.2,
                                                    stratify=y,
                                                    random_state=11)

In [None]:
# Create a column transformer
col_transformer = ColumnTransformer(transformers=[
    ('ohe', OneHotEncoder(categories='auto', handle_unknown='ignore'), ['LIGHTING_CONDITION', 'ROADWAY_SURFACE_COND'])
], remainder='passthrough')

over = SMOTE(sampling_strategy='minority')
under = RandomUnderSampler(sampling_strategy='not minority')

# Create a pipeline containing the column transformer and model
pipeline = imb_Pipeline(steps=[
    ('col_transformer', col_transformer),
    ('o', over),
    ('u', under),
    ('classifier', DecisionTreeClassifier(random_state=11))
])

In [None]:
param_grid = [{'classifier__max_depth':[1, 3, 5]}]

grid_search = GridSearchCV(estimator=pipeline,
                           param_grid=param_grid,
                           scoring='accuracy',
                           cv=5
                           )

grid_search.fit(X_train, y_train)

y_pred = grid_search.predict(X_test)
print(grid_search.best_params_)
cv_score_smoted = grid_search.best_score_
test_score_smoted = grid_search.score(X_test, y_test)

In [None]:
pipeline = imb_Pipeline(steps=[
    ('col_transformer', col_transformer),
    ('o', over),
    ('u', under),
    ('classifier', DecisionTreeClassifier(random_state=11, max_depth=5))
])

fig, ax = plt.subplots(figsize=(40, 40))

pipeline.fit(X_train, y_train)
feature_list = pipeline['col_transformer'].get_feature_names()
plot_tree(pipeline['classifier'], ax=ax, feature_names=feature_list)

In [None]:
cv_score_smoted, test_score_smoted

In [None]:
confusion_matrix(y_pred, y_test)

In [None]:
roc_auc_score(y_pred, y_test)

### no smote

In [None]:
X = final_df.drop(columns=['INJURY_CLASSIFICATION'])
y = final_df['INJURY_CLASSIFICATION']

X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size=0.2,
                                                    stratify=y,
                                                    random_state=11)

In [None]:
from sklearn.compose import ColumnTransformer
from imblearn.pipeline import Pipeline as imb_Pipeline

# Create a column transformer
col_transformer = ColumnTransformer(transformers=[
    ('ohe', OneHotEncoder(categories='auto', handle_unknown='ignore'), ['LIGHTING_CONDITION', 'ROADWAY_SURFACE_COND'])
], remainder='passthrough')

# Create a pipeline containing the column transformer and model
pipeline = Pipeline(steps=[
    ('col_transformer', col_transformer),
    ('classifier', DecisionTreeClassifier(random_state=11))
])

In [None]:
param_grid = [{'classifier__max_depth':[1, 3, 5]}]

grid_search = GridSearchCV(estimator=pipeline,
                           param_grid=param_grid,
                           scoring='accuracy',
                           cv=5
                           )

grid_search.fit(X_train, y_train)

y_pred = grid_search.predict(X_test)
cv_score_no_smote = grid_search.best_score_
test_score_no_smoted = grid_search.score(X_test, y_test)

In [None]:
confusion_matrix(y_pred, y_test)

In [None]:
cv_score_no_smote, test_score_no_smoted

### smote logistic regression (just traffic control device)

In [None]:
X = final_df[['TRAFFIC_CONTROL_DEVICE']]
y = final_df['INJURY_CLASSIFICATION']

X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size=0.2,
                                                    stratify=y,
                                                    random_state=11)

In [None]:
# Create a column transformer
col_transformer = ColumnTransformer(transformers=[
    ('ohe', OneHotEncoder(categories='auto', handle_unknown='ignore'), ['LIGHTING_CONDITION', 'ROADWAY_SURFACE_COND'])
], remainder='passthrough')

over = SMOTE(sampling_strategy='minority')
under = RandomUnderSampler(sampling_strategy='not minority')

# Create a pipeline containing the column transformer and model
pipeline = imb_Pipeline(steps=[
#    ('col_transformer', col_transformer),
    ('o', over),
    ('u', under),
    ('scaler', StandardScaler()),
    ('logistic_regressor', LogisticRegression(random_state=42))
])

In [None]:
param_grid = [{'logistic_regressor__max_iter': [50, 100, 250, 500],
              # 'logistic_regressor__C': [1e-10, 1e-100],
               'logistic_regressor__penalty': ['none', 'l2']
              }]

grid_search = GridSearchCV(estimator=pipeline,
                           param_grid=param_grid,
                           scoring='accuracy',
                           cv=5
                           )

grid_search.fit(X_train, y_train)

y_pred = grid_search.predict(X_test)
cv_score_smoted_log = grid_search.best_score_
test_score_smoted_log = grid_search.score(X_test, y_test)

In [None]:
cv_score_smoted_log, test_score_smoted_log

In [None]:
confusion_matrix(y_pred, y_test)

In [None]:
roc_auc_score(y_pred, y_test)

### smote knn

In [None]:
X = final_df.drop(columns=['INJURY_CLASSIFICATION'])
y = final_df['INJURY_CLASSIFICATION']

X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size=0.2,
                                                    stratify=y,
                                                    random_state=11)

In [None]:
X_t, X_val, y_t, y_val = train_test_split(X_train, y_train,
                                          random_state=42,
                                          test_size=0.2)

In [None]:
# Create a column transformer
col_transformer = ColumnTransformer(transformers=[
    ('ohe', OneHotEncoder(categories='auto', handle_unknown='ignore'), ['LIGHTING_CONDITION', 'ROADWAY_SURFACE_COND'])
], remainder='passthrough')

over = SMOTE(sampling_strategy='minority')
under = RandomUnderSampler(sampling_strategy='not minority')

# Create a pipeline containing the column transformer and model
pipeline = imb_Pipeline(steps=[
    ('col_transformer', col_transformer),
    ('o', over),
    ('u', under),
    ('knn_classifier', KNeighborsClassifier())
])

In [None]:
param_grid = [{'knn_classifier__n_neighbors': [3,5,9],
               'knn_classifier__metric': ['minkowski']}]

grid_search = GridSearchCV(estimator=pipeline,
                           param_grid=param_grid,
                           scoring='accuracy',
                           cv=5
                           )

grid_search.fit(X_t, y_t)

y_hat = grid_search.predict(X_val)
print(grid_search.best_params_)
cv_score_smoted_knn = grid_search.best_score_
test_score_smoted_knn = grid_search.score(X_test, y_test)

In [None]:
cv_score_smoted_knn, test_score_smoted_knn

In [None]:
confusion_matrix(y_val, y_hat)

In [None]:
accuracy_score(y_val, y_hat)

In [None]:
precision_score(y_val, y_hat)

In [None]:
f1_score(y_val, y_hat)

In [None]:
roc_auc_score(y_val, y_hat)

In [None]:
plot_roc_curve(grid_search, X_test, y_test)

### smote logistic regression (all features)

In [None]:
X = final_df.drop(columns=['INJURY_CLASSIFICATION'])
y = final_df['INJURY_CLASSIFICATION']

X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size=0.2,
                                                    stratify=y,
                                                    random_state=11)

In [None]:
# Create a column transformer
col_transformer = ColumnTransformer(transformers=[
    ('ohe', OneHotEncoder(categories='auto', handle_unknown='ignore'), ['LIGHTING_CONDITION', 'ROADWAY_SURFACE_COND'])
], remainder='passthrough')

over = SMOTE(sampling_strategy='minority')
under = RandomUnderSampler(sampling_strategy='not minority')

# Create a pipeline containing the column transformer and model
pipeline = imb_Pipeline(steps=[
    ('col_transformer', col_transformer),
    ('o', over),
    ('u', under),
    ('scaler', StandardScaler()),
    ('logistic_regressor', LogisticRegression(random_state=42))
])

In [None]:
param_grid = [{'logistic_regressor__max_iter': [50, 100, 250, 500],
              # 'logistic_regressor__C': [1e-10, 1e-100],
               'logistic_regressor__penalty': ['none', 'l2']
              }]

grid_search = GridSearchCV(estimator=pipeline,
                           param_grid=param_grid,
                           scoring='accuracy',
                           cv=5
                           )

grid_search.fit(X_train, y_train)

y_pred = grid_search.predict(X_test)
cv_score_smoted_log = grid_search.best_score_
test_score_smoted_log = grid_search.score(X_test, y_test)

In [None]:
cv_score_smoted_log, test_score_smoted_log

In [None]:
confusion_matrix(y_pred, y_test)

In [None]:
roc_auc_score(y_pred, y_test)

In [None]:
print(classification_report(y_pred, y_test))

### smote logistic regression (weather, road cond, age, traffic control device)

In [None]:
X = final_df[['WEATHER_CONDITION', 'ROADWAY_SURFACE_COND', 'AGE', 'TRAFFIC_CONTROL_DEVICE', 'LIGHTING_CONDITION',
             'CRASH_HOUR']]
y = final_df['INJURY_CLASSIFICATION']

X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size=0.2,
                                                    stratify=y,
                                                    random_state=11)

In [None]:
# Create a column transformer
col_transformer = ColumnTransformer(transformers=[
    ('ohe', OneHotEncoder(categories='auto', handle_unknown='ignore'), ['LIGHTING_CONDITION', 'ROADWAY_SURFACE_COND'])
], remainder='passthrough')

over = SMOTE(sampling_strategy='minority')
under = RandomUnderSampler(sampling_strategy='not minority')

# Create a pipeline containing the column transformer and model
pipeline = imb_Pipeline(steps=[
    ('col_transformer', col_transformer),
    ('o', over),
    ('u', under),
    ('scaler', StandardScaler()),
    ('logistic_regressor', LogisticRegression(random_state=42))
])

In [None]:
param_grid = [{'logistic_regressor__max_iter': [50, 100, 250, 500],
              # 'logistic_regressor__C': [1e-10, 1e-100],
               'logistic_regressor__penalty': ['none', 'l2']
              }]

grid_search = GridSearchCV(estimator=pipeline,
                           param_grid=param_grid,
                           scoring='accuracy',
                           cv=5
                           )

grid_search.fit(X_train, y_train)

y_pred = grid_search.predict(X_test)
cv_score_smoted_log = grid_search.best_score_
test_score_smoted_log = grid_search.score(X_test, y_test)

In [None]:
cv_score_smoted_log, test_score_smoted_log

In [None]:
confusion_matrix(y_pred, y_test)

In [None]:
roc_auc_score(y_pred, y_test)

In [None]:
print(classification_report(y_pred, y_test))

### smote logistic regression (without age)

In [None]:
X = final_df[['WEATHER_CONDITION', 'ROADWAY_SURFACE_COND', 'TRAFFIC_CONTROL_DEVICE', 'LIGHTING_CONDITION',
             'CRASH_HOUR']]
y = final_df['INJURY_CLASSIFICATION']

X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size=0.2,
                                                    stratify=y,
                                                    random_state=11)

In [None]:
# Create a column transformer
col_transformer = ColumnTransformer(transformers=[
    ('ohe', OneHotEncoder(categories='auto', handle_unknown='ignore'), ['LIGHTING_CONDITION', 'ROADWAY_SURFACE_COND'])
], remainder='passthrough')

over = SMOTE(sampling_strategy='minority')
under = RandomUnderSampler(sampling_strategy='not minority')

# Create a pipeline containing the column transformer and model
pipeline = imb_Pipeline(steps=[
    ('col_transformer', col_transformer),
    ('o', over),
    ('u', under),
    ('scaler', StandardScaler()),
    ('logistic_regressor', LogisticRegression(random_state=42))
])

In [None]:
param_grid = [{'logistic_regressor__max_iter': [50, 100, 250, 500],
              # 'logistic_regressor__C': [1e-10, 1e-100],
               'logistic_regressor__penalty': ['none', 'l2']
              }]

grid_search = GridSearchCV(estimator=pipeline,
                           param_grid=param_grid,
                           scoring='accuracy',
                           cv=5
                           )

grid_search.fit(X_train, y_train)

y_pred = grid_search.predict(X_test)
cv_score_smoted_log = grid_search.best_score_
test_score_smoted_log = grid_search.score(X_test, y_test)

In [None]:
cv_score_smoted_log, test_score_smoted_log

In [None]:
confusion_matrix(y_pred, y_test)

In [None]:
roc_auc_score(y_pred, y_test)

In [None]:
print(classification_report(y_pred, y_test))

### smote logistic regression (traffic control device, surface cond, day of week)

In [None]:
X = final_df[['ROADWAY_SURFACE_COND', 'TRAFFIC_CONTROL_DEVICE', 'CRASH_DAY_OF_WEEK']]
y = final_df['INJURY_CLASSIFICATION']

X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size=0.2,
                                                    stratify=y,
                                                    random_state=11)

In [None]:
# Create a column transformer
col_transformer = ColumnTransformer(transformers=[
    ('ohe', OneHotEncoder(categories='auto', handle_unknown='ignore'), ['ROADWAY_SURFACE_COND'])
], remainder='passthrough')

over = SMOTE(sampling_strategy='minority')
under = RandomUnderSampler(sampling_strategy='not minority')

# Create a pipeline containing the column transformer and model
pipeline = imb_Pipeline(steps=[
    ('col_transformer', col_transformer),
    ('o', over),
    ('u', under),
    ('scaler', StandardScaler()),
    ('logistic_regressor', LogisticRegression(random_state=42))
])

In [None]:
param_grid = [{'logistic_regressor__max_iter': [50, 100, 250, 500],
              # 'logistic_regressor__C': [1e-10, 1e-100],
               'logistic_regressor__penalty': ['none', 'l2']
              }]

grid_search = GridSearchCV(estimator=pipeline,
                           param_grid=param_grid,
                           scoring='accuracy',
                           cv=5
                           )

grid_search.fit(X_train, y_train)

y_pred = grid_search.predict(X_test)
cv_score_smoted_log = grid_search.best_score_
test_score_smoted_log = grid_search.score(X_test, y_test)

In [None]:
cv_score_smoted_log, test_score_smoted_log

In [None]:
confusion_matrix(y_pred, y_test)

In [None]:
roc_auc_score(y_pred, y_test)

In [None]:
print(classification_report(y_pred, y_test))

### smote logistic regression (traffic control device, crash day of week, roadway cond, weather cond)

In [None]:
X = final_df[['ROADWAY_SURFACE_COND', 'TRAFFIC_CONTROL_DEVICE', 'CRASH_DAY_OF_WEEK']]
y = final_df['INJURY_CLASSIFICATION']

X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size=0.2,
                                                    stratify=y,
                                                    random_state=11)

In [None]:
# Create a column transformer
col_transformer = ColumnTransformer(transformers=[
    ('ohe', OneHotEncoder(categories='auto', handle_unknown='ignore'), ['ROADWAY_SURFACE_COND'])
], remainder='passthrough')

over = SMOTE(sampling_strategy='minority')
under = RandomUnderSampler(sampling_strategy='not minority')

# Create a pipeline containing the column transformer and model
pipeline = imb_Pipeline(steps=[
    ('col_transformer', col_transformer),
    ('o', over),
    ('u', under),
    ('scaler', StandardScaler()),
    ('logistic_regressor', LogisticRegression(random_state=42))
])

In [None]:
param_grid = [{'logistic_regressor__max_iter': [50, 100, 250, 500],
              # 'logistic_regressor__C': [1e-10, 1e-100],
               'logistic_regressor__penalty': ['none', 'l2']
              }]

grid_search = GridSearchCV(estimator=pipeline,
                           param_grid=param_grid,
                           scoring='accuracy',
                           cv=5
                           )

grid_search.fit(X_train, y_train)

y_pred = grid_search.predict(X_test)
cv_score_smoted_log = grid_search.best_score_
test_score_smoted_log = grid_search.score(X_test, y_test)

In [None]:
cv_score_smoted_log, test_score_smoted_log

In [None]:
confusion_matrix(y_pred, y_test)

In [None]:
roc_auc_score(y_pred, y_test)

In [None]:
print(classification_report(y_pred, y_test))

### smote logistic regression (traffic control device, crash day of week, roadway cond, weather cond)

In [None]:
X = final_df[['ROADWAY_SURFACE_COND', 'TRAFFIC_CONTROL_DEVICE', 'CRASH_DAY_OF_WEEK']]
y = final_df['INJURY_CLASSIFICATION']

X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size=0.2,
                                                    stratify=y,
                                                    random_state=11)

In [None]:
# Create a column transformer
col_transformer = ColumnTransformer(transformers=[
    ('ohe', OneHotEncoder(categories='auto', handle_unknown='ignore'), ['ROADWAY_SURFACE_COND'])
], remainder='passthrough')

over = SMOTE(sampling_strategy='minority')
under = RandomUnderSampler(sampling_strategy='not minority')

# Create a pipeline containing the column transformer and model
pipeline = imb_Pipeline(steps=[
    ('col_transformer', col_transformer),
    ('o', over),
    ('u', under),
    ('scaler', StandardScaler()),
    ('logistic_regressor', LogisticRegression(random_state=42))
])

In [None]:
param_grid = [{'logistic_regressor__max_iter': [50, 100, 250, 500],
              # 'logistic_regressor__C': [1e-10, 1e-100],
               'logistic_regressor__penalty': ['none', 'l2']
              }]

grid_search = GridSearchCV(estimator=pipeline,
                           param_grid=param_grid,
                           scoring='accuracy',
                           cv=5
                           )

grid_search.fit(X_train, y_train)

y_pred = grid_search.predict(X_test)
cv_score_smoted_log = grid_search.best_score_
test_score_smoted_log = grid_search.score(X_test, y_test)

In [None]:
cv_score_smoted_log, test_score_smoted_log

In [None]:
confusion_matrix(y_pred, y_test)

In [None]:
roc_auc_score(y_pred, y_test)

In [None]:
print(classification_report(y_pred, y_test))

### smote logistic regression (traffic control device, crash day of week, roadway cond, weather cond)

In [None]:
X = final_df[['ROADWAY_SURFACE_COND', 'TRAFFIC_CONTROL_DEVICE', 'CRASH_DAY_OF_WEEK', 'WEATHER_CONDITION']]
y = final_df['INJURY_CLASSIFICATION']

X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size=0.2,
                                                    stratify=y,
                                                    random_state=11)

In [None]:
# Create a column transformer
col_transformer = ColumnTransformer(transformers=[
    ('ohe', OneHotEncoder(categories='auto', handle_unknown='ignore'), ['WEATHER_CONDITION', 'ROADWAY_SURFACE_COND'])
], remainder='passthrough')

over = SMOTE(sampling_strategy='minority')
under = RandomUnderSampler(sampling_strategy='not minority')

# Create a pipeline containing the column transformer and model
pipeline = imb_Pipeline(steps=[
    ('col_transformer', col_transformer),
    ('o', over),
    ('u', under),
    ('scaler', StandardScaler()),
    ('logistic_regressor', LogisticRegression(random_state=42))
])

In [None]:
param_grid = [{'logistic_regressor__max_iter': [50, 100, 250, 500],
              # 'logistic_regressor__C': [1e-10, 1e-100],
               'logistic_regressor__penalty': ['none', 'l2']
              }]

grid_search = GridSearchCV(estimator=pipeline,
                           param_grid=param_grid,
                           scoring='accuracy',
                           cv=5
                           )

grid_search.fit(X_train, y_train)

y_pred = grid_search.predict(X_test)
cv_score_smoted_log = grid_search.best_score_
test_score_smoted_log = grid_search.score(X_test, y_test)

In [None]:
cv_score_smoted_log, test_score_smoted_log

In [None]:
confusion_matrix(y_pred, y_test)

In [None]:
roc_auc_score(y_pred, y_test)

In [None]:
print(classification_report(y_pred, y_test))

### smote logistic regression (weather, roadway, traffic control, lighting, crash hour)

In [None]:
X = final_df[['WEATHER_CONDITION', 'ROADWAY_SURFACE_COND', 'TRAFFIC_CONTROL_DEVICE', 'LIGHTING_CONDITION',
             'CRASH_HOUR']]
y = final_df['INJURY_CLASSIFICATION']

X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size=0.2,
                                                    stratify=y,
                                                    random_state=11)

In [None]:
# Create a column transformer
col_transformer = ColumnTransformer(transformers=[
    ('ohe', OneHotEncoder(categories='auto', handle_unknown='ignore'), ['LIGHTING_CONDITION', 'ROADWAY_SURFACE_COND'])
], remainder='passthrough')

over = SMOTE(sampling_strategy='minority')
under = RandomUnderSampler(sampling_strategy='not minority')

# Create a pipeline containing the column transformer and model
pipeline = imb_Pipeline(steps=[
    ('col_transformer', col_transformer),
    ('o', over),
    ('u', under),
    ('scaler', StandardScaler()),
    ('logistic_regressor', LogisticRegression(random_state=42))
])

In [None]:
param_grid = [{'logistic_regressor__max_iter': [50, 100, 250, 500],
              # 'logistic_regressor__C': [1e-10, 1e-100],
               'logistic_regressor__penalty': ['none', 'l2']
              }]

grid_search = GridSearchCV(estimator=pipeline,
                           param_grid=param_grid,
                           scoring='accuracy',
                           cv=5
                           )

grid_search.fit(X_train, y_train)

y_pred = grid_search.predict(X_test)
cv_score_smoted_log = grid_search.best_score_
test_score_smoted_log = grid_search.score(X_test, y_test)

In [None]:
cv_score_smoted_log, test_score_smoted_log

In [None]:
confusion_matrix(y_pred, y_test)

In [None]:
roc_auc_score(y_pred, y_test)

In [None]:
print(classification_report(y_pred, y_test))

### smote logistic regression (weather, roadway, device cond, lighting, crash hour)

In [None]:
X = final_df[['WEATHER_CONDITION', 'ROADWAY_SURFACE_COND', 'LIGHTING_CONDITION',
             'CRASH_HOUR', 'DEVICE_CONDITION']]
y = final_df['INJURY_CLASSIFICATION']

X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size=0.2,
                                                    stratify=y,
                                                    random_state=11)

In [None]:
# Create a column transformer
col_transformer = ColumnTransformer(transformers=[
    ('ohe', OneHotEncoder(categories='auto', handle_unknown='ignore'), ['LIGHTING_CONDITION', 'ROADWAY_SURFACE_COND'])
], remainder='passthrough')

over = SMOTE(sampling_strategy='minority')
under = RandomUnderSampler(sampling_strategy='not minority')

# Create a pipeline containing the column transformer and model
pipeline = imb_Pipeline(steps=[
    ('col_transformer', col_transformer),
    ('o', over),
    ('u', under),
    ('scaler', StandardScaler()),
    ('logistic_regressor', LogisticRegression(random_state=42))
])

In [None]:
param_grid = [{'logistic_regressor__max_iter': [50, 100, 250, 500],
              # 'logistic_regressor__C': [1e-10, 1e-100],
               'logistic_regressor__penalty': ['none', 'l2']
              }]

grid_search = GridSearchCV(estimator=pipeline,
                           param_grid=param_grid,
                           scoring='accuracy',
                           cv=5
                           )

grid_search.fit(X_train, y_train)

y_pred = grid_search.predict(X_test)
cv_score_smoted_log = grid_search.best_score_
test_score_smoted_log = grid_search.score(X_test, y_test)

In [None]:
cv_score_smoted_log, test_score_smoted_log

In [None]:
confusion_matrix(y_pred, y_test)

In [None]:
roc_auc_score(y_pred, y_test)

In [None]:
print(classification_report(y_pred, y_test))

### smote logistic regression (weather, roadway, device cond, lighting, crash hour) tuning

In [None]:
X = final_df[['WEATHER_CONDITION', 'ROADWAY_SURFACE_COND', 'LIGHTING_CONDITION',
             'CRASH_HOUR', 'DEVICE_CONDITION']]
y = final_df['INJURY_CLASSIFICATION']

X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size=0.2,
                                                    stratify=y,
                                                    random_state=11)

In [None]:
# Create a column transformer
col_transformer = ColumnTransformer(transformers=[
    ('ohe', OneHotEncoder(categories='auto', handle_unknown='ignore'), ['LIGHTING_CONDITION', 'ROADWAY_SURFACE_COND'])
], remainder='passthrough')

over = SMOTE(sampling_strategy='minority')
under = RandomUnderSampler(sampling_strategy='not minority')

# Create a pipeline containing the column transformer and model
pipeline = imb_Pipeline(steps=[
    ('col_transformer', col_transformer),
    ('o', over),
    ('u', under),
    ('scaler', StandardScaler()),
    ('logistic_regressor', LogisticRegression(random_state=42))
])

In [None]:
param_grid = [{
                'logistic_regressor__max_iter': [100, 150, 250, 500],
              # 'logistic_regressor__C': [1e-10, 1e-100],
               'logistic_regressor__penalty': ['none', 'l2'],
               'logistic_regressor__solver': ['newton-cg', 'lbfgs', 'sag']
              }]

grid_search = GridSearchCV(estimator=pipeline,
                           param_grid=param_grid,
                           scoring='accuracy',
                           cv=5
                           )

grid_search.fit(X_train, y_train)

y_pred = grid_search.predict(X_test)
print(grid_search.best_params_)
cv_score_smoted_log = grid_search.best_score_
test_score_smoted_log = grid_search.score(X_test, y_test)

In [None]:
cv_score_smoted_log, test_score_smoted_log

In [None]:
confusion_matrix(y_pred, y_test)

In [None]:
roc_auc_score(y_pred, y_test)

In [None]:
print(classification_report(y_pred, y_test))

### smote logistic regression (weather, roadway, device cond, lighting, crash hour, crash month) 

In [None]:
X = final_df[['WEATHER_CONDITION', 'ROADWAY_SURFACE_COND', 'LIGHTING_CONDITION',
             'CRASH_HOUR', 'DEVICE_CONDITION', 'CRASH_MONTH']]
y = final_df['INJURY_CLASSIFICATION']

X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size=0.2,
                                                    stratify=y,
                                                    random_state=11)

In [None]:
# Create a column transformer
col_transformer = ColumnTransformer(transformers=[
    ('ohe', OneHotEncoder(categories='auto', handle_unknown='ignore'), ['LIGHTING_CONDITION', 'ROADWAY_SURFACE_COND'])
], remainder='passthrough')

over = SMOTE(sampling_strategy='minority')
under = RandomUnderSampler(sampling_strategy='not minority')

# Create a pipeline containing the column transformer and model
pipeline = imb_Pipeline(steps=[
    ('col_transformer', col_transformer),
    ('o', over),
    ('u', under),
    ('scaler', StandardScaler()),
    ('logistic_regressor', LogisticRegression(random_state=42, max_iter=250))
])

In [None]:
param_grid = [{
              # 'logistic_regressor__max_iter': [100, 150, 250, 500],
              # 'logistic_regressor__C': [1e-10, 1e-100],
               'logistic_regressor__penalty': ['none', 'l2'],
               'logistic_regressor__solver': ['newton-cg', 'lbfgs', 'sag']
              }]

grid_search = GridSearchCV(estimator=pipeline,
                           param_grid=param_grid,
                           scoring='accuracy',
                           cv=5
                           )

grid_search.fit(X_train, y_train)

y_pred = grid_search.predict(X_test)
print(grid_search.best_params_)
cv_score_smoted_log = grid_search.best_score_
test_score_smoted_log = grid_search.score(X_test, y_test)

In [None]:
cv_score_smoted_log, test_score_smoted_log

In [None]:
confusion_matrix(y_pred, y_test)

In [None]:
roc_auc_score(y_pred, y_test)

In [None]:
print(classification_report(y_pred, y_test))

### smote logistic regression (weather, roadway, device cond, lighting, crash hour, bac_result value)

In [None]:
X = final_df[['WEATHER_CONDITION', 'ROADWAY_SURFACE_COND', 'LIGHTING_CONDITION',
             'CRASH_HOUR', 'DEVICE_CONDITION', 'BAC_RESULT VALUE']]
y = final_df['INJURY_CLASSIFICATION']

X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size=0.2,
                                                    stratify=y,
                                                    random_state=11)

In [None]:
# Create a column transformer
col_transformer = ColumnTransformer(transformers=[
    ('ohe', OneHotEncoder(categories='auto', handle_unknown='ignore'), ['LIGHTING_CONDITION', 'ROADWAY_SURFACE_COND'])
], remainder='passthrough')

over = SMOTE(sampling_strategy='minority')
under = RandomUnderSampler(sampling_strategy='not minority')

# Create a pipeline containing the column transformer and model
pipeline = imb_Pipeline(steps=[
    ('col_transformer', col_transformer),
    ('o', over),
    ('u', under),
    ('scaler', StandardScaler()),
    ('logistic_regressor', LogisticRegression(random_state=42, max_iter=250))
])

In [None]:
param_grid = [{
              # 'logistic_regressor__max_iter': [100, 150, 250, 500],
              # 'logistic_regressor__C': [1e-10, 1e-100],
               'logistic_regressor__penalty': ['none', 'l2'],
               'logistic_regressor__solver': ['newton-cg', 'lbfgs', 'sag']
              }]

grid_search = GridSearchCV(estimator=pipeline,
                           param_grid=param_grid,
                           scoring='accuracy',
                           cv=5
                           )

grid_search.fit(X_train, y_train)

y_pred = grid_search.predict(X_test)
print(grid_search.best_params_)
cv_score_smoted_log = grid_search.best_score_
test_score_smoted_log = grid_search.score(X_test, y_test)

In [None]:
cv_score_smoted_log, test_score_smoted_log

In [None]:
confusion_matrix(y_pred, y_test)

In [None]:
roc_auc_score(y_pred, y_test)

In [None]:
print(classification_report(y_pred, y_test))