In [1]:
%reload_ext nb_black

<IPython.core.display.Javascript object>

## Day 24 Lecture 2 Assignment

In this assignment, we will build our a more complex logistic regression model, this time on both numeric and categorical data. We will use the Chicago traffic crashes dataset loaded below and analyze the model generated for this dataset.

In [3]:
import numpy as np
import pandas as pd

import statsmodels.api as sm

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.calibration import calibration_curve
from sklearn.metrics import confusion_matrix

import seaborn as sns
import matplotlib.pyplot as plt

%matplotlib inline

<IPython.core.display.Javascript object>

In [3]:
def missingness_summary(df, print_log=False, sort="none"):
    summary = df.apply(lambda x: x.isna().sum() / x.shape[0])

    if print_log == True:
        if sort == "none":
            print(summary)
        elif sort == "ascending":
            print(summary.sort_values())
        elif sort == "descending":
            print(summary.sort_values(ascending=False))
        else:
            print("Invalid value for sort parameter.")

    return summary

<IPython.core.display.Javascript object>

In [4]:
crash_data = pd.read_csv(
    "https://tf-assets-prod.s3.amazonaws.com/tf-curric/data-science/traffic_crashes_chicago.csv"
)

<IPython.core.display.Javascript object>

In [5]:
crash_data.head()

Unnamed: 0,RD_NO,CRASH_DATE,POSTED_SPEED_LIMIT,TRAFFIC_CONTROL_DEVICE,DEVICE_CONDITION,WEATHER_CONDITION,LIGHTING_CONDITION,FIRST_CRASH_TYPE,TRAFFICWAY_TYPE,LANE_CNT,...,WORKERS_PRESENT_I,NUM_UNITS,MOST_SEVERE_INJURY,INJURIES_TOTAL,INJURIES_FATAL,INJURIES_INCAPACITATING,INJURIES_NON_INCAPACITATING,INJURIES_REPORTED_NOT_EVIDENT,INJURIES_NO_INDICATION,INJURIES_UNKNOWN
0,JC334993,7/4/2019 22:33,45,NO CONTROLS,NO CONTROLS,CLEAR,"DARKNESS, LIGHTED ROAD",REAR END,DIVIDED - W/MEDIAN BARRIER,,...,,,,,,,,,,
1,JC370822,7/30/2019 10:22,30,NO CONTROLS,NO CONTROLS,CLEAR,DAYLIGHT,TURNING,DIVIDED - W/MEDIAN (NOT RAISED),,...,,,,,,,,,,
2,JC387098,8/10/2019 17:00,25,NO CONTROLS,NO CONTROLS,CLEAR,DAYLIGHT,PARKED MOTOR VEHICLE,ONE-WAY,,...,,1.0,,,,,,,,
3,JC395195,8/16/2019 16:53,30,NO CONTROLS,NO CONTROLS,CLEAR,DAYLIGHT,PARKED MOTOR VEHICLE,NOT DIVIDED,,...,,1.0,NO INDICATION OF INJURY,0.0,0.0,0.0,0.0,0.0,1.0,0.0
4,JC396604,8/17/2019 16:04,30,NO CONTROLS,NO CONTROLS,CLEAR,DAYLIGHT,PARKED MOTOR VEHICLE,PARKING LOT,,...,,1.0,NO INDICATION OF INJURY,0.0,0.0,0.0,0.0,0.0,1.0,0.0


<IPython.core.display.Javascript object>

First, create a binary response column by modifying the "DAMAGE" column. Consider "OVER \$1500" to be the positive class, and under \$1500 to be the negative class.

In [6]:
crash_data["DAMAGE"]

0           OVER $1,500
1           OVER $1,500
2         $501 - $1,500
3         $501 - $1,500
4         $501 - $1,500
              ...      
372580    $501 - $1,500
372581      OVER $1,500
372582      OVER $1,500
372583      OVER $1,500
372584      OVER $1,500
Name: DAMAGE, Length: 372585, dtype: object

<IPython.core.display.Javascript object>

In [12]:
# answer goes here
crash_data["big_dmg"] = crash_data["DAMAGE"] == "OVER $1,500"

crash_data["big_dmg"] = crash_data["big_dmg"].astype(int)
crash_data["big_dmg"].value_counts(normalize=True)

1    0.563418
0    0.436582
Name: big_dmg, dtype: float64

<IPython.core.display.Javascript object>

Using the code from Day 21, Lecture 1 as a starting point, devise an appropriate way to address missing values. You have a lot of freedom here; we will proceed by taking the following steps:

- Dropping all columns with more than 5% missing data
- Imputing the median for numeric columns with less than 5% missing data (except for STREET_NO; imputing it in this manner would not make any sense)
- Dropping rows with missing data for categorical columns that have less than 5% missing data

In [23]:
# answer goes here
missingness_df = crash_data.isna().mean().sort_values(ascending=False)
crash_data_filtered = crash_data.copy()
for col in crash_data.columns:
    rat = crash_data[col].isna().mean()
    if rat > 0.05:
        crash_data_filtered = crash_data_filtered.drop(columns=col)
    elif rat < 0.05 and rat > 0.0:
        print(crash_data[col].dtype)
crash_data_filtered.isna().mean().sort_values(ascending=False)

object
object
object
float64
float64
object
float64
float64
float64
float64
float64
float64
float64


REPORT_TYPE                      0.023012
MOST_SEVERE_INJURY               0.005795
INJURIES_UNKNOWN                 0.005776
INJURIES_NO_INDICATION           0.005776
INJURIES_REPORTED_NOT_EVIDENT    0.005776
INJURIES_NON_INCAPACITATING      0.005776
INJURIES_INCAPACITATING          0.005776
INJURIES_FATAL                   0.005776
INJURIES_TOTAL                   0.005776
NUM_UNITS                        0.003755
BEAT_OF_OCCURRENCE               0.000011
STREET_DIRECTION                 0.000005
STREET_NAME                      0.000003
LIGHTING_CONDITION               0.000000
WEATHER_CONDITION                0.000000
DEVICE_CONDITION                 0.000000
big_dmg                          0.000000
TRAFFIC_CONTROL_DEVICE           0.000000
POSTED_SPEED_LIMIT               0.000000
CRASH_DATE                       0.000000
TRAFFICWAY_TYPE                  0.000000
FIRST_CRASH_TYPE                 0.000000
DATE_POLICE_NOTIFIED             0.000000
ALIGNMENT                        0

<IPython.core.display.Javascript object>

Finally, choose a few numeric and categorical features (2-3 of each) to include in the model. (You can definitely include more than this, but too many features, especially categorical ones, will most likely lead to convergence issues). One hot encode the chosen categorical features, being sure to omit one of the categories (which will serve as a "reference" level) to avoid perfect multicollinearity.

Again, you have a lot of freedom here; we will proceed with the following features, dropping the most commonly occurring category for the two categorical variables ("CLEAR" for weather, "REAR END" for first crash type):
POSTED_SPEED_LIMIT, WEATHER_CONDITION, INJURIES_TOTAL, FIRST_CRASH_TYPE

In [0]:
# answer goes here





Split the data into train and test, with 80% training and 20% testing. By default, the LR output from statsmodels does not include an intercept terms; add a constant column to the training data so that an intercept term is calculated for the LR model (hint: sm.add_constant() is a useful function to accomplish this).

In [0]:
# answer goes here





Fit the logistic regression model using the statsmodels package and print out the coefficient summary. Which variables (in particular, which categories of our categorical variables) appear to be the most important, and what effect do they have on the probability of a crash resulting in $1500 or more in damages?

In [0]:
# answer goes here





As we did on the previous exercise, make predictions on the test set and join them to the corresponding true outcomes, then use the *calibration_curve* function in scikit learn to plot a calibration curve. Is the model well-calibrated?

In [0]:
# answer goes here



