# Imports

In [1]:
#import libraries
import pandas as pd
import numpy as np

#import our helper functions
import prepare, explore, evaluate

FileNotFoundError: [Errno 2] No such file or directory: 'work_space/Pictures/sa.png'

<p style="text-align:center;"><img src="https://raw.githubusercontent.com/RAXR-Capstone/project_danger_zone/master/work_space/Pictures/danger_banner.png" alt="Logo"></p>

# Project Danger Zone 
Brought to you by data scientists:
 - Xavier Carter
 - Robert Murphy
 - Anna Vu
 - Ray Zapata
<br>
<br>

## Our Mission
San Antonio is the 7th most populated, and one of the fastest growing cities in the U.S.A. In Bexar County alone, there were nearly 50,000 car crashes in 2020. Of these, 16,780 were injured, and 200 died. With an increasing number of drivers on the roads, there is a recurring need to keep people safe. Using 2021 San Antonio car accident data, Project Danger Zone will look into features that are likely to cause casualties, so that insight can be delivered to entities such as TXDot, Bexar County Public Works, insurance companies, and the general public.


## Executive Summary
- Location may matter for where an accident might occur, but location alone does not play a role in injury from an accident
- The impact on where the car is damaged, as well as the type of car driven, are key factors in predicting accident injury. 
- SA Local Events: Home Spurs games does not increase risk of injury.
- SA Local Events: During Fiesta dates, there seems to be statistical evidence to show an increase in accident injury
- SA Local Events: During 4th of July (late July 4th, early July 5th) there were more accidents caused due to intoxication than normal days, and there was evidence to suggest a greater likelihood of being injured in a car accident.
- While optimizing for recall, we utilized ______ and were able get acuracy of _______  and a recall of _______ , beating the baseline by _______

## Acquire 
- Utilizing open source information from https://app.myaccident.org/, we were able to extract in the relevant information for car accidents.
- Also utilized API from the NHTSA to use the broken vehicle identification number to extract the vehicle type.

In [None]:
#bring in our accident data into a pandas dataframe
df = pd.read_csv('accident_data.csv')

In [None]:
#look at a sample of 3 entries
df.sample(3)

In [None]:
#info about our data
df.info()

In [None]:
#check the number of rows and columns
df.shape

In [None]:
#check for columns with nulls/missing values
df.isna().sum()

### Takeaways
- Dropping any columns not relevant to the model
- Handling accident factor column, extracting valuable information from the accident description
- Extracting time of day and day name of the week from the crash_date
- Hot encoding variables may be useful

## Prepare / Feature Engineering 
- Taking into account what was seen above, we also took into account:
    * remove duplicate observations
    * crash_data to approprate datetime format
    * using 24-hour time
    * drop unnecessary columns
    * encode variables, such as where the car was impacted upon crash, and accident factors
    * extract features from crash_date
    * split into train and test for modeling 

In [None]:
#prepare our data for use, and split into train and test sets
train, test = prepare.collision_data()

In [None]:
#assure the shapes are reasonable
train.shape, test.shape

In [None]:
#look at a sample of our train data
train.sample(2)

In [None]:
#look at all of our columns, including new ones
train.columns

# Explore
- Looking at univariate distributions

### Univariate

In [None]:
#look at distributions of single variables
explore.get_distribution(train.drop(columns=['crash_date','crash_id','crash_latitude', 'crash_longitude','vehicle_id','fault_narrative']))

#### Take Aways 
- Most prominant accident type only involves 2 cars.
- Each car contains 1 person majority of the time.
- The most frequent accident cause is driver inattention, followed by distaction, then faulty manuevers.
- There's variety of car makes and colors within the data set. White, followed by black cars, are present in accidents the most in the last 6 months.
- In the last 6 months, roads where the speed limit were 45 MPH ,followed by 35, and 65, have the most accidents involved.
- Cars followed by mpv(multi person vehicles ie. mini vans and crossovers) occur the most in accidents, followed by trucks

### Bivariate

In [None]:
#look at how our variables compare to our target, injury_class
explore.compare_to_target(train.drop(columns=['crash_date','crash_id','crash_latitude', 'crash_longitude','vehicle_id','driver_age_bin','vehicle_year_bin','fault_narrative']),'injury_class')

#### Takeaways
- Unbalanced data set in relation to the target variable of injury class
- Distributed car impacts and airbags still have possible signiifance, other columns tend to be unison to their respected columns
- almost all motorcycle accidents end in injury.

### Multivariate 
- By utilizing a folium map (seen within the notebook), we were able to visually see if location, alone, played a role in injury

In [None]:
#use folium map to visualize where accidents occur in San Antonio
m = explore.plot_map(train)
m

In [None]:
#heatmap to see the correlation between numeric variables and our target, injury_class
explore.get_heatmap(train, 'injury_class')

## Takeaways
- Variables that correlate to accident were whether the air bag deployed, and vehicle occupant count
- Many of the variables did not correlate alone, after clustering some of the variables together, we were able to find more correlation to the target variable
- the clusters created that were taking speed and damage into account tend to correlate better for model usage 

## Does time of day play a role in accident injury?

In [None]:
#plot injury according to time of day, over the 7 days of the week
explore.plot_hour(train)

**Visual Takeaways**

- Early evenings for: Monday, Tuesday, and Saturday shows a decrease to injury rate before a sudden upward trend into the early hours of the following day
- Wednesday, and Thursday seem to be more consistently near the mean rate of injury versus other days
- Sunday early morning has the highest rate of traffic injuries throughout the data

### Hour

In [None]:
#look at hours
explore.time_breakdown(train)

**Hypothesis Testing**

H$_{0}$: The hour of the MVC is independent of whether an injury occurs or not.

In [None]:
# run chi_test to test for significance between crash_hour and injury_class
evaluate.chi_test(train.crash_hour, train.injury_class)

**Takeaways**

In the visual percentages of hour of accident with injury percentages, there is a marked increased in 0300 hours to 21%. Using $x^2$ testing, there is shown to be a statistical difference in the to categories of hour and if injury occurs.

### Day of Week

In [None]:
#plot the days of the week and injuries
explore.plot_dow(train)

**Hypothesis Testing**

H$_{0}$: The day of the week is independent of whether or not injury occurs as result of an MVC.

In [None]:
#run chi_test to test for significance between crash_day and injury_class
evaluate.chi_test(train.crash_day, train.injury_class)

**Takeaways**

Despite the visual percentage of injuries across days of the week showing increased rates on certain days over others, the statistical testing failed to reject the null hypothesis of independence between the two variables.

## Do special events play a role in car accident injury?

### Spurs Home Games (20-21 season) 

In [None]:
#plot where accidents occur during times after noon on Spurs home game days
explore.plot_spurs(train)

**Hypothesis Testing**

H$_{0}$: Injury occuring as a result of MVCs is independent of of being near the AT&T Center on days of Spurs Home Games.

In [None]:
#run chi_test to see if a different number of accidents occur closer to the AT&T center, vs. the city as a whole
evaluate.chi_test(train[train.spurs].att_1, train[train.spurs].injury_class)

In [None]:
#run chi_test to see if a different number of accidents occur around the AT&T center, vs. the city as a whole
evaluate.chi_test(train[train.spurs].att_2, train[train.spurs].injury_class)

In [None]:
#run chi_test to see if there is a significant difference in injuries reported on spurs home game days
evaluate.chi_test(train.spurs, train.injury_class)

In [None]:
#what about intoxication on these days?
evaluate.chi_test(train[train.spurs].fault_intoxication, train[train.spurs].injury_class)

**Takeaways**
- The statistical testing failed to reject the null hypothesis of independence between the two variables.

### Fiesta 2021

In [None]:
#let's look at accidents on Fiesta days
explore.plot_fiesta(train)

In [None]:
#run chi_test to see if fiesta accidents are independent of reported injuries
evaluate.chi_test(train.fiesta, train.injury_class)

In [None]:
#is there independence between intoxicated drivers and reported injuries during Fiesta?
evaluate.chi_test(train[train.fault_intoxication ==1].fiesta, train.injury_class)

In [None]:
#Are intoxicated drivers independent of Fiesta?
evaluate.chi_test(train.fiesta, train.fault_intoxication)

### 4th of July Weekend

In [None]:
#accidents that happen July 4th - July 5th
explore.plot_the_4th(train)

In [None]:
#testing for independence between July 4th and reported injuries?
evaluate.chi_test(train.jul_fourth, train.injury_class)

In [None]:
# Is July 4th and intoxicated drivers independent?
evaluate.chi_test(train.jul_fourth, train.fault_intoxication)

In [None]:
# Is July 4th and speeding drivers independent?
evaluate.chi_test(train.jul_fourth, train.fault_speed)

## Does accident injury depend on what side of town you are on? 
- This question is best solved with clustering, taking the latitude and longitude to plot location of accidents

In [None]:
#plot crash longitude on x axis, latitude on y axis and cluster them
X = train[['crash_longitude', 'crash_latitude']]
train = explore.create_scatter_plot('crash_longitude', 'crash_latitude', train, X)

In [None]:
# is there a difference in reported injuries between any of the clusters of the town?
evaluate.chi_test(train.location_cluster, train.injury_class)

In [None]:
#we won't use it anymore
train = train.drop(columns='location_cluster')

### Other significant clusters that were created

#### Does speeding, depending on the speed limit, play a role in accident Injury?

In [None]:
#test for independence with our speed_speed_lm cluster and injury class
evaluate.chi_test(train.speed_speed_lm, train.injury_class)

#### Does speeding, when failing to yield, play a role in accident injury?

In [None]:
#test for independence with our speed_yield cluster and injury class
evaluate.chi_test(train.speed_yield_occu, train.injury_class)

#### Is where the car was struck, and whether or not the airbags deployed, play a role in accident injury?

In [None]:
#test for independence with our damage_air cluster and injury class
evaluate.chi_test(train.damage_air, train.injury_class)

# Modeling 

In [None]:
#bring in our modeling
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.dummy import DummyClassifier
from imblearn.over_sampling import SMOTE

In [None]:
# split train data to X, y
X_train = train.select_dtypes(np.number).drop(columns=['injury_class', 'injury_crash_total','day_num'])
y_train = train.injury_class

In [None]:
# split test data to X, y
X_test = test.select_dtypes(np.number).drop(columns=['injury_class', 'injury_crash_total'])
y_test = test.injury_class

In [None]:
# create dummy class object
dummy = DummyClassifier(strategy='most_frequent')
# fit dummy for most frequent target class
dummy.fit(X_train, y_train)
# create baseline array
baseline = pd.Series(dummy.predict(X_train), index=X_train.index)
print('Baseline Score')
dummy.score(X_train, y_train)

In [None]:
#print scores for baseline
evaluate.classifier_scores(baseline, y_train)

#### SMOTE
- Utilizing Synthetic Minority Oversampling Technique inorder to rebalence the dataset.

In [None]:
# create smote object
sm = SMOTE(random_state=19)
# fit and resample train data
X_sm, y_sm = sm.fit_resample(X_train, y_train)

### Logistic Regression

In [None]:
# fit model to gridsearch params
logit = LogisticRegression(class_weight=None,
                           dual=True,
                           multi_class='auto',
                           penalty='l2',
                           random_state=19,
                           solver='liblinear')

In [None]:
# get top 20 RFE recommended features
rfe_selected = evaluate.get_rfe_selected(X_sm, y_sm, logit).Feature.tolist()
# fit classifier with recommended features
logit.fit(X_sm[rfe_selected], y_sm)
# create array of predicitons
y_preds = logit.predict(X_sm[rfe_selected])

In [None]:
# evaluate logistic 
evaluate.classifier_scores(y_sm, y_preds)

### K-nearest Neighbors

In [None]:
from sklearn.feature_selection import SelectKBest
# use SelectKBest for recommended features
knn = KNeighborsClassifier(n_neighbors=5, weights='distance')
knn.fit(X_sm, y_sm)
kbest =  SelectKBest()
kbest.fit(X_sm, y_sm)
kbest_selected = X_sm.columns[kbest.get_support()]
# fit model to gridsearch params
knn = KNeighborsClassifier(n_neighbors=5, weights='distance')
# fit classifier with recommended features
knn.fit(X_sm[kbest_selected], y_sm)
# create array of predicitons
y_preds = knn.predict(X_sm[kbest_selected])

In [None]:
#evaluate knn
evaluate.classifier_scores(y_sm, y_preds)

### Random Forest

In [None]:
forest = RandomForestClassifier(class_weight='balanced_subsample',
                                criterion='gini',
                                max_depth=10,
                                min_samples_leaf=1,
                                n_estimators=100,
                                random_state=19)

In [None]:
# get top 20 RFE recommended features
rfe_selected = evaluate.get_rfe_selected(X_sm, y_sm, forest).Feature.tolist()
# fit classifier with recommended features
forest.fit(X_sm[rfe_selected], y_sm)
# create array of predicitons
y_preds = forest.predict(X_sm[rfe_selected])

In [None]:
evaluate.classifier_scores(y_sm, y_preds)

### Best Model on Test

In [None]:
#predict on test, and evaluate
y_preds = forest.predict(X_test[rfe_selected])
evaluate.classifier_scores(y_test, y_preds)