# Team Introduction
Our group is comprised of Braden Anderson, Hien Lam, and Tavin Weeda.

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
# Read in data from github
url_accident = "https://github.com/BradenAnderson/Accident_Severity_Prediction/blob/main/Data/accident.csv.gz?raw=tr"
url_vehicle = "https://github.com/BradenAnderson/Accident_Severity_Prediction/blob/main/Data/vehicle.csv.gz?raw=tr"
url_person = "https://github.com/BradenAnderson/Accident_Severity_Prediction/blob/main/Data/person.csv.gz?raw=tr"

accident = pd.read_csv(url_accident,compression='gzip')
vehicle = pd.read_csv(url_vehicle, compression='gzip', low_memory=False, encoding="ISO-8859-1")
person = pd.read_csv(url_person, compression='gzip', low_memory=False, encoding="ISO-8859-1")

# Filter accidents where driver is present and vehicle is involved
person = person.loc[(person.VEH_NO==1) & (person.PER_NO==1)]
vehicle = vehicle.loc[vehicle.VEH_NO==1]

# Left join person with vehicle and accident
# Duplicated CASENUM are dropped
df = person.merge(vehicle.drop_duplicates(subset=['CASENUM']), on='CASENUM', how='left')
df = df.merge(accident.drop_duplicates(subset=['CASENUM']),on='CASENUM',how='left')

# Select
df = df[['REGIONNAME','URBANICITYNAME','BODY_TYPNAME_x', 'MOD_YEARNAME_x','VTRAFWAYNAME','VNUM_LANNAME','VSURCONDNAME','VTRAFCONNAME','TYP_INTNAME','INT_HWYNAME','WEATHERNAME',
        'WKDY_IMNAME', 'RELJCT1_IMNAME','LGTCON_IMNAME','MAXSEV_IMNAME','ALCHL_IMNAME','AGE_IM','SEX_IMNAME','TRAV_SP','REST_USENAME','PCRASH1_IMNAME','HOUR_IMNAME','VSPD_LIM',
        'HOUR_IM']]

df = df.rename(columns=str.lower)
df.shape

In [None]:
# Check for NA values
df.isnull().sum()

In [None]:
# Remove NA values since they are low representation of the entire dataset
df.dropna(inplace=True)

In [None]:
# Confirm NA values are removed
df.isnull().sum().sum()

In [None]:
df.shape

In [None]:
df.info()

In [None]:
df.head()

# Business Understanding 1
- Describe the purpose of the data set you selected (i.e., why was this data collected in the first place?). 
- How will you measure the effectiveness of a good algorithm? 
- Why does your chosen validation method make sense for this specificdataset and the stakeholders needs?  
(10)

# Data Understanding 1
- Describe the meaning and type of data (scale, values, etc.) for each attribute in the data file. 
- Verify data quality: Are there missing values? 
- Duplicate data? Outliers? Are those mistakes? 
- How do you deal with these problems?  
(10)

## Section Summary:
- words
- words
- words
- words

## Comprehensive Data Cleaning Steps
1. Read, merged, filtered the data (54427 rows, 24 columns) then dealt with missing values to result in (54473, 24)
2. Inspected each feature, noted cleaning steps if necessary (features from lab 1 only)
3. Carefully binned existing features (`body_type_binned`, `intersection_binned`, `weather_binned`)
4. Derived new features (`speeding_status`, `hour_binned`, `restraint_binned`, `pcrash1_imname`)
5. Binned and visualized response variable for classification task, `maxsev_binned`
6. Visualized response variable for regression task, `age_im`
7. Confirmed there are zero duplicates and NA values (50535, 24)
8. Conducted data preprocessing: 
  - Split full dataset into 80% train / 20% test, performed 10 fold stratified cross validation within the 80% train set.
  - Scaled `age_im` and `mod_yearname_x` on train and test data separately (ensured transformer was only fit on the train set to prevent data leakage)
  - Ordinal and one hot encoded features on train and test data separately
  - Imputed and scaled `trav_sp` on train and test data separately (ensured transformer was only fit on the train set to prevent data leakage)
9. Removed variables that were not needed/useful: 
  - `makename` (too many levels), 
  - `wrk_zonename` (99% of one class)
  - `hour_imname` (unnecessary with `hour_binned`)
  - `vspd_lim` (unnecessary with `speeding_status`) 
  - `body_type_x` (unnecessary with `body_type_binned`)
  - `weathername` (unecessary with `weather_binned`)
  - `typ_intname` (unnecessary with `intersection_binned`)
  - `maxsev_imname` (unnecessary with `maxsev_binned`)
  - `trav_sp` (unnecessary with `trav_sp_imputed`)
10. Final dataset dimension 50535 rows, 119 columns (prior to train/test split)

### Read, clean the data

In [None]:
# Read in data from github
url_accident = "https://github.com/BradenAnderson/Accident_Severity_Prediction/blob/main/Data/accident.csv.gz?raw=tr"
url_vehicle = "https://github.com/BradenAnderson/Accident_Severity_Prediction/blob/main/Data/vehicle.csv.gz?raw=tr"
url_person = "https://github.com/BradenAnderson/Accident_Severity_Prediction/blob/main/Data/person.csv.gz?raw=tr"

accident = pd.read_csv(url_accident,compression='gzip')
vehicle = pd.read_csv(url_vehicle, compression='gzip', low_memory=False, encoding="ISO-8859-1")
person = pd.read_csv(url_person, compression='gzip', low_memory=False, encoding="ISO-8859-1")

# Filter accidents where driver is present and vehicle is involved
person = person.loc[(person.VEH_NO==1) & (person.PER_NO==1)]
vehicle = vehicle.loc[vehicle.VEH_NO==1]

# Left join person with vehicle and accident
# Duplicated CASENUM are dropped
df = person.merge(vehicle.drop_duplicates(subset=['CASENUM']), on='CASENUM', how='left')
df = df.merge(accident.drop_duplicates(subset=['CASENUM']),on='CASENUM',how='left')

# Select
df = df[['REGIONNAME','URBANICITYNAME','BODY_TYPNAME_x', 'MOD_YEARNAME_x','VTRAFWAYNAME','VNUM_LANNAME','VSURCONDNAME','VTRAFCONNAME','TYP_INTNAME','INT_HWYNAME','WEATHERNAME',
        'WKDY_IMNAME', 'RELJCT1_IMNAME','LGTCON_IMNAME','MAXSEV_IMNAME','ALCHL_IMNAME','AGE_IM','SEX_IMNAME','TRAV_SP','REST_USENAME','PCRASH1_IMNAME','HOUR_IMNAME','VSPD_LIM',
        'HOUR_IM']]

df = df.rename(columns=str.lower)
df.shape

In [None]:
# Check for NA values
df.isnull().sum()

In [None]:
# Remove NA values since they are low representation of the entire dataset
df.dropna(inplace=True)

In [None]:
# Confirm NA values are removed
df.isnull().sum().sum()

In [None]:
df.shape

In [None]:
df.info()

In [None]:
df.head()

# Data Understanding 2
- Visualize the any important attributes appropriately. 
- Important: Provide an interpretation for any charts or graphs.  
(10)

# Model and Evaluation 1
- Train and adjust parameters  
(10)

# Model and Evaluation 2
- Evaluate and compare  
(10)

# Model and Evaluation 3
- Visualize results  
(10)

# Model and Evaluation 4
- Summarize the ramifications  
(20)

# Deployment
- Be critical of your performance and tell the reader how you current model might be usable by other parties. 
- Did you achieve your goals? If not, can you reign in the utility of your modeling? 
- How useful is your model for interested parties (i.e., the companies or organizations that might want to use it)? 
- How would your deploy your model for interested parties? 
- What other data should be collected? 
- How often would the model need to be updated, etc.?  
(10)

# Exceptional Work
- You have free reign to provide additional analyses or combine analyses.  
(10)

# Appendix
Data dictionary referenced from [CRSS Analytical Users Manual](https://crashstats.nhtsa.dot.gov/Api/Public/ViewPublication/813236)