# Introduction
Car accidents are dangerous, and many people die each year, approximately 1.35 million people according to the Association for Safe International Road Travel (ASIRT) [1]. 

Using a large dataset and machine learning, it should be possible to predict the severity of car accidents, which when deployed could help determine measures from policy makers to make road travel safer (and push more people to public transport). 

# Data
The City of Seattle has collected data on road incidents over a long period of time, and contains almost 200,000 datapoints, including a measure of severity. From this, several features are selected to predict the severity. 

Due to the format of the dataset, many of the features are information that is impossible to know before the accident happens. For instance, an accident containing many cars is often severe, but we cannot know how many cars will be involved before it has happened. As such, this could be another prediction target, but cannot be used as a feature for a deployed algorithm, with features we know. 

As such, the following features were selected: 
- Place of accidents, as intsersections might be more severe than just a tailend incident.
- Road contition, since wheter weather makes it slippery, or roads in need of repair.
- Light condition, since low visibility impacts the ability to mitigate the severity
- Wheather, probably correlated with road condition, but still could be useful
- Indicent Time, which will be used for feature engineering time of day and prevalence of accidents during months

Most of these are categorical variables, which sadly makes deployment harder since it has to use some type of decider to classify the light condition and so on. A more useful index would be a sensor measurement of light, but this is not available yet, and this model could be transfered into such a setting. 

# Methodology
The data will first be engineered, so features are extracted, such as dummy variables for weekend, and the weekday.

It will then be split into a training set and a testing set, and the following model will first be trained on the training set:
1. Random Forrest 
1. Support Vector Machine
1. Logistic Regression

Since this is a predictive classifier, these  models will be used. These perform well wit little to no hyperparameter tuning, and don't require much computing power.

In the trainining, a pipeline that one-hot encodes the categorical variables will be applied. All NaN-values are dropped, since the data still contains lot's of datapoints to use. 

This is then evaluated on a testing set that was set aside. 

## Evaluation
From the testing set, the following evaluation metrics will be used:

#### All
- Accuracy, for how many of the classifications are correctly classified
- Jaccard similarity index, for a weighted accuracy score
- F1 score, for another score

These will then provide tiebreakers. If two models are the same in one score, they can break using another.

#### Just Logistic Regression
- Log Loss

Since Logistic Regression also can be measured using this, it will be provided as well, but since the others do not, it's mostly for transparency and if the Logistic Regression should be tied in all other metrics. 

### Preprocessing pipeline
All categorical will be one-hot encoded. 


## Sources
[1] https://www.asirt.org/safe-travel/road-safety-facts/

[2] https://s3.us.cloud-object-storage.appdomain.cloud/cf-courses-data/CognitiveClass/DP0701EN/version-2/Metadata.pdf
