# Accident Severity Prediction to reduce Road Traffic Accidents in Seattle, WA - IBM Capstone Project
<br>
<br>
<div><img src="https://www.bekinsmovingandstorage.com/wp-content/uploads/2016/03/SeattleCity2-1024x682.jpg"
     alt="Seattle", align = "center",
     style="float: left; margin-right: 10px;" />
</div>

<br clear="all" />
<br>

## Introduction / Business Problem
<br>
<br>
The study of influencing factors of traffic accidents is an important research direction in the field of traffic safety. The increasing number of crashes is a major public safety concern with various related costs. In an effort to reduce the frequency of such collisions in the community, a model must be developed to predict the severity of an accident given the current weather, road and visibility conditions. With our application, the user will be alerted to be more careful if the conditions are bad.

Our main objective in this project is to make a supervised prediction model that predicts the severity of an accident given certain circumstances (the current weather, road and visibility conditions) and alert the end user appropriately.
<br>
<br>
## Data
<br>
<br>
This project will utilize Jupyter Notebooks to analyze a metadata set containing a rating of accident severity, street location, collision address type, weather condition, road condition, vehicle count, injuries, fatalities, and whether the driver at fault was under the influence. The dataset we will use in this project is the shared data originally provided by Seattle Department of Transportation(SDOT) Traffic Management Division, Traffic Records Group, and modified to particularly meet the project criteria.The dataset that we will be using is a .csv file named, 'Data-Collisions'. Our target variable will be 'SEVERITYCODE' because it is used to measure the severity of an accident from 0 to 3 (including a "2b", as per the metadata) within the dataset. Attributes that are used here to weigh the severity of an accident are 'WEATHER', 'ROADCOND' and 'LIGHTCOND'. The entire dataset originally had 194,673 rows (Instances) and 38 columns (Features). The metadata of the dataset can be found <a href="https://s3.us.cloud-object-storage.appdomain.cloud/cf-courses-data/CognitiveClass/DP0701EN/version-2/Metadata.pdf">here</a>. <br><br>In it's original form, this data is not fit for analysis. There are many columns that we will not use for this model. So, the data is to be cleaned, preprocessed and well prepared for analysis, and then to be fed to the Machine Learning Algorithms to finalize our model.
<br>
 

<br>



## Methodology

<br>
<br>

### Exploratory Data Analysis <br>
The correlation Heat-Map of the dataset was explored. However, it did not provide much of an insight to the problem as our independent variables were shown to be Negatively correlated with the dependent variable. After that, the Pearson Coefficient and p-value were explored, which showed that the Road Condition and Light Condition had a strong relation with the Collision Severity. The initial decision of including the Weather Condition along with the Road and Light Condition was not changed.


<br>

### Machine Learning Algorithms & Evaluation <br>

**1. K-Nearest Neighbor (KNN)*** <br>
KNN will help us predict the severity code of an outcome by finding the most similar to data point within k distance.

<br><br>
**2. Decision Tree** <br>
A decision tree model gives us a layout of all possible outcomes so we can fully analyze the concequences of a decision. In this context, the decision tree observes all possible outcomes of different weather conditions.

<br><br>
**3. Logistic Regression** <br>
Because our dataset only provides us with two severity code outcomes, our model will only predict one of those two classes. This makes our data binary, which is perfect to use with logistic regression.

<br>
<br>

## Results 
<br>

The accuracy of the 3 models are as shown below :
![image.png](attachment:image.png)
<br>



## Discussion <br>
In the beginning of this notebook, we had categorical data that was of type 'object'. This is not a data-type that we could have fed through an algoritim, so label encoding was used to created new classes that were of the type int (numerical data type).

After solving that issue we were presented with another - imbalanced data. As mentioned earlier, class 1 was nearly three times larger than class 2. The solution to this was downsampling the majority class. We downsampled to match the minority class exactly with 57052 values each.

Once we analyzed and cleaned the data, it was then fed through three ML models: K-Nearest Neighbor, Decision Tree and Logistic Regression. Although the first two are ideal for this project, logistic regression made most sense because of its binary nature.

Evaluation metrics used to test the accuracy of our models were Jaccard index, f-1 score and log_loss for logistic regression. Choosing different k, max depth and hyparameter C values helped to improve our accuracy to be the best possible.

**Future Scope**
Many more branches of analytics and modeling can be done with this dataset. By optimizing the dataset and trying other algorithms, we can also try to improve the accuracy of our model in the future.

## Conclusion <br>
Based on historical data from weather, road and light conditions pointing to certain classes, we can conclude that particular weather conditions have a somewhat impact on whether or not travel could result in property damage (class 1) or injury (class 2).
