#  IBM Data Science Capstone: Car Accident Severity Report

## Introduction | Business Undertanding

As an effort to reduce the frequency of vehicular accidents in a city, an algorithim must be developed to predict the severity and the chances of a road accident given the current weather conditions, road and visibility. When conditions are bad, this model will alert drivers to remind them to be more careful, or to take a different route to their destination.

In most of the cases, not paying enough attention during driving, drug abuse or overspeeding are the main causes of serious accidents which would have been otherwise prevented by enacting harsher regulations. Several uncontrollable factors like the weather, visibility, or road conditions, etc are also contributing factors to a number of road accidents. These can be prevented by revealing hidden patterns in the data and warning to the local government, and notify the police and drivers traveling on those roads about the same. If these patterns are discovered early on, local government can know when to send alerts to the public and the respective authorities to drive more carefully or even avoid those roads entirely.

The target audience of the project is local Seattle government, police, rescue groups, and last but not least, car insurance institutes. The model and its results are going to provide some insights for the target audience to make important decisions for reducing the number of road accidents occuring in the city.

## Data Understanding

The data used in this project is taken from collision and accident reports in Seattle from the years 2004 to present. This data was collected by the Seattle Police Department(SPD) and the Traffic Records department of Seattle. The data has 37 independent variables and 194,673 records.

We will be using this data here to identify the key variables that may cause car accidents. For example, the “WEATHER” column may be used to show the types and number of accidents that occur for different weather conditions at the time of the accient. Furthermore, the “INTKEY” column can be grouped and the sum of the car accidents that happned at that paticular intersection can be known for all the intersections. This list can be sorted in descending order of the total sums to identify the most dangerous intersections, this data can be used by the respective authorities to provide better facilities and monitoring at that intersection. Finally, a supervised learning model will be used to come up with a formula that can predict the severity of an accident based on the inputs.

Our predictor or target variable will be 'SEVERITYCODE' because it is used measure the severity of an accident from 0 to 5 within the dataset. Attributes used to weigh the severity of an accident are 'WEATHER', 'ROADCOND' and 'LIGHTCOND'. “SEVERITYCODE” contains numbers that correspond to different levels of severity caused by an accident from 0 to 4 which are as follows -

0. Little to no Probability (Clear Conditions)
1. Very Low Probability — Chance or Property Damage
2. Low Probability — Chance of Injury
3. Mild Probability — Chance of Serious Injury
4. High Probability — Chance of Fatality

## Extracting Data and Pre-Processing

In it's original form, this data is not fit for analysis. For one, there are many columns that we will not use for this model. Also, most of the features are of type object, when they should be numerical type.

We must use label encoding to covert the features to our desired data type.

In [9]:
# The code was removed by Watson Studio for sharing.

  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0,SEVERITYCODE,WEATHER,ROADCOND,LIGHTCOND,WEATHER_CAT,ROADCOND_CAT,LIGHTCOND_CAT
0,2,Overcast,Wet,Daylight,4,8,5
1,1,Raining,Wet,Dark - Street Lights On,6,8,2
2,1,Overcast,Dry,Daylight,4,0,5
3,1,Clear,Dry,Daylight,1,0,5
4,2,Raining,Wet,Daylight,6,8,5


With the new columns, we can now use this data in our analysis and ML models!

Now let's check the data types of the new columns in our dataframe. Moving forward, we will only use the new columns for our analysis.

In [8]:
# The code was removed by Watson Studio for sharing.

SEVERITYCODE        int64
WEATHER          category
ROADCOND         category
LIGHTCOND        category
WEATHER_CAT          int8
ROADCOND_CAT         int8
LIGHTCOND_CAT        int8
dtype: object

### Balancing Dataset

Our target variable SEVERITYCODE is only 42% balanced. In fact, severitycode in class 1 is nearly three times the size of class 2.

We can fix this by downsampling the majority class.

In [7]:
# The code was removed by Watson Studio for sharing.

2    58188
1    58188
Name: SEVERITYCODE, dtype: int64

Now its Balanced!

## Methodology

Our data is now ready to be fed into machine learning models. Statistical testing was not performed because the data revolved around categorical variables, not numerical ones.

Key variables such as pedestrian right of way, inattentive drivers, and whether the car was speeding had a majority of null values. Therefore, they were dropped and not part of the analysis. However, it is likely that these variables play a key factor in vehicle accidents.

We will use the following models:

 - K-Nearest Neighbor (KNN)
KNN will help us predict the severity code of an outcome by finding the most similar to data point within k distance.

 - Decision Tree
A decision tree model gives us a layout of all possible outcomes so we can fully analyze the concequences of a decision. It context, the decision tree observes all possible outcomes of different weather conditions.

 - Logistic Regression
Because our dataset only provides us with two severity code outcomes, our model will only predict one of those two classes. This makes our data binary, which is perfect to use with logistic regression.

## Results & Evaluation

### K-Nearest Neighbor

- Jaccard Similarity Score = 0.563973305072609
- F1-SCORE                 = 0.540128347154051

Therefore, Model is most accurate when k is 25

### Decision Tree

- Jaccard Similarity Score = 0.5664365709048206
- F1-SCORE                 = 0.5450597937389444

Therefore, Model is most accurate with a max depth of 7.

### Logistic Regression

- Jaccard Similarity Score = 0.5260218256809784
- F1-SCORE                 = 0.511602093963383
- LOGLOSS                  = 0.6849535383198887

Model is most accurate when hyperparameter C is 6.

# Discussion



In the beginning of this notebook, we had categorical data that was of type 'object'. This is not a data type that we could have fed through an algorithm, so label encoding was used to created new classes that were of type int8; a numerical data type.

After solving that issue we were presented with another - imbalanced data. As mentioned earlier, class 1 was nearly three times larger than class 2. The solution to this was downsampling the majority class with sklearn's resample tool. We downsampled to match the minority class exactly with 58188 values each.

Once we analyzed and cleaned the data, it was then fed through three ML models; K-Nearest Neighbor, Decision Tree and Logistic Regression. Although the first two are ideal for this project, logistic regression made the most sense because of its binary nature.

Evaluation metrics used to test the accuracy of our models were jaccard index, f-1 score and logloss for logistic regression. Choosing different k, max depth and hyperamater C values helped to improve our accuracy to be the best possible.


# Conclusion 

Based on the dataset and the model provided for this capstone from weather, road, and light conditions pointing to certain classes, we can conclude that particular conditions have some kind of impact on - if travelling or not travelling in that particular weather condition could result in property damage of either Class - 1 or Class - 2. I.E., the current weather barely succeeds to describe the probability or the severity of the accident. But if you just follow the numbers, most cases of car accidents seem to be in Dry weather conditions(124510 cases), followed by Wet weather conditions(47474 cases), but this could be because people prefer to travel when there is  dry weather outside and maybe because these two weather conditions are prevalent throughout a major time period around the year. 