## __Car Accident Severity__

The objective of the project is to develop an automatic learning model / solution that allows identifying the main factors that affect the severity of an accident. For this, a database is provided that consists of information on the weather, the direction of the vehicles, the type of light that existed in the environment, the number of people involved in the accident, among others.

The Target variable is severity, which has the following codes related to metadata:

- 3: Fatal, at least one death.
- 2b: Serious Injury
- 2: Injury
- 1: Prop damage
- 0: Unknown

## __Presentation and Discusion__

The objective of this data science project is to identify those factors that are most decisive in predicting the severity of a car accident. The information available refers to a dataset of the Seattle area, where it is even possible to identify them by Zone on a map since the latitude and longitude of the location are available. However, it is not possible to map through a geojson because the same dataset does not provide identifiers with the districts.

There is information on the number of people involved, the characteristics of the crash, for example, whether it is frontal or lateral, it also refers to whether a parked car or even a pedestrian has participated.
Information on weather, and road conditions is also available in the dataset.

### __Import Modules__

The modules to be used are the basic ones to implement a classification model. They are found within the Sklearn library, in addition to the now classic numpy and pandas.

In view of the fact that on some occasions it is essential to be able to have a lightweight model that can be executed in any environment, the choice is to use the simplest versions of sklearn models. However, it is recommended to use other types of classifiers or algorithms. For example, those who are extremely famous in Kaggle competitions, such as LightGBM or XGBoost

In [267]:
import warnings
warnings.simplefilter(action='ignore')

In [268]:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier 
from sklearn import metrics
import pandas as pd 
import matplotlib.pyplot as plt

### __Cleaning Data and data engineering__

Due to space issues, all the preprocessing logic is saved, that is, the working method for identifying the most used variables. However, it can be said that both correlation mapping has been done to identify those much more related. In addition, a categorical modification has been implemented through the pandas get_dummies function. Expanding in this way the dataset and helping the algorithm to have a better approximation.

In [261]:
# Loading Data and validating NANS in it. 
# Si el número de datos NAN es mayor al 50 % entonces no es una columna valida. 
variables = [
    "SEVERITYCODE", "X", "Y", "ADDRTYPE", "HITPARKEDCAR", "COLLISIONTYPE", "PERSONCOUNT", "PEDCOUNT",
    "PEDCYLCOUNT", "VEHCOUNT", "UNDERINFL", "WEATHER", "ROADCOND", "LIGHTCOND", "SPEEDING", "ST_COLCODE"
]
data = pd.read_csv("datasets/Data-Collisions.csv")[variables]

data["SPEEDING"] = data["SPEEDING"].apply(lambda x: 1 if x == "Y" else 0 )
valid_columns = []
for i in data.columns: 
    if data[i].isna().sum()/len(data) >0.4: 
        pass
    else: 
        valid_columns.append(i)
        
data = data[valid_columns]
data.dropna(inplace = True)

#Adding Dummies
# Convert Categorical Variables: 
weather = pd.get_dummies(data.WEATHER)
road_con = pd.get_dummies(data.ROADCOND)
colicion_d  = pd.get_dummies(data.COLLISIONTYPE)
light_d  = pd.get_dummies(data.LIGHTCOND)
add_d  = pd.get_dummies(data.ADDRTYPE)


data= pd.concat([data, weather], axis = 1)
data= pd.concat([data, road_con], axis = 1)
data= pd.concat([data, colicion_d], axis = 1)
data= pd.concat([data, light_d], axis = 1)
data= pd.concat([data, add_d], axis = 1)

data["HITPARKEDCAR"] = data["HITPARKEDCAR"].apply(lambda x: 0 if x== "N" else 1)

variable = []

for i in data.UNDERINFL: 
    if i == "N": 
        variable.append(0)
    if i == "0": 
        variable.append(0)
    if i == "1": 
        variable.append(1)
    if i == "Y": 
        variable.append(1)
        
data["UNDERINFL"]= variable

data.drop(columns = ["WEATHER", "ROADCOND", "COLLISIONTYPE", "LIGHTCOND", "ADDRTYPE"], inplace =True) 

### __Model Training__

A classification model through trees has been used because in addition to being the easiest to implement, they use an easily understandable and powerful internal logic. The main idea is that the data can be classified through different levels of decomposition of the same, or what could be called "leaves". It allows to have a much more punctual follow-up on what is taking place under the hood of the model.

In [263]:
y = data["SEVERITYCODE"]
X = data.loc[:, data.columns != 'SEVERITYCODE']

In [264]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)

In [265]:
clf = DecisionTreeClassifier()

# Train Decision Tree Classifer
clf = clf.fit(X_train,y_train)

#Predict the response for test dataset
y_pred = clf.predict(X_test)

### __Results__

Although the model has an acceptable level of accuracy for a prediction (close to 70%), it is possible to have a stable and easily scalable solution, so the possibility of an increase in the amount of information available for training can substantially improve our model's ability to predict class. Furthermore, it is necessary to remember that the classification basically consisted of identifying the level of severity of a crash.

In [266]:
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))

Accuracy: 0.6929405376052131


### __Conclusion__

Data science models should have two main focuses: 
   1. Customer; 
   2. Infrastructure.

This means that the most important process in data science is the identification of the business problem we are trying to solve. This implies a continuous and iterative process of interaction with the client, both internal and external, to ensure that the problem has to be solved in the sense that we are giving it. It is also highly recommended to have constant communication with the owners of the MVP in order to guarantee usability.

On the other hand, it is also important to take into account the limitations of the infrastructure where the solution will end up being assembled, that is, it is useless to have the trained model with the best performance if it weighs more than 1gb and it is not possible that it be used at through queries to an API, for example. In this sense, it is extremely important to take infrastructure into account. Personally, I consider that the best option, in the case of scalability and ease of use, is Watson, however, for capacity and power, in addition to availability, AWS takes the crown. Not to mention that it has an extremely useful billing alert system, which could help reduce implementation costs.