# 1.Introduction
<p>
One of the current concerns of the Seattle Department of Transportation (SDT) is to find solutions that can minimize the number of car accidents as well as fatalities, injuries and damages due to traffic accidents, in this context, all relevant information of accidents occurrences is recorded and maintained by the department for the access of all researchers. This data is necessary for identifying the locations and causes of crashes, for planning and implementing countermeasures, for operational management and control, and for evaluating highway safety programs and improvements.
</p>


</p>
The problem is that accidents have a different severity and implementation of countermeasures must be able to prioritize the implementation of solutions based on this. SDT classifies each of the registered collision with one of four categories, namely, Property Damage Only Collision, Injury Collision, Serious Injury Colision and Fatality Collision. This project intends to use these records to analyze all the collisions registered since 2004, in order to build a Machine Learning Model that will make it possible to classify the severity of a new collisions based on its characteristics.
</p>


# 2.Data understanding

<p> The data set used in this project is available in a comma-separated values ​​(CSV) file format and was downloaded from the <a href="https://data-seattlecitygis.opendata.arcgis.com/datasets/5b5c745e0f1f48e7a53acec63a0022ab_0?geometry=-122.326%2C47.592%2C-122.318%2C47.594" target="_blank">Seattle Open GeoData Portal</a>  and includes all types of collisions from 2004 to the present. The data set contains 221738 rows / records and 40 columns / fields. The data set contains columns with 3 different types of values, float64, object and Int64. </p>
<p> The data set metadata was found at the <a href="https://www.seattle.gov/Documents/Departments/SDOT/GIS/Collisions_OD.pdf" target="_blank"> Department of Transportation Seattle</a>. </p>


# 3.Data Preparation

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
%matplotlib inline

!conda install seaborn -y
import seaborn as sns

#### Loading Data

In [None]:
#!wget -O Collisions.csv "https://opendata.arcgis.com/datasets/5b5c745e0f1f48e7a53acec63a0022ab_0.csv"

In [None]:
df_raw = pd.read_csv('Collisions.csv')

## Data Cleaning
<p> We tried to keep as many resources as possible after cleaning. In this sense, the following steps were performed: </p>

<p> a) Data set is part of an information system of the Seattle government and relates to other tables, with this, some columns representing secondary keys of these supplies were all discarded as well as the primary key and other columns of references. </ p>

<p> b) Columns with Boolean behavior values ​​where Y corresponds to TRUE and N or null value field (NAN) corresponds to FALSE were standardized for numeric values ​​0 and 1, where 1 = TRUE and 0 and NAN = FALSE. </p>

<p> c) Day of the week and Month was extracted from the column containing the date of occurrence and the original column was discarded. </p>

<p> d) The target variable appears represented by two columns, SEVERITYCODE which has the severity code with the values ​​(1,2,3,2b, 0) has been changed to the type int64 with the values ​​(1,2, 3, 4, None) the other column is SEVERITYDESC which includes the description and severity was maintained in the dataset to give more emphasis on the interpretation of graphs. </p>

<p> e) Some columns use the value "unknown" and "other" to represent missing data, so these values ​​have been replaced by NAN. </p>

<p> f) Columns containing location or address information have been discarded and are fundamental to the columns of geographic coordinates. </p>

<p> g) All records containing at least one NAN value have been discarded from the DataFrame. </p>

<p> With the data cleaning, 221738 records generated 148171 (68%) and the 40 columns were transformed into 23 features (58%). </p>

In [None]:
df = df_raw[['X','Y','SEVERITYCODE','COLLISIONTYPE','PERSONCOUNT','PEDCOUNT','PEDCYLCOUNT',
            'VEHCOUNT','INJURIES','SERIOUSINJURIES','FATALITIES','WEATHER','ROADCOND','SPEEDING','JUNCTIONTYPE','UNDERINFL','LIGHTCOND','HITPARKEDCAR','PEDROWNOTGRNT','INATTENTIONIND','INCDATE','SEVERITYDESC']]

In [None]:
missing_df = df.isnull().sum(axis=0).reset_index()
missing_df= missing_df[missing_df[0]>0]
missing_df

Fixings


In [None]:
df["SEVERITYCODE"].replace('0', np.nan, inplace=True)
df["SEVERITYCODE"].replace('2b', 4, inplace=True)
df["SEVERITYCODE"] = df["SEVERITYCODE"].apply(pd.to_numeric, downcast='integer', errors='coerce')
df["SPEEDING"].replace(np.nan,0 , inplace=True)
df["PEDROWNOTGRNT"].replace(np.nan,0 , inplace=True)
df["INATTENTIONIND"].replace(np.nan,0 , inplace=True)
df.replace({"Y": 1, '1':1} , inplace=True)
df.replace({"N": 0, '0':0} , inplace=True)
df.replace("Unknown", np.nan , inplace=True)
df.replace("Other", np.nan , inplace=True)


df['MONTH']=pd.to_datetime(df['INCDATE']).dt.month
df['DAYOFWEEK']=pd.to_datetime(df['INCDATE']).dt.dayofweek
df.drop(['INCDATE'],axis=1, inplace=True)


In [None]:
print('Before', df.shape)
df.dropna(axis=0, inplace=True)
print('After', df.shape)

In [None]:
df_clean = df
df_clean.head(10)

## Explanatory Data Analysis

Distribuiton of Severity by Accidents

In [None]:
ax=plt.subplots(figsize=(11,8))
sns.set(style='darkgrid')
sns.countplot('SEVERITYDESC',data=df_clean,ax=ax[1],order=df_clean['SEVERITYDESC'].value_counts().index)
ax[1].set_title('Distribuition of Severity')
plt.show()

### Features
The feature were splited in two groups Categorical Features and Continuos Features

#### Continuos Features
Continuous features Include all independent variables of type float64 and int64. Here, the cumulative value of the features was analyzed according to the severity classes of the accidents in order to identify the trends.

In [None]:
continuous_features = [col for col in df_clean.columns if df_clean[col].dtype=='float64' or df_clean[col].dtype=='int64']

#### Distribuition of Severidty by Location

In [None]:
sns.set_theme(); np.random.seed(0)
sns.jointplot(x=df_clean.X.values,y=df_clean.Y.values,height=10, data=df_clean, hue="SEVERITYDESC")
sns.color_palette("dark", 10)
plt.ylabel('Latitude', fontsize=12)
plt.xlabel('Longitude', fontsize=12)
plt.show()
#plt.savefig('severity_geo_distribuition.png')

Looking at the geodistribution map, we can see that the accident's Severity distribution has almost the same proportions in terms of occurrence. Therefore, we can assume that the site has a weak impact to determine the severity of the accident.

#### Distribuition of Severidty by continuous features

In [None]:
continuous_features.remove('X')
continuous_features.remove('Y')
continuous_features.remove('SEVERITYCODE')


fig, axs = plt.subplots(ncols=5, nrows=3, figsize=(30,25))
plt.subplots_adjust(hspace=.9,wspace = 0.5)
for i, feature in enumerate(continuous_features, 1):    
    plt.subplot(3, 5, i)
    sns.set(style='darkgrid')
    ax = sns.barplot(x = 'SEVERITYDESC', y = feature, data = df_clean)
    plt.xticks(rotation=30, horizontalalignment='right',fontweight='light') 
    plt.title('SEVERITY IN {}'.format(feature), size=14, y=1.05)
    ax.set_xlabel('')
fig.suptitle('SEVERITY VS CONTINUOUS VARIABLES',y=.02)
plt.show()
#plt.savefig('severity_distribuition_continuous.png')


<p>
Looking at Figure above, it is clear that each feature influences the severity of the accident in its own way, in the SEVERITY IN FATALITIES relationship, for example, it is clear that any accident involving fatality tends to be classified as Fatality Collision, in the SEVERITY IN HIPARKEDCAR relationship, If accidents involving a high number of parked cars tend to be classified with <i>Only Property Damage Collision</i> and if the number of cars involved is relatively low then it tends to be classified as a Serious Collision or Collision with Serious Injury.</p>
<p>
In relationships like SEVERITY IN MONTH and SEVERITY IN DAYOFWEEK it is very difficult to classify the accident because the distribution of severity is almost equal.
</p>


#### Categorical Variables

In [None]:
categorical_features = [col for col in df_clean.columns if df_clean[col].dtype=='object']
categorical_features.remove('SEVERITYDESC')
categorical_features

Lets explore the distirbuiton of each variable in accident severity so far.

In [None]:
fig, axs = plt.subplots(ncols=2, nrows=3, figsize=(30, 24))
plt.subplots_adjust(hspace=1,wspace = 0.5)
for i, feature in enumerate(categorical_features, 1):    
    plt.subplot(3, 2, i)
    sns.set(style='darkgrid')
    ax = sns.countplot(x=feature, hue = 'SEVERITYDESC', data = df_clean)
    plt.xticks(rotation=30, horizontalalignment='right',fontweight='light') 
    plt.title('SEVERITY IN {}'.format(feature), size=14, y=1.05)
    plt.legend(bbox_to_anchor=(1.0, 0.7),borderaxespad=0.5)
    ax.set_xlabel('')
fig.suptitle('SEVERITY VS CATEGORICAL VARIABLES',y=.95, fontsize=16)
#plt.savefig('severity_distribuition_categorical.png')
plt.show()

<p>The figure above explores the relationship between the severity of the accident and the categorical features, in which we can see that some values of the variables do not contain enough information to classify the severity of the accident. These values will be discarded from the model that is intended to be created because it does not show any tendency towards the target</p>

<p>After identifying the relevant numerical as well as categorical variables, a pool of categorical and continuous resources will be made and then the values of categorical variables will be transformed into dummy variables.</p>

In [None]:
feautures = ['SEVERITYCODE']+continuous_features+categorical_features

df_cleaned = pd.get_dummies(df_clean[feautures], columns=categorical_features, drop_first=True)

Converting Categorical features in continuous create a lot of variables to the data set, but as we saw when we was observing categorical variables not all of categories values have important impact to classificate accident severity, therefore, all the variables with -0.03 < correlaction < 0.03 will be droped because it means that there is no historical data enought to help us classifying an accident based in this value.

In [None]:
df_corr = pd.DataFrame(df_cleaned.corr()['SEVERITYCODE'].sort_values(ascending=False)).reset_index()
df_corr.columns = ['features','correlation']

df_corr = df_corr.loc[(df_corr['correlation'] > 0.03) | (df_corr['correlation'] <-0.03)]

a = df_corr[1:]
ax = a.plot(kind='barh', figsize=(14,18))
ax.set_yticklabels(a["features"], size=14)
plt.title('Severity Correlations with (-0.03 > coef > 0.03)', fontsize=18)
plt.show()
#plt.savefig('severity_correlactions.png')

feautures = df_corr['features'].values[1:].tolist()

Setting the Features

In [None]:
X = df_cleaned[feautures]
X = preprocessing.StandardScaler().fit(X).transform(X.astype(float))
X[:5]

Setting the Target

In [None]:
y = df_cleaned['SEVERITYCODE']
y[:5]

# 4. Modeling and Evaluation

<p>
After processing the dataset and find the features, now is time to build the model to predict the accident severity based on data historical, to achieve this will be used the train split test approach and also be used the KNN, Decision Tree Classification and Logistic Regression as   classification models. 

<code> accuracy.score</code> and <code>f1score.score</code> to evaluate model accuracy.
</p>

###  Train Test Split

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=10)
print ('Train set:', X_train.shape,  y_train.shape)
print ('Test set:', X_test.shape,  y_test.shape)

### Support Vector Machine

In [None]:
from sklearn import svm
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.metrics import f1_score


svm_ = svm.SVC(kernel='rbf')
svm_.fit(X_train, y_train) 
svm_pred = svm_.predict(X_test)

svm_acc = accuracy_score(y_test, svm_pred)
svm_f1 = f1_score(y_test, svm_pred, average='weighted')
 
print("Test Score Accurancy: %.1f%%"% (svm_acc*100))
print("Test F1 Accurancy: %.1f%%"% (svm_f1*100))
print('\n',classification_report(y_test, svm_pred))

list_result = []
list_result.append(['Support Vector Machine',svm_acc,svm_f1])


### Decision Tree Classification

In [None]:
from sklearn.tree import DecisionTreeClassifier

dtc = DecisionTreeClassifier(criterion="entropy", max_depth = 4)
dtc.fit(X_train,y_train)

dtc_pred = dtc.predict(X_test)
dtc_acc = accuracy_score(y_test, dtc_pred)
dtc_f1 = f1_score(y_test, dtc_pred, average='weighted')

print("Decision Tree Classification Score Accurracy %.1f%%"% (dtc_acc*100))
print("Decision Tree Classification f1 Accurracy %.1f%%"% (dtc_f1*100))
print ('\n',classification_report(y_test, dtc_pred))
list_result.append(['Decision Tree Classification',dtc_acc,dtc_f1])

### K-Nearest Neighbors - KNN

In [None]:
from sklearn.neighbors import KNeighborsClassifier

#Train Model and Predict  
neigh = KNeighborsClassifier(n_neighbors = 1, n_jobs=-1)
neigh.fit(X_train,y_train)
    
knn_pred = neigh.predict(X_test)
knn_acc = accuracy_score(y_test, knn_pred)
knn_f1 = f1_score(y_test, knn_pred, average='weighted')
print( "KNN test accuracy:  %.1f%%"% (knn_acc))
print( "KNN f1 score accuracy:  %.1f%%"% (knn_f1*100))
print ('\n',classification_report(y_test, knn_pred*100))
list_result.append(['K-Nearest Neighbors ',knn_acc,knn_f1])

#### Accuracy Score

In [None]:
my_dict = {"Algoritm":[],"F1 Score":[],"Classification Accuracy":[]};
for resul in list_result:
    my_dict["Algoritm"].append(resul[0])
    my_dict["F1 Score"].append(resul[1])
    my_dict["Classification Accuracy"].append(resul[2])

results = pd.DataFrame(my_dict)
results




# 5. Discussion

<p>
The data set used in the reference project contained 221,738 records corresponding to collision occurrences so far, however, after data cleaning only 148,171 records were used as a sample, this is mainly due to the lack of information in some fields. 32% of records have at least 1 missing data.
</p>

<p>
Columns with missing data is mostly represented by categorical variables, which during our analysis were noted there was insufficiency of data to analyze some values of the categorical variables and consequently they ended up being dropped because the accident cannot be classified through the variables. </p>
<p>
It cannot be categorically stated that the missing data are the dropped values, but it is undeniable that a data set without missing data enables the construction of richer models and it is important to standardize the data collection process.
</p>



# 6. Conclusion
<p>Purpose of this project was to analyze Seattle collisions data and build a machine learning model in order to predict the classification of accident severity by its characteristics. By splitting the variables in categorical and continuous we identified and select as features the independent variables that have significative impact in accident classification. Several classification models were built and tested obtaining an average result of 100% accuracy using different accuracy metrics.</p>

<p>The implementation of this model can help Seattle Department of Transportation to classify the severity of accident more accurately and automatically with the data from the accident record. The model can be adjusted to include new variables if necessary.</p>
