# IBM Data Science Professional Certificate - Capstone Project

## 1. Introduction
### 1.1 Background
While a reduction in number has been observed over the last decades, car accidents are still counted in thousands in Switzerland in 2020. Because the direct and indirect consequences of such events (injuries, death, psychological damages, material damages, etc.) are sizeable, there is value in identifying what are the causes of the accidents so that adequate prevention measures can be put in place. Moreover, it would be valuable to society - not the least from a resource planning standpoint - to understand when accidents are most likely to occur, and respectively what outcome severity (light injuries, heavy injuries, fatal outcome) can be expected depending on when and under what circumstances the accident took place. Since 1992, the Swiss Federal Statistics Office (OFS) is collecting data on car accidents country-wide and making such information available to the public. This analysis will leverage this data. 

### 1.2 Problem
The objective is to explore a 2010-2019 dataset from the Swiss Federal Statistics Office (OFS) and determine what are the **key factors** that drive the **outcome** of an accident for the involved car(s)' passengers: light injuries, heavy injuries, fatal outcome. Additionally, the outcomes of this analysis can be used as a prescriptive tool to :
(1) Have the appropriate medical emergency resources allocated for the times, locations and circumstances when accidents are most likely to occur, with a particular emphasis on the severe and life-threatening cases. 
(2) Design prevention measures based on those drivers identified as having the largest influence on accident outcomes. 

### 1.3 Interest
By being able to allocate medical emergency resources more efficiently and by being able to reduce injuries and deaths through prevention campaigns, society as a whole will reduce the economic impact of road hazards. This analysis is therefore aimed at decision-makers of the Swiss Confederation, notably those in charge of Transportation and Medical Affairs. Beyond economic considerations, there is also a moral value in reducing the suffering and deaths of the thousands of people affected by road accidents. 

## 2. Data Sources and Cleaning
### 2.1 Data Sources
The dataset used in this analysis was obtained from the website of the Swiss Federal Statistics Office (OFS). A data browser allows the user to select the dimensions and time range of interest, within the limits of the Office's data structure. The extracted time range is the years 2010-2019. No other data source was used. 

### 2.2 Data Cleaning
Data cleaning consisted first in the following basic steps :

(1) Translating the labels from French to English. This was performed in Excel directly on the CSV file. 

(2) Shortening the labels, for easier coding purposes. This was performed in Excel directly on the CSV file. 

(3) Transforming the non-numerical classifiers into dummy variables usable by the Machine Learning model discussed below. This was performed in the Jupyter Notebook. 

The data was extractable only in a form where each row represents a unique combination of explanatory variables (such as time, day of the week, type of road, etc.). There are then 10 columns corresponding to the years 2010-2019, where it is reported in each column how many accidents happened in the said year for the given set of explanatory variables of the row. 
Because a time-evolution is not the primary concern of this analysis, it was decided to :

(4) Focus the analysis on the period 2010-2019 (as opposed to using the full dataset tracing back to 1992), as it is most likely more representative of the current road conditions and car technology. This was performed in the Jupyter Notebook.

(5) Eliminate the time dimension of the dataset by performing the following operation : Sum the number of accidents across the years 2010-2019. This was performed in the Jupyter Notebook.


### 2.3 Feature Selection
To enable a simpler analysis and higher readability, certain variables were simplified as follows :
TYPE_ROAD: Highways, semi-highways and highways with special features were grouped into a single "Highway" variable. This was performed in Excel directly on the CSV file. 


## 3. Methodology
### 3.1 Exploratory Data Analysis
### 3.2 Machine Learning Approach
Decision tree because need to classify. And also because non-numerical values.

### 3.3 Code

In [None]:
# Importing modules

import numpy as np # Numpy
import pandas as pd # Pandas
from sklearn.tree import DecisionTreeClassifier # Scikit Learn

# Downloading the data in CSV format

# Placing the data in a Pandas dataframe
my_data = pd.read_csv("URL URL URL data.csv", delimiter=",")
# Displaying the first 5 rows
my_data[0:5]
# Removing values from 1992 to 2009

my_data[0:5]
# Eliminating the time dimension by summing across years

my_data[0:5]
# Itemizing the rows where occurence > 1

my_data[0:5]
# Declaring the Feature Matrix X and removing column 1 which is the Y dependent variable 
X = my_data[['Age', 'Sex', 'BP', 'Cholesterol', 'Na_to_K']].values 
X[0:5]

# Transforming non-numerical classifiers into dummy variables
from sklearn import preprocessing
le_sex = preprocessing.LabelEncoder()
le_sex.fit(['F','M'])
X[:,1] = le_sex.transform(X[:,1]) 


le_BP = preprocessing.LabelEncoder()
le_BP.fit([ 'LOW', 'NORMAL', 'HIGH'])
X[:,2] = le_BP.transform(X[:,2])


le_Chol = preprocessing.LabelEncoder()
le_Chol.fit([ 'NORMAL', 'HIGH'])
X[:,3] = le_Chol.transform(X[:,3]) 

X[0:5]

# Declaring the Y dependent variable vector
y = my_data["Drug"]
y[0:5]

# Separation of the total dataset into a train and a test set
from sklearn.model_selection import train_test_split
X_trainset, X_testset, y_trainset, y_testset = train_test_split(X, y, test_size=0.25, random_state=3) # allowing a 25% partition to the test set

# Creation of the tree
drugTree = DecisionTreeClassifier(criterion="entropy", max_depth = 4)

#Fitting of the tree on the training dataset
drugTree.fit(X_trainset,y_trainset)

# Running a prediction on the test dataset using the trained calibration
predTree = drugTree.predict(X_testset)

# Computing the accuracy of the tree on the test set
from sklearn import metrics
import matplotlib.pyplot as plt
print("DecisionTrees's Accuracy: ", metrics.accuracy_score(y_testset, predTree))

# Visualizing the tree
from sklearn.externals.six import StringIO
import pydotplus
import matplotlib.image as mpimg
from sklearn import tree
%matplotlib inline 

dot_data = StringIO()
filename = "drugtree.png"
featureNames = my_data.columns[0:5]
targetNames = my_data["Drug"].unique().tolist()
out=tree.export_graphviz(drugTree,feature_names=featureNames, out_file=dot_data, class_names= np.unique(y_trainset), filled=True,  special_characters=True,rotate=False)  
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())  
graph.write_png(filename)
img = mpimg.imread(filename)
plt.figure(figsize=(100, 200))
plt.imshow(img,interpolation='nearest')



## 4. Results

## 5. Discussion of the Results

## 6. Conclusion