# IBM Data Science Professional Certificate - Capstone Project

## 1. Introduction
### 1.1 Background
While a reduction in number has been observed over the last decades, car accidents are still counted in thousands in Switzerland in 2020. Because the direct and indirect consequences of such events (injuries, death, psychological damages, material damages, etc.) are sizeable, there is value in identifying what are the causes of the accidents so that adequate prevention measures can be put in place. Moreover, it would be valuable to society - not the least from a resource planning standpoint - to understand when accidents are most likely to occur, and respectively what outcome severity (light injuries, severe injuries, fatal outcome) can be expected depending on when and under what circumstances the accident took place. Since 1992, the Swiss Federal Statistics Office (OFS) is collecting data on car accidents country-wide and making such information available to the public. This analysis will leverage this data. 

### 1.2 Problem
The objective is to explore a 2010-2019 dataset from the Swiss Federal Statistics Office (OFS) and determine what are the **key factors** that drive the **outcome** of an accident for the involved car(s)' passengers: light injuries, severe injuries, fatal outcome. Additionally, the outcomes of this analysis can be used as a prescriptive tool to :
(1) Have the appropriate medical emergency resources allocated for the times, locations and circumstances when accidents are most likely to occur, with a particular emphasis on the severe and life-threatening cases. 
(2) Design prevention measures based on those accident factors identified as having the largest influence on accident outcomes. 

### 1.3 Interest
By being able to allocate medical emergency resources more efficiently and by being able to reduce injuries and deaths through prevention campaigns, society as a whole will reduce the economic impact of road hazards. This analysis is therefore aimed at decision-makers of the Swiss Confederation, notably those in charge of Transportation and Medical Affairs. Beyond economic considerations, there is also a moral value in reducing the suffering and deaths of the thousands of people affected by road accidents. 

## 2. Data Sources and Cleaning
### 2.1 Data Sources
The dataset used here is defined as "Road accidents where at least one of the parties was injured or worse". As a result, this dataset does not report on material damages or other consequences than bodily injuries. It is also worth noting that this dataset does not distinguish between what exact type of vehicle was involved, whether a car, bicycle, motorcycle, tractor, pedestrian, skater, etc. 
The dataset used in this analysis was obtained from the website of the Swiss Federal Statistics Office (OFS). A data browser allows the user to select the dimensions and time range of interest, within the limits of the Office's data structure. The extracted time range is the years 2010-2019. No other data source was used. 

**The variables in this dataset are :**  
Types of accidents : TYPE_ACCIDENT  
Type of road: TYPE_ROAD  
Severity of the accident: SEVERITY (the dependent variable)  
Month of the accident: MONTH  
Day of the week: DAY  
Time of the accident: TIME_ACCIDENT  
  
More details are provided below:  
  
**Types of accidents**  
SKID: The vehicle went into a skid/sideslid and/or the driver lost control of the car.  
OVERTAKE : While trying to overtake or changing lanes. This also includes the variable state where the accident happened when the vehicle was returning to its original lane.   
TURNING: While the vehicle was turning to change directions, ie. enter a new road.   
INTERSECTION: Accident taking place at a crossroad or junction of two roads with the two implied vehicles staying on their respective roads.   
BACK: The vehicle crashed into the back of another vehicle that was either mobile or immobile.  
PARKING: While getting in or out of a parking spot.   
ANIMAL: Accident created by an animal.   
FRONTAL: Frontal collisions. 
PEDESTRIANS: Accidents involving one or several pedestrians.  
  
**Time of the accident**  
NIGHT : Between midnight and 6am  
MORNING: Between 6am and noon  
AFTNOON: Between noon and 6pm  
EVENING: Between 6pm and midnight  
  
**Severity of the accident**  
LIGHT_INJURIES: Light injuries to at least one of the involved parties.   
SEVERE_INJURIES: Severe injuries to at least one of the involved parties.   
DEATH: Death of at least one of the involved parties.  
  
**Type of road**  
HWY: Highways, semi-highways and similars. The speed limit is typically 120 km/h, respectively 100 km/h for the semis.  
MAINROAD: Main road, where the speed limit is typically 80 km/h.  
SCDRYROAD: Secondary road, typically narrower and with a speed limit of 80km/h maximum.  
OTHERROAD: Other types of road, for instance private access roads or country paths.  
  
 
### 2.2 Data Cleaning
Data cleaning consisted first in the following basic steps :

(1) Translating the labels from French to English. This was performed in Excel directly on the CSV file. 

(2) Shortening the labels, for easier coding purposes. This was performed in Excel directly on the CSV file. 

(3) Transforming the non-numerical classifiers into dummy variables usable by the Machine Learning model discussed below. This was performed in the Jupyter Notebook. 

The data was extractable only in a form where each row represents a unique combination of explanatory variables (such as time, day of the week, type of road, etc.). There are then 10 columns corresponding to the years 2010-2019, where it is reported in each column how many accidents happened in the said year for the given set of explanatory variables of the row. 
Because a time-evolution is not the primary concern of this analysis, it was decided to :

(4) Focus the analysis on the period 2010-2019 (as opposed to using the full dataset tracing back to 1992), as it is most likely more representative of the current road conditions and car technology. This was performed in the Jupyter Notebook.

(5) Eliminate the time dimension of the dataset by performing the following operation : Sum the number of accidents across the years 2010-2019. This was performed in the Jupyter Notebook.


### 2.3 Feature Selection
To enable a simpler analysis and higher readability, certain variables were simplified as follows :
TYPE_ROAD: Highways, semi-highways and highways with special features were grouped into a single "Highway" category. This was performed in Excel directly on the CSV file. 


## 3. Methodology
### 3.1 Exploratory Data Analysis
### 3.2 Machine Learning Approach
Decision tree because need to classify. And also because non-numerical values.

### 3.3 Code

In [26]:
# Importing modules

import numpy as np # Numpy
import pandas as pd # Pandas
from sklearn.tree import DecisionTreeClassifier # Scikit Learn

# Downloading the data in CSV format from a **raw** URL

import io
import requests
url="https://raw.githubusercontent.com/CGIBM/Coursera_Capstone/master/dataset.csv"
s=requests.get(url).content

# Placing the data in a Pandas dataframe
my_data=pd.read_csv(io.StringIO(s.decode('utf-8')))
# Displaying the first 5 rows
my_data[0:5]
# Removing values from 1992 to 2009
my_data.drop(my_data.columns[[6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23]],axis=1,inplace=True)
my_data[0:5]
# Eliminating the time dimension by summing across years
occu=my_data.iloc[:,6:16]
sumoccu=occu.sum(axis=1)
del occu
my_data["OCCU"]=sumoccu
my_data.drop(my_data.columns[[6,7,8,9,10,11,12,13,14,15]],axis=1,inplace=True)

# Eliminating rows with zero occurences between 2010-2019
my_data.drop(my_data[my_data.OCCU < 1].index, inplace=True)
newdata = pd.DataFrame(columns=my_data.columns)

# Itemizing the rows where occurence > 1
for x in range(10):
    occs=my_data.iloc[x,6]
    for n in range(occs):
        newdata=newdata.append(my_data.iloc[x,:], sort=False, ignore_index=True)
     
    

# Drop the column containing the number of occurences
newdata= newdata.drop(['OCCU'], axis=1)

# Declaring the Feature Matrix X and removing column 1 which is the Y dependent variable 
X = newdata[['TYPE_ACCIDENT', 'MONTH', 'DAY', 'TIME_ACCIDENT', 'TYPE_ROAD']].values 
X[0:5]


# here transform into dummies see below


# Declaring the Y dependent variable vector
y = newdata["SEVERITY"]
y[0:5]

# Separation of the total dataset into a train and a test set
from sklearn.model_selection import train_test_split
X_trainset, X_testset, y_trainset, y_testset = train_test_split(X, y, test_size=0.25, random_state=3) # allowing a 25% partition to the test set

# Creation of the tree
drugTree = DecisionTreeClassifier(criterion="entropy", max_depth = 4)

#Fitting of the tree on the training dataset
drugTree.fit(X_trainset,y_trainset)

# Running a prediction on the test dataset using the trained calibration
predTree = drugTree.predict(X_testset)

# Computing the accuracy of the tree on the test set
from sklearn import metrics
import matplotlib.pyplot as plt
print("DecisionTrees's Accuracy: ", metrics.accuracy_score(y_testset, predTree))

# Visualizing the tree
from sklearn.externals.six import StringIO
import pydotplus
import matplotlib.image as mpimg
from sklearn import tree
%matplotlib inline 

dot_data = StringIO()
filename = "drugtree.png"
featureNames = my_data.columns[0:5]
targetNames = my_data["Drug"].unique().tolist()
out=tree.export_graphviz(drugTree,feature_names=featureNames, out_file=dot_data, class_names= np.unique(y_trainset), filled=True,  special_characters=True,rotate=False)  
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())  
graph.write_png(filename)
img = mpimg.imread(filename)
plt.figure(figsize=(100, 200))
plt.imshow(img,interpolation='nearest')



ValueError: could not convert string to float: 'SKID'

## 4. Results

## 5. Discussion of the Results

## 6. Conclusion

In [8]:
my_data[0:10]

Unnamed: 0,SEVERITY,TYPE_ACCIDENT,MONTH,DAY,TIME_ACCIDENT,TYPE_ROAD,OCCU
8,DEATH,SKID,JAN,MON,MORNING,MAINROAD,3
9,DEATH,SKID,JAN,MON,MORNING,SCDRYROAD,2
14,DEATH,SKID,JAN,MON,AFTNOON,MAINROAD,1
18,DEATH,SKID,JAN,MON,EVENING,HWY,1
20,DEATH,SKID,JAN,MON,EVENING,MAINROAD,1
21,DEATH,SKID,JAN,MON,EVENING,SCDRYROAD,1
33,DEATH,SKID,JAN,TUE,MORNING,SCDRYROAD,2
38,DEATH,SKID,JAN,TUE,AFTNOON,MAINROAD,1
39,DEATH,SKID,JAN,TUE,AFTNOON,SCDRYROAD,2
44,DEATH,SKID,JAN,TUE,EVENING,MAINROAD,2


In [25]:
occu[0:10]

NameError: name 'occu' is not defined

In [23]:
sumoccu[0:30]

0     0
1     0
2     0
3     0
4     0
5     0
6     0
7     0
8     3
9     2
10    0
11    0
12    0
13    0
14    1
15    0
16    0
17    0
18    1
19    0
20    1
21    1
22    0
23    0
24    0
25    0
26    0
27    0
28    0
29    0
dtype: int64

In [9]:
my_data.iloc[0,6]

3

In [46]:
astra=7
for x in range(astra):
    print (x)

0
1
2
3
4
5
6


In [24]:
newdata[0:10]

Unnamed: 0,SEVERITY,TYPE_ACCIDENT,MONTH,DAY,TIME_ACCIDENT,TYPE_ROAD
0,DEATH,SKID,JAN,MON,MORNING,MAINROAD
1,DEATH,SKID,JAN,MON,MORNING,MAINROAD
2,DEATH,SKID,JAN,MON,MORNING,MAINROAD
3,DEATH,SKID,JAN,MON,MORNING,SCDRYROAD
4,DEATH,SKID,JAN,MON,MORNING,SCDRYROAD
5,DEATH,SKID,JAN,MON,AFTNOON,MAINROAD
6,DEATH,SKID,JAN,MON,EVENING,HWY
7,DEATH,SKID,JAN,MON,EVENING,MAINROAD
8,DEATH,SKID,JAN,MON,EVENING,SCDRYROAD
9,DEATH,SKID,JAN,TUE,MORNING,SCDRYROAD


In [3]:
my_data.iloc[x,:]

SEVERITY            DEATH
TYPE_ACCIDENT        SKID
MONTH                 JAN
DAY                   TUE
TIME_ACCIDENT     EVENING
TYPE_ROAD        MAINROAD
OCCU                    2
Name: 44, dtype: object

In [None]:
# Transforming non-numerical classifiers into dummy variables ##### CHECK IF REALLY NECESSARY ******************************
from sklearn import preprocessing
le_sex = preprocessing.LabelEncoder()
le_sex.fit(['F','M'])
X[:,1] = le_sex.transform(X[:,1]) 


le_BP = preprocessing.LabelEncoder()
le_BP.fit([ 'LOW', 'NORMAL', 'HIGH'])
X[:,2] = le_BP.transform(X[:,2])


le_Chol = preprocessing.LabelEncoder()
le_Chol.fit([ 'NORMAL', 'HIGH'])
X[:,3] = le_Chol.transform(X[:,3]) 

X[0:5]