
# Machine Learning Simulation with Python from A-Z
## Author: René Eber

**Confidential - Do not duplicate or distribute without written permission from the author**




# The Process


![alt text](https://firebasestorage.googleapis.com/v0/b/heccoding.appspot.com/o/process_empty_sh.png?alt=media&token=1168de47-e528-4dbd-aafe-58d2fe992110)

# Step 1: Start Line - Introduction to Business Case / Data-Set


![alt text](https://firebasestorage.googleapis.com/v0/b/heccoding.appspot.com/o/ambulance.png?alt=media&token=1494f989-3014-495f-a69b-974275793bda)

&emsp;

**Background:** You are the management / chief data scientists of an emergency response company that is responsible for all UK road-crash emergencies. It is your responsibility to:
  -  send an ambulance to the location of the crash (e.g. a car crash) 
  - transport the injured people to the closest hospital
  - provide medical emergency services to any people injured 

Generally, there are two different types of ambulances. They vary in the spectrum of medical emergency services they can facilitate:

&emsp;

![alt text](https://firebasestorage.googleapis.com/v0/b/heccoding.appspot.com/o/ambulances.png?alt=media&token=79e882b8-ad9f-4ebb-91df-ec06eeb1eb06)

&emsp;

The ambulance on the *left* only has the most basic medical equipment on board and is used for patients that are not critically injured.

The ambulance on the *right* can be understood as an extension of the hospital, as there are doctors on board as well as sophisticated medical equipment. Even advanced medical procedures such as a surgery can be performed, while on the way to the hospital.

&emsp;

**Objective:** You want to send the right type of ambulance as:
  - you only have a limited amount of ambulances of both types
  - sending an ambulance that includes a medical team is only needed for severe or fatal accidents and is much more costly

&emsp;

**Problem:** You have to decide which ambulance to send, before you know about the severity of the accident. 

&emsp;

**Solution idea:** We want to predict the accident severity from data available at the time the accident is reported. With these datapoints we want to create a predictor for the severity of the accident so that we can make a smart decision on which type of ambulance to send once we receive the emergency call.

We will use the *publicly accessible dataset of UK crash data* to create the predcition model.


**Question: Which type of machine learning is this?**

![alt text](https://firebasestorage.googleapis.com/v0/b/heccoding.appspot.com/o/types_sh.png?alt=media&token=db812890-c7f9-4856-b0ed-4db12f30c57a)

**Where we stand in the process & key tasks for you as a manager:**

![alt text](https://firebasestorage.googleapis.com/v0/b/heccoding.appspot.com/o/step1_sh.png?alt=media&token=01334c65-67de-4398-9067-ab3706caa87f)

# Step 2: Pre-Processing

**Where we stand in the process:**

![alt text](https://firebasestorage.googleapis.com/v0/b/heccoding.appspot.com/o/step2_before_sh.png?alt=media&token=8a3c8adb-04fd-491f-b0c7-a158026f5e69)

First we import the "libraries" (e.g. pandas) that we are going to use in this workshop. These libraries contain all the functionalities that we are going to use (e.g. to calculate a mean, visualize data).


In [None]:
# Loading libraries
!pip install pandas
!pip install tensorflow
!pip install keras
!pip install seaborn
!pip install sklearn
!pip install matplotlib
!pip install imblearn
!pip install scipy
!pip install xlrd
!pip install jupyterlab-language-pack-zh-CN
!pip install ipywidgets

# Loading libraries
import os
import operator
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import uniform, randint
import sklearn as sklearn
from sklearn.preprocessing import LabelEncoder, MinMaxScaler
from sklearn import metrics
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.ensemble import RandomForestClassifier
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation
from imblearn.over_sampling import SMOTE
import warnings; warnings.simplefilter('ignore')
import time
sns.set()

# Connect to data source
! git clone https://github.com/Rebero/Quantgeist
time.sleep(10) 

&emsp;

Now we are ready to import the dataset 

In [None]:
# import data from a comma seperated files (csv) file  
crash_df = pd.read_csv('/content/Quantgeist/accidents_dataset_corrected.csv')

crash_df["Accident_Severity_numerical"] = crash_df["Accident_Severity"]
# Read from file with number descriptions and replace number with description where possible
xls = pd.ExcelFile('/content/Quantgeist/Road-Accident-Safety-Data-Guide.xls')
for name in xls.sheet_names: 
  if name != "Introduction" and name != "Export Variables":
    names_df = pd.read_excel(xls, name)
    if name == "Ped Cross - Human":
      name = "Pedestrian_Crossing-Human_Control"
    if name == "Ped Cross - Physical":
      name = "Pedestrian_Crossing-Physical_Facilities"
    if name == "Weather":
      name = "Weather_Conditions"
    if name == "Road Surface":
      name = "Road_Surface_Conditions"
    if name == "Urban Rural":
      name = "Urban_or_Rural_Area"
    if name == "Police Officer Attend":
      name = "Did_Police_Officer_Attend_Scene_of_Accident"
    name = name.replace(" ", "_")
    if name in list(crash_df.columns):
      rename_dict = names_df.set_index('code').to_dict()['label']
      crash_df[name] = crash_df[name].replace(rename_dict)

# remove irrelevant columns
crash_df = crash_df.drop(columns=['Location_Easting_OSGR', 'Location_Northing_OSGR', "Local_Authority_(District)", "Local_Authority_(Highway)", "LSOA_of_Accident_Location", "Police_Force", "Accident_Index", "1st_Road_Number",	"2nd_Road_Number", "Date",	"Time",	"Special_Conditions_at_Site", "Carriageway_Hazards", "Pedestrian_Crossing-Human_Control"])


&emsp;

To get a first overview of the dataset, let us look at the dimensions of the data as e.g., the number of crashes it entails and the number of columns



In [None]:
# dataframe.size returns the size of our dataset
crash_df.size

&emsp;

**Task: Here I want you to write the code yourself.** The objective is to look at the actual data. The method that you want to use is called **head()**. Remember our data-set is called **crash_df**. You want to do this analogously to how we looked at the shape of our dataset above invoking the method **shape** on our dataset crash_df right above. 

In [None]:
#Your code goes here


&emsp;

If we map the longitude and latitude values of each accident onto an empty coordinate system. **Question: What do you expect to see?**

In [None]:
# lets map the longitude and latitude 
fig = crash_df.plot(kind="scatter", x="Longitude", y="Latitude", alpha=0.6,
                   figsize=(18,11),c="Accident_Severity_numerical", cmap=plt.get_cmap("inferno"), 
                   colorbar=True,)

# remove irrelevant columns
crash_df = crash_df.drop(columns=['Longitude', 'Latitude', 'Accident_Severity_numerical'])

&emsp;

**Question: What does -1 refer to? Why does it occur so often?**

In [None]:
# Change -1 values in 2nd_Road_Type to "Not a junction" if the "Junction Detail" columns let's us see that there was no junction
crash_df.loc[(crash_df["Junction_Detail"] == 'Not at junction or within 20 metres') & (crash_df["2nd_Road_Class"] == -1), '2nd_Road_Class'] = 'Not a junction'  
crash_df.loc[(crash_df["Junction_Detail"] == 'Other junction') & (crash_df["2nd_Road_Class"] == -1), '2nd_Road_Class'] = 'Not a junction'  
crash_df.loc[(crash_df["Junction_Detail"] == 'Private drive or entrance') & (crash_df["2nd_Road_Class"] == -1), '2nd_Road_Class'] = 'Not a junction'  
crash_df.loc[(crash_df["Junction_Detail"] == 'T or staggered junction') & (crash_df["2nd_Road_Class"] == -1), '2nd_Road_Class'] = 'Not a junction'

# Plot the value distriburion once more for the "2nd_Road_Class" column
ax = sns.countplot(crash_df['2nd_Road_Class'], palette="pastel", edgecolor=".6")
ax.set_xticklabels(ax.get_xticklabels(), rotation=40, ha="right")
plt.tight_layout()
plt.show()

&emsp;

Great! For now we are done with the Pre-Processing and we can start with the Feature Engineering

**Main points learned:**

1. Reading data may come with format problems
2. Most of the time, the dataset needs explanation
3. There will be inconsistencies / errors in the data; they can be partly fixed with common sense
4. There will be no perfect dataset, unless it comes from a textbook
5. When leading Data Scientists, be aware that machine learning requires a lot of data Data Cleanining / Pre-Processing, which can be up to 90% of the total time needed for the project

**Where we stand in the process & key tasks for you as a manager:**

![alt text](https://firebasestorage.googleapis.com/v0/b/heccoding.appspot.com/o/step2_sh.png?alt=media&token=48e29cfc-7af7-4f8f-b561-23d404562a0f)

# Step 3: Feature engineering / Exploratory Data Analysis (EDA)

**Where we stand in the process:**

![alt text](https://firebasestorage.googleapis.com/v0/b/heccoding.appspot.com/o/step3_before_sh.png?alt=media&token=e0c2a3a2-4f12-4f8f-8dce-51c07cf40423)

Feature Engineering / EDA is an approach to 
- analyze datasets to summarize their main characteristics 
- use domain knowledge of the data to create features that make machine learning algorithms work

&emsp;

"*Coming up with features is difficult, time-consuming, requires expert knowledge. Applied machine learning is basically feature engineering.*"

— Andrew Ng, Machine Learning and AI via Brain simulations

&emsp;

## 3.1 Target Variable

Let us start by looking at our target variable - the severity of the crash. Based on the severity we would send a different type of ambulance.


In [None]:
# Display count of crashed for each severity
ax = sns.countplot(crash_df['Accident_Severity'], palette="pastel", edgecolor=".6", order = crash_df["Accident_Severity"].value_counts().index)
ax.set_xticklabels(ax.get_xticklabels(), rotation=40, ha="right")
plt.tight_layout()
plt.show()

&emsp;

## 3.2 Features

Diving deeper, we now compare different predictors (e.g. *speed*) to the target variable (the *accident severity* in our case) to see if we intuitively think that there is a correlation or even causation. 

In [None]:
#@title After running this cell manually, it will auto-run if you change the selected value. { run: "auto" }

crash_details = "Junction_Detail" #@param ["Speed_limit", "Day_of_Week","1st_Road_Class", "2nd_Road_Class","Road_Type", "Junction_Detail", "Junction_Control", "Light_Conditions", "Road_Surface_Conditions", "Pedestrian_Crossing-Human_Control", "Pedestrian_Crossing-Physical_Facilities", "Special_Conditions_at_Site", "Carriageway_Hazards", "Urban_or_Rural_Area", "Did_Police_Officer_Attend_Scene_of_Accident"]

accident_counts = (crash_df.groupby([crash_details])["Accident_Severity"]
                     .value_counts(normalize=True)
                     .rename('percentage')
                     .mul(100)
                     .reset_index()
                     .sort_values(crash_details))
p = sns.barplot(x="Accident_Severity", y="percentage", hue=crash_details, palette="pastel", data=accident_counts, edgecolor=".6", order = crash_df["Accident_Severity"].value_counts().index)
p.legend(loc='center left', bbox_to_anchor=(1.25, 0.5), ncol=1)
p = plt.setp(p.get_xticklabels(), rotation=40)  


&emsp;

**Question: Form your own hypotheses. First, reason about what you inuitively think is the best predictor out of Speed_limit, Urban_or_Rural_Area or Did_Police_Officer_Attend_Scene_of_Accident?** 

**Task: Now we want you to test your hypotheses by exploring the data reality and finding the best predictor** 

It's time for you to test your hypotheses and find the best predictor. You can explore the correlation of each feature with the target variable in the code below.

In [None]:
#@title After running this cell manually, it will auto-run if you change the selected value. { run: "auto" }

crash_details = "Urban_or_Rural_Area" #@param ["Speed_limit", "Urban_or_Rural_Area", "Did_Police_Officer_Attend_Scene_of_Accident"]

accident_counts = (crash_df.groupby([crash_details])["Accident_Severity"]
                     .value_counts(normalize=True)
                     .rename('percentage')
                     .mul(100)
                     .reset_index()
                     .sort_values(crash_details))
p = sns.barplot(x="Accident_Severity", y="percentage", hue=crash_details, palette="pastel", data=accident_counts, edgecolor=".6", order = crash_df["Accident_Severity"].value_counts().index)
p.legend(loc='center left', bbox_to_anchor=(1.25, 0.5), ncol=1)
p = plt.setp(p.get_xticklabels(), rotation=40)  




&emsp;

As we can see there are definitely better and worse features for predicting our target but not THE one feature that is sufficient.

**Question: What other features can you think of that could be important but that we cannot find in the dataset?**


&emsp;

**Main points learned:**
1. During the analysis, additional questions and problems always come up
2. Analyzing the data without a guiding business question takes an exponential amount of time

&emsp;

**Where we stand in the process & key tasks for you as a manager:**

![alt text](https://firebasestorage.googleapis.com/v0/b/heccoding.appspot.com/o/step3_sh.png?alt=media&token=322e5986-5327-4920-95c4-7802d373ac76)

&emsp;

**Question: One more thing! What about our target variable? How is it different from what we want to predict?**

In [None]:
crash_df["Accident_Severity"].replace("Serious", "Fatal", inplace=True,)


&emsp;

# Step 4: Modelling

So far, we have done no machine learning at all, but have prepared us and the data set for the machine learning step of the process. The actual machine learning makes up for a rather small portion of the overall effort.

**Where we stand in the process:**

![alt text](https://firebasestorage.googleapis.com/v0/b/heccoding.appspot.com/o/step4_before_sh.png?alt=media&token=7a366b1b-ba05-480b-876b-6aa5cf770d5a)


## 4.1 Data Transformation

### 4.1.1 Data Preparation

In [None]:
#OneHotEncoded: "Day_of_Week","1st_Road_Class", "2nd_Road_Class","Road_Type", "Junction_Detail", "Junction_Control", "Light_Conditions", "Road_Surface_Conditions", "Pedestrian_Crossing-Human_Control", "Pedestrian_Crossing-Physical_Facilities", "Special_Conditions_at_Site", "Carriageway_Hazards", "Urban_or_Rural_Area", "Did_Police_Officer_Attend_Scene_of_Accident"
#crash_df = pd.get_dummies(crash_df, prefix=["Day_of_Week","1st_Road_Class", "2nd_Road_Class","Road_Type", "Junction_Detail", "Junction_Control", "Light_Conditions", "Road_Surface_Conditions", "Pedestrian_Crossing-Physical_Facilities", "Urban_or_Rural_Area", "Did_Police_Officer_Attend_Scene_of_Accident", "Weather_Conditions"], columns=["Day_of_Week","1st_Road_Class", "2nd_Road_Class","Road_Type", "Junction_Detail", "Junction_Control", "Light_Conditions", "Road_Surface_Conditions", "Pedestrian_Crossing-Physical_Facilities", "Urban_or_Rural_Area", "Did_Police_Officer_Attend_Scene_of_Accident", "Weather_Conditions"])

#Categorize Strings: Accident_Severity
#le = LabelEncoder()
#crash_df['Accident_Severity'] = le.fit_transform(crash_df['Accident_Severity'].tolist())

# Normalize nunerical values - squeeze them between range 0 and 1
#scaler = MinMaxScaler()
#scaler.fit(crash_df[["Speed_limit", "Number_of_Vehicles", "Number_of_Casualties"]])
#crash_df[["Speed_limit", "Number_of_Vehicles", "Number_of_Casualties"]] = scaler.transform(crash_df[["Speed_limit", "Number_of_Vehicles", "Number_of_Casualties"]])

# Sperate dataset into predictors and target set
#y = crash_df["Accident_Severity"]
#X = crash_df.drop(columns=['Accident_Severity'])
X = pd.read_csv('/content/Quantgeist/X.csv')
y = pd.read_csv('/content/Quantgeist/y.csv')
X_colnames = list(X.columns)

# Oversample
#sm = SMOTE(random_state=42)
#X, y = sm.fit_resample(X, y)


1.   **Encoding of variables**

Encoding is the transformation of categorical variables to binary or numerical counterparts.


2.  **Feature scaling**

Most of the machine learning algorithms use the Eucledian distance between two data points in their computations. As our features vary in magnitudes, units and range, the results would vary greatly between different units. In our use case this is the case for the Number_of_Vehicles which ranges form 1 to 23 while the variable Speed_limit ranges from 20 to 90. The features with high magnitudes would weigh in a lot more in the calculations than features with low magnitudes.

There are several ways to deal with feature scaling, like normalization and standardization. We will use a simple MinMax scaling that squeezes all numerical features within a 0 to 1 range.


3.   **Balancing Dataset**

As we have a lot less datapoints with serious or fatal accident severity (only 18%) than datapoints with slight severity, a model that would predict all crashes as slight severity accidents would already achieve 82% accuracy. We will go into more details later in the section **Model Evaluation**.





&emsp;

### 4.1.2 Train Test Split

This is the golden rule of machine learning: **Never train your model on your test data**. Unfortunately only few people adhere to this strictly enough. Usually, you should hold out 20% of your data for testing.
![alt text](https://firebasestorage.googleapis.com/v0/b/heccoding.appspot.com/o/train_test_sh.png?alt=media&token=594998cc-2f41-4cfd-a336-5dd4fbadedcc)

In [None]:
# Split data 80:20 in traing, test data
sss = StratifiedShuffleSplit(n_splits=4, test_size=0.2, random_state=1)
for train_index, test_index in sss.split(X, y):
    X_train, X_test = X.loc[train_index], X.loc[test_index]
    y_train, y_test = y.loc[train_index], y.loc[test_index]


&emsp;

## 4.2 Building the Model

In this chapter, we will compare different types of machine learning algorithms. To do that we set up and train each of the different algorithms on the same data.

&emsp;

### 4.2.1 Deep Neural Network

In [None]:
# Set up and train model
model = Sequential()
model.add(Dense(90,input_shape=(80,)))
model.add(Dropout(0.3))
model.add(Activation('relu'))
model.add(Dense(32))
model.add(Dropout(0.3))
model.add(Activation('relu'))
model.add(Dense(1))
model.add(Activation('sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['acc', ])

model.fit(X_train, y_train ,batch_size=128,epochs=25,validation_split=0.2)


&emsp;

### 4.2.2 Random Forest



First, let's set up the algorithm with the parameter. 

In [None]:
# Set up
rf = RandomForestClassifier(verbose=2, random_state=42, n_jobs = -1, class_weight="balanced_subsample", n_estimators=250, max_depth=None, min_samples_split=2, min_samples_leaf=1, bootstrap=True, max_features="auto")

&emsp;

**Task: Now we want you to write the code to actually build a machine learning model, in this case a random forest. Please write the python code below for training a machine learning model. As a hint you can look how we did this for the Neural Network. The library you are calling upon is called rf (short for random forest) and the method you wanto to use is fit(). You will see this also follows the same syntax logic you used at the very beginning in your first coding challenge.**

In [None]:
#Your code goes here
trained_rf = rf.

**Where we stand in the process:**

![alt text](https://firebasestorage.googleapis.com/v0/b/heccoding.appspot.com/o/step4_sh.png?alt=media&token=807883aa-c115-4d0d-ad5b-1d61ccfc825e)


&emsp;

# Step 5: Model Evaluation

**Where we stand in the process:**
![alt text](https://firebasestorage.googleapis.com/v0/b/heccoding.appspot.com/o/step5_before_sh.png?alt=media&token=ed203fd7-1441-4a54-8d7d-bd3abad8424a)

Now let us compare how well our models actually performed by evalauting the error.


&emsp;

### 5.1 Metrics

In [None]:
#Performances on test data
def get_measures(y_true, y_pred, th=0.5):
  yhat_classes = np.where(y_pred > th, 1, 0)
  acc = metrics.accuracy_score(y_true, yhat_classes)
  bal_acc = metrics.balanced_accuracy_score(y_true, yhat_classes)
  sen = metrics.recall_score(y_true, yhat_classes) 
  f1 = metrics.f1_score(y_true, yhat_classes)
  return [acc, bal_acc, sen, f1]

#Run model on test data
y_nn = model.predict(X_test)
y_rf = rf.predict(X_test)

dataset = "Test"  
overall_peformance_df = pd.DataFrame(columns=['model', 'dataset', 'accuracy', 'balanced accuracy', 'sensitivity','F1'])
overall_peformance_df.loc[len(overall_peformance_df)] = ['Neural network', dataset] + get_measures(y_test, y_nn)
overall_peformance_df.loc[len(overall_peformance_df)] = ['Random Forest', dataset] + get_measures(y_test, y_rf)
overall_peformance_df


&emsp;

Can we already pick a winner? Let us look more specifically at the confusion matrix for each model.
![alt text](https://firebasestorage.googleapis.com/v0/b/heccoding.appspot.com/o/confusion_042020.png?alt=media&token=5d4c0c03-50ef-406a-87e1-780b70326559)

In [None]:
# print confusion matrix
#@title After running this cell manually, it will auto-run if you change the selected value. { run: "auto" }

model_details = "Random Forest" #@param ["Deep Neural Network", "Random Forest"]
if model_details == "Deep Neural Network":
  model_pred = y_nn
else:
  model_pred = y_rf
model_pred = model_pred.round()

print(" Confusion matrix \n", metrics.confusion_matrix(y_test, model_pred))


&emsp;

**Question: Which model is better? The one creatd by Deep Neural Network or Random Forest?**


&emsp;

Now let's try to better understand what the model learned. Even though we cannot see how exactly the models arrived at their prediction, we can see which features were deemed most important by the models. 
So, let's check if you were right in picking the best predictor in the question at the beginning.



In [None]:
feat_importances = pd.Series(rf.feature_importances_, index=X_colnames)
feat_importances.nlargest(10).plot(kind='barh')
#feat_importances.nsmallest(10).plot(kind='barh')


&emsp;

We can see that most of the features that the model deemed important are in line with our inutition, which is a good sign that the model learned something useful. So e.g. Did_Police_Officer_Attend_Scene_of_Accident_No is one of the important features.


&emsp;

BUT WAIT! **Question: What is the catch with this feature?**


We do not have this information at the time that the prediciton is made, which is when the emergency call arrives. Meaning we are not allowed to use this feature as a predictor!! Domain Expertise is required to know this. So it would have been the task of the manager to spot this mistake.

Usually, we would have to drop this feature at the beginning and do the whole process again with unforeseeable outcome.


&emsp;

### 5.2 Final Decision

**Question: Even more important than the actual metric is the question whether the model is good enough for us to use it in production?**

The answer is: This depends on our baseline, the current process we use to judge which ambulance to send and how good this current process is.

**Where we stand in the process & key tasks for you as a manager:**
![alt text](https://firebasestorage.googleapis.com/v0/b/heccoding.appspot.com/o/step5_sh.png?alt=media&token=c9fdcbcf-9f78-467c-8797-26d8110ace9d)