
# Python机器学习全程模拟教程
## 作者：雷内·埃伯（音译）

**保密文件-未经作者书面许可，不得复制或转发**




# 流程图


![alt text](https://firebasestorage.googleapis.com/v0/b/heccoding.appspot.com/o/process_empty_sh.png?alt=media&token=1168de47-e528-4dbd-aafe-58d2fe992110)

# 步骤1：起步-业务案例/数据集的介绍


![alt text](https://firebasestorage.googleapis.com/v0/b/heccoding.appspot.com/o/ambulance.png?alt=media&token=1494f989-3014-495f-a69b-974275793bda)

&emsp;

**背景:** 某紧急救援公司负责英国所有的道路事故的紧急处理。作为该公司的一名的管理人员或者首席数据科学家，你有责任：
  - 派遣救护车到事发地点（例如：车辆碰撞事故）。
  - 将伤者送到离事发地点最近的医院。
  - 为伤者提供医疗急救服务。

一般有两种不同类型的救护车。它们能提供不同范围的医疗急救服务。

&emsp;

![alt text](https://firebasestorage.googleapis.com/v0/b/heccoding.appspot.com/o/ambulances.png?alt=media&token=79e882b8-ad9f-4ebb-91df-ec06eeb1eb06)

&emsp;

左图的救护车上只有最基础的医疗设备，适用于救助轻伤病人。

右图的救护车可视为是医院的延伸，因为车上有医生和成熟的医疗设备。即使是复杂的医疗操作，如手术，也可以在去医院的路上进行。

&emsp;

**目标:** 你需要派遣一辆匹配的救护车，因为：
  - 这两种类型的救护车的数量都很有限。
  - 只有在发生严重或致命事故时，才需派遣配备医疗队的救护车，且该车费用高昂得多。
&emsp;

**问题:** 在了解事故的严重程度前，你必须决定派遣哪种类型的救护车。

&emsp;

**解决思路:** 我们希望根据事故报道的可用的数据来预测事故的严重程度。根据这些数据点，我们想要创建一个事故严重程度的预测机制，这样在接到紧急电话后，就可以精准地确定派遣哪种类型的救护车。

我们将使用可公开访问的英国车辆碰撞事故数据库来创建预测模型。


**问题：这是哪种类型的机器学习**

![alt text](https://firebasestorage.googleapis.com/v0/b/heccoding.appspot.com/o/types_sh.png?alt=media&token=db812890-c7f9-4856-b0ed-4db12f30c57a)

**作为一名管理人员，目前在流程中所处的位置和关键任务是:**

![alt text](https://firebasestorage.googleapis.com/v0/b/heccoding.appspot.com/o/step1_sh.png?alt=media&token=01334c65-67de-4398-9067-ab3706caa87f)

# 步骤2：预处理

**目前在流程中所处的位置:**

![alt text](https://firebasestorage.googleapis.com/v0/b/heccoding.appspot.com/o/step2_before_sh.png?alt=media&token=8a3c8adb-04fd-491f-b0c7-a158026f5e69)

首先，我们会导入将在此研讨会中使用的“库”（例如：熊猫数据集）。这些“库”包含了我们将要使用的所有功能。（例如：计算平均值、可视化数据）。


In [None]:
# Loading libraries
import os
import operator
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn as sklearn
from sklearn.preprocessing import LabelEncoder, MinMaxScaler
from sklearn import metrics
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.ensemble import RandomForestClassifier
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation
import warnings; warnings.simplefilter('ignore')
import ipywidgets as widgets
from IPython.display import clear_output
import time
sns.set()

# Connect to data source
! git clone https://github.com/Rebero/Quantgeist
time.sleep(10) 

&emsp;

我们已经准备好导入数据集：

In [None]:
# import data from a comma seperated files (csv) file  
crash_df = pd.read_csv('Quantgeist/accidents_dataset_corrected.csv')

crash_df["Accident_Severity_numerical"] = crash_df["Accident_Severity"]
# Read from file with number descriptions and replace number with description where possible
xls = pd.ExcelFile('Quantgeist/Road-Accident-Safety-Data-Guide.xls')
for name in xls.sheet_names: 
  if name != "Introduction" and name != "Export Variables":
    names_df = pd.read_excel(xls, name)
    if name == "Ped Cross - Human":
      name = "Pedestrian_Crossing-Human_Control"
    if name == "Ped Cross - Physical":
      name = "Pedestrian_Crossing-Physical_Facilities"
    if name == "Weather":
      name = "Weather_Conditions"
    if name == "Road Surface":
      name = "Road_Surface_Conditions"
    if name == "Urban Rural":
      name = "Urban_or_Rural_Area"
    if name == "Police Officer Attend":
      name = "Did_Police_Officer_Attend_Scene_of_Accident"
    name = name.replace(" ", "_")
    if name in list(crash_df.columns):
      rename_dict = names_df.set_index('code').to_dict()['label']
      crash_df[name] = crash_df[name].replace(rename_dict)

# remove irrelevant columns
crash_df = crash_df.drop(columns=['Location_Easting_OSGR', 'Location_Northing_OSGR', "Local_Authority_(District)", "Local_Authority_(Highway)", "LSOA_of_Accident_Location", "Police_Force", "Accident_Index", "1st_Road_Number",	"2nd_Road_Number", "Date",	"Time",	"Special_Conditions_at_Site", "Carriageway_Hazards", "Pedestrian_Crossing-Human_Control"])


&emsp;

想要要获得数据集的第一个汇总，可以查看数据的维度，例如，碰撞事故的频次和报道频次。



In [None]:
# dataframe.size returns the size of our dataset
crash_df.size

&emsp;

**T任务：我希望你们自己编写代码** 目的是可以看到实际的数据。有一种可以用到的方法称为“head”。请记住我们的数据集名称为“crash_df”。你可以用类似的方法，看看我们上面的数据集是如何生成右上方的crash_df数据集的。 

In [None]:
#你的代码在这里


&emsp;

如果我们将每次事故的经纬度值映射到一个空的坐标系上。**问题：你将会看到什么?**

In [None]:
# lets map the longitude and latitude 
fig = crash_df.plot(kind="scatter", x="Longitude", y="Latitude", alpha=0.6,
                   figsize=(18,11),c="Accident_Severity_numerical", cmap=plt.get_cmap("inferno"), 
                   colorbar=True,)

# remove irrelevant columns
crash_df = crash_df.drop(columns=['Longitude', 'Latitude', 'Accident_Severity_numerical'])

&emsp;

**问题：-1指的是什么？为什么发生频率那么高?**

In [None]:
# Change -1 values in 2nd_Road_Type to "Not a junction" if the "Junction Detail" columns let's us see that there was no junction
crash_df.loc[(crash_df["Junction_Detail"] == 'Not at junction or within 20 metres') & (crash_df["2nd_Road_Class"] == -1), '2nd_Road_Class'] = 'Not a junction'  
crash_df.loc[(crash_df["Junction_Detail"] == 'Other junction') & (crash_df["2nd_Road_Class"] == -1), '2nd_Road_Class'] = 'Not a junction'  
crash_df.loc[(crash_df["Junction_Detail"] == 'Private drive or entrance') & (crash_df["2nd_Road_Class"] == -1), '2nd_Road_Class'] = 'Not a junction'  
crash_df.loc[(crash_df["Junction_Detail"] == 'T or staggered junction') & (crash_df["2nd_Road_Class"] == -1), '2nd_Road_Class'] = 'Not a junction'

# Plot the value distriburion once more for the "2nd_Road_Class" column
ax = sns.countplot(crash_df['2nd_Road_Class'], palette="pastel", edgecolor=".6")
ax.set_xticklabels(ax.get_xticklabels(), rotation=40, ha="right")
plt.tight_layout()
plt.show()

&emsp;

很棒！现在我们已经完成了预处理工作，我们可以开始特征工程了。

**小结:**

1. 读取数据可能会出现格式问题。
2. 大多数时候，需要分析数据集。
3. 数据中会出现不匹配/错误现象；部分可以用常识解决。
4. 不存在任何完美的数据集，除非是教科书中描述的理想状态。
5. 引导数据科学家时，要意识到机器学习需要大量的数据清理/预算处理，这会占项目总耗时长的90%。

**我们作为一名管理人员，所处在流程中的位置以及关键任务是:**

![alt text](https://firebasestorage.googleapis.com/v0/b/heccoding.appspot.com/o/step2_sh.png?alt=media&token=48e29cfc-7af7-4f8f-b561-23d404562a0f)

# 步骤3：特征工程/探索性数据分析(EDA)

**我们处在流程中的这一步:**

![alt text](https://firebasestorage.googleapis.com/v0/b/heccoding.appspot.com/o/step3_before_sh.png?alt=media&token=e0c2a3a2-4f12-4f8f-8dce-51c07cf40423)

特征工程 / EDA是一种
- 通过分析数据集并总结其主要特征。
- 使用数据领域知识来促使机器学习算法工作的方法。

&emsp;

"*特征的总结是困难的、耗时的，需要具备专业知识。应用机器学习基本上属于特征工程。*"

— Andrew Ng, 《模拟大脑进行机器学习人工智能》

&emsp;

## 3.1 目标变量

我们先来看目标变量——碰撞事故的严重程度。
我们会根据严重程度派遣不同类型的救护车。


In [None]:
# 显示每个严重程度的次数
ax = sns.countplot(crash_df['Accident_Severity'], palette="pastel", edgecolor=".6", order = crash_df["Accident_Severity"].value_counts().index)
ax.set_xticklabels(ax.get_xticklabels(), rotation=40, ha="right")
plt.tight_layout()
plt.show()

&emsp;

## 3.2 特征

Diving deeper, we now compare different predictors (e.g. *speed*) to the target variable (the *accident severity* in our case) to see if we intuitively think that there is a correlation or even causation. 

In [None]:
feature_selector = widgets.Dropdown(
    options=["Speed_limit", "Day_of_Week","1st_Road_Class","Road_Type", "Junction_Detail", "Junction_Control", "Light_Conditions", "Road_Surface_Conditions", "Pedestrian_Crossing-Physical_Facilities", "Urban_or_Rural_Area", "Did_Police_Officer_Attend_Scene_of_Accident"],
    description='Feature:',
    disabled=False,
)

crash_details = "Speed_limit"
feature_selected = widgets.Output()

def feature_selector_handler(change):
    clear_output(wait=True)
    feature_selected.clear_output()
    display(feature_selector) 
    with feature_selected:
        crash_details = feature_selector.value
        accident_counts = (crash_df.groupby([crash_details])["Accident_Severity"]
                             .value_counts(normalize=True)
                             .rename('percentage')
                             .mul(100)
                             .reset_index()
                             .sort_values(crash_details))
        p = sns.barplot(x="Accident_Severity", y="percentage", hue=crash_details, palette="pastel", data=accident_counts, edgecolor=".6", order = crash_df["Accident_Severity"].value_counts().index)
        p.legend(loc='center left', bbox_to_anchor=(1.25, 0.5), ncol=1)
        p = plt.setp(p.get_xticklabels(), rotation=40) 
        
feature_selector.observe(feature_selector_handler, names="value")
display(feature_selector) 

&emsp;

**问题：提出自己的假设。首先，在超速驾驶、城乡差异、警察是否去过事故现场等原因中找出你直觉认为最准确的预测变量。?** 

**任务：现在我们希望你通过探索数据真实性以及确定最准确的事故原因来检验你的假设。** 

这时可以检验你的假设，找出最佳预测变量了。你可以在下面的代码中研究每个特性与目标变量的相关性。

In [None]:
feature_selector = widgets.Dropdown(
    options=["Speed_limit", "Urban_or_Rural_Area", "Did_Police_Officer_Attend_Scene_of_Accident"],
    description='Feature:',
    disabled=False,
)

crash_details = "Urban_or_Rural_Area"
feature_selected = widgets.Output()

def feature_selector_handler(change):
    clear_output(wait=True)
    feature_selected.clear_output()
    display(feature_selector) 
    with feature_selected:
        crash_details = feature_selector.value
        accident_counts = (crash_df.groupby([crash_details])["Accident_Severity"]
                     .value_counts(normalize=True)
                     .rename('percentage')
                     .mul(100)
                     .reset_index()
                     .sort_values(crash_details))
        p = sns.barplot(x="Accident_Severity", y="percentage", hue=crash_details, palette="pastel", data=accident_counts, edgecolor=".6", order = crash_df["Accident_Severity"].value_counts().index)
        p.legend(loc='center left', bbox_to_anchor=(1.25, 0.5), ncol=1)
        p = plt.setp(p.get_xticklabels(), rotation=40)  
        
feature_selector.observe(feature_selector_handler, names="value")
display(feature_selector) 



&emsp;

正如我们所看到的，肯定有更好的或者更差的特征用以预测我们的目标，但还没找到有足够说服力的特征。

**问：你还能想到哪些在数据集中找不到的重要特征呢?**


&emsp;

**学习要点:**
1. 在分析过程中，总有可能出现其他的问题。
2. 在没有指导性业务问题的情况下，分析数据需要耗费大量的时间。

&emsp;

**作为一名管理人员，我们处在流程中的哪个环节，目前的关键任务是什么:**

![alt text](https://firebasestorage.googleapis.com/v0/b/heccoding.appspot.com/o/step3_sh.png?alt=media&token=322e5986-5327-4920-95c4-7802d373ac76)

&emsp;

**问题：还有一件事！我们的目标变量是什么？它和我们想要预测的结果有什么不同呢?**

In [None]:
crash_df["Accident_Severity"].replace("Serious", "Fatal", inplace=True,)


&emsp;

# Step 4: Modelling

So far, we have done no machine learning at all, but have prepared us and the data set for the machine learning step of the process. The actual machine learning makes up for a rather small portion of the overall effort.

**Where we stand in the process:**

![alt text](https://firebasestorage.googleapis.com/v0/b/heccoding.appspot.com/o/step4_before_sh.png?alt=media&token=7a366b1b-ba05-480b-876b-6aa5cf770d5a)


## 4.1 Data Transformation

### 4.1.1 Data Preparation

In [None]:
#OneHotEncoded: "Day_of_Week","1st_Road_Class", "2nd_Road_Class","Road_Type", "Junction_Detail", "Junction_Control", "Light_Conditions", "Road_Surface_Conditions", "Pedestrian_Crossing-Human_Control", "Pedestrian_Crossing-Physical_Facilities", "Special_Conditions_at_Site", "Carriageway_Hazards", "Urban_or_Rural_Area", "Did_Police_Officer_Attend_Scene_of_Accident"
#crash_df = pd.get_dummies(crash_df, prefix=["Day_of_Week","1st_Road_Class", "2nd_Road_Class","Road_Type", "Junction_Detail", "Junction_Control", "Light_Conditions", "Road_Surface_Conditions", "Pedestrian_Crossing-Physical_Facilities", "Urban_or_Rural_Area", "Did_Police_Officer_Attend_Scene_of_Accident", "Weather_Conditions"], columns=["Day_of_Week","1st_Road_Class", "2nd_Road_Class","Road_Type", "Junction_Detail", "Junction_Control", "Light_Conditions", "Road_Surface_Conditions", "Pedestrian_Crossing-Physical_Facilities", "Urban_or_Rural_Area", "Did_Police_Officer_Attend_Scene_of_Accident", "Weather_Conditions"])

#Categorize Strings: Accident_Severity
#le = LabelEncoder()
#crash_df['Accident_Severity'] = le.fit_transform(crash_df['Accident_Severity'].tolist())

# Normalize nunerical values - squeeze them between range 0 and 1
#scaler = MinMaxScaler()
#scaler.fit(crash_df[["Speed_limit", "Number_of_Vehicles", "Number_of_Casualties"]])
#crash_df[["Speed_limit", "Number_of_Vehicles", "Number_of_Casualties"]] = scaler.transform(crash_df[["Speed_limit", "Number_of_Vehicles", "Number_of_Casualties"]])

# Sperate dataset into predictors and target set
#y = crash_df["Accident_Severity"]
#X = crash_df.drop(columns=['Accident_Severity'])
X = pd.read_csv('Quantgeist/X.csv')
y = pd.read_csv('Quantgeist/y.csv')
X_colnames = list(X.columns)

# Oversample
#sm = SMOTE(random_state=42)
#X, y = sm.fit_resample(X, y)


1.   **Encoding of variables**

Encoding is the transformation of categorical variables to binary or numerical counterparts.


2.  **Feature scaling**

Most of the machine learning algorithms use the Eucledian distance between two data points in their computations. As our features vary in magnitudes, units and range, the results would vary greatly between different units. In our use case this is the case for the Number_of_Vehicles which ranges form 1 to 23 while the variable Speed_limit ranges from 20 to 90. The features with high magnitudes would weigh in a lot more in the calculations than features with low magnitudes.

There are several ways to deal with feature scaling, like normalization and standardization. We will use a simple MinMax scaling that squeezes all numerical features within a 0 to 1 range.


3.   **Balancing Dataset**

As we have a lot less datapoints with serious or fatal accident severity (only 18%) than datapoints with slight severity, a model that would predict all crashes as slight severity accidents would already achieve 82% accuracy. We will go into more details later in the section **Model Evaluation**.





&emsp;

### 4.1.2 Train Test Split

This is the golden rule of machine learning: **Never train your model on your test data**. Unfortunately only few people adhere to this strictly enough. Usually, you should hold out 20% of your data for testing.
![alt text](https://firebasestorage.googleapis.com/v0/b/heccoding.appspot.com/o/train_test_sh.png?alt=media&token=594998cc-2f41-4cfd-a336-5dd4fbadedcc)

In [None]:
# Split data 80:20 in traing, test data
sss = StratifiedShuffleSplit(n_splits=4, test_size=0.2, random_state=1)
for train_index, test_index in sss.split(X, y):
    X_train, X_test = X.loc[train_index], X.loc[test_index]
    y_train, y_test = y.loc[train_index], y.loc[test_index]


&emsp;

## 4.2 Building the Model

In this chapter, we will compare different types of machine learning algorithms. To do that we set up and train each of the different algorithms on the same data.

&emsp;

### 4.2.1 Deep Neural Network

In [None]:
# Set up and train model
model = Sequential()
model.add(Dense(90,input_shape=(80,)))
model.add(Dropout(0.3))
model.add(Activation('relu'))
model.add(Dense(32))
model.add(Dropout(0.3))
model.add(Activation('relu'))
model.add(Dense(1))
model.add(Activation('sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['acc', ])

model.fit(X_train, y_train ,batch_size=128,epochs=25,validation_split=0.2)


&emsp;

### 4.2.2 Random Forest



First, let's set up the algorithm with the parameter. 

In [None]:
# Set up
rf = RandomForestClassifier(verbose=2, random_state=42, n_jobs = -1, class_weight="balanced_subsample", n_estimators=150, max_depth=None, min_samples_split=2, min_samples_leaf=1, bootstrap=True, max_features="auto")

&emsp;

**Task: Now we want you to write the code to actually build a machine learning model, in this case a random forest. Please write the python code below for training a machine learning model. As a hint you can look how we did this for the Neural Network. The library you are calling upon is called rf (short for random forest) and the method you wanto to use is fit(). You will see this also follows the same syntax logic you used at the very beginning in your first coding challenge.**

In [None]:
#Your code goes here
trained_rf = rf.

**Where we stand in the process:**

![alt text](https://firebasestorage.googleapis.com/v0/b/heccoding.appspot.com/o/step4_sh.png?alt=media&token=807883aa-c115-4d0d-ad5b-1d61ccfc825e)


&emsp;

# Step 5: Model Evaluation

**Where we stand in the process:**
![alt text](https://firebasestorage.googleapis.com/v0/b/heccoding.appspot.com/o/step5_before_sh.png?alt=media&token=ed203fd7-1441-4a54-8d7d-bd3abad8424a)

Now let us compare how well our models actually performed by evalauting the error.


&emsp;

### 5.1 Metrics

In [None]:
#Performances on test data
def get_measures(y_true, y_pred, th=0.5):
  yhat_classes = np.where(y_pred > th, 1, 0)
  acc = metrics.accuracy_score(y_true, yhat_classes)
  bal_acc = metrics.balanced_accuracy_score(y_true, yhat_classes)
  sen = metrics.recall_score(y_true, yhat_classes) 
  f1 = metrics.f1_score(y_true, yhat_classes)
  return [acc, bal_acc, sen, f1]

#Run model on test data
y_nn = model.predict(X_test)
y_rf = rf.predict(X_test)

dataset = "Test"  
overall_peformance_df = pd.DataFrame(columns=['model', 'dataset', 'accuracy', 'balanced accuracy', 'sensitivity','F1'])
overall_peformance_df.loc[len(overall_peformance_df)] = ['Neural network', dataset] + get_measures(y_test, y_nn)
overall_peformance_df.loc[len(overall_peformance_df)] = ['Random Forest', dataset] + get_measures(y_test, y_rf)
overall_peformance_df


&emsp;

Can we already pick a winner? Let us look more specifically at the confusion matrix for each model.
![alt text](https://firebasestorage.googleapis.com/v0/b/heccoding.appspot.com/o/confusion_042020.png?alt=media&token=5d4c0c03-50ef-406a-87e1-780b70326559)

In [None]:
# print confusion matrix
#@title After running this cell manually, it will auto-run if you change the selected value. { run: "auto" }

model_details = "Random Forest" #@param ["Deep Neural Network", "Random Forest"]
if model_details == "Deep Neural Network":
  model_pred = y_nn
else:
  model_pred = y_rf
model_pred = model_pred.round()

print(" Confusion matrix \n", metrics.confusion_matrix(y_test, model_pred))


&emsp;

**Question: Which model is better? The one creatd by Deep Neural Network or Random Forest?**


&emsp;

Now let's try to better understand what the model learned. Even though we cannot see how exactly the models arrived at their prediction, we can see which features were deemed most important by the models. 
So, let's check if you were right in picking the best predictor in the question at the beginning.



In [None]:
feat_importances = pd.Series(rf.feature_importances_, index=X_colnames)
feat_importances.nlargest(10).plot(kind='barh')
#feat_importances.nsmallest(10).plot(kind='barh')


&emsp;

We can see that most of the features that the model deemed important are in line with our inutition, which is a good sign that the model learned something useful. So e.g. Did_Police_Officer_Attend_Scene_of_Accident_No is one of the important features.


&emsp;

BUT WAIT! **Question: What is the catch with this feature?**


We do not have this information at the time that the prediciton is made, which is when the emergency call arrives. Meaning we are not allowed to use this feature as a predictor!! Domain Expertise is required to know this. So it would have been the task of the manager to spot this mistake.

Usually, we would have to drop this feature at the beginning and do the whole process again with unforeseeable outcome.


&emsp;

### 5.2 Final Decision

**Question: Even more important than the actual metric is the question whether the model is good enough for us to use it in production?**

The answer is: This depends on our baseline, the current process we use to judge which ambulance to send and how good this current process is.

**Where we stand in the process & key tasks for you as a manager:**
![alt text](https://firebasestorage.googleapis.com/v0/b/heccoding.appspot.com/o/step5_sh.png?alt=media&token=c9fdcbcf-9f78-467c-8797-26d8110ace9d)