# Python机器学习全程模拟教程
## 作者：雷内·埃伯（音译）

**保密文件-未经作者书面许可，不得复制或转发**




# 流程图


![alt text](https://firebasestorage.googleapis.com/v0/b/heccoding.appspot.com/o/chinese_sustech%2FPicture1.png?alt=media&token=9cf26360-4603-4df3-aa09-07f3ced3a3ea)

# 步骤1：起步-业务案例/数据集的介绍


![alt text](https://firebasestorage.googleapis.com/v0/b/heccoding.appspot.com/o/ambulance.png?alt=media&token=1494f989-3014-495f-a69b-974275793bda)

&emsp;

**背景:** 某紧急救援公司负责英国所有的道路事故的紧急处理。作为该公司的一名的管理人员或者首席数据科学家，你有责任：
  - 派遣救护车到事发地点（例如：车辆碰撞事故）。
  - 将伤者送到离事发地点最近的医院。
  - 为伤者提供医疗急救服务。

一般有两种不同类型的救护车。它们能提供不同范围的医疗急救服务。

&emsp;

![alt text](https://firebasestorage.googleapis.com/v0/b/heccoding.appspot.com/o/ambulances.png?alt=media&token=79e882b8-ad9f-4ebb-91df-ec06eeb1eb06)

&emsp;

左图的救护车上只有最基础的医疗设备，适用于救助轻伤病人。

右图的救护车可视为是医院的延伸，因为车上有医生和成熟的医疗设备。即使是复杂的医疗操作，如手术，也可以在去医院的路上进行。

&emsp;

**目标:** 你需要派遣一辆匹配的救护车，因为：
  - 这两种类型的救护车的数量都很有限。
  - 只有在发生严重或致命事故时，才需派遣配备医疗队的救护车，且该车费用高昂得多。
&emsp;

**问题:** 在了解事故的严重程度前，你必须决定派遣哪种类型的救护车。

&emsp;

**解决思路:** 我们希望根据事故报道的可用的数据来预测事故的严重程度。根据这些数据点，我们想要创建一个事故严重程度的预测机制，这样在接到紧急电话后，就可以精准地确定派遣哪种类型的救护车。

我们将使用可公开访问的英国车辆碰撞事故数据库来创建预测模型。


**问题：这是哪种类型的机器学习**

![alt text](https://firebasestorage.googleapis.com/v0/b/heccoding.appspot.com/o/chinese_sustech%2FML_types.png?alt=media&token=21816d1e-5070-43b6-9ecd-6ff1f11f3d0f)

**作为一名管理人员，目前在流程中所处的位置和关键任务是:**

![alt text](https://firebasestorage.googleapis.com/v0/b/heccoding.appspot.com/o/chinese_sustech%2FPicture3.png?alt=media&token=18b18435-15bf-49c3-81ed-c77b085112f1)

# 步骤2：预处理

**目前在流程中所处的位置:**

![alt text](https://firebasestorage.googleapis.com/v0/b/heccoding.appspot.com/o/chinese_sustech%2FPicture4.png?alt=media&token=2669ce41-519c-49aa-8a89-405d98e90918)

首先，我们会导入将在此研讨会中使用的“库”（例如：熊猫数据集）。这些“库”包含了我们将要使用的所有功能。（例如：计算平均值、可视化数据）。


In [19]:
import ipywidgets as widgets
from IPython.display import Javascript, display
def run_all(ev):
    display(Javascript("Jupyter.notebook.execute_cells([Jupyter.notebook.get_cell_elements().index(this.element.parents('.cell'))+1]);"))
button = widgets.Button(button_style='info',description="Run")
button.on_click(run_all)
display(button)

Button(button_style='info', description='Run', style=ButtonStyle())

In [None]:
# Loading libraries
#!pip install keras
#!pip install tensorflow
#!pip install seaborn
#!pip install sklearn
#!pip install matplotlib

import os
import operator
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn as sklearn
from sklearn.preprocessing import LabelEncoder, MinMaxScaler
from sklearn import metrics
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.ensemble import RandomForestClassifier
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation
import warnings; warnings.simplefilter('ignore')
import ipywidgets as widgets
from IPython.display import clear_output
import time
sns.set()

# Connect to data source
! git clone https://github.com/Rebero/Quantgeist
time.sleep(10) 

&emsp;

我们已经准备好导入数据集：

In [20]:
import ipywidgets as widgets
from IPython.display import Javascript, display
def run_all(ev):
    display(Javascript("Jupyter.notebook.execute_cells([Jupyter.notebook.get_cell_elements().index(this.element.parents('.cell'))+1]);"))
button = widgets.Button(button_style='info',description="Run")
button.on_click(run_all)
display(button)

Button(button_style='info', description='Run', style=ButtonStyle())

In [None]:
# import data from a comma seperated files (csv) file  
crash_df = pd.read_csv('Quantgeist/ch_accidents_dataset_corrected.csv')

crash_df["事故严重程度 Accident_Severity_numerical"] = crash_df["事故严重程度 Accident_Severity"]
# Read from file with number descriptions and replace number with description where possible
xls = pd.ExcelFile('Quantgeist/ch_Road-Accident-Safety-Data-Guide.xls')
for name in xls.sheet_names: 
  if name != "Introduction" and name != "Export Variables":
    names_df = pd.read_excel(xls, name)
    if name == "Ped Cross - Human":
      name = "Pedestrian_Crossing-Human_Control"
    if name == "Ped Cross - Physical":
      name = "人行横道-物理设施 Pedestrian_Crossing-Physical_Facilities"
    if name == "Weather":
      name = "Weather_Conditions"
    if name == "Road Surface":
      name = "Road_Surface_Conditions"
    if name == "Urban Rural":
      name = "Urban_or_Rural_Area"
    if name == "Police Officer Attend":
      name = "警察是否到达事故现场 Did_Police_Officer_Attend_Scene_of_Accident"
    #name = name.replace(" ", "_")
    if name in list(crash_df.columns):
      rename_dict = names_df.set_index('code').to_dict()['label']
      crash_df[name] = crash_df[name].replace(rename_dict)

# remove irrelevant columns
crash_df = crash_df.drop(columns=['Location_Easting_OSGR', 'Location_Northing_OSGR', "Local_Authority_(District)", "Local_Authority_(Highway)", "LSOA_of_Accident_Location", "Police_Force", "Accident_Index", "1st_Road_Number",	"2nd_Road_Number", "Date",	"Time",	"Special_Conditions_at_Site", "Carriageway_Hazards", "Pedestrian_Crossing-Human_Control"])


&emsp;

想要要获得数据集的第一个汇总，可以查看数据的维度，例如，碰撞事故的频次和报道频次。



In [21]:
import ipywidgets as widgets
from IPython.display import Javascript, display
def run_all(ev):
    display(Javascript("Jupyter.notebook.execute_cells([Jupyter.notebook.get_cell_elements().index(this.element.parents('.cell'))+1]);"))
button = widgets.Button(button_style='info',description="Run")
button.on_click(run_all)
display(button)

Button(button_style='info', description='Run', style=ButtonStyle())

In [None]:
# dataframe.size returns the size of our dataset
crash_df.size

&emsp;

**T任务：我希望你们自己编写代码** 目的是可以看到实际的数据。有一种可以用到的方法称为“head”。请记住我们的数据集名称为“crash_df”。你可以用类似的方法，看看我们上面的数据集是如何生成右上方的crash_df数据集的。 

In [22]:
import ipywidgets as widgets
from IPython.display import Javascript, display
def run_all(ev):
    display(Javascript("Jupyter.notebook.execute_cells([Jupyter.notebook.get_cell_elements().index(this.element.parents('.cell'))+1]);"))
button = widgets.Button(button_style='info',description="Run")
button.on_click(run_all)
display(button)

Button(button_style='info', description='Run', style=ButtonStyle())

In [None]:
#你的代码在这里
crash_df.

&emsp;

如果我们将每次事故的经纬度值映射到一个空的坐标系上。**问题：你将会看到什么?**

In [23]:
import ipywidgets as widgets
from IPython.display import Javascript, display
def run_all(ev):
    display(Javascript("Jupyter.notebook.execute_cells([Jupyter.notebook.get_cell_elements().index(this.element.parents('.cell'))+1]);"))
button = widgets.Button(button_style='info',description="Run")
button.on_click(run_all)
display(button)

Button(button_style='info', description='Run', style=ButtonStyle())

In [None]:
# lets map the longitude and latitude 
fig = crash_df.plot(kind="scatter", x="经度 Longitude", y="纬度 Latitude", alpha=0.6,
                   figsize=(18,11),c="事故严重程度 Accident_Severity_numerical", cmap=plt.get_cmap("inferno"), 
                   colorbar=True,)

# remove irrelevant columns
crash_df = crash_df.drop(columns=['经度 Longitude', '纬度 Latitude', '事故严重程度 Accident_Severity_numerical'])

&emsp;

**问题：-1指的是什么？为什么发生频率那么高?**

In [24]:
import ipywidgets as widgets
from IPython.display import Javascript, display
def run_all(ev):
    display(Javascript("Jupyter.notebook.execute_cells([Jupyter.notebook.get_cell_elements().index(this.element.parents('.cell'))+1]);"))
button = widgets.Button(button_style='info',description="Run")
button.on_click(run_all)
display(button)

Button(button_style='info', description='Run', style=ButtonStyle())

In [None]:
# Change -1 values in 2nd_Road_Type to "Not a junction" if the "Junction Detail" columns let's us see that there was no junction
crash_df.loc[(crash_df["路口详细信息 Junction_Detail"] == 'Not at junction or within 20 metres 不在交汇处或20米范围内') & (crash_df["2级公路 2nd_Road_Class"] == -1), '2级公路 2nd_Road_Class'] = '不在交汇处或20米范围内'  
crash_df.loc[(crash_df["路口详细信息 Junction_Detail"] == 'Other junction 其他岔路') & (crash_df["2级公路 2nd_Road_Class"] == -1), '2级公路 2nd_Road_Class'] = '不在交汇处或20米范围内'  
crash_df.loc[(crash_df["路口详细信息 Junction_Detail"] == 'Private drive or entrance 私人车道或入口') & (crash_df["2级公路 2nd_Road_Class"] == -1), '2级公路 2nd_Road_Class'] = '不在交汇处或20米范围内'  
crash_df.loc[(crash_df["路口详细信息 Junction_Detail"] == 'T or staggered junction T字路口汇合处') & (crash_df["2级公路 2nd_Road_Class"] == -1), '2级公路 2nd_Road_Class'] = '不在交汇处或20米范围内'

# Plot the value distriburion once more for the "2nd_Road_Class" column
ax = sns.countplot(crash_df['2级公路 2nd_Road_Class'], palette="pastel", edgecolor=".6")
ax.set_xticklabels(ax.get_xticklabels(), rotation=40, ha="right")
plt.tight_layout()
plt.show()

&emsp;

很棒！现在我们已经完成了预处理工作，我们可以开始特征工程了。

**小结:**

1. 读取数据可能会出现格式问题。
2. 大多数时候，需要分析数据集。
3. 数据中会出现不匹配/错误现象；部分可以用常识解决。
4. 不存在任何完美的数据集，除非是教科书中描述的理想状态。
5. 引导数据科学家时，要意识到机器学习需要大量的数据清理/预算处理，这会占项目总耗时长的90%。

**我们作为一名管理人员，所处在流程中的位置以及关键任务是:**

![alt text](https://firebasestorage.googleapis.com/v0/b/heccoding.appspot.com/o/chinese_sustech%2FPicture5.png?alt=media&token=b775f03d-9edf-4112-8455-8edacc755779)

# 步骤3：特征工程/探索性数据分析(EDA)

**我们处在流程中的这一步:**

![alt text](https://firebasestorage.googleapis.com/v0/b/heccoding.appspot.com/o/chinese_sustech%2FPicture6.png?alt=media&token=1692ab82-67ef-495c-875d-1e5da20207d5)

特征工程 / EDA是一种
- 通过分析数据集并总结其主要特征。
- 使用数据领域知识来促使机器学习算法工作的方法。

&emsp;

"*特征的总结是困难的、耗时的，需要具备专业知识。应用机器学习基本上属于特征工程。*"

— Andrew Ng, 《模拟大脑进行机器学习人工智能》

&emsp;

## 3.1 目标变量

我们先来看目标变量——碰撞事故的严重程度。
我们会根据严重程度派遣不同类型的救护车。


In [25]:
import ipywidgets as widgets
from IPython.display import Javascript, display
def run_all(ev):
    display(Javascript("Jupyter.notebook.execute_cells([Jupyter.notebook.get_cell_elements().index(this.element.parents('.cell'))+1]);"))
button = widgets.Button(button_style='info',description="Run")
button.on_click(run_all)
display(button)

Button(button_style='info', description='Run', style=ButtonStyle())

In [None]:
# 显示每个严重程度的次数
ax = sns.countplot(crash_df['事故严重程度 Accident_Severity'], palette="pastel", edgecolor=".6", order = crash_df["事故严重程度 Accident_Severity"].value_counts().index)
ax.set_xticklabels(ax.get_xticklabels(), rotation=40, ha="right")
plt.tight_layout()
plt.show()

&emsp;

## 3.2 特征

再往深一步看，现在我们来比较不同的预测变量(例如速度) 和目标变量 (在我们的案例中是指事故的严重程度)，看看是否存在（我们直觉地认为的）这两者之间的相关性、甚至因果关系。

In [26]:
import ipywidgets as widgets
from IPython.display import Javascript, display
def run_all(ev):
    display(Javascript("Jupyter.notebook.execute_cells([Jupyter.notebook.get_cell_elements().index(this.element.parents('.cell'))+1]);"))
button = widgets.Button(button_style='info',description="Run")
button.on_click(run_all)
display(button)

Button(button_style='info', description='Run', style=ButtonStyle())

In [None]:
feature_selector = widgets.Dropdown(
    options=["速度限制 Speed_limit", "星期几 Day_of_Week","1级公路 1st_Road_Class","道路类型 Road_Type", "路口详细信息 Junction_Detail", "路口控制 Junction_Control", "路灯情况 Light_Conditions", "路面情况 Road_Surface_Conditions", "人行横道-物理设施 Pedestrian_Crossing-Physical_Facilities", "市区或郊区 Urban_or_Rural_Area", "警察是否到达事故现场 Did_Police_Officer_Attend_Scene_of_Accident"],
    description='特征 Feature:',
    disabled=False,
)

crash_details = "Speed_limit"
feature_selected = widgets.Output()

def feature_selector_handler(change):
    clear_output(wait=True)
    feature_selected.clear_output()
    display(feature_selector) 
    with feature_selected:
        crash_details = feature_selector.value
        accident_counts = (crash_df.groupby([crash_details])["事故严重程度 Accident_Severity"]
                             .value_counts(normalize=True)
                             .rename('percentage')
                             .mul(100)
                             .reset_index()
                             .sort_values(crash_details))
        p = sns.barplot(x="事故严重程度 Accident_Severity", y="percentage", hue=crash_details, palette="pastel", data=accident_counts, edgecolor=".6", order = crash_df["事故严重程度 Accident_Severity"].value_counts().index)
        p.legend(loc='center left', bbox_to_anchor=(1.25, 0.5), ncol=1)
        p = plt.setp(p.get_xticklabels(), rotation=40) 
        
feature_selector.observe(feature_selector_handler, names="value")
display(feature_selector) 

&emsp;

**问题：提出自己的假设。首先，在超速驾驶、城乡差异、警察是否去过事故现场等原因中找出你直觉认为最准确的预测变量。?** 

**任务：现在我们希望你通过探索数据真实性以及确定最准确的事故原因来检验你的假设。** 

这时可以检验你的假设，找出最佳预测变量了。你可以在下面的代码中研究每个特性与目标变量的相关性。

In [27]:
import ipywidgets as widgets
from IPython.display import Javascript, display
def run_all(ev):
    display(Javascript("Jupyter.notebook.execute_cells([Jupyter.notebook.get_cell_elements().index(this.element.parents('.cell'))+1]);"))
button = widgets.Button(button_style='info',description="Run")
button.on_click(run_all)
display(button)

Button(button_style='info', description='Run', style=ButtonStyle())

In [None]:
feature_selector = widgets.Dropdown(
    options=["速度限制 Speed_limit", "市区或郊区 Urban_or_Rural_Area", "警察是否到达事故现场 Did_Police_Officer_Attend_Scene_of_Accident"],
    description='特征 Feature:',
    disabled=False,
)

crash_details = "市区或郊区 Urban_or_Rural_Area"
feature_selected = widgets.Output()

def feature_selector_handler_next(change):
    clear_output(wait=True)
    feature_selected.clear_output()
    display(feature_selector) 
    with feature_selected:
        crash_details = feature_selector.value
        accident_counts = (crash_df.groupby([crash_details])["事故严重程度 Accident_Severity"]
                     .value_counts(normalize=True)
                     .rename('percentage')
                     .mul(100)
                     .reset_index()
                     .sort_values(crash_details))
        p = sns.barplot(x="事故严重程度 Accident_Severity", y="percentage", hue=crash_details, palette="pastel", data=accident_counts, edgecolor=".6", order = crash_df["事故严重程度 Accident_Severity"].value_counts().index)
        p.legend(loc='center left', bbox_to_anchor=(1.25, 0.5), ncol=1)
        p = plt.setp(p.get_xticklabels(), rotation=40)  
        
feature_selector.observe(feature_selector_handler_next, names="value")
display(feature_selector) 



&emsp;

正如我们所看到的，肯定有更好的或者更差的特征用以预测我们的目标，但还没找到有足够说服力的特征。

**问：你还能想到哪些在数据集中找不到的重要特征呢?**


&emsp;

**学习要点:**
1. 在分析过程中，总有可能出现其他的问题。
2. 在没有指导性业务问题的情况下，分析数据需要耗费大量的时间。

&emsp;

**作为一名管理人员，我们处在流程中的哪个环节，目前的关键任务是什么:**

![alt text](https://firebasestorage.googleapis.com/v0/b/heccoding.appspot.com/o/chinese_sustech%2FPicture7.png?alt=media&token=8fb537fc-4361-482a-9f3f-c9db5318f2ac)

&emsp;

**问题：还有一件事！我们的目标变量是什么？它和我们想要预测的结果有什么不同呢?**

In [28]:
import ipywidgets as widgets
from IPython.display import Javascript, display
def run_all(ev):
    display(Javascript("Jupyter.notebook.execute_cells([Jupyter.notebook.get_cell_elements().index(this.element.parents('.cell'))+1]);"))
button = widgets.Button(button_style='info',description="Run")
button.on_click(run_all)
display(button)

Button(button_style='info', description='Run', style=ButtonStyle())

In [None]:
crash_df["事故严重程度 Accident_Severity"].replace("Serious 严重的", "Fatal 致命的", inplace=True,)


&emsp;

# 步骤4：建模

到目前为止，我们尚未开始机器学习，但已经为流程中的机器学习环节做好了准备。真正的机器学习只占总流程的很小部分。

**当我们处于流程的这一步时:**

![alt text](https://firebasestorage.googleapis.com/v0/b/heccoding.appspot.com/o/chinese_sustech%2FPicture8.png?alt=media&token=378cc173-0da7-456c-a392-6f6f1725edcf)


## 4.1 数据转换

### 4.1.1 数据准备

In [29]:
import ipywidgets as widgets
from IPython.display import Javascript, display
def run_all(ev):
    display(Javascript("Jupyter.notebook.execute_cells([Jupyter.notebook.get_cell_elements().index(this.element.parents('.cell'))+1]);"))
button = widgets.Button(button_style='info',description="Run")
button.on_click(run_all)
display(button)

Button(button_style='info', description='Run', style=ButtonStyle())

In [None]:
#OneHotEncoded: "Day_of_Week","1st_Road_Class", "2nd_Road_Class","Road_Type", "Junction_Detail", "Junction_Control", "Light_Conditions", "Road_Surface_Conditions", "Pedestrian_Crossing-Human_Control", "Pedestrian_Crossing-Physical_Facilities", "Special_Conditions_at_Site", "Carriageway_Hazards", "Urban_or_Rural_Area", "Did_Police_Officer_Attend_Scene_of_Accident"
#crash_df = pd.get_dummies(crash_df, prefix=["Day_of_Week","1st_Road_Class", "2nd_Road_Class","Road_Type", "Junction_Detail", "Junction_Control", "Light_Conditions", "Road_Surface_Conditions", "Pedestrian_Crossing-Physical_Facilities", "Urban_or_Rural_Area", "Did_Police_Officer_Attend_Scene_of_Accident", "Weather_Conditions"], columns=["Day_of_Week","1st_Road_Class", "2nd_Road_Class","Road_Type", "Junction_Detail", "Junction_Control", "Light_Conditions", "Road_Surface_Conditions", "Pedestrian_Crossing-Physical_Facilities", "Urban_or_Rural_Area", "Did_Police_Officer_Attend_Scene_of_Accident", "Weather_Conditions"])

#Categorize Strings: Accident_Severity
#le = LabelEncoder()
#crash_df['Accident_Severity'] = le.fit_transform(crash_df['Accident_Severity'].tolist())

# Normalize nunerical values - squeeze them between range 0 and 1
#scaler = MinMaxScaler()
#scaler.fit(crash_df[["Speed_limit", "Number_of_Vehicles", "Number_of_Casualties"]])
#crash_df[["Speed_limit", "Number_of_Vehicles", "Number_of_Casualties"]] = scaler.transform(crash_df[["Speed_limit", "Number_of_Vehicles", "Number_of_Casualties"]])

# Sperate dataset into predictors and target set
#y = crash_df["Accident_Severity"]
#X = crash_df.drop(columns=['Accident_Severity'])
X = pd.read_csv('Quantgeist/X.csv')
y = pd.read_csv('Quantgeist/y.csv')
X.columns =['汽车数量 Number_of_Vehicles', ' 伤员数量 Number_of_Casualties', '速度限制 Speed_limit', '星期几 Day_of_Week_1', '星期几 Day_of_Week_2', '星期几 Day_of_Week_3', '星期几 Day_of_Week_4', '星期几 Day_of_Week_5', '星期几 Day_of_Week_6', '星期几 Day_of_Week_7', '1级公路 1st_Road_Class_1', '1级公路 1st_Road_Class_2', '1级公路 1st_Road_Class_3', '1级公路 1st_Road_Class_4', '1级公路 1st_Road_Class_5', '1级公路 1st_Road_Class_6', '2级公路 2nd_Road_Class_-1', '2级公路 2nd_Road_Class_1', '2级公路 2nd_Road_Class_2', '2级公路 2nd_Road_Class_3', '2级公路 2nd_Road_Class_4', '2级公路 2nd_Road_Class_5', '2级公路 2nd_Road_Class_6', '道路类型 Road_Type_1', '道路类型 Road_Type_2', '道路类型 Road_Type_3', '道路类型 Road_Type_6', '道路类型 Road_Type_7', '道路类型 Road_Type_9', '路口详细信息 Junction_Detail_-1', '路口详细信息 Junction_Detail_0', '路口详细信息 Junction_Detail_1', '路口详细信息 Junction_Detail_2', '路口详细信息 Junction_Detail_3', '路口详细信息 Junction_Detail_5', '路口详细信息 Junction_Detail_6', '路口详细信息 Junction_Detail_7', '路口详细信息 Junction_Detail_8', '路口详细信息 Junction_Detail_9', '路口控制 Junction_Control_-1', '路口控制 Junction_Control_1', '路口控制 Junction_Control_2', '路口控制 Junction_Control_3', '路口控制 Junction_Control_4', '路灯情况 Light_Conditions_-1', '路灯情况 Light_Conditions_1', '路灯情况 Light_Conditions_4', '路灯情况 Light_Conditions_5', '路灯情况 Light_Conditions_6', '路灯情况 Light_Conditions_7', '路面情况 Road_Surface_Conditions_-1', '路面情况 Road_Surface_Conditions_1', '路面情况 Road_Surface_Conditions_2', '路面情况 Road_Surface_Conditions_3', '路面情况 Road_Surface_Conditions_4', '路面情况 Road_Surface_Conditions_5', '人行横道-物理设施 Pedestrian_Crossing-Physical_Facilities_-1', '人行横道-物理设施 Pedestrian_Crossing-Physical_Facilities_0', '人行横道-物理设施 Pedestrian_Crossing-Physical_Facilities_1', '人行横道-物理设施 Pedestrian_Crossing-Physical_Facilities_4', '人行横道-物理设施 Pedestrian_Crossing-Physical_Facilities_5', '人行横道-物理设施 Pedestrian_Crossing-Physical_Facilities_7', '人行横道-物理设施 Pedestrian_Crossing-Physical_Facilities_8', '市区或郊区 Urban_or_Rural_Area_1', '市区或郊区 Urban_or_Rural_Area_2', '市区或郊区 Urban_or_Rural_Area_3', '警察是否到达事故现场 Did_Police_Officer_Attend_Scene_of_Accident_1', '警察是否到达事故现场 Did_Police_Officer_Attend_Scene_of_Accident_2', '警察是否到达事故现场 Did_Police_Officer_Attend_Scene_of_Accident_3', '天气情况 Weather_Conditions_-1', '天气情况 Weather_Conditions_1', '天气情况 Weather_Conditions_2', '天气情况 Weather_Conditions_3', '天气情况 Weather_Conditions_4', '天气情况 Weather_Conditions_5', '天气情况 Weather_Conditions_6', '天气情况 Weather_Conditions_7', '天气情况 Weather_Conditions_8', '天气情况 Weather_Conditions_9']
y.columns = ['事故严重程度 Accident_Severity']
X_colnames = list(X.columns)

# Oversample
#sm = SMOTE(random_state=42)
#X, y = sm.fit_resample(X, y)


1.   **变量编码**

编码是指将分类变量转换为二进制或数字对应变量。

2.  **特征缩放**

大多数机器学习算法在计算中会使用两个数据点之间的欧几里德距离。由于我们的特征在量级、单位和范围上有所不同，因此不同单位下得出的结果会有很大差异。

在我们的使用案例中，Number_of_Vehicles 的范围是 1 到 23，而变量 Speed_limit 的范围是 20 到 90。在计算过程中，高量级的特征比低量级的特征权重大得多。

有几种方法可以处理特征缩放，例如正常化和标准化。 我们将使用一个简单的 MinMax 缩放来压缩 0 到 1 范围内的所有数值特征。



3.   **平衡数据集**

由于我们拥有的严重或致命性事故严重程度的数据点（仅 18%）比轻微严重程度的数据点少得多，因此将所有车祸预测为轻微严重事故的模型已经达到 82% 的精确率。这点我们将在后面的模型评估部分作详细介绍。





&emsp;

### 4.1.2 训练测试拆分

机器学习的黄金法则: **永远不要在测试数据上训练模型。**. 很遗憾，只有少数人足够严格地遵守这一点。你通常应该保留 20% 的数据用于测试。
![alt text](https://firebasestorage.googleapis.com/v0/b/heccoding.appspot.com/o/chinese_sustech%2Fch_train_test.png?alt=media&token=ee87ea30-8e45-45c2-8964-46fe8c1dff4b)

In [30]:
import ipywidgets as widgets
from IPython.display import Javascript, display
def run_all(ev):
    display(Javascript("Jupyter.notebook.execute_cells([Jupyter.notebook.get_cell_elements().index(this.element.parents('.cell'))+1]);"))
button = widgets.Button(button_style='info',description="Run")
button.on_click(run_all)
display(button)

Button(button_style='info', description='Run', style=ButtonStyle())

In [None]:
# Split data 80:20 in traing, test data
sss = StratifiedShuffleSplit(n_splits=4, test_size=0.2, random_state=1)
for train_index, test_index in sss.split(X, y):
    X_train, X_test = X.loc[train_index], X.loc[test_index]
    y_train, y_test = y.loc[train_index], y.loc[test_index]


&emsp;

## 4.2 建模

在本章，我们会比较不同的机器学习算法。为此我们在同样的数据上建立以及训练不同的算法。

&emsp;

### 4.2.1 深度神经网络

In [31]:
import ipywidgets as widgets
from IPython.display import Javascript, display
def run_all(ev):
    display(Javascript("Jupyter.notebook.execute_cells([Jupyter.notebook.get_cell_elements().index(this.element.parents('.cell'))+1]);"))
button = widgets.Button(button_style='info',description="Run")
button.on_click(run_all)
display(button)

Button(button_style='info', description='Run', style=ButtonStyle())

In [None]:
# Set up and train model
model = Sequential()
model.add(Dense(90,input_shape=(79,)))
model.add(Dropout(0.3))
model.add(Activation('relu'))
model.add(Dense(32))
model.add(Dropout(0.3))
model.add(Activation('relu'))
model.add(Dense(1))
model.add(Activation('sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['acc', ])

model.fit(X_train, y_train ,batch_size=128,epochs=25,validation_split=0.2)


&emsp;

### 4.2.2 随机森林



首先，我们用参数来设置算法。

In [32]:
import ipywidgets as widgets
from IPython.display import Javascript, display
def run_all(ev):
    display(Javascript("Jupyter.notebook.execute_cells([Jupyter.notebook.get_cell_elements().index(this.element.parents('.cell'))+1]);"))
button = widgets.Button(button_style='info',description="Run")
button.on_click(run_all)
display(button)

Button(button_style='info', description='Run', style=ButtonStyle())

In [None]:
# Set up
rf = RandomForestClassifier(verbose=2, random_state=42, n_jobs = -1, class_weight="balanced_subsample", n_estimators=100, max_depth=None, min_samples_split=2, min_samples_leaf=1, bootstrap=True, max_features="auto")

&emsp;

**任务：我们希望你通过写代码给机器学习建模，请敲入python代码进行机器学习模型训练。这个案例中使用随机森林。你可以回看我们在神经网络（Neural Network）上的操作来获得一些提示。你正在调用的库叫rf（random forest的缩写，你需要用到的方法是fit()。你会发现这跟你初次挑战编程时使用的语法逻辑是吻合的。**

In [33]:
import ipywidgets as widgets
from IPython.display import Javascript, display
def run_all(ev):
    display(Javascript("Jupyter.notebook.execute_cells([Jupyter.notebook.get_cell_elements().index(this.element.parents('.cell'))+1]);"))
button = widgets.Button(button_style='info',description="Run")
button.on_click(run_all)
display(button)

Button(button_style='info', description='Run', style=ButtonStyle())

In [None]:
#Your code goes here
trained_rf = rf.

**我们在下面的流程中所处的位置:**

![alt text](https://firebasestorage.googleapis.com/v0/b/heccoding.appspot.com/o/chinese_sustech%2FPicture9.png?alt=media&token=64ea81b4-bc08-4204-af61-604a8c6c75e1)


&emsp;

# 步骤5：评估模型

**我们在下面的流程中所处的位置:**
![alt text](https://firebasestorage.googleapis.com/v0/b/heccoding.appspot.com/o/chinese_sustech%2FPicture10.png?alt=media&token=e49784f6-eb3f-4e74-9ce2-6c8f4bad92e2)

现在我们来通过评估误差来对比一下模组的实际表现。


&emsp;

### 5.1 矩阵

In [34]:
import ipywidgets as widgets
from IPython.display import Javascript, display
def run_all(ev):
    display(Javascript("Jupyter.notebook.execute_cells([Jupyter.notebook.get_cell_elements().index(this.element.parents('.cell'))+1]);"))
button = widgets.Button(button_style='info',description="Run")
button.on_click(run_all)
display(button)

Button(button_style='info', description='Run', style=ButtonStyle())

In [None]:
#Performances on test data
def get_measures(y_true, y_pred, th=0.5):
  yhat_classes = np.where(y_pred > th, 1, 0)
  acc = metrics.accuracy_score(y_true, yhat_classes)
  bal_acc = metrics.balanced_accuracy_score(y_true, yhat_classes)
  sen = metrics.recall_score(y_true, yhat_classes) 
  f1 = metrics.f1_score(y_true, yhat_classes)
  return [acc, bal_acc, sen, f1]

#Run model on test data
y_nn = model.predict(X_test)
y_rf = rf.predict(X_test)

dataset = "Test"  
overall_peformance_df = pd.DataFrame(columns=['model', 'dataset', 'accuracy', 'balanced accuracy', 'sensitivity','F1'])
overall_peformance_df.loc[len(overall_peformance_df)] = ['深度神经网络 Neural network', dataset] + get_measures(y_test, y_nn)
overall_peformance_df.loc[len(overall_peformance_df)] = ['随机森林 Random Forest', dataset] + get_measures(y_test, y_rf)
overall_peformance_df


&emsp;

这时我们可以选出优胜方了吗?我们再来具体看看每个模组的混淆矩阵。
![alt text](https://firebasestorage.googleapis.com/v0/b/heccoding.appspot.com/o/chinese_sustech%2Fch_conf.png?alt=media&token=ab872303-1e2a-405d-91e2-b7e5686d14cb)

In [35]:
import ipywidgets as widgets
from IPython.display import Javascript, display
def run_all(ev):
    display(Javascript("Jupyter.notebook.execute_cells([Jupyter.notebook.get_cell_elements().index(this.element.parents('.cell'))+1]);"))
button = widgets.Button(button_style='info',description="Run")
button.on_click(run_all)
display(button)

Button(button_style='info', description='Run', style=ButtonStyle())

In [None]:
# print confusion matrix
#@title After running this cell manually, it will auto-run if you change the selected value. { run: "auto" }

model_details = "Random Forest" #@param ["Deep Neural Network", "Random Forest"]
if model_details == "Deep Neural Network":
  model_pred = y_nn
else:
  model_pred = y_rf
model_pred = model_pred.round()

print(" Confusion matrix \n", metrics.confusion_matrix(y_test, model_pred))


&emsp;

**问题：哪些模组创建得更好?是深度神经网络（Deep Neural Network）还是随机森林（Random Forest）呢?**


&emsp;

现在我们来进一步理解一下模型学到了什么。尽管我们无法看到模型是如何准确地得出预测的，但我们可以看到模型得出的最重要特征。 因此，我们检验一下你在开始时选择问题中的最佳预测变量是否正确。


In [36]:
import ipywidgets as widgets
from IPython.display import Javascript, display
def run_all(ev):
    display(Javascript("Jupyter.notebook.execute_cells([Jupyter.notebook.get_cell_elements().index(this.element.parents('.cell'))+1]);"))
button = widgets.Button(button_style='info',description="Run")
button.on_click(run_all)
display(button)

Button(button_style='info', description='Run', style=ButtonStyle())

In [None]:
feat_importances = pd.Series(rf.feature_importances_, index=X_colnames)
feat_importances.nlargest(10).plot(kind='barh')
#feat_importances.nsmallest(10).plot(kind='barh')


&emsp;

我们可以看到对模组得出的最重要的特征是和我们的直觉是吻合的。这标志着模组已经学到有用的知识了。因此，例如Did_Police_Officer_Attend_Scene_of_Accident_No就是重要特征之一。


&emsp;

等一下。 **问题来了：这种特征有什么好处呢?**


我们在作预测时，也就是接到紧急呼叫时，还不存在这些信息，这就意味着我们不
可以将此特征用作预测变量！这是必备的专业知识，因此管理人员的任务是发现这个错误。

通常，我们必须在开始时就忽略这个特征并重新执行整个流程。



&emsp;

### 5.2 最终决策

**问题：比真实矩阵更重要的是：这个模型是否可以运用到实际业务中?**

答案取决于我们的底线，目前用来判断该派遣哪辆救护车的流程以及该流程的优劣性。

**作为一名管理人员，我们处在流程中的哪个环节，目前的关键任务是什么:**
![alt text](https://firebasestorage.googleapis.com/v0/b/heccoding.appspot.com/o/chinese_sustech%2FPicture11.png?alt=media&token=e74a486c-7dcb-4189-99b4-5c3a0cd38e53)