# 探索泰坦尼克号乘客存活情况

每位乘客具备的各种特征：
- **Survived**：存活结果（0 = 存活；1 = 未存活）
- **Pclass**：社会阶层（1 = 上层；2 = 中层；3 = 底层）
- **Name**：乘客姓名
- **Sex**：乘客性别
- **Age**：乘客年龄（某些条目为 `NaN`）
- **SibSp**：一起上船的兄弟姐妹和配偶人数
- **Parch**：一起上船的父母和子女人数
- **Ticket**：乘客的票号
- **Fare**：乘客支付的票价
- **Cabin**：乘客的客舱号（某些条目为 `NaN`）
- **Embarked**：乘客的登船港（C = 瑟堡；Q = 皇后镇；S = 南安普顿）

In [14]:
import random
import numpy as np
import pandas as pd
from IPython.display import display # Allows the use of display() for DataFrames
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import make_scorer
from sklearn.model_selection import GridSearchCV

%matplotlib inline

# Set a random seed
random.seed(42)
full_data = pd.read_csv('titanic_data.csv')
display(full_data.head())

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [15]:
'''
将乘客的存活情况(Survived)提取到outcomes，形成预测目标
并将除此之外的其他数据提取到features_raw形成特征数据
'''
outcomes = full_data['Survived']
features_raw = full_data.drop('Survived', axis = 1)
display(features_raw.head())

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


相同的泰坦尼克号样本数据现在显示 DataFrame 中删除了 **Survived** 特征。注意 `data`（乘客数据）和 `outcomes` （存活结果）现在是*成对的*。意味着对于任何乘客 `data.loc[i]`，都具有存活结果 `outcomes[i]`。

## 预处理数据

In [3]:
# 对特征进行one-hot编码
features = pd.get_dummies(features_raw)

现在用 0 填充任何空白处。

In [4]:
features = features.fillna(0.0)
display(features.head())

Unnamed: 0,PassengerId,Pclass,Age,SibSp,Parch,Fare,"Name_Abbing, Mr. Anthony","Name_Abbott, Mr. Rossmore Edward","Name_Abbott, Mrs. Stanton (Rosa Hunt)","Name_Abelson, Mr. Samuel",...,Cabin_F G73,Cabin_F2,Cabin_F33,Cabin_F38,Cabin_F4,Cabin_G6,Cabin_T,Embarked_C,Embarked_Q,Embarked_S
0,1,3,22.0,1,0,7.25,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
1,2,1,38.0,1,0,71.2833,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
2,3,3,26.0,0,0,7.925,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
3,4,1,35.0,1,0,53.1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
4,5,3,35.0,0,0,8.05,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1


## 训练模型

In [16]:
'''
将数据差分为训练集和测试集
'''
X_train, X_test, y_train, y_test = train_test_split(features, outcomes, test_size=0.2, random_state=42)

In [None]:
'''
定义模型并进行参数拟合
'''
model = DecisionTreeClassifier(random_state=42)
model.fit(X_train, y_train)

## 改善模型

网格搜索法寻找最佳参数

In [13]:
'''
定义网格搜索参数进行搜索
'''
parameters = {
    "max_depth"           :  [1, 2, 4, 6, 8, 10],
    "min_samples_leaf"    :  [2, 4, 6, 8, 10],
    "min_samples_split"   :  [2, 4, 6, 8, 10]
}

'''
制作计分器
'''
scorer = make_scorer(accuracy_score)

'''
使用参数和记分器创建GridSearchCV对象，并在分类器上执行网格搜索寻找最优参数。
'''
grid_obj = GridSearchCV(model, parameters, scoring=scorer)
grid_fit = grid_obj.fit(X_train, y_train)

'''
获得最佳估算器 (estimator)
'''
best_clf = grid_fit.best_estimator_

'''
将数据拟合到新的分类器
'''
best_clf.fit(X_train, y_train)

'''
使用新的模型进行数据预测
'''
best_train_predictions = best_clf.predict(X_train)
best_test_predictions = best_clf.predict(X_test)

'''
计算新模型的评估指标
'''
print('The training Score is', accuracy_score(best_train_predictions, y_train))
print('The testing Score is', accuracy_score(best_test_predictions, y_test))

'''
预览新模型最终采用的参数
'''
best_clf

The training Score is 0.8707865168539326
The testing Score is 0.8547486033519553


DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=6,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=6, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=42,
            splitter='best')