集成学习：选择多个模型，将模型的整体结果进行评估

随机森林：
随机抽样（bootstrap随机有放回抽样）重复N次、随机提取m << M个特征建立决策树

任务: 泰坦尼克号乘客生存预测

流程: 
```
1. 数据读取与清洗
2. 特征工程（提取特征与细化）
3. 模型训练
4. 模型评估
```

In [1]:
import pandas as pd
import numpy as np

数据读取

In [2]:
data = pd.read_csv("/data/ys_data/titanic/train.csv")
test = pd.read_csv("/data/ys_data/titanic/test.csv")

数据清洗

In [3]:
# 合并两个数据集查看整体数据集信息，合并数据集并忽略缺失值
full = data._append(test, ignore_index=True)

In [4]:
# 查看数据集整体情况，查看缺失数据并规划数据清洗
full.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  1309 non-null   int64  
 1   Survived     891 non-null    float64
 2   Pclass       1309 non-null   int64  
 3   Name         1309 non-null   object 
 4   Sex          1309 non-null   object 
 5   Age          1046 non-null   float64
 6   SibSp        1309 non-null   int64  
 7   Parch        1309 non-null   int64  
 8   Ticket       1309 non-null   object 
 9   Fare         1308 non-null   float64
 10  Cabin        295 non-null    object 
 11  Embarked     1307 non-null   object 
dtypes: float64(3), int64(4), object(5)
memory usage: 122.8+ KB


数据清洗的方法：

A如果是数值型，用平均数取代

B如果是分类数据，用最常见的类别取代

C使用模型预测缺失值，例如：K-NN

In [5]:
# 用均值填充Age和Fare
full['Age'] = full['Age'].fillna(full['Age'].mean())
full['Fare'] = full['Fare'].fillna(full['Fare'].mean())

In [6]:
# 查看Embarked常见类别
full['Embarked'].value_counts()

Embarked
S    914
C    270
Q    123
Name: count, dtype: int64

In [7]:
# 用常见的类别填充Embarked
full['Embarked'] = full['Embarked'].fillna('S')

In [8]:
# Cabin一列缺失较多，暂填充为u，代表未知
full['Cabin'] = full['Cabin'].fillna('u')

In [9]:
# 填充完成后查看数据集
full.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  1309 non-null   int64  
 1   Survived     891 non-null    float64
 2   Pclass       1309 non-null   int64  
 3   Name         1309 non-null   object 
 4   Sex          1309 non-null   object 
 5   Age          1309 non-null   float64
 6   SibSp        1309 non-null   int64  
 7   Parch        1309 non-null   int64  
 8   Ticket       1309 non-null   object 
 9   Fare         1309 non-null   float64
 10  Cabin        1309 non-null   object 
 11  Embarked     1309 non-null   object 
dtypes: float64(3), int64(4), object(5)
memory usage: 122.8+ KB


特征提取

In [10]:
# 对性别进行替换（male->1, female->2）
sex_mapDict = { "male" : 1, 
               "female" : 0}
full['Sex'] = full['Sex'].map(sex_mapDict)

In [11]:
full['Sex']

0       1
1       0
2       0
3       0
4       1
       ..
1304    1
1305    0
1306    1
1307    1
1308    1
Name: Sex, Length: 1309, dtype: int64

In [12]:
# Embarked转化为one-hot编码
EmbarkedDf = pd.DataFrame()
EmbarkedDf = pd.get_dummies(full['Embarked'], prefix='Embarked', dtype=int)

# 转化完成后查看
EmbarkedDf.head()

Unnamed: 0,Embarked_C,Embarked_Q,Embarked_S
0,0,0,1
1,1,0,0
2,0,0,1
3,0,0,1
4,0,0,1


In [13]:
# 虚拟变量导入full, 将Embarked转化为one-hot编码
full = pd.concat([full, EmbarkedDf], axis=1)
full.drop("Embarked", axis=1, inplace=True)

# 转化完成后查看
full.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked_C,Embarked_Q,Embarked_S
0,1,0.0,3,"Braund, Mr. Owen Harris",1,22.0,1,0,A/5 21171,7.25,u,0,0,1
1,2,1.0,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",0,38.0,1,0,PC 17599,71.2833,C85,1,0,0
2,3,1.0,3,"Heikkinen, Miss. Laina",0,26.0,0,0,STON/O2. 3101282,7.925,u,0,0,1
3,4,1.0,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",0,35.0,1,0,113803,53.1,C123,0,0,1
4,5,0.0,3,"Allen, Mr. William Henry",1,35.0,0,0,373450,8.05,u,0,0,1


In [14]:
# 对pclass也进行操作，转化为one-hot编码
pclassDf = pd.DataFrame()
pclassDf = pd.get_dummies(full['Pclass'], prefix='Pclass', dtype=int)

# 转化完成后查看
pclassDf.head()

Unnamed: 0,Pclass_1,Pclass_2,Pclass_3
0,0,0,1
1,1,0,0
2,0,0,1
3,1,0,0
4,0,0,1


In [15]:
# 处理后的pclass加入full，并且替换full中的pclass行
full = pd.concat([full, pclassDf], axis=1)
full.drop('Pclass', axis=1, inplace=True)

# 处理后查看full
full.head()

Unnamed: 0,PassengerId,Survived,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked_C,Embarked_Q,Embarked_S,Pclass_1,Pclass_2,Pclass_3
0,1,0.0,"Braund, Mr. Owen Harris",1,22.0,1,0,A/5 21171,7.25,u,0,0,1,0,0,1
1,2,1.0,"Cumings, Mrs. John Bradley (Florence Briggs Th...",0,38.0,1,0,PC 17599,71.2833,C85,1,0,0,1,0,0
2,3,1.0,"Heikkinen, Miss. Laina",0,26.0,0,0,STON/O2. 3101282,7.925,u,0,0,1,0,0,1
3,4,1.0,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",0,35.0,1,0,113803,53.1,C123,0,0,1,1,0,0
4,5,0.0,"Allen, Mr. William Henry",1,35.0,0,0,373450,8.05,u,0,0,1,0,0,1


In [16]:
# 获取name中隐含的Title，可能对是否生存有影响
def getTitle(name: str) -> str:
    str1 = name.split(',')[1]
    str2 = str1.split('.')[0]
    str3 = str2.strip()
    return str3

# 对所有的name对象进行getTitle，提取头衔
titleDf = pd.DataFrame()
titleDf['Title'] = full['Name'].map(getTitle)
titleDf.head()

Unnamed: 0,Title
0,Mr
1,Mrs
2,Miss
3,Mrs
4,Mr


In [17]:
# 查看titleDf内的类别，便于映射对应关系
titleDf.value_counts()

Title       
Mr              757
Miss            260
Mrs             197
Master           61
Rev               8
Dr                8
Col               4
Ms                2
Major             2
Mlle              2
Sir               1
Capt              1
Mme               1
Lady              1
Jonkheer          1
Dona              1
Don               1
the Countess      1
Name: count, dtype: int64

In [18]:
# 映射头衔的对应关系
title_mapDict = {
    "Capt" : "Officer",
    "Col" : "Officer",
    "Major" : "Officer",
    "Jonkeer" : "Royalty",
    "Don" : "Royalty",
    "Sir" : "Royalty",
    "Dr" : "Officer",
    "Rev" : "Officer",
    "the Countess" : "Royalty", 
    "Dona" : "Royalty", 
    "Mme" : "Mrs", 
    "Mlle" : "Miss", 
    "Ms" : "Mrs", 
    "Mr" : "Mr", 
    "Mrs" : "Mrs", 
    "Miss" : "Miss", 
    "Master" : "Master", 
    "Lady" : "Royalty"
}
titleDf['Title'] = titleDf['Title'].map(title_mapDict)

# 提取成为one-hot编码
titleDf = pd.get_dummies(titleDf['Title'], dtype=int)

# 提取完成后查看结果
titleDf.head()

Unnamed: 0,Master,Miss,Mr,Mrs,Officer,Royalty
0,0,0,1,0,0,0
1,0,0,0,1,0,0
2,0,1,0,0,0,0
3,0,0,0,1,0,0
4,0,0,1,0,0,0


In [19]:
# 处理后的Name行替换掉full中的Name行
full = pd.concat([full, titleDf], axis=1)
full.drop('Name', axis=1, inplace=True)

# 填充完成后查看
full.head()

Unnamed: 0,PassengerId,Survived,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked_C,...,Embarked_S,Pclass_1,Pclass_2,Pclass_3,Master,Miss,Mr,Mrs,Officer,Royalty
0,1,0.0,1,22.0,1,0,A/5 21171,7.25,u,0,...,1,0,0,1,0,0,1,0,0,0
1,2,1.0,0,38.0,1,0,PC 17599,71.2833,C85,1,...,0,1,0,0,0,0,0,1,0,0
2,3,1.0,0,26.0,0,0,STON/O2. 3101282,7.925,u,0,...,1,0,0,1,0,1,0,0,0,0
3,4,1.0,0,35.0,1,0,113803,53.1,C123,0,...,1,1,0,0,0,0,0,1,0,0
4,5,0.0,1,35.0,0,0,373450,8.05,u,0,...,1,0,0,1,0,0,1,0,0,0


In [20]:
# 处理Cabin，按舱位分类
cabinDf = pd.DataFrame()

# 匿名函数Lambda，取第一位
full['Cabin'] = full['Cabin'].map(lambda c : c[0])

# 转化为one-hot编码
cabinDf = pd.get_dummies(full['Cabin'], prefix='Cabin', dtype=int)

# 填充完成后查看
cabinDf.head()

Unnamed: 0,Cabin_A,Cabin_B,Cabin_C,Cabin_D,Cabin_E,Cabin_F,Cabin_G,Cabin_T,Cabin_u
0,0,0,0,0,0,0,0,0,1
1,0,0,1,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,1
3,0,0,1,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,1


In [21]:
# 处理后的Cabin替换full中的Cabin
full = pd.concat([full, cabinDf], axis=1)
full.drop('Cabin', axis=1, inplace=True)

# 处理后查看full结构
full.head()

Unnamed: 0,PassengerId,Survived,Sex,Age,SibSp,Parch,Ticket,Fare,Embarked_C,Embarked_Q,...,Royalty,Cabin_A,Cabin_B,Cabin_C,Cabin_D,Cabin_E,Cabin_F,Cabin_G,Cabin_T,Cabin_u
0,1,0.0,1,22.0,1,0,A/5 21171,7.25,0,0,...,0,0,0,0,0,0,0,0,0,1
1,2,1.0,0,38.0,1,0,PC 17599,71.2833,1,0,...,0,0,0,1,0,0,0,0,0,0
2,3,1.0,0,26.0,0,0,STON/O2. 3101282,7.925,0,0,...,0,0,0,0,0,0,0,0,0,1
3,4,1.0,0,35.0,1,0,113803,53.1,0,0,...,0,0,0,1,0,0,0,0,0,0
4,5,0.0,1,35.0,0,0,373450,8.05,0,0,...,0,0,0,0,0,0,0,0,0,1


In [22]:
# 家庭人数总结与家庭大小分类，家庭人数等于直系亲属（parch）+非直系亲属（Sibsp）+本人（1）
familyDf = pd.DataFrame()
familyDf['Family'] = full['Parch'] + full['SibSp'] + 1

# 对人数进行分类，分为小家庭（1）中等家庭（2-4）以及大家庭（>=5）
familyDf['F_S'] = familyDf['Family'].map(lambda s:1 if s == 1 else 0)
familyDf['F_M'] = familyDf['Family'].map(lambda s:1 if 2 <= s <= 4 else 0)
familyDf['F_L'] = familyDf['Family'].map(lambda s:1 if s >= 5 else 0)

# 处理完成查看表
familyDf.head()

Unnamed: 0,Family,F_S,F_M,F_L
0,2,0,1,0
1,2,0,1,0
2,1,1,0,0
3,2,0,1,0
4,1,1,0,0


In [23]:
# 将处理完成的表加入full
full = pd.concat([full, familyDf], axis=1)

# 加入完成查看full
full.head()

Unnamed: 0,PassengerId,Survived,Sex,Age,SibSp,Parch,Ticket,Fare,Embarked_C,Embarked_Q,...,Cabin_D,Cabin_E,Cabin_F,Cabin_G,Cabin_T,Cabin_u,Family,F_S,F_M,F_L
0,1,0.0,1,22.0,1,0,A/5 21171,7.25,0,0,...,0,0,0,0,0,1,2,0,1,0
1,2,1.0,0,38.0,1,0,PC 17599,71.2833,1,0,...,0,0,0,0,0,0,2,0,1,0
2,3,1.0,0,26.0,0,0,STON/O2. 3101282,7.925,0,0,...,0,0,0,0,0,1,1,1,0,0
3,4,1.0,0,35.0,1,0,113803,53.1,0,0,...,0,0,0,0,0,0,2,0,1,0
4,5,0.0,1,35.0,0,0,373450,8.05,0,0,...,0,0,0,0,0,1,1,1,0,0


相关系数法筛选特征值

In [24]:
from sklearn.feature_selection import SelectKBest
from scipy.stats import pearsonr

# 提取出数据集内的数据，并且进行特征提取
source_row = 891
feature_cols = [i for i in full.columns if i not in ['Ticket', 'Survived']]
features = full[feature_cols]
features = features[0:source_row]
target = full['Survived'][0:source_row]

In [25]:
# 采用pearson系数计算相关系数并选择，降序输出
rate_dict = {}
for i in features.columns:
    rate_dict[i] = pearsonr(features[i], target)[0]
result = sorted(rate_dict.items(), key=lambda s:s[1], reverse=True)
for each in result:
    print(each)

('Mrs', 0.3449349674862876)
('Miss', 0.33279543489730845)
('Pclass_1', 0.28590376778374266)
('F_M', 0.2798545470332841)
('Fare', 0.2573065223849624)
('Cabin_B', 0.17509503365047552)
('Embarked_C', 0.16824043121823315)
('Cabin_D', 0.1507156442304825)
('Cabin_E', 0.14532144323642832)
('Cabin_C', 0.11465211543263729)
('Pclass_2', 0.09334857241192887)
('Master', 0.08522056083929419)
('Parch', 0.08162940708348344)
('Cabin_F', 0.057934947020803915)
('Royalty', 0.05056144542486076)
('Cabin_A', 0.022286953811301812)
('Family', 0.016638989282745254)
('Cabin_G', 0.016040182686507552)
('Embarked_Q', 0.0036503826839722055)
('PassengerId', -0.005006660767066498)
('Cabin_T', -0.026456468796962354)
('Officer', -0.03131567043773626)
('SibSp', -0.035322498885735534)
('Age', -0.07032267528829973)
('F_L', -0.12514712398530678)
('Embarked_S', -0.14968272327068555)
('F_S', -0.2033670856998919)
('Cabin_u', -0.3169115231122962)
('Pclass_3', -0.3223083573729699)
('Sex', -0.5433513806577553)
('Mr', -0.54919918

In [26]:
# 抽取相关系数比较大的特征值 -> titleDf, Sex, pclassDf, familyDf, Fare, cabinDf, EmbarkedDf
full_X = pd.concat([titleDf, full['Sex'], pclassDf, familyDf, full['Fare'], cabinDf, EmbarkedDf], axis=1)

# 抽取结束后查看简介
full_X.head()

Unnamed: 0,Master,Miss,Mr,Mrs,Officer,Royalty,Sex,Pclass_1,Pclass_2,Pclass_3,...,Cabin_C,Cabin_D,Cabin_E,Cabin_F,Cabin_G,Cabin_T,Cabin_u,Embarked_C,Embarked_Q,Embarked_S
0,0,0,1,0,0,0,1,0,0,1,...,0,0,0,0,0,0,1,0,0,1
1,0,0,0,1,0,0,0,1,0,0,...,1,0,0,0,0,0,0,1,0,0
2,0,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,1,0,0,1
3,0,0,0,1,0,0,0,1,0,0,...,1,0,0,0,0,0,0,0,0,1
4,0,0,1,0,0,0,1,0,0,1,...,0,0,0,0,0,0,1,0,0,1


划分训练集

In [27]:
# 提取训练集和测试集
source_X = full_X[0:source_row]
source_y = full['Survived'][0:source_row]
pred_X = full_X[source_row:]

In [28]:
# 划分训练集
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(source_X, source_y, random_state=10, train_size=0.8)

模型训练与调优（加入网格搜索与交叉验证）

In [29]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
estimator = RandomForestClassifier()

param_dict = {"n_estimators" : [120, 200, 300, 500, 800, 1200], "max_depth" : [5, 8, 15, 25, 30]}
estimator = GridSearchCV(estimator, param_grid=param_dict, cv=10)

In [30]:
estimator.fit(x_train, y_train)

模型评估

In [31]:
score = estimator.score(x_test, y_test)
print("准确率为:\n", score)

# 打印最佳参数
print("最佳参数: \n", estimator.best_params_)
# 打印最佳结果
print("最佳结果: \n", estimator.best_score_)
# 打印最佳估计器
print("最佳估计器: \n", estimator.best_estimator_)
# 打印交叉验证结果
print("交叉验证结果: \n", estimator.cv_results_)

准确率为:
 0.8603351955307262
最佳参数: 
 {'max_depth': 5, 'n_estimators': 120}
最佳结果: 
 0.8314358372456965
最佳估计器: 
 RandomForestClassifier(max_depth=5, n_estimators=120)
交叉验证结果: 
 {'mean_fit_time': array([0.12416501, 0.19957078, 0.30220056, 0.51047444, 0.8180249 ,
       1.20790415, 0.13095956, 0.22236137, 0.32944875, 0.55156913,
       0.87378671, 1.306988  , 0.14190795, 0.23717716, 0.37455385,
       0.60101376, 0.95377655, 1.45309365, 0.14253354, 0.24004772,
       0.35511129, 0.60332727, 0.96573591, 1.43205826, 0.14702783,
       0.2425025 , 0.36266739, 0.60592697, 0.96198351, 1.45327008]), 'std_fit_time': array([0.00147493, 0.00153508, 0.00418183, 0.01955471, 0.01732095,
       0.03409748, 0.00074352, 0.01097889, 0.01502828, 0.01564978,
       0.02242332, 0.0120086 , 0.00529125, 0.00708996, 0.01622983,
       0.01350317, 0.03214033, 0.0343545 , 0.00231513, 0.00497056,
       0.00755768, 0.02383878, 0.01564256, 0.03298954, 0.00204778,
       0.00542388, 0.01107927, 0.02036605, 0.02443297, 

数据预测

In [32]:
pred_Y = estimator.predict(pred_X)
pred_Y = pred_Y.astype(int)

# 接入passenger_id
passenger_id = full['PassengerId'][source_row:]
predDf = pd.DataFrame({'PassengerId' : passenger_id, 
                     'Survived' : pred_Y})

# 预测完成后查看
predDf.head()

Unnamed: 0,PassengerId,Survived
891,892,0
892,893,1
893,894,0
894,895,0
895,896,1


保存结果

In [33]:
predDf.to_csv('/data/ys_data/titanic/titanic_pred.csv', index=False)