# Project: Titanic生存预测竞赛

## Introdution

### 项目介绍
本次项目是在Kaggle上的一个竞赛项目，项目名称为：[Titanic: Machine Learning from Disaster](https://www.kaggle.com/c/titanic)。本次实验使用对Titanic上的乘客数据集进行了数据挖掘，并且使用多种预测模型对乘客的生存状态进行了预测。最终也取得了很好的效果，获得了Kaggle竞赛排名的Top3%。

在此实验中，我采用多种模型，多种参数和多种随机状态选择，在随机森林的最优调参上取得了最佳的结果，为方便起见我将预测文件一并提交并且描述了我的所有尝试过程。因为随机种子和交叉验证的分类集每次都会有比较大的差别，所以分数也会随着每次提交不断改变，在这里我附上了最佳结果的分数和预测文件my_best_submission.csv。
![结果图](./res.png)

### 项目流程
本次实验采用了数据挖掘的经典流程进行处理：数据集分析与可视化，数据清洗与特征提取和训练模型
* 数据分析与可视化：主要使用python可视化库对数据集重要部分做了分析
* 数据清洗：使用Pandas读取数据，并对数据进行清洗，包括数据清洗、缺失值填充、特征工程、数据分割等。
* 训练模型：尝试了大量模型，包括随机森林、决策树、支持向量机、KNN、SVM、逻辑回归、梯度提升树、神经网络等。

### 主要方法与尝试
本项目主要使用的方法与模型有：
* 决策模型使用：
  * GBDT：Gradient Boosting Decision Tree，梯度提升决策树。
  * XGBoost：深度学习决策框架。
  * Random Forest：随机森林。
  * SVMC：Support Vector Machine Classifier，支持向量机分类器。
* 组合预测模型的使用：使用投票树将上述决策模型进行ensemble。

### 数据集介绍
本此比赛的数据集为Titanic上的乘客信息，具体的属性介绍如下：
* PassengerId：乘客的编号。
* Pclass：乘客的船舱等级，1为经济舱，2为豪华舱，3为商务舱。
* Survived：乘客是否存活，0为死亡，1为生存。
* Name：乘客的姓名。
* Sex：乘客的性别
* Age：乘客的年龄
* SibSp：乘客的兄弟姐妹与配偶数量
* Parch：乘客的父母与子女数量
* Ticket：乘客的票的编号
* Fare：票价
* Cabin：乘客的座位号
* Embarked：乘客的登船港口

## 数据清洗

In [220]:
import numpy as np 
import pandas as pd 

### 加载数据集
重新加载训练数据集，以防止前面数据分析部分对变量进行了修改

In [221]:
test_data = pd.read_csv("kaggle/input/titanic/test.csv")
train_data = pd.read_csv("kaggle/input/titanic/train.csv")

为方便后续数据的清洗，现将测试集与训练集进行合并

In [222]:
# join the two datasets
titanic = pd.concat([train_data, test_data])

合并后的数据集变为：

In [223]:
titanic.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0.0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1.0,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1.0,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1.0,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0.0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


### 缺失值处理
从新合并的数据集来看，在Age, Fare, Cabin, Embarked这四个属性上有缺失值需要进行处理。


In [224]:
titanic.isnull().sum()

PassengerId       0
Survived        418
Pclass            0
Name              0
Sex               0
Age             263
SibSp             0
Parch             0
Ticket            0
Fare              1
Cabin          1014
Embarked          2
dtype: int64

Embarked和Cabin都是类型数据，并且在数据分析中没有看出与别的属性有明显的区别，首先考虑以下处理方式：
* Embarked属性缺失较少，所以我选择直接使用最多的类来进行缺失值的填补
* 对于Cabin，将缺失值标志位U (Unknown)，并且仅取第一个字母作为有效值

In [225]:
titanic['Embarked'].fillna('S', inplace = True)
titanic['Cabin'].fillna('U', inplace = True)
titanic['Cabin'] = titanic['Cabin'].map(lambda x: x[0])

Fare缺失值虽然是数值属性，但缺失值只有一个并且再数据分析部分已经得出结论：Fare与Pclass高度相关，所以直接使用同一Pclass的票价均值来填补缺失。

In [226]:
titanic['Fare'].fillna(titanic.groupby('Pclass')['Fare'].transform('median'), inplace=True)

In [227]:
print(titanic.isnull().sum())

PassengerId      0
Survived       418
Pclass           0
Name             0
Sex              0
Age            263
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin            0
Embarked         0
dtype: int64


### 特征提取

首先需要处理的是Name属性，通过直接观察，我们发现Name中会带有职位或者性别信息，所以通过分割进行处理。

In [228]:
titanic['Title'] = titanic['Name'].map(lambda x: x.split(',')[1].split('.')[0].strip())
titanic.drop('Name', axis=1, inplace=True)
titanic['Title'].value_counts()

Mr              757
Miss            260
Mrs             197
Master           61
Rev               8
Dr                8
Col               4
Mlle              2
Major             2
Ms                2
Lady              1
Sir               1
Mme               1
Don               1
Capt              1
the Countess      1
Jonkheer          1
Dona              1
Name: Title, dtype: int64

In [229]:
print(titanic.isnull().sum())

PassengerId      0
Survived       418
Pclass           0
Sex              0
Age            263
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin            0
Embarked         0
Title            0
dtype: int64


In [230]:
titanic['Title'].replace(['Mme', 'Ms', 'Lady', 'Mlle', 'the Countess', 'Dona'], 'Miss', inplace=True)
titanic['Title'].replace(['Major', 'Col', 'Capt', 'Don', 'Sir', 'Jonkheer'], 'Mr', inplace=True)

titanic['Title'].value_counts()

Mr        767
Miss      268
Mrs       197
Master     61
Rev         8
Dr          8
Name: Title, dtype: int64

接下来分析SibSp与Parch这一属性，这一属性反映的是家庭成员数量，在决策树模型，连续的数值是不必要甚至影响正确判断的，所以需要根据家庭成员数量的大小将乘客分类为四种类型=。

In [231]:
titanic['FamilySize'] = titanic['SibSp'] + titanic['Parch'] + 1
titanic['Family'] = pd.cut(titanic['FamilySize'], [0, 1, 4, 7, 11], labels=['Single', 'Small', 'Medium', 'Large'])
titanic.drop(['SibSp', 'Parch', 'FamilySize'], axis=1, inplace=True)
titanic['Family'].value_counts()

Single    790
Small     437
Medium     63
Large      19
Name: Family, dtype: int64

暂时无法从Ticket中发掘出有用的信息，所以在此实验中我将Ticket忽略

In [232]:
titanic['TicketNumber'] = titanic.Ticket.apply(lambda x: len(x))
titanic['TicketHead'] = titanic.Ticket.apply(lambda x: x[:2])
titanic.drop(['Ticket'], axis=1, inplace=True)

经过上述一系列处理，最终得到了以下特征：

In [233]:
titanic.head()

Unnamed: 0,PassengerId,Survived,Pclass,Sex,Age,Fare,Cabin,Embarked,Title,Family,TicketNumber,TicketHead
0,1,0.0,3,male,22.0,7.25,U,S,Mr,Small,9,A/
1,2,1.0,1,female,38.0,71.2833,C,C,Mrs,Small,8,PC
2,3,1.0,3,female,26.0,7.925,U,S,Miss,Single,16,ST
3,4,1.0,1,female,35.0,53.1,C,S,Mrs,Small,6,11
4,5,0.0,3,male,35.0,8.05,U,S,Mr,Single,6,37


一切都转化完毕后可以查看现在的数据集如下：

In [235]:
titanic.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1309 entries, 0 to 417
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype   
---  ------        --------------  -----   
 0   PassengerId   1309 non-null   int64   
 1   Survived      891 non-null    float64 
 2   Pclass        1309 non-null   int64   
 3   Sex           1309 non-null   object  
 4   Age           1046 non-null   float64 
 5   Fare          1309 non-null   float64 
 6   Cabin         1309 non-null   object  
 7   Embarked      1309 non-null   object  
 8   Title         1309 non-null   object  
 9   Family        1309 non-null   category
 10  TicketNumber  1309 non-null   int64   
 11  TicketHead    1309 non-null   object  
dtypes: category(1), float64(3), int64(3), object(5)
memory usage: 124.2+ KB


In [237]:
print(titanic.isnull().sum())

PassengerId       0
Survived        418
Pclass            0
Sex               0
Age             263
Fare              0
Cabin             0
Embarked          0
Title             0
Family            0
TicketNumber      0
TicketHead        0
dtype: int64


## 模型预测
数据清洗和特征提取完毕之后就可以使用模型进行学习和预测了，首先恢复测试集与训练集

In [238]:
traindata = titanic[titanic['Survived'].notnull()]
testdata = titanic[titanic['Survived'].isnull()]

In [239]:
# 将Survived类转为int类型
traindata['Survived'] = traindata['Survived'].astype(int)
traindata

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  traindata['Survived'] = traindata['Survived'].astype(int)


Unnamed: 0,PassengerId,Survived,Pclass,Sex,Age,Fare,Cabin,Embarked,Title,Family,TicketNumber,TicketHead
0,1,0,3,male,22.0,7.2500,U,S,Mr,Small,9,A/
1,2,1,1,female,38.0,71.2833,C,C,Mrs,Small,8,PC
2,3,1,3,female,26.0,7.9250,U,S,Miss,Single,16,ST
3,4,1,1,female,35.0,53.1000,C,S,Mrs,Small,6,11
4,5,0,3,male,35.0,8.0500,U,S,Mr,Single,6,37
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,male,27.0,13.0000,U,S,Rev,Single,6,21
887,888,1,1,female,19.0,30.0000,B,S,Miss,Single,6,11
888,889,0,3,female,,23.4500,U,S,Miss,Small,10,W.
889,890,1,1,male,26.0,30.0000,C,C,Mr,Single,6,11


In [240]:
testdata

Unnamed: 0,PassengerId,Survived,Pclass,Sex,Age,Fare,Cabin,Embarked,Title,Family,TicketNumber,TicketHead
0,892,,3,male,34.5,7.8292,U,Q,Mr,Single,6,33
1,893,,3,female,47.0,7.0000,U,S,Mrs,Small,6,36
2,894,,2,male,62.0,9.6875,U,Q,Mr,Single,6,24
3,895,,3,male,27.0,8.6625,U,S,Mr,Single,6,31
4,896,,3,female,22.0,12.2875,U,S,Mrs,Small,7,31
...,...,...,...,...,...,...,...,...,...,...,...,...
413,1305,,3,male,,8.0500,U,S,Mr,Single,9,A.
414,1306,,1,female,39.0,108.9000,C,C,Miss,Single,8,PC
415,1307,,3,male,38.5,7.2500,U,S,Mr,Single,18,SO
416,1308,,3,male,,8.0500,U,S,Mr,Single,6,35


### 随机森林

In [241]:
y = traindata['Survived']
values = ['Pclass', 'Fare', 'Title', 'Embarked', 'Family', 'TicketNumber', 'TicketHead']
X = traindata[values]
X_test = testdata[values]

In [242]:
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier

preprocessor = ColumnTransformer(
    transformers=[
        (   'num', 
            SimpleImputer(strategy='median'), 
            ['Fare']
        ),
        (   'cat', 
            Pipeline(steps=[
                ('imputer', SimpleImputer(strategy='most_frequent')),
                ('onehot', OneHotEncoder(handle_unknown='ignore'))
            ]), 
            ['Pclass', 'Title', 'Embarked', 'Family', 'TicketNumber', 'TicketHead']
        )
    ])

model = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('model', RandomForestClassifier(random_state=0, n_estimators=500, max_depth=5))
])

model.fit(X, y)

Pipeline(steps=[('preprocessor',
                 ColumnTransformer(transformers=[('num',
                                                  SimpleImputer(strategy='median'),
                                                  ['Fare']),
                                                 ('cat',
                                                  Pipeline(steps=[('imputer',
                                                                   SimpleImputer(strategy='most_frequent')),
                                                                  ('onehot',
                                                                   OneHotEncoder(handle_unknown='ignore'))]),
                                                  ['Pclass', 'Title',
                                                   'Embarked', 'Family',
                                                   'TicketNumber',
                                                   'TicketHead'])])),
                ('model',
                 RandomFores

In [243]:
predictions = model.predict(X_test)

output = pd.DataFrame({'PassengerId': testdata.PassengerId, 'Survived': predictions})
output.to_csv('submission3.csv', index=False)
print('Your submission was successfully saved!')

Your submission was successfully saved!
