# Titanic Machine Leaning

### 任务

给出一些关于泰坦尼克号上乘客的信息，利用train.csv数据集训练出一个模型，在test.csv数据集中进行预测。

### 引入包

In [1]:
import tensorflow as tf
import pandas as pd
from matplotlib import pyplot as plt
import numpy as np
import matplotlib as mpl
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor

%matplotlib inline

### 导入训练测试集

In [2]:
titan_train = pd.read_csv('data/train.csv')
titan_test = pd.read_csv('data/test.csv')

### 查看数据格式

In [3]:
print(titan_train.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB
None


|参数|意义|
|-- | --|  
|PassengerId| 乘客编号 |
|Survived| 幸存编号 |
| Pclass | 船票等级 |
| Name | 名字 |
| Sex | 性别 |
| Age | 年龄 |
| SibSP | 兄弟姐妹、配偶人数 |
| Parch | 父母、子女人数 |
| Ticket | 船票号 |
| Fare | 旅客票价 |
|Cabin|船舱号|
|Embarked|登船地点|

In [4]:
titan_train.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


In [5]:
titan_train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


可以看到Cabin字段存在NaN空值。
我们可以分析下有没有更多的空值。

In [6]:
#dataset = pd.concat(objs=[titan_train.drop(columns=['Survived']), titan_test], axis=0)
dataset = titan_train.append( titan_test , ignore_index = True )
total = dataset.isnull().sum()
percent = (dataset.isnull().sum()/dataset.isnull().count()).sort_values(ascending=False)
missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'percent'])
print(missing_data)

             Total   percent
Age            263  0.200917
Cabin         1014  0.774637
Embarked         2  0.001528
Fare             1  0.000764
Name             0  0.000000
Parch            0  0.000000
PassengerId      0  0.000000
Pclass           0  0.000000
Sex              0  0.000000
SibSp            0  0.000000
Survived       418  0.319328
Ticket           0  0.000000


of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  sort=sort)
of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  """


发现不同的字段空值的占比有所不同，Cabin绝大部分是空值，Age部分空值，Embarked和Fare只有非常少量空值。  
对于空值，需要采用手段将其数据补全或者是删除，否则影响到最后的预测。有鉴于此，对占比不同的字段采用不同的处理方法，Fare和Embarked可以采用均值或者是众数填补，或者删除掉这三条数据，对最终的结果影响不大。  
其他两个占比太大，不能删除或者是数据填补，可以将空值也作为一种特征量，如果是字段是离散的，那就很好办，比如Cabin，直接就全部填上。而Age肯定是数值型的，那么用不同的值代替空，会产生不一样的影响

### 补全缺省值

In [7]:
# 先补全age，先简单的使用均值替代大法，这可能会有问题
print('处理前：')
dataset.info()
dataset['Age'] = dataset['Age'].fillna(dataset['Age'].mean())
dataset['Fare'] = dataset['Fare'].fillna(dataset['Fare'].mean())
print('处理红后：')
dataset.info()

处理前：
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 12 columns):
Age            1046 non-null float64
Cabin          295 non-null object
Embarked       1307 non-null object
Fare           1308 non-null float64
Name           1309 non-null object
Parch          1309 non-null int64
PassengerId    1309 non-null int64
Pclass         1309 non-null int64
Sex            1309 non-null object
SibSp          1309 non-null int64
Survived       891 non-null float64
Ticket         1309 non-null object
dtypes: float64(3), int64(4), object(5)
memory usage: 122.8+ KB
处理红后：
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 12 columns):
Age            1309 non-null float64
Cabin          295 non-null object
Embarked       1307 non-null object
Fare           1309 non-null float64
Name           1309 non-null object
Parch          1309 non-null int64
PassengerId    1309 non-null int64
Pclass         1309 non-null in

In [8]:
dataset['Embarked'].value_counts()

S    914
C    270
Q    123
Name: Embarked, dtype: int64

可以发现S是最常见值，那么就用这个填充。  
而船舱值缺失的太多了，而且是非数值型，所以创建一个'U'来替代。

In [9]:
dataset['Embarked'] = dataset['Embarked'].fillna('S')
dataset['Cabin'] = dataset['Cabin'].fillna('U')
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 12 columns):
Age            1309 non-null float64
Cabin          1309 non-null object
Embarked       1309 non-null object
Fare           1309 non-null float64
Name           1309 non-null object
Parch          1309 non-null int64
PassengerId    1309 non-null int64
Pclass         1309 non-null int64
Sex            1309 non-null object
SibSp          1309 non-null int64
Survived       891 non-null float64
Ticket         1309 non-null object
dtypes: float64(3), int64(4), object(5)
memory usage: 122.8+ KB


### 特征提取

字符串不适合作为分类识别的特征，对于有限非数值型，应该用数字替代，并且用独热编码。  
下面的是存在明显类别的：  
乘客性别（Sex）：男性male，女性female  
登船港口（Embarked）：出发地点S=英国南安普顿Southampton，途径地点1：C=法国 瑟堡市Cherbourg，出发地点2：Q=爱尔兰 昆士敦Queenstown  
客舱等级（Pclass）：1=1等舱，2=2等舱，3=3等舱  
而下面这些重复度低，可以认为没有类别：  
乘客姓名（Name）
客舱号（Cabin）
船票编号（Ticket）

In [10]:
# Sex:男的1，女的0，二分类，所以无需独热编码
sex_mapDict={'male':1,
            'female':0}
dataset['Sex']=dataset['Sex'].map(sex_mapDict)

In [11]:
# Embarked: 反正就是一通操作，一个特征变成了三个独热编码
embarkedDf = pd.DataFrame()
embarkedDf = pd.get_dummies( dataset['Embarked'] , prefix='Embarked' )
embarkedDf.head()
dataset = pd.concat([dataset,embarkedDf],axis=1)
dataset.drop('Embarked',axis=1,inplace=True)
dataset.head()

Unnamed: 0,Age,Cabin,Fare,Name,Parch,PassengerId,Pclass,Sex,SibSp,Survived,Ticket,Embarked_C,Embarked_Q,Embarked_S
0,22.0,U,7.25,"Braund, Mr. Owen Harris",0,1,3,1,1,0.0,A/5 21171,0,0,1
1,38.0,C85,71.2833,"Cumings, Mrs. John Bradley (Florence Briggs Th...",0,2,1,0,1,1.0,PC 17599,1,0,0
2,26.0,U,7.925,"Heikkinen, Miss. Laina",0,3,3,0,0,1.0,STON/O2. 3101282,0,0,1
3,35.0,C123,53.1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",0,4,1,0,1,1.0,113803,0,0,1
4,35.0,U,8.05,"Allen, Mr. William Henry",0,5,3,1,0,0.0,373450,0,0,1


In [12]:
pclassDf = pd.DataFrame()

#使用get_dummies进行one-hot编码，列名前缀是Pclass
pclassDf = pd.get_dummies( dataset['Pclass'] , prefix='Pclass' )
pclassDf.head()
#添加one-hot编码产生的虚拟变量（dummy variables）到泰坦尼克号数据集
dataset = pd.concat([dataset,pclassDf],axis=1)
#删掉客舱等级（Pclass）这一列
dataset.drop('Pclass',axis=1,inplace=True)
dataset.head()

Unnamed: 0,Age,Cabin,Fare,Name,Parch,PassengerId,Sex,SibSp,Survived,Ticket,Embarked_C,Embarked_Q,Embarked_S,Pclass_1,Pclass_2,Pclass_3
0,22.0,U,7.25,"Braund, Mr. Owen Harris",0,1,1,1,0.0,A/5 21171,0,0,1,0,0,1
1,38.0,C85,71.2833,"Cumings, Mrs. John Bradley (Florence Briggs Th...",0,2,0,1,1.0,PC 17599,1,0,0,1,0,0
2,26.0,U,7.925,"Heikkinen, Miss. Laina",0,3,0,0,1.0,STON/O2. 3101282,0,0,1,0,0,1
3,35.0,C123,53.1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",0,4,0,1,1.0,113803,0,0,1,1,0,0
4,35.0,U,8.05,"Allen, Mr. William Henry",0,5,1,0,0.0,373450,0,0,1,0,0,1


乘客name这个特征有不一样的特点，包含了称呼，也就是类似头衔，可以代表社会地位的，因此提取出来，作为新的特征远比name本身有意义。  
Officer政府官员  
Royalty王室（皇室）  
Mr已婚男士  
Mrs已婚妇女  
Miss年轻未婚女子  
Master有技能的人/教师

In [13]:
def getTitle(name):
    str1=name.split( ',' )[1] #Mr. Owen Harris
    str2=str1.split( '.' )[0]#Mr
    #strip() 方法用于移除字符串头尾指定的字符（默认为空格）
    str3=str2.strip()
    return str3
#存放提取后的特征
titleDf = pd.DataFrame()
#map函数：对Series每个数据应用自定义的函数计算
titleDf['Title'] = dataset['Name'].map(getTitle)

#姓名中头衔字符串与定义头衔类别的映射关系
title_mapDict = {
                    "Capt":       "Officer",
                    "Col":        "Officer",
                    "Major":      "Officer",
                    "Jonkheer":   "Royalty",
                    "Don":        "Royalty",
                    "Sir" :       "Royalty",
                    "Dr":         "Officer",
                    "Rev":        "Officer",
                    "the Countess":"Royalty",
                    "Dona":       "Royalty",
                    "Mme":        "Mrs",
                    "Mlle":       "Miss",
                    "Ms":         "Mrs",
                    "Mr" :        "Mr",
                    "Mrs" :       "Mrs",
                    "Miss" :      "Miss",
                    "Master" :    "Master",
                    "Lady" :      "Royalty"
                    }

#map函数：对Series每个数据应用自定义的函数计算
titleDf['Title'] = titleDf['Title'].map(title_mapDict)

#使用get_dummies进行one-hot编码
titleDf = pd.get_dummies(titleDf['Title'])
titleDf.head()
dataset = pd.concat([dataset,titleDf],axis=1)
#删掉姓名这一列
dataset.drop('Name',axis=1,inplace=True)
dataset.head()

Unnamed: 0,Age,Cabin,Fare,Parch,PassengerId,Sex,SibSp,Survived,Ticket,Embarked_C,...,Embarked_S,Pclass_1,Pclass_2,Pclass_3,Master,Miss,Mr,Mrs,Officer,Royalty
0,22.0,U,7.25,0,1,1,1,0.0,A/5 21171,0,...,1,0,0,1,0,0,1,0,0,0
1,38.0,C85,71.2833,0,2,0,1,1.0,PC 17599,1,...,0,1,0,0,0,0,0,1,0,0
2,26.0,U,7.925,0,3,0,0,1.0,STON/O2. 3101282,0,...,1,0,0,1,0,1,0,0,0,0
3,35.0,C123,53.1,0,4,0,1,1.0,113803,0,...,1,1,0,0,0,0,0,1,0,0
4,35.0,U,8.05,0,5,1,0,0.0,373450,0,...,1,0,0,1,0,0,1,0,0,0


船舱号特征值有统一的大写字母开头，字母开头相同的号码肯定是有共同特征的，取开头字母对数据处理和特征的提取应该会更加合适。

In [14]:
cabinDf = pd.DataFrame()
dataset[ 'Cabin' ] = dataset[ 'Cabin' ].map( lambda c : c[0] )
cabinDf = pd.get_dummies( dataset['Cabin'] , prefix = 'Cabin' )
cabinDf.head()
dataset = pd.concat([dataset,cabinDf],axis=1)
#删掉客舱号这一列
dataset.drop('Cabin',axis=1,inplace=True)
dataset.head()

Unnamed: 0,Age,Fare,Parch,PassengerId,Sex,SibSp,Survived,Ticket,Embarked_C,Embarked_Q,...,Royalty,Cabin_A,Cabin_B,Cabin_C,Cabin_D,Cabin_E,Cabin_F,Cabin_G,Cabin_T,Cabin_U
0,22.0,7.25,0,1,1,1,0.0,A/5 21171,0,0,...,0,0,0,0,0,0,0,0,0,1
1,38.0,71.2833,0,2,0,1,1.0,PC 17599,1,0,...,0,0,0,1,0,0,0,0,0,0
2,26.0,7.925,0,3,0,0,1.0,STON/O2. 3101282,0,0,...,0,0,0,0,0,0,0,0,0,1
3,35.0,53.1,0,4,0,1,1.0,113803,0,0,...,0,0,0,1,0,0,0,0,0,0
4,35.0,8.05,0,5,1,0,0.0,373450,0,0,...,0,0,0,0,0,0,0,0,0,1


In [15]:
familyDf = pd.DataFrame()
familyDf[ 'FamilySize' ] = dataset[ 'Parch' ] + dataset[ 'SibSp' ] + 1
familyDf['Family_Single'] = familyDf['FamilySize'].map(lambda p: 1 if p==1 else 0)
familyDf['Family_Small'] = familyDf['FamilySize'].map(lambda p: 1 if 2 <= p <= 4 else 0)
familyDf['Family_Large'] = familyDf['FamilySize'].map(lambda p: 1 if p>4 else 0)
familyDf.drop('FamilySize', axis=1,inplace=True)
familyDf.head()
dataset = pd.concat([dataset,familyDf],axis=1)
dataset.head()

Unnamed: 0,Age,Fare,Parch,PassengerId,Sex,SibSp,Survived,Ticket,Embarked_C,Embarked_Q,...,Cabin_C,Cabin_D,Cabin_E,Cabin_F,Cabin_G,Cabin_T,Cabin_U,Family_Single,Family_Small,Family_Large
0,22.0,7.25,0,1,1,1,0.0,A/5 21171,0,0,...,0,0,0,0,0,0,1,0,1,0
1,38.0,71.2833,0,2,0,1,1.0,PC 17599,1,0,...,1,0,0,0,0,0,0,0,1,0
2,26.0,7.925,0,3,0,0,1.0,STON/O2. 3101282,0,0,...,0,0,0,0,0,0,1,1,0,0
3,35.0,53.1,0,4,0,1,1.0,113803,0,0,...,1,0,0,0,0,0,0,0,1,0
4,35.0,8.05,0,5,1,0,0.0,373450,0,0,...,0,0,0,0,0,0,1,1,0,0


### 特征选择

通过计算相关系数，决定选择什么特征进行学习。

In [16]:
#相关性矩阵
corrDf = dataset.corr() 
corrDf
'''
查看各个特征与生成情况（Survived）的相关系数，
ascending=False表示按降序排列
'''
corrDf['Survived'].sort_values(ascending =False)

Survived         1.000000
Mrs              0.344935
Miss             0.332795
Pclass_1         0.285904
Family_Small     0.279855
Fare             0.257307
Cabin_B          0.175095
Embarked_C       0.168240
Cabin_D          0.150716
Cabin_E          0.145321
Cabin_C          0.114652
Pclass_2         0.093349
Master           0.085221
Parch            0.081629
Cabin_F          0.057935
Royalty          0.033391
Cabin_A          0.022287
Cabin_G          0.016040
Embarked_Q       0.003650
PassengerId     -0.005007
Cabin_T         -0.026456
Officer         -0.031316
SibSp           -0.035322
Age             -0.070323
Family_Large    -0.125147
Embarked_S      -0.149683
Family_Single   -0.203367
Cabin_U         -0.316912
Pclass_3        -0.322308
Sex             -0.543351
Mr              -0.549199
Name: Survived, dtype: float64

In [17]:
#特征选择
dataset_X = pd.concat( [titleDf,#头衔
                     pclassDf,#客舱等级
                     familyDf,#家庭大小
                     dataset['Fare'],#船票价格
                     cabinDf,#船舱号
                     embarkedDf,#登船港口
                     dataset['Sex']#性别
                    ] , axis=1 )
dataset_X.head()

Unnamed: 0,Master,Miss,Mr,Mrs,Officer,Royalty,Pclass_1,Pclass_2,Pclass_3,Family_Single,...,Cabin_D,Cabin_E,Cabin_F,Cabin_G,Cabin_T,Cabin_U,Embarked_C,Embarked_Q,Embarked_S,Sex
0,0,0,1,0,0,0,0,0,1,0,...,0,0,0,0,0,1,0,0,1,1
1,0,0,0,1,0,0,1,0,0,0,...,0,0,0,0,0,0,1,0,0,0
2,0,1,0,0,0,0,0,0,1,1,...,0,0,0,0,0,1,0,0,1,0
3,0,0,0,1,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,1,0
4,0,0,1,0,0,0,0,0,1,1,...,0,0,0,0,0,1,0,0,1,1


### 构建模型

从train数据集中拆分出一部分作为我们的测试集，而原test作为验证集。

In [22]:
sourceRow = titan_train.count().max()
#原始数据集：特征
source_X = dataset_X.loc[0:sourceRow-1,:]
#原始数据集：标签
source_y = dataset.loc[0:sourceRow-1,'Survived']   

#预测数据集：特征
pred_X = dataset_X.loc[sourceRow:,:]
'''
确保这里原始数据集取的是前891行的数据，不然后面模型会有错误
'''
#原始数据集有多少行
print('原始数据集有多少行:',source_X.shape[0])
#预测数据集大小
print('原始数据集有多少行:',pred_X.shape[0])

原始数据集有多少行: 891
原始数据集有多少行: 418


从原始数据集（source）中拆分出训练数据集（用于模型训练train），测试数据集（用于模型评估test）  
train_test_split是交叉验证中常用的函数，功能是从样本中随机的按比例选取train data和test data  
train_data：所要划分的样本特征集  
train_target：所要划分的样本结果  
test_size：样本占比，如果是整数的话就是样本的数量  

In [24]:
from sklearn.model_selection import train_test_split
#建立模型用的训练数据集和测试数据集
train_X, test_X, train_y, test_y = train_test_split(source_X ,
                                                    source_y,
                                                    train_size=.8)

#输出数据集大小
print ('原始数据集特征：',source_X.shape, 
       '训练数据集特征：',train_X.shape ,
      '测试数据集特征：',test_X.shape)

print ('原始数据集标签：',source_y.shape, 
       '训练数据集标签：',train_y.shape ,
      '测试数据集标签：',test_y.shape)

原始数据集特征： (891, 26) 训练数据集特征： (712, 26) 测试数据集特征： (179, 26)
原始数据集标签： (891,) 训练数据集标签： (712,) 测试数据集标签： (179,)




### 选择机器学习算法

In [25]:
#第1步：导入算法
from sklearn.linear_model import LogisticRegression
#第2步：创建模型：逻辑回归（logisic regression）
model = LogisticRegression()

### 训练模型

In [26]:
model.fit( train_X , train_y )



LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False)

### 评估模型

In [27]:
# 分类问题，score得到的是模型的正确率
model.score(test_X , test_y )

0.7877094972067039

### 方案实施

将得到的预测结果保存至CSV文件，上传至kaggle。

In [28]:
#使用机器学习模型，对预测数据集中的生存情况进行预测
pred_Y = model.predict(pred_X)

'''
生成的预测值是浮点数（0.0,1.0）
但是Kaggle要求提交的结果是整型（0,1）
所以要对数据类型进行转换
'''
pred_Y=pred_Y.astype(int)
#乘客id
passenger_id = dataset.loc[sourceRow:,'PassengerId']
#数据框：乘客id，预测生存情况的值
predDf = pd.DataFrame( 
    { 'PassengerId': passenger_id , 
     'Survived': pred_Y } )
predDf.shape
predDf.head()
#保存结果
predDf.to_csv( 'titanic_pred.csv' , index = False )