# 泰坦尼克号
给定泰坦尼克号船上乘客的信息，设计一个算法模型来判断一名乘客在沉船灾难中能否最终存活下来。

## 数据描述
### 输入格式
数据文件train.csv和test.csv包含多名乘客的信息。每名乘客有如下信息：

PassengerId : 乘客的唯一ID  
Survived : 乘客最终是否存活(0 = No, 1 = Yes, 仅train.csv中包含此信息)  
Pclass : 乘客的船票的等级(1 = 1st, 2 = 2nd, 3 = 3rd)  
Name : 乘客名字  
Sex : 乘客性别(male, female)  
Age : 乘客年龄(Year)  
Sibsp ：船上兄弟姐妹/配偶的人数  
Parch : 船上父母/儿女的人数  
Ticket : 船票号码  
Fare : 船票价格  
Cabin : 船舱号  
Embarked : 出发港口(C = Cherbourg, Q = Queenstown, S = Southampton)  
训练数据集train.csv包含12列，分别对应上述信息。测试数据集test.csv包含11列，不包含Survival信息。  

### 输出格式
您需要提交一个submission.csv文件，文件应采用gender_submission.csv的格式，具体如下：对于测试集中的每位乘客，输出一行，其中包含PassengerId和预测这个乘客是否会存活的结果。 例如，如果您预测第一个乘客存活，第二个乘客不会存活，第三个乘客不会存活，那么您的提交文件将如下所示：

PassengerId,Survived  
1,1  
2,0  
3,0   
(415 more lines)  

# 1、依赖包的导入

In [3]:
import numpy as np  # 数学
import pandas as pd
from sklearn import datasets  #用数据库去学习，或者把数据库放到tenserflow模块练习
from sklearn.model_selection import train_test_split # 数据集测试集分离
from sklearn.neighbors import KNeighborsClassifier   # 会选择邻近几个点作为他的邻居，综合临近几个点模拟出数据的预测值
from sklearn.model_selection import cross_val_score
import matplotlib.pyplot as plt
import warnings 
warnings.filterwarnings("ignore")
import seaborn as sns
import os
%matplotlib inline

# 2、加载数据集，训练集和数据集合并

In [24]:
trainDF = pd.read_csv('./data/train.csv')
testDF = pd.read_csv('./data/test.csv')

print(trainDF.shape, testDF.shape)

df = pd.concat([trainDF, testDF])  # 合并的好处就是预处理可以一起处理
df = df.reset_index() # 重构索引，会多出一列index
df

(891, 12) (418, 11)


Unnamed: 0,index,Age,Cabin,Embarked,Fare,Name,Parch,PassengerId,Pclass,Sex,SibSp,Survived,Ticket
0,0,22.0,,S,7.2500,"Braund, Mr. Owen Harris",0,1,3,male,1,0.0,A/5 21171
1,1,38.0,C85,C,71.2833,"Cumings, Mrs. John Bradley (Florence Briggs Th...",0,2,1,female,1,1.0,PC 17599
2,2,26.0,,S,7.9250,"Heikkinen, Miss. Laina",0,3,3,female,0,1.0,STON/O2. 3101282
3,3,35.0,C123,S,53.1000,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",0,4,1,female,1,1.0,113803
4,4,35.0,,S,8.0500,"Allen, Mr. William Henry",0,5,3,male,0,0.0,373450
5,5,,,Q,8.4583,"Moran, Mr. James",0,6,3,male,0,0.0,330877
6,6,54.0,E46,S,51.8625,"McCarthy, Mr. Timothy J",0,7,1,male,0,0.0,17463
7,7,2.0,,S,21.0750,"Palsson, Master. Gosta Leonard",1,8,3,male,3,0.0,349909
8,8,27.0,,S,11.1333,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",2,9,3,female,0,1.0,347742
9,9,14.0,,C,30.0708,"Nasser, Mrs. Nicholas (Adele Achem)",0,10,2,female,1,1.0,237736


# 3、查看数据集信息

In [25]:
df = df.drop('index', axis=1)  
df = df.reindex_axis(trainDF.columns, axis=1)  # 合并后列名会按照字母序排列，复原
df

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0.0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1.0,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1.0,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1.0,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0.0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
5,6,0.0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
6,7,0.0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
7,8,0.0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.0750,,S
8,9,1.0,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S
9,10,1.0,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C


In [26]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 12 columns):
PassengerId    1309 non-null int64
Survived       891 non-null float64
Pclass         1309 non-null int64
Name           1309 non-null object
Sex            1309 non-null object
Age            1046 non-null float64
SibSp          1309 non-null int64
Parch          1309 non-null int64
Ticket         1309 non-null object
Fare           1308 non-null float64
Cabin          295 non-null object
Embarked       1307 non-null object
dtypes: float64(3), int64(4), object(5)
memory usage: 122.8+ KB


## 分析
Survived：训练集891条，正常  
Age乘客年龄、Fare船票价格、Embarked出发港口：有缺失值  
Cabin船舱号：缺失过多可删除

In [27]:
# 删除Cabin
df = df.drop('Cabin', axis = 1)

In [28]:
# Age、Fare都是数值，常用中位数
df['Age'][df['Age'].isnull()] = df['Age'].median()  
df['Fare'][df['Fare'].isnull()] = df['Fare'].median()  

In [31]:
# Embarked不是数值，常用众数
df['Embarked'][df['Embarked'].isnull()] = df['Embarked'].mode().values  # 为啥要用values，看后面

In [33]:
df['Embarked'].unique()

array(['S', 'C', 'Q'], dtype=object)

In [34]:
df['Embarked'].mode()

0    S
dtype: object

In [38]:
type(df['Age'].mode())

pandas.core.series.Series

In [39]:
df['Age'].mode().values

array([ 28.])

# 4、特征工程
'PassengerId','Survived','Pclass','Name','Sex','Age','SibSp','Parch','Ticket','Fare','Cabin','Embarked'

PassengerId、Name、Fare这些无用  
Ticket可能有用，船票号码影响坐的位置（待定）  
Fare？？？  
SibSp、Parch可合并成一个新的特征FamilySize 

In [42]:
features = ['Survived','Pclass','Sex','Age','SibSp','Parch','Ticket','Fare','Embarked']
for feature in features:
    print(feature, len(df[feature].unique()))

Survived 3
Pclass 3
Sex 2
Age 98
SibSp 7
Parch 8
Ticket 929
Fare 281
Embarked 3


# 4-1、异常值处理

In [44]:
df.info()
# 可以看出：Sex、Ticket、Embarked、Name(过多即删除)需要数值化

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 11 columns):
PassengerId    1309 non-null int64
Survived       891 non-null float64
Pclass         1309 non-null int64
Name           1309 non-null object
Sex            1309 non-null object
Age            1309 non-null float64
SibSp          1309 non-null int64
Parch          1309 non-null int64
Ticket         1309 non-null object
Fare           1309 non-null float64
Embarked       1309 non-null object
dtypes: float64(3), int64(4), object(4)
memory usage: 112.6+ KB


# 4-2、处理特征

In [45]:
# SibSp、Parch可合并成一个新的特征FamilySize
df['FamilySize'] = df['SibSp'] + df['Parch']
# 删除Name、PassengerId
df = df.drop(['Name', 'PassengerId', 'SibSp', 'Parch'], axis = 1)

In [46]:
df.head()

Unnamed: 0,Survived,Pclass,Sex,Age,Ticket,Fare,Embarked,FamilySize
0,0.0,3,male,22.0,A/5 21171,7.25,S,1
1,1.0,1,female,38.0,PC 17599,71.2833,C,1
2,1.0,3,female,26.0,STON/O2. 3101282,7.925,S,0
3,1.0,1,female,35.0,113803,53.1,S,1
4,0.0,3,male,35.0,373450,8.05,S,0


In [50]:
# 数值化
df['Sex'] = pd.factorize(df['Sex'])[0]   # factorize函数的返回值是一个tuple（元组），元组中包含两个元素。
df['Embarked'] = pd.factorize(df['Embarked'])[0]
# Ticket数据参差不齐，按道理来说是统一格式，所以删除
df = df.drop('Ticket', axis = 1)
df.head()

Unnamed: 0,Survived,Pclass,Sex,Age,Fare,Embarked,FamilySize
0,0.0,3,0,22.0,7.25,0,1
1,1.0,1,1,38.0,71.2833,1,1
2,1.0,3,1,26.0,7.925,0,0
3,1.0,1,1,35.0,53.1,0,1
4,0.0,3,0,35.0,8.05,0,0


# 4-3、检查数据

In [55]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 7 columns):
Survived      891 non-null float64
Pclass        1309 non-null int64
Sex           1309 non-null int64
Age           1309 non-null float64
Fare          1309 non-null float64
Embarked      1309 non-null int64
FamilySize    1309 non-null int64
dtypes: float64(3), int64(4)
memory usage: 71.7 KB


In [54]:
df.describe(include = 'all').T    # 因为全是数值型，所以就没有unique、top、freq 

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Survived,891.0,0.383838,0.486592,0.0,0.0,0.0,1.0,1.0
Pclass,1309.0,2.294882,0.837836,1.0,2.0,3.0,3.0,3.0
Sex,1309.0,0.355997,0.478997,0.0,0.0,0.0,1.0,1.0
Age,1309.0,29.503186,12.905241,0.17,22.0,28.0,35.0,80.0
Fare,1309.0,33.281086,51.7415,0.0,7.8958,14.4542,31.275,512.3292
Embarked,1309.0,0.394194,0.653499,0.0,0.0,0.0,1.0,2.0
FamilySize,1309.0,0.883881,1.583639,0.0,0.0,0.0,1.0,10.0


# 5、分离训练集和测试集

In [56]:
train = df[df['Survived'].notnull()].values  # 加values是要转换成数组
test = df[df['Survived'].isnull()].values  
train

array([[  0.    ,   3.    ,   0.    , ...,   7.25  ,   0.    ,   1.    ],
       [  1.    ,   1.    ,   1.    , ...,  71.2833,   1.    ,   1.    ],
       [  1.    ,   3.    ,   1.    , ...,   7.925 ,   0.    ,   0.    ],
       ..., 
       [  0.    ,   3.    ,   1.    , ...,  23.45  ,   0.    ,   3.    ],
       [  1.    ,   1.    ,   0.    , ...,  30.    ,   1.    ,   0.    ],
       [  0.    ,   3.    ,   0.    , ...,   7.75  ,   2.    ,   0.    ]])

In [58]:
#type(train)
len(train)

891

In [59]:
y_train = train[:, 0]     # label
x_train = train[:, 1:]    # feature

# 6、建立模型：随机森林

In [74]:
from sklearn.ensemble import RandomForestClassifier  
from sklearn.cross_validation import cross_val_score  

RF = RandomForestClassifier(n_estimators=1000, random_state=520, min_samples_leaf=3).fit(x_train, y_train)  
print(cross_val_score(RF, x_train, y_train).mean())  
print(df.columns)  
print(RF.feature_importances_)  

0.824915824916
Index(['Survived', 'Pclass', 'Sex', 'Age', 'Fare', 'Embarked', 'FamilySize'], dtype='object')
[ 0.11410596  0.36909351  0.17583256  0.23087716  0.030596    0.07949482]


## 逻辑斯特回归

In [65]:
from sklearn.linear_model import LogisticRegression  #导入逻辑回归模块  
LR = LogisticRegression().fit(x_train, y_train)
print(cross_val_score(LR, x_train, y_train).mean()) 

0.791245791246


## SVM

In [66]:
from sklearn import svm

#调用svm函数，并设置kernel参数，默认是rbf，其它：‘linear’‘poly’‘sigmoid’
SVM = svm.SVC(kernel='linear').fit(x_train, y_train) 
print(cross_val_score(SVM, x_train, y_train).mean()) 

0.785634118967


## GNB

In [71]:
from sklearn.naive_bayes import GaussianNB

GNB = GaussianNB().fit(x_train, y_train)
print(cross_val_score(GNB, x_train, y_train).mean())  

0.793490460157


## KN

In [73]:
from sklearn.neighbors import KNeighborsClassifier

KN = KNeighborsClassifier().fit(x_train, y_train)
print(cross_val_score(KN, x_train, y_train).mean())  

0.705948372615


## DT

In [72]:
from sklearn.tree import DecisionTreeClassifier

DT = DecisionTreeClassifier().fit(x_train, y_train)
print(cross_val_score(DT, x_train, y_train).mean()) 

0.763187429854


## 曾经的一个骚操作

In [70]:
maxScore = 0     # 随机划分5次，记录最好的一次
for i in range(5):
    X_train, X_test, Y_train, Y_test = train_test_split(x_train, y_train, test_size = 0.2, random_state = 0)
    RF = RandomForestClassifier(n_estimators=1000, random_state=520, min_samples_leaf=3).fit(X_train, Y_train)
    RF1 = RandomForestClassifier(n_estimators=1000, random_state=520, min_samples_leaf=3).fit(x_train, y_train)
    print(cross_val_score(RF1, x_train, y_train).mean())  
    tmp = np.mean(Y_test == RF.predict(X_test))
    print(tmp)
    if tmp > maxScore:
        maxScore = tmp
print(maxScore)

0.824915824916
0.854748603352
0.824915824916
0.854748603352
0.824915824916
0.854748603352
0.824915824916
0.854748603352
0.824915824916
0.854748603352
0.854748603352


# 7、预测测试集

In [81]:
res = RF.predict(test[:, 1:]).astype(int)   # 第0列是Survived
res

array([0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1,
       1, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0,
       0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0,
       1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0,
       1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1,
       0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 1, 1,
       1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0,
       1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1,
       0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0,
       0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0,
       0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1,
       0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0,
       0, 0,

# 8、按格式写入csv文件

In [82]:
import datetime

testDF = pd.read_csv('./data/test.csv')
ids = testDF['PassengerId']

now = datetime.datetime.now()
now = now.strftime('%m-%d-%H-%M')

res = pd.DataFrame({
        'PassengerId':ids,
        'Survived':res
        })
res.to_csv("./result/RF_%s.csv" % now, index=False)
print('done!')

done!


# 9、评价模型

kaggle和LintCode上都有泰坦尼克号题目，都用RF数据提交

kaggle：0.77990  
LintCode：0.77512

因此，两个评判平台相差不大，可以选择LintCode。

# 10、计时器

In [9]:
import timeit
import time

start = timeit.default_timer()
time.sleep(2)
print("Running Time: {5:.2f}{} seconds".format(timeit.default_timer() - start).format(3.1415926))

IndexError: tuple index out of range

# 11、进度条

In [4]:
data = pd.read_csv('./data/train.csv')