## Competition Description

The sinking of the RMS Titanic is one of the most infamous shipwrecks in history.  On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.

One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.

In this challenge, we ask you to complete the analysis of what sorts of people were likely to survive. In particular, we ask you to apply the tools of machine learning to predict which passengers survived the tragedy.

In [75]:
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt

%matplotlib inline

In [76]:
train_data = pd.read_csv("train.csv")

In [127]:
train_data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [109]:
train_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB


#### 处理思路
- 抽离Label项：Survived
- 删除列：PassengerId，Name
- 数字项：Pclass, Age, SibSp, Parch, Fare, Sex(作为分类处理)
- 分类项：Embarked
- 特殊项：Cabin 作为分类来说实在太多了，可以抽取首字母简化处理
- 特殊项：Ticket 感觉这一项跟Name一样关联性太弱了，是否有方法可以简化处理？还是暂时忽略？
- 数据遗失项：Age, Carbin, Embarked

In [78]:
# 抽离Label项：Survived
y_train = train_data["Survived"].copy()
x_train = train_data.drop(columns=["Survived"])

In [132]:
# 数字项：Pclass, Age, SibSp, Parch, Fare
x_train_num = train_data[["Survived", "Pclass", "Age", "SibSp", "Parch", "Fare", "Sex"]].copy()
# 调整船舱级别数值，越高级数字越高
x_train_num["Pclass"].replace({1:3, 3:1}, inplace=True)
# Sex 分类转换为数字项
x_train_num["Sex"].replace({"male":0, "female":1}, inplace=True)
x_train_num.head()

Unnamed: 0,Survived,Pclass,Age,SibSp,Parch,Fare,Sex
0,0,1,22.0,1,0,7.25,0
1,1,3,38.0,1,0,71.2833,1
2,1,1,26.0,0,0,7.925,1
3,1,3,35.0,1,0,53.1,1
4,0,1,35.0,0,0,8.05,0


In [133]:
# 补充遗失的的数字项
from sklearn.preprocessing import Imputer
imputer = Imputer(strategy="median")
imputer.fit(x_train_num)
x_train_num_full = imputer.transform(x_train_num)
x_train_num = pd.DataFrame(x_train_num_full, columns=x_train_num.columns)
x_train_num.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 7 columns):
Survived    891 non-null float64
Pclass      891 non-null float64
Age         891 non-null float64
SibSp       891 non-null float64
Parch       891 non-null float64
Fare        891 non-null float64
Sex         891 non-null float64
dtypes: float64(7)
memory usage: 48.8 KB


In [134]:
# 添加计算属性
x_train_num["Parch_b"] = x_train_num["Parch"] > 0
x_train_num["SibSp_b"] = x_train_num["SibSp"] > 0
# x_train_num["single_dog"] = (x_train_num["Parch"] == 0) & (x_train_num["SibSp"] == 0)
x_train_num["has_family"] = (x_train_num["Parch"] > 0) | (x_train_num["SibSp"] > 0)

In [135]:
# 查看属性关系
corr_mtx = x_train_num.corr()
corr_mtx["Survived"].sort_values(ascending=False)

Survived      1.000000
Sex           0.543351
Pclass        0.338481
Fare          0.257307
has_family    0.203367
Parch_b       0.147408
SibSp_b       0.115867
Parch         0.081629
SibSp        -0.035322
Age          -0.064910
Name: Survived, dtype: float64

In [136]:
#分类项：Embarked
x_train_cat = x_train[["Embarked"]].copy()
x_train_cat = pd.get_dummies(x_train_cat)
# x_train_cat["Survived"]=train_data["Survived"]
# x_train_cat.head()

Unnamed: 0,Embarked_C,Embarked_Q,Embarked_S,Survived
0,0,0,1,0
1,1,0,0,1
2,0,0,1,1
3,0,0,1,1
4,0,0,1,0


In [137]:
corr_mtx = x_train_cat.corr()
corr_mtx["Survived"].sort_values(ascending=False)

Survived      1.00000
Embarked_C    0.16824
Embarked_Q    0.00365
Embarked_S   -0.15566
Name: Survived, dtype: float64

In [140]:
# 可见在Q市登船的人数比较少，这里计算出来的相关性会比较低
x_train["Embarked"].value_counts()

S    644
C    168
Q     77
Name: Embarked, dtype: int64

In [156]:
# 特殊项：Cabin 作为分类来说实在太多了，可以抽取首字母简化处理
x_cabin = x_train["Cabin"].copy()

In [158]:
x_cabin.value_counts()

C23 C25 C27        4
B96 B98            4
G6                 4
C22 C26            3
D                  3
E101               3
F33                3
F2                 3
C124               2
B35                2
F4                 2
B28                2
E33                2
B51 B53 B55        2
C92                2
B22                2
B20                2
C2                 2
D20                2
D36                2
C126               2
E25                2
C68                2
E8                 2
C83                2
D33                2
B57 B59 B63 B66    2
E24                2
D26                2
F G73              2
                  ..
A26                1
F38                1
B37                1
C106               1
E63                1
B41                1
D30                1
A10                1
B82 B84            1
B19                1
C95                1
C7                 1
B50                1
A7                 1
D48                1
C45                1
B79          

In [163]:
x_cabin.replace(to_replace="A.*", value="A", regex=True, inplace=True)
x_cabin.replace(to_replace="B.*", value="B", regex=True, inplace=True)
x_cabin.replace(to_replace="C.*", value="C", regex=True, inplace=True)
x_cabin.replace(to_replace="D.*", value="D", regex=True, inplace=True)
x_cabin.replace(to_replace="E.*", value="E", regex=True, inplace=True)
x_cabin.replace(to_replace="F.*", value="F", regex=True, inplace=True)
x_cabin.replace(to_replace="G.*", value="G", regex=True, inplace=True)
x_cabin.replace(to_replace="T.*", value="T", regex=True, inplace=True)

In [162]:
x_cabin.value_counts()

C    59
B    47
D    33
E    32
A    15
F    13
G     4
T     1
Name: Cabin, dtype: int64

In [164]:
x_cabin_cat = pd.get_dummies(x_cabin)
x_cabin_cat["Survived"]=train_data["Survived"]
x_cabin_cat.head()

Unnamed: 0,A,B,C,D,E,F,G,T,Survived
0,0,0,0,0,0,0,0,0,0
1,0,0,1,0,0,0,0,0,1
2,0,0,0,0,0,0,0,0,1
3,0,0,1,0,0,0,0,0,1
4,0,0,0,0,0,0,0,0,0


In [167]:
corr_mtx = x_cabin_cat.corr()
corr_mtx["Survived"].sort_values(ascending=False)

Survived    1.000000
B           0.175095
D           0.150716
E           0.145321
C           0.114652
F           0.057935
A           0.022287
G           0.016040
T          -0.026456
Name: Survived, dtype: float64