# 泰坦尼克号乘客生还预测
项目地址：https://www.kaggle.com/competitions/titanic
<br>数据特征如下：
<p>

Attribute | Definition
--- | ---
PassengerId | 乘客编号
Survived | 生存情况(0-死亡,1-存活)
Pclass | 客舱等级(1-Upper,2-Middle,3-Lower)
Name | 乘客姓名
Sex | 乘客性别
Age | 乘客年龄
SibSp | 船上兄弟姐妹数/配偶数
Parch | 船上父母数/子女数
Ticket | 船票编号
Fare | 船票价格
Cabin | 客舱号
Embarked | 登船港口(C = Cherbourg, Q = Queenstown, S = Southampton)

In [2]:
import pandas as pd
# 读取数据
train_data = pd.read_csv('D:/DocumentFile/data/titanic/train.csv')
test_data = pd.read_csv('D:/DocumentFile/data/titanic/test.csv')
print("训练数据集：", train_data.shape, "测试数据集", test_data.shape)

训练数据集： (891, 12) 测试数据集 (418, 11)


**测试数据比训练数据少一个变量为生存情况，需要预测。**

In [3]:
# 查看训练数据
train_data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [4]:
# 浏览数据的基本信息
train_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


**发现Age,Cabin,Embarked有缺失数据**

In [5]:
# 查看数据的基本统计信息
train_data.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


In [6]:
# 统计一下生还情况
total_survived_num = train_data['Survived'].sum()
total_no_survived_num = train_data['Survived'].count() - total_survived_num
print('生还者共计：%d人，死亡者共计：%d人' % (total_survived_num, total_no_survived_num))

生还者共计：342人，死亡者共计：549人


In [7]:
# 分析Pclass与生还的关系
pclass_before = train_data[['Pclass', 'Survived']].groupby(['Pclass']).count()
pclass_before

Unnamed: 0_level_0,Survived
Pclass,Unnamed: 1_level_1
1,216
2,184
3,491


In [8]:
pclass_after = train_data[train_data['Survived'] == 1][['Pclass', 'Survived']].groupby(['Pclass']).count()
pclass_after

Unnamed: 0_level_0,Survived
Pclass,Unnamed: 1_level_1
1,136
2,87
3,119


**由此可见，一等舱生还率为63%，二等舱生还率为47%，三等舱生还率为24%。`客舱等级越高，生还率吧越高。`**

In [9]:
# 分析Sex与生还的关系
sex_before = train_data[['Sex', 'Survived']].groupby('Sex').count()
sex_before

Unnamed: 0_level_0,Survived
Sex,Unnamed: 1_level_1
female,314
male,577


In [10]:
sex_after = train_data[train_data['Survived'] == 1][['Sex', 'Survived']].groupby('Sex').count()
sex_after

Unnamed: 0_level_0,Survived
Sex,Unnamed: 1_level_1
female,233
male,109


**由此可见，男性生还率为19%，女性生还率为74%。**

In [12]:
# 分析Age与生还的关系
# 由于Age有缺省值，需要先处理，用平均年龄填充。
train_data['Age'] = train_data['Age'].fillna(train_data['Age'].mean())
train_data['Age'].describe()

count    891.000000
mean      29.699118
std       13.002015
min        0.420000
25%       22.000000
50%       29.699118
75%       35.000000
max       80.000000
Name: Age, dtype: float64

In [13]:
# 按照年龄，将乘客划分为儿童，少年，成年人，老年人。
children_df = train_data[train_data['Age'] <= 12]
teenager_df = train_data[(train_data['Age'] > 12) & (train_data['Age'] < 18)]
adults_df = train_data[(train_data['Age'] >= 18) & (train_data['Age'] < 65)]
agedness_df = train_data[train_data['Age'] >= 65]
print('出事前，儿童人数为：%d人；少年人数为：%d人；成年人人数为：%d人；老年人人数为：%d人。' % (children_df['Survived'].count(), teenager_df['Survived'].count(), adults_df['Survived'].count(), agedness_df['Survived'].count()))

出事前，儿童人数为：69人；少年人数为：44人；成年人人数为：767人；老年人人数为：11人。


In [14]:
children_survived = children_df['Survived'].sum()
teenager_survived = teenager_df['Survived'].sum()
adults_survived = adults_df['Survived'].sum()
agedness_survived = agedness_df['Survived'].sum()
print('出事后，儿童人数为：%d人；少年人数为：%d人；成年人人数为：%d人；老年人人数为：%d人。' % (children_survived, teenager_survived, adults_survived, agedness_survived))

出事后，儿童人数为：40人；少年人数为：21人；成年人人数为：280人；老年人人数为：1人。


**由此可见，儿童生还率为58%，少年生还率为48%，成年人生还率为37%，老年人生还率为9%。**

In [15]:
# 分析SibSp与生还的关系
sibsp_df = train_data[train_data['SibSp'] != 0]
no_sibsp_df = train_data[train_data['SibSp'] == 0]
print('出事前，有兄弟姐妹或配偶的人数为：%d人，没有兄弟姐妹或配偶的人数为：%d人。' % (sibsp_df['Survived'].count(), no_sibsp_df['Survived'].count()))

出事前，有兄弟姐妹或配偶的人数为：283人，没有兄弟姐妹或配偶的人数为：608人。


In [16]:
sibsp_survived = sibsp_df['Survived'].sum()
no_sibsp_survived = no_sibsp_df['Survived'].sum()
print('出事后，有兄弟姐妹或配偶的人数为：%d人，没有兄弟姐妹或配偶的人数为：%d人。' % (sibsp_survived, no_sibsp_survived))

出事后，有兄弟姐妹或配偶的人数为：132人，没有兄弟姐妹或配偶的人数为：210人。


**由此可见，有兄弟姐妹或配偶的生还率为47%，没有兄弟姐妹或配偶的生还率为35%。**

In [17]:
# 分析Parch与生还的关系
parch_df = train_data[train_data['Parch'] != 0]
no_parch_df = train_data[train_data['Parch'] == 0]
print('出事前，有父母或子女的人数为：%d人，没有父母或子女的人数为：%d人。' % (parch_df['Survived'].count(), no_parch_df['Survived'].count()))

出事前，有父母或子女的人数为：213人，没有父母或子女的人数为：678人。


In [18]:
parch_survived = parch_df['Survived'].sum()
no_parch_survived = no_parch_df['Survived'].sum()
print('出事后，有父母或子女的人数为：%d人，没有父母或子女的人数为：%d人。' % (parch_survived, no_parch_survived))

出事后，有父母或子女的人数为：109人，没有父母或子女的人数为：233人。


**由此可见，有父母或子女的生还率为51%，没有父母或子女的生还率为34%。**

In [19]:
# 分析Fare与生还的关系
train_data['Fare'].describe()

count    891.000000
mean      32.204208
std       49.693429
min        0.000000
25%        7.910400
50%       14.454200
75%       31.000000
max      512.329200
Name: Fare, dtype: float64

In [20]:
# 由于客舱分为三个等级，所以将船票也分为三个等级
lower_df = train_data[train_data['Fare'] <= 170]
middle_df = train_data[(train_data['Fare'] > 170) & (train_data['Fare'] <= 340)]
upper_df = train_data[train_data['Fare'] > 340]
print('出事前，低价票人数：%d人，中价票人数：%d人，高价票人数：%d人。' % (lower_df['Survived'].count(), middle_df['Survived'].count(), upper_df['Survived'].count()))

出事前，低价票人数：871人，中价票人数：17人，高价票人数：3人。


In [21]:
lower_survived = lower_df['Survived'].sum()
middle_survived = middle_df['Survived'].sum()
upper_survived = upper_df['Survived'].sum()
print('出事后，低价票人数：%d人，中价票人数：%d人，高价票人数：%d人。' % (lower_survived, middle_survived, upper_survived))

出事后，低价票人数：328人，中价票人数：11人，高价票人数：3人。


**由此可见，低价票生还率为38%，中价票生还率为65%，高价票生还率为100%。**

In [22]:
# 由于Cabin缺失值太多，无法分析。
# 分析Embarked与生还的关系
# 由于Embarked有两个缺失值，用众数填充。
train_data['Embarked'].value_counts()

S    644
C    168
Q     77
Name: Embarked, dtype: int64

In [23]:
train_data['Embarked'] = train_data['Embarked'].fillna('S')
train_data['Embarked'].value_counts()

S    646
C    168
Q     77
Name: Embarked, dtype: int64

In [24]:
train_data[train_data['Survived'] == 1]['Embarked'].value_counts()

S    219
C     93
Q     30
Name: Embarked, dtype: int64

**由此可见，S港口生还率为34%，C港口生还率为55%，Q港口生还率为39%。**

In [1]:
import pandas as pd
# 读取数据
train_data = pd.read_csv('D:/DocumentFile/data/titanic/train.csv')
test_data = pd.read_csv('D:/DocumentFile/data/titanic/test.csv')
# 将训练数据和测试数据合并，方便统一进行数据处理
full = pd.concat([train_data, test_data], axis=0, ignore_index=True)
full.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  1309 non-null   int64  
 1   Survived     891 non-null    float64
 2   Pclass       1309 non-null   int64  
 3   Name         1309 non-null   object 
 4   Sex          1309 non-null   object 
 5   Age          1046 non-null   float64
 6   SibSp        1309 non-null   int64  
 7   Parch        1309 non-null   int64  
 8   Ticket       1309 non-null   object 
 9   Fare         1308 non-null   float64
 10  Cabin        295 non-null    object 
 11  Embarked     1307 non-null   object 
dtypes: float64(3), int64(4), object(5)
memory usage: 122.8+ KB


In [2]:
full.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,1309.0,891.0,1309.0,1046.0,1309.0,1309.0,1308.0
mean,655.0,0.383838,2.294882,29.881138,0.498854,0.385027,33.295479
std,378.020061,0.486592,0.837836,14.413493,1.041658,0.86556,51.758668
min,1.0,0.0,1.0,0.17,0.0,0.0,0.0
25%,328.0,0.0,2.0,21.0,0.0,0.0,7.8958
50%,655.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,982.0,1.0,3.0,39.0,1.0,0.0,31.275
max,1309.0,1.0,3.0,80.0,8.0,9.0,512.3292


In [3]:
# 缺失值处理
full['Age'] = full['Age'].fillna(full['Age'].mean())
full['Fare'] = full['Fare'].fillna(full['Fare'].mean())
full['Embarked'] = full['Embarked'].fillna('S')
# Cabin缺失值太多，无法填充，用U（UnKnow）标识
full['Cabin'] = full['Cabin'].fillna('U')
full.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  1309 non-null   int64  
 1   Survived     891 non-null    float64
 2   Pclass       1309 non-null   int64  
 3   Name         1309 non-null   object 
 4   Sex          1309 non-null   object 
 5   Age          1309 non-null   float64
 6   SibSp        1309 non-null   int64  
 7   Parch        1309 non-null   int64  
 8   Ticket       1309 non-null   object 
 9   Fare         1309 non-null   float64
 10  Cabin        1309 non-null   object 
 11  Embarked     1309 non-null   object 
dtypes: float64(3), int64(4), object(5)
memory usage: 122.8+ KB


In [4]:
# 编码分类特征
# 将性别映射为数值
full['Sex'] = full['Sex'].map({'male': 1, 'female': 0})

In [5]:
# 对港口进行特征提取
embarked_df = pd.get_dummies(full['Embarked'], prefix='Embarked')
full = pd.concat([full, embarked_df], axis=1)
full = full.drop('Embarked', axis=1)

In [6]:
# 对客舱等级进行特征提取
pclass_df = pd.get_dummies(full['Pclass'], prefix='Pclass')
full = pd.concat([full, pclass_df], axis=1)
full = full.drop('Pclass', axis=1)

In [7]:
# 对姓名提取头衔
def getTitle(name):
    str1 = name.split(',')[1]
    str2 = str1.split('.')[0]
    # strip()方法用于移除字符串头尾指定的字符（默认为空格）
    str3 = str2.strip()
    return str3
title_df = pd.DataFrame()
title_df['Title'] = full['Name'].map(getTitle)
title_df.head()

Unnamed: 0,Title
0,Mr
1,Mrs
2,Miss
3,Mrs
4,Mr


**关于头衔及含义:**
<br>

类别 | 含义
--- | ---
Officer | 政府官员
Royalty | 王室
Mr | 已婚男士
Mrs | 已婚女士
Miss | 年轻未婚女子
Master | 有技能的人/教师

In [8]:
title_df.value_counts()

Title       
Mr              757
Miss            260
Mrs             197
Master           61
Rev               8
Dr                8
Col               4
Ms                2
Major             2
Mlle              2
Sir               1
Capt              1
Mme               1
Lady              1
Jonkheer          1
Dona              1
Don               1
the Countess      1
dtype: int64

In [9]:
# 头衔的映射关系
title_map = {
    'Mr': 'Mr',
    'Miss': 'Miss',
    'Mrs': 'Mrs',
    'Master': 'Master',
    'Rev': 'Officer',
    'Dr': 'Master',
    'Col': 'Officer',
    'Ms': 'Mrs',
    'Major': 'Officer',
    'Mlle': 'Miss',
    'Sir': 'Royalty',
    'Capt': 'Officer',
    'Mme': 'Mrs',
    'Lady': 'Royalty',
    'Jonkheer': 'Royalty',
    'Dona': 'Royalty',
    'Don': 'Officer',
    'the Countess': 'Royalty'
}
title_df['Title'] = title_df['Title'].map(title_map)
title_df = pd.get_dummies(title_df)
title_df.head()

Unnamed: 0,Title_Master,Title_Miss,Title_Mr,Title_Mrs,Title_Officer,Title_Royalty
0,0,0,1,0,0,0
1,0,0,0,1,0,0
2,0,1,0,0,0,0
3,0,0,0,1,0,0
4,0,0,1,0,0,0


In [10]:
full = pd.concat([full, title_df], axis=1)
full = full.drop('Name', axis=1)

In [11]:
# 建立家庭人数和家庭类别
# 家庭人数=SibSp+Parch+1（1为自己）
# 家庭类别：小家庭（人数=1）；中家庭（人数2-4）；大家庭（人数>=5）。
family_df = pd.DataFrame()
family_df['FamilySize'] = full['SibSp'] + full['Parch'] + 1
family_df['Family_Small'] = family_df['FamilySize'].map(lambda s:1 if s==1 else 0)
family_df['Family_Middle'] = family_df['FamilySize'].map(lambda s:1 if 2<=s<=4 else 0)
family_df['Family_Large'] = family_df['FamilySize'].map(lambda s:1 if s>=5 else 0)
family_df.head()

Unnamed: 0,FamilySize,Family_Small,Family_Middle,Family_Large
0,2,0,1,0
1,2,0,1,0
2,1,1,0,0
3,2,0,1,0
4,1,1,0,0


In [12]:
full = pd.concat([full, family_df], axis=1)
full = full.drop(['SibSp', 'Parch'], axis=1)

In [13]:
# 特征选择
# 相关性矩阵
corr_df = full.corr()
corr_df

Unnamed: 0,PassengerId,Survived,Sex,Age,Fare,Embarked_C,Embarked_Q,Embarked_S,Pclass_1,Pclass_2,...,Title_Master,Title_Miss,Title_Mr,Title_Mrs,Title_Officer,Title_Royalty,FamilySize,Family_Small,Family_Middle,Family_Large
PassengerId,1.0,-0.005007,0.013406,0.025731,0.031416,0.048101,0.011585,-0.049836,0.026495,0.022714,...,9e-06,-0.050027,0.014116,0.033299,-0.004508,0.025269,-0.031437,0.028546,0.002975,-0.063415
Survived,-0.005007,1.0,-0.543351,-0.070323,0.257307,0.16824,0.00365,-0.149683,0.285904,0.093349,...,0.082177,0.332795,-0.549199,0.344935,-0.052177,0.050561,0.016639,-0.203367,0.279855,-0.125147
Sex,0.013406,-0.543351,1.0,0.057397,-0.185484,-0.066564,-0.088651,0.115193,-0.107371,-0.028862,...,0.168245,-0.672819,0.870678,-0.571176,0.082707,-0.031555,-0.188583,0.284537,-0.255196,-0.077748
Age,0.025731,-0.070323,0.057397,1.0,0.171521,0.076179,-0.012718,-0.059153,0.362587,-0.014193,...,-0.317838,-0.254146,0.165476,0.198091,0.148409,0.055386,-0.196996,0.116675,-0.038189,-0.16121
Fare,0.031416,0.257307,-0.185484,0.171521,1.0,0.286241,-0.130054,-0.169894,0.599956,-0.121372,...,0.021493,0.092051,-0.192192,0.139235,0.012098,0.03004,0.226465,-0.274826,0.197281,0.170853
Embarked_C,0.048101,0.16824,-0.066564,0.076179,0.286241,1.0,-0.164166,-0.778262,0.325722,-0.134675,...,-0.010411,-0.014351,-0.065538,0.098379,0.012024,0.060256,-0.036553,-0.107874,0.159594,-0.092825
Embarked_Q,0.011585,0.00365,-0.088651,-0.012718,-0.130054,-0.164166,1.0,-0.491656,-0.166101,-0.121973,...,-0.005666,0.198804,-0.080224,-0.100374,-0.011996,-0.019941,-0.08719,0.127214,-0.122491,-0.018423
Embarked_S,-0.049836,-0.149683,0.115193,-0.059153,-0.169894,-0.778262,-0.491656,1.0,-0.1818,0.196532,...,0.012798,-0.113886,0.108924,-0.02295,-0.002978,-0.040498,0.087771,0.014246,-0.062909,0.093671
Pclass_1,0.026495,0.285904,-0.107371,0.362587,0.599956,0.325722,-0.166101,-0.1818,1.0,-0.296526,...,-0.047785,-0.011733,-0.099725,0.141102,0.065344,0.108189,-0.029656,-0.126551,0.165965,-0.067523
Pclass_2,0.022714,0.093349,-0.028862,-0.014193,-0.121372,-0.134675,-0.121973,0.196532,-0.296526,1.0,...,-0.013402,-0.02544,-0.038595,0.071103,0.078541,-0.032081,-0.039976,-0.035075,0.09727,-0.118495


In [15]:
# 查看各个特征与生存情况Survived的相关系数，ascending=False表示降序排列
corr_df['Survived'].sort_values(ascending=False)

Survived         1.000000
Title_Mrs        0.344935
Title_Miss       0.332795
Pclass_1         0.285904
Family_Middle    0.279855
Fare             0.257307
Embarked_C       0.168240
Pclass_2         0.093349
Title_Master     0.082177
Title_Royalty    0.050561
FamilySize       0.016639
Embarked_Q       0.003650
PassengerId     -0.005007
Title_Officer   -0.052177
Age             -0.070323
Family_Large    -0.125147
Embarked_S      -0.149683
Family_Small    -0.203367
Pclass_3        -0.322308
Sex             -0.543351
Title_Mr        -0.549199
Name: Survived, dtype: float64

In [16]:
full_F = pd.concat([title_df, pclass_df, family_df, full['Fare'], embarked_df, full['Age'], full['Sex']], axis=1)
full_F.head()

Unnamed: 0,Title_Master,Title_Miss,Title_Mr,Title_Mrs,Title_Officer,Title_Royalty,Pclass_1,Pclass_2,Pclass_3,FamilySize,Family_Small,Family_Middle,Family_Large,Fare,Embarked_C,Embarked_Q,Embarked_S,Age,Sex
0,0,0,1,0,0,0,0,0,1,2,0,1,0,7.25,0,0,1,22.0,1
1,0,0,0,1,0,0,1,0,0,2,0,1,0,71.2833,1,0,0,38.0,0
2,0,1,0,0,0,0,0,0,1,1,1,0,0,7.925,0,0,1,26.0,0
3,0,0,0,1,0,0,1,0,0,2,0,1,0,53.1,0,0,1,35.0,0
4,0,0,1,0,0,0,0,0,1,1,1,0,0,8.05,0,0,1,35.0,1


In [17]:
# 建立训练数据集和测试数据集
# train_data 有891行
# 原始数据集
source_X = full_F.loc[0: 890, :]
source_y = full.loc[0: 890, 'Survived']
# 预测数据集
pred_X = full_F.loc[891:, :]

In [18]:
from sklearn.model_selection import train_test_split
train_X, test_X, train_y, test_y = train_test_split(source_X, source_y, train_size=0.8, random_state=0)
# 输出数据集大小
print('原始数据集特征：', source_X.shape, '训练数据集特征：', train_X.shape, '测试数据集特征：', test_X.shape)
print('原始数据集标签：', source_y.shape, '训练数据集标签：', train_y.shape, '测试数据集标签：', test_y.shape)

原始数据集特征： (891, 19) 训练数据集特征： (712, 19) 测试数据集特征： (179, 19)
原始数据集标签： (891,) 训练数据集标签： (712,) 测试数据集标签： (179,)


In [20]:
# 选择机器学习算法
# 泰坦尼克号预测生存率是一个二分分类的问题，也就是通过乘客的特征预测乘客是生存还是死亡？
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression(max_iter=10000)
clf.fit(train_X, train_y)
# 分类问题，score得到的是模型的正确率
clf.score(test_X, test_y)

0.8156424581005587

In [21]:
# 使用模型对预测数据集中的生存情况进行预测
pred_y = clf.predict(pred_X)
# 生成的是浮点数，需要转换为整数
pred_y = pred_y.astype(int)
passenger_id = full.loc[891:, 'PassengerId']
pred_df = pd.DataFrame({
    'PassengerId': passenger_id,
    'Survived': pred_y
})
pred_df.head()

Unnamed: 0,PassengerId,Survived
891,892,0
892,893,1
893,894,0
894,895,0
895,896,1


In [22]:
# 保存结果
pred_df.to_csv('D:/DocumentFile/data/titanic/titanic_pred.csv', index=False)