# 项目：用逻辑回归预测泰坦尼克号幸存情况

## 分析目标

此数据分析报告的目的是，基于泰坦尼克号乘客的性别和船舱等级等属性，对幸存情况进行逻辑回归分析，从而能利用得到的模型，对未知幸存情况的乘客，根据属性预测是否从沉船事件中幸存。

## 简介

> 泰坦尼克号（英语：RMS Titanic）是一艘奥林匹克级邮轮，于1912年4月首航时撞上冰山后沉没。泰坦尼克号是同级的3艘超级邮轮中的第2艘，与姐妹船奥林匹克号和不列颠号为白星航运公司的乘客们提供大西洋旅行。

> 泰坦尼克号由位于北爱尔兰贝尔法斯特的哈兰·沃尔夫船厂兴建，是当时最大的客运轮船，由于其规模相当一艘现代航空母舰，因而号称“上帝也沉没不了的巨型邮轮”。在泰坦尼克号的首航中，从英国南安普敦出发，途经法国瑟堡-奥克特维尔以及爱尔兰昆士敦，计划横渡大西洋前往美国纽约市。但因为人为错误，于1912年4月14日船上时间夜里11点40分撞上冰山；2小时40分钟后，即4月15日凌晨02点20分，船裂成两半后沉入大西洋，死亡人数超越1500人，堪称20世纪最大的海难事件，同时也是最广为人知的海难之一。

数据集包括两个数据表：`titianic_train.csv`和`titanic_test.csv`。

`titianic_train.csv`记录了超过八百位泰坦尼克号乘客在沉船事件后的幸存情况，以及乘客的相关信息，包括所在船舱等级、性别、年龄、同乘伴侣/同胞数量、同乘父母/孩子数量，等等。

`titanic_test.csv`只包含乘客（这些乘客不在`titianic_train.csv`里）相关信息，此文件可以被用于预测乘客是否幸存。

`titianic_train.csv`每列的含义如下：
- PassengerId：乘客ID
- survival：是否幸存
   - 0	否
   - 1	是
- pclass：船舱等级
   - 1	一等舱
   - 2	二等舱
   - 3  三等舱
- sex：性别
- Age：年龄
- sibsp：同乘伴侣/同胞数量
- parch：同乘父母/孩子数量
- ticket：船票号
- fare：票价金额
- cabin：船舱号
- embarked：登船港口
   - C  瑟堡
   - Q  皇后镇
   - S  南安普敦
   
   
`titianic_test.csv`每列的含义和上面相同，但不具备survival变量的数据，即是否幸存。

In [23]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [24]:
list = pd.read_csv("../datas_mechine/titanic_train.csv")
list

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


In [25]:
list.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


In [26]:
list['Age'] = list['Age'].fillna(list['Age'].mean())
list.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          891 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


In [27]:
list['PassengerId'] = list['PassengerId'].astype('str')
list['Sex'] = list['Sex'].astype('category')
list['Survived'] = list['Survived'].astype('category')
list['Pclass'] = list['Pclass'].astype('category')
list['Embarked'] = list['Embarked'].astype('category')

In [28]:
pd.Categorical(list['Sex'],categories=['male','female'])
pd.Categorical(list['Survived'],categories=['no','yes'])
pd.Categorical(list['Pclass'],categories=['one','two','three'])
pd.Categorical(list['Embarked'],categories=['S','Q','N'])
list.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype   
---  ------       --------------  -----   
 0   PassengerId  891 non-null    object  
 1   Survived     891 non-null    category
 2   Pclass       891 non-null    category
 3   Name         891 non-null    object  
 4   Sex          891 non-null    category
 5   Age          891 non-null    float64 
 6   SibSp        891 non-null    int64   
 7   Parch        891 non-null    int64   
 8   Ticket       891 non-null    object  
 9   Fare         891 non-null    float64 
 10  Cabin        204 non-null    object  
 11  Embarked     889 non-null    category
dtypes: category(4), float64(2), int64(2), object(4)
memory usage: 59.8+ KB


In [29]:
list['family_num'] = list['SibSp'] + list['Parch']
list

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,family_num
0,1,0,3,"Braund, Mr. Owen Harris",male,22.000000,1,0,A/5 21171,7.2500,,S,1
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.000000,1,0,PC 17599,71.2833,C85,C,1
2,3,1,3,"Heikkinen, Miss. Laina",female,26.000000,0,0,STON/O2. 3101282,7.9250,,S,0
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.000000,1,0,113803,53.1000,C123,S,1
4,5,0,3,"Allen, Mr. William Henry",male,35.000000,0,0,373450,8.0500,,S,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.000000,0,0,211536,13.0000,,S,0
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.000000,0,0,112053,30.0000,B42,S,0
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,29.699118,1,2,W./C. 6607,23.4500,,S,3
889,890,1,1,"Behr, Mr. Karl Howell",male,26.000000,0,0,111369,30.0000,C148,C,0


In [30]:
import statsmodels.api as sm

In [31]:
list = pd.get_dummies(list,columns=["Pclass","Sex"],drop_first=True,dtype=int)
list

Unnamed: 0,PassengerId,Survived,Name,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,family_num,Pclass_2,Pclass_3,Sex_male
0,1,0,"Braund, Mr. Owen Harris",22.000000,1,0,A/5 21171,7.2500,,S,1,0,1,1
1,2,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",38.000000,1,0,PC 17599,71.2833,C85,C,1,0,0,0
2,3,1,"Heikkinen, Miss. Laina",26.000000,0,0,STON/O2. 3101282,7.9250,,S,0,0,1,0
3,4,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",35.000000,1,0,113803,53.1000,C123,S,1,0,0,0
4,5,0,"Allen, Mr. William Henry",35.000000,0,0,373450,8.0500,,S,0,0,1,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,"Montvila, Rev. Juozas",27.000000,0,0,211536,13.0000,,S,0,1,0,1
887,888,1,"Graham, Miss. Margaret Edith",19.000000,0,0,112053,30.0000,B42,S,0,0,0,0
888,889,0,"Johnston, Miss. Catherine Helen ""Carrie""",29.699118,1,2,W./C. 6607,23.4500,,S,3,0,1,0
889,890,1,"Behr, Mr. Karl Howell",26.000000,0,0,111369,30.0000,C148,C,0,0,0,1


In [34]:
list = list.drop(['PassengerId', 'Name', 'Ticket', 'Cabin', 'Embarked'], axis=1)
list

Unnamed: 0,Survived,Age,SibSp,Parch,Fare,family_num,Pclass_2,Pclass_3,Sex_male
0,0,22.000000,1,0,7.2500,1,0,1,1
1,1,38.000000,1,0,71.2833,1,0,0,0
2,1,26.000000,0,0,7.9250,0,0,1,0
3,1,35.000000,1,0,53.1000,1,0,0,0
4,0,35.000000,0,0,8.0500,0,0,1,1
...,...,...,...,...,...,...,...,...,...
886,0,27.000000,0,0,13.0000,0,1,0,1
887,1,19.000000,0,0,30.0000,0,0,0,0
888,0,29.699118,1,2,23.4500,3,0,1,0
889,1,26.000000,0,0,30.0000,0,0,0,1


In [35]:
x = list.drop(['Survived'], axis = 1)
y = list['Survived']
print(x)
print(y)

           Age  SibSp  Parch     Fare  family_num  Pclass_2  Pclass_3  \
0    22.000000      1      0   7.2500           1         0         1   
1    38.000000      1      0  71.2833           1         0         0   
2    26.000000      0      0   7.9250           0         0         1   
3    35.000000      1      0  53.1000           1         0         0   
4    35.000000      0      0   8.0500           0         0         1   
..         ...    ...    ...      ...         ...       ...       ...   
886  27.000000      0      0  13.0000           0         1         0   
887  19.000000      0      0  30.0000           0         0         0   
888  29.699118      1      2  23.4500           3         0         1   
889  26.000000      0      0  30.0000           0         0         0   
890  32.000000      0      0   7.7500           0         0         1   

     Sex_male  
0           1  
1           0  
2           0  
3           0  
4           1  
..        ...  
886        

In [37]:
x.corr().abs() > 0.8

Unnamed: 0,Age,SibSp,Parch,Fare,family_num,Pclass_2,Pclass_3,Sex_male
Age,True,False,False,False,False,False,False,False
SibSp,False,True,False,False,True,False,False,False
Parch,False,False,True,False,False,False,False,False
Fare,False,False,False,True,False,False,False,False
family_num,False,True,False,False,True,False,False,False
Pclass_2,False,False,False,False,False,True,False,False
Pclass_3,False,False,False,False,False,False,True,False
Sex_male,False,False,False,False,False,False,False,True


In [38]:
x = x.drop(['SibSp','Parch'], axis=1)

In [39]:
x = sm.add_constant(x)
x

Unnamed: 0,const,Age,Fare,family_num,Pclass_2,Pclass_3,Sex_male
0,1.0,22.000000,7.2500,1,0,1,1
1,1.0,38.000000,71.2833,1,0,0,0
2,1.0,26.000000,7.9250,0,0,1,0
3,1.0,35.000000,53.1000,1,0,0,0
4,1.0,35.000000,8.0500,0,0,1,1
...,...,...,...,...,...,...,...
886,1.0,27.000000,13.0000,0,1,0,1
887,1.0,19.000000,30.0000,0,0,0,0
888,1.0,29.699118,23.4500,3,0,1,0
889,1.0,26.000000,30.0000,0,0,0,1


In [40]:
model = sm.Logit(y,x).fit()
model.summary()

Optimization terminated successfully.
         Current function value: 0.443547
         Iterations 6


0,1,2,3
Dep. Variable:,Survived,No. Observations:,891.0
Model:,Logit,Df Residuals:,884.0
Method:,MLE,Df Model:,6.0
Date:,"Thu, 24 Apr 2025",Pseudo R-squ.:,0.3339
Time:,09:50:19,Log-Likelihood:,-395.2
converged:,True,LL-Null:,-593.33
Covariance Type:,nonrobust,LLR p-value:,1.786e-82

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
const,3.8097,0.445,8.568,0.000,2.938,4.681
Age,-0.0388,0.008,-4.963,0.000,-0.054,-0.023
Fare,0.0032,0.002,1.311,0.190,-0.002,0.008
family_num,-0.2430,0.068,-3.594,0.000,-0.376,-0.110
Pclass_2,-1.0003,0.293,-3.416,0.001,-1.574,-0.426
Pclass_3,-2.1324,0.289,-7.373,0.000,-2.699,-1.566
Sex_male,-2.7759,0.199,-13.980,0.000,-3.165,-2.387


In [41]:
x = x.drop(['Fare'], axis = 1)

In [42]:
model = sm.Logit(y,x).fit()
model.summary()

Optimization terminated successfully.
         Current function value: 0.444623
         Iterations 6


0,1,2,3
Dep. Variable:,Survived,No. Observations:,891.0
Model:,Logit,Df Residuals:,885.0
Method:,MLE,Df Model:,5.0
Date:,"Thu, 24 Apr 2025",Pseudo R-squ.:,0.3323
Time:,09:50:52,Log-Likelihood:,-396.16
converged:,True,LL-Null:,-593.33
Covariance Type:,nonrobust,LLR p-value:,4.927e-83

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
const,4.0620,0.404,10.049,0.000,3.270,4.854
Age,-0.0395,0.008,-5.065,0.000,-0.055,-0.024
family_num,-0.2186,0.065,-3.383,0.001,-0.345,-0.092
Pclass_2,-1.1798,0.261,-4.518,0.000,-1.692,-0.668
Pclass_3,-2.3458,0.242,-9.676,0.000,-2.821,-1.871
Sex_male,-2.7854,0.198,-14.069,0.000,-3.173,-2.397


In [65]:
test_list = pd.read_csv("../datas_mechine/titanic_test.csv")
test_list

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0000,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S
...,...,...,...,...,...,...,...,...,...,...,...
413,1305,3,"Spector, Mr. Woolf",male,,0,0,A.5. 3236,8.0500,,S
414,1306,1,"Oliva y Ocana, Dona. Fermina",female,39.0,0,0,PC 17758,108.9000,C105,C
415,1307,3,"Saether, Mr. Simon Sivertsen",male,38.5,0,0,SOTON/O.Q. 3101262,7.2500,,S
416,1308,3,"Ware, Mr. Frederick",male,,0,0,359309,8.0500,,S


In [66]:
test_list.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  418 non-null    int64  
 1   Pclass       418 non-null    int64  
 2   Name         418 non-null    object 
 3   Sex          418 non-null    object 
 4   Age          332 non-null    float64
 5   SibSp        418 non-null    int64  
 6   Parch        418 non-null    int64  
 7   Ticket       418 non-null    object 
 8   Fare         417 non-null    float64
 9   Cabin        91 non-null     object 
 10  Embarked     418 non-null    object 
dtypes: float64(2), int64(4), object(5)
memory usage: 36.1+ KB


In [67]:
test_list['Age'] = test_list['Age'].fillna(test_list['Age'].mean())
test_list.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  418 non-null    int64  
 1   Pclass       418 non-null    int64  
 2   Name         418 non-null    object 
 3   Sex          418 non-null    object 
 4   Age          418 non-null    float64
 5   SibSp        418 non-null    int64  
 6   Parch        418 non-null    int64  
 7   Ticket       418 non-null    object 
 8   Fare         417 non-null    float64
 9   Cabin        91 non-null     object 
 10  Embarked     418 non-null    object 
dtypes: float64(2), int64(4), object(5)
memory usage: 36.1+ KB


In [68]:
test_list['PassengerId'] = test_list['PassengerId'].astype('str')
test_list['Sex'] = test_list['Sex'].astype('category')
test_list['Pclass'] = test_list['Pclass'].astype('category')
test_list['Embarked'] = test_list['Embarked'].astype('category')


In [69]:
pd.Categorical(test_list['Sex'], categories=['male', 'female'])
pd.Categorical(test_list['Pclass'], categories=['one', 'two', 'three'])
pd.Categorical(test_list['Embarked'], categories=['S', 'Q', 'N'])

['Q', 'S', 'Q', 'S', 'S', ..., 'S', NaN, 'S', 'S', NaN]
Length: 418
Categories (3, object): ['S', 'Q', 'N']

In [70]:
test_list.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype   
---  ------       --------------  -----   
 0   PassengerId  418 non-null    object  
 1   Pclass       418 non-null    category
 2   Name         418 non-null    object  
 3   Sex          418 non-null    category
 4   Age          418 non-null    float64 
 5   SibSp        418 non-null    int64   
 6   Parch        418 non-null    int64   
 7   Ticket       418 non-null    object  
 8   Fare         417 non-null    float64 
 9   Cabin        91 non-null     object  
 10  Embarked     418 non-null    category
dtypes: category(3), float64(2), int64(2), object(4)
memory usage: 27.9+ KB


In [71]:
test_list = pd.get_dummies(test_list, drop_first=True, columns=['Pclass', 'Sex'], dtype=int)
test_list

Unnamed: 0,PassengerId,Name,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Pclass_2,Pclass_3,Sex_male
0,892,"Kelly, Mr. James",34.50000,0,0,330911,7.8292,,Q,0,1,1
1,893,"Wilkes, Mrs. James (Ellen Needs)",47.00000,1,0,363272,7.0000,,S,0,1,0
2,894,"Myles, Mr. Thomas Francis",62.00000,0,0,240276,9.6875,,Q,1,0,1
3,895,"Wirz, Mr. Albert",27.00000,0,0,315154,8.6625,,S,0,1,1
4,896,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",22.00000,1,1,3101298,12.2875,,S,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...
413,1305,"Spector, Mr. Woolf",30.27259,0,0,A.5. 3236,8.0500,,S,0,1,1
414,1306,"Oliva y Ocana, Dona. Fermina",39.00000,0,0,PC 17758,108.9000,C105,C,0,0,0
415,1307,"Saether, Mr. Simon Sivertsen",38.50000,0,0,SOTON/O.Q. 3101262,7.2500,,S,0,1,1
416,1308,"Ware, Mr. Frederick",30.27259,0,0,359309,8.0500,,S,0,1,1


In [72]:
model.params

const         4.061982
Age          -0.039495
family_num   -0.218627
Pclass_2     -1.179763
Pclass_3     -2.345823
Sex_male     -2.785398
dtype: float64

In [73]:
test_list['family_num'] = test_list['SibSp'] + test_list['Parch']
test_list

Unnamed: 0,PassengerId,Name,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Pclass_2,Pclass_3,Sex_male,family_num
0,892,"Kelly, Mr. James",34.50000,0,0,330911,7.8292,,Q,0,1,1,0
1,893,"Wilkes, Mrs. James (Ellen Needs)",47.00000,1,0,363272,7.0000,,S,0,1,0,1
2,894,"Myles, Mr. Thomas Francis",62.00000,0,0,240276,9.6875,,Q,1,0,1,0
3,895,"Wirz, Mr. Albert",27.00000,0,0,315154,8.6625,,S,0,1,1,0
4,896,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",22.00000,1,1,3101298,12.2875,,S,0,1,0,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...
413,1305,"Spector, Mr. Woolf",30.27259,0,0,A.5. 3236,8.0500,,S,0,1,1,0
414,1306,"Oliva y Ocana, Dona. Fermina",39.00000,0,0,PC 17758,108.9000,C105,C,0,0,0,0
415,1307,"Saether, Mr. Simon Sivertsen",38.50000,0,0,SOTON/O.Q. 3101262,7.2500,,S,0,1,1,0
416,1308,"Ware, Mr. Frederick",30.27259,0,0,359309,8.0500,,S,0,1,1,0


In [74]:
x_test = test_list[['Age', 'family_num', 'Pclass_2', 'Pclass_3', 'Sex_male']]
x_test = sm.add_constant(x_test)
x_test

Unnamed: 0,const,Age,family_num,Pclass_2,Pclass_3,Sex_male
0,1.0,34.50000,0,0,1,1
1,1.0,47.00000,1,0,1,0
2,1.0,62.00000,0,1,0,1
3,1.0,27.00000,0,0,1,1
4,1.0,22.00000,2,0,1,0
...,...,...,...,...,...,...
413,1.0,30.27259,0,0,1,1
414,1.0,39.00000,0,0,0,0
415,1.0,38.50000,0,0,1,1
416,1.0,30.27259,0,0,1,1


In [75]:
predict_value = model.predict(x_test)
predict_value

0      0.080778
1      0.411265
2      0.086917
3      0.105684
4      0.601091
         ...   
413    0.094075
414    0.925647
415    0.069798
416    0.094075
417    0.062849
Length: 418, dtype: float64

In [76]:
predict_value[predict_value > 0.5].count()

np.int64(158)

In [77]:
predict_value.count()

np.int64(418)

In [78]:
predict_value

0      0.080778
1      0.411265
2      0.086917
3      0.105684
4      0.601091
         ...   
413    0.094075
414    0.925647
415    0.069798
416    0.094075
417    0.062849
Length: 418, dtype: float64