# 项目：用逻辑回归预测泰坦尼克号幸存情况

## 分析目标

此数据分析报告的目的是，基于泰坦尼克号乘客的性别和船舱等级等属性，对幸存情况进行逻辑回归分析，从而能利用得到的模型，对未知幸存情况的乘客，根据属性预测是否从沉船事件中幸存。

## 简介

> 泰坦尼克号（英语：RMS Titanic）是一艘奥林匹克级邮轮，于1912年4月首航时撞上冰山后沉没。泰坦尼克号是同级的3艘超级邮轮中的第2艘，与姐妹船奥林匹克号和不列颠号为白星航运公司的乘客们提供大西洋旅行。

> 泰坦尼克号由位于北爱尔兰贝尔法斯特的哈兰·沃尔夫船厂兴建，是当时最大的客运轮船，由于其规模相当一艘现代航空母舰，因而号称“上帝也沉没不了的巨型邮轮”。在泰坦尼克号的首航中，从英国南安普敦出发，途经法国瑟堡-奥克特维尔以及爱尔兰昆士敦，计划横渡大西洋前往美国纽约市。但因为人为错误，于1912年4月14日船上时间夜里11点40分撞上冰山；2小时40分钟后，即4月15日凌晨02点20分，船裂成两半后沉入大西洋，死亡人数超越1500人，堪称20世纪最大的海难事件，同时也是最广为人知的海难之一。

数据集包括两个数据表：`titianic_train.csv`和`titanic_test.csv`。

`titianic_train.csv`记录了超过八百位泰坦尼克号乘客在沉船事件后的幸存情况，以及乘客的相关信息，包括所在船舱等级、性别、年龄、同乘伴侣/同胞数量、同乘父母/孩子数量，等等。

`titanic_test.csv`只包含乘客（这些乘客不在`titianic_train.csv`里）相关信息，此文件可以被用于预测乘客是否幸存。

`titianic_train.csv`每列的含义如下：
- PassengerId：乘客ID
- survival：是否幸存
   - 0	否
   - 1	是
- pclass：船舱等级
   - 1	一等舱
   - 2	二等舱
   - 3  三等舱
- sex：性别
- Age：年龄
- sibsp：同乘伴侣/同胞数量
- parch：同乘父母/孩子数量
- ticket：船票号
- fare：票价金额
- cabin：船舱号
- embarked：登船港口
   - C  瑟堡
   - Q  皇后镇
   - S  南安普敦
   
   
`titianic_test.csv`每列的含义和上面相同，但不具备survival变量的数据，即是否幸存。

## 引入数据 ##

In [1]:
import pandas as pd
import numpy as np
import statsmodels.api as sm
import seaborn as sns
import matplotlib.pyplot as plt

我们计划先利用`titanic_train.csv`训练预测模型，因此读取数据方面，当前只需要导入`titanic_train.csv`。

通过Pandas的`read_csv`函数，将原始数据文件`titanic_train.csv`里的数据内容

In [2]:
or_titianic_train=pd.read_csv("titanic_train.csv")
or_titianic_test=pd.read_csv("titanic_test.csv")
titianic_train=or_titianic_train.copy()
titianic_test=or_titianic_test.copy()

In [3]:
titianic_train

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


In [4]:
titianic_test

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0000,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S
...,...,...,...,...,...,...,...,...,...,...,...
413,1305,3,"Spector, Mr. Woolf",male,,0,0,A.5. 3236,8.0500,,S
414,1306,1,"Oliva y Ocana, Dona. Fermina",female,39.0,0,0,PC 17758,108.9000,C105,C
415,1307,3,"Saether, Mr. Simon Sivertsen",male,38.5,0,0,SOTON/O.Q. 3101262,7.2500,,S
416,1308,3,"Ware, Mr. Frederick",male,,0,0,359309,8.0500,,S


## 数据评估、清洗和整理 ##

在这一部分中，我们将对在上一部分建立的DataFrame所包含的数据进行评估和清理。

主要从两个方面进行：结构和内容，即整齐度和干净度。

数据的结构性问题指不符合“每个变量为一列，每个观察值为一行，每种类型的观察单位为一个表格”这三个标准；数据的内容性问题包括存在丢失数据、重复数据、无效数据等。

1，首先先看结构性问题

根据上面的结果显示，两个列表都没有结构性问题，不予处理。

2，然后看内容性问题

2.1 空缺值问题

In [5]:
titianic_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


可以看到年龄和船舱号有空缺值。
看一下船舱号是什么数值类

In [6]:
titianic_train['Cabin'].value_counts()

Cabin
B96 B98        4
G6             4
C23 C25 C27    4
C22 C26        3
F33            3
              ..
E34            1
C7             1
C54            1
E36            1
C148           1
Name: count, Length: 147, dtype: int64

已经有了船舱登记，船舱号不影响分析，因此不予理会。缺失的年龄可能对分析有影响，要去掉。

In [7]:
titianic_train=titianic_train.dropna(subset=["Age"])
titianic_train

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
885,886,0,3,"Rice, Mrs. William (Margaret Norton)",female,39.0,0,5,382652,29.1250,,Q
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


接下来看titianic_test

In [8]:
titianic_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  418 non-null    int64  
 1   Pclass       418 non-null    int64  
 2   Name         418 non-null    object 
 3   Sex          418 non-null    object 
 4   Age          332 non-null    float64
 5   SibSp        418 non-null    int64  
 6   Parch        418 non-null    int64  
 7   Ticket       418 non-null    object 
 8   Fare         417 non-null    float64
 9   Cabin        91 non-null     object 
 10  Embarked     418 non-null    object 
dtypes: float64(2), int64(4), object(5)
memory usage: 36.0+ KB


同样是cabin和age有问题，把age有空缺的值去掉

In [9]:
titianic_test=titianic_test.dropna(subset=["Age"])
titianic_test

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0000,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S
...,...,...,...,...,...,...,...,...,...,...,...
409,1301,3,"Peacock, Miss. Treasteall",female,3.0,1,1,SOTON/O.Q. 3101315,13.7750,,S
411,1303,1,"Minahan, Mrs. William Edward (Lillian E Thorpe)",female,37.0,1,0,19928,90.0000,C78,Q
412,1304,3,"Henriksson, Miss. Jenny Lovisa",female,28.0,0,0,347086,7.7750,,S
414,1306,1,"Oliva y Ocana, Dona. Fermina",female,39.0,0,0,PC 17758,108.9000,C105,C


2.2 重复值问题

Passengerld作为乘客ID，只要它不重复就好

In [10]:
titianic_train['PassengerId'].duplicated().sum()

0

In [11]:
titianic_test['PassengerId'].duplicated().sum()

0

两个列表都没有重复的乘客id，不予处理。

2.3 不一致数值问题

查看分类变量中的不一致数值。titianic_train： Survived	Pclass	Sex	Embarked；  titianic_test：Pclass  Embarked

In [12]:
titianic_train['Survived'].value_counts()

Survived
0    424
1    290
Name: count, dtype: int64

In [13]:
titianic_train['Pclass'].value_counts()

Pclass
3    355
1    186
2    173
Name: count, dtype: int64

In [14]:
titianic_train['Sex'].value_counts()

Sex
male      453
female    261
Name: count, dtype: int64

In [15]:
titianic_train['Embarked'].value_counts()

Embarked
S    554
C    130
Q     28
Name: count, dtype: int64

In [16]:
titianic_test['Pclass'].value_counts()

Pclass
3    146
1     98
2     88
Name: count, dtype: int64

In [17]:
titianic_test['Embarked'].value_counts()

Embarked
S    228
C     82
Q     22
Name: count, dtype: int64

没有不一致数值问题，不予处理。

2.4 无效数值问题

In [18]:
titianic_train.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,714.0,714.0,714.0,714.0,714.0,714.0,714.0
mean,448.582633,0.406162,2.236695,29.699118,0.512605,0.431373,34.694514
std,259.119524,0.49146,0.83825,14.526497,0.929783,0.853289,52.91893
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,222.25,0.0,1.0,20.125,0.0,0.0,8.05
50%,445.0,0.0,2.0,28.0,0.0,0.0,15.7417
75%,677.75,1.0,3.0,38.0,1.0,1.0,33.375
max,891.0,1.0,3.0,80.0,5.0,6.0,512.3292


In [19]:
titianic_test.describe()

Unnamed: 0,PassengerId,Pclass,Age,SibSp,Parch,Fare
count,332.0,332.0,332.0,332.0,332.0,331.0
mean,1100.063253,2.144578,30.27259,0.481928,0.39759,40.982087
std,122.763173,0.846283,14.181209,0.874084,0.810651,61.228558
min,892.0,1.0,0.17,0.0,0.0,0.0
25%,992.75,1.0,21.0,0.0,0.0,8.05
50%,1099.5,2.0,27.0,0.0,0.0,16.0
75%,1210.25,3.0,39.0,1.0,1.0,40.63335
max,1307.0,3.0,76.0,8.0,6.0,512.3292


2.5 数值类型问题

PassengerId	,Pclass	,Survived 是分类数值，应该改成category

In [20]:

titianic_train['Pclass']=titianic_train['Pclass'].astype("category")
titianic_train['Survived']=titianic_train['Survived'].astype("category")
titianic_train['Sex']=titianic_train['Sex'].astype("category")
titianic_train['Embarked']=titianic_train['Embarked'].astype("category")

titianic_test['Pclass']=titianic_test['Pclass'].astype("category")

titianic_test['Sex']=titianic_test['Sex'].astype("category")
titianic_test['Embarked']=titianic_test['Embarked'].astype("category")

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  titianic_train['Pclass']=titianic_train['Pclass'].astype("category")
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  titianic_train['Survived']=titianic_train['Survived'].astype("category")
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  titianic_train['Sex']=titianic_train['Sex'].astype("category")


In [21]:
titianic_train.info()

<class 'pandas.core.frame.DataFrame'>
Index: 714 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype   
---  ------       --------------  -----   
 0   PassengerId  714 non-null    int64   
 1   Survived     714 non-null    category
 2   Pclass       714 non-null    category
 3   Name         714 non-null    object  
 4   Sex          714 non-null    category
 5   Age          714 non-null    float64 
 6   SibSp        714 non-null    int64   
 7   Parch        714 non-null    int64   
 8   Ticket       714 non-null    object  
 9   Fare         714 non-null    float64 
 10  Cabin        185 non-null    object  
 11  Embarked     712 non-null    category
dtypes: category(4), float64(2), int64(3), object(3)
memory usage: 53.5+ KB


成功转化数值

数据评估和清理完成，接下来是数据分析。titianic_train用于建立模型，titianic_test用来预测。

In [22]:
#可以drop掉name,ticket,cabin,因为无关紧要
titianic_clean=titianic_train.drop(["Name","Ticket","Cabin","PassengerId"],axis=1)

## 数据分析 ##

在分析步骤中，我们将利用以上清理后到的数据，进行逻辑回归分析，目标是得到一个可以根据泰坦尼克号乘客各个属性，对沉船事件后幸存情况进行预测的数学模型。



In [23]:
##首先把分类变量设置成虚拟变量
titianic_clean=pd.get_dummies(titianic_clean,drop_first=True,columns=["Survived","Pclass","Sex","Embarked"],dtype=int)

In [24]:
#查看变量之间的相关系数
titianic_clean.corr().abs()>0.8

Unnamed: 0,Age,SibSp,Parch,Fare,Survived_1,Pclass_2,Pclass_3,Sex_male,Embarked_Q,Embarked_S
Age,True,False,False,False,False,False,False,False,False,False
SibSp,False,True,False,False,False,False,False,False,False,False
Parch,False,False,True,False,False,False,False,False,False,False
Fare,False,False,False,True,False,False,False,False,False,False
Survived_1,False,False,False,False,True,False,False,False,False,False
Pclass_2,False,False,False,False,False,True,False,False,False,False
Pclass_3,False,False,False,False,False,False,True,False,False,False
Sex_male,False,False,False,False,False,False,False,True,False,False
Embarked_Q,False,False,False,False,False,False,False,False,True,False
Embarked_S,False,False,False,False,False,False,False,False,False,True


没有绝对值高过0.8的系数，不予处理。

In [25]:
titianic_clean

Unnamed: 0,Age,SibSp,Parch,Fare,Survived_1,Pclass_2,Pclass_3,Sex_male,Embarked_Q,Embarked_S
0,22.0,1,0,7.2500,0,0,1,1,0,1
1,38.0,1,0,71.2833,1,0,0,0,0,0
2,26.0,0,0,7.9250,1,0,1,0,0,1
3,35.0,1,0,53.1000,1,0,0,0,0,1
4,35.0,0,0,8.0500,0,0,1,1,0,1
...,...,...,...,...,...,...,...,...,...,...
885,39.0,0,5,29.1250,0,0,1,0,1,0
886,27.0,0,0,13.0000,0,1,0,1,0,1
887,19.0,0,0,30.0000,1,0,0,0,0,1
889,26.0,0,0,30.0000,1,0,0,1,0,0


In [26]:
#把自变量和因变量划分出来
y=titianic_clean['Survived_1']
X=titianic_clean.drop('Survived_1',axis=1)

In [27]:
#输入截距
X=sm.add_constant(X)

In [28]:
#建立逻辑回归模型，fit方法拟合
model=sm.Logit(y,X).fit()
result=sm.Logit(y,X).fit()
result.summary()

Optimization terminated successfully.
         Current function value: 0.443090
         Iterations 6
Optimization terminated successfully.
         Current function value: 0.443090
         Iterations 6


0,1,2,3
Dep. Variable:,Survived_1,No. Observations:,714.0
Model:,Logit,Df Residuals:,704.0
Method:,MLE,Df Model:,9.0
Date:,"Fri, 26 Jan 2024",Pseudo R-squ.:,0.344
Time:,10:53:18,Log-Likelihood:,-316.37
converged:,True,LL-Null:,-482.26
Covariance Type:,nonrobust,LLR p-value:,4.647e-66

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
const,4.4451,0.536,8.296,0.000,3.395,5.495
Age,-0.0432,0.008,-5.192,0.000,-0.059,-0.027
SibSp,-0.3636,0.129,-2.809,0.005,-0.617,-0.110
Parch,-0.0618,0.124,-0.498,0.618,-0.305,0.181
Fare,0.0015,0.003,0.559,0.576,-0.004,0.007
Pclass_2,-1.1931,0.329,-3.623,0.000,-1.839,-0.548
Pclass_3,-2.3985,0.344,-6.982,0.000,-3.072,-1.725
Sex_male,-2.6452,0.223,-11.874,0.000,-3.082,-2.209
Embarked_Q,-0.8338,0.600,-1.389,0.165,-2.010,0.343


从图上可知道：Parch，Fare,Embarked_Q,Embarked_S，超过显著性水平，需要去除掉，进行新一轮建模。

In [29]:
#把自变量和因变量划分出来
y=titianic_clean['Survived_1']
X=titianic_clean.drop(['Survived_1','Parch','Fare','Embarked_Q','Embarked_S'],axis=1)

In [30]:
#输入截距
X=sm.add_constant(X)

In [31]:
#建立逻辑回归模型，fit方法拟合
model=sm.Logit(y,X).fit()
result=sm.Logit(y,X).fit()
result.summary()

Optimization terminated successfully.
         Current function value: 0.445774
         Iterations 6
Optimization terminated successfully.
         Current function value: 0.445774
         Iterations 6


0,1,2,3
Dep. Variable:,Survived_1,No. Observations:,714.0
Model:,Logit,Df Residuals:,708.0
Method:,MLE,Df Model:,5.0
Date:,"Fri, 26 Jan 2024",Pseudo R-squ.:,0.34
Time:,10:53:18,Log-Likelihood:,-318.28
converged:,True,LL-Null:,-482.26
Covariance Type:,nonrobust,LLR p-value:,9.745000000000001e-69

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
const,4.3342,0.451,9.617,0.000,3.451,5.218
Age,-0.0448,0.008,-5.442,0.000,-0.061,-0.029
SibSp,-0.3802,0.122,-3.129,0.002,-0.618,-0.142
Pclass_2,-1.4144,0.285,-4.967,0.000,-1.972,-0.856
Pclass_3,-2.6526,0.286,-9.280,0.000,-3.213,-2.092
Sex_male,-2.6277,0.215,-12.235,0.000,-3.049,-2.207


## 解读数据 ##

In [32]:
#age
np.exp(-0.0448)

0.9561887004506909

说明每多一岁，生还概率下降4%

In [33]:
#SibSp
np.exp(-0.3802)

0.6837246506068299

说明每多一个同乘伴侣，生还概率下降32%

In [34]:
#Pclass_2
np.exp(-1.4144)

0.24307141255216874

比起Pclass1，生还概率下降76%

In [35]:
#Pclass_3
np.exp(-2.6526)

0.07046775850074612

比起Pclass1，生还概率下降93%

In [36]:
#Sex_male
np.exp(-2.6277)

0.07224443349569508

比起female，生还概率下降93%

## 预测数据 ##

首先要处理test中的自变量，使其和model里面完全一样

In [37]:
titianic_test=titianic_test[["Age","SibSp","Pclass","Sex"]]

In [38]:
titianic_test
result_to_predict=titianic_test.copy()

In [39]:
#把分类变量的类型转换为Category，并且通过categories参数，让程序知道所有可能的分类值。
result_to_predict["Pclass"] = pd.Categorical(result_to_predict['Pclass'], categories=['1','2','3'])
result_to_predict["Sex"] = pd.Categorical(result_to_predict['Sex'], categories=['female','male'])

In [40]:
#把分类变量转成虚拟变量
result_to_predict=pd.get_dummies(result_to_predict,drop_first=True,columns=["Pclass","Sex"],dtype=int)
result_to_predict = sm.add_constant(result_to_predict)

我们获得了逻辑回归模型预测的titanic_test.csv里，泰坦尼克号乘客的幸存概率。我们可以把概率大于等于0.5的预测为幸存，小于0.5的预测为遇难，输出一下这个最终的预测结果。

In [41]:
#看一下哪些人会遇难，那些人不会
rate=result.predict(result_to_predict)

In [42]:
rate

0      0.540489
1      0.864163
2      0.255673
3      0.621992
4      0.951168
         ...   
409    0.978537
411    0.908703
412    0.956099
414    0.930122
415    0.495818
Length: 332, dtype: float64

In [43]:
rate[rate>0.5].count()

259

KeyError: 'PassengerId'