此数据分析报告的目的是，基于泰坦尼克号乘客的性别和船舱等级等属性，对幸存情况进行逻辑回归分析，从而能利用得到的模型，对未知幸存情况的乘客，根据属性预测是否从沉船事件中幸存。


数据集包括两个数据表:titianic train.csv和titanic test.csv  
titianic_train,csv 记录了超过八百位泰坦尼克号乘客在沉船事件后的幸存情况，以及乘客的相关信息，包括所在船舱等级、性别、年龄、同乘伴侣/同胞数量、同乘父母/孩子数量，等等。  
titanic_test,csv 只包含乘客(这些乘客不在 titianic_train.csv 里)相关信息，此文件可以被用于预测乘客是否幸存

titianic_train.csv每列的含义如下:  
--Passengerld:乘客ID  
-- survival:是否幸存  
            0否  
            1是  
-- pclass:船舱等级  
            1等舱  
            2二等舱  
            3三等舱  
-- sex:性别  
-- Age:年龄  
-- sibsp:同乘伴侣/同胞数量  
-- parch:同乘父母/孩子数量  
-- ticket:船票号  
-- fare:票价金额  
-- cabin:船舱号  
-- embarked:登船港口  
    C 瑟堡  
    Q 皇后镇  
    S南安普敦  

titianic_test.csv 每列的含义和上面相同，但不具备survival变量的数据，即是否幸存

# 读取数据

In [5]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [7]:
original_data = pd.read_csv("train.csv")
original_data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,493,0,1,"Molson, Mr. Harry Markland",male,55.0,0,0,113787,30.5,C30,S
1,53,1,1,"Harper, Mrs. Henry Sleeper (Myna Haxtun)",female,49.0,1,0,PC 17572,76.7292,D33,C
2,388,1,2,"Buss, Miss. Kate",female,36.0,0,0,27849,13.0,,S
3,192,0,2,"Carbines, Mr. William",male,19.0,0,0,28424,13.0,,S
4,687,0,3,"Panula, Mr. Jaako Arnold",male,14.0,4,1,3101295,39.6875,,S


# 清洗数据--结构与内容

In [10]:
cleaned_data =  original_data.copy()
cleaned_data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,493,0,1,"Molson, Mr. Harry Markland",male,55.0,0,0,113787,30.5,C30,S
1,53,1,1,"Harper, Mrs. Henry Sleeper (Myna Haxtun)",female,49.0,1,0,PC 17572,76.7292,D33,C
2,388,1,2,"Buss, Miss. Kate",female,36.0,0,0,27849,13.0,,S
3,192,0,2,"Carbines, Mr. William",male,19.0,0,0,28424,13.0,,S
4,687,0,3,"Panula, Mr. Jaako Arnold",male,14.0,4,1,3101295,39.6875,,S


## 结构

In [12]:
cleaned_data.head(10)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,493,0,1,"Molson, Mr. Harry Markland",male,55.0,0,0,113787,30.5,C30,S
1,53,1,1,"Harper, Mrs. Henry Sleeper (Myna Haxtun)",female,49.0,1,0,PC 17572,76.7292,D33,C
2,388,1,2,"Buss, Miss. Kate",female,36.0,0,0,27849,13.0,,S
3,192,0,2,"Carbines, Mr. William",male,19.0,0,0,28424,13.0,,S
4,687,0,3,"Panula, Mr. Jaako Arnold",male,14.0,4,1,3101295,39.6875,,S
5,16,1,2,"Hewlett, Mrs. (Mary D Kingcome)",female,55.0,0,0,248706,16.0,,S
6,228,0,3,"Lovell, Mr. John Hall (""Henry"")",male,20.5,0,0,A/5 21173,7.25,,S
7,884,0,2,"Banfield, Mr. Frederick James",male,28.0,0,0,C.A./SOTON 34068,10.5,,S
8,168,0,3,"Skoog, Mrs. William (Anna Bernhardina Karlsson)",female,45.0,1,4,347088,27.9,,S
9,752,1,3,"Moor, Master. Meier",male,6.0,0,1,392096,12.475,E121,S


## 内容

## 空缺与数据类型

In [13]:
cleaned_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 712 entries, 0 to 711
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  712 non-null    int64  
 1   Survived     712 non-null    int64  
 2   Pclass       712 non-null    int64  
 3   Name         712 non-null    object 
 4   Sex          712 non-null    object 
 5   Age          566 non-null    float64
 6   SibSp        712 non-null    int64  
 7   Parch        712 non-null    int64  
 8   Ticket       712 non-null    object 
 9   Fare         712 non-null    float64
 10  Cabin        168 non-null    object 
 11  Embarked     710 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 66.9+ KB


age，cabin,embarked缺失  
PassenegerID应该为字符串类型  
survived,Pclass,Sex,Embarked这些分类数据要转换成category类型

In [14]:
cleaned_data["PassengerId"] = cleaned_data["PassengerId"].astype("str")

In [16]:
cleaned_data["Survived"] = cleaned_data["Survived"].astype("category")
cleaned_data["Pclass"] = cleaned_data["Pclass"].astype("category")
cleaned_data["Sex"] = cleaned_data["Sex"].astype("category")
cleaned_data["Embarked"] = cleaned_data["Embarked"].astype("category")

In [17]:
cleaned_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 712 entries, 0 to 711
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype   
---  ------       --------------  -----   
 0   PassengerId  712 non-null    object  
 1   Survived     712 non-null    category
 2   Pclass       712 non-null    category
 3   Name         712 non-null    object  
 4   Sex          712 non-null    category
 5   Age          566 non-null    float64 
 6   SibSp        712 non-null    int64   
 7   Parch        712 non-null    int64   
 8   Ticket       712 non-null    object  
 9   Fare         712 non-null    float64 
 10  Cabin        168 non-null    object  
 11  Embarked     710 non-null    category
dtypes: category(4), float64(2), int64(2), object(4)
memory usage: 47.9+ KB


查看缺失值的行

In [19]:
cleaned_data[cleaned_data["Age"].isna()]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
17,612,0,3,"Jardin, Mr. Jose Neto",male,,0,0,SOTON/O.Q. 3101305,7.0500,,S
22,230,0,3,"Lefebre, Miss. Mathilde",female,,3,1,4133,25.4667,,S
27,257,1,1,"Thorne, Mrs. Gertrude Maybelle",female,,0,0,PC 17585,79.2000,,C
28,829,1,3,"McCormack, Mr. Thomas Joseph",male,,0,0,367228,7.7500,,Q
37,548,1,2,"Padro y Manent, Mr. Julian",male,,0,0,SC/PARIS 2146,13.8625,,C
...,...,...,...,...,...,...,...,...,...,...,...,...
689,769,0,3,"Moran, Mr. Daniel J",male,,1,0,371110,24.1500,,Q
692,160,0,3,"Sage, Master. Thomas Henry",male,,8,2,CA. 2343,69.5500,,S
699,426,0,3,"Wiseman, Mr. Phillippe",male,,0,0,A/4. 34244,7.2500,,S
708,65,0,1,"Stewart, Mr. Albert A",male,,0,0,PC 17605,27.7208,,C


由于age空缺的行很多，所以不能直接删除，空缺的用平均值代替

In [20]:
average_age = cleaned_data["Age"].mean()
cleaned_data["Age"]=cleaned_data["Age"].fillna(average_age)
cleaned_data["Age"].isna().sum()

0

由于Carbin和Embarked并不影响我们的数据分析，所以不需要管理

## 重复 -- 针对那些唯一性的数据

In [24]:
cleaned_data['PassengerId'].duplicated().sum()

0

## 不一致的数据 -- 一般看分类的数据

In [25]:
cleaned_data["Survived"].value_counts()

Survived
0    433
1    279
Name: count, dtype: int64

In [26]:
cleaned_data["Pclass"].value_counts()

Pclass
3    389
1    178
2    145
Name: count, dtype: int64

In [27]:
cleaned_data["Sex"].value_counts()

Sex
male      456
female    256
Name: count, dtype: int64

In [28]:
cleaned_data["Embarked"].value_counts()

Embarked
S    517
C    131
Q     62
Name: count, dtype: int64

## 错误数据 --针对数值型

In [29]:
cleaned_data.describe()

Unnamed: 0,Age,SibSp,Parch,Fare
count,712.0,712.0,712.0,712.0
mean,29.782102,0.502809,0.386236,32.922173
std,12.934477,1.031156,0.837572,52.102027
min,0.42,0.0,0.0,0.0
25%,22.0,0.0,0.0,7.8958
50%,29.782102,0.0,0.0,14.4583
75%,35.0,1.0,0.0,31.275
max,80.0,8.0,6.0,512.3292


# 数据整理

将Sibsp和Parch合并为FamilyNum

In [30]:
cleaned_data["FamilyNum"] = cleaned_data["SibSp"] + cleaned_data["Parch"]

In [31]:
cleaned_data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,FamilyNum
0,493,0,1,"Molson, Mr. Harry Markland",male,55.0,0,0,113787,30.5,C30,S,0
1,53,1,1,"Harper, Mrs. Henry Sleeper (Myna Haxtun)",female,49.0,1,0,PC 17572,76.7292,D33,C,1
2,388,1,2,"Buss, Miss. Kate",female,36.0,0,0,27849,13.0,,S,0
3,192,0,2,"Carbines, Mr. William",male,19.0,0,0,28424,13.0,,S,0
4,687,0,3,"Panula, Mr. Jaako Arnold",male,14.0,4,1,3101295,39.6875,,S,5


# 可视化数据

# 分析数据

In [33]:
import statsmodels.api as sm

In [41]:
lr_data = cleaned_data.copy()
lr_data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,FamilyNum
0,493,0,1,"Molson, Mr. Harry Markland",male,55.0,0,0,113787,30.5,C30,S,0
1,53,1,1,"Harper, Mrs. Henry Sleeper (Myna Haxtun)",female,49.0,1,0,PC 17572,76.7292,D33,C,1
2,388,1,2,"Buss, Miss. Kate",female,36.0,0,0,27849,13.0,,S,0
3,192,0,2,"Carbines, Mr. William",male,19.0,0,0,28424,13.0,,S,0
4,687,0,3,"Panula, Mr. Jaako Arnold",male,14.0,4,1,3101295,39.6875,,S,5


## 对于分类变量的处理（事先已经转换为了category类型）

In [42]:
lr_data = pd.get_dummies(lr_data,columns=["Pclass","Sex"],drop_first=True,dtype=int)
lr_data

Unnamed: 0,PassengerId,Survived,Name,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,FamilyNum,Pclass_2,Pclass_3,Sex_male
0,493,0,"Molson, Mr. Harry Markland",55.000000,0,0,113787,30.5000,C30,S,0,0,0,1
1,53,1,"Harper, Mrs. Henry Sleeper (Myna Haxtun)",49.000000,1,0,PC 17572,76.7292,D33,C,1,0,0,0
2,388,1,"Buss, Miss. Kate",36.000000,0,0,27849,13.0000,,S,0,1,0,0
3,192,0,"Carbines, Mr. William",19.000000,0,0,28424,13.0000,,S,0,1,0,1
4,687,0,"Panula, Mr. Jaako Arnold",14.000000,4,1,3101295,39.6875,,S,5,0,1,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
707,859,1,"Baclini, Mrs. Solomon (Latifa Qurban)",24.000000,0,3,2666,19.2583,,C,3,0,1,0
708,65,0,"Stewart, Mr. Albert A",29.782102,0,0,PC 17605,27.7208,,C,0,0,0,1
709,130,0,"Ekstrom, Mr. Johan",45.000000,0,0,347061,6.9750,,S,0,0,1,1
710,21,0,"Fynney, Mr. Joseph J",35.000000,0,0,239865,26.0000,,S,0,1,0,1


## 剔除dataFrame中无关变量

In [52]:
lr_data = lr_data.drop(["PassengerId","Name","Ticket","Cabin","Embarked"],axis=1)

## 分成X,y数据

In [53]:
y=lr_data["Survived"]
X=lr_data.drop(["Survived"],axis=1)

## 检查自变量中两两的相关性

In [54]:
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 712 entries, 0 to 711
Data columns (total 8 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Age        712 non-null    float64
 1   SibSp      712 non-null    int64  
 2   Parch      712 non-null    int64  
 3   Fare       712 non-null    float64
 4   FamilyNum  712 non-null    int64  
 5   Pclass_2   712 non-null    int32  
 6   Pclass_3   712 non-null    int32  
 7   Sex_male   712 non-null    int32  
dtypes: float64(2), int32(3), int64(3)
memory usage: 36.3 KB


## 检查自变量之间的相关性，>0.8就需要剔除其中一个了

In [58]:
X.corr().abs() > 0.8

Unnamed: 0,Age,SibSp,Parch,Fare,FamilyNum,Pclass_2,Pclass_3,Sex_male
Age,True,False,False,False,False,False,False,False
SibSp,False,True,False,False,True,False,False,False
Parch,False,False,True,False,False,False,False,False
Fare,False,False,False,True,False,False,False,False
FamilyNum,False,True,False,False,True,False,False,False
Pclass_2,False,False,False,False,False,True,False,False
Pclass_3,False,False,False,False,False,False,True,False
Sex_male,False,False,False,False,False,False,False,True


In [61]:
# 由于SibSp，Parch与FamilyNum的相关性很高，需要剔除这两个列
X = X.drop(["SibSp","Parch"],axis=1)

In [64]:
X.corr().abs()

Unnamed: 0,Age,Fare,FamilyNum,Pclass_2,Pclass_3,Sex_male
Age,1.0,0.094065,0.238867,0.015344,0.261767,0.082147
Fare,0.094065,1.0,0.218309,0.121457,0.411341,0.171444
FamilyNum,0.238867,0.218309,1.0,0.044624,0.067369,0.200124
Pclass_2,0.015344,0.121457,0.044624,1.0,0.554966,0.078966
Pclass_3,0.261767,0.411341,0.067369,0.554966,1.0,0.15794
Sex_male,0.082147,0.171444,0.200124,0.078966,0.15794,1.0


## 给自变量添加截距的常量

In [65]:
X = sm.add_constant(X)

In [66]:
X

Unnamed: 0,const,Age,Fare,FamilyNum,Pclass_2,Pclass_3,Sex_male
0,1.0,55.000000,30.5000,0,0,0,1
1,1.0,49.000000,76.7292,1,0,0,0
2,1.0,36.000000,13.0000,0,1,0,0
3,1.0,19.000000,13.0000,0,1,0,1
4,1.0,14.000000,39.6875,5,0,1,1
...,...,...,...,...,...,...,...
707,1.0,24.000000,19.2583,3,0,1,0
708,1.0,29.782102,27.7208,0,0,0,1
709,1.0,45.000000,6.9750,0,0,1,1
710,1.0,35.000000,26.0000,0,1,0,1


## 进行Logit回归

In [67]:
model = sm.Logit(y,X).fit()

Optimization terminated successfully.
         Current function value: 0.450334
         Iterations 6


In [68]:
model.summary()

0,1,2,3
Dep. Variable:,Survived,No. Observations:,712.0
Model:,Logit,Df Residuals:,705.0
Method:,MLE,Df Model:,6.0
Date:,"Fri, 31 Jan 2025",Pseudo R-squ.:,0.3274
Time:,18:15:28,Log-Likelihood:,-320.64
converged:,True,LL-Null:,-476.73
Covariance Type:,nonrobust,LLR p-value:,1.9940000000000002e-64

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
const,3.6779,0.492,7.478,0.000,2.714,4.642
Age,-0.0423,0.009,-4.775,0.000,-0.060,-0.025
Fare,0.0036,0.003,1.407,0.159,-0.001,0.009
FamilyNum,-0.2266,0.075,-3.023,0.002,-0.373,-0.080
Pclass_2,-0.8069,0.328,-2.462,0.014,-1.449,-0.165
Pclass_3,-1.8163,0.316,-5.754,0.000,-2.435,-1.198
Sex_male,-2.7483,0.217,-12.665,0.000,-3.174,-2.323


看p值，发现Fare的p > 0.05，需要剔除掉，然后重新拟合  
看coef：除了fare外，都会降低生存概率

In [71]:
X = X.drop(["Fare"],axis=1)

KeyError: "['Fare'] not found in axis"

In [72]:
model=sm.Logit(y,X).fit()
model.summary()

Optimization terminated successfully.
         Current function value: 0.451898
         Iterations 6


0,1,2,3
Dep. Variable:,Survived,No. Observations:,712.0
Model:,Logit,Df Residuals:,706.0
Method:,MLE,Df Model:,5.0
Date:,"Fri, 31 Jan 2025",Pseudo R-squ.:,0.3251
Time:,18:19:55,Log-Likelihood:,-321.75
converged:,True,LL-Null:,-476.73
Covariance Type:,nonrobust,LLR p-value:,7.209e-65

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
const,3.9722,0.449,8.838,0.000,3.091,4.853
Age,-0.0430,0.009,-4.875,0.000,-0.060,-0.026
FamilyNum,-0.1990,0.072,-2.766,0.006,-0.340,-0.058
Pclass_2,-1.0227,0.292,-3.501,0.000,-1.595,-0.450
Pclass_3,-2.0684,0.264,-7.844,0.000,-2.585,-1.552
Sex_male,-2.7575,0.217,-12.735,0.000,-3.182,-2.333


## 查看具体的系数

In [77]:
 np.exp(-0.0430)
#说明，age每增加1，age就会降低4%生存率
#其他系数类推

0.9579113900670306

结果说明：  
- 年龄越小越容易活  
- 女性活的几率大  
- 高等仓几率大  
- 家庭成员少的几率大

# 预测数据

In [154]:
test_data = pd.read_csv("test.csv")
test_data.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,181,3,"Sage, Miss. Constance Gladys",female,,8,2,CA. 2343,69.55,,S
1,405,3,"Oreskovic, Miss. Marija",female,20.0,0,0,315096,8.6625,,S
2,635,3,"Skoog, Miss. Mabel",female,9.0,3,2,347088,27.9,,S
3,701,1,"Astor, Mrs. John Jacob (Madeleine Talmadge Force)",female,18.0,1,0,PC 17757,227.525,C62 C64,C
4,470,3,"Baclini, Miss. Helene Barbara",female,0.75,2,1,2666,19.2583,,C


In [155]:
test_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 179 entries, 0 to 178
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  179 non-null    int64  
 1   Pclass       179 non-null    int64  
 2   Name         179 non-null    object 
 3   Sex          179 non-null    object 
 4   Age          148 non-null    float64
 5   SibSp        179 non-null    int64  
 6   Parch        179 non-null    int64  
 7   Ticket       179 non-null    object 
 8   Fare         179 non-null    float64
 9   Cabin        36 non-null     object 
 10  Embarked     179 non-null    object 
dtypes: float64(2), int64(4), object(5)
memory usage: 15.5+ KB


## 传入的数据不能有缺失，并且变量的种类和类型要与model的传入一致

去掉不需要的自变量

In [156]:
test_data = test_data.drop(["PassengerId","Name","Ticket","Cabin","Embarked"],axis=1)
test_data

Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Fare
0,3,female,,8,2,69.5500
1,3,female,20.00,0,0,8.6625
2,3,female,9.00,3,2,27.9000
3,1,female,18.00,1,0,227.5250
4,3,female,0.75,2,1,19.2583
...,...,...,...,...,...,...
174,3,female,23.00,0,0,7.5500
175,3,male,,0,0,7.5500
176,2,female,34.00,0,0,13.0000
177,3,male,,0,0,8.0500


age填充

In [157]:
test_data["Age"] = test_data["Age"].fillna(test_data["Age"].mean())
test_data["Age"].isna().sum()

0

将Pclass与Sex设置为category类型, 其实本质上和上面转换成category的方法效果一模一样，也可以用上面的方法

In [158]:
test_data["Pclass"] = pd.Categorical(test_data["Pclass"],categories=[1,2,3])
test_data["Sex"] = pd.Categorical(test_data["Sex"],categories=["male","female"])

In [159]:
test_data

Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Fare
0,3,female,29.381757,8,2,69.5500
1,3,female,20.000000,0,0,8.6625
2,3,female,9.000000,3,2,27.9000
3,1,female,18.000000,1,0,227.5250
4,3,female,0.750000,2,1,19.2583
...,...,...,...,...,...,...
174,3,female,23.000000,0,0,7.5500
175,3,male,29.381757,0,0,7.5500
176,2,female,34.000000,0,0,13.0000
177,3,male,29.381757,0,0,8.0500


In [160]:
test_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 179 entries, 0 to 178
Data columns (total 6 columns):
 #   Column  Non-Null Count  Dtype   
---  ------  --------------  -----   
 0   Pclass  179 non-null    category
 1   Sex     179 non-null    category
 2   Age     179 non-null    float64 
 3   SibSp   179 non-null    int64   
 4   Parch   179 non-null    int64   
 5   Fare    179 non-null    float64 
dtypes: category(2), float64(2), int64(2)
memory usage: 6.3 KB


In [161]:
test_data = pd.get_dummies(test_data,columns=["Pclass","Sex"],drop_first=1,dtype=int)
test_data

Unnamed: 0,Age,SibSp,Parch,Fare,Pclass_2,Pclass_3,Sex_female
0,29.381757,8,2,69.5500,0,1,1
1,20.000000,0,0,8.6625,0,1,1
2,9.000000,3,2,27.9000,0,1,1
3,18.000000,1,0,227.5250,0,0,1
4,0.750000,2,1,19.2583,0,1,1
...,...,...,...,...,...,...,...
174,23.000000,0,0,7.5500,0,1,1
175,29.381757,0,0,7.5500,0,1,0
176,34.000000,0,0,13.0000,1,0,1
177,29.381757,0,0,8.0500,0,1,0


加入FamilyNum这一列

In [162]:
test_data["FamilyNum"] = test_data["SibSp"] + test_data["Parch"]

test_data.head()


Unnamed: 0,Age,SibSp,Parch,Fare,Pclass_2,Pclass_3,Sex_female,FamilyNum
0,29.381757,8,2,69.55,0,1,1,10
1,20.0,0,0,8.6625,0,1,1,0
2,9.0,3,2,27.9,0,1,1,5
3,18.0,1,0,227.525,0,0,1,1
4,0.75,2,1,19.2583,0,1,1,3


In [163]:
model.params

const        3.972211
Age         -0.042998
FamilyNum   -0.199039
Pclass_2    -1.022660
Pclass_3    -2.068411
Sex_male    -2.757546
dtype: float64

In [164]:
#丢掉SibSp，Parch
test_data = test_data.drop(["SibSp","Parch","Fare"],axis=1)

In [167]:
test_data = sm.add_constant(X)
test_data
#这回test_data与model的传入所要求的自变量一致

Unnamed: 0,const,Age,FamilyNum,Pclass_2,Pclass_3,Sex_male
0,1.0,55.000000,0,0,0,1
1,1.0,49.000000,1,0,0,0
2,1.0,36.000000,0,1,0,0
3,1.0,19.000000,0,1,0,1
4,1.0,14.000000,5,0,1,1
...,...,...,...,...,...,...
707,1.0,24.000000,3,0,1,0
708,1.0,29.782102,0,0,0,1
709,1.0,45.000000,0,0,1,1
710,1.0,35.000000,0,1,0,1


## 进行输入模型

In [170]:
predicted_value = model.predict(test_data)
predicted_value 

0      0.240450
1      0.841079
2      0.802442
3      0.348656
4      0.079372
         ...   
707    0.568255
708    0.483531
709    0.057941
710    0.211999
711    0.483531
Length: 712, dtype: float64

# 总结


## 逻辑回归与线性回归的不同
逻辑回归与线性回归分析的过程一模一样，仅有两处不一致   
1.分析方法
- `sm.OLS(y,X).fit()`是线性回归  
-`sm.Logit(y,X).fit()`是逻辑回归  
2.对于coef的解释
- 线性回归，model.summary()的coef直接就是系数  
- 逻辑回归，model.summary()的cofe作为自然对数的次方后，才是最后的系数

## 逻辑回归与线性回归的相同

1.获取数据-清理数据-整理数据（-探索数据）  
2.import statsmodels.api as sm  
3.首先对于分类变量（要在之前转换为category类型）进行get_dummies处理  
dataFrame = pd.get_dummies(dataFrame,columns=[""],drop_first,dype=int)  
drop_first是也是为了保证“避免共线性问题”  
4.将因变量和自变量分开  
`  
y= dataFrame["价格"]  
X=dataFrame.drop("价格"，axis=1)   
X=dataFrame.drop(["价格","colunm1"]，axis=1)   
X=dataFrame[["column1","column2"]]  
`  
且对自变量X这个dataFrame检查其自变量的相互间的相关性 ---避免共线性问题     
- 两个自变量之间查共线性： X["column1"].corr(X["column2"])
- 多个自变量之间查共线性：[dataFrame].corr().abs(), 当绝对值>0.8，存在严重相关，应该只保留一个
5.对于X中要增加截距的常量：  
`X=sm.add_constant(X)`,传入自变量作为参数  
6.进行拟合  
`model = sm.OLS(y,X).fit()`, `model.summary()`--线性回归  
`model = sm.Logit(y,X).fit()`, `model.summary()`--逻辑回归  
然后看p值，判断显著性。若其中有一个自变量的p值大于我们的水平（0.05)，就剔除这个自变量，重新线性回归一次  
7.进行预测  
要输入到model的数据也要像建立模型时的X一样，同样进行“清洗+整理+拟合前的预处理”： 
- 在种类和类型保持一致  
- 列中的数据不可以有缺失    
尤其注意，在将test_data中的分类数据进行get_dummies操作之前，进行转换为category类型：
`test_data["Pclass"] = pd.Categorical(test_data["Pclass"],categories=[1,2,3])  
test_data["Sex"] = pd.Categorical(test_data["Sex"],categories=["male","female"])` 
8.进行预测：model.predict(test_data)