#### 案例应用、泰坦尼克号乘客生存分类
- 泰坦尼克号的沉没是世界上最严重的海难事故之一，今天我们通过分类树模型来预测一下哪些人可能成为幸存者。 数据集为data.csv
- 特征介绍

  - Survived：是否生存
  - Pclass：船票等级，表示乘客社会经济地位
  - Name，Sex，Age
  - SibSp：泰坦尼克号上的兄弟姐妹/配偶数
  - Parch：泰坦尼克号上的父母/子女数量
  - Ticket：船票号
  - Fare：票价
  - Cabin：船舱号
  - Embarked：登船港口号


In [1]:
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split

In [15]:
data = pd.read_csv(r"./datasets/data.csv",index_col = 'PassengerId')
data.head()

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [16]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 891 entries, 1 to 891
Data columns (total 11 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Survived  891 non-null    int64  
 1   Pclass    891 non-null    int64  
 2   Name      891 non-null    object 
 3   Sex       891 non-null    object 
 4   Age       714 non-null    float64
 5   SibSp     891 non-null    int64  
 6   Parch     891 non-null    int64  
 7   Ticket    891 non-null    object 
 8   Fare      891 non-null    float64
 9   Cabin     204 non-null    object 
 10  Embarked  889 non-null    object 
dtypes: float64(2), int64(4), object(5)
memory usage: 83.5+ KB


In [17]:
# 删除缺失值较多和无关的特征
data.drop(labels=['Cabin','Name','Ticket'],inplace=True,axis=1)

In [18]:
#填充age列
data['Age'] = data['Age'].fillna(data['Age'].mean)

In [19]:
#清洗空值
data.dropna(inplace=True)

In [20]:
#将性别列转换成数值型数据
data['Sex'] = (data['Sex']== 'male')

In [24]:
data['Sex'] =data['Sex'].astype('int')

In [25]:
data.head()

Unnamed: 0_level_0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,0,3,1,22.0,1,0,7.25,S
2,1,1,0,38.0,1,0,71.2833,C
3,1,3,0,26.0,0,0,7.925,S
4,1,1,0,35.0,1,0,53.1,S
5,0,3,1,35.0,0,0,8.05,S


In [26]:
#将三分类变量转换为数值型数据
labels = data['Embarked'].unique().tolist()
data['Embarked'] = data['Embarked'].map(lambda x:labels.index(x))

In [27]:
labels

['S', 'C', 'Q']

In [45]:
data.drop([890],axis=0,inplace = True)

In [49]:
data.tail()

Unnamed: 0_level_0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
884,0,2,1,28.0,0,0,10.5,0
885,0,3,1,25.0,0,0,7.05,0
886,0,3,0,39.0,0,5,29.125,2
887,0,2,1,27.0,0,0,13.0,0
891,0,3,1,32.0,0,0,7.75,2


In [67]:
#将 object 转为int 
data['Age'] = pd.to_numeric( data['Age'], errors='coerce').fillna('0').astype('int32') #‘coerce'表示不能转的将被置为Nan

In [68]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 886 entries, 1 to 891
Data columns (total 8 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Survived  886 non-null    int64  
 1   Pclass    886 non-null    int64  
 2   Sex       886 non-null    int32  
 3   Age       886 non-null    int32  
 4   SibSp     886 non-null    int64  
 5   Parch     886 non-null    int64  
 6   Fare      886 non-null    float64
 7   Embarked  886 non-null    int64  
dtypes: float64(1), int32(2), int64(5)
memory usage: 55.4 KB


In [69]:
feature = data.iloc[:,data.columns != 'Survived']
target = data.iloc[:,data.columns == 'Survived']

In [70]:
x_train, x_test, y_train, y_test = train_test_split(feature,target,test_size=0.3)

In [71]:
clf = DecisionTreeClassifier(random_state=25)
clf.fit(x_train,y_train)
clf.score(x_test, y_test)

0.7481203007518797