任务: 泰坦尼克号乘客生存预测

流程: 
```
1. 数据读取与清洗
2. 特征工程（提取特征与细化）
3. 模型训练
4. 模型评估
```

In [33]:
import pandas as pd
import numpy as np

数据读取

In [34]:
data = pd.read_csv("/data/ys_data/titanic/train.csv")
test = pd.read_csv("/data/ys_data/titanic/test.csv")

数据清洗

In [35]:
# 合并两个数据集查看整体数据集信息，合并数据集并忽略缺失值
full = data._append(test, ignore_index=True)

In [36]:
# 查看数据集整体情况，查看缺失数据并规划数据清洗
full.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  1309 non-null   int64  
 1   Survived     891 non-null    float64
 2   Pclass       1309 non-null   int64  
 3   Name         1309 non-null   object 
 4   Sex          1309 non-null   object 
 5   Age          1046 non-null   float64
 6   SibSp        1309 non-null   int64  
 7   Parch        1309 non-null   int64  
 8   Ticket       1309 non-null   object 
 9   Fare         1308 non-null   float64
 10  Cabin        295 non-null    object 
 11  Embarked     1307 non-null   object 
dtypes: float64(3), int64(4), object(5)
memory usage: 122.8+ KB


数据清洗的方法：

A如果是数值型，用平均数取代

B如果是分类数据，用最常见的类别取代

C使用模型预测缺失值，例如：K-NN

In [37]:
# 用均值填充Age和Fare
full["Age"] = full["Age"].fillna(full["Age"].mean())
full["Fare"] = full["Fare"].fillna(full["Fare"].mean())

In [38]:
# 查看Embarked常见类别
full["Embarked"].value_counts()

Embarked
S    914
C    270
Q    123
Name: count, dtype: int64

In [39]:
# 用常见的类别填充Embarked
full["Embarked"] = full["Embarked"].fillna("S")

In [40]:
# Cabin一列缺失较多，暂填充为u，代表未知
full["Cabin"] = full["Cabin"].fillna("u")

In [41]:
# 填充完成后查看数据集
full.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  1309 non-null   int64  
 1   Survived     891 non-null    float64
 2   Pclass       1309 non-null   int64  
 3   Name         1309 non-null   object 
 4   Sex          1309 non-null   object 
 5   Age          1309 non-null   float64
 6   SibSp        1309 non-null   int64  
 7   Parch        1309 non-null   int64  
 8   Ticket       1309 non-null   object 
 9   Fare         1309 non-null   float64
 10  Cabin        1309 non-null   object 
 11  Embarked     1309 non-null   object 
dtypes: float64(3), int64(4), object(5)
memory usage: 122.8+ KB


特征提取

In [42]:
# 对性别进行替换（male->1, female->2）
sex_mapDict = { "male" : 1, 
               "female" : 0}
full["Sex"] = full["Sex"].map(sex_mapDict)

In [43]:
full["Sex"]

0       1
1       0
2       0
3       0
4       1
       ..
1304    1
1305    0
1306    1
1307    1
1308    1
Name: Sex, Length: 1309, dtype: int64

In [45]:
# Embarked转化为one-hot编码（字典特征提取）
from sklearn.feature_extraction import DictVectorizer

transfer = DictVectorizer()
embarked = transfer.fit_transform(full["Embarked"])

AttributeError: 'str' object has no attribute 'items'