## A、准备阶段
数据获取->数据探索->清洗数据->特征选择

### A1、数据获取

* 爬虫或其他手段获取
* 使用 head 查看前几行数据（默认是前 5 行）；
* 使用 tail 查看后几行数据（默认是最后 5 行）。

In [9]:
import pandas as pd
import numpy as np

In [10]:
train_data = pd.read_csv('./titanic/train.csv', index_col = 0)
test_data = pd.read_csv('./titanic/test.csv', index_col = 0)

In [11]:
train_data.head(3)

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S


In [12]:
test_data.head(3)

Unnamed: 0_level_0,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q


### A2、数据探索

* 使用 info() 了解数据表的基本情况：行数、列数、每列的数据类型、数据完整度；
* 使用 describe() 了解数据表的统计情况：总数、平均值、标准差、最小值、最大值等；
* 使用 describe(include=[‘O’]) 查看字符串类型（非数字）的整体情况。

In [13]:
print(train_data.info())
print('-'*50)
print(train_data.describe())
print('-'*50)
print(train_data.describe(include=['O']))

<class 'pandas.core.frame.DataFrame'>
Int64Index: 891 entries, 1 to 891
Data columns (total 11 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Survived  891 non-null    int64  
 1   Pclass    891 non-null    int64  
 2   Name      891 non-null    object 
 3   Sex       891 non-null    object 
 4   Age       714 non-null    float64
 5   SibSp     891 non-null    int64  
 6   Parch     891 non-null    int64  
 7   Ticket    891 non-null    object 
 8   Fare      891 non-null    float64
 9   Cabin     204 non-null    object 
 10  Embarked  889 non-null    object 
dtypes: float64(2), int64(4), object(5)
memory usage: 83.5+ KB
None
--------------------------------------------------
         Survived      Pclass         Age       SibSp       Parch        Fare
count  891.000000  891.000000  714.000000  891.000000  891.000000  891.000000
mean     0.383838    2.308642   29.699118    0.523008    0.381594   32.204208
std      0.486592    0.836071   14.

In [14]:
print(test_data.info())
print('-'*50)
print(test_data.describe())
print('-'*50)
print(test_data.describe(include=['O']))

<class 'pandas.core.frame.DataFrame'>
Int64Index: 418 entries, 892 to 1309
Data columns (total 10 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Pclass    418 non-null    int64  
 1   Name      418 non-null    object 
 2   Sex       418 non-null    object 
 3   Age       332 non-null    float64
 4   SibSp     418 non-null    int64  
 5   Parch     418 non-null    int64  
 6   Ticket    418 non-null    object 
 7   Fare      417 non-null    float64
 8   Cabin     91 non-null     object 
 9   Embarked  418 non-null    object 
dtypes: float64(2), int64(3), object(5)
memory usage: 35.9+ KB
None
--------------------------------------------------
           Pclass         Age       SibSp       Parch        Fare
count  418.000000  332.000000  418.000000  418.000000  417.000000
mean     2.265550   30.272590    0.447368    0.392344   35.627188
std      0.841838   14.181209    0.896760    0.981429   55.907576
min      1.000000    0.170000    0.000000

### A3、清洗数据

In [15]:
# train_data,test_tada中age段缺失数据，用平均数填补（具体填充方法按实际情况需求）
train_data['Age'].fillna(train_data['Age'].mean(), inplace = True)
test_data['Age'].fillna(test_data['Age'].mean(), inplace = True)

In [16]:
# test_data中Fare(船票价格)段缺失数据，用平均数填补（具体填充方法按实际情况需求）
test_data['Fare'].fillna(test_data['Fare'].mean(), inplace = True)

In [17]:
# Cabin为船舱，有大量的缺失值,暂时无法补齐；
# Embarked为登陆港口，有少量的缺失值
train_data['Embarked'].value_counts()

S    644
C    168
Q     77
Name: Embarked, dtype: int64

In [18]:
train_data['Embarked'].fillna('S', inplace = True)

In [19]:
train_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 891 entries, 1 to 891
Data columns (total 11 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Survived  891 non-null    int64  
 1   Pclass    891 non-null    int64  
 2   Name      891 non-null    object 
 3   Sex       891 non-null    object 
 4   Age       891 non-null    float64
 5   SibSp     891 non-null    int64  
 6   Parch     891 non-null    int64  
 7   Ticket    891 non-null    object 
 8   Fare      891 non-null    float64
 9   Cabin     204 non-null    object 
 10  Embarked  891 non-null    object 
dtypes: float64(2), int64(4), object(5)
memory usage: 83.5+ KB


In [20]:
test_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 418 entries, 892 to 1309
Data columns (total 10 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Pclass    418 non-null    int64  
 1   Name      418 non-null    object 
 2   Sex       418 non-null    object 
 3   Age       418 non-null    float64
 4   SibSp     418 non-null    int64  
 5   Parch     418 non-null    int64  
 6   Ticket    418 non-null    object 
 7   Fare      418 non-null    float64
 8   Cabin     91 non-null     object 
 9   Embarked  418 non-null    object 
dtypes: float64(2), int64(3), object(5)
memory usage: 35.9+ KB


In [21]:
# Cabin为船舱，有大量的缺失值
train_data['Cabin'].value_counts()

B96 B98        4
C23 C25 C27    4
G6             4
F33            3
E101           3
              ..
C47            1
C90            1
B19            1
D48            1
C45            1
Name: Cabin, Length: 147, dtype: int64

### A4、特征选择
选择影响预测的特征（字段），构建分类器：

PassengerId, Name, Ticket, Cabin等字段对分类器没什么作用，可以舍弃；
剩下的字段Pclass, Sex, Age, SibSp, Parch, Fare, Embarked放到分类器中

In [28]:
# 特征选择
features = ['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked']
train_features = train_data[features]
test_features = test_data[features]

# 测试器标签
train_labels = train_data['Survived']

train_features.head(6)

Unnamed: 0_level_0,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1,3,male,22.0,1,0,7.25,S
2,1,female,38.0,1,0,71.2833,C
3,3,female,26.0,0,0,7.925,S
4,1,female,35.0,1,0,53.1,S
5,3,male,35.0,0,0,8.05,S
6,3,male,29.699118,0,0,8.4583,Q


In [23]:
# 将特征包含的字符串转化成数值类型，以便后续运算，类型少可以使用独热编码
from sklearn.feature_extraction import DictVectorizer

vec = DictVectorizer(sparse = False)
train_features = vec.fit_transform(train_features.to_dict(orient = 'records'))
test_features=vec.transform(test_features.to_dict(orient='records'))

In [27]:
# 特征矩阵
train_features

array([[22.        ,  0.        ,  0.        , ...,  0.        ,
         1.        ,  1.        ],
       [38.        ,  1.        ,  0.        , ...,  1.        ,
         0.        ,  1.        ],
       [26.        ,  0.        ,  0.        , ...,  1.        ,
         0.        ,  0.        ],
       ...,
       [29.69911765,  0.        ,  0.        , ...,  1.        ,
         0.        ,  1.        ],
       [26.        ,  1.        ,  0.        , ...,  0.        ,
         1.        ,  0.        ],
       [32.        ,  0.        ,  1.        , ...,  0.        ,
         1.        ,  0.        ]])

In [25]:
# 查看属性
vec.feature_names_

['Age',
 'Embarked=C',
 'Embarked=Q',
 'Embarked=S',
 'Fare',
 'Parch',
 'Pclass',
 'Sex=female',
 'Sex=male',
 'SibSp']

## B、分类建模阶段
数学模型->模型评估->数据可视化->报告

### B1、建立模型

sklearn建模三补：
* 实例化：建立评估模型对象（参数）
* 训练：通过模型接口训练模型（数据属性、数据接口）
* 返回：通过接口返回需要的信息

In [None]:
from sklearn.tree import DecisionTreeClassifier

clf = DecisionTreeClassifier(criterion = 'entropy', random_state = 30, splitter = 'random')
clf = clf.fit(train_features, train_labels)

In [None]:
# 决策树预测
pred_labels = clf.predict(test_features)

In [None]:
# 得到决策树准确率
acc_decision_tree = round(clf.score(train_features, train_labels), 6)
print(u'score准确率为 %.4lf' % acc_decision_tree)

In [None]:
import numpy as np
from sklearn.model_selection import cross_val_score
# 使用K折交叉验证 统计决策树准确率
print(u'cross_val_score准确率为 %.4lf' % np.mean(cross_val_score(clf, train_features, train_labels, cv=10)))

In [None]:
class_name = ['生', '死']

from sklearn import tree
import graphviz

date = tree.export_graphviz(clf
                           ,feature_names = vec.feature_names_
                           ,class_names = class_name
                           ,filled = True
                           ,rounded = True
                           ,out_file = None)
graph = graphviz.Source(date)

In [None]:
graph

In [None]:
clf.feature_importances_  # 打印最重要属性

In [None]:
[*zip(vec.feature_names_,clf.feature_importances_)]

### B2、模型评估

### B3、数据可视化

### B4、报告