# Decision Tree, Random Forest, Gradient Tree Boosting
## Patrick 🌰


In [1]:
# 导入pandas，并且重命名为pd。
import pandas as pd

# 通过互联网读取泰坦尼克乘客档案，并存储在变量titanic中。
titanic = pd.read_csv('http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/titanic.txt')


In [2]:
# 人工选取pclass、age以及sex作为判别乘客是否能够生还的特征。
X = titanic[['pclass', 'age', 'sex']]
y = titanic['survived']


In [3]:
# 对于缺失的年龄信息，我们使用全体乘客的平均年龄代替，这样可以在保证顺利训练模型的同时，尽可能不影响预测任务。
X['age'] = X['age'].fillna(X['age'].mean())


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


In [5]:
# 对原始数据进行分割，25%的乘客数据用于测试。
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state = 33)


In [6]:
# 对类别型特征进行转化，成为特征向量。
from sklearn.feature_extraction import DictVectorizer

vec = DictVectorizer(sparse=False)
X_train = vec.fit_transform(X_train.to_dict(orient='record'))
X_test = vec.transform(X_test.to_dict(orient='record'))


In [7]:
# 使用单一决策树进行模型训练以及预测分析。
from sklearn.tree import DecisionTreeClassifier

dtc = DecisionTreeClassifier()
dtc.fit(X_train, y_train)
dtc_y_pred = dtc.predict(X_test)


In [8]:
# 使用随机森林分类器进行集成模型的训练以及预测分析。
from sklearn.ensemble import RandomForestClassifier

rfc = RandomForestClassifier()
rfc.fit(X_train, y_train)
rfc_y_pred = rfc.predict(X_test)




In [9]:
# 使用梯度提升决策树进行集成模型的训练以及预测分析。
from sklearn.ensemble import GradientBoostingClassifier

gbc = GradientBoostingClassifier()
gbc.fit(X_train, y_train)
gbc_y_pred = gbc.predict(X_test)


In [11]:
# 从sklearn.metrics导入classification_report。
from sklearn.metrics import classification_report

# 输出单一决策树在测试集上的分类准确性，以及更加详细的精确率、召回率、F1指标。
print ('The accuracy of decision tree is', dtc.score(X_test, y_test))
print (classification_report(dtc_y_pred, y_test))


The accuracy of decision tree is 0.7811550151975684
              precision    recall  f1-score   support

           0       0.91      0.78      0.84       236
           1       0.58      0.80      0.67        93

   micro avg       0.78      0.78      0.78       329
   macro avg       0.74      0.79      0.75       329
weighted avg       0.81      0.78      0.79       329



In [12]:
# 输出随机森林分类器在测试集上的分类准确性，以及更加详细的精确率、召回率、F1指标。
print ('The accuracy of random forest classifier is', rfc.score(X_test, y_test))
print (classification_report(rfc_y_pred, y_test))


The accuracy of random forest classifier is 0.7872340425531915
              precision    recall  f1-score   support

           0       0.90      0.79      0.84       230
           1       0.61      0.79      0.69        99

   micro avg       0.79      0.79      0.79       329
   macro avg       0.76      0.79      0.76       329
weighted avg       0.81      0.79      0.79       329



In [13]:
# 输出梯度提升决策树在测试集上的分类准确性，以及更加详细的精确率、召回率、F1指标。
print ('The accuracy of gradient tree boosting is', gbc.score(X_test, y_test))
print (classification_report(gbc_y_pred, y_test))


The accuracy of gradient tree boosting is 0.790273556231003
              precision    recall  f1-score   support

           0       0.92      0.78      0.84       239
           1       0.58      0.82      0.68        90

   micro avg       0.79      0.79      0.79       329
   macro avg       0.75      0.80      0.76       329
weighted avg       0.83      0.79      0.80       329



* 集成模型可以说是实战应用中最为常见的。相比于其他单一的学习模型，集成模型可以整合多种模型，或者多次就一种类型的模型进行建模。由于模型估计参数的过程也同样收到概率的影响，具有一定的不确定性；因此，集成模型虽然在训练过程中要消耗更多的时间，但是得到的综合模型往往具有更高的表现性能和更好的稳定性。
* The ensemble model is arguably the most common in real-world applications. Compared to other single learning models, the ensemble model can integrate multiple models or integrate the same single model many times. Since the process of estimating parameters of the model is also affected by the probability, it has certain uncertainty; therefore, although the ensemble model consumes more time in the training process, the obtained comprehensive model tends to have higher performance and better stability.