主要是为了整合下决策树的资源，作为一个tutorial，让代码一次性可以跑通。

主要参考，
Hugo Bowne-Anderson
January 3rd, 2018
Kaggle Tutorial: Your First Machine Learning Model
https://www.datacamp.com/community/tutorials/kaggle-tutorial-machine-learning

- 下载数据，先安装`kaggle`命令: `pip install kaggle`
- 然后下载数据: `kaggle competitions download -c titanic`
- 参考文档见 [kaggle](https://github.com/Kaggle/kaggle-api)

In [112]:
# Import modules
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import re
import numpy as np
from sklearn import tree
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV

# Figures inline and set visualization style
%matplotlib inline
sns.set()

In [113]:
import os
path_orig = "C:\Users\lijiaxiang\.kaggle"
path_orig2 = os.path.join(path_orig,"competitions","titanic")

In [114]:
# Import data
df_train = pd.read_csv(os.path.join(path_orig2,"train.csv"))
df_test = pd.read_csv(os.path.join(path_orig2,"test.csv"))

In [115]:
survived_train = df_train.Survived

保存train组的y

In [116]:
data = pd.concat([df_train.drop("Survived",axis =1),df_test])

这样数据的预处理，两个组的数据口径一致。

In [117]:
print data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1309 entries, 0 to 417
Data columns (total 11 columns):
PassengerId    1309 non-null int64
Pclass         1309 non-null int64
Name           1309 non-null object
Sex            1309 non-null object
Age            1046 non-null float64
SibSp          1309 non-null int64
Parch          1309 non-null int64
Ticket         1309 non-null object
Fare           1308 non-null float64
Cabin          295 non-null object
Embarked       1307 non-null object
dtypes: float64(2), int64(4), object(5)
memory usage: 122.7+ KB
None


有缺失值。

In [118]:
data.Age = data.Age.fillna(data.Age.median())
data.Fare = data.Fare.fillna(data.Fare.median())

In [119]:
print data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1309 entries, 0 to 417
Data columns (total 11 columns):
PassengerId    1309 non-null int64
Pclass         1309 non-null int64
Name           1309 non-null object
Sex            1309 non-null object
Age            1309 non-null float64
SibSp          1309 non-null int64
Parch          1309 non-null int64
Ticket         1309 non-null object
Fare           1309 non-null float64
Cabin          295 non-null object
Embarked       1307 non-null object
dtypes: float64(2), int64(4), object(5)
memory usage: 122.7+ KB
None


还有缺失值的情况是，`Cabin`和`Embarked`。

In [120]:
data = pd.get_dummies(data,columns=["Sex"],drop_first=True)

`drop_first`就是第一个level就是reference group。

In [121]:
print data.head()

   PassengerId  Pclass                                               Name  \
0            1       3                            Braund, Mr. Owen Harris   
1            2       1  Cumings, Mrs. John Bradley (Florence Briggs Th...   
2            3       3                             Heikkinen, Miss. Laina   
3            4       1       Futrelle, Mrs. Jacques Heath (Lily May Peel)   
4            5       3                           Allen, Mr. William Henry   

    Age  SibSp  Parch            Ticket     Fare Cabin Embarked  Sex_male  
0  22.0      1      0         A/5 21171   7.2500   NaN        S         1  
1  38.0      1      0          PC 17599  71.2833   C85        C         0  
2  26.0      0      0  STON/O2. 3101282   7.9250   NaN        S         0  
3  35.0      1      0            113803  53.1000  C123        S         0  
4  35.0      0      0            373450   8.0500   NaN        S         1  


以下为了简化就选择一部分变量来进行训练。

In [122]:
data.columns

Index([u'PassengerId', u'Pclass', u'Name', u'Age', u'SibSp', u'Parch',
       u'Ticket', u'Fare', u'Cabin', u'Embarked', u'Sex_male'],
      dtype='object')

In [123]:
data = data[['Sex_male','Fare','Age','Pclass','SibSp']]

In [124]:
print data.head()

   Sex_male     Fare   Age  Pclass  SibSp
0         1   7.2500  22.0       3      1
1         0  71.2833  38.0       1      1
2         0   7.9250  26.0       3      0
3         0  53.1000  35.0       1      1
4         1   8.0500  35.0       3      0


In [125]:
print data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1309 entries, 0 to 417
Data columns (total 5 columns):
Sex_male    1309 non-null uint8
Fare        1309 non-null float64
Age         1309 non-null float64
Pclass      1309 non-null int64
SibSp       1309 non-null int64
dtypes: float64(2), int64(2), uint8(1)
memory usage: 52.4 KB
None


这个地方的合并是知道顺序的，位置为`index = 891`，那么现在开始分。

这里进入sklearn的机器学习包，需要`np.array`格式。因此我们提出`.values`。

In [126]:
data_train = data[:891]
data_test = data[891:]

In [127]:
X = data_train.values
y = survived_train
test = data_test.values

这里的决策树是用平衡算法，level-wise，而非leaf-wise，这里限定`max_depth = 3`。

这里不涉及调参数，因此暂时不用管为什么是3，是经验，在进行超参数调整的时候，会比较的。

In [128]:
clf = tree.DecisionTreeClassifier(max_depth=3)
print clf.fit(X,y)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=3,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')


这里会有很多参数不知道，没关系，这些是系统设计的默认参数。

最后一步，开始`.predict()`

In [129]:
Y_pred = clf.predict(test)
df_test["Survived"] = Y_pred

In [130]:
df_test[["PassengerId","Survived"]].to_csv("titanic_18032201.csv",index = False)

所有这些结果都保存，最后用历史法进行集成算法用。

下面主要解释一些决策树的理论和可视化的东西，毕竟模型这么简单，那么一定要做到跟别人解释得非常的浅显易懂。

`max_depth`是用来控制过拟合的，我们可以比较下其他$\neq 3$情况.

我们把train组拿出来做随机测试。

这个地方的`train_test_split`是`list`。

In [131]:
print type(train_test_split(X,y,test_size = 0.33, random_state = 42,stratify = y))

<type 'list'>


In [132]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.33, random_state=42, stratify=y)

In [133]:
dep = np.arange(1,9)

`np.arange`和`range`不同之处是，反馈的是`array`。

设计等长的list，用来写入acc。

In [134]:
train_accuracy = np.empty(len(dep))
test_accuracy = np.empty(len(dep))

In [135]:
print train_accuracy, test_accuracy

[ 0.78305085  0.77627119  0.79661017  0.8         0.79322034  0.8         0.8
  0.79322034] [ 0.7885906   0.80536913  0.83053691  0.84731544  0.86744966  0.87919463
  0.88758389  0.90771812]


In [136]:
for i, k in enumerate(dep):
    print i,k

0 1
1 2
2 3
3 4
4 5
5 6
6 7
7 8


这个地方的`i`刚好做`index`,`k`就是做好`max_depth`的枚举值。

In [137]:
for i, k in enumerate(dep):
    clf = tree.DecisionTreeClassifier(max_depth=k)
    clf.fit(X_train,y_train)
    train_accuracy[i] = clf.score(X_train,y_train)
    test_accuracy[i]  = clf.score(X_test,y_test)
print train_accuracy,test_accuracy

[ 0.7885906   0.80536913  0.83053691  0.84731544  0.86744966  0.87919463
  0.88758389  0.90771812] [ 0.78305085  0.77627119  0.79661017  0.79661017  0.79322034  0.8
  0.80338983  0.79322034]


```
plt.title(u"训练集和测试集的Acc比较")
plt.plot(dep,train_accuracy,label = u"训练集Acc")
plt.plot(dep,test_accuracy,label = u"测试集Acc")
plt.legend()
plt.xlabel(u"level-wise决策树深度选择")
plt.ylabel("Acc")
plt.show()
```
![](http://p24kaozv6.bkt.clouddn.com/decisiontreeacc.png)

注意这里中文显示，每个string前面记得加上`u`。
这里注意看到`max_depth`等于3或者7的时候，Acc最高，因此我们选择3.
为什么不选择7，因为经验。

继续更新随机森林的算法啊，因为啊，这种没bagging的决策树迟早要出bug。