# 建立模型四步驟
在 Scikit-learn 中，建立一個機器學習的模型其實非常簡單，流程大略是以下四個步驟

1. 讀進資料，並檢查資料的 shape (有多少 samples (rows), 多少 features (columns)，label 的型態是什麼？)  
**使用 pandas 讀取 .csv 檔：pd.read_csv  
使用 numpy 讀取 .txt 檔：np.loadtxt  
使用 Scikit-learn 內建的資料集：sklearn.datasets.load_xxx  
檢查資料數量：data.shape (data should be np.array or dataframe)**  
2. 將資料切為訓練 (train) / 測試 (test)  
train_test_split(data)  
3. 建立模型，將資料 fit 進模型開始訓練  
clf = DecisionTreeClassifier()  
clf.fit(x_train, y_train)  
4. 將測試資料 (features) 放進訓練好的模型中，得到 prediction，與測試資料的 label (y_test) 做評估  
clf.predict(x_test)  
accuracy_score(y_test, y_pred)  
f1_score(y_test, y_pred)  

In [1]:
from sklearn import datasets, metrics
# 如果是分類問題，請使用 DecisionTreeClassifier，若為回歸問題，請使用 DecisionTreeRegressor
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.model_selection import train_test_split

In [11]:
# 讀取鳶尾花資料集
iris = datasets.load_iris()

# 切分訓練集/測試集
x_train, x_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.25, random_state=4)

# 建立模型
clf = DecisionTreeClassifier()

# 訓練模型
clf.fit(x_train, y_train)

# 預測測試集
y_pred = clf.predict(x_test)

acc = metrics.accuracy_score(y_test, y_pred)
print("Accuracy: ", acc)

print(iris.feature_names)

print("Feature importance: ", clf.feature_importances_)

Acuuracy:  0.9736842105263158
['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
Feature importance:  [0.         0.01796599 0.05992368 0.92211033]


# 作業
1. 試著調整 DecisionTreeClassifier(...) 中的參數，並觀察是否會改變結果？  
2. 改用其他資料集 (boston, wine)，並與回歸模型的結果進行比較

**調整 DecisionTreeClassifier(...) 中的參數**

In [6]:
import numpy as np
from IPython.display import Image  
from sklearn import tree
from sklearn.tree import export_graphviz
import pydotplus

In [18]:
# 讀取鳶尾花資料集
iris = datasets.load_iris()

# 切分訓練集/測試集
x_train, x_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.25, random_state=4)

# 建立模型
clf = DecisionTreeClassifier(
        criterion = 'entropy',   #gini(default)或者entropy
        max_depth = 3,           #決策樹最大深度，不設定則不限制樹的深度。樣本或特徵少時可不設定，樣本或特徵多時，推薦可10-100之間。
        min_samples_split = 20,  #內部節點需再劃分之最小樣本數，樣本數不大時可不設定。default=2
        min_samples_leaf = 5,    #葉節點最小樣本數。default=1
)

# 訓練模型
clf.fit(x_train, y_train)

# 預測測試集
y_pred = clf.predict(x_test)

acc = metrics.accuracy_score(y_test, y_pred)
print("Accuracy: ", acc)

print(iris.feature_names)
print("Feature importance: ", clf.feature_importances_)

Accuracy:  0.9736842105263158
['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
Feature importance:  [0.         0.         0.06058349 0.93941651]


→Accuracy無影響，但對feature importance有影響。

**改用其他資料集 (boston, wine)，並與回歸模型的結果進行比較**

In [33]:
Breast_cancer = datasets.load_breast_cancer()
wine = datasets.load_wine()

**Breast_cancer**  -  DecisionTreeClassifier

In [36]:
print(Breast_cancer.data.shape)
print(Breast_cancer.feature_names)

# 切分訓練集/測試集
x_train, x_test, y_train, y_test = train_test_split(Breast_cancer.data, Breast_cancer.target, test_size=0.25, random_state=4)

# 建立模型
clf = DecisionTreeClassifier()

# 訓練模型
clf.fit(x_train, y_train)

# 預測測試集
y_pred = clf.predict(x_test)

acc = metrics.accuracy_score(y_test, y_pred)
print("Accuracy: ", acc)

print(iris.feature_names)
print("Feature importance: ", clf.feature_importances_)

(569, 30)
['mean radius' 'mean texture' 'mean perimeter' 'mean area'
 'mean smoothness' 'mean compactness' 'mean concavity'
 'mean concave points' 'mean symmetry' 'mean fractal dimension'
 'radius error' 'texture error' 'perimeter error' 'area error'
 'smoothness error' 'compactness error' 'concavity error'
 'concave points error' 'symmetry error' 'fractal dimension error'
 'worst radius' 'worst texture' 'worst perimeter' 'worst area'
 'worst smoothness' 'worst compactness' 'worst concavity'
 'worst concave points' 'worst symmetry' 'worst fractal dimension']
Accuracy:  0.9090909090909091
['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
Feature importance:  [0.         0.         0.         0.         0.0092534  0.
 0.         0.         0.00991435 0.         0.00660957 0.
 0.         0.00494089 0.         0.         0.         0.
 0.00941052 0.         0.         0.059601   0.75416545 0.04898212
 0.05147346 0.00793148 0.         0.03771776 0.         0

與線性迴歸的結果[Day_038_HW.ipynb](https://github.com/Erincatcat/ML100-Days/blob/master/Homework/Day_038_HW.ipynb)相比，accuracy從0.96下降至0.89

**Wine**

In [37]:
print(wine.data.shape)
print(wine.feature_names)

# 切分訓練集/測試集
x_train, x_test, y_train, y_test = train_test_split(wine.data, wine.target, test_size=0.25, random_state=4)

# 建立模型
clf = DecisionTreeClassifier()

# 訓練模型
clf.fit(x_train, y_train)

# 預測測試集
y_pred = clf.predict(x_test)

acc = metrics.accuracy_score(y_test, y_pred)
print("Acuuracy: ", acc)

print(wine.feature_names)
print("Feature importance: ", clf.feature_importances_)

(178, 13)
['alcohol', 'malic_acid', 'ash', 'alcalinity_of_ash', 'magnesium', 'total_phenols', 'flavanoids', 'nonflavanoid_phenols', 'proanthocyanins', 'color_intensity', 'hue', 'od280/od315_of_diluted_wines', 'proline']
Acuuracy:  0.9111111111111111
['alcohol', 'malic_acid', 'ash', 'alcalinity_of_ash', 'magnesium', 'total_phenols', 'flavanoids', 'nonflavanoid_phenols', 'proanthocyanins', 'color_intensity', 'hue', 'od280/od315_of_diluted_wines', 'proline']
Feature importance:  [0.04440705 0.         0.         0.         0.         0.06142526
 0.         0.         0.         0.38107601 0.         0.12444169
 0.38865   ]


與線性迴歸的結果[Day_038_HW.ipynb](https://github.com/Erincatcat/ML100-Days/blob/master/Homework/Day_038_HW.ipynb)相比，accuracy從1下降至0.911

# 課後補充:

可安裝額外的套件 graphviz，畫出決策樹的圖形幫助理解模型分類的準則。  
詳細參考文章[Creating and Visualizing Decision Trees with Python](https://medium.com/@rnbrown/creating-and-visualizing-decision-trees-with-python-f8e8fa394176)