## [作業重點]
目前你應該已經要很清楚資料集中，資料的型態是什麼樣子囉！包含特徵 (features) 與標籤 (labels)。因此要記得未來不管什麼專案，必須要把資料清理成相同的格式，才能送進模型訓練。
今天的作業開始踏入決策樹這個非常重要的模型，請務必確保你理解模型中每個超參數的意思，並試著調整看看，對最終預測結果的影響為何

## 作業

1. 試著調整 DecisionTreeClassifier(...) 中的參數，並觀察是否會改變結果？
2. 改用其他資料集 (boston, wine)，並與回歸模型的結果進行比較

In [1]:
from sklearn import datasets, metrics

# 如果是分類問題，請使用 DecisionTreeClassifier，若為回歸問題，請使用 DecisionTreeRegressor
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.model_selection import train_test_split

In [2]:
# 讀資料集
wine = datasets.load_wine()

In [5]:
print(wine.DESCR)

.. _wine_dataset:

Wine recognition dataset
------------------------

**Data Set Characteristics:**

    :Number of Instances: 178 (50 in each of three classes)
    :Number of Attributes: 13 numeric, predictive attributes and the class
    :Attribute Information:
 		- Alcohol
 		- Malic acid
 		- Ash
		- Alcalinity of ash  
 		- Magnesium
		- Total phenols
 		- Flavanoids
 		- Nonflavanoid phenols
 		- Proanthocyanins
		- Color intensity
 		- Hue
 		- OD280/OD315 of diluted wines
 		- Proline

    - class:
            - class_0
            - class_1
            - class_2
		
    :Summary Statistics:
    
                                   Min   Max   Mean     SD
    Alcohol:                      11.0  14.8    13.0   0.8
    Malic Acid:                   0.74  5.80    2.34  1.12
    Ash:                          1.36  3.23    2.36  0.27
    Alcalinity of Ash:            10.6  30.0    19.5   3.3
    Magnesium:                    70.0 162.0    99.7  14.3
    Total Phenols:                0

In [6]:
wine.target_names

array(['class_0', 'class_1', 'class_2'], dtype='<U7')

In [7]:
# 切分訓練集/測試集
x_train, x_test, y_train, y_test = train_test_split(wine.data, wine.target, test_size=0.25, random_state=4)

# 建立模型
clf = DecisionTreeClassifier()

# 訓練模型
clf.fit(x_train, y_train)

# 預測測試集
y_pred = clf.predict(x_test)

In [8]:
acc = metrics.accuracy_score(y_test, y_pred)
print("Acuuracy: ", acc)

Acuuracy:  0.9111111111111111


In [9]:
print(wine.feature_names)
print("Feature importance: ", clf.feature_importances_)

['alcohol', 'malic_acid', 'ash', 'alcalinity_of_ash', 'magnesium', 'total_phenols', 'flavanoids', 'nonflavanoid_phenols', 'proanthocyanins', 'color_intensity', 'hue', 'od280/od315_of_diluted_wines', 'proline']
Feature importance:  [0.01364138 0.04922508 0.         0.         0.         0.04296585
 0.08158611 0.         0.         0.38107601 0.         0.04285558
 0.38865   ]


In [10]:
# 建立模型2
clf2 = DecisionTreeClassifier(criterion='entropy', splitter='best', max_depth=None, min_samples_split=2,min_samples_leaf = 3, min_weight_fraction_leaf=0.0, max_features=None, random_state=None, max_leaf_nodes=None,class_weight=None, presort=False)

In [11]:
# 訓練模型
clf2.fit(x_train, y_train)
# 預測測試集
y_pred2 = clf2.predict(x_test)
print("Acuuracy: ", metrics.accuracy_score(y_test, y_pred2))

Acuuracy:  0.9555555555555556


In [12]:
print(wine.feature_names)
print("Feature importance: ", clf2.feature_importances_)

['alcohol', 'malic_acid', 'ash', 'alcalinity_of_ash', 'magnesium', 'total_phenols', 'flavanoids', 'nonflavanoid_phenols', 'proanthocyanins', 'color_intensity', 'hue', 'od280/od315_of_diluted_wines', 'proline']
Feature importance:  [0.         0.         0.         0.         0.07873204 0.
 0.42020182 0.         0.         0.14366333 0.         0.
 0.35740281]


# 與回歸模型的結果進行比較

In [13]:
from sklearn import linear_model
from sklearn.metrics import mean_squared_error, r2_score

In [18]:
boston = datasets.load_boston()
x_train, x_test, y_train, y_test = train_test_split(boston.data, boston.target, test_size=0.25, random_state=6)

In [19]:
model = linear_model.LinearRegression()
model.fit(x_train,y_train)
y_pred = model.predict(x_test)
print("r2_score:",r2_score(y_test,y_pred))
print("MSE:",mean_squared_error(y_test,y_pred))

r2_score: 0.7023602830180025
MSE: 25.65683332692814


In [20]:
regr = DecisionTreeRegressor()
regr.fit(x_train, y_train)
y_pred = regr.predict(x_test)

In [21]:
print("r2_score:",r2_score(y_test,y_pred))
print("MSE:",mean_squared_error(y_test,y_pred))

r2_score: 0.5389989394646111
MSE: 39.73874015748032


criteria are “mse” for the mean squared error, which is equal to variance reduction as feature selection criterion and minimizes the L2 loss using the mean of each terminal node, “friedman_mse”, which uses mean squared error with Friedman’s improvement score for potential splits, and “mae” for the mean absolute error, which minimizes the L1 loss using the median of each terminal node

In [22]:
regr2 = DecisionTreeRegressor(criterion='mae', splitter='best', max_depth=None, min_samples_split=2, min_samples_leaf=3, min_weight_fraction_leaf=0.0, max_features=None, random_state=None, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, presort=False)
regr2.fit(x_train, y_train)
y_pred2 = regr2.predict(x_test)

print("r2_score:",r2_score(y_test,y_pred2))
print("MSE:",mean_squared_error(y_test,y_pred2))

r2_score: 0.7847120425249876
MSE: 18.558031496062995
