## [作業重點]
目前你應該已經要很清楚資料集中，資料的型態是什麼樣子囉！包含特徵 (features) 與標籤 (labels)。因此要記得未來不管什麼專案，必須要把資料清理成相同的格式，才能送進模型訓練。
今天的作業開始踏入決策樹這個非常重要的模型，請務必確保你理解模型中每個超參數的意思，並試著調整看看，對最終預測結果的影響為何

## 作業

1. 試著調整 DecisionTreeClassifier(...) 中的參數，並觀察是否會改變結果？
2. 改用其他資料集 (boston, wine)，並與回歸模型的結果進行比較

## 作業一

In [1]:
from sklearn import datasets, metrics
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.model_selection import train_test_split

In [2]:
# 讀取鳶尾花資料集
iris = datasets.load_iris()

# 切分訓練集/測試集
x_train, x_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.25, random_state=4)

# 建立模型
clf = DecisionTreeClassifier()

# 訓練模型
clf.fit(x_train, y_train)

# 預測測試集
y_pred = clf.predict(x_test)


acc = metrics.accuracy_score(y_test, y_pred)
print("Acuuracy: ", acc)

Acuuracy:  0.9736842105263158


In [3]:
# 變動 DecisionTreeClassifier()中 min_samples_split參數 (defualt=2)

x_train, x_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.25, random_state=4)
clf = DecisionTreeClassifier(min_samples_split=100)
clf.fit(x_train, y_train)
y_pred = clf.predict(x_test)

acc = metrics.accuracy_score(y_test, y_pred)
print("Acuuracy: ", acc)

Acuuracy:  0.6842105263157895


## 作業二

In [4]:
# 選擇酒品分類資料集進行決策樹模型分類
wine = datasets.load_wine()
wine

{'data': array([[1.423e+01, 1.710e+00, 2.430e+00, ..., 1.040e+00, 3.920e+00,
         1.065e+03],
        [1.320e+01, 1.780e+00, 2.140e+00, ..., 1.050e+00, 3.400e+00,
         1.050e+03],
        [1.316e+01, 2.360e+00, 2.670e+00, ..., 1.030e+00, 3.170e+00,
         1.185e+03],
        ...,
        [1.327e+01, 4.280e+00, 2.260e+00, ..., 5.900e-01, 1.560e+00,
         8.350e+02],
        [1.317e+01, 2.590e+00, 2.370e+00, ..., 6.000e-01, 1.620e+00,
         8.400e+02],
        [1.413e+01, 4.100e+00, 2.740e+00, ..., 6.100e-01, 1.600e+00,
         5.600e+02]]),
 'target': array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1

In [5]:
wine_data = wine['data']
target = wine['target']
print(f'wine_data shape: {wine_data.shape}')
print(f'target shape: {target.shape}')

wine_data shape: (178, 13)
target shape: (178,)


In [6]:
# 切割資料集
x_train, x_test, y_train, y_test = train_test_split(wine_data, target, test_size=0.1, random_state=2)

#比較不同深度對於模型準確度的影響
for i in range(1, 16):

    # 實體化決策樹模型
    clf = DecisionTreeClassifier(max_depth=i)

    # 訓練模型
    clf.fit(x_train, y_train)

    # 預測測試集
    y_pred = clf.predict(x_test)

    # 評估預測正確率
    acc = metrics.accuracy_score(y_test, y_pred)
    print(f'max_depth={i}')
    print(f"Acuuracy: {round(acc,5)}", end='\n\n')


max_depth=1
Acuuracy: 0.66667

max_depth=2
Acuuracy: 0.88889

max_depth=3
Acuuracy: 0.94444

max_depth=4
Acuuracy: 0.94444

max_depth=5
Acuuracy: 0.94444

max_depth=6
Acuuracy: 0.94444

max_depth=7
Acuuracy: 0.94444

max_depth=8
Acuuracy: 0.94444

max_depth=9
Acuuracy: 0.94444

max_depth=10
Acuuracy: 0.94444

max_depth=11
Acuuracy: 0.94444

max_depth=12
Acuuracy: 0.94444

max_depth=13
Acuuracy: 0.94444

max_depth=14
Acuuracy: 0.94444

max_depth=15
Acuuracy: 0.94444



In [7]:
print('Feature importance:', end='\n\n')
for i, j in zip(wine.feature_names, clf.feature_importances_):
    print(f'{i} - {round(j, 3)}')

Feature importance:

alcohol - 0.017
malic_acid - 0.0
ash - 0.0
alcalinity_of_ash - 0.0
magnesium - 0.0
total_phenols - 0.0
flavanoids - 0.396
nonflavanoid_phenols - 0.0
proanthocyanins - 0.0
color_intensity - 0.388
hue - 0.019
od280/od315_of_diluted_wines - 0.036
proline - 0.145


In [8]:
# 同樣的資料集以回歸型的決策樹進行訓練
x_train, x_test, y_train, y_test = train_test_split(wine_data, target, test_size=0.1, random_state=2)


max_depth=15   
# 實體化決策樹模型
clf = DecisionTreeRegressor(max_depth=max_depth)

# 訓練模型
clf.fit(x_train, y_train)

# 預測測試集
y_pred = clf.predict(x_test)

# 評估預測正確率
acc = metrics.accuracy_score(y_test, y_pred)
print(f'max_depth={max_depth}')
print(f"Acuuracy: {round(acc,5)}", end='\n\n')


max_depth=15
Acuuracy: 1.0

