## [作業重點]
目前你應該已經要很清楚資料集中，資料的型態是什麼樣子囉！包含特徵 (features) 與標籤 (labels)。因此要記得未來不管什麼專案，必須要把資料清理成相同的格式，才能送進模型訓練。
今天的作業開始踏入決策樹這個非常重要的模型，請務必確保你理解模型中每個超參數的意思，並試著調整看看，對最終預測結果的影響為何

## 作業

1. 試著調整 DecisionTreeClassifier(...) 中的參數，並觀察是否會改變結果？
2. 改用其他資料集 (boston, wine)，並與回歸模型的結果進行比較

In [1]:
from sklearn import datasets, metrics, linear_model
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import warnings
warnings.filterwarnings('ignore')

def process_DecisionTreeClassifier(dataset, ratio):
    # 切分訓練集/測試集
    x_train, x_test, y_train, y_test = train_test_split(dataset.data, dataset.target, test_size=ratio, random_state=4)
    # 建立模型
    clf = DecisionTreeClassifier()
    # 訓練模型
    clf.fit(x_train, y_train)
    # 預測測試集
    y_pred = clf.predict(x_test)
    # Evaluation
    acc = metrics.accuracy_score(y_test, y_pred)
    print("Acuuracy by Decision Tree Classifier: ", acc)
    print("Feature names: ", dataset.feature_names)
    print("Feature importance: ", clf.feature_importances_)
    return

def process_LogisticRegression(dataset, ratio):
    # 切分訓練集/測試集
    x_train, x_test, y_train, y_test = train_test_split(dataset.data, dataset.target, test_size=ratio, random_state=4)
    # 建立模型
    logreg = linear_model.LogisticRegression()
    # 訓練模型
    logreg.fit(x_train, y_train)
    # 預測測試集
    y_pred = logreg.predict(x_test)
    # Evaluation
    acc = metrics.accuracy_score(y_test, y_pred)
    print("Acuuracy by Logistic Regression: ", acc)
    print("Feature names: ", dataset.feature_names)
    return


In [2]:
# 讀取鳶尾花資料集
print('鳶尾花資料集:\n')
dataset = datasets.load_iris()
process_DecisionTreeClassifier(dataset,0.25)
print('\n')
process_DecisionTreeClassifier(dataset,0.3)


鳶尾花資料集:

Acuuracy by Decision Tree Classifier:  0.9736842105263158
Feature names:  ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
Feature importance:  [0.         0.01796599 0.52229134 0.45974266]


Acuuracy by Decision Tree Classifier:  0.9777777777777777
Feature names:  ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
Feature importance:  [0.01920966 0.         0.51762268 0.46316766]


>略為提高test data比例為0.3, 與test_size=0.2相比, 精確度提升.

In [3]:
# 讀取酒類識別資料集
print('酒類識別資料集:\n')
dataset = datasets.load_wine()
process_DecisionTreeClassifier(dataset,0.3)
print('\n')
process_LogisticRegression(dataset, 0.3)

酒類識別資料集:

Acuuracy by Decision Tree Classifier:  0.9259259259259259
Feature names:  ['alcohol', 'malic_acid', 'ash', 'alcalinity_of_ash', 'magnesium', 'total_phenols', 'flavanoids', 'nonflavanoid_phenols', 'proanthocyanins', 'color_intensity', 'hue', 'od280/od315_of_diluted_wines', 'proline']
Feature importance:  [0.         0.         0.         0.         0.         0.
 0.3966316  0.         0.         0.39192945 0.         0.
 0.21143895]


Acuuracy by Logistic Regression:  0.9444444444444444
Feature names:  ['alcohol', 'malic_acid', 'ash', 'alcalinity_of_ash', 'magnesium', 'total_phenols', 'flavanoids', 'nonflavanoid_phenols', 'proanthocyanins', 'color_intensity', 'hue', 'od280/od315_of_diluted_wines', 'proline']


> 以Wine Recognition Dataset為例, 羅吉斯回歸優於決策樹.