## [作業重點]
目前你應該已經要很清楚資料集中，資料的型態是什麼樣子囉！包含特徵 (features) 與標籤 (labels)。因此要記得未來不管什麼專案，必須要把資料清理成相同的格式，才能送進模型訓練。
今天的作業開始踏入決策樹這個非常重要的模型，請務必確保你理解模型中每個超參數的意思，並試著調整看看，對最終預測結果的影響為何

## 作業

1. 試著調整 DecisionTreeClassifier(...) 中的參數，並觀察是否會改變結果？
2. 改用其他資料集 (boston, wine)，並與回歸模型的結果進行比較

# Ans
1. 如果資料複雜度較高，改變參數會影響結果，相反的複雜度較低時，模型的變化有限，所以加深模型的計算也不會讓結果更好
2. 相較於iris的資料集，boston的資料集複雜度較高，相較之下模型需考慮的因數較多，調參也比較需要小心

In [1]:
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt 
from sklearn.model_selection import train_test_split
from sklearn import datasets, metrics, linear_model, tree
import warnings
warnings.filterwarnings('ignore')

In [36]:
iris = datasets.load_iris()
x_train, x_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size = 0.2, random_state = 42)
dt_clf = tree.DecisionTreeClassifier(criterion='gini', max_depth= None, min_samples_split= 2, min_samples_leaf= 2)
dt_clf.fit(x_train,y_train)
y_pred = dt_clf.predict(x_test)
y_pred

array([1, 0, 2, 1, 1, 0, 1, 2, 1, 1, 2, 0, 0, 0, 0, 1, 2, 1, 1, 2, 0, 2,
       0, 2, 2, 2, 2, 2, 0, 0])

array([1, 0, 2, 1, 1, 0, 1, 2, 1, 1, 2, 0, 0, 0, 0, 1, 2, 1, 1, 2, 0, 2,
       0, 2, 2, 2, 2, 2, 0, 0])

In [24]:
df = pd.DataFrame(iris.data)
df.columns = iris.feature_names
df

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2
...,...,...,...,...
145,6.7,3.0,5.2,2.3
146,6.3,2.5,5.0,1.9
147,6.5,3.0,5.2,2.0
148,6.2,3.4,5.4,2.3


In [32]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 4 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   sepal length (cm)  150 non-null    float64
 1   sepal width (cm)   150 non-null    float64
 2   petal length (cm)  150 non-null    float64
 3   petal width (cm)   150 non-null    float64
dtypes: float64(4)
memory usage: 4.8 KB


In [14]:
accuracy = metrics.accuracy_score(y_pred,y_test)
accuracy

1.0

In [3]:
# 畫成樹狀圖
import io
from IPython.display import Image  
from sklearn.tree import export_graphviz
import pydotplus

dot_data = io.StringIO()
export_graphviz(dt_clf, out_file=dot_data,  
                filled=True, rounded=True,
                special_characters=True)
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())  
Image(graph.create_png())

InvocationException: GraphViz's executables not found

In [18]:
dt_clr = tree.DecisionTreeRegressor(max_depth = 5)
dt_clr.fit(x_train,y_train)
y_pred_reg = dt_clr.predict(x_test)
mse = metrics.mean_squared_error
mse(y_test,y_pred_reg)

0.0

In [19]:
dt_clr.feature_importances_

array([0.        , 0.00851154, 0.19875328, 0.79273517])

In [30]:
boston = datasets.load_boston()
df_boston = pd.DataFrame(boston.data)
df_boston.columns = boston.feature_names
df_boston.describe()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT
count,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0
mean,3.613524,11.363636,11.136779,0.06917,0.554695,6.284634,68.574901,3.795043,9.549407,408.237154,18.455534,356.674032,12.653063
std,8.601545,23.322453,6.860353,0.253994,0.115878,0.702617,28.148861,2.10571,8.707259,168.537116,2.164946,91.294864,7.141062
min,0.00632,0.0,0.46,0.0,0.385,3.561,2.9,1.1296,1.0,187.0,12.6,0.32,1.73
25%,0.082045,0.0,5.19,0.0,0.449,5.8855,45.025,2.100175,4.0,279.0,17.4,375.3775,6.95
50%,0.25651,0.0,9.69,0.0,0.538,6.2085,77.5,3.20745,5.0,330.0,19.05,391.44,11.36
75%,3.677083,12.5,18.1,0.0,0.624,6.6235,94.075,5.188425,24.0,666.0,20.2,396.225,16.955
max,88.9762,100.0,27.74,1.0,0.871,8.78,100.0,12.1265,24.0,711.0,22.0,396.9,37.97


In [31]:
df_boston.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 13 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   CRIM     506 non-null    float64
 1   ZN       506 non-null    float64
 2   INDUS    506 non-null    float64
 3   CHAS     506 non-null    float64
 4   NOX      506 non-null    float64
 5   RM       506 non-null    float64
 6   AGE      506 non-null    float64
 7   DIS      506 non-null    float64
 8   RAD      506 non-null    float64
 9   TAX      506 non-null    float64
 10  PTRATIO  506 non-null    float64
 11  B        506 non-null    float64
 12  LSTAT    506 non-null    float64
dtypes: float64(13)
memory usage: 51.5 KB


In [33]:
df_boston.shape

(506, 13)

In [35]:
boston.data.shape

(506, 13)

In [34]:
boston.target.shape

(506,)

In [39]:
# DecisionTree on Boston
X_train, X_test, Y_train, Y_test = train_test_split(boston.data, boston.target, test_size = 0.2, random_state = 42)
dt_clf.fit(X_train, Y_train)
y_pred_boston = dt_clf.predict(X_test)
acc = metrics.accuracy_score(Y_test,y_pred_boston)
acc

ValueError: Unknown label type: 'continuous'

In [38]:
# DecisionTree Regression on Boston
dt_clr.fit(X_train,Y_train)
y_pred_boston_reg = dt_clr.predict(X_test)
mse(Y_test,y_pred_boston_reg)

20.356259525823308

In [40]:
dt_clr.feature_importances_

array([3.55038879e-02, 0.00000000e+00, 2.62746873e-03, 2.53914343e-16,
       3.56104039e-03, 6.45919336e-01, 6.17612617e-03, 7.27222195e-02,
       0.00000000e+00, 0.00000000e+00, 4.39109751e-03, 1.71668872e-03,
       2.27382135e-01])