## [作業重點]
目前你應該已經要很清楚資料集中，資料的型態是什麼樣子囉！包含特徵 (features) 與標籤 (labels)。因此要記得未來不管什麼專案，必須要把資料清理成相同的格式，才能送進模型訓練。
今天的作業開始踏入決策樹這個非常重要的模型，請務必確保你理解模型中每個超參數的意思，並試著調整看看，對最終預測結果的影響為何

## 作業

1. 試著調整 DecisionTreeClassifier(...) 中的參數，並觀察是否會改變結果？
2. 改用其他資料集 (boston, wine)，並與回歸模型的結果進行比較

In [1]:
import pandas as pd
import numpy as np
from sklearn import datasets, metrics
from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import Lasso, Ridge
from sklearn.model_selection import train_test_split

boston = datasets.load_boston()
data = pd.DataFrame(boston['data'], columns=boston['feature_names'])
target = pd.DataFrame(boston['target'], columns=['Target'])
data = pd.concat([data, target], axis=1)
del target
data.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,Target
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33,36.2


In [2]:
#預設2樣本切分
x_train, x_test, y_train, y_test = train_test_split(data.iloc[:,:-1], data['Target'], test_size=0.1, random_state=42)
TreeR = DecisionTreeRegressor(min_samples_split=2)
TreeR.fit(x_train, y_train)
y_pred = TreeR.predict(x_test)

print('MSE :',metrics.mean_squared_error(y_test, y_pred))
print('R2 score :',metrics.r2_score(y_test, y_pred))

MSE : 35.63294117647059
R2 score : 0.4292722397860971


In [3]:
#4樣本切分
x_train, x_test, y_train, y_test = train_test_split(data.iloc[:,:-1], data['Target'], test_size=0.1, random_state=42)
TreeR = DecisionTreeRegressor(min_samples_split=4)
TreeR.fit(x_train, y_train)
y_pred = TreeR.predict(x_test)

print('MSE :',metrics.mean_squared_error(y_test, y_pred))
print('R2 score :',metrics.r2_score(y_test, y_pred))

MSE : 8.761879084967319
R2 score : 0.8596622265711116


In [4]:
pd.DataFrame(TreeR.feature_importances_, columns=['Important'], index=boston['feature_names'])\
                .sort_values(by='Important', ascending=False)

Unnamed: 0,Important
RM,0.574372
LSTAT,0.201809
DIS,0.079915
CRIM,0.064632
PTRATIO,0.026291
NOX,0.012617
B,0.011948
AGE,0.011756
TAX,0.00866
INDUS,0.007422


In [5]:
lasso = Lasso()
lasso.fit(x_train, y_train)
y_pred = lasso.predict(x_test)

print('MSE :',metrics.mean_squared_error(y_test, y_pred))
print('R2 score :',metrics.r2_score(y_test, y_pred))

MSE : 18.645326946116246
R2 score : 0.7013604452769768


In [6]:
ridge = Ridge()
ridge.fit(x_train, y_train)
y_pred = ridge.predict(x_test)

print('MSE :',metrics.mean_squared_error(y_test, y_pred))
print('R2 score :',metrics.r2_score(y_test, y_pred))

MSE : 14.775452511215358
R2 score : 0.7633436747163265


>樹狀模型的指標分數可以達到很好的結果，但樹狀的分數並非每次都是同個數值需要多測試  
反之線性基本模型的結果則是確定的!

----
特徵縮放測試

In [7]:
# x_train, x_test, y_train, y_test = train_test_split(data.iloc[:,:-1], data['Target'], test_size=0.1, random_state=42)

# from sklearn.preprocessing import MinMaxScaler
# x_train = MinMaxScaler().fit_transform(x_train)
# x_test = MinMaxScaler().fit_transform(x_test)

# TreeR = DecisionTreeRegressor()
# TreeR.fit(x_train, y_train)
# y_pred = TreeR.predict(x_test)

# print('MSE :',metrics.mean_squared_error(y_test, y_pred))
# print('R2 score :',metrics.r2_score(y_test, y_pred))

In [8]:
# x_train, x_test, y_train, y_test = train_test_split(data.iloc[:,:-1], data['Target'], test_size=0.1, random_state=42)

# from sklearn.preprocessing import RobustScaler
# x_train = RobustScaler().fit_transform(x_train)
# x_test = RobustScaler().fit_transform(x_test)

# TreeR = DecisionTreeRegressor()
# TreeR.fit(x_train, y_train)
# y_pred = TreeR.predict(x_test)

# print('MSE :',metrics.mean_squared_error(y_test, y_pred))
# print('R2 score :',metrics.r2_score(y_test, y_pred))

效果...不好XD