## [作業重點]
確保你了解隨機森林模型中每個超參數的意義，並觀察調整超參數對結果的影響

## 作業

1. 試著調整 RandomForestClassifier(...) 中的參數，並觀察是否會改變結果？
2. 改用其他資料集 (boston, wine)，並與回歸模型與決策樹的結果進行比較

## 作業一

In [1]:
import pandas as pd
from sklearn import datasets, metrics
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor, DecisionTreeClassifier
from sklearn.metrics import mean_squared_error

In [2]:
# 讀取鳶尾花資料集
iris = datasets.load_iris()

x_train, x_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.25, random_state=4)

# 建立模型 (使用 20 顆樹，每棵樹的最大深度為 4)
clf = RandomForestClassifier(n_estimators=20, max_depth=4)

clf.fit(x_train, y_train)

y_pred = clf.predict(x_test)

acc = metrics.accuracy_score(y_test, y_pred)
print("Accuracy: ", acc)
print('Features: ',iris.feature_names)
print("Feature importance: ", clf.feature_importances_)

Accuracy:  0.9736842105263158
Features:  ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
Feature importance:  [0.06623742 0.02578816 0.41368244 0.49429198]


In [3]:
'''
更改樹總量及最大深度
'''

# 讀取鳶尾花資料集
iris = datasets.load_iris()

x_train, x_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.25, random_state=4)

Accuracy = []
TreeNum = []
depth = []
# 建立模型 (使用 20~50 顆樹，每棵樹的最大深度為 1~5)
for i in range(20, 51, 5):
    for j in range(1,6):

        clf = RandomForestClassifier(n_estimators=i, max_depth=j)

        clf.fit(x_train, y_train)

        y_pred = clf.predict(x_test)

        acc = metrics.accuracy_score(y_test, y_pred)
        print(f'Number of Trees: {i}, max_depth: {j}')
        print("Accuracy: ", round(acc, 3))
        print('Features: ',iris.feature_names)
        print("Feature importance: ", clf.feature_importances_, end='\n\n')
        Accuracy.append(round(acc,4))
        TreeNum.append(i)
        depth.append(j)

        

Number of Trees: 20, max_depth: 1
Accuracy:  0.974
Features:  ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
Feature importance:  [0.1  0.   0.35 0.55]

Number of Trees: 20, max_depth: 2
Accuracy:  0.974
Features:  ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
Feature importance:  [0.0896585  0.02646532 0.30365948 0.5802167 ]

Number of Trees: 20, max_depth: 3
Accuracy:  0.974
Features:  ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
Feature importance:  [0.13251189 0.02727049 0.50523927 0.33497835]

Number of Trees: 20, max_depth: 4
Accuracy:  0.974
Features:  ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
Feature importance:  [0.08252702 0.01655179 0.42446436 0.47645683]

Number of Trees: 20, max_depth: 5
Accuracy:  0.974
Features:  ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
Feature importance:  [0.125

In [4]:
df_Acc = pd.DataFrame({'TreeNum':TreeNum,'depth':depth, 'Accuracy': Accuracy})
df_Acc

Unnamed: 0,TreeNum,depth,Accuracy
0,20,1,0.9737
1,20,2,0.9737
2,20,3,0.9737
3,20,4,0.9737
4,20,5,0.9737
5,25,1,1.0
6,25,2,0.9737
7,25,3,0.9737
8,25,4,0.9737
9,25,5,0.9737


## 作業二
> 使用三種不同模型解決回歸問題, 包含線性回歸, 決策樹, 隨機森林<br>
> 藉由比較sum of square來判定模型準確度<br>
> 訓練結果最佳模型為隨機林, 其次為線性回歸, 決策樹

In [5]:
# 使用波士頓房價資料集進行模型訓練及比較不同模型的準確度差異
boston = datasets.load_boston()
boston

{'data': array([[6.3200e-03, 1.8000e+01, 2.3100e+00, ..., 1.5300e+01, 3.9690e+02,
         4.9800e+00],
        [2.7310e-02, 0.0000e+00, 7.0700e+00, ..., 1.7800e+01, 3.9690e+02,
         9.1400e+00],
        [2.7290e-02, 0.0000e+00, 7.0700e+00, ..., 1.7800e+01, 3.9283e+02,
         4.0300e+00],
        ...,
        [6.0760e-02, 0.0000e+00, 1.1930e+01, ..., 2.1000e+01, 3.9690e+02,
         5.6400e+00],
        [1.0959e-01, 0.0000e+00, 1.1930e+01, ..., 2.1000e+01, 3.9345e+02,
         6.4800e+00],
        [4.7410e-02, 0.0000e+00, 1.1930e+01, ..., 2.1000e+01, 3.9690e+02,
         7.8800e+00]]),
 'target': array([24. , 21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 27.1, 16.5, 18.9, 15. ,
        18.9, 21.7, 20.4, 18.2, 19.9, 23.1, 17.5, 20.2, 18.2, 13.6, 19.6,
        15.2, 14.5, 15.6, 13.9, 16.6, 14.8, 18.4, 21. , 12.7, 14.5, 13.2,
        13.1, 13.5, 18.9, 20. , 21. , 24.7, 30.8, 34.9, 26.6, 25.3, 24.7,
        21.2, 19.3, 20. , 16.6, 14.4, 19.4, 19.7, 20.5, 25. , 23.4, 18.9,
        35.4, 24.7, 3

In [6]:
# 切割資料集
x_train, x_test, y_train, y_test = train_test_split(boston.data, boston.target, test_size=0.25, random_state=2)

In [7]:
# 一般線性回歸
reg = LinearRegression()
reg.fit(x_train,y_train)
y_pred = reg.predict(x_test)

print(f"Mean squared error: {round(mean_squared_error(y_test, y_pred), 3)}")

Mean squared error: 22.16


In [8]:
# 決策樹
DTR = DecisionTreeRegressor(max_depth=5)
DTR.fit(x_train,y_train)
y_pred = DTR.predict(x_test)

print(f"Mean squared error: {round(mean_squared_error(y_test, y_pred), 3)}")

Mean squared error: 31.903


In [9]:
# 隨機森林
RFR = RandomForestRegressor(max_depth=5)
RFR.fit(x_train,y_train)
y_pred = RFR.predict(x_test)

print(f"Mean squared error: {round(mean_squared_error(y_test, y_pred), 3)}")

Mean squared error: 10.96
