## [作業重點]
確保你了解隨機森林模型中每個超參數的意義，並觀察調整超參數對結果的影響

## 作業

1. 試著調整 RandomForestClassifier(...) 中的參數，並觀察是否會改變結果？
2. 改用其他資料集 (boston, wine)，並與回歸模型與決策樹的結果進行比較

In [1]:
import pandas as pd
from sklearn import datasets, metrics
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.model_selection import train_test_split, cross_val_score

In [2]:
# read data
boston = datasets.load_boston()
wine = datasets.load_wine()

df_boston = pd.DataFrame(boston.data, columns=boston.feature_names)
df_wine = pd.DataFrame(wine.data, columns=wine.feature_names)

# print('boston: ', df_boston.head())
print(df_boston.describe())
# print('\n\nwine: ', df_wine.head())
print(df_wine.describe())

             CRIM          ZN       INDUS        CHAS         NOX          RM  \
count  506.000000  506.000000  506.000000  506.000000  506.000000  506.000000   
mean     3.613524   11.363636   11.136779    0.069170    0.554695    6.284634   
std      8.601545   23.322453    6.860353    0.253994    0.115878    0.702617   
min      0.006320    0.000000    0.460000    0.000000    0.385000    3.561000   
25%      0.082045    0.000000    5.190000    0.000000    0.449000    5.885500   
50%      0.256510    0.000000    9.690000    0.000000    0.538000    6.208500   
75%      3.677083   12.500000   18.100000    0.000000    0.624000    6.623500   
max     88.976200  100.000000   27.740000    1.000000    0.871000    8.780000   

              AGE         DIS         RAD         TAX     PTRATIO           B  \
count  506.000000  506.000000  506.000000  506.000000  506.000000  506.000000   
mean    68.574901    3.795043    9.549407  408.237154   18.455534  356.674032   
std     28.148861    2.1057

In [3]:
# test whether a regression or classifirer
print(boston.target)    # regression
print(wine.target)      # classifier

[24.  21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 15.  18.9 21.7 20.4
 18.2 19.9 23.1 17.5 20.2 18.2 13.6 19.6 15.2 14.5 15.6 13.9 16.6 14.8
 18.4 21.  12.7 14.5 13.2 13.1 13.5 18.9 20.  21.  24.7 30.8 34.9 26.6
 25.3 24.7 21.2 19.3 20.  16.6 14.4 19.4 19.7 20.5 25.  23.4 18.9 35.4
 24.7 31.6 23.3 19.6 18.7 16.  22.2 25.  33.  23.5 19.4 22.  17.4 20.9
 24.2 21.7 22.8 23.4 24.1 21.4 20.  20.8 21.2 20.3 28.  23.9 24.8 22.9
 23.9 26.6 22.5 22.2 23.6 28.7 22.6 22.  22.9 25.  20.6 28.4 21.4 38.7
 43.8 33.2 27.5 26.5 18.6 19.3 20.1 19.5 19.5 20.4 19.8 19.4 21.7 22.8
 18.8 18.7 18.5 18.3 21.2 19.2 20.4 19.3 22.  20.3 20.5 17.3 18.8 21.4
 15.7 16.2 18.  14.3 19.2 19.6 23.  18.4 15.6 18.1 17.4 17.1 13.3 17.8
 14.  14.4 13.4 15.6 11.8 13.8 15.6 14.6 17.8 15.4 21.5 19.6 15.3 19.4
 17.  15.6 13.1 41.3 24.3 23.3 27.  50.  50.  50.  22.7 25.  50.  23.8
 23.8 22.3 17.4 19.1 23.1 23.6 22.6 29.4 23.2 24.6 29.9 37.2 39.8 36.2
 37.9 32.5 26.4 29.6 50.  32.  29.8 34.9 37.  30.5 36.4 31.1 29.1 50.
 33.3 3

In [18]:
# linear model
from sklearn.linear_model import LinearRegression

x_train, x_test, y_train, y_test = train_test_split(df_boston, df_boston.values, test_size=0.25, random_state=42)
reg = LinearRegression().fit(x_train, y_train)
y_pred = reg.predict(x_test)
# print(f'cross_val_score: {cross_val_score(reg, df_boston, df_boston.values, cv=10).mean()}')
print(f'MSE: {metrics.mean_squared_error(y_pred, y_test)}')
print(f'features_names: {df_boston.columns}')

MSE: 1.2467190873116645e-26
features_names: Index(['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX',
       'PTRATIO', 'B', 'LSTAT'],
      dtype='object')


In [17]:
# split
x_train, x_test, y_train, y_test = train_test_split(df_boston, df_boston.values, test_size=0.25, random_state=42)

# model
reg = RandomForestRegressor(n_estimators=20, max_depth=4).fit(x_train, y_train)

# predict
y_pred = reg.predict(x_test)

# estimate score
# acc = metrics.accuracy_score(y_test, y_pred)
print(f'cross_val_score: {cross_val_score(reg, df_boston, df_boston.values, cv=10).mean()}')
print(f'MSE: {metrics.mean_squared_error(y_pred, y_test)}')
print(f'features_names: {df_boston.columns}')
print("Feature importance: ", reg.feature_importances_)



cross_val_score: 0.6981057879300083
MSE: 75.68592770054534
features_names: Index(['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX',
       'PTRATIO', 'B', 'LSTAT'],
      dtype='object')
Feature importance:  [5.89109030e-04 5.78840890e-03 1.16974267e-04 0.00000000e+00
 3.04707270e-03 6.02819015e-04 1.03732028e-02 6.53245189e-04
 0.00000000e+00 8.15049371e-01 2.54876910e-05 1.63589193e-01
 1.65116694e-04]




In [12]:
# error 不知道是因為多分類原因嗎?

# split
x_train, x_test, y_train, y_test = train_test_split(df_wine, df_wine.values, test_size=0.25, random_state=42)

# 建立模型 (使用 20 顆樹，每棵樹的最大深度為 4)
clf = RandomForestClassifier(n_estimators=20, max_depth=4)

# 訓練模型
clf.fit(x_train, y_train)

# estimate score
acc = metrics.accuracy_score(y_test, y_pred)
print(f'acc: {acc}')
print(f'features_names: {df_wine.columns}')
print("Feature importance: ", clf.feature_importances_)

ValueError: Unknown label type: 'continuous-multioutput'