## [作業重點]
確保你了解隨機森林模型中每個超參數的意義，並觀察調整超參數對結果的影響

## 作業

1. 試著調整 RandomForestClassifier(...) 中的參數，並觀察是否會改變結果？
2. 改用其他資料集 (boston, wine)，並與回歸模型與決策樹的結果進行比較

In [19]:
from sklearn import datasets, linear_model, metrics
from sklearn.metrics import mean_squared_error, r2_score, accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor, export_graphviz
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from IPython.display import Image
import pydotplus 
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

In [40]:
def data(dataset, is_regression, title=None):
    if title is not None:
        print(title.upper())
        print()
    print('data shape:', dataset.data.shape)
    print('target shape:', dataset.target.shape)
    
    alpha=0.3
    
    X = dataset.data
    x_train, x_test, y_train, y_test = train_test_split(X, dataset.target, test_size=0.1, random_state=4)
    print('x_train', x_train[0])
    print('y_train', y_train[0])
    print('x_test', x_test[0])
    print('y_test', y_test[0])
    print()
    
    if is_regression:
        model = RandomForestRegressor(n_estimators=20, criterion='mse', max_depth=30, min_samples_split=2, min_samples_leaf=1, random_state=0)
        model.fit(x_train, y_train)
        y_pred = model.predict(x_test)
        print('RandomForestRegressor:')
        print('tree score:', model.score(x_test,y_test))
        print("Mean squared error: %.2f"% mean_squared_error(y_test, y_pred))
        df = pd.DataFrame(model.feature_importances_, index=dataset.feature_names, columns=['importance'])
        print("Feature importance: ", df.sort_values('importance', ascending=False))
        print()
    else:
        model = RandomForestClassifier(n_estimators=20, criterion='entropy', max_depth=30, min_samples_split=2, min_samples_leaf=1, random_state=0)
        model.fit(x_train, y_train)
        y_pred = model.predict(x_test)
        print('RandomForestClassifier:')
        print('tree score:', model.score(x_test,y_test))
        print("r2_score: %.2f"% r2_score(y_test, y_pred))
        print('accuracy_score: %.2f'% accuracy_score(y_test, y_pred))
        df = pd.DataFrame(model.feature_importances_, index=dataset.feature_names, columns=['importance'])
        print("Feature importance: ", df.sort_values('importance', ascending=False))
        print()
    print('\n-----------------------------\n')

In [41]:
diabetes = datasets.load_diabetes()
data(diabetes, True, 'diabetes')

breast_cancer = datasets.load_breast_cancer()
data(breast_cancer, True, 'breast_cancer')

boston = datasets.load_boston()
data(boston, True, 'boston')

iris = datasets.load_iris()
data(iris, False, 'iris')

wine = datasets.load_wine()
data(wine, False, 'wine')

DIABETES

data shape: (442, 10)
target shape: (442,)
x_train [-0.04547248 -0.04464164 -0.04824063 -0.01944209 -0.00019301 -0.01603186
  0.06704829 -0.03949338 -0.02479119  0.01963284]
y_train 111.0
x_test [-0.04183994 -0.04464164 -0.04931844 -0.03665645 -0.00707277 -0.02260797
  0.08545648 -0.03949338 -0.06648815  0.00720652]
y_test 128.0

RandomForestRegressor:
tree score: 0.30181463546752285
Mean squared error: 3733.72
Feature importance:       importance
s5     0.320573
bmi    0.261497
bp     0.105012
s6     0.072432
age    0.061350
s2     0.052513
s3     0.047870
s1     0.046926
s4     0.020295
sex    0.011532


-----------------------------

BREAST_CANCER

data shape: (569, 30)
target shape: (569,)
x_train [1.026e+01 1.471e+01 6.620e+01 3.216e+02 9.882e-02 9.159e-02 3.581e-02
 2.037e-02 1.633e-01 7.005e-02 3.380e-01 2.509e+00 2.394e+00 1.933e+01
 1.736e-02 4.671e-02 2.611e-02 1.296e-02 3.675e-02 6.758e-03 1.088e+01
 1.948e+01 7.089e+01 3.571e+02 1.360e-01 1.636e-01 7.162e-02 4.074