## 作業

1. 試著調整 RandomForestClassifier(...) 中的參數，並觀察是否會改變結果？
2. 改用其他資料集 (boston, wine)，並與回歸模型與決策樹的結果進行比較

In [1]:
import numpy as np
from sklearn import datasets, metrics
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor

  from numpy.core.umath_tests import inner1d


In [2]:
# 讀取鳶尾花資料集
iris = datasets.load_iris()

# 切分訓練集/測試集
x_train, x_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.25, random_state=4)

# 建立模型
clf = RandomForestClassifier()

# 訓練模型
clf.fit(x_train, y_train)

# 預測測試集
y_pred = clf.predict(x_test)

acc = metrics.accuracy_score(y_test, y_pred)
print("Acuuracy: ", acc)

print(iris.feature_names)

print("Feature importance: ", clf.feature_importances_)

Acuuracy:  0.9736842105263158
['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
Feature importance:  [0.05902046 0.01731169 0.45826962 0.46539824]


In [3]:
# 試著調整 RandomForestClassifier(...) 中的參數，並觀察是否會改變結果？
# 可調整參數：n_estimators, criterion, max_features, max_depth, min_samples_split, min_samples_leaf.

clf = []
y_pred = []
acc = []

# case-1:n_estimators
n_estimators_list = [3, 10, 100]

for i in n_estimators_list:
    clf.append(RandomForestClassifier(n_estimators = i))

# 訓練模型
for i in np.arange(0, 3):
    clf[i].fit(x_train, y_train)
    y_pred.append(clf[i].predict(x_test))
    acc.append(metrics.accuracy_score(y_test, y_pred[i]))
    print(f"Acuuracy_estimator_{n_estimators_list[i]}: {acc[i]}")
    print(f"Feature importance of {n_estimators_list[i]} estimator:{clf[i].feature_importances_}\n")

Acuuracy_estimator_3: 0.9210526315789473
Feature importance of 3 estimator:[0.05364942 0.02493972 0.57058711 0.35082375]

Acuuracy_estimator_10: 0.9736842105263158
Feature importance of 10 estimator:[0.09172838 0.04114777 0.39751336 0.46961048]

Acuuracy_estimator_100: 0.9736842105263158
Feature importance of 100 estimator:[0.11269652 0.02914429 0.36418094 0.49397825]



In [4]:
clf = []
y_pred = []
acc = []

# case-2:criterion
criterion_list = ["gini", "entropy"]

for i in criterion_list:
    clf.append(RandomForestClassifier(criterion = i))
    
for i in np.arange(0, 2):
    clf[i].fit(x_train, y_train)
    y_pred.append(clf[i].predict(x_test))
    acc.append(metrics.accuracy_score(y_test, y_pred[i]))
    print(f"Acuuracy_{criterion_list[i]}: {acc[i]}")
    print(f"Feature importance of {criterion_list[i]}:{clf[i].feature_importances_}\n")

Acuuracy_gini: 0.9473684210526315
Feature importance of gini:[0.07866104 0.01152948 0.45117321 0.45863627]

Acuuracy_entropy: 0.9736842105263158
Feature importance of entropy:[0.23362791 0.05231795 0.2601374  0.45391674]



In [5]:
clf = []
y_pred = []
acc = []

# case-3:max_features
max_features_list = [2, 4]

for i in max_features_list:
    clf.append(RandomForestClassifier(max_features = i))
    
for i in np.arange(0, 2):
    clf[i].fit(x_train, y_train)
    y_pred.append(clf[i].predict(x_test))
    acc.append(metrics.accuracy_score(y_test, y_pred[i]))
    print(f"Acuuracy_{max_features_list[i]}: {acc[i]}")
    print(f"Feature importance of {max_features_list[i]} features:{clf[i].feature_importances_}\n")

Acuuracy_2: 0.9736842105263158
Feature importance of 2 features:[0.09232627 0.01780708 0.40371113 0.48615552]

Acuuracy_4: 0.9736842105263158
Feature importance of 4 features:[0.01813436 0.01707169 0.484292   0.48050195]



In [6]:
clf = []
y_pred = []
acc = []

# case-4:max_depth
max_depth_list = [2, 10, 100]

for i in max_depth_list:
    clf.append(RandomForestClassifier(max_depth = i))
    
for i in np.arange(0, 3):
    clf[i].fit(x_train, y_train)
    y_pred.append(clf[i].predict(x_test))
    acc.append(metrics.accuracy_score(y_test, y_pred[i]))
    print(f"Acuuracy_{max_depth_list[i]}: {acc[i]}")
    print(f"Feature importance of {max_depth_list[i]} depth:{clf[i].feature_importances_}\n")

Acuuracy_2: 0.9736842105263158
Feature importance of 2 depth:[0.07065076 0.00145008 0.44247422 0.48542493]

Acuuracy_10: 0.9736842105263158
Feature importance of 10 depth:[0.09789122 0.02877368 0.35356636 0.51976874]

Acuuracy_100: 0.9473684210526315
Feature importance of 100 depth:[0.14254955 0.07890961 0.41157722 0.36696362]



In [7]:
clf = []
y_pred = []
acc = []

# case-5:min_samples_split
min_samples_split_list = [2, 10, 100]

for i in min_samples_split_list:
    clf.append(RandomForestClassifier(min_samples_split = i))
    
for i in np.arange(0, 3):
    clf[i].fit(x_train, y_train)
    y_pred.append(clf[i].predict(x_test))
    acc.append(metrics.accuracy_score(y_test, y_pred[i]))
    print(f"Acuuracy_{min_samples_split_list[i]}: {acc[i]}")
    print(f"Feature importance of {max_depth_list[i]} min_samples_split:{clf[i].feature_importances_}\n")

Acuuracy_2: 0.9736842105263158
Feature importance of 2 min_samples_split:[0.29101009 0.0644555  0.37903054 0.26550387]

Acuuracy_10: 0.9736842105263158
Feature importance of 10 min_samples_split:[0.08101213 0.03382717 0.53879044 0.34637026]

Acuuracy_100: 0.21052631578947367
Feature importance of 100 min_samples_split:[0. 0. 0. 0.]



In [8]:
clf = []
y_pred = []
acc = []

# case-6:min_samples_leaf
min_samples_leaf_list = [2, 10, 30]

for i in min_samples_leaf_list:
    clf.append(RandomForestClassifier(min_samples_leaf = i))
    
for i in np.arange(0, 3):
    clf[i].fit(x_train, y_train)
    y_pred.append(clf[i].predict(x_test))
    acc.append(metrics.accuracy_score(y_test, y_pred[i]))
    print(f"Acuuracy_{min_samples_leaf_list[i]}: {acc[i]}")
    print(f"Feature importance of {min_samples_leaf_list[i]} min_samples_leaf:{clf[i].feature_importances_}\n")

Acuuracy_2: 0.9736842105263158
Feature importance of 2 min_samples_leaf:[0.09056879 0.02316977 0.53880815 0.34745328]

Acuuracy_10: 0.9736842105263158
Feature importance of 10 min_samples_leaf:[0.25784083 0.00316397 0.39486116 0.34413404]

Acuuracy_30: 0.868421052631579
Feature importance of 30 min_samples_leaf:[0.1 0.  0.7 0.2]



* 根據自行調整的結果，得到的效果反而比較差，可能是因為對於數據不知如何調整，試著調整反而得到反效果。

In [9]:
from sklearn import linear_model

# 讀取鳶尾花資料集
wine = datasets.load_wine()

# 切分訓練集/測試集
x_train, x_test, y_train, y_test = train_test_split(wine.data, wine.target, test_size=0.25, random_state=4)

# 建立模型
clf = linear_model.LinearRegression()

# 訓練模型
clf.fit(x_train, y_train)

# 預測測試集
y_pred = clf.predict(x_test)

acc = metrics.mean_squared_error(y_test, y_pred)

print("Mean Squared error: %.3f" % acc)

Mean Squared error: 0.065


In [10]:
clf = RandomForestRegressor()

clf.fit(x_train, y_train)

y_pred = clf.predict(x_test)

acc = metrics.mean_squared_error(y_test, y_pred)

print("Mean Squared error: %.3f" % acc)

Mean Squared error: 0.032


In [11]:
print(0.030/0.065)

0.4615384615384615


* 使用 Random Forest Regressor 進行預測，得到的結果比使用 Linear Regression 效果好約 46.1%。