## Bagging dengan RandomForest

Pada kasus ini kita akan menggunakan salah satu metode bagging yaitu RandomForest untuk mengklasifikasikan jenis tumor. Dalam latihan ini Anda akan melakukan training dengan data [Wisconsin Breast Cancer Dataset](https://www.kaggle.com/uciml/breast-cancer-wisconsin-data) dari UCI machine learning repository. Latihan ini akan melakukan prediksi memprediksi apakah tumor ganas atau jinak.

Kita akan membandingkan performa dari algoritma Decision Tree dan RandomForest pada kasus ini.

### Import Library

In [None]:
import numpy as np
import pandas as pd
from sklearn.tree import DecisionTreeClassifier # import DT
from sklearn.ensemble import RandomForestClassifier # import RandomForest
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report

### Persiapan Data

In [None]:
# Load data
df = pd.read_csv('data/wbc.csv')

df.head()

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,...,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst,Unnamed: 32
0,842302,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,
1,842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,
2,84300903,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,
3,84348301,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,...,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,
4,84358402,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,...,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,


In [None]:
# Cek kolom null
df.isnull().sum()

id                           0
diagnosis                    0
radius_mean                  0
texture_mean                 0
perimeter_mean               0
area_mean                    0
smoothness_mean              0
compactness_mean             0
concavity_mean               0
concave points_mean          0
symmetry_mean                0
fractal_dimension_mean       0
radius_se                    0
texture_se                   0
perimeter_se                 0
area_se                      0
smoothness_se                0
compactness_se               0
concavity_se                 0
concave points_se            0
symmetry_se                  0
fractal_dimension_se         0
radius_worst                 0
texture_worst                0
perimeter_worst              0
area_worst                   0
smoothness_worst             0
compactness_worst            0
concavity_worst              0
concave points_worst         0
symmetry_worst               0
fractal_dimension_worst      0
Unnamed:

In [None]:
# Seleksi fitur

# Slice dataframe mulai dari kolom 'radius_mean' sampai 'fractal_dimension_worst'
X = df.iloc[:,2:-1]
y = df['diagnosis']
y = y.map({'M':1, 'B':0}) # Encode label

# Cek jumlah fitur dan instance
X.shape

(569, 30)

### Split data training dan testing

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

### Traning Decision Tree

In [None]:
# Secara default, DecisionTreeClassifier dari scikit-learn akan menggunakan nilai "Gini" untuk kriteria
# Terdapat beberapa "hyperparamater" yang dapat digunakan. Silahka baca dokumentasi
# Pada kasus ini kita akan menggunakan parameter default
dt = DecisionTreeClassifier()

# Sesuaikan dt ke set training
dt.fit(X_train, y_train)

# Memprediksi label set test
y_pred_dt = dt.predict(X_test)

#  menghitung set accuracy
acc_dt = accuracy_score(y_test, y_pred_dt)
print("Test set accuracy: {:.2f}".format(acc_dt))
print(f"Test set accuracy: {acc_dt}")

Test set accuracy: 0.95
Test set accuracy: 0.9473684210526315


### Training RandomForest

In [None]:
# Pada kasus kali ini kita akan menggunakan estimator pada RandomForest
# Untuk detail parameter (hyperparameter) silahkan cek dokumentasi

rf = RandomForestClassifier(n_estimators=10, random_state=1)

# Sesuaikan dt ke set training
rf.fit(X_train, y_train)

# Memprediksi label set test
y_pred_rf = rf.predict(X_test)

#  menghitung set accuracy
acc_rf = accuracy_score(y_test, y_pred_rf)
print("Test set accuracy: {:.2f}".format(acc_rf))
print(f"Test set accuracy: {acc_rf}")

Test set accuracy: 0.96
Test set accuracy: 0.956140350877193


### Tugas

Pada folder data, terdapat dataset jamur yang kita gunakan pada materi Decision Tree. Berdasarkan dataset yang sama, bandingkan peforma antara algoritma DT dan RandomForest. Gunakan tunning hyperparameter untuk mendapatkan parameter dan akurasi yang terbaik.

Nama : Kevin Natanael Wijaya  
Kelas : TI-3B  
NIM : 2041720091

In [1]:
import numpy as np
import pandas as pd
from sklearn.tree import DecisionTreeClassifier # import DT
from sklearn.ensemble import RandomForestClassifier # import RandomForest
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report

In [5]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [6]:
mush=pd.read_csv('/content/drive/MyDrive/Colab Notebooks/Machine Learning/JS09_Ensemble_Method/data/mushrooms.csv')
mush.head()

Unnamed: 0,class,cap-shape,cap-surface,cap-color,bruises,odor,gill-attachment,gill-spacing,gill-size,gill-color,...,stalk-surface-below-ring,stalk-color-above-ring,stalk-color-below-ring,veil-type,veil-color,ring-number,ring-type,spore-print-color,population,habitat
0,p,x,s,n,t,p,f,c,n,k,...,s,w,w,p,w,o,p,k,s,u
1,e,x,s,y,t,a,f,c,b,k,...,s,w,w,p,w,o,p,n,n,g
2,e,b,s,w,t,l,f,c,b,n,...,s,w,w,p,w,o,p,n,n,m
3,p,x,y,w,t,p,f,c,n,n,...,s,w,w,p,w,o,p,k,s,u
4,e,x,s,g,f,n,f,w,b,k,...,s,w,w,p,w,o,e,n,a,g


In [8]:
mush.isnull().sum()

class                       0
cap-shape                   0
cap-surface                 0
cap-color                   0
bruises                     0
odor                        0
gill-attachment             0
gill-spacing                0
gill-size                   0
gill-color                  0
stalk-shape                 0
stalk-root                  0
stalk-surface-above-ring    0
stalk-surface-below-ring    0
stalk-color-above-ring      0
stalk-color-below-ring      0
veil-type                   0
veil-color                  0
ring-number                 0
ring-type                   0
spore-print-color           0
population                  0
habitat                     0
dtype: int64

In [9]:
from sklearn.preprocessing import LabelEncoder
labelencoder=LabelEncoder()
for col in mush.columns:
    mush[col]=labelencoder.fit_transform(mush[col])
mush.head()

Unnamed: 0,class,cap-shape,cap-surface,cap-color,bruises,odor,gill-attachment,gill-spacing,gill-size,gill-color,...,stalk-surface-below-ring,stalk-color-above-ring,stalk-color-below-ring,veil-type,veil-color,ring-number,ring-type,spore-print-color,population,habitat
0,1,5,2,4,1,6,1,0,1,4,...,2,7,7,0,2,1,4,2,3,5
1,0,5,2,9,1,0,1,0,0,4,...,2,7,7,0,2,1,4,3,2,1
2,0,0,2,8,1,3,1,0,0,5,...,2,7,7,0,2,1,4,3,2,3
3,1,5,3,8,1,6,1,0,1,5,...,2,7,7,0,2,1,4,2,3,5
4,0,5,2,3,0,5,1,1,0,4,...,2,7,7,0,2,1,0,3,0,1


In [10]:
X = mush.drop(columns='class')
y = mush['class']

In [11]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

In [12]:
dt = DecisionTreeClassifier()

dt.fit(X_train, y_train)

y_pred_dt = dt.predict(X_test)

acc_dt = accuracy_score(y_test, y_pred_dt)
print("Test set accuracy: {:.2f}".format(acc_dt))
print(f"Test set accuracy: {acc_dt}")

Test set accuracy: 1.00
Test set accuracy: 1.0


In [13]:
rf = RandomForestClassifier(n_estimators=10, random_state=1)

rf.fit(X_train, y_train)

y_pred_rf = rf.predict(X_test)

acc_rf = accuracy_score(y_test, y_pred_rf)
print("Test set accuracy: {:.2f}".format(acc_rf))
print(f"Test set accuracy: {acc_rf}")

Test set accuracy: 1.00
Test set accuracy: 1.0


## Method Boosting

In [14]:
from sklearn.ensemble import AdaBoostClassifier # import AdaBoost

In [15]:
ada = AdaBoostClassifier(n_estimators=10)

ada.fit(X_train, y_train)

y_pred_ada = ada.predict(X_test)

acc_ada = accuracy_score(y_test, y_pred_ada)
print("Test set accuracy: {:.2f}".format(acc_ada))
print(f"Test set accuracy: {acc_ada}")

Test set accuracy: 0.96
Test set accuracy: 0.9643076923076923


In [16]:
from sklearn.model_selection import GridSearchCV

ada = AdaBoostClassifier()

param_ada = {
    'n_estimators': [10, 50, 100, 200],
    'learning_rate': [0.1, 0.5, 1, 2]
}

grid_ada = GridSearchCV(ada, param_ada, cv=5, scoring='accuracy')

grid_ada.fit(X_train, y_train)

print(grid_ada.best_params_)
print(grid_ada.best_score_)
print(grid_ada.best_estimator_)

{'learning_rate': 0.5, 'n_estimators': 100}
1.0
AdaBoostClassifier(learning_rate=0.5, n_estimators=100)


In [17]:
ada = AdaBoostClassifier(n_estimators=100, learning_rate=0.5)

# Sesuaikan dt ke set training
ada.fit(X_train, y_train)

# Memprediksi label set test
y_pred_ada = ada.predict(X_test)

#  menghitung set accuracy
acc_ada = accuracy_score(y_test, y_pred_ada)
print("Test set accuracy: {:.2f}".format(acc_ada))
print(f"Test set accuracy: {acc_ada}")

Test set accuracy: 1.00
Test set accuracy: 1.0


## Method Stacking Dengan Voting

In [18]:
dbt = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/Machine Learning/JS09_Ensemble_Method/data/diabetes.csv')

dbt.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [19]:
dbt.isnull().sum()

Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
Outcome                     0
dtype: int64

In [20]:
feature_columns = ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age']
for column in feature_columns:
    print("============================================")
    print(f"{column} ==> Missing zeros : {len(dbt.loc[dbt[column] == 0])}")

Pregnancies ==> Missing zeros : 111
Glucose ==> Missing zeros : 5
BloodPressure ==> Missing zeros : 35
SkinThickness ==> Missing zeros : 227
Insulin ==> Missing zeros : 374
BMI ==> Missing zeros : 11
DiabetesPedigreeFunction ==> Missing zeros : 0
Age ==> Missing zeros : 0


In [21]:
from sklearn.impute import SimpleImputer

fill_values = SimpleImputer(missing_values=0, strategy="mean", copy=False)

dbt[feature_columns] = fill_values.fit_transform(dbt[feature_columns])

In [22]:
X_dbt = dbt[feature_columns]
y_dbt = dbt.Outcome

X_train_dbt, X_test_dbt, y_train_dbt, y_test_dbt = train_test_split(X_dbt, y_dbt, test_size=0.3, random_state=42)

In [23]:
from sklearn.preprocessing import StandardScaler

sc = StandardScaler()

X_train_std = sc.fit_transform(X_train_dbt)
X_test_std = sc.transform(X_test_dbt)

In [24]:
# Model Logistic Regression
from sklearn.linear_model import LogisticRegression

logreg = LogisticRegression()

logreg.fit(X_train_std, y_train_dbt)

y_pred_logreg = logreg.predict(X_test_std)

acc_logreg = accuracy_score(y_test_dbt, y_pred_logreg)

print("Test set accuracy: {:.2f}".format(acc_logreg))
print(f"Test set accuracy: {acc_logreg}")

Test set accuracy: 0.74
Test set accuracy: 0.7359307359307359


In [25]:
# Tunning
from sklearn.model_selection import GridSearchCV

logreg = LogisticRegression()

param_logreg = {
    'C': [0.1, 0.5, 1, 2, 5, 10],
    'max_iter': [100, 200, 300, 400, 500],
}

grid_logreg = GridSearchCV(logreg, param_logreg, cv=5, scoring='accuracy')

grid_logreg.fit(X_train_std, y_train_dbt)

print(grid_logreg.best_params_)
print(grid_logreg.best_score_)
print(grid_logreg.best_estimator_)

{'C': 0.5, 'max_iter': 100}
0.778331602630668
LogisticRegression(C=0.5)


In [26]:
from sklearn.linear_model import LogisticRegression

logreg = LogisticRegression(C=0.5, max_iter=100)

logreg.fit(X_train_std, y_train_dbt)

y_pred_logreg = logreg.predict(X_test_std)

acc_logreg = accuracy_score(y_test_dbt, y_pred_logreg)

print("Test set accuracy: {:.2f}".format(acc_logreg))
print(f"Test set accuracy: {acc_logreg}")

Test set accuracy: 0.74
Test set accuracy: 0.7402597402597403


In [27]:
# Model SVM Kernel Poly
from sklearn.svm import SVC

svm = SVC(kernel='poly')

svm.fit(X_train_std, y_train_dbt)

y_pred_svm = svm.predict(X_test_std)

acc_svm = accuracy_score(y_test_dbt, y_pred_svm)

print("Test set accuracy: {:.2f}".format(acc_svm))

print(f"Test set accuracy: {acc_svm}")

Test set accuracy: 0.70
Test set accuracy: 0.696969696969697


In [28]:
# Tunning
from sklearn.model_selection import GridSearchCV

svm = SVC(kernel='poly')

param_svm = {
    'C': [0.1, 0.5, 1, 2, 5, 10],
    'degree': [2, 3, 4, 5, 6, 7, 8, 9, 10],
}

grid_svm = GridSearchCV(svm, param_svm, cv=5, scoring='accuracy')

grid_svm.fit(X_train_std, y_train_dbt)

print(grid_svm.best_params_)
print(grid_svm.best_score_)
print(grid_svm.best_estimator_)

{'C': 0.5, 'degree': 3}
0.7393042575285567
SVC(C=0.5, kernel='poly')


In [29]:
# Model SVM Kernel Poly
from sklearn.svm import SVC

svm = SVC(kernel='poly', C=0.5, degree=3)

svm.fit(X_train_std, y_train_dbt)

y_pred_svm = svm.predict(X_test_std)

acc_svm = accuracy_score(y_test_dbt, y_pred_svm)

print("Test set accuracy: {:.2f}".format(acc_svm))

print(f"Test set accuracy: {acc_svm}")

Test set accuracy: 0.71
Test set accuracy: 0.70995670995671


In [30]:
# Model Decision Tree
dt = DecisionTreeClassifier()

dt.fit(X_train_std, y_train_dbt)

y_pred_dt = dt.predict(X_test_std)

acc_dt = accuracy_score(y_test_dbt, y_pred_dt)

print("Test set accuracy: {:.2f}".format(acc_dt))

print(f"Test set accuracy: {acc_dt}")

Test set accuracy: 0.71
Test set accuracy: 0.70995670995671


In [31]:
# Decision Tree
from sklearn.model_selection import GridSearchCV

dt = DecisionTreeClassifier()

param_dt = {
    'criterion': ['gini', 'entropy'],
    'max_depth': [2, 3, 4, 5, 6, 7, 8, 9, 10],
    'min_samples_split': [2, 3, 4, 5, 6, 7, 8, 9, 10],
    'min_samples_leaf': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
}

grid_dt = GridSearchCV(dt, param_dt, cv=5, scoring='accuracy')

grid_dt.fit(X_train_std, y_train_dbt)

print(grid_dt.best_params_)
print(grid_dt.best_score_)
print(grid_dt.best_estimator_)

{'criterion': 'entropy', 'max_depth': 4, 'min_samples_leaf': 10, 'min_samples_split': 4}
0.7561093804084458
DecisionTreeClassifier(criterion='entropy', max_depth=4, min_samples_leaf=10,
                       min_samples_split=4)


In [32]:
# Model with Decision Tree
dt = DecisionTreeClassifier(criterion='entropy', max_depth=4, min_samples_leaf=10)

dt.fit(X_train_std, y_train_dbt)

y_pred_dt = dt.predict(X_test_std)

acc_dt = accuracy_score(y_test_dbt, y_pred_dt)

print("Test set accuracy: {:.2f}".format(acc_dt))

print(f"Test set accuracy: {acc_dt}")

Test set accuracy: 0.72
Test set accuracy: 0.7186147186147186


In [35]:
from sklearn.ensemble import VotingClassifier # import model Voting
# Model with Stacking Voting
clf1 = LogisticRegression(C=0.6, max_iter=100)
clf2 = SVC(kernel='poly', C=0.6, degree=1)
clf3 = DecisionTreeClassifier(criterion='entropy', max_depth=4, min_samples_leaf=10)

# model hard voting
voting = VotingClassifier(estimators=[('LogReg', clf1), ('SVM-Poly', clf2), ('DT', clf3)], voting='hard')

# Fit model
voting.fit(X_train_std, y_train_dbt)

# Prediksi
y_pred_vt1 = voting.predict(X_test_std)

# Evaluasi akurasi testing data
acc_vt1 = accuracy_score(y_test_dbt, y_pred_vt1)

# Print hasil evaluasi
print('Voting Hard')
print("Test set accuracy: {:.2f}".format(acc_vt1))
print(f"Test set accuracy: {acc_vt1}")

Voting Hard
Test set accuracy: 0.74
Test set accuracy: 0.7402597402597403
