## Bagging dengan RandomForest

Pada kasus ini kita akan menggunakan salah satu metode bagging yaitu RandomForest untuk mengklasifikasikan jenis tumor. Dalam latihan ini Anda akan melakukan training dengan data [Wisconsin Breast Cancer Dataset](https://www.kaggle.com/uciml/breast-cancer-wisconsin-data) dari UCI machine learning repository. Latihan ini akan melakukan prediksi memprediksi apakah tumor ganas atau jinak.

Kita akan membandingkan performa dari algoritma Decision Tree dan RandomForest pada kasus ini.

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


### Import Library

In [None]:
import numpy as np
import pandas as pd
from sklearn.tree import DecisionTreeClassifier # import DT
from sklearn.ensemble import RandomForestClassifier # import RandomForest
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report

### Persiapan Data

In [None]:
# Load data
df = pd.read_csv('/content/drive/MyDrive/Data/PembelajaranMesin/P10/wbc.csv')

df.head()

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,...,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst,Unnamed: 32
0,842302,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,
1,842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,
2,84300903,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,
3,84348301,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,...,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,
4,84358402,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,...,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,


In [None]:
# Cek kolom null
df.isnull().sum()

id                           0
diagnosis                    0
radius_mean                  0
texture_mean                 0
perimeter_mean               0
area_mean                    0
smoothness_mean              0
compactness_mean             0
concavity_mean               0
concave points_mean          0
symmetry_mean                0
fractal_dimension_mean       0
radius_se                    0
texture_se                   0
perimeter_se                 0
area_se                      0
smoothness_se                0
compactness_se               0
concavity_se                 0
concave points_se            0
symmetry_se                  0
fractal_dimension_se         0
radius_worst                 0
texture_worst                0
perimeter_worst              0
area_worst                   0
smoothness_worst             0
compactness_worst            0
concavity_worst              0
concave points_worst         0
symmetry_worst               0
fractal_dimension_worst      0
Unnamed:

In [None]:
# Seleksi fitur

# Slice dataframe mulai dari kolom 'radius_mean' sampai 'fractal_dimension_worst'
X = df.iloc[:,2:-1]
y = df['diagnosis']
y = y.map({'M':1, 'B':0}) # Encode label

# Cek jumlah fitur dan instance
X.shape

(569, 30)

### Split data training dan testing

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

### Traning Decision Tree

In [None]:
# Secara default, DecisionTreeClassifier dari scikit-learn akan menggunakan nilai "Gini" untuk kriteria
# Terdapat beberapa "hyperparamater" yang dapat digunakan. Silahka baca dokumentasi
# Pada kasus ini kita akan menggunakan parameter default
dt = DecisionTreeClassifier()

# Sesuaikan dt ke set training
dt.fit(X_train, y_train)

# Memprediksi label set test
y_pred_dt = dt.predict(X_test)

#  menghitung set accuracy
acc_dt = accuracy_score(y_test, y_pred_dt)
print("Test set accuracy: {:.2f}".format(acc_dt))
print(f"Test set accuracy: {acc_dt}")

Test set accuracy: 0.94
Test set accuracy: 0.9385964912280702


### Training RandomForest

In [None]:
# Pada kasus kali ini kita akan menggunakan seluruh parameter default dari RandomForest
# Untuk detail parameter (hyperparameter) silahkan cek dokumentasi

rf = RandomForestClassifier(n_estimators=10, random_state=1)

# Sesuaikan dt ke set training
rf.fit(X_train, y_train)

# Memprediksi label set test
y_pred_rf = rf.predict(X_test)

#  menghitung set accuracy
acc_rf = accuracy_score(y_test, y_pred_rf)
print("Test set accuracy: {:.2f}".format(acc_rf))
print(f"Test set accuracy: {acc_rf}")

Test set accuracy: 0.96
Test set accuracy: 0.956140350877193


### Tugas

Pada folder data, terdapat dataset jamur yang kita gunakan pada materi Decision Tree. Berdasarkan dataset yang sama, bandingkan peforma antara algoritma DT dan RandomForest. Gunakan tunning hyperparameter untuk mendapatkan parameter dan akurasi yang terbaik. 

**Import Library**

In [3]:
import numpy as np
import pandas as pd
from sklearn.tree import DecisionTreeClassifier # import DT
from sklearn.ensemble import RandomForestClassifier # import RandomForest
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report

**Persiapkan Data**

In [4]:
# Load data
mushroom = pd.read_csv('/content/drive/MyDrive/Data/PembelajaranMesin/P10/mushrooms.csv')

mushroom.head()

Unnamed: 0,class,cap-shape,cap-surface,cap-color,bruises,odor,gill-attachment,gill-spacing,gill-size,gill-color,...,stalk-surface-below-ring,stalk-color-above-ring,stalk-color-below-ring,veil-type,veil-color,ring-number,ring-type,spore-print-color,population,habitat
0,p,x,s,n,t,p,f,c,n,k,...,s,w,w,p,w,o,p,k,s,u
1,e,x,s,y,t,a,f,c,b,k,...,s,w,w,p,w,o,p,n,n,g
2,e,b,s,w,t,l,f,c,b,n,...,s,w,w,p,w,o,p,n,n,m
3,p,x,y,w,t,p,f,c,n,n,...,s,w,w,p,w,o,p,k,s,u
4,e,x,s,g,f,n,f,w,b,k,...,s,w,w,p,w,o,e,n,a,g


**Encoding Data**

In [5]:
from sklearn.preprocessing import LabelEncoder
encode = LabelEncoder()

mushroom['class'] = encode.fit_transform(mushroom['class'])
mushroom['cap-shape'] = encode.fit_transform(mushroom['cap-shape'])
mushroom['cap-surface'] = encode.fit_transform(mushroom['cap-surface'])
mushroom['cap-color'] = encode.fit_transform(mushroom['cap-color'])
mushroom['bruises'] = encode.fit_transform(mushroom['bruises'])
mushroom['odor'] = encode.fit_transform(mushroom['odor'])
mushroom['gill-attachment'] = encode.fit_transform(mushroom['gill-attachment'])
mushroom['gill-spacing'] = encode.fit_transform(mushroom['gill-spacing'])
mushroom['gill-size'] = encode.fit_transform(mushroom['gill-size'])
mushroom['gill-color'] = encode.fit_transform(mushroom['gill-color'])
mushroom['stalk-shape'] = encode.fit_transform(mushroom['stalk-shape'])
mushroom['stalk-root'] = encode.fit_transform(mushroom['stalk-root'])
mushroom['stalk-surface-above-ring'] = encode.fit_transform(mushroom['stalk-surface-above-ring'])
mushroom['stalk-surface-below-ring'] = encode.fit_transform(mushroom['stalk-surface-below-ring'])
mushroom['stalk-color-above-ring'] = encode.fit_transform(mushroom['stalk-color-above-ring'])
mushroom['stalk-color-below-ring'] = encode.fit_transform(mushroom['stalk-color-below-ring'])
mushroom['veil-type'] = encode.fit_transform(mushroom['veil-type'])
mushroom['veil-color'] = encode.fit_transform(mushroom['veil-color'])
mushroom['ring-number'] = encode.fit_transform(mushroom['ring-number'])
mushroom['ring-type'] = encode.fit_transform(mushroom['ring-type'])
mushroom['spore-print-color'] = encode.fit_transform(mushroom['spore-print-color'])
mushroom['population'] = encode.fit_transform(mushroom['population'])
mushroom['habitat'] = encode.fit_transform(mushroom['habitat'])

In [6]:
# Cek kolom null
mushroom.isnull().sum()

class                       0
cap-shape                   0
cap-surface                 0
cap-color                   0
bruises                     0
odor                        0
gill-attachment             0
gill-spacing                0
gill-size                   0
gill-color                  0
stalk-shape                 0
stalk-root                  0
stalk-surface-above-ring    0
stalk-surface-below-ring    0
stalk-color-above-ring      0
stalk-color-below-ring      0
veil-type                   0
veil-color                  0
ring-number                 0
ring-type                   0
spore-print-color           0
population                  0
habitat                     0
dtype: int64

In [7]:
# Seleksi fitur

# Slice dataframe mulai dari kolom 'radius_mean' sampai 'fractal_dimension_worst'
X = mushroom.iloc[:,1:]
y = mushroom['class']

# Cek jumlah fitur dan instance
X.shape

(8124, 22)

**Split data training dan testing**

In [8]:
from sklearn.model_selection import train_test_split

X1_train, X1_test, y1_train, y1_test = train_test_split(X, y, test_size=0.2, random_state=1)

**Traning Decision Tree**

In [9]:
# Secara default, DecisionTreeClassifier dari scikit-learn akan menggunakan nilai "Gini" untuk kriteria
# Terdapat beberapa "hyperparamater" yang dapat digunakan. Silahka baca dokumentasi
# Pada kasus ini kita akan menggunakan parameter default
dt = DecisionTreeClassifier()

# Sesuaikan dt ke set training
dt.fit(X1_train, y1_train)

# Memprediksi label set test
y1_pred_dt = dt.predict(X1_test)

#  menghitung set accuracy
acc1_dt = accuracy_score(y1_test, y1_pred_dt)
print("Test set accuracy: {:.2f}".format(acc1_dt))
print(f"Test set accuracy: {acc1_dt}")

Test set accuracy: 1.00
Test set accuracy: 1.0


**Traning Random Forest**

In [10]:
# Pada kasus kali ini kita akan menggunakan seluruh parameter default dari RandomForest
# Untuk detail parameter (hyperparameter) silahkan cek dokumentasi

rf = RandomForestClassifier(n_estimators=10, random_state=1)

# Sesuaikan dt ke set training
rf.fit(X1_train, y1_train)

# Memprediksi label set test
y1_pred_rf = rf.predict(X1_test)

#  menghitung set accuracy
acc1_rf = accuracy_score(y1_test, y1_pred_rf)
print("Test set accuracy: {:.2f}".format(acc1_rf))
print(f"Test set accuracy: {acc1_rf}")

Test set accuracy: 1.00
Test set accuracy: 1.0


### Adaboost Scikit Learn

In [14]:
# Pada kasus kali ini kita akan menggunakan seluruh parameter default dari RandomForest
# Untuk detail parameter (hyperparameter) silahkan cek dokumentasi
from sklearn.ensemble import AdaBoostClassifier
from sklearn.datasets import make_classification
ad = AdaBoostClassifier(n_estimators=10, random_state=1)

# Sesuaikan dt ke set training
ad.fit(X1_train, y1_train)

# Memprediksi label set test
y1_pred_ad =ad.predict(X1_test)

#  menghitung set accuracy
acc1_ad = accuracy_score(y1_test, y1_pred_ad)
print("Test set accuracy: {:.2f}".format(acc1_ad))
print(f"Test set accuracy: {acc1_ad}")

Test set accuracy: 0.96
Test set accuracy: 0.9643076923076923


In [15]:
from sklearn.ensemble import RandomForestClassifier, StackingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier



layer_one_estimators = [
                        ('rf_1', RandomForestClassifier(n_estimators=10, random_state=42)),
                        ('knn_1', KNeighborsClassifier(n_neighbors=5))             
                       ]
layer_two_estimators = [
                        ('dt_2', DecisionTreeClassifier()),
                        ('rf_2', RandomForestClassifier(n_estimators=50, random_state=42)),
                       ]
layer_two = StackingClassifier(estimators=layer_two_estimators, final_estimator=LogisticRegression())


clf = StackingClassifier(estimators=layer_one_estimators, final_estimator=layer_two)

X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=42)
clf.fit(X_train, y_train).score(X_test, y_test)

1.0