## **Soal No. 2**

Proses pengerjaan machine learning pada umumnya meliputi uji coba berbagai model terhadap dataset dengan memilih model dengan performa terbaik. Untuk mendapatkan hasil prediksi data yang akurat, diperlukan tidak hanya model machine learning yang tepat, tetapi juga hyperparameter (parameter yang mengatur proses pembelajaran mesin) yang tepat pula yang dikenal dengan istilah hyperparameter tuning. 

Grid Search Cross Validation merupakan salah satu metode pemilihan kombinasi model dan hyperparameter dengan cara menguji coba satu persatu kombinasi dan melakukan validasi untuk setiap kombinasi. Tujuannya adalah menentukan kombinasi yang menghasilkan performa model terbaik yang dapat dipilih untuk dijadikan model untuk prediksi. Dengan menggunakan dataset yang sama, gunakanlah fungsi GridSearchCV untuk hyperparameter tuning dan model selection sehingga dapat menemukan kombinasi model dan hyperparameter terbaik, dengan ketentuan sebagai berikut.


In [None]:
#Import Library

import pandas as pd

from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix,accuracy_score,classification_report
from sklearn.preprocessing import StandardScaler
from matplotlib import pyplot as plt

#### **A. kNN Model**

In [None]:
#Import Dataset

columns = ['age', 'year', 'nodes', 'class']
df = pd.read_csv("/content/haberman.data", names=columns)
df

Unnamed: 0,age,year,nodes,class
0,30,64,1,1
1,30,62,3,1
2,30,65,0,1
3,31,59,2,1
4,31,65,4,1
...,...,...,...,...
301,75,62,1,1
302,76,67,0,1
303,77,65,3,1
304,78,65,1,2


In [None]:
#Collecting features

X = df[["age", "year", "nodes"]].values
y = df["class"].values

In [None]:
#Split dataset into train and test data

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 0)

In [None]:
#Feature Scaling

sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

In [None]:
from sklearn.model_selection import GridSearchCV

#List Hyperparameters that we want to tune.
leaf_size = list(range(1,50))
n_neighbors = list(range(1,30))
p=[1,2]

#Convert to dictionary
hyperparameters = dict(leaf_size=leaf_size, n_neighbors=n_neighbors, p=p)

#Create new KNN object
knn_2 = KNeighborsClassifier(p=2)

#Use GridSearch
clf = GridSearchCV(knn_2, hyperparameters, cv=10)

#Fit the model
best_model = clf.fit(X_train,y_train)

#Print The value of best Hyperparameters
print('Best leaf_size:', best_model.best_estimator_.get_params()['leaf_size'])
print('Best p:', best_model.best_estimator_.get_params()['p'])
print('Best n_neighbors:', best_model.best_estimator_.get_params()['n_neighbors'])

Best leaf_size: 1
Best p: 1
Best n_neighbors: 15


In [None]:
# Create and train the K Nearest Neighbor model

classifier = KNeighborsClassifier(leaf_size = 1, n_neighbors=15, p=1)
classifier.fit(X_train, y_train)

KNeighborsClassifier(leaf_size=1, n_neighbors=15, p=1)

In [None]:
# Fitting KNN to the Training

y_pred = classifier.predict(X_test)

In [None]:
# Calculate Confusion Matrix, Accuracy Score, and Classification Report

cm = confusion_matrix(y_test, y_pred)
ac = accuracy_score(y_test,y_pred)
cr = classification_report(y_test,y_pred)

In [None]:
print(cm)
print(ac)
print(cr)

[[36  1]
 [23  2]]
0.6129032258064516
              precision    recall  f1-score   support

           1       0.61      0.97      0.75        37
           2       0.67      0.08      0.14        25

    accuracy                           0.61        62
   macro avg       0.64      0.53      0.45        62
weighted avg       0.63      0.61      0.51        62



#### **B. SVM Model**

In [None]:
#Import Dataset

columns = ['age', 'year', 'nodes', 'class']
df = pd.read_csv("/content/haberman.data", names=columns)
df

Unnamed: 0,age,year,nodes,class
0,30,64,1,1
1,30,62,3,1
2,30,65,0,1
3,31,59,2,1
4,31,65,4,1
...,...,...,...,...
301,75,62,1,1
302,76,67,0,1
303,77,65,3,1
304,78,65,1,2


In [None]:
#Collecting features

X = df[["age", "year", "nodes"]].values
y = df["class"].values

In [None]:
#Split dataset into train and test data

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 0)

In [None]:
#Feature Scaling

sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

In [None]:
# Fitting SVM to the Training

from sklearn.svm import SVC

classifier = SVC()
classifier.fit(X_train, y_train)

SVC()

In [None]:
from sklearn.model_selection import GridSearchCV
 
# defining parameter range
param_grid = {'C': [0.1, 1, 10],
              'gamma': [1, 0.1, 0.001],
              'kernel': ['rbf','linear']}
 
grid = GridSearchCV(SVC(), param_grid, refit = True, verbose = 3)
 
# fitting the model for grid search
grid.fit(X_train, y_train)

Fitting 5 folds for each of 18 candidates, totalling 90 fits
[CV 1/5] END ........C=0.1, gamma=1, kernel=rbf;, score=0.776 total time=   0.0s
[CV 2/5] END ........C=0.1, gamma=1, kernel=rbf;, score=0.776 total time=   0.0s
[CV 3/5] END ........C=0.1, gamma=1, kernel=rbf;, score=0.776 total time=   0.0s
[CV 4/5] END ........C=0.1, gamma=1, kernel=rbf;, score=0.755 total time=   0.0s
[CV 5/5] END ........C=0.1, gamma=1, kernel=rbf;, score=0.771 total time=   0.0s
[CV 1/5] END .....C=0.1, gamma=1, kernel=linear;, score=0.735 total time=   0.0s
[CV 2/5] END .....C=0.1, gamma=1, kernel=linear;, score=0.776 total time=   0.0s
[CV 3/5] END .....C=0.1, gamma=1, kernel=linear;, score=0.776 total time=   0.0s
[CV 4/5] END .....C=0.1, gamma=1, kernel=linear;, score=0.755 total time=   0.0s
[CV 5/5] END .....C=0.1, gamma=1, kernel=linear;, score=0.688 total time=   0.0s
[CV 1/5] END ......C=0.1, gamma=0.1, kernel=rbf;, score=0.776 total time=   0.0s
[CV 2/5] END ......C=0.1, gamma=0.1, kernel=rbf;

GridSearchCV(estimator=SVC(),
             param_grid={'C': [0.1, 1, 10], 'gamma': [1, 0.1, 0.001],
                         'kernel': ['rbf', 'linear']},
             verbose=3)

In [None]:
# print best parameter after tuning
print(grid.best_params_)
 
# print how our model looks after hyper-parameter tuning
print(grid.best_estimator_)

{'C': 0.1, 'gamma': 1, 'kernel': 'rbf'}
SVC(C=0.1, gamma=1)


In [None]:
# Fitting SVM to the Training

from sklearn.svm import SVC

classifier = SVC(C=0.1, gamma=1)
classifier.fit(X_train, y_train)

SVC(C=0.1, gamma=1)

In [None]:
# Predicting the Test set results

y_pred = classifier.predict(X_test)

In [None]:
# Display Confusion Matrix

cm = confusion_matrix(y_test, y_pred)
cm

array([[37,  0],
       [25,  0]])

In [None]:
# Display Acuuracy Score

ac = accuracy_score(y_test,y_pred)
ac

0.5967741935483871

In [None]:
# Display Classification Report

cr = classification_report(y_test,y_pred)
print(cr)

              precision    recall  f1-score   support

           1       0.60      1.00      0.75        37
           2       0.00      0.00      0.00        25

    accuracy                           0.60        62
   macro avg       0.30      0.50      0.37        62
weighted avg       0.36      0.60      0.45        62



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


#### **C. Naive Bayes Model**

In [None]:
#Import Dataset

columns = ['age', 'year', 'nodes', 'class']
df = pd.read_csv("/content/haberman.data", names=columns)
df

#Collecting features

X = df[["age", "year", "nodes"]].values
y = df["class"].values

In [None]:
#Split dataset into train and test data

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 0)

In [None]:
#Feature Scaling

sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

In [None]:
# Load the Naive Bayes classifier model
from sklearn.naive_bayes import GaussianNB

# Create the classifier and perform the training
classifier = GaussianNB()
classifier.fit(X_train, y_train)

GaussianNB()

In [None]:
import numpy as np
np.logspace(0,-9, num=10)

array([1.e+00, 1.e-01, 1.e-02, 1.e-03, 1.e-04, 1.e-05, 1.e-06, 1.e-07,
       1.e-08, 1.e-09])

In [None]:
from sklearn.model_selection import RepeatedStratifiedKFold

cv_method = RepeatedStratifiedKFold(n_splits=5, 
                                    n_repeats=3, 
                                    random_state=999)

In [None]:
from sklearn.preprocessing import PowerTransformer
from sklearn.model_selection import GridSearchCV

params_NB = {'var_smoothing': np.logspace(0,-9, num=100)}

nb = GridSearchCV(estimator=classifier, 
                     param_grid=params_NB, 
                     cv=cv_method,
                     verbose=1, 
                     scoring='accuracy')

Data_transformed = PowerTransformer().fit_transform(X_test)

nb.fit(Data_transformed, y_test);

Fitting 15 folds for each of 100 candidates, totalling 1500 fits


In [None]:
nb.best_params_

{'var_smoothing': 0.533669923120631}

In [None]:
nb.best_score_

0.6132478632478632

In [None]:
# Load the Naive Bayes classifier model
from sklearn.naive_bayes import GaussianNB

# Create the classifier and perform the training
classifier = GaussianNB(var_smoothing= 0.533669923120631)
classifier.fit(X_train, y_train)

GaussianNB(var_smoothing=0.533669923120631)

In [None]:
# Load the metrics module from sklearn
from sklearn.metrics import confusion_matrix,accuracy_score

# Predict the test dataset
y_pred = classifier.predict(X_test)

In [None]:
# Calculate Accuracy

ac = accuracy_score(y_test,y_pred)
ac

0.5967741935483871

In [None]:
# Calculate Confusion Matrix

cm = confusion_matrix(y_test, y_pred)
cm

array([[36,  1],
       [24,  1]])

In [None]:
# Display Classification Report

cr = classification_report(y_test,y_pred)
print(cr)

              precision    recall  f1-score   support

           1       0.60      0.97      0.74        37
           2       0.50      0.04      0.07        25

    accuracy                           0.60        62
   macro avg       0.55      0.51      0.41        62
weighted avg       0.56      0.60      0.47        62

