<a href="https://colab.research.google.com/github/Iqbal18062002/HepatitisCPred/blob/main/3_Modelling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Prediksi Hepatitis C**

## **Modelling**

Disini akan kita bangun sebuah classifier yang akan membantu kita mengklasifikasi status seorang pasien, apakah ia merupakan seorang donor darah, kemungkinan donor darah, hepatitis C, Fibrosis, atau Sirosis.

Akan kita coba 3 algoritma disini, yaitu:
- **K-Nearest Neighbors**
- **Gaussian Naive Bayes**
-**Random Forest**

## **Import libraries**

In [None]:
import pandas as pd
import numpy as np
import scipy.stats as sp
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from imblearn.over_sampling import RandomOverSampler


## **Load Dataset**

In [None]:
df = pd.read_csv('cleaned_Data.csv')

Kolom Sex akan kita drop karena tidak memiliki korelasi kategorikal yang baik dengan Category.

In [None]:
df.drop(columns = ['Unnamed: 0', 'Sex'], inplace = True)
df

Unnamed: 0,Category,Age,ALB,ALP,ALT,AST,BIL,CHE,CHOL,CREA,GGT,PROT
0,0,32.0,38.5,52.5,7.7,22.1,7.5,6.93,3.23,106.0,12.1,69.0
1,0,32.0,38.5,70.3,18.0,24.7,3.9,11.17,4.80,74.0,15.6,76.5
2,0,32.0,46.9,74.7,36.2,52.6,6.1,8.84,5.20,86.0,33.2,79.3
3,0,32.0,43.2,52.0,30.6,22.6,18.9,7.33,4.74,80.0,33.8,75.7
4,0,32.0,39.2,74.1,32.6,24.8,9.6,9.15,4.32,76.0,29.9,68.7
...,...,...,...,...,...,...,...,...,...,...,...,...
607,4,62.0,32.0,416.6,5.9,110.3,50.0,5.57,6.30,55.7,650.9,68.5
608,4,64.0,24.0,102.8,2.9,44.4,20.0,1.54,3.02,63.0,35.9,71.3
609,4,64.0,29.0,87.3,3.5,99.0,48.0,1.66,3.63,66.7,64.2,82.0
610,4,46.0,33.0,66.2,39.0,62.0,20.0,3.56,4.20,52.0,50.0,71.0


## **Split Dataset**

Akan kita pecah data, menjadi

In [None]:
X=df.drop(columns = ['Category'])
y=df['Category']
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.3,random_state=1)

## **KNN**

In [None]:
knn = KNeighborsClassifier(n_neighbors=5, weights='distance' ,metric="euclidean")
knn.fit(X_train, y_train.values)

In [None]:
y_pred=knn.predict(X_test)
df1=pd.DataFrame({'Actual Status':y_test,'Predicted Status':y_pred})
df1

Unnamed: 0,Actual Status,Predicted Status
493,0,0
472,0,0
107,0,0
558,2,4
535,1,0
...,...,...
587,4,4
42,0,0
334,0,0
305,0,0


In [None]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.95      0.99      0.97       161
           1       0.00      0.00      0.00         2
           2       0.40      0.25      0.31         8
           3       0.50      0.25      0.33         4
           4       0.50      0.44      0.47         9

    accuracy                           0.90       184
   macro avg       0.47      0.39      0.42       184
weighted avg       0.88      0.90      0.89       184



In [None]:
print(confusion_matrix(y_test,y_pred))

[[159   0   1   0   1]
 [  1   0   0   0   1]
 [  3   0   2   1   2]
 [  2   0   1   1   0]
 [  3   1   1   0   4]]


In [None]:
print(accuracy_score(y_test,y_pred)*100)

90.21739130434783


## **Gaussian Naive Bayes**

In [None]:
model = GaussianNB()
model.fit(X_train,y_train)

In [None]:
y_pred=model.predict(X_test)
df1=pd.DataFrame({'Actual Status':y_test,'Predicted Status':y_pred})
df1

Unnamed: 0,Actual Status,Predicted Status
493,0,0
472,0,0
107,0,0
558,2,4
535,1,1
...,...,...
587,4,4
42,0,0
334,0,0
305,0,0


In [None]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.97      0.95      0.96       161
           1       0.50      0.50      0.50         2
           2       0.43      0.38      0.40         8
           3       0.17      0.25      0.20         4
           4       0.67      0.89      0.76         9

    accuracy                           0.90       184
   macro avg       0.55      0.59      0.56       184
weighted avg       0.91      0.90      0.91       184



In [None]:
print(confusion_matrix(y_test,y_pred))

[[153   1   2   4   1]
 [  0   1   0   0   1]
 [  3   0   3   0   2]
 [  1   0   2   1   0]
 [  0   0   0   1   8]]


In [None]:
print(accuracy_score(y_test,y_pred)*100)

90.21739130434783


## **Random Forest**

In [None]:
Forest = RandomForestClassifier(n_estimators = 5, criterion = 'entropy')
Forest.fit(X_train, y_train)

In [None]:
y_pred=Forest.predict(X_test)
df1=pd.DataFrame({'Actual Status':y_test,'Predicted Status':y_pred})
df1

Unnamed: 0,Actual Status,Predicted Status
493,0,0
472,0,0
107,0,0
558,2,4
535,1,0
...,...,...
587,4,4
42,0,0
334,0,0
305,0,0


In [None]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.95      0.99      0.97       161
           1       0.00      0.00      0.00         2
           2       0.00      0.00      0.00         8
           3       0.00      0.00      0.00         4
           4       0.70      0.78      0.74         9

    accuracy                           0.90       184
   macro avg       0.33      0.35      0.34       184
weighted avg       0.86      0.90      0.88       184



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [None]:
print(confusion_matrix(y_test,y_pred))

[[159   0   0   1   1]
 [  2   0   0   0   0]
 [  4   0   0   2   2]
 [  2   0   2   0   0]
 [  1   0   0   1   7]]


In [None]:
print(accuracy_score(y_test,y_pred)*100)

90.21739130434783


Tanpa melakukan Tuning, kita bisa lihat bahwa **ketiga model memiliki akurasi yang kurang lebih sama, yaitu 90%.**

## **Model Tuning - Oversampling**

Data yang kita miliki tidak seimbang, sehingga akan kita lakukan Oversampling, sehingga untuk semua kategori selain Donor dan Kemungkinan Donor akan meningkat.

In [None]:
OVS = RandomOverSampler(random_state=0)
X_resampled, y_resampled = OVS.fit_resample(X, y)

In [None]:
X_train,X_test,y_train,y_test=train_test_split(X_resampled,y_resampled,
                                               test_size=0.3,random_state=1)

In [None]:
knn.fit(X_train, y_train.values)

In [None]:
y_pred=knn.predict(X_test)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       1.00      0.94      0.97       148
           1       1.00      1.00      1.00       157
           2       0.98      1.00      0.99       182
           3       0.98      1.00      0.99       152
           4       0.99      1.00      0.99       161

    accuracy                           0.99       800
   macro avg       0.99      0.99      0.99       800
weighted avg       0.99      0.99      0.99       800



In [None]:
print(accuracy_score(y_test,y_pred)*100)

98.875


In [None]:
model.fit(X_train,y_train)

In [None]:
y_pred=model.predict(X_test)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.72      0.89      0.80       148
           1       0.98      1.00      0.99       157
           2       0.78      0.48      0.60       182
           3       0.66      0.79      0.72       152
           4       0.91      0.90      0.90       161

    accuracy                           0.80       800
   macro avg       0.81      0.81      0.80       800
weighted avg       0.81      0.80      0.79       800



In [None]:
print(accuracy_score(y_test,y_pred)*100)

80.25


In [None]:
Forest.fit(X_train, y_train)

In [None]:
y_pred=Forest.predict(X_test)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00       148
           1       1.00      1.00      1.00       157
           2       1.00      1.00      1.00       182
           3       1.00      1.00      1.00       152
           4       1.00      1.00      1.00       161

    accuracy                           1.00       800
   macro avg       1.00      1.00      1.00       800
weighted avg       1.00      1.00      1.00       800



In [None]:
print(accuracy_score(y_test,y_pred)*100)

100.0


Dengan melakukan Oversampling, **Random Forest meningkat menjadi 100% (overfit), KNN 98%, dan Naive Bayes 80%.**

## **Model Tuning - Box-Cox Transformation**

Kita akan gunakan Box-Cox Transformation untuk mengubah distribusi data menjadi normal.

In [None]:
df.columns

Index(['Category', 'Age', 'ALB', 'ALP', 'ALT', 'AST', 'BIL', 'CHE', 'CHOL',
       'CREA', 'GGT', 'PROT'],
      dtype='object')

In [None]:
numerik = ['Age', 'ALB', 'ALP', 'ALT', 'AST', 'BIL', 'CHE', 'CHOL',
       'CREA', 'GGT', 'PROT']

for i in numerik:
  a,b = sp.boxcox(df[i])
  df[i] = a

In [None]:
df

Unnamed: 0,Category,Age,ALB,ALP,ALT,AST,BIL,CHE,CHOL,CREA,GGT,PROT
0,0,8.050270,80.789519,4.905053,2.254976,0.924113,1.475380,7.016395,1.687381,4.051092,1.540478,460923.934490
1,0,8.050270,80.789519,5.353280,3.331252,0.928330,1.098287,12.674980,2.575485,3.778756,1.624367,652376.873888
2,0,8.050270,104.057377,5.448209,4.285043,0.947059,1.364834,9.521606,2.781427,3.893378,1.826565,736306.255046
3,0,8.050270,93.655545,4.890592,4.049744,0.925001,1.887843,7.534286,2.543993,3.838351,1.830619,629691.376053
4,0,8.050270,82.680001,5.435564,4.137942,0.928474,1.598024,9.935428,2.318753,3.799174,1.802243,454211.672279
...,...,...,...,...,...,...,...,...,...,...,...,...
607,4,11.486702,63.697153,8.401117,1.935164,0.955484,2.206263,5.286174,3.316426,3.559149,2.211123,449775.196880
608,4,11.677724,43.942946,5.957530,1.121034,0.944036,1.909263,0.556757,1.556131,3.654826,1.844039,514721.626008
609,4,11.677724,56.110803,5.694716,1.331219,0.954620,2.194840,0.684405,1.928011,3.698916,1.957314,824161.649488
610,4,9.813568,66.270101,5.259905,4.390574,0.949524,1.909263,2.834240,2.252759,3.505431,1.911989,507466.519997


In [None]:
X=df.drop(columns = ['Category'])
y=df['Category']
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.3,random_state=1)

In [None]:
knn.fit(X_train, y_train.values)

In [None]:
y_pred=knn.predict(X_test)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.89      0.95      0.92       161
           1       0.50      0.50      0.50         2
           2       0.00      0.00      0.00         8
           3       0.20      0.25      0.22         4
           4       0.33      0.11      0.17         9

    accuracy                           0.85       184
   macro avg       0.39      0.36      0.36       184
weighted avg       0.81      0.85      0.82       184



In [None]:
print(accuracy_score(y_test,y_pred)*100)

84.78260869565217


In [None]:
model.fit(X_train,y_train)

In [None]:
y_pred=model.predict(X_test)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.90      0.98      0.94       161
           1       1.00      1.00      1.00         2
           2       0.00      0.00      0.00         8
           3       0.00      0.00      0.00         4
           4       0.50      0.44      0.47         9

    accuracy                           0.89       184
   macro avg       0.48      0.48      0.48       184
weighted avg       0.82      0.89      0.85       184



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [None]:
print(accuracy_score(y_test,y_pred)*100)

88.58695652173914


In [None]:
Forest.fit(X_train, y_train)

In [None]:
y_pred=Forest.predict(X_test)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.95      0.99      0.97       161
           1       1.00      1.00      1.00         2
           2       0.50      0.25      0.33         8
           3       0.00      0.00      0.00         4
           4       0.75      0.67      0.71         9

    accuracy                           0.92       184
   macro avg       0.64      0.58      0.60       184
weighted avg       0.90      0.92      0.91       184



In [None]:
print(accuracy_score(y_test,y_pred)*100)

92.3913043478261


Dengan menggunakan Box-Cox Transformation, **Akurasi Random Forest meningkat menjadi 92%, dengan KNN menurun menjadi 84% dan Gaussian Naive Bayes menjadi 88.58%.**

## **Conclusion**

**Classifier terbaik** yang didapatkan dalam projek ini adalah dengan menggunakan Algoritma **K-Nearest Neighbor** pada data yang telah diberikan perlakuan Oversampling, dengan **akurasi 98%**.