## Hepatitis C using Random Forest Classication

[Dataset](https://www.kaggle.com/datasets/fedesoriano/hepatitis-c-dataset)

[Google Colab NoteBook](https://colab.research.google.com/drive/1uQcQpgq5SqZHKGNKw2Bi2VR03C1uoWEn)

[Github](https://github.com/z5208980/machine-learning-health/tree/main/hep_c/)

Hepatitis C (Hep C) is an inflammation of the liver caused by a virus. It can be classified as serious liver disease and is often blood bourne. This means that an individual can get Hep C through contact with the blood of an infected Hep C individual. In Australia, it affects over 115 000 individuals and is often chronic. Most of the time symptoms are not present when one has Hep C, until their liver is damaged. The way individuals determine if they have Hep C is through a blood test.
The dataset can be used to assist classify whether type of liver damage they may have and can be used to diagnose individuals based on their measurements.
The target for this dataset is Category. by running this code,
```python
for target in df['Category'].unique():
  print(target)
```
we get a categorical result,
- Blood Donor
- Suspect Blood Donor
- Hepatitis
- Fibrosis
- Cirrhosis

The category above is shown as blood and liver-related outputs, with blood donors at risk as they may have blood donors with Hep C or may have received blood with Hep C. Hepatitis, Fibrosis and Cirrhosis are in relation to the damage of the liver due to *"Hepatitis"* but in other cases could be alcoholism or injury.
The features of this dataset are mostly taken from laboratory values and include,
- Age: Age
- Sex: Gender (M|F)
- ALB: Albumin (a protein produced by the liver)
- ALP: Alkaline Phosphatase (Blood test of ALP enzyme in the blood)
- ALT: Alanine Transaminase (Blood test of ALT enzyme in the blood)
- AST: Aspartate Aminotransferase (Blood test of AST enzyme in the blood)
- BIL: Bilirubin
- CHE: Acetylcholinesterase (Blood test)
- CHOL: Cholesterol
- CREA: Creatinine
- GGT: Gamma-Glutamyl Transferase
- PROT: Protein


In [1]:
import numpy as np
import pandas as pd
import pickle

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import binarize, LabelEncoder, MinMaxScaler, StandardScaler
from sklearn import metrics
from sklearn.metrics import accuracy_score, mean_squared_error, precision_recall_curve
from sklearn.model_selection import cross_val_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression

import warnings
warnings.filterwarnings('ignore')

In [7]:
# Loading and seeking the data

# Load dataset
df = pd.read_csv('https://raw.githubusercontent.com/z5208980/machine-learning-health/main/hep_c/data/raw.csv')

print(f"There have {df.shape[0]} rows with {df.shape[1]} columns including targets")

# Seek the dataset
df.head(5) 

There have 615 rows with 14 columns including targets


Unnamed: 0.1,Unnamed: 0,Category,Age,Sex,ALB,ALP,ALT,AST,BIL,CHE,CHOL,CREA,GGT,PROT
0,1,0=Blood Donor,32,m,38.5,52.5,7.7,22.1,7.5,6.93,3.23,106.0,12.1,69.0
1,2,0=Blood Donor,32,m,38.5,70.3,18.0,24.7,3.9,11.17,4.8,74.0,15.6,76.5
2,3,0=Blood Donor,32,m,46.9,74.7,36.2,52.6,6.1,8.84,5.2,86.0,33.2,79.3
3,4,0=Blood Donor,32,m,43.2,52.0,30.6,22.6,18.9,7.33,4.74,80.0,33.8,75.7
4,5,0=Blood Donor,32,m,39.2,74.1,32.6,24.8,9.6,9.15,4.32,76.0,29.9,68.7


In [8]:
# Processing the data
 
# Remove uneccessary features
df.drop("Unnamed: 0", axis=1, inplace=True) # Drop index

# Format data
df["Category"] = df["Category"].str.replace(".+=", '', regex=True)
df["Category"] = df["Category"].str.title()

# Encode m (male) and f (female) to integers
encoder = LabelEncoder()
encoder.fit(df["Sex"])
df["Sex"] = encoder.transform(df["Sex"])

# Fill in missing data
df.fillna(df.median(), inplace=True)  # Not sure if appriopate for medical values
# df.fillna(0, inplace=True)
df.head()

# Save
# filename = '/content/sample_data/processed.csv'
# df.to_csv(filename, index=False)

Unnamed: 0,Category,Age,Sex,ALB,ALP,ALT,AST,BIL,CHE,CHOL,CREA,GGT,PROT
0,Blood Donor,32,1,38.5,52.5,7.7,22.1,7.5,6.93,3.23,106.0,12.1,69.0
1,Blood Donor,32,1,38.5,70.3,18.0,24.7,3.9,11.17,4.8,74.0,15.6,76.5
2,Blood Donor,32,1,46.9,74.7,36.2,52.6,6.1,8.84,5.2,86.0,33.2,79.3
3,Blood Donor,32,1,43.2,52.0,30.6,22.6,18.9,7.33,4.74,80.0,33.8,75.7
4,Blood Donor,32,1,39.2,74.1,32.6,24.8,9.6,9.15,4.32,76.0,29.9,68.7


In [16]:
y = df.Category
X = df.drop('Category', axis=1)

scaler = StandardScaler()
X = scaler.fit_transform(X)

In [17]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=200)

# scaler = StandardScaler() # Use StandardScaler if postprocessing 
# X_train = scaler.fit_transform(X_train)
# X_test = scaler.transform(X_test)

The choosen model use is **RandomForestClassifier** which yields a 92% accurancy in training and testing. No parameter will used to tune the model.

## Using the model

The list should be of format `[gender, age, hypertension, heart_disease, ever_married, work_type, Residence_type, avg_glucose_level, bmi, smoking_status]`

In [36]:
def LR():
  model = LogisticRegression()
  model.fit(X_train, y_train)
  y_pred_class = model.predict(X_test)

  accuracy_score(model)

  return model

def RFC():
  model = RandomForestClassifier()
  model.fit(X_train, y_train)
  y_pred_class = model.predict(X_test)

  accuracy_score(model)

  return model

def GBC():
  model = GradientBoostingClassifier()
  model.fit(X_train, y_train)

  accuracy_score(model)

  return model

def accuracy_score(model):
  y_pred_class = model.predict(X_test)

  print('RESULT')
  print('Accuracy:', metrics.accuracy_score(y_test, y_pred_class))


model = RFC()

# filename = '/content/sample_data/model.sav'
# pickle.dump(model, open(filename, 'wb'))

RESULT
Accuracy: 0.922077922077922


In [None]:
model = pickle.load(open('/content/sample_data/model.sav', 'rb'))   # load model

val = []
row = 200
for x in X_train[row]:
  val.append(x)

input = [val]
output = model.predict(input)

print("X=%s, Predicted=%s, Actually=%s" % (input[0], output[0], y_train.iloc[row]))

X=[-0.7052232780919925, 0.8040078180145834, -0.045733332806742175, -0.03147139495730363, 1.1785683229506227, 0.05812158875788527, -0.2219969538910292, -1.8083413098012011, 0.6218495773044013, -0.12303655751846336, 0.933390836209473, 0.07987049747609182], Predicted=Blood Donor, Actually=Blood Donor
