<a href="https://colab.research.google.com/github/Biokatzen/Hepatitis-C-Prediction-Dataset/blob/main/HepatitisC_Prediction_Dataset.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Hepatitis C Prediction Dataset

The dataset used in this study contains data from female and male blood donors and patients suffering from hepatitis C, fibrosis and cirrhosis with ages ranged from 19 to 77. These are the categorical variables that will be used for classification and clustering later on. The numerical variables are the following: ALB, ALP, ALT, AST, BIL, CHE, CHOL, CREA, GGT, PROT, which are traditional diagnostic tests for liver diseases.

The table below resumes all the attributes found in this dataset:


| **Attribute** | **Value** | **Value Label** | **Type** | **Description** |
|----------------|----------------|-----------------|-----------|-----------------|
| **Patient ID/No.** | Numeric | – | Integer | Patient identification |
| **Category** | 0 | Blood Donor | Categorical | Diagnosis of the patient. Blood donor vs Hepatitis C including its progress to Fibrosis and Cirrhosis |
|  | 0s | Suspect Blood Donor |  |  |
|  | 1 | Hepatitis |  |  |
|  | 2 | Fibrosis |  |  |
|  | 3 | Cirrhosis |  |  |
| **Age** | Numeric | – | Integer |  |
| **Sex** | f | Female | Binary |  |
|  | m | Male |  |  |
| **ALB** | Numeric | – | Continuous | Albumin Blood Test (g/L)|
| **ALP** | Numeric | – | Continuous | Alkaline Phosphatase (U/L) |
| **ALT** | Numeric | – | Continuous | Alanine Transaminase (U/L) |
| **AST** | Numeric | – | Continuous | Aspartate Transaminase (U/L)|
| **BIL** | Numeric | – | Continuous | Bilirubin (µmol/L) |
| **CHE** | Numeric | – | Continuous | Acetylcholinesterase (U/mL)|
| **CHOL** | Numeric | – | Continuous | Cholesterol (mmol/L) |
| **CREA** | Numeric | – | Continuous | Creatinine (µmol/L) |
| **GGT** | Numeric | – | Continuous | Gamma-Glutamyl Transferase (U/L) |
| **PROT** | Numeric | – | Continuous | Total Protein (g/L)|


In [4]:
import pandas as pd
import os

Firstly, the dataset is loaded into google colab  enviroment and printed to have a first look. As we can see below, the dataset has 615 rows and 14 columns

In [33]:
location = '/content/drive/MyDrive/Colab Notebooks'
os.chdir(location)
df = pd.read_csv('HepatitisCdata.csv', header=0, sep=';')
df.shape
print(df)

     Unnamed: 0       Category  Age Sex   ALB    ALP    ALT    AST   BIL  \
0             1  0=Blood Donor   32   m  38.5   52.5    7.7   22.1   7.5   
1             2  0=Blood Donor   32   m  38.5   70.3   18.0   24.7   3.9   
2             3  0=Blood Donor   32   m  46.9   74.7   36.2   52.6   6.1   
3             4  0=Blood Donor   32   m  43.2   52.0   30.6   22.6  18.9   
4             5  0=Blood Donor   32   m  39.2   74.1   32.6   24.8   9.6   
..          ...            ...  ...  ..   ...    ...    ...    ...   ...   
610         611    3=Cirrhosis   62   f  32.0  416.6    5.9  110.3  50.0   
611         612    3=Cirrhosis   64   f  24.0  102.8    2.9   44.4  20.0   
612         613    3=Cirrhosis   64   f  29.0   87.3    3.5   99.0  48.0   
613         614    3=Cirrhosis   46   f  33.0    NaN   39.0   62.0  20.0   
614         615    3=Cirrhosis   59   f  36.0    NaN  100.0   80.0  12.0   

       CHE  CHOL   CREA    GGT  PROT  
0     6.93  3.23  106.0   12.1  69.0  
1    11.1

As we can see, the first column should be the Patient ID/No. but in the dataset appears as an unnamed column, so the dataset has to be modified to add this column name. Additionally, it would be better to have the values male and female in the sex column, instead of f and m, which are less informative


In [37]:
df.rename(columns={'Unnamed: 0': 'Patient ID'}, inplace=True)
df['Sex'] = df['Sex'].replace({'m': 'male', 'f': 'female'})
print(df)

     Patient ID       Category  Age     Sex   ALB    ALP    ALT    AST   BIL  \
0             1  0=Blood Donor   32    male  38.5   52.5    7.7   22.1   7.5   
1             2  0=Blood Donor   32    male  38.5   70.3   18.0   24.7   3.9   
2             3  0=Blood Donor   32    male  46.9   74.7   36.2   52.6   6.1   
3             4  0=Blood Donor   32    male  43.2   52.0   30.6   22.6  18.9   
4             5  0=Blood Donor   32    male  39.2   74.1   32.6   24.8   9.6   
..          ...            ...  ...     ...   ...    ...    ...    ...   ...   
610         611    3=Cirrhosis   62  female  32.0  416.6    5.9  110.3  50.0   
611         612    3=Cirrhosis   64  female  24.0  102.8    2.9   44.4  20.0   
612         613    3=Cirrhosis   64  female  29.0   87.3    3.5   99.0  48.0   
613         614    3=Cirrhosis   46  female  33.0    NaN   39.0   62.0  20.0   
614         615    3=Cirrhosis   59  female  36.0    NaN  100.0   80.0  12.0   

       CHE  CHOL   CREA    GGT  PROT  


Following the pre-processing steps, the categorical variables 'Category' and 'Sex' need to be converted into a numerical format. As any of them have ordinal categories, the method chosen is One-Hot Encoding.

In [38]:
df_encoded = pd.get_dummies(df, columns=['Category', 'Sex'], prefix=['Category', 'Sex'], dtype=int)
print(df_encoded.head)

<bound method NDFrame.head of      Patient ID  Age   ALB    ALP    ALT    AST   BIL    CHE  CHOL   CREA  \
0             1   32  38.5   52.5    7.7   22.1   7.5   6.93  3.23  106.0   
1             2   32  38.5   70.3   18.0   24.7   3.9  11.17  4.80   74.0   
2             3   32  46.9   74.7   36.2   52.6   6.1   8.84  5.20   86.0   
3             4   32  43.2   52.0   30.6   22.6  18.9   7.33  4.74   80.0   
4             5   32  39.2   74.1   32.6   24.8   9.6   9.15  4.32   76.0   
..          ...  ...   ...    ...    ...    ...   ...    ...   ...    ...   
610         611   62  32.0  416.6    5.9  110.3  50.0   5.57  6.30   55.7   
611         612   64  24.0  102.8    2.9   44.4  20.0   1.54  3.02   63.0   
612         613   64  29.0   87.3    3.5   99.0  48.0   1.66  3.63   66.7   
613         614   46  33.0    NaN   39.0   62.0  20.0   3.56  4.20   52.0   
614         615   59  36.0    NaN  100.0   80.0  12.0   9.07  5.30   67.0   

       GGT  PROT  Category_0=Blood Donor  Cat

To have binary numbers intead of booleans it is important to specify dtype=int in the get_dummies fucntion.