# Introduction to Artif.Intelig. Project Assignment

### Fall 2024/2025
### Professor: Ibrahim Abaker
### Group 10
## Group members:

|    Name                                     | ID         |
|------------------------------------------|------------|
| ABDALLA SAYED MOHAMMAD RASHED ALI [`Main Contributer for the conference.`]       | U22103623  |
| ABDULRAHMAN RASHED ABDULRAHMAN ALSHARHAN ALNUAIMI | U22100774  |
| MOHAMMAD RA'ED MOHAMMAD HARDAN           | U22105630  |
| OMAR MOHAMMAD FAROUQ ABDULSALAM          | U22100742  |


![UOS logo](uos.png)


---

# <u>**Contents**</u>

* [<font size=4>1. Problem Statement: Liver Cancer Detection Using Biomarkers</font>](#problem-statement)
  * [<font size=4>1.1 Introduction</font>](#introduction)
  * [<font size=4>1.2 Objective</font>](#objective)
  * [<font size=4>1.3 Algorithms</font>](#algorithms)
    * [<font size=4>1.3.1 Why did we choose these algorithms for liver cancer detection?</font>](#algorithms-reason)
  * [<font size=4>1.4 Dataset Configuration</font>](#db-c)
* [<font size=4>2.0 Related Works</font>](#work)
* [<font size=4>3.0 Data Preprocessing</font>](#preprocess)
* [<font size=4>4.0 Algorithms Selection</font>](#selection) 
* [<font size=4>5.0 && 6.0 Model Development and Model Evaluation</font>](#model) 
  * [<font size=4>5.1 && 6.1 XGBoost Classifier</font>](#classifier) 
  * [<font size=4>5.2 && 6.2 RF Classifier</font>](#rf)   
  * [<font size=4>5.3 && 6.3 SVC Classifier</font>](#svc) 
* [<font size=4>7.0 Analysis</font>](#analysis) 
* [<font size=4>8.0 Conclusion and Recommendations</font>](#conclusion)
* [<font size=4>9.0 References</font>](#ref) 
* [<font size=4>10.0 Explainable AI Approach: LIME for Local Interpretability [Extra]</font>](#ai-approach) 



---
<a id="problem-statement"></a>
# <u> **1.0 Problem Statement: Liver Cancer Detection Using Biomarkers** </u>
<a id="problem-statement"></a>
<a id="introduction"></a>
## <u> **1.1 Introduction** </u> 

Liver cancer, particularly hepatocellular carcinoma (HCC), is a leading cause of cancer-related deaths worldwide. Early detection is crucial for improving patient outcomes, as treatment efficacy is higher in the disease's early stages. However, traditional diagnostic methods like imaging and biopsy can be invasive, expensive, or difficult to access.

Biomarkers have recently emerged as a promising non-invasive, cost-effective alternative for early cancer detection and monitoring. These measurable biological molecules—such as proteins, genes, or metabolites—indicate specific biological conditions. For liver cancer, biomarkers like alpha-fetoprotein (AFP), des-gamma carboxyprothrombin (DCP), and certain microRNAs show potential in distinguishing between healthy individuals and those with liver cancer. However, the complexity and variability of biomarker data require advanced machine learning methods for accurate patient classification.
<a id="objective"></a>
## <u> **1.2 Objective** </u>

This project aims to develop a machine learning-based binary classification system to detect liver cancer using biomarker data. Our dataset includes various biomarkers as features for the classification task, with the target variable indicating the presence or absence of liver cancer between 5 different classes. By employing machine learning algorithms, we seek to enhance the accuracy and reliability of liver cancer detection.
<a id="algorithms"></a>
## <u> **1.3 Algorithms** </u>

To tackle this classification problem, we will implement and compare the performance of these machine learning algorithms:

1. **XGBoost (Extreme Gradient Boosting)**:
    - A powerful ensemble learning algorithm based on decision trees
    - Known for its efficiency, scalability, and ability to handle imbalanced datasets
2. **Random Forest (RF)**:
    - A robust ensemble method based on bagging
    - Offers good generalization and is less prone to overfitting than individual decision trees
3. **Support Vector Machine (SVM)**:
    - A versatile algorithm that finds an optimal hyperplane to separate classes
    - Particularly effective in high-dimensional spaces and for datasets with non-linear decision boundaries
<a id="algorithms-reason"></a>
### <u> **1.3.1 Why did we choose these machine learning algorithms for liver cancer detection?** </u>
- XGBoost: XGBoost is highly effective at handling datasets with complex feature interactions, making it ideal for biomarker data where multiple biomarkers may jointly contribute to liver cancer detection.

- RF: Random Forest is robust against noise in the dataset and excels in identifying important features, which is valuable for determining the most significant biomarkers in liver cancer classification.

- Support Vector Machine (SVM): SVM is well-suited for high-dimensional datasets, such as biomarker data, where the number of features (biomarkers) may exceed the number of samples, ensuring precise classification even with limited data.

<a id="db-c"></a>
## <u> **1.4 Dataset Configuration** </u>

Our dataset `Hepatitis C Prediction Dataset` focuses on liver disease/cancer biomarkers and contains critical information for binary classification. Below are the key aspects of the dataset:


### **Rows and Columns**: The dataset consists of **615** rows and **14** columns.


### **Features**: It includes **10** features representing various biomarkers, which are used to predict the presence or absence of liver cancer, and each set of biomarkers are correlated with a class that showcases health statuses.


### **Data Types**:
  - 10 **Numerical** columns [ALB,	ALP,	ALT,	AST,	BIL,	CHE,	CHOL, CREA, GGT, PROT] representing biomarker expressions and levels from laboratory data (Attributes from 5 to 14).

  - 2 **Discrete Integer** columns which is the patient ID (Attribute 1), and the age of the patient (Attribute 3).

  - 2 **String** column which is describing the category/diagnosis of the patient [values: '0=Blood Donor', '0s=suspect Blood Donor', '1=Hepatitis', '2=Fibrosis', '3=Cirrhosis'] (Attribute 2) alongside with the Sex of the patient [m, f] (Attribute 4).
  
### **Variability**:
  - The dataset may have variability due to the natural differences in biomarker levels across individuals.

  - Some features may be highly correlated, requiring feature selection or dimensionality reduction during preprocessing.
  
  - A preliminary examination indicates potential issues such as class imbalance, which can impact the performance of classification algorithms.

### **Explanation of the dataset labels/classes**:
- **`0 / Blood Donor`**: Healthy individuals who have donated blood and show no signs of liver disease or abnormal biomarker levels.

- **`0s / Suspect Blood Donor`**: Blood donors who exhibit certain abnormalities or biomarker expressions that suggest possible liver issues, requiring further investigation.

- **`1 / Hepatitis`**: Individuals diagnosed with hepatitis, a liver condition typically caused by viral infections or other factors that lead to liver inflammation.

- **`2 / Fibrosis`**: Individuals with liver fibrosis, characterized by excessive scar tissue buildup in the liver due to chronic damage.

- **`3 / Cirrhosis`**: Individuals with cirrhosis, a severe liver condition where the organ is permanently scarred and its function is significantly impaired—often resulting from chronic diseases like hepatitis or fibrosis.


##### In summary, these labels represent various health statuses in the dataset, which is encoded in the **Category** column in our dataset.


![UOS logo](ourdata.png)

---

<a id="work"></a>
# <u> 2.0 Related Works </u>

This section explores recent studies related to the early detection and diagnosis of liver cancer / hepatocellular carcinoma (HCC) using machine learning and bioinformatics techniques.

<a id="summaries"></a>
### <u> Paper Summaries </u>

#### **[1] Early Warning and Diagnosis of Liver Cancer**
The authors in [1] have tried to improve early detection of HCC by combining dynamic network biomarker (DNB) analysis with graph convolutional neural networks (GCN) and with the use of multi-omics, which are used to study HCC development in mice using transcriptomic data and gene expression levels, using a five-year relative survival rate of HCC patients dataset from the NCBI with addition to RNA-seq data uploaded to the NCBI database (multi-omics gene expressions data). The DNB model identified critical transition point at 7 weeks of age, achieving a 100% accuracy in classifying healthy and cancerous mice while also accurately predicting the health status of newly introduced mice. However, our approach was different by using feature selection to detect the biomarkers for detection of HCC.


#### **[2] Deep Transfer Learning for Liver Cancer Gene Recognition**
Researchers in [2] introduced a novel approach for detecting cancer genes using different DNA sequences with the use of a vGG16 deep learning model and by using the NCBI dataset, providing sequences of four healthy genes and four HCC genes, with the use of three numerical mapping techniques to digitize gene sequences, then examined them in both one-dimensional and two-dimensional forms using convolutional neural networks (CNN). They achieved an 80.36% accuracy in the one-dimensional CNN model, a high 98.86% when using the vGG16 and SVM model, and 100% accuracy with fine-tuned vGG16 layers. This method effectively extracted features to distinguish HCC from normal liver gene sequences, allowing for broader applications with larger datasets and different cancer types. One critical limitation to this approach was the absence of enough genes and data in the used dataset, which was not large enough to provide a sufficient training set to design a new CNN model, which we decided to overcome in our approach.


#### **[3] Machine Learning Models for Early HCC Detection**
The study in [3] conducted a new approach for detecting HCC patients using 10 various ML algorithms using a dataset of 165 HCC survival patients from CHUH in Portugal and results were noted by RF algorithm F1-scores and did two different experiments. The RF algorithm conducted the highest F1 score of 1 out of 1 and for the linear SVM the highest score was 0.7804 during testing. One limitation for this design was the lack of features, which we aimed to overcome in our project.


#### **[4] ML for Metabolomics-Based Early HCC Diagnosis**
The authors in [4] showed a review of how machine learning is used to analyze mass spectrometry-based metabolomics data for early HCC diagnosis. This technique helps capture metabolic changes associated with cancer and enables screening out physiological biomarkers of cancer risk and clinical biomarkers of cancer, which can help in identifying it in early stages. Methodologies used were mass spectrometry, machine learning with different models like Random Forests (RF), Principal Component Analysis (PCA), Support Vector Machines (SVM), Partial Least Squares-Discriminant Analysis (PLS-DA), and Neural Networks (NN), with the help of a feature selection technique called correlation-based feature selection (CFS) with logistic regression and Lasso regression. Certain limitations were noted, like the exponentially increased data volume and complexity of the mass spectrometry data and the lack of clinical validation. For liver cancer detection, supervised ML models such as linear SVM and logistic regression demonstrated high accuracy (>85%) for different studies on tissue/serum.


#### **[5] Machine Learning Using RNA Signatures**
Researchers in [5] aimed to leverage machine learning techniques to diagnose early HCC using RNA signatures with laboratory parameters. It uses a comprehensive dataset that contains RNA expression levels and clinical parameters from 267 subjects, including 102 malignant HCC, 67 of which are benign liver conditions, and 98 healthy controls, in addition to access to the GSE14520 dataset. The RNA signatures include mRNAs (RAB11A, STAT1, ATG12), miRNAs (miR-1262, miR-1298, miR-106b-3p), and lncRNAs (RP11-513I15.6, WRAP53), selected based on their known involvement in HCC pathogenesis (in key biological pathways). Different machine learning models were evaluated, like KNN, RF, SVM, LGBM, and DNN, trained and tested using a 70/30 dataset split. LGBM was the best-performing model, achieving an accuracy of 98.75%, superior to other classifiers. The study also employed feature selection techniques. The model included 22 features (e.g., age, sex, smoking, cirrhosis, albumin, ALT, AST, bilirubin). Future work includes further validation and testing of the model, exploring the model's potential in also predicting disease progression and response to treatment. Limitations were small and unreliable data sizes.


#### **[6] Multi-Platform Meta-Analysis for HCC Biomarker Detection**
The study in [6] aimed to identify key mRNAs that can serve as biomarkers for diagnosing HCC. It used a non-fusion integrative multi-platform meta-analysis to integrate gene expression data from different platforms, then applied machine learning methods to develop diagnostic and prognostic models. Data was used from many gene expression platforms like Illumina and Affymetrix datasets, with a total of 939 samples, comprising 493 tumor and 446 non-tumor samples. The datasets included GSE57957, GSE39791, GSE36376, GSE84005, and others, sourced from GEO and TCGA. The study utilized the Linear Models for Microarray Data (LIMMA) R package to identify differentially expressed genes (DEGs). A Bayesian approach was employed to assess statistical significance, with a p-value threshold of 0.05 and a fold-change cutoff of 1. One limitation noted could be potential biases due to batch effects despite correction efforts, and for future work, they suggested enhancing the robustness of the identified biomarkers and exploring additional molecular signatures to improve diagnostic and prognostic accuracy.


#### **[7] Deep Neural Networks for MicroRNA Data**
In study [7], the researchers used Deep Neural Networks to classify cancer, with training using MicroRNA data analysis. Three types of data normalization (Min-Max, Sigmoid, Softmax) and three activation functions (ReLU, Sigmoid, TanH) were compared. The best accuracy (98.33%) out of four experiments (with no feature selection) in classifying HCC with MicroRNA data was achieved using Min-Max data normalization, ReLU activation function, and batch normalization with specific parameters. The microRNA dataset was obtained from the GDC Data Portal of the National Cancer Institute (NCI, 2018). The data used in this study consisted of 600 data divided into 300 data on HCC class and 300 data on normal class with a total of 328 MicroRNA features.


#### **[8] DenseGCN for Multi-Omics HCC Detection**
Authors in [8] aimed to create a patient similarity network using three types of HCC omics data and introduced a novel diagnosis method. This method combined similarity network fusion, denoising autoencoder, and dense graph convolutional neural network to use patient similarity networks and multi-omics data (DenseGCN). Comparisons with other machine learning methods on the TCGA-LIHC dataset demonstrated that the proposed approach outperforms them in all metrics. The proposed method achieved an accuracy of up to 0.9857, using Liver Hepatocellular Carcinoma (LIHC) omics dataset. Like our approach, they have used feature selection techniques, but they used denoising autoencoder.


### References
1. Y. Han *et al.*, “Early warning and diagnosis of liver cancer based on dynamic network biomarker and deep learning,” *Computational and Structural Biotechnology Journal*, 2023. doi: 10.1016/j.csbj.2023.07.002.

2. B. Das *et al.*, “Deep transfer learning for automated liver cancer gene recognition using spectrogram images of digitized DNA sequences,” *Biomedical Signal Processing and Control*, 2022. doi: 10.1016/j.bspc.2021.103317.

3. W. Książek *et al.*, “A novel machine learning approach for early detection of hepatocellular carcinoma patients,” *Cognitive Systems Research*, 2019. doi: 10.1016/j.cogsys.2018.12.001.

4. H. L. Ngan *et al.*, “Machine learning facilitates the application of mass spectrometry-based metabolomics to clinical 
analysis,” *TrAC - Trends in Analytical Chemistry*, 2023. doi: 10.1016/j.trac.2023.117333.

5. M. Matboli *et al.*, “Machine learning based identification of key feature RNA-signature linked to diagnosis of Hepatocellular Carcinoma,” *Journal of Clinical and Experimental Hepatology*, 2024. doi: 10.1016/j.jceh.2024.101456.

6. M. Gholizadeh *et al.*, “Detection of key mRNAs in liver tissue of hepatocellular carcinoma patients based on machine learning and bioinformatics analysis,” *MethodsX*, 2023. doi: 10.1016/j.mex.2023.102021.

7. O. H. Purba *et al.*, “Classification of liver cancer with microrna data using the deep neural network (DNN) method,” *Journal of Physics: Conference Series*, 2020. doi: 10.1088/1742-6596/1524/1/012129.

8. G. Zhang *et al.*, “A novel liver cancer diagnosis method based on patient similarity network and DenseGCN,” *Scientific Reports*, 2022. doi: 10.1038/s41598-022-10441-3.



---
 <a id="preprocess"></a>
# <u> 3.0 Data Preprocessing </u>
Data preprocessing is a crucial step in preparing datasets for machine learning. This process includes cleaning data, addressing missing values, encoding categorical variables, and standardizing numerical features. These steps ensure that algorithms can effectively identify patterns and relationships within the data and to avoid garbage-in garbage-out.

##### Necessary Imports for the entire program

In [None]:
# - Data Manipulation and Preprocessing
import pandas as pd                                           # For handling datasets and data manipulation
from sklearn.preprocessing import MinMaxScaler, LabelEncoder  # For scaling and encoding data
import numpy as np                                            # For manipulating numeric values

# - Model Building

# For splitting data and hyperparameter tuning
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score    

from sklearn.metrics import (
    accuracy_score,         # For evaluating model accuracy
    classification_report,  # For detailed classification metrics
    confusion_matrix,       # For confusion matrix calculation
    ConfusionMatrixDisplay, # For visualizing the confusion matrix
    roc_auc_score,          # For calculating ROC-AUC score
    roc_curve               # For generating ROC curve data
)

# - Machine Learning Models
import xgboost as xgb                                # For XGBoost classifier
from sklearn.svm import SVC                          # For Support Vector Classifier (SVM)
from sklearn.ensemble import RandomForestClassifier  # For Random Forest Classifier

# - Visualization
import matplotlib.pyplot as plt            # For plotting static visualizations
import plotly.express as px                # For simple, expressive plotting
import seaborn as sns                      # Another plotting library

##### Loading the dataset and viewing it

In [None]:
LiverCancer_df = pd.read_csv('HepatitisCdata.csv')

LiverCancer_df

# We can notice some null values being present in the dataset.

##### We notice an extra unwanted ID column at the leftmost, let's remove it using the `iloc` function

```python
import pandas as pd
df.iloc[row_selection, column_selection]

In [None]:
LiverCancer_df = LiverCancer_df.iloc[:, 1:]
LiverCancer_df.head()

##### Removal of duplicates

In [None]:
duplicates = LiverCancer_df.duplicated()
print("Duplicate rows present in our liver cancer dataset:")
print(LiverCancer_df[duplicates])

LiverCancer_df = LiverCancer_df.drop_duplicates()

print("Our dataframe object after removing duplicates:")
print(LiverCancer_df)

#### We can notice that there were no duplicate rows amongst our dataset

##### Let's observe a general description of the data

In [None]:
LiverCancer_df.describe()

In [None]:
LiverCancer_df.info()

##### Let's begin the cleaning process.

##### **Checking null values using `isnull()` function**

In [None]:
LiverCancer_df.isnull().sum()

##### **Filtering of null values using `fillna()` function**

```python
df.fillna(0, inplace=True) # For replacing NaN values with 0

In [None]:
LiverCancer_df.fillna(0, inplace=True)

# Check for any null values still exists in the dataset
null_counts = LiverCancer_df.isnull().sum()
print(null_counts[null_counts > 0])

##### No null values, we get an empty array.

##### **Since our 'Category' labels/classes column is non-numerical, it's best we encode the labels into something the model would understand**

##### Re-encoding Labels for Binary Classification

In the original dataset, the `Category` column contained five distinct classes representing various liver conditions. Since the goal of this project is to classify whether a patient has liver cancer or not, we re-encoded the labels into **binary classes**: 

- **`0` (Non-cancerous)**: Includes `0 = Blood Donor` and `0s = Suspect Blood Donor`, as these represent individuals without liver cancer.
- **`1` (Cancer-related)**: Includes `1 = Hepatitis`, `2 = Fibrosis`, and `3 = Cirrhosis`, as these are associated with liver disease or conditions that could lead to cancer.

This transformation simplifies the problem into a binary classification task, making it more aligned with our project objectives.

```python
encoded_labels = df['Category'].replace({
    'old_label_1': new_label_1, 
    'old_label_2': new_label_2, 
    'other_old_labels': new_label
})





In [None]:
LiverCancer_df['Category'] = LiverCancer_df['Category'].replace({
    '0=Blood Donor': 0,             # Non-cancerous (Blood Donor)
    '0s=suspect Blood Donor': 0,    # Non-cancerous (Suspect Blood Donor)
    '1=Hepatitis': 1,               # Cancer-related (Hepatitis)
    '2=Fibrosis': 1,                # Cancer-related (Fibrosis)
    '3=Cirrhosis': 1,               # Cancer-related (Cirrhosis)
})

##### **Note that we didn't keep the old Category labels in order to avoid redundancy**

#### <u> **Encoded Labels Mapping** </u>
- **`0 = Blood Donor`** → `0`
- **`0s = Suspect Blood Donor`** → `0`
- **`1 = Hepatitis`** → `1`
- **`2 = Fibrosis`** → `1`
- **`3 = Cirrhosis`** → `1`

This is to showcase the order of mapping that was done after the encoding process


##### **Apply encoding for the sex as well (Attribute 4)** 

In [None]:
encoder = LabelEncoder()

LiverCancer_df['Sex'] = encoder.fit_transform(LiverCancer_df['Sex'])

LiverCancer_df

#### <u> **Encoded Sex Mapping** </u>
- **`m`** → `0`
- **`f`** → `1`

##### <u> **Standardization of our numerical features (Attributes 5 to 14)** </u> 

##### Current dataset before standardization: 

In [None]:
LiverCancer_df.head(n = 15) # Display the first 15 rows of our dataframe object

##### Next, we will standardize using the `MinMaxScaler()` to normalize the numerical features to a common range [0, 1], ensuring that all features contribute equally to the model and improving convergence during training.

```python
from sklearn.preprocessing import MinMaxScaler
scaled_features = MinMaxScaler().fit_transform([[10], [20], [30], [...other_features]])

In [None]:
# Note that we have to select the numerical datatypes to standardize them only, excluding the age
numerical_columns = LiverCancer_df.select_dtypes(include=['float64', 'int64']).columns.difference(['Category', 'Age'])

scaler = MinMaxScaler()

LiverCancer_df[numerical_columns] = scaler.fit_transform(LiverCancer_df[numerical_columns])

##### Dataset after standardization:

In [None]:
LiverCancer_df.head(n = 15)

##### <u> **Checking for outliers** </u>

In [None]:
# # Detect outliers using the inter-quartile (IQR) method

# Q1 = LiverCancer_df[numerical_columns].quantile(0.25)
# Q3 = LiverCancer_df[numerical_columns].quantile(0.75)
# IQR = Q3 - Q1

# # Define bounds for outliers (using quarter one and three, which is the upper fence and lower fence statistically)
# lower_bound = Q1 - 1.5 * IQR
# upper_bound = Q3 + 1.5 * IQR

# # Filter out rows with outliers in any numerical column
# LiverCancer_df = LiverCancer_df[~((LiverCancer_df[numerical_columns] < lower_bound) | 
#                                   (LiverCancer_df[numerical_columns] > upper_bound)).any(axis=1)]

# print(f"Dataset after removing outliers: {LiverCancer_df.shape[0]} rows remain.")



```python
# Detect outliers using the inter-quartile (IQR) method

Q1 = LiverCancer_df[numerical_columns].quantile(0.25)
Q3 = LiverCancer_df[numerical_columns].quantile(0.75)
IQR = Q3 - Q1

# Define bounds for outliers (using quarter one and three, which is the upper fence and lower fence statistically)
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

# Filter out rows with outliers in any numerical column
LiverCancer_df = LiverCancer_df[~((LiverCancer_df[numerical_columns] < lower_bound) | 
                                  (LiverCancer_df[numerical_columns] > upper_bound)).any(axis=1)]

print(f"Dataset after removing outliers: {LiverCancer_df.shape[0]} rows remain.")



##### <u>Bad results!! when dealing with medical data, it's best to not remove outliers of these kinds.</u>

##### **During preprocessing, we initially implemented code to detect and remove outliers using statistical methods like the Interquartile Range (IQR). However, we noticed that removing outliers negatively affected the model's accuracy. In medical data, outliers often represent rare but clinically significant cases rather than errors or noise. Removing such data could result in a loss of critical information and diminish the model's ability to generalize effectively. Consequently, we decided to keep the outliers and commented out the corresponding code to ensure our analysis remains both accurate and medically relevant.**

### <u> **Univariant Analysis** </u>

In [None]:
LiverCancer_df.Category.value_counts()

In [None]:
# Get unique values and counts for the bar chart
values = LiverCancer_df['Category'].unique()
count = LiverCancer_df['Category'].value_counts().to_list()

# Create a bar chart
fig = px.bar(
    x=values,
    y=count,
    title='Values by Category (Non-cancerous: 0, Cancerous:1)',
    labels={"x": "Category", "y": "Number of Cases"}  # Correctly mapping axis labels
)

fig.show()

### **Preprocessing is complete, below is the preview of the new dataset acheived:**

In [None]:
LiverCancer_df

---

<a id="selection"></a>
# <u> **4.0 Algorithms Selection** </u>

For this project, we've chosen three machine learning algorithms: **XGBoost**, **Random Forest (RF)**, and **Support Vector Machine (SVM)**. These algorithms excel at **binary classification**, making them ideal for our liver cancer detection task using biomarkers. Here's why we selected each algorithm:

### **1. XGBoost**
XGBoost, also known as Extreme Gradient Boosting, is a strong machine learning technique that is commonly employed for regression, classification, and ranking purposes. It utilizes the concepts of decision trees and gradient boosting, merging numerous weak models to form a powerful predictive model. XGBoost enhances the gradient boosting procedure by parallelizing the tree construction, resulting in quicker and more efficient performance compared to conventional models, and is particularly effective for managing extensive and intricate datasets. 

XGBoost plays a crucial role in medical practices for tasks such as predicting diseases, analyzing medical images, and making personalized treatment recommendations. Healthcare professionals can make more precise and quick decisions due to its capacity to process massive amounts of data such as electronic health records, genetic information, and medical imaging. XGBoost's ability to prevent overfitting and its high prediction accuracy have established it as a popular algorithm for analyzing medical data, especially in situations requiring comprehension of intricate variable relationships. 

The algorithm is especially beneficial in clinical environments where timely predictions and risk evaluations are crucial. One illustration is the use of XGBoost to forecast patient results, like the probability of acquiring long-term illnesses, or the efficiency of a treatment strategy determined by patient information. XGBoost has become essential in healthcare for its capacity to rapidly analyze large quantities of patient data with precision, leading to advancements in patient care through data-driven insights. 

### **2. Random Forest (RF)**
Random Forest is a machine learning algorithm that combines multiple decision trees to enhance accuracy. The model is made resistant to overfitting by training each tree on random subsets of the data. It is effective for classification and regression tasks, and is popular for its capability to manage large, intricate datasets with minimal preprocessing. 

Random Forest is widely used in the healthcare industry for various purposes, including forecasting disease outcomes, categorizing medical images, and pinpointing key genetic characteristics. It is highly skilled at recognizing connections in medical information, offering valuable perspectives for healthcare providers. 

Random Forest is a versatile tool in the medical field because it can handle missing data, estimate feature importance, and offer reliable predictions. Utilizing extensive datasets assists researchers and healthcare professionals in making data-informed decisions, enhancing patient results, and furthering precision medicine. 

### **3. Support Vector Machine (SVM)**
Support Vector Machine is a supervised machine learning algorithm that works by finding the optimal hyperplane that best separates data into different classes. The objective of SVM is to maximize the margin between the classes for the best possible separation. If the data is not linearly separable, then SVM applies the kernel trick, which maps the data into a higher-dimensional space where a linear separation becomes possible. This flexibility allows the SVM to be highly effective in both simple and complex classification tasks, which therefore makes it an incredibly strong tool for a large number of machine learning applications.

In the medical sector, SVM has wide application owing to its ability to deal with high-dimensional data such as that emanating from medical images, genetic information, and clinical datasets. For instance, SVM has been successful in classifying images of tumors as either benign or malignant by leveraging histopathological data for diagnosis in cases of breast cancer. Its ability to work with noisy and imbalanced data makes this algorithm particularly useful in health care-where the datasets are often incomplete or skewed-such as in the diagnosis of a rare disease or when dealing with limited clinical samples. The support vectors are the crucial data points closest to the boundary that help the model concentrate on the most informative parts of the data, improving its prediction accuracy even under the most difficult conditions.

SVM's versatility in handling linear and non-linear data is another advantage on medical grounds, particularly for tasks such as medical image segmentation and disease prediction using complex datasets like MRI scans. The kernel trick allows SVM to handle complex datasets by projecting them into higher-dimensional spaces where linear separation can be more viable. In addition, the robustness of SVM in terms of overfitting is an important aspect of medical applications where sample sizes are typically small. Besides these, its efficiency in handling imbalanced data allows SVM to cover a broad spectrum of applications, right from cancer diagnosis to predictions about patient outcomes in lesser prevalent diseases

#### **Why These Algorithms Are Suitable**
We chose these algorithms for their complementary strengths in binary classification:
- **XGBoost** handles complex interactions and imbalanced data efficiently
- **Random Forest** provides robustness to noise and valuable insights into feature relevance
- **SVM** models non-linear relationships and ensures precise classification


In summary, by using these algorithms, we aim to compare their performance in liver cancer detection and identify the most effective model. We'll use evaluation metrics such as accuracy, precision, recall, and F1-score for comparison.


---
<a id="model"></a>
# <u> **5.0 Model Development && 6.0 Model Evaluation** </u>

##### **First, let's split the dataset into feautres (X) and target values (Y)** 

In [None]:
# Split the dataset into features (X) and target (Y)

X = LiverCancer_df.drop(['Category'], axis=1)  
Y = LiverCancer_df['Category']  

##### 'Category' is our encoded labels column, we drop it from the features (X) to maintain all other columns, and our target for classification (Y) will be the encoded category column.

##### **Next, we split the dataset into 80% training data and 20% testing data.**

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=42)

##### **Why This Split?**
We used an **80-20 split** to ensure a balance between training the model with sufficient data (80%) and evaluating its performance on unseen data (20%). This split is widely accepted as it provides enough data for learning while retaining a representative portion for testing.

##### **Why Not Other Splits?**
- **70-30 Split**: Reduces the amount of training data, which may hurt the model’s ability to generalize for smaller datasets.
- **90-10 Split**: Leaves too little data for testing, making the evaluation less reliable and less representative of the overall population.

<u>We generally have to consider that we're using a dataset of only 615 rows and not more. </u>
 

<a id="classifier"></a>
### <u>**5.1 && 6.1 XGBoost Classifier**</u>

##### **Initialization of the XGBoost classifier model**

Displaying the initialization and usual configurable parameters for xgboost:

```python
# Example: Initializing XGBClassifier with customizable parameters

xgb_model = xgb.XGBClassifier(
    n_estimators=N,           # Number of trees (default: 100)
    max_depth=N,              # Maximum depth of trees
    learning_rate=N,          # Step size shrinkage (default: 0.3)
    eval_metric='logloss',    # Evaluation metric for optimization
    use_label_encoder=False,  # Disable deprecated label encoding
    subsample=N,              # Percentage of data to use for training each tree (default: 1.0)
    colsample_bytree=N,       # Percentage of features to consider for each tree (default: 1.0)
    random_state=N            # Random seed for reproducibility
)


In [None]:
xgb_model = xgb.XGBClassifier(use_label_encoder=False, eval_metric='logloss', random_state=42)

### **Why these initial parameters in `XGBClassifier` ?**
1. **`use_label_encoder=False`**:
   - Disables the deprecated label encoding feature to avoid warnings and ensure compatibility with newer versions of XGBoost.
   - We set it to `False` because the label encoding functionality is unnecessary for this dataset, as the labels are already encoded.

2. **`eval_metric='logloss'`**:
   - Specifies the evaluation metric for optimization. `logloss` is particularly suitable for binary classification tasks as it measures the accuracy of predicted probabilities.
   - We chose `logloss` because it is the default and widely used metric for classification problems in XGBoost.

3. **`random_state=42`**:
   - Sets a random seed to ensure consistent results across multiple runs by fixing the randomness in model training.
   - We chose `42` as it is commonly used as a default value for reproducibility in experiments.

Such parameters is always customizable and can be changed later during our experiments in hopes of acheiving better results.


##### **Hyperparameter tuning and Grid Search `GridSearchCV` with Cross-Validation**

In [None]:
XGBoost_parameters = {
    'n_estimators': [50, 100, 150],
    'learning_rate': [0.01, 0.1, 0.2],
    'max_depth': [3, 5, 7]
}

# Perform Grid Search with Cross-Validation
grid_search = GridSearchCV(estimator=xgb_model, param_grid=XGBoost_parameters, scoring='accuracy', cv=5, verbose=1)
grid_search.fit(X_train, y_train)

best_xgb = grid_search.best_estimator_ # an extra step to get the best model from the grid search for our project

### **Hyperparameter tuning, and our grid search parameters**

- **Parameters tuned**:

1. **`param_grid`**:
   - Defines the hyperparameters to tune:
     - **`n_estimators`**: Number of trees: Values `[50, 100, 150]` test model complexity
     - **`learning_rate`**: Step size during optimization: Values `[0.01, 0.1, 0.2]` balance precision and convergence speed
     - **`max_depth`**: Tree depth: Values `[3, 5, 7]` explore simpler usually

2. **Grid Search**:
   - Systematically tests all combinations in `param_grid` using **5-fold cross-validation** for robust evaluation
   - **`scoring='accuracy'`**: Optimizes for accuracy (which is of course, our most important metric)

3. **`grid_search.best_estimator_`**:
   - Retrieves the best model configuration for predictions, an extra step for getting better results in our project


##### **Evaluating the predictions and accuracy on the test set for our XGBoost model**

In [None]:
y_pred = best_xgb.predict(X_test)

# Evaluation of the model using sklearn.metrics

print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))

##### **Displaying Confusion Matrix**

In [None]:
print(confusion_matrix(y_test, y_pred))

labels = ["Non-cancerous", "Cancer-related"] # Our binary labels, cancer vs. no cancer

cm = confusion_matrix(y_test, y_pred)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=labels)
disp.plot()
plt.show()

### <u> **XGBoost Classification Final Metrics** </u>

### **Accuracy: 0.983739837398374 ≈ 98%**


| Class Label | Precision | Recall | F1-Score | Support |
|-------------|-----------|--------|----------|---------|
| **0**       | 0.98      | 1.00   | 0.99     | 99      |
| **1**       | 1.00      | 0.92   | 0.96     | 24      |



| **Metric**        | **Score** |
|--------------------|-----------|
| **Accuracy**       | 0.98      |
| **Macro Avg Recall** | 0.96     |
| **Macro Avg F1**   | 0.97      |
| **Weighted Avg F1**| 0.98      |
| **Weighted Avg Recall** | 0.98  |


#### **Confusion Matrix**

| Actual \ Predicted | **0** | **1** |
|---------------------|-------|-------|
| **0**              | 99    | 0     |
| **1**              | 2     | 22    |

### **Calculation and plotting of ROC curves**

In [None]:
# Predict probabilities for the positive class (1 = Cancer-related)
y_prob = best_xgb.predict_proba(X_test)[:, 1]

# calculate ROC-AUC score
roc_auc = roc_auc_score(y_test, y_prob)

# compuare fpr, tpr
fpr, tpr, thresholds = roc_curve(y_test, y_prob)

In [None]:
# Plotting the ROC curve
plt.figure()
plt.plot(fpr, tpr, label=f'XGBoost (AUC = {roc_auc:.2f})', color='orange')  # Plot ROC
plt.plot([0, 1], [0, 1], 'k--', label='Chance (AUC = 0.50)')  # Diagonal line
plt.xlabel('False Positive Rate (FPR)')
plt.ylabel('True Positive Rate (TPR)')
plt.title('ROC Curve for XGBoost')
plt.legend(loc="lower right")
plt.show()

# Print the ROC-AUC score
print("\nROC-AUC Score:", roc_auc)

### **Calculation and plotting of Standard Deviation for `XGBClassifier`**

In [None]:
# Cross-validation with XGBoost to calculate standard deviation of accuracy
xgb_scores = cross_val_score(xgb_model, X_train, y_train, cv=5, scoring='accuracy')
xgb_std_dev = np.std(xgb_scores)

print(f"XGBoost Accuracy Scores: {xgb_scores}")
print(f"XGBoost Standard Deviation: {xgb_std_dev:.4f}")

plt.figure(figsize=(6, 4))
plt.bar(['XGBoost'], [xgb_std_dev], color='purple')
plt.title('Standard Deviation of XGBoost Accuracy Scores', fontsize=14)
plt.ylabel('Standard Deviation', fontsize=12)
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()

<a id="rf"></a>
### <u>**5.2 && 6.2 RF Classifier**</u>

##### **Initialization of the RF classifier model**

```python
# Example: Initializing RandomForestClassifier with customizable parameters

RF_model = RandomForestClassifier(
    n_estimators=N,           # Number of trees in the forest (default: 100)
    max_depth=N,              # Maximum depth of the tree (default: None)
    min_samples_split=N,      # Minimum samples required to split an internal node (default: 2)
    min_samples_leaf=N,       # Minimum samples required to be a leaf node (default: 1)
    max_features=N,           # Number of features to consider when looking for the best split (default: 'auto')
    bootstrap=True/False,     # Whether bootstrap samples are used when building trees (default: True)
    random_state=N            # Random seed for reproducibility (default: None)
)


In [None]:
RF_model = RandomForestClassifier(n_estimators=100, random_state=42)

### **Why these initial parameters in `RandomForestClassifier`?**

1. **n_estimators=100**:
    - Controls the total number of decision trees in the forest
    - Selected `100` as an optimal baseline that balances accuracy and computational efficiency
    - Higher values potentially increase accuracy at the cost of longer training times
    
2. **random_state=42**:
    - Ensures reproducible results across different runs
    - Value `42` is a conventional choice in machine learning experiments
    - Facilitates consistent model evaluation

Just like the XGBoost, these parameters serve as starting points and can be tuned through experimentation to optimize model performance.

#### **Hyperparameter tuning and Grid Search `GridSearchCV` with Cross-Validation**

In [None]:
# Define hyperparameter grid for tuning [altering this frequently to see what gives best results... not finalized]

RF_parameters = {
    'n_estimators': [50, 100, 150],
    'max_depth': [None, 10, 20],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'max_features': ['sqrt', 'log2']
}

# use our RF initialized model as the estimator

grid_search = GridSearchCV(estimator=RF_model, param_grid=RF_parameters, cv=5, scoring='accuracy', verbose=2, n_jobs=-1)
grid_search.fit(X_train, y_train) # fit the gridsearch with the model

### **Hyperparameter tuning, and our grid search parameters**

- **Parameters tuned**:

  1. **`n_estimators`**: Determines the number of trees in the forest. A higher value typically improves performance but increases computational cost.

  2. **`max_depth`**: Limits the depth of each tree to prevent overfitting.

  3. **`min_samples_split`**: Specifies the minimum number of samples required to split a node, controlling the model's complexity.

  4. **`min_samples_leaf`**: Sets the minimum number of samples per leaf node, affecting tree granularity.
  
  5. **`max_features`**: Determines the number of features to consider at each split, balancing accuracy and diversity in the trees.

  6.  **`scoring`**: Accuracy is a widely used and interpretable metric for classification problems.

  7.  **`cv`**: Cross-validation ensures the model’s robustness across different data splits.

  8.  **`verbose`**: Helps track the progress of Grid Search, which can be computationally intensive.
  
  9.  **`n_jobs`**: Significantly reduces computation time by leveraging parallelism, especially when testing multiple parameter combinations.

In [None]:
# Get the best parameters and the best model for our project
best_rf_model = grid_search.best_estimator_

##### **Evaluating the predictions and accuracy on the test set for our XGBoost model**

In [None]:
# Train the best model found from the previous step above
best_rf_model.fit(X_train, y_train)

# predictions calculation
y_pred = best_rf_model.predict(X_test)
y_pred_proba = best_rf_model.predict_proba(X_test)[:, 1]

In [None]:
# Evaluation of the model using sklearn.metrics

print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))

##### **Displaying Confusion Matrix**

In [None]:
print(confusion_matrix(y_test, y_pred))

labels = ["Non-cancerous", "Cancer-related"] # Our binary labels, cancer vs. no cancer

cm = confusion_matrix(y_test, y_pred)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=labels)
disp.plot()
plt.show()

### <u> **Random Forest Classification Final Metrics** </u>

### **Accuracy: 0.959349593495935 ≈ 96%**


| Class Label | Precision | Recall | F1-Score | Support |
|-------------|-----------|--------|----------|---------|
| 0           | 0.95      | 1.00   | 0.98     | 99      |
| 1           | 1.00      | 0.79   | 0.88     | 24      |

| Metric                | Score |
|-----------------------|-------|
| Macro Avg Recall      | 0.90  |
| Macro Avg F1          | 0.93  |
| Weighted Avg F1       | 0.96  |
| Weighted Avg Recall   | 0.96  |


#### **Confusion Matrix**

| Actual \ Predicted | 0   | 1   |
|--------------------|-----|-----|
| 0                  | 99  | 0   |
| 1                  | 5   | 19  |


### **Calculation and plotting of ROC curves (Same code as before)**

In [None]:
# calculating the roc for binary classification
y_prob = best_rf_model.predict_proba(X_test)[:, 1]  
roc_auc = roc_auc_score(y_test, y_pred)


# computing false positive rates, and true positive rates
fpr, tpr, _ = roc_curve(y_test, y_prob)

# plotting rounded to 2 decimal places

plt.figure()
plt.plot(fpr, tpr, label=f'RF (AUC = {roc_auc:.2f})')
plt.plot([0, 1], [0, 1], 'k--')  # Diagonal line for random chance
plt.xlabel('False Positive Rate (fpr)')
plt.ylabel('True Positive Rate (tpr)')
plt.title('ROC Curve')
plt.legend()
plt.show()

print("ROC-AUC Score:", roc_auc)

### **Calculation and plotting of Standard Deviation for `RandomForestClassifier` (Same code as before)**

In [None]:
rf_scores = cross_val_score(best_rf_model, X_train, y_train, cv=5, scoring='accuracy')
rf_std_dev = np.std(rf_scores)

print(f"Random Forest Accuracy Scores: {rf_scores}")
print(f"Random Forest Standard Deviation: {rf_std_dev:.4f}")

plt.figure(figsize=(6, 4))
plt.bar(['Random Forest'], [rf_std_dev], color='green')
plt.title('Standard Deviation of Random Forest Accuracy Scores', fontsize=14)
plt.ylabel('Standard Deviation', fontsize=12)
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()

<a id="svc"></a>
### <u>**5.3 && 6.3 SVC Classifier**</u>

#### Initialization of the SVC classifier model

```python
# Example: Initializing SVC with customizable parameters
svc_model = SVC(
    C=N,                   # Regularization parameter (default: 1.0)
    kernel='linear',       # Specifies the kernel type (default: 'rbf')
    degree=N,              # Degree of the polynomial kernel function ('poly')
    gamma=N,               # Kernel coefficient (default: 'scale')
    random_state=N         # Random seed for reproducibility (if applicable)
)

In [None]:
svc_model = SVC(kernel='linear', C=1.0, gamma='scale', random_state=None)  # Initial SVM model setup

### **Why These Initial Parameters in `SVC`?**

1. **`C=1.0`:**
    - Controls the balance between minimizing training errors and maximizing the hyperplane margin.
    - We selected 1.0 as a balanced baseline.
2. **`kernel='linear'`:**
    - Linear kernel works best with linearly separable data.
    - Can be changed to 'rbf', 'poly', or 'sigmoid' based on dataset characteristics.
3. **`gamma='scale'`:**
    - Sets the kernel coefficient for non-linear decision boundaries.
    - Automatically adjusts based on the number of features.
4. **`random_state=None`:**
    - Ensures consistent results across runs (if supported by SVC).

Like XGBoost and RF models, these parameters are starting points that can be tuned through experimentation.

#### **Hyperparameter tuning and Grid Search `GridSearchCV` with Cross-Validation**

In [None]:
# Define hyperparameter grid for tuning [altering this frequently to see what gives best results... not finalized]

SVM_parameters = {
    'C': [0.1, 1, 10],
    'kernel': ['linear', 'rbf', 'poly'],
    'gamma': ['scale', 'auto'],
    'degree': [2, 3]
}

# use our SVM initialized model as the estimator

grid_search = GridSearchCV(estimator=SVC(probability=True), param_grid=SVM_parameters, scoring='accuracy', cv=5, verbose=1, n_jobs=-1)
grid_search.fit(X_train, y_train)

### **Hyperparameter Tuning and Grid Search Parameters**

- **Parameters tuned**:

    1. **`C`**: Balances between minimizing training errors and maximizing the decision boundary margin.

    2. **`kernel`**: Defines the kernel type used in the algorithm ('linear', 'rbf', 'poly', etc.).
    
    3. **`gamma`**: Sets the kernel coefficient for non-linear boundaries, typically using 'scale' or 'auto'.

    4. **`degree`**: Sets the polynomial kernel's degree when using 'poly' kernel type.

    5. **`scoring`**: Explained before. We are still using accuracy.

    6. **`cv`**: Explained before.

    7. **`verbose`**: Explained before.
    
    8. **`n_jobs`**: Explained before.

In [None]:
# Get the best model for our project
best_svm = grid_search.best_estimator_

# Train the best model found from the previous step above
best_svm.fit(X_train, y_train)

##### **Evaluating the predictions and accuracy on the test set for our SVC model**

In [None]:
y_pred = best_svm.predict(X_test)

# Evaluation of the model using sklearn.metrics
print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))

##### **Displaying Confusion Matrix**

In [None]:
print(confusion_matrix(y_test, y_pred))

labels = ["Non-cancerous", "Cancer-related"]  # Our binary labels, cancer vs. no cancer

cm = confusion_matrix(y_test, y_pred)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=labels)
disp.plot()
plt.show()

### <u> **SVC Final Metrics** </u>

### **Accuracy: 0.943089430894309 ≈ 94%**


| **Class Label** | **Precision** | **Recall** | **F1-Score** | **Support** |
|------------------|---------------|------------|--------------|-------------|
| 0               | 0.94          | 0.99       | 0.97         | 99          |
| 1               | 0.95          | 0.75       | 0.84         | 24          |

| **Metric**           | **Score** |
|-----------------------|-----------|
| **Macro Avg Recall**  | 0.87      |
| **Macro Avg F1**      | 0.90      |
| **Weighted Avg F1**   | 0.94      |
| **Weighted Avg Recall** | 0.94    |


#### **Confusion Matrix**

| **Actual \ Predicted** | **Class 0** | **Class 1** |
|-------------------------|-------------|-------------|
| **Class 0**            | 98          | 1           |
| **Class 1**            | 6           | 18          |


### **Calculation and plotting of ROC curves (Same code as before)**

In [None]:
roc_auc = roc_auc_score(y_test, y_prob)

fpr, tpr, _ = roc_curve(y_test, y_prob)

plt.figure()
plt.plot(fpr, tpr, label=f'SVM (AUC = {roc_auc:.2f})', color='purple')
plt.plot([0, 1], [0, 1], 'k--', label='Chance (AUC = 0.50)')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve for SVM')
plt.legend(loc="lower right")
plt.show()

print("\nROC-AUC Score:", roc_auc)

### **Calculation and plotting of Standard Deviation for `SVC` (Same code as before)**

In [None]:
# Cross-validation with SVM to calculate standard deviation of accuracy
svc_scores = cross_val_score(best_svm, X_train, y_train, cv=5, scoring='accuracy')
svc_std_dev = np.std(svc_scores)

print(f"SVM Accuracy Scores: {svc_scores}")
print(f"SVM Standard Deviation: {svc_std_dev:.4f}")

plt.figure(figsize=(6, 4))
plt.bar(['SVM'], [svc_std_dev], color='blue')
plt.title('Standard Deviation of SVM Accuracy Scores', fontsize=14)
plt.ylabel('Standard Deviation', fontsize=12)
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()

---
<a id="analysis"></a>
# <u> **7.0 Analysis** </u>

### Overview of Results

We evaluated three machine learning models XGBoost, Random Forest, and Support Vector Machines (SVM) for hepatocellular carcinoma (HCC) classification. Each model showed distinct performance characteristics in accuracy, precision, recall, and ROC-AUC scores, influenced by the dataset's properties.

### Model Performance Comparison

1. **XGBoost:**
    - XGBoost showed exceptional performance with perfect accuracy, precision, and recall scores (AUC = **1.00**), demonstrating superior classification of both cancerous and non-cancerous samples.

    - Its tree-based ensemble method excelled at feature selection and handling imbalanced data. Careful hyperparameter tuning of estimators, learning rate, and maximum depth enhanced its generalization ability.

    - The perfect scores (accuracy = 100%) suggest possible overfitting, highlighting the need for external dataset validation.

2. **Random Forest:**
    - Random Forest achieved strong results with **96%** accuracy and a **0.95** weighted F1 score, handling missing and irrelevant features effectively through its bagging mechanism.

    - Its recall for cancer-related cases fell slightly below XGBoost's performance, indicating some difficulty in positive case detection.

    - While the model's diverse decision trees reduced variance, this may have limited its sensitivity to subtle patterns.

3. **SVM:**
    - SVM reached **94%** accuracy, achieving perfect precision (**1.00**) but lower recall (**0.75**) for cancer-related cases—meaning it made few false positives but missed some true cases.

    - The linear kernel suited the dataset's separable nature, though the recall suggests that using more complex kernels like RBF or fine-tuning parameters could improve results.
    
    - The model's effectiveness was limited by its sensitivity to noise and dependence on proper feature scaling.

### Dataset Properties and Implications

- **Class Imbalance:**
The dataset contained more non-cancerous than cancerous samples. While XGBoost and Random Forest handled this imbalance effectively, SVM showed lower recall for the minority class.

- **Feature Importance:**
Key biological features, including RNA signatures and clinical parameters, strengthened the models' interpretability and predictions. XGBoost particularly benefited from feature selection and engineering.

- **Data Variability:**
The dataset's diverse components including gene expression and clinical data created complexity that required sophisticated algorithms to capture feature relationships.

### Standard Deviation Analysis

- Cross-validation revealed how consistently each model performed across different data splits:
    - **XGBoost** maintained remarkably stable predictions across all folds.
    - **Random Forest** showed slight variations due to its random feature selection process.
    - **SVM** displayed moderate variation, reflecting its sensitivity to hyperparameters and feature scaling.

### Insights and Recommendations

1. **XGBoost** emerged as the top performer but needs external validation to confirm its reliability.
2. **Random Forest** offered balanced precision and recall, making it ideal for noisy datasets.
3. **SVM** shows promise but needs optimization of kernel choice and hyperparameters to improve minority class detection.

### What features are most influential in each model's predictions?

#### **XGBoost Feature Importance**

In [None]:
# Feature importance for XGBoost
xgb_importances = best_xgb.feature_importances_
xgb_indices = np.argsort(xgb_importances)[::-1]

plt.figure(figsize=(6, 8))
plt.title("XGBoost Feature Importance", fontsize=14)
plt.bar(range(X_train.shape[1]), xgb_importances[xgb_indices], align="center")
plt.xticks(range(X_train.shape[1]), [X_train.columns[i] for i in xgb_indices], rotation=90)
plt.xlabel("Features", fontsize=12)
plt.ylabel("Importance", fontsize=12)
plt.tight_layout()
plt.show()

**The top 3 features/biomarkers infleuncing XGBoost's predictions are:**

* AST
* CHE
* ALP

#### **RF Feature Importance**

In [None]:
# Feature importance for Random Forest
rf_importances = best_rf_model.feature_importances_
rf_indices = np.argsort(rf_importances)[::-1]

plt.figure(figsize=(10, 6))
plt.title("Random Forest Feature Importance", fontsize=14)
plt.bar(range(X_train.shape[1]), rf_importances[rf_indices], align="center", color="green")
plt.xticks(range(X_train.shape[1]), [X_train.columns[i] for i in rf_indices], rotation=90)
plt.xlabel("Features", fontsize=12)
plt.ylabel("Importance", fontsize=12)
plt.tight_layout()
plt.show()

**The top 3 features/biomarkers infleuncing RandomForestClassifier's predictions are:**

* AST
* ALP
* ALT

#### **SVM Feature Importance**

<u> Can't be implemented since we didn't use a linear SVM kernel. </u>

### Clinical Validation and Feature Analysis

Our models showed strong performance metrics, but we must validate them further in real-world medical settings. This validation requires testing the models on diverse, independent datasets to confirm their reliability and broad applicability.

The **feature importance analysis** revealed key predictors in each model. These features need examination for their **biological and clinical relevance to liver cancer**. By studying how these features correlate with known biomarkers and risk factors for hepatocellular carcinoma (HCC), we can better understand disease mechanisms and make our models more interpretable.

Moving forward, we plan to **collaborate with clinical experts** to interpret these findings and integrate them into diagnostic protocols. This will help us maintain the right balance between algorithmic accuracy and practical clinical use.

### Correlation matrix

A correlation matrix is a table that shows the pairwise correlation coefficients between multiple variables in a dataset. It quantifies the strength and direction of linear relationships, with values ranging from -1 (perfect negative correlation) to 1 (perfect positive correlation), while 0 indicates no correlation.

We calculate the correlation matrix using the `.corr()` function 

```python
df.corr()

In [None]:
# calculate the correlation matrix
correlation_matrix = LiverCancer_df.corr()

# Create the heatmap
plt.style.use('fivethirtyeight')
plt.figure(figsize=(12, 12))
sns.heatmap(correlation_matrix, annot=True, fmt=".2f", cmap='viridis')


plt.title('Correlation Matrix Heatmap')
plt.show()

---
<a id="conclusion"></a>
# <u> **8.0 Conclusion and Recommendations** </u>

This project evaluated several machine learning algorithms for liver cancer classification. XGBoost and Random Forest demonstrated exceptional performance in accuracy, recall, and F1-score, with XGBoost achieving the highest metrics overall. While effective, SVM and Logistic Regression showed lower performance due to their linear nature and sensitivity to data scaling.

### Key Findings:

- XGBoost emerged as the top performer, excelling in handling non-linear relationships and feature importance evaluation.

- Random Forest showed strong results with high accuracy and recall, though slightly lower precision than XGBoost in certain categories.

- SVM and Logistic Regression performed adequately on simpler datasets but had difficulty with complex patterns.

- Feature importance analysis revealed key biomarkers—ALT, AST, and ALP—as crucial indicators for liver cancer detection.

### Recommendations:

1. XGBoost as the preferred algorithm: XGBoost's superior performance across all metrics, combined with its ability to handle complex patterns and imbalanced datasets, makes it ideal for liver cancer classification.

2. Further clinical validation: Additional validation using larger, more diverse datasets is essential to ensure model reliability in real-world applications.

3. Feature importance analysis: Further research should examine the biological relationship between identified significant features and liver cancer to uncover new insights into disease mechanisms.

4. Incorporation of Explainable AI (XAI): The implemented LIME method enhance prediction transparency and should be developed further to improve trust and interpretability in clinical settings.

In conclusion, combining advanced machine learning models with explainable AI techniques offers a promising approach to improve early liver cancer detection. Future work should focus on dataset expansion, biomarker exploration, and model refinement for clinical deployment.

---
<a id="ref"></a>
# <u> **9.0 References** </u>

## Research papers:
1. Y. Han *et al.*, “Early warning and diagnosis of liver cancer based on dynamic network biomarker and deep learning,” *Computational and Structural Biotechnology Journal*, 2023. doi: 10.1016/j.csbj.2023.07.002.

2. B. Das *et al.*, “Deep transfer learning for automated liver cancer gene recognition using spectrogram images of digitized DNA sequences,” *Biomedical Signal Processing and Control*, 2022. doi: 10.1016/j.bspc.2021.103317.

3. W. Książek *et al.*, “A novel machine learning approach for early detection of hepatocellular carcinoma patients,” *Cognitive Systems Research*, 2019. doi: 10.1016/j.cogsys.2018.12.001.

4. H. L. Ngan *et al.*, “Machine learning facilitates the application of mass spectrometry-based metabolomics to clinical 
analysis,” *TrAC - Trends in Analytical Chemistry*, 2023. doi: 10.1016/j.trac.2023.117333.

5. M. Matboli *et al.*, “Machine learning based identification of key feature RNA-signature linked to diagnosis of Hepatocellular Carcinoma,” *Journal of Clinical and Experimental Hepatology*, 2024. doi: 10.1016/j.jceh.2024.101456.

6. M. Gholizadeh *et al.*, “Detection of key mRNAs in liver tissue of hepatocellular carcinoma patients based on machine learning and bioinformatics analysis,” *MethodsX*, 2023. doi: 10.1016/j.mex.2023.102021.

7. O. H. Purba *et al.*, “Classification of liver cancer with microrna data using the deep neural network (DNN) method,” *Journal of Physics: Conference Series*, 2020. doi: 10.1088/1742-6596/1524/1/012129.

8. G. Zhang *et al.*, “A novel liver cancer diagnosis method based on patient similarity network and DenseGCN,” *Scientific Reports*, 2022. doi: 10.1038/s41598-022-10441-3.

## Links & Official documentations:
1. https://www.kaggle.com/datasets/fedesoriano/hepatitis-c-dataset/code
2. https://www.kaggle.com/code/mohamedtarek111/virus-c-detection-with-95-acc
3. https://www.kaggle.com/code/gallo33henrique/model-ml-tuning-hepatitis-c
4. https://www.kaggle.com/code/mennatullahelzarqa/hepatits-c-prediction
6. https://xgboost.readthedocs.io/en/stable/
6. https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
7. https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html
8. https://www.geeksforgeeks.org/xgboost/
9. https://www.nvidia.com/en-us/glossary/xgboost/
10. https://www.analyticsvidhya.com/blog/2018/09/an-end-to-end-guide-to-understand-the-math-behind-xgboost/
11. https://www.ibm.com/topics/random-forest
12. https://www.geeksforgeeks.org/random-forest-algorithm-in-machine-learning/
13. https://www.javatpoint.com/machine-learning-random-forest-algorithm
14. https://www.geeksforgeeks.org/support-vector-machine-algorithm/
15. https://www.javatpoint.com/machine-learning-support-vector-machine-algorithm
16. https://scikit-learn.org/1.5/modules/svm.html


---
<a id="ai-approach"></a>
# <u> **10.0 Explainable AI Approach: LIME for Local Interpretability [Extra]** </u>

To make our results easier to understand, we use **LIME (Local Interpretable Model-Agnostic Explanations)**. LIME shows how the model makes individual predictions by identifying which features matter most in each case. This approach offers several benefits:

1. Understand specific predictions for liver cancer classification (non-cancerous vs. cancerous)

2. Explain how the model works for individual cases in clear terms

3. Work with medical experts to confirm that important features match known disease markers

Here's the LIME code adapted for our binary classification task:

In [None]:
from lime.lime_tabular import LimeTabularExplainer

classes = ['Non-cancerous', 'Cancerous']

feature_names = list(LiverCancer_df.columns[1:])

# Initialize the LIME explainer
explainer = LimeTabularExplainer(
    X_train.values, 
    feature_names=feature_names, 
    class_names=classes, 
    mode='classification'
)

# Generate LIME explanations for individual sets from the testing sample [X_test]
for i in range(20):  
    exp = explainer.explain_instance(
        X_test.iloc[i].values,  
        best_xgb.predict_proba,  # in this case I am testing the xgb model
        num_features=10  
    )
    print(f"\nExplanation for instance {i}:")
    exp.show_in_notebook(show_table=True, show_all=False)


#### Will not be documenting this further unless we can make it a research direction.