<a href="https://colab.research.google.com/github/Laaliji/Colon-Cancer-Gene-Expression-Data-Classification-Analysis/blob/main/Colon_Cancer_Gene_Expression_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Analysis of Colon Cancer Gene Expression Activity**

Colon cancer remains a major global health challenge, requiring precise diagnostic tools. Gene expression profiling, combined with machine learning, enables the identification of cancer biomarkers. This project analyzes the "Gene Expression of Colon Cancer" dataset (60 genes) to:
1. Compare five models—logistic regression, SVM, k-NN, decision tree, and random forest—for classification.
2. Identify influential genes based on each model’s mechanism (e.g., coefficients, feature importance).
3. Conduct a comparative analysis to determine the best prediction for a new patient.

Results will support an article for Nordic Machine Intelligence (NMI) journal.

*   **Supervision:** Dr. O. BANOUAR, Faculty of Sciences and Techniques, Cadi Ayyad University, Marrakech.
*   **Realized by:** Zakariae LAALIJI.



## **Libraries and Configurations**

This section imports the necessary Python libraries and configures the environment for analyzing the "Gene Expression of Colon Cancer" dataset. Libraries such as pandas, scikit-learn, and matplotlib are used for data processing, machine learning, and visualization, enabling model training, gene importance analysis, and comparative evaluation. Configurations ensure reproducibility and consistency throughout the analysis.

In [1]:
from google.colab import drive # Import the necessary library for Google Drive integration
drive.mount('/content/drive') # Mount your Google Drive to the '/content/drive' directory

Mounted at /content/drive


In [5]:
import os #for operating system related tasks

# Specifying the path
datasets_folder_path = '/content/drive/My Drive/Datasets_For_Research'

# Printing the contents of the 'Datasets_For_Research' folder
print(os.listdir(datasets_folder_path))  # to list files and directories

['colon_cancer.csv']


In [6]:
import pandas as pd # For data manipulation and analysis using DataFrames
from sklearn.model_selection import train_test_split # For splitting data into training and testing sets
from sklearn.preprocessing import StandardScaler # For feature scaling (standardization)
from sklearn.linear_model import LogisticRegression # For logistic regression model
from sklearn.neighbors import KNeighborsClassifier # For k-nearest neighbors model
from sklearn.inspection import DecisionBoundaryDisplay # For visualizing decision boundaries
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report # For model evaluation metrics
from sklearn.model_selection import cross_val_score # For cross-validation
from sklearn.svm import SVC # For Support Vector Machine model
import seaborn as sns # For data visualization
import numpy as np # For numerical computations
import matplotlib.pyplot as plt # For plotting graphs

## **Data preparation**

This section prepares the "Gene Expression of Colon Cancer" dataset for analysis. Tasks include loading the dataset, handling missing values, standardizing the 60 gene expression features, and splitting the data into training and test sets. These steps ensure the data is clean, consistent, and ready for training machine learning models and evaluating gene importance.