<a href="https://colab.research.google.com/github/RegenMedandAI/Machine-Learning-and-cancer-studies/blob/main/Comparative_machine_learning_analysis_of_breast_cancer_type_and_tumour_grade.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Breast Cancer Gene Expression Analysis: Multi-Model Approach
Overview
This project implements a comprehensive machine learning analysis for breast cancer classification using gene expression data, employing three distinct modeling approaches to assess their ability to best identify cancer type and tumour grade.

Random Forest (RF)
Support Vector Machine (SVM)
Deep Neural Network (DNN)

Data Preprocessing

Feature extraction from gene expression values
Label encoding for cancer types and tumor grades
Train-test split (80-20)
StandardScaler application for feature normalization

Model Architectures
1. Random Forest Classifier

Non-parametric, ensemble learning method
Uses 100 decision trees (n_estimators=100)
Advantages:

Built-in feature importance
Handles non-linearity effectively
Less prone to overfitting



2. Support Vector Machine

Kernel: Linear and RBF (Radial Basis Function)
Hyperparameters:

probability=True for probability estimates
random_state=42 for reproducibility


Advantages:

Effective in high-dimensional spaces
Memory efficient
Versatile through different kernel functions



3. Deep Neural Network

Architecture:

Input layer: Matches feature dimensionality
Hidden layers: 256 → 128 → 64 neurons
Output layer: Softmax activation


Training parameters:

Epochs: 100
Batch size: 32
Early stopping with patience=10


Regularization techniques:

Dropout (0.3)
BatchNormalization


Optimizer: Adam
Loss function: Categorical crossentropy

Model Comparison
Similarities

RF and SVM:

Both are traditional machine learning algorithms
Work well with high-dimensional data
Less computational resources compared to DNN


SVM and DNN:

Both create hyperplanes for classification
Can model non-linear relationships



Key Differences

Random Forest:

Ensemble method using multiple decision trees
Provides feature importance out-of-the-box
May struggle with very high-dimensional data


SVM:

Creates optimal hyperplane for class separation
Kernel trick for non-linear classification
Can be computationally intensive for large datasets


Deep Neural Network:

Most complex model with multiple layers
Requires more data for optimal performance
Computationally intensive training
Can automatically learn feature representations



Comparative Efficacy
The most appropriate model depends on specific requirements:

For interpretability: Random Forest
For balanced performance: SVM
For complex pattern recognition: Deep Neural Network

Implementation Notes

Both cancer type and tumor grade classifications are performed
All models use consistent preprocessing for fair comparison
Early stopping in DNN prevents overfitting
Multiple evaluation metrics: accuracy, classification report, confusion matrix



In [None]:

import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.model_selection import train_test_split

# Load the data
df = pd.read_csv('Breast_GSE45827p.csv')
print(df.head()) # Use head() to display the first few rows of the dataframe

In [None]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.model_selection import train_test_split

# Load the data
df = pd.read_csv('Breast_GSE45827p.csv')
print(df.head()) # Use head() to display the first few rows of the dataframe

#Corrected Data Preprocessing for Breast Cancer Gene Expression Analysis
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.model_selection import train_test_split

# Load the data


# Extract features (gene expression values)
X = df.iloc[2:, 2:153].T  # Transpose to get samples as rows
X.index = df.columns[2:153]  # Set sample names as index
X.columns = df.iloc[2:, 0]  # Set probe_id as column names

# Extract labels
y_type = df.iloc[0, 2:153]
y_grade = df.iloc[1, 2:153].astype(str)

# Create a mapping of probe_id to gene symbol
gene_map = dict(zip(df.iloc[2:, 0], df.iloc[2:, 1]))

# Encode cancer types and grades
le_type = LabelEncoder()
le_grade = LabelEncoder()
y_type_encoded = le_type.fit_transform(y_type)
y_grade_encoded = le_grade.fit_transform(y_grade)

# Split the data
X_train, X_test, y_type_train, y_type_test, y_grade_train, y_grade_test = train_test_split(
    X, y_type_encoded, y_grade_encoded, test_size=0.2, random_state=42)

# Scale the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print("Data preprocessed and split into training and test sets.")
print(f"Training set shape: {X_train_scaled.shape}")
print(f"Test set shape: {X_test_scaled.shape}")
print(f"Cancer types: {le_type.classes_}")
print(f"Tumor grades: {le_grade.classes_}")

In [None]:
# Machine Learning Analysis for Breast Cancer Gene Expression Data
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt

# Using the preprocessed data from previous code
# Assuming we have: X_train_scaled, X_test_scaled, y_type_train, y_type_test, y_grade_train, y_grade_test

# 1. Train and evaluate cancer type classifier
rf_type = RandomForestClassifier(n_estimators=100, random_state=42)
rf_type.fit(X_train_scaled, y_type_train)

# Predict and evaluate
y_type_pred = rf_type.predict(X_test_scaled)

print("Cancer Type Classification Results:")
print("Accuracy:", accuracy_score(y_type_test, y_type_pred))
print("\nClassification Report:")
print(classification_report(y_type_test, y_type_pred, target_names=le_type.classes_))

# Confusion Matrix for cancer types
plt.figure(figsize=(10, 8))
cm_type = confusion_matrix(y_type_test, y_type_pred)
sns.heatmap(cm_type, annot=True, fmt='d', cmap='Blues',
            xticklabels=le_type.classes_,
            yticklabels=le_type.classes_)
plt.title('Confusion Matrix - Cancer Types')
plt.xlabel('Predicted')
plt.ylabel('True')
plt.show()

# 2. Train and evaluate tumor grade classifier
rf_grade = RandomForestClassifier(n_estimators=100, random_state=42)
rf_grade.fit(X_train_scaled, y_grade_train)

# Predict and evaluate
y_grade_pred = rf_grade.predict(X_test_scaled)

print("\nTumor Grade Classification Results:")
print("Accuracy:", accuracy_score(y_grade_test, y_grade_pred))
print("\nClassification Report:")
print(classification_report(y_grade_test, y_grade_pred, target_names=le_grade.classes_))

# Confusion Matrix for tumor grades
plt.figure(figsize=(10, 8))
cm_grade = confusion_matrix(y_grade_test, y_grade_pred)
sns.heatmap(cm_grade, annot=True, fmt='d', cmap='Blues',
            xticklabels=le_grade.classes_,
            yticklabels=le_grade.classes_)
plt.title('Confusion Matrix - Tumor Grades')
plt.xlabel('Predicted')
plt.ylabel('True')
plt.show()

# 3. Feature Importance Analysis
# For cancer types
feature_importance_type = pd.DataFrame({
    'gene': X.columns,
    'importance': rf_type.feature_importances_
})
feature_importance_type = feature_importance_type.sort_values('importance', ascending=False)

# For tumor grades
feature_importance_grade = pd.DataFrame({
    'gene': X.columns,
    'importance': rf_grade.feature_importances_
})
feature_importance_grade = feature_importance_grade.sort_values('importance', ascending=False)

# Print top 10 important genes for each classification
print("\nTop 10 Important Genes for Cancer Type Classification:")
for i, row in feature_importance_type.head(10).iterrows():
    gene_symbol = gene_map.get(row['gene'], row['gene'])
    print(f"{gene_symbol}: {row['importance']:.4f}")

print("\nTop 10 Important Genes for Tumor Grade Classification:")
for i, row in feature_importance_grade.head(10).iterrows():
    gene_symbol = gene_map.get(row['gene'], row['gene'])
    print(f"{gene_symbol}: {row['importance']:.4f}")

In [None]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt

# 1. Data Loading and Preprocessing
print("Loading and preprocessing data...")

# Load the data
df = pd.read_csv('Breast_GSE45827p.csv')
print("First few rows of the dataframe:")
print(df.head())

# Extract features (gene expression values)
X = df.iloc[2:, 2:153].T  # Transpose to get samples as rows
X.index = df.columns[2:153]  # Set sample names as index
X.columns = df.iloc[2:, 0]  # Set probe_id as column names

# Extract labels
y_type = df.iloc[0, 2:153]
y_grade = df.iloc[1, 2:153].astype(str)

# Create a mapping of probe_id to gene symbol
gene_map = dict(zip(df.iloc[2:, 0], df.iloc[2:, 1]))

# Encode cancer types and grades
le_type = LabelEncoder()
le_grade = LabelEncoder()
y_type_encoded = le_type.fit_transform(y_type)
y_grade_encoded = le_grade.fit_transform(y_grade)

# Split the data
X_train, X_test, y_type_train, y_type_test, y_grade_train, y_grade_test = train_test_split(
    X, y_type_encoded, y_grade_encoded, test_size=0.2, random_state=42)

# Scale the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print("\nData preprocessing completed:")
print(f"Training set shape: {X_train_scaled.shape}")
print(f"Test set shape: {X_test_scaled.shape}")
print(f"Cancer types: {le_type.classes_}")
print(f"Tumor grades: {le_grade.classes_}")

# 2. Cancer Type Classification with SVM
print("\nTraining SVM for Cancer Type Classification...")
svm_type = SVC(kernel='linear', random_state=42)
svm_type.fit(X_train_scaled, y_type_train)

# Predict and evaluate
y_type_pred_svm = svm_type.predict(X_test_scaled)

print("\nCancer Type Classification Results (SVM):")
print("Accuracy:", accuracy_score(y_type_test, y_type_pred_svm))
print("\nClassification Report:")
print(classification_report(y_type_test, y_type_pred_svm, target_names=le_type.classes_))

# Confusion Matrix for cancer types (SVM)
plt.figure(figsize=(10, 8))
cm_type_svm = confusion_matrix(y_type_test, y_type_pred_svm)
sns.heatmap(cm_type_svm, annot=True, fmt='d', cmap='Blues',
            xticklabels=le_type.classes_,
            yticklabels=le_type.classes_)
plt.title('Confusion Matrix - Cancer Types (SVM)')
plt.xlabel('Predicted')
plt.ylabel('True')
plt.show()

# 3. Tumor Grade Classification with SVM
print("\nTraining SVM for Tumor Grade Classification...")
svm_grade = SVC(kernel='linear', random_state=42)
svm_grade.fit(X_train_scaled, y_grade_train)

# Predict and evaluate
y_grade_pred_svm = svm_grade.predict(X_test_scaled)

print("\nTumor Grade Classification Results (SVM):")
print("Accuracy:", accuracy_score(y_grade_test, y_grade_pred_svm))
print("\nClassification Report:")
print(classification_report(y_grade_test, y_grade_pred_svm, target_names=le_grade.classes_))

# Confusion Matrix for tumor grades (SVM)
plt.figure(figsize=(10, 8))
cm_grade_svm = confusion_matrix(y_grade_test, y_grade_pred_svm)
sns.heatmap(cm_grade_svm, annot=True, fmt='d', cmap='Blues',
            xticklabels=le_grade.classes_,
            yticklabels=le_grade.classes_)
plt.title('Confusion Matrix - Tumor Grades (SVM)')
plt.xlabel('Predicted')
plt.ylabel('True')
plt.show()

# 4. Feature Importance Analysis
def get_svm_feature_importance(svm_model, feature_names, class_labels):
    importance_per_class = {}
    for i, class_label in enumerate(class_labels):
        if len(class_labels) == 2 and i == 1:
            continue
        coef = svm_model.coef_[i] if len(class_labels) > 2 else svm_model.coef_[0]
        importance = pd.DataFrame({
            'gene': feature_names,
            'importance': np.abs(coef)
        })
        importance_per_class[class_label] = importance.sort_values('importance', ascending=False)
    return importance_per_class

# Get feature importance for cancer types
type_importance_svm = get_svm_feature_importance(svm_type, X.columns, le_type.classes_)

print("\nTop 10 Important Genes for Cancer Type Classification (SVM):")
for class_label, importance_df in type_importance_svm.items():
    print(f"\nFor {class_label}:")
    for i, row in importance_df.head(10).iterrows():
        gene_symbol = gene_map.get(row['gene'], row['gene'])
        print(f"{gene_symbol}: {row['importance']:.4f}")

In [None]:
# Neural Network Analysis for Brain Cancer Gene Expression Data
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, BatchNormalization
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras.utils import to_categorical
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt

# Convert labels to categorical format for Keras
y_type_train_cat = to_categorical(y_type_train)
y_type_test_cat = to_categorical(y_type_test)
y_grade_train_cat = to_categorical(y_grade_train)
y_grade_test_cat = to_categorical(y_grade_test)

# 1. Define function to create model
def create_model(input_dim, output_dim):
    model = Sequential([
        Dense(256, activation='relu', input_dim=input_dim),
        BatchNormalization(),
        Dropout(0.3),
        Dense(128, activation='relu'),
        BatchNormalization(),
        Dropout(0.3),
        Dense(64, activation='relu'),
        BatchNormalization(),
        Dropout(0.3),
        Dense(output_dim, activation='softmax')
    ])

    model.compile(optimizer='adam',
                  loss='categorical_crossentropy',
                  metrics=['accuracy'])
    return model

# 2. Train and evaluate cancer type classifier
print("Training Neural Network for Cancer Type Classification...")
nn_type = create_model(X_train_scaled.shape[1], len(le_type.classes_))

early_stopping = EarlyStopping(monitor='val_loss', patience=10, restore_best_weights=True)

history_type = nn_type.fit(X_train_scaled, y_type_train_cat,
                          validation_split=0.2,
                          epochs=100,
                          batch_size=32,
                          callbacks=[early_stopping],
                          verbose=1)

# Evaluate cancer type model
y_type_pred_nn = nn_type.predict(X_test_scaled)
y_type_pred_classes = np.argmax(y_type_pred_nn, axis=1)

print("\nCancer Type Classification Results (Neural Network):")
print("Accuracy:", accuracy_score(y_type_test, y_type_pred_classes))
print("\nClassification Report:")
print(classification_report(y_type_test, y_type_pred_classes, target_names=le_type.classes_))

# Plot training history for cancer type
plt.figure(figsize=(12, 4))
plt.subplot(1, 2, 1)
plt.plot(history_type.history['accuracy'])
plt.plot(history_type.history['val_accuracy'])
plt.title('Cancer Type Model Accuracy')
plt.ylabel('Accuracy')
plt.xlabel('Epoch')
plt.legend(['Train', 'Validation'], loc='upper left')

plt.subplot(1, 2, 2)
plt.plot(history_type.history['loss'])
plt.plot(history_type.history['val_loss'])
plt.title('Cancer Type Model Loss')
plt.ylabel('Loss')
plt.xlabel('Epoch')
plt.legend(['Train', 'Validation'], loc='upper left')
plt.tight_layout()
plt.show()

# 3. Train and evaluate tumor grade classifier
print("\nTraining Neural Network for Tumor Grade Classification...")
nn_grade = create_model(X_train_scaled.shape[1], len(le_grade.classes_))

history_grade = nn_grade.fit(X_train_scaled, y_grade_train_cat,
                            validation_split=0.2,
                            epochs=100,
                            batch_size=32,
                            callbacks=[early_stopping],
                            verbose=1)

# Evaluate tumor grade model
y_grade_pred_nn = nn_grade.predict(X_test_scaled)
y_grade_pred_classes = np.argmax(y_grade_pred_nn, axis=1)

print("\nTumor Grade Classification Results (Neural Network):")
print("Accuracy:", accuracy_score(y_grade_test, y_grade_pred_classes))
print("\nClassification Report:")
print(classification_report(y_grade_test, y_grade_pred_classes, target_names=le_grade.classes_))

# Plot training history for tumor grade
plt.figure(figsize=(12, 4))
plt.subplot(1, 2, 1)
plt.plot(history_grade.history['accuracy'])
plt.plot(history_grade.history['val_accuracy'])
plt.title('Tumor Grade Model Accuracy')
plt.ylabel('Accuracy')
plt.xlabel('Epoch')
plt.legend(['Train', 'Validation'], loc='upper left')

plt.subplot(1, 2, 2)
plt.plot(history_grade.history['loss'])
plt.plot(history_grade.history['val_loss'])
plt.title('Tumor Grade Model Loss')
plt.ylabel('Loss')
plt.xlabel('Epoch')
plt.legend(['Train', 'Validation'], loc='upper left')
plt.tight_layout()
plt.show()

# 4. Compare with previous models
def compare_all_models(y_true, y_pred_rf, y_pred_svm, y_pred_nn, model_type):
    print(f"\nComparison for {model_type}:")
    print(f"Random Forest Accuracy: {accuracy_score(y_true, y_pred_rf):.4f}")
    print(f"SVM Accuracy: {accuracy_score(y_true, y_pred_svm):.4f}")
    print(f"Neural Network Accuracy: {accuracy_score(y_true, y_pred_nn):.4f}")

# Compare all models
compare_all_models(y_type_test, y_type_pred, y_type_pred_svm, y_type_pred_classes, "Cancer Type Classification")
compare_all_models(y_grade_test, y_grade_pred, y_grade_pred_svm, y_grade_pred_classes, "Tumor Grade Classification")

In [None]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.utils import to_categorical
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.metrics import classification_report

# Keras Model
model_type_keras = Sequential([
    Dense(256, activation='relu', input_shape=(X_train_scaled.shape[1],)),
    Dropout(0.3),
    Dense(128, activation='relu'),
    Dropout(0.3),
    Dense(64, activation='relu'),
    Dropout(0.3),
    Dense(len(le_type.classes_), activation='softmax')
])

model_type_keras.compile(optimizer=Adam(learning_rate=0.001), loss='categorical_crossentropy', metrics=['accuracy'])

y_type_train_cat = to_categorical(y_type_train)
y_type_test_cat = to_categorical(y_type_test)

history_type_keras = model_type_keras.fit(X_train_scaled, y_type_train_cat, epochs=100, batch_size=32,
                                          validation_split=0.2, verbose=1)

# Random Forest
rf_type = RandomForestClassifier(n_estimators=100, random_state=42)
rf_type.fit(X_train_scaled, y_type_train)

# SVM
svm_type = SVC(kernel='rbf', probability=True, random_state=42)
svm_type.fit(X_train_scaled, y_type_train)

# Evaluate models
print("Keras Model - Cancer Type Prediction:")
y_pred_keras = np.argmax(model_type_keras.predict(X_test_scaled), axis=1)
print(classification_report(y_type_test, y_pred_keras, target_names=le_type.classes_))

print("\nRandom Forest - Cancer Type Prediction:")
y_pred_rf = rf_type.predict(X_test_scaled)
print(classification_report(y_type_test, y_pred_rf, target_names=le_type.classes_))

print("\nSVM - Cancer Type Prediction:")
y_pred_svm = svm_type.predict(X_test_scaled)
print(classification_report(y_type_test, y_pred_svm, target_names=le_type.classes_))

In [None]:
#Tumor Grade Prediction Models

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.utils import to_categorical
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.metrics import classification_report

# Keras Model
model_grade_keras = Sequential([
    Dense(128, activation='relu', input_shape=(X_train_scaled.shape[1],)),
    Dropout(0.3),
    Dense(64, activation='relu'),
    Dropout(0.3),
    Dense(32, activation='relu'),
    Dropout(0.3),
    Dense(len(le_grade.classes_), activation='softmax')
])

model_grade_keras.compile(optimizer=Adam(learning_rate=0.001), loss='categorical_crossentropy', metrics=['accuracy'])

y_grade_train_cat = to_categorical(y_grade_train)
y_grade_test_cat = to_categorical(y_grade_test)

history_grade_keras = model_grade_keras.fit(X_train_scaled, y_grade_train_cat, epochs=100, batch_size=32,
                                            validation_split=0.2, verbose=1)

# Random Forest
rf_grade = RandomForestClassifier(n_estimators=100, random_state=42)
rf_grade.fit(X_train_scaled, y_grade_train)

# SVM
svm_grade = SVC(kernel='rbf', probability=True, random_state=42)
svm_grade.fit(X_train_scaled, y_grade_train)

# Evaluate models
print("Keras Model - Tumor Grade Prediction:")
y_pred_keras = np.argmax(model_grade_keras.predict(X_test_scaled), axis=1)
print(classification_report(y_grade_test, y_pred_keras, target_names=le_grade.classes_))

print("\nRandom Forest - Tumor Grade Prediction:")
y_pred_rf = rf_grade.predict(X_test_scaled)
print(classification_report(y_grade_test, y_pred_rf, target_names=le_grade.classes_))

print("\nSVM - Tumor Grade Prediction:")
y_pred_svm = svm_grade.predict(X_test_scaled)
print(classification_report(y_grade_test, y_pred_svm, target_names=le_grade.classes_))