# Predicting Cancer Types Using Machine Learning Techniques

# # Data Description

## 1. Data Import

In [1]:
# importing necessary libraries and functions
import pandas as pd
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score

import warnings
warnings.filterwarnings("ignore")

train = pd.read_csv(r"Y:\IIT KGP Class Materials(Y3)\Sem 5 Study Material\Application of Machine Learning in Biological Systems\Asgn2_21IM30013\data\data_train.csv")
test = pd.read_csv(r"Y:\IIT KGP Class Materials(Y3)\Sem 5 Study Material\Application of Machine Learning in Biological Systems\Asgn2_21IM30013\data\data_test.csv")
actual = pd.read_csv(r"Y:\IIT KGP Class Materials(Y3)\Sem 5 Study Material\Application of Machine Learning in Biological Systems\Asgn2_21IM30013\data\Copy of actual.csv")

## * Data Description 

The dataset comprises gene expression data stored in RES format, a commonly used format for gene pattern data ([more about RES format here](https://www.genepattern.org/file-formats-guide#RES)). The dataset encompasses 7129 distinct gene features, with columns representing various samples.

For each gene, the numerical entries denote its expression levels within a given sample. Additionally, an accompanying `call` column indicates whether the gene is classified as Absent (A), Marginal (M), or Present (P) in that particular sample.

The dataset is divided into two files:

- `train`: Contains data from 38 samples.
- `test`: Contains data from 34 samples.

This totals to 72 samples in the entire dataset.

### Cancer Types

The dataset focuses on two types of cancer:

1. **Acute Myeloid Leukemia (AML):** AML affects myeloid cells, which are responsible for generating certain types of white blood cells.

2. **Acute Lymphocytic Leukemia (ALL):** ALL is a form of cancer that impacts lymphocytes, a crucial type of white blood cell involved in the immune response. ([source](https://www.healthline.com/health/leukemia/aml-vs-all))

### Patient Information

The `actual` file provides information about individual patients, including their unique identifiers and the specific type of cancer they have been diagnosed with (AML or ALL).


In [2]:
print(train.shape, test.shape)

(7129, 78) (7129, 70)


In [3]:
train.head()

Unnamed: 0,Gene Description,Gene Accession Number,1,call,2,call.1,3,call.2,4,call.3,...,29,call.33,30,call.34,31,call.35,32,call.36,33,call.37
0,AFFX-BioB-5_at (endogenous control),AFFX-BioB-5_at,-214,A,-139,A,-76,A,-135,A,...,15,A,-318,A,-32,A,-124,A,-135,A
1,AFFX-BioB-M_at (endogenous control),AFFX-BioB-M_at,-153,A,-73,A,-49,A,-114,A,...,-114,A,-192,A,-49,A,-79,A,-186,A
2,AFFX-BioB-3_at (endogenous control),AFFX-BioB-3_at,-58,A,-1,A,-307,A,265,A,...,2,A,-95,A,49,A,-37,A,-70,A
3,AFFX-BioC-5_at (endogenous control),AFFX-BioC-5_at,88,A,283,A,309,A,12,A,...,193,A,312,A,230,P,330,A,337,A
4,AFFX-BioC-3_at (endogenous control),AFFX-BioC-3_at,-295,A,-264,A,-376,A,-419,A,...,-51,A,-139,A,-367,A,-188,A,-407,A


In [4]:
actual.head()

Unnamed: 0,patient,cancer
0,1,ALL
1,2,ALL
2,3,ALL
3,4,ALL
4,5,ALL


In [5]:
actual.cancer.unique()

array(['ALL', 'AML'], dtype=object)

## 2. Data Manipulation

In [6]:
# Select columns for training and testing
train_columns = [str(i) for i in range(1, 39)]
test_columns = [str(i) for i in range(39, 73)]

# Transpose the dataframes to have rows as samples
train = train[train_columns].transpose()
test = test[test_columns].transpose()

# Add the target value column from 'actual' file
train['target'] = list(actual.cancer.iloc[:38])
test['target'] = list(actual.cancer.iloc[38:])

In [7]:
# Define train and test sets
X_train = train.iloc[:, :-1]
y_train = train['target']
X_test = test.iloc[:, :-1]
y_test = test['target']

## 3. Train the model

### SVM Model (with different kernels)

In [8]:
# Define a list of kernel functions
kernels = ['rbf', 'linear', 'poly', 'sigmoid']

for kernel in kernels:
    # Create and train the SVM model
    svc_model = SVC(kernel=kernel, C=10)
    svc_model.fit(X_train, y_train)
    
    # Make predictions and calculate accuracy
    y_pred = svc_model.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    
    # Print the accuracy for each kernel
    print(f'Accuracy ({kernel} kernel): {accuracy:.2f}')

Accuracy (rbf kernel): 0.97
Accuracy (linear kernel): 0.97
Accuracy (poly kernel): 0.97
Accuracy (sigmoid kernel): 0.91


### Random Forest Model

In [9]:
randfmodel = RandomForestClassifier(random_state=42)
randfmodel.fit(X_train, y_train)

print('Accuracy: ', randfmodel.score(X_test, y_test))

Accuracy:  0.8529411764705882


### Neural Network Classification Model

In [10]:
# Define the parameter grid for grid search
param_grid = {
    'activation': ['relu', 'tanh'],
    'hidden_layer_sizes': [(50,), (100,), (50, 50), (100, 50)],
    'alpha': [0.0001, 0.001, 0.01],
    'learning_rate': ['constant', 'invscaling', 'adaptive']
}

# Initialize the MLPClassifier
nnmodel = MLPClassifier(random_state=42, early_stopping=True)

# Initialize GridSearchCV
grid_search = GridSearchCV(nnmodel, param_grid, cv=3, n_jobs=-1, scoring='accuracy')

# Perform grid search
grid_search.fit(X_train, y_train)

# Get the best parameters and the best estimator
best_params = grid_search.best_params_
best_estimator = grid_search.best_estimator_

# Evaluate the model on the test set
accuracy = best_estimator.score(X_test, y_test)

print(f"Best Parameters: {best_params}")
print(f"Test Accuracy: {accuracy:.2f}")

KeyboardInterrupt: 

## 4. Result Storage

In [None]:
# # Uncomment the lines to store files from the predictions of the above-mentioned models on the test data
# # The outputs generated can be found in 'output' folder [Note that the following code doesn't store it in that folder]
# pd.DataFrame({'target':y_test,'prediction':svcrbf.predict(X_test)}).to_csv('svm_rcf_preds.csv')
# pd.DataFrame({'target':y_test,'prediction':svclin.predict(X_test)}).to_csv('svm_lin_preds.csv')
# pd.DataFrame({'target':y_test,'prediction':svcpoly.predict(X_test)}).to_csv('svm_poly_preds.csv')
# pd.DataFrame({'target':y_test,'prediction':svcsig.predict(X_test)}).to_csv('svm_sig_preds.csv')
# pd.DataFrame({'target':y_test,'prediction':rfmodel.predict(X_test)}).to_csv('random_forest_preds.csv')
# pd.DataFrame({'target':y_test,'prediction':nnmodel.predict(X_test)}).to_csv('neural_networks_preds.csv')

## SVM's role in cancer type prediction

We used Support Vector Machines (SVMs) to forecast cancer kinds based on RES format gene data in our study. SVMs proved essential in determining the best decision boundary, efficiently differentiating distinct cancer kinds using genetic information.

The adaptability of SVMs was proved by the usage of different kernels:

- **Linear Kernel**: Effective for linearly separable data.
- **RBF Kernel**: Excelled in capturing non-linear relationships in genetic data.
- **Polynomial Kernel**: Valuable for representing polynomial decision boundaries.

Each kernel was carefully examined. The linear, RBF, and polynomial kernels obtained an amazing 97.06% accuracy. This demonstrates the robustness of SVMs in discriminating between AML and ALL instances based on the gene data presented.

Notably, while the sigmoid kernel continues to perform well,achieved a slightly lower accuracy of approximately 91.18%. This indicates that, for our specific dataset and problem, the sigmoid kernel might not be as well-suited as the other kernels. This emphasizes the critical importance of selecting the most appropriate kernel tailored to the data's characteristics.

### Handling High-Dimensional Data

SVMs excel in scenarios with numerous features, as seen in our study with 7129 features. Here's how SVMs handled this high-dimensional data:

1. **Feature Selection**: SVMs find optimal decision boundaries in high-dimensional spaces, crucial for distinguishing cancer types based on genetic data.

2. **Margin Maximization**: They maximize the margin, enhancing generalization to new data points.

3. **Kernel Trick for Complexity**: Various kernels handle non-linearities in genetic data effectively.

4. **Identifying Important Features**: SVMs pinpoint crucial genes for accurate predictions.

5. **Effective Generalization**: Despite high dimensionality, SVMs generalize well to new, unseen data.

# Neural Network Regression Analysis

We employed neural network regression to predict cancer kinds using gene data in our study. This strategy is critical for the following reasons:

- **Complex Relationship Modeling**: In high-dimensional genetic data, neural networks excel in capturing subtle, non-linear relationships. This reveals subtle patterns that are difficult to detect using typical methods.

- **Feature Extraction and Abstraction**: Neural networks extract significant characteristics automatically, with the potential to find critical genetic markers for cancer classification.

- **Adaptability to High-Dimensional Data**: Neural networks handle large feature sets, as demonstrated with our 7129-feature gene dataset.

In order to optimize the neural network parameters, we also used a grid search:

- **Grid Search Process**:

  This approach iteratively investigates a given hyperparameter grid in order to find the best-performing combination. Hidden layers and neurons per layer are important neural network hyperparameters, activation functions, and regularization terms.
  Grid search fine-tunes the model, preventing overfitting and underfitting. Our analysis resulted in an approximately 94.12% accuracy, indicating effective hyperparameter selection for accurate cancer type predictions using the provided gene dataset.

# Comparison between different models

Based on gene data, we tested three models for cancer type prediction: neural networks, support vector machines (SVM), and random forest.

- **Neural Network Accuracy**: ~94.12%
- **SVM Accuracy**: ~97.06%
- **Random Forest Accuracy**: ~85.29% (no hyperparameter tuning)

## Model Strengths:

- **SVM**:
  -Works well in three-dimensional spaces.
  - Supports both linear and non-linear relationships.
  - Resistant to overfitting.
- **Neural Networks**:
  - Excellent at capturing complex relationships. 
  - Extracts key features automatically.

- **Random Forest**:
  - Effectively handles high-dimensional data.
  - Provides feature significance.

## Model Weaknesses:

- **SVM**:
  - Can be computationally costly.
  - Vulnerable to kernel and hyperparameter changes.

- **Neural Networks**:
  - Susceptible to overfitting due to insufficient data.
  - Extensive computational requirements.

- **Random Forest**:
 - When there are many trees, it can be computationally expensive. 
 - Capturing intricate relationships is limited.

## Model Selection:

Given our situation, the SVM model with an accuracy of 97.06% is the best choice. It performs effectively in three-dimensional spaces and generalizes nicely. While neural networks worked admirably, they may be computationally demanding. Random Forest could be improved with hyperparameter adjustment.

# Discussion

## Broader Implications of Accurate Cancer Type Prediction

Accurate cancer type prediction is crucial in clinical practice and medical research. It can:

- **Guide Treatment Strategies**: Precise cancer type influences treatment decisions, allowing for more customized therapy and improved patient outcomes.

- **Facilitate Early Detection**: Early cancer type identification can lead to more effective and less intrusive therapies.

- **Enable Personalized Medicine**: Understanding a tumor's individual genetic traits enables focused therapy with fewer adverse effects.

- **Advance Research and Drug Development**: Classification accuracy aids in the identification of prospective pharmacological targets and the development of new medicines.

## Real-World Applications

The models' performance has some practical applications in real world:

- **Clinical Settings**: Doctors can utilize these models to help with their diagnostic procedure, adding a layer of assurance to cancer typing.

- **Research Institutes**:Scientists can use these models for genetic investigations, which will speed up oncology research.

- **Drug Development**:Accurate cancer typing can help pharmaceutical companies streamline clinical trials and generate more effective treatments.

# Conclusion:

## Summary of work

Following careful consideration, the Support Vector Machine (SVM) was determined to be the best-performing model, with an accuracy of roughly 97.06%. Its accuracy in high-dimensional spaces and robust generalization make it the best choice for cancer type prediction.

## Importance of selecting a proper model and tuning:

This study emphasizes the need of careful model selection and parameter adjustment in machine learning. The effectiveness of the SVM emphasizes the significance of matching the model to the properties of the data. Furthermore, parameter optimization ensures that the model is fine-tuned for maximum performance.

These concerns are critical in the context of cancer prediction for producing clinically relevant results and advancing cancer research and treatment.
