# Week 7 - Applying Machine Learning

This week, we will focus on a case study involving the application of machine learning (ML)  to a genetics dataset. The objective of this module is to prompt you to utilize a ML  pipeline to identify and classify various types of cancer using patient data.




## ML Pipeline

As a reminder, a Machine learning pipeline typically involves 4 steps:

1. Data Preparation:  In this step, we obtain the relevant data needed for the task we intend to perform.
2. Data Exploration: We then analyze the data at hand to manually identify potentially interesting patterns.
3. Model Training: After exploring the data and manually identifying potential trends and patterns, we train (or fit) a machine learning model. Ideally, the ML model will pick up patterns we may have missed and outperform manually discovered rules.
4. Model Evaluation: Finally, we use various metrics to assess how well the model performs at our task and confirm if it has identified useful trends.

![fcall](fcall.png)

## Gene Expression Data for Acute Leukemia Patients
In this module, we are going to analyze gene expression data for acure leukemia patients. Leukemia is a form of cancer that impacts the blood and bone marrow, interfering with the normal production and function of healthy blood cells. Within the bone marrow, blood stem cells, known as hematopoietic stem cells, generate different types of blood cells. Leukemia arises when these stem cells produce abnormal white blood cells, crowding out healthy cells and disrupting their functions. As a result, individuals with leukemia may become more prone to infections, experience anemia, or bleed easily.

Due to the severity of leukemia and the need for personalized treatment, accurate diagnosis of  the specific  types of leukemia is crucial. To facilitate this, [Golub et al. (1999)](https://pubmed.ncbi.nlm.nih.gov/10521349/) collected gene expression data from patients with acute myeloid leukemia (AML) or acute lymphoblastic leukemia (ALL).  Their objective was  to identify key genes that can be used for classification and to provide insights for designing targeted treatments. The dataset includes gene expression levels for over 7000 genes from patients with AML and ALL. Your task  is to apply an ML pipeline to this dataset and develop a classifier to distinguish between AML and ALL.






## Data Preparation

![fc1](fc1.png)

Like in our previous datasets, the data has already been prepared for us. Detailed preparation steps can be found at [this link](https://www.kaggle.com/code/selinyang/gene-expression-data-for-acute-leukemia-patients).




---

##### **Q1: Read in `gene_expression_data.csv.` Assign the `cancer` column to `y` and the rest of the columns to `X.`**

In [None]:
import pandas as pd
import numpy as np

### YOUR CODE HERE
# load the dataset with pandas
data = ...

# split into features (X) and target (y)
X = ...
y = ...


Each column in `X` corresponds to a certain level of gene expresssion, and each row is a patient. The `y` contains the type of cancer the patient has: `ALL` or `AML.`


---

##### **Q2: Binarize `y` by assigning `ALL` to 1 and `AML` to 0.**

In [None]:
### YOUR CODE HERE

# hint use df.replace()

## Data Exploration

![fc2](fc2.png)


Before we dive into training a ML Model, we will first manually explore the data to see if we can identify ways to distinguish between the two classes.




---
##### **Q3: How many genes (features) are in the dataset?**

> Hint: Look at week 3 Question 5 on how to subset dataframes.

In [None]:
### YOUR CODE HERE

# hint use df.shape()


---
##### **Q4: Subset the data into two different dataframes, one with all AML cases and one with ALL cases. What percentage of patients have AML?**

> Hint: Look at week 3 Question 5 on how to subset dataframes.

In [None]:
### YOUR CODE HERE
# Subset the data
X_aml = ...
X_all = ...

# Calculate the percentage of patients with AML
percentage_aml = X_aml.shape[0] / X.shape[0] * 100
print(percentage_aml)


---
##### **Q5: Without using ML, visualize and analyse the data and try to construct 1-2 manual rules that may distinguish between the two types of cancer. What is the accuracy of your rules?**



In [None]:
### YOUR CODE HERE

# subset the dataframe for your rule 
y_rule_aml = ...
y_rule_all = ...

# calculate the follow to help you derive accuracy
true_positives = np.sum(y_rule_aml == 1) # as an example
false_positives = ...
true_negatives = ...
false_negatives = ...

accuracy = (true_positives + true_negatives) / (true_positives + true_negatives + false_positives + false_negatives)
print(accuracy)

## Model Training

![fc3](fc3.png)

Now, let's train a ML model to perform this classification task.


---
##### **Q6: Split the data into a training set (`X_train` and `y_train`) and testing set (`X_test` and `y_test`).**


In [None]:
## YOUR CODE HERE

# import the correct function
from sklearn.model_selection import train_test_split

# split the data
# hint use train_test_split() and set all the necessary parameters

X_train, X_test, y_train, y_test = ...

---


In the previous modules, you learned about 4 different models. In the sections below, you will tune and train two different models. The first one will be a Decision Tree, and the second one will be your choice.



---
##### **Q7: Fill in the function below to test a hyperparameter configuration for a Decision Tree. This function should accept the training set as well as at least 3 hyperparameters (but feel free to add more) and perform K-Fold Cross-Validation.**
> HINT: Look at Week 6

In [None]:
from sklearn.model_selection import KFold
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
def test_hyperparameter_DT(X_train, y_train,
                                max_depth,
                                min_samples_split,
                                min_impurity_decrease,
                                ):
  accuracies = []

  # create folds
  kf = KFold(n_splits=5, random_state=42, shuffle=True)
  fold_indices = kf.split(X_train)
  # loop through the folds
  for fold_index in fold_indices:
    # Get what samples should be in each split
    train_fold_indices, val_fold_indices = fold_index
    # select the samples
    X_train_fold = X_train.iloc[train_fold_indices]
    y_train_fold = y_train.iloc[train_fold_indices]
    X_val_fold = X_train.iloc[val_fold_indices]
    y_val_fold = y_train.iloc[val_fold_indices]

    # initialize a new model
    dt = DecisionTreeClassifier(
        max_depth=max_depth,
        min_samples_split=min_samples_split,
        min_impurity_decrease=min_impurity_decrease
    )

    ## YOUR CODE HERE
    # Fit the model on the train fold
    
    
    # Make predictions on the validation fold
    val_predictions  = ...

    # calculate the accuracy of the val predictions
    fold_accuracy = ...

    accuracies.append(fold_accuracy)
  return np.mean(accuracies) # return the average accuracy across all folds


---
##### **Q8: Using the above function, perform a Grid Search for the best set of Hyperparameters for the Decision Tree. What is the best CV accuracy achieved?**
> HINT: Look at Week 6

In [None]:
# Loop through the HP configurations, 
# calling the above function for each config.

## YOUR CODE HERE
# set a list of max depths
max_depth_list = ...
# set a list of min samples split
min_samples_split_list = ...
# set a list of impurity
impurity_list = ... 

# perform the grid search
for max_depth in max_depth_list:
  for min_samples_split in min_samples_split_list:
    for impurity in impurity_list:
      cv_acc = test_hyperparameter_DT(X_train, y_train,
                                   max_depth = max_depth,
                                   min_samples_split = min_samples_split,
                                   min_impurity_decrease = impurity)
      print(f"max_depth: {max_depth}, min_samples_split: {min_samples_split}, min_impurity_decrease: {impurity}, cv_acc: {cv_acc:.3f}")


---
##### **Q9: Using the best hyperparameters found, train a new Decision tree on all the training data.**


In [None]:
## YOUR CODE HERE

# create a new model
dt = DecisionTreeClassifier()

# train the model using the training data


## Model Evaluation

![fc4](fc4.png)

Now that we have trained our model, let's see how well it performs.


---
##### **Q10: Make predictions on the training and test sets for the Decision Tree.**


In [None]:
# make prediction for the training data
train_pred = ...

# make predictions for the testing data
test_pred = ...


---
##### **Q11: What is the train and test accuracy, precision, and recall? Do you think the model is underfitting, overfitting, or neither and why?**


In [None]:

# calculate true positives, true negatives, false positives, false negatives for train

true_positives_train = np.sum((train_pred == 1) & (y_train == 1))
true_negatives_train = np.sum((train_pred == 0) & (y_train == 0))
false_positives_train = np.sum((train_pred == 1) & (y_train == 0))
false_negatives_train = np.sum((train_pred == 0) & (y_train == 1))

# Calculate precision recall, and accuracy

train_precision = true_positives_train / (true_positives_train + false_positives_train)
train_recall = true_positives_train / (true_positives_train + false_negatives_train)

train_accuracy = (true_positives_train + true_negatives_train) / (true_positives_train + true_negatives_train + false_positives_train + false_negatives_train)
train_accuracy = accuracy_score(y_train, train_pred)

## YOUR CODE HERE

# repeat for test
# calculate true positives, true negatives, false positives, false negatives for test
true_positives_test = ...
true_negatives_test = ...
false_positives_test = ...
false_negatives_test = ...

# Calculate precision recall, and accuracy
test_precision = ...
test_recall = ...

test_accuracy = ...
test_accuracy = ...


---
##### **Q12: Visualize the best decision tree. What feature is in the root node? How many leaves does your tree have?**

> HINT: use the `plot_tree` function from `sklearn.tree`

In [None]:
from sklearn.tree import plot_tree
import matplotlib.pyplot as plt

## YOUR CODE HERE
plt.figure(figsize=(20,10))
plot_tree() # fill in the required parameters
plt.show()

##### **Q13: According to the decision tree, what genes and what level of expression determine if someone will have AML?**
> HINT: Look at the paths to leaves that predict AML

Your answer here

## Graded Questions

---
##### **GQ1: Repeat Questions 7-11 for a new model (e.g. LR, XGBoost, SVM, or any model you choose).**


In [None]:
# installs and imports for XGboost, SVM, and LR
!pip install xgboost
from xgboost import XGBClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression

###### GQ.7: Fill in the function below to test a hyperparameter configuration for a your model. This function should accept the training set as well as at least 3 hyperparameters (but feel free to add more) and perform K-Fold Cross-Validation (3pt).


In [None]:
from sklearn.model_selection import KFold
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
def test_hyperparameter_YOUR_MODEL(X_train, y_train,
                                max_depth,
                                learning_rate,
                                n_estimators,
                                ):
  accuracies = []

  # create folds
  kf = KFold(n_splits=5, random_state=42, shuffle=True)
  fold_indices = kf.split(X_train)
  # loop through the folds
  for fold_index in fold_indices : ## YOUR CODE HERE
    # Get what samples should be in each split
    train_fold_indices, val_fold_indices = fold_index
    # select the samples
    X_train_fold = X_train.iloc[train_fold_indices]
    y_train_fold = y_train.iloc[train_fold_indices]
    X_val_fold = X_train.iloc[val_fold_indices]
    y_val_fold = y_train.iloc[val_fold_indices]

    # initialize a new model
    dt = XGBClassifier(
        n_estimators = n_estimators,
        max_depth=max_depth,  
        learning_rate=learning_rate,

    ) 
    
    ## YOUR CODE HERE

    # Fit the model on the train fold
    

    # Make predictions on the validation fold
    val_predictions  = ...

    # calculate the accuracy of the val predictions
    fold_accuracy = ...

    accuracies.append(fold_accuracy)
  return np.mean(accuracies) # return the average accuracy across all folds


###### GQ.8: Using the above function, perform a Grid Search for the best set of Hyperparameters for your model (3pt). What is the best CV accuracy achieved (2pt)?
> HINT: Look at Week 6


In [None]:
## YOUR CODE HERE
# set a list of max depths
max_depth_list = ...
# set a list of min samples split
min_samples_split_list = ...
# set a list of impurity
impurity_list = ... 

for max_depth in max_depth_list:
  for n_estimators in min_samples_split_list:
    for learning_rate  in impurity_list:
      cv_acc = test_hyperparameter_YOUR_MODEL(X_train, y_train,
                                   max_depth = max_depth,
                                   n_estimators = n_estimators,
                                   learning_rate = learning_rate)
      print(f"max_depth: {max_depth}, n_estimators: {n_estimators}, learning_rate: {learning_rate}, cv_acc: {cv_acc:.3f}")

###### GQ.9: Using the best hyperparameters found, train a new model on all the training data (1pt).

In [None]:
## YOUR CODE HERE

# create a new model
model = XGBClassifier()

# train the model using the training data


###### GQ.10: Make predictions on the training and test sets for your model (2pt). 


In [None]:
# make prediction for the training data
train_pred = ...

# make predictions for the testing data
test_pred = ...

###### GQ.11: What is the train accuracy, precision, and recall (3pt)? Do you think the model is underfitting, overfitting, or neither and why (2pt)?

*Your Answer Here*

---

##### **GQ2: Considering the decision tree model made in the first half of this module in comparison with the model of your chose made in the second half of this module, which model performed better? (1 mark) Justify your reasoning with the use of model evaluation metrics (1 mark) and suggest why there this difference (1 mark).**

*Your Answer Here*

---

## Conclusion

In this module, we have gone through the entire machine learning pipeline to classify the type of cancer a patient has. We began by exploring the data and manually identifying patterns, followed by training a Decision Tree model for classification. Upon evaluating the model, we found it was overfitting. Subsequently, we trained and evaluated a new model. Finally, we compared the two models to determine which one performed better.
While this module focused on a specific dataset, the ML  pipeline can be applied to a wide variety of datasets and problems. The key steps are understanding the data, training a model, and evaluating its performance. By following this process, you can apply machine learning to a wide range of problems and make data-driven predictions.
