# Week 7 - Applying Machine Learning

In this week, we will be working on a case study on utilizing ML on a genetics dataset. The goal of this module is to have you go through the ML pipeline to identify and classify types of cancer in patient data.




## ML Pipeline

As a reminder, a Machine learning pipeline typically involves 4 steps:

1. Data Preparation: In this step, we obtain the relevant data for the task we are trying to perform.
2. Data Exploration: We then analyze the data at hand to manually find potentially interesting patterns.
3. Model Training: Once we have explored our data and manually identified potential trends and patterns, we can train (aka fit) a machine learning model. Ideally, the ML model will pick up patterns we have missed and will be able to outperform rules we discover.
4. Model Evaluation: To confirm if the model picked up useful trends, we will use a variety of metrics to evaluate how well the model does at our task.

![fcall](fcall.png)

## Gene Expression Data for Acute Leukemia Patients
In this module, we are going to analyze gene expression data for acure leukemia patients. Leukemia is a form of cancer that impacts the blood and bone marrow, interfering with the normal production and function of healthy blood cells. Within the bone marrow, blood stem cells, known as hematopoietic stem cells, generate different types of blood cells. Leukemia arises when these stem cells produce abnormal white blood cells, crowding out healthy cells and disrupting their functions. As a result, individuals with leukemia may become more prone to infections, experience anemia, or bleed easily.

Given its severity, it would be nice to be able to identify the type of leukemia a patient may have. To do this, [Golub et al. (1999)](https://pubmed.ncbi.nlm.nih.gov/10521349/) collected gene expression data from patients and aimed to use identify whether or not patients had either cute myeloid leukemia - AML - or acute lymphoblastic leukemia - ALL. Our goal is to use this data to create a classifier.






## Data Preparation

![fc1](fc1.png)

Like in our previous datasets, the data has already been prepared for us. Detailed preparation steps can be found at [this link](https://www.kaggle.com/code/selinyang/gene-expression-data-for-acute-leukemia-patients).




---

##### **Q1: Read in `gene_expression_data.csv.` Assign the `cancer` column to `y` and the rest of the columns to `X.`**

In [None]:
import pandas as pd
import numpy as np

### YOUR CODE HERE


Each column in `X` corresponds to a certain level of gene expresssion, and each row is a patient. The `y` contains the type of cancer the patient has: `ALL` or `AML.`


---

##### **Q2: Binarize `y` by assigning `ALL` to 1 and `AML` to 0.**

In [None]:
### YOUR CODE HERE


## Data Exploration

![fc2](fc2.png)


Before we dive into training a ML Model, we will first manually explore the data to see if we can identify ways to distinguish between the two classes.




---
##### **Q3: How many genes (features) are in the dataset?**

> Hint: Look at week 3 Question 5 on how to subset dataframes.

In [None]:
### YOUR CODE HERE


---
##### **Q4: Subset the data into two different dataframes, one with all AML cases and one with ALL cases. What percentage of patients have AML?**

> Hint: Look at week 3 Question 5 on how to subset dataframes.

In [None]:
### YOUR CODE HERE
# Subset the data

# Calculate the percentage of patients with AML


---
##### **Q5: Without using ML, visualize and analyse the data and try to construct 1-2 manual rules that may distinguish between the two types of cancer. What is the accuracy of your rules?**



In [None]:
### YOUR CODE HERE

## Model Training

![fc3](fc3.png)

Now, let's train a ML model to perform this classification task.


---
##### **Q6: Split the data into a training set (`X_train` and `y_train`) and testing set (`X_test` and `y_test`).**


In [None]:
## YOUR CODE HERE

# import the correct function

# split the data

---


In the previous modules, you learned about 4 different models. In the sections below, you will tune and train two different models. The first one will be a Decision Tree, and the second one will be your choice.



---
##### **Q7: Fill in the function below to test a hyperparameter configuration for a Decision Tree. This function should accept the training set as well as at least 3 hyperparameters (but feel free to add more) and perform K-Fold Cross-Validation.**
> HINT: Look at Week 6

In [None]:
def test_hyperparameter_DT(X_train, y_train,
                                HP_1, # CHANGE THIS TO YOUR HyperParameter
                                HP_2, # CHANGE THIS TO YOUR HyperParameter
                                HP_3, # CHANGE THIS TO YOUR HyperParameter
                                ):
  accuracies = []

  # create folds
  ## YOUR CODE HERE

  # loop through the folds
  for ... : ## YOUR CODE HERE
    # Get what samples should be in each split
    ## YOUR CODE HERE

    # select the samples
    ## YOUR CODE HERE

    # initialize a new model
    dt = ... ## YOUR CODE HERE

    # Train the model on the train fold
    ## YOUR CODE HERE

    # Make predictions on the validation fold
    val_predictions  = ... ## YOUR CODE HERE

    # calculate the accuracy of the val predictions

    fold_accuracy = ... ## YOUR CODE HERE

    accuracies.append(fold_accuracy)
  return np.mean(accuracies) # return the average accuracy across all folds



---
##### **Q8: Using the above function, perform a Grid Search for the best set of Hyperparameters for the Decision Tree. What is the best CV accuracy achieved?**
> HINT: Look at Week 6

In [None]:
# Loop through the HP configurations, calling the above function for each config.


---
##### **Q9: Using the best hyperparameters found, train a new Decision tree on all the training data.**


In [None]:
## YOUR CODE HERE

# create a new model

# train the model using the training data


## Model Evaluation

![fc4](fc4.png)

Now that we have trained our model, let's see how well it performs.


---
##### **Q10: Make predictions on the training and test sets for the Decision Tree.**


In [None]:
# make prediction for the training data

# make predictions for the testing data



---
##### **Q11: What is the train and test accuracy, precision, and recall? Do you think the model is underfitting, overfitting, or neither and why?**


In [None]:
## YOUR CODE HERE
# Train metrics

# calculate true positives, true negatives, false positives, false negatives for train

# Calculate precision recall, and accuracy

# repeat for test.


---
##### **Q12: Visualize the best decision tree. What feature is in the root node? How many leaves does your tree have?**

> HINT: use the `plot_tree` function from `sklearn.tree`

##### **Q13: According to the decision tree, what genes and what level of expression determine if someone will have AML?**
> HINT: Look at the paths to leaves that predict AML

## Graded Questions

---
##### **GQ1: Repeat Questions 7-11 for a new model (e.g. LR, XGBoost, SVM, or any model you choose).**


In [None]:
# installs and imports for XGboost, SVM, and LR
!pip install xgboost
from xgboost import XGBClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression

###### GQ.7: Fill in the function below to test a hyperparameter configuration for a your model. This function should accept the training set as well as at least 3 hyperparameters (but feel free to add more) and perform K-Fold Cross-Validation (3pt).


In [None]:
def test_hyperparameter_YOUR_MODEL(X_train, y_train,
                                HP_1, # CHANGE THIS TO YOUR HyperParameter
                                HP_2, # CHANGE THIS TO YOUR HyperParameter
                                HP_3, # CHANGE THIS TO YOUR HyperParameter
                                ):
  accuracies = []

  # create folds
  ## YOUR CODE HERE

  # loop through the folds
  for ... : ## YOUR CODE HERE
    # Get what samples should be in each split
    ## YOUR CODE HERE

    # select the samples
    ## YOUR CODE HERE

    # initialize a new model
    model = ... ## YOUR CODE HERE

    # Train the model on the train fold
    ## YOUR CODE HERE

    # Make predictions on the validation fold
    val_predictions  = ... ## YOUR CODE HERE

    # calculate the accuracy of the val predictions

    fold_accuracy = ... ## YOUR CODE HERE

    accuracies.append(fold_accuracy)
  return np.mean(accuracies) # return the average accuracy across all folds


###### GQ.8: Using the above function, perform a Grid Search for the best set of Hyperparameters for your model (3pt). What is the best CV accuracy achieved (2pt)?
> HINT: Look at Week 6


In [None]:
# Your code here

###### GQ.9: Using the best hyperparameters found, train a new model on all the training data (1pt).

In [None]:
# Your code here

###### GQ.10: Make predictions on the training and test sets for your model (2pt). 


In [None]:
# Your code here

###### GQ.11: What is the train accuracy, precision, and recall (3pt)? Do you think the model is underfitting, overfitting, or neither and why (2pt)?

*Your Answer Here*

---

##### **GQ2: Considering the decision tree model made in the first half of this module in comparison with the model of your chose made in the second half of this module, which model performed better? (1 mark) Justify your reasoning with the use of model evaluation metrics (1 mark) and suggest why there this difference (1 mark).**

*Your Answer Here*

---

## Conclusion

In this module, we have gone through the entire ML pipeline to classify the type of cancer a patient has. We started by exploring the data and manually identifying patterns, then trained a Decision Tree model to classify the data. We then evaluated the model and found that it was overfitting. We then trained a new model and evaluated it. Finally, we compared the two models and determined which one was better. 

While we focused on a specific dataset in this module, the ML pipeline can be applied to a wide variety of datasets and problems. The key is to understand the data, train a model, and evaluate its performance. By following this process, you can apply ML to a wide range of problems and make predictions based on data.