# Week 8 - Applying Machine Learning
# Tutorial Module

This week, we will be working on a case study that utilizes machine learning (ML) on a genetics dataset. The goal of this module is to have you go through the ML pipeline to identify and classify types of cancer in patient data.


### ML Pipeline

As a reminder, a Machine Learning pipeline typically involves 4 steps:

1. Data Preparation: In this step, we obtain the relevant data for the task we are trying to perform.
2. Data Exploration: We then analyze the data at hand to manually find potentially interesting patterns.
3. Model Training: Once we have explored our data and manually identified potential trends and patterns, we can train (aka fit) a machine learning model. Ideally, the ML model will pick up patterns we have missed and will be able to outperform the rules we discover.
4. Model Evaluation: To confirm if the model picked up useful trends, we will use a variety of metrics to evaluate how well the model does at our task.

![fcall](fcall.png)

### Gene Expression Data for Acute Leukemia Patients
In this module, we will analyze gene expression data for acute leukemia patients. Leukemia is a form of cancer that impacts the blood and bone marrow, interfering with the normal production and function of healthy blood cells. Within the bone marrow, blood stem cells, known as hematopoietic stem cells, generate different types of blood cells. Leukemia arises when these stem cells produce abnormal white blood cells, crowding out healthy cells and disrupting their functions. As a result, individuals with leukemia may become more prone to infections, experience anemia, or bleed easily.

Given its severity, it would be nice to be able to identify the type of leukemia a patient may have. To do this, [Golub et al. (1999)](https://pubmed.ncbi.nlm.nih.gov/10521349/) collected gene expression data from patients and aimed to identify whether or not patients had either acute myeloid leukemia (AML) or acute lymphoblastic leukemia (ALL). Our goal is to use this data to create a classifier.


### Data Preparation

![fc1](fc1.png)

Like in our previous datasets, the data has already been prepared for us. Detailed preparation steps can be found at [this link](https://www.kaggle.com/code/selinyang/gene-expression-data-for-acute-leukemia-patients).



---
**Q*1: Read in `'gene_expression_data.csv'`. Assign the `'cancer'` column to `y` and the rest of the columns to `X`.**

<span style="background-color: #FFD700">**Write your code below**</span> 


In [None]:
import pandas as pd
import numpy as np

# YOUR CODE HERE
data = 
y = 
X = 

---
Each column in `X` corresponds to a certain level of gene expression, and each row is a patient. The `y` contains the type of cancer the patient has: `"ALL"` or `"AML"`.

---
**Q*2: Binarize `y` by assigning `ALL` to 0 and `AML` to 1.**

> Hint: consider using the `map()` function that was used a few times since Week 5.

<span style="background-color: #FFD700">**Write your code below**</span> 

In [None]:
# YOUR CODE HERE
y =
y

---
### Data Exploration

![fc2](fc2.png)


Before we dive into training an ML Model, we will first manually explore the data to see if we can identify ways to distinguish between the two classes.


---
**Q*3: How many genes (features) are in the dataset?**

<!-- > Hint: Look at week 3 Question 5 on how to subset dataframes. -->

<span style="background-color: #FFD700">**Write your code below**</span> 


In [None]:
# YOUR CODE HERE



---
**Q*4: Subset the data into two different dataframes, one with all "AML" cases and one with "ALL" cases. What percentage of patients have "AML"?**

> Hint: Look at week 3 Question 5 on how to subset dataframes.

<span style="background-color: #FFD700">**Write your code below**</span> 


In [None]:
# YOUR CODE HERE
# Subset the data
X_aml = 
X_all = 

# Calculate the percentage of patients with AML
percentage_aml = 
print(percentage_aml)

---
**Q*5: Without using ML, visualize and analyse the data and try to construct 1-2 manual rules that may distinguish between the two types of cancer. What is the accuracy of your rules?**

> Hint: Refer to Week 5 Pre-module on how to make rules.

<span style="background-color: #FFD700">**Write your code below**</span> 

In [None]:
# YOUR CODE HERE
# Write your rules here and complete the rest of the code



true_positives = 
false_positives = 
true_negatives = 
false_negatives = 

accuracy = 
print(accuracy)


---
### Model Training

![fc3](fc3.png)

Now, let's train an ML model to perform this classification task.

---
**Q*6: Split the data into training sets (`X_train` and `y_train`) and testing sets (`X_test` and `y_test`).**

<span style="background-color: #FFD700">**Write your code below**</span> 


In [None]:
# YOUR CODE HERE

# Import the correct function (refer to Pre-module)


# Split the data



---
In the previous modules, you learned about 4 different models. In the sections below, you will tune and train two different models. The first one will be Logistic Regression, and the second will be Random Forest.

---
**Q*7: Use grid search to test combinations of at least 2 hyperparameters (but feel free to add more) for Logistic Regression. Fit using the training set, predict using the test set, and calculate accuracy.**

> Hint: Refer to Week 6.

<span style="background-color: #FFD700">**Write your code below**</span> 


In [None]:
from sklearn.linear_model import LogisticRegression

# Step 1: Initialize hyperparameter arrays 
# YOUR CODE HERE
C_values = ...  # Regularization strength from 10^-4 to 10^4
penalty_values = ...     # L1 (Lasso) or L2 (Ridge) regularization

# Step 2: Manually iterate over Hyperparameter Combinations and retrieve the combination that performs best for accuracy
best_accuracy = 0
best_params = None
best_model = None

# Loop over all combinations of hyperparameters
# YOUR CODE HERE
for ...
    for ...
        print(f"Training model with penalty={penalty}, C={C}")
        
        # Initialize the Logistic Regression model with current hyperparameters
        model = ...
        
        # Train the model
        ...
        
        # Predict on the test set
        ...
        
        # Evaluate accuracy
        accuracy = ...
        print(f"Accuracy: {accuracy:.4f}")
        
        # Update the best model if we find a better one
        if ...
            


# Step 3: Output the Best Hyperparameters and Model Performance
print("\nBest hyperparameters:", best_params)
print(f"Best accuracy: {best_accuracy:.4f}")


---
**Q*8: Using the best hyperparameters found, train a new Logistic Regression model on all the training data.**

<span style="background-color: #FFD700">**Write your code below**</span> 


In [None]:
# YOUR CODE HERE
# Create a new model
lr = ...

# Train the model using the training data



---
### Model Evaluation

![fc4](fc4.png)

Now that we have trained our model, let's see how well it performs.

---
**Q*9: Make predictions on the training and test sets for the Logistic Regression model.**

<span style="background-color: #FFD700">**Write your code below**</span> 


In [None]:
# YOUR CODE HERE 
# Make predictions for the training data


# Make predictions for the testing data



---
**Q*10: What are the train and test accuracy, precision, recall, and F1-score? Do you think the model is underfitting, overfitting, or neither? Why?**

<span style="background-color: #FFD700">**Write your code below**</span> 


In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# YOUR CODE HERE
# Evaluation metrics for the train set
acc = ...            #TODO: complete this line
precision = ...      #TODO: complete this line
recall = ...         #TODO: complete this line
f1 = ...             #TODO: complete this line

eval_train = pd.Series({
    "Accuracy": acc,
    "Precision": precision,
    "Recall": recall,
    "F1-Score": f1
})

# Evaluation metrics for test set
acc = ...        #TODO: complete this line
precision = ...  #TODO: complete this line
recall = ...     #TODO: complete this line
f1 = ...         #TODO: complete this line

eval_test = pd.Series({
    "Accuracy": acc,
    "Precision": precision,
    "Recall": recall,
    "F1-Score": f1
})

print(f"Model: LogisticRegression")
print(f"\nEvaluation metrics for train set:\n",eval_train)

print(f"Model: LogisticRegression")
print(f"\nEvaluation metrics for test set:\n",eval_test)

<span style="background-color: #FFD700">**Write your answer below**</span>

Answer here:

---

## **Graded Exercise: (9 marks)**

**GQ*1: Repeat Questions 7-10 for Random Forest. (8 marks)**

<span style="background-color: #FFD700">**Write your code below**</span> 


In [None]:
# Import for Random Forest
# YOUR CODE HERE



##### **GQ*1-7: Use grid search to test combinations of at least 2 hyperparameters (but feel free to add more) for Random Forest. Fit using the training set, predict using the test set, and calculate accuracy (3pt).**

<span style="background-color: #FFD700">**Write your code below**</span> 


In [None]:
# Step 1: Initialize hyperparameter arrays 
# YOUR CODE HERE
n_estimators_values = ...  # Number of trees in the forest
max_depth_values = ...     # Maximum depth of each tree

# Step 2: Manually iterate over Hyperparameter Combinations and retrieve the combination that performs best for accuracy
best_accuracy = 0
best_params = None
best_model = None

# Loop over all combinations of hyperparameters
# YOUR CODE HERE
for ...
    for ...
        print(f"Training model with n_estimators={n_estimators}, max_depth={max_depth}")
        
        # Initialize the Random Forest model with current hyperparameters
        model = ...  # Add a random_state for reproducibility
        
        # Train the model
        ...
        
        # Predict on the test set
        ...
        
        # Evaluate accuracy
        accuracy = ...
        print(f"Accuracy: {accuracy:.4f}")
        
        # Update the best model if we found a better one
        if ...



# Step 3: Output the Best Hyperparameters and Model Performance
print("\nBest hyperparameters:", best_params)
print(f"Best accuracy: {best_accuracy:.4f}")


##### **GQ*1-8: Using the best hyperparameters found, train a new Random Forest model on all the training data (1pt).**

<span style="background-color: #FFD700">**Write your code below**</span> 


In [None]:
# YOUR CODE HERE

# Create a new model
rf = ...

# Train the model using the training data



##### **GQ*1-9: Make predictions on the training and test sets for the Random Forest model (2pt).**

<span style="background-color: #FFD700">**Write your code below**</span> 


In [None]:
# YOUR CODE HERE
# Make prediction for the training data


# Make predictions for the testing data



##### **GQ*1-10: What are the train and test accuracy, precision, recall, and F1-score? Do you think the model is underfitting, overfitting, or neither? Why? (2pt)?**

<span style="background-color: #FFD700">**Write your code below**</span> 


In [None]:
# YOUR CODE HERE
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Evaluation metrics for the train set
acc = ...            #TODO: complete this line
precision = ...      #TODO: complete this line
recall = ...         #TODO: complete this line
f1 = ...             #TODO: complete this line

eval_train = pd.Series({
    "Accuracy": acc,
    "Precision": precision,
    "Recall": recall,
    "F1-Score": f1
})

# Evaluation metrics for the test set
acc = ...            #TODO: complete this line
precision = ...      #TODO: complete this line
recall = ...         #TODO: complete this line
f1 = ...             #TODO: complete this line

eval_test = pd.Series({
    "Accuracy": acc,
    "Precision": precision,
    "Recall": recall,
    "F1-Score": f1
})

print(f"Model: Random Forest")
print(f"\nEvaluation metrics for train set:\n",eval_train)

print(f"Model: Random Forest")
print(f"\nEvaluation metrics for test set:\n",eval_test)

<span style="background-color: #FFD700">**Write your answer below**</span>

Answer here:

---

**GQ*2: Which model (Logistic Regression or Random Forest) performed better? (1pt)**

<span style="background-color: #FFD700">**Write your answer below**</span>

Answer here:

---