# Week 5: Introduction to Supervised Machine Learning
# Pre-module

## Learning Objectives
This week, we will introduce another tool that can be used to analyse data: Machine Learning (ML). By the end of this module, you should be able to:
1. Define Supervised Machine Learning and understand its importance in biology.
2. Explain the typical procedure to train an ML model.
3. Implement a program to create an ML model using a Python framework called `sklearn`.
4. Understand appropriate ways to evaluate the performance of an ML model.

### What is Machine Learning?
First, you have probably heard both the terms machine learning (ML) and artificial intelligence (AI) before and may be wondering what the difference is.

**Artificial Intelligence (AI)** is the field of computer science that focuses on creating systems capable of performing tasks that would typically require human intelligence. These tasks include recognizing patterns, learning from data, making decisions, and solving complex problems.

**Machine Learning (ML)** is a subset of AI. Rather than explicitly programming a computer to follow a fixed set of rules, in ML, we "teach" computers to learn from data. This is particularly useful in fields like biology, where there is an abundance of data, but the patterns and relations in the data are complex and hard for a human to parse through.


![aiml_hierarchy](aiml_hierarchy.png)

AI and ML are becoming essential tools in biology, with applications such as:

* Genomics: Identifying gene sequences associated with specific diseases.
* Drug Discovery: Predicting which compounds may act effectively on certain targets, helping to speed up the drug discovery process.
* Imaging and Diagnostics: Analyzing medical images (like MRI or histology slides) to identify patterns that might indicate disease.
* Ecology and Evolution: Analyzing large-scale environmental data or understanding population dynamics and evolutionary patterns.


Many of you have heard recent developments in ML, and these advancements continue to impact the way we work, learn, and coexist with technology. You may be surprised to find that machine learning is becoming increasingly important in biological research. A study by  [Walsh et al. (2021)](https://https://www.nature.com/articles/s41592-021-01205-4#citeas) shows an exponential increase of ML publications in biology since the 1990's, shown in the figure below.

![ml_in_bio](ml_in_bio.png)

The scientific community is rapidly adopting ML as a powerful tool for discovery. However, bridging the knowledge gap between computer scientists/engineers and researchers in the scientific community remains a challenge. Our goal is to provide you with a high-level understanding of what machine learning entails and how it may be applied to biological research.


### ML Pipeline

A typical Machine Learning pipeline consists of four key steps:
1. **Data Preparation**: This involves collecting relevant data for the task and cleaning it up by removing or correcting inaccurate or inconsistent records.
2. **Data Exploration**: Next, we analyze the data to manually identify potential trends and patterns through inspection and visualization.
3. **Model Training**: Once we have explored our data, we train (or “fit”) a machine learning model. Ideally, the ML model will pick up patterns we have missed and will be able to outperform rules we discover.
4. **Model Evaluation**: Finally, we assess the model’s performance using various metrics to determine how well it accomplishes the task at hand.

In this course, Week 3 covered Steps 1 and 2. We will practice Step 2 below, while Steps 3 and 4 will be covered in this week's main module.

![mlp](ml_pipeline.png)

### Problem statement: Breast cancer

Hopefully, you enjoyed the heart failure analysis from earlier weeks, because we have yet another task:

You have recently moved to a new research lab within the Canadian Cancer Society, studying breast cancer cells. The Canadian Cancer Society estimates that 1 in 8 women will develop breast cancer, with breast cancer constituting about 14% of cancer deaths among women. However, not all breast cancer cells are malignant. If there is a way to accurately detect if a particular breast cancer cell is benign or malignant early on, medical resources could be better allocated for better patient care. The team would like to find out if machine learning could be used to predict which cancer cells are benign or malignant, knowing that ML methods have been shown to be feasible in other cancer cell analyses in the industry.

You have been given a public dataset from the Diagnostic Wisconsin Breast Cancer Database*, containing features extracted from digitized images of breast cancer cell nuclei. From each image, researchers measured various characteristics, such as radius, texture, and symmetry, for each cell sample. For every characteristic, three types of values were computed:
- **Mean**: the average value across the measured object 
- **Standard error (std)**: an estimate of the variability in the measurement
- **Worst**: the mean of the three largest values observed

The measurements and mean/std error/worst values comprise the features of this dataset.

For example, the `radius_mean` feature is the mean of distances from the center of the nucleus to various points on the perimeter. `radius_worst` is the mean of the three largest distances from the center of the nucleus to points on the perimeter. 

Take a moment to view the dataset **bc_data.csv** (using Excel or another .csv file viewer). 

While you are excited to contribute to your team's research, there are a few gaps in your knowledge, starting with... how does machine learning even work?

*https://www.kaggle.com/datasets/uciml/breast-cancer-wisconsin-data

### Data Preparation
![dp](data_prep_pipeline.png)
While in most real-world cases, you (or your team) may need to go out and collect data, we will use a dataset that has already been prepared for us. Below, we load in the dataset from an online source where `X` is the measurements taken by the researchers, and `y` contains whether or not a positive diagnosis was given. Each row in the data represents one sample.

In [None]:
from sklearn.datasets import fetch_openml
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

df = pd.read_csv('bc_data.csv', index_col=0)
df = df.drop('Unnamed: 32', axis=1)

y = df["diagnosis"]
X = df.drop("diagnosis", axis = 1)

Below we print out the first few rows of both `X` (the measurements taken by the researchers) and `y` (the diagnosis for breast cancer).

In [None]:
X.head()

In [None]:
y.head()

### Data Exploration:

![de](data_explore_pipeline.png)

Before we dive into training an ML Model, we will first manually explore the data to see if we can identify potential reasons for breast cancer.

Let's analyze the positive and negative cases for the Breast Cancer dataset. Under the column `"diagnosis"`, there are 2 possible values: `'M'` (malignant) and `'B'` (benign). 

---
**Q*1: Subset `X` into two different dataframes, one with all positive cases and one with negative cases. What percentage of patients have breast cancer?**

> Hint 1: Look at Week 3 Question 5 on how to subset dataframes.
> 
> Hint 2: `X` and `y` still have the same `id` for corresponding rows!

<span style="background-color: #FFD700">**Write your code below**</span> 


In [None]:
positive_cases = ...
negative_cases = ...
n_positive = ...
n_negative = ...

print(f"{n_positive / (n_positive + n_negative) * 100:.2f}% of the cases are positive")

---
**Q*2: Calculate the mean and standard deviations of the attributes for patients with breast cancer as well as the average values for patients without breast cancer. What differences do you notice between the two groups?**

<span style="background-color: #FFD700">**Write your code below**</span> 


<span style="background-color: #FFD700">**Write your answer below**</span>

Answer here:

---

**Q*3: Create histograms of the patient attributes for both the positive and negative groups. What differences do you notice between the two groups?**

<span style="background-color: #FFD700">**Write your code below**</span> 


In [None]:

plt.show()

In [None]:

plt.show()

<span style="background-color: #FFD700">**Write your answer below**</span>

Answer here:

---

## **Graded Question: (3 marks)**

**GQ*1. (3 marks) Using the analysis above, do your best to create a few rules (1-2+) that doctors may use to identify patients with breast cancer, then subset `X` and `y` based on these rules (1pt). How many patients with breast cancer did you identify (1pt)? How many did you miss (1pt)?**

> Rule Examples: 'Patients have breast cancer if Age > 10' or 'Patients don't have breast cancer if Pregnancies >=2'. These do not do well; please make your own rules.


<span style="background-color: #FFD700">**Write your code below**</span> 


In [None]:
# Rule here
...

In [None]:
n_patients_captured = ...
n_patients_missed = ...

print(f"Ratio of Patients captured: {n_patients_captured:.2f}, Patients missed: {n_patients_missed:.2f}")