# Week 5: Introduction to Supervised Machine Learning

This week we will introduce another tool that can be used to analyse data: Machine Learning. By the end of this model, you should be able to:
1. Define Supervised Machine Learning and understand its importance in biology.
2. Explain the typical procedure to train an ML model.
3. Implement a program to create an ML Model using a framework called `sklearn`.
4. Understand appropriate ways to evaluate the performance of an ML Model.


## What is Machine Learning?
First, you have probably heard both the terms machine learning (ML) and artificial intelligence (AI) before and may be wondering what the difference is.


**Artificial Intelligence (AI)** is the field of computer science that focuses on creating systems capable of performing tasks that would typically require human intelligence. These tasks include recognizing patterns, learning from data, making decisions, and solving complex problems.

**Machine Learning (ML)** is a subset of AI. Rather than explicitly programming a computer to follow a fixed set of rules, in ML, we "teach" computers to learn from data. This is particularly useful in fields like biology, where there is an abundance of data, but the patterns and relations in the data are complex and hard for a human to parse through.





![aiml_hierarchy](aiml_hierarchy.png)


AI and ML are becoming essential tools in biology, with applications such as:

* Genomics: Identifying gene sequences associated with specific diseases.
* Drug Discovery: Predicting which compounds may act effectively on certain targets, helping to speed up the drug discovery process.
* Imaging and Diagnostics: Analyzing medical images (like MRI or histology slides) to identify patterns that might indicate disease.
* Ecology and Evolution: Analyzing large-scale environmental data or understanding population dynamics and evolutionary patterns.


Many of you have heard recent developments in ML, and these advancements continue to impact the way we work, learn, and coexist with technology. You may be surprised to find that machine learning is becoming increasingly important in biological research. A study by  [Walsh et al. (2021)](https://https://www.nature.com/articles/s41592-021-01205-4#citeas) shows an exponential increase of ML publications in biology since the 1990's, shown in the figure below.

![ml_in_bio](ml_in_bio.png)

It's clear to see the scientific community is rapidly adopting ML techniques as a means to drive new findings.  Bridging the gap in knowledge between computer scientists/engineers and researchers in the scientific community is very much a work in progress. Our goal is to learn, at a very high level, what machine learning entails and how it may be used in biological research.


## ML Pipeline

A Machine learning pipeline typically involves 4 steps"

1. Data Preparation: In this step, we obtain the relevant data for the task we are trying to perform and clean up records that seem incorrect.
2. Data Exploration: We then analyze the data at hand to manually find potentially interesting patterns.
3. Model Training: Once we have explored our data and manually identified potential trends and patterns, we can train (aka fit) a machine learning model. Ideally, the ML model will pick up patterns we have missed and will be able to outperform rules we discover.
4. Model Evaluation: To confirm if the model picked up useful trends, we will use a variety of metrics to evaluate how well the model does at our task.

In this course, Week 3 covered Steps 1 and 2 (but we will practice step 2 below). We will cover Steps 3 and 4 in the main module.

![fc](fcall.png)

### Case Study: Predicting Diabetes Risk in Pima Indian Women

Hopefully you enjoyed the heart failure analysis from earlier weeks, because we have yet another task:

The Pima Indians Diabetes Dataset was created by the National Institute of Diabetes and Digestive and Kidney Diseases to investigate the correlation between health-related attributes and the onset of diabetes. The study population consists exclusively of adult females of Pima Indian heritage, selected because of their higher-than-average prevalence of diabetes, providing a focused case for understanding factors contributing to diabetes risk.

The dataset comprises 768 records, with each instance containing eight attributes and a target variable. Each record represents one patient and includes the following attributes:

* Pregnancies: Number of times the patient has been pregnant.
* Glucose: Plasma glucose concentration over two hours in an oral glucose tolerance test.
* Blood Pressure: Diastolic blood pressure (mm Hg).
* Skin Thickness: Triceps skinfold thickness (mm).
* Insulin: Two-hour serum insulin (mu U/ml).
* BMI: Body mass index (weight in kg/(height in m)^2).
* Diabetes Pedigree Function: A function that scores likelihood of diabetes based on family history.
* Age: Patient's age (years).

The primary objective of this case study is to identify diabetic individuals based on the features provided to understand which factors most strongly correlate with diabetes risk. This could ultimately help in developing preventive measures for at-risk populations.



## Data Preparation
![fc1](fc1.png)
While in most real-world cases, you (or your team) may need to go out and collect data, we will use a dataset that has already been prepared for us. Below, we load in the dataset from an online source where `X` is the measurements taken by the researchers, and `y` contains whether or not each a positive diagnosis was given. Each row in the data represents one sample.

In [None]:
from sklearn.datasets import fetch_openml
import pandas as pd
import numpy as np

# we fetch the dataset from https://www.openml.org/search?type=data&status=active&id=37
X,y = fetch_openml(data_id = 37, as_frame = True, return_X_y = True, parser='auto')


Below we print out the first few rows of both `X` (the measurements taken by the researchers) and `y` (the diagnosis for diabetes)

In [None]:
X.head()

In [None]:
y.head()

## Data Exploration:

![fc2](fc2.png)

Before we dive into training a ML Model, we will first manually explore the data to see if we can identify potential reasons for diabetes.

---
##### **Q1: Subset `X` into two different dataframes, one with all positive cases and one with negative cases. What percentage of patients have diabetes?**

> Hint: Look at week 3 Question 5 on how to subset dataframes.


*Your Answer Here.*

---


##### **Q2: Calculate the mean and standard deviations of the attributes for patients with diabetes as well as the average values for patients without diabetes. What differences do you notice between the two groups?**


Your Answer Here

---

##### **Q3: Create Histograms of the patient attributes for both the positive and negative groups. What differences do you notice between the two groups?**


Your Answer Here

---

## Graded Question

##### **GQ1: Using the analysis above, do your best to create a few rules (1-2+) that doctors may use to identify patients with diabetes, then subset `X` and `y` based on these rules (1pt). How many patients with diabetes did you identify (1pt)? How many did you miss (1pt)?**

> Rule Examples: `Patients have diabetes if Age > 10` or `Patients don't have diabetes if Pregnancies >=2`. These do not do well so please make your own.


In [None]:
# Example rule
example_rule_positive_cases = y[X['age'] > 10]

In [None]:
## Your Code Here