# Week 5: Introduction to Supervised Machine Learning


This week will introduce you to fundamental machine learning (ML) concepts. By the end of this module, you will understand basic machine learning terminology as well as know how to train and evaluate an ML model. In the pre-module, you learned what Machine Learning and were introduced to the Pima Diabetes case study. In this module, we will continue with analysing the cause of diabetes using *Supervised Machine Learning*.





## Supervised ML

Supervised Machine Learning is a type of machine learning where the model is trained on a **labeled dataset**, meaning each sample comes with an input-output pair. The model learns to map inputs (predictor variables AKA features) to the correct output (labels AKA response variables). This type of learning is common in tasks like medical diagnosis, where patient data is labeled with specific health conditions.

There are two primary supervised machine learning tasks, classification and regression:

* **Classification** is a supervised learning task that involves predicting discrete labels for input data. For example, a classification model might be trained to identify images of animals as either "cat," "dog," or "bird." Classification is widely used in applications like medical diagnostics, where the goal might be to classify cells as cancerous or non-cancerous.

* **Regression** is another type of supervised learning task, but unlike classification, it involves predicting continuous, numerical outputs rather than discrete categories. For instance, it might predict a patient's blood pressure based on their age, weight, and lifestyle factors.

For the following notebooks, we will primarily focus on classification tasks, but know that the concepts presented carry over to regression tasks as well. Below, we focus on the model training and model evaluation steps of the machine learning pipeline.

![fcall](fcall.png)



### Case Study: Predicting Diabetes Risk in Pima Indian Women

As a reminder, the Pima Diabetes case study consists of a dataset with 768 records, with each instance containing eight attributes and a target variable. Each record represents one patient and includes the following attributes:

* Pregnancies: Number of times the patient has been pregnant.
* Glucose: Plasma glucose concentration over two hours in an oral glucose tolerance test.
* Blood Pressure: Diastolic blood pressure (mm Hg).
* Skin Thickness: Triceps skinfold thickness (mm).
* Insulin: Two-hour serum insulin (mu U/ml).
* BMI: Body mass index (weight in kg/(height in m)^2).
* Diabetes Pedigree Function: A function that scores likelihood of diabetes based on family history.
* Age: Patient's age (years).


**Using these pieces of information, our goal is to predict whether or not the patient had diabetes. In other words, this is a *classification* problem.**


Below, we load in the dataset from an online source where `X` are the measurements taken by the researchers, and `y` contains whether or not each a positive diagnosis was given. Each row in the data represents one sample. For this module, we will convert `y` into a binary variable where 1 represents a patient had diabetes, and 0 means a patient does not have diabetes.

In [None]:
from sklearn.datasets import fetch_openml
import pandas as pd
import numpy as np

# we fetch the dataset from https://www.openml.org/search?type=data&status=active&id=37
X,y = fetch_openml(data_id = 37, as_frame = True, return_X_y = True)

# convert tested_positive and tested_negative to 1 and 0
y = (y == 'tested_positive').astype(int)


As stated before  our goal is to accurately predict whether or not a given patient developed diabetes. If we can do this task *accurately*, then in the future we may be able to help patients proactively rather than treat them after they develop diabetes:

$$
\text{Accuracy} = \frac{\text{Correct Predictions}}{\text{All Predictions}}
$$

---
##### **Q1: Copy your rules from the pre-module. What is the accuracy of your rules? Do you think this is acceptable for clinical usage?**

*Your Answer Here*

---

Chances are that the manual rules you created do not adequately predict whether or not someone has diabetes. This is understandable as there are a lot of features and samples, which makes manually identifying potential reasons for diabetes is difficult. **Therefore, we will train an ML model to do this for us.**


## Model Training
![fc3](fc3.png)
### What is a model?

In the world of machine learning, A **model** is an **computer algorithm** that calculates an output based on input(s) from a dataset through calculations using learned **parameters**. These parameters allow it to make predictions on previously unseen data.

A machine learning model can perform such tasks by having it **trained** with a large dataset (a process also known as **fitting** a model). Training a model is the process of teaching a model to recognize patterns or rules by using data. During training, the model makes predictions with inputs, and then calculates the error between its prediction and the true answer. It then adjusts its internal parameters to get better at making a guess. More specifically, the training process is as follows:

The model begins with a set of initial weights, typically assigned randomly. These weights represent the model's initial state before any learning occurs, similar to a "blank slate" or initial knowledge base.

1. We pass all points from the dataset \( $x$ \) into the model.
2. The model processes this input, performing computations based on \( $x$ \) and its current weights, to produce predictions, denoted as \( $\hat{y}$ \).
3. We then calculate the error (the opposite of accuracy) of all the predictions the model made.
4. By looking at what the model got right and wrong, we can adjust the relevant parameters of the model so that the next set of predictions is more accuracy.
5. We then repeat steps 1-4 until the error stops reducing, or we reach some stopping conditions (ex. we hit a limit on the number of guesses).  

This iterative process allows the model to gradually refine its weights, thereby improving its predictions as it learns from the data. Think of it like a student. By doing a lot of homework questions, a student can learn what answers are correct to given questions, and what answers are incorrect.




#### Fitting a Logistic Regression Model

One machine learning algorithm that we can use is *Logistic Regression*. You have already seen this model in Week 3, but as a refresher, Logisitic Regression makes predictions by adding the inputs together with different weights:

$$
\hat{y} = b + w_1x_1 + w_2x_2 + \dots + w_nx_n
$$

Here:
- \( $b$ \) is the intercept or bias term
- \( $w_1, w_2, \dots, w_n$ \) are the weights associated with each feature, and
- \( $x_1, x_2, \dots, x_n$ \) are the feature values.

If the result of the sum is above zero, we predict the positive class (1). Otherwise we predict the negative class (0).

In this model the different weights as well as the bias term are the *parameters* that are learned in the model. To put it in context of the training process above:

1. Given our datapoints $x$, will calculate $\hat{y}$ through the above question
2. We then calculate whether or not we got the prediction correctly.
3. Depending on our error (ex. did we overpredict the true values, underpredict the true values, etc.) we adjust each weight either up or down.
4. We repeat steps 1-3 multiple times, stopping when the model stops improving (or if we hit the limit of iterations we set).


![training_flowchart](model_training_workflow.png)

#### SciKit-Learn
Now luckily, we don't need to manually code the training procedure for many models. Instead, we can use a package called SciKit-Learn AKA `sklearn`. This is a Python library that contains many useful machine learning tools such as:
* Loading in common datasets (Seen above).
* Tools to process your data.
* Code to train various models, including code to fit a Logistic Regression model.
* Code to evaluate your trained models.


In the cell below, we train and evaluate a Logistic Regression model. This is a little bit different than the previous way we fit LR in module 3.

In [None]:
# import the LogisticRegression class.
# This contains all code needed to train LR

from sklearn.linear_model import LogisticRegression

# create an instance of the LR model.
# By setting max_iter to 30, we say that we want to do steps 1-3 at most 30 times.
model = LogisticRegression(solver = 'liblinear', max_iter = 30)

# fit the model to the data
model.fit(X, y)

# make predictions
predictions = model.predict(X)

# measure what proportion of the predictions we got correct.

accuracy = np.sum(predictions == y) / len(y)
print(f'Accuracy: {accuracy*100:.4f}%')

We can also use functions from `sklearn` to calculate the accuracy of our model. The `accuracy_score` function from `sklearn.metrics` calculates the accuracy of the model by comparing the predicted values to the true values. The function takes in two arguments: the true values and the predicted values. 

In [None]:
from sklearn.metrics import accuracy_score
accuracy = accuracy_score(y, predictions)
print(f"Accuracy: {accuracy * 100: .4f}%")

---

##### **Q2: Compare the accuracy of Logistic Regression to your hand-crafted rules. Which did better?**

*Your Answer Here*


---

##### **Q3: Try running Logistic Regression with 10 different values of `max_iter` that are smaller than 30 and 10 that are larger than 30. Plot out the model accuracy by the number of iterations. What do you observe as you increase the number of iterations? Does the training accuracy stop improving at some point?**

> NOTE: If you get `ConvergenceWarning`, ignore it for now you're doing fine :). 


In [None]:
# Your code here.

*Your Answer Here*

---

## Model Evaluation

![fc4](fc4.png)

Imagine you're a doctor trying to learn if someone might have diabetes based on things like their age, weight, or blood sugar levels. To get really good at predicting this, you practice with records from patients you've already seen before, where you know if they ended up having diabetes or not.

But if you only checked how good you were using those same patients, you might not really know if you're good at guessing for new people you haven't met. Maybe you're just getting used to those specific patients and learning things about them rather than truly understanding the signs of diabetes.

Our machine learning model's can fall into the same fallacy. Often evaluating our model on the same data that was used to train them can result in a more optimistic view of the model's performance. **To handle this, we will actually withold some data from the model** by creating two different by randomly splitting the data into two different sets:

1. Training set: This is the dataset we give the model to learn from. Think of it as homework given to students. The model can practice getting these answers correctly (aka training).
2. Test set: This is data the model has not seen. Think of it as the exam. Student's used homework questions to learn the concepts (training), now we evaluate how well they learned patterns and trends by giving them questions they have not

Ideally, we want a model that does well on both the training AND testing set.

In the cell below, we split the data using the `train_test_split` method provided by `sklearn`. We allocate 80% of the data to the training set, and 20% of the data to the testing set (aka the exam).



In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)


---

##### **Q4: Train a model on the train set, make predictions for both the train and test set, and report the accuracy on both sets.**

In [None]:
# Your code here

# create model class

# Train model on the train set

# make predictions for train set

# make predictions for the test set

# Calculate accuracy


---

### Measures beyond Accuracy

While accuracy is a good measure of how many predictions the model is getting correct, we often care about the *type* of correct and incorrect predictions. More specifically, we care about True Positives, True Negatives, False Positives, and False Negatives:

* False positive (FP): predicted positive, but the true label was actually negative. (Type I error)
* False negative (FN): predicted negative, but the true label was actually positive. (Type II error)
* True positive (TP): predicted positive, and the true label was indeed positive.
* True negative (TN): predicted negative, and the true label was indeed negative.

Using this terminology, we can improve our definition of Accuracy:

$$
\text{Accuracy} =\frac{\text{TP} + \text{TN}}{\text{TP} + \text{TN} + \text{FP} + \text{FN}}
$$




---
##### **Q5: Calculate the True Positives, True Negatives, False Positives, and False negatives of your model predictions on the train set.**

---


Apart from accuracy, other common ways to measure the performance of your classification model is *Precision* and *Recall.*


In machine learning, "precision" refers to a metric that measures the **proportion of positive predictions made by a model that are actually correct,**. This is basically asking "What portion of people are predicted to have diabetes actually has diabetes?"

$$
\text{Precision} = \frac{TP}{TP + FP}
$$

"Recall" is a metric that measures how well a model can identify all relevant positive instances within a dataset. This is basically asking "Of all the people that have diabetes, how many did the model actually predict."

$$
\text{Recall} = \frac{TP}{TP+FN}
$$


---

##### **Q6: Calculate the train and test precision and recall of your model**

---

## Conclusion

In this module, we introduced the fundamentals of training and evaluating a machine learning (ML) model. By iteratively making guesses and adjusting itself based on those guesses, an ML model can learn rules and patterns in the data that are useful in making predictions. In the next model, we will learn about more models and further our knowledge in how to use machine learning for biological applications.


## Graded Questions

In this set of graded questions, you will train a Logistic Regression model on the Heart Failures dataset from the prior weeks.

---

##### **GQ1: Read in the heart failures dataset (`hf_data_tut.csv`) (1pt) and split the predictor variables (features) from the labels (response variable) (1pt).**

> HINT: Look at your work from Week 3.



In [None]:
# Your Code here

---
##### **GQ2: Train a Logistic Regression model on *all* the data (1pt). What is the accuracy (1pt)?**

In [None]:
# Your Code here

---

##### **GQ3: Split the dataset into a train and test set and retrain your model (1pt). What is the train and test accuracy (2pt), precision (2pt), and recall (2pt)?**