<left><img width=100% height=100% src="img/itu_logo.png"></left>

## Lecture 01: Components of a Supervised Learning Problem

### __Gül İnan__<br><br>Istanbul Technical University

## A Supervised Machine Learning Problem

Let’s start with a simple example of a **supervised learning problem**: predicting diabetes progression (risk) score.

Remember that we have a dataset of diabetes patients. 
* For each patient we have an access to measurements from their medical record and an estimate of diabetes risk.
* We are interested in understanding how the measurements affect an individual's diabetes risk.

## Three Components of a Supervised Machine Learning Problem

At a high level, a `supervised machine learning problem` has the following structure:

$$ \text{Dataset} + \text{Algorithm} \to \text{Predictive Model}. $$

The **predictive model** is constructed to **model** the relationship between inputs and targets and then it can **predict** future targets.

### A Supervised Learning Dataset: Notation

We say that a `training dataset` of size $n$ (e.g., $n$ patients) is a set:

$$
\mathcal{D} = \{(\textbf{x}_i, y_i) \mid i = 1,2,...,n\}.
$$

Each $\textbf{x}_i=(x_{11}, x_{12},\ldots, x_{1d})^T$ is a vector of $d$ `features` (e.g., the measurements for patient $i$) and each $y_i$ is a `target` (e.g., the diabetes risk). 

#### Feature Matrix

Therefore, it will be useful to represent the feature dataset as one matrix $\textbf{X} \in \mathbb{R}^{n \times d}$, of the form:

$$ 
\textbf{X} = \begin{bmatrix}
x_{11} & x_{12} & \ldots & x_{1d} \\
x_{21} & x_{22} & \ldots & x_{2d} \\
\vdots \\
x_{n1} & x_{n2} & \ldots & x_{nd} \\
\end{bmatrix}.
$$

Similarly, we can vectorize the target variables into a vector $\textbf{y} \in \mathbb{R}^n$ of the form:

$$ 
\textbf{y} = \begin{bmatrix}
y_{1} \\
y_{2} \\
\vdots \\
y_{n}
\end{bmatrix}.
$$

We can look at the diabetes dataset in this form.

In [1]:
import numpy as np
import pandas as pd
from sklearn.datasets import load_diabetes

# Load the diabetes dataset and return X and y separately
diabetes_X, diabetes_y = load_diabetes(return_X_y=True, as_frame=True) #return data frames

# Print a portion of the feature matrix
diabetes_X.head()

Unnamed: 0,age,sex,bmi,bp,s1,s2,s3,s4,s5,s6
0,0.038076,0.05068,0.061696,0.021872,-0.044223,-0.034821,-0.043401,-0.002592,0.019907,-0.017646
1,-0.001882,-0.044642,-0.051474,-0.026328,-0.008449,-0.019163,0.074412,-0.039493,-0.068332,-0.092204
2,0.085299,0.05068,0.044451,-0.00567,-0.045599,-0.034194,-0.032356,-0.002592,0.002861,-0.02593
3,-0.089063,-0.044642,-0.011595,-0.036656,0.012191,0.024991,-0.036038,0.034309,0.022688,-0.009362
4,0.005383,-0.044642,-0.036385,0.021872,0.003935,0.015596,0.008142,-0.002592,-0.031988,-0.046641


In [2]:
# Print a portion of target vector
diabetes_y.head()

0    151.0
1     75.0
2    141.0
3    206.0
4    135.0
Name: target, dtype: float64

### A Supervised Learning Algorithm: Notation

We can also define the high-level structure of a `supervised learning algorithm` as consisting of three components:
* A `model class`: the set of possible models we consider.
* An `objective function`: which defines how good a model is.
* An `optimizer`: which finds the best predictive model in the model class according to the objective function.

### Model: Notation

We'll say that a `model is a function`:

$$ 
f : \mathcal{X} \to \mathcal{Y} 
$$
that maps inputs $\textbf{X} \in \mathcal{X}$ to targets $\textbf{y} \in \mathcal{Y}$.

Often, models have `parameters` $\boldsymbol{\theta} \in \Theta$ living in a set $\Theta$. We will then write the model as:

$$ 
f_\theta : \mathcal{X} \to \mathcal{Y} 
$$
to denote that it's parametrized by $\theta$.

### Model Class: Notation

Formally, the `model class is a set`:

$$
\mathcal{M} = \{f \mid f : \mathcal{X} \to \mathcal{Y} \}
$$

of `possible models` that map input features to targets.

When the models $f_\theta$ are paremetrized by *parameters* $\boldsymbol{\theta} \in \Theta$ living in some set $\Theta$, we can also write:

$$
\mathcal{M} = \{f_\theta \mid f : \mathcal{X} \to \mathcal{Y}; \; \boldsymbol{\theta} \in \Theta \}.
$$

### Model Class Example: Linear Regression Model Family

One simple approach is to assume that $\textbf{x}_i$ and $y_i$ are related by a `linear function` of the form:


$$
y_i  = \theta_0 + \theta_1 \cdot x_{i1} + \theta_2 \cdot x_{i2} + ... + \theta_d \cdot x_{id}+ \epsilon_i,
$$

where $i=1,\ldots,n$,  $\boldsymbol{\theta}=(\theta_0,\theta_1,\ldots,\theta_d)^T$  is the vector of **parameters** of the model, and $\epsilon \sim N(0,\sigma^2)$ is the **noise** not captured by $\textbf{x}_i$. 

We call this model family as `multiple linear regression model`.

By using the notation $x_0 = 1$, we can denote:

$$ 
f_\theta(\textbf{x}_i) = \sum_{j=0}^d \theta_j \cdot x_{ij} = \boldsymbol{\theta}^\top \textbf{x}_i,
$$

where $\boldsymbol{\theta}=(\theta_0,\theta_1,\ldots,\theta_d)^T$ and $\textbf{x}_i=(1,x_{i1}, \ldots,x_{id})^T$ and then we can also represent the `multiple linear regression model` in a **vectorized form**:

$$
y_i  = f_\theta(\textbf{x}_i)+ \epsilon_i = \boldsymbol{\theta}^\top \textbf{x}_i + \epsilon_i,
$$

where $i=1,\ldots,n$.



<!-- By using the notation $x_0 = 1$, we can represent the model in a vectorized form
$$ y = \sum_{j=0}^d \beta_j \cdot x_j = \vec \beta \cdot \vec x. $$
where $\vec x$ is a vector of features. -->

For example, we can investigate the relationship between BMI and diabetes risk and assume that risk is a linear function of BMI such as:

$$
y_{{Riskscore}_i}  = f_\theta(x_{{BMI}_i})+ \epsilon_i= \theta_0 + \theta_1 \cdot x_{{BMI}_i} +  \epsilon_i,
$$

<!--where $x$ is the BMI (also called the dependent variable), and $y$ is the diabetes risk score (the independent variable). -->

where $\theta_0$ and $\theta_1$ are the intercept and the slope parameters of the line that relates $x_{{BMI}_i}$ to $f_\theta(x_{{BMI}_i})=E(y_{{Riskscore}_i})=\theta_0 + \theta_1 \cdot x_{{BMI}_i}$.

### Objective Function: Notation

<!-- Given a training set, how do we pick the parameters $\theta$ for the model? A natural approach is to select $\theta$ such that $f_\theta(x^{(i)})$ is close to $y^{(i)}$ on a training dataset $\mathcal{D} = \{(x^{(i)}, y^{(i)}) \mid i = 1,2,...,n\}$. -->

To capture this intuition, we define an `objective function` (also called a `loss function`):

$$
J(f) : \mathcal{M} \to [0, \infty), 
$$

which describes the extent to which $f$ "fits" the training data $\mathcal{D} = \{(\textbf{x}_i, y_i) \mid i = 1,2,...,n\}$.

When $f$ is parametrized by $\boldsymbol{\theta} \in \Theta$, the objective function becomes:

$$
J(\boldsymbol{\theta}) : \Theta \to [0, \infty).
$$

### Objective Function: Examples

What would are some possible objective functions? We will see many, but here are a few examples:
* Mean squared error: $$J(\boldsymbol{\theta}) = \frac{1}{2n} \sum_{i=1}^n \left(y_{i}-f_\theta(\textbf{x}_i)  \right)^2.$$
* Absolute (L1) error: $$J(\boldsymbol{\theta}) = \frac{1}{n} \sum_{i=1}^n \left|y_{i} -f_\theta(\textbf{x}_i) \right|.$$

These are defined for the training dataset $\mathcal{D} = \{(\textbf{x}_i, y_i) \mid i = 1,2,...,n\}$.

### Optimizer: Notation

At a high-level an `optimizer` takes an objective $J$ and a model class $\mathcal{M}$ and finds a model $f \in \mathcal{M}$ with the smallest value of the objective $J$.

\begin{align*}
\min_{f \in \mathcal{M}} J(f).
\end{align*}

Intuitively, this is the function is the one that best "fits" the data on the training dataset.

When $f$ is parametrized by $\boldsymbol{\theta} \in \Theta$, the optimizer minimizes a function $J(\boldsymbol{\theta})$ over all $\boldsymbol{\theta} \in \Theta$:

\begin{align*}
\hat{\theta}_{optimum} = \argmin_{\boldsymbol{\theta} \in \Theta} J(\boldsymbol{\theta}).
\end{align*}

### Optimizer: Example

We will see that behind the scenes, the [sklearn.linear_models.LinearRegression](https://scikit-learn.org/stable/modules/linear_model.html#linear-model) algorithm optimizes the sum of squares of errors.

\begin{align*}
\hat{\theta}_{optimum} = \argmin_{\boldsymbol{\theta} \in \mathbb{R}}\sum_{i=1}^n \left(y_{i} - f_\theta(\textbf{x}_i)\right)^2.
\end{align*}

### Predictive Model: Predictions Using Supervised Learning

After **obtaining** $\hat{\theta}_{optimum}$ from **training data**, given a `new` $\textbf{x}_{new}$, we can output a `predicted` $\hat{y}$ as:

$$ 
\hat{y} = \widehat{E(y)}=\hat{f}(\textbf{x}_{new}) = \widehat{\boldsymbol{\theta}}^\top \textbf{x}_{new}. 
$$

For instance, for a given new value of $x_{BMI}$, we can predict the diabetes risk score 
such as:

$$ 
\hat{y}_{{Riskscore}} = \hat{f}(x_{BMI}) = \hat{\theta}_0 + \hat{\theta}_1* x_{BMI}. 
$$

### Metrics and scoring: quantifying the quality of predictions

We can measure the `quality of the predictions` on the `test set`, in other words, `model's predictive performance on the test set`, through several metrics. However, metrics for the evaluation of model's predictive performance may differ with respect to the type of the supervised learning algorithm, that's, there are regression metrics, classification metrics etc. 

# References

- https://github.com/kuleshov/cornell-cs5785-2020-applied-ml/tree/main/notebooks

In [3]:
import session_info
session_info.show()