In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

# Introduction to Machine Learning

## What is Machine Learning?
* It is a big data phenomenon due to collection and storage of data at an unprecedented rate
* ML uses computers to automatically detect patterns in data and make predictions or decisions
* It is a subset of Artificial Intelligence (AI)

* Machine Learning Algorithms are able to learn information directly from data without relying on a predetermined mathematical model
![](MLvenn.jpeg)

Another image incorporating Data Science/Big data
![](AIvenn.webp)

#### There are 3 main Learning Paradigms:
1. **Supervised Learning**: Given a set of input patterns $X$ and a corresponding set of label $Y$, the underlying mapping function $f:X\rightarrow Y$ is discovered. The classifier operates in two distinct phases, 1-> a training phase (model tuning), and an operating phase, in which the model is kept fixed and tested with different and new data.<br>
2. **Unsupervised Learning**: Given a set of data $X$ with no labels, the underlined structure in the data is discovered to provide an internal representation.<br>
3. **Reinforcement learning**: The network is used to control a system (actor) by generating actions according to the current system state. The actions chosen by the actor are evaluated by another network (critic) which generates a reinforcement signal (a reward or punishment) for the actor, in order to improve its performace.

![](supunsup.png)

## Motivating Example
* The number of columns, also called the *features*, in the table below is $d=3$
* The number of datapoints, also called *training examples* is $n=20$

- Features:
    * Inhabitants: Number of inhabitants
    * BelowIncome: the percentage of families with incomes below 5000 dollars
    * PercentageUnemployed: the percentage unemployed
- Target:
    * Murders: the number of murders per 1 million inhabitants per annum

In [6]:
df = pd.DataFrame(index = np.arange(0,20),
                 data = {
                     'inhabitants':[687000,123445,127345,867345,283457,123456,235869,783753,875934,456789,687000,123445,127345,867345,283457,123456,235869,783753,875934,456789],
                     'BelowIncome':[16.5,20.5,26.3,16.5,16.2,18.5,20.2,21.2,17.2,15.2,18.1,23.1,19.1,25.7,18.6,24.7,17.0,22.5,20.2,16.9],
                     'PercentageUnemployed':[6.2,6.4,9.3,5.3,7.3,5.8,6.4,7.4,4.9,6.4,6.0,7.4,6.8,8.8,6.5,8.3,8.7,8.0,8.4,6.7],
                     'Murders':[11.2,13.4,40.7,5.3,14.8,12.7,20.8,35.7,8.7,9.5,15.5,26.9,15.2,25.2,18.1,26.8,15.3,25.8,21.7,25.7]
                 })
df.head(15)

Unnamed: 0,inhabitants,BelowIncome,PercentageUnemployed,Murders
0,687000,16.5,6.2,11.2
1,123445,20.5,6.4,13.4
2,127345,26.3,9.3,40.7
3,867345,16.5,5.3,5.3
4,283457,16.2,7.3,14.8
5,123456,18.5,5.8,12.7
6,235869,20.2,6.4,20.8
7,783753,21.2,7.4,35.7
8,875934,17.2,4.9,8.7
9,456789,15.2,6.4,9.5


## Motivating Example
Let's consider 3 training examples

|Index| Inhabitants|BelowIncome|PercentageUnemployed|Murders|
|-----|------------|-----------|--------------------|-------|
|0|587000|16.5|6.2|11.2|
|1|643000|20.5|6.4|13.4|
|2|635000|26.3|9.3|40.7|

We can then define the *feature vectors* and the *targets* as:<br>
$x_1=[587000,16.5,6.2];\hspace{10pt}y_1=11.2$<br>
$x_2=[643000,20.5,6.4];\hspace{10pt}y_2=13.4$<br>
$x_3=[635000,26.3,9.3];\hspace{10pt}y_3=40.7$
* $x_i$ is the *feature vector* for example $i$
* $y_i$ is the *target* for example $i$

Our Goal is to find the mapping $\mathbf{\mathcal{f}}$ such that:<br>\
$\hspace{20pt} y_i=\mathbf{\mathcal{f}}(x_i)\hspace{10pt}$for $i\in[0,19]$

## Supervised Learning
When $y_i$ is continuous, this is a regression problem

### Supervised Learning: Regression
Let's consider the **linear regression** problem.
___
Feature vector: $\hspace{10pt}x_i=[x_{i1},x_{i2},\dots,x_{id}]$

Target: $\hspace{10pt}y_i$

Prediction: $\hspace{10pt}\hat{y}_i = w_1x_{i1}+w_2x_{i2}+\dots+w_dx_{id}\hspace{10pt}$where,<br>

* $w_i$ is the weight of feature $i$
* $x_{ij}$ is the $j$-th feature of the $i$-th datapoint
* $\hat{y}_i$ which is the **weighted sum of features** of the $i$-th datapoint
___
To make the prediction be close to the target value, we try to minimize $y_i-\hat{y}_i$
___
We can expand the *residual* $y_i-\hat{y}_i$ as:

$\hspace{20pt}y_i-\hat{y}_i = y_i-\Sigma_{j=1}^d w_jx_{ij}$

We need to tune the weights such that the prediction error or residual is minimized


### Minimizing the Least Square Error
What is the classic way to minimize this error considering we have $n$ training examples?
___
We can minimize the **Least Square Error**
___
We find $\bar{w}$ such that it minimizes $\Sigma_{i=1}^n(y_i-\hat{y}_i)^2$<br>where, $\bar{w}=[w_1,w_2,\dots,w_d]$

Expanding the summation, we get:

$\Sigma_{i=1}^n(y_i-\hat{y}_i)^2 = \Sigma_{i=1}^n(y_i-\bar{w}^Tx_i)^2=\lVert\mathbf{X}\bar{w}-\bar{y}\rVert^2$
* $\bar{w}$ is an unknown $d\times 1$ column vector, called the *weight vector*
* $\bar{y}$ is a known $n\times 1$ column vector, called the *target vector*
* $\mathbf{X}$ is a known $n\times d$ matrix, called the *feature matrix*
---

We find a solution by optimizing as:<br> 
<p style="text-align: center;"> 
$minimize\hspace{2pt} \lVert \mathbf{X}\bar{w}-\bar{y}\rVert^2\hspace{2pt}$ w.r.t $\hspace{2pt}\bar{w}\in\mathbb{R}^d$
</p>

* We must find the gradient, set it to zero, and solve for $\bar{w}$

* The gradient, which is just a vector of partial derivatives, is given by- $\nabla f(\bar{w}) = [\frac{\partial f}{\partial w_1},\frac{\partial f}{\partial w_2},\dots,\frac{\partial f}{\partial w_d}]^T$

* We have framed the *Least Squares problem* as: $minimize\hspace{2pt}\lVert \mathbf{X}\bar{w}-\bar{y}\rVert^2\hspace{2pt}$ w.r.t $\hspace{2pt}\bar{w}\in\mathbb{R}^d$

* An equivalent problem, which will reduce some calculations for us, is: $minimize\hspace{2pt}\frac{1}{2}\lVert \mathbf{X}\bar{w}-\bar{y}\rVert^2\hspace{2pt}$ w.r.t $\hspace{2pt}\bar{w}\in\mathbb{R}^d$

* Therefore, we get (**NOTE**, the bars for the vectors are ignored for clarity)

$\mathcal{f}(\bar{w}) =\frac{1}{2}\lVert \mathbf{X}\bar{w}-\bar{y}\rVert^2 = \frac{1}{2}(\mathbf{X}w-y)^T(\mathbf{X}w-y)$

$\hspace{79pt}=\frac{1}{2}(w^T\mathbf{X}^T-y^T)(\mathbf{X}w-y)$

$\hspace{79pt}=\frac{1}{2}(w^T\mathbf{X}^T(\mathbf{X}w-y)-y^T(\mathbf{X}w-y))$

$\hspace{79pt}=\frac{1}{2}(w^T\mathbf{X}^T\mathbf{X}w-w^T\mathbf{X}^Ty-y^T\mathbf{X}w+y^Ty)$

$\hspace{79pt}=\frac{1}{2}(w^T\mathbf{X}^T\mathbf{X}w-((\mathbf{X}^Ty)^Tw)_{1\times 1}^T-(\mathbf{X}^Ty)^Tw_{1\times 1}+y^Ty)$

$\hspace{79pt}=\frac{1}{2}(w^T\mathbf{X}^T\mathbf{X}w-(\mathbf{X}^Ty)^Tw+y^Ty)$

This boils the objective function down to:
* A quadratic term (symmetric matrix), $\frac{1}{2}(\mathbf{X}\bar{w})^T\mathbf{X}\bar{w}$
* A linear term, $-\frac{1}{2}\mathbf{X}^T\bar{y}^T\bar{w}$
* A constant $\frac{1}{2}\bar{y}^T\bar{y}$

We now compute the gradient as:

$\frac{\partial}{\partial \bar{w}}\mathcal{f}(\bar{w}) = \frac{\partial}{\partial\bar{w}}\frac{1}{2}(\bar{w}^T\mathbf{X}^T\mathbf{X}\bar{w}-(\mathbf{X}^T\bar{y})^T\bar{w}+\bar{y}^T\bar{y})$

$\hspace{5pt}\nabla \mathcal{f}(\bar{w})=\mathbf{X}^T\mathbf{X}\bar{w}-\mathbf{X}^T\bar{y}$

Setting gradient to zero to find the minimize it:

$\mathbf{X}^T\mathbf{X}\bar{w}-\mathbf{X}^T\bar{y}=0$

$(\mathbf{X}^T\mathbf{X})\bar{w} = \mathbf{X}^T\bar{y}$

$(\mathbf{X}^T\mathbf{X})^{-1}(\mathbf{X}^T\mathbf{X})\bar{w} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\bar{y}$

**NOTE** Here, we assume that $\mathbf{X}^T\mathbf{X}$ is invertible

$\therefore\bar{w} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\bar{y}$

This is the **Least Squares Solution**