<a href="https://colab.research.google.com/github/stepyt/machine_learning_notes/blob/master/lectures/Introduction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# INTRODUCTION

Let's start answering the principal question, *what is Machine Learning (ML)*?

One of the first and most used definition derives from Mitchell (1997) who stated:

> "A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E."

We can simplify this definition by stating that it is a system able to improve itself as it collecs data: more experience $\rightarrow$ better performance.

ML is a subfield of Artificial Intelligence (AI) which is based on the *induction*. We start from observations and we try to extract from them general rules. It is not a perfect field since wrong observations cause unavoidably wrong models. It is based on the extraction of informations from data but can't extract information not contained in the data and can't extract all the informations! There is nothing magic behind as people are used to think..

<p align="center">
<img src="https://github.com/stepyt/machine_learning_notes/blob/master/storage/funny/aicat.JPG?raw=1" width="400">
</p>

The approach of ML to programming is completely different from the standard one:
<p align="center">
<img src="https://github.com/stepyt/machine_learning_notes/blob/master/storage/pict1/figure1_1.png?raw=1" width="400">
</p>

We rely on ML because sometimes it is really difficult or impossible to hard code the rule behind the phenomenon considered. ML allows to automatically extract relevant informations from data $\rightarrow$ *automating the automation*

The main goal is to obtain codes able to generalize the unseen data. This is very difficult since we cannot measure the objective function that we want to optimize (the one on unseen data) so we try to minimize a different one hoping that it is similar to the true.

ML is becoming widespread in many fields among them:
* Computer vision and robotics
* Biology and medicine
* Space exploration
* Manufacturing
* Finance

Just to give an idea of how much it is in the hype look at the pictures below:

<p align="center">
<img src="https://github.com/stepyt/machine_learning_notes/blob/master/storage/pict1/figure1_2.png?raw=1" width="500" hspace="20"> <img src="https://github.com/stepyt/machine_learning_notes/blob/master/storage/pict1/figure1_3.png?raw=1" width="500">
</p>

NeurIPS (Conference on Neural Information Processing Systems) is one of the oldest and most important conference of ML in the world. The souldout time of the NeurIPS 2018 was about $11'38''$.

ML is divided in 3 main subaerea:
* *Supervised Learning*: the goal is to learn a model that is able to explain the relationship between input and output.
* *Unsupervised Learning*: the goal is to learn the best possible representation of data. In this learning paradigm just the inputs are used.
* *Reinforcement Learning*: the goal is to learn how to make decisions in order to control and optimize.

<p align="center">
<img src="https://github.com/stepyt/machine_learning_notes/blob/master/storage/pict1/figure1_5.png?raw=1" width="400">
</p>

### **Supervised learning**
Given a training set $\mathcal{D}=\{x,t\}$ of known inputs $x$ and known outputs $t$, we want to estimate the unknown model that maps inputs to outputs $\rightarrow t=f(x)$.

It can be used for:
* *Classification*: the target variables are finite, discrete and usually small.
* *Regression*: the output are continuous.
* *Probability estimation*: this is similar to regression but with a constraint on the function since it must be a probability function.

The input variables are called *features*, predictors or attribute.

The output variables are called *targets*, responses or labels.

### **Unsupervised learning**
Given a training set $\mathcal{D}=\{x\}$ we want to learn a more efficient representation of a set of unknown inputs $\rightarrow$ $?=f(x)$.

It can be used for:
* Compression
* Clustering

### **Reinforcement learning**
Given a training set $\mathcal{D}=\{x,u,x',r\}$, where $x$ is the current state, $u$ the control, $x'$ the state reached after application of the control and $r$ is the reward of the action, we want to learn an optimal policy for an action $\rightarrow \pi ^*(x) = arg max_u\{Q^*(x,u)\} $.

In reinforcement learning we want to optimize a reward over a long horizon maybe performing a bad action to achieve a better reward in the future.

In these bunch of lectures we will mainly focus on supervised and reinforcement learning.


## **Overview of Supervised Learning**
We want to approximate the real model $f$ given the data set $\mathcal{D}$ and in order to do this we must follow three steps:
1. Define the *loss function* $\mathcal{L}$
2. Define the *hypothesis space* $\mathcal{H}$
3. Optimize to find an *approximate mode*l $h$

In the pictures below the value of the loss function is represented in grey scale. From white to black the value of the loss function increases. Obviously, the loss is at its minimum where the true function is located. When we define the first hypothesis space $\mathcal{H}_1$ we start searching for the best possible approximation of $f$ inside it, which in this case is $h_1$. Knowing the true $f$ it is possible to calculate the error given by $h_1$ but this is not the case: we don't know the real model. 

If we know the true function the problem of finding a good approximation of it it's called function approximation. In ML we don't know the real $f$ so we can't correctly calculate the error. What we have is just a set of samples so we can obtain only a noisy measure of the true underlying function. Since in supervised learning we can only know partial and noisy information of $f$ there is the possibility to learn a wrong behavior of the system.

Ideally, by enlarging the hypothesis space as in $\mathcal{H}_2$, we should be able to obtain the real minimum of the loss and so the true $f$. However, due to the fact that we only have samples of the real function, the minimum of the loss is not precisely in $f$ but it moves in another place. Let's consider the last picture, where the noisy loss function instead of the real one is taken into account: the minimum moves in the new $h_2$. Considering now the two best approximations inside the two hypothesis spaces: $h_1$ and $h_2$. Between these two, basing our choices on the noisy loss function, $h_2$ is the best solution being it exactly at the minimum of the loss function. In this case we may think to have captured the true behavior but instead the approximation given by $h_1$ is still better.

<p align="center">
<img src="https://github.com/stepyt/machine_learning_notes/blob/master/storage/pict1/figure1_6.png?raw=1" width="200" hspace="20"> <img src="https://github.com/stepyt/machine_learning_notes/blob/master/storage/pict1/figure1_7.png?raw=1" width="200"> <img src="https://github.com/stepyt/machine_learning_notes/blob/master/storage/pict1/figure1_8.png?raw=1" width="200" hspace="20"> <img src="https://github.com/stepyt/machine_learning_notes/blob/master/storage/pict1/figure1_9.png?raw=1" width="200"> <img src="https://github.com/stepyt/machine_learning_notes/blob/master/storage/pict1/figure1_10.png?raw=1" width="200" hspace="20"> <img src="https://github.com/stepyt/machine_learning_notes/blob/master/storage/pict1/figure1_11.png?raw=1" width="200" hspace> 
</p>


> *The most important and difficult aspect of supervised learning is to find the proper hypothesis space.*

We need to be carreful and try to obtain a good trade off between the model complexity and the accuracy. Remember, the most important thing is the data!

<p align="center">
<img src="https://github.com/stepyt/machine_learning_notes/blob/master/storage/funny/goddata.JPG?raw=1" width="400">
</p>

> **DICHOTOMIES IN ML**
>
> **Parametric & Nonparametric** :: in the first case we have a finite set of parameters while in the second case the number depends on the training set.
>
> **Frequentis & Bayesian** :: in the first case the probability is used to model the sampling process while in the second case it is used to estimate the uncertainty about the estimate.
>
> **Generative & Discriminative** :: in the first case the model learns the joint probability distribution of inputs and outputs. In the second case the algorithm learns the conditional distribution of the the outpus given the inputs. The generative approach is more difficult, infact from the joint we can recover the conditional.
>
> **Enpirical Risk Minimization & Structural Risk Minimization** :: in the first case the care is on the error inside the training set while in the second case it is important to balance the error on the training set with the model complexity.