# <span style='font-family:Marker Felt;font-weight:bold'>INTRODUCTION</span>

---

Let's start answering the principal question, *what is Machine Learning (ML)*?

One of the first and most used definition derives from Mitchell (1997) who stated:

> "A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E."

We can simplify this definition by stating that it is a system capable of improving itself as it collects data: more experience $\rightarrow$ better performance.

ML is a subfield of Artificial Intelligence (AI) which is based on **induction**. We start from observations and we try to extract from them general rules. It is not a perfect field since wrong observations cause unavoidably wrong models. It is based on the extraction of informations from data but can't extract information not contained in the data and, more important, can't extract all the informations!

The approach of ML to programming is completely different from the standard one:

<p align="center">
<img src="https://github.com/stepyt/machine_learning_notes/blob/master/storage/pict1/figure1_1.png?raw=1" width="700">
</p>

We rely on ML because sometimes it is really difficult or impossible to hard code the rules behind the phenomenon considered. ML allows to automatically extract relevant informations from data $\rightarrow$ *automating the automation*.

The main goal is to obtain codes able to generalize the unseen data. This is very difficult since we cannot measure the objective function that we want to optimize (the one on unseen data) so we try to minimize a different one, hoping that it is similar to the true.

ML is becoming widespread in many fields, among them:

* Computer vision and robotics

* Biology and medicine

* Space exploration

* Manufacturing

* Finance

just to get an idea of how much it is in the spotlight, we can look at the pictures below:

<p align="center">
<img src="https://github.com/stepyt/machine_learning_notes/blob/master/storage/pict1/figure1_7.png?raw=1" width="700">
</p>

NeurIPS (Conference on Neural Information Processing Systems) is one of the oldest and most important conference of ML in the world. The soldout time of the NeurIPS 2018 was about $11'38''$.

ML is divided in 3 main sub-areas:

*  <span style='color:SlateBlue;font-weight:bold'>Supervised Learning</span>: the goal is to learn a model that is able to explain the relationship between input and output.

* <span style='color:SlateBlue;font-weight:bold'>Unsupervised Learning</span>: the goal is to learn the best possible representation of data. In this learning paradigm just the inputs are used.

* <span style='color:SlateBlue;font-weight:bold'>Reinforcement Learning</span>: the goal is to learn how to make decisions in order to control and optimize.

<p align="center">
<img src="https://github.com/stepyt/machine_learning_notes/blob/master/storage/pict1/figure1_5.png?raw=1" width="300">
</p>

### <span style='font-family:Marker Felt;font-weight:bold'>Supervised learning</span>

Given a training set $\mathcal{D}=\{x,t\}$ of known inputs $x$ and known outputs $t$, we want to estimate the unknown model that maps inputs to outputs $\rightarrow t=f(x)$.

It can be used for:

* *Classification*: the target variables are finite and discrete.

* *Regression*: the output are continuous variables.

* *Probability estimation*: this is similar to regression but with a constraint on the function since it must be a valid probability function.

The input variables are called **features**, predictors or attribute.

The output variables are called **targets**, responses or labels.

### <span style='font-family:Marker Felt;font-weight:bold'>Unsupervised learning</span>

Given a training set $\mathcal{D}=\{x\}$ we want to learn a more efficient representation of a set of unknown inputs $\rightarrow$ $?=f(x)$.

It can be used for:

* Compression

* Clustering

### <span style='font-family:Marker Felt;font-weight:bold'>Reinforcement learning</span>

Given a training set $\mathcal{D}=\{x,u,x',r\}$, where $x$ is the current state, $u$ the control action, $x'$ the state reached after the application of the control and $r$ is the reward of the action, we want to learn an optimal policy for an action $ \rightarrow \pi ^*(x) = arg max_u\{Q^*(x,u)\} $.

In reinforcement learning we want to optimize a reward over a long horizon, maybe performing a bad action to achieve a better reward in the future.

In these bunch of lectures we will mainly focus on supervised and reinforcement learning.

## <span style='font-family:Marker Felt;font-weight:bold'>Overview of Supervised Learning</span>

---

We want to approximate the real model $f$ given the data set $\mathcal{D}$. In order to do this, we must follow three steps:

1. Define the *loss function* $\mathcal{L}$

2. Define the *hypothesis space* $\mathcal{H}$

3. Optimize to find an *approximate model* $h$

In the pictures below the values of the loss function are represented in gray scale. From white to black the value of the loss function increases. Obviously, the loss is at its minimum where the true function is located. When we define the first hypothesis space $\mathcal{H}_1$ we start searching for the best possible approximation of $f$ inside it, which in this case is $h_1$. Knowing the true $f$ it is possible to calculate the error given by $h_1$ but this is not the case: we don't know the real model. 

If we know the true function the problem of finding a good approximation of it is called function approximation. In ML we don't know the real $f$ so we can't correctly calculate the error. What we have, is just a set of samples from which we can obtain only a noisy measure of the true underlying function. Since in supervised learning we can only know partial and noisy information of $f$, there is the possibility of learning a wrong behavior of the system.

Ideally, by enlarging the hypothesis space as in $\mathcal{H}_2$, we should be able to obtain the real minimum of the loss and so the true $f$. However, due to the fact that we only have finite samples of the real function, the minimum of the loss is not precisely in $f$ but, it is located in another place. Let's consider the last picture, where the noisy loss function instead of the real one is taken into account: the minimum moves in the new $h_2$. Considering again the same two hypothesis spaces we have two new solutions $h_1$ and $h_2$. Between these two, basing our choices on the noisy loss function, $h_2$ is the best solution being it exactly at the minimum of the loss function. In this case we may think to have captured the true behavior but instead the approximation given by $h_1$ is still better. This is a typical problem that one can encounter during the design of a ML algorithm.

<p align="center">
<img src="https://github.com/stepyt/machine_learning_notes/blob/master/storage/pict1/figure1_6.png?raw=1" width="600">
</p>

> <span style='font-style:italic'>The most important and difficult aspect of supervised learning is to find the proper hypothesis space.</span>

We need to be careful and try to obtain a good trade off between the model complexity (dimension of the hypothesis space) and the accuracy (value of the loss function).

<div class="alert alert-block alert-info">
<span style='color:DodgerBlue;font-weight:bold'>DICHOTOMIES IN ML</span>

* *Parametric & Nonparametric*: in the first case we have a finite set of parameters while in the second case the number depends on the training set.

* *Frequentist & Bayesian*: in the first case the probability is used to model the sampling process while in the second case it is used to estimate the uncertainty about the estimate.

* *Generative & Discriminative*: in the first case the model learns the joint probability distribution of inputs and outputs. In the second case the algorithm learns the conditional distribution of the the outputs given the inputs. The generative approach is more difficult, indeed, from the joint we can recover also the conditional.

* *Empirical Risk Minimization & Structural Risk Minimization*: in the first case the care is on the error inside the training set while in the second case it is important to balance the error on the training set with the model complexity.
</div>

<span style='font-family:Marker Felt;font-size:20pt;color:DarkCyan;font-weight:bold'>References:</span>

1. *Restelli M., Machine Learning - course slides*