## <span style="color:blue">  Supervised Learning Prediction</span>

The feature of these exercise consists in machine learning techniques for prediction of function behaviour: from a set of sampled points, neural networks should be able to rebuild general trend for a generic $f(x)$ function. Furthermore, to simulate real experimental data, we'll try to add some noise on our data: generate $x_i$ data, then our output will be $y_i = f(x_i) + \eta_i$, with $\eta_i$ a gaussian random variable such that

1. $\left< \eta_i \right> = 0 $
2. $\left< \eta_i \eta_j \right> = \delta_{ij} \sigma $

By aid of neural networks, our M.L. algorithm should be able to map correctly every Lebesgue-integrable function (Cybenko, 1989, Universal approximation theorem). So let's start from a very simple function, a linear one such as

$$ f(x) = 2x + 1 $$

We'll now study how our predictions may differ fine-tuning hyperparameters such as $\sigma$, training dataset $N_{train}$ and number of epochs $N_{ep}$, i.e. the number of times our algorithm will make prediction, for a batch of data, to confront with real values; to evaluate the gap between predictions and observations, we've made use of MSE method. Batch size (i.e. the number of data to confront every predicition we make) have been set to 32.

Let's start from $\sigma=0$: with $N_{train}$=100, we've achieved, for $N_{ep}$=50:

<img src="es1/sigma_0/100_train/50_epochs/Model_loss.png" />
<img src="es1/sigma_0/100_train/50_epochs/Predicted_vs_True.png" />

As we see, our prediction is not so far accurate, and both Train and Test loss have to be further minimized. Let's try with $N_{ep}$=100:

<img src="es1/sigma_0/100_train/100_epochs/Model_loss.png" />
<img src="es1/sigma_0/100_train/100_epochs/Predicted_vs_True.png" />

Much better. Slope is almost the same, even though slightly flatter, while loss is asymptotically zero. Can we do better? Let's try with $N_{ep}$=200:

<img src="es1/sigma_0/100_train/200_epochs/Model_loss.png" />
<img src="es1/sigma_0/100_train/200_epochs/Predicted_vs_True.png" />

Here we go: our model has learnt how to fit correctly. It seems to be, the more the epochs, the better the prediction; is it always true? Let's have a check: try with $N_{trial}$=1000 and $N_{ep}$=30

<img src="es1/sigma_0/1000_train/30_epochs/Model_loss.png" />
<img src="es1/sigma_0/1000_train/30_epochs/Predicted_vs_True.png" />

Even though the previous parameters have shown good results, these new plots show us the following rule: huge size of data training is the strongest condition for fitting with M.L. techniques.

Let's now go on with $\sigma$=1: we'll start now from $N_{train}$=1000, with $N_{ep}$=50 and 200:

<img src="es1/sigma_1/1000_training/50_epochs/Model_loss.png" />
<img src="es1/sigma_1/1000_training/50_epochs/Predicted_vs_True.png" />
for $N_{ep}$=50

<img src="es1/sigma_1/1000_training/50_epochs/Model_loss.png" />
<img src="es1/sigma_1/1000_training/50_epochs/Predicted_vs_True.png" />
for $N_{ep}$=200

This shows us there's an unavoidable bias between prediction and real model, when introducing some noise on data; raising the number of epochs doesn't seem to be a solution to this gap. Let's try then to increase, another time, the number of training points to $N_{train}$=10000:

<img src="es1/sigma_1/10000_training/50_epochs/Model_loss.png" />
<img src="es1/sigma_1/10000_training/50_epochs/Predicted_vs_True.png" />

As we see, there's always a bias, but from here we can reproduce better results: the higher the number of points, the more the fluctuations compensate each other tending to zero (central limit theorem), that's why we have a lower loss on test data rather than on training ones.

**Cubic Function Fit**

Let's make things harder: this time, we'll try to infer a cubic function behaviour, given some (more or less noisy) data. We have

$$ f(x)=4-3x-2x^2+3x^3
$$


for $x \in [-1,1]$. For non-linear function, we'll need to install more complex neural network architectures: we added three more layers, the first one activated by ReLu function, the others by Linear one. Here a practical scheme:

<img src="es2/model_plot.png" />

Furthermore, as we'll need more data to reproduce the effective function, we've enlarged our batch size dimension: from 32 to 64, in order to train more data in one go. As optimizer, we've chosen SDG (Stochastic Gradient Descent), while for loss MSE (Mean Squared Error); we've tried Adam as another optimizer, but with worse results.
First of all, we've fitted some models with $\sigma$=0: let's watch what happens for $N_{train}$=1000 and $N_{ep}$=1000:

<img src="es2/sigma_0/1000train_1000epochs/Model_loss.png" />
<img src="es2/sigma_0/1000train_1000epochs/Predicted_vs_True.png" />

As we see, our plot isn't too precise, in particular at the end of the line. As we may notice from loss graph, we have far exceeded the asymptotical value for training; it would be better, for computational performances, to reduce the number of epochs. Let's try with another set of hyperparameters, let's say $N_{train}$=5000, $N_{ep}$=500:

<img src="es2/sigma_0/5000train_500ep/Model_loss.png" />
<img src="es2/sigma_0/5000train_500ep/Predicted_vs_True.png" />

Here we have a better result: we have again a patchy representation for our tail, even though a better overall trend. At last, let's try with $N_{trial}$=10000, $N_{ep}$=1000:

<img src="es2/sigma_0/10000train_1000epoch/Model_loss.png" />
<img src="es2/sigma_0/10000train_1000epoch/Predicted_vs_True.png" />

The tail is slowly improving its depiction. Now let's try with a gaussian noise by $\sigma=0.5$, $N_{train}$=1000 and $N_{ep}$:

<img src="es2/sigma_.5/1000T_1000E/Model_loss.png" />
<img src="es2/sigma_.5/1000T_1000E/Predicted_vs_True.png" />

This performance is not the best: we may increase the number of training data, while the epochs seem to suffice enough. However, watching the original data points distribution, this plot seems not to be so bad:

<img src="es2/sigma_.5/1000T_1000E/Ideal_target.png" />

In blue line we have our function we want to fit and represent; the red points are the training data distributed around $f(x)$ with $\sigma=0.5$ variance. All things considered, it's not such a bad work! Let's double the training data, reducing a little $N_{ep}$: this time we'll use $N_{train}$=10000 and $N_{ep}$=1000, here our starting set of training:

<img src="es2/sigma_.5/10000T_1000E/Ideal_Target.png" />

which is quite dispersive. Now let's see the effective results:

<img src="es2/sigma_.5/10000T_1000E/Model_loss.png" />
<img src="es2/sigma_.5/10000T_1000E/Predicted_vs_True.png" />

A quite good result; however, we may see there are huge fluctuations at the beginning of test phase, this means this time we need a higher number of epochs for good predictions.

At last, we post some results for $\sigma$=1 $N_{train}$=1000, $N_{ep}$=100:

<img src="es2/sigma_1/1000train_100epoch/Ideal_target.png" />
<img src="es2/sigma_1/1000train_100epoch/Model_loss.png" />
<img src="es2/sigma_1/1000train_100epoch/Predicted_vs_True.png" />

**2D Function Fitting**

Now we'll try to extend our techniques to 2D functions: here we're gonna reproduce a sinusoidal one, in particular

$$ f(x,y) = \sin(x^2+y^2) $$

being $x$,$y$ in range $[-3/2; 3/2]$. We report he true function plot to make future comparisons with our results:

<img src="es3/True.png" />

This time, we'll have a double input for our neural network, each neuron sampling a free variable (i.e. $x$ and $y$). Beyond this point, the approach we've followed is very similar to those on the previous exercises. We've tried with a small number of epochs, $N_{ep}$=20, and a huge number of training data (as we'll need a squared number of these), $N_{trial}$=10000. Taking sections of the graph above, we have seen that:

<img src="es3/Predicted_vs_True_x.png" />
on $y$=0 axis

<img src="es3/Predicted_vs_True_y.png" />
on $x$=0 axis