<h1 class="intro_title" style="text-align:center; font-size: 45px;">Python course 2021</h1>
<h2 class="intro_subtitle" style="text-align:center; font-size: 30px;">Introduction to machine learning<br/> with Keras</h2>

<img class="intro_logo" style="width:400px" src="https://static.poul.org/assets/logo/logo_text_g.svg" alt="POuL logo"/>

<p class="intro_author" style="text-align: center; font-size: 18px;">Roberto Bochet &lt;avrdudo@poul.org&gt;</p>

# What is machine learning?
##### (as basic as possible)
<small style="font-size: 0.5em;">Engineers, mathematicians and scientists have mercy of me!</small>

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import math
np.random.seed(0)

In [None]:
f = lambda x: np.real(-0.5j*(math.e**(1j*x) - math.e**(-1j*x)) + \
              0.15*(math.e**(10j*x) + math.e**(-10j*x))) + \
              1e-4*np.e**x + np.random.normal(0,0.1,len(x))
x = np.random.uniform(-5,5, 200)
y = f(x) 
plt.title("What is this?")
plt.plot(x, y, "o");

In [None]:
import scipy as sp
from scipy import signal
x_fin = np.arange(-5, 5, 0.01)
f_tri = lambda x: 2*np.abs(sp.signal.sawtooth(x - np.pi/2)) - 1
plt.title("Is it a triangle wave?")
plt.plot(x, y.real, "o")
plt.plot(x_fin, f_tri(x_fin), "g");

In [None]:
f_sin = lambda x: np.sin(x)
plt.title("...or is it a sine?")
plt.plot(x, y.real, "o")
plt.plot(x_fin, f_sin(x_fin), "r");

In [None]:
f_poly = lambda x: x - x**3/6 + x**5/120 - x**7/5040 + x**9/362880 - x**11/39916800
plt.title("...maybe a polynomial?")
plt.plot(x, y, "o")
plt.plot(x_fin, f_poly(x_fin), "m");

#### What are `triangle wave`, `sine` or `polynomial`?
##### (Recap)

We had some data in the form of tuple `(x,y)`

We notice that there is a kind of relation between `x` and `y`, they are not random (mostly)

So, we asked ourselves what value `y` assumes given a generic value of `x` (not presents in the orginal dataset)

We answered with some mathematical functions which seems approximate the data quite well

#### So, from which among the suggested functions the dataset are generated?

### Short answer: From nothing of them

In a real scenario is unrealistic to completely identify the "real process" behind a dataset

A **mathematical system can provides nothing more than an approximation of a real system**  
and this is true for all the real system

### The mathematical functions  
### we considered are called **mathematical models**
an alternative to mathematical models could be the **physical models**

So, a rasonable question we should answer could be  
**"Which mathematical model approximates better the behaviour of our real system?"**

# Machine learning

>is the study of **computer algorithms** that  
    improve **automatically** through **experience**  
    and by the use of data.  
>
>    &#91;...&#93;  
>
>Machine learning algorithms build a model based  
    on sample data, &#91;...&#93; in order to make **prediction**  
    or **decisions** without being  
    **explicitly programmed to do so**.  
>
>[from wikipedia](https://en.wikipedia.org/wiki/Machine_learning)

## ML branches

ML splits itself in three macro areas

*(incredible simplified summary)*

### Reinforced learning
The model is trained like you would do with a pet:  
it does a good job it is rewarded,  
it does a mistake it is punished.

The model should try to maximize the reward and avoid the punishes,  
consequentially it would learn to do a good job without makes mistakes.

Some applications:  
[songs suggestion](https://medium.com/analytics-vidhya/emotion-based-music-recommendation-system-using-a-deep-reinforcement-learning-approach-6d23a24d3044),
[autonomous drive](https://towardsdatascience.com/do-you-want-to-train-a-simplified-self-driving-car-with-reinforcement-learning-be1263622e9e)

### Supervised learning
To the model is provided the input data and the result we would expect from it.

The model should learn and generalize the relation between input and output,  
so that given a never seen input it can be provided a reasonable output.

Some applications:  
[text translation](https://towardsdatascience.com/language-translation-with-rnns-d84d43b40571),
[image classification](https://developers.google.com/machine-learning/practica/image-classification)

### Unsupervised learning
To the model are provided only the input data, without what we want aspect as output,  
will be the model that will identify scheme and recurrences in the data.

Some applications:  
[paints style transfer](https://github.com/jcjohnson/neural-style),
[words embedding](https://nlp.stanford.edu/projects/glove/)

It could be that for complex problems they are used together.

#### However today we will talk olny about **supervised learning**!

# Feed-forward Neural Network
is a really simple model inspired by the functioning of the brain

## Let us see how to compose it

### Neuron
is the basic unit that composed the FNN
![a neuron](./images/neuron.svg)

Let us start with a really simple model, a linear combination of the input

$ x = w_0 + w_1 u_1 + w_2 u_2 + \dots + w_k u_k $

*where $w_0$ is a parameter called bias, $u_i$ is the i-th input and $w_i$ is an arbitrary multiplication factor*

So, the input data are linearly combined to get the value $x$ 

> The function could be seen (with $x=0$) as an equation  
    defining a k-dimension [hyperplane](https://en.wikipedia.org/wiki/Hyperplane) of parameters $w_0, w_1, \dots, w_k$

Then we can transform the value $x$ to get an output exploiting an arbitrary function

$y = g(x)$

Where $g(\cdot)$ is called [**activation function**](https://en.wikipedia.org/wiki/Activation_function) (and it is generally non-linear one)

In [None]:
import tensorflow.keras as kr
x_act = np.arange(-6,6,0.01)
fig, axs = plt.subplots(2,2)
fig.suptitle("Some exmples of activatcion functions")
axs[0,0].set_title("Linear")
axs[0,0].plot(x_act, kr.activations.linear(x_act))
axs[0,1].set_title("tanh")
axs[0,1].plot(x_act, kr.activations.tanh(x_act))
axs[1,0].set_title("Sigmoid")
axs[1,0].plot(x_act, kr.activations.sigmoid(x_act))
axs[1,1].set_title("ReLU")
axs[1,1].plot(x_act, kr.activations.relu(x_act));

$ x = w_0 + w_1 u_1 + w_2 u_2 + \dots + w_k u_k $  
$ y = g(x) $

This couple of equations define entirely the concept of **Neuron** for the **FNN**

A single **neuron** defines the whole model of the simplest possible **FNN** at the base of the [**Perceptron**](https://en.wikipedia.org/wiki/Perceptron), a supervised algorithm invented in 1958 by [*Frank Rosenblatt*](https://en.wikipedia.org/wiki/Frank_Rosenblatt).

It composed a binary classifier:  
given an input it decided if this was part of a first class or a second one  
(you are in or you are out)

### Layer
Anyway a lone Neuron is rather useless, so they are composed in structure called layer.
An arbitrary number of neuron can be arranged side by side, in order to create a layer with $m$ output, where $m$ is the number of Neuron in the layer.

This kind of layer is called **Dense layer** or **Fully-connected layer**

![a single layer FNN](./images/layer.svg)

Layers, in turn, can be stacked in order to improve the complexity of the final model.

A single **Neuron** of a **Dense layer** has as inputs all the output of the previously layer  
(from here the name **Fully-connected layer**) 

![a multi layers FNN](./images/multi_layers.svg)

n.b. the propagation of signals go only from the input to the output, **there are not loops**!  
From here the name **Feed-forward Neural Network**!

A **FNN** model is defined by the **number of inputs**, the **number of layers**, the **number of neurons for layer** (each layer can have a difference number of them) and the **neurons' activation functions**, we called these parameters [**Hyperparameters**](https://en.wikipedia.org/wiki/Hyperparameter_(machine_learning)).

And by the values **$w_{i,j}$** which are called **weights**.

Defined the **Hyperparameters** we defined a **FNN** working model, but it is a dummy!  
##### An Hardware without a Software!

## How to "train" a FNN?

Up to now we only saw a **mathematical model**,  
based on arranging an arbitrary number of **Neurons**,  
but we still have no idea to how use the data to **teach** a **behavior** to our model.

The **FNN** training is treated as an optimization problem (as always...).

So, after defined the **Hyperparameters** we should ask ourselves with values of the **weights** are the best possible ones in order to the **FNN** simulates the behavior of our "real system".

#### [Optimization problem](https://en.wikipedia.org/wiki/Optimization_problem) (in a nutshell)

Given a **mathematical model** (one or more parametric equations) and  
a [**loss function (or cost function)**](https://en.wikipedia.org/wiki/Loss_function) (one to benchmark the model given a specific set of parameters)  
we want find the parameters set that **minimize** (or **maximize**) it.

We already have a **mathematical model** (the **FNN** model)
#### so we only need a **loss function**!

### Loss function
In the literature several loss functions are proposed on the basis of the utility context for the **FNN**.

**Mean Squared Error**, **Binary Crossentropy**, **Categorical Crossentropy** are some of the most used.

### Training

Unluckily (or not) **FNN** models overall are non-linear models,  
so the choice of an algorithm to find the optimal **weights** is not a trivial decision.

We could talk about them for hours, but (luckily for you) we have little time,  
you need to know also this choice is important for the model result,  
and that all the algorithms for training **FNN** are based on a magic stuff called [**Backpropagation**](https://en.wikipedia.org/wiki/Backpropagation).

We can include in the **FNN** **Hyperparameters** also the **loss function** and the **optimization algorithm**

Now we should have a general idea of how the **FNN**s work.

...of course several concepts were omitted in the previous content,  
we could talk for hours about **FNN**...  

but now we have introduced all the arguments required to create our first model based on **FNN**

# Keras

<img alt="keras logo" src="./images/keras_logo.svg" style="width:15%"/>

is an **open source** library thought to simplify the prototyping of [**deep neural networks**](https://en.wikipedia.org/wiki/Deep_learning).

It provides an high level interface to back-end [**TensorFlow**](https://it.wikipedia.org/wiki/TensorFlow)  
(previously some other back-ends were supported 😕)

**n.b.**  
a develop directly on **TensorFlow** with **Python** is possible,  
but **Keras** provides an high abstraction level drastically speed up and simplify the prototyping with **deep neural networks**.

**Keras** was considered so important that now it is shipped also in the **TensorFlow** libraries (`tensorflow.keras`) 

#### Let us start with Keras!

# Our first FNN

As first experiment we resume the first dataset we saw at the start of this talk  
and let us try to build and train a FNN on it.

In [None]:
plt.title("Our dataset")
plt.plot(x, y, "o");

In particular we are looking for a model that characterizes the relation between `x` and `y` 

## Model

Let us start defining a simple **FNN** model,  
also called "choose the **Hyperparameters**".

In this simple model we will use a 3-layers **FNN** with a single input (`x`).

Firsts two layers will have some **Neurons** with a [**Sigmoid**](https://en.wikipedia.org/wiki/Sigmoid_function) as **activation function**.

The last one (output layer) instead will have only one **Neuron** with a **Linear** activation function,  
because we need a single continuous value (`y`) as output.

In [None]:
# import the keras module
import tensorflow.keras as kr

# create a new empty model
model = kr.Sequential()

# set the input size
model.add(kr.layers.InputLayer(input_shape=(1,)))
# add the three Dense layers to the model
model.add(kr.layers.Dense(30, activation=kr.activations.sigmoid))
model.add(kr.layers.Dense(10, activation=kr.activations.sigmoid))
model.add(kr.layers.Dense(1, activation=kr.activations.linear))

# print a summary of the model
model.summary()

## Optimization
Missing **Hyperparameters** are the **optimization algorithm** and **loss function**,&nbsp;  
for the first we will use the [**Adam**](https://keras.io/api/optimizers/adam/) algorithm  
and since we want to do a [nonlinear regression](https://en.wikipedia.org/wiki/Nonlinear_regression) we can use the [**Mean Squared Error**](https://en.wikipedia.org/wiki/Mean_squared_error) as loss function 

We will use this choices to configure the FNN for the training 

In [None]:
# complete the FNN setup with optimizer algorithm and loss function 
model.compile(
    optimizer=kr.optimizers.Adam(learning_rate=1e-3),
    loss=kr.losses.mean_squared_error
)

##### The **FNN** is ready to be trained!!!

## Training

We will use the method `fit` to start the model training.  
We have to give to it the inputs data, the expected output data,  
and the times the training algorithm will have to iterate on the given dataset.

In [None]:
history = model.fit(x, y, epochs=750)

#### The training is over!
Let us check how it went!

In [None]:
plt.ylabel('loss')
plt.xlabel('epoch')
plt.plot(history.history["loss"]);

From our trained model we are now able to get an estimation on the relation between `x` and `y` values

In [None]:
x_pred = np.arange(-5, 5, 0.01)
y_pred = model.predict(x_pred)

plt.plot(x, y, "o")
plt.plot(x_pred, y_pred, "r")
plt.show()

We got a model(extremely limited) of our real system that we can reuse in order to do predictions

# The "Hello world" of ML

So, we saw a first(useless) model realized on garbage data, it is the time to realize our first model on a real challenge

In the machine learning amateur world there is a famous site to challenge you with several ML problems.  
It is [Kaggle](https://www.kaggle.com) and it proposes a first challenge to begin with this world

Before to start we need to authenticate to **Kaggle**,&nbsp;  
to do this we need to require a token API.

In [None]:
# insert here your authentication data
%env KAGGLE_USERNAME={YOUR_USERNAME}
%env KAGGLE_KEY={YOUR_KEY}

# The Titanic challenge

It will be given to us two **lists of Titanic passengers** with some informations,  
the first one (**training dataset**) with an indication **if the passenger survived**,&nbsp;  
the second one (**test dataset**) without the passenger survived informations.

Using the training dataset we have to build a model in order to **predict** which passengers in the test dataset survived.

## 🥳 only positive vibes

## Dataset
First, let us retrieve the dataset, to do this we will exploit the [**Kaggle API**](https://github.com/Kaggle/kaggle-api)

In [None]:
%%bash
# this is a special cell, the above code is pass to bash (directly to the os)
kaggle competitions download -c titanic # download the compressed datasets
unzip -o titanic.zip -d titanic # unzip the datasets in `titanic` folder

Now we have to load the dataset in Python, we can choose several way to do it.

A powerful library thought to work with complex dataset could be [**Pandas**](https://pandas.pydata.org/), I think you  will be fine with it.

In [None]:
import pandas as pd
ds = pd.read_csv("./titanic/train.csv") # load the training dataset
ds # this syntax force jupyter to render the dataset as a table

This dataset are really heterogeneous, this present data on the shape of `string`, `integer`, `float` and also `enum`.

So, we have to pre-process the data before use it to train our model  
##### The data pre-processing is not an optional step in ML
##### It could be the critical point!

In [None]:
# drop useless(maybe) columns
ds.drop(columns=["Embarked", "Cabin", "SibSp", "Parch", "Ticket", "PassengerId"], inplace=True)
ds

Well, now we have less data to work with, let us look to the `Age` data.

The age data is **not available for all** the dataset entries, so we have to solve this issue.

We have two options (IMO), set it to an **arbitrary or random value** for the missing ones,  
or another way could be to **calculate the average age** from the other passengers and use it as **estimation** for the missing ones.

In [None]:
age_average = ds["Age"].mean(skipna=True) # compute average age from only not NaN values
age_average = np.around(age_average, decimals=2) # preserve only two decimals
print(f"The estimated average age is {age_average}")

ds.loc[ds["Age"].isna(), "Age"] = age_average # override NaN ages with the average age
ds

`Sex` and `Name` are `string` we cannot use it directly!

The first will be converted to `integer`.

From the `Name` we can identify the title (e.g. `Mrs.`, `Miss.`, ecc)

In [None]:
# enumerate sex
ds.loc[ds["Sex"] == "female", "Sex"] = 0
ds.loc[ds["Sex"] == "male", "Sex"] = 1

# enumerate title
ds.loc[:,"Title"] = 0 # set default value for title to 0
ds.loc[ds["Name"].str.contains("Rev."), "Title"] = 1
ds.loc[ds["Name"].str.contains("Miss."), "Title"] = 2
ds.loc[ds["Name"].str.contains("Mr."), "Title"] = 3
ds.loc[ds["Name"].str.contains("Mrs."), "Title"] = 4
ds.loc[ds["Name"].str.contains("Master."), "Title"] = 5
ds.loc[ds["Name"].str.contains("Dr."), "Title"] = 6
ds.drop(columns="Name", inplace=True) # delete the `Name` column, it is not longer userful
ds

We have define arbitrary values for the titles, but there is not an intrinsic hierarchical titles order (`Miss` < `Mrs.`?),  
whatever we erroneously introduced it.

[**One-hot encoding**](https://it.wikipedia.org/wiki/One-hot) is common system to solve this issue, for each category a binary codification with a single `1` is created, let us see it. 

In [None]:
title_onehot = pd.get_dummies(ds["Title"], prefix="Title") # create one-hot encode based on `Title`

ds = pd.concat([ds, title_onehot], axis=1) # merge the one-hot table to original dataset

ds.drop(columns="Title", inplace=True) # drop the not longer required `Title` column
ds

The dataset is almost ready, so we have to do some final operations.

Let us separate inputs and desired output 

In [None]:
x = ds.drop(columns="Survived").to_numpy(dtype=np.float16) # our inputs
y = ds["Survived"].to_numpy() # our desired output
x.shape, y.shape # let us check the shape of them

### Wait a moment!
### Are we sure we can behave like in the regression problem?
# 🤔

In the first problem we saw simple data and we immediately realized that there was some kind of mathematical relation between $x$ and $y$... maybe a little noise, but the relation was obvious.

In this case we can say the some thing?

Also in the noise, if we search well enough we will able to find some kind of pattern we can memorize, but this cannot help us to improve our model, indeed these "conclusions" can be deceptive our **FNN**.

This is a well known problem of any kind of model identification system.

## [Overfitting](https://en.wikipedia.org/wiki/Overfitting)

When it occurs, our model stops to learn useful pattern and starts to memorize the training dataset.

This kind of mechanize is not simple to avoid, because if during the training we look only to **loss** value we  will see it tends to zero,  
so we can erroneously think that our model is improving, while it is losing generality!

#### So, how can we detect overfitting?

### [**cross validation**](https://en.wikipedia.org/wiki/Cross-validation_(statistics))


In [None]:
import sklearn as skl
import sklearn.model_selection
train_x, val_x, train_y, val_y = skl.model_selection.train_test_split(x, y,
                                                                      train_size=0.7,
                                                                      random_state=0)

<h1 class="outro_title" style="text-align:center; font-size: 35px;">Thank you!</h1>

<img class="outro_logo" style="width: 20%;" src="https://static.poul.org/assets/logo/logo_g.svg" alt="POuL logo">

<a class="outro_license" style="display: block; margin: 20px auto;" rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/"><img alt="Creative Commons License" src="https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png" /></a>
<p class="outro_license_text" style="font-size: 15px; text-align: center;">Licensed under Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International<br/>
    Notebook source code available in <a href="https://gitlab.poul.org/corsi/Python/keras/-/tree/2021">this repo</a></p>

<p class="outro_author" style="text-align: center; font-size: 18px;">Roberto Bochet &lt;avrdudo@poul.org&gt;</p>