<a href="https://colab.research.google.com/github/Demosthene-OR/Student-AI-and-Data-Management/blob/main/151.1_1_Introduction_Deep_Learning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


<img src="https://prof.totalenergies.com/wp-content/uploads/2024/09/TotalEnergies_TPA_picto_DegradeRouge_RVB-1024x1024.png" height="150" width="150">
<hr style="border-width:2px;border-color:#75DFC1">
<center><h1> Introduction to Deep Learning with Keras </h1></center>

<hr style="border-width:2px;border-color:#75DFC1">

>Since 2012, deep learning algorithms seem ready to solve many problems: recognizing faces as proposed by DeepFace, defeating poker players, enabling autonomous cars to drive, and even searching for cancer cells.
>
>
>
>However, the foundations of these methods are not so recent. Deep learning was formalized in 2007 based on new neural network architectures pioneered by McCulloch and Pitts in 1943. This was followed by numerous developments such as the perceptron (artificial neuron), the convolutional neural networks of Yann LeCun and Yoshua Bengio in 1998, and the deep neural networks that emerged in 2012. This work paved the way for numerous fields of application such as image processing (computer vision), language processing (NLP), and speech recognition.
>
>Furthermore, before the advent of deep learning, mathematical analysis methods already existed to address these issues. In computer vision, for example, Haar features and image gradients were used.
<br>

## Deep Learning vs. Machine Learning

<br>

> <img src="https://assets-datascientest.s3-eu-west-1.amazonaws.com/notebooks/masterclass_deeplearning_debutant_MLetDeep.png" width="300" height="200">
>
>
>
> 
> **Deep learning** algorithms can be considered both a sophisticated and mathematically complex evolution of **machine learning** algorithms. The field has been the subject of much attention lately, and for two good reasons:
> 
>
>* Recent developments have led to results that were previously unthinkable. These new machine learning techniques take advantage of the massive increase in data, as well as phenomenal computing power thanks to graphics processors. Unlike other machine learning models, which reach a point of stagnation, a deep learning model will perform better the more data it has available. 

> <img src= "https://datascientest.com/wp-content/uploads/2020/06/DL5.png" >

>* Deep learning describes algorithms that analyze data with a logical structure similar to the way a human would draw conclusions. Note that this can occur through both supervised and unsupervised learning. To achieve this, deep learning applications use a layered structure of algorithms called an **artificial neural network** (ANN). The design of such an ANN is inspired by the biological neural network of the human brain, leading to a much more efficient learning process than standard machine learning models.
>



## **What are the differences?**

>First, while traditional machine learning algorithms have a fairly simple structure, such as linear regression or a decision tree, deep learning is based on artificial neural networks, like the human brain, which are complex and intertwined.
>
>
>Second, deep learning algorithms require much less human intervention. In machine learning, a data scientist must specify the model parameters, train the model, observe the results, and then readjust the model. In deep learning, **learning takes place continuously** through a number of iterations defined in advance.
>
>This continuous learning through a complex network makes it possible to more effectively handle problems where there are **more parameters than observations**, and for which explicit mathematical solutions often do not exist.
>
> This video illustrates and complements the differences outlined above (click on the image to be redirected to the video):
>
>
> [![This video illustrates and expands on the differences outlined above:](https://img.youtube.com/vi/q6kJ71tEYqM/0.jpg)](https://www.youtube.com/watch?v=q6kJ71tEYqM "Machine Learning vs. Deep Learning: What are the Differences? ")
<!-- [![This video illustrates and expands on the differences outlined above:](https://img.youtube.com/vi/DazUaVu5MO0/0.jpg)](https://www.youtube.com/DazUaVu5MO0 "Machine Learning vs. Deep Learning: What are the Differences? ") -->

## Some examples of applications


As mentioned above, deep learning has many potential applications, from image processing to language processing and speech recognition.

> * The interpreting profession is being disrupted by the advent of audio and handwriting recognition and text translation models: 
>
>
><img src="https://assets-datascientest.s3-eu-west-1.amazonaws.com/notebooks/masterclass_deeplearning_debutant_translate.jpeg" width="400" height="200"> 


> * Autonomous cars:
>
><img src="https://assets-datascientest.s3-eu-west-1.amazonaws.com/notebooks/masterclass_deeplearning_debutant_intro_cars.gif" alt=‘animated’ width="400" height="200">


<br>
<br>

<hr style="border-width:2px;border-color:#75DFC1">
<h2 style = “text-align:center”>How Deep Learning works </h2> 
<hr style="border-width:2px;border-color:#75DFC1">

<br>
<br>

## What is a perceptron?


> Deep learning algorithms are based on neural networks, which are **layered architectures**: one layer takes our data as input, hidden layers process this data, and an output layer returns the expected result based on our problem. The GIF below illustrates how they work: 
>
><img src="https://assets-datascientest.s3-eu-west-1.amazonaws.com/notebooks/masterclass_deeplearning_debutant_intro_dense.gif" style=‘width:400px’>
>
> Each layer is composed of one or more neurons: **the perceptron**. It works as follows: it takes a vector of parameters as input, undergoes several transformations before being interpretable by the algorithm, and returns an output value. It generally contains:
>
>
>* A weight vector $w$
>
>* A bias $b$ 
>
>* An activation function $f$
>
>The scalar product between the weight vector $w$ and a vector $x$ is denoted as:
>
>$$w^\top x := w_1 x_1 + ... + w_n x_n$$
>
>
>For an input vector $x$, the output of the model's perceptron $h$ is:
>
>$$ h = \mathrm{Perceptron}(x) = f(w^\top x + b)$$
> 
>Here is an illustration: 
><img src = "https://assets-datascientest.s3-eu-west-1.amazonaws.com/notebooks/perceptron1.png" width="900" height="200">

> It is important to note that if the activation function chosen is the identity function, the perceptron only calculates a linear operation. In this specific case, the task performed by the neuron is nothing more than **linear regression**. This is useful when the data is linearly separable, as you can see in several geometric representations below.

## The scalar product


>As seen previously, the scalar product is the basis of how a perceptron works.
>
><br>**What is its purpose?** 
>
> It is used for classification thanks to its geometric properties. The result produced by the scalar product between two vectors is very easy to interpret.
>
> In the following interactive figure, you can see a point and a vector. The blue point **x** with coordinates **${(x_1, x_2)}$** is the point we need to classify. The green vector **w** with coordinates **${(w_1, w_2)}$** is the vector that will allow us to classify it.
>
> Classification with the scalar product is done as follows:
>
>
> * If the scalar product between **x** and **w** is **positive**, **x will be classified as 1** (blue).
>
>
> * If the scalar product between **x** and **w** is **negative**, **x will be classified as 0** (red).
>
>
> The **green line perpendicular to w** represents all points in the domain such that **the scalar product with w is 0**. This line is called the **decision boundary** of the classification problem.

* **(a)** Run the following cell to display the interaction

In [1]:
from interaction_dense import show_dotProduct
show_dotProduct()


Figure(axes=[Axis(scale=LinearScale(max=5.0, min=-5.0)), Axis(orientation='vertical', scale=LinearScale(max=5.…

interactive(children=(FloatSlider(value=-2.0, description='x1', max=4.0, min=-4.0), FloatSlider(value=1.0, des…


## Linear Separability

> The concept of linear separability of a database is fundamental to the use of scalar product classification. 
>
> The following figure corresponds to the **Iris** database, which contains two variables: `Sepal Width` and `Sepal Length`, corresponding to the width and length of the sepals of two different species of iris. The classification problem is as follows:
>
#### Can we determine the species of a flower based on these two variables?
>
>
> The **green points** in the figure correspond to flowers of the species **iris setosa**, and the **orange points** correspond to flowers of the species **iris virginica**. (The database has been normalized for better visualization).

> To solve this problem **geometrically** with scalar product classification, we can reformulate the question as follows:

* **(c)** **Is there a linear decision boundary that would allow us to separate the two species?**

* **(d)** Using the following interactive figure, find a vector w that defines a decision boundary **separating the green points from the orange points**.



In [2]:
from interaction_dense import show_data
show_data()


Figure(axes=[Axis(label='Sepal Length', scale=LinearScale(max=4.0, min=-4.0)), Axis(label='Sepal Width', orien…

interactive(children=(FloatSlider(value=1.0, description='w1', max=4.0, min=-4.0, step=0.11), FloatSlider(valu…


> One possible solution is the vector `w = (-1.8, 0.95)`, which defines a **linear** decision boundary that **perfectly** separates the two groups of individuals. We then say that the database is **linearly separable**.
>
> In this particular case, the decision boundary is called a **separating hyperplane**. We use this name because the points defining the decision boundary are the points satisfying the **plane equation** ${\{ x = (x_1,x_2)\in \mathbb{R}^2 : \langle x, w \rangle = x_1 w_1 + x_2 w_2 = 0\} }$

## Loss Function

> We have seen that we can find a hyperplane for the Iris database that perfectly separates the two species of flowers. However, this solution was found visually.
>
><br> **How can we find a separating hyperplane mathematically?**
>
> First, we need to find a way to quantify the quality of the separation of the species. One of the simplest possibilities is to **count the number of classification errors we would make using a specific vector**.
>
> Suppose that our database contains $n$ points ${X = (x_i)_{i = 1, 2,..., n} \in \mathbb{R}^d}$ and that each of these points is associated with values ${Y = (y_i)_{i = 1,2,...,n} \in \{0,1\}}$ corresponding to the group to which the point $x_i$ belongs.
>
> In our example, class 1 would correspond to the species *iris setosa* and class 0 to the species *iris virginica*.
>
> Mathematically, the classification of an individual $x_i$ by a vector $w$ would be done by a function $f$ defined as follows:
>
>$${f(x_i, w) = \mathbb{1}_{\langle x,w \rangle \geq 0} = \begin{cases} 1 & \mbox{if } \langle x,w \rangle \geq 0 \\  0 & \mbox{if } \langle x,w \rangle < 0 \end{cases}}$$
>
> Thus, the number of errors can be calculated by a function of ${w = (w_1, w_2)}$, $X$ and $Y$ that would be written as follows:
>
> $${ g(w,X,Y) = \sum_{i = 1}^{n}\mathbb{1}_{f(x_i, w)\neq y_i} }$$
>
> This function allows us to define a criterion for determining the best solution to our classification problem. Functions of this type are called **loss functions**. 
>
> The lower the value of this function, the more effective our classification function is. **Minimizing this loss function with respect to vector w is therefore equivalent to finding a separating hyperplane**.
>
> There are different loss functions depending on the problem to be solved. Here are a few examples (which you don't need to remember for now) with their names in the Tensorflow/Keras library that you will be using in this module: 
>
| TensorFlow function     | Application | Meaning |
| :-------------: |:-------------:|:-------------:|
| `BinaryCrossentropy`   | Binary classification | Cross entropy in the case of a binary classification problem |
| `CategoricalCrossentropy`   | Multiple classification | Cross entropy with y_true in the form of a one-hot vector |
| `SparseCategoricalCrossentropy` | Multiple classification | Cross entropy with y_true in the form of a class index |
| `Hinge` | Binary/multiple classification | Loss function that only takes into account the error of points in the margin (zero loss function for points that are too easy to predict) | 
| `mse`   | Regression problem | Mean squared error | 
| `mae`   | Regression problem | Mean absolute error | 
| `mape`   | Regression problem | Mean absolute percentage error|

* Run the following cell to display some loss functions for our Iris classification problem.


In [3]:
from dl_widgets import show_losses
show_losses()


VBox(children=(HBox(children=(Figure(axes=[Axis(label='Sepal Length', scale=LinearScale(max=4.0, min=-4.0)), A…


## Review of the Perceptron algorithm and activation function
> A  special feature of the Perceptron algorithm is that we will use a function called **activation**, which will allow us to modify our perceptron output data to use a loss function that is more suited to our problem.
>
>Among the most commonly used **activation functions** are:
>- *Sigmoid*
>- *Tanh*
>- *ReLU (Rectified Linear Unit)*
>- *Leaky ReLU (Rectified Linear Unit)*
> 
> Their graphical representation can be found in the figure below:
>
><img src= https://assets-datascientest.s3-eu-west-1.amazonaws.com/notebooks/masterclass_deeplearning_activation_functions.png style="width:800px">
> <p style="text-align:center"> <i>Graphical representation of the main activation functions, <a href="https://medium.com/@shrutijadon10104776/survey-on-activation-functions-for-deep-learning-9689331ba092">source</a></i></p>
>
>
> Let's assume that $(y_i)_{i = 1,..,n} \in \{-1, 1\} $ and that we use the *tanh* function as activation. The classification function becomes:
>
> $$ f(x_i, w) = tanh(\langle x, w \rangle + bias) $$
>
> The loss function then becomes the sum of the comparisons between our perceptron output values after the activation function and the expected labels (-1 or 1):
> $$ g(w, X, Y) = \sum_{i = 1}^{n} (f(x_i) - y_i)^2 $$
>
> The classification function and the loss function **no longer contain indicator functions** and are now **differentiable at every point**. This distinction is very important for what follows. Furthermore, the loss function in our example is **convex** and has a single minimum. (This is true if the data is linearly separable, which is almost never the case in practice).
>
> Another reason we would use the *tanh* function is that
>
> $$ tanh(\langle x, w \rangle + bias) = 0 \iff \langle x, w \rangle + bias = 0$$
> because $$ tanh(x) = 0 \iff x = 0 $$
>
> That is, the equation of the optimal separating hyperplane is the same as if we were not using an activation function.

## Gradient Descent Training

> Thanks to the activation function, the loss function is differentiable. We can use a **gradient descent algorithm** to find the optimal vector $w$.
>
> The gradient descent algorithm is very simple. The easiest case to illustrate is the one-dimensional case. In the following figure, the function represented is $f(x) = x^2$ and its derivative is $f'(x) = 2x$.

* **(e)** Run the following cell to display the interaction



In [4]:
from interaction_dense import show_optimization_square
show_optimization_square()


Figure(axes=[Axis(scale=LinearScale(max=10.0, min=-10.0)), Axis(orientation='vertical', scale=LinearScale(max=…

interactive(children=(FloatSlider(value=-5.0, description='x', max=10.0, min=-10.0, step=0.2), Output()), _dom…


> When $f'(x) < 0$ (in red), then $f$ is **decreasing** in the neighborhood of $x$.
>
> When $f'(x) > 0$ (in green), then $f$ is **increasing** in the neighborhood of $x$.
>
> Thus, a point $x_{min}$ is a minimum of a **convex** function if and only if $f'(x_{min}) = 0$, i.e., $f$ must be increasing in the neighborhood of any point $x > x_{min}$ and decreasing in the neighborhood of any point $x < x_{min}$
>
> Let $x_0$ be a random point in the domain of $f$. The gradient descent algorithm then consists of choosing a point in the **opposite** direction to the gradient. That is:
>
> * If $f'(x_0) < 0$, $f$ is decreasing in the neighborhood of $x_0$, which means that the minimum $x_{min}$ is necessarily greater than $x_0$.
>
>
> * If $f'(x_0) > 0$, $f$ is increasing in the neighborhood of $x_0$, which means that the minimum $x_{min}$ is necessarily less than $x_0$.
>
> We then define $x_1 = x_0 - \lambda f'(x_0)$, where $\lambda$ is called **the descent step**. In the context of the Perceptron algorithm and deep learning in general, $\lambda$ is called the **learning rate**.
>
>
> We repeat the operation until we obtain a point $x_k$ such that $|f'(x_k)| < tol$, where $tol$ is the **tolerance**, a very small constant.
>
> > Step 0: Define an initial point $x_0$ and a tolerance $tol$.
> >
> > Step k: As long as $|f'(x_k)| >= tol$: $x_{k+1} = x_k - \lambda f'(x_k)$.

* **(f)** What happens when the descent step is too small? When it is too large?


* **(g)** For the initialization $x_0 = -10$, find the smallest descent step such that $|f'(x_k)| \leq 0.001$ after 20 steps.


* **(h)** Find a descent step such that the descent algorithm converges to the minimum in 1 step for any initialization $x_0$.



In [5]:
from interaction_dense import show_gradient_descent
show_gradient_descent()


VBox(children=(Figure(axes=[Axis(scale=LinearScale(max=10.0, min=-10.0)), Axis(orientation='vertical', scale=L…


## Limitations of gradient descent

> As you can see, the gradient descent algorithm with the right descent step size is very effective at finding the global minimum of a function. However, this algorithm has one major weakness: **it is only effective when the function to be minimized is strictly convex, which is not always the case (as we will see below)**.
>
> In the following interactive figure, we have plotted the function $f(x) = (\frac{x}{5})^4 + (\frac{x}{5})^3 - 6(\frac{x}{5})^2 + 1$.
>
> This function contains a global minimum (the one we want to approach) and a local minimum (the one we want to avoid).

* **(i)** What happens if we apply the gradient descent algorithm with initialization $x_0 = 11$ and a descent step of $0.1$?


* **(j)** What happens if we apply the gradient descent algorithm with initialization $x_0 = 0$ and any descent step?


* **(k)** What happens if we apply the gradient descent algorithm with initialization $x_0 = -1$ and a descent step of $0.1$?


* **(l)** What happens if we apply the gradient descent algorithm with initialization $x_0 = -1$ and a descent step greater than $0.54$?



In [6]:
from interaction_dense import show_optimization
show_optimization()


VBox(children=(Figure(axes=[Axis(scale=LinearScale(max=15.0, min=-16.0)), Axis(orientation='vertical', scale=L…


> When the function to be minimized is not convex, the gradient descent algorithm becomes unpredictable and does not produce consistent results. **The results of the algorithm are very sensitive to variations in the gradient step size**.
>
> In the vast majority of cases in deep learning, **the loss function to be minimized is never convex, and the gradient descent algorithm will converge to a local minimum**.
>
> Unfortunately, the gradient descent algorithm is one of the only algorithms that can be used in practice because it is the only effective optimization algorithm given our computing capabilities.
>
>
> As you will see later in the practical modules, **the gradient step size is one of the most influential hyperparameters on the performance of a deep learning model***.
>
> In the next notebooks, you will see how we can use these loss and activation functions to solve problems using neural networks. We will also use objects that you are already familiar with and that allow us to evaluate models: **metrics**. You may be wondering why we don't just use metrics to optimize our perceptrons, in this case **accuracy** for example, since we are dealing with a classification problem. Take a look at the following widget, paying particular attention to the differentiability and convexity of the loss function with respect to the metric.



In [7]:
from dl_widgets import show_accuracy
show_accuracy()


VBox(children=(HBox(children=(Figure(axes=[Axis(label='Sepal Length', scale=LinearScale(max=4.0, min=-4.0)), A…




<hr style="border-width:2px;border-color:#75DFC1">
<center><h1> Key points to remember </h1></center>

<hr style="border-width:2px;border-color:#75DFC1">


> * The scalar product is the main tool we use for classification. **This classification is purely geometric**.
> * The objective of a Perceptron is to **find a hyperplane that separates the different classes of individuals**. 
> * This objective is achieved by **minimizing the loss function using gradient descent**.
> * There are a number of **activation functions** that address different issues, and their importance will be even greater for multilayer Perceptrons.
> * If the database is not linearly separable, there is not necessarily a single global minimum.
> * **A gradient step that is too large or too small will prevent the MLP algorithm from converging to a satisfactory solution**. Finding the right gradient step and initialization is the challenge of deep learning.

<hr style="border-width:2px;border-color:#75DFC1">
<center><h1> Going further - The Perceptron algorithm </h1></center>

<hr style="border-width:2px;border-color:#75DFC1">


> In this exercise, we will introduce the operating principle of the perceptron algorithm. To do this, we will train a model based on the simple perceptron principle using the *Moon* database from the *scikit-learn* library.
>
> As mentioned earlier, the **choice** of activation function depends on **the desired output space** of the Perceptron model:
>
> * The $\mathbf{sigmoid}$ function takes values from $(-\infty, \infty) $ to $ [0,1]$, making it an ideal choice for displaying a probability value for classification. In this case, the Perceptron model is equivalent to **logistic regression**.
>
> Here is the link to the exercise on [colab](https://colab.research.google.com/drive/1QzbcVRvXCTjgtDoF6EKzYX_UM6OyglNt?usp=sharing) **choice**

