<img src="https://github.com/insaid2018/Term-1/blob/master/Images/INSAID_Full%20Logo.png?raw=true" width="240" height="360" />

# Maths behind AI - Simplified

## Table of Contents

1. [Data Types Used in AI](#section1)<br>
  - 1.1 [Scalars (0D tensors)](#section101)<br>
  - 1.2 [Vectors (1D tensors)](#section102)<br>
  - 1.3 [Matrices (2D tensors)](#section103)<br>
  - 1.4 [3D tensors and Higher-Dimensional Tensors](#section104)<br>
  - 1.5 [Key attributes of Tensors](#section105)<br>
  - 1.6 [Manipulating tensors in Numpy](#section106)<br>
  - 1.7 [The Notion of Data Batches](#section107)<br>
  - 1.8 [Real-world examples of data tensors](#section108)<br><br>
2. [Vector Data](#section2)<br><br>
3. [Tensor Operations in a Nutshell](#section3)<br>
  - 3.1 [Element-wise Operations](#section301)<br>
  - 3.2 [Broadcasting](#section302)<br>
  - 3.3 [Tensor Dot](#section303)<br>
  - 3.4 [Tensor Reshaping](#section304)<br>
  - 3.5 [A Geometric Interpretation of Deep Learning](#section305)<br><br>
4. [Basic Maths for Gradient Descent](#section4)<br>
  - 4.1 [What’s a derivative?](#section401)<br>
  - 4.2 [Derivative of a Tensor Operation: the Gradient](#section402)<br>
  - 4.3 [Stochastic Gradient Descent](#section403)<br>
  - 4.4 [Chaining Derivatives: the Backpropagation Algorithm](#section404)<br>

<img src="https://raw.githubusercontent.com/insaid2018/DeepLearning/master/images/title.jpg" width="800" height="800"/>

<a id=section1></a>
## 1. Data Types Used in AI

- **Tensors**: **Data stored in multidimensional Numpy arrays** are called **tensors**. 


- In general, all **current machine-learning systems use tensors** as their **basic data structure**. 


- **Tensors** are **fundamental** to the **field**—so fundamental that **Google’s TensorFlow** was **named after them**. 


- So what’s a tensor? At its core, **a tensor is a container for data—almost always numerical data**. So, it’s a **container for numbers**. 


- You may be already familiar with **matrices, which are 2D tensors**: **tensors** are a **generalization of matrices** to an **arbitrary number of dimensions** (note that in the **context of tensors**, a **dimension** is often **called an axis**).


<img src="https://raw.githubusercontent.com/insaid2018/DeepLearning/master/images/tensor.png" width="600" height="600"/>

<a id=section101></a>
### 1.1 Scalars (0D tensors)

- A **tensor** that **contains** only **one number** is called a **scala**r (or **scalar tensor**, or **0-dimensional tensor**, or **0D tensor**). 


- In **Numpy**, a **float32 or float64 number** is a **scalar tensor** (or **scalar array**). 


- You can **display** the **number of axes** of a **Numpy tenso**r via the **ndim attribute**; a **scalar tensor has 0 axes (ndim == 0)**. 


- The **number of axes** of a **tensor** is also **called its rank**. 


- Here’s a **Numpy scalar**:

In [0]:
import numpy as np

In [0]:
x = np.array(12)
x

array(12)

In [0]:
x.ndim

0

---

<a id=section102></a>
### 1.2 Vectors (1D tensors)

- An **array of numbers** is called a **vector**, or **1D tensor**. 


- A **1D tensor** is said to **have exactly one axis**. 


- Following is a **Numpy vector**:

In [0]:
x = np.array([12, 3, 6, 14, 30])
x

array([12,  3,  6, 14, 30])

In [0]:
x.ndim

1

- This **vector** has **five entries** and so is called a **5-dimensional vector**. 


- **Don’t confuse** a **5D vector** with a **5D tensor**! A **5D vector** has only **one axis** and has **five dimensions along its axis**, whereas a **5D tensor** has **five axes** (and may have **any number of dimensions along each axis**). 


- **Dimensionality** can **denote either** the **number of entries along a specific axis** (as **in** the **case of our 5D vector**) or the **number of axes in a tensor** (such as a **5D tensor**), which can be confusing at times. 


- In the latter case, it’s **technically more correct** to talk about a **tensor of rank 5** (the **rank of a tensor** being the **number of axes**), but the ambiguous notation 5D tensor is common regardless.

---

<a id=section103></a>
### 1.3 Matrices (2D tensors)

- An **array of vectors** is a **matrix**, or **2D tensor**. 


- A **matrix has two axes** (often referred to **rows and columns**). You can **visually interpret** a **matrix** as a **rectangular grid of numbers**.


- This is a **Numpy matrix**:

In [0]:
x = np.array([[5, 78, 2, 34, 0],
              [6, 79, 3, 35, 1],
              [7, 80, 4, 36, 2]])

In [0]:
x.ndim

2

- The **entries from** the **first axis** are called the **rows**, and the **entries from** the **second axis** are called the **columns**. 


- In the example, **[5, 78, 2, 34, 0]** is the **first row** of **x**, and **[5, 6, 7]** is the **first column**.

<a id=section104></a>
### 1.4 3D tensors and Higher-Dimensional Tensors

- If you **pack matrices** in a **new array**, you **obtain** a **3D tensor**, which you can **visually interpret** as a **cube of numbers**.


- Following is a **Numpy 3D tensor**:

In [0]:
x = np.array([[[5, 78, 2, 34, 0],
               [6, 79, 3, 35, 1],
               [7, 80, 4, 36, 2]],
              [[15, 78, 32, 34, 80],
               [16, 79, 33, 35, 81],
               [17, 80, 34, 36, 82]],
              [[45, 78, 52, 34, 60],
               [46, 79, 53, 35, 71],
               [47, 80, 54, 36, 72]]])

In [0]:
x.ndim

3

- By **packing 3D tensors** in an **array**, you can **create a 4D tensor**, and so on. 


- In **deep learning**, you’ll generally **manipulate tensors** that are **0D to 4D**, although you may go up to **5D** if you **process video data**.

---

<a id=section105></a>
### 1.5 Key attributes of Tensors

A **tensor** is defined by **three key attributes**:

- **Number of axes (rank)** - For instance, a **3D tensor has three axes**, and a **matrix has two axes**. This is also **called the tensor’s ndim in** Python libraries such as **Numpy**.


- **Shape** - This is a **tuple of integers** that **describes** how many **dimensions** the **tensor has along each axis**. For instance, the previous **matrix example has shape (3, 5)**, and the **3D tensor example has shape (3, 3, 5)**. A **vector** has a **shape** with a **single element**, such as **(5,)**, whereas a **scalar** has an **empty shape, ()**.


- **Data type** (usually called **dtype** in Python libraries) - This is the **type of the data contained in the tensor**; for instance, a **tensor’s type** could be **float32, uint8, float64**, and so on. 

  - On **rare occasions**, you may see a **char tensor**. 
  
  - Note that **string tensors don’t exist in Numpy** (or in most other libraries), because **tensors live in preallocated, contiguous memory segments**: and **strings**, being **variable length**, would **preclude** the use of **this implementation**.

- To make this more concrete, let’s look back at the data we processed in the above examples.


- We will **display** the **number of axes** of the **tensor x**, the **ndim attribute**:

In [0]:
print(x.ndim)

3


- Here’s its **shape**:

In [0]:
print(x.shape)

(3, 3, 5)


- And this is its **data type**, the **dtype attribute**:

In [0]:
print(x.dtype)

int32


- So what we have here is a **3D tensor** of **32-bit integers**. More precisely, it’s an **array of 3 matrices of 3 × 5 integers**. 

---

<a id=section106></a>
### 1.6 Manipulating tensors in Numpy

- **Selecting specific elements** in a **tensor** is called **tensor slicing**.

In [0]:
x[2]

array([[45, 78, 52, 34, 60],
       [46, 79, 53, 35, 71],
       [47, 80, 54, 36, 72]])

- In the previous example, we **selected** a **specific digit alongside** the **first axis** using the **syntax x[i]**. 


- Let’s look at the **tensor-slicing operations** you can do on **Numpy arrays**. The following example **selects matrices #0 to #2** (#2 isn’t included) and puts them in an **array of shape (2, 3, 5)**.

In [0]:
my_slice = x[0:2]
print(my_slice.shape)

(2, 3, 5)


- It’s **equivalent** to this **more detailed notation**, which **specifies** a **start index** and **stop index for** the **slice along each tensor axis**. 


- Note that **:** is **equivalent** to **selecting the entire axis**.

In [0]:
my_slice = x[0:2, :, :]
my_slice.shape

(2, 3, 5)

In [0]:
my_slice = x[0:2, 0:1, 0:4]
my_slice.shape

(2, 1, 4)

- In general, you may **select between any two indices along each tensor axis**. 


- For instance, in order **to select 2 × 2 matrix** in the **bottom right corner** of all the **parent matrices**, you do this:

In [0]:
x

array([[[ 5, 78,  2, 34,  0],
        [ 6, 79,  3, 35,  1],
        [ 7, 80,  4, 36,  2]],

       [[15, 78, 32, 34, 80],
        [16, 79, 33, 35, 81],
        [17, 80, 34, 36, 82]],

       [[45, 78, 52, 34, 60],
        [46, 79, 53, 35, 71],
        [47, 80, 54, 36, 72]]])

In [0]:
my_slice = x[:, 1:, 3:]
my_slice

array([[[35,  1],
        [36,  2]],

       [[35, 81],
        [36, 82]],

       [[35, 71],
        [36, 72]]])

- It’s also **possible** to **use negative indices**. 


- Much **like negative indices in Python lists**, they **indicate a position relative to the end of the current axis**. 


- In order to **select** the **central row** of **each parent matrix**, you do this:

In [0]:
my_slice = x[:, 1:-1, :]
my_slice

array([[[ 6, 79,  3, 35,  1]],

       [[16, 79, 33, 35, 81]],

       [[46, 79, 53, 35, 71]]])

---

<a id=section107></a>
### 1.7 The Notion of Data Batches

- In general, the **first axis (axis 0, because indexing starts at 0)** in all **data tensors** you’ll come across **in deep learning** will be the **samples axis** (sometimes called the **samples dimension**). 


- In addition, ****deep-learning models don’t process**** an ****entire dataset at once****; rather, ****they break the data into small batches****. 



- Concretely, here’s **one batch of our x**, with **batch size of 1**:

In [0]:
batch = x[:1]
batch

array([[[ 5, 78,  2, 34,  0],
        [ 6, 79,  3, 35,  1],
        [ 7, 80,  4, 36,  2]]])

- And here’s the **next batch**:

In [0]:
batch = x[1:2]
batch

array([[[15, 78, 32, 34, 80],
        [16, 79, 33, 35, 81],
        [17, 80, 34, 36, 82]]])

- The **batch size** is the **difference between** the **two values** inputed in the **slice above (batch_size = 2 - 1 = 1)**.


- And the **nth batch**: **batch = x[1 * n:1 * (n + 1)]**


- Here the **value before :** is the **multiple of n and batch_size**; and **value after :** is the **multiple of n+1 and batch_size**.

In [0]:
# if n = 2
n = 2
batch = x[(1 * n):(1 * (n + 1))]
batch

array([[[45, 78, 52, 34, 60],
        [46, 79, 53, 35, 71],
        [47, 80, 54, 36, 72]]])

- When considering such a **batch tensor**, the **first axis (axis 0)** is **called** the **batch axis** or **batch dimension**. 


- This is a **term** you’ll **frequently encounter** when **using Keras** and **other deep-learning libraries**.

---

<a id=section108></a>
### 1.8 Real-world examples of data tensors

- The **data** you’ll manipulate **will** almost always **fall into one** of the following **categories**:
<br><br>
  - **Vector data** - **2D tensors** of **shape (samples, features)**.
  <br><br>
  - **Timeseries data or Sequence data** - **3D tensors** of **shape (samples, timesteps, features)**.
  <br><br>
  - **Images** - **4D tensors** of **shape (samples, height, width, channels)** or **(samples, channels, height, width)**.
  <br><br>
  - **Video** - **5D tensors** of **shape (samples, frames, height, width, channels)** or **(samples, frames, channels, height, width)**.
  
<br>

Here **channels** refer to the **number of color channels**, for example, **gray scale images** have only a **single color channel**.

<br>
 
<table bgcolor="white">
  <tr text-align="left">
    <th style="font-weight:bold; font-size:14px; text-align:center">Timeseries Data</th>
    <th style="font-weight:bold; font-size:14px; text-align:center">Image Data</th>
  </tr>
  <tr>
  <tr>
    <td><img src="https://raw.githubusercontent.com/insaid2018/DeepLearning/master/images/timeseries.png" width="500" height="500"/></td>
    <td><img src="https://raw.githubusercontent.com/insaid2018/DeepLearning/master/images/imagedata.png" width="500" height="500"/></td>
  </tr>
</table>

---

<a id=section2></a>
## 2. Vector Data

- This is the **most common case**. 


- In such a dataset, **each single data point** can be **encoded as a vector**, and thus a **batch of data** will be **encoded as a 2D tensor** (that is, an **array of vectors**), where the **first axis is the samples axis** and the **second axis is the features axis**.


- Let’s take a look at two **examples**:
<br><br>
  - An **actuarial dataset of people**, where we consider each **person’s age, ZIP code**, and **income. Each person** can be **characterized** as a **vector of 3 values**, and thus an entire **dataset of 100,000 people** can be **stored** in a **2D tensor of shape (100000, 3)**.
<br><br>
  - A dataset of text documents, where we **represent each document by** the **counts** of **how many times each word appears** in it (out of a **dictionary of 20,000 common words**). **Each document** can be **encoded** as a **vector of 20,000 values (one count per word in the dictionary)**, and thus an entire **dataset of 500 documents** can be **stored** in a **tensor of shape (500, 20000)**.

---

<a id=section3></a>
## 3. Tensor Operations in a Nutshell

- Much as any **computer program** can be ultimately **reduced** to a **small set** of **binary operations on binary inputs** (**AND, OR, NOR,** and so on), all **transformations learned by deep neural networks** can be **reduced to** a handful of **tensor operations applied to tensors of numeric data**. 


- For instance, **it’s possible** to **add tensors**, **multiply tensors**, and so on.

<a id=section301></a>
### 3.1 Element-wise Operations

- **Element-wise operations**: **operations** that are **applied independently** to **each entry** in the **tensors** being considered. Examples are **relu** operation and **addition**.


- This means these **operations** are **highly amenable to massively parallel implementations** (vectorized implementations, a term that comes from the vector processor supercomputer architecture from the 1970–1990 period). 

<br> 
- If you want **to write** a naive **Python implementation** of an **element-wise operation**, you **use a for loop**, as in this naive implementation of an **element-wise addition**:

# Element-wise addition

def naive_add(x, y):
    assert len(x.shape) == 2                  # x is a 2D Numpy tensor
    assert x.shape == y.shape
    x = x.copy()                              # Avoid overwriting the input tensor
    for i in range(x.shape[0]):
        for j in range(x.shape[1]):
            x[i, j] += y[i, j]
    return x

- On the **same principle**, you can do **element-wise multiplication**, **subtraction**, and **so on**. 


- In practice, **when dealing with Numpy arrays**, these **operation**s are **available** as well **optimized built-in Numpy functions**, which themselves **delegate** the **heavy lifting** to a **Basic Linear Algebra Subprograms** (BLAS) implementation if you have one installed (which you should). 


- **BLAS** are **low-level, highly parallel, efficient tensor-manipulation routines** that are typically **implemented** in **Fortran or C**. 

<br> 
- So, in **Numpy**, you can do the **following element-wise operation**, and it will be **blazing fast**:

# Element-wise addition

z = x + y

# Element-wise maximum operation (relu)

z = np.maximum(z, 0.)

---

<a id=section302></a>
### 3.2 Broadcasting

- Our earlier naive implementation of **naive_add** only **supports** the **addition** of **2D tensors with identical shapes**. But **what happens** with **addition** when the **shapes** of the **two tensors being added differ**?


- When possible, and if there’s no ambiguity, the **smaller tensor** will be **broadcasted** to **match the shape** of the **larger tensor**. 

- **Broadcasting** consists of **two steps**:
  
  1. **Axes** (called **broadcast axes**) are **added to** the **smaller tensor** to **match** the **ndim** of the **larger tensor**.
<br><br>  
  2. The **smaller tensor** is **repeated alongside** these **new axes** to **match** the **full shape** of the **larger tensor**.

- Let’s look at a **concrete example**. 

  - Consider **X** with **shape (32, 10)** and **y** with **shape (10,)**. 
  
  - First, we **add** an **empty first axis to y**, whose **shape becomes (1, 10)**. 
  
  - Then, we **repeat y 32 times alongside** this **new axis**, so that we **end up** with a **tensor Y with shape (32, 10)**, where **Y[i, :] == y for i in range(0, 32)**. 
  
  - **At this point**, we can **proceed** to **add X and Y**, because **they have** the **same shape**.


- In **terms of implementation**, **no new 2D tensor is created**, because that **would be** terribly **inefficient**. The **repetition operation is entirely virtual**: it **happens at the algorithmic level rather** than at the **memory level**. But thinking of the vector being repeated 10 times alongside a new axis is a helpful mental model. 

<img src="https://raw.githubusercontent.com/insaid2018/DeepLearning/master/images/broadcasting.png" width="850" height="850"/>

<br> 
- Here’s what a **naive implementation** would look like:

def naive_add_matrix_and_vector(x, y):
    assert len(x.shape) == 2                  # x is a 2D Numpy tensor
    assert len(y.shape) == 1                  # y is a Numpy vector
    assert x.shape[1] == y.shape[0]
    x = x.copy()                              # Avoid overwriting the input tensor
    for i in range(x.shape[0]):
        for j in range(x.shape[1]):
            x[i, j] += y[j]
    return x

- **With broadcasting**, you can generally **apply two-tensor element-wise operations** if **one tensor has shape (a, b, … n, n + 1, … m)** and the **other has shape (n, n + 1, … m)**. The **broadcasting will** then a**utomatically happen** for **axes a through n - 1**. 

<br> 
- The following example applies the **element-wise maximum operation** to **two tensors of different shapes** via **broadcasting**:

x = np.random.random((64, 3, 32, 10))        # x is a random tensor with shape (64, 3, 32, 10)
y = np.random.random((32, 10))               # y is a random tensor with shape (32, 10)

z = np.maximum(x, y)                         # The output z has shape (64, 3, 32, 10) like x

---

<a id=section303></a>
### 3.3 Tensor Dot

- The ___dot___ **operation**, also called a **tensor product** (**not** to be confused with an **element-wise product**) is the **most common**, **most useful tensor operation**. 


- **Contrary to element-wise operations**, it **combines entries in the input tensors**.


- An **element-wise product** is **done with** the __* operator__ in **Numpy, Keras, Theano**, and **TensorFlow**. 

<br> 
- ___dot___ uses a **different syntax** in **TensorFlow**, but **in both Numpy and Keras** it’s **done using** the **standard dot operator**:

z = np.dot(x, y)

- In **mathematical notation**, you’d note the **operation** with a **dot** (**.**): ```z = x . y```


- **Mathematically, what does the dot operation do?** 

<br> 
- Let’s **start with** the **dot product** of **two vectors x and y**. It’s computed as follows:

def naive_vector_dot(x, y):
    assert len(x.shape) == 1                  
    assert len(y.shape) == 1              # x and y are Numpy vectors    
    assert x.shape[0] == y.shape[0]
    z = 0.
    for i in range(x.shape[0]):
        z += x[i] * y[i]
    return z

- The **dot product between two vectors** is a **scalar** and that **only vectors with** the **same number of elements** are **compatible for a dot product**. 

<br> 
- You can also take the **dot product between** a **matrix x** and a **vector y**, which **returns a vector** where the **coefficients are** the **dot products between y and the rows of x**. You **implement** it as follows:

def naive_matrix_vector_dot(x, y):
    assert len(x.shape) == 2
    assert len(y.shape) == 1
    assert x.shape[1] == y.shape[0]
    z = np.zeros(x.shape[0])
    for i in range(x.shape[0]):
        for j in range(x.shape[1]):
            z[i] += x[i, j] * y[j]
    return z

- Note that as soon as **one of the two tensors** has an **ndim greater than 1**, **dot is no longer symmetric**, which is to say that **dot(x, y) isn’t the same as dot(y, x)**.


- Of course, a **dot product generalizes** to **tensors with an arbitrary number of axes**.


- The **most common applications** may be the **dot product between two matrices**. 
  
  - You can take the **dot product** of **two matrices x** and **y (dot(x, y))** if and only **if x.shape[1] == y.shape[0]**. 
  
  - The **result** is a **matrix** with **shape (x.shape[0], y.shape[1])**, where the **coefficients are** the **vector products between the rows of x and the columns of y**. 

<img src="https://raw.githubusercontent.com/insaid2018/DeepLearning/master/images/tensordot2.png" width="400" height="300"/>


- Here’s the **naive implementation**:

def naive_matrix_dot(x, y):
    assert len(x.shape) == 2
    assert len(y.shape) == 2
    assert x.shape[1] == y.shape[0]
    z = np.zeros((x.shape[0], y.shape[1]))
    for i in range(x.shape[0]):
        for j in range(y.shape[1]):
            row_x = x[i, :]
            column_y = y[:, j]
            z[i, j] = naive_vector_dot(row_x, column_y)
    return z

- To **understand dot-product shape compatibility**, it helps to **visualize the input** and **output tensors** by **aligning them** as shown in figure:

<img src="https://raw.githubusercontent.com/insaid2018/DeepLearning/master/images/tensordot3.png" width="250" height="250"/>
<br> 

- **x**, **y**, and **z** are **pictured as rectangles** (literal boxes of coefficients). 


- Because the **rows of x** and the **columns of y must have** the **same size**, it follows that the **width of x must match** the **height of y**. 

- You can take the **dot product between higher-dimensional tensors**, **following** the same **rules for shape compatibility** as outlined earlier for the 2D case:
<br><br>
 - _(a, b, c, d)_ **.** _(d,)_ **-->** _(a, b, c)_
<br><br>
 - _(a, b, c, d)_ **.** _(d, e)_ **-->** _(a, b, c, e)_
<br><br>
 - And so on.
 
<img src="https://raw.githubusercontent.com/insaid2018/DeepLearning/master/images/tensordot.png" width="500" height="500"/>

---

<a id=section304></a>
### 3.4 Tensor Reshaping

- A **third type** of **tensor operation** that’s essential to understand is **tensor reshaping**.


- **Reshaping a tensor** means **rearranging its rows and columns** to **match a target shape**.


- Naturally, the **reshaped tensor has** the **same total number of coefficients** as the **initial tensor**. 


- **Reshaping** is best understood via simple **examples**:

In [0]:
x = np.array([[0., 1.],
              [2., 3.],
              [4., 5.]])

In [0]:
print(x.shape)

(3, 2)


In [0]:
x = x.reshape((6, 1))
x

array([[0.],
       [1.],
       [2.],
       [3.],
       [4.],
       [5.]])

In [0]:
x = x.reshape((2, 3))
x

array([[0., 1., 2.],
       [3., 4., 5.]])

- A **special case** of **reshaping** that’s commonly encountered is **transposition**. 


- **Transposing a matrix** means **exchanging its rows** and its **columns**, so that **x[i, :] becomes x[:, i]**:

In [0]:
x = np.zeros((300, 20))                   # Creates an all-zeros matrix of shape (300, 20)
x = np.transpose(x)
print(x.shape)

(20, 300)


---

<a id=section305></a>
### 3.5 A Geometric Interpretation of Deep Learning

- You just learned that **neural networks consist entirely of chains of tensor operations** and that all of these **tensor operations** are just **geometric transformations of the input data**. 


- It follows that you can **interpret a neural network** as a **very complex geometric transformation in a high-dimensional space**, **implemented via** a **long series of simple steps**. 


- **In 3D**, the **following mental image** may **prove useful**. 

 - **Imagine two sheets** of **colored paper**: **one red** and **one blue**. 

 - **Put one on top of** the **other**. 
 
 - Now **crumple them together** into a **small ball**. 
 
 - That **crumpled paper ball** is your **input data**, and **each sheet of paper** is a **class of data in a classification problem**. 
 
 - What a **neural network** (or any other machine-learning model) is **meant to do** is **figure out a transformation** of the **paper ball** that would **uncrumple i**t, so as **to make** the **two classes cleanly separable again**. 
 
 - With **deep learning**, this would be **implemented** as a **series** of **simple transformations** of the **3D space**, such as **those you could apply** on the **paper ball with your fingers**, **one movement at a time**.
<br><br>  
<img src="https://raw.githubusercontent.com/insaid2018/DeepLearning/master/images/deep_learning_interpretation.png" width="800" height="800"/>
<br><br>

- **Uncrumpling paper balls** is what **machine learning is about**: **finding neat representations** for **complex, highly folded data manifolds**. 


- At this point, you should have a pretty good intuition as to **why deep learning excels** at this: **it takes** the **approach** of **incrementally decomposing** a **complicated geometric transformation** into a **long chain of elementary ones**, which is pretty much the **strategy** a **human would follow** to **uncrumple a paper ball**. 


- **Each layer** in a **deep network applies** a **transformation** that **disentangles the data a little** and a **deep stack of layers makes tractable** an **extremely complicated disentanglement process**.

---

<a id=section4></a>
## 4. Basic Maths for Gradient Descent

- **Each neural layer transforms** its **input data** as follows:

output = relu(dot(W, input) + b)

- In this expression, **W** and **b** are **tensors** that are **attributes of the layer**. 

  - They’re **called** the **weights** or **trainable parameters** of the **layer** (the **kernel** and **bias attributes, respectively**). 
 
  - These **weights contain** the **information learned** by the **network** from **exposure to training data**.

- Initially, these **weight matrices** are **filled** with **small random values** (a step **called random initialization**). 


- Of course, there’s **no reason to expect** that **relu(dot(W, input) + b)**, when **W and b are random**, will **yield** any **useful representations**. The **resulting representations** are **meaningless**, **but** they’re a **starting point**. 


- What comes **next** is to **gradually adjust** these **weights**, **based** on a **feedback signal**. This **gradual adjustment**, also **called training**, **is** basically **the learning that machine learning is all about**. 

<br> 
- This **happens within** what’s called **a training loop**, which works as follows. **Repeat these steps in a loop**, **as long as necessary**:

 1. **Draw** a **batch of training samples x** and **corresponding targets y**.
<br><br>
 2. **Run** the **network on x** (a step **called** the **forward pass**) to **obtain predictions y_pred**.
<br><br>
 3. **Compute** the **loss of** the **network on** the **batch**, a **measure of** the **mismatch between y_pred and y**.
<br><br>
 4. **Update all weights** of the **network in** a **way** that **slightly reduces** the **loss on** this **batch**.
 
 
<img src="https://raw.githubusercontent.com/insaid2018/DeepLearning/master/images/gradient_descent.png" width="700" height="700"/>
<br>

- You’ll **eventually end up** with a **network that has** a **very low loss on** its **training data**: a **low mismatch between predictions y_pred** and **expected targets y**. 


- The **network** has **“learned” to map** its **inputs to correct targets**. 


- **From afar**, it may **look like magic**, but **when** you **reduce it to elementary step**s, it **turns out to be simple**.

- **Step 1** sounds easy enough - just **I/O code**. **Steps 2 and 3** are merely the **application of** a handful of **tensor operations**, so you could **implement these steps** purely **from what you learned** in the **previous section**. 


- The **difficult part** is **step 4**: **updating the network's weights**. **Given** an **individual weight coefficient** in the network, **how** can you **compute whether** the **coefficient should be increased** or **decreased**, and **by how much?**

<br>
<img src="https://raw.githubusercontent.com/insaid2018/DeepLearning/master/images/GD.png" width="500" height="500"/>
<br> 

- One **naive solution** would be to **freeze all weights** in the **network except** the **one scalar coefficient** being **considered**, and **try different values** for this **coefficient**. **But such** an **approach** would be **horribly inefficient**, because **you’d need** to **compute two forward passes** (which are **expensive**) for **every individual coefficient** (of which there are many, usually **thousands** and sometimes **up to millions**). 

- A much **better approach** is to **take advantage** of the fact that **all operations used** in the **network are differentiable**, and **compute** the **gradient of** the **loss with regard** to the **network’s coefficients**. 


- You can **then move the coefficients in** the **opposite direction from** the **gradient**, thus **decreasing the loss**.

<a id=section401></a>
### 4.1 What’s a derivative?

- Consider a continuous, smooth function **f(x) = y**, mapping a real number x to a new real number y. 
<br> 
  - Because the **function** is **continuous**, a **small change in x** can only **result in** a **small change in y**, that’s the intuition behind continuity. 
<br><br>  
  - Let’s say you **increase x by** a **small factor epsilon_x**: this **results in** a **small epsilon_y change to y**:

f(x + epsilon_x) = y + epsilon_y

- In addition, because the **function is smooth** (its **curve doesn’t have any abrupt angles**), **when epsilon_x** is **small enough**, **around** a certain **point p**, it’s **possible to approximate f** as a **linear function** of **slope a**, so that **epsilon_y becomes a * epsilon_x**:

f(x + epsilon_x) = y + a * epsilon_x

- Obviously, this **linear approximation** is **valid** only **when x** is **close enough to p**. 
<br><br> 
  - The **slope a** is **called** the **derivative of f in p**. 
<br><br> 
  - If **a** is **negative**, it **means** a **small change of x around p** will **result** in a **decrease of f(x)**. 
<br><br> 
  - And if **a** is **positive**, a **small change in x** will **result in** an **increase of f(x)**. 
<br><br>   
  - Further, the **absolute value of a** (the **magnitude of** the **derivative**) **tells** you **how quickly** this **increase or decrease will happen**.
  
  
<img src="https://raw.githubusercontent.com/insaid2018/DeepLearning/master/images/derivative.png" width="600" height="600"/>
<br>

- For every **differentiable function f(x)** (differentiable means **“can be derived”**: for example, smooth, continuous functions can be derived), there **exists** a **derivative function f'(x)** that **maps values** of **x** to the **slope of** the **local linear approximation of f** in those **points**. 


- For instance, the **derivative of cos(x)** is **-sin(x)**, the **derivative** of **f(x) = a * x** is **f'(x) = a**, and so on.


- If you’re trying **to update x** by a **factor epsilon_x** in order **to minimize f(x)**, and you **know** the **derivative of f**, then your job is done: the **derivative completely describes how f(x) evolves** as you **change x**. If you want **to reduce** the **value of f(x)**, you just **need to move x** a little in the **opposite direction from the derivative**.

---

<a id=section402></a>
### 4.2 Derivative of a Tensor Operation: the Gradient

- A **gradient** is the **derivative** of a **tensor operation**. 


- It’s the **generalization** of the **concept of derivatives** to **functions of multidimensional inputs**: that is, to **functions** that **take tensors as inputs**.


- **Consider** an **input vector x**, a **matrix W**, a **target y**, and a **loss function loss**. You can **use W to compute** a **target candidate y_pred**, and **compute** the **loss**, or **mismatch**, **between** the **target candidate y_pred** and the **target y**:

y_pred = dot(W, x)
loss_value = loss(y_pred, y)

- **I**f the **data inputs x** and **y are frozen**, then **this** can be **interpreted** as a **function mapping values of W to loss values**:

loss_value = f(W)

- Let’s say the **current value** of **W is W0**. 
<br><br>
 - Then the **derivative of f** in the **point W0** is a **tensor gradient(f)(W0)** with the **same shape as W**, where **each coefficient gradient(f)(W0)[i, j] indicates** the **direction and magnitude** of the **change in loss_value** you **observe when modifying W0[i, j]**. 
<br><br> 
 - That **tensor gradient(f)(W0)** is the **gradient** of the **function f(W) = loss_value in W0**.

<br> 
- The **derivative** of a **function f(x)** of a **single coefficient** can be **interpreted** as the **slope of** the **curve of f**. Likewise, **gradient(f)(W0)** can be **interpreted as** the **tensor describing** the **curvature of f(W) around W0**.


- For this reason, in much the same way that, **for** a **function f(x)**, you can **reduce** the **value of f(x)** by **moving x a little** in the **opposite direction from** the **derivative**, with a **function f(W) of a tensor**, you can **reduce f(W)** by **moving W** in the **opposite direction from the gradient**: 
<br><br>
  - For example, **W1 = W0 - step * gradient(f)(W0)** (where **step** is a **small scaling factor**). 
<br><br>  
  - That **means going against** the **curvature**, **which** intuitively should **put** you **lower on the curve**. 
<br><br>  
  - Note that the **scaling factor step** is **needed because gradient(f)(W0)** only **approximates** the **curvature when** you’re **close to W0**, so you **don’t want** to **get too far from W0**.

---

<a id=section403></a>
### 4.3 Stochastic Gradient Descent

- Given a **differentiable function**, it’s **theoretically possible** to **find its minimum analytically**: it’s known that a **function’s minimum** is a **point where** the **derivative is 0**, so all you have to do is **find all** the **points where** the **derivative goes to 0** and **check for which** of these **points** the **function has** the **lowest value**.


- **Applied to** a **neural network**, that **means finding analytically** the **combination of weight values** that **yields** the **smallest possible loss function**. 
<br><br>
  - This can be **done** by **solving** the **equation gradient(f)(W) = 0 for W**. 
<br><br>  
  - This is a **polynomial equation** of **N variables**, where **N** is the **number of coefficients** in the **network**. 
<br><br>  
  - Although it would be **possible to solve** such an **equation for N = 2** or **N = 3**, **doing** so **is intractable** for **real neural networks**, where the **number of parameters** is **never less than** a **few thousand** and **can often be** several **tens of millions**.

- Instead, you can **use** the **four-step algorithm** outlined at the beginning of this section: **modify** the **parameters little by little based on** the **current loss value on** a **random batch of data**. 


- Because you’re **dealing with** a **differentiable function**, you can **compute its gradient**, which **gives** you **an efficient way** to **implement step 4**. 

<br> 
- If you **update the weights in** the **opposite direction from** the **gradient**, the **loss will be** a **little less every time**:
<br> 
  1. **Draw** a **batch of training samples x** and **corresponding targets y**.
<br><br>    
  2. **Run** the **network on x** to **obtain predictions y_pred**.
<br><br>    
  3. **Compute** the **loss of** the **network on** the **batch**, a **measure of** the **mismatch between y_pred and y**.
<br><br>    
  4. **Compute** the **gradient of** the **loss** with **regard to** the **network’s parameters** (a **backward pass**).
<br><br>   
  5. **Move** the **parameters** a **little in** the **opposite direction from** the **gradient**. 
<br><br>   
    - For example **W -= step * gradient**, thus **reducing** the **loss on** the **batch a bit**.

<br> 
<img src="https://raw.githubusercontent.com/insaid2018/DeepLearning/master/images/SGD.png" width="700" height="700"/>
<br>

- This is called **mini-batch stochastic gradient descent** (**minibatch SGD**). 


- The **term stochastic refers** to the fact that **each batch of data** is **drawn at random** (**stochastic** is a **scientific synonym** of **random**).

- It’s **important to pick** a **reasonable value for** the **step factor (a.k.a. learning rate)**. 
<br><br> 
  - **If** it’s **too small**, the **descent down the curve** will **take many iterations**, and it **could get stuck in** a **local minimum**. 
<br><br>   
  - **If step** is **too large**, your **updates may end up taking** you to **completely random locations on** the **curve**. 
  
<br> 
<img src="https://raw.githubusercontent.com/insaid2018/DeepLearning/master/images/learning_rate.png" width="500" height="500"/>
<br>

- Note that a **variant of** the **mini-batch SGD algorithm** would be to **draw a single sample** and **target at each iteration**, **rather** than **drawing a batch of data**. This **would be true SGD** (as **opposed to mini-batch SGD**). 


- **Alternatively**, going to **the opposite extreme**, you could **run every step on all data available**, which is **called batch SGD**. **Each update** would then be **more accurate**, **but far more expensive**. 


- The **efficient compromise between** these **two extremes is** to **use mini-batches** of **reasonable size**.

<br> 
<img src="https://raw.githubusercontent.com/insaid2018/DeepLearning/master/images/SGD_2D5.jpg" width="800" height="800"/>
<br>

- Additionally, **there exist multiple variants of SGD** that **differ** by **taking into account previous weight updates** when **computing** the **next weight update**, **rather than** just **looking at** the **current value of** the **gradients**. 


- **There is**, for instance, **SGD with momentum**, as well as **Adagrad**, **RMSProp**, and several others. **Such variants** are **known as optimization methods** or **optimizers**. 


- In particular, the **concept of momentum**, which is **used in many** of these **variants**, deserves your attention. 

<br> 
- **Momentum addresses two issues** with **SGD**: **convergence speed** and **local minima**.

  
<br> 
<img src="https://raw.githubusercontent.com/insaid2018/DeepLearning/master/images/minimas2.png" width="600" height="600"/>
<br><br> 
<img src="https://raw.githubusercontent.com/insaid2018/DeepLearning/master/images/minimas.png" width="700" height="700"/>
<br>

<br><br> 
  - As you can see, **around a certain parameter value**, there is a **local minimum**: **around that point**, **moving left** would **result in** the **loss increasing**, **but so would moving right**. 
<br><br>   
  - **If** the **parameter under consideration** were being **optimized via SGD with** a **small learning rate**, **then** the **optimization process** would **get stuck a**t the **local minimum instead** of **making its way to** the **global minimum**.

- You can **avoid such issues** by **using momentum**, which **draws inspiration from physics**. 
<br><br> 
  - A useful mental image here is to **think of** the **optimization process as** a **small ball rolling down** the **loss curve**. 

<br> 
<img src="https://raw.githubusercontent.com/insaid2018/DeepLearning/master/images/momentum.png" width="600" height="600"/>
<br>

  - **If it has enough momentum**, the **ball won’t get stuck in** a **ravine** and **will end up at** the **global minimum**. 
<br><br>  
  - **Momentum** is **implemented** by **moving the ball** at **each step based not only** on the **current slope value** (**current acceleration**) **but also** on **the current velocity** (**resulting from past acceleration**). 
<br><br>   
  - **In practice**, this **means updating** the **parameter w based not onl**y on the **current gradient value but also** on the **previous parameter update**, such as in this **naive implementation**:

past_velocity = 0.
momentum = 0.1                                                           # Constant momentum factor
while loss > 0.01:                                                       # Optimization loop 
    w, loss, gradient = get_current_parameters()
    velocity = past_velocity * momentum + learning_rate * gradient
    w = w + momentum * velocity - learning_rate * gradient
    past_velocity = velocity
    update_parameter(w)

<br> 
<img src="https://raw.githubusercontent.com/insaid2018/DeepLearning/master/images/momentum1.png" width="500" height="500"/>
<br>
<br> 
<img src="https://raw.githubusercontent.com/insaid2018/DeepLearning/master/images/momentum2.png" width="600" height="600"/>
<br>

---

<a id=section404></a>
### 4.4 Chaining Derivatives: the Backpropagation Algorithm

- **In** the **previous algorithm**, we casually **assumed that** because a **function is differentiable**, we can **explicitly compute its derivative**. 


- **In practice**, a **neural network function consists** of **many tensor operations chained together**, **each** of which **has a simple**, **known derivative**. 


- For instance, this is a **network f composed** of **three tensor operations**, **a**, **b**, and **c**, with **weight matrices W1**, **W2**, and **W3**:

f(W1, W2, W3) = a(W1, b(W2, c(W3)))

- **Calculus** tells us that **such** a **chain of functions** can be **derived using** the following **identity**, **called** the **chain rule**: 

f(g(x)) = f'(g(x)) * g'(x)

- **Applying** the **chain rule to** the **computation of** the **gradient values** of a **neural network gives rise** to an **algorithm called Backpropagation** (also sometimes called **reverse-mode differentiation**). 


- **Backpropagation starts with** the **final loss value** and **works backward from** the **top layers to** the **bottom layers**, **applying** the **chain rule to compute** the **contribution that each parameter had in** the **loss value**.

<br> 
<img src="https://raw.githubusercontent.com/insaid2018/DeepLearning/master/images/backpropagation.png" width="500" height="500"/>
<br>

- Nowadays, and for years to come, **people** will **implement networks in modern frameworks** that are **capable of symbolic differentiation**, such as **TensorFlow**. 
<br><br> 
  - This means that, **given a chain of operations with** a **known derivative**, **they can compute** a g**radient function for** the **chain** (by **applying** the **chain rule**) **that maps network parameter values** to **gradient values**. 
<br><br> 
  - **When** you **have access to such** a **function**, the **backward pass** is **reduced to** a **call to** this **gradient function**. 
<br><br> 
  - **Thanks** to **symbolic differentiation**, you’ll **never have to implement** the **Backpropagation algorithm by hand**. 
<br><br>  
  - For this reason, we **won’t waste your time** and your **focus on deriving** the **exact formulation of** the **Backpropagation algorithm** in these pages.
<br><br>  
  - **All you need** is a **good understanding** of **how gradient-based optimization works**.