# Lecture 1A - The fundamentals of deep learning in the broader context of AI.

## Artificial intelligence
**Artificial intelligence (AI)** can be described as the effort to automate intellectual tasks normally performed by humans. As such, AI is a general field that encompasses machine learning and deep learning, but that also includes many more approaches that may not involve any learning. 
One of such approaches is so called **symbolic AI** that was the dominant paradigm in AI from the 1950s to the late 1980s, and it reached its peak popularity during the expert systems boom of the 1980s. Most experts dealing with symbolic AI believed that human-level artificial intelligence could be achieved by  a sufficiently large set of explicit rules stored in explicit databases.

<img src="1.png" alt="The relation between artificial intelligence, machine learning, and deep learning"/>

## Machine learning 
### ML and symbolic AI
The usual way to make a computer do useful work is to have a human programmer write down a computer program to be followed to turn input data into appropriate answers. Machine learning turns this around: the machine looks at the input data and the corresponding answers, and figures out what the rules should be.  A machine learning system is trained rather than explicitly programmed. It’s presented with many examples relevant to a task, and it finds statistical structure in these examples that eventually allows the system to come up with rules for automating the task.

<img src="2.png"/>

### ML and statistics
Machine learning has started to flourish in the 1990s and has quickly become the most popular and successful subfield of AI, thanks to faster hardware and larger datasets. Machine learning is related to mathematical statistics. However, unlike statistics, machine learning tends to deal with large, complex datasets for which classical statistical analysis would be impractical. Moreover, machine learning, especially deep learning, exhibits comparatively little mathematical theory and is fundamentally an engineering discipline driven by empirical findings and deeply reliant on advances in software and hardware.

### Each machine learning algorithm needs 3 elements:
- Input data points
- Examples of the expected output
- A way to measure whether the algorithm is doing a good job

### The central problem in machine learning
A machine learning model transforms its input data into meaningful outputs, a process that is “learned” from exposure to known examples of inputs and outputs. Therefore, the central problem in machine learning is to meaningfully transform data: in other words, to learn useful representations of the input data at hand—representations that get us closer to the expected output. Machine learning algorithms usually do not find these transformations; they’re merely searching through a predefined set of operations, called a hypothesis space. So ML can be defined as searching for useful representations and rules over some input data, within a predefined space of possibilities, using guidance from a feedback signal.

<img src="3.png"/>

## Other ML methods
Most of the machine learning algorithms used in the industry today aren’t deep learning algorithms. Deep learning isn’t always the right tool for the job. 
### Probabilistic modeling 
is the application of the principles of statistics to data analysis. It is one of the earliest forms of machine learning, and it’s still widely used to this day. One of the best-known algorithms in this category are the Naive Bayes algorithm and logistic regression.
### Kernel methods
are a group of classification algorithms, the best known of which is the Support Vector Machine (SVM). The modern formulation of an SVM was developed by Vladimir Vapnik and Corinna Cortes in the early 1990s at Bell Labs and published in 1995. SVM finds decision boundaries separating two classes in two steps:
1. The data is mapped to a new high-dimensional representation where the decision boundary can be expressed as a hyperplane.
2. A good decision boundary (a separation hyperplane) is computed by trying to maximize the distance between the hyperplane and the closest data points from each class, a step called maximizing the margin.
<img src="4.png"/>

The technique of mapping data to a high-dimensional representation where a classification problem becomes simpler is often computationally intractable. Therefore the kernel functions emerged. They are computationally tractable operations that maps any two points in initial space to the distance between two points in target representation space, completely bypassing the explicit computation of the new representation
### Decision trees
are flowchart-like structures that let you classify input data points or predict output values given inputs.
They’re easy to visualize and interpret.

<img src="5.png"/>

### Random forests
introduced a robust, practical take on decision-tree learning that involves building a large number of specialized decision trees and then ensembling their outputs. They’re almost always the second-best algorithm for any shallow machine learning task.
### Gradient boosting machine
is a machine learning technique based on ensembling weak prediction models, generally decision trees. It uses gradient boosting, a way to improve any machine learning model by iteratively training new models that specialize in addressing the weak points of the previous models. It is one of the best, if not the best, algorithm for dealing with nonperceptual data today. Alongside deep learning, it’s one of the most commonly used techniques in <a href="http://kaggle.com">Kaggle</a> competitions.

## Deep learning
Deep learning is a subfied of machine learning methods and is based on the idea of successive *layers* of representations. The number of layers that contribute to a model of the data is called the *depth* of the model. Modern deep learning often involves even hundreds of successive layers of representations, while, other machine learning methods tend to focus on learning only one or two layers of representations of the data.  These layered representations are learned via models called *neural networks*. Summarizing, deep neural networks  map inputs to targets (which is done by observing many examples of input and targets) via a deep sequence of simple data transformations (layers).

<img src="6.png"/>

### Weigths
The specification of what a layer does to its input data is stored in the layer’s weights. The transformation implemented by a layer is parameterized by its *weights*, that are also called the parameters of a layer. In this context, *learning* means finding a set of values for the weights of all layers in a network, such that the network correctly maps example inputs to their associated targets. 

### Loss function
The *loss function* of the network, also called the *objective function* or *cost function* is designed to control the output of a neural network, you need to be able to measure how far this output is from what you expected. The loss function takes the predictions of the network and the true target  and computes a distance score, capturing how well the network has done on this specific example.

### Optimizer
The *optimizer*, implements the *Backpropagation algorithm*. It uses a score as a feedback signal to adjust the value of the weights a little, in a direction that will lower the loss score for the current example. Initially, the weights of the network are assigned random values. With every example the network processes, the weights are adjusted a little in the correct direction, and the loss score decreases. This is the training loop, which, repeated a sufficient number of times (typically tens of iterations over thousands of examples), yields weight values that minimize the loss function.

# Lecture 1B - mathematical aspects of deep learning.

## Tensors
All current machine learning systems use *tensors* as their basic data structure. A tensor is a container for data—usually numerical data. Tensors are a generalization of matrices (rank-2 tensors) to an arbitrary number of dimensions. In Python, tensors can be represented by NumPy arrays. In particular, scalars are rank-0 tensors and vectors are rank-1 tensors. 

A tensor is defined by three key attributes:
- Number of axes (rank). This is also called the tensor’s ndim in Python libraries such as NumPy or TensorFlow.
- Shape—This is a tuple of integers that describes how many dimensions the tensor has along each axis.
- Data type (usually called dtype in Python libraries)—This is the type of the data contained in the tensor; for instance, a tensor’s type could be float16, float32, float64, uint8, and so on.

### Example of a tensor: array of 60,000 matrices of 28 × 28 integers:

In [None]:
from tensorflow.keras.datasets import mnist
(train_images, train_labels), (test_images, test_labels) = mnist.load_data()

In [None]:
train_images.ndim #rank

In [None]:
train_images.shape #shape

In [None]:
train_images.dtype #data type

Each such matrix is a grayscale image, with coefficients between 0 and 255:

In [None]:
import matplotlib.pyplot as plt
for i in range(10):
    digit = train_images[i]
    plt.imshow(digit, cmap=plt.cm.binary)
    plt.show()

### Slicing
Selecting specific elements in a tensor is called tensor *slicing*. 

In [None]:
slice = train_images[15:200, :, :]
slice.shape

### Batches
Deep learning models don’t process an entire dataset at once; rather, they break the data into *batches*. 

### Examples of data tensors
The data we manipulate almost always fall into one of the following categories:f images

#### Vector 
data—Rank-2 tensors of shape (samples, features), where each sample is a vector of numerical attributes (“features”).

#### Timeseries data 
sequence data—Rank-3 tensors of shape (samples, timesteps, features), where each sample is a sequence (of length timesteps) of feature vectors

<br/><img src="7.png"/><br/>

#### Images 
Rank-4 tensors of shape (samples, height, width, channels), where each sample is a 2D grid of pixels, and each pixel is represented by a vector of values (“channels”)

<br/><img src="8.png"/><br/>

#### Video 
Rank-5 tensors of shape (samples, frames, height, width, channels), where each sample is a sequence (of length frames) of images.

### Tensor operations
All transformations learned by deep neural networks can be reduced to a handful of tensor operations (or tensor functions) applied to tensors of numeric data. 

The most common tensor operations:
- a dot product of tensors
- an addition  between a tensor and another tensor.
- tensor reshaping

All operations on tesnors are element-wise operations. It means that they are applied independently to each entry in the tensors being considered, i.e., they are highly amenable to massively parallel implementations (vectorized implementations). These operations are available as well-optimized built-in NumPy functions, which themselves delegate the heavy lifting to a Basic Linear Algebra Subprograms (BLAS) implementation. BLAS are low-level, highly parallel, efficient tensor-manipulation routines that are typically implemented in Fortran or C.

In [None]:
def naive_add(x, y):
    assert len(x.shape) == 2
    assert x.shape == y.shape
    x = x.copy()
    for i in range(x.shape[0]):
        for j in range(x.shape[1]):
            x[i, j] += y[i, j]
    return x

import time
import numpy as np
x = np.random.random((50, 100))
y = np.random.random((50, 100))
  
t0 = time.time() 
for _ in range(1000):
    z = x + y
t1 = time.time() - t0

t0 = time.time() 
for _ in range(1000):
    z = naive_add(x, y)
t2 = time.time() - t0

t1,t2,t2/t1

In [None]:
def naive_vector_dot(x, y):
    assert len(x.shape) == 1         
    assert len(y.shape) == 1         
    assert x.shape[0] == y.shape[0]
    z = 0. 
    for i in range(x.shape[0]):
        z += x[i] * y[i]
    return z
def naive_matrix_dot(x, y):
    assert len(x.shape) == 2                  
    assert len(y.shape) == 2                  
    assert x.shape[1] == y.shape[0]           
    z = np.zeros((x.shape[0], y.shape[1]))    
    for i in range(x.shape[0]):               
        for j in range(y.shape[1]):           
            row_x = x[i, :]
            column_y = y[:, j]
            z[i, j] = naive_vector_dot(row_x, column_y)
    return z

X = np.random.random((4, 6))
Y = np.random.random((6, 7))

X,Y,X.dot(Y),naive_matrix_dot(X, Y)

In [None]:
X = np.random.random((40, 60))
Y = np.random.random((60, 70))
t0 = time.time() 
for _ in range(10**3):
    X.dot(Y)
t1 = time.time() - t0

t0 = time.time() 
for _ in range(10**3):
    naive_matrix_dot(X, Y)
t2 = time.time() - t0

t1,t2,t2/t1

In [None]:
x = np.array(range(12))
x.shape

In [None]:
x.reshape(3,4)

In [None]:
x.reshape(6,2)

Another mechanism used in tensor operations is *broadcasting*. When possible, and if there’s no ambiguity, the smaller tensor will be broadcast to match the shape of the larger tensor. Broadcasting consists of two steps:

- Axes (called broadcast axes) are added to the smaller tensor to match the ndim of the larger tensor.
- The smaller tensor is repeated alongside these new axes to match the full shape of the larger tensor.

In [None]:
X = np.random.choice(range(11),(2,5))
y = np.random.choice(range(11),(5,))
X,y,X+y

## Affine transform
An affine transform is the combination of a linear transform (achieved via a dot product with some matrix) and a translation (achieved via a vector addition):  $$y = W \cdot  x + b$$
A Dense layer without an activation function is an affine layer.
<br/><img src="9.png"/><br/>

## Activation functions
If you apply many of them repeatedly, you still end up with an affine transform (so you could just have applied that one affine transform in the first place):
$$y = W_1 \cdot  (W_2 \cdot  x + b_2) + b_1 = W_1 \cdot W_2 \cdot x + W_1 \cdot b_2 + b_1$$
As a consequence, a multilayer neural network made entirely of Dense layers without activations would be equivalent to a single Dense layer. This “deep” neural network would just be a linear model in disguise! This is why we need activation functions.

Three major types of activation functions used to introduce nonlinearities:.

### The sigmoid
$$f(x) = \frac{1}{1+\exp(-x)}  $$
So the output of the sigmoid ranges from 0 to 1.

In [None]:
x = np.array(range(-1000,1000))/100
y = 1/(1+np.exp(-x))
plt.plot(x,y)

### The tanh function 
They use a similar kind of S-shaped nonlinearity, but instead of ranging 
from 0 to 1, the output of tanh neurons ranges from −1 to 1
$$f(x) = \tanh(x)$$.

In [None]:
z = np.tanh(x)
plt.plot(x,z)

### Rectified Linear Unit (ReLU) 
It uses the function 
$$f(x) = \max(0,x)$$
resulting in a characteristic hockey-stick-shaped response. The ReLU has recently become popular for many tasks, especially in computer vision.

In [None]:
t = np.array([x,np.zeros(len(x))]).max(axis=0)
plt.plot(x,t)

### Softmax Output Layers
Sometimes  we want our output vector to be a probability distribution over a set of mutually exclusive labels. As a result, the desired output vector is of the form $p=[p_1,p_2,\ldots,p_n]$, where $\sum_{i=1}^n p_i = 1$. This can be achieved by a *softmax layer*.  Unlike in other kinds of layers, the output of a neuron in a softmax layer depends on the outputs of all the other neurons in its layer. This is because we require the sum of all the outputs to be equal to 1. Letting $x_i$ be the logit of the $i$tℎ softmax neuron, we can achieve this normalization by setting its output to
$$y_i=\frac{\exp(x_i)}{\sum_{i=1}^n \exp(x_i)}$$

## Training
Initially, weight matrices are filled with small random values. What comes next is to gradually adjust these weights, based on a feedback signal. This gradual adjustment, also called *training*, is the learning that machine learning is all about. This happens within a *training loop*, which works as follows. Repeat these steps in a loop, until the loss seems sufficiently low:
- Draw a batch of training samples, x, and corresponding targets, y_true.
- Run the model on x (a step called the forward pass) to obtain predictions, y_pred.
- Compute the loss of the model on the batch, a measure of the mismatch between y_pred and y_true.
- Update all weights of the model in a way that slightly reduces the loss on this batch.

The most difficult is step 4. One naive solution would be to freeze all weights in the model except the one scalar coefficient being considered, and try different values for this coefficient. But such an approach would be horribly inefficient, because you’d need to compute two forward passes for every individual coefficient. 

## Example
We create a model consisting of a chain of two dense layers:
- the first with $n$ parameters (weigths) and the `relu` activation function;
- the second with $m$ weigths and the `softmax` activation function:

`model = keras.Sequential([ layers.Dense(n, activation="relu")   layers.Dense(m, activation="softmax")])`

Next, we compile the model using the `sparse_categorical_crossentropy` loss function, the `accuracy` metrics, and the `rmsprop` optimizer:

`model.compile(optimizer="rmsprop", loss="sparse_categorical_crossentropy", metrics=["accuracy"])`

## Gradient descent
Gradient descent is much better technique.  All of the functions used in deep learning models transform their input in a smooth and continuous way. Mathematically, these functions are differentiable. If you chain together such functions, the bigger function you obtain is still differentiable. This enables us to use the gradient to describe how the loss varies as you move the model’s coefficients in different directions. If we compute this gradient, we can use it to move the coefficients (all at once in a single update, rather than one at a time) in a direction that decreases the loss. 
The *mini-batch stochastic gradient descent algorithm (mini-batch SGD)*:
- Draw a batch of training samples, x, and corresponding targets, y_true.
- Run the model on x to obtain predictions, y_pred (this is called the forward pass).
- Compute the loss of the model on the batch, a measure of the mismatch between y_pred and y_true.
- Compute the gradient of the loss with regard to the model’s parameters (this is called the backward pass).
- Move the parameters a little in the opposite direction from the gradient—for example, W -= learning_rate * gradient—thus reducing the loss on the batch a bit. The learning rate (learning_rate here) would be a scalar factor modulating the “speed” of the gradient descent process.

<br/><img src="10.png"/><br/>
It’s important to pick a reasonable value for the learning_ rate factor. If it’s too small, the descent down the curve will take many iterations, and it could get stuck in a local minimum. If learning_rate is too large, your updates may end up taking you to completely random locations on the curve.

A variant of the mini-batch SGD algorithm would be to draw a single sample and target at each iteration, rather than drawing a batch of data. This would be *true SGD* (as opposed to mini-batch SGD). Alternatively, going to the opposite extreme, we could run every step on all data available, which is called *batch gradient descent*. Each update would then be more accurate, but far more expensive. The efficient compromise between these two extremes is to use mini-batches of reasonable size.

There exist many variants of SGD that differ by taking into account previous weight updates when computing the next weight update, rather than just looking at the current value of the gradients:
- SGD with momentum, which draws inspiration from physics: If a ball rolling down has enough momentum, the ball won’t get stuck in a ravine and will end up at the global minimum. Momentum is implemented by moving the ball at each step based not only on the current slope value (current acceleration) but also on the current velocity (resulting from past acceleration). Momentum addresses two issues with SGD: convergence speed and local minima. 
<img src=14.png>
- Adagrad
- RMSprop

All variants of SGD algorithm belong to *optimization* methods or *optimizers*.

In SGD with momentum the parameter $w$ is updated based not only on the current gradient value but also on the previous parameter update:

```python
past_velocity = 0. 
momentum = 0.1                
while loss > 0.01:            
    w, loss, gradient = get_current_parameters()
    velocity = past_velocity * momentum - learning_rate * gradient
    w = w + momentum * velocity - learning_rate * gradient
    past_velocity = velocity
    update_parameter(w)
```

## The Backpropagation algorithm
**Backpropagation** is a way to use the derivatives of simple operations (such as addition, relu, or tensor product) to easily compute the gradient of arbitrarily complex combinations of these atomic operations based on the *chain rule*:
$$f(g(x))' = f'(g(x))\cdot g'(x)$$

Backpropagation is the application of the chain rule to a computation graph.
<br/><img src="11.png"/><br/>

Computation graphs have been an extremely successful abstraction in computer science because they enable us to treat computation as data: a computable expression is encoded as a machine-readable data structure that can be used as the input or output of another program. 

<br/><img src="12.png"/><br/>
<br/><img src="13.png"/><br/>

By applying the chain rule to our graph, we obtain the following gradients:
- grad(loss_val, w) = 1 * 1 * 2 = 2
- grad(loss_val, b) = 1 * 1 = 1

The *GradientTape* is the API that leverages TensorFlow’s automatic differentiation capabilities. 

In [None]:
import tensorflow as tf
x = tf.Variable(0.)                      
with tf.GradientTape() as tape:          
    y = 2 * x + 3                        
grad_of_y_wrt_x = tape.gradient(y, x)  
grad_of_y_wrt_x

In [None]:
x = tf.Variable(tf.random.uniform((2, 2)))     
with tf.GradientTape() as tape:
    y = 2 * x + 3 
grad_of_y_wrt_x = tape.gradient(y, x)    
grad_of_y_wrt_x

In [None]:
W = tf.Variable(tf.random.uniform((2, 2)))
b = tf.Variable(tf.zeros((2,)))
x = tf.random.uniform((2, 2)) 
with tf.GradientTape() as tape:
    y = tf.matmul(x, W) + b                         
grad_of_y_wrt_W_and_b = tape.gradient(y, [W, b])    
x, grad_of_y_wrt_W_and_b

## Example
Next, the model starts to iterate on the training data in mini-batches of $s$ samples, $I$ times over (each iteration over all the training data is called an *epoch*). For each batch, the model will compute the gradient of the loss with regard to the weights using the Backpropagation algorithm, and move the weights in the direction that will reduce the value of the loss for this batch:

`model.fit(train_images, train_labels, epochs=I, batch_size=s)`

After training we can use the model to make predictions:

`predictions = model.predict(test_digits)`

and evaluate the model on new data:

`test_loss, test_acc = model.evaluate(test_images, test_labels)`

# Laboratory 1

## Task 1. 
Import necessary libraries: `from tensorflow import keras` and  `
from tensorflow.keras import layers.

`
Load `train_images`, `train_labels`, `test_images`, `test_labels` from data set `mnist` from `tensorflow.keras.datasets`. What are the shape and type of `train_images` and `test_images`? 

Preprocess the data by reshaping it into the shape the model expects and scaling it so that all values are in the [0, 1] interval: 
Change tensors `train_images` and `test_images` into rank-2 tensors of shape $(60000, 28*28)$, $(10000, 28*28)$, respectively. Change type of these tensors into `float32` and standardize them: divide them by the their maximum value.

## Taks 2
Using function `Sequential` from `Keras` create a model consisting of a sequence of two densely connected neural layers. The first layer should consists of 512 parameters and its activation function should be reLU. The second (and last) layer is to be a 10-way softmax classification layer, which means it will return an array of 10 probability scores (summing to 1). Each score will be the probability that the current digit image belongs to one of 10 digit classes.

## Task 3
Compile the model using `rmsprop` algorithm as an optimizer, `sparse_categorical_crossentropy` as a loss function, and the `accuracy` (the fraction of the images that were correctly classified) as a metrics. 

## Task 4
Fit the model to its training data using 5 epochs and batch_size=128.

## Task 5
Use the trained model to predict class probabilities for digits from `test_images` and for each digit compare the index with the greatest probability with apropriate element in `test_labels`. Find the first digit from  `test_images` for which label predicted by the model is different from the true label form `test_labels`. What is the true digit and its prediction? Plot this digit.

## Task 6
Evaluate the model on training and test data. Compare the test-set accuracy/loss with the training-set accuracy/loss.

## Task 7
Do tasks 4 and 6 using:
- 100 epochs and batch_size=len(train_labels)
- 1 epoch and batch_size=1.

## Task 8
Create a model consisting of a sequence of three densely connected neural layers. The first two layers should consists of 256 parameters and their activation function should be reLU. The last layer is to be a 10-way softmax classification layer. Next, compile the model in the same way as in Task 3 and fit the model to its training data using 10 epochs and batch_size=64. Finally, evaluate the model on training and test data and compare the training-set/test-set accuracy with the training-set/test-set accuracy of the model from Task 2. What model has better accuracy for training-set and test-set?

## Task 9
Check, whether further increasing the number of layers influences the accuracy of a deep learning model. Experiment with various numbers of parameters, batch sizes, and numbers of epochs. 