# Deep Learning Theoretical Aspects

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import scipy as sp
import sklearn
%matplotlib inline

### Q1: Nonlinerity (15 points)

Much of the power of neural networks comes from the nonlinearity that is inherited in activation functions.  
Show that a network of N layers that uses a linear activation function can be reduced into a network with just an input and output layers.

(Write down what is the output of two layers and use induction to claim for all layers).


<u>Answer</u>

Let's assume we have $x_1$ in input and it goes through next layers:

1. Layer 1 with a linear activation function:
$z_1(x_1) = w_1 * x_1 + b_1$

2. Than it goes through layer 2 with a linear activation function:
$z_2(z_1(x_1)) = w_2 * z_1(x_1) + b_2$

Now let's calculate the output y:

$y = w_2 * (w_1 * x_1 + b_1) + b_2$

If we will open this equation we will get another linear function:
$y = (w_1 * w_2)x_1 + (w_2 * b_1 + b_2)$

Let $W = w_1 * w_2$ and $B = w_2 * b_1 + b_2$, therefore the equation will look like: $y=W*x_i + B$

Hence no matter how many layers there are with linear activation function they can be represented as a single layer with a linear function. 

### Q2: Derivatives of Activation Functions (15 points)
Compute the derivative of these activation functions:

1 Sigmoid
<img src="https://cdn-images-1.medium.com/max/1200/1*Vo7UFksa_8Ne5HcfEzHNWQ.png" width="150">

The derivative is:
$f'(t) = \frac{e^{-t}}{(1+e^{-t})^2}$

2 Relu

<img src="https://cloud.githubusercontent.com/assets/14886380/22743194/73ca0834-ee54-11e6-903f-a7efd247406b.png" width="200">

The derivative is:

$f'(x) = \begin{cases} 1& \text{if} & x > 0, \\ 0\end{cases}$

3 Softmax
<img src="https://wikimedia.org/api/rest_v1/media/math/render/svg/e348290cf48ddbb6e9a6ef4e39363568b67c09d3" width="250">

The derivative is:
$\sigma'(z)_j = \frac{e^{z_j}}{\sum_{k=1}^{K}e^{z_k}}\bigg(1-\frac{e^{z_j}}{\sum_{k=1}^{K}e^{z_k}}\bigg)$

### Q3: Back Propagation (30 points)
Use the chain rule and backprop (also called the generalized delta rule) to compute the partial derivatives for these computations (i.e., dz/dx1, dz/dx1, dz/dx3):

```
z = x1 + 5*x2 - 3*x3^2
```

Answer:

$\frac{dz}{dx_1} = 1$

$\frac{dz}{dx_2} = 5$

$\frac{dz}{dx_3} = -6x_3$

```
z = x1*(x2-4) + exp(x3^2) / 5*x4^2
```

Answer:

$\frac{dz}{dx_1} = x_2-4$

$\frac{dz}{dx_2} = x_1$

$\frac{dz}{dx_3} = \frac{2x_3e^{x_3^2}}{5x_4^2}$

$\frac{dz}{dx_4} = -\frac{2e^{x_3^2}}{5x_4^3}$

```
z = 1/x3 + exp( (x1+5*(x2+3)) ^2 )
```

Answer:

$\frac{dz}{dx_1} = (2x_1 + 10x_2 + 30)e^{(x_1+5(x_2+3))^2}$

$\frac{dz}{dx_1} = (10x_1 + 50x_2 + 150)e^{(x_1+5(x_2+3))^2}$

$\frac{dz}{dx_3} = -\frac{1}{x_3^2}$

### Q4: Puppy or bagel? (20 points)
We've seen in class the funny examples of challenging images (Chihuahua or muffin, puppy or bagel etc.).

Let's say you were asked by someone to find more examples like that. You are able to call the 3 neural networks that won the recent ImageNet challenges, and get their predictions (the entire vector of probabilities for the 1000 classes).  

Describe methods that might assist you in finding more examples.

<u>Answer:</u>

<b>1. High Entropy in Predictions</b>: High entropy values suggest that the network is unsure, possibly because the image contains elements of different classes that are hard to distinguish. Calculate the entropy of the prediction vector for each image and find images where the neural network's prediction distribution has high entropy.

<b>2. Close Probabilities for Target Categories</b>: Search for images where the probability difference between the two interesting classes (e.g., "puppy" and "bagel") is below a certain threshold, it shows that the network is almost equally likely to classify the image as either category.

<b>3. Disagreement Among Models</b>: For each image, compare the top predictions of each model. If there's significant disagreement among the models regarding the top predicted class for an image, it may indicate an image that fits the "puppy or bagel" pattern.

<b>4. Analyzing Misclassified Images</b>: For each model, sort images by the confidence level of incorrect predictions, these are images for which the model was confident about a wrong prediction. Specifically look at images that were strongly misclassified by any of the models and check if they belong to pairs of classes known to be visually similar. 

<b>5. Clustering of Prediction Vectors</b>: Perform clustering on the space of prediction vectors, snalyze the clusters to find those that contain a high diversity of classes. Images that are central to such clusters might be visually ambiguous.

<b>6. Visual Similarity Analysis</b>: Use a feature extraction layer from one of the neural networks to get feature vectors for images. Compute similarity scores between images within ambiguous pairs of classes (like "puppy" vs. "bagel"). High similarity scores within such pairs may indicate images that are visually similar across different classes.

### Q5: Convolution (20 points)
Consider the following convolution filters:
```python
k1 = [ [0 0 0], [0 1 0], [0 0 0] ]
k2 = [ [0 0 0], [0 0 1], [0 0 0] ]
k3 = [ [-1-1 -1], [-1 8 -1], [-1 -1 -1] ]
k4 = [ [1 1 1], [1 1 1], [1 1 1] ] / 9
```

Can you guess what each of them computes?

<u>Answer: </u>

k1 - this filter yields the center pixel, that means it leaves the image unchanged

k2 - this filter yields the right pixel to the center, that means that is shifts the picture one pixel to the left

k3 - this filter makes stronger the current center pixel by 8 and substracts surrounding pixels with -1, it ephasizes edges

k4 - this filter calculates the average pixels value, that means it blurrs the image