# Deep Learning Theoretical Aspects

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import scipy as sp
import sklearn
%matplotlib inline


Much of the power of neural networks comes from the nonlinearity that is inherited in activation functions.  
Show that a network of N layers that uses a linear activation function can be reduced into a network with just an input and output layers.

(Write down what is the output of two layers and use induction to claim for all layers).


In [2]:
# Write your answer here
'''each layer in the neural network is a linear combination of the input, the weights and the biases. 
so if a layer includes only linear activation functions, the activation functio can be replace with a 
layer that represents the same linear equation as the activation functions'''

'each layer in the neural network is a linear combination of the input, the weights and the biases. \nso if a layer includes only linear activation functions, the activation functio can be replace with a \nlayer that represents the same linear equation as the activation functions'

an N-layer neural network with a linear activation function, we can represent the output of each layer as a linear combination of the input:

$z_i^{(l)} = \sum_{j=1}^{n^{(l-1)}} w_{ij}^{(l)} a_j^{(l-1)}$



where $z_i^{(l)}$ is the input to the i-th neuron in the l-th layer, $n^{(l-1)}$ is the number of neurons in the (l-1)-th layer, $w_{ij}^{(l)}$ is the weight connecting the j-th neuron in the (l-1)-th layer to the i-th neuron in the l-th layer, and $a_j^{(l-1)}$ is the output of the j-th neuron in the (l-1)-th layer.

Since the activation function is linear, we can simply replace it with an identity function. This means that the output of the l-th layer is simply:

$a_i^{(l)} = z_i^{(l)}$

Now, we can substitute the expression for $z_i^{(l)}$ into the expression for $a_i^{(l)}$ to get:

$a_i^{(l)} = \sum_{j=1}^{n^{(l-1)}} w_{ij}^{(l)} a_j^{(l-1)}$

We can repeat this process for all layers until we get to the output layer:

$a_i^{(N)} = \sum_{j=1}^{n^{(N-1)}} w_{ij}^{(N)} a_j^{(N-1)}$

At this point, we have a network with just an input and output layer. The input layer has $n^{(0)}$ neurons, and the output layer has $n^{(N)}$ neurons. The weights connecting the input layer to the output layer are given by $w_{ij}^{(N)}$. Therefore, we have reduced the original N-layer neural network into a single-layer neural network.

### Derivatives of Activation Functions
Compute the derivative of these activation functions:

1 Sigmoid
<img src="https://cdn-images-1.medium.com/max/1200/1*Vo7UFksa_8Ne5HcfEzHNWQ.png" width="150">

In [3]:
# Write your answer here

def sigmoid(x):
  return 1/(1 + np.exp(-x))

def sigmoid_derivative(x):
  sig = sigmoid(x)
  return sig*(1 - sig)






$f'(x) = \frac{d}{dx} \frac{1}{1+e^{-x}}$ $ = \frac{d}{dx} (1+e^{-x})^{-1}$

Using the chain rule, we can express this as:



$f'(x) = -(1+e^{-x})^{-2} \cdot \frac{d}{dx} (1+e^{-x})$

$f'(x) = \frac{e^{-x}}{(1+e^{-x})^{2}}$

$f'(x) = \frac{1}{1+e^{-x}} \cdot \frac{e^{-x}}{1+e^{-x}}$

$f'(x) = f(x) \cdot \frac{e^{-x}}{1+e^{-x}}$

$f'(x) = f(x) \cdot (1-f(x))$

Therefore, the derivative of the sigmoid function is:

$f'(x) = f(x) \cdot (1-f(x))$

2 Relu 

<img src="https://cloud.githubusercontent.com/assets/14886380/22743194/73ca0834-ee54-11e6-903f-a7efd247406b.png" width="200">

In [4]:
# Write your answer here

def ReLu_derivative(x):
  if x <= 0:
    return 0
  else:
    return 1
    


To find its derivative, we need to differentiate between two cases: when the input is positive (greater than or equal to zero) and when the input is negative.

When $x \geq 0$, the function is just $f(x) = x$. Therefore, its derivative is:

$f'(x) = \frac{d}{dx} x = 1$

When $x < 0$, the function is just $f(x) = 0$. Therefore, its derivative is:

$f'(x) = \frac{d}{dx} 0 = 0$

Putting these cases together, the derivative of the ReLU function is:

$f'(x) = \begin{cases}
0 & x < 0 \
1 & x \geq 0
\end{cases}$

This derivative is not defined at $x = 0$, but we can choose to define it as either 0 or 1 using the convention of subgradient.

3 Softmax
<img src="https://wikimedia.org/api/rest_v1/media/math/render/svg/e348290cf48ddbb6e9a6ef4e39363568b67c09d3" width="250">

In [5]:
# Write your answer here
def Softmax(z):
  z_sum = np.sum(z)
  return z/z_sum

def delta(i,j):
  if i == j:
    return 1
  else:
    return 0

def softmax_derivative(z):
  n = len(z)
  DS = np.zeros((n, n))
  for i in range(n):
    for j in range(n):
      DS[i][j] = Softmax(z[i])*(delta(i,j) - Softmax(z[j]))
  return DS

The Softmax activation function is defined as:

$f_j(z) = \frac{e^{z_j}}{\sum_{k=1}^{K} e^{z_k}}$ for $j=1,2,\ldots,K$

where $K$ is the number of output classes.

To find the derivative of the Softmax function, we need to compute the partial derivative of $f_j(z)$ with respect to $z_k$ for all $j$ and $k$. Using the quotient rule, we can derive the expression as follows:

$\frac{\partial f_j}{\partial z_k} = \frac{\frac{\partial}{\partial z_k} e^{z_j} \sum_{k=1}^{K} e^{z_k} - e^{z_j} \frac{\partial}{\partial z_k} \sum_{k=1}^{K} e^{z_k}}{(\sum_{k=1}^{K} e^{z_k})^2}$

If $j = k$, we can simplify this expression to:

$\frac{\partial f_j}{\partial z_k} = \frac{e^{z_j} (\sum_{k=1}^{K} e^{z_k} - e^{z_j})}{(\sum_{k=1}^{K} e^{z_k})^2}$

$\frac{\partial f_j}{\partial z_k} = f_j(z) (1 - f_j(z))$ for $j = k$

If $j \neq k$, we can simplify the expression to:

$\frac{\partial f_j}{\partial z_k} = \frac{- e^{z_j} e^{z_k}}{(\sum_{k=1}^{K} e^{z_k})^2}$

$\frac{\partial f_j}{\partial z_k} = -f_j(z) f_k(z)$ for $j \neq k$

Therefore, the derivative of the Softmax function is:

$\frac{\partial f_j}{\partial z_k} = \begin{cases}
f_j(z) (1 - f_j(z)) & i = k \
-f_j(z) f_k(z) & j \neq k
\end{cases}$

This can also be written in matrix form.

### Back Propagation
Use the chain rule and backprop (also called the generalized delta rule) to compute the partial derivatives for these computations (i.e., dz/dx1, dz/dx1, dz/dx3):

```
z = x1 + 5*x2 - 3*x3^2
```

In [6]:
# Write your answer here, using Markdown, image or any other suitable format



$\frac{\partial z}{\partial x_1}  = 1 \cdot 1 = 1$


Similarly, we have:

$\frac{\partial z}{\partial x_2} = 1 \cdot 5 = 5$



Finally, we have:

$\frac{\partial z}{\partial x_3} = 1 \cdot (-6x_3) = -6x_3$



```
z = x1*(x2-4) + exp(x3^2) / 5*x4^2
```

In [None]:
# Write your answer here

Using the chain rule, we can compute the partial derivatives of the function $z = x_1(x_2 - 4) + \frac{\exp(x_3^2)}{5x_4^2}$ as follows:

$\frac{\partial z}{\partial x_1} = x_2 - 4$

$\frac{\partial z}{\partial x_2} = x_1$



$\frac{\partial z}{\partial x_3} = 2x_3\frac{\exp(x_3^2)}{5x_4^2}$

$\frac{\partial z}{\partial x_4} = -\frac{2\exp(x_3^2)}{5x_4^3}$



```
z = 1/x3 + exp( (x1+5*(x2+3)) ^2 )
```

In [None]:
# Write your answer here

Using the chain rule, we can compute the partial derivatives of the function $z = \frac{1}{x_3} + \exp((x_1+5(x_2+3))^2)$ as follows:

$\frac{\partial z}{\partial x_1} = 2(x_1+5(x_2+3))\exp((x_1+5(x_2+3))^2)$

$\frac{\partial z}{\partial x_2} = 10(x_1+5(x_2+3))\exp((x_1+5(x_2+3))^2)$

$\frac{\partial z}{\partial x_3} = -\frac{1}{x_3^2}$



### Puppy or bagel?
We've seen in class the (hopefully) funny examples of challenging images (Chihuahua or muffin, puppy or bagel etc.). 

Let's say you were asked by someone to find more examples like that. You are able to call the 3 neural networks that won the recent ImageNet challenges, and get their predictions (the entire vector of probabilities for the 1000 classes).  

Describe methods that might assist you in finding more examples.

In [None]:
# Write your answer here

The following methods might help in finding similar examples:

1) Adversarial Examples:  adding small, imperceptible perturbations to an image that cause the model to misclassify it. This method can be used to create images that are difficult to classify and could be mistaken for other objects.

2) Image augmentation:adding variations to the original image to increase the size of the training set. By applying different transformations such as rotations, scaling, and cropping, it is possible to generate images that are similar to the original but difficult to classify. These images can be used to test the robustness of the model and identify areas for improvement.

3) Data collection: collecting real world data with similar missleading examples.

4) Feature visualization: Feature visualization techniques can help identify what a deep neural network is looking for in an image. This can be used to identify images that have similar features but are different objects. For example, images of Chihuahuas and muffins might have similar features that make them difficult to distinguish, such as round shapes and similar coloration.

### Convolusion
Consider the following convolution filters:
```python
k1 = [ [0 0 0], [0 1 0], [0 0 0] ]
k2 = [ [0 0 0], [0 0 1], [0 0 0] ]
k3 = [ [-1-1 -1], [-1 8 -1], [-1 -1 -1] ]
k4 = [ [1 1 1], [1 1 1], [1 1 1] ] / 9
```

Can you guess what each of them computes?

In [None]:
# Write your answer here

Each of the convolution filters has a different purpose and computes different operations on an input image:

k1 = [ [0 0 0], [0 1 0], [0 0 0] ]
This filter is called an identity filter because it passes the input image through unchanged. It assigns a weight of 1 to the center pixel and 0 to all other pixels. 

k2 = [ [0 0 0], [0 0 1], [0 0 0] ]
This filter is  called a horizontal edge detection filter. It assigns a weight of 1 to the middle-right pixel and a weight of -1 to the middle-left pixel. It detects edges that are oriented horizontally.

k3 = [ [-1 -1 -1], [-1 8 -1], [-1 -1 -1] ]
This filter is called a Laplacian filter or edge enhancement filter. It detects edges by computing the second derivative of the image intensity. It assigns a weight of -1 to all eight neighboring pixels and a weight of 8 to the center pixel. It enhances edges and reduces noise.

k4 = [ [1 1 1], [1 1 1], [1 1 1] ] / 9
This filter is called a mean filter or smoothing filter. It assigns a weight of 1/9 to each of the nine neighboring pixels and computes the average value. It smooths the image and reduces noise.