# EE P 590: Deep Learning for Embedded Real-time Intelligence 


# Homework 3  

In [1]:
import torch 

In [2]:
from eep590_utils import * 

# 0. Warm-up (40 points) 

###  Answer the following in your own words (read and understand the concepts, do not copy answers from the internet): 
**(a.) Explain why the depth-wise separable convolutions can achieve reasonable results while using fewer parameters compared to traditional convolutions. Also compare the representation power of a depth-wise separable convolution to that of a standard convolution if both had the same number of parameters.**

**(b.) Comment on the quantization bit-widths (for a deep learning model) required to achieve good performance for speech/audio inputs and compare it to that for image inputs. Which type of input requires higher number of bits to achieve reasonable performance, and why?** 

**(c.) From an optimization perspective, why is it challenging to train a deep learning model with quantized weights and activations?** 

**(d.) What is the impact of low-rank projections on the generalization ability of a deep learning model? Explain your reasoning.**
 
**(e.) What is the motive to deploy deep learning models on embedded/IoT devices? List the problems associated with sending the data from the IoT device to the cloud through wifi and performing the inference on the cloud**

**(f.) In one or more sentences, and using sketches as appropriate, explain: SqueezeNet and MobileNet. What are the defining characteristic of these architectures? How do they improve upon their predecessors in terms of latency and memory consumption?**


**(f.) Compare and contrast the different pruning techniques discussed in the lecture. Which technique would produce the  most optimal results? Which technique is the most practical? Explain your reasoning.**


**(g.) Explain how Scree plots are useful in resource allocation and determining each layer's target rank.**

**(h.) Compare and contrast the different quantization techniques discussed in the lectures, and comment on the strengths and limitations of each technique.**

# 1. Low-rank approximation (30 points) 

In this section we will use low-rank approximations of the dense layers to reduce the memory consumption of a simple  LeNet architecture during inference. To make it simpler, we have divided this into subtasks:- 

- **(8 points) Subtask 1: Train a standard LeNet architercture on the Fashion MNIST dataset. Use the `load_data_fashion_mnist` function provided in the `eep590_utils.py` file for downloading and loading the dataset. The model architecture is defined below for your convenience, but you are free to use any other implementations of the same architecture. Use a set of reasonable hyper-parameters to obtain an accuracy of atleast 75 percentage on the training and testing sets.** 

In [3]:
class LeNet(nn.Module):
    def __init__(self):
        super(LeNet, self).__init__()
        # 1 input image channel, 6 output channels, 3x3 square conv kernel
        self.conv1 = nn.Conv2d(1, 6, 3)
        self.conv2 = nn.Conv2d(6, 16, 3)
        self.fc1 = nn.Linear(16 * 5 * 5, 120)  # 5x5 image dimension
        self.fc2 = nn.Linear(120, 84)
        self.fc3 = nn.Linear(84, 10)

    def forward(self, x):
        x = F.max_pool2d(F.relu(self.conv1(x)), (2, 2))
        x = F.max_pool2d(F.relu(self.conv2(x)), 2)
        x = x.view(-1, int(x.nelement() / x.shape[0]))
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return x

- **(10 points) Subtask 2: Project the `self.fc1` and `self.fc2` weights to subspaces with ranks `r1` and `r2`.** 
- **(4 points) Plot a graph with the X-axis representing the %improvement in memory usage of the architure, and on Y-axis the test accuracy of the model. (Hint: play with `r1` and `r2`, and run the inference multiple times. You need not re-train the model)** 
- **(4 points) Plot a graph of the %improvement in memory usage vs. $\beta$, which is the percentage of total variation explained in the low-rank subspace. (Hint: Refer to slide 42 of the lecture)**
- **(4 points) Plot a graph of test accuracy vs. $\beta$.** 

# 2. Sparsification (30 points) 

In this section we will attempt to sparsify the model trained in subtask 1 of previous section. 

- **Subtask 1: Plot a histogram of the distribution of all the weights in the model.** 
- **Subtask 2: Implement a module to clip all the weights that have a magntude less than a threshold (say $\alpha$) to zero.**
- **Subtask 3: Run inference for different levels of sparsity by varying $\alpha$, and plot a graph of sparsity vs. test accuracy.** 

# 3. Improved RNN (35 points)

In this section we will implement a different type of RNN, one with weighted residual connections, and try to analyze it's usefulness. 

Let $X = [x 1 , . . . , x^T ]$ be the input data where $x_t \in R^D$ denotes the t-th step feature vector. Then,
the goal of multi-class RNNs is to learn a function F : R D×T → {1, . . . , L} that predicts one of L
classes for the given data point X. Standard RNN architecture has a provision to produce an output
at every time step, but we focus on the setting where each data point is associated with a single label
that is predicted at the end of the time horizon T . Standard RNN maintains a vector of hidden state
$h_t ∈ R^D̂$ which captures temporal dynamics in the input data, i.e., $$h_t = f(W \, x_t + U \, h_{t-1} + b)$$

Learning U, W in the above architecture is difficult as the gradient
can have exponentially large (in T) condition number, which leads to problems like exploding or vanishing gradients. Instead, ImprovedRNN will use a simple weighted residual connection to perform updates of the hidden state $h_t$ as follows: $$ h̃_t = f(Wx_t + Uh_{t−1} + b)$$  $$h_t = \alpha h̃_t + βh_{t−1} $$

where 0 ≤ α, β ≤ 1 are trainable scalar weights.


- **(20 points) Subtask 1: Implement the `ImprovedRNNCell` by extending the RNNCell defined in `eep590_utils.py`, which extends `nn.Module` in pytorch.** 

In [4]:
class ImprovedRNNCell(RNNCell):
    '''
    ImprovedRNN Cell with both Full Rank and Low Rank Formulations
    Has multiple activation functions for the gates
    hidden_size = # hidden units
    update_nonlinearity = nonlinearity for final rnn update
    can be chosen from [tanh, sigmoid, relu, quantTanh, quantSigm]
    wRank = rank of W matrix (creates two matrices if not None)
    uRank = rank of U matrix (creates two matrices if not None)
    wSparsity = intended sparsity of W matrix(ces)
    uSparsity = intended sparsity of U matrix(ces)
    The cell will not automatically sparsify. The user must invoke .sparsify() to hard threshold.
    alphaInit = init for alpha, the update scalar
    betaInit = init for beta, the weight for previous state
    Basic architecture is like:
    h_t^ = update_nl(Wx_t + Uh_{t-1} + B_h)
    h_t = sigmoid(beta)*h_{t-1} + sigmoid(alpha)*h_t^
    W and U can further parameterised into low rank version by
    W = matmul(W_1, W_2) and U = matmul(U_1, U_2)
    '''

    def __init__(self, input_size, hidden_size,
                 update_nonlinearity="tanh", wRank=None, uRank=None,
                 wSparsity=1.0, uSparsity=1.0, alphaInit=-3.0, betaInit=3.0,
                 name="ImprovedRNN"):
        super(ImprovedRNNCell, self).__init__(input_size, hidden_size,
                                           None, update_nonlinearity,
                                           1, 1, 1, wRank, uRank, wSparsity,
                                           uSparsity)
        # TODO: Initialize all the required variables

    @property
    def name(self):
        return self._name

    @property
    def cellType(self):
        return "ImprovedRNN"

    def forward(self, input, state):
        # TODO: Implement the forward pass to compute the new_state
        pass
        return new_state

- **(10 points) Subtask 2: Explain how Improved RNN resolves some of the problems in traditional RNNs from an optimization perspective.** Assume that the label decision function is one dimensional and is given by $f (X) = v^⊤ h_T$ . Let $L(X, y; θ) = L(f (X), y; θ)$ be the logistic loss function for the given labeled data point $(X, y)$ and with parameters $θ = (W, U, v)$. Then, check the gradients $\frac{\partial L}{\partial W}$ and $\frac{\partial L}{\partial U}$. A critical term in the expression of the gradients that determines whether it explodes/vanishes will be: $$M(U) = \prod_{k=t}^{T-1} (αU^T D_{k+1} + βI)$$ where $D_k = diag(σ′ (Wx_k + Uh_{k−1} + b))$ is the Jacobian matrix of the pointwise nonlinearity. **Hint: Analyze the range of possible values for the condition number $\kappa$ of the matrix $M(U)$**. 


- **(5 points) Subtask 3: Explain why this result is useful in the context of embedded real-time applications. Comment on representation power, memory-usage, and computation.** 

# 4. References & Misc. (15 points) 

**(a.) How many hours (approx) did you spend on this assignment totally? Document and explain your journey, including the issues/challenges you faced (both technical and non-technical).**

**(b.) Cite all the resources and references you consulted to complete this assignment.** 