###### Some of the content of this homework is adapted from [ LSTM with Pytorch](https://www.deeplearningwizard.com/deep_learning/practical_pytorch/pytorch_lstm_neuralnetwork/)

<div class="alert alert-block alert-warning">
<h1><span style="color:green"> EE512: Machine Learning </span><h1>
<h2><span style="color:green"> Homework-4 </span><h2>
</div>

---
---

## Recurrent Neural Network(RNN)
**Recurrent Neural Networks(RNNs)** are a type of neural network where the **output from previous step is fed as input to the current step.** RNNs are used to tackle problems that requires knowledge from the past to make a prediction such as predicting next word in a sentence. RNN does this by maintaing a "memory" through a vector called **Hidden State**.

 At each time step $t$, an instance of a sequence $x_t\in\mathbb{R}^D$ and previous hidden state $h_{t-1}\in\mathbb{R}^H$  are passed to RNN. Using these two inputs, hidden state $h_t$ gets updated. The learnable parameters of the RNN are $(W_x\in\mathbb{R}^{H\times D}, W_h\in\mathbb{R}^{H\times H},b\in\mathbb{R}^{H} )$  input-to-hidden matrix , a hidden-to-hidden matrix  and a bias vector respectively.

$$ h_t = tanh(W_{x}x_t + W_{h}h_{_{t-1}}+b) $$


![RNN_folded_unfolded.png](attachment:RNN_folded_unfolded.png)
<center>A folded (left) and un-folded (right) version of an RNN. <b>A</b> is feed forward neural network followed by an activation function. </center> 

<center> <a href="https://colah.github.io/posts/2015-08-Understanding-LSTMs/" title="colah's blog">
Image Source</a></center>


In [1]:
import torch
import torch.nn as nn
import torchvision
import torchvision.transforms as transforms
import torchvision.datasets as dsets

import numpy as np
import matplotlib.pyplot as plt

<div class="alert alert-block alert-warning">
<h1><span style="color:green"> Task-1 : Implement One Fully-Connected Layer using nn.Linear  </span></h1>
</div>

Input row-vector is $x=[1,2,3]$

In [2]:
x = torch.FloatTensor([1,2,3])

fc_layer = nn.Linear(3, 4*6)

output = fc_layer(x)

w = None
b = None
for name, param in fc_layer.named_parameters():
    if name == 'weight':
        w = param.data
    elif name == 'bias':
        b = param.data

print('Dimension of Input        : ' , x.shape) # we can also print dementions x.ndimension()
print('Dimension of Output       : ' , output.shape)
print('Dimension of Weight Matrix: ',w.shape )
print('Dimension of Bias Vector  : ' ,b.shape)

print('\n\n\n')
print('Weight Matrix of Fully-Connected Layer:\n',w)
print('\n\n\nBias Vecotr of Fully-Connected Layer:\n',b)
print('\n\n\nOutput:\n',output)

Dimension of Input        :  torch.Size([3])
Dimension of Output       :  torch.Size([24])
Dimension of Weight Matrix:  torch.Size([24, 3])
Dimension of Bias Vector  :  torch.Size([24])




Weight Matrix of Fully-Connected Layer:
 tensor([[ 0.0620, -0.2227,  0.2560],
        [ 0.4110,  0.2781,  0.5542],
        [ 0.0268, -0.4913, -0.2040],
        [-0.2326, -0.2051, -0.5239],
        [ 0.0988, -0.0739, -0.1083],
        [ 0.3112, -0.0467, -0.2634],
        [-0.5627, -0.0148, -0.3369],
        [-0.0025,  0.2711,  0.2552],
        [-0.1124, -0.1506, -0.4798],
        [-0.0821,  0.5672,  0.0552],
        [-0.3041, -0.0383,  0.2553],
        [-0.3990,  0.4771, -0.4966],
        [-0.2871, -0.1807,  0.3783],
        [ 0.3518,  0.4003,  0.0272],
        [ 0.1394, -0.2831, -0.4631],
        [ 0.3804, -0.3814,  0.1181],
        [-0.5648,  0.5569, -0.0754],
        [-0.0027,  0.4436, -0.2137],
        [ 0.3736, -0.5466, -0.3141],
        [ 0.2314,  0.2771,  0.4165],
        [-0.3393,  0.2019,  0

In [14]:
input_dim = 3
hidden_dim = 6
input = torch.FloatTensor([3,4,5,6])
hidden = torch.zeros((1,hidden_dim))

In [15]:
W_x = nn.Linear(input_dim, 4 * hidden_dim, bias=True)
W_h = nn.Linear(hidden_dim, 4 * hidden_dim, bias=True)

In [16]:
for name, param in W_x.named_parameters():
    if name == 'weight':
        w = param.data
    elif name == 'bias':
        b = param.data
w.shape

torch.Size([24, 3])

In [17]:
W_x(input).shape

RuntimeError: ignored

In [8]:
activation = W_x(input) + W_h(hidden)
activation.shape

torch.Size([1, 24])

In [9]:
activation

tensor([[-0.1789, -3.7217,  3.4697,  0.6477, -0.0525, -4.2697,  0.9152, -5.4598,
          1.1937, -1.9445,  5.2034, -1.4306, -0.8855, -0.5019,  3.5420,  2.5287,
         -2.1952,  0.7701, -1.2765, -0.1209, -1.8376,  0.1420,  2.4654, -0.0465]],
       grad_fn=<AddBackward0>)

In [13]:
a, b, c, d = activation.chunk(4, dim=1)

In [18]:
print(f'shape of a is {a.shape}')
print(f'shape of b is {b.shape}')
print(f'shape of c is {c.shape}')
print(f'shape of d is {d.shape}')

shape of a is torch.Size([1, 6])
shape of b is torch.Size([1, 6])
shape of c is torch.Size([1, 6])
shape of d is torch.Size([1, 6])


In [15]:
a

tensor([[-0.1789, -3.7217,  3.4697,  0.6477, -0.0525, -4.2697]],
       grad_fn=<SplitBackward0>)

In [17]:
a1, b1, c1, d1 = activation[:,0:6], activation[:,6:12], activation[:,12:18], activation[:,18:24]

torch.Size([1, 6])

In [19]:
print(f'shape of a1 is{a1.shape}')
print(f'shape of b1 is{b1.shape}')
print(f'shape of c1 is{c1.shape}')
print(f'shape of d1 is{d1.shape}')

shape of a1 istorch.Size([1, 6])
shape of b1 istorch.Size([1, 6])
shape of c1 istorch.Size([1, 6])
shape of d1 istorch.Size([1, 6])


In [21]:
print(a == a1)
print(b == b1)
print(c == c1)
print(d == d1)

tensor([[True, True, True, True, True, True]])
tensor([[True, True, True, True, True, True]])
tensor([[True, True, True, True, True, True]])
tensor([[True, True, True, True, True, True]])


<div class="alert alert-block alert-warning">
<h2><span style="color:blue"> Task-1  Comments  </span></h2>
</div>


I did this becuase:
* Abcd
* Abcd


I have observed following points:
* Abcd
* Abcd

<div class="alert alert-block alert-warning">
<h1><span style="color:green"> Task-2  : Implement a Single-layer RNN  </span></h1>
</div>

In [None]:
class SimpleRNN(nn.Module):
    
    def __init__(self,input_dim, hidden_dim):

        super().__init__()

        self.input_dim  = input_dim
        self.hidden_dim = hidden_dim

        
        # Initialize W_x and W_h 
        # Bias terms are included in nn.Linear, so no need to initialize separate bias vector

        self.W_x = nn.Linear(??) 

        self.W_h = nn.Linear(??) 



    def rnn_step(self, inp, hidden):
        
        """ Implementation of Single-layer RNN ONE Time Step
            
            Args:
            
                  inp   : input of shape( batch_size, input_dim)
                  hidden: previoud hiedden state h_(t-1)
            
            
            Returns:
                     h_t : updated hidden state after ONE time step 
            """

        
        prev_h = hidden
        
        h_t = None
        
        
        # TODO: Implement one time step and update h

        h_t = ??


        return h_t


    def forward(self, inp):
        # Input should be of shape(seq_dim, batch_size, input_dim)

        seq_dim, batch_size , _ = inp.shape

        
        # Initialize hidden state with zeros and shape(batch_size, hidden_dim)
        # In case of batch gradient descent we have to maintain hidden vector 
        # for each training example in the batch
        
        h = ??

        
        # Loop through the whole sequence and update h_t at every time step
        for x in inp:
          # x is of shape (batch_size, input_dim)
          h = self.rnn_step(x , h)
        
        return h 


#### Output of RNN after T time steps

In [None]:
# Testing RNN: This code should run without any error

x = torch.randn(28, 1, 28) # (Seq_dim, batch_dim, inp_dim)

rnn = SimpleRNN(28, 100)

# Dimension of network parameters
for p in rnn.parameters():
    print(p.shape)

    
h_last_t = rnn(x)


<div class="alert alert-block alert-warning">
<h2><span style="color:blue"> Task-2  Comments  </span></h2>
</div>


I did this becuase:
* Abcd
* Abcd


I have observed following points:
* Abcd
* Abcd

---
---

# LSTM
If you read recent papers, you'll see that many people use a variant on the vanilla RNN called Long-Short Term Memory (LSTM) RNNs. Vanilla RNNs can be tough to train on long sequences because it suffers from vanishing/exploding gradient problem due to repeated matrix multiplication. LSTMs solve this problem by replacing the simple update rule of the vanilla RNN with a gating mechanism as follows.


Similar to the vanilla RNN, at each time step we receive an input $x_t\in\mathbb{R}^D$ and the previous hidden state $h_{t-1}\in\mathbb{R}^H$; the LSTM also maintains an $H$-dimensional *cell state*, so we also receive the previous cell state $c_{t-1}\in\mathbb{R}^H$. The learnable parameters of the LSTM are an *input-to-hidden* matrix $W_x\in\mathbb{R}^{4H\times D}$, a *hidden-to-hidden* matrix $W_h\in\mathbb{R}^{4H\times H}$ and a *bias vector* $b\in\mathbb{R}^{4H}$.

**Points to be noted**

* *Cell State* is the internal state of LSTM. It provides a path for the flow of gradient which does not contain matrix multiplication to reduce the chances of vanishing/exploding gradient.

* *Hidden State* is the output of the LSTM

![lstm.png](attachment:lstm.png)
<center> <a href="https://colah.github.io/posts/2015-08-Understanding-LSTMs/" title="colah's blog">
Image Source</a></center>

At each time step:

**1.** An *activation vector* $a\in\mathbb{R}^{4H}$ is computed as $a=W_xx_t + W_hh_{t-1}+b$.

**2.** Activation vector is partitioned into four vectors $a_i,a_f,a_o,a_c\in\mathbb{R}^H$ where $a_i$ consists of the first $H$ elements of $a$, $a_f$ is the next $H$ elements of $a$ and so on. 

**3.** We then compute the *input gate* $(i\in\mathbb{R}^H)$, *forget gate* $(f\in\mathbb{R}^H)$, *output gate* $(o\in\mathbb{R}^H)$ and *new candidate for $c_t$* $(\tilde{c_t}\in\mathbb{R}^H)$ as

$$
\begin{align*}
i = \sigma(a_i) \hspace{2pc}
f = \sigma(a_f) \hspace{2pc}
o = \sigma(a_o) \hspace{2pc}
\tilde{c_t} = \tanh(a_c)
\end{align*}
$$

where $\sigma$ is the sigmoid function and $\tanh$ is the hyperbolic tangent, both applied element-wise.


  * *new candidate vector* ($\tilde{c_t}$) provides candidate values for new cell state ($c_t$)
  
  * *input gate* ($i$) determines the positions where data should be injected from  $\tilde{c_t}$ to new cell state ($c_t$)

  * *forget gate* ($f$) removes the useless data from previous cell state $c_{t-1}$

  * *output gate* ($o$) determines the data to pass from current cell state $c_t$ to current hidden state $h_t$
  


  
  

**4.** Finally, we update the next cell state $c_t$ and next hidden state $h_t$ as follows

$$
c_{t} = f\odot c_{t-1} + i\odot \tilde{c_t} \hspace{4pc}
h_t = o\odot\tanh(c_t)
$$

where $\odot$ is the element-wise (Hadamard) product of vectors.


---
---

<div class="alert alert-block alert-warning">
<h1><span style="color:red"> Question  </span></h1>
</div>

Recall that in an LSTM the input gate $i$, forget gate $f$, and output gate $o$ are all outputs of a sigmoid function. Why don't we use the ReLU activation function instead of sigmoid to compute these values? Explain.

**Your Answer:** 


---
---

<div class="alert alert-block alert-warning">
<h1><span style="color:green"> Task-3 : Implement a Single-layer LSTM  </span></h1>
</div>

In [None]:
class SimpleLSTM(nn.Module):
    
    def __init__(self,input_dim, hidden_dim):

        super().__init__()

        
        self.input_dim  = input_dim
        self.hidden_dim = hidden_dim

        
        #TODO: Initialize W_x and W_h
        # Bias terms are included in nn.Linear, so no need to initialize separate bias vector
        
        self.W_x = nn.Linear(??)
        self.W_h = nn.Linear(??)



    def lstm_step(self, inp, prev_hidden_cell):
        """ Implementation of Single-layer LSTM ONE Time Step
            
            Args:
                    inp      : input of shape(batch_size, input_dim)
             prev_hidden_cell: tuple (previous_hidden_state, previous_cell_state)
            
            
            Returns:
                    Updated hidden state and cell state 
        """

        
        
        h_prev , c_prev = prev_hidden_cell
        
        
        # The activation vector
        activation = ??
        
        
        # The activation is split into four parts
        ai, af, ac, ao = activation.chunk(4, 1)

        
        updated_h, updated_c = None, None
      
    
        # TODO: Implement the gates of lstm and update hidden state and cell state

        in_gate     = ??
        forget_gate = ??
        cell_gate   = ??
        out_gate    = ??

        
        updated_c  = ??
        updated_h  = ??
        


        return updated_h, updated_c


    def forward(self, inp):
        
        # input shape (seq_dim, batch_size, input_dim)

        # TODO: initialize hidden and cell state and loop through sequence

        # Initialize hidden state with zeros (batch_size, hidden_dim)
        h = ??
        
        # Initialize cell state with zeros (batch_size, hidden_dim)
        c = ??
        
        
        # Loop through the whole sequence and update h_t and c_t at every time step
        for x in inp:
            
           # shape of x is (batch_size, input_dim )
        
           h, c = self.lstm_step( x, (h, c) )
        
        return h 


#### Output of LSTM Cell after T time steps
The final updated hidden state is the output of LSTM. After filtering the data through gates, the final hidden state can provide us with the relevant indormation for our task.

In [None]:
# Testing LSTM: The code should run with any erros

#test input of shape(Sequence_len, Batch_size , Input_dim)
x = torch.randn(28, 1 ,28)

m = SimpleLSTM(28, 100)

out = m(x)


<div class="alert alert-block alert-warning">
<h2><span style="color:blue"> Task-3  Comments  </span></h2>
</div>


I did this becuase:
* Abcd
* Abcd


I have observed following points:
* Abcd
* Abcd

<div class="alert alert-block alert-warning">
<h1><span style="color:orange"> Image Classification with LSTM  </span></h1>
</div>

In this section, you  will use an LSTM network for MNIST image classification. We will use **many-to-one** scenario for this task. Refer to following figure to understand *many-to-one* LSTM.
<br>
<br>

![lstm_shapes.png](attachment:lstm_shapes.png)
<center> <a href="https://colah.github.io/posts/2015-08-Understanding-LSTMs/" title="Andrej Karpathy's blog">
Image Source</a></center>


### Loading MNIST Train Dataset
MNIST contains 70,000 images of handwritten digits: 60,000 for training and 10,000 for testing. The images are grayscale, 28x28 pixels, and centered to reduce preprocessing and get started quicker.

In [None]:
train_dataset = dsets.MNIST(root='./data', 
                            train=True, 
                            transform=transforms.ToTensor(),
                            download=False)

test_dataset = dsets.MNIST(root='./data', 
                           train=False, 
                           transform=transforms.ToTensor())

In [None]:
print('Train Data   : ',train_dataset.data.size())
print('Train Labels : ',train_dataset.targets.size())
print('Test Data    : ',test_dataset.data.size())
print('Test Labels  : ',test_dataset.targets.size())

### Make Dataset Iterable

In [None]:
batch_size = 100
n_iters = 3000

num_epochs = n_iters / (len(train_dataset) / batch_size)
num_epochs = int(num_epochs)

train_loader = torch.utils.data.DataLoader(dataset=train_dataset, 
                                           batch_size=batch_size, 
                                           shuffle=True)

test_loader = torch.utils.data.DataLoader(dataset=test_dataset, 
                                          batch_size=batch_size, 
                                          shuffle=False)

<div class="alert alert-block alert-warning">
<h1><span style="color:green"> Task-4 : Plot Images  </span></h1>
</div>

Plot 4 random images using plt.imshow  and cmap = 'gray' in a 2x2 grid

In [None]:
# TODO: Plot MNIST Images in 2x2 grid



<div class="alert alert-block alert-warning">
<h2><span style="color:blue"> Task-4  Comments  </span></h2>
</div>


I did this becuase:
* Abcd
* Abcd


I have observed following points:
* Abcd
* Abcd

<div class="alert alert-block alert-warning">
<h1><span style="color:green"> Task-5 : Build LSTM network for MNIST image Classification  </span></h1>
</div>

### Initialize an LSTM Model Using Pytorch's nn.LSTM() Class
Create an  LSTM model for classification of MNIST images. Follow the structure given below;

1. One LSTM Cell
2. One Linear Fully-Connected Layer


Read Pytorch [documentation]( https://pytorch.org/docs/stable/nn.html?highlight=lstm#torch.nn.LSTM) to learn about **nn.LSTM()**.

In [None]:
class LSTMModel(nn.Module):
    
    def __init__(self, input_dim, hidden_dim, layer_dim, output_dim):
        
        super(LSTMModel, self).__init__()
        
        
        # Hidden dimensions: Number of features in hidden/cell state
        self.hidden_dim = hidden_dim
        
        
        
        # Number of LSTM cells/layers 
        self.layer_dim = layer_dim

        
        
        # Declaring attributes of model for LSTM cell and read-out linear fully-connected layer
        self.lstm = None
        self.fc   = None

        
        
        # TODO: Intialize LSTM and linear layer
             
        
        # Initialize LSTM Cell/s
        self.lstm = nn.LSTM(????,  batch_first=True)
        
        
        
        # Readout fully-connected layer. It takes hidden state from last time step and converts this hidden state to logits
        self.fc = nn.Linear(?????)                                     
        

    
    def forward(self, x):
        
        # Assuming batch_first=True
        # x       : input consists of a sequence or a batch of sequences.   shape: (batch, seq_len, features)
        #         : batch = number of sequences in one batch ,      seq_len = number of time steps       ,  features= number of features in one part sequence at each time step
        
        
        
        
        # Everytime when model is called, intial hidden state and cell state would be initialized with zeros
        
        # Initialize hidden state with zeros
        h0 = torch.zeros(self.layer_dim, x.size(0), self.hidden_dim)     # shape: (num_layers * num_directions, batch, hidden_size)
        # Initialize cell state with zeros
        c0 = torch.zeros(self.layer_dim, x.size(0), self.hidden_dim)     # shape: (num_layers * num_directions, batch, hidden_size)
        
        
        # x       : input consists of a sequence of batch of sequences.   shape: (batch, seq_len, features)                         Assuming batch_first=True
        # out     : tensor containing hidden states from all time steps.  shape: (batch, seq_len,  num_directions * hidden_size)    Assuming batch_first=True
        # (hn,cn) : hidden state and cell state after last time-step.     shape: (num_layers * num_directions, batch, hidden_size)
        out, (hn, cn) = self.lstm(??????)
        
        
        # Take hidden state from last time step and convert it to logits using fully-connected layer
        out = self.fc(out[??????]) 
        
        
        return out

### Instantiate Model Class

In [None]:
input_dim  = 28      # Number of elements in each part of sequence. 
hidden_dim = 100     # Number of elements in hidden/cell state
layer_dim  = 1       # Number of LSTM Cells
output_dim = 10      # Number of classes,  There are 10 classes in MNIST dataset (Digits: 0,1,2,...9)



model = LSTMModel( ????????????)

### Instantiate Loss Class
We are going to use **Cross Entropy Loss** for this classifcation problem.   

In [None]:
criterion = nn.??

### Instantiate Optimizer Class


- Simplified equation
    - $ \boldsymbol{\theta} $ = $\boldsymbol{\theta} - \eta \cdot \nabla_{\boldsymbol{\theta}} $
        - $ \boldsymbol{\theta} $: parameters (learnable variables/parameters)
        - $\eta $: learning rate (how fast we want to learn)
        - $\nabla_{\boldsymbol{\theta}}$: parameters' gradients
- Even simplier equation
    - `parameters = parameters - learning_rate * parameters_gradients`
    - **At every iteration, we update our model's parameters**

In [None]:
learning_rate = 0.1

optimizer = torch.optim.SGD(?????)  

### Parameters In-Depth

In [None]:
len(list(model.parameters()))

In [None]:
for i in range(len(list(model.parameters()))):
    print(list(model.parameters())[i].size())

### Understanding LSTM Training

Firstly lets see how image passes through LSTM and loss is calculate.

1. Firstly, we have converted 28x28 image into sequence of 28 1x28 image strips.
2. At each time step, we feed LSTM 1x28 image strip and update hidden state and cell state.
3. After passing all the 28 1x28 sequence of image parts, we take the final hidden state and pass it as input to a fully connected layer.
4. The fully-connected layer outpus, 10 logits(unnormalized scores) for iamge classes.
5. We use softmax to normalize the scores and calculate Cross Entropy Loss

In case of backward pass, gradient flows from final time step to initial time step of LSTM and gradients of parameters are added together.



### Train Model
- Process 
    1. Take one sequence or batch of sequences
    2. Clear gradient buffers
    3. Get output by forward-passing input through model/network
    4. Compute **cross-entropy** loss using labels and output of network
    5. Get gradients w.r.t. parameters
    6. Update parameters using gradients
        - `parameters = parameters - learning_rate * parameters_gradients`
    7. REPEAT

In this homework, we will feed an LSTM cell an entire MNIST image in the form of a sequence of 28 rows and then classify the image. Data can be viewed as a set of $N$ examples $\{(\mathbf{x_i},y_i)\}_{i=1}^{N}$. In each pair $(\mathbf{x_i},y_i)$, $\mathbf{x_i} = \langle \mathbf{x_{i,1}}, \mathbf{x_{i,2}},\dots, \mathbf{x_{i,28}}\rangle$ is a sequence of 28 vectors, each of dimension $(1,28)$.<br>


### Training Loop

In [None]:
# Number of time steps or Number of constituents of one sequence
seq_len = 28  



iteration = 0


for epoch in range(num_epochs):
    
    for i, (images, labels) in enumerate(train_loader):
        
        # Shape of images = (batch_size, num_channels, 28, 28)
        # Shape of images = (batch_size, 1           , 28, 28)
        
        # Reshape batch of images into batch of sequences,    
        inputs = images.view(-1, seq_len, input_dim)  # shape: (batch, seq_len, input_features)
        
        
        # Clear gradients w.r.t. parameters
        ??
        
        
        
        # Forward pass to get output/logits
        outputs = model(??)
        
        
        # Calculate Loss: softmax --> cross entropy loss
        loss = criterion(????)
        
        
        # Getting gradients w.r.t. parameters
        ????
        
        
        # Updating parameters
        ????
        
        
        
        iteration += 1
        
        
        # Test Accuracy on Test Data
        if iteration % 500 == 0:
            with torch.no_grad():   #  We do not want to add following operations in computation graph.
                
                
                correct = 0
                total = 0


                # Iterate through test dataset
                for images, labels in test_loader:

                    # Shape of images = (batch_size, num_channels, 28, 28)
                    # Shape of images = (batch_size, 1           , 28, 28)
                    
                    # Reshape batch of images into batch of sequences
                    inputs = images.view(-1, seq_len, input_dim)   # shape: (batch, seq_len, input_features)
        
        

                    # Forward pass only to get logits/output
                    outputs = model(???)

                    # Get predictions from outputs
                    predicted = ??

                    
                    # Total number of labels
                    total += labels.size(0)

                    
                    # Total correct predictions
                    correct += (predicted == labels).sum()

                accuracy = 100 * correct.item() / total

                # Print Loss
                print('Iteration: {}. Loss: {}. Accuracy: {}'.format(iteration, loss.item(), accuracy))

### LSTM Network with Multiple Cells/Layers
In the above example, we have used LSTM with only one layer but it is possible to stack multiple LSTM cells/layers over each other. Each LSTM cell/layer has its own set of parameters as well as hidden and cell states. The output of first LSTM cell/layer(hidden state) is fed as input to the next LSTM cell/layer. 


* Increasing the no of layers allow LSTM to learn more complex patterns. But it can overfit

* You have to maintain hidden state and cell state for each layer and each layer comes with a lot of additional network parmateters

* You will only come across maximum of 3 to 4 stacked layers/cells of LSTM. 

---
---

<div class="alert alert-block alert-warning">
<h1><span style="color:red"> Question  </span></h1>
</div>


#### LSTM with Multiple Cells/Layers
Use pytorch nn.LSTM class to initialize LSTM model with 2 cell/layers and then 3 cell/layers. Comment on the results.

**HINT**: You just have to change the layer_dim

**Comments**: 



---
---

<div class="alert alert-block alert-warning">
<h1><span style="color:red"> Question  </span></h1>
</div>


What does **.detach()** do in Pytorch? Describe **'Backward'** propagation in single-cell LSTM network.

**Answer**: 



---
---

In [None]:
<div class="alert alert-block alert-warning">
<h2><span style="color:blue"> Task-5  Comments  </span></h2>
</div>


I did this becuase:
* Abcd
* Abcd


I have observed following points:
* Abcd
* Abcd

### Convert .IPYNB to .HTML 

In [None]:
import os 
cwd = os.getcwd()
os.chdir(cwd)

!jupyter nbconvert LSTM_Homework_StarterCodeNew.ipynb

[NbConvertApp] Converting notebook LSTM_Homework_StarterCodeNew.ipynb to html
[NbConvertApp] Writing 696890 bytes to LSTM_Homework_StarterCodeNew.html
