# **Mock Exam**

#### This notebook is to help you prepare for the exam. It contains a task similar to what will be on the first part of the exam. Although this is just a simpler version of the actual exam, what is used to solve this task should provide you with basic knowledge needed for the exam (part one). I also included the solutions.

**Name**:

**Neptun code:**

## Task Description

#### Your task is to implement an Encoder-Decoder structure and the forward functions. 

#### Afterwards, make sure to run cell code number 1.3. and 1.6. to know if your implementation is correct.

#### This task should be **SOLVED IN 1 HOUR** and submitted to Canvas (download the .ipynb file). Please note that after 1 hour, the Canvas exam assignment will be closed and you cannot submit your solution. 

In [None]:
import torch
from torch import nn
from torchvision import models
from torchsummary import summary

#### **NO GPU IS NEEDED for this task**. No training nor any computationally expensive operation will be performed. This notebook runs on any computer using a cpu.

In [None]:
#device = torch.device("cuda" if torch.cuda.is_available() else "cpu") #make sure that you are using GPU acceleration
#device

## 1. Architecture

#### Please keep in mind that this architecture is purely imagined and should not correspond to any existing model / architecture. You will not find it on the internet.

Please right click the image and "Open image in a new tab" to view it better with zoom. Or download it from here: https://drive.google.com/file/d/1up5D0ikCBUt5T_RVbChEsz3Xb96akDXO/view?usp=sharing

<br>
<br>

![](https://drive.google.com/uc?export=view&id=1up5D0ikCBUt5T_RVbChEsz3Xb96akDXO)

#### 1.1. Implement the encoder

In [None]:
class EncoderCNN(nn.Module):
    def __init__(self, bottleneck_size = 128):
        super(EncoderCNN, self).__init__()
        
        # GET the pretrained vgg16 model
        
        vgg16 = models.vgg16(pretrained = True)

        # REMOVE the "classifier" layers. In order to know how to solve this,
        # printing the architecture structure might be useful

        modules = list(vgg16.children())[:-2]
        self.vgg16 = nn.Sequential(*modules)
        

        # DEFINE the convolutions. You need to know the output of the VGG16 without the last layer
        # in order to define the input channels. Again, printing the architecture might be useful

        self.conv256_3x3 = nn.Conv2d(in_channels=512, out_channels=256, kernel_size=3, stride=1, padding="valid", dilation=1, groups=1, bias=True, padding_mode='zeros', device=None, dtype=None)
        
        

        # DEFINE a fully connected layer that receives the combined convolutions' features as input features
        # and outputs the bottleneck_size.
        # You need to know the output size of the combined convolutions, reshaped into a vector (batch_size, combined features)
        self.fc = nn.Linear(12800, 128)

        
        # DEFINE a dropout layer with probability 0.3
        self.dropOut = nn.Dropout(0.3)
        
        # DEFINE a ReLU activation layer
        self.relu = nn.ReLU(inplace=False)
        
    def forward(self, images):
        
        # GET the features from the VGG16
        features = self.vgg16(images)
        

        # SEND the features to the two branches
        encoder_a = torch.nn.Sequential(
            self.conv256_3x3,
            self.relu,
        )
        
        encoder_b = torch.nn.Sequential(
            self.conv256_3x3,
            self.dropOut,
        )
        
        #COMBINE the branches
        cat = torch.cat([encoder_a(features) , encoder_b(features)], dim=1, out=None)

        print(cat.shape)

        # RESHAPE the combined output of the convolutions so that it can be fed to the linear layer.
        # Alternatively you can FLATTEN it.
        x = torch.flatten(cat, 1)
        print(x.shape)

        # GET the bottleneck from the fully connected layer
        bottleneck = self.fc(x) #change the None to the actual layer
        
        return bottleneck

encoder = EncoderCNN()
encoded = encoder(torch.randn(10,3,224,224))
encoded.shape

torch.Size([10, 512, 5, 5])
torch.Size([10, 12800])


torch.Size([10, 128])

#### 1.2. Solution

In [None]:
#@title
class EncoderCNN(nn.Module):
    def __init__(self, bottleneck_size = 128):
        super(EncoderCNN, self).__init__()
        
        # GET the pretrained vgg16 model
        vgg16 = models.vgg16(pretrained=True)
        
        # REMOVE the "classifier" layers. In order to know how to solve this,
        # printing the architecture structure might be useful
        modules = list(vgg16.children())[:-1]
        self.vgg16 = nn.Sequential(*modules)

        # DEFINE the convolutions. You need to know the output of the VGG16 without the last layer
        # in order to define the input channels. Again, printing the architecture might be useful
        self.conv2d_A = nn.Conv2d(in_channels=512, out_channels=256, kernel_size=(3, 3), stride=(1, 1), padding='valid')
        self.conv2d_B = nn.Conv2d(in_channels=512, out_channels=256, kernel_size=(3, 3), stride=(1, 1), padding='valid')

        # DEFINE a fully connected layer that receives the combined convolutions' features as input features
        # and outputs the bottleneck_size.
        # You need to know the output size of the combined convolutions, reshaped into a vector (batch_size, combined features)
        self.linear = nn.Linear(in_features=12800, out_features=bottleneck_size)
        
        # DEFINE a dropout layer with probability 0.3
        self.dropout = nn.Dropout(p=0.3)
        
        # DEFINE a ReLU activation layer
        self.relu = nn.ReLU()
        
    def forward(self, images):
        
        # GET the features from the VGG16
        features = self.vgg16(images)

        # SEND the features to the two branches
        branch_A = self.conv2d_A(features)
        branch_A = self.relu(branch_A)

        branch_B = self.conv2d_B(features)
        branch_B = self.dropout(branch_B)
        
        #COMBINE the branches
        combined = torch.cat((branch_A, branch_B), axis=1)

        # RESHAPE the combined output of the convolutions so that it can be fed to the linear layer.
        # Alternatively you can FLATTEN it.
        combined = torch.flatten(combined, start_dim=1) #or combined = combined.view(-1,512*5*5)

        # GET the bottleneck from the fully connected layer
        bottleneck = self.linear(combined) #change the None to the actual layer
        
        return bottleneck

#### 1.3. Test your implementation.
Expected output 

torch.Size( [10, 128] )

In [None]:
encoder = EncoderCNN()
encoded = encoder(torch.randn(10,3,224,224))
encoded.shape

RuntimeError: ignored

#### 1.4. Implement the Decoder

In [None]:
class DecoderRNN(nn.Module):
    def __init__(self, embed_size=128, hidden_size=512, vocab_size=1000):
        super(DecoderRNN, self).__init__()
        
        self.embed_size = embed_size
        self.hidden_size = hidden_size
        self.vocab_size = vocab_size
        
        # DEFINE the LSTM Cell, it takes the size of the embedding as input size and has a hidden size 
        self.lstm_cell = nn.LSTMCell(embed_size, hidden_size) 
    
        # DEFINE the fully connected layer that takes as input features the output of the LSTM 
        # and outputs features of the size of the vocabulary
        self.fc = nn.Linear(hidden_size, vocab_size)
        
    
        # DEFINE the embedding layer with the size of the vocabulary and a dimension of embed size
        self.embedding = nn.Embedding(vocab_size, embed_size)
    
        # DEFINE a sigmoid activation
        self.sigmoid = nn.Sigmoid()
    
    def forward(self, encoded, captions):
        # GET the batch size from the encoded input
        batch_size = encoded.size(0)
        
        # INITIALIZE the hidden and cell states to zeros taking into consideration the batch size
        hidden_state = torch.zeros(batch_size, self.hidden_size)
        cell_state = torch.zeros(batch_size, self.hidden_size)

        # DONE for you
        outputs = torch.empty((batch_size, captions.size(1), self.vocab_size))

        # APPLY embeddings to the captions. Please use captions.to(torch.int64) to avoid errors
        captions_embed = self.embedding(captions.to(torch.int64))
  
        # PASS THE CAPTION TO THE MODEL, WORD BY WORD
        for t in range(captions.size(1)):

            # for the first time step the INPUT IS THE ENCODED VECTOR.
            # don't forget to GET the hidden state and cell state
            if t == 0:
                hidden_state, cell_state = self.lstm_cell(encoded, (hidden_state, cell_state))
                
            # for the 2nd+ time step the INPUT IS THE EMBEDDED CAPTIONS.
            # take into consideration only the time stamp axis (tokens) and not the batch size and the size [:,???,:]
            # don't forget to GET the hidden state and cell state
            else:
                hidden_state, cell_state = self.lstm_cell(captions_embed[:, t, :], (hidden_state, cell_state))
            
            # GET the output from the fully connected layer
            out = self.fc(hidden_state)

            # APPLY the activation function. Call it "out"
            out = self.sigmoid(out) #replace None with the actual layer
            
            # DONE for you
            outputs[:, t, :] = out
    
        return outputs
decoder = DecoderRNN()
captions = torch.abs(torch.randn(10,1000))
decoded = decoder(encoded,captions)
decoded.shape

torch.Size([10, 1000, 1000])

#### 1.5. Solution

In [None]:
#@title
class DecoderRNN(nn.Module):
    def __init__(self, embed_size=128, hidden_size=512, vocab_size=1000):
        super(DecoderRNN, self).__init__()
        
        self.embed_size = embed_size
        self.hidden_size = hidden_size
        self.vocab_size = vocab_size
        
        # DEFINE the LSTM Cell, it takes the size of the embedding as input size and has a hidden size 
        self.lstm_cell = nn.LSTMCell(input_size=self.embed_size, hidden_size=self.hidden_size)
    
        # DEFINE the fully connected layer that takes as input features the output of the LSTM 
        # and outputs features of the size of the vocabulary
        self.fc_out = nn.Linear(in_features=self.hidden_size, out_features=self.vocab_size)
    
        # DEFINE the embedding layer with the size of the vocabulary and a dimension of embed size
        self.embed = nn.Embedding(num_embeddings=self.vocab_size, embedding_dim=self.embed_size)
    
        # DEFINE a sigmoid activation
        self.sigmoid = nn.Sigmoid()
    
    def forward(self, encoded, captions):
        # GET the batch size from the encoded input
        batch_size = encoded.size(0)
        
        # INITIALIZE the hidden and cell states to zeros taking into consideration the batch size
        hidden_state = torch.zeros((batch_size, self.hidden_size))
        cell_state = torch.zeros((batch_size, self.hidden_size))

        # DONE for you
        outputs = torch.empty((batch_size, captions.size(1), self.vocab_size))

        # APPLY embeddings to the captions. Please use captions.to(torch.int64) to avoid errors
        captions_embed = self.embed(captions.to(torch.int64))

        # PASS THE CAPTION TO THE MODEL, WORD BY WORD
        for t in range(captions.size(1)):

            # for the first time step the INPUT IS THE ENCODED VECTOR.
            # don't forget to GET the hidden state and cell state
            if t == 0:
                hidden_state, cell_state = self.lstm_cell(encoded, (hidden_state, cell_state))
                
            # for the 2nd+ time step the INPUT IS THE EMBEDDED CAPTIONS.
            # take into consideration only the time stamp axis (tokens) and not the batch size and the size [:,???,:]
            # don't forget to GET the hidden state and cell state
            else:
                hidden_state, cell_state = self.lstm_cell(captions_embed[:, t, :], (hidden_state, cell_state))
            
            # GET the output from the fully connected layer
            out = self.fc_out(hidden_state)

            # APPLY the activation function. Call it "out"
            out = self.sigmoid(out)
            
            # DONE for you
            outputs[:, t, :] = out
    
        return outputs

#### 1.6. Test your implementation
This requires the Encoder to be defined and its output to be stored in an "encoded" variable (cell 1.3. does that).

Expected output

torch.Size( [10, 1000, 1000] )


In [None]:
decoder = DecoderRNN()
captions = torch.abs(torch.randn(10,1000))
decoded = decoder(encoded,captions)
decoded.shape