<a href="https://colab.research.google.com/github/Sazid669/DND-Practice/blob/IFRoS-Master/EXAM_2_2025_01_03_9AM.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Final Exam - Deep Network Development**



# **Exam Information**

---

- **Name:** *<Enter your name here>*
- **Neptun ID:** *<Enter your Neptun ID here>*
- **Date:** *03/01/2025*
- **Duration:** *9:00 AM – 11:00 AM*
- *Please fill in your details above before starting the exam.*



## **General Rules**

This notebook contains the task to be completed in order to pass the exam and the course. Below are the details:
1. **Implementing a network architecture**, including its **forward pass** function.
2. Additional **optional requirements** for bonus points towards final grade.
3. You have **2 hours** to complete the exam.
4. You may distribute the time as you see fit between the required and optional parts.
5. You are allowed to use any resource including: the internet, AI tools, practice notebooks, and more.
6. **It is strictly prohibited to use any form of communication** (e.g., Teams, WhatsApp, Messenger, etc.). **Violation will result in an immediate FAIL** of the exam.

---

### **Submission Guidelines**
- Submit your solution as a **`.ipynb` file** on **Canvas**.
- To **PASS**, your solution must:
  1. **Satisfy the minimum requirements** (i.e., a working implementation of the network architecture and forward pass).
  2. Be **submitted on time**.
  3. Be prepared to **orally defend your code** after submission.

---

### **Exam Retake Policy**
- If you **FAIL**, you are allowed to do **one retake**.  
- If you **FAIL AGAIN**, sadly, you **fail the course**.  

---

### **Grading**
- If you **PASS**, your final grade will be the **weighted average** of your assignment defenses (theory and code).

---

Good luck, and ensure you follow all the rules!


## **Requirements**

---

### **Minimum Requirements – Sufficient to Pass the Exam**
1. **Implement the layers of the architecture:**  
   Complete the architecture outlined in Section 1 by filling in the missing parts.
2. **Implement the forward function:**  
   Ensure the input and output of the forward function are correctly implemented.  
   
   **Note:** To meet these requirements, your final output must match the expected output.

---

### **Extra Requirements – For Grade Improvement and AI Lab Access**

---

3. **Text-to-Image with Image-Guided Embeddings:**  
- The goal is to perform text-to-image generation using an existing image as a guide for editing. The input text specifies modifications to the existing image, preserving its content while applying specific changes as described by the text.

   ➡️ **Reward: +1 to final grade**

---

4. **Replacing Text Encoder with Transformer:**  
- Replace the text encoder with a Transformer model.
- Test the new architecture to ensure it performs text-to-image editing correctly, by satisfying the expect output condition.

   ➡️ **Reward: Access to AI Lab**

---

Make sure to carefully follow the instructions provided in each cell to meet the requirements!


## **1. Required: Task Description**

Your task is to implement a custom neural network architecture along with its forward pass function.

This task is inspired by **text-to-image generation**, where a neural network maps a sequence of tokens representing textual information into a high-dimensional image. The text input is typically **tokenized** into a sequence of integers. This representation can be passed through an **encoder-decoder** architecture to generate images.

For this task, you will work with a simplified text-to-image representation in the form of a random tensor with the shape **(1, 10)**:
- The 1 indicates that there is a single input sample.
- 10 corresponds to the sequence length of the input text tokens.

Your implemented model will:
- Take this text token tensor as input.
- Encode it into a latent representation.
- Decode the latent representation to produce an output **image tensor of shape (1, 3, 256, 256)**, where:
    - 1 represents the batch size.
    - 3 indicates the RGB color channels of the image.
    - 256 × 256 corresponds to the height and width of the output image.

The primary objective is to correctly implement the neural network architecture and its forward pass to achieve the desired functionality.

To better view the architecture diagram:  
- **Right-click the image** and select **"Open image in a new tab"** to enable zoom for a clearer view.  
- Alternatively, you can **download the image** using the link below:  
  [Download Architecture Diagram](https://drive.google.com/file/d/1osVNVcsNGo-d9DCGVH1hJDw2nw4rMToR/view?usp=sharing)

---

### Diagram Preview:
![Architecture Diagram](https://drive.google.com/uc?export=view&id=1osVNVcsNGo-d9DCGVH1hJDw2nw4rMToR)


Necessary Imports and Data Loading

In [None]:
# Cell 0.1
import torch
import torch.nn as nn
import numpy as np

In [None]:
# Cell 0.2 (GPU is not needed)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device

device(type='cpu')

### Input

In [None]:
# Cell 0.3 -> INPUT (DO NOT EDIT THIS CELL!)
vocab_size = 10
input_tokens = torch.randint(0, vocab_size, (1, 10))
print(input_tokens.shape)

torch.Size([1, 10])


### Architecture

In [22]:
import torch
import torch.nn as nn

class TextEncoder(nn.Module):
    def __init__(self, vocab_size, embedding_dim):
        super(TextEncoder, self).__init__()
        self.vocab_size = vocab_size
        self.embedding_dim = embedding_dim
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.lstm = nn.LSTM(input_size=embedding_dim, hidden_size=128, num_layers=2, batch_first=True)
        self.fc = nn.Linear(in_features=1280, out_features= 10 * 32 * 32)

    def forward(self, input_tokens):
        print("Text input:", input_tokens.shape)
        embeddings = self.embedding(input_tokens)
        print("After Embedding:", embeddings.shape)
        lstm_out, _ = self.lstm(embeddings)
        print("After LSTM:", lstm_out.shape)
        # reshape = lstm_out[:, -1, :]
        reshape=lstm_out.view(1,-1)
        print("After Reshape:", reshape.shape)
        linear_out = self.fc(reshape)
        print("After Linear Layer:", linear_out.shape)
        reshaped_out = linear_out.view(-1, 10, 32, 32)
        print("After Reshape:", reshaped_out.shape)
        return reshaped_out

vocab_size = 10
embedding_dim = 256
text_encoder = TextEncoder(vocab_size, embedding_dim)
# input_tokens = torch.randint(0, vocab_size, (1, 5))
text_embedding = text_encoder(input_tokens)
print("Output shape:", text_embedding.shape)

Text input: torch.Size([1, 10])
After Embedding: torch.Size([1, 10, 256])
After LSTM: torch.Size([1, 10, 128])
After Reshape: torch.Size([1, 1280])
After Linear Layer: torch.Size([1, 10240])
After Reshape: torch.Size([1, 10, 32, 32])
Output shape: torch.Size([1, 10, 32, 32])


In [None]:
class ImageDecoder(nn.Module):
    def __init__(self):
        super(ImageDecoder, self).__init__()
        # ADD YOUR CODE HERE
        ## 1
        self.convtrans = nn.ConvTranspose2d(in_channels=10, out_channels=16, kernel_size=12, stride=4, padding=0)
        ##2
        self.conv1= nn.Conv2d(in_channels=10, out_channels=32, kernel_size=3, stride=1, padding=0)
        self.maxpool1 = nn.MaxPool2d(kernel_size=2, stride=2, padding=0)
        self.relu1 = nn.ReLU()
        self.batchnorm1 = nn.BatchNorm2d(32)
        self.maxpool2 = nn.MaxPool2d(kernel_size=2, stride=2, padding=0)
        ##3
        self.conv2 = nn.Conv2d(in_channels=32, out_channels=32, kernel_size=3, stride=1, padding=0)
        self.relu2 = nn.ReLU()
        self.batchnorm2 = nn.BatchNorm2d(32)
        ##4
        self.conv3 = nn.Conv2d(in_channels=32, out_channels=64, kernel_size=3, stride=1, padding=0)
        self.convtrans2 = nn.ConvTranspose2d(in_channels=64, out_channels=32, kernel_size=3, stride=1, padding=0)
        ##5
        self.conv4 = nn.Conv2d(in_channels=32, out_channels=32, kernel_size=8, stride=3, padding=0)
        ##6
        self.convtrans3 = nn.ConvTranspose2d(in_channels=32, out_channels=16, kernel_size=24, stride=19, padding=1)
        ##7
        self.convtrans4 = nn.ConvTranspose2d(in_channels=32, out_channels=8, kernel_size=3, stride=2, padding=1)
        ##8
        self.conv5 = nn.Conv2d(in_channels=8, out_channels=3, kernel_size=16, stride=1, padding=0)



    def forward(self, text_embedding):
        print("Text embedding:", text_embedding.shape)
        # ADD YOUR CODE HERE
        x1 = self.convtrans(text_embedding)
        print("After ConvTrans:", x1.shape)

        y = self.conv1(text_embedding)
        print("After Conv1:", y.shape)
        x2 = self.maxpool1(y)
        print("After MaxPool1:", x2.shape)
        x2 = self.relu1(x2)
        x2 = self.batchnorm1(x2)
        x2 = self.maxpool2(x2)
        print("After MaxPool2:", x2.shape)

        y1 = self.conv2(y)
        print("After Conv2:", y1.shape)
        x3 = self.relu2(y1)
        x3 = self.batchnorm2(x3)

        x4 = self.conv3(y1)
        print("After Conv3:", x4.shape)
        x4 = self.convtrans2(x4)
        print("After Conv3:", x4.shape)

        x5 = x3 + x4
        print("After add: ",x5.shape)
        x5 = self.conv4(x5)
        print("After Conv4:", x5.shape)

        x6 = x2 + x5
        print("After add:", x6.shape)
        x6 = self.convtrans3(x6)
        print("After ConvTrans3:", x6.shape)

        x7 = torch.concat([x1, x6], dim=1)
        print("After concat:", x7.shape)
        x7 = self.convtrans4(x7)
        print("After ConvTrans4:", x7.shape)
        x8 = self.conv5(x7)
        print("After Conv5:", x8.shape)
        out = x8
        return out

image_decoder = ImageDecoder()
image = image_decoder(text_embedding)
print("Output shape:", image.shape)

Text embedding: torch.Size([1, 10, 32, 32])
After ConvTrans: torch.Size([1, 16, 136, 136])
After Conv1: torch.Size([1, 32, 30, 30])
After MaxPool1: torch.Size([1, 32, 15, 15])
After MaxPool2: torch.Size([1, 32, 7, 7])
After Conv2: torch.Size([1, 32, 28, 28])
After Conv3: torch.Size([1, 64, 26, 26])
After Conv3: torch.Size([1, 32, 28, 28])
After add:  torch.Size([1, 32, 28, 28])
After Conv4: torch.Size([1, 32, 7, 7])
After add: torch.Size([1, 32, 7, 7])
After ConvTrans3: torch.Size([1, 16, 136, 136])
After concat: torch.Size([1, 32, 136, 136])
After ConvTrans4: torch.Size([1, 8, 271, 271])
After Conv5: torch.Size([1, 3, 256, 256])
Output shape: torch.Size([1, 3, 256, 256])


#### Test your implementation

In [None]:
#DO NOT MODIFY THIS CELL

embedding_dim = 256
print("Encoder:")
text_encoder = TextEncoder(vocab_size, embedding_dim)
text_embedding = text_encoder(input_tokens)
print("Decoder:")
image_decoder = ImageDecoder()
image = image_decoder(text_embedding)

try:
    assert text_embedding.shape == (1, 10, 32, 32), "Encoded text shape is incorrect."
    assert image.shape == (1, 3, 256, 256), "Decoded image shape is incorrect."
    print("\n🎉 Congratulations! Your implementation is correct. You passed the minimum requirement! 🎉")
except AssertionError as e:
    print(f"\n❌ Error: {e}")

Encoder:
Text input: torch.Size([1, 10])
After Embedding: torch.Size([1, 10, 256])
After LSTM: torch.Size([1, 10, 128])
Last Hidden State: torch.Size([1, 128])
After Linear Layer: torch.Size([1, 10240])
After Reshape: torch.Size([1, 10, 32, 32])
Decoder:
Text embedding: torch.Size([1, 10, 32, 32])
After ConvTrans: torch.Size([1, 16, 136, 136])
After Conv1: torch.Size([1, 32, 30, 30])
After MaxPool1: torch.Size([1, 32, 15, 15])
After MaxPool2: torch.Size([1, 32, 7, 7])
After Conv2: torch.Size([1, 32, 28, 28])
After Conv3: torch.Size([1, 64, 26, 26])
After Conv3: torch.Size([1, 32, 28, 28])
After add:  torch.Size([1, 32, 28, 28])
After Conv4: torch.Size([1, 32, 7, 7])
After add: torch.Size([1, 32, 7, 7])
After ConvTrans3: torch.Size([1, 16, 136, 136])
After concat: torch.Size([1, 32, 136, 136])
After ConvTrans4: torch.Size([1, 8, 271, 271])
After Conv5: torch.Size([1, 3, 256, 256])

🎉 Congratulations! Your implementation is correct. You passed the minimum requirement! 🎉


## **2. Optional: +1 to the Final Grade**
- Add another input tensor: a random tensor of size (1, 3, 256, 256).
- Implement an Image Encoder with a few layers to encode the tensor. The encoding process should follow the example image provided.
- Combine the encoded image embeddings with the text embeddings using cross-attention, following the example image provided.

You should only add these new parts and reuse the ImageDecoder previously created. The final output should still be the same as in the previously required task (1,3,256,256).

To better view the architecture diagram:  
- **Right-click the image** and select **"Open image in a new tab"** to enable zoom for a clearer view.  
- Alternatively, you can **download the image** using the link below:  
  [Download Architecture Diagram](https://drive.google.com/file/d/1nIgqhyPq0eKWEvT7leqoa0ZeBApCCs6u/view?usp=sharing)

---

### Diagram Preview:
![Architecture Diagram](https://drive.google.com/uc?export=view&id=1nIgqhyPq0eKWEvT7leqoa0ZeBApCCs6u)


#### New Input - create a random tensor of size (1,3,256,256)

In [None]:
# ADD YOUR CODE HERE

#### Image Encoder - create the image encoder, following the example provided.

In [None]:
class ImageEncoder(nn.Module):
    def __init__(self):
        super(ImageEncoder, self).__init__()
        # ADD YOUR CODE HERE

    def forward(self, new_image):
        print("New image:", new_image.shape)
        # ADD YOUR CODE HERE
        return out

#### Combine with Cross-Attention

In [None]:
class CrossAttention(nn.Module):
    def __init__(self, embed_dim=1024):
        super(CrossAttention, self).__init__()
        # ADD YOUR CODE HERE

    def forward(self, text_embedding, image_embedding):
        # ADD YOUR CODE HERE
        return out

#### Test your implementation

In [None]:
# DO NOT MODIFY THIS CELL
image_encoder = ImageEncoder()
image_embedding = image_encoder(new_image)

embed_dim = 1024
cross_attention = CrossAttention(embed_dim=embed_dim)
combine = cross_attention(text_embedding, image_embedding)

image = image_decoder(combine)

try:
    assert image_embedding.shape == (1, 10, 32, 32), "Encoded image shape is incorrect."
    assert combine.shape == (1, 10, 32, 32), "Combined cross attention shape is incorrect."
    assert image.shape == (1, 3, 256, 256), "Combined cross attention shape is incorrect."
    print("\n🎉 Congratulations! Your implementation is correct. You increased your final grade by 1! 🎉")
except AssertionError as e:
    print(f"\n❌ Error: {e}")

## **3. Optional: Access to AI Lab**
- Replace the text encoder with a Transformer model. This involves:
    - Maximizing the sequence length to 16.
    - Using BertTokenizer.
    - Adding Positional Encoding.
    - Defining a Transformer Encoder.


To better view the architecture diagram:  
- **Right-click the image** and select **"Open image in a new tab"** to enable zoom for a clearer view.  
- Alternatively, you can **download the image** using the link below:  
  [Download Architecture Diagram](https://drive.google.com/file/d/1bwrPryFGAAFF3OoJ3Z7UUzpVJaYvijUg/view?usp=sharing)

---

### Diagram Preview:
![Architecture Diagram](https://drive.google.com/uc?export=view&id=1bwrPryFGAAFF3OoJ3Z7UUzpVJaYvijUg)


In [None]:
class PositionalEncoding(nn.Module):
    def __init__(self, embed_dim, max_len=16):
        super(PositionalEncoding, self).__init__()
        # ADD YOUR CODE HERE

    def forward(self, x):
        # ADD YOUR CODE HERE

        return x

In [None]:
from transformers import BertTokenizer

# Transformer Encoder
class TransformerTextEncoder(nn.Module):
    def __init__(self, embed_dim, max_len=16, num_heads=4, num_layers=2):
        super(TransformerTextEncoder, self).__init__()
        # ADD YOUR CODE HERE

    def forward(self, input_text):
        # ADD YOUR CODE HERE

        return out

In [None]:

#DO NOT MODIFY THIS CELL
new_image = torch.randn((1,3,256,256))
embedding_dim = 1024 # 32x32
print("Encoder:")
text_encoder = TransformerTextEncoder(embedding_dim)
input_text = ["Generate an image based on this text"]
text_embedding = text_encoder(input_text)

image_encoder = ImageEncoder()
image_embedding = image_encoder(new_image)

embed_dim = 1024
cross_attention = CrossAttention(embed_dim=embed_dim)
combine = cross_attention(text_embedding, image_embedding)

print("Decoder:")
image_decoder = ImageDecoder()
image = image_decoder(combine)

try:
    assert text_embedding.shape == (1, 16, 32, 32), "Encoded text shape is incorrect."
    assert image_embedding.shape == (1, 16, 32, 32), "Encoded image shape is incorrect."
    assert combine.shape == (1, 16, 32, 32), "Combined cross attention shape is incorrect."
    assert image.shape == (1, 3, 256, 256), "Decoded image shape is incorrect."
    print("\n🎉 Congratulations! Your implementation is correct. You passed the requirement for the AI Lab! 🎉")
except AssertionError as e:
    print(f"\n❌ Error: {e}")