<a href="https://colab.research.google.com/github/FaisalWani123/DND/blob/main/EXAM_2_PRACTICE_IPYNB.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Final Exam - Deep Network Development**



# **Exam Information**

---

- **Name:** *<Enter your name here>*
- **Neptun ID:** *<Enter your Neptun ID here>*
- **Date:** *03/01/2025*
- **Duration:** *9:00 AM – 11:00 AM*
- *Please fill in your details above before starting the exam.*



## **General Rules**

This notebook contains the task to be completed in order to pass the exam and the course. Below are the details:
1. **Implementing a network architecture**, including its **forward pass** function.
2. Additional **optional requirements** for bonus points towards final grade.
3. You have **2 hours** to complete the exam.
4. You may distribute the time as you see fit between the required and optional parts.
5. You are allowed to use any resource including: the internet, AI tools, practice notebooks, and more.
6. **It is strictly prohibited to use any form of communication** (e.g., Teams, WhatsApp, Messenger, etc.). **Violation will result in an immediate FAIL** of the exam.

---

### **Submission Guidelines**
- Submit your solution as a **`.ipynb` file** on **Canvas**.
- To **PASS**, your solution must:
  1. **Satisfy the minimum requirements** (i.e., a working implementation of the network architecture and forward pass).
  2. Be **submitted on time**.
  3. Be prepared to **orally defend your code** after submission.

---

### **Exam Retake Policy**
- If you **FAIL**, you are allowed to do **one retake**.  
- If you **FAIL AGAIN**, sadly, you **fail the course**.  

---

### **Grading**
- If you **PASS**, your final grade will be the **weighted average** of your assignment defenses (theory and code).

---

Good luck, and ensure you follow all the rules!


## **Requirements**

---

### **Minimum Requirements – Sufficient to Pass the Exam**
1. **Implement the layers of the architecture:**  
   Complete the architecture outlined in Section 1 by filling in the missing parts.
2. **Implement the forward function:**  
   Ensure the input and output of the forward function are correctly implemented.  
   
   **Note:** To meet these requirements, your final output must match the expected output.

---

### **Extra Requirements – For Grade Improvement and AI Lab Access**

---

3. **Text-to-Image with Image-Guided Embeddings:**  
- The goal is to perform text-to-image generation using an existing image as a guide for editing. The input text specifies modifications to the existing image, preserving its content while applying specific changes as described by the text.

   ➡️ **Reward: +1 to final grade**

---

4. **Replacing Text Encoder with Transformer:**  
- Replace the text encoder with a Transformer model.
- Test the new architecture to ensure it performs text-to-image editing correctly, by satisfying the expect output condition.

   ➡️ **Reward: Access to AI Lab**

---

Make sure to carefully follow the instructions provided in each cell to meet the requirements!


## **1. Required: Task Description**

Your task is to implement a custom neural network architecture along with its forward pass function.

This task is inspired by **text-to-image generation**, where a neural network maps a sequence of tokens representing textual information into a high-dimensional image. The text input is typically **tokenized** into a sequence of integers. This representation can be passed through an **encoder-decoder** architecture to generate images.

For this task, you will work with a simplified text-to-image representation in the form of a random tensor with the shape **(1, 10)**:
- The 1 indicates that there is a single input sample.
- 10 corresponds to the sequence length of the input text tokens.

Your implemented model will:
- Take this text token tensor as input.
- Encode it into a latent representation.
- Decode the latent representation to produce an output **image tensor of shape (1, 3, 256, 256)**, where:
    - 1 represents the batch size.
    - 3 indicates the RGB color channels of the image.
    - 256 × 256 corresponds to the height and width of the output image.

The primary objective is to correctly implement the neural network architecture and its forward pass to achieve the desired functionality.

To better view the architecture diagram:  
- **Right-click the image** and select **"Open image in a new tab"** to enable zoom for a clearer view.  
- Alternatively, you can **download the image** using the link below:  
  [Download Architecture Diagram](https://drive.google.com/file/d/1osVNVcsNGo-d9DCGVH1hJDw2nw4rMToR/view?usp=sharing)

---

### Diagram Preview:
![Architecture Diagram](https://drive.google.com/uc?export=view&id=1osVNVcsNGo-d9DCGVH1hJDw2nw4rMToR)


Necessary Imports and Data Loading

In [None]:
# Cell 0.1
import torch
import torch.nn as nn
import numpy as np

In [None]:
# Cell 0.2 (GPU is not needed)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device

device(type='cpu')

### Input

In [None]:
# Cell 0.3 -> INPUT (DO NOT EDIT THIS CELL!)
vocab_size = 10
input_tokens = torch.randint(0, vocab_size, (1, 10))
print(input_tokens.shape)

torch.Size([1, 10])


### Architecture

In [None]:
class TextEncoder(nn.Module):
    def __init__(self, vocab_size, embedding_dim):
        super(TextEncoder, self).__init__()
        self.vocab_size = vocab_size
        self.embedding_dim = embedding_dim
        # ADD YOUR CODE HERE
        self.embedding = nn.Embedding(10, 256)
        self.lstm = nn.LSTM(256, 128, num_layers=2, batch_first=True)
        self.linear = nn.Linear(128, 1024)


    def forward(self, input_tokens):
        print("encoder started...")
        print("Text input:", input_tokens.shape)
        # ADD YOUR CODE HERE
        after_embedding = self.embedding(input_tokens)
        print("After embedding:", after_embedding.shape)

        after_lstm, _ = self.lstm(after_embedding)
        print("After LSTM:", after_lstm.shape)

        after_linear = self.linear(after_lstm)
        print("after linear: ", after_linear.shape)

        reshaped = after_linear.view(1, 10, 32, 32)
        print("Reshaped:", reshaped.shape)


        # Reshape to match expected output shape (1, 10, 32, 32)


        return reshaped

In [None]:
from re import X
class ImageDecoder(nn.Module):
    def __init__(self):
        super(ImageDecoder, self).__init__()

        #top layer
        self.T_convTrans2d = nn.ConvTranspose2d(10, 16, kernel_size=12, stride=4, padding=0)


        #Middle top
        self.MT_conv2d_1 = nn.Conv2d(10, 32, kernel_size=3, stride=1, padding=0)
        self.MT_maxPool_1 = nn.MaxPool2d(kernel_size=2, stride=2, padding=0)
        #relu
        self.MT_batchNorm2d = nn.BatchNorm2d(32)
        self.MT_MaxPool_2 = nn.MaxPool2d(kernel_size=2, stride=2, padding=0)

        #middle bottom
        self.MB_conv2d_1 = nn.Conv2d(32, 32, kernel_size=3, stride=1, padding=0)
        #relu
        self.MB_batchNorm2d = nn.BatchNorm2d(32)
        self.MB_conv2d_2 = nn.Conv2d(32, 32, kernel_size=8, stride=3, padding=0)

        #bottom
        self.B_conv2d_1 = nn.Conv2d(32, 64, kernel_size=3, stride=1, padding=0)
        self.B_convTrans_1 = nn.ConvTranspose2d(64, 32, kernel_size=3, stride=1, padding=0)


        #final stretch
        self.final_convTrans_1 = nn.ConvTranspose2d(32, 16, kernel_size=24, stride=19, padding=1) #tweak stride
        #concat
        self.final_convtrans2d_1 = nn.ConvTranspose2d(32, 8, kernel_size=3, stride=2, padding=1)
        self.final_conv2d_2 = nn.Conv2d(8, 3, kernel_size=16, stride=1, padding=0) #tweak padding


    def forward(self, text_embedding):
        print("decoder started")
        print("Text embedding:", text_embedding.shape)

        #top layer
        top_1 = self.T_convTrans2d(text_embedding)
        print("Top 1:", top_1.shape)

        #middle top layer
        middle_conv2d_1 = self.MT_conv2d_1(text_embedding)
        print("Middle conv2d 1:", middle_conv2d_1.shape)
        middle_maxPool_1 = self.MT_maxPool_1(middle_conv2d_1)
        print("Middle maxPool 1:", middle_maxPool_1.shape)
        middle_relu_1 = nn.ReLU()(middle_maxPool_1)
        print("Middle relu 1:", middle_relu_1.shape)
        middle_batchNorm2d = self.MT_batchNorm2d(middle_relu_1)
        print("Middle batchNorm2d:", middle_batchNorm2d.shape)
        middle_maxPool_2 = self.MT_MaxPool_2(middle_batchNorm2d)
        print("Middle maxPool 2:", middle_maxPool_2.shape)


        #middle bottom
        middle_conv2d_2 = self.MB_conv2d_1(middle_conv2d_1)
        print("Middle conv2d 2:", middle_conv2d_2.shape)
        middle_relu_2 = nn.ReLU()(middle_conv2d_2)
        print("Middle relu 2:", middle_relu_2.shape)
        middle_batchNorm2d_2 = self.MB_batchNorm2d(middle_relu_2)
        print("Middle batchNorm2d_2: ", middle_batchNorm2d_2.shape)
        #wait for add with bottom


        #bottom
        bottom_conv2d_1 = self.B_conv2d_1(middle_conv2d_2)
        print("Bottom conv2d 1:", bottom_conv2d_1.shape)
        bottom_convTrans_1 = self.B_convTrans_1(bottom_conv2d_1)
        print("Bottom convTrans 1:", bottom_convTrans_1.shape)

        #add bottom and middle bottom
        added_bottom_middle = bottom_convTrans_1 + middle_batchNorm2d_2
        print("Added bottom and middle:", added_bottom_middle.shape)

        #final middle bottom layer
        final_middleBottom = self.MB_conv2d_2(added_bottom_middle)
        print("Final middle bottom:", final_middleBottom.shape)

        #add with middle top
        added_BM_BT = final_middleBottom + middle_maxPool_2
        print("Added BM and BT:", added_BM_BT.shape)

        #final stretch
        final_stretch_1 = self.final_convTrans_1(added_BM_BT)
        print("Final stretch 1:", final_stretch_1.shape)

        #concat with top layer
        concat = torch.cat((top_1, final_stretch_1), dim=1)
        print("Concat:", concat.shape)


        #final stretch continues
        final_stretch_2 = self.final_convtrans2d_1(concat)
        print("Final stretch 2:", final_stretch_2.shape)

        output = self.final_conv2d_2(final_stretch_2)
        print("Output:", output.shape)




        return output

#### Test your implementation

In [None]:
#DO NOT MODIFY THIS CELL

embedding_dim = 256
print("Encoder:")
text_encoder = TextEncoder(vocab_size, embedding_dim)
text_embedding = text_encoder(input_tokens)
print("Decoder:")
image_decoder = ImageDecoder()
image = image_decoder(text_embedding)

try:
    assert text_embedding.shape == (1, 10, 32, 32), "Encoded text shape is incorrect."
    assert image.shape == (1, 3, 256, 256), "Decoded image shape is incorrect."
    print("\n🎉 Congratulations! Your implementation is correct. You passed the minimum requirement! 🎉")
except AssertionError as e:
    print(f"\n❌ Error: {e}")

Encoder:
encoder started...
Text input: torch.Size([1, 10])
After embedding: torch.Size([1, 10, 256])
After LSTM: torch.Size([1, 10, 128])
after linear:  torch.Size([1, 10, 1024])
Reshaped: torch.Size([1, 10, 32, 32])
Decoder:
decoder started
Text embedding: torch.Size([1, 10, 32, 32])
Top 1: torch.Size([1, 16, 136, 136])
Middle conv2d 1: torch.Size([1, 32, 30, 30])
Middle maxPool 1: torch.Size([1, 32, 15, 15])
Middle relu 1: torch.Size([1, 32, 15, 15])
Middle batchNorm2d: torch.Size([1, 32, 15, 15])
Middle maxPool 2: torch.Size([1, 32, 7, 7])
Middle conv2d 2: torch.Size([1, 32, 28, 28])
Middle relu 2: torch.Size([1, 32, 28, 28])
Middle batchNorm2d_2:  torch.Size([1, 32, 28, 28])
Bottom conv2d 1: torch.Size([1, 64, 26, 26])
Bottom convTrans 1: torch.Size([1, 32, 28, 28])
Added bottom and middle: torch.Size([1, 32, 28, 28])
Final middle bottom: torch.Size([1, 32, 7, 7])
Added BM and BT: torch.Size([1, 32, 7, 7])
Final stretch 1: torch.Size([1, 16, 28, 28])


RuntimeError: Sizes of tensors must match except in dimension 1. Expected size 136 but got size 28 for tensor number 1 in the list.

## **2. Optional: +1 to the Final Grade**
- Add another input tensor: a random tensor of size (1, 3, 256, 256).
- Implement an Image Encoder with a few layers to encode the tensor. The encoding process should follow the example image provided.
- Combine the encoded image embeddings with the text embeddings using cross-attention, following the example image provided.

You should only add these new parts and reuse the ImageDecoder previously created. The final output should still be the same as in the previously required task (1,3,256,256).

To better view the architecture diagram:  
- **Right-click the image** and select **"Open image in a new tab"** to enable zoom for a clearer view.  
- Alternatively, you can **download the image** using the link below:  
  [Download Architecture Diagram](https://drive.google.com/file/d/1nIgqhyPq0eKWEvT7leqoa0ZeBApCCs6u/view?usp=sharing)

---

### Diagram Preview:
![Architecture Diagram](https://drive.google.com/uc?export=view&id=1nIgqhyPq0eKWEvT7leqoa0ZeBApCCs6u)


#### New Input - create a random tensor of size (1,3,256,256)

In [None]:
# ADD YOUR CODE HERE

#### Image Encoder - create the image encoder, following the example provided.

In [None]:
class ImageEncoder(nn.Module):
    def __init__(self):
        super(ImageEncoder, self).__init__()
        # ADD YOUR CODE HERE

    def forward(self, new_image):
        print("New image:", new_image.shape)
        # ADD YOUR CODE HERE
        return out

#### Combine with Cross-Attention

In [None]:
class CrossAttention(nn.Module):
    def __init__(self, embed_dim=1024):
        super(CrossAttention, self).__init__()
        # ADD YOUR CODE HERE

    def forward(self, text_embedding, image_embedding):
        # ADD YOUR CODE HERE
        return out

#### Test your implementation

In [None]:
# DO NOT MODIFY THIS CELL
image_encoder = ImageEncoder()
image_embedding = image_encoder(new_image)

embed_dim = 1024
cross_attention = CrossAttention(embed_dim=embed_dim)
combine = cross_attention(text_embedding, image_embedding)

image = image_decoder(combine)

try:
    assert image_embedding.shape == (1, 10, 32, 32), "Encoded image shape is incorrect."
    assert combine.shape == (1, 10, 32, 32), "Combined cross attention shape is incorrect."
    assert image.shape == (1, 3, 256, 256), "Combined cross attention shape is incorrect."
    print("\n🎉 Congratulations! Your implementation is correct. You increased your final grade by 1! 🎉")
except AssertionError as e:
    print(f"\n❌ Error: {e}")

## **3. Optional: Access to AI Lab**
- Replace the text encoder with a Transformer model. This involves:
    - Maximizing the sequence length to 16.
    - Using BertTokenizer.
    - Adding Positional Encoding.
    - Defining a Transformer Encoder.


To better view the architecture diagram:  
- **Right-click the image** and select **"Open image in a new tab"** to enable zoom for a clearer view.  
- Alternatively, you can **download the image** using the link below:  
  [Download Architecture Diagram](https://drive.google.com/file/d/1bwrPryFGAAFF3OoJ3Z7UUzpVJaYvijUg/view?usp=sharing)

---

### Diagram Preview:
![Architecture Diagram](https://drive.google.com/uc?export=view&id=1bwrPryFGAAFF3OoJ3Z7UUzpVJaYvijUg)


In [None]:
class PositionalEncoding(nn.Module):
    def __init__(self, embed_dim, max_len=16):
        super(PositionalEncoding, self).__init__()
        # ADD YOUR CODE HERE

    def forward(self, x):
        # ADD YOUR CODE HERE

        return x

In [None]:
from transformers import BertTokenizer

# Transformer Encoder
class TransformerTextEncoder(nn.Module):
    def __init__(self, embed_dim, max_len=16, num_heads=4, num_layers=2):
        super(TransformerTextEncoder, self).__init__()
        # ADD YOUR CODE HERE

    def forward(self, input_text):
        # ADD YOUR CODE HERE

        return out

In [None]:

#DO NOT MODIFY THIS CELL
new_image = torch.randn((1,3,256,256))
embedding_dim = 1024 # 32x32
print("Encoder:")
text_encoder = TransformerTextEncoder(embedding_dim)
input_text = ["Generate an image based on this text"]
text_embedding = text_encoder(input_text)

image_encoder = ImageEncoder()
image_embedding = image_encoder(new_image)

embed_dim = 1024
cross_attention = CrossAttention(embed_dim=embed_dim)
combine = cross_attention(text_embedding, image_embedding)

print("Decoder:")
image_decoder = ImageDecoder()
image = image_decoder(combine)

try:
    assert text_embedding.shape == (1, 16, 32, 32), "Encoded text shape is incorrect."
    assert image_embedding.shape == (1, 16, 32, 32), "Encoded image shape is incorrect."
    assert combine.shape == (1, 16, 32, 32), "Combined cross attention shape is incorrect."
    assert image.shape == (1, 3, 256, 256), "Decoded image shape is incorrect."
    print("\n🎉 Congratulations! Your implementation is correct. You passed the requirement for the AI Lab! 🎉")
except AssertionError as e:
    print(f"\n❌ Error: {e}")