<a href="https://colab.research.google.com/github/Odima-dev/Data-Science-and-Machine-Learning/blob/main/Seq2Seq.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
#Mounting Google Drive
from google.colab import drive
drive.mount('/content/drive')

# Creating my project folder
!mkdir -p "/content/drive/MyDrive/Seq2SeqProject"
project_folder = "/content/drive/MyDrive/Seq2SeqProject"

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [5]:
%cd /content/drive/MyDrive/Seq2SeqProject

/content/drive/MyDrive/Seq2SeqProject


In [7]:
# Problem 1: Execution of machine translation and code reading

#Running the Code
"""
Title: Character-level recurrent sequence-to-sequence model
Author: [fchollet](https://twitter.com/fchollet)
Date created: 2017/09/29
Last modified: 2023/11/22
Description: Character-level recurrent sequence-to-sequence model.
Accelerator: GPU
"""

"""
## Introduction

This example demonstrates how to implement a basic character-level
recurrent sequence-to-sequence model. We apply it to translating
short English sentences into short French sentences,
character-by-character. Note that it is fairly unusual to
do character-level machine translation, as word-level
models are more common in this domain.

**Summary of the algorithm**

- We start with input sequences from a domain (e.g. English sentences)
    and corresponding target sequences from another domain
    (e.g. French sentences).
- An encoder LSTM turns input sequences to 2 state vectors
    (we keep the last LSTM state and discard the outputs).
- A decoder LSTM is trained to turn the target sequences into
    the same sequence but offset by one timestep in the future,
    a training process called "teacher forcing" in this context.
    It uses as initial state the state vectors from the encoder.
    Effectively, the decoder learns to generate `targets[t+1...]`
    given `targets[...t]`, conditioned on the input sequence.
- In inference mode, when we want to decode unknown input sequences, we:
    - Encode the input sequence into state vectors
    - Start with a target sequence of size 1
        (just the start-of-sequence character)
    - Feed the state vectors and 1-char target sequence
        to the decoder to produce predictions for the next character
    - Sample the next character using these predictions
        (we simply use argmax).
    - Append the sampled character to the target sequence
    - Repeat until we generate the end-of-sequence character or we
        hit the character limit.
"""

"""
## Setup
"""

import numpy as np
import keras
import os
from pathlib import Path

"""
## Download the data
"""
from google.colab import files
uploaded = files.upload()

data_path = "fra.txt"

"""
## Configuration
"""

batch_size = 64  # Batch size for training.
epochs = 5  # I adjusted the number of epochs to 5 instead of 100 so as to reduce training time that the process would take.
latent_dim = 256  # Latent dimensionality of the encoding space.
num_samples = 10000  # Number of samples to train on.
# Path to the data txt file on disk.
data_path = "fra.txt"

"""
## Prepare the data
"""

# Vectorize the data.
input_texts = []
target_texts = []
input_characters = set()
target_characters = set()
with open(data_path, "r", encoding="utf-8") as f:
    lines = f.read().split("\n")
for line in lines[: min(num_samples, len(lines) - 1)]:
    input_text, target_text, _ = line.split("\t")
    # We use "tab" as the "start sequence" character
    # for the targets, and "\n" as "end sequence" character.
    target_text = "\t" + target_text + "\n"
    input_texts.append(input_text)
    target_texts.append(target_text)
    for char in input_text:
        if char not in input_characters:
            input_characters.add(char)
    for char in target_text:
        if char not in target_characters:
            target_characters.add(char)

input_characters = sorted(list(input_characters))
target_characters = sorted(list(target_characters))
num_encoder_tokens = len(input_characters)
num_decoder_tokens = len(target_characters)
max_encoder_seq_length = max([len(txt) for txt in input_texts])
max_decoder_seq_length = max([len(txt) for txt in target_texts])

print("Number of samples:", len(input_texts))
print("Number of unique input tokens:", num_encoder_tokens)
print("Number of unique output tokens:", num_decoder_tokens)
print("Max sequence length for inputs:", max_encoder_seq_length)
print("Max sequence length for outputs:", max_decoder_seq_length)

input_token_index = dict([(char, i) for i, char in enumerate(input_characters)])
target_token_index = dict([(char, i) for i, char in enumerate(target_characters)])

encoder_input_data = np.zeros(
    (len(input_texts), max_encoder_seq_length, num_encoder_tokens),
    dtype="float32",
)
decoder_input_data = np.zeros(
    (len(input_texts), max_decoder_seq_length, num_decoder_tokens),
    dtype="float32",
)
decoder_target_data = np.zeros(
    (len(input_texts), max_decoder_seq_length, num_decoder_tokens),
    dtype="float32",
)

for i, (input_text, target_text) in enumerate(zip(input_texts, target_texts)):
    for t, char in enumerate(input_text):
        encoder_input_data[i, t, input_token_index[char]] = 1.0
    encoder_input_data[i, t + 1 :, input_token_index[" "]] = 1.0
    for t, char in enumerate(target_text):
        # decoder_target_data is ahead of decoder_input_data by one timestep
        decoder_input_data[i, t, target_token_index[char]] = 1.0
        if t > 0:
            # decoder_target_data will be ahead by one timestep
            # and will not include the start character.
            decoder_target_data[i, t - 1, target_token_index[char]] = 1.0
    decoder_input_data[i, t + 1 :, target_token_index[" "]] = 1.0
    decoder_target_data[i, t:, target_token_index[" "]] = 1.0

"""
## Build the model
"""

# Define an input sequence and process it.
encoder_inputs = keras.Input(shape=(None, num_encoder_tokens))
encoder = keras.layers.LSTM(latent_dim, return_state=True)
encoder_outputs, state_h, state_c = encoder(encoder_inputs)

# We discard `encoder_outputs` and only keep the states.
encoder_states = [state_h, state_c]

# Setting up the decoder, using `encoder_states` as initial state.
decoder_inputs = keras.Input(shape=(None, num_decoder_tokens))

# We set up our decoder to return full output sequences,
# and to return internal states as well. We don't use the
# return states in the training model, but we will use them in inference.
decoder_lstm = keras.layers.LSTM(latent_dim, return_sequences=True, return_state=True)
decoder_outputs, _, _ = decoder_lstm(decoder_inputs, initial_state=encoder_states)
decoder_dense = keras.layers.Dense(num_decoder_tokens, activation="softmax")
decoder_outputs = decoder_dense(decoder_outputs)

# Define the model that will turn
# `encoder_input_data` & `decoder_input_data` into `decoder_target_data`
model = keras.Model([encoder_inputs, decoder_inputs], decoder_outputs)

"""
## Train the model
"""

model.compile(
    optimizer="rmsprop", loss="categorical_crossentropy", metrics=["accuracy"]
)
model.fit(
    [encoder_input_data, decoder_input_data],
    decoder_target_data,
    batch_size=batch_size,
    epochs=epochs,
    validation_split=0.2,
)
# Save model
model.save("s2s_model.keras")

"""
## Run inference (sampling)

1. encode input and retrieve initial decoder state
2. run one step of decoder with this initial state
and a "start of sequence" token as target.
Output will be the next target token.
3. Repeat with the current target token and current states
"""

# Define sampling models
# Restore the model and construct the encoder and decoder.
model = keras.models.load_model("s2s_model.keras")

encoder_inputs = model.input[0]  # input_1
encoder_outputs, state_h_enc, state_c_enc = model.layers[2].output  # lstm_1
encoder_states = [state_h_enc, state_c_enc]
encoder_model = keras.Model(encoder_inputs, encoder_states)

decoder_inputs = model.input[1]  # input_2
decoder_state_input_h = keras.Input(shape=(latent_dim,))
decoder_state_input_c = keras.Input(shape=(latent_dim,))
decoder_states_inputs = [decoder_state_input_h, decoder_state_input_c]
decoder_lstm = model.layers[3]
decoder_outputs, state_h_dec, state_c_dec = decoder_lstm(
    decoder_inputs, initial_state=decoder_states_inputs
)
decoder_states = [state_h_dec, state_c_dec]
decoder_dense = model.layers[4]
decoder_outputs = decoder_dense(decoder_outputs)
decoder_model = keras.Model(
    [decoder_inputs] + decoder_states_inputs, [decoder_outputs] + decoder_states
)

# Reverse-lookup token index to decode sequences back to
# something readable.
reverse_input_char_index = dict((i, char) for char, i in input_token_index.items())
reverse_target_char_index = dict((i, char) for char, i in target_token_index.items())


def decode_sequence(input_seq):
    # Encode the input as state vectors.
    states_value = encoder_model.predict(input_seq, verbose=0)

    # Generate empty target sequence of length 1.
    target_seq = np.zeros((1, 1, num_decoder_tokens))
    # Populate the first character of target sequence with the start character.
    target_seq[0, 0, target_token_index["\t"]] = 1.0

    # Sampling loop for a batch of sequences
    # (to simplify, here we assume a batch of size 1).
    stop_condition = False
    decoded_sentence = ""
    while not stop_condition:
        output_tokens, h, c = decoder_model.predict(
            [target_seq] + states_value, verbose=0
        )

        # Sample a token
        sampled_token_index = np.argmax(output_tokens[0, -1, :])
        sampled_char = reverse_target_char_index[sampled_token_index]
        decoded_sentence += sampled_char

        # Exit condition: either hit max length
        # or find stop character.
        if sampled_char == "\n" or len(decoded_sentence) > max_decoder_seq_length:
            stop_condition = True

        # Update the target sequence (of length 1).
        target_seq = np.zeros((1, 1, num_decoder_tokens))
        target_seq[0, 0, sampled_token_index] = 1.0

        # Update states
        states_value = [h, c]
    return decoded_sentence


"""
You can now generate decoded sentences as such:
"""

for seq_index in range(20):
    # Take one sequence (part of the training set)
    # for trying out decoding.
    input_seq = encoder_input_data[seq_index : seq_index + 1]
    decoded_sentence = decode_sequence(input_seq)
    print("-")
    print("Input sentence:", input_texts[seq_index])
    print("Decoded sentence:", decoded_sentence)

Saving fra.txt to fra.txt
Number of samples: 10000
Number of unique input tokens: 70
Number of unique output tokens: 91
Max sequence length for inputs: 14
Max sequence length for outputs: 59
Epoch 1/5
[1m125/125[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m53s[0m 404ms/step - accuracy: 0.7064 - loss: 1.5486 - val_accuracy: 0.7166 - val_loss: 1.1665
Epoch 2/5
[1m125/125[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m52s[0m 419ms/step - accuracy: 0.7463 - loss: 0.9671 - val_accuracy: 0.7214 - val_loss: 0.9667
Epoch 3/5
[1m125/125[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m79s[0m 399ms/step - accuracy: 0.7631 - loss: 0.8646 - val_accuracy: 0.7492 - val_loss: 0.8769
Epoch 4/5
[1m125/125[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m80s[0m 385ms/step - accuracy: 0.7841 - loss: 0.7773 - val_accuracy: 0.7718 - val_loss: 0.7831
Epoch 5/5
[1m125/125[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m49s[0m 396ms/step - accuracy: 0.8041 - loss: 0.6852 - val_accuracy: 0.7897 - val_lo

**Summarizing What Each Part of This Sample Code Does**

This code essentially trains and creates a character level sequence to sequence (Seq2Seq) neural network that translates short English sentences into French sentences using Keras and LSTM (Long Short-Term Memory) layers. And so here goes an explanation of the lines in a nutshell:



* Lines 1–17: Metadata and Introduction the code author, creation date, and a high-level explanation Seq2Seq translation.
* Lines 51–56: Import Libraries
* Lines 59–62: Downloading the Data an Defining Directory Path
* Lines 65–70: Configuring Hyperparameter
* Lines 73–113: Data Preparation & Tokenization
* Lines 115–140: Performing One-Hot Encoding
* Lines 143–167: Building the Model
* Lines 170-184: Training and Compiling
* Lines 187–223: Inference Model Setup
* Lines 226–260: Decoding Sequence Function
* Lines 267–274: Generating Sample Translations



In [10]:
# Problem 2: Execution of a trained model of image captioning
# I will use a Pre-trained PyTorch Model

import torch
import torch.nn as nn
import torchvision.transforms as transforms
from PIL import Image
import pickle
import os
import requests
import urllib.request
from io import BytesIO
import numpy as np
import argparse
import zipfile

# Installing required packages
!pip install torch torchvision pillow requests

# Uploading the already trained model files manually
from google.colab import files
uploaded = files.upload()

# Unzipping vocab.zip
with zipfile.ZipFile("vocap.zip", 'r') as zip_ref:
    zip_ref.extractall()

# Unzipping pretrained_model.zip
with zipfile.ZipFile("pretrained_model.zip", 'r') as zip_ref:
    zip_ref.extractall()

required_files = ['vocab.pkl', 'encoder-5-3000.pkl', 'decoder-5-3000.pkl']
for f in required_files:
    print(f"{f}: {'File Exists!' if os.path.exists(f) else 'File Missing'}")

# Defining the model architectures (from the original repository)
class EncoderCNN(nn.Module):
    def __init__(self, embed_size):
        """Load the pretrained ResNet-152 and replace top fc layer."""
        super(EncoderCNN, self).__init__()
        resnet = torch.hub.load('pytorch/vision:v0.10.0', 'resnet152', pretrained=True)
        modules = list(resnet.children())[:-1]      # deleting the last fc layer.
        self.resnet = nn.Sequential(*modules)
        self.linear = nn.Linear(resnet.fc.in_features, embed_size)
        self.bn = nn.BatchNorm1d(embed_size, momentum=0.01)

    def forward(self, images):
        """Extract feature vectors from input images."""
        with torch.no_grad():
            features = self.resnet(images)
        features = features.reshape(features.size(0), -1)
        features = self.bn(self.linear(features))
        return features

class DecoderRNN(nn.Module):
    def __init__(self, embed_size, hidden_size, vocab_size, num_layers, max_seq_length=20):
        """Set the hyper-parameters and build the layers."""
        super(DecoderRNN, self).__init__()
        self.embed = nn.Embedding(vocab_size, embed_size)
        self.lstm = nn.LSTM(embed_size, hidden_size, num_layers, batch_first=True)
        self.linear = nn.Linear(hidden_size, vocab_size)
        self.max_seq_length = max_seq_length

    def forward(self, features, captions, lengths):
        """Decode image feature vectors and generates captions."""
        embeddings = self.embed(captions)
        embeddings = torch.cat((features.unsqueeze(1), embeddings), 1)
        packed = nn.utils.rnn.pack_padded_sequence(embeddings, lengths, batch_first=True)
        hiddens, _ = self.lstm(packed)
        outputs = self.linear(hiddens[0])
        return outputs

    def sample(self, features, states=None):
        """Generate captions for given image features using greedy search."""
        sampled_ids = []
        inputs = features.unsqueeze(1)
        for i in range(self.max_seq_length):
            hiddens, states = self.lstm(inputs, states)          # hiddens: (batch_size, 1, hidden_size)
            outputs = self.linear(hiddens.squeeze(1))            # outputs:  (batch_size, vocab_size)
            _, predicted = outputs.max(1)                        # predicted: (batch_size)
            sampled_ids.append(predicted)
            inputs = self.embed(predicted)                       # inputs: (batch_size, embed_size)
            inputs = inputs.unsqueeze(1)                         # inputs: (batch_size, 1, embed_size)
        sampled_ids = torch.stack(sampled_ids, 1)                # sampled_ids: (batch_size, max_seq_length)
        return sampled_ids

def load_image(image_path, transform=None):
    """Load and preprocess image"""
    if image_path.startswith('http'):
        response = requests.get(image_path)
        image = Image.open(BytesIO(response.content)).convert('RGB')
    else:
        image = Image.open(image_path).convert('RGB')

    if transform is not None:
        image = transform(image).unsqueeze(0)

    return image

# Minimal Vocabulary class definition to support unpickling
class Vocabulary(object):
    def __init__(self):
        self.word2idx = {}
        self.idx2word = {}
        self.idx = 0

    def add_word(self, word):
        if word not in self.word2idx:
            self.word2idx[word] = self.idx
            self.idx2word[self.idx] = word
            self.idx += 1

    def __call__(self, word):
        return self.word2idx.get(word, self.word2idx.get('<unk>', 0))

    def __len__(self):
        return len(self.word2idx)

    def items(self):
        return self.word2idx.items()

def load_models_and_vocab():
    """Load the pre-trained models and vocabulary"""
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

    # Loading vocabulary
    with open('vocab.pkl', 'rb') as f:
        vocab = pickle.load(f)

    # Model parameters
    embed_size = 256
    hidden_size = 512
    num_layers = 1
    vocab_size = len(vocab)

    # Building models
    encoder = EncoderCNN(embed_size).eval()
    decoder = DecoderRNN(embed_size, hidden_size, vocab_size, num_layers)

    # Loading pre-trained weights
    encoder_path = 'encoder-5-3000.pkl'
    decoder_path = 'decoder-5-3000.pkl'

    # Loading encoder weights
    encoder.load_state_dict(torch.load(encoder_path, map_location=device))
    print(f"✓ Loaded encoder weights from {encoder_path}")

    # Loading decoder weights
    decoder.load_state_dict(torch.load(decoder_path, map_location=device))
    print(f"✓ Loaded decoder weights from {decoder_path}")

    # Moving models to device
    encoder = encoder.to(device)
    decoder = decoder.to(device)

    return encoder, decoder, vocab, device

def generate_caption(image_path, encoder, decoder, vocab, transform, device):
    """Generate caption for an image using the pre-trained model"""
    # Loading and preprocessing image
    image = load_image(image_path, transform)
    image_tensor = image.to(device)

    # Generating caption
    with torch.no_grad():
        feature = encoder(image_tensor)
        sampled_ids = decoder.sample(feature)
        sampled_ids = sampled_ids[0].cpu().numpy()

    # Converting word_ids to words
    vocab_inv = {v: k for k, v in vocab.items()}
    sampled_caption = []
    for word_id in sampled_ids:
        word = vocab_inv[word_id]
        sampled_caption.append(word)
        if word == '<end>':
            break

    sentence = ' '.join(sampled_caption)
    return sentence


def main():
    print("Image Captioning with Pre-trained PyTorch Model")

    # Loading models and vocabulary
    try:
        encoder, decoder, vocab, device = load_models_and_vocab()
        print(f"✓ Models loaded successfully on {device}")
        print(f"✓ Vocabulary size: {len(vocab)}")
    except Exception as e:
        print(f"Error loading models: {e}")
        return

    # Image preprocessing transform
    transform = transforms.Compose([
        transforms.Resize((224, 224)),
        transforms.ToTensor(),
        transforms.Normalize((0.485, 0.456, 0.406),
                           (0.229, 0.224, 0.225))
    ])

    # Testing with online images

    test_images = [
        'https://images.unsplash.com/photo-1552053831-71594a27632d?w=400',      # a dog with a flower
        'https://images.unsplash.com/photo-1514888286974-6c03e2ca1dba?w=400',   # cat
        'https://images.unsplash.com/photo-1507003211169-0a1dd7228f2d?w=400',   # A man
        'https://images.unsplash.com/photo-1549298916-b41d501d3772?w=400',      # A Nikey Shoe
        'https://images.unsplash.com/photo-1506905925346-21bda4d32df4?w=400'    # Mountains
    ]

    print("\\nGenerating captions for test images:")


    for i, image_path in enumerate(test_images):
        try:
            print(f"\\nProcessing Image {i+1}...")
            print(f"URL: {image_path}")

            caption = generate_caption(image_path, encoder, decoder, vocab, transform, device)
            print(f"Generated Caption: {caption}")

        except Exception as e:
            print(f"Error processing image {i+1}: {e}")


if __name__ == "__main__":
    main()

# Now I will test with my own uploaded images
def test_with_uploaded_images():

    #Loading images
    from google.colab import files
    uploaded = files.upload()

    # Loading models
    encoder, decoder, vocab, device = load_models_and_vocab()

    transform = transforms.Compose([
        transforms.Resize((224, 224)),
        transforms.ToTensor(),
        transforms.Normalize((0.485, 0.456, 0.406),
                           (0.229, 0.224, 0.225))
    ])

    # Replacing with my uploaded image paths
    uploaded_images = [
        '/content/drive/MyDrive/Seq2SeqProject/pexels-laker-6156582.jpg',              # A Bus
        '/content/drive/MyDrive/Seq2SeqProject/pexels-dariuskrs-2253938.jpg',          # Traffic Light
        '/content/drive/MyDrive/Seq2SeqProject/pexels-jamphotography-2626665.jpg',     # Motorcycle
    ]

    print("Testing with my own uploaded images:")

    for i, image_path in enumerate(uploaded_images):
        if os.path.exists(image_path):
            try:
                caption = generate_caption(image_path, encoder, decoder, vocab, transform, device)
                print(f"Image {i+1} ({image_path}): {caption}")
            except Exception as e:
                print(f"Error processing {image_path}: {e}")
        else:
            print(f"Image not found: {image_path}")


test_with_uploaded_images()



Saving pretrained_model.zip to pretrained_model.zip
Saving vocap.zip to vocap.zip
vocab.pkl: File Exists!
encoder-5-3000.pkl: File Exists!
decoder-5-3000.pkl: File Exists!
Image Captioning with Pre-trained PyTorch Model


Downloading: "https://github.com/pytorch/vision/zipball/v0.10.0" to /root/.cache/torch/hub/v0.10.0.zip
Downloading: "https://download.pytorch.org/models/resnet152-394f9c45.pth" to /root/.cache/torch/hub/checkpoints/resnet152-394f9c45.pth
100%|██████████| 230M/230M [00:03<00:00, 69.7MB/s]


✓ Loaded encoder weights from encoder-5-3000.pkl
✓ Loaded decoder weights from decoder-5-3000.pkl
✓ Models loaded successfully on cpu
✓ Vocabulary size: 9956
\nGenerating captions for test images:
\nProcessing Image 1...
URL: https://images.unsplash.com/photo-1552053831-71594a27632d?w=400
Generated Caption: <start> a dog is sitting on a couch with a frisbee . <end>
\nProcessing Image 2...
URL: https://images.unsplash.com/photo-1514888286974-6c03e2ca1dba?w=400
Generated Caption: <start> a cat is sitting on a couch with a remote . <end>
\nProcessing Image 3...
URL: https://images.unsplash.com/photo-1507003211169-0a1dd7228f2d?w=400
Generated Caption: <start> a man with a beard and a tie in his mouth . <end>
\nProcessing Image 4...
URL: https://images.unsplash.com/photo-1549298916-b41d501d3772?w=400
Generated Caption: <start> a cat laying on a bed next to a keyboard . <end>
\nProcessing Image 5...
URL: https://images.unsplash.com/photo-1506905925346-21bda4d32df4?w=400
Generated Caption: <s

Saving pexels-dariuskrs-2253938.jpg to pexels-dariuskrs-2253938.jpg
Saving pexels-jamphotography-2626665.jpg to pexels-jamphotography-2626665.jpg
Saving pexels-laker-6156582.jpg to pexels-laker-6156582.jpg


Using cache found in /root/.cache/torch/hub/pytorch_vision_v0.10.0


✓ Loaded encoder weights from encoder-5-3000.pkl
✓ Loaded decoder weights from decoder-5-3000.pkl
Testing with my own uploaded images:
Image 1 (/content/drive/MyDrive/Seq2SeqProject/pexels-laker-6156582.jpg): <start> a red and white bus parked in a parking lot . <end>
Image 2 (/content/drive/MyDrive/Seq2SeqProject/pexels-dariuskrs-2253938.jpg): <start> a traffic light with a sign on top of it . <end>
Image 3 (/content/drive/MyDrive/Seq2SeqProject/pexels-jamphotography-2626665.jpg): <start> a motorcycle parked on a street next to a building . <end>


In [11]:
# Problem 3: Complete Working Solution (with vocab loading)
"""
    Transforming a PyTorch image captioning model into Keras has a number of steps which include translation of architecture, mapping of weights and adaptation of preprocessing.
    As such, this conversion process would include:
    Step 1. Architecture Analysis
    Clearly, we can see that the PyTorch model structure utilizes:
        - Encoder: feature extraction CNN (ResNet)
        - Decoder: LSTM to generate sequences
        - Embedding: for vocabulary embedding
        - Linear Layers: for projecting output
"""

import torch
import tensorflow as tf
from tensorflow.keras.layers import Input, Dense, LSTM, Embedding
from tensorflow.keras.models import Model
import numpy as np
import pickle

# Step 2. Implementing Keras Architecture
"""
      - In this step we will implement replicate EncoderCNN and DecoderRNN with keras.
"""

# First we will load the vocabulary to get vocab_size
with open('vocab.pkl', 'rb') as f:
    vocab = pickle.load(f)
vocab_size = len(vocab)
print(f"Loaded vocabulary with {vocab_size} tokens")

# Second, we our trained PyTorch models
pt_encoder = torch.load('encoder-5-3000.pkl', map_location='cpu')
pt_decoder = torch.load('decoder-5-3000.pkl', map_location='cpu')

#Then we will create Keras equivalent encoder
"""We will use ResNet152."""

def create_keras_encoder():
    base_model = tf.keras.applications.ResNet152(
        include_top=False,
        weights='imagenet',
        input_shape=(224, 224, 3)
    )
    base_model.trainable = False

    inputs = Input(shape=(224, 224, 3))
    x = base_model(inputs)
    x = tf.keras.layers.GlobalAveragePooling2D()(x)
    x = Dense(256, name='dense')(x)  # embed_size=256
    x = tf.keras.layers.BatchNormalization(momentum=0.01, name='bn')(x)

    return Model(inputs, x)

#Then we will create Keras equivalent encoder
""" We will use keras LSTM."""
def create_keras_decoder(vocab_size):
    # Inputs
    features = Input(shape=(256,))  # embed_size
    sequence = Input(shape=(None,))

    # Embedding
    embedding = Embedding(vocab_size, 256)(sequence)

    # LSTM (hidden_size=512)
    lstm = LSTM(512, return_sequences=True, name='lstm')
    lstm_out = lstm(embedding, initial_state=[
        Dense(512)(features),  # h_0
        Dense(512)(features)   # c_0
    ])

    # Output
    outputs = Dense(vocab_size, activation='softmax')(lstm_out)

    return Model([features, sequence], outputs)

# Step 3. Weight Conversion Process
"""
    - We can accomplish this process through three main tactics, including:

      1. Weight Transfer
      2. Using ONNX as Intermediate Format
      3. Retraining afresh in Keras

    - But since we want to be able to use the learned weights for PyTorch in Keras, we will have to take the first method as our best approach.

    - Although ONNX has general-purpose conversion path, manual weight transfer allows complete numerical preservation of all parameters learned with PyTorch and transferring them to Keras.

    - This is essential to sequencing applications where oftentimes minor numerical errors can propagate to totally distinct outcomes.

    - There are PyTorch patterns that are used in our image captioning model that ONNX would typically have trouble converting which are (1) teacher forcing during training (2) stateful sampling during inference.

    - As such, weight transfer allows for exact duplicate LSTM initialization of state, retaining the precise mapping of vocabulary in the embedding layer and the use of exact batch normalization momentum (0.01) as the one set in PyTorch.

"""
def transfer_encoder_weights(pt_encoder, keras_encoder):
    # Dense layer
    keras_encoder.get_layer('dense').set_weights([
        pt_encoder['linear.weight'].numpy().T,  # Transpose for Keras
        pt_encoder['linear.bias'].numpy()
    ])

    # BatchNorm
    keras_encoder.get_layer('bn').set_weights([
        pt_encoder['bn.weight'].numpy(),
        pt_encoder['bn.bias'].numpy(),
        pt_encoder['bn.running_mean'].numpy(),
        pt_encoder['bn.running_var'].numpy()
    ])

def transfer_decoder_weights(pt_decoder, keras_decoder):
    """Fixed version handling automatic layer naming"""
    # 1. Finding the embedding layer by type (more robust than name)
    embedding_layer = None
    for layer in keras_decoder.layers:
        if isinstance(layer, Embedding):
            embedding_layer = layer
            break

    if embedding_layer is None:
        raise ValueError("No Embedding layer found in decoder")

    # 2. Transfering embedding weights
    embedding_layer.set_weights([pt_decoder['embed.weight'].numpy()])

    # 3. Transfering LSTM weights (with gate reordering)
    W_ih = pt_decoder['lstm.weight_ih_l0'].numpy()  # (4*hidden, embed)
    W_hh = pt_decoder['lstm.weight_hh_l0'].numpy()  # (4*hidden, hidden)
    bias = (pt_decoder['lstm.bias_ih_l0'] + pt_decoder['lstm.bias_hh_l0']).numpy()

    # Reordering gates: PyTorch (i,f,g,o) → Keras (i,f,o,g)
    W_i, W_f, W_g, W_o = np.split(W_ih, 4)
    U_i, U_f, U_g, U_o = np.split(W_hh, 4)
    b_i, b_f, b_g, b_o = np.split(bias, 4)

    lstm_layer = keras_decoder.get_layer('lstm')  # Using explicit name
    lstm_layer.set_weights([
        np.concatenate([W_i, W_f, W_o, W_g]).T,
        np.concatenate([U_i, U_f, U_o, U_g]).T,
        np.concatenate([b_i, b_f, b_o, b_g])
    ])

    # 4. Transfering output layer weights
    output_layer = keras_decoder.layers[-1]  # Last layer is Dense
    output_layer.set_weights([
        pt_decoder['linear.weight'].numpy().T,
        pt_decoder['linear.bias'].numpy()
    ])

# 5. Initializing and transfer weights
keras_encoder = create_keras_encoder()
keras_decoder = create_keras_decoder(vocab_size)

transfer_encoder_weights(pt_encoder, keras_encoder)
transfer_decoder_weights(pt_decoder, keras_decoder)

# 6. Saving models
keras_encoder.save('keras_encoder.h5')
keras_decoder.save('keras_decoder.h5')

print("Model converted to Keras successfully!")

# Step 4. Inference Implementation
""" -In this step, we implement Inference.
    -This is how the trained model can be used to produce captions to new images- without changing any weights.
"""
def generate_caption(image_path, max_length=20):
    # Preprocessing image (Keras-style)
    img = tf.keras.preprocessing.image.load_img(image_path, target_size=(224, 224))
    img = tf.keras.preprocessing.image.img_to_array(img)
    img = tf.keras.applications.resnet.preprocess_input(img[np.newaxis, ...])

    # Getting features
    features = keras_encoder.predict(img)

    # Initialize sequence
    seq = [vocab['<start>']]

    # Generating caption
    for _ in range(max_length):
        # Preparing input
        seq_input = tf.keras.preprocessing.sequence.pad_sequences(
            [seq], maxlen=max_length, padding='post'
        )

        # Predicting next word
        preds = keras_decoder.predict([features, seq_input], verbose=0)
        next_id = np.argmax(preds[0, len(seq)-1])
        seq.append(next_id)

        if next_id == vocab['<end>']:
            break

    # Converting to text
    return ' '.join([vocab.idx2word[i] for i in seq[1:-1]])

Loaded vocabulary with 9956 tokens
Downloading data from https://storage.googleapis.com/tensorflow/keras-applications/resnet/resnet152_weights_tf_dim_ordering_tf_kernels_notop.h5
[1m234698864/234698864[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 0us/step




Model converted to Keras successfully!


In [15]:
# Problem 4: (Advance assignment) Code reading and rewriting

import tensorflow as tf
from tensorflow.keras.layers import Input, Dense, LSTM, Embedding, Concatenate, GlobalAveragePooling2D, Reshape
from tensorflow.keras.models import Model

def build_image_captioning_model(vocab_size, embed_size=256, hidden_size=512, max_length=20):
    """
    Keras model equivalent to model.py (PyTorch EncoderCNN + DecoderRNN).
    Uses machine translation seq2seq style as reference.
    """

    # Encoder: Image feature extractor
    image_input = Input(shape=(224, 224, 3), name='image_input')
    base_model = tf.keras.applications.ResNet152(include_top=False, weights='imagenet')
    base_model.trainable = False
    x = base_model(image_input)
    x = GlobalAveragePooling2D()(x)
    image_features = Dense(embed_size, activation='relu', name='image_features')(x)

    # Decoder: Caption sequence processor
    caption_input = Input(shape=(max_length,), name='caption_input')
    caption_embedding = Embedding(input_dim=vocab_size, output_dim=embed_size, mask_zero=True, name='word_embedding')(caption_input)

    # Reshaping image features so they behave like one token
    image_features_expanded = Reshape((1, embed_size))(image_features)

    # Combining image features with caption embeddings
    decoder_input = Concatenate(axis=1)([image_features_expanded, caption_embedding])

    lstm_out = LSTM(hidden_size, return_sequences=True, name='decoder_lstm')(decoder_input)
    output = Dense(vocab_size, activation='softmax', name='output_word')(lstm_out)

    model = Model(inputs=[image_input, caption_input], outputs=output, name="ImageCaptioningModel")
    return model

# Example usage
vocab_size = 10000  # hypothetical
max_length = 20
model = build_image_captioning_model(vocab_size, max_length=max_length)
model.summary()


**Problem 5: (Advance assignment) Developmental survey**

**A. When translating into other languages**

To implement additional languages (including French, German, or Swahili) with Problem 1 that translates in between English and Japanese, it will be necessary to take planning very seriously.

Data preparation would be the first consideration and in this preparation, the parallel corpora should be gathered that includes the sentence pairs of the intended language pairs. Examples of these corpora are `Europarl` (European languages), `UN Corpus` (multi-language corpora), and `JW300` (low-resource language corpus). Upon the collection of the data, there are text normalization and tokenization in accordance with each language that should be done so that they are consistent.

Tokenizer and vocabulary design is another important procedure in multiple language handling. Other common approaches to tokenization include `subword tokenization` where large vocabularies with rare words are addressed more effectively using `Byte-Pair Encoding (BPE)` or `SentencePiece`. The use of tokenizers will depend on the complexity of the task: each tokenizer can be trained individually between language pairs or incorporated into one shared tokenizer in multi-language models.

Along with it is model adaptation. It can be done by extending encoder to decoder systems to support several languages. The first one would be applying the construction of one multilingual model based on special language tokens like`<2fr>` to pinpoint the target language. In a different setup, it is possible to conduct training of individual language pairs, and, in this case, it is simpler to train but more costly in computation.

Incorporation through transfer learning is possible by using `mBART, MarianMT, or mT5`, all of which are trained with the intention to apply to many languages using pretrained multilingual settings. The efficiency of adapting even in low-resource settings will occur as a result of fine-tuning of these models with new language pairs.

Lastly, there is assessment to be taken into consideration. Quality of a translation can be measured automatically using `BLEU, METEOR, and chrF++`, however, in many cases, IP quality should be evaluated by a person, which is because translation requires a person to understand the context and take into account linguistic peculiarities.

**B. What are the advanced methods of machine translation**

Machine translation has tremendously changed from past statistical algorithms.

`Neural Machine Translation (NMT)` is a key breakthrough, as it employed the sequence-to-sequence models, and attention mechanisms (Sequences), like `Bahdanau` or `Luong attention`, which enhanced context management.

The next big innovation is `transformer-based` architectures, which have popularized the self-attention mechanism, which allows `Google Transformer`,` OpenNMT` and `Fairseq` to perform at a higher level of accuracy and support parallel training.

The more recent development is the `pretrained multilingual models`. Multilingual models like `mBART`, `MarianMT`, or `M2M-100` are trained on tens or even hundreds of languages jointly and so do not require direct exposure to pairs of languages to translate between them, so-called zero-shot translation.

Furthermore, `document-level translation` methods in `Context-Aware Translation` have also enhance context discovery, by taking into account whole paragraphs or documents, rather than individual sentences, overcoming such issues as pronoun resolution and style coherence.

`Transfer learning` in `low-resource language translation` has been useful in the transfer of knowledge in high-resource languages to enhance performance in low-resource languages. What is more, `unsupervised NMT` methods have been demonstrated to work even on monolingual data in the absence of parallel data.

**C. How to generate an image from text**

Whereas image captioning is a method of translating a visual input into text, text-to-image generation goes the other way around. Historically, this was done either with retrieval based systems that retrieved existing images that resembled the text description or in the case of early `Generative Adversarial Networks (GANs)`, they generated simplistic images given some text embeddings.

New methods have yielded great results and modern methods such as GAN-based models like `StackGAN` and `AttnGAN` have gone further to apply attention mechanisms to enable the model to produce images of finer detail and which match textual descriptions more closely. Nevertheless, the best techniques are now resting on diffusion models, such as `DALL.E 2`, `Stable Diffusion`, and `MidJourney`. Such models are designed to learn to progressively clean random noise into a coherent image given a text prompt, and therefore, can be used to produce high-resolution, photorealistic, and highly creative images.

Text-to-image generation dependency is based on the natural language understanding based either on large language model embedding or vision-language models such as `CLIP` to embed text meaning into the visual world. The examples of use of these models are concept art and product design, educational visualizations, and marketing materials. Nevertheless, there are still obstacles, such as possible biases in the produced material, the inability to manage fine-grained details of pictures, including style, structure, etc., and the fact that they are computer-demanding to train and infer.