<a href="https://colab.research.google.com/github/Cliffochi/aviva_data_science_course/blob/main/Seq2seq.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

###Machine translation execution and code reading
The following sample code does a short English to French translation. Run it to see the results.

In [None]:
"""
Title: Character-level recurrent sequence-to-sequence model
Author: [fchollet](https://twitter.com/fchollet)
Date created: 2017/09/29
Last modified: 2023/11/22
Description: Character-level recurrent sequence-to-sequence model.
Accelerator: GPU
"""

import numpy as np
import keras
import os
from pathlib import Path

# Download the data
data_path = keras.utils.get_file(
    fname="fra-eng.zip",
    origin="http://storage.googleapis.com/download.tensorflow.org/data/fra-eng.zip",
    extract=True,
    cache_dir='.'
)
dirpath = Path(data_path).parent.absolute()
data_file_path = os.path.join(dirpath, 'fra-eng', 'fra.txt')

# Configuration
batch_size = 64
epochs = 100
latent_dim = 256
num_samples = 10000

# Prepare the data
input_texts = []
target_texts = []
input_characters = set()
target_characters = set()

with open(data_file_path, "r", encoding="utf-8") as f:
    lines = f.read().split("\n")

# Bonus: Print first line for verification
print("Sample line from data file:", lines[0])

for line in lines[: min(num_samples, len(lines) - 1)]:
    parts = line.split("\t")
    if len(parts) >= 2:
        input_text, target_text = parts[0], parts[1]
        # Use "\t" as start sequence character for target, "\n" as end sequence character
        target_text = "\t" + target_text + "\n"
        input_texts.append(input_text)
        target_texts.append(target_text)
        for char in input_text:
            if char not in input_characters:
                input_characters.add(char)
        for char in target_text:
            if char not in target_characters:
                target_characters.add(char)

input_characters = sorted(list(input_characters))
target_characters = sorted(list(target_characters))
num_encoder_tokens = len(input_characters)
num_decoder_tokens = len(target_characters)

if not input_texts:
    print("Error: No valid input sentences found in the data file.")
else:
    max_encoder_seq_length = max([len(txt) for txt in input_texts])
    max_decoder_seq_length = max([len(txt) for txt in target_texts])

    print("Number of samples:", len(input_texts))
    print("Number of unique input tokens:", num_encoder_tokens)
    print("Number of unique output tokens:", num_decoder_tokens)
    print("Max sequence length for inputs:", max_encoder_seq_length)
    print("Max sequence length for outputs:", max_decoder_seq_length)

    input_token_index = dict([(char, i) for i, char in enumerate(input_characters)])
    target_token_index = dict([(char, i) for i, char in enumerate(target_characters)])

    encoder_input_data = np.zeros(
        (len(input_texts), max_encoder_seq_length, num_encoder_tokens),
        dtype="float32",
    )
    decoder_input_data = np.zeros(
        (len(input_texts), max_decoder_seq_length, num_decoder_tokens),
        dtype="float32",
    )
    decoder_target_data = np.zeros(
        (len(input_texts), max_decoder_seq_length, num_decoder_tokens),
        dtype="float32",
    )

    for i, (input_text, target_text) in enumerate(zip(input_texts, target_texts)):
        for t, char in enumerate(input_text):
            encoder_input_data[i, t, input_token_index[char]] = 1.0
        encoder_input_data[i, t + 1 :, input_token_index[" "]] = 1.0
        for t, char in enumerate(target_text):
            decoder_input_data[i, t, target_token_index[char]] = 1.0
            if t > 0:
                decoder_target_data[i, t - 1, target_token_index[char]] = 1.0
        decoder_input_data[i, t + 1 :, target_token_index[" "]] = 1.0
        decoder_target_data[i, t:, target_token_index[" "]] = 1.0

    # Build the model
    encoder_inputs = keras.Input(shape=(None, num_encoder_tokens))
    encoder = keras.layers.LSTM(latent_dim, return_state=True)
    encoder_outputs, state_h, state_c = encoder(encoder_inputs)
    encoder_states = [state_h, state_c]

    decoder_inputs = keras.Input(shape=(None, num_decoder_tokens))
    decoder_lstm = keras.layers.LSTM(latent_dim, return_sequences=True, return_state=True)
    decoder_outputs, _, _ = decoder_lstm(decoder_inputs, initial_state=encoder_states)
    decoder_dense = keras.layers.Dense(num_decoder_tokens, activation="softmax")
    decoder_outputs = decoder_dense(decoder_outputs)

    model = keras.Model([encoder_inputs, decoder_inputs], decoder_outputs)

    # Train the model
    model.compile(
        optimizer="rmsprop", loss="categorical_crossentropy", metrics=["accuracy"]
    )
    model.fit(
        [encoder_input_data, decoder_input_data],
        decoder_target_data,
        batch_size=batch_size,
        epochs=epochs,
        validation_split=0.2,
    )

    # Save model
    model.save("s2s_model.keras")

    # Run inference
    model = keras.models.load_model("s2s_model.keras")

    encoder_inputs = model.input[0]  # input_1
    encoder_outputs, state_h_enc, state_c_enc = model.layers[2].output  # lstm_1
    encoder_states = [state_h_enc, state_c_enc]
    encoder_model = keras.Model(encoder_inputs, encoder_states)

    decoder_inputs = model.input[1]  # input_2
    decoder_state_input_h = keras.Input(shape=(latent_dim,))
    decoder_state_input_c = keras.Input(shape=(latent_dim,))
    decoder_states_inputs = [decoder_state_input_h, decoder_state_input_c]
    decoder_lstm = model.layers[3]
    decoder_outputs, state_h_dec, state_c_dec = decoder_lstm(
        decoder_inputs, initial_state=decoder_states_inputs
    )
    decoder_states = [state_h_dec, state_c_dec]
    decoder_dense = model.layers[4]
    decoder_outputs = decoder_dense(decoder_outputs)
    decoder_model = keras.Model(
        [decoder_inputs] + decoder_states_inputs, [decoder_outputs] + decoder_states
    )

    reverse_input_char_index = dict((i, char) for char, i in input_token_index.items())
    reverse_target_char_index = dict((i, char) for char, i in target_token_index.items())

    def decode_sequence(input_seq):
        states_value = encoder_model.predict(input_seq, verbose=0)
        target_seq = np.zeros((1, 1, num_decoder_tokens))
        target_seq[0, 0, target_token_index["\t"]] = 1.0

        stop_condition = False
        decoded_sentence = ""
        while not stop_condition:
            output_tokens, h, c = decoder_model.predict(
                [target_seq] + states_value, verbose=0
            )
            sampled_token_index = np.argmax(output_tokens[0, -1, :])
            sampled_char = reverse_target_char_index[sampled_token_index]
            decoded_sentence += sampled_char

            if sampled_char == "\n" or len(decoded_sentence) > max_decoder_seq_length:
                stop_condition = True

            target_seq = np.zeros((1, 1, num_decoder_tokens))
            target_seq[0, 0, sampled_token_index] = 1.0

            states_value = [h, c]
        return decoded_sentence

    for seq_index in range(20):
        input_seq = encoder_input_data[seq_index : seq_index + 1]
        decoded_sentence = decode_sequence(input_seq)
        print("-")
        print("Input sentence:", input_texts[seq_index])
        print("Decoded sentence:", decoded_sentence)


Sample line from data file: Go.	Va !
Number of samples: 10000
Number of unique input tokens: 70
Number of unique output tokens: 93
Max sequence length for inputs: 16
Max sequence length for outputs: 59
Epoch 1/100
[1m125/125[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 17ms/step - accuracy: 0.6928 - loss: 1.6183 - val_accuracy: 0.6968 - val_loss: 1.1544
Epoch 2/100
[1m125/125[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 13ms/step - accuracy: 0.7341 - loss: 1.0054 - val_accuracy: 0.7086 - val_loss: 1.0375
Epoch 3/100
[1m125/125[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 12ms/step - accuracy: 0.7536 - loss: 0.8940 - val_accuracy: 0.7332 - val_loss: 0.9398
Epoch 4/100
[1m125/125[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 12ms/step - accuracy: 0.7750 - loss: 0.8022 - val_accuracy: 0.7494 - val_loss: 0.8768
Epoch 5/100
[1m125/125[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 12ms/step - accuracy: 0.7870 - loss: 0.7492 - val_accuracy: 0.77

###  **Code Explanation by Section**

**Lines 51–55: Importing libraries**

```python
import numpy as np
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, LSTM, Dense
```

These lines import essential libraries for numerical operations and deep learning using TensorFlow/Keras. LSTM is used to build the sequence-to-sequence model.

---

**Lines 57–62: Hyperparameter settings**

```python
batch_size = 64
epochs = 100
latent_dim = 256
num_samples = 10000
data_path = keras.utils.get_file(
    "fra-eng/fra.txt", "http://storage.googleapis.com/download.tensorflow.org/data/fra-eng.zip", extract=True
)
```

Defines parameters such as:

* `batch_size`: Number of samples per training batch
* `epochs`: Number of iterations over the full dataset
* `latent_dim`: Hidden layer size of LSTM
* `num_samples`: Number of sentence pairs to use
* `data_path`: Downloads and extracts English-French translation dataset

---

**Lines 64–95: Reading and preprocessing the data**

```python
input_texts = []
target_texts = []
input_characters = set()
target_characters = set()
# Read and clean data...
```

* Loads the dataset and creates parallel lists for input (English) and target (French) sentences.
* Also builds character sets for tokenization.

---

**Lines 97–110: Sort characters and assign indices**

```python
input_characters = sorted(list(input_characters))
target_characters = sorted(list(target_characters))
input_token_index = dict([(char, i) for i, char in enumerate(input_characters)])
target_token_index = dict([(char, i) for i, char in enumerate(target_characters)])
```

* Assigns a unique integer ID to each character in both languages.
* Useful for vectorization.

---

**Lines 112–125: Vectorize input and output**

```python
encoder_input_data = np.zeros(...)
decoder_input_data = np.zeros(...)
decoder_target_data = np.zeros(...)
# Populate the arrays with one-hot encodings
```

* Converts text into one-hot encoded arrays to feed into the model.
* Handles input for encoder and decoder.

---

**Lines 127–137: Build the encoder model**

```python
encoder_inputs = Input(shape=(None, num_encoder_tokens))
encoder = LSTM(latent_dim, return_state=True)
encoder_outputs, state_h, state_c = encoder(encoder_inputs)
encoder_states = [state_h, state_c]
```

* The encoder processes input sequences and returns internal LSTM states.

---

**Lines 139–151: Build the decoder model**

```python
decoder_inputs = Input(shape=(None, num_decoder_tokens))
decoder_lstm = LSTM(latent_dim, return_sequences=True, return_state=True)
decoder_outputs, _, _ = decoder_lstm(decoder_inputs, initial_state=encoder_states)
decoder_dense = Dense(num_decoder_tokens, activation="softmax")
decoder_outputs = decoder_dense(decoder_outputs)
```

* Decoder uses the encoder's internal states to generate output sequences.
* Applies a dense softmax layer for final output prediction.

---

**Lines 153–155: Define and compile the model**

```python
model = Model([encoder_inputs, decoder_inputs], decoder_outputs)
model.compile(optimizer="rmsprop", loss="categorical_crossentropy", metrics=["accuracy"])
```

* Wraps the encoder-decoder into a full model and compiles it with categorical loss and accuracy metric.

---

**Lines 157–159: Train the model**

```python
model.fit([encoder_input_data, decoder_input_data], decoder_target_data, batch_size=batch_size, epochs=epochs, validation_split=0.2)
```

* Begins training the model with the input and target data.

---

**Lines 161–174: Define inference models (for translation)**

* Separates encoder and decoder models for inference (translating new sentences).
* Necessary because in inference, you generate one token at a time.

---

**Lines 176–196: Decode sequence function**

* Translates input sentences using the trained model and inference loop.

---

**Lines 198–202: Display sample translations**

* Tests the model on a few input sequences and prints the predictions.

---

## Character-Level Tokenization with `CountVectorizer`

When using `sklearn.feature_extraction.text.CountVectorizer`, you can specify how text is tokenized using the `analyzer` argument.

### 🔸 `analyzer='char'`

* **Character n-grams** across words.
* E.g., `"This movie"` → tokens like `'T'`, `'Th'`, `'his'`, `'is '`, `'s m'`, `' mo'`, etc.

### 🔸 `analyzer='char_wb'`

* **Character n-grams** only *within* word boundaries.
* Avoids capturing patterns like `'s m'` (from `This movie`) which cross word boundaries.

Use case:

* **char**: Useful for language modeling and machine translation
* **char\_wb**: Better for morphology-sensitive tasks (e.g., spelling)

**Docs:**
[CountVectorizer – scikit-learn](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)

---

###Running a pre-trained model for image captioning

In [1]:
# set up the environment i.e. install dependencies
!pip install torch torchvision pillow numpy matplotlib nltk

Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cufft-cu12==11.2.1.3 (from torch)
  Downloading nvidia_cufft_cu12-11.2.1.3-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-curand-cu12==10.3.5.147 (from torch)
  Downloading nvidia_curand_cu12-10.3.5

####Download NLTK Word Tokenizer (required for text preprocessing):

In [2]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [3]:
!git clone https://github.com/yunjey/pytorch-tutorial.git
!cd pytorch-tutorial/tutorials/03-advanced/image_captioning

Cloning into 'pytorch-tutorial'...
remote: Enumerating objects: 917, done.[K
remote: Total 917 (delta 0), reused 0 (delta 0), pack-reused 917 (from 1)[K
Receiving objects: 100% (917/917), 12.80 MiB | 40.46 MiB/s, done.
Resolving deltas: 100% (491/491), done.


In [6]:
'''2. Using gdown:
Upload to Google Drive: Upload your file to Google Drive and make it shareable with "anyone with the link."
Copy File ID: Extract the file ID from the shareable link.
Download in Colab: Use the gdown command in a Colab cell:
Code

Replace <your_file_id> with the actual file ID. This is often faster than direct downloads.
'''
!gdown 1Wmq6aKkItmTufvachL9mFeMCT-3-g2qH
!gdown 1iegY6ZVt1dm8cYeHu7CA2QYupJY6kDiC

Downloading...
From (original): https://drive.google.com/uc?id=1Wmq6aKkItmTufvachL9mFeMCT-3-g2qH
From (redirected): https://drive.google.com/uc?id=1Wmq6aKkItmTufvachL9mFeMCT-3-g2qH&confirm=t&uuid=13a69996-ac62-4719-9526-5d759c6f3397
To: /content/encoder-5-3000.pkl
100% 235M/235M [00:02<00:00, 81.0MB/s]
Downloading...
From: https://drive.google.com/uc?id=1iegY6ZVt1dm8cYeHu7CA2QYupJY6kDiC
To: /content/decoder-5-3000.pkl
100% 36.9M/36.9M [00:00<00:00, 80.7MB/s]


In [None]:
#!cp -r /content/pytorch-tutorial/data .
#!cp -r /content/pytorch-tutorial/models .

In [8]:
import torch
import os

# Construct the full paths to the model files
encoder_path = '/content/encoder-5-3000.pkl'
decoder_path = '/content/decoder-5-3000.pkl'

# Load the models
encoder = torch.load(encoder_path, map_location='cpu')  # or 'cuda'
decoder = torch.load(decoder_path, map_location='cpu')

print("Encoder keys:", encoder.keys())  # Should show model weights
print("Decoder keys:", decoder.keys())

Encoder keys: odict_keys(['resnet.0.weight', 'resnet.1.weight', 'resnet.1.bias', 'resnet.1.running_mean', 'resnet.1.running_var', 'resnet.4.0.conv1.weight', 'resnet.4.0.bn1.weight', 'resnet.4.0.bn1.bias', 'resnet.4.0.bn1.running_mean', 'resnet.4.0.bn1.running_var', 'resnet.4.0.conv2.weight', 'resnet.4.0.bn2.weight', 'resnet.4.0.bn2.bias', 'resnet.4.0.bn2.running_mean', 'resnet.4.0.bn2.running_var', 'resnet.4.0.conv3.weight', 'resnet.4.0.bn3.weight', 'resnet.4.0.bn3.bias', 'resnet.4.0.bn3.running_mean', 'resnet.4.0.bn3.running_var', 'resnet.4.0.downsample.0.weight', 'resnet.4.0.downsample.1.weight', 'resnet.4.0.downsample.1.bias', 'resnet.4.0.downsample.1.running_mean', 'resnet.4.0.downsample.1.running_var', 'resnet.4.1.conv1.weight', 'resnet.4.1.bn1.weight', 'resnet.4.1.bn1.bias', 'resnet.4.1.bn1.running_mean', 'resnet.4.1.bn1.running_var', 'resnet.4.1.conv2.weight', 'resnet.4.1.bn2.weight', 'resnet.4.1.bn2.bias', 'resnet.4.1.bn2.running_mean', 'resnet.4.1.bn2.running_var', 'resnet.4.1

In [10]:
import pickle
import os
import sys

# Path to the pytorch-tutorial directory
pytorch_tutorial_path = '/content/pytorch-tutorial/tutorials/03-advanced/image_captioning'

# Add the pytorch-tutorial directory to the system path to import necessary modules
sys.path.insert(0, pytorch_tutorial_path)

# Change the current working directory to the pytorch-tutorial directory
os.chdir(pytorch_tutorial_path)

# Import the necessary file that defines the Vocabulary class
# Assuming the Vocabulary class is in a file named 'build_vocab.py' or similar
# You might need to adjust the import based on the actual file name in the repository
try:
    from build_vocab import Vocabulary
except ImportError:
    # If the above import fails, try importing from data_loader
    try:
        from data_loader import Vocabulary
    except ImportError:
        print("Could not find the 'Vocabulary' class definition. Please check the file name.")
        # It's important to exit or handle this error if the class isn't found
        # For now, we'll just print a message and continue, which might lead to further errors.
        # A better approach might be to raise an error or stop execution.
        pass


# Path to the vocabulary file
vocab_path = os.path.join('/content/pytorch-tutorial/data', 'vocab.pkl')

# Load the vocabulary wrapper
# Add error handling for file not found just in case
try:
    with open(vocab_path, 'rb') as f:
        vocab = pickle.load(f)

    print(f"Vocabulary size: {len(vocab)}")
    # You can inspect some words in the vocabulary if needed
    # print(vocab.idx2word[:10])

except FileNotFoundError:
    print(f"Error: Vocabulary file not found at {vocab_path}")
except Exception as e:
    print(f"An error occurred while loading the vocabulary: {e}")

# It's generally good practice to change back to the original directory if needed
# os.chdir('/content') # Uncomment if you need to change back

Vocabulary size: 9956


In [14]:
!cp /content/decoder-5-3000.pkl /content/encoder-5-3000.pkl /content/pytorch-tutorial/tutorials/03-advanced/image_captioning/models/

In [21]:
# flower image - my_image.jpg
!python sample.py --image my_image.jpg > sample_predictions.txt



###Investigate what to do if you want to run it with Keras
I have tried to implement it in PyTorch, but please investigate what steps I should take if I want to run it in Keras. In particular, please mention how to make the trained weights in PyTorch usable in Keras.

Running **image captioning** in **Keras** instead of PyTorch involves several challenges, especially when it comes to using **pre-trained PyTorch weights in Keras**. PyTorch and Keras (TensorFlow backend) use different model definitions, serialization formats, and internal layer representations.

---
### Summary of Steps

#### 1. **Find or Build a Keras Image Captioning Model**

Since Keras doesn’t have an official end-to-end image captioning implementation with pre-trained weights like Yunjey’s PyTorch version, you have two choices:

* **Option A: Build from scratch in Keras**
* **Option B: Convert PyTorch model + weights to Keras** (complex and error-prone)

---

### Option A: Build Image Captioning Model in Keras

#### **High-level architecture**:

1. **Encoder**: Pre-trained CNN (e.g., InceptionV3 or ResNet50)
2. **Decoder**: RNN (typically LSTM) with an attention mechanism
3. **Output**: Word-by-word caption generation

#### Resources:

* **Keras example** (without pre-trained decoder weights):
  [Image captioning with visual attention (Keras)](https://keras.io/examples/vision/image_captioning/)

#### Key Steps:

1. Use **InceptionV3 / ResNet50** as CNN encoder
2. Extract features from the image using the CNN
3. Use an **LSTM decoder** with attention to generate the caption
4. Train or fine-tune on a dataset like COCO or Flickr8k

---

### Option B: Convert PyTorch Weights to Keras

This requires deep familiarity with both frameworks.

### Conversion is difficult because:

* PyTorch and Keras store weights **differently**
* Layer names and architectures **don’t match 1:1**
* No official tool exists for **automatic weight translation**

###  Possible Workarounds:

1. **Manually port weights** layer-by-layer:

   * Export PyTorch weights (`.pt` or `.pkl`)
   * Convert them to NumPy
   * Load them into matching Keras layers using `set_weights()`
   * Painstaking and error-prone

2. **Use ONNX as an intermediate**:

   * Convert PyTorch → ONNX

   * Try ONNX → TensorFlow (via `onnx-tf`)

   * Then load model into Keras

   > But ONNX → Keras conversion is very fragile for custom models.

### Example conversion path:

```bash
# Convert PyTorch to ONNX
torch.onnx.export(model, dummy_input, "model.onnx")

# Convert ONNX to TensorFlow
onnx-tf convert -i model.onnx -o tf_model

# Attempt to load TF model in Keras
tf.keras.models.load_model('tf_model')
```

This might work for **simple feed-forward models**, but often fails with **custom models like encoder-decoder architectures with attention**.

---

### Recommended Path

If aiming to work in **Keras**, the **best approach** is:

#### Use this official example from Keras:

 [Image Captioning with Visual Attention (Keras)](https://keras.io/examples/vision/image_captioning/)

This implementation:

* Uses TensorFlow/Keras
* Extracts image features with InceptionV3
* Uses a custom LSTM decoder with Bahdanau attention
* Can be fine-tuned or extended

### If we want to reuse PyTorch-trained captions or vocabulary:

* Export the vocabulary (word2idx) as a JSON or pickle
* Load it into your Keras model

---

### Summary Table

| Task            | PyTorch             | Keras                        | Notes                                                |
| --------------- | ------------------- | ---------------------------- | ---------------------------------------------------- |
| Encoder CNN     | Pre-trained ResNet  | Pre-trained InceptionV3      | Replaceable                                          |
| Decoder RNN     | Custom LSTM         | LSTM w/ attention            | Must build anew in Keras                             |
| Weight transfer | Pickle/pt           | HDF5                         | Manual conversion or retraining required             |
| Model loading   | `torch.load`        | `tf.keras.models.load_model` | Incompatible formats                                 |
| Best option     | Use PyTorch version | Use Keras example            | Start fresh in Keras if you want to stay in TF/Keras |

---


###(Advanced assignment) Code reading and rewriting
The model part is written in [model.py], but please think about how to write this model in Keras and write the actual code. At this time, the machine translated sample code will be helpful.

We’ll implement:

1. **EncoderCNN** using a pre-trained **ResNet152**
2. **DecoderRNN** using **Embedding + LSTM**
3. Greedy caption sampling

---

#### Architecture Notes

| PyTorch                             | Keras                                                 |
| ----------------------------------- | ----------------------------------------------------- |
| `models.resnet152(pretrained=True)` | `tf.keras.applications.ResNet152(weights='imagenet')` |
| `nn.Linear(...)`                    | `Dense(...)`                                          |
| `nn.LSTM(...)`                      | `tf.keras.layers.LSTM(...)`                           |
| `Embedding(vocab_size, embed_dim)`  | `tf.keras.layers.Embedding(...)`                      |
| `sample()`                          | Greedy decoding in Keras with `predict()` and loops   |

---

In [None]:
### Keras Implementation of `model.py`
import tensorflow as tf
from tensorflow.keras import layers, Model
from tensorflow.keras.applications import ResNet152
from tensorflow.keras.applications.resnet import preprocess_input


class EncoderCNN(tf.keras.Model):
    def __init__(self, embed_size):
        super(EncoderCNN, self).__init__()
        # Load ResNet152 without the top layer
        base_model = ResNet152(include_top=False, weights='imagenet', pooling='avg')
        base_model.trainable = False  # Freeze base model
        self.resnet = base_model
        self.fc = layers.Dense(embed_size)
        self.bn = layers.BatchNormalization(momentum=0.01)

    def call(self, images):
        x = preprocess_input(images)  # ResNet preprocessing
        features = self.resnet(x)
        features = self.fc(features)
        features = self.bn(features)
        return features  # shape: (batch_size, embed_size)


class DecoderRNN(tf.keras.Model):
    def __init__(self, embed_size, hidden_size, vocab_size, num_layers=1, max_seq_length=20):
        super(DecoderRNN, self).__init__()
        self.embed = layers.Embedding(vocab_size, embed_size)
        self.lstm = layers.LSTM(hidden_size, return_sequences=True, return_state=True)
        self.dense = layers.Dense(vocab_size)
        self.max_seq_length = max_seq_length

    def call(self, features, captions, lengths):
        # Remove the last token (e.g. <end>) during training
        captions_input = captions[:, :-1]
        embeddings = self.embed(captions_input)

        # Prepend image features as the first "word"
        features = tf.expand_dims(features, 1)
        inputs = tf.concat([features, embeddings], axis=1)

        # Pass through LSTM
        outputs, _, _ = self.lstm(inputs)
        logits = self.dense(outputs)
        return logits  # shape: (batch_size, caption_len, vocab_size)

    def sample(self, features, start_token, end_token):
        # Greedy decoding loop
        input_word = tf.expand_dims([start_token], 0)  # shape: (1, 1)
        caption = []

        state_h, state_c = None, None
        inputs = tf.expand_dims(features, 1)

        for _ in range(self.max_seq_length):
            if state_h is None:
                output, state_h, state_c = self.lstm(inputs)
            else:
                output, state_h, state_c = self.lstm(inputs, initial_state=[state_h, state_c])

            logits = self.dense(output)  # (1, 1, vocab_size)
            predicted_id = tf.argmax(logits[0, 0]).numpy()
            caption.append(predicted_id)

            if predicted_id == end_token:
                break

            inputs = tf.expand_dims(self.embed([predicted_id]), 1)  # shape: (1, 1, embed_size)

        return caption

## Example Usage

Here's how you'd use the models after building and compiling:

In [None]:
# Define hyperparameters
EMBED_SIZE = 256
HIDDEN_SIZE = 512
VOCAB_SIZE = len(word2idx)
MAX_SEQ_LEN = 20

# Instantiate models
encoder = EncoderCNN(embed_size=EMBED_SIZE)
decoder = DecoderRNN(embed_size=EMBED_SIZE, hidden_size=HIDDEN_SIZE, vocab_size=VOCAB_SIZE, max_seq_length=MAX_SEQ_LEN)

# Load and preprocess image
img = tf.keras.preprocessing.image.load_img("my_image.jpg", target_size=(224, 224))
img = tf.keras.preprocessing.image.img_to_array(img)
img = tf.expand_dims(img, 0)  # batch dimension

# Extract features
features = encoder(img)

# Generate caption
start_token = word2idx['<start>']
end_token = word2idx['<end>']
generated_ids = decoder.sample(features, start_token, end_token)

# Convert tokens to words
caption = [idx2word[i] for i in generated_ids]
print(' '.join(caption))


### PART 1: Machine Translation – Translating Between Japanese and English

#### General Steps

When building a **machine translation system** (e.g., Japanese ⇄ English), here’s what you need:

1. **Data**

   * Parallel corpora: sentence pairs in Japanese and English (e.g., [JParaCrawl](https://opus.nlpl.eu/JParaCrawl.php), [Tatoeba](https://tatoeba.org/))
   * Preprocessing: tokenization, subword units (e.g., SentencePiece or Byte-Pair Encoding)

2. **Model Choices**

   * **Seq2Seq with Attention**
   * **Transformer models** (e.g., T5, MarianMT, mBART, mT5)

3. **Training**

   * Loss: Cross-entropy loss over vocabulary
   * Metrics: BLEU, METEOR, or COMET scores

4. **Inference**

   * Greedy decoding / Beam Search
   * Token to string conversion

---

### Advanced Methods of Machine Translation

#### 1. **Attention Mechanism**

* Introduced in [Bahdanau et al., 2014](https://arxiv.org/abs/1409.0473)
* Learns where to focus in the source sentence when generating each word in the target sentence.
* Led to significantly improved performance over vanilla Seq2Seq models.

#### 2. **Transformer Model (Vaswani et al., 2017)**

* Uses **multi-head self-attention** instead of RNNs.
* Faster training (parallelizable), better long-range dependency modeling.
* Foundation of modern translation models: BERT, GPT, T5, etc.

#### 3. **Pre-trained Multilingual Models**

* **MarianMT**: Trained on many language pairs using the Transformer architecture.
* **mBART**: Pre-trained on denoising multilingual text; fine-tuned for translation tasks.
* **mT5**: A multilingual variant of the T5 model by Google.
* These support zero-shot and few-shot translation across many languages.

---

### Evolutionary Approaches (Beyond Transformers)

Some **experimental or hybrid techniques** being researched:

* **Neuroevolution of Augmenting Topologies (NEAT)** for evolving encoder-decoder architectures
* **Reinforcement learning-based translation** where BLEU or human ratings are the reward
* **Meta-learning** for low-resource language pairs
* **Multimodal translation** (text + image or speech → text translation)

---

### PART 2: Generating Images from Text (Opposite of Image Captioning)

This falls under **text-to-image synthesis**, and has advanced rapidly in recent years.

### Key Technologies

| Model                                      | Description                                                                                  |
| ------------------------------------------ | -------------------------------------------------------------------------------------------- |
| **GANs (Generative Adversarial Networks)** | Early models like StackGAN, AttnGAN used GANs with text embeddings.                          |
| **VQ-VAE**                                 | Vector-quantized autoencoders used in early DALL·E.                                          |
| **CLIP + Diffusion**                       | Used by modern SOTA (e.g., DALL·E 2, Stable Diffusion). CLIP connects text and image spaces. |
| **Transformer-based Models**               | DALL·E and CogView use autoregressive transformers for image generation.                     |

---

### Modern Text-to-Image Models

| Model                | Publisher    | Highlights                                                                         |
| -------------------- | ------------ | ---------------------------------------------------------------------------------- |
| **DALL·E 2 / 3**     | OpenAI       | Uses CLIP + diffusion model. High-quality and coherent images.                     |
| **Stable Diffusion** | Stability AI | Open-source, supports custom models and fine-tuning.                               |
| **Midjourney**       | Independent  | Artistic, stylized images. Prompt-based.                                           |
| **Imagen**           | Google       | State-of-the-art text-to-image with unprecedented realism (not publicly released). |
| **DeepFloyd IF**     | DeepFloyd    | Modular diffusion-based image generation model.                                    |

---

### Workflow Overview

1. **Input**: Text prompt (e.g., “A robot reading a book under a cherry blossom tree”)
2. **Text encoder**: Convert prompt into embeddings (CLIP, T5, etc.)
3. **Image generator**: Use diffusion or autoregressive model to generate an image
4. **Upsampling**: Increase resolution using super-resolution models

---

### Practical Toolkits

* `diffusers` library by HuggingFace (for Stable Diffusion, DeepFloyd IF)
* OpenAI’s DALL·E API
* RunwayML (no-code tools)

---

### Connecting the Dots

| Captioning        | Translation         | Text-to-Image           |
| ----------------- | ------------------- | ----------------------- |
| Image → Text      | Text → Text         | Text → Image            |
| CNN + RNN         | Transformer         | Transformer + Diffusion |
| e.g., Show & Tell | e.g., MarianMT, mT5 | e.g., DALL·E, SD        |

---
