<a href="https://colab.research.google.com/github/NeshPk/NeshPk/blob/main/seq2seq_assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
"""
Title: Character-level recurrent sequence-to-sequence model
Author: [fchollet](https://twitter.com/fchollet)
Date created: 2017/09/29
Last modified: 2023/11/22
Description: Character-level recurrent sequence-to-sequence model.
Accelerator: GPU
"""

"""
## Introduction

This example demonstrates how to implement a basic character-level
recurrent sequence-to-sequence model. We apply it to translating
short English sentences into short French sentences,
character-by-character. Note that it is fairly unusual to
do character-level machine translation, as word-level
models are more common in this domain.

**Summary of the algorithm**

- We start with input sequences from a domain (e.g. English sentences)
    and corresponding target sequences from another domain
    (e.g. French sentences).
- An encoder LSTM turns input sequences to 2 state vectors
    (we keep the last LSTM state and discard the outputs).
- A decoder LSTM is trained to turn the target sequences into
    the same sequence but offset by one timestep in the future,
    a training process called "teacher forcing" in this context.
    It uses as initial state the state vectors from the encoder.
    Effectively, the decoder learns to generate `targets[t+1...]`
    given `targets[...t]`, conditioned on the input sequence.
- In inference mode, when we want to decode unknown input sequences, we:
    - Encode the input sequence into state vectors
    - Start with a target sequence of size 1
        (just the start-of-sequence character)
    - Feed the state vectors and 1-char target sequence
        to the decoder to produce predictions for the next character
    - Sample the next character using these predictions
        (we simply use argmax).
    - Append the sampled character to the target sequence
    - Repeat until we generate the end-of-sequence character or we
        hit the character limit.
"""

"""
## Setup
"""

import numpy as np
import keras
import os
from pathlib import Path

"""
## Download the data
"""

#fpath = keras.utils.get_file(origin="http://www.manythings.org/anki/fra-eng.zip")
#dirpath = Path(fpath).parent.absolute()
#os.system(f"unzip -q {fpath} -d {dirpath}")

data_path = 'fra.txt' # Assumes it's in the main /content/ directory
print(f"Using data path: {data_path}")

"""
## Configuration
"""

batch_size = 64  # Batch size for training.
epochs = 100  # Number of epochs to train for.
latent_dim = 256  # Latent dimensionality of the encoding space.
num_samples = 10000  # Number of samples to train on.
# Path to the data txt file on disk.
#data_path = os.path.join(dirpath, "fra.txt")

"""
## Prepare the data
"""

# Vectorize the data.
input_texts = []
target_texts = []
input_characters = set()
target_characters = set()
with open(data_path, "r", encoding="utf-8") as f:
    lines = f.read().split("\n")
for line in lines[: min(num_samples, len(lines) - 1)]:
    input_text, target_text, _ = line.split("\t")
    # We use "tab" as the "start sequence" character
    # for the targets, and "\n" as "end sequence" character.
    target_text = "\t" + target_text + "\n"
    input_texts.append(input_text)
    target_texts.append(target_text)
    for char in input_text:
        if char not in input_characters:
            input_characters.add(char)
    for char in target_text:
        if char not in target_characters:
            target_characters.add(char)

input_characters = sorted(list(input_characters))
target_characters = sorted(list(target_characters))
num_encoder_tokens = len(input_characters)
num_decoder_tokens = len(target_characters)
max_encoder_seq_length = max([len(txt) for txt in input_texts])
max_decoder_seq_length = max([len(txt) for txt in target_texts])

print("Number of samples:", len(input_texts))
print("Number of unique input tokens:", num_encoder_tokens)
print("Number of unique output tokens:", num_decoder_tokens)
print("Max sequence length for inputs:", max_encoder_seq_length)
print("Max sequence length for outputs:", max_decoder_seq_length)

input_token_index = dict([(char, i) for i, char in enumerate(input_characters)])
target_token_index = dict([(char, i) for i, char in enumerate(target_characters)])

encoder_input_data = np.zeros(
    (len(input_texts), max_encoder_seq_length, num_encoder_tokens),
    dtype="float32",
)
decoder_input_data = np.zeros(
    (len(input_texts), max_decoder_seq_length, num_decoder_tokens),
    dtype="float32",
)
decoder_target_data = np.zeros(
    (len(input_texts), max_decoder_seq_length, num_decoder_tokens),
    dtype="float32",
)

for i, (input_text, target_text) in enumerate(zip(input_texts, target_texts)):
    for t, char in enumerate(input_text):
        encoder_input_data[i, t, input_token_index[char]] = 1.0
    encoder_input_data[i, t + 1 :, input_token_index[" "]] = 1.0
    for t, char in enumerate(target_text):
        # decoder_target_data is ahead of decoder_input_data by one timestep
        decoder_input_data[i, t, target_token_index[char]] = 1.0
        if t > 0:
            # decoder_target_data will be ahead by one timestep
            # and will not include the start character.
            decoder_target_data[i, t - 1, target_token_index[char]] = 1.0
    decoder_input_data[i, t + 1 :, target_token_index[" "]] = 1.0
    decoder_target_data[i, t:, target_token_index[" "]] = 1.0

"""
## Build the model
"""

# Define an input sequence and process it.
encoder_inputs = keras.Input(shape=(None, num_encoder_tokens))
encoder = keras.layers.LSTM(latent_dim, return_state=True)
encoder_outputs, state_h, state_c = encoder(encoder_inputs)

# We discard `encoder_outputs` and only keep the states.
encoder_states = [state_h, state_c]

# Set up the decoder, using `encoder_states` as initial state.
decoder_inputs = keras.Input(shape=(None, num_decoder_tokens))

# We set up our decoder to return full output sequences,
# and to return internal states as well. We don't use the
# return states in the training model, but we will use them in inference.
decoder_lstm = keras.layers.LSTM(latent_dim, return_sequences=True, return_state=True)
decoder_outputs, _, _ = decoder_lstm(decoder_inputs, initial_state=encoder_states)
decoder_dense = keras.layers.Dense(num_decoder_tokens, activation="softmax")
decoder_outputs = decoder_dense(decoder_outputs)

# Define the model that will turn
# `encoder_input_data` & `decoder_input_data` into `decoder_target_data`
model = keras.Model([encoder_inputs, decoder_inputs], decoder_outputs)

"""
## Train the model
"""

model.compile(
    optimizer="rmsprop", loss="categorical_crossentropy", metrics=["accuracy"]
)
model.fit(
    [encoder_input_data, decoder_input_data],
    decoder_target_data,
    batch_size=batch_size,
    epochs=epochs,
    validation_split=0.2,
)
# Save model
model.save("s2s_model.keras")

"""
## Run inference (sampling)

1. encode input and retrieve initial decoder state
2. run one step of decoder with this initial state
and a "start of sequence" token as target.
Output will be the next target token.
3. Repeat with the current target token and current states
"""

# Define sampling models
# Restore the model and construct the encoder and decoder.
model = keras.models.load_model("s2s_model.keras")

encoder_inputs = model.input[0]  # input_1
encoder_outputs, state_h_enc, state_c_enc = model.layers[2].output  # lstm_1
encoder_states = [state_h_enc, state_c_enc]
encoder_model = keras.Model(encoder_inputs, encoder_states)

decoder_inputs = model.input[1]  # input_2
decoder_state_input_h = keras.Input(shape=(latent_dim,))
decoder_state_input_c = keras.Input(shape=(latent_dim,))
decoder_states_inputs = [decoder_state_input_h, decoder_state_input_c]
decoder_lstm = model.layers[3]
decoder_outputs, state_h_dec, state_c_dec = decoder_lstm(
    decoder_inputs, initial_state=decoder_states_inputs
)
decoder_states = [state_h_dec, state_c_dec]
decoder_dense = model.layers[4]
decoder_outputs = decoder_dense(decoder_outputs)
decoder_model = keras.Model(
    [decoder_inputs] + decoder_states_inputs, [decoder_outputs] + decoder_states
)

# Reverse-lookup token index to decode sequences back to
# something readable.
reverse_input_char_index = dict((i, char) for char, i in input_token_index.items())
reverse_target_char_index = dict((i, char) for char, i in target_token_index.items())


def decode_sequence(input_seq):
    # Encode the input as state vectors.
    states_value = encoder_model.predict(input_seq, verbose=0)

    # Generate empty target sequence of length 1.
    target_seq = np.zeros((1, 1, num_decoder_tokens))
    # Populate the first character of target sequence with the start character.
    target_seq[0, 0, target_token_index["\t"]] = 1.0

    # Sampling loop for a batch of sequences
    # (to simplify, here we assume a batch of size 1).
    stop_condition = False
    decoded_sentence = ""
    while not stop_condition:
        output_tokens, h, c = decoder_model.predict(
            [target_seq] + states_value, verbose=0
        )

        # Sample a token
        sampled_token_index = np.argmax(output_tokens[0, -1, :])
        sampled_char = reverse_target_char_index[sampled_token_index]
        decoded_sentence += sampled_char

        # Exit condition: either hit max length
        # or find stop character.
        if sampled_char == "\n" or len(decoded_sentence) > max_decoder_seq_length:
            stop_condition = True

        # Update the target sequence (of length 1).
        target_seq = np.zeros((1, 1, num_decoder_tokens))
        target_seq[0, 0, sampled_token_index] = 1.0

        # Update states
        states_value = [h, c]
    return decoded_sentence


"""
You can now generate decoded sentences as such:
"""

for seq_index in range(20):
    # Take one sequence (part of the training set)
    # for trying out decoding.
    input_seq = encoder_input_data[seq_index : seq_index + 1]
    decoded_sentence = decode_sequence(input_seq)
    print("-")
    print("Input sentence:", input_texts[seq_index])
    print("Decoded sentence:", decoded_sentence)


Using data path: fra.txt
Number of samples: 10000
Number of unique input tokens: 70
Number of unique output tokens: 91
Max sequence length for inputs: 14
Max sequence length for outputs: 59
Epoch 1/100
[1m125/125[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 17ms/step - accuracy: 0.7076 - loss: 1.5533 - val_accuracy: 0.7137 - val_loss: 1.0884
Epoch 2/100
[1m125/125[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 12ms/step - accuracy: 0.7466 - loss: 0.9565 - val_accuracy: 0.7154 - val_loss: 1.0930
Epoch 3/100
[1m125/125[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 13ms/step - accuracy: 0.7624 - loss: 0.8698 - val_accuracy: 0.7512 - val_loss: 0.8688
Epoch 4/100
[1m125/125[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 15ms/step - accuracy: 0.7880 - loss: 0.7583 - val_accuracy: 0.7801 - val_loss: 0.7726
Epoch 5/100
[1m125/125[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 12ms/step - accuracy: 0.8040 - loss: 0.6894 - val_accuracy: 0.7868 - val_los

In [None]:
!git clone https://github.com/yunjey/pytorch-tutorial.git


Cloning into 'pytorch-tutorial'...
remote: Enumerating objects: 917, done.[K
remote: Total 917 (delta 0), reused 0 (delta 0), pack-reused 917 (from 1)[K
Receiving objects: 100% (917/917), 12.80 MiB | 16.87 MiB/s, done.
Resolving deltas: 100% (491/491), done.


In [None]:
%cd pytorch-tutorial/tutorials/03-advanced/image_captioning/


/content/pytorch-tutorial/tutorials/03-advanced/image_captioning


In [None]:
!pip install -r requirements.txt


Collecting argparse (from -r requirements.txt (line 5))
  Downloading argparse-1.4.0-py2.py3-none-any.whl.metadata (2.8 kB)
Downloading argparse-1.4.0-py2.py3-none-any.whl (23 kB)
Installing collected packages: argparse
Successfully installed argparse-1.4.0


In [None]:
!pip install pycocotools



In [None]:
import nltk
nltk.download('punkt')


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [None]:
!mkdir -p models
!wget https://www.dropbox.com/s/ne0ixz5d58ccbbz/pretrained_model.zip?dl=1 -O pretrained_model.zip
!unzip -q pretrained_model.zip -d models/
!rm pretrained_model.zip

--2025-04-24 14:55:19--  https://www.dropbox.com/s/ne0ixz5d58ccbbz/pretrained_model.zip?dl=1
Resolving www.dropbox.com (www.dropbox.com)... 162.125.81.18, 2620:100:6031:18::a27d:5112
Connecting to www.dropbox.com (www.dropbox.com)|162.125.81.18|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://www.dropbox.com/scl/fi/5pbpnmdqarpl3im03e6sk/pretrained_model.zip?rlkey=t60qk1iyys5fejbbwgvx5p5hq&dl=1 [following]
--2025-04-24 14:55:19--  https://www.dropbox.com/scl/fi/5pbpnmdqarpl3im03e6sk/pretrained_model.zip?rlkey=t60qk1iyys5fejbbwgvx5p5hq&dl=1
Reusing existing connection to www.dropbox.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://ucf072f3cf11c73e21d8c30d0cf2.dl.dropboxusercontent.com/cd/0/inline/CoYXZENEe1t4BIvfaqNEOtW8CqNRG5ftxO16d58VBAjhEZ6j_41fDeHpXDXeUDIuoJvGyAiB2ZnloY77UagTClKBOgZvBbSELe2vzPq_p-jYPGUCYYnIrwtph71b1gADt5gbB6roEkgUafJ8ZRE9C99v/file?dl=1# [following]
--2025-04-24 14:55:20--  https://ucf072f3cf11c73e21d8c

In [None]:
!mkdir -p data
!wget https://www.dropbox.com/s/26adb7y9m98uisa/vocap.zip?dl=1 -O vocap.zip
!unzip -q vocap.zip -d data/
!rm vocap.zip

--2025-04-24 14:56:44--  https://www.dropbox.com/s/26adb7y9m98uisa/vocap.zip?dl=1
Resolving www.dropbox.com (www.dropbox.com)... 162.125.81.18, 2620:100:6031:18::a27d:5112
Connecting to www.dropbox.com (www.dropbox.com)|162.125.81.18|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://www.dropbox.com/scl/fi/r7g8pbh36tmcpbyk0gabm/vocap.zip?rlkey=xl8bmroltgedbq7m7glk3i57z&dl=1 [following]
--2025-04-24 14:56:44--  https://www.dropbox.com/scl/fi/r7g8pbh36tmcpbyk0gabm/vocap.zip?rlkey=xl8bmroltgedbq7m7glk3i57z&dl=1
Reusing existing connection to www.dropbox.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://uc6e9f03ac8375872b515918ccbf.dl.dropboxusercontent.com/cd/0/inline/CobXLLGkopy26b6_QRV1NwzKMiNd1zMSj-pXD1thLyoxzpVW_IAfJ1Jbc8PToC8cFJlyWvUbbkv8HDDdadCBZ7yZCOMBbFar53rN-hjJ9wwGo5EEqJzilFLDv0Od6MdW_Qo2Snr1gv78iwt_XcXx1qpO/file?dl=1# [following]
--2025-04-24 14:56:45--  https://uc6e9f03ac8375872b515918ccbf.dl.dropboxusercontent.com

#### Execute Test Model: Generate sentences for images

In [None]:

from google.colab import files
print("Please choose the image file you want to caption:")
uploaded = files.upload()

for fn in uploaded.keys():
  print(f'User uploaded file "{fn}" with length {len(uploaded[fn])} bytes')

image_filename = list(uploaded.keys())[0]
print(f"\nWill generate caption for: {image_filename}")

Please choose the image file you want to caption:


Saving giraffe.jpeg to giraffe.jpeg
User uploaded file "giraffe.jpeg" with length 5982 bytes

Will generate caption for: giraffe.jpeg


In [None]:
!python /content/pytorch-tutorial/tutorials/03-advanced/image_captioning/sample.py \
  --image={image_filename} \
  --encoder_path='./models/encoder-5-3000.pkl' \
  --decoder_path='./models/decoder-5-3000.pkl' \
  --vocab_path='./data/vocab.pkl'

<start> a giraffe standing in a field next to a tree . <end>


1. Replicate the Model Architecture in Keras

build an equivalent model structure using Keras layers.

Encoder (CNN):

Use a Keras implementation of the same base CNN (e.g., tf.keras.applications.ResNet152).

Load pre-trained ImageNet weights (weights='imagenet').

Ensure the final classification layer is removed (include_top=False).

Add appropriate pooling (pooling='avg') or flattening layers to get a feature vector.

Add a Keras Dense layer (and potentially BatchNormalization if mimicking closely) to map the CNN features to the required embed_size.

Decoder (RNN):

Use Keras layers: keras.layers.Embedding, keras.layers.LSTM (or GRU), keras.layers.Dense.

Crucially: Ensure the dimensions (e.g., vocab_size, embed_size, hidden_size for LSTM units) exactly match the PyTorch model's parameters.

Replicate the mechanism for injecting the image features. In our Keras example code, we used the image features (passed through Dense layers) to generate the initial state for the LSTM decoder.

2. Handle the Pre-trained PyTorch Weights

a) Manual Weight Mapping (Fundamental but Tedious)

Load PyTorch Weights: Use PyTorch to load the .pkl files and access the model's state_dict(), which is a dictionary mapping layer names to weight/bias tensors.

Build Keras Model Instance: Create an instance of your Keras model defined in Step 1.

Iterate and Match: Go through each layer in your Keras model (e.g., the Dense layer in the Encoder, the Embedding, LSTM, and Dense layers in the Decoder).

Extract & Convert: For each Keras layer, find the corresponding weight tensor(s) in the PyTorch state_dict. Convert the PyTorch tensor(s) to NumPy arrays (e.g., using .cpu().numpy()).

Handle Shape Differences: This is vital. Keras and PyTorch might store weights differently. For example, the kernel (weight matrix) of a Keras Dense layer often needs to be the transpose of the corresponding PyTorch Linear layer's weight matrix. You might need to use .T on the NumPy array. Convolutional layers can also have different dimension orders (channels_first vs. channels_last). You must carefully inspect and potentially reshape/transpose the NumPy arrays to match the Keras layer's expectations.

Load into Keras: Use the Keras layer's set_weights([numpy_array_1, numpy_array_2, ...]) method to load the correctly shaped NumPy arrays. (Layers typically expect a list, e.g., [kernel_weights, bias_weights]).

Gives complete control, guaranteed to work if architecture and shapes match perfectly.

3. Align Vocabulary/Tokenizer

You need to load the PyTorch vocab.pkl file.

Create a Keras Tokenizer (e.g., tf.keras.preprocessing.text.Tokenizer) or implement a custom mapping.

Ensure this Keras tokenizer uses the exact same word-to-index mapping, including the specific indices for special tokens like <start>, <pad>, and <end>, as defined in the PyTorch vocab.pkl. The generate_caption function relies on this consistency.

In [None]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

class EncoderCNN(layers.Layer):
    """
    Keras implementation of the Encoder CNN.
    Loads a pre-trained ResNet-152 and maps its output features.
    """
    def __init__(self, embed_size):
        super(EncoderCNN, self).__init__()
        resnet = tf.keras.applications.ResNet152(
            include_top=False,
            weights='imagenet',
            pooling='avg'
        )
        self.resnet_base = resnet

        self.dense = layers.Dense(embed_size, activation='relu')

    def call(self, images):
        features = self.resnet_base(images)
        features = self.dense(features)

        return features


embed_size = 256
hidden_size = 512
vocab_size = 10000
max_seq_length = 20

encoder_features_input = layers.Input(shape=(embed_size,), name='encoder_features')
decoder_input = layers.Input(shape=(None,), name='decoder_input_indices')

decoder_embedding_layer = layers.Embedding(input_dim=vocab_size,
                                           output_dim=embed_size,
                                           mask_zero=True,
                                           name='decoder_embedding')

decoder_lstm_layer = layers.LSTM(hidden_size,
                                 return_sequences=True,
                                 return_state=True,
                                 name='decoder_lstm')

decoder_dense_layer = layers.Dense(vocab_size, activation='softmax', name='decoder_output')

initial_h_dense = layers.Dense(hidden_size, activation='relu', name='initial_h_dense')
initial_c_dense = layers.Dense(hidden_size, activation='relu', name='initial_c_dense')

initial_h_state = initial_h_dense(encoder_features_input)
initial_c_state = initial_c_dense(encoder_features_input)
initial_lstm_state = [initial_h_state, initial_c_state]

embedded_captions = decoder_embedding_layer(decoder_input)
lstm_outputs, _, _ = decoder_lstm_layer(embedded_captions,
                                        initial_state=initial_lstm_state)
decoder_outputs = decoder_dense_layer(lstm_outputs)

training_model = keras.Model(inputs=[encoder_features_input, decoder_input],
                             outputs=decoder_outputs,
                             name='image_captioning_training_model')


training_model.summary()


inf_decoder_input_index = layers.Input(shape=(1,), name='inf_word_index')
inf_prev_h_state = layers.Input(shape=(hidden_size,), name='inf_prev_h')
inf_prev_c_state = layers.Input(shape=(hidden_size,), name='inf_prev_c')
inf_prev_states = [inf_prev_h_state, inf_prev_c_state]

inf_embedded_word = decoder_embedding_layer(inf_decoder_input_index)
inf_lstm_outputs, inf_new_h, inf_new_c = decoder_lstm_layer(inf_embedded_word,
                                                            initial_state=inf_prev_states)
inf_decoder_outputs = decoder_dense_layer(inf_lstm_outputs)
inf_new_states = [inf_new_h, inf_new_c]

inference_decoder_model = keras.Model(
    inputs=[inf_decoder_input_index] + inf_prev_states,
    outputs=[inf_decoder_outputs] + inf_new_states,
    name='image_captioning_inference_decoder'
)


inference_decoder_model.summary()


def generate_caption(image_path, keras_encoder, inference_decoder_model, tokenizer, max_length):


    img = tf.io.read_file(image_path)
    img = tf.image.decode_jpeg(img, channels=3)
    img = tf.image.resize(img, (299, 299))
    img = tf.keras.applications.resnet.preprocess_input(img)
    img = tf.expand_dims(img, 0)

    features = keras_encoder(img)

    feature_to_state_model = keras.Model(encoder_features_input, initial_lstm_state)
    current_h, current_c = feature_to_state_model.predict(features)

    current_word_index = tf.constant([[tokenizer.word_index['<start>']]])
    generated_indices = []

    for _ in range(max_length):
        probs, current_h, current_c = inference_decoder_model.predict(
            [current_word_index, current_h, current_c]
        )

        predicted_index = tf.argmax(probs[0, -1, :]).numpy()
        generated_indices.append(predicted_index)

        if predicted_index == tokenizer.word_index['<end>']:
            break

        current_word_index = tf.constant([[predicted_index]])

    caption = tokenizer.sequences_to_texts([[idx for idx in generated_indices]])
    return caption[0]


1. Translating Between Different Languages (Japanese-English)

Dataset: A large, high-quality parallel corpus (sentence-aligned Japanese and English text) is essential. Examples include JParaCrawl, KFTT, or ASPEC.

Tokenization (Japanese): This is a key challenge due to the lack of spaces and multiple scripts (Hiragana, Katakana, Kanji).

Morphological Analyzers: Tools like MeCab or Sudachi break text into meaningful words based on dictionaries.

Subword Tokenization: Methods like SentencePiece (BPE or Unigram) are standard for NMT. They break words/characters into common sub-units, handling rare words and morphology effectively without relying solely on large dictionaries.

Tokenization (English): While simpler, using a compatible subword tokenizer (like SentencePiece) often yields better results than basic space splitting, especially when paired with Japanese subword tokenization.

Vocabulary: Build separate or joint vocabularies based on the chosen tokenization method. Subword methods help manage vocabulary size.

Model Architecture: While the core Encoder-Decoder structure (LSTM, GRU, or Transformer) applies, hyperparameters might need tuning. The significant difference in sentence structure (Japanese SOV vs. English SVO) is a major challenge that modern NMT models learn to handle through training data.

Handling Specifics: Japanese particles (like は, を, が), politeness levels, and omitted subjects require the model to learn complex grammatical mappings, which can be difficult. Neural Machine Translation (NMT) models have shown significant improvements over older methods in handling these complexities but still require large amounts of data.

2. Advanced Methods of Machine Translation

Attention Mechanisms:

Concept: Allows the decoder to dynamically focus on relevant parts of the entire input (encoder hidden states) when generating each output word, rather than relying solely on the final encoder state.

How: Calculates "attention scores" weighting the importance of each input word for the current output word. A context-specific vector is created as a weighted sum of encoder states.

Benefits: Significantly improves long-sentence translation, better handling of word alignment. Common types include Bahdanau (additive) and Luong (multiplicative) attention.

Transformer Models:

Concept: Introduced in "Attention Is All You Need," Transformers discard recurrence (LSTMs/GRUs) entirely and rely solely on attention mechanisms, primarily self-attention.

Architecture: Uses stacked Encoder and Decoder layers. Key components include:

Self-Attention: Allows each input token to weigh the importance of all other tokens within the same sequence (input or output).

Multi-Head Attention: Runs multiple self-attention processes in parallel, allowing the model to capture different types of relationships simultaneously.

Positional Encoding: Explicitly adds information about word order, as there are no sequential processing steps like in RNNs.

Feed-Forward Networks: Standard layers applied independently at each position.

Benefits: State-of-the-art performance, highly parallelizable (faster training), excellent at capturing long-range dependencies. Forms the foundation of most modern large language models (LLMs).

3. Generating Images from Text (Text-to-Image Synthesis)

Generative Adversarial Networks (GANs):

How: Used a Generator network (creates images from text embeddings + noise) and a Discriminator network (judges realism). Text embeddings guided the Generator.

Examples: StackGAN, AttnGAN (used attention to link words to image regions).

Limitations: Training instability, difficulty achieving high fidelity and strong text alignment for complex prompts.

Diffusion Models (Current State-of-the-Art):

How: Learn to reverse a process of adding noise to images. To generate, they start with random noise and iteratively denoise it, guided by text embeddings.

Text Conditioning: Powerful text encoders (like CLIP or T5) convert the prompt into vector representations. These vectors guide the denoising process at each step, often using cross-attention within the denoising network (typically a U-Net). Techniques like classifier-free guidance enhance prompt adherence.

Examples: DALL-E 2/3, Stable Diffusion, Midjourney, Google Imagen.

Benefits: Produce high-resolution, coherent, diverse images that strongly align with complex prompts. Have largely surpassed GANs in quality and control



