# ============================
# Machine Translation with Encoder-Decoder Model
# ============================

## Overview: 

The goal is to implement a machine translation system using an encoder-decoder model. It involves preparing a suitable dataset, defining the model architecture, training it to translate between languages, and performing inference to generate translations. The conclusion summarizes the results and insights from the implementation.

## Table of Contents
1. [Introduction](#introduction)
2. [Dataset Preparation](#dataset-preparation)
3. [Model Architecture](#model-architecture)
4. [Training the Model](#training-the-model)
5. [Inference](#inference)
6. [Conclusion](#conclusion)

## Introduction
In this notebook, we will implement an Encoder-Decoder model for machine translation. The Encoder-Decoder architecture is widely used for tasks involving sequential data, such as language translation. This architecture consists of two main components: an encoder that processes the input sequence and a decoder that generates the output sequence. We will train our model on a simple dataset and then demonstrate how to make predictions for new input sentences.

## Dataset Preparation
For this example, we will use a simple dataset that consists of English sentences and their corresponding translations in another language (e.g., French).

1. **Load the Dataset**: We will load the translation dataset which contains pairs of sentences in English and French.
2. **Preprocess the Data**: We will tokenize the sentences, convert words to integers, and pad the sequences to ensure they have a uniform length, which is essential for batch processing in neural networks.

In [3]:
# install below libraries if not done
!pip install numpy==1.23.5
!pip install tensorflow==2.13.1

In [22]:
import numpy as np
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, LSTM, Embedding, Dense
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

# Provided data
data = {
    'english': ['<start> Hello <end>', '<start> How are you? <end>', '<start> I am learning <end>', '<start> Machine Translation is fun <end>'],
    'french': ['<start> Bonjour <end>', '<start> Comment Ã§a va? <end>', '<start> J\'apprends <end>', '<start> La traduction automatique est amusante <end>']
}

# Tokenization
tokenizer_en = Tokenizer(filters="")
tokenizer_en.fit_on_texts(data['english'])
vocab_size_en = len(tokenizer_en.word_index) + 1  # +1 for padding token

tokenizer_fr = Tokenizer(filters="")
tokenizer_fr.fit_on_texts(data['french'])
vocab_size_fr = len(tokenizer_fr.word_index) + 1  # +1 for padding token

# Convert sentences to sequences
sequences_en = tokenizer_en.texts_to_sequences(data['english'])
sequences_fr = tokenizer_fr.texts_to_sequences(data['french'])

# Pad sequences
max_length_en = max(len(seq) for seq in sequences_en)
max_length_fr = max(len(seq) for seq in sequences_fr)

padded_en = pad_sequences(sequences_en, maxlen=max_length_en, padding='post')
padded_fr = pad_sequences(sequences_fr, maxlen=max_length_fr, padding='post')

# Prepare target data for training
fr_target_data = np.zeros((len(padded_fr), max_length_fr, vocab_size_fr))  # (num_samples, max_length, vocab_size)

for i, seq in enumerate(sequences_fr):
    for t in range(len(seq) - 1):
        fr_target_data[i, t, seq[t + 1]] = 1.0  # One-hot encoding

# Define hyperparameters
latent_dim = 256  # Latent dimensionality of the encoding space
embedding_dim = 256  # Dimensionality of the embedding layer

# Define the encoder
encoder_inputs = Input(shape=(None,))  # Input shape for encoder
encoder_embedding = Embedding(vocab_size_en, embedding_dim)(encoder_inputs)
encoder_lstm = LSTM(latent_dim, return_state=True)
encoder_outputs, state_h, state_c = encoder_lstm(encoder_embedding)
encoder_states = [state_h, state_c]  # Encoder states

# Define the decoder
decoder_inputs = Input(shape=(None,))  # Input shape for decoder
decoder_embedding = Embedding(vocab_size_fr, embedding_dim)(decoder_inputs)
decoder_lstm = LSTM(latent_dim, return_sequences=True, return_state=True)
decoder_dense = Dense(vocab_size_fr, activation='softmax')

# Decoder outputs
decoder_outputs, _, _ = decoder_lstm(decoder_embedding, initial_state=encoder_states)
decoder_outputs = decoder_dense(decoder_outputs)  # Output layer for decoder

# Define the training model
model = Model([encoder_inputs, decoder_inputs], decoder_outputs)

# Compile the model
model.compile(optimizer='adam', loss='categorical_crossentropy')

# Fit the model (using the prepared data)
model.fit([padded_en, padded_fr], fr_target_data, batch_size=64, epochs=100)  # Adjust epochs as needed

# Create the encoder model for inference
encoder_model = Model(encoder_inputs, encoder_states)

# Create the decoder model for inference
decoder_state_input_h = Input(shape=(latent_dim,))
decoder_state_input_c = Input(shape=(latent_dim,))
decoder_inputs_single = Input(shape=(1,))  # Input for a single token
decoder_embedding_single = Embedding(vocab_size_fr, embedding_dim)(decoder_inputs_single)
decoder_outputs_single, h, c = decoder_lstm(decoder_embedding_single, initial_state=[decoder_state_input_h, decoder_state_input_c])
decoder_outputs_single = decoder_dense(decoder_outputs_single)  # Output layer for decoder

# Define the decoder model for inference
decoder_model = Model([decoder_inputs_single, decoder_state_input_h, decoder_state_input_c], [decoder_outputs_single, h, c])

# Decode sequence function
def decode_sequence(input_seq, encoder_model, decoder_model, tokenizer_fr, max_length_fr):
    # Encode the input as state vectors
    states_value = encoder_model.predict(input_seq)

    # Generate the initial target sequence (the start character)
    target_seq = np.zeros((1, 1))
    target_seq[0, 0] = tokenizer_fr.word_index['<start>']  # Assuming you have a start token

    stop_condition = False
    decoded_sentence = ''
    
    while not stop_condition:
        # Run the decoder model to get the next token
        output_tokens, h, c = decoder_model.predict([target_seq] + states_value)

        # Sample a token and convert it to a character
        sampled_token_index = np.argmax(output_tokens[0, -1, :])
        sampled_char = tokenizer_fr.index_word.get(sampled_token_index, '')  # Convert token index to word
        decoded_sentence += ' ' + sampled_char  # Append the sampled character to the decoded sentence

        # Exit condition: either hit max length or find stop character
        if (sampled_char == '<end>' or len(decoded_sentence.split()) > max_length_fr):
            stop_condition = True

        # Update the target sequence for the next time step
        target_seq = np.zeros((1, 1))
        target_seq[0, 0] = sampled_token_index

        # Update states
        states_value = [h, c]

    return decoded_sentence.strip()  # Return the decoded sentence without leading whitespace

# Example usage
# Prepare the input sequence (using the first sample as an example)
input_seq = padded_en[0].reshape(1, -1)  # Example input
input_seq = pad_sequences(input_seq, maxlen=max_length_en, padding='post')  # Pad input sequence

decoded_sentence = decode_sequence(input_seq, encoder_model, decoder_model, tokenizer_fr, max_length_fr)
print(f'Decoded sentence: {decoded_sentence}')  # Print the decoded sentence


Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78