
Creating a neural machine translation (NMT) model using Recurrent Neural Networks (RNNs) with TensorFlow involves several steps. Here’s a basic guide on how to implement a sequence-to-sequence (Seq2Seq) model for translation using RNNs. The dataset will be in CSV format.

## 1. Set Up the Environment

Ensure you have TensorFlow installed. You can install it via pip if you haven't already:
### * pip install tensorflow pandas numpy


# 2. Load and Preprocess Data

You need to load this data and preprocess it. This involves tokenizing the text, padding sequences, and creating a vocabulary.

In [68]:
import pandas as pd
import numpy as np
import tensorflow as tf
from sklearn.model_selection import train_test_split
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, Embedding, LSTM, Dense, TimeDistributed


In [69]:

# Load dataset
data = pd.read_csv('english to bengali.csv')


In [70]:
data.head()

Unnamed: 0,english_caption,bengali_caption
0,a child in a pink dress is climbing up a set o...,একটি গোলাপী জামা পরা বাচ্চা মেয়ে একটি বাড়ির প্...
1,a girl going into a wooden building .,একটি মেয়ে শিশু একটি কাঠের বাড়িতে ঢুকছে
2,a little girl climbing into a wooden playhouse .,একটি বাচ্চা তার কাঠের খেলাঘরে উঠছে ।
3,a little girl climbing the stairs to her playh...,ছোট মেয়েটি তার খেলার ঘরের সিড়ি বেয়ে উঠছে
4,a little girl in a pink dress going into a woo...,গোলাপি জামা পড়া ছোট একটি মেয়ে একটি কাঠের তৈরি...


In [71]:

# Constants
MAX_NUM_WORDS = 10000
MAX_SEQUENCE_LENGTH = 20
EMBEDDING_DIM = 256

In [72]:
# Add special tokens
START_TOKEN = 'startseq'
END_TOKEN = 'endseq'

In [73]:
# Modify target texts to include start and end tokens
target_texts = [START_TOKEN + ' ' + text + ' ' + END_TOKEN for text in data['bengali_caption']]

In [74]:
# Tokenize and pad source sequences
source_texts = data['english_caption'].tolist()
source_tokenizer = Tokenizer(num_words=MAX_NUM_WORDS)
source_tokenizer.fit_on_texts(source_texts)
source_sequences = source_tokenizer.texts_to_sequences(source_texts)
source_padded = pad_sequences(source_sequences, maxlen=MAX_SEQUENCE_LENGTH, padding='post')

In [75]:
# Tokenize and pad target sequences
target_tokenizer = Tokenizer(num_words=MAX_NUM_WORDS)
target_tokenizer.fit_on_texts(target_texts)
target_sequences = target_tokenizer.texts_to_sequences(target_texts)
target_padded = pad_sequences(target_sequences, maxlen=MAX_SEQUENCE_LENGTH, padding='post')

In [76]:
# Check if special tokens are in word index
print(f"Token Index for '{START_TOKEN}':", target_tokenizer.word_index.get(START_TOKEN))
print(f"Token Index for '{END_TOKEN}':", target_tokenizer.word_index.get(END_TOKEN))


Token Index for 'startseq': 1
Token Index for 'endseq': 2


In [77]:

# Split into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(source_padded, target_padded, test_size=0.2, random_state=42)

# Define model parameters
num_encoder_tokens = len(source_tokenizer.word_index) + 1
num_decoder_tokens = len(target_tokenizer.word_index) + 1

# 3. Build the Seq2Seq Model

Create the Seq2Seq model using RNNs (specifically LSTM or GRU cells) for both the encoder and the decoder.

In [78]:
# Encoder
encoder_inputs = Input(shape=(None,))
encoder_embedding = Embedding(input_dim=num_encoder_tokens, output_dim=EMBEDDING_DIM)(encoder_inputs)
encoder_lstm, state_h, state_c = LSTM(256, return_state=True)(encoder_embedding)
encoder_states = [state_h, state_c]


In [79]:

# Decoder
decoder_inputs = Input(shape=(None,))
decoder_embedding = Embedding(input_dim=num_decoder_tokens, output_dim=EMBEDDING_DIM)(decoder_inputs)
decoder_lstm = LSTM(256, return_sequences=True)(decoder_embedding, initial_state=encoder_states)
decoder_outputs = TimeDistributed(Dense(num_decoder_tokens, activation='softmax'))(decoder_lstm)

In [80]:
# Model
model = Model([encoder_inputs, decoder_inputs], decoder_outputs)
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

In [81]:
# Train the model
model.fit([X_train, y_train], np.expand_dims(y_train, -1), epochs=20, batch_size=64, validation_data=([X_val, y_val], np.expand_dims(y_val, -1)))


Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<keras.callbacks.History at 0x262f0d1d2e0>

# 5. Inference and Prediction
To use the model for inference, you need to define separate encoder and decoder models for predicting sequences.

In [115]:
# Encoder model
encoder_model = Model(encoder_inputs, encoder_states)

# Decoder model
decoder_inputs = Input(shape=(None,))
decoder_state_input_h = Input(shape=(256,))
decoder_state_input_c = Input(shape=(256,))
decoder_states_inputs = [decoder_state_input_h, decoder_state_input_c]

decoder_embedding2 = Embedding(input_dim=num_decoder_tokens, output_dim=EMBEDDING_DIM)(decoder_inputs)
decoder_lstm2, state_h2, state_c2 = LSTM(256, return_sequences=True, return_state=True)(decoder_embedding2, initial_state=decoder_states_inputs)
decoder_states = [state_h2, state_c2]
decoder_outputs2 = TimeDistributed(Dense(num_decoder_tokens, activation='softmax'))(decoder_lstm2)

decoder_model = Model([decoder_inputs] + decoder_states_inputs, [decoder_outputs2] + decoder_states)

def preprocess_input_sentence(sentence):
    sequence = source_tokenizer.texts_to_sequences([sentence])
    padded_sequence = pad_sequences(sequence, maxlen=MAX_SEQUENCE_LENGTH, padding='post')
    return padded_sequence

def decode_sequence(input_seq):
    states_value = encoder_model.predict(input_seq)

    target_seq = np.zeros((1, 1))
    start_token_index = target_tokenizer.word_index.get(START_TOKEN)
    if start_token_index is None:
        raise KeyError(f"'{START_TOKEN}' not found in target_tokenizer.word_index")
    target_seq[0, 0] = start_token_index

    decoded_sentence = ''
    stop_condition = False
    
    while not stop_condition:
        output_tokens, h, c = decoder_model.predict([target_seq] + states_value)

        sampled_token_index = np.argmax(output_tokens[0, -1, :])
        sampled_char = target_tokenizer.index_word.get(sampled_token_index, '')
        if not sampled_char:
            break
        
        decoded_sentence += sampled_char + ' '

        if sampled_char == END_TOKEN or len(decoded_sentence) > MAX_SEQUENCE_LENGTH:
            stop_condition = True

        target_seq = np.zeros((1, 1))
        target_seq[0, 0] = sampled_token_index
        states_value = [h, c]

    return decoded_sentence.strip()

input_sentence = 'a girl going into a wooden building .'
input_seq = preprocess_input_sentence(input_sentence)
try:
    decoded_sentence = decode_sequence(input_seq)
    print(f'Input sentence: {input_sentence}')
    print(f'Decoded sentence: {decoded_sentence}')
except KeyError as e:
    print(e)

Input sentence: a girl going into a wooden building .
Decoded sentence: একটি মেয়ে শিশু একটি কাঠের বাড়িতে ঢুকছে
