<a href="https://colab.research.google.com/github/Priyanshu-Naik/Gen_AI/blob/main/Encoder_Decoder_Model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import numpy as np, pandas as pd, string
from string import digits
import tensorflow as tf
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, LSTM, Dense, Embedding
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

In [2]:
!pip install --upgrade datasets



In [3]:
from datasets import load_dataset

# Load dataset in streaming mode
dataset = load_dataset(
    "cfilt/iitb-english-hindi",
    split="train",
    streaming=True
)

samples = []
max_samples = 25000

for i, example in enumerate(dataset):
    if i >= max_samples:
        break

    samples.append({
        "english_sentence": example["translation"]["en"],
        "hindi_sentence": example["translation"]["hi"]
    })

# Convert only the collected samples to DataFrame
lines = pd.DataFrame(samples)

# Optional: clean
lines = lines.dropna().drop_duplicates()

print(lines.head())
print(lines.shape)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


                                 english_sentence  \
0  Give your application an accessibility workout   
1               Accerciser Accessibility Explorer   
2  The default plugin layout for the bottom panel   
3     The default plugin layout for the top panel   
4  A list of plugins that are disabled by default   

                                      hindi_sentence  
0    अपने अनुप्रयोग को पहुंचनीयता व्यायाम का लाभ दें  
1                    एक्सेर्साइसर पहुंचनीयता अन्वेषक  
2              निचले पटल के लिए डिफोल्ट प्लग-इन खाका  
3               ऊपरी पटल के लिए डिफोल्ट प्लग-इन खाका  
4  उन प्लग-इनों की सूची जिन्हें डिफोल्ट रूप से नि...  
(5174, 2)


**Text Cleaning**

Here it remove punctuation and digits and converts text to lowercase and strips whitespace.

It applies Applies cleaning and adds special tokens to Hindi sentences to mark start and end (start_, _end).

In [4]:
def clean_text(text):
  exclude = set(string.punctuation)
  text = ''.join(ch for ch in text if ch not in exclude)
  text = text.translate(str.maketrans('', '', digits))
  text = text.strip().lower()
  return text

lines['english_sentence'] = lines['english_sentence'].apply(clean_text)
lines['hindi_sentence'] = lines['hindi_sentence'].apply(clean_text)
lines['hindi_sentence'] = lines['hindi_sentence'].apply(lambda x: 'START_ ' + x + ' _END')

**Tokenization**

Converts text to sequences of integers using word indices. Hindi tokenizer keeps_because of special tokens.

In [5]:
eng_tokenizer = Tokenizer()
eng_tokenizer.fit_on_texts(lines['english_sentence'])
eng_seq = eng_tokenizer.texts_to_sequences(lines['english_sentence'])

hin_tokenizer = Tokenizer(filters='')
hin_tokenizer.fit_on_texts(lines['hindi_sentence'])
hin_seq = hin_tokenizer.texts_to_sequences(lines['hindi_sentence'])

**Padding**

Pads sequences to uniform length

decoder_target is shifted version of decoder_input used for teacher forcing.

In [6]:
max_eng_len = max([len(x) for x in eng_seq])
max_hin_len = max([len(x) for x in hin_seq])

encoder_input = pad_sequences(eng_seq, maxlen=max_eng_len, padding='post')
decoder_input = pad_sequences(hin_seq, maxlen=max_hin_len, padding='post')

decoder_target = np.zeros((decoder_input.shape[0], decoder_input.shape[1], 1))
decoder_target[:, 0:-1, 0] = decoder_input[:, 1:]

**Define Model Architecture
Encoder:**

It embeds English input and Passes through LSTM. Keeps hidden (state_h) and cell state (state_c) to pass to decoder.

In [7]:
eng_vocab_size = len(eng_tokenizer.word_index) + 1
latent_dim = 256

encoder_inputs = Input(shape=(None,))
enc_emb = Embedding(eng_vocab_size, latent_dim)(encoder_inputs)
enc_outputs, state_h, state_c = LSTM(latent_dim, return_state=True)(enc_emb)
encoder_states = [state_h, state_c]

**Decoder:**

It embeds Hindi input. Uses initial states from encoder and Outputs probability distribution over Hindi vocabulary at each time step.

In [8]:
hin_vocab_size = len(hin_tokenizer.word_index) + 1

decoder_inputs = Input(shape=(None,))
dec_emb_layer = Embedding(hin_vocab_size, latent_dim)
dec_emb = dec_emb_layer(decoder_inputs)
decoder_lstm = LSTM(latent_dim, return_sequences=True, return_state=True)
decoder_outputs, _, _ = decoder_lstm(dec_emb, initial_state=encoder_states)
decoder_dense = Dense(hin_vocab_size, activation='softmax')
decoder_outputs = decoder_dense(decoder_outputs)

**Compile and Train**

Trains on source (encoder_input) and target (decoder_input) with shifted targets and uses RMSProp optimizer and cross-entropy loss.

In [9]:
model = Model([encoder_inputs, decoder_inputs], decoder_outputs)
model.compile(optimizer='rmsprop', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
model.fit([encoder_input, decoder_input], decoder_target, batch_size=64, epochs=20, validation_split=0.2)

Epoch 1/20
[1m65/65[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m94s[0m 1s/step - accuracy: 0.8134 - loss: 2.4078 - val_accuracy: 0.9172 - val_loss: 0.6539
Epoch 2/20
[1m65/65[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m85s[0m 1s/step - accuracy: 0.8938 - loss: 0.7100 - val_accuracy: 0.9172 - val_loss: 0.6092
Epoch 3/20
[1m65/65[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m97s[0m 2s/step - accuracy: 0.9009 - loss: 0.6549 - val_accuracy: 0.9172 - val_loss: 0.6089
Epoch 4/20
[1m65/65[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m86s[0m 1s/step - accuracy: 0.9006 - loss: 0.6462 - val_accuracy: 0.9175 - val_loss: 0.6011
Epoch 5/20
[1m65/65[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m91s[0m 1s/step - accuracy: 0.9040 - loss: 0.6205 - val_accuracy: 0.9175 - val_loss: 0.6066
Epoch 6/20
[1m65/65[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m132s[0m 1s/step - accuracy: 0.8994 - loss: 0.6438 - val_accuracy: 0.9175 - val_loss: 0.6075
Epoch 7/20
[1m65/65[0m [32m━━━━━━━━━

<keras.src.callbacks.history.History at 0x7ea10625f8f0>

**Inference Models**

To translate new sentences after training:

In [10]:
encoder_model_inf = Model(encoder_inputs, encoder_states)

decoder_state_input_h = Input(shape=(latent_dim,))
decoder_state_input_c = Input(shape=(latent_dim,))
decoder_states_inputs = [decoder_state_input_h, decoder_state_input_c]

dec_emb2 = dec_emb_layer(decoder_inputs)
dec_outputs2, state_h2, state_c2 = decoder_lstm(dec_emb2, initial_state=decoder_states_inputs)
decoder_outputs2 = decoder_dense(dec_outputs2)

decoder_model_inf = Model([decoder_inputs] + decoder_states_inputs, [decoder_outputs2, state_h2, state_c2])

**Reverse Lookup**

Used to convert indices back to words during decoding.

In [11]:
rev_eng = {v: k for k, v in eng_tokenizer.word_index.items()}
rev_hin = {v: k for k, v in hin_tokenizer.word_index.items()}

**Translate Function**

It prepares input sentence. Starts decoding with <start> token and Iteratively predicts next word and feeds it back until <end> is predicted. and the test the model with example

In [14]:
def translate(sentences):
  sentences = clean_text(sentences)
  eng_seq = eng_tokenizer.texts_to_sequences([sentences])
  eng_seq = pad_sequences(eng_seq, maxlen=max_eng_len, padding='post')
  state_values = encoder_model_inf.predict(eng_seq, verbose=0)

  target_seq = np.zeros((1, 1))
  target_seq[0, 0] = hin_tokenizer.word_index['start_']

  decoded = []
  while True:
    output_seq, h, c = decoder_model_inf.predict([target_seq] + state_values, verbose=0)
    pred_word_ind = np.argmax(output_seq[0, -1, :])

    # Handle the case where the model predicts the padding index (0)
    if pred_word_ind == 0:
      break

    pred_word = rev_hin[pred_word_ind]

    if pred_word == '_END' or len(decoded) >= max_hin_len:
      break

    decoded.append(pred_word)
    target_seq = np.zeros((1, 1))
    target_seq[0, 0] = pred_word_ind
    state_values = [h , c]

  return ' '.join(decoded)

print("English: And")
print("Hindi: ", translate("And"))

English: And
Hindi:  a _end
