# **Implementing Many-to-Many RNN for English-to-Urdu Language Translation and Exploring Its Limitations**

# **Part 3:** Resolving RNN Limitations Using Long Short-Term Memory (LSTM)

## Data Preparation:

### Loading the Data:

In [1]:
import pandas as pd

df = pd.read_excel('./parallel-corpus.xlsx')

# Keep only the first two columns
df = df.iloc[:, :2]

df.rename(columns = {'SENTENCES ':'SENTENCES'}, inplace = True)


df.head()

Unnamed: 0,SENTENCES,MEANING
0,How can I communicate with my parents?,میں اپنے والدین سے کیسے بات کروں ؟
1,How can I make friends?’,میں دوست کیسے بنائوں ؟
2,Why do I get so sad?’,میں اتنا اداس کیوں ہوں؟.
3,"If you’ve asked yourself such questions, you’r...",اگر آپ نے اپنے آپ سے ایسے سوالات کیے ہیں، تو آ...
4,"Depending on where you’ve turned for guidance,...",اس بات پر منحصر ہے کہ آپ رہنمائی کے لیے کہاں ...


In [2]:
import pandas as pd
import numpy as np
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences


# Tokenize the sentences
tokenizer_eng = Tokenizer()
tokenizer_urdu = Tokenizer()

# Convert the 'SENTENCES' column to string type before fitting the tokenizer
df['SENTENCES'] = df['SENTENCES'].astype(str)
# Convert the 'MEANING' column to string type before fitting the tokenizer
df['MEANING'] = df['MEANING'].astype(str)

tokenizer_eng.fit_on_texts(df['SENTENCES'])
tokenizer_urdu.fit_on_texts(df['MEANING'])

eng_sequences = tokenizer_eng.texts_to_sequences(df['SENTENCES'])
urdu_sequences = tokenizer_urdu.texts_to_sequences(df['MEANING'])

# Pad sequences
max_len_eng = max(len(seq) for seq in eng_sequences)
max_len_urdu = max(len(seq) for seq in urdu_sequences)

max_len = max(max_len_eng,max_len_urdu)

eng_sequences = pad_sequences(eng_sequences, maxlen=max_len, padding='post')
urdu_sequences = pad_sequences(urdu_sequences, maxlen=max_len, padding='post')

# Vocabulary sizes
vocab_size_eng = len(tokenizer_eng.word_index) + 1
vocab_size_urdu = len(tokenizer_urdu.word_index) + 1

In [3]:
# Split the data into training and validation sets
train_size = int(len(eng_sequences) * 0.7)
test_size = int(len(eng_sequences) * 0.15)

# For English train, validation and test
x_train, x_temp = eng_sequences[:train_size], eng_sequences[train_size:]
x_test, x_val = x_temp[:test_size], x_temp[test_size:]

# For Urdu train, validation and test
y_train, y_temp = urdu_sequences[:train_size], urdu_sequences[train_size:]
y_test, y_val = y_temp[:test_size], y_temp[test_size:]


In [4]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense

# Build Model
model = Sequential(
    [
        Embedding(input_dim=vocab_size_eng , output_dim=64),
        LSTM(64, return_sequences=True),
        Dense(vocab_size_urdu, activation='softmax')
    ]
)


In [5]:
print(x_train.shape, y_train.shape)


(21114, 938) (21114, 938)


In [6]:
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])


# Train Model
model.fit(x_train, y_train, epochs=25,validation_data=(x_val,y_val))


model.evaluate(x_test,y_test)

Epoch 1/25
[1m660/660[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m293s[0m 434ms/step - accuracy: 0.9724 - loss: 2.2065 - val_accuracy: 0.9871 - val_loss: 0.0975
Epoch 2/25
[1m660/660[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m286s[0m 434ms/step - accuracy: 0.9828 - loss: 0.1317 - val_accuracy: 0.9871 - val_loss: 0.0907
Epoch 3/25
[1m660/660[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m326s[0m 440ms/step - accuracy: 0.9831 - loss: 0.1243 - val_accuracy: 0.9877 - val_loss: 0.0857
Epoch 4/25
[1m660/660[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m322s[0m 440ms/step - accuracy: 0.9832 - loss: 0.1211 - val_accuracy: 0.9878 - val_loss: 0.0827
Epoch 5/25
[1m660/660[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m318s[0m 435ms/step - accuracy: 0.9834 - loss: 0.1182 - val_accuracy: 0.9880 - val_loss: 0.0807
Epoch 6/25
[1m660/660[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m322s[0m 435ms/step - accuracy: 0.9837 - loss: 0.1146 - val_accuracy: 0.9881 - val_loss: 0.0792
Epoc

[0.07316845655441284, 0.9886731505393982]

In [7]:
# Translate function
def translate(text):
    sequence = tokenizer_eng.texts_to_sequences([text])
    sequence = pad_sequences(sequence, maxlen=max_len_eng, padding='post')
    prediction = model.predict(sequence)
    predicted_sequence = np.argmax(prediction, axis=-1)
    translated_text = ' '.join([tokenizer_urdu.index_word[idx] for idx in predicted_sequence[0] if idx != 0])
    return translated_text

In [8]:
for sentence in df.head()['SENTENCES']:
  print(sentence)
  print(translate(sentence))

How can I communicate with my parents?
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 168ms/step
آپ آپ میں آپ کیسے کر آپ سکتے
How can I make friends?’
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 23ms/step
آپ آپ میں میں میں آپ
Why do I get so sad?’
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 25ms/step
آپ آپ آپ آپ کہ
If you’ve asked yourself such questions, you’re not alone.
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 26ms/step
اگر آپ نے آپ آپ نے آپ آپ کہ آپ نہیں نہیں نہیں
Depending on where you’ve turned for guidance, you may have been given conflicting answers.
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 22ms/step
اس کی میں آپ کہ کہ کے آپ آپ آپ آپ آپ میں نے میں میں میں میں میں ہو


In [9]:
model.save("english_urdu_LSTM_f219258.keras")