<a href="https://colab.research.google.com/github/RuthBiney/Language_Translation/blob/main/Language_Translation_with_RNN.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#1. Tokenization and Alignment
##Steps:
1. Read the data files: Load both the English and Twi files.

2. Align the sentences: Ensure that each English sentence has a corresponding Twi sentence.
3. Tokenize the sentences: Split each sentence into individual words.

In [23]:
import numpy as np
from sklearn.model_selection import train_test_split
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense


In [24]:
# Step 1: Load the dataset with logging of sentence count
def load_data(english_file_path, twi_file_path):
    with open(english_file_path, 'r', encoding='utf-8', errors='replace') as english_file:
        english_sentences = english_file.readlines()

    with open(twi_file_path, 'r', encoding='utf-8', errors='replace') as twi_file:
        twi_sentences = twi_file.readlines()

    english_count = len(english_sentences)
    twi_count = len(twi_sentences)

    # Log the sentence counts
    print(f"Number of English sentences: {english_count}")
    print(f"Number of Twi sentences: {twi_count}")

    # Raise an error if the counts do not match
    if english_count != twi_count:
        print(f"Mismatch! English sentences: {english_count}, Twi sentences: {twi_count}")
        # Optionally handle the mismatch, e.g., by padding/trimming
        # Here we'll pad the shorter list with empty strings
        if english_count > twi_count:
            twi_sentences += [''] * (english_count - twi_count)
        else:
            english_sentences += [''] * (twi_count - english_count)

    return english_sentences, twi_sentences

# File paths to your dataset
english_file_path = '/content/english'
twi_file_path = '/content/twi'

# Load and preprocess the data
english_sentences, twi_sentences = load_data(english_file_path, twi_file_path)

# Check first few sentence pairs
for i in range(3):
    print(f"English: {english_sentences[i]}")
    print(f"Twi: {twi_sentences[i]}\n")


Number of English sentences: 976541
Number of Twi sentences: 606197
Mismatch! English sentences: 976541, Twi sentences: 606197
English: “ Oh , Jehovah , Keep My Young Girl Faithful ! ”

Twi: “ Oo , Yehowa , Boa Me Babea Kumaa Yi Ma Onni Nokware ! ”


English: I WAS born in 1930 in Alsace , France , into an artistic family .

Twi: WƆWOO me too abusua a wonim adwinne di mu wɔ Alsace , France , wɔ 1930 mu .


English: During the evenings , Father , sitting in his lounge chair , would be reading some books about geography or astronomy .

Twi: Ná Papa taa pa twere n’agua mu kenkan asase ho nsɛm anaa ewim nneɛma ho nhoma bi anwummere anwummere .




#2. Model Building
We will use Recurrent Neural Networks (RNN) to build the translation model. Here's a general outline of the model-building process using TensorFlow and Keras:

In [25]:
import tensorflow as tf
from tensorflow.keras.layers import Embedding, LSTM, Dense
from tensorflow.keras.models import Sequential

# Step 1: Define the model
def build_translation_model(input_dim, output_dim, input_length):
    model = Sequential()
    model.add(Embedding(input_dim=input_dim, output_dim=256, input_length=input_length))
    model.add(LSTM(512, return_sequences=True))
    model.add(LSTM(512))
    model.add(Dense(output_dim, activation='softmax'))

    return model

# Step 2: Compile the model
model = build_translation_model(input_dim=10000, output_dim=10000, input_length=100)  # Adjust dimensions based on your data
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

# Print model summary
model.summary()


#3. Training and Evaluation
For training and evaluating the model:

### 3.1Implement Evaluation with BLEU Score:
Add the BLEU score as part of the evaluation. You can use a library like nltk to calculate BLEU:

In [28]:
import tensorflow as tf
from tensorflow.keras.layers import Embedding, LSTM, Dense
from tensorflow.keras.models import Sequential
import numpy as np
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction

# Function to calculate BLEU score with smoothing
def calculate_bleu(reference, candidate):
    candidate = [int(np.argmax(c)) for c in candidate[0]]  # Convert NumPy array to list of indices
    smoothing = SmoothingFunction().method1  # Use smoothing function to avoid 0 scores for higher n-grams
    return sentence_bleu([reference], candidate, smoothing_function=smoothing)

# Example: Evaluate your model's output against the Twi references
num_samples = min(5, len(tokenized_english))  # Ensure we don't go out of bounds

for i in range(num_samples):  # Dynamically set the range based on data size
    english_input = tokenized_english[i]
    reference_translation = tokenized_twi[i]  # Ground truth in Twi

    # Use your trained model to generate a translation
    model_output = model.predict(np.array([english_input]))  # Convert list to NumPy array

    # Calculate BLEU score
    bleu_score = calculate_bleu(reference_translation, model_output)
    print(f"BLEU Score for sentence {i+1}: {bleu_score}")


[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 23ms/step
BLEU Score for sentence 1: 0.05372849659117709
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 21ms/step
BLEU Score for sentence 2: 0.069372929071742


###3.2. Enhancing Data Preprocessing
To ensure the dataset is clean and aligned properly, let's add a preprocessing step to handle special characters, punctuation, and consistency between sentence pairs.

In [29]:
import re

# Function to clean and preprocess the sentences
def clean_sentence(sentence):
    # Convert to lowercase
    sentence = sentence.lower()

    # Remove special characters and punctuation (keeping standard alphabets and spaces)
    sentence = re.sub(r"[^a-zA-Z0-9\s]", "", sentence)

    # Tokenize by splitting on spaces
    tokens = sentence.split()

    return tokens

# Apply cleaning to both English and Twi sentences
tokenized_english = [clean_sentence(sentence) for sentence in english_sentences]
tokenized_twi = [clean_sentence(sentence) for sentence in twi_sentences]


###3.3. Attention Mechanism (Optional but Useful Enhancement)
Adding an attention mechanism can improve the model’s translation quality, especially for longer sentences. Here’s a simplified way to integrate attention into an RNN-based model:

In [31]:
import tensorflow as tf
from tensorflow.keras.layers import Embedding, LSTM, Dense, Input, Attention, Concatenate
from tensorflow.keras.models import Model

# Function to build translation model with attention
def build_translation_model_with_attention(input_dim, output_dim, input_length):
    # Encoder input
    encoder_input = Input(shape=(input_length,))
    encoder_embedding = Embedding(input_dim=input_dim, output_dim=256)(encoder_input)

    # Encoder LSTM
    encoder_lstm = LSTM(512, return_sequences=True, return_state=True)
    encoder_outputs, state_h, state_c = encoder_lstm(encoder_embedding)

    # Decoder input
    decoder_input = Input(shape=(input_length,))
    decoder_embedding = Embedding(input_dim=input_dim, output_dim=256)(decoder_input)

    # Decoder LSTM
    decoder_lstm = LSTM(512, return_sequences=True, return_state=True)
    decoder_outputs, _, _ = decoder_lstm(decoder_embedding, initial_state=[state_h, state_c])

    # Attention layer
    attention = Attention()([decoder_outputs, encoder_outputs])

    # Combine decoder outputs with attention
    decoder_combined_context = Concatenate(axis=-1)([decoder_outputs, attention])

    # Final Dense layer
    output = Dense(output_dim, activation='softmax')(decoder_combined_context)

    # Define the model
    model = Model([encoder_input, decoder_input], output)

    return model

# Build the model
model_with_attention = build_translation_model_with_attention(input_dim=10000, output_dim=10000, input_length=100)

# Compile the model
model_with_attention.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

# Print the model summary
model_with_attention.summary()


###3.4 Final Steps for Training
Ensure that the training code properly tokenizes the sentences and converts them into sequences of integers for the model to understand.

In [33]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

# Example tokenized English and Twi sentences
tokenized_english = ["Hello, how are you?", "What is your name?", "I am a data scientist."]
tokenized_twi = ["Wo ho te sen?", "Wo din de sen?", "Meyɛ data scientist."]

# Prepare tokenizers for both English and Twi
english_tokenizer = Tokenizer()
twi_tokenizer = Tokenizer()

# Fit tokenizers on the sentences
english_tokenizer.fit_on_texts(tokenized_english)
twi_tokenizer.fit_on_texts(tokenized_twi)

# Convert text to sequences of integers
english_sequences = english_tokenizer.texts_to_sequences(tokenized_english)
twi_sequences = twi_tokenizer.texts_to_sequences(tokenized_twi)

# Padding the sequences to ensure uniform length
english_sequences = pad_sequences(english_sequences, padding='post')
twi_sequences = pad_sequences(twi_sequences, padding='post')

# Now you can train the model with the prepared sequences
print("English sequences:", english_sequences)
print("Twi sequences:", twi_sequences)


English sequences: [[ 1  2  3  4  0]
 [ 5  6  7  8  0]
 [ 9 10 11 12 13]]
Twi sequences: [[1 3 4 2]
 [1 5 6 2]
 [7 8 9 0]]


In [34]:
!pip install keras




In [35]:
from sklearn.model_selection import train_test_split
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense


In [36]:
# Step 3: Pad sequences to ensure uniform length (use same max_length for both languages)
max_length = max(max(len(seq) for seq in x_train), max(len(seq) for seq in y_train))  # Take the maximum of both

# Pad both input (x_train, x_val) and target sequences (y_train, y_val) using the same max_length
x_train = pad_sequences(x_train, maxlen=max_length)
x_val = pad_sequences(x_val, maxlen=max_length)
y_train = pad_sequences(y_train, maxlen=max_length)  # Use the same max_length
y_val = pad_sequences(y_val, maxlen=max_length)

# Convert lists to NumPy arrays
x_train = np.array(x_train)
y_train = np.array(y_train)
x_val = np.array(x_val)
y_val = np.array(y_val)

# Check the shapes of your validation data
print("Shape of x_val:", x_val.shape)
print("Shape of y_val:", y_val.shape)

# Step 5: Define the model
model = Sequential()
model.add(Embedding(input_dim=10000, output_dim=64, input_length=max_length))  # Set input_length for the Embedding layer
model.add(LSTM(64, return_sequences=True))  # LSTM layer with return_sequences=True
model.add(TimeDistributed(Dense(10000, activation='softmax')))  # TimeDistributed output layer for each time step

# Step 6: Compile the model
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

# Step 7: Train with full data (since the dataset is small)
# Remove steps_per_epoch and validation_steps - these are not needed for small datasets and may cause issues
history = model.fit(
    x_train, y_train,  # Train on full data
    epochs=2,
    validation_data=(x_val, y_val)
)

# Step 8: Evaluate the model
# Ensure you are evaluating on a reasonable portion of the validation data
model.evaluate(x_val, y_val)

Shape of x_val: (1, 5)
Shape of y_val: (1, 5)
Epoch 1/2
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 3s/step - accuracy: 0.0000e+00 - loss: 9.2105 - val_accuracy: 0.0000e+00 - val_loss: 9.2089
Epoch 2/2
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 171ms/step - accuracy: 0.3000 - loss: 9.2078 - val_accuracy: 0.0000e+00 - val_loss: 9.2063
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 31ms/step - accuracy: 0.0000e+00 - loss: 9.2063


[9.206286430358887, 0.0]

In [37]:
# Split the data into training and validation sets
from sklearn.model_selection import train_test_split

english_train, english_val, twi_train, twi_val = train_test_split(
    tokenized_english, tokenized_twi, test_size=0.2
)

# Convert your data into a format suitable for training, such as sequences of integers
# You can use Tokenizer from keras.preprocessing.text to tokenize and convert words into sequences of numbers

# Train the model
history = model.fit(
    x_train, y_train,  # Preprocessed input and output sequences
    epochs=10,
    validation_data=(x_val, y_val)
)

# Evaluate the model
model.evaluate(x_val, y_val)


Epoch 1/10
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 139ms/step - accuracy: 0.4000 - loss: 9.2050 - val_accuracy: 0.0000e+00 - val_loss: 9.2036
Epoch 2/10
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 86ms/step - accuracy: 0.4000 - loss: 9.2022 - val_accuracy: 0.0000e+00 - val_loss: 9.2009
Epoch 3/10
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 83ms/step - accuracy: 0.5000 - loss: 9.1993 - val_accuracy: 0.0000e+00 - val_loss: 9.1981
Epoch 4/10
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 136ms/step - accuracy: 0.5000 - loss: 9.1963 - val_accuracy: 0.0000e+00 - val_loss: 9.1952
Epoch 5/10
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 84ms/step - accuracy: 0.5000 - loss: 9.1932 - val_accuracy: 0.0000e+00 - val_loss: 9.1920
Epoch 6/10
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 146ms/step - accuracy: 0.5000 - loss: 9.1899 - val_accuracy: 0.0000e+00 - val_loss: 9.1887
Epoch 7/10
[1m1/1

[9.172381401062012, 0.0]

#4.Save and Document
Once the model is trained, save it and document the process:

In [39]:
# Save the model in HDF5 format
model.save('twi_translation_model.h5')


