<a href="https://colab.research.google.com/github/DavidkingMazimpaka/English-to-Kinyarwanda-Translation/blob/main/English_to_Kinyarwanda_Translation_with_RNNs.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**English_to_Kinyarwanda_Translation_with_RNNs**

To build a model that translates text from English to Kinyarwanda, we can use the Kinyarwanda-English parallel dataset provided by Digital Umuganda on Hugging Face. This dataset contains 48,000 Kinyarwanda-English parallel sentences, which is suitable for training a machine translation model.

**STEPS TO BUILD THE TRANSLATION MODEL**

**1. Data Collection and Preprocessing**

In [1]:
import pandas as pd
import tensorflow as tf
from tensorflow.keras.layers import Embedding, GRU, Dense, Bidirectional
from tensorflow.keras.models import Model
from sklearn.model_selection import train_test_split
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
import numpy as np
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction



# Function to read data with specific encoding and error handling
def read_data_with_encoding(url, encoding='utf-8'):
    try:
        return pd.read_csv(url, sep='\t', names=['English', 'Kinyarwanda'], encoding=encoding, on_bad_lines='skip')
    except UnicodeDecodeError:
        print(f"UnicodeDecodeError encountered while reading {url} with encoding {encoding}")
        return None

# Load the datasets
urls = [
    'https://huggingface.co/datasets/DigitalUmuganda/kinyarwanda-english-machine-translation-dataset/resolve/main/kinyarwanda-english-corpus.tsv',
    'https://huggingface.co/datasets/DigitalUmuganda/kinyarwanda-english-machine-translation-dataset/resolve/main/kinyarwanda-english-corpus2.tsv',
    'https://huggingface.co/datasets/DigitalUmuganda/kinyarwanda-english-machine-translation-dataset/resolve/main/kinyarwanda-english-corpus3.tsv'
]

dfs = [read_data_with_encoding(url, encoding='ISO-8859-1') for url in urls]
df = pd.concat(dfs, ignore_index=True)

# Drop any rows with NaN values
df.dropna(inplace=True)

# Preprocess the text
def preprocess(text):
    text = text.lower()
    text = ''.join(char for char in text if char.isalnum() or char.isspace())
    return text

df['English'] = df['English'].apply(preprocess)
df['Kinyarwanda'] = df['Kinyarwanda'].apply(preprocess)

# Add special tokens
def add_special_tokens(texts):
    return ['<start> ' + text + ' <end>' for text in texts]

df['English'] = add_special_tokens(df['English'])
df['Kinyarwanda'] = add_special_tokens(df['Kinyarwanda'])

# Split the data into training and validation sets
train_data, val_data = train_test_split(df, test_size=0.2, random_state=42)

**2. TOKENIZATION AND PADDING**

>Converts text into sequences of integers, and ensures these sequences are of equal length by performing tokenization and padding.



In [2]:
# Tokenize the text
def tokenize_text(texts):
    tokenizer = Tokenizer(filters='')
    tokenizer.fit_on_texts(texts)
    sequences = tokenizer.texts_to_sequences(texts)
    return sequences, tokenizer

# Tokenize English and Kinyarwanda texts
english_texts = train_data['English'].tolist()
kinyarwanda_texts = train_data['Kinyarwanda'].tolist()

english_sequences, english_tokenizer = tokenize_text(english_texts)
kinyarwanda_sequences, kinyarwanda_tokenizer = tokenize_text(kinyarwanda_texts)

# Pad sequences
max_seq_length = max(max(len(seq) for seq in english_sequences), max(len(seq) for seq in kinyarwanda_sequences))
english_padded = pad_sequences(english_sequences, maxlen=max_seq_length, padding='post')
kinyarwanda_padded = pad_sequences(kinyarwanda_sequences, maxlen=max_seq_length, padding='post')

# Convert to numpy arrays
english_padded = np.array(english_padded)
kinyarwanda_padded = np.array(kinyarwanda_padded)

# Ensure inp_lang and targ_lang are defined correctly
inp_lang = english_tokenizer
targ_lang = kinyarwanda_tokenizer
max_length_inp = max_seq_length
max_length_targ = max_seq_length

**3. Model Development**

For the translation model, we'll use an RNN-based architecture with an encoder-decoder structure. We can use GRU or LSTM units for the RNNs, and integrate pre-trained word embeddings like Word2Vec or GloVe.

**Example code for building the model:**

In [3]:
# Define the Encoder
class Encoder(Model):
    def __init__(self, vocab_size, embedding_dim, enc_units, batch_sz):
        super(Encoder, self).__init__()
        self.batch_sz = batch_sz
        self.enc_units = enc_units
        self.embedding = Embedding(vocab_size, embedding_dim)
        self.gru = Bidirectional(GRU(self.enc_units,
                                     return_sequences=True,
                                     return_state=True,
                                     recurrent_initializer='glorot_uniform'))

    def call(self, x, hidden):
        x = self.embedding(x)
        output, forward_h, backward_h = self.gru(x, initial_state=hidden)
        state = tf.concat([forward_h, backward_h], axis=-1)
        return output, state

    def initialize_hidden_state(self, batch_size):
        return [tf.zeros((batch_size, self.enc_units)) for _ in range(2)]

# Define the Decoder
class Decoder(Model):
    def __init__(self, vocab_size, embedding_dim, dec_units, batch_sz):
        super(Decoder, self).__init__()
        self.batch_sz = batch_sz
        self.dec_units = dec_units
        self.embedding = Embedding(vocab_size, embedding_dim)
        self.gru = GRU(self.dec_units,
                       return_sequences=True,
                       return_state=True,
                       recurrent_initializer='glorot_uniform')
        self.fc = Dense(vocab_size)

    def call(self, x, hidden, enc_output):
        x = self.embedding(x)
        output, state = self.gru(x, initial_state=hidden)
        output = self.fc(output)
        return output, state

# Initialize model parameters
embedding_dim = 256
units = 512
vocab_inp_size = len(inp_lang.word_index) + 1
vocab_tar_size = len(targ_lang.word_index) + 1
BATCH_SIZE = 64

# Instantiate Encoder and Decoder
encoder = Encoder(vocab_inp_size, embedding_dim, units, BATCH_SIZE)
decoder = Decoder(vocab_tar_size, embedding_dim, units * 2, BATCH_SIZE)

**4. Training the Model**

Train the model with the prepared dataset, tuning the hyperparameters like batch size, learning rate, and the number of epochs.

**Example code for training:**

In [19]:
# Training process
optimizer = tf.keras.optimizers.Adam()
loss_object = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True, reduction='none')

def loss_function(real, pred):
    mask = tf.math.logical_not(tf.math.equal(real, 0))
    loss_ = loss_object(real, pred)
    mask = tf.cast(mask, dtype=loss_.dtype)
    loss_ *= mask
    return tf.reduce_mean(loss_)

# Training step function
@tf.function
def train_step(inp, targ, enc_hidden):
    loss = 0
    with tf.GradientTape() as tape:
        enc_output, enc_hidden = encoder(inp, enc_hidden)
        dec_hidden = enc_hidden
        dec_input = tf.expand_dims([targ_lang.word_index['<start>']] * BATCH_SIZE, 1)
        for t in range(1, targ.shape[1]):
            predictions, dec_hidden = decoder(dec_input, dec_hidden, enc_output)
            loss += loss_function(targ[:, t], predictions)
            dec_input = tf.expand_dims(targ[:, t], 1)

    batch_loss = (loss / int(targ.shape[1]))
    variables = encoder.trainable_variables + decoder.trainable_variables
    gradients = tape.gradient(loss, variables)
    optimizer.apply_gradients(zip(gradients, variables))

    return batch_loss


**5. Evaluation**

Evaluate the model using metrics such as BLEU score and human evaluations to ensure the translations are accurate and fluent.

**Example code for evaluation:**

**Calculate BLEU score for a batch of sentences**

In [None]:
# Evaluation function
def evaluate(sentence):
    sentence = '<start> ' + sentence + ' <end>'
    inputs = [inp_lang.word_index.get(i, 0) for i in sentence.split(' ')]
    inputs = pad_sequences([inputs], maxlen=max_length_inp, padding='post')
    inputs = tf.convert_to_tensor(inputs)

    result = ''
    hidden = encoder.initialize_hidden_state(1)
    enc_out, enc_hidden = encoder(inputs, hidden)
    dec_hidden = enc_hidden
    dec_input = tf.expand_dims([targ_lang.word_index['<start>']], 0)

    for t in range(max_length_targ):
        predictions, dec_hidden = decoder(dec_input, dec_hidden, enc_out)
        predicted_id = tf.argmax(predictions[0, -1]).numpy().item()  # Convert to integer

        # Debugging output
        print(f"Step {t}: Predicted ID: {predicted_id}, Result so far: '{result}'")

        # Check if predicted_id exists in target language vocabulary
        if predicted_id not in targ_lang.index_word:
            print(f"Warning: predicted_id {predicted_id} not found in index_word.")
            break  # Exit loop if not found

        # Check for padding
        if predicted_id == 0:
            print("Predicted padding token, breaking loop.")
            break

        result += targ_lang.index_word[predicted_id] + ' '

        # Check for end token
        if targ_lang.index_word[predicted_id] == '<end>':
            return result, sentence

        dec_input = tf.expand_dims([predicted_id], 0)

    return result, sentence

# Calculate BLEU score for a batch of sentences
def calculate_bleu(reference, candidate):
    smoothing_function = SmoothingFunction().method4
    return sentence_bleu([reference.split()], candidate.split(), smoothing_function=smoothing_function)

def evaluate_batch(test_data):
    total_bleu_score = 0.0
    for idx, row in test_data.iterrows():
        reference = row['Kinyarwanda']
        candidate, _ = evaluate(row['English'])
        bleu_score = calculate_bleu(reference, candidate)
        total_bleu_score += bleu_score
        print(f"Reference: {reference}")
        print(f"Candidate: {candidate}")
        print(f"BLEU score: {bleu_score}")
    average_bleu_score = total_bleu_score / len(test_data)
    print(f"Average BLEU score: {average_bleu_score}")
    return average_bleu_score

# Example evaluation call
example_sentence = "hello how are you"
translation, sentence = evaluate(example_sentence)
print(f'Translation: {translation}')

# Evaluate BLEU score on validation data
average_bleu_score = evaluate_batch(val_data)
print(f'Average BLEU score on validation data: {average_bleu_score}')

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
Step 11: Predicted ID: 7366, Result so far: 'befallen ids ingrained carpet sauls begs stripped tableware idly scrolls priyankas '
Step 12: Predicted ID: 10723, Result so far: 'befallen ids ingrained carpet sauls begs stripped tableware idly scrolls priyankas arimathea '
Step 13: Predicted ID: 7280, Result so far: 'befallen ids ingrained carpet sauls begs stripped tableware idly scrolls priyankas arimathea characterization '
Step 14: Predicted ID: 6979, Result so far: 'befallen ids ingrained carpet sauls begs stripped tableware idly scrolls priyankas arimathea characterization motorist '
Step 15: Predicted ID: 2939, Result so far: 'befallen ids ingrained carpet sauls begs stripped tableware idly scrolls priyankas arimathea characterization motorist orphan '
Step 16: Predicted ID: 6459, Result so far: 'befallen ids ingrained carpet sauls begs stripped tableware idly scrolls priyankas arimathea characterization motorist orph