<a href="https://colab.research.google.com/github/Praise-Atadja/language_translation/blob/main/Ewe_Translation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **PROJECT NAME: EWELANGUAGE TRANSLATION WITH RNNS**

## IMPLEMENTATION

The Ewe people are an African ethnic group. The largest population of Ewe people is in Ghana (3.3 million), and the second largest population in Togo (2 million). They speak the Ewe language which belongs to the Niger-Congo Gbe family of languages. They are related to other speakers of Gbe languages such as the Fon, Gen, Phla Phera, and the Aja people of Togo and [1]
Ewe is written using the Latin alphabet to which a few letters have been added, some derived from the International Phonetic Alphabet.

Ewe language is a local dialect spoken by majority of people from volta region in Ghana, West Africa. It is also popularly spoken in Togo, and in parts of Benin. Given the popularity of the language, it is imperative to have a dependable translator, which this project attempts to solve.

## THE DATASET


This dataset contains pairs of Ewe sentences as well as their English translation. The dataset can be used for research or training a language model

In [1]:
#Import Necessary Libraries
import collections
from collections import Counter
import helper
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
import tensorflow as tf
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Model
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Model, Sequential
from tensorflow.keras.layers import Input, LSTM, Embedding, Dense
from tensorflow.keras.callbacks import EarlyStopping
from keras.optimizers import Adam
from keras.losses import sparse_categorical_crossentropy
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


Loading data

In [2]:
#load data
def load_data():
    # Load data
    ewe_english_data = pd.read_csv('/content/drive/MyDrive/EWE_ENGLISH.csv')

    return  ewe_english_data

# Load the data
ewe_english_data = load_data()

In [3]:
# Display the first few rows of the data
ewe_english_data.head()

Unnamed: 0.1,Unnamed: 0,EWE,ENGLISH
0,0,Ne nyɔnu aɖe le evi dzim eye wo le kukum nɛ la...,﻿If a woman often loss his baby after he is bo...
1,1,Ŋkɔ sia nye na ŋkɔ si ke ame bubu tsɔna na ɖev...,"This name comes from another person, which mea..."
2,2,Ame si hɔ ɖevi la ƒlela tsona ƒome bubu me alo...,This person must not be part of the whole fami...
3,3,Kɔnua wo yina ale: evinɔ si ga dzi ɖevi bubu a...,The ceremony is done as follow: the family of ...
4,4,Ne ame aɖe vayina to afimagodzi he kɔ ɖevia la...,When somebody passes through the road and find...


In [4]:
# Remove unnecessary columns
ewe_english = ewe_english_data[['ENGLISH', 'EWE']]

In [5]:
# Display the first few rows of the data
ewe_english.head()

Unnamed: 0,ENGLISH,EWE
0,﻿If a woman often loss his baby after he is bo...,Ne nyɔnu aɖe le evi dzim eye wo le kukum nɛ la...
1,"This name comes from another person, which mea...",Ŋkɔ sia nye na ŋkɔ si ke ame bubu tsɔna na ɖev...
2,This person must not be part of the whole fami...,Ame si hɔ ɖevi la ƒlela tsona ƒome bubu me alo...
3,The ceremony is done as follow: the family of ...,Kɔnua wo yina ale: evinɔ si ga dzi ɖevi bubu a...
4,When somebody passes through the road and find...,Ne ame aɖe vayina to afimagodzi he kɔ ɖevia la...


In [6]:
# Drop any rows with missing values by creating a new DataFrame
ewe_english_cleaned = ewe_english.dropna().reset_index(drop=True)
print(ewe_english_cleaned.head())

                                             ENGLISH  \
0  ﻿If a woman often loss his baby after he is bo...   
1  This name comes from another person, which mea...   
2  This person must not be part of the whole fami...   
3  The ceremony is done as follow: the family of ...   
4  When somebody passes through the road and find...   

                                                 EWE  
0  Ne nyɔnu aɖe le evi dzim eye wo le kukum nɛ la...  
1  Ŋkɔ sia nye na ŋkɔ si ke ame bubu tsɔna na ɖev...  
2  Ame si hɔ ɖevi la ƒlela tsona ƒome bubu me alo...  
3  Kɔnua wo yina ale: evinɔ si ga dzi ɖevi bubu a...  
4  Ne ame aɖe vayina to afimagodzi he kɔ ɖevia la...  


Tokenization and Preprocessing

In [7]:
#Create a Vocabulary
def build_vocab(sentences):
    # Split each sentence into words and flatten the list
    words = [word for sentence in sentences for word in sentence.split()]
    # Count the frequency of each word
    word_counts = Counter(words)
    # Create a vocabulary dictionary mapping each word to a unique index
    vocab = {word: idx + 1 for idx, (word, _) in enumerate(word_counts.items())}
    # Add a special token for padding
    vocab['<PAD>'] = 0
    return vocab

# Build vocabularies for English and Ewe
english_vocab = build_vocab(ewe_english_cleaned['ENGLISH'])
ewe_vocab = build_vocab(ewe_english_cleaned['EWE'])

print(list(english_vocab.items())[:10])
print(list(ewe_vocab.items())[:10])


[('\ufeffIf', 1), ('a', 2), ('woman', 3), ('often', 4), ('loss', 5), ('his', 6), ('baby', 7), ('after', 8), ('he', 9), ('is', 10)]
[('Ne', 1), ('nyɔnu', 2), ('aɖe', 3), ('le', 4), ('evi', 5), ('dzim', 6), ('eye', 7), ('wo', 8), ('kukum', 9), ('nɛ', 10)]


In [8]:
# Create Word to Index and Index to Word Mappings
def create_mappings(vocab):
    word_to_index = vocab
    index_to_word = {idx: word for word, idx in vocab.items()}
    return word_to_index, index_to_word

# Create mappings for English and Ewe
english_word_to_index, english_index_to_word = create_mappings(english_vocab)
ewe_word_to_index, ewe_index_to_word = create_mappings(ewe_vocab)

print(list(english_word_to_index.items())[:10])
print(list(ewe_index_to_word.items())[:10])

[('\ufeffIf', 1), ('a', 2), ('woman', 3), ('often', 4), ('loss', 5), ('his', 6), ('baby', 7), ('after', 8), ('he', 9), ('is', 10)]
[(1, 'Ne'), (2, 'nyɔnu'), (3, 'aɖe'), (4, 'le'), (5, 'evi'), (6, 'dzim'), (7, 'eye'), (8, 'wo'), (9, 'kukum'), (10, 'nɛ')]


In [9]:
#Convert Sentences to Sequences
def text_to_sequence(text, word_to_index):
    return [word_to_index[word] for word in text.split()]

# Convert sentences to sequences
ewe_english_cleaned['ENGLISH_seq'] = ewe_english_cleaned['ENGLISH'].apply(lambda x: text_to_sequence(x, english_word_to_index))
ewe_english_cleaned['EWE_seq'] = ewe_english_cleaned['EWE'].apply(lambda x: text_to_sequence(x, ewe_word_to_index))

print(ewe_english_cleaned.head())

                                             ENGLISH  \
0  ﻿If a woman often loss his baby after he is bo...   
1  This name comes from another person, which mea...   
2  This person must not be part of the whole fami...   
3  The ceremony is done as follow: the family of ...   
4  When somebody passes through the road and find...   

                                                 EWE  \
0  Ne nyɔnu aɖe le evi dzim eye wo le kukum nɛ la...   
1  Ŋkɔ sia nye na ŋkɔ si ke ame bubu tsɔna na ɖev...   
2  Ame si hɔ ɖevi la ƒlela tsona ƒome bubu me alo...   
3  Kɔnua wo yina ale: evinɔ si ga dzi ɖevi bubu a...   
4  Ne ame aɖe vayina to afimagodzi he kɔ ɖevia la...   

                                         ENGLISH_seq  \
0  [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 2,...   
1  [33, 27, 34, 35, 26, 36, 37, 38, 28, 22, 39, 4...   
2  [33, 39, 43, 44, 45, 46, 15, 18, 47, 48, 15, 1...   
3  [64, 16, 10, 65, 66, 67, 18, 48, 15, 18, 30, 7...   
4  [93, 94, 95, 90, 18, 74, 21, 96, 21, 97, 18

In [10]:
# Load and preprocess the data
ewe_english_cleaned = ewe_english_cleaned[['ENGLISH_seq', 'EWE_seq']]

In [11]:
max_len_en = max(ewe_english_cleaned['ENGLISH_seq'].apply(len))
max_len_ewe = max(ewe_english_cleaned['EWE_seq'].apply(len))

encoder_input_data = pad_sequences(ewe_english_cleaned['ENGLISH_seq'], maxlen=max_len_en, padding='post')
decoder_input_data = pad_sequences(ewe_english_cleaned['EWE_seq'], maxlen=max_len_ewe, padding='post')

# Create target data for decoder by shifting the decoder input by one time step
decoder_target_data = np.zeros_like(decoder_input_data)
decoder_target_data[:, :-1] = decoder_input_data[:, 1:]
decoder_target_data[:, -1] = 0


In [12]:
# Split the data into training and validation sets
encoder_input_train, encoder_input_val, decoder_input_train, decoder_input_val, decoder_target_train, decoder_target_val = train_test_split(
    encoder_input_data, decoder_input_data, decoder_target_data, test_size=0.2
)

Modeling

Model Architecture

In [13]:
# Define parameters
num_encoder_tokens = len(english_vocab)
num_decoder_tokens = len(ewe_vocab)
latent_dim = 256

# Define the encoder
encoder_inputs = Input(shape=(None,), name='encoder_inputs')
encoder_embedding = Embedding(input_dim=num_encoder_tokens, output_dim=latent_dim, name='encoder_embedding')(encoder_inputs)
encoder_lstm = LSTM(latent_dim, return_state=True, name='encoder_lstm')
encoder_outputs, state_h, state_c = encoder_lstm(encoder_embedding)
encoder_states = [state_h, state_c]

# Define the decoder
decoder_inputs = Input(shape=(None,), name='decoder_inputs')
decoder_embedding = Embedding(input_dim=num_decoder_tokens, output_dim=latent_dim, name='decoder_embedding')(decoder_inputs)
decoder_lstm = LSTM(latent_dim, return_sequences=True, return_state=True, name='decoder_lstm')
decoder_outputs, _, _ = decoder_lstm(decoder_embedding, initial_state=encoder_states)

# Attention mechanism
attention = tf.keras.layers.Attention(name='attention_layer')
attention_result = attention([decoder_outputs, encoder_outputs])
decoder_concat_input = tf.keras.layers.Concatenate(axis=-1, name='concat_layer')([decoder_outputs, attention_result])

# Dense layer to predict the next word
decoder_dense = Dense(num_decoder_tokens, activation='softmax', name='decoder_dense')
decoder_outputs = decoder_dense(decoder_concat_input)

# Define the model
model = Model([encoder_inputs, decoder_inputs], decoder_outputs)

# Early stopping callback
early_stopping = EarlyStopping(monitor='val_loss', patience=3, restore_best_weights=True)

# Compile the model
model.compile(optimizer='rmsprop', loss='sparse_categorical_crossentropy')

# Summary of the model
model.summary()


Model: "model"
__________________________________________________________________________________________________
 Layer (type)                Output Shape                 Param #   Connected to                  
 encoder_inputs (InputLayer  [(None, None)]               0         []                            
 )                                                                                                
                                                                                                  
 decoder_inputs (InputLayer  [(None, None)]               0         []                            
 )                                                                                                
                                                                                                  
 encoder_embedding (Embeddi  (None, None, 256)            1212339   ['encoder_inputs[0][0]']      
 ng)                                                      2                                   

Training the Model

In [None]:
# Expand the target data for sparse_categorical_crossentropy
decoder_target_train = np.expand_dims(decoder_target_train, -1)
decoder_target_val = np.expand_dims(decoder_target_val, -1)

# Train the model
history = model.fit(
    [encoder_input_train, decoder_input_train], decoder_target_train,
    batch_size=128,
    epochs=5,
    validation_data=([encoder_input_val, decoder_input_val], decoder_target_val))


Epoch 1/5


Inference

In [None]:
# encoder model for inference
encoder_model = Model(encoder_inputs, encoder_states)

# decoder model for inference
decoder_state_input_h = Input(shape=(latent_dim,))
decoder_state_input_c = Input(shape=(latent_dim,))
decoder_states_inputs = [decoder_state_input_h, decoder_state_input_c]

decoder_outputs, state_h, state_c = decoder_lstm(
    decoder_embedding(decoder_inputs), initial_state=decoder_states_inputs)
decoder_states = [state_h, state_c]

attention_result = attention([decoder_outputs, encoder_outputs])
decoder_concat_input = tf.keras.layers.Concatenate(axis=-1)([decoder_outputs, attention_result])
decoder_outputs = decoder_dense(decoder_concat_input)

decoder_model = Model(
    [decoder_inputs] + decoder_states_inputs,
    [decoder_outputs] + decoder_states
)

In [None]:
# Function to decode sequences
def decode_sequence(input_seq):
    # Encode the input as state vectors.
    states_value = encoder_model.predict(input_seq)

    # Generate empty target sequence of length 1.
    target_seq = np.zeros((1, 1))
    # Populate the first token of target sequence with the start token.
    target_seq[0, 0] = ewe_word_index.get('<start>', 1)

    # Sampling loop for a batch of sequences
    stop_condition = False
    decoded_sentence = ''
    while not stop_condition:
        output_tokens, h, c = decoder_model.predict(
            [target_seq] + states_value)

        # Sample a token
        sampled_token_index = np.argmax(output_tokens[0, -1, :])
        sampled_char = ewe_tokenizer.index_word.get(sampled_token_index, '')

        decoded_sentence += ' ' + sampled_char

        # Exit condition: either hit max length or find stop token.
        if (sampled_char == '<end>' or len(decoded_sentence) > max_len_ewe):
            stop_condition = True

        # Update the target sequence (length 1).
        target_seq = np.zeros((1, 1))
        target_seq[0, 0] = sampled_token_index

        # Update states
        states_value = [h, c]

    return decoded_sentence


Pediction

In [None]:
# Prediction function
def predict_translation(input_sentence):
    input_seq = pad_sequences([english_tokenizer.texts_to_sequences([input_sentence])[0]], maxlen=max_len_en, padding='post')
    translation = decode_sequence(input_seq)
    return translation

Example translation

In [None]:
input_sentence = "It is not men of many days that are wise nor old men that understand what is right."
translation = predict_translation(input_sentence)
print(f"Translation: {translation}")