<a href="https://colab.research.google.com/github/Praise-Atadja/language_translation/blob/main/Ewe_Translation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **PROJECT NAME: EWE LANGUAGE TRANSLATION WITH RNNS**

## IMPLEMENTATION

The Ewe people are an African ethnic group. The largest population of Ewe people is in Ghana (3.3 million), and the second largest population in Togo (2 million). They speak the Ewe language which belongs to the Niger-Congo Gbe family of languages. They are related to other speakers of Gbe languages such as the Fon, Gen, Phla Phera, and the Aja people of Togo and [1]
Ewe is written using the Latin alphabet to which a few letters have been added, some derived from the International Phonetic Alphabet.

Ewe language is a local dialect spoken by majority of people from volta region in Ghana, West Africa. It is also popularly spoken in Togo, and in parts of Benin. Given the popularity of the language, it is imperative to have a dependable translator, which this project attempts to solve.

## THE DATASET


This dataset contains pairs of Ewe sentences as well as their English translation. The dataset can be used for research or training a language model

In [None]:
#Import Necessary Libraries
import collections
from collections import Counter
import helper
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
import tensorflow as tf
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Model, Sequential
from tensorflow.keras.layers import Input, LSTM, Embedding, Dense, Concatenate
from tensorflow.keras.callbacks import EarlyStopping
from keras.optimizers import Adam
from keras.losses import sparse_categorical_crossentropy
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


Loading data

In [None]:
#load data
def load_data():
    # Load data
    ewe_english_data = pd.read_csv('/content/drive/MyDrive/EWE_ENGLISH.csv')

    return  ewe_english_data

# Load the data
ewe_english_data = load_data()

In [None]:
# Reduce the dataset size by taking a random sample (10% of the original data)
ewe_english_data = ewe_english_data.sample(frac=0.1, random_state=42).reset_index(drop=True)

In [None]:
# Display the first few rows of the data
ewe_english_data.head()

Unnamed: 0.1,Unnamed: 0,EWE,ENGLISH
0,27922,"3 Eya ta Yona ɖo to, eye wòtso yi Niniwe le Ye...",3 and Jonah went off to Nineveh in accordance ...
1,1660,Ale si maɖe nye dzixɔsea me\n,Explaining my belief
2,96,Ɣehowa be edze vivinɔ\n,Yehowa says salt is sweet
3,6237,"Kaka nane naɖi kikli ko la, menyɔna.",The slightest noise wakes me up.
4,6650,"Ne míeyi nusrɔ̃ƒe la, egbɔa dzi ɖi kpɔa kɔmpiu...","At the meeting, he patiently navigates through..."


In [None]:
# Remove unnecessary columns
ewe_english = ewe_english_data[['ENGLISH', 'EWE']]

In [None]:
# Display the first few rows of the data
ewe_english.head()

Unnamed: 0,ENGLISH,EWE
0,3 and Jonah went off to Nineveh in accordance ...,"3 Eya ta Yona ɖo to, eye wòtso yi Niniwe le Ye..."
1,Explaining my belief,Ale si maɖe nye dzixɔsea me\n
2,Yehowa says salt is sweet,Ɣehowa be edze vivinɔ\n
3,The slightest noise wakes me up.,"Kaka nane naɖi kikli ko la, menyɔna."
4,"At the meeting, he patiently navigates through...","Ne míeyi nusrɔ̃ƒe la, egbɔa dzi ɖi kpɔa kɔmpiu..."


In [None]:
# Drop any rows with missing values by creating a new DataFrame
ewe_english_cleaned = ewe_english.dropna().reset_index(drop=True)
print(ewe_english_cleaned.head())

                                             ENGLISH  \
0  3 and Jonah went off to Nineveh in accordance ...   
1                               Explaining my belief   
2                          Yehowa says salt is sweet   
3                   The slightest noise wakes me up.   
4  At the meeting, he patiently navigates through...   

                                                 EWE  
0  3 Eya ta Yona ɖo to, eye wòtso yi Niniwe le Ye...  
1                      Ale si maɖe nye dzixɔsea me\n  
2                            Ɣehowa be edze vivinɔ\n  
3               Kaka nane naɖi kikli ko la, menyɔna.  
4  Ne míeyi nusrɔ̃ƒe la, egbɔa dzi ɖi kpɔa kɔmpiu...  


Tokenization and Preprocessing

In [None]:
# Create a Vocabulary
def build_vocab(sentences):
    words = [word for sentence in sentences for word in sentence.split()]
    word_counts = Counter(words)
    vocab = {word: idx + 1 for idx, (word, _) in enumerate(word_counts.items())}
    vocab['<PAD>'] = 0
    return vocab

# Build vocabularies for English and Ewe
english_vocab = build_vocab(ewe_english_cleaned['ENGLISH'])
ewe_vocab = build_vocab(ewe_english_cleaned['EWE'])

# Add special tokens
english_vocab['<UNK>'] = len(english_vocab)
english_vocab['<start>'] = len(english_vocab)
english_vocab['<end>'] = len(english_vocab)
ewe_vocab['<UNK>'] = len(ewe_vocab)
ewe_vocab['<start>'] = len(ewe_vocab)
ewe_vocab['<end>'] = len(ewe_vocab)

In [None]:
# Create Word to Index and Index to Word Mappings
def create_mappings(vocab):
    word_to_index = vocab
    index_to_word = {idx: word for word, idx in vocab.items()}
    return word_to_index, index_to_word

english_word_to_index, english_index_to_word = create_mappings(english_vocab)
ewe_word_to_index, ewe_index_to_word = create_mappings(ewe_vocab)


In [None]:
# Convert Sentences to Sequences
def text_to_sequence(text, word_to_index, unknown_token='<UNK>'):
    return [word_to_index.get(word, word_to_index[unknown_token]) for word in text.split()]

ewe_english_cleaned['ENGLISH_seq'] = ewe_english_cleaned['ENGLISH'].apply(lambda x: text_to_sequence(x, english_word_to_index))
ewe_english_cleaned['EWE_seq'] = ewe_english_cleaned['EWE'].apply(lambda x: text_to_sequence(x, ewe_word_to_index))

max_len_en = max(ewe_english_cleaned['ENGLISH_seq'].apply(len))
max_len_ewe = max(ewe_english_cleaned['EWE_seq'].apply(len))

encoder_input_data = pad_sequences(ewe_english_cleaned['ENGLISH_seq'], maxlen=max_len_en, padding='post')
decoder_input_data = pad_sequences(ewe_english_cleaned['EWE_seq'], maxlen=max_len_ewe, padding='post')

# Create target data for decoder by shifting the decoder input by one time step
decoder_target_data = np.zeros_like(decoder_input_data)
decoder_target_data[:, :-1] = decoder_input_data[:, 1:]
decoder_target_data[:, -1] = 0

In [None]:
# Create target data for decoder by shifting the decoder input by one time step
decoder_target_data = np.zeros_like(decoder_input_data)
decoder_target_data[:, :-1] = decoder_input_data[:, 1:]
decoder_target_data[:, -1] = 0


In [None]:
# Split the data into training and validation sets
encoder_input_train, encoder_input_val, decoder_input_train, decoder_input_val, decoder_target_train, decoder_target_val = train_test_split(
    encoder_input_data, decoder_input_data, decoder_target_data, test_size=0.2
)

Modeling

Model Architecture and Training

In [None]:
# Define parameters
latent_dim = 256
num_encoder_tokens = len(english_word_to_index)
num_decoder_tokens = len(ewe_word_to_index)

# Define the encoder
encoder_inputs = Input(shape=(None,), name='encoder_inputs')
encoder_embedding = Embedding(input_dim=num_encoder_tokens, output_dim=latent_dim, name='encoder_embedding')(encoder_inputs)
encoder_lstm = LSTM(latent_dim, return_state=True, return_sequences=True, name='encoder_lstm')
encoder_outputs, state_h, state_c = encoder_lstm(encoder_embedding)
encoder_states = [state_h, state_c]

# Define the decoder
decoder_inputs = Input(shape=(None,), name='decoder_inputs')
decoder_embedding = Embedding(input_dim=num_decoder_tokens, output_dim=latent_dim, name='decoder_embedding')(decoder_inputs)
decoder_lstm = LSTM(latent_dim, return_sequences=True, return_state=True, name='decoder_lstm')
decoder_lstm_outputs, _, _ = decoder_lstm(decoder_embedding, initial_state=encoder_states)

# Attention mechanism
attention = tf.keras.layers.Attention(name='attention_layer')
attention_result = attention([decoder_lstm_outputs, encoder_outputs])
decoder_concat_input = Concatenate(axis=-1, name='concat_layer')([decoder_lstm_outputs, attention_result])

# Dense layer to predict the next word
decoder_dense = Dense(num_decoder_tokens, activation='softmax', name='decoder_dense')
decoder_outputs = decoder_dense(decoder_concat_input)

# Define the model
model = Model([encoder_inputs, decoder_inputs], decoder_outputs)
model.compile(optimizer='rmsprop', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

# Model summary
model.summary()

# Training the model
history = model.fit(
    [encoder_input_train, decoder_input_train], decoder_target_train,
    batch_size=128,
    epochs=10,
    validation_data=([encoder_input_val, decoder_input_val], decoder_target_val)
)

# Save the model
model.save('ewe_english_translation_model.h5')


Model: "model"
__________________________________________________________________________________________________
 Layer (type)                Output Shape                 Param #   Connected to                  
 encoder_inputs (InputLayer  [(None, None)]               0         []                            
 )                                                                                                
                                                                                                  
 decoder_inputs (InputLayer  [(None, None)]               0         []                            
 )                                                                                                
                                                                                                  
 encoder_embedding (Embeddi  (None, None, 256)            3103232   ['encoder_inputs[0][0]']      
 ng)                                                                                          

  saving_api.save_model(


Inference

In [None]:
# Rebuild the embedding layer for the decoder
decoder_embedding_inference = Embedding(input_dim=num_decoder_tokens, output_dim=latent_dim, name='decoder_embedding_inference')

# Define the encoder model for inference
encoder_model = Model(encoder_inputs, [encoder_outputs, state_h, state_c])

# Define the decoder model for inference
decoder_state_input_h = Input(shape=(latent_dim,), name='decoder_state_input_h')
decoder_state_input_c = Input(shape=(latent_dim,), name='decoder_state_input_c')
decoder_states_inputs = [decoder_state_input_h, decoder_state_input_c]

decoder_encoder_outputs = Input(shape=(None, latent_dim), name='decoder_encoder_outputs')

decoder_inputs_single = Input(shape=(1,), name='decoder_inputs_single')
decoder_embedding_single = decoder_embedding_inference(decoder_inputs_single)

decoder_lstm_output, state_h, state_c = decoder_lstm(
    decoder_embedding_single, initial_state=decoder_states_inputs)
decoder_states = [state_h, state_c]

attention_result = attention([decoder_lstm_output, decoder_encoder_outputs])
decoder_concat_input = Concatenate(axis=-1)([decoder_lstm_output, attention_result])
decoder_outputs = decoder_dense(decoder_concat_input)

decoder_model = Model(
    [decoder_inputs_single, decoder_encoder_outputs] + decoder_states_inputs,
    [decoder_outputs] + decoder_states
)


Prediction

In [None]:
def decode_sequence(input_seq):
    # Encode the input as state vectors.
    encoder_output, state_h, state_c = encoder_model.predict(input_seq)
    states_value = [state_h, state_c]

    # Generate empty target sequence of length 1.
    target_seq = np.zeros((1, 1))
    # Populate the first character of target sequence with the start token.
    target_seq[0, 0] = ewe_word_to_index['<start>']

    # Sampling loop for a batch of sequences
    stop_condition = False
    decoded_sentence = ''

    while not stop_condition:
        output_tokens, h, c = decoder_model.predict(
            [target_seq, encoder_output] + states_value)

        # Sample a token
        sampled_token_index = np.argmax(output_tokens[0, -1, :])
        sampled_char = ewe_index_to_word.get(sampled_token_index, '<UNK>')

        decoded_sentence += ' ' + sampled_char

        # Exit condition: either hit max length or find stop token.
        if (sampled_char == '<end>' or len(decoded_sentence.split()) > max_len_ewe):
            stop_condition = True

        # Update the target sequence (of length 1).
        target_seq = np.zeros((1, 1))
        target_seq[0, 0] = sampled_token_index

        # Update states
        states_value = [h, c]

    return decoded_sentence.strip()

In [None]:
from tensorflow.keras.models import load_model

# Load the trained model
model = load_model('ewe_english_translation_model.h5')

# Define a function to translate English to Ewe
def translate_to_ewe(input_text):
    # Convert input text to sequence
    input_seq = text_to_sequence(input_text, english_word_to_index)
    input_seq = pad_sequences([input_seq], maxlen=max_len_en, padding='post')

    # Get the translated sentence
    translated_sentence = decode_sequence(input_seq)

    return translated_sentence

# Test the translation function
input_text = "Hello, how are you?"
translated_text = translate_to_ewe(input_text)
print("English:", input_text)
print("Ewe:", translated_text)


English: Hello, how are you?
Ewe: <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> 

Model Application