# Importing libraries for text processing and translation modeling.


1. **`pandas as pd`**: Pandas is used for data manipulation and analysis. It provides data structures and functions needed to work with structured data.

2. **`re`**: The module handles regular expressions in Python, which are useful for text processing and pattern matching.

3. **`nltk`**: The Natural Language Toolkit (NLTK) is a library for working with human language data. It includes tools for text processing, tokenization, stemming, and more.

4. **`from nltk.translate import AlignedSent, IBMModel1`**:
   - **`AlignedSent`**: Represents a pair of sentences that are aligned in translation tasks.
   - **`IBMModel1`**: A statistical model for machine translation that learns the probability of word alignments in parallel corpora.


In [1]:
import pandas as pd
import re
import nltk
from nltk.translate import AlignedSent, IBMModel1

# Reading dataset from a CSV file and processing text data.

**1. Reading the File:** Using pd.read_csv

**2. Column Names:** Confirming names and removing Byte Order Mark (BOM) from column name if present

**3. Clean_sentences function:** Stripping whitespace, converting text to lowercase, and removing non-alphanumeric characters.

In [3]:
# Read the CSV file
df = pd.read_csv("/content/updated.csv", encoding='latin-1')

# Check the columns to confirm their names
print(df.columns)

# Remove BOM from column name if present
df.columns = df.columns.str.replace('ï»¿', '', regex=False)

# Extract sentences from the relevant columns
logooli_sentences = df['Luhya_Loogoli'].tolist()
kiswahili_sentences = df['Kiswahili'].tolist()

def clean_sentences(sentences):
    cleaned_sentences = []
    for sentence in sentences:
        sentence = sentence.strip()
        sentence = sentence.lower()
        sentence = re.sub(r"[^a-zA-Z0-9\s]+", " ", sentence)  # Adjusted regex to keep spaces
        cleaned_sentences.append(sentence.strip())
    return cleaned_sentences

# Clean the sentences
cleaned_logooli_sentences = clean_sentences(logooli_sentences)
cleaned_kiswahili_sentences = clean_sentences(kiswahili_sentences)

print(cleaned_logooli_sentences[:5])  # Print the first 5 cleaned Logooli sentences
print(cleaned_kiswahili_sentences[:5])  # Print the first 5 cleaned Kiswahili sentences


Index(['Luhya_Loogoli', 'Kiswahili'], dtype='object')
['lisyoma lya jeremia', 'lya duka ndi lidala kumenya lyeng ine lye li lizulira avandu  lya duka ndi kuva kuli mukunzakali  yili lya li linene ligali mutsihiri', 'mukana womwami hagati hevivala yivyo ya duka ndi kunigwa ku lisolola', '2 a liranga ligali inyinga yobudiku amaliga hehe go doolanga ku tsindama tsitye', 'mu vosi avayanze veve savaveye nomulala u mleminya mwoyo  avalina veve vosi vamukoleye agobugadi  vagwiye avasigu veve']
['maombolezo', 'boma lilibaki vipi bila watu  lilikuwa limejaa watu  lilifikiaje kuwa kama mjane  hili lilikuwa kubwa zaidi katika koo', 'bintiye kiongozi kati ya nchi hizo alianzaje kulazimishwa kuchanga', '2 ataita kabisa wakati wa usiku katika matanga yake unayookota katika tama zake', 'kwa wote wapendwa wake hakuna mmoja anayeupendeza moyo wake  rafiki zake wote ni waongo  wamekua maadui zake']


# Function for training the translation model.


**1. Align Sentences:** It creates AlignedSent objects from pairs of source and target sentences. This is necessary for training the IBM Model 1, which is used for statistical machine translation.

**2. Train IBM Model**: It initializes and trains the IBM Model 1 using the aligned sentences.

*Modified cell incorporating error handling and iteration

**3. Error Handling:** Added a check to ensure that the source and target sentences lists are of the same length, as mismatched lengths could cause issues.

**4. Training Iterations:** The 10 in IBMModel1(aligned_sentences, 10) specifies the number of iterations for training. Adjust as needed based on data size and training requirements.

In [6]:
from nltk.translate import AlignedSent, IBMModel1

def train_translation_model(source_sentences, target_sentences):
    # Ensure both source and target sentences lists are of the same length
    if len(source_sentences) != len(target_sentences):
        raise ValueError("Source and target sentences must be of the same length")

    # Create aligned sentences
    aligned_sentences = [AlignedSent(source.split(), target.split()) for source, target in zip(source_sentences, target_sentences)]

    # Train IBM Model 1
    ibm_model = IBMModel1(aligned_sentences, 10)  # Adjust iterations if needed

    return ibm_model

# Train the translation model
translation_model = train_translation_model(cleaned_logooli_sentences, cleaned_kiswahili_sentences)


# Function to interactively translate Luhya Logooli sentences to Kiswahili using the trained IBM Model


In [19]:
!pip install ipywidgets


Collecting jedi>=0.16 (from ipython>=4.0.0->ipywidgets)
  Downloading jedi-0.19.1-py2.py3-none-any.whl (1.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.6/1.6 MB[0m [31m16.8 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: jedi
Successfully installed jedi-0.19.1


In [23]:
import ipywidgets as widgets
from IPython.display import display, clear_output, HTML

def translate_input(ibm_model):
    # Create widgets
    text_input = widgets.Text(
        description='Sentence:',
        placeholder='Type your Luhya Logooli sentence here',
        layout=widgets.Layout(width='80%')
    )

    translate_button = widgets.Button(
        description='Translate',
        button_style='success',
        layout=widgets.Layout(width='20%')
    )

    output = widgets.Output()

    def on_button_click(b):
        with output:
            clear_output(wait=True)  # Clear previous output

            source_text = text_input.value
            if source_text.lower() == 'q':
                print("Quitting...")
                return

            cleaned_text = clean_sentences(source_text.split())
            source_words = cleaned_text
            translated_words = []

            for source_word in source_words:
                max_prob = 0.0
                translated_word = None
                for target_word in ibm_model.translation_table.get(source_word, {}):
                    prob = ibm_model.translation_table[source_word][target_word]
                    if prob > max_prob:
                        max_prob = prob
                        translated_word = target_word
                if translated_word is not None:
                    translated_words.append(translated_word)

            translated_text = ' '.join(translated_words)
            print("Translated text:", translated_text)

    # Link the button to the function
    translate_button.on_click(on_button_click)

    # Display widgets with some styling
    display(HTML("<h2>Translation Interface</h2>"))
    display(widgets.VBox([text_input, translate_button, output]))

# Call the function with your model
translate_input(translation_model)


VBox(children=(Text(value='', description='Sentence:', layout=Layout(width='80%'), placeholder='Type your Luhy…

In [25]:
# Define the data cleaning function
def clean_sentences(sentences):
    cleaned_sentences = []
    for sentence in sentences:
        sentence = sentence.strip().lower()
        sentence = re.sub(r"[^a-zA-Z0-9\s]+", " ", sentence)
        cleaned_sentences.append(sentence.strip())
    return cleaned_sentences

# Define the translation function
def translate_sentence(ibm_model, sentence):
    cleaned_text = clean_sentences([sentence])[0]
    source_words = cleaned_text.split()
    translated_words = []
    for source_word in source_words:
        if source_word in ibm_model.translation_table:
            max_prob = 0.0
            translated_word = None
            for target_word in ibm_model.translation_table[source_word]:
                prob = ibm_model.translation_table[source_word][target_word]
                if prob > max_prob:
                    max_prob = prob
                    translated_word = target_word
            if translated_word is not None:
                translated_words.append(translated_word)
            else:
                translated_words.append(source_word)  # Keep the word if no translation found
        else:
            translated_words.append(source_word)  # Keep the word if not in the model
    return ' '.join(translated_words)

# Dummy function for loading a model (replace with actual model loading code)
def load_model():
    # Dummy data for the purpose of example
    train_logooli = ["hello world", "how are you"]
    train_kiswahili = ["habari dunia", "habari yako"]
    return train_translation_model(train_logooli, train_kiswahili, iterations=20)

# Dummy function for training the model (replace with actual training code)
def train_translation_model(source_sentences, target_sentences, iterations=10):
    aligned_sentences = [AlignedSent(source.split(), target.split()) for source, target in zip(source_sentences, target_sentences)]
    ibm_model = IBMModel1(aligned_sentences, iterations)
    return ibm_model

# Load or train your model
translation_model = load_model()

# Create interactive widgets
input_text = widgets.Text(
    description='Enter sentence:',
    placeholder='Type here'
)

output_text = widgets.Output()

def on_button_click(b):
    with output_text:
        sentence = input_text.value
        if sentence:
            translated_sentence = translate_sentence(translation_model, sentence)
            print("Translated sentence:", translated_sentence)
        else:
            print("Please enter a sentence to translate.")

button = widgets.Button(description="Translate")
button.on_click(on_button_click)

display(input_text, button, output_text)


Text(value='', description='Enter sentence:', placeholder='Type here')

Button(description='Translate', style=ButtonStyle())

Output()

# Evaluating the translation model using BLEU scores.



In [10]:
from sklearn.model_selection import train_test_split
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction

# Split the dataset
train_logooli, test_logooli, train_kiswahili, test_kiswahili = train_test_split(
    cleaned_logooli_sentences, cleaned_kiswahili_sentences, test_size=0.2, random_state=42
)

# Train the model on the training set
translation_model = train_translation_model(train_logooli, train_kiswahili)

# Function to translate sentences
def translate_sentence(ibm_model, sentence):
    cleaned_text = clean_sentences([sentence])[0].split()
    translated_words = []
    for source_word in cleaned_text:
        if source_word in ibm_model.translation_table:
            max_prob = 0.0
            translated_word = None
            for target_word in ibm_model.translation_table[source_word]:
                prob = ibm_model.translation_table[source_word][target_word]
                if prob > max_prob:
                    max_prob = prob
                    translated_word = target_word
            if translated_word is not None:
                translated_words.append(translated_word)
        else:
            translated_words.append(source_word)  # Keep the word if no translation found
    return ' '.join(translated_words)

# Evaluate the model
bleu_scores = []
smoothie = SmoothingFunction().method4  # Smoothing method to handle short sentences

for source_sentence, target_sentence in zip(test_logooli, test_kiswahili):
    translated_sentence = translate_sentence(translation_model, source_sentence)
    reference = [target_sentence.split()]
    candidate = translated_sentence.split()
    try:
        bleu_score = sentence_bleu(reference, candidate, smoothing_function=smoothie)
        bleu_scores.append(bleu_score)
    except Exception as e:
        print(f"Error calculating BLEU score for sentence: {source_sentence}\n{e}")

average_bleu_score = sum(bleu_scores) / len(bleu_scores) if bleu_scores else 0
print("Average BLEU score:", average_bleu_score)


Average BLEU score: 0.009463065465849624


# Fine-tune the number of iterations for training the IBM Model 1 and re-evaluate the model

In [11]:
# Fine-tune the number of iterations
def train_translation_model(source_sentences, target_sentences, iterations=10):
    aligned_sentences = [AlignedSent(source.split(), target.split()) for source, target in zip(source_sentences, target_sentences)]
    ibm_model = IBMModel1(aligned_sentences, iterations)
    return ibm_model

# Train with more iterations
translation_model = train_translation_model(train_logooli, train_kiswahili, iterations=20)

# Re-evaluate the model
from nltk.translate.bleu_score import SmoothingFunction

bleu_scores = []
smoothie = SmoothingFunction().method4  # Smoothing method to handle short sentences

for source_sentence, target_sentence in zip(test_logooli, test_kiswahili):
    translated_sentence = translate_sentence(translation_model, source_sentence)
    reference = [target_sentence.split()]
    candidate = translated_sentence.split()
    try:
        bleu_score = sentence_bleu(reference, candidate, smoothing_function=smoothie)
        bleu_scores.append(bleu_score)
    except Exception as e:
        print(f"Error calculating BLEU score for sentence: {source_sentence}\n{e}")

average_bleu_score = sum(bleu_scores) / len(bleu_scores) if bleu_scores else 0
print("Average BLEU score after fine-tuning:", average_bleu_score)


Average BLEU score after fine-tuning: 0.009560483011628634
