#Overview

This script performs preprocessing and tokenization of Yoruba phrases and their English literal translations from a CSV file. The goal is to clean the data, preserve important diacritics, remove noise (like HTML tags and extra spaces), and extract tokens (words) to prepare the dataset for translation modeling.

The input is a CSV file"yoruba_phrases.csv" containing the following columns:

Phrase: Yoruba proverbs.

Literal Translation: Word-for-word English translation of the proverbs.

Actual Meaning: Intended meaning of the proverb.

The output is a cleaned and tokenized CSV file "yoruba_phrases_processesed.csv" that includes:

Phrase: Cleaned Yoruba phrase.

Literal Translation: Cleaned literal English translation.

Phrase_Tokens: Yoruba phrase

Literal_Translation_Tokens:Literal translation word tokens.

In [None]:
# Import libraries
import pandas as pd
import unicodedata
import re

In [None]:
# Loading the dataset
df = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/Project 2 Context Matters/yoruba_phrases.csv")

#Preprocessing

The normalize_and_clean function normalizes Unicode characters to NFC(Normalization Form Composed) form to ensure consistent representation of accented characters (e.g., ẹ, ò), removes any HTML tags using regular expressions, filters out unwanted characters by retaining only word characters (a–z, A–Z, 0–9, _), whitespaces, Yoruba diacritics (À–ỹ, à–ỹ), and apostrophes ('), and finally cleans up extra whitespace by converting multiple spaces into a single space and trimming any leading or trailing spaces.

In [None]:
# Define a function to normalize and clean text
def normalize_and_clean(text):
    text = unicodedata.normalize("NFC", str(text))    # Normalize Unicode text
    text = re.sub(r"<.*?>", "", text)                 # Remove HTML tags
    text = re.sub(r"[^\w\sÀ-ỹà-ỹ']", "", text)        # Keep Yoruba diacritics
    text = re.sub(r"\s+", " ", text).strip()          # Remove extra whitespaces
    text = text.lower()                               # Convert all characters to lowercase

    return text

# Applying the cleaning function
df["Phrase"] = df["Phrase"].apply(normalize_and_clean)
df["Literal Translation"] = df["Literal Translation"].apply(normalize_and_clean)
df["Actual Meaning"] = df["Actual Meaning"].apply(normalize_and_clean)

In [None]:
# Removing columns not needed in building the model
df.drop(['Actual Meaning'], axis=1, inplace=True)

#Tokenization

The simple_tokenizer function uses regular expression (re) to identify and extract words from the text by matching sequences of alphanumeric characters and underscores (\w+) that are bounded by word boundaries (\b).

In [None]:
# Define a simple tokenizer function to perform word tokenization
def simple_tokenizer(text):
    return re.findall(r'\b\w+\b', str(text))       # Tokenize on word boundaries without altering the text

In [None]:
# Applying the tokenizer
df['Phrase_Tokens'] = df['Phrase'].apply(simple_tokenizer)
df['Literal_Translation_Tokens'] = df['Literal Translation'].apply(simple_tokenizer)

In [None]:
# Save the processed dataframe into a new CSV file
df.to_csv('/content/drive/MyDrive/Colab Notebooks/Project 2 Context Matters/yoruba_phrases_processesed.csv', index=False)