In [2]:
import pandas as pd
from transformers import MarianMTModel, MarianTokenizer
from tqdm import tqdm

# 1. Load your dataset
df = pd.read_excel("mental_health.xlsx")

# 2. Load translation model (English → Swahili)
model_name = 'Helsinki-NLP/opus-mt-en-sw'
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)

# 3. Translate function
def translate_text(text):
    if pd.isna(text):
        return ""
    inputs = tokenizer([text], return_tensors="pt", padding=True, truncation=True)
    translated = model.generate(**inputs)
    return tokenizer.decode(translated[0], skip_special_tokens=True)

# 4. Apply translation to text column
tqdm.pandas()
df["text_sw"] = df["text"].progress_apply(translate_text)

# 5. Save to new CSV
df.to_csv("translated_dataset.csv", index=False)
print("✅ Translation complete! Saved as translated_dataset.csv")


100%|██████████| 103488/103488 [57:57:42<00:00,  2.02s/it]   


✅ Translation complete! Saved as translated_dataset.csv


## **Adding the Swahili Feature to our Dataset**

Our dataset contained a large number of rows i.e 100k rows, so we decided to have a separate notebook for translation. For this notebook we translated around 18k rows into swahili language.

Steps:
1. Pip install required dependecies these include:
   * transformers
   * sentencepiece
   * torch
   * huggingface_hub[hf_xet]
2. Create a dataframe with number of rows you would prefer. For this notebook we translated 18k rows from our original dataset.