# Bemba TTS Training Notebook

This notebook trains a Text-to-Speech model for Bemba language using the BembaSpeech dataset. We'll use Coqui TTS framework for training.

## 1. Import Required Libraries

Import necessary libraries for TTS training, including Coqui TTS, audio processing, and data handling.

In [20]:
# Install Coqui TTS if not already installed
try:
    import TTS
    print("✅ Coqui TTS already installed")
except ImportError:
    print("Installing Coqui TTS...")
    !pip install coqui-tts
    print("✅ Coqui TTS installed")

import os
import pandas as pd
import numpy as np
from pathlib import Path
import librosa
from TTS.api import TTS
from TTS.utils.manage import ModelManager
from TTS.utils.synthesizer import Synthesizer

✅ Coqui TTS already installed


In [21]:
# Check if BembaSpeech dataset exists
if not os.path.exists('BembaSpeech'):
    print("❌ BembaSpeech dataset not found!")
    print("Please clone it first: git clone https://github.com/csikasote/BembaSpeech.git")
    raise FileNotFoundError("BembaSpeech dataset missing")

print("✅ BembaSpeech dataset found")

✅ BembaSpeech dataset found


## 2. Load and Prepare Data

Load the BembaSpeech dataset and prepare it for TTS training by extracting text and audio paths.

In [22]:
# Load BembaSpeech dataset
dataset_path = 'BembaSpeech/bem'

# Load training data (files are TSV, not CSV)
train_df = pd.read_csv(f'{dataset_path}/train.tsv', sep='\t')
dev_df = pd.read_csv(f'{dataset_path}/dev.tsv', sep='\t')
test_df = pd.read_csv(f'{dataset_path}/test.tsv', sep='\t')

print(f"Training samples: {len(train_df)}")
print(f"Dev samples: {len(dev_df)}")
print(f"Test samples: {len(test_df)}")

# Sample data
print(train_df.head())

# Prepare metadata for TTS training
def prepare_metadata(df, audio_dir):
    metadata = []
    for _, row in df.iterrows():
        audio_path = f"{audio_dir}/{row['audio']}"  # Column is 'audio', not 'audio_filepath'
        text = row['sentence']  # Column is 'sentence', not 'text'
        metadata.append(f"{audio_path}|{text}")
    return metadata

train_metadata = prepare_metadata(train_df, f"{dataset_path}/audio")
dev_metadata = prepare_metadata(dev_df, f"{dataset_path}/audio")

# Save metadata files
with open('train_metadata.txt', 'w') as f:
    f.write('\n'.join(train_metadata))

with open('dev_metadata.txt', 'w') as f:
    f.write('\n'.join(dev_metadata))

Training samples: 12421
Dev samples: 1700
Test samples: 1368
                                    audio  \
0  01-200921-192247_bem_d31_elicit_16.wav   
1   03-200921-160552_bem_798_elicit_2.wav   
2  01-201002-002057_bem_fb0_elicit_31.wav   
3  01-180101-223427_bem_d31_elicit_25.wav   
4   01-200930-150533_bem_fb0_elicit_1.wav   

                                            sentence  
0  cisuma ukwibukisho kuti iciputulwa icikalamba ...  
1  umutitikisha wamu ndupwa namu mayanda milandu ...  
2  kwena umutima tawakalipishe atile ndeisaisanga...  
3  camulengele ukwenda ubulendo ubwayafya kabili ...  
4  joni nao ati nga iwe amaka ukwete ayakunjipush...  


## 3. Define the Model

We'll use a pre-trained TTS model and fine-tune it on Bemba data. For this demo, we'll use a simple TTS model setup.

In [23]:
# For TTS training, we'll use Coqui TTS configuration
# Note: Full training requires significant compute resources
# This is a simplified setup for demonstration

# Define TTS model configuration
model_config = {
    'model': 'tts_models/en/ljspeech/tacotron2-DDC_ph',  # Base English model
    'vocoder': 'vocoder_models/en/ljspeech/hifigan_v2',
    'language': 'en',  # We'll adapt for Bemba
}

# Initialize TTS with pre-trained model
tts = TTS(model_name=model_config['model'], 
          vocoder_name=model_config['vocoder'])

print("TTS Model loaded successfully")
print(f"Model: {model_config['model']}")
print(f"Vocoder: {model_config['vocoder']}")

TTS Model loaded successfully
Model: tts_models/en/ljspeech/tacotron2-DDC_ph
Vocoder: vocoder_models/en/ljspeech/hifigan_v2


## 4. Compile the Model

Configure the TTS model settings and prepare for inference or fine-tuning.

In [24]:
# TTS models are pre-compiled with default settings
# Configuration is handled internally by the model
# Default sample rate is typically 22050 Hz with silence trimming enabled

print("TTS Model ready for inference")
print("Using default audio settings (22kHz sample rate, silence trimming enabled)")

TTS Model ready for inference
Using default audio settings (22kHz sample rate, silence trimming enabled)


## 5. Train the Model

Note: Full TTS training requires significant computational resources and time. For this demo, we'll use the pre-trained model for inference. To train on Bemba data, use Coqui TTS training scripts.

In [25]:
# Training command (run in terminal, not in notebook for long training)
# python -m TTS.bin.train_tts --config_path config.json --restore_path <pretrained_model_path>

print("For full training, use:")
print("python -m TTS.bin.train_tts --config_path bemba_tts_config.json --restore_path pretrained_model.pth.tar")

For full training, use:
python -m TTS.bin.train_tts --config_path bemba_tts_config.json --restore_path pretrained_model.pth.tar


## 6. Evaluate the Model

Test the TTS model by generating audio for sample Bemba phrases.

In [26]:
# Test TTS with Bemba phrases
bemba_phrases = [
    "Umuntu wawonekera",  # Person detected
    "Imoto yawonekera",   # Car detected
    "Ibayisikilo yawonekera",  # Bicycle detected
    "Iciti cawonekera",   # Chair detected
    "Itabule yawonekera"  # Table detected
]

print("⚠️  WARNING: This uses English TTS model, so pronunciation will be English phonetics, not authentic Bemba.")
print("For proper Bemba audio, record native speakers or train a custom Bemba TTS model.\n")

for i, phrase in enumerate(bemba_phrases):
    print(f"Generating audio for: {phrase}")
    try:
        tts.tts_to_file(text=phrase, file_path=f"bemba_sample_{i}.wav")
        print(f"✅ Saved: bemba_sample_{i}.wav")
    except Exception as e:
        print(f"❌ Error generating {phrase}: {e}")

For proper Bemba audio, record native speakers or train a custom Bemba TTS model.

Generating audio for: Umuntu wawonekera
✅ Saved: bemba_sample_0.wav
Generating audio for: Imoto yawonekera
✅ Saved: bemba_sample_0.wav
Generating audio for: Imoto yawonekera
✅ Saved: bemba_sample_1.wav
Generating audio for: Ibayisikilo yawonekera
✅ Saved: bemba_sample_1.wav
Generating audio for: Ibayisikilo yawonekera
✅ Saved: bemba_sample_2.wav
Generating audio for: Iciti cawonekera
✅ Saved: bemba_sample_2.wav
Generating audio for: Iciti cawonekera
✅ Saved: bemba_sample_3.wav
Generating audio for: Itabule yawonekera
✅ Saved: bemba_sample_3.wav
Generating audio for: Itabule yawonekera
✅ Saved: bemba_sample_4.wav
✅ Saved: bemba_sample_4.wav


## 7. Make Predictions

Generate audio files for all object detection phrases that can be used in the Android app.

## Translation Strategy

Instead of translating full phrases, we translate only the object names and append language-specific suffixes:
- **Nyanja**: `[object] yawonekera` (e.g., "munthu yawonekera" = "person detected")
- **Bemba**: `[object] yamoneka` (e.g., "umuntu yamoneka" = "person detected")

This reduces the translation workload to only 181 object names instead of 181 full phrases.

In [2]:
# Generate audio for all 181 object detection labels
# Strategy: Translate only object names, then append language-specific suffix
import os
import re
import pandas as pd

# Path to the label map
label_map_path = 'app/src/main/assets/labelmap.txt'

# Read all labels from the file
try:
    with open(label_map_path, 'r') as f:
        labels = [line.strip() for line in f.readlines()]
    print(f"✅ Successfully read {len(labels)} labels from {label_map_path}")
except FileNotFoundError:
    print(f"❌ Error: {label_map_path} not found. Make sure the path is correct.")
    labels = []

# Language-specific suffixes
SUFFIX_NYANJA = "yawonekera"  # "detected" in Nyanja
SUFFIX_BEMBA = "yamoneka"     # "detected" in Bemba

# --- Translation Logic ---
# Option 1: Load translations from CSV (if available)
translations_file = 'translations.csv'
object_translations = {}

try:
    df = pd.read_csv(translations_file)
    for _, row in df.iterrows():
        english = row['English']
        nyanja = row['Nyanja'] if pd.notna(row['Nyanja']) else english
        bemba = row['Bemba'] if pd.notna(row['Bemba']) else english
        object_translations[english] = {'nyanja': nyanja, 'bemba': bemba}
    print(f"✅ Loaded translations from {translations_file}")
except FileNotFoundError:
    print(f"⚠️  {translations_file} not found. Using English object names as placeholders.")
    for label in labels:
        object_translations[label] = {'nyanja': label, 'bemba': label}

# Generate phrases for both languages
nyanja_phrases = {}
bemba_phrases = {}

for label in labels:
    sanitized_label = re.sub(r'[^a-zA-Z0-9_]', '', label.replace(' ', '_'))
    
    # Get translation or use English as fallback
    trans = object_translations.get(label, {'nyanja': label, 'bemba': label})
    
    # Create complete phrases
    nyanja_phrases[sanitized_label] = f"{trans['nyanja']} {SUFFIX_NYANJA}"
    bemba_phrases[sanitized_label] = f"{trans['bemba']} {SUFFIX_BEMBA}"

print(f"\nGenerated {len(nyanja_phrases)} Nyanja phrases and {len(bemba_phrases)} Bemba phrases.")
print("⚠️  WARNING: This uses an English TTS model, so pronunciation will be English phonetics.")
print("For proper audio, record native speakers.\n")

# Generate audio for both languages
for lang_name, phrases, output_dir in [
    ("Nyanja", nyanja_phrases, "nyanja_audio_all"),
    ("Bemba", bemba_phrases, "bemba_audio_all")
]:
    print(f"\n{'='*60}")
    print(f"Generating {lang_name} audio files...")
    print(f"{'='*60}\n")
    
    os.makedirs(output_dir, exist_ok=True)
    
    for obj_filename, phrase in phrases.items():
        output_path = f"{output_dir}/{lang_name.lower()}_{obj_filename}.wav"
        print(f"Generating '{phrase}'")
        try:
            tts.tts_to_file(text=phrase, file_path=output_path)
            print(f"✅ Saved: {output_path}")
        except Exception as e:
            print(f"❌ Error: {e}")

print(f"\n{'='*60}")
print("Audio generation complete!")
print(f"Nyanja files: nyanja_audio_all/")
print(f"Bemba files: bemba_audio_all/")
print("\nNext steps:")
print("1. Fill in translations.csv with proper Nyanja/Bemba object names")
print("2. Record native speakers saying each phrase")
print("3. Convert to MP3 and copy to app/src/main/res/raw/")
print(f"{'='*60}")

✅ Successfully read 181 labels from app/src/main/assets/labelmap.txt

Generated 181 placeholder phrases.
For proper Bemba audio, record native speakers or train a custom Bemba TTS model.

Generating for 'person': person yawonekera
❌ Error generating audio for person: name 'tts' is not defined
Generating for 'bicycle': bicycle yawonekera
❌ Error generating audio for bicycle: name 'tts' is not defined
Generating for 'car': car yawonekera
❌ Error generating audio for car: name 'tts' is not defined
Generating for 'motorcycle': motorcycle yawonekera
❌ Error generating audio for motorcycle: name 'tts' is not defined
Generating for 'airplane': airplane yawonekera
❌ Error generating audio for airplane: name 'tts' is not defined
Generating for 'bus': bus yawonekera
❌ Error generating audio for bus: name 'tts' is not defined
Generating for 'train': train yawonekera
❌ Error generating audio for train: name 'tts' is not defined
Generating for 'truck': truck yawonekera
❌ Error generating audio for 