# üá≥üá¨ Yoruba G2P Demo
### Tone-Aware Grapheme-to-Phoneme Conversion (IPA + ASCII)

This notebook demonstrates how to use the `yoruba-g2p` Python package to:

- Load `.lab` transcripts  
- Extract vocabulary  
- Convert Yoruba words ‚Üí IPA phones with tones  
- Convert IPA ‚Üí ASCII phones (MFA-ready)  
- Build full pronunciation dictionaries  
- Inspect phonesets + stats  

The library is designed for speech research, NLP pipelines, MFA alignment, and Yoruba ASR/TTS datasets.


In [None]:
# Install the package (from pip or local)
# If using PyPI:
# !pip install yoruba-g2p

# If using local clone:
# !pip install -e ../


In [None]:
# Import the library
from yoruba_g2p import YorubaG2P

g2p = YorubaG2P()


ModuleNotFoundError: No module named 'epitran'

In [None]:
# Quick IPA conversion examples

test_words = [
    "·ªçÃÄm·ªçÃÅ", "·ªçm·ªç", "g·∫πÃÅg·∫πÃÅ", "√†w·ªçn",
    "j·∫πÃÅ", "√≤«πd√®", "gb√≤«πgb√≤", "makpela",
    "√†«πf√†√†n√≠", "t√°l·∫πÃÅ«πt√¨"
]

for w in test_words:
    phones, ok = g2p.yoruba_word_to_ipa_phones(w)
    print(f"{w:12s} -> {phones}   OK={ok}")


In [None]:
# IPA ‚Üí ASCII examples
for w in test_words:
    ipa, ok = g2p.yoruba_word_to_ipa_phones(w)
    ascii_phones = [g2p.ipa_phone_to_ascii(p) for p in ipa]
    print(f"{w:12s} -> {ascii_phones}")


NameError: name 'test_words' is not defined

#### This assumes your folder looks like:
#### my_labs/
####    train/*.lab
####    valid/*.lab
####    test/*.lab

In [None]:
# Build full dictionaries from .lab root
LAB_ROOT = "/path/to/lab_root"       # input folder
OUT_DIR = "yoruba_dict_output"       # output folder

result = g2p.build_all_from_labs(
    lab_root=LAB_ROOT,
    splits=["train", "valid", "test"],
    out_dir=OUT_DIR
)

result


In [None]:
# View generated files

import json

print("IPA dictionary:", result["ipa_dict"])
print("ASCII dictionary:", result["ascii_dict"])
print("Phoneset:", result["phoneset"])
print("Stats:", result["stats"])

with open(result["stats"], "r", encoding="utf-8") as f:
    stats = json.load(f)

stats


In [None]:
# Preview dictionaries
# Show first 20 IPA entries
with open(result["ipa_dict"], "r", encoding="utf-8") as f:
    for i, line in enumerate(f):
        if i >= 20: break
        print(line.strip())


In [None]:

# Show first 20 ASCII entries
with open(result["ascii_dict"], "r", encoding="utf-8") as f:
    for i, line in enumerate(f):
        if i >= 20: break
        print(line.strip())


In [None]:
# Visualize the phoneset
with open(result["phoneset"], "r", encoding="utf-8") as f:
    phones = [p.strip() for p in f.readlines()]

phones


In [None]:
# Histogram of phone frequencies (optional)
from collections import Counter
import matplotlib.pyplot as plt

# Count across ASCII dictionary
freq = Counter()

with open(result["ascii_dict"], "r", encoding="utf-8") as f:
    for line in f:
        word, phs = line.strip().split("\t")
        for p in phs.split():
            freq[p] += 1

# Plot
plt.figure(figsize=(12,5))
plt.bar(freq.keys(), freq.values())
plt.xticks(rotation=90)
plt.title("ASCII Phone Frequency Distribution")
plt.show()


In [None]:
# Convert a single sentence
sentence = "√Äw·ªçn ·ªçm·ªç n√°√† ≈Ñ l·ªç s√≠ il√© ·ªçba"

def transcribe_sentence(s):
    tokens = g2p.normalize_text(s).split()
    out = []
    for t in tokens:
        phones, ok = g2p.yoruba_word_to_ipa_phones(t)
        out.append((t, phones))
    return out

transcribe_sentence(sentence)


### Export as MFA-ready package (optional)

You already have ASCII dictionary + phoneset; MFA accepts them directly.

But we can produce a folder structure:

In [None]:
import shutil

MFA_DIR = "mfa_yoruba_dict"

shutil.copytree(OUT_DIR, MFA_DIR, dirs_exist_ok=True)

print(f"MFA-ready dictionary copied to: {MFA_DIR}")
