# Speech AI on Colab — All-in-One Notebook

This notebook runs three speech pipelines end-to-end on Google Colab:

- Approach A (fast, English-only target): Speech → Whisper (task=translate) → English text → English TTS (SpeechT5 or Parler‑TTS)
- Approach B (multilingual, quality): Speech → Whisper (ASR) → MT (NLLB‑200 distilled 600M or OPUS‑MT) → TTS (English: SpeechT5/Parler‑TTS; non‑English: MMS‑TTS)
- Approach C (optional, end‑to‑end S2ST): SeamlessM4T direct Speech‑to‑Speech

Notes
- Purely Colab (Python 3.12, CUDA GPU). No conda.
- Large models are loaded sequentially to fit a ~15 GB GPU.
- Caches and outputs can be stored on Google Drive.
- This scaffold includes structure and placeholders. Implementation code will be added later.

## 0) Quick Start
1. Runtime → Change runtime type → GPU.
2. Set the toggles in the next cell (e.g., SAVE_TO_DRIVE, which pipelines to run).
3. (Later) Run the dependency install cell when we add it.
4. Upload your audio (Section 3) and/or record inside Colab.
5. Run pipelines A/B/C (Sections 6–8) and compare (Section 10).

## 1) Configuration (edit these as needed)
Set your preferences here. These will be read by subsequent sections.

Key toggles
- SAVE_TO_DRIVE: Store caches/outputs on Drive (recommended).
- USE_MIC: Enable an in-notebook recorder.
- RUN_PIPELINE_A/B/C: Enable/disable each approach.
- WHISPER_SIZE: large-v2 (default) or medium if VRAM is tight.
- MT_CHOICE: nllb (default) or opus.
- TTS_EN: speecht5 (default) or parler.
- TTS_NONEN: mms (default on Colab).

In [1]:
# User-configurable settings (safe to edit)
SAVE_TO_DRIVE = True           # Use Google Drive for HF cache and outputs
USE_MIC = True                 # Enable the in-notebook microphone recorder UI

RUN_PIPELINE_A = True          # Approach A: Speech -> Whisper translate -> English TTS
RUN_PIPELINE_B = True          # Approach B: Speech -> ASR -> MT -> TTS
RUN_PIPELINE_C = False         # Approach C: SeamlessM4T S2ST (optional)

WHISPER_SIZE = "large-v2"      # Options: "large-v2" (default), "medium"
MT_CHOICE = "nllb"             # Options: "nllb" (default), "opus"
TTS_EN = "speecht5"            # Options: "speecht5" (default), "parler"
TTS_NONEN = "mms"              # Options: "mms" (default)

# Language configuration examples (edit per file later in the manifest UI)
DEFAULT_SRC_LANG = "auto"      # e.g., "en", "de", "es", "zh", "ar" or "auto"
DEFAULT_TGT_LANG_TEXT = "eng_Latn"  # NLLB target code, e.g., "eng_Latn", "deu_Latn", "spa_Latn"
DEFAULT_TGT_LANG_TTS = "en"     # TTS target code, e.g., "en", "de", "es", "zh", "ar"

# Advanced (leave as-is for now)
MAX_AUDIO_SECONDS = 60          # Warn/advise chunking above this duration
SEAMLESS_AVAILABLE = False      # Will be probed later if C is enabled
PROJECT_NAME = "Speech AI on Colab — All-in-One"
print("Configuration loaded.")

Configuration loaded.


## 2) Runtime checks (GPU, Disk, Python)
Lightweight checks to confirm your Colab runtime is ready. No heavy installs here yet.

In [2]:
import os, shutil, sys, subprocess, textwrap
from datetime import datetime

print("Timestamp:", datetime.utcnow().isoformat(), "UTC")
print("Python:", sys.version.split()[0])

try:
    import torch
    print("PyTorch:", torch.__version__)
    print("CUDA available:", torch.cuda.is_available())
    if torch.cuda.is_available():
        print("GPU:", torch.cuda.get_device_name(0))
except Exception as e:
    print("Torch not available yet (will install later).", e)

print("Working dir:", os.getcwd())
print("Disk status (df -h):")
try:
    out = subprocess.check_output(["bash", "-lc", "df -h | head -n 5"])
    print(out.decode())
except Exception as e:
    print(e)

print("Note: Colab's VM disk is separate from your local machine.")

  print("Timestamp:", datetime.utcnow().isoformat(), "UTC")


Timestamp: 2025-09-08T08:11:40.752804 UTC
Python: 3.12.11
PyTorch: 2.8.0+cu126
CUDA available: False
Working dir: /content
Disk status (df -h):
Filesystem      Size  Used Avail Use% Mounted on
overlay         108G   40G   69G  37% /
tmpfs            64M     0   64M   0% /dev
shm             5.8G     0  5.8G   0% /dev/shm
/dev/root       2.0G  1.2G  775M  61% /usr/sbin/docker-init

Note: Colab's VM disk is separate from your local machine.


## 3) Google Drive (optional but recommended)
If SAVE_TO_DRIVE is True:
- Mount Drive
- Set Hugging Face cache (HF_HOME) to a folder on Drive
- Create project folders for inputs and outputs

This prevents the VM disk from filling up and keeps assets persistent between sessions.

In [3]:
HF_HOME = None
BASE_DIR = "/content"
DRIVE_BASE = "/content/drive/MyDrive/speech_ai_colab"
PATHS = {}

if SAVE_TO_DRIVE:
    try:
        from google.colab import drive  # type: ignore
        drive.mount('/content/drive', force_remount=False)
        os.makedirs(DRIVE_BASE, exist_ok=True)
        HF_HOME = os.path.join(DRIVE_BASE, "hf_cache")
        os.environ["HF_HOME"] = HF_HOME
        for p in ["inputs", "recordings", "outputs", "manifests", "results", "figures"]:
            os.makedirs(os.path.join(DRIVE_BASE, p), exist_ok=True)
        PATHS = {
            "inputs": os.path.join(DRIVE_BASE, "inputs"),
            "recordings": os.path.join(DRIVE_BASE, "recordings"),
            "outputs": os.path.join(DRIVE_BASE, "outputs"),
            "manifests": os.path.join(DRIVE_BASE, "manifests"),
            "results": os.path.join(DRIVE_BASE, "results"),
            "figures": os.path.join(DRIVE_BASE, "figures"),
        }
        print("Drive mounted. HF_HOME:", HF_HOME)
    except Exception as e:
        print("Drive not available, continuing on VM:", e)
        SAVE_TO_DRIVE = False

if not SAVE_TO_DRIVE:
    for p in ["inputs", "recordings", "outputs", "manifests", "results", "figures"]:
        os.makedirs(os.path.join(BASE_DIR, p), exist_ok=True)
    PATHS = {
        "inputs": os.path.join(BASE_DIR, "inputs"),
        "recordings": os.path.join(BASE_DIR, "recordings"),
        "outputs": os.path.join(BASE_DIR, "outputs"),
        "manifests": os.path.join(BASE_DIR, "manifests"),
        "results": os.path.join(BASE_DIR, "results"),
        "figures": os.path.join(BASE_DIR, "figures"),
    }
    print("Using VM storage. Outputs will be ephemeral.")

print("Project paths:")
for k, v in PATHS.items():
    print(f"- {k}: {v}")

Mounted at /content/drive
Drive mounted. HF_HOME: /content/drive/MyDrive/speech_ai_colab/hf_cache
Project paths:
- inputs: /content/drive/MyDrive/speech_ai_colab/inputs
- recordings: /content/drive/MyDrive/speech_ai_colab/recordings
- outputs: /content/drive/MyDrive/speech_ai_colab/outputs
- manifests: /content/drive/MyDrive/speech_ai_colab/manifests
- results: /content/drive/MyDrive/speech_ai_colab/results
- figures: /content/drive/MyDrive/speech_ai_colab/figures


## 4) Dependencies (placeholder)
This is where we will add the exact pip install commands (Transformers, faster-whisper, torchaudio, etc.). For now, this cell only documents intent so the notebook remains lightweight until we finalize code.

Planned components to install:
- System: ffmpeg
- Python: faster-whisper, transformers, accelerate, sentencepiece, sacremoses, soundfile, numpy, gradio, evaluate, sacrebleu (and optional COMET)
- Optional: SeamlessM4T from GitHub (Approach C)

We will add version pins and CUDA-friendly settings later.

In [4]:
# 4.1 Core deps (system + Python) — Colab Python 3.12, CUDA GPU
import os, subprocess, sys, json

print("[Deps] Installing system tools (ffmpeg) ...")
subprocess.run(["bash", "-lc", "apt -qq update && apt -qq install -y ffmpeg"], check=False)

print("[Deps] Upgrading pip/setuptools/wheel ...")
subprocess.run([sys.executable, "-m", "pip", "install", "-U", "pip", "setuptools", "wheel"], check=True)

core_packages = [
    # Core model/tooling
    "transformers>=4.42.0",
    "accelerate>=0.30.0",
    "faster-whisper>=1.0.0",   # includes ctranslate2
    "sentencepiece>=0.1.99",
    "sacremoses",
    "huggingface_hub>=0.23.0",

    # Audio + preprocessing
    "soundfile",
    "scipy>=1.10",             # fallback resampling/dsp if torchaudio not available
    # We'll handle torchaudio separately to match torch

    # UI + eval
    "gradio>=4.0.0",
    "evaluate>=0.4.1",
    "sacrebleu>=2.4.0",
    "jiwer>=3.0.0",
    "pandas",
    "tqdm",
]

print("[Deps] Installing core Python packages ...")
subprocess.run([sys.executable, "-m", "pip", "install", "-U"] + core_packages, check=True)

# Torchaudio: try to match the preinstalled Torch to avoid CUDA mismatches
def install_torchaudio_matching_torch():
    try:
        import torch
        torch_ver = torch.__version__
        print(f"[Deps] Detected torch {torch_ver}")
        cu = getattr(torch.version, "cuda", None)
        if cu:
            cu_tag = "cu" + cu.replace(".", "")
            # Prefer PyTorch index for CUDA wheels
            print(f"[Deps] Installing torchaudio=={torch_ver} from cu index ({cu_tag}) ...")
            cmd = [sys.executable, "-m", "pip", "install", "-U",
                   f"torchaudio=={torch_ver}",
                   "--index-url", f"https://download.pytorch.org/whl/{cu_tag}"]
        else:
            print(f"[Deps] CPU build detected, installing torchaudio=={torch_ver} from PyPI ...")
            cmd = [sys.executable, "-m", "pip", "install", "-U", f"torchaudio=={torch_ver}"]
        subprocess.run(cmd, check=True)
    except Exception as e:
        print("[Deps][WARN] Could not match-install torchaudio to torch:", e)
        print("[Deps] Falling back to plain torchaudio install from PyPI ...")
        try:
            subprocess.run([sys.executable, "-m", "pip", "install", "-U", "torchaudio"], check=True)
        except Exception as e2:
            print("[Deps][ERROR] torchaudio install failed:", e2)

install_torchaudio_matching_torch()

# Nice-to-have defaults
os.environ["TOKENIZERS_PARALLELISM"] = "false"

print("[Deps] Core dependencies installed.")
DEPS_INSTALLED = True

[Deps] Installing system tools (ffmpeg) ...
[Deps] Upgrading pip/setuptools/wheel ...
[Deps] Installing core Python packages ...
[Deps] Detected torch 2.8.0+cu126
[Deps] Installing torchaudio==2.8.0+cu126 from cu index (cu126) ...
[Deps] Core dependencies installed.


In [97]:
# 4.2 SeamlessM4T (Approach C): Transformers-native backend (recommended).
# If you really want the Meta repo backend, set USE_SEAMLESS_REPO=True below.

import os, subprocess, sys

USE_SEAMLESS_REPO = False  # keep False unless you specifically need the repo backend

SEAMLESS_AVAILABLE = False
SEAMLESS_BACKEND = None
SEAMLESS_MODEL_ID = "facebook/seamless-m4t-v2-large"

# Try Transformers-native SeamlessM4T first
print("[Deps] Trying SeamlessM4T via Transformers:", SEAMLESS_MODEL_ID)
try:
    import torch
    from transformers import AutoProcessor
    # The v2 model class name in Transformers:
    try:
        from transformers import SeamlessM4Tv2Model  # recent Transformers
        model_cls = SeamlessM4Tv2Model
        print("[Deps] Found SeamlessM4Tv2Model in transformers.")
    except Exception as e:
        # Older/alternate class name fallback
        from transformers import SeamlessM4TModel as model_cls
        print("[Deps][INFO] Using SeamlessM4TModel fallback:", e)

    device = "cuda" if torch.cuda.is_available() else "cpu"
    dtype = torch.float16 if (device == "cuda") else torch.float32

    # Lightweight load/instantiate to validate availability
    processor = AutoProcessor.from_pretrained(SEAMLESS_MODEL_ID)
    _model = model_cls.from_pretrained(
        SEAMLESS_MODEL_ID,
        torch_dtype=dtype,
        low_cpu_mem_usage=True
    )
    if device == "cuda":
        _model = _model.to(device)
    del _model  # free immediately; we just needed to confirm loading works

    SEAMLESS_AVAILABLE = True
    SEAMLESS_BACKEND = "transformers"
    print("[Deps] SeamlessM4T available via Transformers backend.")
except Exception as e:
    print("[Deps][INFO] Transformers-native SeamlessM4T not available:", e)

# Optional: try Meta's repo backend only if requested
if not SEAMLESS_AVAILABLE and USE_SEAMLESS_REPO:
    print("[Deps] Attempting Meta repo install for SeamlessM4T (optional) ...")
    try:
        subprocess.run(
            [sys.executable, "-m", "pip", "install", "-U", "git+https://github.com/facebookresearch/seamless_communication.git"],
            check=True
        )
        try:
            from seamless_communication.inference import Translator  # noqa: F401
            SEAMLESS_AVAILABLE = True
            SEAMLESS_BACKEND = "meta-repo"
            print("[Deps] SeamlessM4T available via Meta repo backend.")
        except Exception as e:
            print("[Deps][INFO] Meta repo installed but import failed:", e)
            print("[Deps] Treating SeamlessM4T as unavailable.")
    except Exception as e:
        print("[Deps][INFO] Meta repo install failed (optional):", e)

# Final status
print(f"[Deps] SEAMLESS_AVAILABLE={SEAMLESS_AVAILABLE}, BACKEND={SEAMLESS_BACKEND}")

[Deps] Trying SeamlessM4T via Transformers: facebook/seamless-m4t-v2-large
[Deps] Found SeamlessM4Tv2Model in transformers.


Fetching 1 files:   0%|          | 0/1 [00:00<?, ?it/s]

Fetching 1 files:   0%|          | 0/1 [00:00<?, ?it/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

[Deps] SeamlessM4T available via Transformers backend.
[Deps] SEAMLESS_AVAILABLE=True, BACKEND=transformers


In [8]:
# 4.3 Verify imports and environment
import platform, importlib, shutil, os

print("=== Environment ===")
try:
    import torch
    print("Python:", platform.python_version())
    print("PyTorch:", torch.__version__)
    print("CUDA available:", torch.cuda.is_available())
    if torch.cuda.is_available():
        print("GPU:", torch.cuda.get_device_name(0))
except Exception as e:
    print("[Verify][WARN] Torch import failed:", e)

for pkg in ["transformers", "accelerate", "faster_whisper", "sentencepiece",
            "soundfile", "scipy", "gradio", "evaluate", "sacrebleu", "jiwer", "torchaudio"]:
    try:
        m = importlib.import_module(pkg.replace("-", "_"))
        ver = getattr(m, "__version__", "(no __version__)")
        print(f"{pkg}: {ver}")
    except Exception as e:
        print(f"[Verify][WARN] {pkg}: not importable -> {e}")

# Show HF_HOME if set, and disk space
hf_home = os.environ.get("HF_HOME", None)
print("HF_HOME:", hf_home if hf_home else "(not set)")
print("Outputs dir:", PATHS.get("outputs"))
print("===================")

=== Environment ===
Python: 3.12.11
PyTorch: 2.8.0+cu126
CUDA available: False
transformers: 4.56.1
accelerate: 1.10.1
faster_whisper: 1.2.0
sentencepiece: 0.2.1
soundfile: 0.13.1
scipy: 1.16.1
gradio: 5.44.1
evaluate: 0.4.5
sacrebleu: 2.5.1
jiwer: (no __version__)
torchaudio: 2.8.0+cu126
HF_HOME: /content/drive/MyDrive/speech_ai_colab/hf_cache
Outputs dir: /content/drive/MyDrive/speech_ai_colab/outputs


## 5) Data intake: upload your audio and/or record in notebook
You chose: upload your own set + optional in-notebook recorder. This section provides the UI and builds a manifest of files to process.
\
What this section will do (implementation to be added):
- Drag-and-drop uploader for WAV/MP3/M4A.
- Automatic resampling to 16 kHz mono and normalization.
- Optional: edit per-file language metadata.
- If USE_MIC=True, a small recorder UI saves clips to recordings/.
- A table view of the manifest with basic editing.

The manifest schema (CSV): id, path, src_lang, transcript_src (optional), ref_translation_en (optional), ref_translation_tgt (optional), split, license (optional).

In [22]:
##Discover your audio (and optional upload helper)

from pathlib import Path
import os, shutil

inputs_dir = PATHS["inputs"]
os.makedirs(inputs_dir, exist_ok=True)

# If your 5 MP3s are already in PATHS["inputs"], we just list them:
audio_exts = {".mp3",".wav",".flac",".m4a",".ogg"}
found = [p for p in Path(inputs_dir).glob("**/*") if p.suffix.lower() in audio_exts]
print("Found audio files:")
for p in found:
    print("-", p)

print(f"Total: {len(found)} file(s) in {inputs_dir}")

# Optional: upload from local computer into inputs/
USE_UPLOAD_HELPER = False
if USE_UPLOAD_HELPER:
    try:
        from google.colab import files  # type: ignore
        uploaded = files.upload()
        for name, _ in uploaded.items():
            shutil.move(name, os.path.join(inputs_dir, name))
        print("Uploaded ->", inputs_dir)
    except Exception as e:
        print("[WARN] Upload helper not available:", e)

Found audio files:
- /content/drive/MyDrive/speech_ai_colab/inputs/ar.mp3
- /content/drive/MyDrive/speech_ai_colab/inputs/zh.mp3
- /content/drive/MyDrive/speech_ai_colab/inputs/en.mp3
- /content/drive/MyDrive/speech_ai_colab/inputs/de.mp3
- /content/drive/MyDrive/speech_ai_colab/inputs/es.mp3
- /content/drive/MyDrive/speech_ai_colab/inputs/preproc/ar.wav
- /content/drive/MyDrive/speech_ai_colab/inputs/preproc/zh.wav
- /content/drive/MyDrive/speech_ai_colab/inputs/preproc/en.wav
- /content/drive/MyDrive/speech_ai_colab/inputs/preproc/de.wav
- /content/drive/MyDrive/speech_ai_colab/inputs/preproc/es.wav
- /content/drive/MyDrive/speech_ai_colab/inputs/uploads/de.mp3
Total: 11 file(s) in /content/drive/MyDrive/speech_ai_colab/inputs


In [23]:
##Create/update manifest
import pandas as pd, os
from pathlib import Path

MANIFEST_PATH = os.path.join(PATHS["manifests"], "manifest.csv")
os.makedirs(PATHS["manifests"], exist_ok=True)

cols = ["id","path","src_lang","transcript_src","ref_translation_en","ref_translation_tgt","split","license"]
if os.path.exists(MANIFEST_PATH):
    df = pd.read_csv(MANIFEST_PATH)
    for c in cols:
        if c not in df.columns:
            df[c] = ""
else:
    df = pd.DataFrame(columns=cols)

# Add new audio files from inputs/
audio_exts = {".mp3",".wav",".flac",".m4a",".ogg"}
found = [p for p in Path(PATHS["inputs"]).glob("**/*") if p.suffix.lower() in audio_exts]

existing_paths = set(df["path"].astype(str).tolist())
added = 0
for p in found:
    sp = str(p)
    if sp in existing_paths:
        continue
    fid = p.stem
    df.loc[len(df)] = {
        "id": fid,
        "path": sp,
        "src_lang": "auto",  # keep auto if unsure; Whisper can detect
        "transcript_src": "",
        "ref_translation_en": "",
        "ref_translation_tgt": "",
        "split": "demo",
        "license": ""        # set if you plan to publish clips
    }
    added += 1

df.to_csv(MANIFEST_PATH, index=False)
print(f"Manifest at {MANIFEST_PATH} — added {added} new row(s), total {len(df)}.")
display(df.tail(min(5, len(df))))

Manifest at /content/drive/MyDrive/speech_ai_colab/manifests/manifest.csv — added 6 new row(s), total 18.


Unnamed: 0,id,path,src_lang,transcript_src,ref_translation_en,ref_translation_tgt,split,license
13,zh,/content/drive/MyDrive/speech_ai_colab/inputs/...,auto,,,,demo,
14,en,/content/drive/MyDrive/speech_ai_colab/inputs/...,auto,,,,demo,
15,de,/content/drive/MyDrive/speech_ai_colab/inputs/...,auto,,,,demo,
16,es,/content/drive/MyDrive/speech_ai_colab/inputs/...,auto,,,,demo,
17,de,/content/drive/MyDrive/speech_ai_colab/inputs/...,auto,,,,demo,


In [24]:
## Normalize to 16 kHz mono WAV and update manifest paths
import pandas as pd, os, subprocess
from pathlib import Path

df = pd.read_csv(MANIFEST_PATH)
preproc_dir = os.path.join(PATHS["inputs"], "preproc")
os.makedirs(preproc_dir, exist_ok=True)

def to_wav_16k_mono(src_path: str, out_path: str):
    cmd = [
        "bash","-lc",
        f'ffmpeg -y -hide_banner -loglevel error -i "{src_path}" -ac 1 -ar 16000 -vn -map_metadata -1 "{out_path}"'
    ]
    subprocess.run(cmd, check=True)

updated = 0
for i, row in df.iterrows():
    src = row["path"]
    if not isinstance(src, str) or not os.path.exists(src):
        print(f"[WARN] Missing file, skipping: {src}")
        continue
    fid = row["id"] if isinstance(row["id"], str) and row["id"] else Path(src).stem
    dst = os.path.join(preproc_dir, f"{fid}.wav")
    try:
        to_wav_16k_mono(src, dst)
        df.at[i, "id"] = fid
        df.at[i, "path"] = dst
        updated += 1
    except Exception as e:
        print(f"[WARN] Failed to preprocess {src}: {e}")

df.to_csv(MANIFEST_PATH, index=False)
print(f"Preprocessed {updated} file(s) to WAV in: {preproc_dir}")
display(df.tail(min(5, len(df))))


[WARN] Failed to preprocess /content/drive/MyDrive/speech_ai_colab/inputs/preproc/ar.wav: Command '['bash', '-lc', 'ffmpeg -y -hide_banner -loglevel error -i "/content/drive/MyDrive/speech_ai_colab/inputs/preproc/ar.wav" -ac 1 -ar 16000 -vn -map_metadata -1 "/content/drive/MyDrive/speech_ai_colab/inputs/preproc/ar.wav"']' returned non-zero exit status 1.
[WARN] Failed to preprocess /content/drive/MyDrive/speech_ai_colab/inputs/preproc/zh.wav: Command '['bash', '-lc', 'ffmpeg -y -hide_banner -loglevel error -i "/content/drive/MyDrive/speech_ai_colab/inputs/preproc/zh.wav" -ac 1 -ar 16000 -vn -map_metadata -1 "/content/drive/MyDrive/speech_ai_colab/inputs/preproc/zh.wav"']' returned non-zero exit status 1.
[WARN] Failed to preprocess /content/drive/MyDrive/speech_ai_colab/inputs/preproc/en.wav: Command '['bash', '-lc', 'ffmpeg -y -hide_banner -loglevel error -i "/content/drive/MyDrive/speech_ai_colab/inputs/preproc/en.wav" -ac 1 -ar 16000 -vn -map_metadata -1 "/content/drive/MyDrive/spee

Unnamed: 0,id,path,src_lang,transcript_src,ref_translation_en,ref_translation_tgt,split,license
13,zh,/content/drive/MyDrive/speech_ai_colab/inputs/...,auto,,,,demo,
14,en,/content/drive/MyDrive/speech_ai_colab/inputs/...,auto,,,,demo,
15,de,/content/drive/MyDrive/speech_ai_colab/inputs/...,auto,,,,demo,
16,es,/content/drive/MyDrive/speech_ai_colab/inputs/...,auto,,,,demo,
17,de,/content/drive/MyDrive/speech_ai_colab/inputs/...,auto,,,,demo,


In [25]:
import soundfile as sf
from IPython.display import Audio, display
import pandas as pd

df = pd.read_csv(MANIFEST_PATH)

def duration(path):
    try:
        with sf.SoundFile(path) as f:
            return len(f) / f.samplerate
    except Exception:
        return None

df["duration_s"] = df["path"].apply(duration)
print(df[["id","src_lang","duration_s","path"]])

# Listen to the first few files
for _, row in df.head(3).iterrows():
    print(f"\n{id} — {row['id']} | {row['src_lang']} | {row['duration_s']:.2f}s")
    display(Audio(filename=row["path"]))

      id src_lang  duration_s  \
0     ar     auto    5.564063   
1     zh     auto    4.519188   
2     en     auto    4.519188   
3     de     auto    4.440812   
4     es     auto    4.597563   
5   de_2     auto         NaN   
6   ar_2     auto         NaN   
7   zh_2     auto         NaN   
8   en_2     auto         NaN   
9   de_3     auto         NaN   
10  es_2     auto         NaN   
11  de_4     auto         NaN   
12    ar     auto    5.564063   
13    zh     auto    4.519188   
14    en     auto    4.519188   
15    de     auto    4.440812   
16    es     auto    4.597563   
17    de     auto    4.440812   

                                                 path  
0   /content/drive/MyDrive/speech_ai_colab/inputs/...  
1   /content/drive/MyDrive/speech_ai_colab/inputs/...  
2   /content/drive/MyDrive/speech_ai_colab/inputs/...  
3   /content/drive/MyDrive/speech_ai_colab/inputs/...  
4   /content/drive/MyDrive/speech_ai_colab/inputs/...  
5   /content/drive/MyDrive/speech_ai


<built-in function id> — zh | auto | 4.52s



<built-in function id> — en | auto | 4.52s


## 6) Common preprocessing (placeholder)
This section will standardize audio (16 kHz mono), apply mild normalization, and optionally trim leading/trailing silence.

We will also warn for long files (duration > MAX_AUDIO_SECONDS) and recommend chunking. A simple inline audio player will preview inputs after preprocessing.

Outputs:
- Preprocessed WAVs stored alongside inputs (or in a temp folder), with updated paths reflected in the manifest used by pipelines.

In [26]:
#Normalize to target peak and optionally trim leading/trailing silence
import os, math
import numpy as np
import soundfile as sf
import pandas as pd

MANIFEST_PATH = os.path.join(PATHS["manifests"], "manifest.csv")

# Settings
MAX_AUDIO_SECONDS = 180  # warn if longer than this
APPLY_PEAK_NORMALIZE = True
TARGET_PEAK_DBFS = -1.0  # => amplitude ~0.891
TRIM_SILENCE = True
SILENCE_THRESH_DBFS = -40.0
MIN_SILENCE_DUR_S = 0.20   # ignore brief dips
TRIM_PADDING_S = 0.05      # keep a tiny context

def amp_from_dbfs(dbfs: float) -> float:
    return float(10.0 ** (dbfs / 20.0))

def dbfs_from_amp(amp: float) -> float:
    if amp <= 0:
        return float("-inf")
    return float(20.0 * math.log10(amp))

def frame_rms(x: np.ndarray, frame: int, hop: int) -> np.ndarray:
    n = len(x)
    if n < frame:
        return np.array([np.sqrt(np.mean(x**2) + 1e-12)], dtype=np.float32)
    n_frames = 1 + max(0, (n - frame) // hop)
    rms = np.empty(n_frames, dtype=np.float32)
    for i in range(n_frames):
        s = i * hop
        e = s + frame
        seg = x[s:e]
        rms[i] = float(np.sqrt(np.mean(seg**2) + 1e-12))
    return rms

def trim_silence(x: np.ndarray, sr: int, thresh_dbfs: float, min_silence_s: float, pad_s: float):
    if x.size == 0:
        return x, 0, 0
    # Frame params: 50 ms window, 10 ms hop
    frame = max(1, int(0.050 * sr))
    hop = max(1, int(0.010 * sr))
    rms = frame_rms(x, frame, hop)
    thresh_amp = amp_from_dbfs(thresh_dbfs)

    voiced = rms >= thresh_amp
    if not np.any(voiced):
        return x, 0, 0  # nothing above threshold

    # Enforce minimum silence duration at edges (optional conservative)
    # Find first and last voiced frames
    first = int(np.argmax(voiced))
    last = int(len(voiced) - 1 - np.argmax(voiced[::-1]))

    pad = int(pad_s * sr)
    start_sample = max(0, first * hop - pad)
    end_sample = min(len(x), last * hop + frame + pad)

    if end_sample <= start_sample:
        return x, 0, 0

    lead = start_sample
    tail = len(x) - end_sample
    return x[start_sample:end_sample], lead, tail

def peak_normalize(x: np.ndarray, target_dbfs: float):
    peak = float(np.max(np.abs(x))) if x.size else 0.0
    before_dbfs = dbfs_from_amp(peak)
    if peak <= 0.0:
        return x, before_dbfs, 0.0  # all silence
    target_amp = amp_from_dbfs(target_dbfs)
    gain = target_amp / peak
    y = np.clip(x * gain, -1.0, 1.0)
    after_peak = float(np.max(np.abs(y)))
    after_dbfs = dbfs_from_amp(after_peak)
    gain_db = dbfs_from_amp(abs(gain))  # positive => amplification
    return y.astype(np.float32), before_dbfs, gain_db

df = pd.read_csv(MANIFEST_PATH)
updated = 0

for i, row in df.iterrows():
    p = str(row.get("path", ""))
    if not p or not os.path.exists(p):
        print(f"[WARN] Missing path, skipping: {p}")
        continue
    if not p.lower().endswith(".wav"):
        print(f"[SKIP] Not a WAV (expected output of 5.3): {p}")
        continue

    try:
        x, sr = sf.read(p, dtype="float32", always_2d=False)
        if x.ndim == 2:
            x = np.mean(x, axis=1)  # ensure mono
        dur = len(x) / float(sr) if sr > 0 else 0.0
        if dur > MAX_AUDIO_SECONDS:
            print(f"[WARN] Long file ({dur:.1f}s) — consider chunking: {row.get('id','(no id)')}")

        # Trim silence (optional)
        lead_s = tail_s = 0.0
        if TRIM_SILENCE:
            y, lead, tail = trim_silence(x, sr, SILENCE_THRESH_DBFS, MIN_SILENCE_DUR_S, TRIM_PADDING_S)
            lead_s = lead / float(sr)
            tail_s = tail / float(sr)
        else:
            y = x

        # Peak normalize (optional)
        gain_db_applied = 0.0
        before_peak_db = dbfs_from_amp(float(np.max(np.abs(y))) if y.size else 0.0)
        if APPLY_PEAK_NORMALIZE:
            y, before_peak_db, gain_db_applied = peak_normalize(y, TARGET_PEAK_DBFS)

        # Save in place
        sf.write(p, y, sr, subtype="PCM_16")
        after_dur = len(y) / float(sr)
        after_peak_db = dbfs_from_amp(float(np.max(np.abs(y))) if y.size else 0.0)

        updated += 1
        print(f"[6] {row.get('id','(no id)')}: {dur:.2f}s -> {after_dur:.2f}s | peak {before_peak_db:.1f} dBFS -> {after_peak_db:.1f} dBFS | gain {gain_db_applied:+.1f} dB | trim lead {lead_s:.2f}s tail {tail_s:.2f}s")
    except Exception as e:
        print(f"[ERR] {row.get('id','(no id)')}: {e}")

print(f"Preprocessing done. Updated {updated} file(s). Manifest unchanged (paths already point to preprocessed WAVs).")

[6] ar: 5.56s -> 5.34s | peak -1.2 dBFS -> -1.0 dBFS | gain +0.2 dB | trim lead 0.07s tail 0.15s
[6] zh: 4.52s -> 4.38s | peak -3.0 dBFS -> -1.0 dBFS | gain +2.0 dB | trim lead 0.01s tail 0.13s
[6] en: 4.52s -> 4.52s | peak -3.6 dBFS -> -1.0 dBFS | gain +2.6 dB | trim lead 0.00s tail 0.00s
[6] de: 4.44s -> 4.29s | peak -0.9 dBFS -> -1.0 dBFS | gain -0.1 dB | trim lead 0.02s tail 0.13s
[6] es: 4.60s -> 4.60s | peak -1.1 dBFS -> -1.0 dBFS | gain +0.1 dB | trim lead 0.00s tail 0.00s
[WARN] Missing path, skipping: /content/drive/MyDrive/speech_ai_colab/inputs/preproc/de_2.wav
[WARN] Missing path, skipping: /content/drive/MyDrive/speech_ai_colab/inputs/preproc/ar_2.wav
[WARN] Missing path, skipping: /content/drive/MyDrive/speech_ai_colab/inputs/preproc/zh_2.wav
[WARN] Missing path, skipping: /content/drive/MyDrive/speech_ai_colab/inputs/preproc/en_2.wav
[WARN] Missing path, skipping: /content/drive/MyDrive/speech_ai_colab/inputs/preproc/de_3.wav
[WARN] Missing path, skipping: /content/drive

In [27]:
import pandas as pd
from IPython.display import Audio, display

df = pd.read_csv(MANIFEST_PATH)
print(df[["id","src_lang","path"]].tail(min(5, len(df))))

for _, row in df.tail(min(3, len(df))).iterrows():
    print(f"\n{row['id']} — {row['path']}")
    try:
        display(Audio(filename=row["path"]))
    except Exception as e:
        print("[Audio preview error]", e)

    id src_lang                                               path
13  zh     auto  /content/drive/MyDrive/speech_ai_colab/inputs/...
14  en     auto  /content/drive/MyDrive/speech_ai_colab/inputs/...
15  de     auto  /content/drive/MyDrive/speech_ai_colab/inputs/...
16  es     auto  /content/drive/MyDrive/speech_ai_colab/inputs/...
17  de     auto  /content/drive/MyDrive/speech_ai_colab/inputs/...

de — /content/drive/MyDrive/speech_ai_colab/inputs/preproc/de.wav



es — /content/drive/MyDrive/speech_ai_colab/inputs/preproc/es.wav



de — /content/drive/MyDrive/speech_ai_colab/inputs/preproc/de.wav


## 7) Shared components and caching (placeholder)
We will centralize model loading to reuse components across pipelines and avoid re-computation:
- ASR (Whisper via faster-whisper)
- MT (NLLB-200 distilled 600M or OPUS-MT)
- TTS (SpeechT5/Parler for English; MMS-TTS for non-English)

This section will also manage GPU memory between stages and provide a small cache for intermediate text outputs (ASR transcripts, MT outputs).

In [None]:
# Placeholders for shared resources
ASR_MODEL = None
MT_MODEL = None
TTS_MODEL_EN = None
TTS_MODEL_NONEN = None
CACHE = {"asr": {}, "mt": {}}

def free_gpu_memory():
    try:
        import torch
        import gc
        gc.collect()
        if torch.cuda.is_available():
            torch.cuda.empty_cache()
    except Exception:
        pass

print("[Shared components placeholder ready]")

## 8) Approach A — Speech → Whisper translate (English) → English TTS (placeholder)
Purpose: fastest path to English audio.

Steps (to be implemented):
1. ASR with Whisper `task=translate` → English text
2. English TTS with SpeechT5 (default) or Parler‑TTS

Outputs per file:
- English transcript (.txt)
- English audio (.wav)
- Timing and (later) peak VRAM stats

In [34]:
!pip -q install faster-whisper transformers soundfile einops



In [30]:
#Load Whisper (translate to English)

import os, time, math, torch
from faster_whisper import WhisperModel

# Device/config
if torch.cuda.is_available():
    ASR_MODEL_SIZE = "large-v3"
    ASR_DEVICE = "cuda"
    ASR_COMPUTE_TYPE = "float16"
    NUM_WORKERS = 4
else:
    ASR_MODEL_SIZE = "small"
    ASR_DEVICE = "cpu"
    ASR_COMPUTE_TYPE = "int8"   # fast on CPU
    NUM_WORKERS = max(1, os.cpu_count() // 2)

print(f"Loading Whisper (Approach A): model={ASR_MODEL_SIZE}, device={ASR_DEVICE}, compute={ASR_COMPUTE_TYPE}")
t0 = time.time()
asr_model = WhisperModel(
    ASR_MODEL_SIZE,
    device=ASR_DEVICE,
    compute_type=ASR_COMPUTE_TYPE,
    cpu_threads=0,
    num_workers=NUM_WORKERS,
)
print(f"ASR model loaded in {time.time()-t0:.1f}s")

Loading Whisper (Approach A): model=small, device=cpu, compute=int8


vocabulary.txt: 0.00B [00:00, ?B/s]

config.json: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

model.bin:   0%|          | 0.00/484M [00:00<?, ?B/s]

ASR model loaded in 12.7s


In [33]:
# Cell 8.2 — Load English TTS (SpeechT5) without datasets/xvectors
import time, torch
from transformers import SpeechT5Processor, SpeechT5ForTextToSpeech, SpeechT5HifiGan

TTS_DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
DTYPE = torch.float16 if TTS_DEVICE == "cuda" else torch.float32

print("Loading SpeechT5 TTS...")
t0 = time.time()
processor = SpeechT5Processor.from_pretrained("microsoft/speecht5_tts")
tts_model = SpeechT5ForTextToSpeech.from_pretrained("microsoft/speecht5_tts").to(TTS_DEVICE, dtype=DTYPE)
vocoder = SpeechT5HifiGan.from_pretrained("microsoft/speecht5_hifigan").to(TTS_DEVICE, dtype=DTYPE)
print(f"TTS loaded in {time.time()-t0:.1f}s on {TTS_DEVICE} ({DTYPE})")

# Use a neutral fallback speaker embedding (512-dim) — avoids datasets entirely
speaker_embeddings = torch.zeros((1, 512), device=TTS_DEVICE, dtype=DTYPE)
print("Using neutral fallback speaker embedding (zeros, 512-dim). Voice will be generic but works.")

Loading SpeechT5 TTS...


Fetching 1 files:   0%|          | 0/1 [00:00<?, ?it/s]

TTS loaded in 5.2s on cpu (torch.float32)
Using neutral fallback speaker embedding (zeros, 512-dim). Voice will be generic but works.


In [38]:
# Cell 8.3 — Run Approach A and show results in Colab (text + audio preview)
import os, re, time, math
import numpy as np
import pandas as pd
import soundfile as sf
from pathlib import Path
from IPython.display import Audio, display

MANIFEST_PATH = os.path.join(PATHS["manifests"], "manifest.csv")
OUT_TXT_DIR = os.path.join(PATHS["outputs"], "approachA", "txt")
OUT_WAV_DIR = os.path.join(PATHS["outputs"], "approachA", "wav")
os.makedirs(OUT_TXT_DIR, exist_ok=True)
os.makedirs(OUT_WAV_DIR, exist_ok=True)

# Re-run controls
RETRANSCRIBE = False   # set True to overwrite existing English text files
RE_TTS       = False   # set True to overwrite existing English wav files

# How many previews to show inline in Colab after processing
PREVIEW_MAX = 3
TEXT_PREVIEW_CHARS = 400

def chunk_text(s: str, max_len: int = 220):
    s = re.sub(r"\s+", " ", s).strip()
    if len(s) <= max_len:
        return [s] if s else []
    # sentence split first, then chunk
    parts = re.split(r"(?<=[\.\!\?\:;])\s+", s)
    chunks, cur = [], ""
    for p in parts:
        if not p:
            continue
        if len(cur) + 1 + len(p) <= max_len:
            cur = (cur + " " + p).strip()
        else:
            if cur:
                chunks.append(cur)
            cur = p
    if cur:
        chunks.append(cur)
    # fallback: if any chunk still too long, hard split
    final = []
    for c in chunks:
        if len(c) <= max_len:
            final.append(c)
        else:
            for i in range(0, len(c), max_len):
                final.append(c[i:i+max_len])
    return [x for x in final if x.strip()]

def whisper_translate_to_en(audio_path: str):
    # task="translate" -> Whisper outputs English text
    segments, info = asr_model.transcribe(
        audio_path,
        language=None,          # auto-detect source language
        task="translate",       # translate to English
        beam_size=5,
        best_of=5,
        temperature=0.0,
        vad_filter=True,
        vad_parameters={"min_silence_duration_ms": 500},
        word_timestamps=False,
    )
    segs = list(segments)
    text = " ".join((s.text or "").strip() for s in segs).strip()
    return segs, text, getattr(info, "language", None), getattr(info, "language_probability", None)

def synthesize_speecht5(text: str, out_wav: str):
    if not text.strip():
        return 0.0
    chunks = chunk_text(text, max_len=220)
    if not chunks:
        return 0.0

    sr = 16000
    waves = []
    for chunk in chunks:
        inputs = processor(text=chunk, return_tensors="pt").to(TTS_DEVICE)
        with torch.autocast(device_type=TTS_DEVICE, dtype=DTYPE) if TTS_DEVICE=="cuda" else torch.no_grad():
            speech = tts_model.generate_speech(
                inputs["input_ids"],
                speaker_embeddings,
                vocoder=vocoder
            )
        waves.append(speech.detach().float().cpu().numpy())

    y = np.concatenate(waves) if len(waves) > 1 else waves[0]
    # Peak-normalize softly to -1 dBFS
    peak = float(np.max(np.abs(y))) if y.size else 0.0
    if peak > 0:
        target_amp = 10 ** (-1.0 / 20.0)
        y = np.clip(y * (target_amp / peak), -1.0, 1.0).astype(np.float32)
    sf.write(out_wav, y, sr, subtype="PCM_16")
    return len(y) / sr

# Track GPU peak VRAM (optional)
def reset_cuda_peak():
    if torch.cuda.is_available():
        torch.cuda.reset_peak_memory_stats()

def cuda_peak_mb():
    if torch.cuda.is_available():
        return torch.cuda.max_memory_allocated() / (1024**2)
    return 0.0

df = pd.read_csv(MANIFEST_PATH)
for col in ["english_text", "english_audio_path", "asr_time_s", "tts_time_s", "src_lang_detected", "src_lang_prob", "gpu_mem_peak_mb"]:
    if col not in df.columns:
        df[col] = ""

processed = 0
total_t_asr = 0.0
total_t_tts = 0.0
results = []  # collect results for display

for i, row in df.iterrows():
    audio_path = row.get("path", "")
    if not isinstance(audio_path, str) or not os.path.exists(audio_path):
        print(f"[WARN] Missing file, skipping: {audio_path}")
        continue

    file_id = row["id"] if isinstance(row.get("id"), str) and row["id"] else Path(audio_path).stem
    out_txt = os.path.join(OUT_TXT_DIR, f"{file_id}.txt")
    out_wav = os.path.join(OUT_WAV_DIR, f"{file_id}.wav")

    # English text
    english_text = None
    if (not RETRANSCRIBE) and os.path.exists(out_txt):
        try:
            with open(out_txt, "r", encoding="utf-8") as f:
                english_text = f.read().strip()
        except Exception:
            english_text = None

    # 1) Whisper translate -> English text
    reset_cuda_peak()
    t0 = time.time()
    det_lang, det_prob = None, None
    if english_text is None or not english_text:
        try:
            segs, english_text, det_lang, det_prob = whisper_translate_to_en(audio_path)
            with open(out_txt, "w", encoding="utf-8") as f:
                f.write((english_text or "").strip() + "\n")
            df.at[i, "src_lang_detected"] = det_lang or ""
            df.at[i, "src_lang_prob"] = f"{det_prob:.3f}" if det_prob is not None else ""
        except Exception as e:
            print(f"[ERR] ASR/translate failed for {file_id}: {e}")
            english_text = ""
    t_asr = time.time() - t0
    total_t_asr += t_asr

    # 2) TTS English
    tts_done = False
    t_tts = 0.0
    if english_text:
        need_tts = RE_TTS or (not os.path.exists(out_wav))
        if need_tts:
            t0 = time.time()
            try:
                _dur = synthesize_speecht5(english_text, out_wav)
                t_tts = time.time() - t0
                tts_done = True
            except Exception as e:
                print(f"[ERR] TTS failed for {file_id}: {e}")
        else:
            tts_done = True
    total_t_tts += t_tts

    # Update manifest
    df.at[i, "english_text"] = english_text or ""
    df.at[i, "english_audio_path"] = out_wav if (english_text and os.path.exists(out_wav)) else ""
    df.at[i, "asr_time_s"] = f"{t_asr:.2f}"
    df.at[i, "tts_time_s"] = f"{t_tts:.2f}"
    df.at[i, "gpu_mem_peak_mb"] = f"{cuda_peak_mb():.1f}" if torch.cuda.is_available() else ""
    if det_lang is not None and not str(df.at[i, "src_lang_detected"]).strip():
        df.at[i, "src_lang_detected"] = det_lang
    if det_prob is not None and not str(df.at[i, "src_lang_prob"]).strip():
        df.at[i, "src_lang_prob"] = f"{det_prob:.3f}"

    # Collect results for display
    results.append({
        "id": file_id,
        "src_lang_detected": df.at[i, "src_lang_detected"],
        "src_lang_prob": df.at[i, "src_lang_prob"],
        "english_text_preview": (english_text or "")[:TEXT_PREVIEW_CHARS],
        "english_text_path": out_txt if os.path.exists(out_txt) else "",
        "english_audio_path": out_wav if os.path.exists(out_wav) else "",
        "asr_time_s": f"{t_asr:.2f}",
        "tts_time_s": f"{t_tts:.2f}",
        "gpu_mem_peak_mb": df.at[i, "gpu_mem_peak_mb"],
    })

    processed += 1
    print(f"[OK] {file_id} | ASR:{t_asr:.1f}s | TTS:{t_tts:.1f}s | src≈{df.at[i,'src_lang_detected']}")

df.to_csv(MANIFEST_PATH, index=False)

print(f"\nApproach A complete. Files processed: {processed}")
print(f"Outputs:\n- English text: {OUT_TXT_DIR}\n- English audio: {OUT_WAV_DIR}")
print(f"Totals: ASR {total_t_asr:.1f}s, TTS {total_t_tts:.1f}s")

# ---------- Colab display: summary table + inline previews ----------
if results:
    print("\nSummary of latest results:")
    view_cols = ["id","src_lang_detected","src_lang_prob","asr_time_s","tts_time_s","gpu_mem_peak_mb","english_text_path","english_audio_path","english_text_preview"]
    display(pd.DataFrame(results)[view_cols])

    print(f"\nInline previews (up to {PREVIEW_MAX}):")
    for r in results[:PREVIEW_MAX]:
        print(f"\nID: {r['id']} | src≈{r['src_lang_detected']} (p={r['src_lang_prob']})")
        txt = r["english_text_preview"]
        print("English text:", txt if txt else "(empty)")
        wav = r["english_audio_path"]
        if isinstance(wav, str) and os.path.exists(wav) and os.path.getsize(wav) > 44:
            display(Audio(filename=wav))
        else:
            print("(no audio output)")
else:
    print("No results to display.")

[OK] ar | ASR:0.0s | TTS:0.0s | src≈ar
[OK] zh | ASR:0.0s | TTS:0.0s | src≈zh
[OK] en | ASR:0.0s | TTS:0.0s | src≈en
[OK] de | ASR:0.0s | TTS:0.0s | src≈de
[OK] es | ASR:0.0s | TTS:0.0s | src≈es
[WARN] Missing file, skipping: /content/drive/MyDrive/speech_ai_colab/inputs/preproc/de_2.wav
[WARN] Missing file, skipping: /content/drive/MyDrive/speech_ai_colab/inputs/preproc/ar_2.wav
[WARN] Missing file, skipping: /content/drive/MyDrive/speech_ai_colab/inputs/preproc/zh_2.wav
[WARN] Missing file, skipping: /content/drive/MyDrive/speech_ai_colab/inputs/preproc/en_2.wav
[WARN] Missing file, skipping: /content/drive/MyDrive/speech_ai_colab/inputs/preproc/de_3.wav
[WARN] Missing file, skipping: /content/drive/MyDrive/speech_ai_colab/inputs/preproc/es_2.wav
[WARN] Missing file, skipping: /content/drive/MyDrive/speech_ai_colab/inputs/preproc/de_4.wav
[OK] ar | ASR:0.0s | TTS:0.0s | src≈nan
[OK] zh | ASR:0.0s | TTS:0.0s | src≈nan
[OK] en | ASR:0.0s | TTS:0.0s | src≈nan
[OK] de | ASR:0.0s | TTS:0.

  df.at[i, "asr_time_s"] = f"{t_asr:.2f}"
  df.at[i, "tts_time_s"] = f"{t_tts:.2f}"
  df.at[i, "gpu_mem_peak_mb"] = f"{cuda_peak_mb():.1f}" if torch.cuda.is_available() else ""


Unnamed: 0,id,src_lang_detected,src_lang_prob,asr_time_s,tts_time_s,gpu_mem_peak_mb,english_text_path,english_audio_path,english_text_preview
0,ar,ar,1.0,0.0,0.0,,/content/drive/MyDrive/speech_ai_colab/outputs...,/content/drive/MyDrive/speech_ai_colab/outputs...,This is a clear example of how to translate wo...
1,zh,zh,1.0,0.0,0.0,,/content/drive/MyDrive/speech_ai_colab/outputs...,/content/drive/MyDrive/speech_ai_colab/outputs...,This is a voice translation display used in la...
2,en,en,0.995,0.0,0.0,,/content/drive/MyDrive/speech_ai_colab/outputs...,/content/drive/MyDrive/speech_ai_colab/outputs...,This is a speech translation demo using large ...
3,de,de,1.0,0.0,0.0,,/content/drive/MyDrive/speech_ai_colab/outputs...,/content/drive/MyDrive/speech_ai_colab/outputs...,This is a demonstration of the language transl...
4,es,es,0.999,0.0,0.0,,/content/drive/MyDrive/speech_ai_colab/outputs...,/content/drive/MyDrive/speech_ai_colab/outputs...,This is a voice translation demonstration usin...
5,ar,,,0.0,0.0,,/content/drive/MyDrive/speech_ai_colab/outputs...,/content/drive/MyDrive/speech_ai_colab/outputs...,This is a clear example of how to translate wo...
6,zh,,,0.0,0.0,,/content/drive/MyDrive/speech_ai_colab/outputs...,/content/drive/MyDrive/speech_ai_colab/outputs...,This is a voice translation display used in la...
7,en,,,0.0,0.0,,/content/drive/MyDrive/speech_ai_colab/outputs...,/content/drive/MyDrive/speech_ai_colab/outputs...,This is a speech translation demo using large ...
8,de,,,0.0,0.0,,/content/drive/MyDrive/speech_ai_colab/outputs...,/content/drive/MyDrive/speech_ai_colab/outputs...,This is a demonstration of the language transl...
9,es,,,0.0,0.0,,/content/drive/MyDrive/speech_ai_colab/outputs...,/content/drive/MyDrive/speech_ai_colab/outputs...,This is a voice translation demonstration usin...



Inline previews (up to 3):

ID: ar | src≈ar (p=1.0)
English text: This is a clear example of how to translate words using a large language.



ID: zh | src≈zh (p=1.0)
English text: This is a voice translation display used in large-scale language models.



ID: en | src≈en (p=0.995)
English text: This is a speech translation demo using large language models.


## 9) Approach B — Speech → ASR (src) → MT (tgt) → TTS (tgt) (placeholder)
Purpose: high-quality multilingual results.

Steps (to be implemented):
1. ASR with Whisper `task=transcribe` (source language)
2. MT with NLLB‑200 distilled 600M (default) or OPUS‑MT to target text
3. TTS in target language: English → SpeechT5/Parler; non‑English → MMS‑TTS

Outputs per file:
- Source transcript (.txt)
- Target translation (.txt)
- Target audio (.wav)
- Timing and (later) peak VRAM stats

In [69]:
!pip -q install faster-whisper transformers sentencepiece sacremoses soundfile einops
!pip -q install edge-tts pydub



In [59]:
# Cleanup manifest: keep only rows whose input audio exists
import os, pandas as pd
MANIFEST_PATH = os.path.join(PATHS["manifests"], "manifest.csv")
df = pd.read_csv(MANIFEST_PATH)
df2 = df[df["path"].apply(lambda p: isinstance(p, str) and os.path.exists(p))].reset_index(drop=True)
df2.to_csv(MANIFEST_PATH, index=False)
print(f"Pruned manifest rows: {len(df)} -> {len(df2)}")

Pruned manifest rows: 18 -> 11


In [60]:
#Load Whisper ASR
import os, time, torch
from faster_whisper import WhisperModel

if torch.cuda.is_available():
    ASR_MODEL_SIZE = "large-v3"
    ASR_DEVICE = "cuda"
    ASR_COMPUTE_TYPE = "float16"
    NUM_WORKERS = 4
else:
    ASR_MODEL_SIZE = "small"
    ASR_DEVICE = "cpu"
    ASR_COMPUTE_TYPE = "int8"
    NUM_WORKERS = max(1, os.cpu_count() // 2)

print(f"Loading Whisper (Approach B): model={ASR_MODEL_SIZE}, device={ASR_DEVICE}, compute={ASR_COMPUTE_TYPE}")
t0 = time.time()
asr_model = WhisperModel(
    ASR_MODEL_SIZE,
    device=ASR_DEVICE,
    compute_type=ASR_COMPUTE_TYPE,
    cpu_threads=0,
    num_workers=NUM_WORKERS,
)
print(f"ASR model loaded in {time.time()-t0:.1f}s")

Loading Whisper (Approach B): model=small, device=cpu, compute=int8
ASR model loaded in 7.0s


In [79]:
# Cell 9.2 — Translation via M2M100 (set TARGET_LANG here)
# Set your target language (ISO-2): e.g., "zh", "en", "de", "es", "ar", ...
TARGET_LANG = "de"  # <--- set the target language here

import re
from typing import Optional
import torch
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

# M2M100 model (single multilingual MT for many pairs, including -> zh)
M2M_MODEL_ID = "facebook/m2m100_418M"
_m2m_tok = None
_m2m_mdl = None

MT_DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
# fp16 if CUDA; float32 on CPU
MT_DTYPE = torch.float16 if MT_DEVICE == "cuda" else torch.float32

# Normalize common language codes to M2M expected codes
_LANG_ALIAS = {
    # Chinese variants and aliases
    "zh-cn": "zh", "zh_hans": "zh", "zho_hans": "zh", "cmn": "zh", "zh-hans": "zh",
    "zh-tw": "zh", "zh_hant": "zh", "zho_hant": "zh", "zh-hant": "zh",
    # Portuguese variants
    "pt-br": "pt", "pt_pt": "pt",
    # Norwegian
    "no": "nb", "nob": "nb",
}

def _norm_lang(code: Optional[str]) -> Optional[str]:
    if not code:
        return None
    c = str(code).lower().strip()
    return _LANG_ALIAS.get(c, c)

def _ensure_m2m():
    global _m2m_tok, _m2m_mdl
    if _m2m_tok is not None and _m2m_mdl is not None:
        return _m2m_tok, _m2m_mdl
    _m2m_tok = AutoTokenizer.from_pretrained(M2M_MODEL_ID)
    # Use float16 on CUDA for speed; keep default dtype on CPU
    _m2m_mdl = AutoModelForSeq2SeqLM.from_pretrained(M2M_MODEL_ID)
    if MT_DEVICE == "cuda":
        _m2m_mdl = _m2m_mdl.to(MT_DEVICE, dtype=torch.float16)
    else:
        _m2m_mdl = _m2m_mdl.to(MT_DEVICE)
    _m2m_mdl.eval()
    return _m2m_tok, _m2m_mdl

# Optional text language detection when src='auto'
def _detect_lang_text(txt: str) -> Optional[str]:
    s = (txt or "").strip()
    if not s:
        return None
    try:
        from langdetect import detect  # pip install langdetect (optional)
        code = detect(s)
        return _norm_lang(code)
    except Exception:
        # Simple heuristic for Chinese
        if re.search(r"[\u4e00-\u9fff]", s):
            return "zh"
        return None

def translate_text(text: str, src_lang: Optional[str], tgt_lang: str) -> str:
    """
    Translate using M2M100. Handles src_lang='auto' by detecting from text.
    src_lang/tgt_lang should be ISO-2 codes (en, de, es, ar, zh, ...).
    """
    txt = (text or "").strip()
    if not txt:
        return ""
    src = _norm_lang(src_lang)
    tgt = _norm_lang(tgt_lang) or "zh"

    # If src is auto/unknown, try detecting from the text; default to 'en'
    if (not src) or src == "auto":
        guess = _detect_lang_text(txt)
        src = guess or "en"

    tok, mdl = _ensure_m2m()

    # Set languages
    tok.src_lang = src
    forced_id = tok.get_lang_id(tgt)

    inputs = tok(txt, return_tensors="pt", padding=True).to(MT_DEVICE)
    with torch.no_grad():
        gen = mdl.generate(
            **inputs,
            forced_bos_token_id=forced_id,
            max_length=512,
            num_beams=4,
        )
    out = tok.batch_decode(gen, skip_special_tokens=True)
    return out[0] if out else ""

print(f"[MT] M2M100 ready on {MT_DEVICE}. TARGET_LANG={TARGET_LANG}")

[MT] M2M100 ready on cpu. TARGET_LANG=de


In [81]:
# Cell 9.3 — TTS: English via SpeechT5; non‑English via edge-tts (online)
import os, re, time, asyncio, threading
import numpy as np
import soundfile as sf
import torch
from transformers import SpeechT5Processor, SpeechT5ForTextToSpeech, SpeechT5HifiGan

# Device/dtype
TTS_DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
TTS_DTYPE = torch.float16 if TTS_DEVICE == "cuda" else torch.float32

# 1) English TTS backend (SpeechT5)
print("Loading English TTS (SpeechT5)...")
t0 = time.time()
tts_processor = SpeechT5Processor.from_pretrained("microsoft/speecht5_tts")
tts_model = SpeechT5ForTextToSpeech.from_pretrained("microsoft/speecht5_tts").to(TTS_DEVICE, dtype=TTS_DTYPE)
tts_vocoder = SpeechT5HifiGan.from_pretrained("microsoft/speecht5_hifigan").to(TTS_DEVICE, dtype=TTS_DTYPE)
print(f"SpeechT5 loaded in {time.time()-t0:.1f}s on {TTS_DEVICE}")

# Neutral speaker embedding (keeps us independent of datasets)
speaker_embeddings = torch.zeros((1, 512), device=TTS_DEVICE, dtype=TTS_DTYPE)

# 2) Non‑English backend (edge-tts, online)
import io, asyncio, threading
import numpy as np
from pydub import AudioSegment

try:
    import edge_tts
    EDGE_TTS_AVAILABLE = True
except Exception:
    EDGE_TTS_AVAILABLE = False
    print("[WARN] edge-tts not installed. Run: pip install edge-tts pydub")

EDGE_VOICE = {
    "zh": "zh-CN-XiaoxiaoNeural",
    "ja": "ja-JP-NanamiNeural",
    "ko": "ko-KR-SunHiNeural",
    "de": "de-DE-KatjaNeural",
    "es": "es-ES-ElviraNeural",
    "fr": "fr-FR-DeniseNeural",
    "it": "it-IT-ElsaNeural",
    "pt": "pt-BR-FranciscaNeural",
    "ru": "ru-RU-SvetlanaNeural",
    "ar": "ar-SA-HamedNeural",
    "tr": "tr-TR-AhmetNeural",
    "vi": "vi-VN-HoaiMyNeural",
    "_default": "en-US-AriaNeural",
}

def _run_async(coro):
    try:
        return asyncio.run(coro)
    except RuntimeError:
        try:
            import nest_asyncio
            nest_asyncio.apply()
            loop = asyncio.get_event_loop()
            return loop.run_until_complete(coro)
        except Exception:
            result = {}
            def _target():
                loop = asyncio.new_event_loop()
                asyncio.set_event_loop(loop)
                result["v"] = loop.run_until_complete(coro)
                loop.close()
            t = threading.Thread(target=_target)
            t.start(); t.join()
            return result.get("v")

async def _edge_tts_mp3_bytes(text: str, voice_name: str):
    tts = edge_tts.Communicate(text, voice_name)
    audio = bytearray()
    async for chunk in tts.stream():  # no format kw in this edge-tts version
        if chunk["type"] == "audio":
            audio.extend(chunk["data"])
    return bytes(audio)

def _edge_tts_say(text: str, lang_iso2: str):
    if not EDGE_TTS_AVAILABLE:
        raise RuntimeError("edge-tts not available. Install with: pip install edge-tts pydub")
    voice = EDGE_VOICE.get((lang_iso2 or "").lower(), EDGE_VOICE["_default"])
    mp3_bytes = _run_async(_edge_tts_mp3_bytes(text, voice))
    # Decode MP3 -> PCM using pydub (ffmpeg is available in Colab)
    seg = AudioSegment.from_file(io.BytesIO(mp3_bytes), format="mp3")
    seg = seg.set_frame_rate(16000).set_channels(1).set_sample_width(2)  # 16kHz, mono, 16-bit
    pcm = np.frombuffer(seg.raw_data, dtype=np.int16).astype(np.float32) / 32768.0
    return pcm, 16000


def synthesize_tts(text: str, lang_iso: str, out_wav: str) -> float:
    """Generate speech. English -> SpeechT5; others -> edge-tts (online). Returns duration seconds."""
    text = (text or "").strip()
    if not text:
        return 0.0
    lang = (lang_iso or "").lower()
    chunks = _chunk_text(text, max_len=220)
    if not chunks:
        return 0.0

    if lang == "en":
        sr = 16000
        waves = []
        for ch in chunks:
            inputs = tts_processor(text=ch, return_tensors="pt").to(TTS_DEVICE)
            with torch.autocast(device_type=TTS_DEVICE, dtype=TTS_DTYPE) if TTS_DEVICE=="cuda" else torch.no_grad():
                speech = tts_model.generate_speech(inputs["input_ids"], speaker_embeddings, vocoder=tts_vocoder)
            waves.append(speech.detach().float().cpu().numpy())
        y = np.concatenate(waves) if len(waves) > 1 else waves[0]
    else:
        y_list = []
        sr = 16000
        for ch in chunks:
            wav, sr_e = _edge_tts_say(ch, lang)
            sr = sr_e
            y_list.append(wav)
        y = np.concatenate(y_list) if len(y_list) > 1 else y_list[0]

    peak = float(np.max(np.abs(y))) if y.size else 0.0
    if peak > 0:
        y = np.clip(y * ((10 ** (-1/20)) / peak), -1.0, 1.0).astype(np.float32)
    sf.write(out_wav, y, sr, subtype="PCM_16")
    return len(y) / sr

print(f"Non-English TTS backend: {'edge-tts ready' if EDGE_TTS_AVAILABLE else 'edge-tts NOT available'}")


Loading English TTS (SpeechT5)...


Fetching 1 files:   0%|          | 0/1 [00:00<?, ?it/s]

SpeechT5 loaded in 3.6s on cpu
Non-English TTS backend: edge-tts ready


In [77]:
# 9.3 self-test: generate Chinese and English samples and play them
import os
from IPython.display import Audio, display

# Chinese (edge-tts)
try:
    zh_wav = "/content/_tts_smoketest_zh.wav"
    dur = synthesize_tts("你好，世界。今天天气很好。", "zh", zh_wav)
    print(f"ZH: wrote {zh_wav} | duration≈{dur:.2f}s | size={os.path.getsize(zh_wav)} bytes")
    display(Audio(filename=zh_wav))
except Exception as e:
    print("[ERR] Chinese TTS failed:", e)

# English (SpeechT5) — sanity check
try:
    en_wav = "/content/_tts_smoketest_en.wav"
    dur = synthesize_tts("Hello world. This is a speech test.", "en", en_wav)
    print(f"EN: wrote {en_wav} | duration≈{dur:.2f}s | size={os.path.getsize(en_wav)} bytes")
    display(Audio(filename=en_wav))
except Exception as e:
    print("[ERR] English TTS failed:", e)

ZH: wrote /content/_tts_smoketest_zh.wav | duration≈3.79s | size=121388 bytes


EN: wrote /content/_tts_smoketest_en.wav | duration≈1.18s | size=37932 bytes


In [82]:
# Cell 9.4 — Approach B end-to-end (robust, language-aware, filtered to existing inputs)
import os, time, math
import numpy as np
import pandas as pd
import torch
from pathlib import Path
from IPython.display import display, Audio

# Preconditions
if "asr_model" not in globals():
    raise RuntimeError("ASR model not loaded. Please run Cell 9.1 first.")
if "translate_text" not in globals() or "TARGET_LANG" not in globals():
    raise RuntimeError("MT not ready. Please run Cell 9.2 (with TARGET_LANG set) first.")
if "synthesize_tts" not in globals():
    raise RuntimeError("TTS function not defined. Please run Cell 9.3 first.")

# Paths
MANIFEST_PATH = os.path.join(PATHS["manifests"], "manifest.csv")
OUT_DIR = os.path.join(PATHS["outputs"], "approachB", TARGET_LANG)
SRC_TXT_DIR = os.path.join(OUT_DIR, "src_txt")
TGT_TXT_DIR = os.path.join(OUT_DIR, "tgt_txt")
TGT_WAV_DIR = os.path.join(OUT_DIR, "tgt_wav")
for d in (SRC_TXT_DIR, TGT_TXT_DIR, TGT_WAV_DIR):
    os.makedirs(d, exist_ok=True)

# Controls — force refresh once after changing TARGET_LANG, then set back to False
RE_ASR = False
RE_MT  = True   # set to False after verifying zh outputs
RE_TTS = True   # set to False after verifying zh outputs
PREVIEW_MAX = 3

# Helpers
def _safe_str(v):
    if v is None: return ""
    try:
        if isinstance(v, float) and (np.isnan(v) or np.isinf(v)):
            return ""
    except Exception:
        pass
    return str(v)

def asr_transcribe(audio_path: str):
    segments, info = asr_model.transcribe(
        audio_path,
        language=None,
        task="transcribe",
        beam_size=5,
        best_of=5,
        temperature=0.0,
        vad_filter=True,
        vad_parameters={"min_silence_duration_ms": 500},
        word_timestamps=False,
    )
    segs = list(segments)
    text = " ".join((s.text or "").strip() for s in segs).strip()
    det_lang = getattr(info, "language", None)
    det_prob = getattr(info, "language_probability", None)
    return segs, text, det_lang, det_prob

def reset_cuda_peak():
    if torch.cuda.is_available():
        torch.cuda.reset_peak_memory_stats()

def cuda_peak_mb():
    if torch.cuda.is_available():
        return torch.cuda.max_memory_allocated() / (1024**2)
    return np.nan

# Load manifest
df = pd.read_csv(MANIFEST_PATH)

# Ensure columns exist and dtypes are safe
text_cols = ["src_transcript","tgt_lang","tgt_text","tgt_audio_path","src_lang_detected","id"]
num_cols  = ["asr_time_s","mt_time_s","tts_time_s","gpu_mem_peak_mb","src_lang_prob"]
for c in text_cols + num_cols:
    if c not in df.columns:
        df[c] = ("" if c in text_cols else np.nan)
for c in text_cols:
    df[c] = df[c].astype(object)
for c in num_cols:
    df[c] = pd.to_numeric(df[c], errors="coerce")

# Process only rows with existing audio paths
exist_mask = df["path"].apply(lambda p: isinstance(p, str) and os.path.exists(p))
missing_count = int((~exist_mask).sum())
if missing_count:
    print(f"[INFO] Skipping {missing_count} manifest rows with missing audio files (not processed).")
df_run = df[exist_mask].copy()
if df_run.empty:
    print("No input files found. Make sure your manifest paths exist.")
    raise SystemExit

# Main loop
processed_ids = []
tot_asr = tot_mt = tot_tts = 0.0

for i, row in df_run.iterrows():
    audio_path = row.get("path", "")
    file_id = row["id"] if isinstance(row.get("id"), str) and row["id"] else Path(audio_path).stem
    df_run.at[i, "id"] = file_id

    src_txt_path = os.path.join(SRC_TXT_DIR, f"{file_id}.txt")
    tgt_txt_path = os.path.join(TGT_TXT_DIR, f"{file_id}.txt")
    tgt_wav_path = os.path.join(TGT_WAV_DIR, f"{file_id}.wav")

    # 1) ASR (source transcript)
    reset_cuda_peak()
    t0 = time.time()
    do_asr = RE_ASR or (not os.path.exists(src_txt_path))
    src_text, det_lang, det_prob = None, None, None
    if do_asr:
        try:
            segs, src_text, det_lang, det_prob = asr_transcribe(audio_path)
            with open(src_txt_path, "w", encoding="utf-8") as f:
                f.write((src_text or "").strip() + "\n")
        except Exception as e:
            print(f"[ERR] ASR failed for {file_id}: {e}")
            src_text = ""
    else:
        try:
            with open(src_txt_path, "r", encoding="utf-8") as f:
                src_text = f.read().strip()
        except Exception:
            src_text = ""
    t_asr = time.time() - t0
    tot_asr += t_asr

    # Resolve src language (avoid passing 'auto' to MT)
    src_lang_iso = (row.get("src_lang") or "").lower().strip()
    if not src_lang_iso or src_lang_iso == "auto":
        if det_lang:
            src_lang_iso = (det_lang or "").split("_")[0].lower()
    if not src_lang_iso or src_lang_iso == "auto":
        prev_det = (row.get("src_lang_detected") or "")
        if isinstance(prev_det, str) and prev_det:
            src_lang_iso = prev_det.split("_")[0].lower()
    if (not src_lang_iso or src_lang_iso == "auto") and src_text:
        try:
            from langdetect import detect
            src_lang_iso = detect(src_text) or "en"
        except Exception:
            # heuristic: Chinese characters present
            if any("\u4e00" <= ch <= "\u9fff" for ch in src_text):
                src_lang_iso = "zh"
            else:
                src_lang_iso = "en"

    # 2) MT (invalidate on lang change or missing or RE_MT)
    prev_tgt_lang = _safe_str(row.get("tgt_lang"))
    need_retranslate = (prev_tgt_lang != TARGET_LANG)
    t0 = time.time()
    if src_text and (RE_MT or need_retranslate or (not os.path.exists(tgt_txt_path))):
        try:
            tgt_text = translate_text(src_text, src_lang_iso, TARGET_LANG)
            with open(tgt_txt_path, "w", encoding="utf-8") as f:
                f.write((tgt_text or "").strip() + "\n")
        except Exception as e:
            print(f"[ERR] MT failed for {file_id} ({src_lang_iso}->{TARGET_LANG}): {e}")
            tgt_text = ""
    else:
        try:
            with open(tgt_txt_path, "r", encoding="utf-8") as f:
                tgt_text = f.read().strip()
        except Exception:
            tgt_text = ""
    t_mt = time.time() - t0
    tot_mt += t_mt

    # 3) TTS (invalidate on lang change or missing or RE_TTS)
    t0 = time.time()
    if tgt_text and (RE_TTS or need_retranslate or (not os.path.exists(tgt_wav_path))):
        try:
            _dur = synthesize_tts(tgt_text, TARGET_LANG, tgt_wav_path)
        except Exception as e:
            print(f"[ERR] TTS failed for {file_id} (lang={TARGET_LANG}): {e}")
    t_tts = time.time() - t0
    tot_tts += t_tts

    # Update df_run with safe dtypes
    df_run.at[i, "src_transcript"] = src_text or ""
    df_run.at[i, "tgt_lang"] = TARGET_LANG
    df_run.at[i, "tgt_text"] = tgt_text or ""
    df_run.at[i, "tgt_audio_path"] = tgt_wav_path if (tgt_text and os.path.exists(tgt_wav_path)) else ""
    df_run.at[i, "asr_time_s"] = float(t_asr)
    df_run.at[i, "mt_time_s"] = float(t_mt)
    df_run.at[i, "tts_time_s"] = float(t_tts)
    if det_lang is not None:
        df_run.at[i, "src_lang_detected"] = det_lang
    if det_prob is not None:
        try:
            df_run.at[i, "src_lang_prob"] = float(det_prob)
        except Exception:
            df_run.at[i, "src_lang_prob"] = np.nan
    df_run.at[i, "gpu_mem_peak_mb"] = float(cuda_peak_mb()) if torch.cuda.is_available() else np.nan

    processed_ids.append(file_id)
    print(f"[OK] {file_id} | tgt={TARGET_LANG} | src≈{df_run.at[i,'src_lang_detected']} | "
          f"ASR:{t_asr:.1f}s MT:{t_mt:.1f}s TTS:{t_tts:.1f}s")

# Persist updates back to full manifest (explicit assignment avoids dtype warnings)
cols_to_write = [
    "src_transcript","tgt_lang","tgt_text","tgt_audio_path",
    "asr_time_s","mt_time_s","tts_time_s","gpu_mem_peak_mb",
    "src_lang_detected","src_lang_prob","id"
]
df.loc[df_run.index, cols_to_write] = df_run[cols_to_write]
df.to_csv(MANIFEST_PATH, index=False)

print(f"\nApproach B complete. Files processed this run: {len(processed_ids)}")
print(f"Outputs (lang={TARGET_LANG}):\n- Source transcript: {SRC_TXT_DIR}\n- Target translation: {TGT_TXT_DIR}\n- Target audio: {TGT_WAV_DIR}")
print(f"Totals: ASR {tot_asr:.1f}s, MT {tot_mt:.1f}s, TTS {tot_tts:.1f}s")

# Summary + previews (only processed rows this run)
rows = []
proc_set = set(processed_ids)
for _, r in df_run.iterrows():
    rid = _safe_str(r.get("id"))
    if rid not in proc_set:
        continue
    rows.append({
        "id": rid,
        "tgt_lang": _safe_str(r.get("tgt_lang")),
        "src_lang_detected": _safe_str(r.get("src_lang_detected")),
        "src_lang_prob": ("" if pd.isna(r.get("src_lang_prob")) else f"{float(r.get('src_lang_prob')):.3f}"),
        "asr_time_s": ("" if pd.isna(r.get("asr_time_s")) else f"{float(r.get('asr_time_s')):.2f}"),
        "mt_time_s": ("" if pd.isna(r.get("mt_time_s")) else f"{float(r.get('mt_time_s')):.2f}"),
        "tts_time_s": ("" if pd.isna(r.get("tts_time_s")) else f"{float(r.get('tts_time_s')):.2f}"),
        "src_txt_path": os.path.join(SRC_TXT_DIR, f"{rid}.txt"),
        "tgt_txt_path": os.path.join(TGT_TXT_DIR, f"{rid}.txt"),
        "tgt_audio_path": _safe_str(r.get("tgt_audio_path")),
        "tgt_text_preview": (_safe_str(r.get("tgt_text")) or "")[:400],
    })

if rows:
    print("\nSummary of processed files:")
    display(pd.DataFrame(rows)[[
        "id","tgt_lang","src_lang_detected","src_lang_prob",
        "asr_time_s","mt_time_s","tts_time_s",
        "src_txt_path","tgt_txt_path","tgt_audio_path","tgt_text_preview"
    ]])

    shown = 0
    for r in rows:
        if shown >= PREVIEW_MAX:
            break
        wav = r["tgt_audio_path"]
        if isinstance(wav, str) and wav and os.path.exists(wav) and os.path.getsize(wav) > 44:
            print(f"\nID: {r['id']} | tgt={r['tgt_lang']} | src≈{r['src_lang_detected']}")
            print("Target text:", r["tgt_text_preview"] or "(empty)")
            display(Audio(filename=wav))
            shown += 1
else:
    print("No results to display (nothing processed or all inputs missing).")

pytorch_model.bin:   0%|          | 0.00/1.94G [00:00<?, ?B/s]

[OK] ar | tgt=de | src≈ar | ASR:13.6s MT:79.4s TTS:2.2s
[OK] zh | tgt=de | src≈zh | ASR:10.0s MT:12.8s TTS:1.8s
[OK] en | tgt=de | src≈en | ASR:8.5s MT:6.1s TTS:0.6s
[OK] de | tgt=de | src≈de | ASR:9.1s MT:6.2s TTS:0.8s
[OK] es | tgt=de | src≈es | ASR:9.1s MT:6.4s TTS:0.9s
[OK] ar | tgt=de | src≈nan | ASR:0.0s MT:6.7s TTS:0.8s
[OK] zh | tgt=de | src≈nan | ASR:0.0s MT:9.1s TTS:1.7s
[OK] en | tgt=de | src≈nan | ASR:0.0s MT:6.0s TTS:0.6s
[OK] de | tgt=de | src≈nan | ASR:0.0s MT:6.3s TTS:1.7s
[OK] es | tgt=de | src≈nan | ASR:0.0s MT:7.0s TTS:1.7s
[OK] de | tgt=de | src≈nan | ASR:0.0s MT:6.4s TTS:0.6s

Approach B complete. Files processed this run: 11
Outputs (lang=de):
- Source transcript: /content/drive/MyDrive/speech_ai_colab/outputs/approachB/de/src_txt
- Target translation: /content/drive/MyDrive/speech_ai_colab/outputs/approachB/de/tgt_txt
- Target audio: /content/drive/MyDrive/speech_ai_colab/outputs/approachB/de/tgt_wav
Totals: ASR 50.4s, MT 152.5s, TTS 13.5s

Summary of processed f

Unnamed: 0,id,tgt_lang,src_lang_detected,src_lang_prob,asr_time_s,mt_time_s,tts_time_s,src_txt_path,tgt_txt_path,tgt_audio_path,tgt_text_preview
0,ar,de,ar,1.0,13.62,79.44,2.17,/content/drive/MyDrive/speech_ai_colab/outputs...,/content/drive/MyDrive/speech_ai_colab/outputs...,/content/drive/MyDrive/speech_ai_colab/outputs...,Dies ist eine arabische Erläuterung für die Üb...
1,zh,de,zh,1.0,10.03,12.84,1.77,/content/drive/MyDrive/speech_ai_colab/outputs...,/content/drive/MyDrive/speech_ai_colab/outputs...,/content/drive/MyDrive/speech_ai_colab/outputs...,Es handelt sich um eine Übersetzungs-Demonstra...
2,en,de,en,0.995,8.48,6.07,0.64,/content/drive/MyDrive/speech_ai_colab/outputs...,/content/drive/MyDrive/speech_ai_colab/outputs...,/content/drive/MyDrive/speech_ai_colab/outputs...,Dies ist eine Sprachübersetzung Demo mit große...
3,de,de,de,1.0,9.15,6.16,0.8,/content/drive/MyDrive/speech_ai_colab/outputs...,/content/drive/MyDrive/speech_ai_colab/outputs...,/content/drive/MyDrive/speech_ai_colab/outputs...,Dies ist eine Demonstration der Sprachübersetz...
4,es,de,es,0.999,9.06,6.45,0.94,/content/drive/MyDrive/speech_ai_colab/outputs...,/content/drive/MyDrive/speech_ai_colab/outputs...,/content/drive/MyDrive/speech_ai_colab/outputs...,Dies ist eine Demonstration der Stimmübersetzu...
5,ar,de,,,0.0,6.69,0.82,/content/drive/MyDrive/speech_ai_colab/outputs...,/content/drive/MyDrive/speech_ai_colab/outputs...,/content/drive/MyDrive/speech_ai_colab/outputs...,Diese arabische Erläuterung zum Übersetzen der...
6,zh,de,,,0.0,9.08,1.73,/content/drive/MyDrive/speech_ai_colab/outputs...,/content/drive/MyDrive/speech_ai_colab/outputs...,/content/drive/MyDrive/speech_ai_colab/outputs...,Es handelt sich um eine Übersetzungs-Demonstra...
7,en,de,,,0.0,6.0,0.64,/content/drive/MyDrive/speech_ai_colab/outputs...,/content/drive/MyDrive/speech_ai_colab/outputs...,/content/drive/MyDrive/speech_ai_colab/outputs...,Dies ist eine Sprachübersetzung Demo mit große...
8,de,de,,,0.0,6.35,1.66,/content/drive/MyDrive/speech_ai_colab/outputs...,/content/drive/MyDrive/speech_ai_colab/outputs...,/content/drive/MyDrive/speech_ai_colab/outputs...,Dies ist eine Demonstration der Sprachübersetz...
9,es,de,,,0.0,6.99,1.67,/content/drive/MyDrive/speech_ai_colab/outputs...,/content/drive/MyDrive/speech_ai_colab/outputs...,/content/drive/MyDrive/speech_ai_colab/outputs...,Es ist eine Demonstration der Übersetzung der ...



ID: ar | tgt=de | src≈ar
Target text: Dies ist eine arabische Erläuterung für die Übersetzung von Sprache mit einer großen Sprachmalerei



ID: zh | tgt=de | src≈zh
Target text: Es handelt sich um eine Übersetzungs-Demonstration mit einem Modell der Sprache.



ID: en | tgt=de | src≈en
Target text: Dies ist eine Sprachübersetzung Demo mit großen Sprachmodellen.


## 10) Approach C — SeamlessM4T S2ST (optional, placeholder)
Purpose: demonstrate end‑to‑end Speech‑to‑Speech without explicit text.

Notes:
- Installation and availability can vary on Colab; this section must fail gracefully if not installed.
- We'll probe import in the installer cell when we add it and set `SEAMLESS_AVAILABLE` accordingly.

Outputs per file:
- Target audio (.wav)
- Timing and (later) peak VRAM stats

In [14]:
# Step 10.0 — Install required deps (run once per fresh runtime)
import importlib, subprocess, sys
pkgs = ["encodec", "soundfile", "sentencepiece"]
need = [p for p in pkgs if importlib.util.find_spec(p) is None]
if need:
    subprocess.check_call([sys.executable, "-m", "pip", "install", "-U"] + need)

In [15]:
# Step 10.1 — SeamlessM4T loader + s2st_generate

import importlib, time
import numpy as np
import torch

seamless_model = None
seamless_processor = None
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
DTYPE = torch.float16 if DEVICE == "cuda" else torch.float32

def _ensure_encodec():
    if importlib.util.find_spec("encodec") is None:
        raise RuntimeError("Missing 'encodec'. Run 10.0_install_deps.py first.")

def _load_seamless(model_id: str = "facebook/seamless-m4t-v2-large"):
    global seamless_model, seamless_processor
    if seamless_model is not None and seamless_processor is not None:
        return
    _ensure_encodec()
    from transformers import AutoProcessor
    try:
        from transformers import SeamlessM4Tv2Model as _Seamless
    except Exception:
        from transformers import SeamlessM4TModel as _Seamless
    print(f"[Seamless] Loading {model_id} on {DEVICE} ...")
    t0 = time.time()
    seamless_processor = AutoProcessor.from_pretrained(model_id)
    seamless_model = _Seamless.from_pretrained(model_id, torch_dtype=DTYPE, low_cpu_mem_usage=True).to(DEVICE).eval()
    globals()["seamless_model"] = seamless_model
    globals()["seamless_processor"] = seamless_processor
    print(f"[Seamless] Loaded in {time.time()-t0:.1f}s")

def _to_numpy(x):
    if torch.is_tensor(x):
        x = x.detach().float().cpu().numpy()
    else:
        x = np.asarray(x, dtype=np.float32)
    if x.ndim > 1:
        if x.shape[0] == 1: x = x[0]
        elif x.shape[-1] == 1: x = x[..., 0]
        elif x.ndim == 2 and x.shape[1] not in (1, 2): x = x.mean(axis=1)
    return np.ascontiguousarray(x.astype(np.float32))

def _extract_audio_and_sr(out):
    if isinstance(out, dict):
        if "audio" in out and isinstance(out["audio"], dict):
            a = out["audio"].get("array")
            sr = out["audio"].get("sampling_rate") or out.get("sampling_rate") or out.get("sample_rate") or 16000
            if a is not None: return a, int(sr)
        for k in ("audio", "audio_values", "waveform", "speech_values"):
            if k in out and out[k] is not None:
                return out[k], int(out.get("sampling_rate") or out.get("sample_rate") or 16000)
    for k in ("audio", "audio_values", "waveform", "speech_values"):
        if hasattr(out, k):
            v = getattr(out, k)
            if v is not None:
                sr = int(getattr(out, "sampling_rate", getattr(out, "sample_rate", 16000)))
                return v, sr
    if isinstance(out, (list, tuple)) and out:
        return _extract_audio_and_sr(out[0])
    if torch.is_tensor(out) or isinstance(out, (np.ndarray, list, tuple)):
        return out, 16000
    return None, None

def s2st_generate(input_wav: str, tgt_lang_code3: str, out_wav: str, max_new_tokens: int = 512) -> float:
    _load_seamless()
    import soundfile as sf

    y, sr = sf.read(input_wav, dtype="float32", always_2d=False)
    if getattr(y, "ndim", 1) > 1: y = y.mean(axis=1).astype(np.float32)
    y = np.asarray(y, dtype=np.float32, order="C")

    inputs = seamless_processor(audios=y, sampling_rate=sr, return_tensors="pt")
    inputs = {k: (v.to(DEVICE) if hasattr(v, "to") else v) for k, v in inputs.items()}

    gen_kwargs = dict(
        tgt_lang=tgt_lang_code3,
        generate_speech=True,
        max_new_tokens=max_new_tokens,
        num_beams=1,
        do_sample=False,
        use_cache=True,
        return_dict_in_generate=True,
    )
    with torch.inference_mode():
        out = seamless_model.generate(**inputs, **gen_kwargs)

    a, out_sr = _extract_audio_and_sr(out)
    if a is None:
        raise RuntimeError("No audio returned from generate(). Ensure 'encodec' is installed.")
    y_out = _to_numpy(a)
    out_sr = int(out_sr or 16000)
    sf.write(out_wav, y_out, out_sr, subtype="PCM_16")
    return len(y_out) / float(out_sr)

In [17]:
# Step 10.2 — Choose target language (3-letter M4T code)
# Edit this value to your desired target language.
# Examples: "eng" English, "deu" German, "fra" French, "spa" Spanish, "ita" Italian,
#           "por" Portuguese, "rus" Russian, "jpn" Japanese, "kor" Korean,
#           "cmn" Mandarin Chinese, "arb" Arabic (MSA), "hin" Hindi, etc.

S2ST_TGT_LANG_M4T = "deu"  # change this
print("[Lang] Target code:", S2ST_TGT_LANG_M4T)

[Lang] Target code: deu


In [19]:
# Step 10.3 — Batch Approach C with auto-discovery (matches A/B behavior)
# Prereqs: run 10.0_install_deps.py, 10.1_seamless_s2st.py, 10.2_target_lang.py

import os, glob, time

# Guards
if "s2st_generate" not in globals():
    raise RuntimeError("s2st_generate is not defined. Run 10.1_seamless_s2st.py first.")
if "S2ST_TGT_LANG_M4T" not in globals():
    raise RuntimeError("Target language not set. Run 10.2_target_lang.py first.")

# Where to look (same places A/B used; prioritized)
SEARCH_DIRS = [
    "/content/drive/MyDrive/speech_ai_colab/inputs/preproc",
    "/content/drive/MyDrive/speech_ai_colab/inputs",
]
EXTS = ("wav",)  # We target WAVs from the preproc step

OUT_DIR = "/content/drive/MyDrive/approachC_batch_out"
os.makedirs(OUT_DIR, exist_ok=True)

def discover_wavs(dirs, exts, limit=5):
    found = []
    for d in dirs:
        if not os.path.isdir(d):
            continue
        for ext in exts:
            # search recursively, keep stable ordering
            files = sorted(glob.glob(os.path.join(d, "**", f"*.{ext}"), recursive=True))
            for f in files:
                if len(found) < limit:
                    found.append(f)
                else:
                    break
        if len(found) >= limit:
            break
    return found[:limit]

INPUTS = discover_wavs(SEARCH_DIRS, EXTS, limit=5)
if not INPUTS:
    raise FileNotFoundError(
        "No WAV files found under the expected preprocessed directories.\n"
        "Make sure your A/B preprocessing wrote WAVs to:\n - /content/drive/MyDrive/speech_ai_colab/inputs/preproc\n"
        "Or mount Drive and place WAVs there. If needed, set INPUTS manually."
    )

def audio_duration(path):
    try:
        import soundfile as sf
        info = sf.info(path)
        if info.samplerate and info.frames:
            return info.frames / float(info.samplerate)
    except Exception:
        pass
    return None

# Run batch
rows = []
ok, fail = 0, 0

for i, inp in enumerate(INPUTS, 1):
    base = os.path.splitext(os.path.basename(inp))[0]
    outp = os.path.join(OUT_DIR, f"{base}.s2st.{S2ST_TGT_LANG_M4T}.wav")
    t0 = time.time()
    status = "ok"
    err = ""
    in_dur = audio_duration(inp)
    try:
        out_dur = s2st_generate(inp, S2ST_TGT_LANG_M4T, outp, max_new_tokens=512)
        ok += 1
    except Exception as e:
        status = "error"
        err = str(e)
        outp = ""
        out_dur = None
        fail += 1
    t1 = time.time()
    rows.append({
        "idx": i,
        "input": inp,
        "in_dur_s": round(in_dur, 2) if in_dur else None,
        "target": S2ST_TGT_LANG_M4T,
        "output": outp,
        "out_dur_s": round(out_dur, 2) if out_dur else None,
        "status": status,
        "elapsed_s": round(t1 - t0, 2),
        "error": err,
    })
    print(f"[{i}/{len(INPUTS)}] {status.upper()} -> {outp or '-'} ({round(t1 - t0, 2)}s)")

print(f"Done. OK={ok}, Fail={fail}. Outputs in: {OUT_DIR}")

# Summary table (like A/B)
have_pandas = True
try:
    import pandas as pd
except Exception:
    have_pandas = False

if have_pandas:
    df = pd.DataFrame(rows, columns=["idx","input","in_dur_s","target","output","out_dur_s","status","elapsed_s","error"])
    from IPython.display import display
    display(df)
else:
    print("\nSummary:")
    for r in rows:
        print(f"- [{r['idx']}] {r['status']} | in_dur={r['in_dur_s']}s -> out_dur={r['out_dur_s']}s | target={r['target']}\n  in: {r['input']}\n  out: {r['output']}\n  err: {r['error']}")

# Inline audio preview (source + translated), same as A/B
try:
    from IPython.display import Audio, display, HTML
    print("\nAudio preview:")
    for r in rows:
        try:
            if os.path.isfile(r["input"]):
                display(HTML(f"<b>Input:</b> {os.path.basename(r['input'])} ({r.get('in_dur_s','?')}s)"))
                display(Audio(r["input"], autoplay=False))
            if r["output"] and os.path.isfile(r["output"]):
                display(HTML(f"<b>Output ({r['target']}):</b> {os.path.basename(r['output'])} ({r.get('out_dur_s','?')}s)"))
                display(Audio(r["output"], autoplay=False))
            print("-" * 60)
        except Exception:
            pass
except Exception:
    print("[Preview] Inline audio not available in this environment.")

[1/5] OK -> /content/drive/MyDrive/approachC_batch_out/ar.s2st.deu.wav (4.68s)
[2/5] OK -> /content/drive/MyDrive/approachC_batch_out/de.s2st.deu.wav (2.97s)
[3/5] OK -> /content/drive/MyDrive/approachC_batch_out/en.s2st.deu.wav (2.36s)
[4/5] OK -> /content/drive/MyDrive/approachC_batch_out/es.s2st.deu.wav (3.16s)
[5/5] OK -> /content/drive/MyDrive/approachC_batch_out/zh.s2st.deu.wav (2.5s)
Done. OK=5, Fail=0. Outputs in: /content/drive/MyDrive/approachC_batch_out


Unnamed: 0,idx,input,in_dur_s,target,output,out_dur_s,status,elapsed_s,error
0,1,/content/drive/MyDrive/speech_ai_colab/inputs/...,5.34,deu,/content/drive/MyDrive/approachC_batch_out/ar....,4.7,ok,4.68,
1,2,/content/drive/MyDrive/speech_ai_colab/inputs/...,4.29,deu,/content/drive/MyDrive/approachC_batch_out/de....,4.34,ok,2.97,
2,3,/content/drive/MyDrive/speech_ai_colab/inputs/...,4.52,deu,/content/drive/MyDrive/approachC_batch_out/en....,4.0,ok,2.36,
3,4,/content/drive/MyDrive/speech_ai_colab/inputs/...,4.6,deu,/content/drive/MyDrive/approachC_batch_out/es....,4.36,ok,3.16,
4,5,/content/drive/MyDrive/speech_ai_colab/inputs/...,4.38,deu,/content/drive/MyDrive/approachC_batch_out/zh....,4.7,ok,2.5,



Audio preview:


------------------------------------------------------------


------------------------------------------------------------


------------------------------------------------------------


------------------------------------------------------------


------------------------------------------------------------


## 11) Interactive runner and batch processing (placeholder)
Two ways to run:
- Interactive: choose one file from the manifest and one pipeline (A/B/C) → preview outputs inline.

---


- Batch: run selected pipelines over all entries in the manifest → save outputs to outputs/ and a results CSV.

This section will manage sequencing (load/free models between stages) to stay within 15 GB VRAM.

In [None]:
def run_one(file_row: dict, pipeline: str, cfg: dict) -> dict:
    """Dispatch to A/B/C placeholders; will be wired to real implementations later."""
    input_path = file_row["path"]
    out_dir = PATHS["outputs"]
    cfg = {**cfg, "SEAMLESS_AVAILABLE": SEAMLESS_AVAILABLE}
    if pipeline.upper() == "A":
        return run_pipeline_a(input_path, out_dir, cfg)
    if pipeline.upper() == "B":
        return run_pipeline_b(input_path, out_dir, cfg)
    if pipeline.upper() == "C":
        return run_pipeline_c(input_path, out_dir, cfg)
    raise ValueError("Unknown pipeline: " + str(pipeline))

print("[Runner placeholder ready]")

## 12) Evaluation & comparison (placeholder)
Planned objective metrics:
- End‑to‑end intelligibility: re‑ASR synthesized audio in target language and compute WER/CER vs. target text (A/B) or a reference (if available) for C.
- MT quality (B): BLEU/chrF (and optional COMET on a subset).
- Efficiency: latency per minute of audio, peak VRAM, disk footprint.

Planned subjective assessment:
- Quick MOS (1–5) for naturalness and pronunciation via a small in-notebook widget or a simple Google Form link.

Outputs:
- CSVs in results/ with metrics per file × pipeline.
- Bar charts saved to figures/ and displayed inline.
- An audio gallery comparing A/B/C per input (inline players).

In [None]:
print("[Evaluation placeholder] — We will add metrics (WER, BLEU/chrF), plots, and an audio gallery here.")

## 13) Reproducibility & housekeeping (placeholder)
This section will:
- Print environment versions (Python, CUDA, PyTorch, Transformers, model IDs).
- Set random seeds (where applicable).
- Provide cache management buttons (clean HF/torch/pip caches).
- Show disk usage and guidance if low space.

These utilities help keep the Colab VM stable and the results reproducible for reviewers.

In [None]:
def print_env_summary():
    import platform
    print("\n=== Environment Summary ===")
    print("Project:", PROJECT_NAME)
    print("Python:", platform.python_version())
    try:
        import torch
        print("PyTorch:", torch.__version__)
        print("CUDA available:", torch.cuda.is_available())
        if torch.cuda.is_available():
            print("GPU:", torch.cuda.get_device_name(0))
    except Exception as e:
        print("PyTorch not installed yet:", e)
    try:
        import transformers
        print("Transformers:", transformers.__version__)
    except Exception as e:
        print("Transformers not installed yet:", e)
    print("===========================\n")

print_env_summary()
print("[Housekeeping placeholder ready]")

## 14) Appendix (for CV write-up)
You can reuse or adapt the following text blocks when writing your PhD application materials. Edit in-place.

**Methodology (draft)**
> We compare cascaded ASR→MT→TTS against end‑to‑end speech‑to‑speech translation on Colab using open models (Whisper, NLLB‑200, SpeechT5/Parler/MMS‑TTS, SeamlessM4T). We evaluate intelligibility (re‑ASR WER), translation quality (BLEU/chrF), and efficiency (latency, VRAM). We also collect small‑scale MOS ratings for naturalness. Models are loaded sequentially and caches are stored on Drive for reproducibility.

**Limitations (draft)**
> SeamlessM4T installation can vary by Colab runtime and is treated as optional. Non‑English TTS quality on Colab relies on MMS‑TTS; higher‑quality multilingual voices (e.g., Coqui) are documented but not run in‑notebook due to Python 3.12 constraints.

**Ethics (draft)**
> All demo audio was recorded by the author and contributors with explicit consent and is released under CC BY 4.0. Third‑party datasets are accessed under their respective licenses and not redistributed in this repository.