# FACTR_03 ‚Äî ASR + (optional) Diarize/Align ‚Äî **CPU-safe**

**Version:** 2.0 (2025-09-07)

This notebook is designed to run reliably on free Colab (CPU only). It avoids CUDA/cuDNN crashes and still produces `UTTERANCES.parquet` for Notebook 04.

Pipeline:
1. Safe Mode ‚Üí force CPU + cap threads
2. Install minimal, stable CPU dependencies (faster-whisper)
3. Load `AUDIO_PATH` from `FACTR_02` handoff JSON
4. Transcribe on CPU (faster-whisper)
5. *(Optional)* Align & Diarize via WhisperX (graceful fallback)
6. Save `data/processed/UTTERANCES.parquet` to Drive


In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
# --- SAFEMODE: force CPU, tame native threads (prevents crashes) ---
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "-1"  # force CPU
os.environ["OMP_NUM_THREADS"] = "1"
os.environ["OPENBLAS_NUM_THREADS"] = "1"
os.environ["MKL_NUM_THREADS"] = "1"
os.environ["NUMEXPR_NUM_THREADS"] = "1"
print("‚úÖ Safe mode set: CPU only, threads capped.")

‚úÖ Safe mode set: CPU only, threads capped.


In [3]:
%%bash
set -euo pipefail

# Keep pip modern (below the 25.3 breaking change)
pip install -q --upgrade "pip<25.3" wheel

# CPU-only trio for faster-whisper
pip install -q "ctranslate2==4.4.0" "onnxruntime==1.19.2" "faster-whisper==1.1.1"

# Optional utilities used later
pip install -q pandas pyarrow "matplotlib<3.9" "scikit-learn<1.6" yt-dlp

python - <<'PY'
import platform, sys
print("Python", sys.version.split()[0], "| Platform", platform.platform())
import onnxruntime, faster_whisper, ctranslate2
print("onnxruntime", onnxruntime.__version__)
print("faster_whisper", faster_whisper.__version__)
print("ctranslate2", ctranslate2.__version__)
PY

   ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ 1.8/1.8 MB 12.0 MB/s eta 0:00:00
Python 3.12.11 | Platform Linux-6.1.123+-x86_64-with-glibc2.35
onnxruntime 1.19.2
faster_whisper 1.1.1
ctranslate2 4.4.0


ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
umap-learn 0.5.9.post2 requires scikit-learn>=1.6, but you have scikit-learn 1.5.2 which is incompatible.


## Load AUDIO_PATH from FACTR_02 handoff
FACTR_02 writes `data/processed/LAST_INGEST.json` with `audio_path`.


In [10]:
import json, os
HANDOFF = "/content/drive/MyDrive/FATCR/data/processed/LAST_INGEST.json"
assert os.path.exists(HANDOFF), "LAST_INGEST.json not found. Run FACTR_02 first."
with open(HANDOFF, "r", encoding="utf-8") as f:
    meta = json.load(f)
AUDIO_PATH = meta["audio_path"]
print("AUDIO_PATH:", AUDIO_PATH)
assert os.path.exists(AUDIO_PATH) and os.path.getsize(AUDIO_PATH) > 10_000, "Bad AUDIO_PATH."

AUDIO_PATH: /content/drive/MyDrive/FATCR/data/processed/speFWRuuJNs_16k_mono.wav


## CPU transcription with faster-whisper
Small English model is fast on CPU. Adjust `model_size` if you want higher quality (e.g., `medium.en`).


In [11]:
from faster_whisper import WhisperModel

model_size   = "small.en"     # or "medium.en" if you can afford time
compute_type = "int8"         # best for CPU; "int8_float16" also OK
device       = "cpu"          # keep CPU to avoid cuDNN

fw = WhisperModel(model_size, device=device, compute_type=compute_type)
segments_gen, info = fw.transcribe(
    AUDIO_PATH,
    language="en",           # skip detection
    vad_filter=False,
    beam_size=1,
)
print("Detected language:", info.language)

# Convert to WhisperX-like dict
asr_segments = []
for s in segments_gen:
    seg = {
        "start": float(s.start) if s.start is not None else None,
        "end":   float(s.end)   if s.end   is not None else None,
        "text":  (s.text or "").strip(),
    }
    if getattr(s, "words", None):
        seg["words"] = [
            {"start": float(w.start) if w.start is not None else None,
             "end":   float(w.end)   if w.end   is not None else None,
             "word":  w.word}
            for w in s.words
        ]
    asr_segments.append(seg)
asr = {"segments": asr_segments, "language": info.language or "en"}
print(f"‚úÖ ASR segments: {len(asr_segments)}")

tokenizer.json: 0.00B [00:00, ?B/s]

vocabulary.txt: 0.00B [00:00, ?B/s]

model.bin:   0%|          | 0.00/484M [00:00<?, ?B/s]

config.json: 0.00B [00:00, ?B/s]

Detected language: en
‚úÖ ASR segments: 1242


## (Optional) Alignment & Diarization with WhisperX (graceful fallback)
- Stays on CPU. If imports or models fail, we fall back to plain ASR.


In [13]:
USE_ALIGNMENT = True
USE_DIAR      = False         # diarization is heavier; enable if needed
HUGGINGFACE_TOKEN = ""       # paste your HF token if you enable diarization

asr_aligned = asr
diar_out = None

if USE_ALIGNMENT:
    try:
        import whisperx
        align_model, metadata = whisperx.load_align_model(
            language_code=asr["language"], device="cpu"
        )
        asr_aligned = whisperx.align(asr["segments"], align_model, metadata, AUDIO_PATH, "cpu")
        print("‚úÖ Alignment ok.")
    except Exception as e:
        print("‚ö†Ô∏è Alignment skipped:", e)
        asr_aligned = asr

if USE_DIAR:
    try:
        import whisperx
        diar = whisperx.DiarizationPipeline(device="cpu", use_auth_token=(HUGGINGFACE_TOKEN or None))
        diar_out = diar(AUDIO_PATH)
        print("‚úÖ Diarization ok.")
    except Exception as e:
        print("‚ö†Ô∏è Diarization skipped:", e)
        diar_out = None

‚ö†Ô∏è Alignment skipped: No module named 'whisperx'


## Save UTTERANCES.parquet for FACTR_04


In [14]:
import pandas as pd, os
rows = []
segments = asr_aligned.get("segments", asr["segments"])
for seg in segments:
    rows.append({
        "video_id": os.path.basename(AUDIO_PATH),
        "t_start": seg.get("start"),
        "t_end":   seg.get("end"),
        "speaker": seg.get("speaker", "SPEAKER_00"),
        "text":    (seg.get("text") or "").strip(),
    })
df_utts = pd.DataFrame(rows)
out_parquet = "/content/drive/MyDrive/FATCR/data/processed/UTTERANCES.parquet"
os.makedirs(os.path.dirname(out_parquet), exist_ok=True)
df_utts.to_parquet(out_parquet, index=False)
print("‚úÖ wrote", out_parquet, "rows:", len(df_utts))

‚úÖ wrote /content/drive/MyDrive/FATCR/data/processed/UTTERANCES.parquet rows: 1242


---
**Notes**
- Keep running on CPU for stability. Once pipeline is proven, you can experiment with GPU on Pro by removing the CPU force and installing CUDA-matching torch/torchaudio before whisperx.
- If you only need transcripts, set `USE_ALIGNMENT=False` and `USE_DIAR=False`.


## Snapshot + pointer JSON (for FACTR_03 handoff)

In [15]:
# === FACTR_03 Snapshot (rows, duration, versions) ===
import os, json, time, platform
import pandas as pd
import numpy as np

ROOT = "/content/drive/MyDrive/FATCR"
UTT  = f"{ROOT}/data/processed/UTTERANCES.parquet"
SNAP_DIR = f"{ROOT}/snapshots"
PTR_PATH  = f"{ROOT}/data/processed/LAST_ASR.json"

assert os.path.exists(UTT), f"Missing {UTT}. Run FACTR_03 transcription first."

# Load file
df = pd.read_parquet(UTT)
rows = len(df)

# Calculate durations
t_start = pd.to_numeric(df.get("t_start", pd.Series(dtype=float)), errors="coerce")
t_end   = pd.to_numeric(df.get("t_end"  , pd.Series(dtype=float)), errors="coerce")
seg_dur = (t_end - t_start).clip(lower=0)
total_duration_sec = float(np.nan_to_num(seg_dur.sum(), nan=0.0))
max_time_sec       = float(np.nan_to_num(t_end.max(), nan=0.0))

print("‚úÖ FACTR_03 snapshot")
print("   Rows:", rows)
print("   Total speech duration (approx):", round(total_duration_sec, 2), "sec")
print("   Max end time:", round(max_time_sec, 2), "sec")

# Environment/versions
snap = {
    "ts"     : time.strftime("%Y-%m-%dT%H:%M:%SZ", time.gmtime()),
    "python" : platform.python_version(),
    "pandas" : pd.__version__,
    "numpy"  : np.__version__,
    "rows"   : int(rows),
    "dur_s"  : round(total_duration_sec, 3),
    "max_t"  : round(max_time_sec, 3),
}

# Save snapshot (auditable trail)
os.makedirs(SNAP_DIR, exist_ok=True)
snap_path = f"{SNAP_DIR}/FACTR03_SNAPSHOT_{int(time.time())}.json"
with open(snap_path, "w") as f:
    json.dump(snap, f, indent=2)
print("üìù Snapshot saved ->", os.path.relpath(snap_path, ROOT))

# Save pointer JSON (safe to push)
os.makedirs(os.path.dirname(PTR_PATH), exist_ok=True)
with open(PTR_PATH, "w") as f:
    json.dump({
        "ts"    : snap["ts"],
        "rows"  : rows,
        "dur_s" : snap["dur_s"],
        "path"  : os.path.relpath(UTT, ROOT),
    }, f, indent=2)
print("üîó Pointer JSON saved ->", os.path.relpath(PTR_PATH, ROOT))


‚úÖ FACTR_03 snapshot
   Rows: 1242
   Total speech duration (approx): 1762.2 sec
   Max end time: 1786.16 sec
üìù Snapshot saved -> snapshots/FACTR03_SNAPSHOT_1757837986.json
üîó Pointer JSON saved -> data/processed/LAST_ASR.json


## Git push helper (for FACTR_03 outputs)

In [16]:
# === FACTR push (commit notebook + pointer JSON + snapshots) ===
from google.colab import userdata
import urllib.parse, os, subprocess, shlex

ROOT = "/content/drive/MyDrive/FATCR"
os.chdir(ROOT)

print("üìÇ Repo status:")
!git status -sb

# Pull to avoid divergence
print("\nüîÑ Pulling latest (rebase)‚Ä¶")
pat = userdata.get("GITHUB_PAT")
assert pat, "Missing GITHUB_PAT in Colab Secrets."
enc_pat = urllib.parse.quote(pat, safe="")
PULL_URL = f"https://LukmaanViscomi:{enc_pat}@github.com/LukmaanViscomi/FATCR.git"
!git pull --rebase {PULL_URL} main || true

# Stage notebooks + snapshots + pointer JSON
print("\n‚ûï Staging relevant files‚Ä¶")
!git add notebooks snapshots data/processed/LAST_ASR.json README.md .gitignore 2>/dev/null || true

# Commit if changes exist
changed = subprocess.run(["git", "diff", "--cached", "--quiet"]).returncode != 0
if changed:
    msg = "FACTR_03: add snapshot + pointer"
    print("\n‚úèÔ∏è Commit:", msg)
    !git commit -m {shlex.quote(msg)}
else:
    print("\n‚ÑπÔ∏è Nothing new to commit.")

# Push using PAT
print("\n‚¨ÜÔ∏è Pushing to main‚Ä¶")
!git push {PULL_URL} HEAD:main

print("\n‚úÖ Push complete.")


üìÇ Repo status:
Refresh index: 100% (6/6), done.
## [32mmain[m...[31morigin/main[m [ahead [32m1[m]
 [31mM[m notebooks/FACTR_03_ASR+Diarize_v2025-09-07_1.0.ipynb
[31m??[m data/

üîÑ Pulling latest (rebase)‚Ä¶
error: cannot pull with rebase: You have unstaged changes.
error: please commit or stash them.

‚ûï Staging relevant files‚Ä¶

‚úèÔ∏è Commit: FACTR_03: add snapshot + pointer
Author identity unknown

*** Please tell me who you are.

Run

  git config --global user.email "you@example.com"
  git config --global user.name "Your Name"

to set your account's default identity.
Omit --global to set the identity only in this repository.

fatal: unable to auto-detect email address (got 'root@b7374f05d615.(none)')

‚¨ÜÔ∏è Pushing to main‚Ä¶
Everything up-to-date

‚úÖ Push complete.
