# Qwen3 ForcedAligner offline demo (parquet → timestamps)

This notebook loads **one parquet shard** from your Yodas-Granary ASR-only directory, picks one example, and runs:
1) *(Optional)* Qwen3-ASR (vLLM backend) to get transcript
2) Qwen3-ForcedAligner to produce **word/character timestamps**

Model usage references:
- ForcedAligner direct usage and `qwen-asr-demo` flags on the HF model card. citeturn1view0

---
✅ Assumptions: you have a GPU (L40S) and HF cache env vars already set.


In [1]:
# --- Check GPU + HF cache wiring (read-only) ---
import os, subprocess, textwrap
print('HF_HOME:', os.environ.get('HF_HOME'))
print('HF_HUB_CACHE:', os.environ.get('HF_HUB_CACHE'))
print('TRANSFORMERS_CACHE:', os.environ.get('TRANSFORMERS_CACHE'))
print('CUDA_VISIBLE_DEVICES:', os.environ.get('CUDA_VISIBLE_DEVICES'))
try:
    out = subprocess.check_output(['nvidia-smi','-L'], text=True)
    print('\nGPU(s):')
    print(out)
except Exception as e:
    print('nvidia-smi not available:', e)


HF_HOME: /data/user_data/haolingp/hf_cache
HF_HUB_CACHE: /data/user_data/haolingp/hf_cache/hub
TRANSFORMERS_CACHE: None
CUDA_VISIBLE_DEVICES: None

GPU(s):
GPU 0: NVIDIA L40S (UUID: GPU-5f205193-a34d-1536-6802-bbcbe0ca3b70)



## 1) Install deps (run once per env)
If you already installed these in your conda env, you can skip this cell.

In [2]:
!pip -q install -U "qwen-asr[vllm]" pandas pyarrow soundfile numpy
# optional: for resampling if needed
!pip -q install -U librosa

## 2) Point to one parquet file
Set `PARQUET_PATH` to a single parquet file inside:
`/data/group_data/li_lab/siqiouya/datasets/yodas-granary/data/en000/asr_only/`


In [3]:
import os, glob
BASE_DIR = "/data/group_data/li_lab/siqiouya/datasets/yodas-granary/data/en000/asr_only/"
parquets = sorted(glob.glob(os.path.join(BASE_DIR, "**/*.parquet"), recursive=True))
print('Found parquet files:', len(parquets))
print('First 5:')
for p in parquets[:5]:
    print('  ', p)

# Pick the first parquet by default; change to any path you want.
PARQUET_PATH = parquets[0] if parquets else None
print('\nUsing PARQUET_PATH:', PARQUET_PATH)

Found parquet files: 499
First 5:
   /data/group_data/li_lab/siqiouya/datasets/yodas-granary/data/en000/asr_only/00000000.parquet
   /data/group_data/li_lab/siqiouya/datasets/yodas-granary/data/en000/asr_only/00000001.parquet
   /data/group_data/li_lab/siqiouya/datasets/yodas-granary/data/en000/asr_only/00000002.parquet
   /data/group_data/li_lab/siqiouya/datasets/yodas-granary/data/en000/asr_only/00000003.parquet
   /data/group_data/li_lab/siqiouya/datasets/yodas-granary/data/en000/asr_only/00000004.parquet

Using PARQUET_PATH: /data/group_data/li_lab/siqiouya/datasets/yodas-granary/data/en000/asr_only/00000000.parquet


## 3) Load parquet + inspect schema

In [4]:
import pandas as pd
import pyarrow.parquet as pq

assert PARQUET_PATH is not None, 'No parquet file found. Set PARQUET_PATH manually.'

table = pq.read_table(PARQUET_PATH)
df = table.to_pandas()
print('Rows:', len(df), 'Cols:', len(df.columns))
print('\nColumns:')
print(df.columns.tolist())
display(df.head(2))

Rows: 1703 Cols: 9

Columns:
['utt_id', 'audio', 'duration', 'lang', 'task', 'text', 'translation_en', 'original_audio_id', 'original_audio_offset']


Unnamed: 0,utt_id,audio,duration,lang,task,text,translation_en,original_audio_id,original_audio_offset
0,en000_00000000_Y0aGwNq86f4_112_62_5_54,{'bytes': b'RIFF\xa4\xb4\x02\x00WAVEfmt \x10\x...,5.54,<en>,<asr>,Which sites have you been using and how did th...,,Y0aGwNq86f4,112.62
1,en000_00000000_Y0aGwNq86f4_118_16_30_48,{'bytes': b'RIFF$\xe2\x0e\x00WAVEfmt \x10\x00\...,30.48,<en>,<asr>,edge technologies to find talent in the techno...,,Y0aGwNq86f4,118.16


## 4) Pick one row + locate audio + transcript fields
This block tries to guess likely columns for audio and text.

- If your parquet stores **audio file paths**, it will use them directly.
- If it stores **audio bytes/arrays**, it will decode them.
- For transcript, it looks for columns like `text`, `transcript`, `asr_text` etc.

If the guesses are wrong, just set `AUDIO_COL` / `TEXT_COL` manually after inspecting `df.columns`.


In [5]:
import re
def guess_col(cols, patterns):
    for pat in patterns:
        r = re.compile(pat, re.IGNORECASE)
        for c in cols:
            if r.search(c):
                return c
    return None

cols = df.columns.tolist()
AUDIO_COL = guess_col(cols, [r'^audio$', r'audio', r'wav', r'path'])
TEXT_COL  = guess_col(cols, [r'^text$', r'transcript', r'asr', r'sentence', r'utt'])

print('Guessed AUDIO_COL:', AUDIO_COL)
print('Guessed TEXT_COL :', TEXT_COL)

ROW_IDX = 0
row = df.iloc[ROW_IDX]
print('\nSample row preview (selected cols):')
for k in [AUDIO_COL, TEXT_COL]:
    if k is None: continue
    v = row[k]
    s = str(v)
    print(f'  {k}:', s[:200] + ('...' if len(s)>200 else ''))

Guessed AUDIO_COL: audio
Guessed TEXT_COL : text

Sample row preview (selected cols):
  audio: {'bytes': b'RIFF\xa4\xb4\x02\x00WAVEfmt \x10\x00\x00\x00\x01\x00\x01\x00\x80>\x00\x00\x00}\x00\x00\x02\x00\x10\x00data\x80\xb4\x02\x00\xef\xf9M\xfeS\x03F\xf8\x05\xf9\x87\x01\x8e\x02\xbc\xfc\xb4\xfdz\x...
  text: Which sites have you been using and how did the use of these technologies assist your search?


## 5) Load audio into a standard form
We normalize audio into either:
- a local path string, or
- a `(np.ndarray, sr)` tuple (float32)

The Qwen3 aligner supports both input styles. citeturn1view0

In [None]:
import numpy as np
import soundfile as sf
import librosa
from pathlib import Path

def resolve_audio(row_val):
    """Return either a path (str) or (np.ndarray, sr)."""
    # Case 1: already a string path
    if isinstance(row_val, str):
        p = Path(row_val)
        if p.exists():
            return str(p)
        # Sometimes stored as relative path
        p2 = Path(BASE_DIR) / row_val
        if p2.exists():
            return str(p2)
        # fallthrough: maybe URL or something
        raise FileNotFoundError(f"Audio file not found: {row_val} (also tried {Path(BASE_DIR)/row_val})")


    # Case 2: dict-like (common in HF datasets)
    if isinstance(row_val, dict):
        for key in ['path','file','filepath','audio_path']:
            if key in row_val and isinstance(row_val[key], str):
                return resolve_audio(row_val[key])
        for key in ['array','samples']:
            if key in row_val:
                arr = np.asarray(row_val[key], dtype=np.float32)
                sr = int(row_val.get('sampling_rate', row_val.get('sr', 16000)))
                return (arr, sr)

    # Case 3: bytes (wav bytes)
    if isinstance(row_val, (bytes, bytearray)):
        import io
        data, sr = sf.read(io.BytesIO(row_val), dtype='float32')
        if data.ndim > 1:
            data = data.mean(axis=1)
        return (data, sr)

    # Case 4: numpy array already
    if isinstance(row_val, np.ndarray):
        return (row_val.astype(np.float32), 16000)

    raise TypeError(f'Unsupported audio type: {type(row_val)}')

assert AUDIO_COL is not None, 'Could not guess audio column. Set AUDIO_COL manually.'
audio_input = resolve_audio(row[AUDIO_COL])
print('Audio input type:', type(audio_input))
if isinstance(audio_input, tuple):
    arr, sr = audio_input
    print('Audio tuple:', arr.shape, 'sr=', sr, 'dtype=', arr.dtype)
else:
    print('Audio path/url:', audio_input)

Audio input type: <class 'str'>
Audio path/url: en000_00000000_Y0aGwNq86f4_112_62_5_54.wav


## 6) Option A (recommended): align using **existing transcript** from parquet
If your parquet already has transcript text, you can run ForcedAligner directly.

Direct ForcedAligner usage example is on the model card. citeturn1view0

In [7]:
TRANSCRIPT = None
if TEXT_COL is not None:
    val = row[TEXT_COL]
    if isinstance(val, str) and val.strip():
        TRANSCRIPT = val.strip()

print('Transcript found:', TRANSCRIPT is not None)
if TRANSCRIPT:
    print('Transcript preview:', TRANSCRIPT[:200])

Transcript found: True
Transcript preview: Which sites have you been using and how did the use of these technologies assist your search?


In [8]:
# --- Run forced alignment (audio + transcript) ---
import torch
from qwen_asr import Qwen3ForcedAligner

assert TRANSCRIPT is not None, 'No transcript found; use Option B below to run ASR first.'

aligner = Qwen3ForcedAligner.from_pretrained(
    "Qwen/Qwen3-ForcedAligner-0.6B",
    dtype=torch.bfloat16,
    device_map="cuda:0",
)

# Set language explicitly if you know it; otherwise choose None and/or set English.
LANGUAGE = "English"
aligned = aligner.align(audio=audio_input, text=TRANSCRIPT, language=LANGUAGE)

# aligned[0] is a list of segments/tokens with timestamps
print('Aligned items:', len(aligned[0]))
print('First item:', aligned[0][0])
print('Example:', aligned[0][0].text, aligned[0][0].start_time, aligned[0][0].end_time)


  from .autonotebook import tqdm as notebook_tqdm
  audio, sr = librosa.load(x, sr=None, mono=False)
	Deprecated as of librosa version 0.10.0.
	It will be removed in librosa version 1.0.
  y, sr_native = __audioread_load(path, offset, duration, dtype)


FileNotFoundError: [Errno 2] No such file or directory: 'en000_00000000_Y0aGwNq86f4_112_62_5_54.wav'

### Export as JSON (token/word timestamps)
This makes it easy to inspect/plot or convert to TextGrid later.

In [None]:
import json
out = []
for it in aligned[0]:
    out.append({
        "text": getattr(it, 'text', None),
        "start": float(getattr(it, 'start_time', 0.0)),
        "end": float(getattr(it, 'end_time', 0.0)),
    })

print('First 5 aligned tokens:')
print(json.dumps(out[:5], ensure_ascii=False, indent=2))

OUT_JSON_PATH = "qwen3_forced_alignment_example.json"
with open(OUT_JSON_PATH, 'w', encoding='utf-8') as f:
    json.dump(out, f, ensure_ascii=False, indent=2)
print('Wrote:', OUT_JSON_PATH)

## 7) Option B: If parquet has **audio only** (no transcript), run Qwen3-ASR (vLLM) and return timestamps
Qwen3-ASR supports returning timestamps when you pass `forced_aligner` and its kwargs. citeturn1view0

⚠️ This is heavier: it loads the ASR model (1.7B by default). You can switch to `Qwen/Qwen3-ASR-0.6B` if needed.

In [None]:
# Uncomment and run if you need ASR -> transcript -> timestamps
import torch
from qwen_asr import Qwen3ASRModel

# Use the vLLM backend wrapper from the package (as shown in the README). citeturn1view0
asr = Qwen3ASRModel.LLM(
    model="Qwen/Qwen3-ASR-1.7B",
    gpu_memory_utilization=0.70,
    max_inference_batch_size=8,
    max_new_tokens=2048,
    forced_aligner="Qwen/Qwen3-ForcedAligner-0.6B",
    forced_aligner_kwargs=dict(device_map="cuda:0", dtype=torch.bfloat16),
)

results = asr.transcribe(audio=audio_input, language=None)
r0 = results[0]
print('Detected language:', getattr(r0, 'language', None))
print('Transcript:', getattr(r0, 'text', None))

# Depending on version, timestamps may be in r0.timestamps / r0.words / r0.segments; inspect:
print('Available fields:', [a for a in dir(r0) if not a.startswith('_')][:50])

# Try common locations
for cand in ['timestamps','words','segments','tokens','alignment']:
    if hasattr(r0, cand):
        val = getattr(r0, cand)
        if val is not None:
            print(f'Found {cand}:', type(val), 'len=', (len(val) if hasattr(val,'__len__') else 'n/a'))
            break

## 8) Next: convert to TextGrid (optional)
If you want, we can add a cell to convert the JSON timestamps into a Praat TextGrid to match your MFA workflow.