# Transcript Normalization for CLIP

This notebook explains and demonstrates the **transcript normalization** step
that prepares raw ASR output for CLIP's text encoder.

## The problem

CLIP was trained on **image-caption pairs** — clean, descriptive sentences like:

> *"a cat sleeping on a windowsill, feeling calm"*

But ASR output from dysarthric speakers often contains:
- **Filler words**: *um, uh, like, you know*
- **Stuttering / disfluency**: *the the cat is is sleeping*
- **Fragments**: *cat* (a single word instead of a sentence)
- **Inconsistent casing**: *A CAT sleeping*

Feeding raw ASR output directly into CLIP produces poor similarity scores.
Normalization bridges this gap.

## 0) Imports

Only `re` (regular expressions) is needed — this module is intentionally lightweight
with no heavy dependencies.

In [None]:
import re

## 1) Filler word list

These are common filler words and discourse markers that appear in spontaneous speech
but carry no semantic content relevant to image description. The list includes both
**single-token** fillers (`um`, `uh`) and **two-token** phrases (`you know`, `i mean`).

Using a `frozenset` for O(1) membership testing.

In [None]:
FILLER_WORDS = frozenset({
    "um", "uh", "uh-huh", "hmm", "hm", "ah", "er", "oh",
    "like", "you know", "i mean", "okay", "ok", "so", "well",
})

## 2) `normalize_transcript` — core cleanup

This function applies three transformations in order:

1. **Lowercase + strip** — removes casing differences
2. **Collapse stutters** — `"the the cat is is sleeping"` → `"the cat is sleeping"`  
   Uses the regex `\b(\w+)( \1\b)+` to match any word repeated consecutively.
3. **Remove fillers** — scans word-by-word, checking both single tokens and bigrams
   against `FILLER_WORDS`

The result is a clean, lowercase transcript with no disfluencies.

In [None]:
def normalize_transcript(text: str) -> str:
    """Clean an ASR transcript: lowercase, remove fillers, collapse stutters."""
    text = text.strip().lower()
    if not text:
        return text

    text = re.sub(r"\b(\w+)( \1\b)+", r"\1", text)

    words = text.split()
    cleaned: list[str] = []
    skip_next = False
    for i, w in enumerate(words):
        if skip_next:
            skip_next = False
            continue
        bigram = f"{w} {words[i + 1]}" if i + 1 < len(words) else ""
        if bigram in FILLER_WORDS:
            skip_next = True
            continue
        if w not in FILLER_WORDS:
            cleaned.append(w)
    text = " ".join(cleaned)

    text = re.sub(r"\s+", " ", text).strip()
    return text

## 3) `to_caption_style` — rephrasing for CLIP

CLIP's text encoder works best with **full sentences** that describe an image. Very short
ASR outputs (1–2 words) like `"cat"` produce weak embeddings because CLIP rarely saw
isolated words during training.

This function wraps short fragments as captions:
- `"cat"` → `"an image showing cat"`
- `"a dog"` → `"an image showing a dog"`
- `"the cat is sleeping on the window"` → left unchanged (already sentence-length)

In [None]:
def to_caption_style(text: str) -> str:
    """Rephrase a short transcript as a caption-like sentence for CLIP."""
    text = normalize_transcript(text)
    if not text:
        return text
    if len(text.split()) <= 2:
        return f"an image showing {text}"
    return text


def batch_normalize(
    texts: list[str], caption_style: bool = False
) -> list[str]:
    """Normalize a list of transcripts."""
    fn = to_caption_style if caption_style else normalize_transcript
    return [fn(t) for t in texts]

## 4) Demo: before and after normalization

The examples below simulate realistic ASR outputs and show how each normalization
stage transforms them.

In [None]:
examples = [
    "Uh the the cat is is sleeping on the window",
    "um like a dog",
    "cat",
    "  A BOY riding a bicycle on a PATH  ",
    "well you know the the girl is painting",
    "",
]

for raw in examples:
    norm = normalize_transcript(raw)
    cap = to_caption_style(raw)
    print(f"  raw:     {raw!r}")
    print(f"  norm:    {norm!r}")
    print(f"  caption: {cap!r}")
    print()

## Summary

| Step | What it does | Example |
|------|-------------|--------|
| Lowercase + strip | Removes casing noise | `"A CAT"` → `"a cat"` |
| Collapse stutters | Deduplicates repeated words | `"the the cat"` → `"the cat"` |
| Remove fillers | Strips discourse markers | `"uh like a dog"` → `"a dog"` |
| Caption wrapping | Pads short fragments | `"cat"` → `"an image showing cat"` |

This normalization is applied automatically inside the multimodal rescoring pipeline
(`multimodal_asr.py`) before text is sent to CLIP's text encoder.