
# Segment 2 — AI & ML Fundamentals for Video (CPU-only Lab)

This Colab notebook contains **CPU-friendly exercises** using **Hugging Face** models that map to Segment 2 of your course:
- Embeddings & Semantic Search
- Unsupervised Learning (Clustering & PCA)
- Named Entity Recognition (NER)
- Vision Classification (Image tagging)
- Whisper-based ASR (very short clips on CPU)
- Metrics: Precision/Recall/F1
- Transfer Learning (feature extractor → tiny classifier)

> **Tip:** Run each section sequentially. Keep inputs tiny for quick CPU runs.


## 0) Setup (run once)

In [None]:

# If running on Colab CPU, install deps
!pip -q install transformers torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu
!pip -q install datasets sentence-transformers scikit-learn numpy pandas matplotlib librosa soundfile


### Imports & Seeding

In [None]:

import numpy as np
import torch
torch.manual_seed(42)
np.random.seed(42)
print("Torch:", torch.__version__)



## 1) Text Embeddings + Semantic Search

**Goal:** Demonstrate how **embeddings** enable **semantic search** on transcripts.
- Model: `sentence-transformers/all-MiniLM-L6-v2` (small, fast on CPU)


In [None]:

from sentence_transformers import SentenceTransformer, util

model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2", device="cpu")

docs = [
  "The pit crew changed all four tires in under two seconds.",
  "The finance minister presented the annual budget today.",
  "A red flag stopped the Formula 1 race due to debris on the track.",
  "The actor received an award for best performance in a drama."
]

doc_emb = model.encode(docs, convert_to_tensor=True, normalize_embeddings=True)

def search(query, top_k=3):
    q_emb = model.encode([query], convert_to_tensor=True, normalize_embeddings=True)
    scores = util.cos_sim(q_emb, doc_emb)[0]
    topk = torch.topk(scores, k=top_k)
    print(f"Query: {query}\n")
    for idx, score in zip(topk.indices.tolist(), topk.values.tolist()):
        print(f"{score:.3f} :: {docs[idx]}")

# Try it:
search("race was paused after an accident", top_k=3)



**Discuss:** Why does semantic search find the F1 example without explicit keyword overlap?  
Try your own query in `search("...")`.



## 2) Unsupervised Topic Clustering (Transcripts → Clusters)

**Goal:** Use **k-means** on sentence embeddings to group topics without labels.


In [None]:

from sklearn.cluster import KMeans

sentences = [
  "Ferrari leads after an early overtake.",
  "Budget deficit expected to narrow next year.",
  "Mercedes pits Hamilton for hard tires.",
  "Central bank maintains interest rates.",
  "Safety car deployed after collision.",
  "Inflation falls for third month in a row."
]

X = model.encode(sentences, normalize_embeddings=True)
kmeans = KMeans(n_clusters=2, n_init=10, random_state=42)
labels = kmeans.fit_predict(X)

for lbl in sorted(set(labels)):
    print(f"\nCluster {lbl}:")
    for s, l in zip(sentences, labels):
        if l == lbl:
            print(" -", s)



**Try:** Change `n_clusters=3`. What themes emerge?  
**Note:** In real pipelines, cluster IDs can be stored as **structural/semantic tags**.



## 3) Dimensionality Reduction for Visualization (PCA)


In [None]:

from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

X2 = PCA(n_components=2, random_state=42).fit_transform(X)

plt.figure()
for lbl in sorted(set(labels)):
    ix = labels == lbl
    plt.scatter(X2[ix,0], X2[ix,1], label=f"Cluster {lbl}")
for i, txt in enumerate(sentences):
    plt.annotate(str(i), (X2[i,0], X2[i,1]))
plt.legend()
plt.title("Transcript sentence embeddings (PCA)")
plt.show()



**Discuss:** PCA vs t-SNE/UMAP; when visual patterns are informative vs misleading.



## 4) Named Entity Recognition (NER) on Transcript
- Model: `dslim/bert-base-NER` (CPU-friendly)


In [None]:

from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline

ner = pipeline("ner", model="dslim/bert-base-NER", aggregation_strategy="simple", device=-1)
sample_transcript = """Lewis Hamilton speaks with the Mercedes engineer on team radio during the Monaco Grand Prix.
The FIA confirmed a five-second penalty for exceeding track limits."""
for ent in ner(sample_transcript):
    print(ent["entity_group"], ent["word"], round(float(ent["score"]), 3))



**Extend:** Normalize entities (e.g., map “FIA” → governing body), dedupe, attach timestamps from ASR.



## 5) Image Tagging for Frames (Vision Baseline)
- Model: `google/vit-base-patch16-224` (or try `microsoft/resnet-50`)
- **If Colab blocks downloads**, upload a couple of images via the Files pane and modify the code to load local paths.


In [None]:

from transformers import AutoImageProcessor, AutoModelForImageClassification
from PIL import Image
import requests, io

processor = AutoImageProcessor.from_pretrained("google/vit-base-patch16-224")
vision_model = AutoModelForImageClassification.from_pretrained("google/vit-base-patch16-224")

# Provide two small images; replace with local file paths if needed
image_urls = [
  "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/cats.png",
  "https://huggingface.co/datasets/hf-internal-testing/example-images/resolve/main/racing.jpg"
]

def load_image(url):
    content = requests.get(url, stream=True, timeout=10).content
    return Image.open(io.BytesIO(content)).convert("RGB")

def classify_image(img: Image.Image):
    inputs = processor(images=img, return_tensors="pt")
    with torch.no_grad():
        logits = vision_model(**inputs).logits
    pred_id = logits.argmax(-1).item()
    return vision_model.config.id2label[pred_id]

for url in image_urls:
    try:
        img = load_image(url)
        print(url, "=>", classify_image(img))
    except Exception as e:
        print("Failed to load:", url, e)



**Extension:** Extract **embeddings** from the penultimate layer and cluster images (see Bonus section).



## 6) ASR (Speech → Text) with Whisper-Tiny (CPU)
- Model: `openai/whisper-tiny.en`
- Keep clips **very short** for CPU. Upload a `.wav` (16kHz mono recommended).


In [None]:

from transformers import pipeline

asr = pipeline("automatic-speech-recognition", model="openai/whisper-tiny.en", device=-1)

# TODO: Replace with your uploaded path, e.g., /content/sample.wav
audio_path = "/content/sample.wav"
try:
    result = asr(audio_path)
    print(result["text"])
except Exception as e:
    print("Provide a valid short .wav file path in 'audio_path'. Error:", e)



## 7) Precision/Recall & F1 (Evaluation mini-lab)


In [None]:

from sklearn.metrics import precision_recall_fscore_support, classification_report

y_true = [1,0,1,0,1,0]  # ground truth for 6 items
y_pred = [1,0,0,0,1,1]  # pretend model predictions

print(classification_report(y_true, y_pred, target_names=["Non-Racing","Racing"]))

prec, rec, f1, _ = precision_recall_fscore_support(y_true, y_pred, average="binary")
print("Precision:", round(prec,3), "Recall:", round(rec,3), "F1:", round(f1,3))



**Discuss:** Search often optimizes **recall** (don’t miss relevant scenes). Compliance/brand-safety may require higher **precision**.



## 8) Transfer Learning Lite (Feature Extractor → Classifier)
- Use text embeddings from a pre-trained model as features for a small classifier.


In [None]:

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

texts = [
  "Ferrari takes pole position at Monza",
  "Central bank raises interest rates",
  "Safety car deployed after crash",
  "Quarterly earnings beat expectations",
  "Pit stop under two seconds",
  "Budget deficit narrows in 2025"
]
y = [1,0,1,0,1,0]  # 1 = Racing, 0 = Finance

emb = model.encode(texts, normalize_embeddings=True)
X_train, X_test, y_train, y_test = train_test_split(emb, y, test_size=0.33, random_state=42)

clf = LogisticRegression(max_iter=1000)
clf.fit(X_train, y_train)
print("Accuracy:", clf.score(X_test, y_test))

# Try your own sentence:
test_q = "Hamilton boxes for soft tyres"
pred = clf.predict(model.encode([test_q], normalize_embeddings=True))
print("Pred (1=Racing):", pred[0])



**Note:** This is *transfer learning* because we reuse a pre-trained encoder and only learn a tiny classifier head.



## 9) Bonus: Unsupervised Frame Clustering (End-to-end)
- Extract **image embeddings** from ViT’s hidden states and **cluster** them.


In [None]:

# Reuse the ViT processor/model from earlier. We'll extract CLS token embeddings.
from sklearn.cluster import KMeans

def image_embedding(img):
    inputs = processor(images=img, return_tensors="pt")
    with torch.no_grad():
        outputs = vision_model(**inputs, output_hidden_states=True)
        cls = outputs.hidden_states[-1][:,0,:].squeeze(0).numpy()
    return cls

# Prepare a tiny image set: upload your own or reuse downloaded samples twice
try:
    imgs = []
    for url in image_urls:
        imgs.append(load_image(url))
        imgs.append(load_image(url))

    embs = np.vstack([image_embedding(im) for im in imgs])
    kmeans = KMeans(n_clusters=2, n_init=10, random_state=42)
    labels = kmeans.fit_predict(embs)
    print("Cluster distribution:", {i: int(sum(labels==i)) for i in set(labels)})
except Exception as e:
    print("If downloads fail, upload a few images and modify the code to load local files. Error:", e)



**Wrap-up:** Name clusters (e.g., “cars/sports” vs “misc”), and decide where to store `cluster_id` in your metadata schema.
