# Call Quality Analyzer (Colab)
Fast, lightweight pipeline that works on Colab Free and aims for <30s per call.

**Test link**: https://www.youtube.com/watch?v=4ostqJD3Psc

**What you get**
1) Talk-time ratio per speaker  
2) Number of questions asked  
3) Longest monologue duration  
4) Overall sentiment  
5) One actionable insight  
_Bonus_: heuristic label of Sales Rep vs Customer

---
### How it works (quick)
- Download audio via `yt-dlp`, convert to 16 kHz mono.
- Transcribe with **faster-whisper (tiny)** for speed + robustness.
- **Lightweight diarization**: extract MFCC embeddings in short windows and **KMeans (k=2)** to separate speakers; merge contiguous segments.
- Compute metrics from aligned words+segments.
- Sentiment via **NLTK VADER** (fast, no GPU).
- Actionable insight is generated from ratios, question counts, and sentiment.

Run cells top-to-bottom. Comments explain each step.

In [1]:
#@title 1) Setup (installs) — ~10-20s
!pip install -q yt-dlp faster-whisper==1.0.3 librosa==0.10.2.post1 pydub==0.25.1 numpy==1.26.4 scipy==1.13.1 scikit-learn==1.3.2 nltk==3.9.1





In [2]:
#@title 2) Imports & small utils
import os, re, math, json, tempfile, subprocess, sys
from dataclasses import dataclass
import numpy as np
from pydub import AudioSegment
import librosa
from sklearn.cluster import KMeans
from faster_whisper import WhisperModel
import nltk
from nltk.sentiment import SentimentIntensityAnalyzer

nltk.download('vader_lexicon', quiet=True)
sia = SentimentIntensityAnalyzer()

def hhmmss(seconds: float) -> str:
    seconds = max(0, float(seconds))
    h = int(seconds // 3600); seconds -= h*3600
    m = int(seconds // 60); s = seconds - m*60
    if h: return f"{h:d}:{m:02d}:{s:04.1f}"
    return f"{m:d}:{s:04.1f}"


In [3]:
#@title 3) Download audio from YouTube (or provide your own file path)
YT_URL = "https://www.youtube.com/watch?v=4ostqJD3Psc"  #@param {type:"string"}
OUTPUT_WAV = "call_16k_mono.wav"

# Use yt-dlp to pull audio best quality, then convert to 16k mono wav via ffmpeg
# This remains fast on Colab Free.
tmp_m4a = "tmp_audio.m4a"
!yt-dlp -f bestaudio -x --audio-format m4a -o "{tmp_m4a}" "{YT_URL}" -q

# Convert to 16k mono wav for consistent processing
!ffmpeg -y -i "{tmp_m4a}" -ac 1 -ar 16000 "{OUTPUT_WAV}" -loglevel error

print("Saved:", OUTPUT_WAV, "Size:", os.path.getsize(OUTPUT_WAV)//1024, "KB")


Saved: call_16k_mono.wav Size: 3834 KB


In [4]:
#@title 4) Transcribe quickly with faster-whisper (tiny)
AUDIO_PATH = OUTPUT_WAV

# tiny is fast; beam_size=1 and vad_filter improves speed/robustness on noisy audio.
model = WhisperModel("tiny", device="cpu", compute_type="int8")
segments, info = model.transcribe(AUDIO_PATH, beam_size=1, vad_filter=True, word_timestamps=True)

words = []  # flatten word-level timings
transcript_text = []
for seg in segments:
    if seg.words:
        for w in seg.words:
            words.append({"start": w.start, "end": w.end, "text": w.word})
            transcript_text.append(w.word)
    else:
        # fallback if words missing
        words.append({"start": seg.start, "end": seg.end, "text": seg.text})
        transcript_text.append(seg.text)

full_text = " ".join(t["text"] for t in words).strip()
print("Transcription done. Duration (s):", round(info.duration, 2))
print("Approx. words:", len(words))


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Transcription done. Duration (s): 122.72
Approx. words: 327


In [5]:
#@title 5) Lightweight 2-speaker diarization (fast KMeans on MFCC windows)
# Window the audio, compute MFCC mean+var features per window, then KMeans(k=2).
# Map windows back to time to get speaker turns and merge contiguous segments.

y, sr = librosa.load(AUDIO_PATH, sr=16000, mono=True)
win_sec = 1.5
hop_sec = 0.5
win = int(win_sec*sr)
hop = int(hop_sec*sr)

features = []
times = []
for start in range(0, len(y)-win, hop):
    yw = y[start:start+win]
    mfcc = librosa.feature.mfcc(y=yw, sr=sr, n_mfcc=13)
    feat = np.concatenate([mfcc.mean(axis=1), mfcc.var(axis=1)])
    features.append(feat)
    times.append((start/sr, (start+win)/sr))

features = np.array(features)
kmeans = KMeans(n_clusters=2, n_init=10, random_state=42)
labels = kmeans.fit_predict(features)

# Build merged segments per speaker label
merged = []
for (t0,t1), lab in zip(times, labels):
    if not merged:
        merged.append([t0,t1,lab])
    else:
        if lab == merged[-1][2] and t0 <= merged[-1][1] + 0.05:
            merged[-1][1] = t1  # extend
        else:
            merged.append([t0,t1,lab])

# Clip within audio duration
duration = len(y)/sr
for seg in merged:
    seg[0] = max(0.0, seg[0]); seg[1] = min(duration, seg[1])

# Prepare speaker map (S0, S1)
speaker_map = {0: "S0", 1: "S1"}  # temporary; we'll relabel with bonus heuristic later
speaker_segments = [{"start": s, "end": e, "label": speaker_map[l]} for s,e,l in merged if e > s]

print("Speaker segments:", len(speaker_segments))
print("Total duration:", hhmmss(duration))


Speaker segments: 5
Total duration: 2:02.7


In [6]:
#@title 6) Align words to speakers
# For each word, find the segment that overlaps its midpoint and assign that speaker.
def assign_speaker(words, segments):
    seg_idx = 0
    assigned = []
    for w in words:
        mid = 0.5*(w["start"] + w["end"])
        # advance seg_idx until segment covers mid
        while seg_idx < len(segments) and segments[seg_idx]["end"] < mid:
            seg_idx += 1
        spk = None
        if seg_idx < len(segments) and segments[seg_idx]["start"] <= mid <= segments[seg_idx]["end"]:
            spk = segments[seg_idx]["label"]
        assigned.append({**w, "speaker": spk or "UNK"})
    return assigned

words_spk = assign_speaker(words, speaker_segments)
print("Assigned words with speakers:", len(words_spk))


Assigned words with speakers: 327


In [7]:
#@title 7) Metrics: talk-time, question count, longest monologue, sentiment
from collections import defaultdict

# Talk-time per speaker
dur_per_spk = defaultdict(float)
for seg in speaker_segments:
    dur_per_spk[seg["label"]] += (seg["end"] - seg["start"])

total_talk = sum(dur_per_spk.values()) or 1e-6
talk_ratio = {spk: round(100.0*dur/total_talk, 1) for spk, dur in dur_per_spk.items()}

# Questions: simple heuristic — sentences ending with '?' OR starting with WH- words
text_by_spk = defaultdict(list)
for w in words_spk:
    text_by_spk[w["speaker"]].append(w["text"])
joined_by_spk = {spk: " ".join(toks) for spk, toks in text_by_spk.items()}

def count_questions(text):
    # count '?' and wh-question patterns
    q_mark = text.count("?")
    sents = re.split(r"(?<=[\.!?])\s+", text)
    wh = sum(1 for s in sents if re.match(r"\s*(who|what|when|where|why|how|which|can|could|would|should|do|does|did)\b", s.strip(), re.IGNORECASE))
    return max(q_mark, wh)

questions_total = 0
questions_per_spk = {}
for spk, tx in joined_by_spk.items():
    qc = count_questions(tx)
    questions_per_spk[spk] = qc
    questions_total += qc

# Longest monologue (max contiguous segment)
longest = 0.0; longest_spk = None; longest_span = (0,0)
for seg in speaker_segments:
    d = seg["end"] - seg["start"]
    if d > longest:
        longest = d; longest_spk = seg["label"]; longest_span = (seg["start"], seg["end"])

# Sentiment via VADER on full transcript
sent_scores = sia.polarity_scores(" ".join(tok for tok in full_text.split()))
sentiment = max(sent_scores, key=sent_scores.get)  # 'pos','neu','neg','compound'; compound usually max magnitude
overall_sent = "positive" if sent_scores["compound"] >= 0.25 else ("negative" if sent_scores["compound"] <= -0.25 else "neutral")

print("Talk-time ratio:", talk_ratio)
print("Questions by speaker:", questions_per_spk, " Total:", questions_total)
print("Longest monologue:", round(longest,1), "s by", longest_spk, f"({hhmmss(longest_span[0])}–{hhmmss(longest_span[1])})")
print("Overall sentiment:", overall_sent, sent_scores)


Talk-time ratio: {'S0': 96.4, 'S1': 3.6}
Questions by speaker: {'S0': 8}  Total: 8
Longest monologue: 111.5 s by S0 (0:07.0–1:58.5)
Overall sentiment: positive {'neg': 0.008, 'neu': 0.794, 'pos': 0.198, 'compound': 0.9927}


In [8]:
#@title 8) Bonus heuristic: label Sales Rep vs Customer
# Heuristics (cheap, fast):
# - Sales rep tends to ask more questions, uses sales keywords, and may speak more.
sales_keywords = set("""price pricing demo discount trial contract plan subscription quote upgrade onboarding features roadmap integration api support invoice budget stakeholder proposal pilot deployment competitor timeline requirement next steps follow-up""".split())

def score_sales_like(text):
    toks = re.findall(r"[a-z']+", text.lower())
    return sum(1 for t in toks if t in sales_keywords)

spk_scores = {}
for spk, tx in joined_by_spk.items():
    spk_scores[spk] = 1.0*questions_per_spk.get(spk,0) + 0.5*score_sales_like(tx) + 0.2*talk_ratio.get(spk,0)

# Higher score => more likely the Sales Rep
if len(spk_scores) >= 2:
    sales_spk = max(spk_scores, key=spk_scores.get)
    customer_spk = [s for s in spk_scores.keys() if s != sales_spk][0]
else:
    sales_spk, customer_spk = "S0", "S1"

role_map = {sales_spk: "Sales Rep", customer_spk: "Customer"}
print("Role scores:", spk_scores)
print("Heuristic roles:", role_map)


Role scores: {'S0': 28.28}
Heuristic roles: {'S0': 'Sales Rep', 'S1': 'Customer'}


In [9]:
#@title 9) Actionable insight (simple rule-based)
insights = []

# Balance suggestion
diff = abs(talk_ratio.get("S0",0) - talk_ratio.get("S1",0))
if diff > 20:
    dominant = max(talk_ratio, key=talk_ratio.get)
    insights.append(f"{role_map.get(dominant, dominant)} dominated the call. Aim for a more balanced split.")

# Question suggestion
asker = max(questions_per_spk, key=questions_per_spk.get) if questions_per_spk else None
if asker:
    other = [s for s in talk_ratio if s != asker]
    if other:
        other = other[0]
        if questions_per_spk[asker] < 5:
            insights.append(f"Ask more open-ended questions to uncover needs (only {questions_per_spk[asker]} questions asked).")

# Sentiment suggestion
if overall_sent == "negative":
    t0, t1 = longest_span
    insights.append("Address objections early; sentiment skewed negative.")
elif overall_sent == "neutral":
    insights.append("Try adding a clear value statement and next steps to lift sentiment.")

actionable = insights[0] if insights else "Summarize needs, confirm next steps, and schedule follow-up."

print("Actionable insight:", actionable)


Actionable insight: Sales Rep dominated the call. Aim for a more balanced split.


In [10]:
#@title 10) Final report
report = {
    "talk_time_ratio_percent": talk_ratio,
    "questions_per_speaker": questions_per_spk,
    "questions_total": int(sum(questions_per_spk.values())),
    "longest_monologue_seconds": round(float(max((seg["end"]-seg["start"]) for seg in speaker_segments)),1) if speaker_segments else 0.0,
    "longest_monologue_speaker": longest_spk,
    "longest_monologue_span": {"start_sec": round(longest_span[0],1), "end_sec": round(longest_span[1],1)},
    "overall_sentiment": overall_sent,
    "roles": role_map,
    "actionable_insight": actionable,
}

print(json.dumps(report, indent=2))


{
  "talk_time_ratio_percent": {
    "S0": 96.4,
    "S1": 3.6
  },
  "questions_per_speaker": {
    "S0": 8
  },
  "questions_total": 8,
  "longest_monologue_seconds": 111.5,
  "longest_monologue_speaker": "S0",
  "longest_monologue_span": {
    "start_sec": 7.0,
    "end_sec": 118.5
  },
  "overall_sentiment": "positive",
  "roles": {
    "S0": "Sales Rep",
    "S1": "Customer"
  },
  "actionable_insight": "Sales Rep dominated the call. Aim for a more balanced split."
}


---
## Notes
- This notebook avoids heavy diarization models for speed; the KMeans MFCC approach is robust enough for 2-speaker sales calls.
- For longer calls, consider switching `WhisperModel('base')` (still fast) or enabling GPU in Colab if available.
- Metrics are deterministic and explained in comments for review.
