<a href="https://colab.research.google.com/github/AbdulAhadSiddiqui-0786/Voice-AI/blob/main/Voice_AI.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install pytube pydub openai-whisper pyannote.audio torch torchvision torchaudio transformers -q


In [2]:
!pip install yt-dlp -q
from pydub import AudioSegment

url = "https://www.youtube.com/watch?v=4ostqJD3Psc"

# Download audio with yt-dlp
!yt-dlp -x --audio-format mp3 -o "call.%(ext)s" {url}

# Convert to wav, mono, 16kHz
audio = AudioSegment.from_file("call.mp3")
audio = audio.set_channels(1).set_frame_rate(16000)
audio.export("call.wav", format="wav")

print(" Audio downloaded & converted to WAV using yt-dlp!")


[youtube] Extracting URL: https://www.youtube.com/watch?v=4ostqJD3Psc
[youtube] 4ostqJD3Psc: Downloading webpage
[youtube] 4ostqJD3Psc: Downloading tv simply player API JSON
[youtube] 4ostqJD3Psc: Downloading tv client config
[youtube] 4ostqJD3Psc: Downloading player 0004de42-main
[youtube] 4ostqJD3Psc: Downloading tv player API JSON
[info] 4ostqJD3Psc: Downloading 1 format(s): 251
[download] call.mp3 has already been downloaded
[ExtractAudio] Not converting audio call.mp3; file is already in target format mp3
 Audio downloaded & converted to WAV using yt-dlp!


In [3]:
from pyannote.audio import Pipeline
from google.colab import userdata
from tqdm import tqdm
import soundfile as sf
from pydub import AudioSegment
import math
import os
import pickle

#  Hugging Face token
HUGGINGFACE_TOKEN = userdata.get('HUGGINGFACE_TOKEN')

# 1. Load diarization model
print(" Loading diarization model...")
pipeline = Pipeline.from_pretrained(
    "pyannote/speaker-diarization",
    use_auth_token=HUGGINGFACE_TOKEN
)
print(" Model loaded!")

# 2. Load audio and get duration
audio_file = "call.wav"
info = sf.info(audio_file)
duration = info.duration  # in seconds
print(f" Audio duration: {duration:.1f} sec")

# 3. Split into 30s chunks
chunk_size = 30_000  # in ms
audio = AudioSegment.from_wav(audio_file)
chunks = math.ceil(len(audio) / chunk_size)
print(f" Splitting into {chunks} chunks of 30s each...")

# 4. Process chunks with caching
speaker_segments = []
with tqdm(total=chunks, desc="Processing chunks") as pbar:
    for i in range(chunks):
        start = i * chunk_size
        end = min((i + 1) * chunk_size, len(audio))
        chunk_file = f"chunk_{i}.wav"
        cache_file = f"chunk_{i}_diarization.pkl"

        # Export chunk only if not exists
        if not os.path.exists(chunk_file):
            audio[start:end].export(chunk_file, format="wav")

        # Load from cache if exists
        if os.path.exists(cache_file):
            with open(cache_file, "rb") as f:
                diarization = pickle.load(f)
        else:
            # Run diarization on this chunk
            diarization = pipeline(chunk_file)
            with open(cache_file, "wb") as f:
                pickle.dump(diarization, f)

        # Extract speaker segments
        for turn, _, speaker in diarization.itertracks(yield_label=True):
            speaker_segments.append({
                "speaker": speaker,
                "start": turn.start + start / 1000,  # shift time
                "end": turn.end + start / 1000,
                "duration": turn.end - turn.start
            })

        pbar.update(1)

print(" Diarization complete!")
print(" Speakers found:", set([s["speaker"] for s in speaker_segments]))


  torchaudio.list_audio_backends()


 Loading diarization model...


  available_backends = torchaudio.list_audio_backends()
DEBUG:speechbrain.utils.checkpoints:Registered checkpoint save hook for _speechbrain_save
DEBUG:speechbrain.utils.checkpoints:Registered checkpoint load hook for _speechbrain_load
DEBUG:speechbrain.utils.checkpoints:Registered checkpoint save hook for save
DEBUG:speechbrain.utils.checkpoints:Registered checkpoint load hook for load
DEBUG:speechbrain.utils.checkpoints:Registered checkpoint save hook for _save
DEBUG:speechbrain.utils.checkpoints:Registered checkpoint load hook for _recover
INFO:pytorch_lightning.utilities.migration.utils:Lightning automatically upgraded your loaded checkpoint from v1.5.4 to v2.5.5. To apply the upgrade to your files permanently, run `python -m pytorch_lightning.utilities.upgrade_checkpoint ../root/.cache/torch/pyannote/models--pyannote--segmentation/snapshots/c4c8ceafcbb3a7a280c2d357aee9fbc9b0be7f9b/pytorch_model.bin`
INFO:speechbrain.utils.fetching:Fetch hyperparams.yaml: Using symlink found at '/r

Model was trained with pyannote.audio 0.0.1, yours is 3.4.0. Bad things might happen unless you revert pyannote.audio to 0.x.
Model was trained with torch 1.10.0+cu102, yours is 2.8.0+cu126. Bad things might happen unless you revert torch to 1.x.


DEBUG:speechbrain.utils.checkpoints:Registered checkpoint save hook for _save
DEBUG:speechbrain.utils.checkpoints:Registered checkpoint load hook for _load
DEBUG:speechbrain.utils.checkpoints:Registered parameter transfer hook for _load
  wrapped_fwd = torch.cuda.amp.custom_fwd(fwd, cast_inputs=cast_inputs)
DEBUG:speechbrain.utils.checkpoints:Registered checkpoint save hook for save
DEBUG:speechbrain.utils.checkpoints:Registered checkpoint load hook for load_if_possible
DEBUG:speechbrain.utils.parameter_transfer:Collecting files (or symlinks) for pretraining in /root/.cache/torch/pyannote/speechbrain.
INFO:speechbrain.utils.fetching:Fetch embedding_model.ckpt: Using symlink found at '/root/.cache/torch/pyannote/speechbrain/embedding_model.ckpt'
DEBUG:speechbrain.utils.parameter_transfer:Set local path in self.paths["embedding_model"] = /root/.cache/torch/pyannote/speechbrain/embedding_model.ckpt
INFO:speechbrain.utils.fetching:Fetch mean_var_norm_emb.ckpt: Using symlink found at '/root

 Model loaded!
 Audio duration: 122.7 sec
 Splitting into 5 chunks of 30s each...


Processing chunks: 100%|██████████| 5/5 [00:00<00:00, 3264.56it/s]

 Diarization complete!
 Speakers found: {'SPEAKER_02', 'SPEAKER_01', 'SPEAKER_04', 'SPEAKER_00', 'SPEAKER_03'}





In [4]:
import whisper

# Load Whisper model
model = whisper.load_model("small")   # small model = good balance (fast + accurate)

# Transcribe audio
result = model.transcribe("call.wav")
transcript = result["text"]

print(" Transcription complete!")
print("Sample transcript:", transcript[:250], "...")




 Transcription complete!
Sample transcript:  Thank you for calling Nissan. My name is Lauren. Can I have your name? Yeah, my name is John Smith. Thank you, John. How can I help you? I was just calling about to see how much it would cost to update the map in my car. I'd be happy to help you wit ...


In [5]:
from transformers import pipeline
import re

# 1. Talk-time ratio
talk_time = {}
for seg in speaker_segments:
    talk_time[seg["speaker"]] = talk_time.get(seg["speaker"], 0) + seg["duration"]

total_time = sum(talk_time.values())
talk_ratios = {sp: round((dur/total_time)*100, 2) for sp, dur in talk_time.items()}

# 2. Number of questions
num_questions = transcript.count("?")
extra_questions = len(re.findall(r"\b(what|why|how|when|where|can|do|is|are|does|did)\b", transcript.lower()))
question_count = max(num_questions, extra_questions)

# 3. Longest monologue
longest_monologue = max([seg["duration"] for seg in speaker_segments])

# 4. Sentiment
sentiment_analyzer = pipeline("sentiment-analysis")
sentiment = sentiment_analyzer(transcript[:500])[0]  # analyze first 500 chars

# 5. Actionable insight
insight = ""
if max(talk_ratios.values()) > 70:
    insight = "One speaker dominated the conversation. Allow more balanced talk-time."
elif question_count < 3:
    insight = "Too few questions were asked. Ask more questions to engage the customer."
else:
    insight = "Good balance, but could improve listening."

# Identify roles
sales_rep = max(talk_ratios, key=talk_ratios.get)
customer = min(talk_ratios, key=talk_ratios.get)


No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
Device set to use cpu


In [6]:
print("\n===== CALL QUALITY REPORT =====")
print("Talk-time ratio:", talk_ratios)
print("Questions asked:", question_count)
print("Longest monologue (sec):", round(longest_monologue, 2))
print("Call sentiment:", sentiment["label"], "| Confidence:", round(sentiment["score"], 2))
print("Actionable Insight:", insight)
print("Likely Sales Rep:", sales_rep, "| Likely Customer:", customer)



===== CALL QUALITY REPORT =====
Talk-time ratio: {'SPEAKER_00': 33.52, 'SPEAKER_01': 32.02, 'SPEAKER_02': 23.03, 'SPEAKER_03': 10.52, 'SPEAKER_04': 0.91}
Questions asked: 20
Longest monologue (sec): 11.61
Call sentiment: POSITIVE | Confidence: 1.0
Actionable Insight: Good balance, but could improve listening.
Likely Sales Rep: SPEAKER_00 | Likely Customer: SPEAKER_04
