# Install and Import Dependencies

In [1]:
!pip install numpy pandas librosa torch groq load_dotenv

Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com



[notice] A new release of pip is available: 25.0.1 -> 25.1.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [2]:
# Imports
import os
import numpy as np
import librosa
import soundfile as sf
import torch
from groq import Groq
from datetime import timedelta

In [5]:
from load_dotenv import load_dotenv
print(load_dotenv('.env.local'))

assert os.environ.get('GROQ_API_KEY'), "Groq API key not found in .env file, please set the key before starting this notebook"

True


# Feature Extraction
Features extracted:
	•	ZCR
	•	Pitch
	•	RMS
	•	MFCC
	•	DeltaMFCC
	•	SpeakingRate
	•	PauseCount
	•	PauseDuration

In [27]:
def extract_features(audio_path):
    data, sr = librosa.load(audio_path)

    # 1. Zero-Crossing Rate (ZCR)
    zcr = np.mean(librosa.feature.zero_crossing_rate(data))

    # 2. Pitch: Mean and Std Deviation
    pitch = librosa.yin(data, fmin=librosa.note_to_hz("C2"),
                        fmax=librosa.note_to_hz("C7"), sr=sr)
    pitch = np.nan_to_num(pitch, nan=0.0, posinf=0.0, neginf=0.0)
    pitch_mean = np.mean(pitch)
    pitch_std = np.std(pitch)
    pitch_var = np.var(pitch)

    # 3. Energy: RMS Mean, Std, Variance
    rms = librosa.feature.rms(y=data)[0]
    rms_mean = np.mean(rms)
    rms_std = np.std(rms)
    rms_var = np.var(rms)

    # 4. MFCC and Delta MFCC Mean
    mfcc = librosa.feature.mfcc(y=data, sr=sr, n_mfcc=13)
    delta = librosa.feature.delta(mfcc)
    mfcc_mean = np.mean(mfcc)
    delta_mean = np.mean(delta)

    # 5. Transcription
    client = Groq()
    with open(audio_path, "rb") as file:
        transcription = client.audio.transcriptions.create(
        file=(audio_path, file.read()),
        model="distil-whisper-large-v3-en",
        response_format="verbose_json",
        )

    transcript = transcription.text
            
    # 6. Speaking Rate: Words per second
    duration_sec = librosa.get_duration(y=data, sr=sr)
    word_count = len(transcript.split())
    speaking_rate = word_count / duration_sec if duration_sec > 0 else 0

    # 7. Long Pauses (>1s)
    intervals = librosa.effects.split(data, top_db=30)
    pauses = []
    for i in range(1, len(intervals)):
        prev_end = intervals[i-1][1]
        cur_start = intervals[i][0]
        silence_duration = (cur_start - prev_end) / sr
        if silence_duration > 1.0:
            pauses.append(silence_duration)
    long_pause_count = len(pauses)
    long_pause_total = sum(pauses)

    return {
        "transcript": transcript,
        "zcr": zcr,
        "pitch_mean": pitch_mean,
        "pitch_std": pitch_std,
        "pitch_var": pitch_var,
        "rms_mean": rms_mean,
        "rms_std": rms_std,
        "rms_var": rms_var,
        "speaking_rate": speaking_rate,
        "long_pause_count": long_pause_count,
        "long_pause_duration": long_pause_total,
        "mfcc_mean": mfcc_mean,
        "delta_mean": delta_mean
    }

# Send to GPT for feedback

In [31]:
def generate_feedback(features):
    prompt = f"""
You're a communication coach. Analyze the following features of a speaker:

- Transcript: {features['transcript']}
- ZCR: {features['zcr']}
- Mean pitch: {features['pitch_mean']}
- Std pitch: {features['pitch_std']}
- Pitch Variance: {features['pitch_var']}
- RMS (mean/std/var): {features['rms_mean']}, {features['rms_std']}, {features['rms_var']}
- Speaking rate: {features['speaking_rate']} words/sec
- Long pauses: {features['long_pause_count']} pauses totaling {features['long_pause_duration']} sec
- MFCC mean: {features['mfcc_mean']}
- Delta MFCC mean: {features['delta_mean']}

Based on this data, provide feedback on the user's fluency, confidence, and delivery. Do not give a lot of weightage 
to the transcript, they may use the right words to confuse you, use the audio features that you've been given to primarily detect the 3 traits
"""

    client = Groq()
    completion = client.chat.completions.create(
        model="llama-3.3-70b-versatile",
        messages=[
        {
            "role": "user",
            "content": prompt
        }
        ],
        temperature=1,
        max_completion_tokens=32768,
        top_p=1,
        stream=False,
        stop=None,
    )

    return completion.choices[0].message

# Run Pipeline

In [32]:
# An unconfident speech
path = "samples/unconfident.m4a"
features = extract_features(path)
feedback = generate_feedback(features)

  data, sr = librosa.load(audio_path)


In [33]:
(features)

{'transcript': " Um, um, so, so, so, my name is Utkarsh Malaya. I'm a very, very scared student and I have, I have a fear of public speaking. So this is, this is, this is, this is very difficult for me. And, yeah, I'm just, just trying to get one model that can recognize these filler words.",
 'zcr': np.float64(0.0805518689798599),
 'pitch_mean': np.float64(271.47274691287276),
 'pitch_std': np.float64(477.2604483344195),
 'pitch_var': np.float64(227777.53554437112),
 'rms_mean': np.float32(0.0074115577),
 'rms_std': np.float32(0.007082585),
 'rms_var': np.float32(5.0163006e-05),
 'speaking_rate': 2.075784313557683,
 'long_pause_count': 1,
 'long_pause_duration': np.float64(1.555736961451247),
 'mfcc_mean': np.float32(-27.511122),
 'delta_mean': np.float32(0.006345318)}

In [34]:
import IPython.display as ipd
ipd.Markdown(feedback.content)

Based on the provided audio features, I'll analyze the speaker's fluency, confidence, and delivery.

**Fluency:**
The speaker's fluency appears to be affected by their nervousness. The presence of filler words ("um", "so") in the transcript, although not the primary focus, is supported by the audio features. The speaking rate of 2.075 words/sec is slightly slower than average, which may indicate hesitation or difficulty in finding the right words. The long pause of 1.555 seconds also suggests that the speaker is taking time to gather their thoughts or calm their nerves. The Mean Pitch (271.47) and Std Pitch (477.26) are relatively high, which could be a sign of tension or anxiety affecting their vocal delivery. Considering these factors, I'd rate the speaker's fluency as 6/10.

**Confidence:**
The speaker's confidence seems to be low. The high Pitch Variance (227777.54) and Std Pitch (477.26) indicate that their tone is wavering, which is often a sign of nervousness or lack of confidence. The ZCR (Zero Crossing Rate) of 0.0806 is relatively low, which may suggest a more monotone or hesitant delivery. The RMS (mean) of 0.0074 is relatively low, indicating a softer voice, which can be a sign of lack of confidence. The MFCC mean (-27.51) and Delta MFCC mean (0.0063) are within normal ranges, but the overall combination of these features suggests that the speaker is not confident in their delivery. I'd rate the speaker's confidence as 4/10.

**Delivery:**
The speaker's delivery is affected by their nervousness and lack of confidence. The slow speaking rate and long pauses contribute to a somewhat hesitant delivery. The high Pitch Variance and Std Pitch make their tone less stable, which can be distracting for the audience. However, the speaker is trying to convey their message, and their RMS (std) of 0.0071 is relatively consistent, indicating some control over their voice. The MFCC features are within normal ranges, suggesting that the speaker's pronunciation and articulation are not significantly affected. Overall, I'd rate the speaker's delivery as 5/10.

To improve, the speaker could focus on relaxation techniques to reduce their nervousness, practice speaking at a slightly faster pace, and work on maintaining a more stable tone. Recording themselves and listening to the playback could help them become more aware of their filler words, pauses, and tone. With practice and experience, they can develop greater fluency, confidence, and delivery skills.

In [35]:
# A confident speech
path = "samples/confident.m4a"
features = extract_features(path)
feedback = generate_feedback(features)

  data, sr = librosa.load(audio_path)


In [36]:
features

{'transcript': " Hi, my name is Adkarsh Malaya. I'm a student and right now what I'm trying to do is I'm trying to get a model and I'm trying to use it to transcribe some filler words. I am very scared. I don't know what will happen and I really really hope this works.",
 'zcr': np.float64(0.08035611403023599),
 'pitch_mean': np.float64(287.1815062705668),
 'pitch_std': np.float64(466.33901796330355),
 'pitch_var': np.float64(217472.07967497836),
 'rms_mean': np.float32(0.0131712835),
 'rms_std': np.float32(0.010488089),
 'rms_var': np.float32(0.00011000002),
 'speaking_rate': 3.3073250933844847,
 'long_pause_count': 0,
 'long_pause_duration': 0,
 'mfcc_mean': np.float32(-24.901373),
 'delta_mean': np.float32(0.016268123)}

In [37]:
ipd.Markdown(feedback.content)

Based on the provided audio features, I'll analyze the speaker's fluency, confidence, and delivery.

**Fluency:**
The speaking rate of 3.3073250933844847 words/sec is slightly above average, indicating that the speaker is trying to convey their message at a moderate pace. However, the Zero Crossing Rate (ZCR) of 0.08035611403023599 is relatively low, which may suggest a more hesitant or uncertain tone. The lack of long pauses (0 pauses totaling 0 sec) is a positive indicator of fluency, as it shows the speaker is able to maintain a consistent flow. Overall, I would rate the speaker's fluency as average, with some room for improvement in terms of smoothness and naturalness.

**Confidence:**
The speaker's confidence can be inferred from their pitch and volume characteristics. The mean pitch of 287.1815062705668 is relatively high, which can be an indicator of anxiety or nervousness. The standard deviation of pitch (466.33901796330355) is quite high, suggesting significant pitch variations, which can be a sign of emotional instability. The RMS (mean, std, var) values are relatively low, indicating a soft and inconsistent volume. These audio features, combined with the speaker's explicit statement of being "very scared" in the transcript, suggest that the speaker is likely feeling anxious and lacking confidence.

**Delivery:**
The delivery of the speaker can be evaluated based on their prosody and articulation. The pitch variance (217472.07967497836) is high, which can make the speech sound less engaging and more monotonous. The MFCC mean (-24.9013729095459) and delta MFCC mean (0.01626812294125557) values are within normal ranges, indicating no significant issues with articulation. However, the overall delivery is likely to be perceived as nervous and uncertain due to the high pitch variability and soft volume. To improve their delivery, the speaker could focus on relaxing their tone, varying their pitch in a more controlled manner, and projecting their voice more confidently.

In summary, based on the audio features, the speaker's fluency is average, their confidence is low, and their delivery is nervous and uncertain. With practice and training, the speaker can work on improving their fluency, confidence, and delivery to become a more effective and engaging communicator.