# Install and Import Dependencies

In [1]:
!pip install numpy pandas librosa groq load_dotenv

Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
Collecting load_dotenv
  Downloading load_dotenv-0.1.0-py3-none-any.whl.metadata (1.9 kB)
Collecting python-dotenv (from load_dotenv)
  Downloading python_dotenv-1.1.1-py3-none-any.whl.metadata (24 kB)
Downloading load_dotenv-0.1.0-py3-none-any.whl (7.2 kB)
Downloading python_dotenv-1.1.1-py3-none-any.whl (20 kB)
Installing collected packages: python-dotenv, load_dotenv

   ---------------------------------------- 0/2 [python-dotenv]
   ---------------------------------------- 0/2 [python-dotenv]
   ---------------------------------------- 2/2 [load_dotenv]

Successfully installed load_dotenv-0.1.0 python-dotenv-1.1.1


In [2]:
!pip install nltk tiktoken parselmouth pydub psutil

Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
Collecting parselmouth
  Downloading parselmouth-1.1.1.tar.gz (33 kB)
  Installing build dependencies: started
  Installing build dependencies: finished with status 'done'
  Getting requirements to build wheel: started
  Getting requirements to build wheel: finished with status 'done'
  Preparing metadata (pyproject.toml): started
  Preparing metadata (pyproject.toml): finished with status 'done'
Collecting pydub
  Downloading pydub-0.25.1-py2.py3-none-any.whl.metadata (1.4 kB)
Collecting googleads==3.8.0 (from parselmouth)
  Downloading googleads-3.8.0.tar.gz (23 kB)
  Installing build dependencies: started
  Installing build dependencies: finished with status 'done'
  Getting requirements to build wheel: started
  Getting requirements to build wheel: finished with status 'error'


  error: subprocess-exited-with-error
  
  √ó Getting requirements to build wheel did not run successfully.
  ‚îÇ exit code: 1
  ‚ï∞‚îÄ> [1 lines of output]
      error in googleads setup command: use_2to3 is invalid.
      [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
error: subprocess-exited-with-error

√ó Getting requirements to build wheel did not run successfully.
‚îÇ exit code: 1
‚ï∞‚îÄ> See above for output.

note: This error originates from a subprocess, and is likely not a problem with pip.


In [5]:
# Imports
import io
import os
import re
import joblib
import asyncio
import librosa
import tiktoken
import numpy as np
import parselmouth
from pydub import AudioSegment
from nltk.corpus import cmudict
from parselmouth.praat import call
from groq import Groq, AsyncClient
from groq.types.audio import Transcription

# Load environment file
from load_dotenv import load_dotenv
print(load_dotenv('.env.local'))

assert os.environ.get('GROQ_API_KEY'), "Groq API key not found in .env file, please set the key before starting this notebook"

# Global variables
client = AsyncClient()
encoder = tiktoken.get_encoding('gpt2')
fluency_model = joblib.load('fluency/models/weights/xgboost_model.pkl')

try:
    cmu_dict = cmudict.dict()
except:
    import nltk
    nltk.download('cmudict')
    cmu_dict = cmudict.dict()    

True


## Monitor CPU resources

In [6]:
import psutil
import time
import functools

def monitor_resources(func):
    @functools.wraps(func)
    def wrapper(*args, **kwargs):
        process = psutil.Process(os.getpid())

        # Get memory and CPU before
        mem_before = process.memory_info().rss / (1024 ** 2)  # MB
        cpu_before = process.cpu_percent(interval=None)

        # Start time and CPU
        start_time = time.time()
        result = func(*args, **kwargs)
        end_time = time.time()

        # Get memory and CPU after
        mem_after = process.memory_info().rss / (1024 ** 2)  # MB
        cpu_after = process.cpu_percent(interval=0.1)

        # Get number of CPUs used
        cpu_affinity = process.cpu_affinity()
        
        print(f"Function: {func.__name__}")
        print(f"Execution Time: {end_time - start_time:.2f} sec")
        print(f"Memory Usage: {mem_after - mem_before:.2f} MB")
        print(f"CPU Usage: {cpu_after:.2f}%")
        print(f"CPU Cores Used: {cpu_affinity}")

        return result
    return wrapper

def limit_to_one_core(core_id=0):
    """
    Set process to run only on one CPU core (default: core 0).
    """
    def decorator(func):
        @functools.wraps(func)
        def wrapper(*args, **kwargs):
            process = psutil.Process(os.getpid())

            # Store the current affinity to restore later
            original_affinity = process.cpu_affinity()
            
            try:
                # Set affinity to a single core
                process.cpu_affinity([core_id])
                print(f"Running {func.__name__} on CPU core {core_id}")
                return func(*args, **kwargs)
            finally:
                # Restore original affinity
                process.cpu_affinity(original_affinity)
        return wrapper
    return decorator

# # Limit NumPy, OpenBLAS etc to use only one CPU core
# os.environ["OMP_NUM_THREADS"] = "1"
# os.environ["OPENBLAS_NUM_THREADS"] = "1"
# os.environ["MKL_NUM_THREADS"] = "1"
# os.environ["VECLIB_MAXIMUM_THREADS"] = "1"
# os.environ["NUMEXPR_NUM_THREADS"] = "1"

# Feature Extraction
Features extracted:
* ZCR
* Pitch
* Jitter
* Shimmer
* Harmonic-to-Noise ratio
* RMS
* MFCC
* DeltaMFCC
* SpeakingRate
* PauseCount
* PauseDuration

In [7]:
# Async Transcription
def split_audio_in_memory(audio_path, max_mb=24):
    audio = AudioSegment.from_wav(audio_path)
    bytes_per_ms = (audio.frame_rate * audio.frame_width * audio.channels) / 1000
    max_bytes = max_mb * 1024 * 1024
    chunk_duration_ms = int(max_bytes / bytes_per_ms)

    chunks = []
    for i in range(0, len(audio), chunk_duration_ms):
        chunk = audio[i:i+chunk_duration_ms]
        buffer = io.BytesIO()
        chunk.export(buffer, format="wav")
        buffer.seek(0)
        chunks.append((f"chunk_{i//chunk_duration_ms}.wav", buffer))

    return chunks

async def transcribe_chunk(filename, audio_buffer):
    return await client.audio.transcriptions.create(
        file=(filename, audio_buffer.read()),
        model="distil-whisper-large-v3-en",
        response_format="verbose_json",
        timestamp_granularities=["word"]
    )


async def transcribe_audio(audio_path, client=client):
    """Transcribe an audio file without saving the chunks to disk"""
    chunks = split_audio_in_memory(audio_path)
    tasks = [transcribe_chunk(name, buffer) for name, buffer in chunks]
    all_transcripts = await asyncio.gather(*tasks)

    transcript_parts = []
    all_words = []
    total_duration = 0.0

    for chunk in all_transcripts:
        transcript_parts.append(chunk.text)
        all_words.extend(getattr(chunk, "words", []))
        total_duration += chunk.duration

    transcript = "".join(transcript_parts)
    
    return Transcription(text=transcript, words=all_words, duration=total_duration)


transcript = await transcribe_audio("samples/confident.wav")
transcript

Transcription(text=" Hi, my name is Adkarsh Malaya. I'm a student and right now what I'm trying to do is I'm trying to get a model and I'm trying to use it to transcribe some filler words. I am very scared. I don't know what will happen and I really really hope this works.", words=[{'word': 'Hi,', 'start': 0.78, 'end': 1.18}, {'word': 'my', 'start': 1.18, 'end': 1.52}, {'word': 'name', 'start': 1.52, 'end': 1.66}, {'word': 'is', 'start': 1.66, 'end': 1.82}, {'word': 'Adkarsh', 'start': 1.82, 'end': 2.04}, {'word': 'Malaya.', 'start': 2.04, 'end': 2.52}, {'word': "I'm", 'start': 2.52, 'end': 3.12}, {'word': 'a', 'start': 3.12, 'end': 3.26}, {'word': 'student', 'start': 3.26, 'end': 3.76}, {'word': 'and', 'start': 3.76, 'end': 4.22}, {'word': 'right', 'start': 4.22, 'end': 4.62}, {'word': 'now', 'start': 4.62, 'end': 4.84}, {'word': 'what', 'start': 4.84, 'end': 5.1}, {'word': "I'm", 'start': 5.1, 'end': 5.26}, {'word': 'trying', 'start': 5.26, 'end': 5.38}, {'word': 'to', 'start': 5.38,

In [8]:
# Helper functions for calculating syllables speaking rate
def get_word_syllable_count(word):
    word = word.lower().strip(".,?!;:")
    if word in cmu_dict:
        return len([p for p in cmu_dict[word][0] if p[-1].isdigit()])
    return max(1, len(re.findall(r'[aeiouy]+', word)))


def estimate_syllable_rate(transcript, duration_sec):
    words = transcript.split()
    total_syllables = sum(get_word_syllable_count(word) for word in words)
    return total_syllables / duration_sec if duration_sec > 0 else 0

In [9]:
# Extract Pitch statistics, Jitter, Shimmer, and HNR ratio through Parselmouth
@monitor_resources
def extract_parselmouth_features(data, sr):
    snd = parselmouth.Sound(values=data, sampling_frequency=sr)

    pitch_obj = snd.to_pitch()
    pitch_mean = call(pitch_obj, "Get mean", 0, 0, "Hertz")
    pitch_std = call(pitch_obj, "Get standard deviation", 0, 0, "Hertz")

    point_process = call(snd, "To PointProcess (periodic, cc)", 75, 500)
    jitter = call(point_process, "Get jitter (local)", 0, 0, 0.0001, 0.02, 1.3)
    shimmer = call([snd, point_process], "Get shimmer (local)", 0, 0, 0.0001, 0.02, 1.3, 1.6)

    harmonicity = call(snd, "To Harmonicity (cc)", 0.01, 75, 0.1, 1.0)
    hnr = call(harmonicity, "Get mean", 0, 0)

    return {
        "pitch_mean": pitch_mean,
        "pitch_std": pitch_std,
        "pitch_var": pitch_std**2,
        "jitter_local": jitter,
        "shimmer_local": shimmer,
        "hnr": hnr
    }

async def async_extract_parselmouth_features(data, sr, executor):
    return await asyncio.get_event_loop().run_in_executor(
        executor, extract_parselmouth_features, data, sr
    )

In [10]:
# Extract RMS Energy, ZCR, MFCC and Deltas using librosa
@monitor_resources
def extract_librosa_features(data, sr):
    zcr = np.mean(librosa.feature.zero_crossing_rate(data))
    
    rms = librosa.feature.rms(y=data)[0]
    rms_mean = np.mean(rms)
    rms_std = np.std(rms)
    rms_var = np.var(rms)

    mfcc = librosa.feature.mfcc(y=data, sr=sr, n_mfcc=13)
    delta = librosa.feature.delta(mfcc)
    mfcc_mean = np.mean(mfcc)
    delta_mean = np.mean(delta)

    return {
        "zcr": zcr,
        "rms_mean": rms_mean,
        "rms_std": rms_std,
        "rms_var": rms_var,
        "mfcc": mfcc.mean(axis=1),
        "delta_mfcc": delta.mean(axis=1),
        "mfcc_mean": mfcc_mean,
        "delta_mean": delta_mean
    }
    
async def async_extract_librosa_features(data, sr, executor):
    return await asyncio.get_event_loop().run_in_executor(
        executor, extract_librosa_features, data, sr
    )

In [None]:
def extract_features_from_wave(data, sr):
    return {
        **extract_librosa_features(data, sr),
        **extract_parselmouth_features(data, sr)
    }
    
async def async_extract_features_from_wave(data, sr, executor):
    # Start both tasks concurrently
    librosa_task = asyncio.create_task(async_extract_librosa_features(data, sr, executor))
    parselmouth_task = asyncio.create_task(async_extract_parselmouth_features(data, sr, executor))

    # Wait for both
    librosa_feats, parselmouth_feats = await asyncio.gather(librosa_task, parselmouth_task)

    return {**librosa_feats, **parselmouth_feats}


@monitor_resources
async def extract_features(audio_path, baseline_duration: float = 0.0, fluency_model=fluency_model):
    # -------------- Load the audio file --------------
    data, sr = librosa.load(audio_path)
    assert len(data) != 0, "Your audio file appears to contain no content. Please input a valid file"
    
    
    # -------------- Get transcription and check minimum duration --------------
    transcription_json = await transcribe_audio(audio_path)
    duration_sec = transcription_json.duration # type: ignore
    baseline_duration = baseline_duration or max(10.0, duration_sec * 0.05)

    assert duration_sec != 0, "File duration appears to be 0 after transcription?"
    
    
    # -------------- Get features of baseline and full wave --------------
    baseline_data = data[:min(len(data), int(sr * baseline_duration))]
    baseline_feats = extract_features_from_wave(baseline_data, sr)
    full_feats = extract_features_from_wave(data, sr)


    # -------------- Get fluency ratings --------------
    features = ['zcr', 'pitch_mean', 'pitch_std', 'rms_mean', 'rms_std', 'rms_var', 'mfcc_mean', 'delta_mean']
    rating_map = ['Low', 'Medium', 'High']
        
    baseline_fluency_features = np.array([baseline_feats[key] for key in baseline_feats if key in features])
    full_fluency_features = np.array([full_feats[key] for key in full_feats if key in features])

    res = fluency_model.predict(np.vstack((baseline_fluency_features, full_fluency_features)))
    baseline_fluency = rating_map[res[0].argmax()]
    full_fluency = rating_map[res[1].argmax()]

    relative_feats = {}
    for key in full_feats:
        if key not in ['mfcc', 'delta_mfcc']:
            base = baseline_feats.get(key, 0.0)
            full = full_feats[key]
            relative_feats[f'{key}_delta'] = full - base
    
    
    # -------------- Get speaking rates --------------
    # Assuming the transcript has come by now

    # Baseline speaking rate
    baseline_transcript = [word_segment['word'] for word_segment in transcription_json.words if word_segment['start'] <= baseline_duration]  # type: ignore
    baseline_word_count = len(baseline_transcript)
    baseline_transcript = " ".join(baseline_transcript)
    baseline_speaking_rate = baseline_word_count / baseline_duration
    baseline_syllables_rate = estimate_syllable_rate(baseline_transcript, baseline_duration)

    # Full data speaking rate
    transcript = transcription_json.text
    word_count = len(transcript.split())
    speaking_rate = word_count / duration_sec
    syllables_rate = estimate_syllable_rate(transcript, duration_sec)
        
    
    # -------------- Pause detection --------------
    intervals = librosa.effects.split(data, top_db=30)
    pauses = [(intervals[i][0] - intervals[i - 1][1]) / sr
              for i in range(1, len(intervals))
              if (intervals[i][0] - intervals[i - 1][1]) / sr > 1.0]
    
    long_pause_count = len(pauses)
    long_pause_total = sum(pauses)

    return {
        "transcript": transcript,
        "duration": duration_sec,
        "baseline_duration": baseline_duration,
        "speaking_rate": speaking_rate,
        "syllables_rate": syllables_rate,
        "baseline_speaking_rate": baseline_speaking_rate,
        "baseline_syllables_rate": baseline_syllables_rate,
        "long_pause_count": long_pause_count,
        "long_pause_duration": long_pause_total,
        "fluency_rating": full_fluency,
        "baseline_fluency_rating": baseline_fluency,
        **full_feats,
        **{f'baseline_{k}': v for k, v in baseline_feats.items()},
        **relative_feats,
    }


In [12]:
features = await extract_features('samples/tim-urban.wav')

Function: extract_features
Execution Time: 0.00 sec
Memory Usage: 0.00 MB
CPU Usage: 14.20%
CPU Cores Used: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]
Function: extract_librosa_features
Execution Time: 1.69 sec
Memory Usage: 14.68 MB
CPU Usage: 0.00%
CPU Cores Used: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]
Function: extract_parselmouth_features
Execution Time: 0.35 sec
Memory Usage: 4.38 MB
CPU Usage: 0.00%
CPU Cores Used: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]
Function: extract_librosa_features
Execution Time: 1.27 sec
Memory Usage: 3.27 MB
CPU Usage: 0.00%
CPU Cores Used: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]
Function: extract_parselmouth_features
Execution Time: 5.06 sec
Memory Usage: -5.28 MB
CPU Usage: 0.00%
CPU Cores Used: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]


In [13]:
features

{'transcript': " Reviewer.pxs. So in college, I was a government major, which means I had to write a lot of papers. Now, when a normal student writes a paper, they might spread the work out a little like this. So, you know, you get started maybe a little slowly, but you get enough done in the first week that with some heavier days later on, everything gets done and things stay civil. And I would want to do that like that. That would be the plan. I would have it all ready to go, but then actually the paper would come along, and then I would kind of do this. And that would happen every single paper. But then came my 90-page senior thesis, a paper you're supposed to spend a year on. I knew for a paper like that, my normal workflow was not an option, it was way too big a project. So I planned things out and I decided I kind of had to go something like this. This is how the year would go. So I'd start off light, and I'd bump it up in the middle months. And then at the end, I would kick it u

# Send to GPT for feedback

In [None]:
def get_prompt(audio_features, posture_features = None):
    prompt = f"""
You are a professional voice coach and delivery analyst tasked with evaluating a speaker's performance based on a variety of acoustic and prosodic features. Below is a detailed snapshot of the speaker‚Äôs delivery ‚Äî both baseline and full-clip ‚Äî along with their changes. Use this to deliver personalized, context-aware feedback.

## NOTE:
- The **first {int(audio_features['baseline_duration'])} seconds** of the speech are used to define the speaker's personal baseline.
- All relative metrics (e.g., deltas, ratios) are calculated with respect to this baseline.
- Interpret *changes* from baseline as signs of adaptation or stress ‚Äî not necessarily flaws.
- **Avoid quoting any raw values** in your response. Use intuitive, narrative insights only.
- An 86% accurate ML model was used to rate the fluency of the speech, and that rating has also been provided to you.

‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ
üìù TRANSCRIPT
‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ
<transcript>
{audio_features['transcript']}
</transcript>

‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ
üìè BASELINE METRICS (First {int(audio_features['baseline_duration'])} seconds)
‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ

## Fluency & Tempo
- Fluency rating: {audio_features['baseline_fluency_rating']}
- Words/sec: {audio_features['baseline_speaking_rate']:.2f}
- Syllables/sec: {audio_features['baseline_syllables_rate']:.2f}

## Voice Modulation
- Pitch (Mean / Std / Var): {audio_features['baseline_pitch_mean']:.2f} / {audio_features['baseline_pitch_std']:.2f} / {audio_features['baseline_pitch_var']:.2f}
- Jitter (local): {audio_features['baseline_jitter_local']:.3f}
- Shimmer (local): {audio_features['baseline_shimmer_local']:.3f}
- Harmonic-to-Noise Ratio (HNR): {audio_features['baseline_hnr']:.2f}

## Energy & Dynamics
- RMS Energy (Mean / Std / Var): {audio_features['baseline_rms_mean']:.2f} / {audio_features['baseline_rms_std']:.2f} / {audio_features['baseline_rms_var']:.2f}
- Zero Crossing Rate: {audio_features['baseline_zcr']:.3f}

## Timbre & Articulation
- MFCC Mean: {audio_features['baseline_mfcc_mean']:.2f}
- Delta MFCC Mean: {audio_features['baseline_delta_mean']:.6f}

‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ
üìä FULL CLIP METRICS
‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ

## Fluency & Tempo
- Fluency rating: {audio_features['fluency_rating']}
- Words/sec: {audio_features['speaking_rate']:.2f}
- Syllables/sec: {audio_features['syllables_rate']:.2f}
- Long pauses (>1s): {audio_features['long_pause_count']}
- Total pause duration: {audio_features['long_pause_duration']:.2f} sec

## Voice Modulation
- Pitch (Mean / Std / Var): {audio_features['pitch_mean']:.2f} / {audio_features['pitch_std']:.2f} / {audio_features['pitch_var']:.2f}
- Jitter (local): {audio_features['jitter_local']:.3f}
- Shimmer (local): {audio_features['shimmer_local']:.3f}
- Harmonic-to-Noise Ratio (HNR): {audio_features['hnr']:.2f}

## Energy & Dynamics
- RMS Energy (Mean / Std / Var): {audio_features['rms_mean']:.2f} / {audio_features['rms_std']:.2f} / {audio_features['rms_var']:.2f}
- Zero Crossing Rate: {audio_features['zcr']:.3f}

## Timbre & Articulation
- MFCC Mean: {audio_features['mfcc_mean']:.2f}
- Delta MFCC Mean: {audio_features['delta_mean']:.6f}

‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ
üìà RELATIVE CHANGES FROM BASELINE
‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ

## Tempo & Fluency
- Speaking rate ratio: {audio_features['speaking_rate'] / audio_features['baseline_speaking_rate']:.2f}
- Syllable rate ratio: {audio_features['syllables_rate'] / audio_features['baseline_syllables_rate']:.2f}

## Modulation
- Pitch std delta: {audio_features['pitch_std_delta']:+.2f}
- Jitter delta: {audio_features['jitter_local_delta']:+.3f}
- Shimmer delta: {audio_features['shimmer_local_delta']:+.3f}
- HNR delta: {audio_features['hnr_delta']:+.2f}

## Energy
- RMS mean delta: {audio_features['rms_mean_delta']:+.2f}
- RMS std delta: {audio_features['rms_std_delta']:+.2f}
- ZCR delta: {audio_features['zcr_delta']:+.3f}

## Timbre
- MFCC mean delta: {audio_features['mfcc_mean_delta']:+.2f}
- Delta MFCC mean delta: {audio_features['delta_mean_delta']:+.6f}

üß† **Interpretation Tips** (for internal use only):
- A **negative pitch_std_delta** might suggest monotony or nervousness; a positive value implies expressive modulation.
- **Decreased RMS or HNR** may imply loss of vocal energy or confidence.
- **Increased jitter/shimmer** may reflect stress or instability.
- A **low syllable rate ratio** suggests slowing down relative to their natural pace, which may imply hesitation or deliberate pacing.
- **ZCR changes** may reflect articulation style or clarity.

‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ
üß≠ INSTRUCTIONS FOR FEEDBACK GENERATION
‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ

Using the data above, write a highly personalized and supportive **narrative-style voice coaching paragraph**. Do not cite any specific numerical values. Your tone should be professional, encouraging, and practical.

Structure your feedback in **three sections**:

1. ‚úÖ **What the speaker did well** ‚Äî Highlight strengths or improvements in vocal control, energy, fluency, or confidence.
2. üõ†Ô∏è **What they can improve** ‚Äî Tactfully mention areas that deviated from their baseline and might affect clarity or delivery.
3. üìä **Confidence & fluency rating** ‚Äî Conclude with your overall impression of their vocal confidence and fluency (e.g., low, moderate, high), based on relative metrics.

DO NOT compare to average speakers. DO NOT be generic. Focus only on deviations from this speaker's own baseline and the emotional/functional impact of those changes.
"""
    return prompt

In [None]:
from schema import AudioOnlyFeedback, FeedbackSchema

def get_prompt_with_schema(audio_features, posture_features = None, schema = None):
    prompt = f"""
You are a professional voice coach and delivery analyst tasked with evaluating a speaker's performance based on a variety of acoustic and prosodic features. Below is a detailed snapshot of the speaker‚Äôs delivery ‚Äî both baseline and full-clip ‚Äî along with their changes. Use this to deliver personalized, context-aware feedback.

## NOTE:
- The **first {int(audio_features['baseline_duration'])} seconds** of the speech are used to define the speaker's personal baseline.
- All relative metrics (e.g., deltas, ratios) are calculated with respect to this baseline.
- Interpret *changes* from baseline as signs of adaptation or stress ‚Äî not necessarily flaws.
- **Avoid quoting any raw values** in your response. Use intuitive, narrative insights only.
- An 86% accurate ML model was used to rate the fluency of the speech, and that rating has also been provided to you.

‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ
üìù TRANSCRIPT
‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ
<transcript>
{audio_features['transcript']}
</transcript>

‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ
üìè BASELINE METRICS (First {int(audio_features['baseline_duration'])} seconds)
‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ

## Fluency & Tempo
- Fluency rating: {audio_features['baseline_fluency_rating']}
- Words/sec: {audio_features['baseline_speaking_rate']:.2f}
- Syllables/sec: {audio_features['baseline_syllables_rate']:.2f}

## Voice Modulation
- Pitch (Mean / Std / Var): {audio_features['baseline_pitch_mean']:.2f} / {audio_features['baseline_pitch_std']:.2f} / {audio_features['baseline_pitch_var']:.2f}
- Jitter (local): {audio_features['baseline_jitter_local']:.3f}
- Shimmer (local): {audio_features['baseline_shimmer_local']:.3f}
- Harmonic-to-Noise Ratio (HNR): {audio_features['baseline_hnr']:.2f}

## Energy & Dynamics
- RMS Energy (Mean / Std / Var): {audio_features['baseline_rms_mean']:.2f} / {audio_features['baseline_rms_std']:.2f} / {audio_features['baseline_rms_var']:.2f}
- Zero Crossing Rate: {audio_features['baseline_zcr']:.3f}

## Timbre & Articulation
- MFCC Mean: {audio_features['baseline_mfcc_mean']:.2f}
- Delta MFCC Mean: {audio_features['baseline_delta_mean']:.6f}

‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ
üìä FULL CLIP METRICS
‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ

## Fluency & Tempo
- Fluency rating: {audio_features['fluency_rating']}
- Words/sec: {audio_features['speaking_rate']:.2f}
- Syllables/sec: {audio_features['syllables_rate']:.2f}
- Long pauses (>1s): {audio_features['long_pause_count']}
- Total pause duration: {audio_features['long_pause_duration']:.2f} sec

## Voice Modulation
- Pitch (Mean / Std / Var): {audio_features['pitch_mean']:.2f} / {audio_features['pitch_std']:.2f} / {audio_features['pitch_var']:.2f}
- Jitter (local): {audio_features['jitter_local']:.3f}
- Shimmer (local): {audio_features['shimmer_local']:.3f}
- Harmonic-to-Noise Ratio (HNR): {audio_features['hnr']:.2f}

## Energy & Dynamics
- RMS Energy (Mean / Std / Var): {audio_features['rms_mean']:.2f} / {audio_features['rms_std']:.2f} / {audio_features['rms_var']:.2f}
- Zero Crossing Rate: {audio_features['zcr']:.3f}

## Timbre & Articulation
- MFCC Mean: {audio_features['mfcc_mean']:.2f}
- Delta MFCC Mean: {audio_features['delta_mean']:.6f}

‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ
üìà RELATIVE CHANGES FROM BASELINE
‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ

## Tempo & Fluency
- Speaking rate ratio: {audio_features['speaking_rate'] / audio_features['baseline_speaking_rate']:.2f}
- Syllable rate ratio: {audio_features['syllables_rate'] / audio_features['baseline_syllables_rate']:.2f}

## Modulation
- Pitch std delta: {audio_features['pitch_std_delta']:+.2f}
- Jitter delta: {audio_features['jitter_local_delta']:+.3f}
- Shimmer delta: {audio_features['shimmer_local_delta']:+.3f}
- HNR delta: {audio_features['hnr_delta']:+.2f}

## Energy
- RMS mean delta: {audio_features['rms_mean_delta']:+.2f}
- RMS std delta: {audio_features['rms_std_delta']:+.2f}
- ZCR delta: {audio_features['zcr_delta']:+.3f}

## Timbre
- MFCC mean delta: {audio_features['mfcc_mean_delta']:+.2f}
- Delta MFCC mean delta: {audio_features['delta_mean_delta']:+.6f}

üß† **Interpretation Tips** (for internal use only):
- A **negative pitch_std_delta** might suggest monotony or nervousness; a positive value implies expressive modulation.
- **Decreased RMS or HNR** may imply loss of vocal energy or confidence.
- **Increased jitter/shimmer** may reflect stress or instability.
- A **low syllable rate ratio** suggests slowing down relative to their natural pace, which may imply hesitation or deliberate pacing.
- **ZCR changes** may reflect articulation style or clarity.

‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ
üß≠ INSTRUCTIONS FOR FEEDBACK GENERATION
‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ

Using the data above, write a highly personalized and supportive **narrative-style voice coaching paragraph**. Do not cite any specific numerical values. Your tone should be professional, encouraging, and practical.

Structure your feedback in **three sections**:

1. ‚úÖ **What the speaker did well** ‚Äî Highlight strengths or improvements in vocal control, energy, fluency, or confidence.
2. üõ†Ô∏è **What they can improve** ‚Äî Tactfully mention areas that deviated from their baseline and might affect clarity or delivery.
3. üìä **Confidence & fluency rating** ‚Äî Conclude with your overall impression of their vocal confidence and fluency (e.g., low, moderate, high), based on relative metrics.

DO NOT compare to average speakers. DO NOT be generic. Focus only on deviations from this speaker's own baseline and the emotional/functional impact of those changes.
"""
    return prompt


def generate_feedback(audio_features, posture_features = None, response_schema = None, llm_model : str = "llama-3.3-70b-versatile"):
    prompt = get_prompt(audio_features)

    client = Groq()
    completion = client.chat.completions.create(
        model=llm_model,
        messages=[
        {
            "role": "user",
            "content": prompt
        }
        ],
        temperature=0.5,
        max_completion_tokens=32768,
        top_p=1,
        response_format={"type": "json_object"},
        stream=False,
        stop=None,
    )

    return completion.choices[0].message

# Run Pipeline

In [15]:
def get_n_tokens(features): return len(encoder.encode(get_prompt(features)))

In [22]:
# An unconfident speech
path = "samples/unconfident.wav"
features = await extract_features(path)
feedback = generate_feedback(features)

get_n_tokens(features)

Function: extract_features
Execution Time: 0.00 sec
Memory Usage: 0.00 MB
CPU Usage: 0.00%
CPU Cores Used: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]
Function: extract_librosa_features
Execution Time: 0.04 sec
Memory Usage: 1.54 MB
CPU Usage: 0.00%
CPU Cores Used: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]
Function: extract_parselmouth_features
Execution Time: 0.17 sec
Memory Usage: 1.82 MB
CPU Usage: 0.00%
CPU Cores Used: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]
Function: extract_librosa_features
Execution Time: 0.06 sec
Memory Usage: 0.59 MB
CPU Usage: 0.00%
CPU Cores Used: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]
Function: extract_parselmouth_features
Execution Time: 0.28 sec
Memory Usage: 0.77 MB
CPU Usage: 0.00%
CPU Cores Used: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]


1438

In [23]:
import IPython.display as ipd
ipd.Markdown(feedback.content)

As you began speaking, it was clear that you took a moment to settle into your pace, and once you did, you showed a notable increase in your speaking rate, which suggests that you were able to find a rhythm that worked for you. Your voice modulation also became slightly more consistent, which is a great sign of adapting to the speaking environment. Additionally, your articulation showed some positive shifts, indicating an effort to enunciate clearly, which is commendable, especially given your self-described nervousness about public speaking.

However, there were moments where your pitch became slightly more monotone, and your vocal energy, while consistent, could benefit from a bit more variation to add emphasis and keep the listener engaged. It's also worth noting that you had a brief pause, which, while not uncommon, might be an area to work on for smoother transitions between thoughts. Furthermore, your syllable rate, while increased, still reflects a careful and perhaps slightly hesitant pace, suggesting that you might be focusing intently on your words, which is understandable but could be balanced with a more natural flow.

Overall, considering your fluency and confidence, I would say that you demonstrated a moderate level of vocal confidence and fluency. Your ability to adapt and find a comfortable pace is a significant strength, and with some practice on varying your pitch and energy, as well as working on smoother transitions, you could see noticeable improvements in your delivery. Remember, the key is not to compare yourself to others but to focus on your own growth and comfort with speaking, and it's clear that you have a good foundation to build upon.

In [18]:
# A confident speech
path = "samples/confident.wav"
features = await extract_features(path)
feedback = generate_feedback(features)

get_n_tokens(features)

Function: extract_features
Execution Time: 0.00 sec
Memory Usage: 0.00 MB
CPU Usage: 0.00%
CPU Cores Used: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]
Function: extract_librosa_features
Execution Time: 0.03 sec
Memory Usage: 0.03 MB
CPU Usage: 0.00%
CPU Cores Used: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]
Function: extract_parselmouth_features
Execution Time: 0.10 sec
Memory Usage: 0.03 MB
CPU Usage: 0.00%
CPU Cores Used: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]
Function: extract_librosa_features
Execution Time: 0.04 sec
Memory Usage: 0.00 MB
CPU Usage: 0.00%
CPU Cores Used: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]
Function: extract_parselmouth_features
Execution Time: 0.11 sec
Memory Usage: 0.96 MB
CPU Usage: 0.00%
CPU Cores Used: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]


1456

In [19]:
features

{'transcript': " Hi, my name is Adkarsh Malaya. I'm a student and right now what I'm trying to do is I'm trying to get a model and I'm trying to use it to transcribe some filler words. I am very scared. I don't know what will happen and I really really hope this works.",
 'duration': 15.72,
 'baseline_duration': 10.0,
 'speaking_rate': 3.3078880407124682,
 'syllables_rate': 4.134860050890585,
 'baseline_speaking_rate': 3.3,
 'baseline_syllables_rate': 4.1,
 'long_pause_count': 0,
 'long_pause_duration': 0,
 'fluency_rating': 'Low',
 'baseline_fluency_rating': 'Low',
 'zcr': 0.08035611403023599,
 'rms_mean': 0.0131712835,
 'rms_std': 0.010488089,
 'rms_var': 0.00011000002,
 'mfcc': array([-467.24243   ,  134.46071   ,  -29.663124  ,   35.92738   ,
           7.6379876 ,    9.713081  ,    9.715767  ,    2.3445828 ,
         -13.212228  ,    0.7006626 ,   -0.62327105,  -11.099308  ,
          -2.377678  ], dtype=float32),
 'delta_mfcc': array([ 0.08744686,  0.09395339,  0.03829468,  0.007

In [20]:
ipd.Markdown(feedback.content)

As you began your speech, it was clear that you had a good foundation to build upon, with a consistent pace that allowed your words to flow smoothly. Your voice modulation showed some expressive qualities, which is a great strength to leverage in your delivery. Notably, your ability to maintain a relatively steady energy level throughout your speech is commendable, as it suggests a good level of comfort with your material. 

However, there were moments where your pitch variation became slightly more pronounced, which may indicate a touch of nervousness or an attempt to add emphasis to certain points. Additionally, your articulation and timbre underwent some subtle shifts, potentially affecting the clarity of your message. It's also worth exploring how you can harness your natural syllable rate to enhance the overall flow of your speech, as there were instances where your pace was very slightly quicker than your baseline, which might be an adaptation to convey your ideas more urgently.

Overall, your vocal confidence and fluency came across as moderate, considering the slight deviations from your baseline in pitch modulation and articulation. The fluency rating from the model also suggests that there's room for improvement in terms of smoothness and natural flow. Nonetheless, your speech demonstrated a clear and sincere attempt to communicate your thoughts, and with practice and attention to these areas, you have the potential to develop a more expressive and engaging delivery style.

In [None]:
tim_urban_path = "samples/tim-urban.wav"
tim_urban_features = await extract_features(tim_urban_path)
tim_urban_feedback = generate_feedback(tim_urban_features)

Function: extract_features
Execution Time: 0.00 sec
Memory Usage: 0.00 MB
CPU Usage: 14.30%
CPU Cores Used: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]
Function: extract_librosa_features
Execution Time: 0.09 sec
Memory Usage: 1.18 MB
CPU Usage: 0.00%
CPU Cores Used: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]
Function: extract_parselmouth_features
Execution Time: 0.42 sec
Memory Usage: -11.55 MB
CPU Usage: 0.00%
CPU Cores Used: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]
Function: extract_librosa_features
Execution Time: 2.23 sec
Memory Usage: 2.18 MB
CPU Usage: 0.00%
CPU Cores Used: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]
Function: extract_parselmouth_features
Execution Time: 7.44 sec
Memory Usage: 1.32 MB
CPU Usage: 0.00%
CPU Cores Used: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]


In [27]:
get_n_tokens(tim_urban_features)

4172

In [28]:
tim_urban_features

{'transcript': " Reviewer.pxs. So in college, I was a government major, which means I had to write a lot of papers. Now, when a normal student writes a paper, they might spread the work out a little like this. So, you know, you get started maybe a little slowly, but you get enough done in the first week that with some heavier days later on, everything gets done and things stay civil. And I would want to do that like that. That would be the plan. I would have it all ready to go, but then actually the paper would come along, and then I would kind of do this. And that would happen every single paper. But then came my 90-page senior thesis, a paper you're supposed to spend a year on. I knew for a paper like that, my normal workflow was not an option, it was way too big a project. So I planned things out and I decided I kind of had to go something like this. This is how the year would go. So I'd start off light, and I'd bump it up in the middle months. And then at the end, I would kick it u

In [29]:
import IPython.display as ipd
ipd.Markdown(tim_urban_feedback.content)

As I listened to your speech, I was struck by your ability to maintain a sense of humor and lightheartedness, even when discussing a topic as personal as procrastination. Your vocal control showed moments of strength, particularly in your ability to convey a sense of irony and self-deprecation, which helped to engage your audience. Your energy levels remained relatively consistent, which is commendable given the length of your speech. 

However, there were moments where your pace slowed down significantly, and your vocal modulation became less varied. This could be an indication of hesitation or a desire to emphasize certain points, but it also led to a sense of monotony in some sections. Additionally, your articulation and clarity were affected by a decrease in your syllable rate, which may have made it slightly more challenging for your audience to follow your train of thought. It's possible that you were deliberately pacing yourself to ensure your message was conveyed effectively, but being aware of these changes can help you strike a better balance between emphasis and flow.

Overall, I would rate your vocal confidence and fluency as moderate. While you demonstrated a good sense of humor and audience awareness, there were moments where your delivery was affected by changes in your tempo and modulation. With some attention to these areas, you can work on maintaining a more consistent pace and varied tone, which will help to enhance your overall confidence and fluency. The fact that you were able to engage your audience and convey a meaningful message despite these challenges is a testament to your strengths as a speaker, and with some refinement, you can continue to grow and improve in your public speaking abilities.