# 🧠 Cycle 2 Week 6 – Part 2  
## Assignment: “What Are We Really Hearing?” – Voice, Sound, and Cultural Signal Analysis in Short-Form Video

## 📘 Assignment Narrative

Short-form videos aren’t just visual. They’re compressed cultural objects, encoding emotion, performance, and group belonging in milliseconds.

Where Part 1 explored what we’re really seeing, this notebook focuses on the audio layer — what we hear, and what that sound implies. Because often, what something sounds like carries just as much weight as what’s actually said.

This week, our exploratory notebook builds toward a broader goal: to interpret videos in ways humans do, using not just audio content but tone, cadence, slang, musical choice, and delivery style to surface signals of belonging, trend alignment, and identity performance.

Content is also interpreted differently by different audiences — a phenomenon sometimes used intentionally to share in-group jokes, evade moderation, or pass cultural signals under the radar.

## 🔎 Why This Matters

Modern tools for content recommendation, indexing, and moderation can’t rely on explicit tags anymore.

Instead, they need to understand:

- How slang travels from fringe groups to the middle  
- How cadence and tone give away social or emotional intent  
- How music or speaking rhythm might index a subculture  
- How communication fingerprinting (like yours!) works — not just in *what’s said*, but *how*

If this sounds like we're building a classifier, we are — but also, we’re building a **listener**:
tuned to the invisible layer of cultural and emotional structure embedded in milliseconds of sound.

## 🔧 What We Are Doing

In this assignment, we’re building a pipeline that:
- Extracts the audio track from a short-form video  
- Transcribes speech using an ASR transformer (e.g., Whisper)  
- Analyzes musical structure (tempo, beat, tone)  
- Detects emerging slang or trending phrases  
- Measures emotional energy in delivery  
- Builds a communication fingerprint (pacing, rhythm, lexical density)  
- Interprets voice as a cultural and emotional signal

## 🎯 Learning Objectives

By completing this notebook, you will:

✅ Learn how to extract and process audio for machine interpretation  
✅ Understand the acoustic correlates of emotion and rhythm  
✅ Practice transcription and keyword extraction from short, noisy clips  
✅ Explore how identity and group signals are encoded in voice cadence  
✅ Build a pipeline that begins to interpret sound as **social signal**, not just information

## 🟦 1. Setup & Dependencies

Environment config, packages: `whisper`, `librosa`, `torch`, `nltk`, etc.

We begin by configuring our environment and installing the necessary libraries:

- `whisper`: OpenAI's transformer-based ASR model, pretrained on multilingual audio-text pairs
- `librosa`: Used for audio feature extraction (MFCCs, tempo, pitch)
- `moviepy`: For slicing video files and extracting audio tracks
- `nltk`: To tokenize and analyze text content from transcripts
- `torch`: Required as the backend for running Whisper and tensor computations

This setup prepares the computational tools we'll use to simulate human-like interpretation of sound, rhythm, and cultural signals.

In [1]:
!pip install -q openai-whisper librosa moviepy nltk torch

import whisper
import librosa
import moviepy.editor as mp
import nltk
import torch

# Confirm GPU availability
print("CUDA available:", torch.cuda.is_available())

[0m[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
tensorflow 2.12.0 requires jax>=0.3.15, which is not installed.
tensorflow 2.12.0 requires libclang>=13.0.0, which is not installed.
d2l 1.0.3 requires matplotlib==3.7.2, but you have matplotlib 3.8.4 which is incompatible.
d2l 1.0.3 requires numpy==1.23.5, but you have numpy 2.0.2 which is incompatible.
d2l 1.0.3 requires pandas==2.0.3, but you have pandas 2.2.3 which is incompatible.
d2l 1.0.3 requires requests==2.31.0, but you have requests 2.32.3 which is incompatible.
d2l 1.0.3 requires scipy==1.10.1, but you have scipy 1.13.1 which is incompatible.
contourpy 1.2.0 requires numpy<2.0,>=1.20, but you have numpy 2.0.2 which is incompatible.
tensorflow 2.12.0 requires numpy<1.24,>=1.22, but you have numpy 2.0.2 which is incompatible.
torchtext 0.17.2 requires torch==2.2.2, but you have torch 2.1.0 which


A module that was compiled using NumPy 1.x cannot be run in
NumPy 2.0.2 as it may crash. To support both 1.x and 2.x
versions of NumPy, modules must be compiled with NumPy 2.0.
Some module may need to rebuild instead e.g. with 'pybind11>=2.12'.

If you are a user of the module, the easiest solution will be to
downgrade to 'numpy<2' or try to upgrade the affected module.
We expect that some modules will need time to support NumPy 2.

Traceback (most recent call last):  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/Users/sophiaboettcher/anaconda3/lib/python3.11/site-packages/ipykernel_launcher.py", line 17, in <module>
    app.launch_new_instance()
  File "/Users/sophiaboettcher/anaconda3/lib/python3.11/site-packages/traitlets/config/application.py", line 1075, in launch_instance
    app.start()
  File "/Users/sophiaboettcher/anaconda3/lib/python3.11/site-packages/ipykernel/kernelapp.py", line 701, in start
    self.io_lo

AttributeError: _ARRAY_API not found

ImportError: numpy.core.multiarray failed to import

## 🟦 2. Audio Track Extraction

Before we can analyze rhythm, emotion, or lexical content, we must isolate the audio signal from the video file.

We use `moviepy` to extract the audio stream and save it as a `.wav` file — an uncompressed format suitable for downstream processing:

- `.wav` files preserve **raw signal fidelity**, making them ideal for MFCC and tempo analysis  
- `pcm_s16le` is a standard **16-bit linear PCM** codec, widely compatible and non-lossy  
- This step simulates the first action of a human listener: tuning in to the sound, separate from the visuals

Once saved, this audio can be passed to transcription, rhythm analysis, and acoustic modeling pipelines.

In [None]:
from moviepy.editor import VideoFileClip

def extract_audio(video_path, output_path='audio.wav'):
    video = VideoFileClip(video_path)
    video.audio.write_audiofile(output_path, codec='pcm_s16le')
    return output_path

# Example usage
audio_path = extract_audio("your_video.mp4")

### 🧮 Signal Preprocessing

Why we convert to mono, fixed sample rate, and normalize for MFCCs.

> To ensure consistent feature extraction, audio signals must be standardized. Mono conversion removes stereo imbalances; a fixed sample rate ensures consistent frequency resolution; normalization reduces loudness variance which otherwise biases spectral measurements.

## 🟦 3. Transcription with Whisper

Once the audio has been extracted, the next step is to understand *what is being said* — not just how it sounds. For this, we use OpenAI’s `whisper`, a state-of-the-art automatic speech recognition (ASR) model.

Whisper is built using a Transformer encoder-decoder architecture:

- The audio signal is first converted into a **mel-spectrogram**, a frequency-based visual representation
- An encoder processes this spectrogram to create contextual embeddings
- A decoder then generates text tokens autoregressively, attending to both audio context and previously decoded tokens

This model is robust to accents, background noise, and informal or fast-paced speech — making it ideal for analyzing short-form, often messy social video content.

Transcription provides both the **verbal content** and (optionally) the **timestamps**, which can later help align tone, tempo, and emotion with specific phrases.

In [None]:
import whisper

# Load the Whisper model (you can choose 'tiny', 'base', 'small', etc.)
model = whisper.load_model("base")

# Transcribe the audio file
result = model.transcribe(audio_path)

# Print the raw transcription
print(result["text"])

### 🧮 Transformer-Based ASR

Explain how Whisper tokenizes audio into text, decoder logic, model structure.

> Whisper uses a multi-lingual transformer architecture. It converts raw audio into mel-spectrograms, which are then interpreted via attention blocks and decoded autoregressively to generate text, optionally with timestamps.

## 🟦 4. Acoustic Feature Analysis

To move beyond words and into *tone*, we extract acoustic features that shape our emotional perception of speech.

Using `librosa`, we calculate:

- **MFCCs** (Mel-Frequency Cepstral Coefficients): Represent voice timbre, correlating with emotional texture (e.g., warmth, tension)  
- **Tempo**: Overall pacing or beat frequency, often aligned with energy or urgency  
- **Pitch Contour**: Variation in fundamental frequency over time, shaping expressiveness and intonation

These low-level features mirror how humans infer mood or intent from sound — whether someone is rushed, relaxed, angry, or cheerful.

In [None]:
import librosa

# Load the audio
y, sr = librosa.load(audio_path)

# Extract MFCCs
mfccs = librosa.feature.mfcc(y=y, sr=sr, n_mfcc=13)

# Estimate tempo (beats per minute)
tempo, _ = librosa.beat.beat_track(y=y, sr=sr)

# Estimate pitch using the YIN algorithm
pitch = librosa.yin(y, fmin=50, fmax=300)

# Output summaries
print("MFCC shape:", mfccs.shape)
print("Estimated Tempo (BPM):", tempo)
print("Pitch Contour (sample):", pitch[:10])

### 🧮 Spectral Features as Mood Markers

Grounded in **speech prosody theory**, these features have real psychological correlates:

- **MFCCs** capture fine-grained variations in resonance and vocal tract shape — markers of emotional coloring  
- **Tempo** often reflects arousal (fast = excited or anxious, slow = calm or sad)  
- **Pitch movement** signals affective variation (monotone vs melodic speech)

Together, these help machines listen not just to *what* is said, but *how* it’s said.

## 🟦 5. Emotion Detection from Voice

Human listeners often infer emotion from how something is said, not just the words themselves. Machines can mimic this using acoustic proxies for emotional modulation.

Here, we use three primary features:

- **Pitch** (frequency contour): Variation in fundamental frequency correlates with emotional expressiveness (e.g., rising pitch = excitement or surprise)  
- **Tempo** (speech rate): Faster tempo typically indicates urgency, agitation, or enthusiasm; slower pace often signals calmness or sadness  
- **Volume** (amplitude envelope): Loudness relates to arousal or emphasis  

We use these to classify an approximate emotional state on the Russell Circumplex Model — a 2D affective space where:

- The **x-axis** reflects **valence** (pleasant → unpleasant)
- The **y-axis** reflects **energy level** (calm → excited)

In [None]:
# This is a first approximation — 
# more sophisticated emotion models would use classifiers 
# trained on labeled corpora (e.g., RAVDESS or CREMA-D).

import numpy as np

# Compute basic stats
pitch_mean = np.mean(pitch)
pitch_std = np.std(pitch)
volume = np.sqrt(np.mean(y**2))  # RMS energy

# Basic heuristics for emotion classification
if pitch_std > 30 and tempo > 100 and volume > 0.02:
    emotion = "Excited / Happy"
elif pitch_std < 10 and tempo < 80:
    emotion = "Calm / Sad"
elif pitch_std > 20 and tempo > 90:
    emotion = "Energetic / Angry"
else:
    emotion = "Neutral or Uncertain"

print("Estimated Emotion Tone:", emotion)
print(f"Pitch Std Dev: {pitch_std:.2f} | Tempo: {tempo:.2f} | Volume (RMS): {volume:.4f}")

### 🧮 Affect from Acoustic Modulation

Let:
- $ f_0(t) $: fundamental frequency (pitch) over time  
- $ E(t) $: energy or volume (e.g., RMS amplitude)  
- $ B(t) $: beat rate or tempo  

Then:
- High $ \sigma(f_0(t)) $ → expressive / dynamic speech (emotionally intense)
- High $ B(t) $ → elevated arousal (e.g., anger, joy)
- Low $ B(t) $, low $ f_0(t) $ variation → subdued, flat affect (e.g., boredom, sadness)

> Emotional expression in speech can be modeled via pitch contour (excitement), tempo (urgency), volume (arousal), and dynamic range (engagement).  
This offers a simple yet powerful lens for decoding tone and social signal from voice alone.

## 🟦 6. Slang and Keyword Drift

Extract n-grams / trends from Whisper transcript.

Slang, memes, and emergent phrases move from fringe groups to the mainstream through repeated exposure, imitation, and algorithmic amplification. These changes are often reflected first in the *sound* of language, before appearing in formal lexicons.

By analyzing short word sequences (n-grams) in our transcripts, we can start to detect:

- Repetitive or trending phrases
- Semantic clusters that reflect cultural reference points
- Phrase structures that map to genre, subculture, or moment

This process lays the groundwork for identifying *what kind of thing this video is trying to be* — whether that’s a skit, a joke, an aesthetic statement, or a participation in a meme.

In [None]:
import nltk
from nltk.util import ngrams
from collections import Counter

# Download tokenizer resources if not already present
nltk.download('punkt')

# Tokenize and lowercase the transcript
tokens = nltk.word_tokenize(result["text"].lower())

# Extract bigrams (or try trigrams for more structure)
bigrams = list(ngrams(tokens, 2))

# Count and display top patterns
trending = Counter(bigrams).most_common(10)
print("Top Trending Bigrams:")
for phrase, count in trending:
    print(f"{' '.join(phrase)}: {count}")

### 🧮 Social Lexicon Drift

Let $ T = \{w_1, w_2, \dots, w_n\} $ be the tokenized transcript.

We compute $ n $-grams:

$$
\text{NGram}(T, n) = \{(w_i, w_{i+1}, \dots, w_{i+n-1}) \mid i = 1, \dots, n - k + 1\}
$$

and evaluate their **burstiness**, **frequency**, and **collocation strength**:

- **Burstiness**: How sharply a phrase rises in popularity  
- **PMI (Pointwise Mutual Information)**: Measures strength of association between terms  
- **TF-IDF / trend curves**: For longer horizon drift tracking

This gives us a signal of *lexical diffusion* — how novel expressions bubble up in different social strata and transition from noise to signal.

> “What are people saying?” becomes: “What *kind* of saying is this becoming?”

Analyzing these structures helps index emergent genres, humor, identity, and performance — all core to decoding modern media.

## 🟦 7. Cadence + Delivery Pattern Analysis

Word per second, pause frequency, filler usage, stutter, “Millennial pause.”

Cadence — the rhythm of delivery — plays a major role in how identity and emotion are expressed through voice. It acts as a kind of "social accent" that reveals:

- **Pacing** (e.g., rushed, calm, deliberate)
- **Pause patterns** (e.g., hesitation, rhetorical pauses)
- **Filler words** (e.g., “like,” “uh,” “you know”)
- **Delivery style** (e.g., the "Millennial pause" — a purposeful breath before speaking)

These are not just quirks — they are part of what sociolinguists call **paralinguistic signaling**, contributing to perceived authenticity, genre alignment, and group belonging.

Cadence helps answer:  
> “Does this person sound like a YouTuber? A streamer? A parody? A real person?”  

It also relates to how short-form platforms reward or suppress certain styles (e.g., speed-talking for algorithm favor, deadpan for humor).

In [None]:
import re

# Estimate total audio duration in seconds
duration_sec = librosa.get_duration(y=y, sr=sr)

# Count total words
words = len(tokens)

# Compute words per second (speech rate)
wps = words / duration_sec
print(f"Words per second (WPS): {wps:.2f}")

# Optional: Detect filler words (basic example set)
filler_words = {'um', 'uh', 'like', 'you know', 'so', 'actually'}
filler_count = sum(token in filler_words for token in tokens)
filler_ratio = filler_count / words if words > 0 else 0

print(f"Filler word ratio: {filler_ratio:.3f} ({filler_count} fillers in {words} words)")

### 🧮 Speech Rhythm as Identity Marker

Sociolinguistics of pacing and paralinguistic cues.

> Timing patterns (e.g., delayed starts, filler interjections) reveal speaker identity and group alignment. Pacing correlates with personality traits and cultural affinity.

Let:
- $ N $: total number of transcribed words  
- $ D $: total duration of the audio in seconds  
- $ F $: number of known filler words (e.g., from a curated set)

Then:
- Words per second: $ WPS = \frac{N}{D} $
- Filler ratio: $ \frac{F}{N} $

We also track pause duration between words, and staccato vs flowing delivery. These rhythm signatures can become **latent features** in downstream classifiers (e.g., genre, sentiment, persona).

> In essence, we are fingerprinting the *musicality of speech*.

## 🟦 8. Cultural Fingerprinting Summary

After analyzing voice tone, rhythm, lexical choices, and acoustic mood, we arrive at a key question:

> What kind of person — or performance — is this?

This final section synthesizes:

- 🗣️ **Emotion from voice** (pitch, tempo, volume)
- 📈 **Delivery style** (cadence, filler use, “Millennial pause”)
- 💬 **Lexical drift** (slang, memes, genre phrases)
- 🎵 **Acoustic aesthetic** (energy, warmth, musicality)

These features combine into a **soft fingerprint** — a flexible, human-like impression of *how someone talks* and *who they sound like they are*. This may hint at:

- Subculture (e.g., gaming, K-pop, alt-literature)
- Role or intent (e.g., educational, ironic, sincere)
- Identity signals (e.g., age group, in-group, parody)

In short: We’re modeling **performance style** — a key component of how social video is interpreted by both people and algorithms.

> The same phrase can mean very different things depending on *how* it's said. And now, our model can begin to sense that difference.