# Problem Statement

In this project, we build a machine learning model capable of evaluating the humor quality of short stand-up comedy clips. The goal is to predict how well a given audio performance will be received by a real audience, based on its content and delivery.

To achieve this, we work with a dataset of stand-up audio clips and their transcriptions. Our system will analyze features such as:

Overall ranking of how funny it will be based on total laughter time to total speech time
Audience engagement patterns (word level response)
Linguistic style and structure of jokes
Emotional tone and delivery cues (e.g., timing, pacing, pauses)
Laughter reactions (presence, duration) (intensity?)


The project involves:

Text preprocessing: transcript text.

Audio feature extraction: Detecting laughter segments and other prosodic cues.

Embedding and modeling: Using pre-trained language models (e.g., BERT) or training a humor-specific embedding.

Prediction & critique: Predicting an overall humor rating (e.g., 1–5) and providing fine-grained feedback to the performer.

Ultimately, this model can support comedians in refining their material, help platforms surface funnier content, and enable deeper understanding of what makes something “funny.”

## Data dictionary

speech: the text transcription of the standup audio clip

ranking: the ranking in the sclae of 1-4

audio feature 1:

audio feature 2:

audio feature 3:

audio feature 4:

audio feature 5:

audio feature 6:

duration (optional): the total time length of the speech

laugh time (optional): the total length of all laughters


Word level


## Libraries

In [None]:
# Installing the libraries with specified versions
!pip install -U -q sentence-transformers==4.1.0 transformers==4.52.4 bitsandbytes==0.46.0 accelerate==1.7.0 sentencepiece==0.2.0 pandas==2.2.2 numpy==2.0.2 matplotlib==3.10.0 seaborn==0.13.2 torch==2.6.0 scikit-learn==1.6.1

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.5/10.5 MB[0m [31m122.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m67.0/67.0 MB[0m [31m17.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m362.1/362.1 kB[0m [31m34.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m363.4/363.4 MB[0m [31m4.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.8/13.8 MB[0m [31m130.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.6/24.6 MB[0m [31m107.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m883.7/883.7 kB[0m [31m61.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m664.8/664.8 MB[0m [31m1.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [None]:
# to read and manipulate the data
import pandas as pd
import numpy as np
pd.set_option('max_colwidth', None)  # setting column to the maximum column width as per the data

# to visualise data
import matplotlib.pyplot as plt
import seaborn as sns

# Deep Learning library
import torch

# to load transformer models
from sentence_transformers import SentenceTransformer
from transformers import T5Tokenizer, T5ForConditionalGeneration, pipeline

# to split the data
from sklearn.model_selection import train_test_split

# to compute performance metrics
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score

# To build a Random Forest model
from sklearn.ensemble import RandomForestClassifier


# to ignore unnecessary warnings
import warnings
warnings.filterwarnings("ignore")


# Encode audio and text into a shared vector space,
allowing for multimodal tasks like cross-modal search or improved classification

* audio feature granulaty: word level or sentence level??)
  * intuitively, it makes sense to integret word level audio feature with each text token's during embedding process? how to realize techinically??

* How to integrete with text embedding: add, concatenate, or other??

* Currently has word level laughter info
* other audio feature sentence level?: removing laughter, extracting relevant features that capture aspects like pitch, rhythm, timbre, and even speech patterns, like Mel Frequency Cepstral Coefficients (MFCCs) are often used, along with deep learning models like WaveNet or DeepSpeech for feature extraction.

# Extract Audio features

In [None]:
from pydub import AudioSegment

import librosa
import librosa.display


from scipy.signal import lfilter
from scipy.fftpack import dct
from scipy.signal import spectrogram
import soundfile as sf

In [None]:
# Load the MP3 file
audio_path = "/mnt/data/1.mp3"
audio = AudioSegment.from_file(audio_path)

plot waveform graph

In [None]:
# Convert to mono and get raw data
samples = np.array(audio.set_channels(1).get_array_of_samples())

# Normalize samples to range [-1, 1]
samples = samples.astype(np.float32) / (2**15)

# Create a time axis
duration = len(audio) / 1000.0  # in seconds
time = np.linspace(0, duration, num=len(samples))

# Plot waveform
plt.figure(figsize=(14, 4))
plt.plot(time, samples, linewidth=0.5)
plt.title("Audio Waveform of Stand-Up Clip")
plt.xlabel("Time (s)")
plt.ylabel("Amplitude")
plt.grid(True)
plt.tight_layout()
plt.show()


## Audio features:
* MFCCs (Mel Frequency Cepstral Coefficients): Capture the timbral texture of the voice, often used in speech and emotion recognition.
* RMS Energy: Measures the loudness of the signal. Higher energy may correlate with intense delivery or audience laughter.
* Spectrogram (dB scale): Visualizes frequency over time. Dense regions may hint at dynamic tone or punchlines.
  * spectral_centroid: Brightness or average frequency	Highlights expressive voice tone
  * spectral_rolloff: Frequency where energy drops off	Detects sharp or fast-paced delivery
  * spectral_contrast: Dynamic range across frequency bands	Captures voice expressiveness & emotion
* Line Spectral Frequencies (LSF): Represent the speech spectral envelope and help capture vocal tract information, useful for distinguishing speaker tone or emotion.
* Zero-Crossing Rate (ZCR): Measures signal noisiness or fricative speech; higher in laughter or fast-paced delivery.
* Delta Coefficients of MFCCs: Capture temporal changes in MFCCs—important for modeling how humor builds up or shifts across a performance.

In [None]:
# Load audio file
y, sr = librosa.load(audio_path, sr=None)

# --- Feature 1: MFCCs ---
mfccs = librosa.feature.mfcc(y=y, sr=sr, n_mfcc=13)
mfccs_mean = mfccs.mean(axis=1)

# --- Feature 2: RMS Energy ---
rms = librosa.feature.rms(y=y)
rms_mean = rms.mean()

# --- Feature 3 : Spectrogram ---
S = np.abs(librosa.stft(y))
spectrogram_db = librosa.amplitude_to_db(S, ref=np.max)
spectrogram_mean = spectrogram_db.mean()

spectral_centroid = librosa.feature.spectral_centroid(y=y, sr=sr)
spectral_rolloff = librosa.feature.spectral_rolloff(y=y, sr=sr)
spectral_contrast = librosa.feature.spectral_contrast(y=y, sr=sr)

# --- Feature 4: Line Spectral Frequencies (LSF) ---
def lpc(signal, order):
    """Compute Linear Predictive Coefficients using autocorrelation."""
    from scipy.linalg import toeplitz, solve_toeplitz
    R = np.correlate(signal, signal, mode='full')
    R = R[len(R)//2:]
    r = R[1:order+1]
    R_matrix = toeplitz(R[:order])
    lpc_coeffs = solve_toeplitz((R[:order], R[:order]), r)
    return np.concatenate(([1], -lpc_coeffs))

def lsf_from_lpc(a):
    """Convert LPC to LSF using polynomial root finding."""
    import numpy.polynomial.polynomial as poly
    A = a
    P = A + np.flip(A)
    Q = A - np.flip(A)
    roots_P = np.roots(P)
    roots_Q = np.roots(Q)
    angles_P = np.angle(roots_P[np.isreal(roots_P)])
    angles_Q = np.angle(roots_Q[np.isreal(roots_Q)])
    return np.sort(np.concatenate((angles_P, angles_Q)))

lpc_coeffs = lpc(y[:2048], order=10)
lsf = lsf_from_lpc(lpc_coeffs)
lsf_mean = np.mean(lsf)

# --- Feature 5: Zero-Crossing Rate ---
zcr = librosa.feature.zero_crossing_rate(y)
zcr_mean = zcr.mean()

# --- Feature 6: Delta Coefficients of MFCCs ---
delta_mfcc = librosa.feature.delta(mfccs)
delta_mfcc_mean = delta_mfcc.mean(axis=1)

# --- Combine All Features into DataFrame ---
# recommend using mean values to reduce dimensionalitym for base model, add full array for most importanct features during fine tune.
# feature_dict = {
#     **{f"mfcc_{i+1}": mfccs_mean[i] for i in range(len(mfccs_mean))},
#     "rms_energy": rms_mean,
#     "spectrogram_db_mean": spectrogram_mean,
#     "spectral_centroid": spectral_centroid.tolist(),
#     "spectral_rolloff": spectral_rolloff.tolist(),
#     "spectral_contrast": spectral_contrast.tolist(),
#     "lsf_mean": lsf_mean,
#     "zcr_mean": zcr_mean,
#     **{f"delta_mfcc_{i+1}": delta_mfcc_mean[i] for i in range(len(delta_mfcc_mean))}
# }
feature_dict = {
    **{f"mfcc_{i+1}": mfccs_mean[i] for i in range(len(mfccs_mean))},
    "rms_energy": rms_mean,
    "spectrogram_db_mean": spectrogram_mean,
    "spectral_centroid": spectral_centroid_mean,
    "spectral_rolloff": spectral_rolloff_mean,
    "spectral_contrast": spectral_contrast_mean,
    "lsf_mean": lsf_mean,
    "zcr_mean": zcr_mean,
    **{f"delta_mfcc_{i+1}": delta_mfcc_mean[i] for i in range(len(delta_mfcc_mean))}
}

audio_features = pd.DataFrame([feature_dict])
audio_features.head()


# Import text data

2070 files uploaded to google dirve, one file upload failed S_ITYFTLT_audio_6.mp3

In [None]:
import os
import csv

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:

# Define the input folder and output file
input_folder = "/content/drive/MyDrive/AI_open_mic_dataset"
output_csv = "funny.csv"

In [None]:
# Prepare a list to store data
data_rows = []

# Walk through all files in the folder
for file_name in os.listdir(input_folder):
    if file_name.endswith(".txt"):
        file_path = os.path.join(input_folder, file_name)
        try:
            with open(file_path, "r", encoding="utf-8") as file:
                content = file.read().strip()
                data_rows.append({"text": content, "file name": file_name})
        except Exception as e:
            print(f"Error reading {file_name}: {e}")



In [None]:
data_rows.head()

In [None]:
# Write to CSV
with open(output_csv, "w", newline="", encoding="utf-8") as csvfile:
    fieldnames = ["text", "file name"]
    writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
    writer.writeheader()
    writer.writerows(data_rows)

print(f"Extracted {len(data_rows)} .txt files into {output_csv}")


# Data overview

# Text embedding

## Use an existing BERT model to do text embedding

We'll be using the all-MiniLM-L6-v2 model here.

💡 The all-MiniLM-L6-v2 model is an all-round (all) model trained on a large and diverse dataset of over 1 billion training samples and generates state-of-the-art sentence embeddings of 384 dimensions.

📊 It is a language model (LM) that has 6 transformer encoder layers (L6) and is a smaller model (Mini) trained to mimic the performance of a larger model (BERT).

🛠️ Potential use-cases include text classification, sentiment analysis, and semantic search.

In [None]:
# defining the model
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

In [None]:
model.encode(['I like clean jokes!'])

In [None]:
# setting the device to GPU if available, else CPU
import torch
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")


In [None]:
# encoding the dataset
embedding_matrix = model.encode(data['review'], device=device, show_progress_bar=True)

## Use Word2Vec for text embedding

## Fine tune a LM model with funny text for text embedding?

## Build own transformer encoder using mostly funny text
$$$ billions


# Encode audio and text into a shared vector space,
allowing for multimodal tasks like cross-modal search or improved classification

* (add, concatenate, or other???)


Audio: removing laughter, extracting relevant features that capture aspects like pitch, rhythm, timbre, and even speech patterns. Techniques like Mel Frequency Cepstral Coefficients (MFCCs) are often used, along with deep learning models like WaveNet or DeepSpeech for feature extraction.

Text: Preprocessing includes cleaning the text, tokenizing it into individual words or sub-word units, and converting it into embeddings that capture semantic meaning. Models like BERT, GPT, and T5 are widely used for generating text embeddings.
* sentence level embedding

# Humor analysis

## Use nueral network, input sentence level embedding, hidden layers, then output a continuous ranking - this will hardly work since it does not have sequence
## transformer vs. LSMT

## Use LLM

## Use Fine tuned LLM with funny text

## Random forest, XGBoost

In [None]:
# Process the data

X = embedding_matrix
y = data["ranking"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

In [None]:
# Building the model
rf_transformer = RandomForestClassifier(n_estimators = 100, max_depth = 7, random_state = 42)

# Fitting on train data
rf_transformer.fit(X_train, y_train)

In [None]:
# creating a function to plot the confusion matrix
def plot_confusion_matrix(actual, predicted):
    cm = confusion_matrix(actual, predicted)

    plt.figure(figsize = (5, 4))
    label_list = [0, 1]
    sns.heatmap(cm, annot = True,  fmt = '.0f', xticklabels = label_list, yticklabels = label_list)
    plt.ylabel('Actual')
    plt.xlabel('Predicted')
    plt.show()

In [None]:
plot_confusion_matrix(y_train, y_pred_train)
plot_confusion_matrix(y_test, y_pred_test)

## Build own Transformer prediction model with Multi-head Attention
 $$$$$ billions