# Label Presidential Speeches with Ekman Emotions
Using the highly-rated `SamLowe/roberta-base-go_emotions` model (472K+ downloads) to establish ground truth labels.

In [1]:
%pip install transformers torch pandas openpyxl tqdm

Collecting openpyxl
  Downloading openpyxl-3.1.5-py2.py3-none-any.whl.metadata (2.5 kB)
Collecting et-xmlfile (from openpyxl)
  Downloading et_xmlfile-2.0.0-py3-none-any.whl.metadata (2.7 kB)
Downloading openpyxl-3.1.5-py2.py3-none-any.whl (250 kB)
Downloading et_xmlfile-2.0.0-py3-none-any.whl (18 kB)
Installing collected packages: et-xmlfile, openpyxl

   -------------------- ------------------- 1/2 [openpyxl]
   -------------------- ------------------- 1/2 [openpyxl]
   -------------------- ------------------- 1/2 [openpyxl]
   ---------------------------------------- 2/2 [openpyxl]

Successfully installed et-xmlfile-2.0.0 openpyxl-3.1.5
Note: you may need to restart the kernel to use updated packages.


In [2]:
import torch
import pandas as pd
import numpy as np
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from tqdm import tqdm

# Check for GPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

  from .autonotebook import tqdm as notebook_tqdm


Using device: cuda


### Load the GoEmotions Model
Using `SamLowe/roberta-base-go_emotions` - the most popular and well-tested emotion classification model on Hugging Face.

In [3]:
# Load the pre-trained GoEmotions model
MODEL_NAME = "SamLowe/roberta-base-go_emotions"

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModelForSequenceClassification.from_pretrained(MODEL_NAME)
model.to(device)
model.eval()

# Get the emotion labels from the model
emotion_labels = list(model.config.id2label.values())
print(f"Model has {len(emotion_labels)} emotion labels:")
print(emotion_labels)

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


Model has 28 emotion labels:
['admiration', 'amusement', 'anger', 'annoyance', 'approval', 'caring', 'confusion', 'curiosity', 'desire', 'disappointment', 'disapproval', 'disgust', 'embarrassment', 'excitement', 'fear', 'gratitude', 'grief', 'joy', 'love', 'nervousness', 'optimism', 'pride', 'realization', 'relief', 'remorse', 'sadness', 'surprise', 'neutral']


### Define Ekman Emotion Mapping
Map the 28 GoEmotions labels to 7 Ekman emotions (anger, disgust, fear, joy, sadness, surprise, neutral)

In [4]:
# Mapping from GoEmotions (28 labels) to Ekman emotions (7 labels)
# Based on the official GoEmotions paper grouping
GOEMOTIONS_TO_EKMAN = {
    'anger': 'anger',
    'annoyance': 'anger',
    'disapproval': 'anger',
    'disgust': 'disgust',
    'fear': 'fear',
    'nervousness': 'fear',
    'joy': 'joy',
    'amusement': 'joy',
    'approval': 'joy',
    'excitement': 'joy',
    'gratitude': 'joy',
    'love': 'joy',
    'optimism': 'joy',
    'relief': 'joy',
    'pride': 'joy',
    'admiration': 'joy',
    'desire': 'joy',
    'caring': 'joy',
    'sadness': 'sadness',
    'disappointment': 'sadness',
    'embarrassment': 'sadness',
    'grief': 'sadness',
    'remorse': 'sadness',
    'surprise': 'surprise',
    'realization': 'surprise',
    'confusion': 'surprise',
    'curiosity': 'surprise',
    'neutral': 'neutral'
}

EKMAN_EMOTIONS = ['anger', 'disgust', 'fear', 'joy', 'sadness', 'surprise', 'neutral']
print(f"Ekman emotions: {EKMAN_EMOTIONS}")

Ekman emotions: ['anger', 'disgust', 'fear', 'joy', 'sadness', 'surprise', 'neutral']


### Load Presidential Speeches Dataset

In [5]:
# Load the presidential speeches dataset
df = pd.read_excel("data/1presidential_speeches_with_metadata.xlsx")

print(f"Dataset shape: {df.shape}")
print(f"\nColumns: {df.columns.tolist()}")
print(f"\nFirst few rows:")
df.head()

Dataset shape: (995, 9)

Columns: ['President', 'Party', 'from', 'until', 'Vice President', 'title', 'date', 'info', 'speech']

First few rows:


Unnamed: 0,President,Party,from,until,Vice President,title,date,info,speech
0,Donald Trump,Republican,2017,2021,1.0,"January 8, 2020: Statement on Iran",2020-01-08 00:00:00,After the killing of General Qasem Soleimani o...,As long as I am President of the United States...
1,Donald Trump,Republican,2017,2021,1.0,"January 3, 2020: Remarks on the Killing of Qas...",2020-01-03 00:00:00,President Trump announces that the US military...,"Hello, everybody. Well, thank you very much. ..."
2,Donald Trump,Republican,2017,2021,1.0,"October 27, 2019: Statement on the the Death o...",2019-10-27 00:00:00,President Donald Trump announces the death of ...,"Last night, the United States brought the worl..."
3,Donald Trump,Republican,2017,2021,1.0,"September 25, 2019: Press Conference",2019-09-25 00:00:00,President Donald Trump holds a press conferenc...,PRESIDENT TRUMP: Thank you very much. Thank...
4,Donald Trump,Republican,2017,2021,1.0,"September 24, 2019: Remarks at the United Nati...",2019-09-24 00:00:00,President Donald Trump speaks to the 74th sess...,PRESIDENT TRUMP: Thank you very much. Mr. ...


### Define Emotion Prediction Function
The model processes text and returns probabilities for all 28 GoEmotions, which we then aggregate to Ekman emotions.

In [6]:
def predict_ekman_emotions(text, model, tokenizer, threshold=0.3):
    """
    Predict Ekman emotions for a given text.
    
    Args:
        text: Input text string
        model: The GoEmotions model
        tokenizer: The tokenizer
        threshold: Probability threshold for multi-label classification
    
    Returns:
        dict with Ekman emotion probabilities and predicted labels
    """
    if pd.isna(text) or not isinstance(text, str) or len(text.strip()) == 0:
        return {
            'ekman_probs': {e: 0.0 for e in EKMAN_EMOTIONS},
            'primary_emotion': 'neutral',
            'all_emotions': ['neutral']
        }
    
    # Truncate very long texts (model max length is 512)
    # Process in chunks if needed for long speeches
    max_length = 512
    
    # Tokenize
    inputs = tokenizer(
        text,
        return_tensors="pt",
        truncation=True,
        max_length=max_length,
        padding=True
    )
    inputs = {k: v.to(device) for k, v in inputs.items()}
    
    # Get predictions
    with torch.no_grad():
        outputs = model(**inputs)
        probs = torch.sigmoid(outputs.logits).cpu().numpy()[0]
    
    # Map GoEmotions probabilities to Ekman emotions (take max of each group)
    ekman_probs = {ekman: 0.0 for ekman in EKMAN_EMOTIONS}
    
    for i, go_emotion in enumerate(emotion_labels):
        if go_emotion in GOEMOTIONS_TO_EKMAN:
            ekman = GOEMOTIONS_TO_EKMAN[go_emotion]
            ekman_probs[ekman] = max(ekman_probs[ekman], probs[i])
    
    # Get predicted emotions above threshold
    predicted_emotions = [e for e, p in ekman_probs.items() if p >= threshold]
    
    # If no emotion above threshold, use the highest one
    if not predicted_emotions:
        primary_emotion = max(ekman_probs, key=ekman_probs.get)
        predicted_emotions = [primary_emotion]
    else:
        primary_emotion = max(predicted_emotions, key=lambda e: ekman_probs[e])
    
    return {
        'ekman_probs': ekman_probs,
        'primary_emotion': primary_emotion,
        'all_emotions': predicted_emotions
    }

# Test the function
test_texts = [
    "I am so happy and grateful for this wonderful day!",
    "This makes me furious! How dare they do this!",
    "I'm really scared about what might happen next.",
    "That's absolutely disgusting behavior.",
    "I feel so sad and heartbroken about this loss."
]

print("Testing emotion prediction:")
print("="*60)
for text in test_texts:
    result = predict_ekman_emotions(text, model, tokenizer)
    print(f"\nText: {text[:50]}...")
    print(f"Primary emotion: {result['primary_emotion']}")
    print(f"All emotions: {result['all_emotions']}")
    print(f"Probabilities: {', '.join([f'{k}:{v:.2f}' for k, v in result['ekman_probs'].items()])}")

Testing emotion prediction:

Text: I am so happy and grateful for this wonderful day!...
Primary emotion: joy
All emotions: ['joy']
Probabilities: anger:0.00, disgust:0.00, fear:0.00, joy:0.87, sadness:0.00, surprise:0.01, neutral:0.01

Text: This makes me furious! How dare they do this!...
Primary emotion: anger
All emotions: ['anger']
Probabilities: anger:0.82, disgust:0.01, fear:0.00, joy:0.01, sadness:0.01, surprise:0.01, neutral:0.09

Text: I'm really scared about what might happen next....
Primary emotion: fear
All emotions: ['fear']
Probabilities: anger:0.01, disgust:0.01, fear:0.90, joy:0.02, sadness:0.02, surprise:0.01, neutral:0.03

Text: That's absolutely disgusting behavior....
Primary emotion: disgust
All emotions: ['disgust']
Probabilities: anger:0.05, disgust:0.85, fear:0.02, joy:0.03, sadness:0.02, surprise:0.01, neutral:0.02

Text: I feel so sad and heartbroken about this loss....
Primary emotion: sadness
All emotions: ['sadness']
Probabilities: anger:0.01, disgust:0.0

### Process Long Speeches in Chunks
Presidential speeches are often very long. We'll split them into chunks, predict emotions for each chunk, and aggregate the results.

In [7]:
def predict_emotions_for_long_text(text, model, tokenizer, chunk_size=400, overlap=50, threshold=0.3):
    """
    Process long text by splitting into overlapping chunks and aggregating predictions.
    
    Args:
        text: Long input text
        chunk_size: Number of tokens per chunk
        overlap: Token overlap between chunks
        threshold: Probability threshold
    
    Returns:
        Aggregated emotion predictions
    """
    if pd.isna(text) or not isinstance(text, str) or len(text.strip()) == 0:
        return {
            'ekman_probs': {e: 0.0 for e in EKMAN_EMOTIONS},
            'primary_emotion': 'neutral',
            'all_emotions': ['neutral']
        }
    
    # Tokenize the full text to get token count
    full_tokens = tokenizer.encode(text, add_special_tokens=False)
    
    # If text is short enough, process normally
    if len(full_tokens) <= chunk_size:
        return predict_ekman_emotions(text, model, tokenizer, threshold)
    
    # Split into chunks with overlap
    chunk_probs = []
    
    for i in range(0, len(full_tokens), chunk_size - overlap):
        chunk_tokens = full_tokens[i:i + chunk_size]
        chunk_text = tokenizer.decode(chunk_tokens, skip_special_tokens=True)
        
        result = predict_ekman_emotions(chunk_text, model, tokenizer, threshold)
        chunk_probs.append(result['ekman_probs'])
    
    # Aggregate: take the mean probability across all chunks
    aggregated_probs = {ekman: 0.0 for ekman in EKMAN_EMOTIONS}
    for ekman in EKMAN_EMOTIONS:
        aggregated_probs[ekman] = np.mean([cp[ekman] for cp in chunk_probs])
    
    # Get predicted emotions above threshold
    predicted_emotions = [e for e, p in aggregated_probs.items() if p >= threshold]
    
    if not predicted_emotions:
        primary_emotion = max(aggregated_probs, key=aggregated_probs.get)
        predicted_emotions = [primary_emotion]
    else:
        primary_emotion = max(predicted_emotions, key=lambda e: aggregated_probs[e])
    
    return {
        'ekman_probs': aggregated_probs,
        'primary_emotion': primary_emotion,
        'all_emotions': predicted_emotions,
        'num_chunks': len(chunk_probs)
    }

print("Long text processing function defined.")

Long text processing function defined.


### Label All Speeches
Process each speech and add emotion labels to the dataset.

In [8]:
# Identify the text column (adjust if needed after seeing the data)
# Common column names: 'text', 'speech', 'transcript', 'content'
text_column = None
for col in ['text', 'speech', 'transcript', 'content', 'Speech', 'Text', 'Transcript']:
    if col in df.columns:
        text_column = col
        break

if text_column is None:
    print("Available columns:", df.columns.tolist())
    print("\nPlease set text_column manually to the column containing speech text")
else:
    print(f"Using column '{text_column}' for speech text")
    print(f"Sample text length: {df[text_column].str.len().describe()}")

Using column 'speech' for speech text
Sample text length: count      995.000000
mean     17063.630151
std      10989.687657
min        482.000000
25%       6694.500000
50%      15204.000000
75%      28168.500000
max      32759.000000
Name: speech, dtype: float64


In [9]:
# Process all speeches and add emotion columns
# This may take a while for long speeches

results = []

for idx, row in tqdm(df.iterrows(), total=len(df), desc="Labeling speeches"):
    text = row[text_column] if text_column else ""
    
    # Get emotion predictions
    prediction = predict_emotions_for_long_text(text, model, tokenizer)
    
    result = {
        'primary_emotion': prediction['primary_emotion'],
        'all_emotions': ','.join(prediction['all_emotions']),
    }
    
    # Add individual emotion probabilities
    for ekman in EKMAN_EMOTIONS:
        result[f'prob_{ekman}'] = prediction['ekman_probs'][ekman]
    
    # Add binary labels for each emotion (1 if prob >= threshold)
    threshold = 0.3
    for ekman in EKMAN_EMOTIONS:
        result[ekman] = 1 if prediction['ekman_probs'][ekman] >= threshold else 0
    
    results.append(result)

# Convert results to DataFrame and merge with original
results_df = pd.DataFrame(results)
df_labeled = pd.concat([df.reset_index(drop=True), results_df], axis=1)

print(f"\nLabeling complete!")
print(f"New columns added: {results_df.columns.tolist()}")

Labeling speeches:   0%|          | 0/995 [00:00<?, ?it/s]Token indices sequence length is longer than the specified maximum sequence length for this model (1383 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (1383 > 512). Running this sequence through the model will result in indexing errors
Labeling speeches: 100%|██████████| 995/995 [02:02<00:00,  8.11it/s]


Labeling complete!
New columns added: ['primary_emotion', 'all_emotions', 'prob_anger', 'prob_disgust', 'prob_fear', 'prob_joy', 'prob_sadness', 'prob_surprise', 'prob_neutral', 'anger', 'disgust', 'fear', 'joy', 'sadness', 'surprise', 'neutral']





### View Results and Statistics

In [10]:
# View sample of labeled data
print("Sample of labeled speeches:")
display_cols = [text_column, 'primary_emotion', 'all_emotions'] + [f'prob_{e}' for e in EKMAN_EMOTIONS]
df_labeled[display_cols].head(10)

Sample of labeled speeches:


Unnamed: 0,speech,primary_emotion,all_emotions,prob_anger,prob_disgust,prob_fear,prob_joy,prob_sadness,prob_surprise,prob_neutral
0,As long as I am President of the United States...,joy,"joy,neutral",0.037528,0.003699,0.003882,0.434055,0.007555,0.021475,0.366062
1,"Hello, everybody. Well, thank you very much. ...",joy,joy,0.015578,0.001833,0.002713,0.804161,0.008377,0.013426,0.047896
2,"Last night, the United States brought the worl...",joy,joy,0.016062,0.009028,0.054276,0.614096,0.053937,0.03567,0.204883
3,PRESIDENT TRUMP: Thank you very much. Thank...,joy,joy,0.076832,0.011377,0.003931,0.3172,0.076258,0.147637,0.19681
4,PRESIDENT TRUMP: Thank you very much. Mr. ...,joy,joy,0.048789,0.002534,0.002411,0.439925,0.089808,0.027506,0.24204
5,"THE PRESIDENT: Thank you very much, everybody...",joy,joy,0.073178,0.002377,0.002148,0.425831,0.077917,0.095724,0.215591
6,"Madam Speaker, Mr. Vice President, Members of...",joy,joy,0.027317,0.002865,0.005867,0.312904,0.060393,0.072448,0.258047
7,"THE PRESIDENT: Just a short time ago, I had th...",joy,joy,0.069184,0.003491,0.004149,0.426853,0.122895,0.027968,0.257375
8,"THE PRESIDENT: Madam President, Mr. Secretary...",joy,joy,0.095226,0.003532,0.002939,0.456638,0.031205,0.02623,0.195302
9,"THE PRESIDENT: Thank you, Lee. Thank you, Lee...",joy,joy,0.056596,0.002372,0.002041,0.465214,0.034339,0.066433,0.185618


In [11]:
# Emotion distribution statistics
print("\nEmotion Distribution:")
print("="*50)

print("\nPrimary emotion counts:")
print(df_labeled['primary_emotion'].value_counts())

print("\nBinary label distribution (speeches with each emotion):")
for ekman in EKMAN_EMOTIONS:
    count = df_labeled[ekman].sum()
    pct = count / len(df_labeled) * 100
    print(f"  {ekman}: {count} ({pct:.1f}%)")

print("\nAverage emotion probabilities:")
for ekman in EKMAN_EMOTIONS:
    avg_prob = df_labeled[f'prob_{ekman}'].mean()
    print(f"  {ekman}: {avg_prob:.3f}")


Emotion Distribution:

Primary emotion counts:
primary_emotion
neutral     608
joy         380
sadness       4
surprise      2
anger         1
Name: count, dtype: int64

Binary label distribution (speeches with each emotion):
  anger: 2 (0.2%)
  disgust: 0 (0.0%)
  fear: 0 (0.0%)
  joy: 552 (55.5%)
  sadness: 7 (0.7%)
  surprise: 6 (0.6%)
  neutral: 719 (72.3%)

Average emotion probabilities:
  anger: 0.045
  disgust: 0.002
  fear: 0.003
  joy: 0.323
  sadness: 0.042
  surprise: 0.064
  neutral: 0.427


### Save Labeled Dataset

In [12]:
# Save to CSV
output_path = "data/presidential_speeches_ekman_labeled.csv"
df_labeled.to_csv(output_path, index=False)
print(f"Labeled dataset saved to: {output_path}")

# Also save as Excel if preferred
output_path_xlsx = "data/presidential_speeches_ekman_labeled.xlsx"
df_labeled.to_excel(output_path_xlsx, index=False)
print(f"Labeled dataset saved to: {output_path_xlsx}")

print(f"\nFinal dataset shape: {df_labeled.shape}")
print(f"Columns: {df_labeled.columns.tolist()}")

Labeled dataset saved to: data/presidential_speeches_ekman_labeled.csv
Labeled dataset saved to: data/presidential_speeches_ekman_labeled.xlsx

Final dataset shape: (995, 25)
Columns: ['President', 'Party', 'from', 'until', 'Vice President', 'title', 'date', 'info', 'speech', 'primary_emotion', 'all_emotions', 'prob_anger', 'prob_disgust', 'prob_fear', 'prob_joy', 'prob_sadness', 'prob_surprise', 'prob_neutral', 'anger', 'disgust', 'fear', 'joy', 'sadness', 'surprise', 'neutral']
Labeled dataset saved to: data/presidential_speeches_ekman_labeled.xlsx

Final dataset shape: (995, 25)
Columns: ['President', 'Party', 'from', 'until', 'Vice President', 'title', 'date', 'info', 'speech', 'primary_emotion', 'all_emotions', 'prob_anger', 'prob_disgust', 'prob_fear', 'prob_joy', 'prob_sadness', 'prob_surprise', 'prob_neutral', 'anger', 'disgust', 'fear', 'joy', 'sadness', 'surprise', 'neutral']
