# Sentiment Analysis of Donald Trump Rally Speeches

This notebook performs comprehensive sentiment analysis on Donald Trump's rally speeches from 2019-2020 using **FinBERT**, a BERT model fine-tuned for financial sentiment analysis. While originally designed for financial text, FinBERT's sentiment classification (positive, negative, neutral) works well for political speech analysis.

## Analysis Overview
- **Model**: ProsusAI/finbert - Pre-trained BERT for sentiment classification
- **Approach**: Chunk long speeches into manageable segments for BERT processing
- **Output**: Sentiment scores (positive, negative, neutral) for each speech
- **Insights**: Temporal trends, location-based patterns, and aggregate sentiment metrics

## Import Libraries

Loading required libraries for deep learning, NLP, and visualization.

In [1]:
# Suppress warnings for cleaner output
import warnings
import os
warnings.filterwarnings('ignore')
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2'  # Suppress TensorFlow warnings
os.environ['HF_HUB_DISABLE_SYMLINKS_WARNING'] = '1'  # Disable symlink warnings

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import plotly.graph_objects as go
import plotly.express as px
from plotly.subplots import make_subplots

import math
import tensorflow as tf
from tqdm.notebook import tqdm
from typing import List, Tuple, Dict

# Configure TensorFlow logging
import logging
logging.getLogger('tensorflow').setLevel(logging.ERROR)

from transformers import pipeline, AutoTokenizer, TFBertForSequenceClassification
from scipy.special import softmax
from tensorflow.python.ops.numpy_ops import np_config

# Enable NumPy behavior for TensorFlow
np_config.enable_numpy_behavior()

# Set visualization styles
sns.set_style("whitegrid")
plt.rcParams['figure.dpi'] = 100

print(f"✅ Libraries loaded successfully")
print(f"TensorFlow version: {tf.__version__}")
print(f"GPU Available: {len(tf.config.list_physical_devices('GPU')) > 0}")

✅ Libraries loaded successfully
TensorFlow version: 2.20.0
GPU Available: False


## Loading and overview of stored dataset

In [2]:
# Load the dataset prepared in Word Clouds notebook
%store -r DT_rally_speaches_dataset
df = DT_rally_speaches_dataset.copy()

print(f"Loaded {len(df)} speeches")
print(f"Date range: {df['Month'].iloc[0]} {df['Year'].iloc[0]} - {df['Month'].iloc[-1]} {df['Year'].iloc[-1]}")

Loaded 35 speeches
Date range: Jul 2019 - Sep 2020


In [3]:
df.head()

Unnamed: 0,Location,Month,Year,filename,content,Month_Num,Date,word_count
0,Greenville,Jul,2019,GreenvilleJul17_2019.txt,Thank you very much. Thank you. Thank you. Tha...,7,2019-07-15,10605
1,Cincinnati,Aug,2019,CincinnatiAug1_2019.txt,Thank you all. Thank you very much. Thank you ...,8,2019-08-15,8170
2,New Hampshire,Aug,2019,NewHampshireAug15_2019.txt,Thank you very much everybody. Thank you. Wow...,8,2019-08-15,10141
3,Texas,Sep,2019,TexasSep23_2019.txt,"Hello, Houston. I am so thrilled to be here in...",9,2019-09-15,2487
4,New Mexico,Sep,2019,NewMexicoSep16_2019.txt,"Wow, thank you. Thank you, New Mexico. Thank ...",9,2019-09-15,11498


## Model and tokenizer setup

We're using **FinBERT** (ProsusAI/finbert), a BERT model fine-tuned for sentiment analysis. It classifies text into three categories:
- **Positive**: Optimistic, confident language
- **Negative**: Critical, pessimistic language  
- **Neutral**: Factual, balanced statements

**Note**: The model download may take a few minutes on first run. Progress bars and some warnings are normal during model loading.

In [4]:
# Load FinBERT model and tokenizer
MODEL_CHECKPOINT = 'ProsusAI/finbert'

print("🔄 Loading FinBERT model and tokenizer...")
print("   (This may take a few moments on first run)\n")

# Suppress transformers warnings temporarily during model loading
import transformers
transformers.logging.set_verbosity_error()

try:
    tokenizer = AutoTokenizer.from_pretrained(MODEL_CHECKPOINT)
    model = TFBertForSequenceClassification.from_pretrained(MODEL_CHECKPOINT)
    
    # Display model configuration
    print(f"✅ Model loaded successfully!")
    print(f"\n📋 Model Configuration:")
    print(f"   Model: {MODEL_CHECKPOINT}")
    print(f"   Labels: {model.config.id2label}")
    print(f"   Max sequence length: {tokenizer.model_max_length} tokens")
    
except Exception as e:
    print(f"❌ Error loading model: {e}")
    print("💡 Tip: Check your internet connection or try re-running the cell")
    raise

# Restore normal logging for rest of notebook
transformers.logging.set_verbosity_warning()

🔄 Loading FinBERT model and tokenizer...
   (This may take a few moments on first run)

✅ Model loaded successfully!

📋 Model Configuration:
   Model: ProsusAI/finbert
   Labels: {0: 'positive', 1: 'negative', 2: 'neutral'}
   Max sequence length: 512 tokens


## Text Processing Functions

BERT models have a maximum sequence length (512 tokens). We need to split long speeches into manageable chunks.

In [5]:
def chunk_text_for_bert(text: str, tokenizer, max_length: int = 510) -> List[Dict]:
    """
    Split text into chunks that fit within BERT's token limits.
    
    Parameters:
        text: Input text to chunk
        tokenizer: HuggingFace tokenizer
        max_length: Maximum tokens per chunk (510 to leave room for [CLS] and [SEP])
        
    Returns:
        List of encoded chunks ready for model input
    """
    # Tokenize the full text
    tokens = tokenizer.tokenize(text)
    
    # Split into chunks
    chunks = []
    for i in range(0, len(tokens), max_length):
        chunk_tokens = tokens[i:i + max_length]
        chunk_text = tokenizer.convert_tokens_to_string(chunk_tokens)
        
        # Encode with special tokens
        encoding = tokenizer.encode_plus(
            chunk_text,
            add_special_tokens=True,
            max_length=max_length + 2,  # +2 for [CLS] and [SEP]
            padding='max_length',
            truncation=True,
            return_tensors='tf'
        )
        chunks.append(encoding)
    
    return chunks


def analyze_sentiment(chunks: List[Dict], model) -> Tuple[np.ndarray, np.ndarray]:
    """
    Run sentiment analysis on text chunks.
    
    Parameters:
        chunks: List of encoded text chunks
        model: Loaded sentiment analysis model
        
    Returns:
        Tuple of (all_predictions, mean_sentiment) as numpy arrays
    """
    all_predictions = []
    
    for chunk in chunks:
        # Get model predictions
        outputs = model(chunk)
        
        # Convert logits to probabilities
        probs = tf.nn.softmax(outputs.logits, axis=-1)
        all_predictions.append(probs.numpy())
    
    # Stack all predictions
    all_predictions = np.vstack(all_predictions)
    
    # Calculate mean sentiment across all chunks
    mean_sentiment = np.mean(all_predictions, axis=0)
    
    return all_predictions, mean_sentiment

In [6]:
# Process all speeches with progress tracking
print("🔄 Processing all speeches for sentiment analysis...\n")

sentiment_results = []

for idx, row in tqdm(df.iterrows(), total=len(df), desc="Analyzing speeches"):
    try:
        # Chunk the speech text
        chunks = chunk_text_for_bert(row['content'], tokenizer, max_length=510)
        
        # Analyze sentiment
        chunk_predictions, mean_sentiment = analyze_sentiment(chunks, model)
        
        # Store results
        sentiment_results.append({
            'speech_idx': idx,
            'location': row['Location'],
            'month': row['Month'],
            'year': row['Year'],
            'num_chunks': len(chunks),
            'positive': mean_sentiment[0],
            'negative': mean_sentiment[1],
            'neutral': mean_sentiment[2],
            'chunk_predictions': chunk_predictions,
            'dominant_sentiment': model.config.id2label[np.argmax(mean_sentiment)]
        })
        
    except Exception as e:
        print(f"\n⚠️  Error processing speech {idx} ({row['Location']}): {e}")
        continue

print(f"\n✅ Successfully analyzed {len(sentiment_results)} speeches!")

# Create results DataFrame
sentiment_df = pd.DataFrame([{k: v for k, v in r.items() if k != 'chunk_predictions'} 
                              for r in sentiment_results])

🔄 Processing all speeches for sentiment analysis...



Analyzing speeches:   0%|          | 0/35 [00:00<?, ?it/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (14239 > 512). Running this sequence through the model will result in indexing errors



✅ Successfully analyzed 35 speeches!


## Sentiment Analysis Results

Let's examine the sentiment scores for each speech.

In [7]:
# Display sentiment scores
print("Sentiment Scores by Speech:\n")
print("="*80)

for idx, row in sentiment_df.iterrows():
    print(f"{row['location']:.<30} ({row['month']} {row['year']})")
    print(f"   Positive: {row['positive']:.3f} | Negative: {row['negative']:.3f} | Neutral: {row['neutral']:.3f}")
    print(f"   Dominant: {row['dominant_sentiment']} | Chunks: {row['num_chunks']}")
    print()

# Show DataFrame
sentiment_df.head(10)

Sentiment Scores by Speech:

Greenville.................... (Jul 2019)
   Positive: 0.572 | Negative: 0.274 | Neutral: 0.153
   Dominant: positive | Chunks: 28

Cincinnati.................... (Aug 2019)
   Positive: 0.565 | Negative: 0.279 | Neutral: 0.156
   Dominant: positive | Chunks: 21

New Hampshire................. (Aug 2019)
   Positive: 0.567 | Negative: 0.277 | Neutral: 0.155
   Dominant: positive | Chunks: 26

Texas......................... (Sep 2019)
   Positive: 0.426 | Negative: 0.324 | Neutral: 0.250
   Dominant: positive | Chunks: 6

New Mexico.................... (Sep 2019)
   Positive: 0.551 | Negative: 0.283 | Neutral: 0.166
   Dominant: positive | Chunks: 31

Fayetteville.................. (Sep 2019)
   Positive: 0.541 | Negative: 0.289 | Neutral: 0.170
   Dominant: positive | Chunks: 24

Dallas........................ (Oct 2019)
   Positive: 0.560 | Negative: 0.282 | Neutral: 0.159
   Dominant: positive | Chunks: 28

Minneapolis................... (Oct 2019)
   Pos

Unnamed: 0,speech_idx,location,month,year,num_chunks,positive,negative,neutral,dominant_sentiment
0,0,Greenville,Jul,2019,28,0.572434,0.274166,0.1534,positive
1,1,Cincinnati,Aug,2019,21,0.564755,0.279329,0.155916,positive
2,2,New Hampshire,Aug,2019,26,0.567399,0.277195,0.155406,positive
3,3,Texas,Sep,2019,6,0.425736,0.324423,0.249841,positive
4,4,New Mexico,Sep,2019,31,0.550716,0.28288,0.166404,positive
5,5,Fayetteville,Sep,2019,24,0.540782,0.289273,0.169946,positive
6,6,Dallas,Oct,2019,28,0.559747,0.281512,0.158741,positive
7,7,Minneapolis,Oct,2019,31,0.57617,0.274016,0.149814,positive
8,8,Tupelo,Nov,2019,25,0.576783,0.270095,0.153122,positive
9,9,Lexington,Nov,2019,24,0.566354,0.275151,0.158495,positive


## Interactive Visualizations

Creating interactive charts to explore sentiment patterns.

In [8]:
# Create comprehensive sentiment visualization
fig = make_subplots(
    rows=2, cols=2,
    subplot_titles=('Sentiment Distribution Across All Speeches',
                    'Average Sentiment by Year',
                    'Dominant Sentiment Count',
                    'Sentiment Trends Over Time'),
    specs=[[{"type": "bar"}, {"type": "bar"}],
           [{"type": "pie"}, {"type": "scatter"}]]
)

# 1. Overall sentiment distribution (stacked bar)
fig.add_trace(
    go.Bar(name='Positive', x=sentiment_df['location'], y=sentiment_df['positive'],
           marker_color='#2ecc71', showlegend=True),
    row=1, col=1
)
fig.add_trace(
    go.Bar(name='Negative', x=sentiment_df['location'], y=sentiment_df['negative'],
           marker_color='#e74c3c', showlegend=True),
    row=1, col=1
)
fig.add_trace(
    go.Bar(name='Neutral', x=sentiment_df['location'], y=sentiment_df['neutral'],
           marker_color='#95a5a6', showlegend=True),
    row=1, col=1
)

# 2. Average sentiment by year
year_avg = sentiment_df.groupby('year')[['positive', 'negative', 'neutral']].mean()
fig.add_trace(
    go.Bar(name='Positive', x=year_avg.index, y=year_avg['positive'],
           marker_color='#2ecc71', showlegend=False),
    row=1, col=2
)
fig.add_trace(
    go.Bar(name='Negative', x=year_avg.index, y=year_avg['negative'],
           marker_color='#e74c3c', showlegend=False),
    row=1, col=2
)
fig.add_trace(
    go.Bar(name='Neutral', x=year_avg.index, y=year_avg['neutral'],
           marker_color='#95a5a6', showlegend=False),
    row=1, col=2
)

# 3. Dominant sentiment pie chart
sentiment_counts = sentiment_df['dominant_sentiment'].value_counts()
fig.add_trace(
    go.Pie(labels=sentiment_counts.index, values=sentiment_counts.values,
           marker=dict(colors=['#2ecc71', '#e74c3c', '#95a5a6']),
           showlegend=False),
    row=2, col=1
)

# 4. Sentiment timeline
fig.add_trace(
    go.Scatter(x=sentiment_df['speech_idx'], y=sentiment_df['positive'],
               mode='lines+markers', name='Positive',
               line=dict(color='#2ecc71', width=2), showlegend=False),
    row=2, col=2
)
fig.add_trace(
    go.Scatter(x=sentiment_df['speech_idx'], y=sentiment_df['negative'],
               mode='lines+markers', name='Negative',
               line=dict(color='#e74c3c', width=2), showlegend=False),
    row=2, col=2
)
fig.add_trace(
    go.Scatter(x=sentiment_df['speech_idx'], y=sentiment_df['neutral'],
               mode='lines+markers', name='Neutral',
               line=dict(color='#95a5a6', width=2), showlegend=False),
    row=2, col=2
)

# Update layout
fig.update_layout(
    height=900,
    title_text="Comprehensive Sentiment Analysis Dashboard",
    showlegend=True,
    barmode='group',
    template='plotly_white'
)

fig.update_xaxes(tickangle=-45, row=1, col=1)
fig.update_yaxes(title_text="Probability", row=1, col=1)
fig.update_yaxes(title_text="Probability", row=1, col=2)
fig.update_yaxes(title_text="Sentiment Score", row=2, col=2)
fig.update_xaxes(title_text="Speech Index", row=2, col=2)

fig.show()

## Sentiment Heatmap

Visualizing sentiment patterns across speeches and time.

In [9]:
# Create sentiment heatmap
heatmap_data = sentiment_df[['positive', 'negative', 'neutral']].T
heatmap_data.columns = [f"{row['location'][:15]}..." if len(row['location']) > 15 
                        else row['location'] 
                        for _, row in sentiment_df.iterrows()]

fig = go.Figure(data=go.Heatmap(
    z=heatmap_data.values,
    x=heatmap_data.columns,
    y=['Positive', 'Negative', 'Neutral'],
    colorscale='RdYlGn',
    text=heatmap_data.values,
    texttemplate='%{text:.2f}',
    textfont={"size": 10},
    colorbar=dict(title="Probability")
))

fig.update_layout(
    title='Sentiment Heatmap: All Speeches',
    xaxis_title='Speech Location',
    yaxis_title='Sentiment Type',
    height=400,
    template='plotly_white'
)

fig.update_xaxes(tickangle=-45)
fig.show()

## Chunk-Level Sentiment Analysis

Examining sentiment variation within individual speeches.

In [10]:
# Select a few interesting speeches to examine in detail
selected_speeches = [0, 10, 20, 30]  # First, middle, and later speeches

fig = make_subplots(
    rows=2, cols=2,
    subplot_titles=[f"{sentiment_results[i]['location']} ({sentiment_results[i]['month']} {sentiment_results[i]['year']})" 
                    for i in selected_speeches]
)

positions = [(1, 1), (1, 2), (2, 1), (2, 2)]

for idx, (speech_idx, pos) in enumerate(zip(selected_speeches, positions)):
    result = sentiment_results[speech_idx]
    chunks = result['chunk_predictions']
    
    chunk_indices = list(range(len(chunks)))
    
    # Add traces for each sentiment
    fig.add_trace(
        go.Scatter(x=chunk_indices, y=chunks[:, 0],
                   mode='lines+markers', name='Positive',
                   line=dict(color='#2ecc71'), showlegend=(idx == 0)),
        row=pos[0], col=pos[1]
    )
    fig.add_trace(
        go.Scatter(x=chunk_indices, y=chunks[:, 1],
                   mode='lines+markers', name='Negative',
                   line=dict(color='#e74c3c'), showlegend=(idx == 0)),
        row=pos[0], col=pos[1]
    )
    fig.add_trace(
        go.Scatter(x=chunk_indices, y=chunks[:, 2],
                   mode='lines+markers', name='Neutral',
                   line=dict(color='#95a5a6'), showlegend=(idx == 0)),
        row=pos[0], col=pos[1]
    )
    
    fig.update_xaxes(title_text="Chunk Index", row=pos[0], col=pos[1])
    fig.update_yaxes(title_text="Sentiment Score", row=pos[0], col=pos[1])

fig.update_layout(
    height=800,
    title_text="Sentiment Variation Within Individual Speeches",
    template='plotly_white',
    showlegend=True
)

fig.show()

## Temporal Analysis: Sentiment Over Time

Analyzing how sentiment evolved throughout 2019 and 2020.

In [11]:
# Add chronological date information to sentiment_df
month_map = {'Jan': 1, 'Feb': 2, 'Mar': 3, 'Apr': 4, 'May': 5, 'Jun': 6,
             'Jul': 7, 'Aug': 8, 'Sep': 9, 'Oct': 10, 'Nov': 11, 'Dec': 12}
sentiment_df['month_num'] = sentiment_df['month'].map(month_map)
sentiment_df['date'] = pd.to_datetime(sentiment_df['year'] + '-' + 
                                       sentiment_df['month_num'].astype(str) + '-15')
sentiment_df = sentiment_df.sort_values('date')

# Create temporal visualization
fig = go.Figure()

# Add sentiment traces
fig.add_trace(go.Scatter(
    x=sentiment_df['date'],
    y=sentiment_df['positive'],
    mode='lines+markers',
    name='Positive',
    line=dict(color='#2ecc71', width=3),
    marker=dict(size=8),
    hovertemplate='<b>%{text}</b><br>Positive: %{y:.3f}<extra></extra>',
    text=sentiment_df['location']
))

fig.add_trace(go.Scatter(
    x=sentiment_df['date'],
    y=sentiment_df['negative'],
    mode='lines+markers',
    name='Negative',
    line=dict(color='#e74c3c', width=3),
    marker=dict(size=8),
    hovertemplate='<b>%{text}</b><br>Negative: %{y:.3f}<extra></extra>',
    text=sentiment_df['location']
))

fig.add_trace(go.Scatter(
    x=sentiment_df['date'],
    y=sentiment_df['neutral'],
    mode='lines+markers',
    name='Neutral',
    line=dict(color='#95a5a6', width=3),
    marker=dict(size=8),
    hovertemplate='<b>%{text}</b><br>Neutral: %{y:.3f}<extra></extra>',
    text=sentiment_df['location']
))

# Add vertical line to separate years using an explicit shape + annotation with Python datetime
import datetime
vline_dt = datetime.datetime(2020, 1, 1)
fig.add_shape(dict(type='line', x0=vline_dt, x1=vline_dt, y0=0, y1=1, xref='x', yref='paper',
                   line=dict(dash='dash', color='gray')))
fig.add_annotation(dict(x=vline_dt, y=1.02, xref='x', yref='paper', showarrow=False, text='2020 Begins'))

fig.update_layout(
    title='Sentiment Evolution Over Time (2019-2020)',
    xaxis_title='Date',
    yaxis_title='Sentiment Score',
    height=500,
    template='plotly_white',
    hovermode='x unified'
)

fig.show()

# Calculate rolling average
window = 3
sentiment_df['positive_ma'] = sentiment_df['positive'].rolling(window=window, center=True).mean()
sentiment_df['negative_ma'] = sentiment_df['negative'].rolling(window=window, center=True).mean()
sentiment_df['neutral_ma'] = sentiment_df['neutral'].rolling(window=window, center=True).mean()

# Plot with moving average
fig2 = go.Figure()

# Raw data (lighter)
fig2.add_trace(go.Scatter(x=sentiment_df['date'], y=sentiment_df['positive'],
                          mode='markers', name='Positive (raw)',
                          marker=dict(color='#2ecc71', size=6, opacity=0.3),
                          showlegend=True))
fig2.add_trace(go.Scatter(x=sentiment_df['date'], y=sentiment_df['negative'],
                          mode='markers', name='Negative (raw)',
                          marker=dict(color='#e74c3c', size=6, opacity=0.3),
                          showlegend=True))

# Moving averages (bold)
fig2.add_trace(go.Scatter(x=sentiment_df['date'], y=sentiment_df['positive_ma'],
                          mode='lines', name=f'Positive ({window}-speech avg)',
                          line=dict(color='#2ecc71', width=4)))
fig2.add_trace(go.Scatter(x=sentiment_df['date'], y=sentiment_df['negative_ma'],
                          mode='lines', name=f'Negative ({window}-speech avg)',
                          line=dict(color='#e74c3c', width=4)))

# Add the same vertical line to fig2
fig2.add_shape(dict(type='line', x0=vline_dt, x1=vline_dt, y0=0, y1=1, xref='x', yref='paper',
                    line=dict(dash='dash', color='gray')))
fig2.add_annotation(dict(x=vline_dt, y=1.02, xref='x', yref='paper', showarrow=False, text='2020 Begins'))

fig2.update_layout(
    title=f'Sentiment Trends with {window}-Speech Moving Average',
    xaxis_title='Date',
    yaxis_title='Sentiment Score',
    height=500,
    template='plotly_white'
)

fig2.show()


## Year-over-Year Comparison

Comparing sentiment patterns between 2019 and 2020.

In [12]:
# Compare sentiment statistics by year
year_stats = sentiment_df.groupby('year').agg({
    'positive': ['mean', 'std', 'min', 'max'],
    'negative': ['mean', 'std', 'min', 'max'],
    'neutral': ['mean', 'std', 'min', 'max'],
    'speech_idx': 'count'
}).round(3)

print("Year-over-Year Sentiment Statistics:")
print("="*80)
print(year_stats)
print()

# Create box plots for sentiment distribution by year
fig = make_subplots(
    rows=1, cols=3,
    subplot_titles=('Positive Sentiment', 'Negative Sentiment', 'Neutral Sentiment')
)

for year in sentiment_df['year'].unique():
    year_data = sentiment_df[sentiment_df['year'] == year]
    
    fig.add_trace(
        go.Box(y=year_data['positive'], name=year, showlegend=True,
               marker_color='#2ecc71' if year == '2019' else '#27ae60'),
        row=1, col=1
    )
    fig.add_trace(
        go.Box(y=year_data['negative'], name=year, showlegend=False,
               marker_color='#e74c3c' if year == '2019' else '#c0392b'),
        row=1, col=2
    )
    fig.add_trace(
        go.Box(y=year_data['neutral'], name=year, showlegend=False,
               marker_color='#95a5a6' if year == '2019' else '#7f8c8d'),
        row=1, col=3
    )

fig.update_layout(
    title_text='Sentiment Distribution by Year',
    height=400,
    template='plotly_white',
    showlegend=True
)

fig.update_yaxes(title_text="Sentiment Score", row=1, col=1)
fig.update_yaxes(title_text="Sentiment Score", row=1, col=2)
fig.update_yaxes(title_text="Sentiment Score", row=1, col=3)

fig.show()

# Statistical comparison
print("\n📊 Key Insights:")
print("="*80)
for year in sorted(sentiment_df['year'].unique()):
    year_data = sentiment_df[sentiment_df['year'] == year]
    print(f"\n{year}:")
    print(f"  • Average Positive: {year_data['positive'].mean():.3f} (±{year_data['positive'].std():.3f})")
    print(f"  • Average Negative: {year_data['negative'].mean():.3f} (±{year_data['negative'].std():.3f})")
    print(f"  • Average Neutral:  {year_data['neutral'].mean():.3f} (±{year_data['neutral'].std():.3f})")
    print(f"  • Speeches: {len(year_data)}")

Year-over-Year Sentiment Statistics:
     positive                      negative                      neutral  \
         mean    std    min    max     mean    std    min    max    mean   
year                                                                       
2019    0.555  0.042  0.426  0.580    0.281  0.015  0.269  0.324   0.164   
2020    0.568  0.011  0.545  0.586    0.278  0.006  0.268  0.288   0.154   

                          speech_idx  
        std    min    max      count  
year                                  
2019  0.028  0.149  0.250         12  
2020  0.006  0.146  0.167         23  




📊 Key Insights:

2019:
  • Average Positive: 0.555 (±0.042)
  • Average Negative: 0.281 (±0.015)
  • Average Neutral:  0.164 (±0.028)
  • Speeches: 12

2020:
  • Average Positive: 0.568 (±0.011)
  • Average Negative: 0.278 (±0.006)
  • Average Neutral:  0.154 (±0.006)
  • Speeches: 23


## Summary Statistics and Insights

In [13]:
# Comprehensive summary
print("=" * 80)
print("📊 SENTIMENT ANALYSIS SUMMARY")
print("=" * 80)

# Overall statistics
print(f"\n🎤 Dataset Overview:")
print(f"   Total Speeches Analyzed: {len(sentiment_df)}")
print(f"   Time Period: {sentiment_df['date'].min().strftime('%B %Y')} - {sentiment_df['date'].max().strftime('%B %Y')}")
print(f"   Total Text Chunks Processed: {sentiment_df['num_chunks'].sum():,}")
print(f"   Average Chunks per Speech: {sentiment_df['num_chunks'].mean():.1f}")

# Overall sentiment averages
print(f"\n📈 Overall Sentiment Scores:")
print(f"   Positive: {sentiment_df['positive'].mean():.3f} (±{sentiment_df['positive'].std():.3f})")
print(f"   Negative: {sentiment_df['negative'].mean():.3f} (±{sentiment_df['negative'].std():.3f})")
print(f"   Neutral:  {sentiment_df['neutral'].mean():.3f} (±{sentiment_df['neutral'].std():.3f})")

# Dominant sentiment
dominant_counts = sentiment_df['dominant_sentiment'].value_counts()
print(f"\n🎯 Dominant Sentiment Distribution:")
for sentiment, count in dominant_counts.items():
    percentage = (count / len(sentiment_df)) * 100
    print(f"   {sentiment}: {count} speeches ({percentage:.1f}%)")

# Most/least positive speeches
most_positive = sentiment_df.nlargest(3, 'positive')
most_negative = sentiment_df.nlargest(3, 'negative')

print(f"\n✨ Most Positive Speeches:")
for _, row in most_positive.iterrows():
    print(f"   • {row['location']} ({row['month']} {row['year']}): {row['positive']:.3f}")

print(f"\n⚠️  Most Negative Speeches:")
for _, row in most_negative.iterrows():
    print(f"   • {row['location']} ({row['month']} {row['year']}): {row['negative']:.3f}")

# Sentiment volatility (speeches with high variance in chunks)
print(f"\n📊 Sentiment Variation:")
chunk_variances = []
for result in sentiment_results:
    chunks = result['chunk_predictions']
    variance = np.var(chunks, axis=0).mean()
    chunk_variances.append((result['location'], result['month'], result['year'], variance))

chunk_variances.sort(key=lambda x: x[3], reverse=True)
print(f"   Speeches with Most Sentiment Variation:")
for location, month, year, var in chunk_variances[:3]:
    print(f"   • {location} ({month} {year}): variance = {var:.4f}")

print(f"\n   Speeches with Most Consistent Sentiment:")
for location, month, year, var in chunk_variances[-3:]:
    print(f"   • {location} ({month} {year}): variance = {var:.4f}")

print("\n" + "=" * 80)

📊 SENTIMENT ANALYSIS SUMMARY

🎤 Dataset Overview:
   Total Speeches Analyzed: 35
   Time Period: July 2019 - September 2020
   Total Text Chunks Processed: 985
   Average Chunks per Speech: 28.1

📈 Overall Sentiment Scores:
   Positive: 0.563 (±0.026)
   Negative: 0.279 (±0.010)
   Neutral:  0.158 (±0.017)

🎯 Dominant Sentiment Distribution:
   positive: 35 speeches (100.0%)

✨ Most Positive Speeches:
   • Minden (Sep 2020): 0.586
   • Tulsa (Jun 2020): 0.585
   • Pittsburgh (Sep 2020): 0.582

⚠️  Most Negative Speeches:
   • Texas (Sep 2019): 0.324
   • Fayetteville (Sep 2019): 0.289
   • Charlotte (Mar 2020): 0.288

📊 Sentiment Variation:
   Speeches with Most Sentiment Variation:
   • Texas (Sep 2019): variance = 0.0054
   • New Mexico (Sep 2019): variance = 0.0035
   • Charlotte (Mar 2020): variance = 0.0030

   Speeches with Most Consistent Sentiment:
   • Mosinee (Sep 2020): variance = 0.0006
   • Pittsburgh (Sep 2020): variance = 0.0005
   • Freeland (Sep 2020): variance = 0.000

## Save Results to DataFrame

Adding sentiment scores to the original dataset for further analysis.

In [14]:
# Merge sentiment scores back into original DataFrame
df_with_sentiment = df.copy()
df_with_sentiment['sentiment_positive'] = sentiment_df['positive'].values
df_with_sentiment['sentiment_negative'] = sentiment_df['negative'].values
df_with_sentiment['sentiment_neutral'] = sentiment_df['neutral'].values
df_with_sentiment['dominant_sentiment'] = sentiment_df['dominant_sentiment'].values

# Store the enhanced dataset
DT_rally_speeches_with_sentiment = df_with_sentiment
%store DT_rally_speeches_with_sentiment

print("✅ Sentiment scores added to DataFrame!")
print(f"\nNew columns: sentiment_positive, sentiment_negative, sentiment_neutral, dominant_sentiment")
print(f"\nDataFrame shape: {df_with_sentiment.shape}")
print("\n📁 Dataset stored as 'DT_rally_speeches_with_sentiment' for use in other notebooks")

# Display sample
df_with_sentiment[['Location', 'Month', 'Year', 'sentiment_positive', 
                    'sentiment_negative', 'sentiment_neutral', 'dominant_sentiment']].head(10)

Stored 'DT_rally_speeches_with_sentiment' (DataFrame)
✅ Sentiment scores added to DataFrame!

New columns: sentiment_positive, sentiment_negative, sentiment_neutral, dominant_sentiment

DataFrame shape: (35, 12)

📁 Dataset stored as 'DT_rally_speeches_with_sentiment' for use in other notebooks


Unnamed: 0,Location,Month,Year,sentiment_positive,sentiment_negative,sentiment_neutral,dominant_sentiment
0,Greenville,Jul,2019,0.572434,0.274166,0.1534,positive
1,Cincinnati,Aug,2019,0.564755,0.279329,0.155916,positive
2,New Hampshire,Aug,2019,0.567399,0.277195,0.155406,positive
3,Texas,Sep,2019,0.425736,0.324423,0.249841,positive
4,New Mexico,Sep,2019,0.550716,0.28288,0.166404,positive
5,Fayetteville,Sep,2019,0.540782,0.289273,0.169946,positive
6,Dallas,Oct,2019,0.559747,0.281512,0.158741,positive
7,Minneapolis,Oct,2019,0.57617,0.274016,0.149814,positive
8,Tupelo,Nov,2019,0.566354,0.275151,0.158495,positive
9,Lexington,Nov,2019,0.576783,0.270095,0.153122,positive
