# 🎤 Trump Rally Speeches - Advanced Sentiment Analysis with FinBERT

This notebook performs **deep learning-based sentiment analysis** on Donald Trump's 2019-2020 rally speeches using **FinBERT**, a BERT-based transformer model fine-tuned for sentiment classification.

---

## 🔗 Project Links

- **📊 Full Project (GitHub):** [Donald-Trump-Rally-Speeches-NLP](https://github.com/JustaKris/Donald-Trump-Rally-Speeches-NLP)
- **🚀 Live API:** [Try the Sentiment Analysis API](https://trump-speeches-nlp-api.onrender.com/docs) *(FastAPI deployment)*
- **📖 Documentation:** [Project Docs](https://github.com/JustaKris/Donald-Trump-Rally-Speeches-NLP/tree/main/docs)
- **🐳 Deployment:** Azure App Service + Render (Docker containerized)

---

## 📚 Dataset

This analysis uses the **[Donald Trump Rally Speeches dataset](https://www.kaggle.com/datasets/christianlillelund/donald-trumps-rallies)** by Christian Lillelund.

**Dataset Details:**
- **35 rally speeches** from July 2019 to September 2020
- **300,000+ words** of transcribed content
- Covers key events during first presidential campaign

---

## 🛠️ Tech Stack

**Deep Learning & NLP:**
- **Transformers (HuggingFace)**: FinBERT pre-trained model
- **TensorFlow**: Deep learning framework
- **BERT Architecture**: Bidirectional encoder representations
- **Sentiment Classification**: 3-class (positive, negative, neutral)

**Analysis & Visualization:**
- Plotly for interactive dashboards
- Pandas for data manipulation
- NumPy for numerical operations

**Production API (see GitHub):**
- FastAPI for RESTful endpoints
- Docker containerization
- GitHub Actions CI/CD
- pytest testing (70%+ coverage)

---

## 🎯 Analysis Overview

1. **BERT-based sentiment classification** using FinBERT
2. **Text chunking strategy** for handling long speeches (>512 tokens)
3. **Temporal sentiment trends** across 2019-2020
4. **Speech-level aggregation** with statistical analysis
5. **Interactive visualizations** for exploring patterns

---

## 🧠 About FinBERT

**FinBERT** (ProsusAI/finbert) is a BERT model fine-tuned for financial sentiment analysis. While designed for financial text, its 3-class sentiment classification (positive, negative, neutral) works excellently for political speech analysis.

**Model Details:**
- Base: BERT-base-uncased
- Fine-tuned on: Financial news and reports
- Output: Probability distribution over 3 sentiment classes
- Max sequence length: 512 tokens

---

**Author:** Kristiyan Bonev  
**License:** MIT  
**Contact:** [GitHub](https://github.com/JustaKris)

## Import Libraries

Loading required libraries for deep learning, NLP, and visualization.

In [1]:
# Suppress warnings for cleaner output
import warnings
import os
warnings.filterwarnings('ignore')
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2'  # Suppress TensorFlow warnings
os.environ['HF_HUB_DISABLE_SYMLINKS_WARNING'] = '1'  # Disable symlink warnings

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import plotly.graph_objects as go
import plotly.express as px
from plotly.subplots import make_subplots

import re
import math
import tensorflow as tf
from tqdm.notebook import tqdm
from typing import List, Tuple, Dict
from pathlib import Path

# Configure TensorFlow logging
import logging
logging.getLogger('tensorflow').setLevel(logging.ERROR)

from transformers import pipeline, AutoTokenizer, TFBertForSequenceClassification
from scipy.special import softmax
from tensorflow.python.ops.numpy_ops import np_config

# Enable NumPy behavior for TensorFlow
np_config.enable_numpy_behavior()

# Set visualization styles
sns.set_style("whitegrid")
plt.rcParams['figure.dpi'] = 100

print(f"✅ Libraries loaded successfully")
print(f"TensorFlow version: {tf.__version__}")
print(f"GPU Available: {len(tf.config.list_physical_devices('GPU')) > 0}")

✅ Libraries loaded successfully
TensorFlow version: 2.20.0
GPU Available: False


## Load and Parse Data

Loading all rally speech transcripts and extracting metadata from filenames.

**Note:** On Kaggle, the dataset will be automatically mounted at `/kaggle/input/donald-trumps-rallies/`

In [2]:
def load_speech_data(data_dir: str | Path) -> pd.DataFrame:
    """
    Load all speech transcripts and extract metadata from filenames.
    
    Parameters:
        data_dir: Path to directory containing speech text files
        
    Returns:
        DataFrame with columns: Location, Month, Year, filename, content
    """
    filenames, file_contents, years, months, locations = [], [], [], [], []
    data_path = Path(data_dir)
    
    # Get list of text files
    files = list(data_path.glob("*.txt"))
    print(f"Found {len(files)} speech transcripts")
    
    for file_path in files:
        filename = file_path.name
        filenames.append(filename)
        
        # Read file content
        try:
            with open(file_path, encoding="utf-8") as file:
                file_contents.append(file.read())
        except Exception as e:
            print(f"Error reading {filename}: {e}")
            file_contents.append("")
        
        # Extract metadata from filename (e.g., "CincinnatiAug1_2019.txt")
        years.append(filename[-8:-4])
        match = re.search(r"([A-z]+)([A-z]{3})([0-9]+)\_", filename)
        if match:
            months.append(match.group(2))
            # Add spaces before capital letters for location names
            location_raw = match.group(1)
            locations.append(''.join([' ' + c if c.isupper() else c for c in location_raw]).strip())
        else:
            months.append("Unknown")
            locations.append("Unknown")
    
    # Create DataFrame
    df = pd.DataFrame({
        'Location': locations,
        'Month': months,
        'Year': years,
        'filename': filenames,
        'content': file_contents
    })
    
    # Sort by chronological order (approximate)
    month_order = ['Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec', 'Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun']
    df['month_sort'] = df['Month'].apply(lambda x: month_order.index(x) if x in month_order else 99)
    df = df.sort_values(['Year', 'month_sort']).reset_index(drop=True)
    df = df.drop('month_sort', axis=1)
    
    return df


def get_data_path():
    """
    Determine the appropriate data directory path based on environment.
    
    Checks for Kaggle environment first, then falls back to local development path.
    
    Returns:
        Path object pointing to the dataset directory
        
    Raises:
        FileNotFoundError: If dataset not found in either environment
    """
    # Kaggle environment check
    kaggle_path = Path("/kaggle/input/donald-trumps-rallies")
    if kaggle_path.exists():
        return kaggle_path

    # Local environment fallback
    local_path = Path("D:/Programming/Repositories/Donald-Trump-Rally-Speeches-NLP/data/Donald Trump Rally Speeches")
    if local_path.exists():
        return local_path

    # Neither path exists - raise error with helpful message
    raise FileNotFoundError("Dataset not found in either local or Kaggle environment.")


# Load data using environment-aware path detection
data_dir = get_data_path()
df = load_speech_data(data_dir)

print(f"\n📊 Loaded {len(df)} speeches from {df['Year'].min()} to {df['Year'].max()}")
print(f"📅 Date range: {df['Month'].iloc[0]} {df['Year'].iloc[0]} - {df['Month'].iloc[-1]} {df['Year'].iloc[-1]}")

Found 35 speech transcripts

📊 Loaded 35 speeches from 2019 to 2020
📅 Date range: Jul 2019 - Jun 2020


In [3]:
# Display dataset overview
print("Dataset Overview:")
print(f"{'='*80}")
df.head(10)

Dataset Overview:


Unnamed: 0,Location,Month,Year,filename,content
0,Greenville,Jul,2019,GreenvilleJul17_2019.txt,Thank you very much. Thank you. Thank you. Tha...
1,Cincinnati,Aug,2019,CincinnatiAug1_2019.txt,Thank you all. Thank you very much. Thank you ...
2,New Hampshire,Aug,2019,NewHampshireAug15_2019.txt,Thank you very much everybody. Thank you. Wow...
3,Fayetteville,Sep,2019,FayettevilleSep9_2019.txt,Thank you everybody. Thank you and Vice Presi...
4,New Mexico,Sep,2019,NewMexicoSep16_2019.txt,"Wow, thank you. Thank you, New Mexico. Thank ..."
5,Texas,Sep,2019,TexasSep23_2019.txt,"Hello, Houston. I am so thrilled to be here in..."
6,Dallas,Oct,2019,DallasOct17_2019.txt,Thank you. Thank you very much. Hello Dallas. ...
7,Minneapolis,Oct,2019,MinneapolisOct10_2019.txt,"Thank you very much. Thank you, Minnesota. Thi..."
8,Lexington,Nov,2019,LexingtonNov4_2019.txt,Thank you very much and thank you to the origi...
9,Tupelo,Nov,2019,TupeloNov1_2019.txt,"ell, thank you very much. And hello, Tupelo. T..."


## Model and Tokenizer Setup

Loading **FinBERT** (ProsusAI/finbert), a BERT model fine-tuned for sentiment analysis.

**Sentiment Classes:**
- **Positive**: Optimistic, confident language
- **Negative**: Critical, pessimistic language  
- **Neutral**: Factual, balanced statements

**Note**: The model download may take a few minutes on first run. Progress bars and some warnings are normal during model loading.

In [None]:
# Load FinBERT model and tokenizer
MODEL_CHECKPOINT = 'ProsusAI/finbert'

print("🔄 Loading FinBERT model and tokenizer...")
print("   (This may take a few moments on first run)\n")

# Suppress transformers warnings temporarily during model loading
import transformers
transformers.logging.set_verbosity_error()

try:
    tokenizer = AutoTokenizer.from_pretrained(MODEL_CHECKPOINT)
    model = TFBertForSequenceClassification.from_pretrained(MODEL_CHECKPOINT)
    
    # Display model configuration
    print(f"✅ Model loaded successfully!")
    print(f"\n📋 Model Configuration:")
    print(f"   Model: {MODEL_CHECKPOINT}")
    print(f"   Labels: {model.config.id2label}") # type: ignore
    print(f"   Max sequence length: {tokenizer.model_max_length} tokens")
    
except Exception as e:
    print(f"❌ Error loading model: {e}")
    print("💡 Tip: Check your internet connection or try re-running the cell")
    raise

# Restore normal logging for rest of notebook
transformers.logging.set_verbosity_warning()

🔄 Loading FinBERT model and tokenizer...
   (This may take a few moments on first run)

✅ Model loaded successfully!

📋 Model Configuration:
   Model: ProsusAI/finbert
   Labels: {0: 'positive', 1: 'negative', 2: 'neutral'}
   Max sequence length: 512 tokens
✅ Model loaded successfully!

📋 Model Configuration:
   Model: ProsusAI/finbert
   Labels: {0: 'positive', 1: 'negative', 2: 'neutral'}
   Max sequence length: 512 tokens


## Text Processing Functions

BERT models have a maximum sequence length (512 tokens). We split long speeches into manageable chunks and aggregate the results.

### Chunking Strategy:
1. Tokenize full speech text
2. Split into 510-token chunks (leaving room for [CLS] and [SEP] special tokens)
3. Run sentiment analysis on each chunk
4. Aggregate chunk-level predictions to get speech-level sentiment

In [None]:
def chunk_text_for_bert(text: str, tokenizer, max_length: int = 510) -> List[Dict]:
    """
    Split text into chunks that fit within BERT's token limits.
    
    Parameters:
        text: Input text to chunk
        tokenizer: HuggingFace tokenizer
        max_length: Maximum tokens per chunk (510 to leave room for [CLS] and [SEP])
        
    Returns:
        List of encoded chunks ready for model input
    """
    # Tokenize the full text
    tokens = tokenizer.tokenize(text)
    
    # Split into chunks
    chunks = []
    for i in range(0, len(tokens), max_length):
        chunk_tokens = tokens[i:i + max_length]
        chunk_text = tokenizer.convert_tokens_to_string(chunk_tokens)
        
        # Encode with special tokens
        encoding = tokenizer.encode_plus(
            chunk_text,
            add_special_tokens=True,
            max_length=max_length + 2,  # +2 for [CLS] and [SEP]
            padding='max_length',
            truncation=True,
            return_tensors='tf'
        )
        chunks.append(encoding)
    
    return chunks


def analyze_sentiment(chunks: List[Dict], model) -> Tuple[np.ndarray, np.ndarray]:
    """
    Run sentiment analysis on text chunks.
    
    Parameters:
        chunks: List of encoded text chunks
        model: Loaded sentiment analysis model
        
    Returns:
        Tuple of (all_predictions, mean_sentiment) as numpy arrays
    """
    all_predictions = []
    
    for chunk in chunks:
        # Get model predictions
        outputs = model(chunk)
        
        # Convert logits to probabilities
        probs = tf.nn.softmax(outputs.logits, axis=-1)
        all_predictions.append(probs.numpy()) # type: ignore
    
    # Stack all predictions
    all_predictions = np.vstack(all_predictions)
    
    # Calculate mean sentiment across all chunks
    mean_sentiment = np.mean(all_predictions, axis=0)
    
    return all_predictions, mean_sentiment

print("✅ Text processing functions defined successfully!")

✅ Text processing functions defined successfully!


## Run Sentiment Analysis on All Speeches

Processing each speech through FinBERT to extract sentiment scores.

**Note:** This may take 5-10 minutes depending on hardware. Each speech is split into multiple chunks for processing.

In [None]:
# Process all speeches with progress tracking
print("🔄 Processing all speeches for sentiment analysis...\n")

sentiment_results = []

for idx, row in tqdm(df.iterrows(), total=len(df), desc="Analyzing speeches"):
    try:
        # Chunk the speech text
        chunks = chunk_text_for_bert(row['content'], tokenizer, max_length=510)
        
        # Analyze sentiment
        chunk_predictions, mean_sentiment = analyze_sentiment(chunks, model)
        
        # Store results
        sentiment_results.append({
            'speech_idx': idx,
            'location': row['Location'],
            'month': row['Month'],
            'year': row['Year'],
            'num_chunks': len(chunks),
            'positive': mean_sentiment[0],
            'negative': mean_sentiment[1],
            'neutral': mean_sentiment[2],
            'chunk_predictions': chunk_predictions,
            'dominant_sentiment': model.config.id2label[np.argmax(mean_sentiment)] # type: ignore
        })
        
    except Exception as e:
        print(f"\n⚠️  Error processing speech {idx} ({row['Location']}): {e}")
        continue

print(f"\n✅ Successfully analyzed {len(sentiment_results)} speeches!")

# Create results DataFrame
sentiment_df = pd.DataFrame([{k: v for k, v in r.items() if k != 'chunk_predictions'} 
                              for r in sentiment_results])

🔄 Processing all speeches for sentiment analysis...



Analyzing speeches:   0%|          | 0/35 [00:00<?, ?it/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (14239 > 512). Running this sequence through the model will result in indexing errors



✅ Successfully analyzed 35 speeches!


## Sentiment Analysis Results

Examining the sentiment scores for each speech.

In [7]:
# Display sentiment scores
print("Sentiment Scores by Speech:\n")
print("="*80)

for idx, row in sentiment_df.iterrows():
    print(f"{row['location']:.<30} ({row['month']} {row['year']})")
    print(f"   Positive: {row['positive']:.3f} | Negative: {row['negative']:.3f} | Neutral: {row['neutral']:.3f}")
    print(f"   Dominant: {row['dominant_sentiment']} | Chunks: {row['num_chunks']}")
    print()

# Show DataFrame
sentiment_df.head(10)

Sentiment Scores by Speech:

Greenville.................... (Jul 2019)
   Positive: 0.394 | Negative: 0.265 | Neutral: 0.341
   Dominant: positive | Chunks: 28

Cincinnati.................... (Aug 2019)
   Positive: 0.394 | Negative: 0.272 | Neutral: 0.334
   Dominant: positive | Chunks: 21

New Hampshire................. (Aug 2019)
   Positive: 0.398 | Negative: 0.271 | Neutral: 0.331
   Dominant: positive | Chunks: 26

Fayetteville.................. (Sep 2019)
   Positive: 0.392 | Negative: 0.287 | Neutral: 0.321
   Dominant: positive | Chunks: 24

New Mexico.................... (Sep 2019)
   Positive: 0.391 | Negative: 0.271 | Neutral: 0.338
   Dominant: positive | Chunks: 31

Texas......................... (Sep 2019)
   Positive: 0.396 | Negative: 0.317 | Neutral: 0.287
   Dominant: positive | Chunks: 6

Dallas........................ (Oct 2019)
   Positive: 0.394 | Negative: 0.270 | Neutral: 0.336
   Dominant: positive | Chunks: 28

Minneapolis................... (Oct 2019)
   Pos

Unnamed: 0,speech_idx,location,month,year,num_chunks,positive,negative,neutral,dominant_sentiment
0,0,Greenville,Jul,2019,28,0.394135,0.265128,0.340737,positive
1,1,Cincinnati,Aug,2019,21,0.394186,0.272307,0.333507,positive
2,2,New Hampshire,Aug,2019,26,0.397914,0.270642,0.331445,positive
3,3,Fayetteville,Sep,2019,24,0.392093,0.286814,0.321093,positive
4,4,New Mexico,Sep,2019,31,0.391417,0.270587,0.337996,positive
5,5,Texas,Sep,2019,6,0.395873,0.316721,0.287406,positive
6,6,Dallas,Oct,2019,28,0.394222,0.270229,0.33555,positive
7,7,Minneapolis,Oct,2019,31,0.390913,0.265286,0.343801,positive
8,8,Lexington,Nov,2019,24,0.393218,0.271917,0.334865,positive
9,9,Tupelo,Nov,2019,25,0.395425,0.266898,0.337677,positive


## Interactive Sentiment Dashboard

Creating comprehensive visualizations to explore sentiment patterns across all speeches.

In [8]:
# Create comprehensive sentiment visualization
fig = make_subplots(
    rows=2, cols=2,
    subplot_titles=('Sentiment Distribution Across All Speeches',
                    'Average Sentiment by Year',
                    'Dominant Sentiment Count',
                    'Sentiment Trends Over Time'),
    specs=[[{"type": "bar"}, {"type": "bar"}],
           [{"type": "pie"}, {"type": "scatter"}]]
)

# 1. Overall sentiment distribution (stacked bar)
fig.add_trace(
    go.Bar(name='Positive', x=sentiment_df['location'], y=sentiment_df['positive'],
           marker_color='#2ecc71', showlegend=True),
    row=1, col=1
)
fig.add_trace(
    go.Bar(name='Negative', x=sentiment_df['location'], y=sentiment_df['negative'],
           marker_color='#e74c3c', showlegend=True),
    row=1, col=1
)
fig.add_trace(
    go.Bar(name='Neutral', x=sentiment_df['location'], y=sentiment_df['neutral'],
           marker_color='#95a5a6', showlegend=True),
    row=1, col=1
)

# 2. Average sentiment by year
year_avg = sentiment_df.groupby('year')[['positive', 'negative', 'neutral']].mean()
fig.add_trace(
    go.Bar(name='Positive', x=year_avg.index, y=year_avg['positive'],
           marker_color='#2ecc71', showlegend=False),
    row=1, col=2
)
fig.add_trace(
    go.Bar(name='Negative', x=year_avg.index, y=year_avg['negative'],
           marker_color='#e74c3c', showlegend=False),
    row=1, col=2
)
fig.add_trace(
    go.Bar(name='Neutral', x=year_avg.index, y=year_avg['neutral'],
           marker_color='#95a5a6', showlegend=False),
    row=1, col=2
)

# 3. Dominant sentiment pie chart
sentiment_counts = sentiment_df['dominant_sentiment'].value_counts()
fig.add_trace(
    go.Pie(labels=sentiment_counts.index, values=sentiment_counts.values,
           marker=dict(colors=['#2ecc71', '#e74c3c', '#95a5a6']),
           showlegend=False),
    row=2, col=1
)

# 4. Sentiment timeline
fig.add_trace(
    go.Scatter(x=sentiment_df['speech_idx'], y=sentiment_df['positive'],
               mode='lines+markers', name='Positive',
               line=dict(color='#2ecc71', width=2), showlegend=False),
    row=2, col=2
)
fig.add_trace(
    go.Scatter(x=sentiment_df['speech_idx'], y=sentiment_df['negative'],
               mode='lines+markers', name='Negative',
               line=dict(color='#e74c3c', width=2), showlegend=False),
    row=2, col=2
)
fig.add_trace(
    go.Scatter(x=sentiment_df['speech_idx'], y=sentiment_df['neutral'],
               mode='lines+markers', name='Neutral',
               line=dict(color='#95a5a6', width=2), showlegend=False),
    row=2, col=2
)

# Update layout
fig.update_layout(
    height=900,
    title_text="Comprehensive Sentiment Analysis Dashboard",
    showlegend=True,
    barmode='group',
    template='plotly_white'
)

fig.update_xaxes(tickangle=-45, row=1, col=1)
fig.update_yaxes(title_text="Probability", row=1, col=1)
fig.update_yaxes(title_text="Probability", row=1, col=2)
fig.update_yaxes(title_text="Sentiment Score", row=2, col=2)
fig.update_xaxes(title_text="Speech Index", row=2, col=2)

fig.show()

## Sentiment Heatmap

Visualizing sentiment patterns across speeches in a compact heatmap format.

In [9]:
# Create sentiment heatmap
heatmap_data = sentiment_df[['positive', 'negative', 'neutral']].T
heatmap_data.columns = [f"{row['location'][:15]}..." if len(row['location']) > 15 
                        else row['location'] 
                        for _, row in sentiment_df.iterrows()]

fig = go.Figure(data=go.Heatmap(
    z=heatmap_data.values,
    x=heatmap_data.columns,
    y=['Positive', 'Negative', 'Neutral'],
    colorscale='RdYlGn',
    text=heatmap_data.values,
    texttemplate='%{text:.2f}',
    textfont={"size": 10},
    colorbar=dict(title="Probability")
))

fig.update_layout(
    title='Sentiment Heatmap: All Speeches',
    xaxis_title='Speech Location',
    yaxis_title='Sentiment Type',
    height=400,
    template='plotly_white'
)

fig.update_xaxes(tickangle=-45)
fig.show()

## Chunk-Level Sentiment Analysis

Examining how sentiment varies **within** individual speeches. This reveals whether speeches maintain consistent tone or shift sentiment throughout.

In [10]:
# Select a few interesting speeches to examine in detail
selected_speeches = [0, 10, 20, 30]  # First, middle, and later speeches

fig = make_subplots(
    rows=2, cols=2,
    subplot_titles=[f"{sentiment_results[i]['location']} ({sentiment_results[i]['month']} {sentiment_results[i]['year']})" 
                    for i in selected_speeches]
)

positions = [(1, 1), (1, 2), (2, 1), (2, 2)]

for idx, (speech_idx, pos) in enumerate(zip(selected_speeches, positions)):
    result = sentiment_results[speech_idx]
    chunks = result['chunk_predictions']
    
    chunk_indices = list(range(len(chunks)))
    
    # Add traces for each sentiment
    fig.add_trace(
        go.Scatter(x=chunk_indices, y=chunks[:, 0],
                   mode='lines+markers', name='Positive',
                   line=dict(color='#2ecc71'), showlegend=(idx == 0)),
        row=pos[0], col=pos[1]
    )
    fig.add_trace(
        go.Scatter(x=chunk_indices, y=chunks[:, 1],
                   mode='lines+markers', name='Negative',
                   line=dict(color='#e74c3c'), showlegend=(idx == 0)),
        row=pos[0], col=pos[1]
    )
    fig.add_trace(
        go.Scatter(x=chunk_indices, y=chunks[:, 2],
                   mode='lines+markers', name='Neutral',
                   line=dict(color='#95a5a6'), showlegend=(idx == 0)),
        row=pos[0], col=pos[1]
    )
    
    fig.update_xaxes(title_text="Chunk Index", row=pos[0], col=pos[1])
    fig.update_yaxes(title_text="Sentiment Score", row=pos[0], col=pos[1])

fig.update_layout(
    height=800,
    title_text="Sentiment Variation Within Individual Speeches",
    template='plotly_white',
    showlegend=True
)

fig.show()

## Temporal Analysis: Sentiment Over Time

Analyzing how sentiment evolved throughout 2019 and 2020. Did the tone shift as political events unfolded?

In [11]:
# Create temporal visualization
fig = go.Figure()

# Add sentiment traces
fig.add_trace(go.Scatter(
    x=sentiment_df['speech_idx'],
    y=sentiment_df['positive'],
    mode='lines+markers',
    name='Positive',
    line=dict(color='#2ecc71', width=3),
    marker=dict(size=8),
    hovertemplate='<b>%{text}</b><br>Positive: %{y:.3f}<extra></extra>',
    text=sentiment_df['location']
))

fig.add_trace(go.Scatter(
    x=sentiment_df['speech_idx'],
    y=sentiment_df['negative'],
    mode='lines+markers',
    name='Negative',
    line=dict(color='#e74c3c', width=3),
    marker=dict(size=8),
    hovertemplate='<b>%{text}</b><br>Negative: %{y:.3f}<extra></extra>',
    text=sentiment_df['location']
))

fig.add_trace(go.Scatter(
    x=sentiment_df['speech_idx'],
    y=sentiment_df['neutral'],
    mode='lines+markers',
    name='Neutral',
    line=dict(color='#95a5a6', width=3),
    marker=dict(size=8),
    hovertemplate='<b>%{text}</b><br>Neutral: %{y:.3f}<extra></extra>',
    text=sentiment_df['location']
))

fig.update_layout(
    title='Sentiment Evolution Over Time (2019-2020)',
    xaxis_title='Speech Index (Chronological)',
    yaxis_title='Sentiment Score',
    height=500,
    template='plotly_white',
    hovermode='x unified'
)

fig.show()

In [12]:
# Calculate rolling average to see trends more clearly
window = 3
sentiment_df['positive_ma'] = sentiment_df['positive'].rolling(window=window, center=True).mean()
sentiment_df['negative_ma'] = sentiment_df['negative'].rolling(window=window, center=True).mean()
sentiment_df['neutral_ma'] = sentiment_df['neutral'].rolling(window=window, center=True).mean()

# Plot with moving average
fig = go.Figure()

# Raw data (lighter)
fig.add_trace(go.Scatter(x=sentiment_df['speech_idx'], y=sentiment_df['positive'],
                          mode='markers', name='Positive (raw)',
                          marker=dict(color='#2ecc71', size=6, opacity=0.3),
                          showlegend=True))
fig.add_trace(go.Scatter(x=sentiment_df['speech_idx'], y=sentiment_df['negative'],
                          mode='markers', name='Negative (raw)',
                          marker=dict(color='#e74c3c', size=6, opacity=0.3),
                          showlegend=True))

# Moving averages (bold)
fig.add_trace(go.Scatter(x=sentiment_df['speech_idx'], y=sentiment_df['positive_ma'],
                          mode='lines', name=f'Positive ({window}-speech avg)',
                          line=dict(color='#2ecc71', width=4)))
fig.add_trace(go.Scatter(x=sentiment_df['speech_idx'], y=sentiment_df['negative_ma'],
                          mode='lines', name=f'Negative ({window}-speech avg)',
                          line=dict(color='#e74c3c', width=4)))

fig.update_layout(
    title=f'Sentiment Trends with {window}-Speech Moving Average',
    xaxis_title='Speech Index (Chronological)',
    yaxis_title='Sentiment Score',
    height=500,
    template='plotly_white'
)

fig.show()

## Year-over-Year Comparison

Comparing sentiment patterns between 2019 and 2020 using box plots to show distribution.

In [13]:
# Compare sentiment statistics by year
year_stats = sentiment_df.groupby('year').agg({
    'positive': ['mean', 'std', 'min', 'max'],
    'negative': ['mean', 'std', 'min', 'max'],
    'neutral': ['mean', 'std', 'min', 'max'],
    'speech_idx': 'count'
}).round(3)

print("Year-over-Year Sentiment Statistics:")
print("="*80)
print(year_stats)
print()

# Create box plots for sentiment distribution by year
fig = make_subplots(
    rows=1, cols=3,
    subplot_titles=('Positive Sentiment', 'Negative Sentiment', 'Neutral Sentiment')
)

for year in sentiment_df['year'].unique():
    year_data = sentiment_df[sentiment_df['year'] == year]
    
    fig.add_trace(
        go.Box(y=year_data['positive'], name=year, showlegend=True,
               marker_color='#2ecc71' if year == '2019' else '#27ae60'),
        row=1, col=1
    )
    fig.add_trace(
        go.Box(y=year_data['negative'], name=year, showlegend=False,
               marker_color='#e74c3c' if year == '2019' else '#c0392b'),
        row=1, col=2
    )
    fig.add_trace(
        go.Box(y=year_data['neutral'], name=year, showlegend=False,
               marker_color='#95a5a6' if year == '2019' else '#7f8c8d'),
        row=1, col=3
    )

fig.update_layout(
    title_text='Sentiment Distribution by Year',
    height=400,
    template='plotly_white',
    showlegend=True
)

fig.update_yaxes(title_text="Sentiment Score", row=1, col=1)
fig.update_yaxes(title_text="Sentiment Score", row=1, col=2)
fig.update_yaxes(title_text="Sentiment Score", row=1, col=3)

fig.show()

# Statistical comparison
print("\n📊 Key Insights:")
print("="*80)
for year in sorted(sentiment_df['year'].unique()):
    year_data = sentiment_df[sentiment_df['year'] == year]
    print(f"\n{year}:")
    print(f"  • Average Positive: {year_data['positive'].mean():.3f} (±{year_data['positive'].std():.3f})")
    print(f"  • Average Negative: {year_data['negative'].mean():.3f} (±{year_data['negative'].std():.3f})")
    print(f"  • Average Neutral:  {year_data['neutral'].mean():.3f} (±{year_data['neutral'].std():.3f})")
    print(f"  • Speeches: {len(year_data)}")

Year-over-Year Sentiment Statistics:
     positive                      negative                      neutral  \
         mean    std    min    max     mean    std    min    max    mean   
year                                                                       
2019    0.394  0.002  0.391  0.398    0.274  0.015  0.259  0.317   0.332   
2020    0.393  0.003  0.388  0.398    0.269  0.007  0.258  0.282   0.339   

                          speech_idx  
        std    min    max      count  
year                                  
2019  0.016  0.287  0.346         12  
2020  0.007  0.320  0.350         23  




📊 Key Insights:

2019:
  • Average Positive: 0.394 (±0.002)
  • Average Negative: 0.274 (±0.015)
  • Average Neutral:  0.332 (±0.016)
  • Speeches: 12

2020:
  • Average Positive: 0.393 (±0.003)
  • Average Negative: 0.269 (±0.007)
  • Average Neutral:  0.339 (±0.007)
  • Speeches: 23


## Summary Statistics and Insights

Comprehensive analysis summary with key findings.

In [14]:
# Comprehensive summary
print("=" * 80)
print("📊 SENTIMENT ANALYSIS SUMMARY")
print("=" * 80)

# Overall statistics
print(f"\n🎤 Dataset Overview:")
print(f"   Total Speeches Analyzed: {len(sentiment_df)}")
print(f"   Time Period: {sentiment_df['month'].iloc[0]} {sentiment_df['year'].iloc[0]} - {sentiment_df['month'].iloc[-1]} {sentiment_df['year'].iloc[-1]}")
print(f"   Total Text Chunks Processed: {sentiment_df['num_chunks'].sum():,}")
print(f"   Average Chunks per Speech: {sentiment_df['num_chunks'].mean():.1f}")

# Overall sentiment averages
print(f"\n📈 Overall Sentiment Scores:")
print(f"   Positive: {sentiment_df['positive'].mean():.3f} (±{sentiment_df['positive'].std():.3f})")
print(f"   Negative: {sentiment_df['negative'].mean():.3f} (±{sentiment_df['negative'].std():.3f})")
print(f"   Neutral:  {sentiment_df['neutral'].mean():.3f} (±{sentiment_df['neutral'].std():.3f})")

# Dominant sentiment
dominant_counts = sentiment_df['dominant_sentiment'].value_counts()
print(f"\n🎯 Dominant Sentiment Distribution:")
for sentiment, count in dominant_counts.items():
    percentage = (count / len(sentiment_df)) * 100
    print(f"   {sentiment}: {count} speeches ({percentage:.1f}%)")

# Most/least positive speeches
most_positive = sentiment_df.nlargest(3, 'positive')
most_negative = sentiment_df.nlargest(3, 'negative')

print(f"\n✨ Most Positive Speeches:")
for _, row in most_positive.iterrows():
    print(f"   • {row['location']} ({row['month']} {row['year']}): {row['positive']:.3f}")

print(f"\n⚠️  Most Negative Speeches:")
for _, row in most_negative.iterrows():
    print(f"   • {row['location']} ({row['month']} {row['year']}): {row['negative']:.3f}")

# Sentiment volatility (speeches with high variance in chunks)
print(f"\n📊 Sentiment Variation:")
chunk_variances = []
for result in sentiment_results:
    chunks = result['chunk_predictions']
    variance = np.var(chunks, axis=0).mean()
    chunk_variances.append((result['location'], result['month'], result['year'], variance))

chunk_variances.sort(key=lambda x: x[3], reverse=True)
print(f"   Speeches with Most Sentiment Variation:")
for location, month, year, var in chunk_variances[:3]:
    print(f"   • {location} ({month} {year}): variance = {var:.4f}")

print(f"\n   Speeches with Most Consistent Sentiment:")
for location, month, year, var in chunk_variances[-3:]:
    print(f"   • {location} ({month} {year}): variance = {var:.4f}")

print("\n" + "=" * 80)

📊 SENTIMENT ANALYSIS SUMMARY

🎤 Dataset Overview:
   Total Speeches Analyzed: 35
   Time Period: Jul 2019 - Jun 2020
   Total Text Chunks Processed: 985
   Average Chunks per Speech: 28.1

📈 Overall Sentiment Scores:
   Positive: 0.393 (±0.002)
   Negative: 0.270 (±0.010)
   Neutral:  0.337 (±0.011)

🎯 Dominant Sentiment Distribution:
   positive: 35 speeches (100.0%)

✨ Most Positive Speeches:
   • Milwaukee (Jan 2020): 0.398
   • New Hampshire (Aug 2019): 0.398
   • New Hampshire (Feb 2020): 0.397

⚠️  Most Negative Speeches:
   • Texas (Sep 2019): 0.317
   • Fayetteville (Sep 2019): 0.287
   • New Hampshire (Feb 2020): 0.282

📊 Sentiment Variation:
   Speeches with Most Sentiment Variation:
   • Texas (Sep 2019): variance = 0.0006
   • Colorador Springs (Feb 2020): variance = 0.0005
   • Lexington (Nov 2019): variance = 0.0005

   Speeches with Most Consistent Sentiment:
   • Dallas (Oct 2019): variance = 0.0003
   • Pittsburgh (Sep 2020): variance = 0.0002
   • Mosinee (Sep 2020): 

---

## 🚀 Next Steps: Full Project Features

This notebook demonstrates **advanced sentiment analysis with BERT transformers**. For the complete project, visit the GitHub repository which includes:

### 📊 Additional Notebooks
- **Word Frequency & Topics Analysis**: N-gram extraction, word clouds, temporal linguistic patterns
- **Masked Language Modeling**: Custom DistilBERT fine-tuning on domain-specific corpus

### 🚀 Production-Ready API
Try the live FastAPI application that serves this exact sentiment analysis model:
- **Sentiment Analysis Endpoint**: Analyze any text with FinBERT
- **Batch Processing**: Handle multiple texts efficiently
- **RESTful Design**: Clean API with automatic documentation
- **Error Handling**: Robust input validation and error responses

**Live API Docs:** [https://trump-speeches-nlp-api.onrender.com/docs](https://trump-speeches-nlp-api.onrender.com/docs)

**Example API Request:**
```python
import requests

response = requests.post(
    "https://trump-speeches-nlp-api.onrender.com/sentiment",
    json={"text": "Your text here"}
)
print(response.json())
```

### 🛠️ Professional Engineering Practices
- **Testing**: 38 unit/integration tests with 70%+ coverage (pytest)
- **CI/CD**: GitHub Actions pipeline (tests, linting, security scans, deployment)
- **Code Quality**: Black, flake8, isort, mypy for consistent code standards
- **Deployment**: Docker containerization with Azure + Render deployment
- **Documentation**: Comprehensive guides for setup, testing, and deployment
- **Poetry**: Modern Python dependency management

---

## 📚 Learn More

- **GitHub Repository:** [Donald-Trump-Rally-Speeches-NLP](https://github.com/JustaKris/Donald-Trump-Rally-Speeches-NLP)
- **Dataset Source:** [Kaggle Dataset](https://www.kaggle.com/datasets/christianlillelund/donald-trumps-rallies)
- **FinBERT Model:** [HuggingFace Model Card](https://huggingface.co/ProsusAI/finbert)
- **Author:** Kristiyan Bonev | [GitHub Profile](https://github.com/JustaKris)

---

### 💬 Feedback Welcome!

If you found this analysis helpful:
- ⭐ Star the GitHub repository
- 🔼 Upvote this Kaggle notebook
- 💡 Share your thoughts in the comments
- 🐛 Report issues or suggest improvements on GitHub

**License:** MIT - Feel free to use, modify, and learn from this code!

---

## 🔬 Technical Notes

### Why FinBERT for Political Speeches?

While FinBERT was fine-tuned on financial text, its sentiment classification transfers well to political speeches because:
1. **Structured language**: Both domains use formal, persuasive language
2. **Clear sentiment signals**: Financial and political texts both express optimism/pessimism explicitly
3. **BERT foundation**: Pre-trained on diverse text, captures general language understanding
4. **3-class setup**: Positive/Negative/Neutral classification fits political analysis perfectly

### Chunking Strategy Explained

BERT models have a hard limit of 512 tokens. Our approach:
- **Chunk size**: 510 tokens (leaving room for [CLS] and [SEP])
- **Aggregation**: Average predictions across chunks
- **Why this works**: Sentiment tends to be consistent within speeches; averaging smooths local variations

### Performance Considerations

- **Model size**: ~440MB download
- **Processing time**: ~5-10 minutes for 35 speeches on CPU
- **GPU acceleration**: Use GPU runtime for 5-10x speedup
- **Memory usage**: ~2GB RAM for model + data