# Tutorial 3: Sentiment Analysis

**Goal:** Learn to analyze the tone and sentiment of central bank communications.

**What you'll learn:**
- What is sentiment analysis?
- Using VADER for sentiment scoring
- Detecting hawkish vs dovish language
- Tracking sentiment over time
- Comparing sentiment across banks

**Time:** ~1 hour

**Key Concept:** In central banking:
- **Hawkish** = favoring tighter monetary policy (higher interest rates)
- **Dovish** = favoring looser monetary policy (lower interest rates)

## Step 1: Setup

In [None]:
import os
import pandas as pd
import matplotlib.pyplot as plt
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
import re

%matplotlib inline
plt.style.use('seaborn-v0_8-darkgrid')

# Load data function
def load_statements(directory, bank_name):
    statements = []
    for filename in os.listdir(directory):
        if filename.endswith('.txt'):
            filepath = os.path.join(directory, filename)
            date_str = filename.replace('.txt', '').replace('-txt', '')
            with open(filepath, 'r', encoding='utf-8') as file:
                text = file.read()
            statements.append({
                'date': date_str,
                'bank': bank_name,
                'text': text,
                'filename': filename
            })
    df = pd.DataFrame(statements)
    df['date'] = pd.to_datetime(df['date'])
    df = df.sort_values('date').reset_index(drop=True)
    return df

# Load both datasets
fed_data = load_statements('../usa-central-bank/fomc-statements', 'Fed')
nz_data = load_statements('../nz-central-bank/ocr', 'RBNZ')
all_data = pd.concat([fed_data, nz_data], ignore_index=True).sort_values('date').reset_index(drop=True)

print(f"âœ“ Loaded {len(fed_data)} Fed statements")
print(f"âœ“ Loaded {len(nz_data)} RBNZ statements")

## Step 2: Understanding Sentiment Analysis

**Sentiment analysis** determines if text is positive, negative, or neutral.

**VADER** (Valence Aware Dictionary and sEntiment Reasoner) is a tool that:
- Analyzes text and gives scores from -1 (very negative) to +1 (very positive)
- Understands context ("not good" is negative, not positive)
- Works well on formal text like financial statements

In [None]:
# Initialize the sentiment analyzer
analyzer = SentimentIntensityAnalyzer()

# Test with sample sentences
test_sentences = [
    "The economy is strong and employment is robust.",
    "Economic activity has been weak and risks remain.",
    "The Committee maintains its target range.",
    "Inflation remains elevated and poses significant concerns."
]

print("Testing VADER Sentiment Analysis:")
print("=" * 80)

for sentence in test_sentences:
    scores = analyzer.polarity_scores(sentence)
    print(f"\nSentence: {sentence}")
    print(f"  Positive: {scores['pos']:.2f}")
    print(f"  Negative: {scores['neg']:.2f}")
    print(f"  Neutral:  {scores['neu']:.2f}")
    print(f"  Compound: {scores['compound']:.2f} (overall score)")

## Step 3: Analyze Sentiment of All Statements

Let's score every statement in our dataset.

In [None]:
def get_sentiment(text):
    """
    Get sentiment scores for a text.
    Returns the compound score (-1 to +1).
    """
    scores = analyzer.polarity_scores(text)
    return scores['compound']

def get_detailed_sentiment(text):
    """
    Get all sentiment scores.
    """
    return analyzer.polarity_scores(text)

# Calculate sentiment for all statements
all_data['sentiment'] = all_data['text'].apply(get_sentiment)

# Add detailed scores
sentiment_details = all_data['text'].apply(get_detailed_sentiment)
all_data['positive'] = sentiment_details.apply(lambda x: x['pos'])
all_data['negative'] = sentiment_details.apply(lambda x: x['neg'])
all_data['neutral'] = sentiment_details.apply(lambda x: x['neu'])

print("âœ“ Sentiment analysis complete!")
all_data[['date', 'bank', 'sentiment', 'positive', 'negative']].head(10)

## Step 4: Sentiment Distribution

Let's see the overall distribution of sentiment scores.

In [None]:
# Create histogram
plt.figure(figsize=(12, 6))
plt.hist(all_data['sentiment'], bins=30, color='skyblue', edgecolor='black')
plt.xlabel('Sentiment Score', fontsize=12)
plt.ylabel('Number of Statements', fontsize=12)
plt.title('Distribution of Sentiment Scores', fontsize=14, fontweight='bold')
plt.axvline(x=0, color='red', linestyle='--', label='Neutral (0)')
plt.legend()
plt.tight_layout()
plt.show()

print(f"\nSentiment Statistics:")
print(f"  Mean: {all_data['sentiment'].mean():.3f}")
print(f"  Median: {all_data['sentiment'].median():.3f}")
print(f"  Min: {all_data['sentiment'].min():.3f}")
print(f"  Max: {all_data['sentiment'].max():.3f}")

## Step 5: Sentiment Over Time

This is where it gets interesting! Let's track how sentiment changes over time.

In [None]:
# Plot for each bank
plt.figure(figsize=(14, 6))

for bank in all_data['bank'].unique():
    bank_data = all_data[all_data['bank'] == bank]
    plt.plot(bank_data['date'], bank_data['sentiment'], 
             marker='o', label=bank, linewidth=2, markersize=6)

plt.xlabel('Date', fontsize=12)
plt.ylabel('Sentiment Score', fontsize=12)
plt.title('Sentiment of Central Bank Statements Over Time', fontsize=14, fontweight='bold')
plt.axhline(y=0, color='gray', linestyle='--', alpha=0.5, label='Neutral')
plt.legend(fontsize=11)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print("\nðŸ’¡ Interpretation:")
print("   - Look for trends: is sentiment becoming more positive or negative?")
print("   - Sharp drops might indicate crisis periods (2008, COVID-19)")
print("   - Compare banks: do they have similar patterns?")

## Step 6: Hawkish vs Dovish Analysis

Central bank watchers care about whether statements are hawkish or dovish.

We'll create a custom dictionary for monetary policy language.

In [None]:
# Keywords associated with hawkish (tightening) policy
HAWKISH_WORDS = [
    'increase', 'raise', 'tighten', 'elevated', 'strong', 'robust',
    'accelerating', 'above', 'high', 'rising', 'tightening', 'restrictive'
]

# Keywords associated with dovish (loosening) policy
DOVISH_WORDS = [
    'decrease', 'lower', 'ease', 'weak', 'slow', 'below', 'low',
    'subdued', 'decline', 'falling', 'accommodative', 'support'
]

def calculate_hawk_dove_score(text):
    """
    Calculate hawkish vs dovish score.
    
    Returns:
    - Positive score = hawkish
    - Negative score = dovish
    """
    text_lower = text.lower()
    
    # Count hawkish words
    hawk_count = sum(text_lower.count(word) for word in HAWKISH_WORDS)
    
    # Count dovish words
    dove_count = sum(text_lower.count(word) for word in DOVISH_WORDS)
    
    # Normalize by text length (per 100 words)
    word_count = len(text.split())
    if word_count == 0:
        return 0
    
    hawk_score = (hawk_count / word_count) * 100
    dove_score = (dove_count / word_count) * 100
    
    return hawk_score - dove_score

# Calculate for all statements
all_data['hawk_dove_score'] = all_data['text'].apply(calculate_hawk_dove_score)

print("Hawk-Dove Score calculated!")
print("  Positive = Hawkish (tightening)")
print("  Negative = Dovish (easing)\n")

all_data[['date', 'bank', 'hawk_dove_score']].head(10)

In [None]:
# Plot hawk-dove scores over time
plt.figure(figsize=(14, 6))

for bank in all_data['bank'].unique():
    bank_data = all_data[all_data['bank'] == bank]
    plt.plot(bank_data['date'], bank_data['hawk_dove_score'], 
             marker='o', label=bank, linewidth=2, markersize=6)

plt.xlabel('Date', fontsize=12)
plt.ylabel('Hawk-Dove Score', fontsize=12)
plt.title('Hawkish vs Dovish Language Over Time', fontsize=14, fontweight='bold')
plt.axhline(y=0, color='gray', linestyle='--', alpha=0.5, label='Neutral')

# Add shaded regions
plt.fill_between(all_data['date'], 0, 10, alpha=0.1, color='red', label='Hawkish zone')
plt.fill_between(all_data['date'], 0, -10, alpha=0.1, color='blue', label='Dovish zone')

plt.legend(fontsize=10)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

## Step 7: Compare Banks

Let's compare sentiment patterns between Fed and RBNZ.

In [None]:
# Compare average sentiment
comparison = all_data.groupby('bank')[['sentiment', 'hawk_dove_score', 'positive', 'negative']].mean()

print("Average Scores by Central Bank:")
print("=" * 70)
print(comparison.round(3))

# Create comparison bar chart
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Sentiment comparison
comparison['sentiment'].plot(kind='bar', ax=axes[0], color=['steelblue', 'coral'])
axes[0].set_title('Average Sentiment by Bank', fontweight='bold')
axes[0].set_ylabel('Sentiment Score')
axes[0].set_xlabel('')
axes[0].axhline(y=0, color='gray', linestyle='--', alpha=0.5)

# Hawk-Dove comparison
comparison['hawk_dove_score'].plot(kind='bar', ax=axes[1], color=['steelblue', 'coral'])
axes[1].set_title('Average Hawk-Dove Score by Bank', fontweight='bold')
axes[1].set_ylabel('Hawk-Dove Score')
axes[1].set_xlabel('')
axes[1].axhline(y=0, color='gray', linestyle='--', alpha=0.5)

plt.tight_layout()
plt.show()

## Step 8: Find Most Extreme Statements

Let's find the most positive, negative, hawkish, and dovish statements.

In [None]:
def show_extreme_statement(df, column, ascending=True, label=""):
    """
    Show the most extreme statement based on a score column.
    """
    statement = df.sort_values(column, ascending=ascending).iloc[0]
    print(f"\n{'='*80}")
    print(f"{label}")
    print(f"{'='*80}")
    print(f"Date: {statement['date'].date()}")
    print(f"Bank: {statement['bank']}")
    print(f"Score: {statement[column]:.3f}")
    print(f"\nFirst 300 characters:")
    print(statement['text'][:300] + "...")

# Most positive
show_extreme_statement(all_data, 'sentiment', ascending=False, label="MOST POSITIVE STATEMENT")

# Most negative
show_extreme_statement(all_data, 'sentiment', ascending=True, label="MOST NEGATIVE STATEMENT")

# Most hawkish
show_extreme_statement(all_data, 'hawk_dove_score', ascending=False, label="MOST HAWKISH STATEMENT")

# Most dovish
show_extreme_statement(all_data, 'hawk_dove_score', ascending=True, label="MOST DOVISH STATEMENT")

## Step 9: Sentiment Change Detection

Let's detect when sentiment shifts significantly from one statement to the next.

In [None]:
# Calculate change from previous statement (for each bank separately)
for bank in all_data['bank'].unique():
    mask = all_data['bank'] == bank
    all_data.loc[mask, 'sentiment_change'] = all_data.loc[mask, 'sentiment'].diff()

# Find biggest shifts
biggest_shifts = all_data.nlargest(10, 'sentiment_change')[['date', 'bank', 'sentiment_change', 'sentiment']]

print("Top 10 Biggest Positive Sentiment Shifts:")
print(biggest_shifts)

print("\nðŸ’¡ These dates might mark important turning points in policy or economic conditions!")

## Step 10: Create a Sentiment Dashboard

Let's create a comprehensive visualization.

In [None]:
fig, axes = plt.subplots(2, 2, figsize=(16, 10))

# 1. Sentiment over time
for bank in all_data['bank'].unique():
    bank_data = all_data[all_data['bank'] == bank]
    axes[0, 0].plot(bank_data['date'], bank_data['sentiment'], marker='o', label=bank, linewidth=2)
axes[0, 0].axhline(y=0, color='gray', linestyle='--', alpha=0.5)
axes[0, 0].set_title('Sentiment Over Time', fontweight='bold', fontsize=12)
axes[0, 0].set_ylabel('Sentiment Score')
axes[0, 0].legend()
axes[0, 0].grid(True, alpha=0.3)

# 2. Hawk-Dove over time
for bank in all_data['bank'].unique():
    bank_data = all_data[all_data['bank'] == bank]
    axes[0, 1].plot(bank_data['date'], bank_data['hawk_dove_score'], marker='o', label=bank, linewidth=2)
axes[0, 1].axhline(y=0, color='gray', linestyle='--', alpha=0.5)
axes[0, 1].set_title('Hawkish vs Dovish Over Time', fontweight='bold', fontsize=12)
axes[0, 1].set_ylabel('Hawk-Dove Score')
axes[0, 1].legend()
axes[0, 1].grid(True, alpha=0.3)

# 3. Sentiment distribution
axes[1, 0].hist([all_data[all_data['bank']=='Fed']['sentiment'], 
                 all_data[all_data['bank']=='RBNZ']['sentiment']], 
                label=['Fed', 'RBNZ'], bins=20, alpha=0.7)
axes[1, 0].set_title('Sentiment Distribution', fontweight='bold', fontsize=12)
axes[1, 0].set_xlabel('Sentiment Score')
axes[1, 0].set_ylabel('Frequency')
axes[1, 0].legend()
axes[1, 0].axvline(x=0, color='red', linestyle='--', alpha=0.5)

# 4. Positive vs Negative
for bank in all_data['bank'].unique():
    bank_data = all_data[all_data['bank'] == bank]
    axes[1, 1].scatter(bank_data['positive'], bank_data['negative'], 
                      label=bank, alpha=0.6, s=50)
axes[1, 1].set_title('Positive vs Negative Language', fontweight='bold', fontsize=12)
axes[1, 1].set_xlabel('Positive Score')
axes[1, 1].set_ylabel('Negative Score')
axes[1, 1].legend()
axes[1, 1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## ðŸŽ¯ What You Learned

1. **Sentiment analysis basics**: Using VADER to score text
2. **Domain-specific analysis**: Hawkish vs dovish language in monetary policy
3. **Time series tracking**: Following sentiment changes over time
4. **Comparative analysis**: Comparing different central banks
5. **Change detection**: Finding significant shifts in tone
6. **Comprehensive visualization**: Creating multi-panel dashboards

## ðŸš€ Next Steps

In Tutorial 4, we'll learn:
- Advanced visualizations (word clouds, heatmaps)
- Interactive plots with Plotly
- Creating publication-ready charts

## ðŸ’¡ Try It Yourself

1. Add more hawkish/dovish keywords to improve the score
2. Correlate sentiment with actual interest rate decisions
3. Create a "risk" score based on uncertainty-related words
4. Compare sentiment before and after major economic events (2008 crisis, etc.)

In [None]:
# Exercise space
# YOUR CODE HERE
