# Tutorial 4: Advanced Visualizations

**Goal:** Learn to create compelling, publication-ready visualizations of text analytics.

**What you'll learn:**
- Word clouds for visual impact
- Heatmaps for pattern detection
- Interactive plots with Plotly
- Multi-dimensional visualizations
- Export-ready charts

**Time:** ~1 hour

## Step 1: Setup

In [None]:
import os
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from wordcloud import WordCloud
from collections import Counter
import re
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

%matplotlib inline
sns.set_style('whitegrid')

# Load data function
def load_statements(directory, bank_name):
    statements = []
    for filename in os.listdir(directory):
        if filename.endswith('.txt'):
            filepath = os.path.join(directory, filename)
            date_str = filename.replace('.txt', '').replace('-txt', '')
            with open(filepath, 'r', encoding='utf-8') as file:
                text = file.read()
            statements.append({'date': date_str, 'bank': bank_name, 'text': text})
    df = pd.DataFrame(statements)
    df['date'] = pd.to_datetime(df['date'])
    df = df.sort_values('date').reset_index(drop=True)
    return df

# Load and process data
fed_data = load_statements('../usa-central-bank/fomc-statements', 'Fed')
nz_data = load_statements('../nz-central-bank/ocr', 'RBNZ')
all_data = pd.concat([fed_data, nz_data], ignore_index=True).sort_values('date').reset_index(drop=True)

# Add sentiment
analyzer = SentimentIntensityAnalyzer()
all_data['sentiment'] = all_data['text'].apply(lambda x: analyzer.polarity_scores(x)['compound'])
all_data['word_count'] = all_data['text'].str.split().str.len()

print(f"âœ“ Loaded {len(all_data)} statements")

## Step 2: Word Clouds

**Word clouds** are visual representations where word size = frequency.
They're great for presentations and reports!

In [None]:
def create_wordcloud(text, title="Word Cloud"):
    """
    Create a word cloud from text.
    """
    # Define stop words
    stop_words = set(['the', 'a', 'an', 'and', 'or', 'but', 'in', 'on', 'at', 'to', 'for',
                      'of', 'with', 'by', 'from', 'as', 'is', 'was', 'are', 'were', 'been',
                      'be', 'have', 'has', 'had', 'will', 'would', 'committee'])
    
    # Generate word cloud
    wordcloud = WordCloud(
        width=1200,
        height=600,
        background_color='white',
        stopwords=stop_words,
        colormap='viridis',
        max_words=100,
        relative_scaling=0.5
    ).generate(text)
    
    # Plot
    plt.figure(figsize=(15, 8))
    plt.imshow(wordcloud, interpolation='bilinear')
    plt.axis('off')
    plt.title(title, fontsize=18, fontweight='bold', pad=20)
    plt.tight_layout(pad=0)
    plt.show()

# Create word clouds for each bank
fed_text = ' '.join(fed_data['text'])
nz_text = ' '.join(nz_data['text'])

create_wordcloud(fed_text, "Federal Reserve FOMC Statements (2014-2017)")
create_wordcloud(nz_text, "Reserve Bank of New Zealand OCR Statements (2006-2012)")

## Step 3: Heatmaps - Keyword Intensity Over Time

Heatmaps show patterns across two dimensions. Perfect for tracking multiple keywords over time.

In [None]:
# Define keywords to track
keywords = ['inflation', 'employment', 'growth', 'risk', 'uncertainty', 
            'economic', 'policy', 'financial', 'market', 'recovery']

# Count keywords in Fed statements
keyword_data = fed_data[['date']].copy()
for keyword in keywords:
    keyword_data[keyword] = fed_data['text'].str.lower().str.count(keyword)

# Create heatmap matrix
# Rows = dates, Columns = keywords
heatmap_matrix = keyword_data.set_index('date')[keywords]

# Plot
plt.figure(figsize=(12, 10))
sns.heatmap(heatmap_matrix, 
            cmap='YlOrRd',
            linewidths=0.5,
            cbar_kws={'label': 'Mentions'},
            fmt='d')
plt.title('Keyword Frequency Heatmap - Fed Statements', fontsize=14, fontweight='bold', pad=15)
plt.xlabel('Keywords', fontsize=11)
plt.ylabel('Date', fontsize=11)
plt.xticks(rotation=45, ha='right')
plt.yticks(rotation=0)
plt.tight_layout()
plt.show()

print("\nðŸ’¡ Look for:")
print("   - Vertical bands = a keyword used heavily in that period")
print("   - Horizontal bands = a statement mentioning many keywords")
print("   - Patterns = correlations between keywords")

## Step 4: Interactive Plots with Plotly

**Plotly** creates interactive charts you can zoom, pan, and hover over.
Perfect for exploring data!

In [None]:
# Interactive scatter plot: Word count vs Sentiment
fig = px.scatter(all_data, 
                 x='word_count', 
                 y='sentiment',
                 color='bank',
                 hover_data=['date'],
                 title='Statement Length vs Sentiment',
                 labels={'word_count': 'Word Count', 'sentiment': 'Sentiment Score'},
                 template='plotly_white',
                 width=900,
                 height=600)

fig.update_traces(marker=dict(size=10, opacity=0.7))
fig.add_hline(y=0, line_dash="dash", line_color="gray", annotation_text="Neutral")
fig.show()

print("ðŸ’¡ TIP: Hover over points to see details, click legend to filter, drag to zoom!")

In [None]:
# Interactive time series with range slider
fig = go.Figure()

for bank in all_data['bank'].unique():
    bank_data = all_data[all_data['bank'] == bank]
    fig.add_trace(go.Scatter(
        x=bank_data['date'],
        y=bank_data['sentiment'],
        mode='lines+markers',
        name=bank,
        hovertemplate='<b>%{fullData.name}</b><br>Date: %{x|%Y-%m-%d}<br>Sentiment: %{y:.3f}<extra></extra>'
    ))

fig.update_layout(
    title='Interactive Sentiment Timeline',
    xaxis_title='Date',
    yaxis_title='Sentiment Score',
    hovermode='x unified',
    template='plotly_white',
    width=1000,
    height=500,
    xaxis=dict(
        rangeselector=dict(
            buttons=list([
                dict(count=6, label="6m", step="month", stepmode="backward"),
                dict(count=1, label="1y", step="year", stepmode="backward"),
                dict(step="all", label="All")
            ])
        ),
        rangeslider=dict(visible=True),
        type="date"
    )
)

fig.show()

print("ðŸ’¡ Use the range slider at the bottom to zoom into specific periods!")

## Step 5: Multi-Dimensional Analysis

Let's visualize multiple metrics simultaneously.

In [None]:
# Bubble chart: x=word count, y=sentiment, size=date (newer = bigger), color=bank
all_data['year'] = all_data['date'].dt.year
all_data['days_since_start'] = (all_data['date'] - all_data['date'].min()).dt.days

fig = px.scatter(all_data,
                 x='word_count',
                 y='sentiment',
                 size='days_since_start',
                 color='bank',
                 hover_data=['date', 'year'],
                 title='Multi-Dimensional View: Length, Sentiment, Time, and Bank',
                 labels={'word_count': 'Statement Length (words)', 
                        'sentiment': 'Sentiment Score',
                        'days_since_start': 'Time progression'},
                 template='plotly_white',
                 width=1000,
                 height=600)

fig.update_traces(marker=dict(opacity=0.6, line=dict(width=1, color='DarkSlateGrey')))
fig.show()

print("\nðŸ’¡ Bubble size = how recent (bigger = more recent)")
print("   This shows if statements are getting longer/shorter and more/less positive over time")

## Step 6: Comparison Dashboard

Create a comprehensive comparison dashboard with Plotly subplots.

In [None]:
# Create subplots
fig = make_subplots(
    rows=2, cols=2,
    subplot_titles=('Sentiment Over Time', 'Statement Length Distribution',
                    'Word Count Over Time', 'Sentiment Distribution'),
    specs=[[{'type': 'scatter'}, {'type': 'histogram'}],
           [{'type': 'scatter'}, {'type': 'box'}]]
)

colors = {'Fed': '#1f77b4', 'RBNZ': '#ff7f0e'}

# 1. Sentiment over time
for bank in all_data['bank'].unique():
    bank_data = all_data[all_data['bank'] == bank]
    fig.add_trace(
        go.Scatter(x=bank_data['date'], y=bank_data['sentiment'], 
                  name=bank, line=dict(color=colors[bank]), showlegend=True),
        row=1, col=1
    )

# 2. Statement length distribution
for bank in all_data['bank'].unique():
    bank_data = all_data[all_data['bank'] == bank]
    fig.add_trace(
        go.Histogram(x=bank_data['word_count'], name=bank, 
                    marker=dict(color=colors[bank]), showlegend=False, opacity=0.7),
        row=1, col=2
    )

# 3. Word count over time
for bank in all_data['bank'].unique():
    bank_data = all_data[all_data['bank'] == bank]
    fig.add_trace(
        go.Scatter(x=bank_data['date'], y=bank_data['word_count'], 
                  name=bank, line=dict(color=colors[bank]), showlegend=False),
        row=2, col=1
    )

# 4. Sentiment distribution (box plot)
for bank in all_data['bank'].unique():
    bank_data = all_data[all_data['bank'] == bank]
    fig.add_trace(
        go.Box(y=bank_data['sentiment'], name=bank, 
              marker=dict(color=colors[bank]), showlegend=False),
        row=2, col=2
    )

# Update layout
fig.update_layout(
    title_text="Central Bank Communications Dashboard",
    height=800,
    showlegend=True,
    template='plotly_white'
)

fig.update_xaxes(title_text="Date", row=1, col=1)
fig.update_xaxes(title_text="Word Count", row=1, col=2)
fig.update_xaxes(title_text="Date", row=2, col=1)
fig.update_xaxes(title_text="Bank", row=2, col=2)

fig.update_yaxes(title_text="Sentiment", row=1, col=1)
fig.update_yaxes(title_text="Frequency", row=1, col=2)
fig.update_yaxes(title_text="Word Count", row=2, col=1)
fig.update_yaxes(title_text="Sentiment", row=2, col=2)

fig.show()

## Step 7: Correlation Heatmap

See how different metrics relate to each other.

In [None]:
# Add more metrics
all_data['sentence_count'] = all_data['text'].str.count(r'[.!?]')
all_data['avg_word_length'] = all_data['text'].str.len() / all_data['word_count']
all_data['inflation_mentions'] = all_data['text'].str.lower().str.count('inflation')
all_data['employment_mentions'] = all_data['text'].str.lower().str.count('employment')

# Select numeric columns for correlation
corr_columns = ['sentiment', 'word_count', 'sentence_count', 'avg_word_length', 
                'inflation_mentions', 'employment_mentions']
correlation_matrix = all_data[corr_columns].corr()

# Plot
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, 
            annot=True,  # Show numbers
            fmt='.2f',
            cmap='coolwarm',
            center=0,
            square=True,
            linewidths=1,
            cbar_kws={'label': 'Correlation'})
plt.title('Correlation Between Metrics', fontsize=14, fontweight='bold', pad=15)
plt.tight_layout()
plt.show()

print("\nðŸ’¡ Reading correlations:")
print("   1.0 = perfect positive correlation")
print("   0.0 = no correlation")
print("   -1.0 = perfect negative correlation")

## Step 8: Time Series Decomposition

Break down sentiment into trend and seasonal components.

In [None]:
# Calculate rolling average (smoothed trend)
for bank in all_data['bank'].unique():
    mask = all_data['bank'] == bank
    all_data.loc[mask, 'sentiment_trend'] = all_data.loc[mask, 'sentiment'].rolling(window=5, center=True).mean()

# Plot original vs trend
fig = make_subplots(rows=2, cols=1, 
                    subplot_titles=('Federal Reserve', 'Reserve Bank of New Zealand'),
                    shared_xaxes=True)

for i, bank in enumerate(['Fed', 'RBNZ'], 1):
    bank_data = all_data[all_data['bank'] == bank]
    
    # Original
    fig.add_trace(
        go.Scatter(x=bank_data['date'], y=bank_data['sentiment'],
                  name=f'{bank} Raw', mode='lines',
                  line=dict(color='lightblue', width=1), opacity=0.5),
        row=i, col=1
    )
    
    # Trend
    fig.add_trace(
        go.Scatter(x=bank_data['date'], y=bank_data['sentiment_trend'],
                  name=f'{bank} Trend', mode='lines',
                  line=dict(color='darkblue', width=3)),
        row=i, col=1
    )

fig.update_layout(title='Sentiment: Raw vs Smoothed Trend', 
                 height=700, template='plotly_white')
fig.update_yaxes(title_text="Sentiment Score")
fig.update_xaxes(title_text="Date", row=2, col=1)
fig.show()

print("\nðŸ’¡ The trend line reveals the overall direction, filtering out noise")

## Step 9: Export-Ready Publication Charts

Create professional charts ready for reports or presentations.

In [None]:
# High-quality chart with custom styling
plt.figure(figsize=(14, 7), dpi=300)  # High resolution

# Plot data
for bank in all_data['bank'].unique():
    bank_data = all_data[all_data['bank'] == bank]
    plt.plot(bank_data['date'], bank_data['sentiment'], 
             marker='o', label=bank, linewidth=2.5, markersize=7, alpha=0.8)

# Styling
plt.axhline(y=0, color='gray', linestyle='--', alpha=0.5, linewidth=1.5, label='Neutral')
plt.xlabel('Date', fontsize=14, fontweight='bold')
plt.ylabel('Sentiment Score', fontsize=14, fontweight='bold')
plt.title('Sentiment Analysis of Central Bank Communications\n2006-2017', 
         fontsize=16, fontweight='bold', pad=20)
plt.legend(fontsize=12, frameon=True, shadow=True, loc='best')
plt.grid(True, alpha=0.3, linestyle=':', linewidth=0.8)
plt.tight_layout()

# Save
plt.savefig('sentiment_analysis_chart.png', dpi=300, bbox_inches='tight')
print("âœ“ Chart saved as 'sentiment_analysis_chart.png'")
plt.show()

## Step 10: Interactive HTML Export

Save interactive Plotly charts as HTML files you can share.

In [None]:
# Create comprehensive interactive chart
fig = px.line(all_data, 
              x='date', 
              y='sentiment', 
              color='bank',
              title='Central Bank Sentiment Analysis - Interactive Dashboard',
              labels={'date': 'Date', 'sentiment': 'Sentiment Score', 'bank': 'Central Bank'},
              template='plotly_white')

fig.update_traces(mode='lines+markers')
fig.add_hline(y=0, line_dash="dash", line_color="gray", annotation_text="Neutral")

# Save as HTML
fig.write_html('sentiment_dashboard.html')
print("âœ“ Interactive dashboard saved as 'sentiment_dashboard.html'")
print("  Open this file in a web browser to explore!")

fig.show()

## ðŸŽ¯ What You Learned

1. **Word clouds**: Visual representation of text frequency
2. **Heatmaps**: Pattern detection across dimensions
3. **Interactive plots**: Plotly for explorable visualizations
4. **Multi-dimensional analysis**: Bubble charts and subplots
5. **Statistical visualizations**: Correlations and distributions
6. **Time series decomposition**: Trend analysis
7. **Export techniques**: Publication-ready charts and HTML dashboards

## ðŸš€ Next Steps

You now have a complete toolkit for text analytics! Consider:
- Building a complete analysis pipeline
- Creating a web dashboard with Streamlit or Dash
- Adding more advanced NLP (topic modeling, entity recognition)
- Integrating with real-time data sources

## ðŸ’¡ Try It Yourself

1. Create a word cloud comparing early vs late periods
2. Build a heatmap for RBNZ statements
3. Make an animated Plotly chart showing sentiment evolution
4. Design your own custom dashboard combining all techniques

In [None]:
# Exercise space
# YOUR CODE HERE
