# Tutorial 2: Text Analysis Basics

**Goal:** Learn fundamental text analysis techniques to extract insights from central bank statements.

**What you'll learn:**
- Text preprocessing and cleaning
- Word frequency analysis
- Finding important keywords
- Basic text statistics
- Simple visualizations

**Time:** ~45 minutes

## Step 1: Setup and Load Data

We'll reuse the loading function from Tutorial 1.

In [None]:
import os
import pandas as pd
import matplotlib.pyplot as plt
from collections import Counter
import re  # Regular expressions for text cleaning

# Make plots show up in the notebook
%matplotlib inline
plt.style.use('seaborn-v0_8-darkgrid')

# Function from Tutorial 1
def load_statements(directory, bank_name):
    statements = []
    for filename in os.listdir(directory):
        if filename.endswith('.txt'):
            filepath = os.path.join(directory, filename)
            date_str = filename.replace('.txt', '').replace('-txt', '')
            with open(filepath, 'r', encoding='utf-8') as file:
                text = file.read()
            statements.append({
                'date': date_str,
                'bank': bank_name,
                'text': text,
                'filename': filename
            })
    df = pd.DataFrame(statements)
    df['date'] = pd.to_datetime(df['date'])
    df = df.sort_values('date').reset_index(drop=True)
    return df

# Load data
fed_data = load_statements('../usa-central-bank/fomc-statements', 'Fed')
print(f"âœ“ Loaded {len(fed_data)} Fed statements")

## Step 2: Text Preprocessing

**Why clean text?** Raw text has inconsistencies. We need to:
- Convert to lowercase (so "Inflation" and "inflation" are treated the same)
- Remove punctuation
- Remove common words ("the", "a", "and") called "stop words"

This makes analysis more accurate.

In [None]:
def clean_text(text):
    """
    Clean and normalize text for analysis.
    
    Steps:
    1. Convert to lowercase
    2. Remove punctuation and special characters
    3. Split into words
    
    Returns: list of cleaned words
    """
    # Convert to lowercase
    text = text.lower()
    
    # Remove punctuation and keep only letters and spaces
    # \W+ means "one or more non-word characters"
    text = re.sub(r'\W+', ' ', text)
    
    # Split into words
    words = text.split()
    
    return words

# Test it
sample = "The Committee decided to maintain the target range for the federal funds rate at 0 to 1/4 percent."
print("Original:")
print(sample)
print("\nCleaned:")
print(clean_text(sample))

## Step 3: Stop Words

**Stop words** are common words that don't carry much meaning: "the", "a", "is", etc.

We remove them to focus on meaningful words.

In [None]:
# Common English stop words
STOP_WORDS = set([
    'the', 'a', 'an', 'and', 'or', 'but', 'in', 'on', 'at', 'to', 'for',
    'of', 'with', 'by', 'from', 'as', 'is', 'was', 'are', 'were', 'been',
    'be', 'have', 'has', 'had', 'do', 'does', 'did', 'will', 'would',
    'should', 'could', 'may', 'might', 'must', 'can', 'this', 'that',
    'these', 'those', 'i', 'you', 'he', 'she', 'it', 'we', 'they',
    'their', 'our', 'your', 'its'
])

def remove_stop_words(words):
    """
    Remove common stop words from a list of words.
    """
    return [word for word in words if word not in STOP_WORDS and len(word) > 2]

# Test it
test_words = ['the', 'committee', 'decided', 'to', 'maintain', 'the', 'rate']
print("Before:", test_words)
print("After:", remove_stop_words(test_words))

## Step 4: Word Frequency Analysis

Let's find the most common words in Fed statements. This shows what topics they talk about most.

In [None]:
def get_word_frequencies(df, n=20):
    """
    Get the most common words across all statements.
    
    Parameters:
    - df: DataFrame with 'text' column
    - n: number of top words to return
    
    Returns: list of (word, count) tuples
    """
    all_words = []
    
    # Process each statement
    for text in df['text']:
        words = clean_text(text)
        words = remove_stop_words(words)
        all_words.extend(words)
    
    # Count frequencies using Counter (like a smart dictionary)
    word_counts = Counter(all_words)
    
    # Get top N most common
    return word_counts.most_common(n)

# Get top 20 words
top_words = get_word_frequencies(fed_data, n=20)

print("Top 20 Most Common Words in Fed Statements:")
print("=" * 50)
for word, count in top_words:
    print(f"{word:20s} {count:5d}")

## Step 5: Visualize Word Frequencies

A picture is worth a thousand words! Let's create a bar chart.

In [None]:
# Prepare data for plotting
words = [word for word, count in top_words[:15]]  # Top 15 for readability
counts = [count for word, count in top_words[:15]]

# Create bar chart
plt.figure(figsize=(12, 6))
plt.bar(words, counts, color='steelblue')
plt.xlabel('Word', fontsize=12)
plt.ylabel('Frequency', fontsize=12)
plt.title('Most Common Words in Fed Statements (2014-2017)', fontsize=14, fontweight='bold')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

print("\nðŸ’¡ What do these words tell us?")
print("   - 'committee' appears most (that's who issues the statements)")
print("   - Economic terms like 'economic', 'employment', 'inflation' are common")
print("   - This shows the key topics the Fed focuses on")

## Step 6: Track Keywords Over Time

Let's see how often specific keywords appear over time. This shows shifting priorities.

In [None]:
def count_keyword(text, keyword):
    """
    Count how many times a keyword appears in text (case-insensitive).
    """
    return text.lower().count(keyword.lower())

# Keywords we care about
keywords = ['inflation', 'employment', 'growth', 'risk']

# Count each keyword in each statement
for keyword in keywords:
    fed_data[keyword] = fed_data['text'].apply(lambda x: count_keyword(x, keyword))

# Show first few rows
fed_data[['date', 'inflation', 'employment', 'growth', 'risk']].head(10)

In [None]:
# Plot keyword trends over time
plt.figure(figsize=(14, 6))

for keyword in keywords:
    plt.plot(fed_data['date'], fed_data[keyword], marker='o', label=keyword.title(), linewidth=2)

plt.xlabel('Date', fontsize=12)
plt.ylabel('Mentions per Statement', fontsize=12)
plt.title('Keyword Frequency in Fed Statements Over Time', fontsize=14, fontweight='bold')
plt.legend(fontsize=11)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print("\nðŸ’¡ Insights:")
print("   - You can see when certain topics become more/less important")
print("   - Spikes might correlate with economic events")
print("   - This is how researchers track changing priorities")

## Step 7: Bigrams (Two-Word Phrases)

Sometimes two words together have more meaning than alone.
Examples: "interest rate", "economic growth", "labor market"

In [None]:
def get_bigrams(text, n=15):
    """
    Find most common two-word phrases.
    """
    words = clean_text(text)
    
    # Create pairs of consecutive words
    bigrams = []
    for i in range(len(words) - 1):
        bigram = f"{words[i]} {words[i+1]}"
        bigrams.append(bigram)
    
    # Count and return top N
    return Counter(bigrams).most_common(n)

# Get all text combined
all_text = ' '.join(fed_data['text'])

top_bigrams = get_bigrams(all_text, n=15)

print("Top 15 Two-Word Phrases:")
print("=" * 50)
for phrase, count in top_bigrams:
    print(f"{phrase:30s} {count:5d}")

## Step 8: Compare Time Periods

Let's compare early vs. late statements to see how language evolved.

In [None]:
# Split data in half by date
mid_point = len(fed_data) // 2
early = fed_data.iloc[:mid_point]
late = fed_data.iloc[mid_point:]

print(f"Early period: {early['date'].min().date()} to {early['date'].max().date()}")
print(f"Late period: {late['date'].min().date()} to {late['date'].max().date()}")

# Get top words for each period
early_words = dict(get_word_frequencies(early, n=10))
late_words = dict(get_word_frequencies(late, n=10))

print("\nTop 10 Words - Early Period:")
for word, count in list(early_words.items())[:10]:
    print(f"  {word}: {count}")

print("\nTop 10 Words - Late Period:")
for word, count in list(late_words.items())[:10]:
    print(f"  {word}: {count}")

## Step 9: Text Readability

Let's measure how complex the statements are. We'll use **Flesch Reading Ease**:
- Higher score = easier to read
- 60-70 = standard writing
- 0-30 = very difficult (college level)

In [None]:
def calculate_readability(text):
    """
    Calculate Flesch Reading Ease score.
    Formula: 206.835 - 1.015 * (words/sentences) - 84.6 * (syllables/words)
    
    Simplified version using word and sentence counts.
    """
    words = len(text.split())
    sentences = text.count('.') + text.count('!') + text.count('?')
    
    if sentences == 0:
        return 0
    
    # Simplified calculation (without syllable counting)
    avg_sentence_length = words / sentences
    
    # Estimate complexity based on sentence length
    # Longer sentences = harder to read
    score = 100 - (avg_sentence_length * 2)
    
    return max(0, score)  # Keep score above 0

# Calculate for all statements
fed_data['readability'] = fed_data['text'].apply(calculate_readability)

# Plot over time
plt.figure(figsize=(14, 6))
plt.plot(fed_data['date'], fed_data['readability'], marker='o', linewidth=2, color='purple')
plt.xlabel('Date', fontsize=12)
plt.ylabel('Readability Score', fontsize=12)
plt.title('Fed Statement Readability Over Time (Higher = Easier)', fontsize=14, fontweight='bold')
plt.axhline(y=60, color='red', linestyle='--', label='Standard writing')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print(f"\nAverage readability score: {fed_data['readability'].mean():.1f}")

## ðŸŽ¯ What You Learned

1. **Text cleaning**: Lowercase, punctuation removal, stop words
2. **Word frequency**: Finding most common words with Counter
3. **Visualization**: Creating bar charts and line plots
4. **Time series analysis**: Tracking keywords over time
5. **Bigrams**: Finding meaningful phrases
6. **Readability**: Measuring text complexity

## ðŸš€ Next Steps

In Tutorial 3, we'll learn:
- Sentiment analysis (is the tone positive/negative?)
- Hawkish vs dovish language detection
- Advanced NLP with specialized libraries

## ðŸ’¡ Try It Yourself

1. Load RBNZ data and compare top words with Fed
2. Add more keywords to track (try: "uncertainty", "policy", "committee")
3. Find the most readable and least readable statements
4. Create a word cloud (requires `wordcloud` library)

In [None]:
# Exercise space
# YOUR CODE HERE
