# Tutorial 1: Loading and Exploring Central Bank Data

**Goal:** Learn how to load text files and explore them using Python and pandas.

**What you'll learn:**
- Reading text files from a directory
- Organizing data in a pandas DataFrame
- Basic exploration and statistics
- Understanding your dataset

**Time:** ~30 minutes

## Step 1: Import Libraries

First, we need to import the tools we'll use. Think of this like getting your tools out of a toolbox.

In [None]:
# Import necessary libraries
import os  # For working with files and directories
import pandas as pd  # For organizing data in tables (DataFrames)
from datetime import datetime  # For working with dates

# Make output look nicer
pd.set_option('display.max_colwidth', 100)

print("âœ“ Libraries imported successfully!")

## Step 2: Load Data from Files

Let's write a function to load all our text files. Functions are reusable pieces of code - write once, use many times!

In [None]:
def load_statements(directory, bank_name):
    """
    Load all text files from a directory and organize them into a DataFrame.
    
    Parameters:
    - directory: path to folder containing text files
    - bank_name: name of the central bank (e.g., 'RBNZ', 'Fed')
    
    Returns:
    - DataFrame with columns: date, bank, text, filename
    """
    
    statements = []  # Empty list to store our data
    
    # Loop through all files in the directory
    for filename in os.listdir(directory):
        if filename.endswith('.txt'):  # Only process text files
            
            # Build the full path to the file
            filepath = os.path.join(directory, filename)
            
            # Extract date from filename (e.g., '2014-01-29.txt' -> '2014-01-29')
            date_str = filename.replace('.txt', '').replace('-txt', '')
            
            # Read the file content
            with open(filepath, 'r', encoding='utf-8') as file:
                text = file.read()
            
            # Add this statement to our list
            statements.append({
                'date': date_str,
                'bank': bank_name,
                'text': text,
                'filename': filename
            })
    
    # Convert list to DataFrame (think of it as a spreadsheet)
    df = pd.DataFrame(statements)
    
    # Convert date strings to actual date objects (makes sorting/filtering easier)
    df['date'] = pd.to_datetime(df['date'])
    
    # Sort by date (oldest first)
    df = df.sort_values('date').reset_index(drop=True)
    
    return df

print("âœ“ Function defined successfully!")

## Step 3: Load All Our Data

Now let's use our function to load data from both central banks.

In [None]:
# Load New Zealand Reserve Bank OCR statements
nz_data = load_statements('../nz-central-bank/ocr', 'RBNZ')
print(f"Loaded {len(nz_data)} statements from RBNZ")

# Load US Federal Reserve FOMC statements
fed_data = load_statements('../usa-central-bank/fomc-statements', 'Fed')
print(f"Loaded {len(fed_data)} statements from Fed")

# Combine both datasets into one
all_data = pd.concat([nz_data, fed_data], ignore_index=True)
all_data = all_data.sort_values('date').reset_index(drop=True)

print(f"\nâœ“ Total: {len(all_data)} statements loaded!")

## Step 4: Explore the Data

Let's look at what we have. This is like opening a new dataset for the first time.

In [None]:
# Show the first few rows
print("First 5 statements:")
all_data.head()

In [None]:
# Get basic info about our dataset
print("Dataset Information:")
print(all_data.info())

In [None]:
# Count statements per bank
print("\nStatements per bank:")
print(all_data['bank'].value_counts())

In [None]:
# Date range for each bank
print("\nDate ranges:")
for bank in all_data['bank'].unique():
    bank_data = all_data[all_data['bank'] == bank]
    print(f"{bank}: {bank_data['date'].min().date()} to {bank_data['date'].max().date()}")

## Step 5: Add Useful Metrics

Let's calculate some basic metrics about each statement.

In [None]:
# Add text length metrics
all_data['char_count'] = all_data['text'].str.len()  # Number of characters
all_data['word_count'] = all_data['text'].str.split().str.len()  # Number of words
all_data['sentence_count'] = all_data['text'].str.count(r'[.!?]')  # Approximate sentence count

# Calculate average word length (sophistication indicator)
all_data['avg_word_length'] = all_data['char_count'] / all_data['word_count']

print("âœ“ Metrics calculated!")
all_data[['date', 'bank', 'word_count', 'sentence_count', 'avg_word_length']].head()

## Step 6: Basic Statistics

Let's look at summary statistics to understand our data better.

In [None]:
# Overall statistics
print("Overall Statistics:")
all_data[['word_count', 'sentence_count', 'avg_word_length']].describe()

In [None]:
# Compare banks
print("\nComparison by Bank:")
all_data.groupby('bank')[['word_count', 'sentence_count', 'avg_word_length']].mean()

## Step 7: Look at Actual Content

Let's read a sample statement to see what we're working with.

In [None]:
# Show one statement from the Fed
sample = all_data[all_data['bank'] == 'Fed'].iloc[0]

print(f"Date: {sample['date'].date()}")
print(f"Bank: {sample['bank']}")
print(f"Word count: {sample['word_count']}")
print(f"\nContent preview (first 500 characters):")
print("=" * 80)
print(sample['text'][:500] + "...")

## ðŸŽ¯ What You Learned

1. **Reading files**: Using `os` and file operations
2. **Functions**: Creating reusable code with `def`
3. **DataFrames**: Organizing data with pandas
4. **Dates**: Working with datetime objects
5. **Basic analysis**: Counting, calculating metrics, grouping

## ðŸš€ Next Steps

In Tutorial 2, we'll learn:
- Text preprocessing (cleaning, normalizing)
- Word frequency analysis
- Finding common patterns
- Creating simple visualizations

## ðŸ’¡ Try It Yourself

Before moving on, try these exercises:

1. Find the longest statement (most words)
2. Find the shortest statement
3. Calculate how many statements were issued per year
4. Load the `fomc-implementation-note` data and compare it to regular statements

In [None]:
# Exercise 1: Find longest statement
# YOUR CODE HERE


In [None]:
# Exercise 2: Find shortest statement
# YOUR CODE HERE


In [None]:
# Exercise 3: Statements per year
# HINT: Use all_data['date'].dt.year to extract the year
# YOUR CODE HERE
