# Fed Data Collection and Processing

This notebook runs the scraping and processing scripts to collect data from the Federal Reserve website.
The scripts are located in the `../scripts/` directory.

**Note:** Ensure you have the necessary dependencies installed (see `requirements.txt`).

In [1]:
import os
import sys

# !{sys.executable} -m pip install nltk pandas
# !{sys.executable} -m pip install pdfplumber
# Print current working directory to verify context
print(f"Current Working Directory: {os.getcwd()}")

# The scripts are in the sibling directory 'scraping'
# We will execute them using !python relative path

Current Working Directory: e:\Textming\notebooks


## 1. Meeting Minutes
Scrape and process the FOMC meeting minutes.

In [2]:
# Scrape Minutes
!{sys.executable} ../scraping/scrape_minutes.py


--- Starting FOMC Minutes Scraping ---
Fetching: https://www.federalreserve.gov/monetarypolicy/fomchistorical2018.htm
Fetching: https://www.federalreserve.gov/monetarypolicy/fomcminutes20180131.htm
Parsed Minutes from https://www.federalreserve.gov/monetarypolicy/fomcminutes20180131.htm
Fetching: https://www.federalreserve.gov/monetarypolicy/fomcminutes20180321.htm
Parsed Minutes from https://www.federalreserve.gov/monetarypolicy/fomcminutes20180321.htm
Fetching: https://www.federalreserve.gov/monetarypolicy/fomcminutes20180502.htm
Parsed Minutes from https://www.federalreserve.gov/monetarypolicy/fomcminutes20180502.htm
Fetching: https://www.federalreserve.gov/monetarypolicy/fomcminutes20180613.htm
Parsed Minutes from https://www.federalreserve.gov/monetarypolicy/fomcminutes20180613.htm
Fetching: https://www.federalreserve.gov/monetarypolicy/fomcminutes20180801.htm
Parsed Minutes from https://www.federalreserve.gov/monetarypolicy/fomcminutes20180801.htm
Fetching: https://www.federalre

In [3]:
# Process Minutes
!{sys.executable} ../scraping/process_minutes.py

Successfully loaded 6864 documents.
Segmenting text into sentences with section labels...

--- Processing Complete ---
Original Documents: 6864
Generated Sentences: 1618

Distribution of Sections:
section
Participants' Views                    498
Developments in Financial Markets      316
Staff Economic Outlook                 301
Staff Review of Financial Situation    260
Staff Review of Economic Situation     185
Committee Policy Action                 51
Inflation Analysis                       7
Name: count, dtype: int64

Sample Data:
        date  ...                                      sentence_text
0 2018-01-31  ...  The manager of the System Open Market Account ...
1 2018-01-31  ...  Financial conditions eased further over recent...
2 2018-01-31  ...  In this environment, yields on safe assets suc...
3 2018-01-31  ...  Breakeven measures of inflation compensation d...
4 2018-01-31  ...  Survey measures of longer-term inflation expec...
5 2018-01-31  ...  Judging from interest

## 2. Speeches
Scrape and process the speeches of Federal Reserve officials.

In [4]:
# Scrape Speeches
!{sys.executable} ../scraping/scrape_speeches.py


--- Starting Speeches Scraping ---
Fetching: https://www.federalreserve.gov/newsevents/speech/2018-speeches.htm
Fetching: https://www.federalreserve.gov/newsevents/speech/powell20181206a.htm
Parsed Speech: Welcoming Remarks...
Fetching: https://www.federalreserve.gov/newsevents/speech/powell20181203a.htm
Parsed Speech: Celebrating Excellence in Comm...
Fetching: https://www.federalreserve.gov/newsevents/speech/powell20181128a.htm
Parsed Speech: The Federal Reserve's Framewor...
Fetching: https://www.federalreserve.gov/newsevents/speech/powell20181002a.htm
Parsed Speech: Monetary Policy and Risk Manag...
Fetching: https://www.federalreserve.gov/newsevents/speech/powell20180927a.htm
Parsed Speech: Brief Remarks on the U.S. Econ...
Fetching: https://www.federalreserve.gov/newsevents/speech/powell20180824a.htm
Parsed Speech: Monetary Policy in a Changing ...
Fetching: https://www.federalreserve.gov/newsevents/speech/powell20180620a.htm
Parsed Speech: Monetary Policy at a Time of U...
Fetc

In [5]:
# Process Speeches
!{sys.executable} ../scraping/process_speeches.py

Loaded 1661 raw speech segments.
Processing complete. Saved 4542 sentences to e:\Textming\data\processed\fed_speeches_sentences.csv


## 3. Press Conferences
Scrape (download PDFs) and process (extract text and structure) the press conference transcripts.

In [6]:
# Process Press Conferences (Extract and Structure Text)
!{sys.executable} ../scraping/scrape_process_press_conf.py


[Step 1] Starting Download to e:\Textming\data\raw\press_conf_pdfs...
  Checking year 2018...
  Checking year 2019...
  Checking year 2020...
  Checking year 2021...
  Checking year 2022...
  Checking year 2023...
  Checking year 2024...

[Step 2] Processing PDFs in e:\Textming\data\raw\press_conf_pdfs...
  Processing: FOMC20191004confcall.pdf
  Processing: FOMCpresconf20180321.pdf
DEBUG: Found earliest marker 'happy to respond to your questions' at 76 for date 2018-03-21
  Processing: FOMCpresconf20180613.pdf
DEBUG: Found earliest marker 'happy to take your questions' at 8756 for date 2018-06-13
  Processing: FOMCpresconf20180926.pdf
DEBUG: Found earliest marker 'happy to take your questions' at 5782 for date 2018-09-26
  Processing: FOMCpresconf20181219.pdf
DEBUG: Found earliest marker 'happy to take your questions' at 7982 for date 2018-12-19
  Processing: FOMCpresconf20190130.pdf
DEBUG: No marker found for date 2019-01-30
  Processing: FOMCpresconf20190320.pdf
DEBUG: No marker fou

## 4. Verify Data
Check the generated CSV files to ensure data has been collected and processed correctly.

In [7]:
import pandas as pd

# Define paths
data_dir = "../data/processed"
minutes_file = os.path.join(data_dir, "fed_minutes_sentences_structured.csv")
speeches_file = os.path.join(data_dir, "fed_speeches_sentences.csv")
press_file = os.path.join(data_dir, "fed_press_conf_structured.csv")

# Function to check file
def check_file(filepath, name):
    if os.path.exists(filepath):
        print(f"\n--- {name} ---")
        try:
            df = pd.read_csv(filepath)
            print(f"File found: {filepath}")
            print(f"Shape: {df.shape}")
            print("Columns:", df.columns.tolist())
            print(df.head(3))
        except Exception as e:
            print(f"Error reading {name}: {e}")
    else:
        print(f"\n{name} NOT found at {filepath}")

# Check all
check_file(minutes_file, "Minutes Data")
check_file(speeches_file, "Speeches Data")
check_file(press_file, "Press Conference Data")


--- Minutes Data ---
File found: ../data/processed\fed_minutes_sentences_structured.csv
Shape: (1618, 5)
Columns: ['original_doc_id', 'date', 'section', 'sentence_text', 'source_type']
   original_doc_id        date                            section  \
0              135  2018-01-31  Developments in Financial Markets   
1              135  2018-01-31  Developments in Financial Markets   
2              135  2018-01-31  Developments in Financial Markets   

                                       sentence_text source_type  
0  The manager of the System Open Market Account ...     Minutes  
1  Financial conditions eased further over recent...     Minutes  
2  In this environment, yields on safe assets suc...     Minutes  

--- Speeches Data ---
File found: ../data/processed\fed_speeches_sentences.csv
Shape: (4542, 5)
Columns: ['date', 'title', 'text', 'source_type', 'url']
         date              title  \
0  2018-12-06  Welcoming Remarks   
1  2018-12-06  Welcoming Remarks   
2  2018