# Phase 1, Step 1: Project Setup & Data Preparation

**Objective:** To source, clean, and prepare two distinct, knowledge-intensive datasets for our continual learning experiment. 

- **Task A (Broad Knowledge):** Wikipedia articles on core finance and economics topics.
- **Task B (Specialized Knowledge):** Corporate earnings call transcripts.

This notebook will handle all preprocessing and save the final, analysis-ready datasets to the `../data/` directory.

## 1. Setup & Dependencies

First, we install and import all necessary libraries. We'll need `wikipedia` for Task A, `datasets` from Hugging Face for Task B, and standard data manipulation tools.

In [1]:
%pip install wikipedia-api transformers datasets pandas pyarrow scikit-learn tqdm

Collecting wikipedia-api
  Downloading wikipedia_api-0.8.1.tar.gz (19 kB)
  Installing build dependencies: started
  Installing build dependencies: finished with status 'done'
  Getting requirements to build wheel: started
  Getting requirements to build wheel: finished with status 'done'
  Preparing metadata (pyproject.toml): started
  Preparing metadata (pyproject.toml): finished with status 'done'
Collecting scikit-learn
  Downloading scikit_learn-1.7.2-cp311-cp311-win_amd64.whl.metadata (11 kB)
Collecting scipy>=1.8.0 (from scikit-learn)
  Downloading scipy-1.16.1-cp311-cp311-win_amd64.whl.metadata (60 kB)
Collecting joblib>=1.2.0 (from scikit-learn)
  Downloading joblib-1.5.2-py3-none-any.whl.metadata (5.6 kB)
Collecting threadpoolctl>=3.1.0 (from scikit-learn)
  Downloading threadpoolctl-3.6.0-py3-none-any.whl.metadata (13 kB)
Downloading scikit_learn-1.7.2-cp311-cp311-win_amd64.whl (8.9 MB)
   ---------------------------------------- 0.0/8.9 MB ? eta -:--:--
   -----------------

In [2]:
import wikipediaapi
from datasets import load_dataset
import pandas as pd
import re
import os
from sklearn.model_selection import train_test_split
from tqdm.notebook import tqdm

# --- Configuration ---
DATA_DIR = "../data/"
WIKI_LANG = 'en'
CHUNK_SIZE = 256  # Words per chunk
CHUNK_OVERLAP = 50 # Words to overlap between chunks
TEST_SIZE = 0.15
VAL_SIZE = 0.15
RANDOM_STATE = 42

# Ensure data directory exists
if not os.path.exists(DATA_DIR):
    os.makedirs(DATA_DIR)

print(f"Project setup complete. Datasets will be saved to: {os.path.abspath(DATA_DIR)}")

Project setup complete. Datasets will be saved to: C:\HGC_Thesis\data


## 2. Task A: Wikipedia - Broad Financial & Economic Knowledge

We will fetch content from a curated list of Wikipedia pages covering fundamental economic and financial concepts. This will form our initial, broad knowledge base.

In [3]:
wiki_wiki = wikipediaapi.Wikipedia(
    language=WIKI_LANG,
    extract_format=wikipediaapi.ExtractFormat.WIKI
)

SEED_TOPICS = [
    # Macroeconomics
    'Macroeconomics', 'Fiscal policy', 'Monetary policy', 'Inflation',
    'Gross domestic product', 'Unemployment', 'Quantitative easing',
    'Keynesian economics', 'Monetarism', 'Supply-side economics',

    # Microeconomics
    'Microeconomics', 'Supply and demand', 'Market structure',
    'Perfect competition', 'Monopoly', 'Oligopoly', 'Game theory',
    
    # Financial Markets
    'Financial market', 'Stock market', 'Bond market', 'Derivative (finance)',
    'Efficient-market hypothesis', 'Capital asset pricing model',
    'Behavioral economics', 'Foreign exchange market',

    # Corporate Finance
    'Corporate finance', 'Financial statement', 'Balance sheet', 'Income statement',
    'Cash flow statement', 'Valuation (finance)', 'Discounted cash flow',
    'Mergers and acquisitions'
]

print(f"Fetching content for {len(SEED_TOPICS)} seed topics from Wikipedia...")

TypeError: Wikipedia.__init__() missing 1 required positional argument: 'user_agent'

In [None]:
def clean_wiki_text(text):
    """Cleans Wikipedia text by removing headers, extra newlines, and references."""
    # Remove headers (e.g., == History ==)
    text = re.sub(r'==.*?==+', '', text)
    # Remove extra newlines
    text = re.sub(r'\n+', '\n', text)
    # Remove references that might be left over
    text = re.sub(r'\[\d+\]', '', text) 
    text = text.strip()
    return text

def chunk_text(text, chunk_size, overlap_size):
    """Splits text into overlapping chunks of a specified word count."""
    words = text.split()
    if not words:
        return []
    
    chunks = []
    stride = chunk_size - overlap_size
    for i in range(0, len(words), stride):
        chunk = words[i:i + chunk_size]
        if len(chunk) < chunk_size * 0.5 and len(chunks)>0: # Avoid very small trailing chunks
            chunks[-1].extend(chunk)
        else: 
             chunks.append(chunk)
    
    return [' '.join(chunk) for chunk in chunks]

all_chunks = []
for topic in tqdm(SEED_TOPICS, desc="Processing Wikipedia Articles"):
    page = wiki_wiki.page(topic)
    if page.exists():
        cleaned_text = clean_wiki_text(page.text)
        chunks = chunk_text(cleaned_text, CHUNK_SIZE, CHUNK_OVERLAP)
        for chunk in chunks:
            all_chunks.append({'text': chunk, 'source': 'wikipedia_finance'})

task_a_df = pd.DataFrame(all_chunks)
print(f"Successfully created {len(task_a_df)} text chunks for Task A.")
task_a_df.head()

## 3. Task B: Earnings Call Transcripts - Specialized Financial Knowledge

Next, we'll load a dataset of earnings call transcripts from the Hugging Face Hub. This data is highly specialized, full of jargon, and structurally different from Wikipedia, making it a perfect test for continual learning.

In [None]:
print("Loading Task B dataset from Hugging Face...")
# Using the 'presentation' part of earnings calls, which is dense with prepared statements.
earnings_dataset = load_dataset("toughdata/quants", split='train')
print("Dataset loaded.")

task_b_chunks = []
for item in tqdm(earnings_dataset, desc="Processing Earnings Calls"):
    # We focus on the prepared presentation section for dense knowledge
    if item['section'] == 'presentation' and isinstance(item['segment'], str):
        # The text is already quite clean, but we apply the same chunking for consistency
        chunks = chunk_text(item['segment'], CHUNK_SIZE, CHUNK_OVERLAP)
        for chunk in chunks:
            task_b_chunks.append({'text': chunk, 'source': 'earnings_calls'})

task_b_df = pd.DataFrame(task_b_chunks)
print(f"Successfully created {len(task_b_df)} text chunks for Task B.")
task_b_df.head()

## 4. Data Splitting & Saving

With both datasets processed and chunked, we'll now split them into training, validation, and testing sets. This ensures we can train our models, tune them on a validation set, and get a final, unbiased performance measure on the test set.

In [None]:
def split_and_save(df, task_name):
    """Splits a dataframe into train, validation, and test sets and saves them."""
    print(f"\nSplitting dataset for {task_name}...")
    
    # First split: separate out the test set
    train_val_df, test_df = train_test_split(df, test_size=TEST_SIZE, random_state=RANDOM_STATE)
    
    # Second split: separate train and validation from the remaining data
    # Adjusting the validation size relative to the remaining data
    val_size_adjusted = VAL_SIZE / (1 - TEST_SIZE)
    train_df, val_df = train_test_split(train_val_df, test_size=val_size_adjusted, random_state=RANDOM_STATE)
    
    print(f"  Total examples: {len(df)}")
    print(f"  Training set size: {len(train_df)}")
    print(f"  Validation set size: {len(val_df)}")
    print(f"  Test set size: {len(test_df)}")
    
    # Save to parquet files for efficiency
    train_df.to_parquet(os.path.join(DATA_DIR, f"{task_name}_train.parquet"))
    val_df.to_parquet(os.path.join(DATA_DIR, f"{task_name}_val.parquet"))
    test_df.to_parquet(os.path.join(DATA_DIR, f"{task_name}_test.parquet"))
    
    print(f"  Successfully saved all sets for {task_name}.")

# Process Task A
split_and_save(task_a_df, 'task_a')

# Process Task B
split_and_save(task_b_df, 'task_b')

## 5. Conclusion

We have successfully sourced, processed, and split our two datasets. The `../data/` directory now contains six Parquet files:

- `task_a_train.parquet`, `task_a_val.parquet`, `task_a_test.parquet`
- `task_b_train.parquet`, `task_b_val.parquet`, `task_b_test.parquet`

This completes the data preparation step. We are now ready to move on to the next step: **building the HGC architecture and the baseline BERT model**.