# Data Preparation

This notebook demonstrates how to load the **Financial Phrasebank** dataset and create hierarchically nested splits for the sentiment analysis mini-challenge.

In [None]:
import sys
import os

# Add src to path
sys.path.append(os.path.abspath(os.path.join('..', 'src')))

from data_loader import load_and_split_data
import matplotlib.pyplot as plt
import seaborn as sns

## Load Data and Create Splits

We use the `load_and_split_data` function from `src/data_loader.py`.

In [None]:
splits = load_and_split_data()

print("Available splits:", splits.keys())

## Verify Hierarchical Property

Ensure that `train_100` is a subset of `train_250`, which is a subset of `train_500`, and so on.

In [None]:
train_100 = splits['train_100']
train_250 = splits['train_250']
train_500 = splits['train_500']
train_1000 = splits['train_1000']

# Check if indices of smaller sets are contained in larger sets
assert set(train_100.index).issubset(set(train_250.index))
assert set(train_250.index).issubset(set(train_500.index))
assert set(train_500.index).issubset(set(train_1000.index))

print("Hierarchical property verified!")

## Visualize Class Distribution

Let's look at the distribution of labels (0: negative, 1: neutral, 2: positive) in the training sets.

In [None]:
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
axes = axes.flatten()

for i, size in enumerate([100, 250, 500, 1000]):
    df = splits[f'train_{size}']
    sns.countplot(x='label', data=df, ax=axes[i])
    axes[i].set_title(f'Train Size: {size}')
    axes[i].set_xlabel('Label (0: Neg, 1: Neu, 2: Pos)')

plt.tight_layout()
plt.show()