# Multi-Label Text Classification - Data Preparation

This notebook prepares the Dell Research Harvard newswire dataset for **multi-label text classification** with **one-hot encoded labels**.

**Key Points:**
- 7 binary classification tasks (one per category)
- Labels are stored as 7-dimensional binary vectors
- Each dimension represents: antitrust, civil_rights, crime, govt_regulation, labor_movement, politics, protests
- Example: `[0, 0, 1, 1, 0, 1, 0]` means the article is about crime, govt_regulation, and politics

## 1. Setup and Data Loading

In [1]:
import pandas as pd
import numpy as np
from datasets import load_dataset
from sklearn.model_selection import train_test_split

# Load the Dell Research Harvard newswire dataset
print("Loading dataset from HuggingFace...")
dataset = load_dataset(
    "dell-research-harvard/newswire",
    data_files=["1968_data_clean.json", "1969_data_clean.json"],
    trust_remote_code=True
)

print(f"\n✓ Dataset loaded successfully!")
print(f"  Total articles: {dataset['train'].num_rows:,}")

# Convert to pandas
df = dataset["train"].to_pandas()
print(f"  Total features: {len(df.columns)}")

Loading dataset from HuggingFace...

✓ Dataset loaded successfully!
  Total articles: 65,191
  Total features: 23


## 2. Extract Labels and Text

Extract the 7 category columns and convert them directly to one-hot encoded arrays:

In [2]:
# Define the 7 categories in order
CATEGORIES = [
    'antitrust',
    'civil_rights',
    'crime',
    'govt_regulation',
    'labor_movement',
    'politics',
    'protests'
]

print("Category columns:")
print("=" * 70)
for i, category in enumerate(CATEGORIES):
    count = df[category].sum()
    pct = (count / len(df)) * 100
    print(f"  [{i}] {category:20s} {count:6,} positive samples ({pct:5.1f}%)")

print(f"\n✓ 7 categories for multi-label classification")

Category columns:
  [0] antitrust                93 positive samples (  0.1%)
  [1] civil_rights          3,393 positive samples (  5.2%)
  [2] crime                 7,464 positive samples ( 11.4%)
  [3] govt_regulation       5,590 positive samples (  8.6%)
  [4] labor_movement        4,533 positive samples (  7.0%)
  [5] politics             24,005 positive samples ( 36.8%)
  [6] protests              1,833 positive samples (  2.8%)

✓ 7 categories for multi-label classification


## 3. Convert to One-Hot Encoded Labels

Each article's labels are represented as a 7-dimensional binary vector where 1 indicates the category is present.

In [3]:
# Extract text
texts = df['cleaned_article'].values

# Extract labels as one-hot encoded numpy arrays
# Stack the 7 binary columns into a (n_samples, 7) array
labels = df[CATEGORIES].values.astype(np.float32)

print(f"✓ Data extracted successfully!")
print(f"  Texts shape: {texts.shape}")
print(f"  Labels shape: {labels.shape}")
print(f"  Labels dtype: {labels.dtype}")

# Analyze label statistics
labels_per_sample = labels.sum(axis=1)
print(f"\nLabel Statistics:")
print(f"  Min labels per sample: {int(labels_per_sample.min())}")
print(f"  Max labels per sample: {int(labels_per_sample.max())}")
print(f"  Avg labels per sample: {labels_per_sample.mean():.2f}")

# Distribution by number of active labels
print(f"\nDistribution by number of labels:")
unique_counts, counts = np.unique(labels_per_sample, return_counts=True)
for n_labels, count in zip(unique_counts, counts):
    pct = (count / len(labels)) * 100
    print(f"  {int(n_labels)} label(s): {count:6,} samples ({pct:5.1f}%)")

# Show some examples
print(f"\nSample Data (first 5):")
print("=" * 70)
for i in range(min(5, len(texts))):
    active_categories = [CATEGORIES[j] for j in range(7) if labels[i, j] == 1]
    if not active_categories:
        active_categories = ['no_class']
    text_preview = texts[i][:80] + "..." if len(texts[i]) > 80 else texts[i]
    print(f"\n{i+1}. Labels: {labels[i]} → {active_categories}")
    print(f"   Text: {text_preview}")

✓ Data extracted successfully!
  Texts shape: (65191,)
  Labels shape: (65191, 7)
  Labels dtype: float32

Label Statistics:
  Min labels per sample: 0
  Max labels per sample: 5
  Avg labels per sample: 0.72

Distribution by number of labels:
  0 label(s): 29,339 samples ( 45.0%)
  1 label(s): 26,585 samples ( 40.8%)
  2 label(s):  7,596 samples ( 11.7%)
  3 label(s):  1,551 samples (  2.4%)
  4 label(s):    119 samples (  0.2%)
  5 label(s):      1 samples (  0.0%)

Sample Data (first 5):

1. Labels: [0. 0. 0. 0. 0. 0. 0.] → ['no_class']
   Text: SAIGON (AP) — Smashing Communist thrusts across South Vietnam, allied forces kil...

2. Labels: [0. 0. 1. 0. 1. 0. 0.] → ['crime', 'labor_movement']
   Text: NEW YORK (AP) — Police cleared a barricaded building and arrested 131 demonstrat...

3. Labels: [0. 0. 0. 0. 0. 0. 0.] → ['no_class']
   Text: SAIGON (AP) — The allies called off the Tet truce in South Vietnam’s northern mi...

4. Labels: [0. 0. 0. 0. 0. 0. 0.] → ['no_class']
   Text: S

## 4. Train/Test Split

Split the data into training (80%) and test (20%) sets.

**Note:** Validation data will be automatically created during training by randomly splitting 20% from the training set.

In [4]:
# Set random seed for reproducibility
RANDOM_SEED = 42

# Split: 80% train, 20% test
X_train, X_test, y_train, y_test = train_test_split(
    texts, 
    labels,
    test_size=0.2,
    random_state=RANDOM_SEED
)

print("✓ Data split complete!")
print(f"\n  Training set:   {len(X_train):6,} samples ({len(X_train)/len(texts)*100:.1f}%)")
print(f"  Test set:       {len(X_test):6,} samples ({len(X_test)/len(texts)*100:.1f}%)")
print(f"  {'─'*50}")
print(f"  Total:          {len(texts):6,} samples")
print(f"\n  Note: Validation set (20% of train) will be created during training")

# Show label distribution for both sets
def show_distribution(y, name):
    labels_per_sample = y.sum(axis=1)
    print(f"\n{name} Set - Labels per sample:")
    unique_counts, counts = np.unique(labels_per_sample, return_counts=True)
    for n_labels, count in zip(unique_counts, counts):
        pct = (count / len(y)) * 100
        print(f"  {int(n_labels)} label(s): {count:6,} samples ({pct:5.1f}%)")
    
    # Show per-category statistics
    print(f"\n{name} Set - Per-category distribution:")
    for i, category in enumerate(CATEGORIES):
        count = int(y[:, i].sum())
        pct = (count / len(y)) * 100
        print(f"  {category:20s}: {count:6,} ({pct:5.1f}%)")

show_distribution(y_train, "Training")
show_distribution(y_test, "Test")

✓ Data split complete!

  Training set:   52,152 samples (80.0%)
  Test set:       13,039 samples (20.0%)
  ──────────────────────────────────────────────────
  Total:          65,191 samples

  Note: Validation set (20% of train) will be created during training

Training Set - Labels per sample:
  0 label(s): 23,475 samples ( 45.0%)
  1 label(s): 21,254 samples ( 40.8%)
  2 label(s):  6,064 samples ( 11.6%)
  3 label(s):  1,266 samples (  2.4%)
  4 label(s):     92 samples (  0.2%)
  5 label(s):      1 samples (  0.0%)

Training Set - Per-category distribution:
  antitrust           :     69 (  0.1%)
  civil_rights        :  2,690 (  5.2%)
  crime               :  5,937 ( 11.4%)
  govt_regulation     :  4,495 (  8.6%)
  labor_movement      :  3,633 (  7.0%)
  politics            : 19,242 ( 36.9%)
  protests            :  1,487 (  2.9%)

Test Set - Labels per sample:
  0 label(s):  5,864 samples ( 45.0%)
  1 label(s):  5,331 samples ( 40.9%)
  2 label(s):  1,532 samples ( 11.7%)
  3 la

## 5. Save Processed Data

Save the datasets with one-hot encoded labels to parquet files:

In [None]:
import os

# Create data directory if it doesn't exist
os.makedirs("../data", exist_ok=True)

# Create DataFrames with one-hot encoded labels
train_data = pd.DataFrame({
    'text': X_train,
    'label': list(y_train)  # Convert arrays to list for parquet storage
})

test_data = pd.DataFrame({
    'text': X_test,
    'label': list(y_test)
})

# Save to parquet format
train_data.to_parquet("../data/train_data.parquet", index=False)
test_data.to_parquet("../data/test_data.parquet", index=False)

print("✓ Data saved successfully!")
print(f"\n  ../data/train_data.parquet ({len(train_data):,} samples)")
print(f"  ../data/test_data.parquet  ({len(test_data):,} samples)")

print("\n" + "="*70)
print("DATA PREPARATION COMPLETE")
print("="*70)
print("\nDataset structure:")
print("  - train_data.parquet: Training data (80%)")
print("  - test_data.parquet:  Test data (20%)")
print("  - Validation split: Created automatically during training (20% of train)")
print("\nLabel format:")
print("  - One-hot encoded 7-dimensional numpy arrays")
print("  - [antitrust, civil_rights, crime, govt_regulation, labor_movement, politics, protests]")
print("  - Example: [0, 0, 1, 1, 0, 1, 0] = crime + govt_regulation + politics")
print("\nNext steps:")
print("  1. Config is set: NUM_CLASSES=7, VAL_SPLIT=0.2")
print("  2. Run training: python src/train.py")
print("  3. Model uses BCEWithLogitsLoss for multi-label classification")

✓ Data saved successfully!

  ../data/train_data.parquet (52,152 samples)
  ../data/test_data.parquet  (13,039 samples)

DATA PREPARATION COMPLETE

Dataset structure:
  - train_data.parquet: Training data (80%)
  - test_data.parquet:  Test data (20%)
  - Validation split: Created automatically during training (20% of train)

Label format:
  - One-hot encoded 7-dimensional numpy arrays
  - [antitrust, civil_rights, crime, govt_regulation, labor_movement, politics, protests]
  - Example: [0, 0, 1, 1, 0, 1, 0] = crime + govt_regulation + politics

Next steps:
  1. Config is set: NUM_CLASSES=7, VAL_SPLIT=0.2
  2. Run training: python src/train.py
  3. Model uses BCEWithLogitsLoss for multi-label classification
