# 03 - EDA and Temporal Splits

## Goal

EDA for trust: class balance, year drift, label co-occurrence. Then temporal splits (train â‰¤2021, val 2022â€“2023, test â‰¥2024) to prevent time leakage.


## Why This Step Matters

**Trust in data** comes from understanding it:

- **Class balance:** Are some labels extremely rare?
- **Year trends:** Are study designs changing over time?
- **Co-occurrence:** Do certain labels always appear together?
- **Temporal splits:** Prevent leakage (future knowledge influencing past predictions)

Without EDA, you're training blind.


In [None]:
# === TODO (you code this) ===
# Goal: Import libraries for EDA and visualization.
# Hints:
# 1) pandas, numpy, matplotlib, seaborn, pandera
# 2) Set seaborn style for clean plots
# Acceptance:
# - All imports successful
# - Can create plots with plt/sns


In [None]:
# === TODO (you code this) ===
# Goal: Load the labeled dataset.
# Hints:
# 1) Read dental_abstracts.parquet from ../data/processed
# 2) Print row count
# Acceptance:
# - df loaded with 'labels' column
# - Print total papers

# TODO: load and display count


## Basic Counts

Let's understand the dataset size and temporal distribution.


In [None]:
# === TODO (you code this) ===
# Goal: Display basic dataset statistics.
# Hints:
# 1) Total count, year distribution, top journals
# 2) Use value_counts() sorted appropriately
# Acceptance:
# - Shows year-by-year counts
# - Shows top 10 journals

# TODO: print basic stats


## Class Balance

Which labels are common? Which are rare?


In [None]:
# === TODO (you code this) ===
# Goal: Visualize label frequency distribution.
# Hints:
# 1) Flatten labels lists to count individual occurrences (use Counter)
# 2) Create horizontal barplot sorted by frequency
# 3) Show count on x-axis, label names on y-axis
# Acceptance:
# - Barplot shows all 10 labels
# - Sorted by frequency descending
# - Clear axis labels and title

# TODO: create and display barplot


## Label Co-occurrence

Do certain labels always appear together?


In [None]:
# === TODO (you code this) ===
# Goal: Compute and visualize label co-occurrence.
# Hints:
# 1) Create binary matrix: rows=papers, cols=labels (1 if present)
# 2) Compute matrix product (label_matrix.T @ label_matrix)
# 3) Zero diagonal, create seaborn heatmap
# Acceptance:
# - 10x10 symmetric matrix
# - Shows which labels appear together frequently
# - Annotated with counts

# TODO: build co-occurrence matrix and plot heatmap


## Temporal Splits

**Critical:** Split by year to prevent temporal leakage.

- **Train:** â‰¤ 2021 (~60-70% of data)
- **Val:** 2022-2023 (~15-20%)
- **Test:** â‰¥ 2024 (~15-20%)

This mimics real-world deployment: predicting future papers based on past patterns.


In [None]:
# === TODO (you code this) ===
# Goal: Assign temporal split labels.
# Hints:
# 1) Define function: â‰¤2021='train', 2022-2023='val', â‰¥2024='test'
# 2) Apply to create 'split' column
# 3) Print value_counts to verify distribution
# Acceptance:
# - Function assign_split(year) -> str
# - New column df['split'] with 3 values
# - Roughly 60-70% train, 15-20% val, 15-20% test

def assign_split(year):
    """Assign temporal split based on year."""
    # TODO
    raise NotImplementedError

# TODO: create df['split'] column and print counts


In [None]:
# === TODO (you code this) ===
# Goal: Save separate parquet files for each split.
# Hints:
# 1) Loop through ['train', 'val', 'test']
# 2) Filter df and save to {split_name}.parquet
# 3) Print count for each
# Acceptance:
# - Three files in ../data/processed: train.parquet, val.parquet, test.parquet
# - Each contains only its split

# TODO: save split files


## Schema Validation

Use Pandera to validate data quality before training.


In [None]:
# === TODO (you code this) ===
# Goal: Validate DataFrame schema with Pandera.
# Hints:
# 1) Define schema with pa.DataFrameSchema specifying types and constraints
# 2) Check: pmid/title/abstract non-null, year in range, labels non-empty, split valid
# 3) Call schema.validate(df) in try/except
# Acceptance:
# - Schema defined for all key columns
# - Validation passes or prints helpful error
# - Catches: missing values, out-of-range years, empty labels

# TODO: define schema and validate


## Recommendations

- **Revisit year cutoffs** if val/test too small (aim for at least 1000 papers each)
- **For severe imbalance:** Consider stratified sampling within year windows (stretch goal)
- **Document decisions:** Why these splits? What assumptions are we making?

## ðŸ§˜ Reflection Log

**What did you learn in this session?**
- 

**What challenges did you encounter?**
- 

**How will this improve Periospot AI?**
- 
