# Data Preprocessing Notebook

This notebook handles the data preprocessing pipeline for the Narrative Similarity task (SemEval4). It includes loading raw data, generating augmented triplets using LLMs, and extracting aspects from data.

## 1. Environment Setup

The system path is configured to enable imports from the `src` package, and the Together AI banner is disabled for cleaner output.

In [1]:
import sys
import os

sys.path.insert(0, '..')
os.environ["TOGETHER_NO_BANNER"] = "1"


## 2. Import Dependencies

The following modules are imported:
- **load_track_a**: utility function to load Track A data from JSONL files
- **AugmentTripletGenerator**: LLM-based augmentor that generates triplets (anchor, positive, negative) for training

In [3]:
from utils.load_track import load_track_a
from src.llm.augmentor import AugmentTripletGenerator

## 3. Load Development Data

The development dataset for Track A is loaded from a JSONL file. This dataset contains narrative pairs that will be used to generate augmented training triplets.

In [None]:
dev_ds = load_track_a("../data/raw/trackA/dev_track_a.jsonl")
print(f"Dev examples: {len(dev_ds)}")

## 4. Generating Augmented Data

The `AugmentTripletGenerator` uses the Together AI API to generate augmented triplets from the development data.

In [None]:

# Replace it with your own API key
with open('../data/auth/together.ai.key') as f:
    api_key = f.read().strip()

# model_names= "Qwen/Qwen2.5-72B-Instruct-Turbo"
generator = AugmentTripletGenerator(api_key=api_key)

triplets = generator.run_batch(dev_data=dev_ds, n_triplets_per_sample=2, temps=[0.8],
                            batch_input_path="./data/generated/inputs-prompts/llama-full-dev-input.jsonl",
                            batch_output_path="./data/generated/llama-full-dev-aug.jsonl")

## 5. Loading Augmented Data

The previously generated augmented data is parsed from the batch output file. This step converts the raw LLM outputs into structured triplet format for downstream processing.

In [None]:
full_llama_aug_dev = generator.parse_batch_output("../data/generated/llama-full-ao-dev-aug.jsonl")

## 6. Aspects Extraction (TODO)

This section is reserved for extracting narrative aspects from the texts. Aspect extraction enables aspect-aware similarity learning, where the model learns to compare narratives based on specific dimensions (e.g., theme, course of action, outcomes).