# Part I: Preparing the Dataset

This notebook showcases transforming a dataset for finetuning an embedding model with NeMo Microservices.


It covers the following -
1. Download the SPECTER dataset
2. Prepare data for embedding fine-tuning.

*Dataset Disclaimer: Each user is responsible for checking the content of datasets and the applicable licenses and determining if suitable for the intended use.*

In [None]:
import os
import json
import random
from datasets import load_dataset
from config import HF_TOKEN

The following code cell sets a random seed for reproducibility, and sets data path.
It also configures the fraction of training data to use for demonstration purposes as training with the whole [SPECTER](https://huggingface.co/datasets/embedding-data/SPECTER) dataset may take several hours.

In [None]:
SEED = 42
random.seed(SEED)

DATA_SAVE_PATH = "./data"

# Configuration for data fraction
USE_FRACTION = True  # Set to False to use full dataset
FRACTION = 0.1  # Use 10% of the dataset (0.1 = 10%, 0.01 = 1%, etc.)

# Configure Hugging Face access
os.environ["HF_TOKEN"] = HF_TOKEN
os.environ["HF_ENDPOINT"] = "https://huggingface.co"

## Step 1: Download the SPECTER Dataset

The SPECTER dataset contains scientific paper triples for training embedding models. This step loads the dataset from Hugging Face.

In [None]:
# Load the dataset directly from Hugging Face
dataset = load_dataset("embedding-data/SPECTER")
print(f"Dataset info: {dataset}")

In [None]:
# Inspect dataset structure
print("Dataset structure:")
print(f"  Shape: {len(dataset['train']):,} rows × 1 column")
print(f"  Column name: '{dataset['train'].column_names[0]}'")
print(f"  Each row's value: A 3-element list [query_title, positive_title, negative_title]")
print(f"\n  Example of one row:")
example = dataset['train'][0]['set']
print(f"    dataset['train'][0]['set'] = [")
print(f"      '{example[0][:60]}...',  # query")
print(f"      '{example[1][:60]}...',  # positive")
print(f"      '{example[2][:60]}...'   # negative")
print(f"    ]")

# Display diverse examples from different parts of the dataset
print("\nDiverse examples from the dataset:")
sample_indices = [0, 100000, 200000]
for idx, sample_idx in enumerate(sample_indices, 1):
    example = dataset["train"][sample_idx]["set"]
    print(f"\nExample {idx} (row {sample_idx:,}):")
    print(f"  Query (Anchor):       {example[0][:100]}{'...' if len(example[0]) > 100 else ''}")
    print(f"  Positive (Related):   {example[1][:100]}{'...' if len(example[1]) > 100 else ''}")
    print(f"  Negative (Unrelated): {example[2][:100]}{'...' if len(example[2]) > 100 else ''}")

Each row in the dataset contains a triplet of scientific paper titles:
- **Query (Anchor)**: The reference paper
- **Positive (Related)**: A related or cited paper  
- **Negative (Unrelated)**: An unrelated paper from a different field

During training, contrastive learning is used to maximize the similarity between the query paper and related papers, while minimizing the similarity between the query paper and unrelated papers. This teaches the embedding model to understand semantic relationships between scientific documents.

## Step 2: Prepare Data for Customization

For customizing embedding models, the NeMo Microservices platform leverages a JSONL format, where each row is:
```
{
    "query": "query text",
    "pos_doc": "positive document text",
    "neg_doc": ["negative document text 1", "negative document text 2", ...]
}
```

The following code cell converts the dataset to this format and splits it into training, validation, and test sets.

In [None]:
# Select data fraction if configured
data = dataset['train'].shuffle(seed=SEED)
if USE_FRACTION:
    data = data.select(range(int(len(data) * FRACTION)))
    print(f"Using {len(data)}/{len(dataset['train'])} examples ({FRACTION*100:.1f}%)")

# Split: 90% train, 5% validation, 5% test
train_val = data.train_test_split(test_size=0.10, seed=SEED)
val_test = train_val['test'].train_test_split(test_size=0.50, seed=SEED)
splits = {'train': train_val['train'], 'validation': val_test['train'], 'test': val_test['test']}

print(f"Train: {len(splits['train'])} | Validation: {len(splits['validation'])} | Test: {len(splits['test'])}\n")

# Save to JSONL format
folder_name = f"specter_{int(FRACTION*100)}pct" if USE_FRACTION else "specter_full"
save_path = os.path.join(DATA_SAVE_PATH, folder_name)

for split_name, split_data in [("training", splits['train']), ("validation", splits['validation']), ("testing", splits['test'])]:
    split_dir = os.path.join(save_path, split_name)
    os.makedirs(split_dir, exist_ok=True)
    
    file_path = os.path.join(split_dir, f"{split_name}.jsonl")
    with open(file_path, "w") as f:
        for row in split_data:
            example = {"query": row['set'][0], "pos_doc": row['set'][1], "neg_doc": [row['set'][2]]}
            f.write(json.dumps(example) + "\n")
    print(f"Saved {file_path}")

# Display sample
print("\nSample from training set:")
for i, row in enumerate(splits['train'].select(range(3))):
    print(f"\nExample {i+1}:")
    print(f"  Query:    {row['set'][0]}")
    print(f"  Positive: {row['set'][1]}")
    print(f"  Negative: {row['set'][2]}")

---

## Next Steps

✅ **Completed in this notebook:**
- Loaded the SPECTER dataset containing scientific paper triplets
- Inspected the dataset structure and sample examples
- Converted the data to NeMo Microservices JSONL format
- Split the data into training (90%), validation (5%), and test (5%) sets
- Saved the processed data for fine-tuning

**Continue to [2_finetuning_and_inference.ipynb](./2_finetuning_and_inference.ipynb)** to:
- Upload the prepared dataset to NeMo Data Store
- Configure and launch the embedding fine-tuning job
- Deploy the fine-tuned model as a NIM
- Run inference and test the customized model
