# Part I: Preparing the Dataset

This notebook showcases transforming a dataset for finetuning an embedding model with NeMo Microservices.


It covers the following -
1. Download the SPECTER dataset
2. Prepare data for embedding fine-tuning.

*Dataset Disclaimer: Each user is responsible for checking the content of datasets and the applicable licenses and determining if suitable for the intended use.*

In [1]:
import os
import json
import random
from datasets import load_dataset

The following code cell sets a random seed for reproducibility, and sets data path.
It also configures the fraction of training data to use for demonstration purposes as training with the whole [SPECTER](https://huggingface.co/datasets/embedding-data/SPECTER) data may take several hours.

In [2]:
SEED = 42
random.seed(SEED)

DATA_SAVE_PATH = "./data"

# Configuration for data fraction
USE_FRACTION = True  # Set to False to use full dataset
FRACTION = 0.1  # Use 10% of the dataset (0.1 = 10%, 0.01 = 1%, etc.)

## Step 1: Download the SPECTER Dataset

The SPECTER dataset contains scientific paper triples for training embedding models. This step loads the dataset from Hugging Face.

In [3]:
from config import HF_TOKEN

os.environ["HF_TOKEN"] = HF_TOKEN
os.environ["HF_ENDPOINT"] = "https://huggingface.co"

In [4]:
# Load the dataset directly from Hugging Face
dataset = load_dataset("embedding-data/SPECTER")
print(f"Dataset info: {dataset}")

Dataset info: DatasetDict({
    train: Dataset({
        features: ['set'],
        num_rows: 684100
    })
})


In [5]:
# Inspect the first 3 rows
dataset["train"][:3]["set"]

[['Millimeter-wave CMOS digital controlled artificial dielectric differential mode transmission lines for reconfigurable ICs',
  'CMP network-on-chip overlaid with multi-band RF-interconnect',
  'Route packets, not wires: on-chip interconnection networks'],
 ['Millimeter-wave CMOS digital controlled artificial dielectric differential mode transmission lines for reconfigurable ICs',
  'CMP network-on-chip overlaid with multi-band RF-interconnect',
  'Entheses: tendon and ligament attachment sites'],
 ['Millimeter-wave CMOS digital controlled artificial dielectric differential mode transmission lines for reconfigurable ICs',
  'CMP network-on-chip overlaid with multi-band RF-interconnect',
  'Packet leashes: a defense against wormhole attacks in wireless networks']]

Each row in the dataset contains three sentences (or triplets): query, positive passage, and negative passage, in order.

During training of the embedding model, contrastive learning is used to maximize the similarity between the query and the passage that contains the answer, while minimizing the similarity between the query and sampled negative passage not useful to answer the question.

## Step 2: Prepare Data for Customization

For customizing embedding models, the NeMo Microservices platform leverages a JSONL format, where each row is:
```
{
    "query": "query text",
    "pos_doc": "positive document text",
    "neg_doc": ["negative document text 1", "negative document text 2", ...]
}
```

The following code cell -
1. Defines a helper for data splitting
2. Uses a fraction of the data, and converts each row to the required format
3. Saves the data splits to jsonl files

In [6]:
def split_data(data, train_ratio=0.8, val_ratio=0.1, test_ratio=0.1):
    """
    Splits the data into training, validation, and test sets.
    """
    assert train_ratio + val_ratio + test_ratio == 1.0, "Ratios must sum to 1"
    
    # Compute split indices
    train_end = int(len(data) * train_ratio)
    val_end = train_end + int(len(data) * val_ratio)
    
    # Split the data
    train_set = data[:train_end]
    val_set = data[train_end:val_end]
    test_set = data[val_end:]
    
    return train_set, val_set, test_set


try:
    # Get the raw data
    raw_data = dataset['train']['set']
    print(f"Total examples in dataset: {len(raw_data)}")
    
    # Shuffle the data once at the beginning
    raw_data_list = list(raw_data)
    random.shuffle(raw_data_list)
    
    # Apply fraction if specified (after shuffling)
    if USE_FRACTION:
        original_size = len(raw_data_list)
        fraction_size = int(len(raw_data_list) * FRACTION)
        raw_data_list = raw_data_list[:fraction_size]
        print(f"Using fraction of dataset: {len(raw_data_list)}/{original_size} examples ({FRACTION*100:.1f}%)")
    else:
        print(f"Using full dataset: {len(raw_data_list)} examples")
    
    # Format the data
    data = []
    for example in raw_data_list:
        data.append({
            "query": example[0],
            "pos_doc": example[1], 
            "neg_doc": [example[2]]  # neg_doc as a list of strings
        })
    print(f"Formatted {len(data)} examples")
    
    # Split the data
    train, val, test = split_data(data, train_ratio=0.90, val_ratio=0.05, test_ratio=0.05)
    
    print(f"\nTrain set: {len(train)} examples")
    print(f"Validation set: {len(val)} examples")
    print(f"Test set: {len(test)} examples")
    
    # Generate save path with fraction suffix if using a fraction of the dataset
    if USE_FRACTION:
        # Convert fraction to percentage for folder name (e.g., 0.1 -> 10pct, 0.01 -> 1pct)
        fraction_pct = int(FRACTION * 100)
        folder_name = f"specter_{fraction_pct}pct"
    else:
        folder_name = "specter_full"
    
    save_path = os.path.join(DATA_SAVE_PATH, folder_name)
    print(f"Saving data to: {save_path}")
    
    # Create directories for each split
    for split_name in ["training", "validation", "testing"]:
        split_dir = os.path.join(save_path, split_name)
        os.makedirs(split_dir, exist_ok=True)
    
    # Save to JSONL files in respective folders
    for fname, ds, folder in (("training.jsonl", train, "training"), 
                              ("validation.jsonl", val, "validation"), 
                              ("testing.jsonl", test, "testing")):
        file_path = os.path.join(save_path, folder, fname)
        with open(file_path, "w") as out:
            for obj in ds:
                out.write(json.dumps(obj) + "\n")
        print(f"Saved {len(ds)} examples to {file_path}")
    
    # Display first few examples from training set
    print("\nFirst few examples from training set:")
    for i, example in enumerate(train[:3]):
        print(f"Example {i+1}:")
        print(f"  Query: {example['query']}")
        print(f"  Positive: {example['pos_doc']}")
        print(f"  Negative: {example['neg_doc']}")
        print()
        
except Exception as e:
    print(f"Error loading dataset: {e}")

Total examples in dataset: 684100
Using fraction of dataset: 68410/684100 examples (10.0%)
Formatted 68410 examples

Train set: 61569 examples
Validation set: 3420 examples
Test set: 3421 examples
Saving data to: ./data/specter_10pct
Saved 61569 examples to ./data/specter_10pct/training/training.jsonl
Saved 3420 examples to ./data/specter_10pct/validation/validation.jsonl
Saved 3421 examples to ./data/specter_10pct/testing/testing.jsonl

First few examples from training set:
Example 1:
  Query: Rhythm, Metrics, and the Link to Phonology
  Positive: Rhythm, Timing and the Timing of Rhythm
  Negative: ['Social software and participatory learning: Pedagogical choices with technology affordances in the Web 2.0 era']

Example 2:
  Query: underwater image processing : state of the art of restoration and image enhancement methods .
  Positive: Image quality assessment: from error visibility to structural similarity
  Negative: ['An overview of home automation systems']

Example 3:
  Query: Ma