# BRIGHT Benchmark - Steps 1-3 Demo

## Step 1: Setup 
- Dependencies installed
- Project structure created  
- Config file ready

## Step 2: Load BRIGHT Dataset (for document lookup) and ReasonIR-HQ (for training)

In [1]:
import sys
from pathlib import Path
sys.path.append(str(Path.cwd() / 'src'))

from data.bright_loader import BRIGHTLoader
from utils.helpers import load_config

# Load config
config = load_config('config/config.yaml')

# Step 2A: Load BRIGHT dataset (needed for document ID mapping)
print("=" * 80)
print("Step 2A: Loading BRIGHT dataset for document lookup...")
print("=" * 80)
loader = BRIGHTLoader(config_path='config/config.yaml')
dataset = loader.load_dataset()

print(f"\n✅ BRIGHT dataset loaded!")
print(f"Available domains in documents: {list(loader.documents_dataset.keys())[:5]}...")
print(f"Available domains in examples: {list(loader.examples_dataset.keys())[:5]}...")

# Create ID-to-text mapping (needed for ReasonIR-HQ)
print("\n" + "=" * 80)
print("Creating BRIGHT document ID-to-text mapping...")
print("=" * 80)
id2doc = loader.get_all_documents_id_map()
print(f"✅ Created mapping for {len(id2doc)} documents")

# Step 2B: ReasonIR-HQ dataset configuration
print("\n" + "=" * 80)
print("Step 2B: ReasonIR-HQ dataset configuration...")
print("=" * 80)
reasonir_config = config['dataset']['reasonir']
print(f"Dataset: {reasonir_config['name']}")
print(f"Subset: {reasonir_config['subset']}")
print(f"Cache dir: {reasonir_config['cache_dir']}")
print(f"✅ ReasonIR-HQ will be loaded during preprocessing (Step 3)")

Step 2A: Loading BRIGHT dataset for document lookup...
Loading BRIGHT 'documents' from: xlangai/BRIGHT
Using cache directory: /Users/aiamn/scratch/aiamn/dense-retrieval-SOTA/data/bright
Loading BRIGHT 'Gemini-1.0_reason' (queries/qrels) from: xlangai/BRIGHT
Loaded Documents Domains: ['psychology', 'sustainable_living', 'earth_science', 'stackoverflow', 'biology', 'economics', 'pony', 'leetcode', 'theoremqa_theorems', 'robotics', 'aops', 'theoremqa_questions']
Loaded Examples Domains: ['psychology', 'sustainable_living', 'earth_science', 'stackoverflow', 'biology', 'economics', 'pony', 'leetcode', 'theoremqa_theorems', 'robotics', 'aops', 'theoremqa_questions']

✅ BRIGHT dataset loaded!
Available domains in documents: ['biology', 'earth_science', 'economics', 'psychology', 'robotics']...
Available domains in examples: ['biology', 'earth_science', 'economics', 'psychology', 'robotics']...

Creating BRIGHT document ID-to-text mapping...
Created ID-to-text mapping for 1145164 documents acr

In [5]:
# Extract and explore ReasonIR-HQ dataset
from datasets import load_dataset

print("=" * 80)
print("Loading ReasonIR-HQ dataset...")
print("=" * 80)
reasonir_config = config['dataset']['reasonir']
hq_dataset = load_dataset(
    reasonir_config['name'],
    reasonir_config['subset'],
    cache_dir=reasonir_config.get('cache_dir')
)

print(f"\n✅ ReasonIR-HQ dataset loaded!")
print(f"  - Total examples: {len(hq_dataset['train'])}")
print(f"  - Dataset structure: {list(hq_dataset['train'][0].keys())}")

# Show sample entry
print("\n" + "=" * 80)
print("Sample ReasonIR-HQ entry (before mapping):")
print("=" * 80)
sample_entry = hq_dataset['train'][0]
print(f"Query sequence: {sample_entry['query']}")
print(f"Query (actual): {sample_entry['query'][1] if len(sample_entry['query']) > 1 else sample_entry['query'][0]}")
print(f"Positive document IDs: {sample_entry['pos']}")
print(f"Negative document IDs: {sample_entry['neg']}")

# Map one document ID to show actual text
if sample_entry['pos'] and len(sample_entry['pos']) > 0:
    first_pos = sample_entry['pos'][0]
    if isinstance(first_pos, list) and len(first_pos) >= 2:
        doc_id = first_pos[1]
        if doc_id in id2doc:
            print(f"\n✅ Mapped document ID '{doc_id}' to text:")
            print(f"Document text (first 200 chars): {id2doc[doc_id][:200]}...")
        else:
            print(f"\n⚠️ Document ID '{doc_id}' not found in BRIGHT mapping")

Loading ReasonIR-HQ dataset...

✅ ReasonIR-HQ dataset loaded!
  - Total examples: 100521
  - Dataset structure: ['query', 'pos', 'neg']

Sample ReasonIR-HQ entry (before mapping):
Query sequence: ['Given this reasoning-intensive query, find relevant documents that could help answer the question. ', 'A researcher is analyzing a sound signal represented by the equation f(t) = 2sin(3πt) + sin(5πt) + 0.5sin(7πt). Using the Fourier transform, what are the frequencies, amplitudes, and phases of the individual sinusoidal components in the signal?']
Query (actual): A researcher is analyzing a sound signal represented by the equation f(t) = 2sin(3πt) + sin(5πt) + 0.5sin(7πt). Using the Fourier transform, what are the frequencies, amplitudes, and phases of the individual sinusoidal components in the signal?
Positive document IDs: [['', 'camel_44852']]
Negative document IDs: [['', 'The Fourier transform is widely used in various fields, including engineering, physics, and data analysis. It is a p

In [2]:
# Extract biology domain data
biology_data = loader.get_data_split('biology')

print(f"✅ Biology domain extracted:")
print(f"  - Corpus: {len(biology_data['corpus'])} documents")
print(f"  - Queries: {len(biology_data['queries'])} queries")
print(f"  - Qrels: {len(biology_data['qrels'])} relevance judgments")
print(f"Sample query: {biology_data['queries'].iloc[0]['query']}")
print(f"Sample corpus doc: {biology_data['corpus'].iloc[0]['text'][:100]}...")

✅ Biology domain extracted:
  - Corpus: 57359 documents
  - Queries: 103 queries
  - Qrels: 372 relevance judgments
Sample query: ## Essential Problem:

The article claims that insects are not attracted to light sources solely due to heat radiation. However, the user argues that insects could be evolutionarily programmed to associate light with heat, potentially explaining their attraction to LEDs despite the lack of significant heat emission.

## Relevant Information:

* **Insect vision:** Insects have compound eyes, which differ from the lens-based eyes of humans and other vertebrates. Compound eyes are highly sensitive to movement and light intensity, but have lower resolution and are less adept at distinguishing colors.
* **Evolutionary history:** Insects have existed for hundreds of millions of years, evolving alongside various light sources like the sun, moon, and bioluminescent organisms.
* **Light and heat association:** In nature, sunlight is often accompanied by heat, creatin

## Step 3: Preprocess to Tevatron JSONL Format

In [3]:
# Step 3: Prepare ReasonIR-HQ Training Data and BRIGHT Evaluation Data
from data.preprocessor import BRIGHTPreprocessor

preprocessor = BRIGHTPreprocessor(output_dir='data/processed')

# Step 3A: Prepare ReasonIR-HQ training data
print("=" * 80)
print("Step 3A: Preparing ReasonIR-HQ training data...")
print("=" * 80)
reasonir_config = config['dataset']['reasonir']
train_path = preprocessor.prepare_reasonir_hq_train_data(
    id2doc=id2doc,
    dataset_name=reasonir_config['name'],
    subset=reasonir_config['subset'],
    cache_dir=reasonir_config.get('cache_dir'),
    filename='train_reasonir.jsonl'
)
print(f"\n✅ ReasonIR-HQ training data saved to: {train_path}")

# Step 3B: Prepare BRIGHT evaluation data (for a domain)
print("\n" + "=" * 80)
print("Step 3B: Preparing BRIGHT evaluation data (biology domain)...")
print("=" * 80)
biology_data = loader.get_data_split('biology')

corpus_path = preprocessor.prepare_tevatron_corpus(
    biology_data['corpus'],
    'biology_corpus.jsonl'
)

queries_path = preprocessor.prepare_tevatron_queries(
    biology_data['queries'],
    'biology_queries.jsonl'
)

qrels_path = preprocessor.prepare_trec_qrels(
    biology_data['qrels'],
    'biology_qrels.txt'
)

print(f"\n✅ BRIGHT evaluation files created:")
print(f"  - {corpus_path}")
print(f"  - {queries_path}")
print(f"  - {qrels_path}")

Step 3A: Preparing ReasonIR-HQ training data...
Preparing ReasonIR-HQ training data...
Loading ReasonIR dataset: reasonir/reasonir-data (subset: hq)...


Generating train split:   0%|          | 0/100521 [00:00<?, ? examples/s]

Mapping document IDs to texts...


Map:   0%|          | 0/100521 [00:00<?, ? examples/s]

Formatting training data to data/processed/train_reasonir.jsonl...
Saved 100521 training examples to data/processed/train_reasonir.jsonl

✅ ReasonIR-HQ training data saved to: data/processed/train_reasonir.jsonl

Step 3B: Preparing BRIGHT evaluation data (biology domain)...
Processing 57359 documents for biology_corpus.jsonl...
Processing 103 queries for biology_queries.jsonl...
Saved TREC qrels to data/processed/biology_qrels.txt

✅ BRIGHT evaluation files created:
  - data/processed/biology_corpus.jsonl
  - data/processed/biology_queries.jsonl
  - data/processed/biology_qrels.txt


In [4]:
# Verify JSONL format
import json

print("=" * 80)
print("Verifying ReasonIR-HQ training data format...")
print("=" * 80)
print("\nSample ReasonIR-HQ training entry:")
with open(train_path, 'r') as f:
    sample_train = json.loads(f.readline())
    print(f"Query ID: {sample_train['query_id']}")
    print(f"Query (first 200 chars): {sample_train['query'][:200]}...")
    print(f"Number of positives: {len(sample_train['positives'])}")
    print(f"First positive (first 200 chars): {sample_train['positives'][0][:200]}...")
    print(f"Negatives: {sample_train['negatives']}")

print("\n" + "=" * 80)
print("Verifying BRIGHT evaluation data format...")
print("=" * 80)
print("\nSample BRIGHT corpus entry:")
with open(corpus_path, 'r') as f:
    sample_corpus = json.loads(f.readline())
    print(f"Doc ID: {sample_corpus['id']}")
    print(f"Text (first 200 chars): {sample_corpus['text'][:200]}...")

print("\nSample BRIGHT query entry:")
with open(queries_path, 'r') as f:
    sample_query = json.loads(f.readline())
    print(f"Query ID: {sample_query['id']}")
    print(f"Query (first 200 chars): {sample_query['text'][:200]}...")

print("\n✅ All data formats verified!")


Verifying ReasonIR-HQ training data format...

Sample ReasonIR-HQ training entry:
Query ID: reasonir_0
Query (first 200 chars): A researcher is analyzing a sound signal represented by the equation f(t) = 2sin(3πt) + sin(5πt) + 0.5sin(7πt). Using the Fourier transform, what are the frequencies, amplitudes, and phases of the ind...
Number of positives: 1
First positive (first 200 chars): A sound signal is given by the equation f(t) = sin(2πt) + sin(4πt) + sin(6πt) where t is time in seconds. Use Fourier transform to find the frequencies, amplitudes, and phases of the individual sinuso...
Negatives: []

Verifying BRIGHT evaluation data format...

Sample BRIGHT corpus entry:
Doc ID: neanderthals_vitamin_C_diet/Neanderthal_0_43.txt
Text (first 200 chars):  pelvises; and proportionally shorter forearms and forelegs.
Based on 45 Neanderthal long bones from 14 men and 7 women, the average height was 164 to 168 cm (5 ft 5 in to 5 ft 6 in) for males and 152...

Sample BRIGHT query entry:
Query I