# BRIGHT Benchmark - Steps 1-3 Demo

## Step 1: Setup 
- Dependencies installed
- Project structure created  
- Config file ready

## Step 2: Load BRIGHT Dataset

In [1]:
import sys
from pathlib import Path
sys.path.append(str(Path.cwd() / 'src'))

from data.bright_loader import BRIGHTLoader

# Initialize loader
loader = BRIGHTLoader(config_path='config/config.yaml')

# Load dataset
dataset = loader.load_dataset()

print(f"✅ Dataset loaded!")
print(f"Available domains in documents: {list(loader.documents_dataset.keys())[:5]}...")
print(f"Available domains in examples: {list(loader.examples_dataset.keys())[:5]}...")

Loading BRIGHT 'documents' from: xlangai/BRIGHT


Generating biology split:   0%|          | 0/57359 [00:00<?, ? examples/s]

Generating earth_science split:   0%|          | 0/121249 [00:00<?, ? examples/s]

Generating economics split:   0%|          | 0/50220 [00:00<?, ? examples/s]

Generating psychology split:   0%|          | 0/52835 [00:00<?, ? examples/s]

Generating robotics split:   0%|          | 0/61961 [00:00<?, ? examples/s]

Generating stackoverflow split:   0%|          | 0/107081 [00:00<?, ? examples/s]

Generating sustainable_living split:   0%|          | 0/60792 [00:00<?, ? examples/s]

Generating pony split:   0%|          | 0/7894 [00:00<?, ? examples/s]

Generating leetcode split:   0%|          | 0/413932 [00:00<?, ? examples/s]

Generating aops split:   0%|          | 0/188002 [00:00<?, ? examples/s]

Generating theoremqa_theorems split:   0%|          | 0/23839 [00:00<?, ? examples/s]

Generating theoremqa_questions split:   0%|          | 0/188002 [00:00<?, ? examples/s]

Loading BRIGHT 'Gemini-1.0_reason' (queries/qrels) from: xlangai/BRIGHT


Generating biology split:   0%|          | 0/103 [00:00<?, ? examples/s]

Generating earth_science split:   0%|          | 0/116 [00:00<?, ? examples/s]

Generating economics split:   0%|          | 0/103 [00:00<?, ? examples/s]

Generating psychology split:   0%|          | 0/101 [00:00<?, ? examples/s]

Generating robotics split:   0%|          | 0/101 [00:00<?, ? examples/s]

Generating stackoverflow split:   0%|          | 0/117 [00:00<?, ? examples/s]

Generating sustainable_living split:   0%|          | 0/108 [00:00<?, ? examples/s]

Generating leetcode split:   0%|          | 0/142 [00:00<?, ? examples/s]

Generating pony split:   0%|          | 0/112 [00:00<?, ? examples/s]

Generating aops split:   0%|          | 0/111 [00:00<?, ? examples/s]

Generating theoremqa_questions split:   0%|          | 0/194 [00:00<?, ? examples/s]

Generating theoremqa_theorems split:   0%|          | 0/76 [00:00<?, ? examples/s]

Loaded Documents Domains: ['sustainable_living', 'leetcode', 'robotics', 'psychology', 'theoremqa_theorems', 'economics', 'aops', 'pony', 'biology', 'theoremqa_questions', 'earth_science', 'stackoverflow']
Loaded Examples Domains: ['sustainable_living', 'leetcode', 'robotics', 'psychology', 'theoremqa_theorems', 'economics', 'aops', 'pony', 'biology', 'theoremqa_questions', 'earth_science', 'stackoverflow']
✅ Dataset loaded!
Available domains in documents: ['biology', 'earth_science', 'economics', 'psychology', 'robotics']...
Available domains in examples: ['biology', 'earth_science', 'economics', 'psychology', 'robotics']...


In [2]:
# Extract biology domain data
biology_data = loader.get_data_split('biology')

print(f"✅ Biology domain extracted:")
print(f"  - Corpus: {len(biology_data['corpus'])} documents")
print(f"  - Queries: {len(biology_data['queries'])} queries")
print(f"  - Qrels: {len(biology_data['qrels'])} relevance judgments")
print(f"Sample query: {biology_data['queries'].iloc[0]['query']}")
print(f"Sample corpus doc: {biology_data['corpus'].iloc[0]['text'][:100]}...")

✅ Biology domain extracted:
  - Corpus: 57359 documents
  - Queries: 103 queries
  - Qrels: 372 relevance judgments
Sample query: ## Essential Problem:

The article claims that insects are not attracted to light sources solely due to heat radiation. However, the user argues that insects could be evolutionarily programmed to associate light with heat, potentially explaining their attraction to LEDs despite the lack of significant heat emission.

## Relevant Information:

* **Insect vision:** Insects have compound eyes, which differ from the lens-based eyes of humans and other vertebrates. Compound eyes are highly sensitive to movement and light intensity, but have lower resolution and are less adept at distinguishing colors.
* **Evolutionary history:** Insects have existed for hundreds of millions of years, evolving alongside various light sources like the sun, moon, and bioluminescent organisms.
* **Light and heat association:** In nature, sunlight is often accompanied by heat, creatin

## Step 3: Preprocess to Tevatron JSONL Format

In [4]:
# Step 3: Preprocess to Tevatron JSONL Format
from data.preprocessor import BRIGHTPreprocessor

preprocessor = BRIGHTPreprocessor(output_dir='data/processed')

# Prepare Tevatron format files (pass only filenames, not full paths)
corpus_path = preprocessor.prepare_tevatron_corpus(
    biology_data['corpus'],
    'biology_corpus.jsonl'
)

queries_path = preprocessor.prepare_tevatron_queries(
    biology_data['queries'],
    'biology_queries.jsonl'
)

train_path = preprocessor.prepare_train_data(
    biology_data['queries'],
    biology_data['corpus'],
    biology_data['qrels'],
    'biology_train.jsonl'
)

qrels_path = preprocessor.prepare_trec_qrels(
    biology_data['qrels'],
    'biology_qrels.txt'
)

print(f"✅ Tevatron staging files created:")
print(f"  - {corpus_path}")
print(f"  - {queries_path}")
print(f"  - {train_path}")
print(f"  - {qrels_path}")

Processing 57359 documents for biology_corpus.jsonl...
Processing 103 queries for biology_queries.jsonl...
Preparing training pairs for biology_train.jsonl...
Saved 103 training examples to data/processed/biology_train.jsonl
Saved TREC qrels to data/processed/biology_qrels.txt
✅ Tevatron staging files created:
  - data/processed/biology_corpus.jsonl
  - data/processed/biology_queries.jsonl
  - data/processed/biology_train.jsonl
  - data/processed/biology_qrels.txt


In [5]:
# Verify JSONL format
import json

print("Sample corpus entry:")
with open(corpus_path, 'r') as f:
    print(json.loads(f.readline()))

print("\nSample query entry:")
with open(queries_path, 'r') as f:
    print(json.loads(f.readline()))

print("\nSample training entry:")
with open(train_path, 'r') as f:
    print(json.loads(f.readline()))


Sample corpus entry:
{'id': 'neanderthals_vitamin_C_diet/Neanderthal_0_43.txt', 'text': ' pelvises; and proportionally shorter forearms and forelegs.\nBased on 45 Neanderthal long bones from 14 men and 7 women, the average height was 164 to 168\xa0cm (5\xa0ft 5\xa0in to 5\xa0ft 6\xa0in) for males and 152 to 156\xa0cm (5\xa0ft 0\xa0in to 5\xa0ft 1\xa0in) for females. For comparison, the average height of 20 males and 10 females Upper Palaeolithic humans is, respectively, 176.2\xa0cm (5\xa0ft 9.4\xa0in) and 162.9'}

Sample query entry:
{'id': '0', 'text': '## Essential Problem:\n\nThe article claims that insects are not attracted to light sources solely due to heat radiation. However, the user argues that insects could be evolutionarily programmed to associate light with heat, potentially explaining their attraction to LEDs despite the lack of significant heat emission.\n\n## Relevant Information:\n\n* **Insect vision:** Insects have compound eyes, which differ from the lens-based eyes o