# MS MARCO Dataset Exploration
# How they connect:
# - topics: queries (qid, query)
# - corpus: documents (docno, text)
# - qrels: relevance judgments linking queries to documents (qid, docno, label)

In [None]:
import pyterrier as pt
import pandas as pd

pt.init()


terrier-assemblies 5.11 jar-with-dependencies not found, downloading to /root/.pyterrier...


https://repo1.maven.org/maven2/org/terrier/terrier-assemblies/5.11/terrier-assemblies-5.11-jar-with-dependenci…

Done
terrier-python-helper 0.0.8 jar not found, downloading to /root/.pyterrier...


https://repo1.maven.org/maven2/org/terrier/terrier-python-helper/0.0.8/terrier-python-helper-0.0.8.jar:   0%| …

Done


Java started and loaded: pyterrier.java.colab, pyterrier.java, pyterrier.java.24, pyterrier.terrier.java [version=5.11 (build: craig.macdonald 2025-01-13 21:29), helper_version=0.0.8]
java is now started automatically with default settings. To force initialisation early, run:
pt.java.init() # optional, forces java initialisation
  pt.init()


In [3]:
dataset = pt.get_dataset("msmarco_passage")


In [14]:
# Use get_corpus_iter() - returns an iterable of documents
print("Loading corpus (taking first 1000 docs for quick exploration)...")
corpus_iter = dataset.get_corpus_iter()

# Convert to DataFrame - take sample for exploration
corpus_list = []
for i, doc in enumerate(corpus_iter):
    corpus_list.append(doc)
    if i >= 999:  # Just first 1000 for quick exploration
        break

corpus = pd.DataFrame(corpus_list)
print(f"Loaded {len(corpus)} documents")
corpus.head()


Loading corpus (taking first 1000 docs for quick exploration)...
Loaded 1000 documents


Unnamed: 0,docno,text
0,0,The presence of communication amid scientific ...
1,1,The Manhattan Project and its atomic bomb help...
2,2,Essay on The Manhattan Project - The Manhattan...
3,3,The Manhattan Project was the name for a proje...
4,4,versions of each volume as well as complementa...


In [7]:
topics = dataset.get_topics(variant='dev.small')
# Convert to DataFrame if it's not already
if not isinstance(topics, pd.DataFrame):
    topics = pd.DataFrame(topics)
topics.head()


Downloading msmarco_passage tars to /root/.pyterrier/corpora/msmarco_passage/collectionandqueries.tar.gz


collectionandqueries.tar.gz:   0%|          | 0.00/0.99G [00:00<?, ?iB/s]

Unnamed: 0,qid,query
0,1048585,what is paula deen's brother
1,2,Androgen receptor define
2,524332,treating tension headaches without medication
3,1048642,what is paranoid sc
4,524447,treatment of varicose veins in legs


In [8]:
qrels = dataset.get_qrels(variant='dev.small')
# Convert to DataFrame if it's not already
if not isinstance(qrels, pd.DataFrame):
    qrels = pd.DataFrame(qrels)
qrels.head()


Unnamed: 0,qid,docno,label
0,300674,7067032,1
1,125705,7067056,1
2,94798,7067181,1
3,9083,7067274,1
4,174249,7067348,1


In [None]:
print(f"Corpus: {len(corpus)} documents")
print(f"Topics: {len(topics)} queries")
print(f"Qrels: {len(qrels)} judgments")

# For quick exploration, take tiny samples
print("\n--- Tiny samples for structure exploration ---")
print(f"\nCorpus sample (first 3):")
print(corpus.head(3))
print(f"\nTopics sample (first 3):")
print(topics.head(3))
print(f"\nQrels sample (first 3):")
print(qrels.head(3))


In [16]:
# Load TRAINING data
print("="*60)
print("TRAINING DATA")
print("="*60)

topics_train = dataset.get_topics(variant='train')
# Convert to DataFrame if it's not already
if not isinstance(topics_train, pd.DataFrame):
    topics_train = pd.DataFrame(topics_train)
print(f"Training topics: {len(topics_train)} queries")
topics_train.head()


TRAINING DATA
Downloading msmarco_passage tars to /root/.pyterrier/corpora/msmarco_passage/queries.tar.gz


queries.tar.gz:   0%|          | 0.00/18.0M [00:00<?, ?iB/s]

Training topics: 808731 queries


Unnamed: 0,qid,query
0,121352,define extreme
1,634306,what does chattel mean on credit history
2,920825,what was the great leap forward brainly
3,510633,tattoo fixers how much does it cost
4,737889,what is decentralization process.


In [17]:
qrels_train = dataset.get_qrels(variant='train')
# Convert to DataFrame if it's not already
if not isinstance(qrels_train, pd.DataFrame):
    qrels_train = pd.DataFrame(qrels_train)
print(f"Training qrels: {len(qrels_train)} judgments")
qrels_train.head()


Downloading msmarco_passage qrels to /root/.pyterrier/corpora/msmarco_passage/qrels.train.tsv


qrels.train.tsv:   0%|          | 0.00/10.1M [00:00<?, ?iB/s]

Training qrels: 532761 judgments


Unnamed: 0,qid,docno,label
0,1185869,0,1
1,1185868,16,1
2,597651,49,1
3,403613,60,1
4,1183785,389,1


In [18]:
# Compare training vs dev
print("\n" + "="*60)
print("COMPARISON: Training vs Dev")
print("="*60)
print(f"Training - Topics: {len(topics_train)}, Qrels: {len(qrels_train)}")
print(f"Dev (small) - Topics: {len(topics)}, Qrels: {len(qrels)}")
print(f"\nTraining qrels label distribution:")
print(qrels_train['label'].value_counts())
print(f"\nAverage relevant docs per query (train): {len(qrels_train[qrels_train['label']==1]) / len(topics_train):.2f}")



COMPARISON: Training vs Dev
Training - Topics: 808731, Qrels: 532761
Dev (small) - Topics: 6980, Qrels: 7437

Training qrels label distribution:
label
1    532761
Name: count, dtype: int64

Average relevant docs per query (train): 0.66
