## CSI4107 Assignment 2 - Information Retrieval System
- Joseph Champeau, 300170535
- Samuel Pierre-Louis, 300211427
- Yubo Zhu, 300207231



Our project uses a two-stage neural IR pipeline on the SciFact dataset:

1. **TAS-B**: Dense retrieval to get top-100 candidate documents.
2. **MiniLM cross-encoder** reranker to refine the top-10 results
3. Evaluate using **MAP** and **P@10**

# Install dependencies
pip install sentence-transformers pytrec_eval
pip install json
pip install torch

PLEASE ADD ALL THE DEPENDANCIES HERE I DON'T REMEMBER ALL




In [None]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from torch.nn.functional import softmax
from sentence_transformers import SentenceTransformer, util
import torch, json
import time
from tqdm import tqdm
from collections import defaultdict

## TAS-B: Encode Corpus and Queries

Load TAS-B model and encode the documents.

**Currently takes like 30 minutes**

In [None]:
# Load a publicly available TAS-B model
model = SentenceTransformer('msmarco-distilbert-base-tas-b')


# Load corpus
with open('scifact\corpus.jsonl') as f:
    docs = [json.loads(line) for line in f]
doc_texts = [doc['text'] for doc in docs]
doc_ids = [doc['_id'] for doc in docs]

# Encode documents
doc_embeddings = model.encode(doc_texts, convert_to_tensor=True, show_progress_bar=True)

  with open('scifact\corpus.jsonl') as f:
  from .autonotebook import tqdm as notebook_tqdm
Batches:   0%|          | 0/162 [00:02<?, ?it/s]
  with open('scifact\corpus.jsonl') as f:


KeyboardInterrupt: 

Encodes the queries

In [None]:
# Load queries
with open('scifact\queries.jsonl') as f:
    queries = [json.loads(line) for line in f]
query_texts = [q['text'] for q in queries]
query_ids = [q['_id'] for q in queries]

# Encode queries
query_embeddings = model.encode(query_texts, convert_to_tensor=True)


  with open('C:\dev\CSI4107\Assignment 2\Assignment2_Group13\scifact\queries.jsonl') as f:


Gets the top-k results

In [None]:
# Get top-100 candidate docs per query
tasb_candidates = {}
k = 100 # Change this to get top-k results

for qid, q_emb in zip(query_ids, query_embeddings):
    scores = util.cos_sim(q_emb, doc_embeddings)[0]
    top_results = torch.topk(scores, k)
    tasb_candidates[qid] = [(doc_ids[i], doc_texts[i]) for i in top_results.indices]


## Stage 2: Reranking with MiniLM

Loads and Reranks with MiniLM

**Currently takes 45 minutes**

In [None]:
# Load MiniLM-L-4-v2 model + tokenizer
tokenizer = AutoTokenizer.from_pretrained("cross-encoder/ms-marco-MiniLM-L-4-v2")
model = AutoModelForSequenceClassification.from_pretrained("cross-encoder/ms-marco-MiniLM-L-4-v2").to('cpu')
model.eval()

# Settings
batch_size = 64
rerank_top_k = 10 #change to rerank more values

# Prepare (query, doc) pairs
print("Preparing (query, document) pairs for reranking...")

all_pairs = []
pair_lookup = []

for q in queries:
    qid = q.get("_id")
    qtext = q.get("text")
    candidates = tasb_candidates.get(qid, [])
    for docid, doc_text in candidates:
        all_pairs.append((qtext, doc_text))
        pair_lookup.append((qid, docid))

# Run batched reranking with progress bar
print(f"Starting reranking of {len(all_pairs)} pairs in batches of {batch_size}...\n")
start_time = time.time()
scores = []

for i in tqdm(range(0, len(all_pairs), batch_size), desc="Reranking", ncols=80):
    batch = all_pairs[i:i+batch_size]
    q_texts, d_texts = zip(*batch)

    inputs = tokenizer(
        list(q_texts),
        list(d_texts),
        padding=True,
        truncation=True,
        return_tensors="pt"
    ).to('cpu')

    with torch.no_grad():
        logits = model(**inputs).logits
        batch_scores = logits[:, 0] if logits.shape[-1] > 1 else logits.squeeze()
        scores.extend(batch_scores.cpu().tolist())

end_time = time.time()
print(f"\n✅ Reranking completed in {end_time - start_time:.2f} seconds.")

# Group scores by query
minilm_results = defaultdict(list)
for (qid, docid), score in zip(pair_lookup, scores):
    minilm_results[qid].append((docid, score))

# Keep top-k docs per query
minilm_results = {
    qid: dict(sorted(docs, key=lambda x: x[1], reverse=True)[:rerank_top_k])
    for qid, docs in minilm_results.items()
}

# Save results to JSON
with open("reranked_results.json", "w") as f:
    json.dump(minilm_results, f, indent=2)

print("📝 Results saved to reranked_results.json")


To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


Preparing (query, document) pairs for reranking...
Starting reranking of 33270 pairs in batches of 64...



Reranking: 100%|██████████████████████████████| 520/520 [46:30<00:00,  5.37s/it]


✅ Reranking completed in 2790.53 seconds.
📝 Results saved to minilm_reranked_results.json





## Evaluation: MAP and P@10

In [2]:
# Load qrels
qrels = defaultdict(set)
with open('scifact/qrels/test.tsv') as f:
    for line in f.readlines()[1:]:
        qid, docid, label = line.strip().split()
        if int(label) > 0:
            qrels[qid].add(docid)

# Evaluate MAP and P@10
average_precisions = []
precisions_at_10 = []

for qid, retrieved_docs in minilm_results.items():
    if qid not in qrels:
        continue

    relevant_docs = qrels[qid]
    retrieved_doc_ids = list(retrieved_docs.keys())

    # Calculate Precision@10
    top_10 = retrieved_doc_ids[:10]
    relevant_at_10 = sum([1 for docid in top_10 if docid in relevant_docs])
    precisions_at_10.append(relevant_at_10 / 10)

    # Calculate Average Precision
    num_hits = 0
    precision_sum = 0
    for rank, docid in enumerate(retrieved_doc_ids):
        if docid in relevant_docs:
            num_hits += 1
            precision_sum += num_hits / (rank + 1)
    if len(relevant_docs) > 0:
        average_precisions.append(precision_sum / len(relevant_docs))

# Final metrics
map_score = sum(average_precisions) / len(average_precisions) if average_precisions else 0.0
p10_score = sum(precisions_at_10) / len(precisions_at_10) if precisions_at_10 else 0.0

print("TAS-B + MiniLM-L-4-v2 Evaluation Results:")
print(f"MAP:  {map_score:.4f}")
print(f"P@10: {p10_score:.4f}")

NameError: name 'defaultdict' is not defined

First run I got:

MAP:  0.5867
P@10: 0.0820

MAP Values: .5+ is a good score
P@10: around .75+ is a good score

Something is wrong with our P@10 value, could be for following reasons (most reasonable to least):
1. Bug with result formatting (we currently aren't capturing the titles for corpus)
2. Relavant docs exist, but deeper than top 10
3. Scifact has few positive per query (we should change to exclude corpus' that don't have a single relevant doc)
4. Low recall with TAS-B