## CSI4107 Assignment 2 - Information Retrieval System
- Joseph Champeau, 300170535
- Samuel Pierre-Louis, 300211427
- Yubo Zhu, 300207231



Our project uses a two-stage neural IR pipeline on the SciFact dataset, with two different neural approaches:

1. **TF-IDF** (from assignment 1): Sparse bag-of-words retrieval to get top-100 candidate documents
2. **TAS-B**: First dense reranker approach, designed for embeddings for information retrieval. This is a bi-encoder model (query and docs are encoded then compared).
3. **MiniLM cross-encoder** Second dense reranker approach based on a small LLM. This is a cross-encoder model (query and docs are compared directly and we received a similarity score).
3. Evaluate using **MAP** and **P@10**

In [None]:
%%capture

# Install Dependencies
%pip install -U sentence-transformers
%pip install -U pytrec_eval
%pip install -U torch
%pip install -U nltk
%pip install -U tf-keras
%pip install -U tqdm

In [None]:
import assignment1_code as a1

import random
import json
import math
import itertools as iter

from tqdm import tqdm # For pretty printing a progress bar

from transformers import AutoTokenizer, AutoModelForSequenceClassification
from torch.nn.functional import softmax
from sentence_transformers import SentenceTransformer, util
import torch
import time

## Step 0: Reading the Corpus and Queries

Read the corpus and tansform it into an array of (id, text) pairs

In [4]:
USE_TITLE_ONLY = False

# Read documents
documents = []
with open('./scifact/corpus.jsonl', 'r') as f:
    for line in f:
        doc = json.loads(line)
        doc_id = int(doc['_id'])
        if USE_TITLE_ONLY:
            doc_text = doc['title']
        else:
            doc_text = doc['title'] + ' ' + doc['text']
        documents.append((doc_id, doc_text))
del f, line, doc, doc_id, doc_text

# Read queries
queries = []
query_ids = []
with open('./scifact/queries.jsonl', 'r') as f:
    for line in f:
        query = json.loads(line)
        query_id = int(query['_id'])
        query_text = query['text']
        query_ids.append(query_id)
        queries.append(query_text)
del f, line, query, query_id, query_text

## Step 1: Prepare the Assignment 1 Code

Prepare the inverted index and run parameters.

We will use the bag-of-words with TF-IDF to create a shortlist for the rerankers, since the corpus is too big to use the neural approach on all of it

In [10]:
# Configuration
RUN_NAME = f"run_{random.randrange(1000000)}"
DOC_COUNT = len(documents)
QUERY_COUNT = len(queries)
BATCH_SIZE = 5
BATCH_COUNT = math.ceil(QUERY_COUNT/BATCH_SIZE)

# Create the inverted index
inverted_index = a1.create_inverted_index(documents)

# Save it to a file
with open('./inverted_index.json', 'w') as f:
    json.dump(inverted_index, f, sort_keys=True)

## Step 2: Prepare TAS-B

TAS-B encodes sentences and paragraphs to dense word embeddings.

In [11]:
# Load a publicly available TAS-B model
tasb_model = SentenceTransformer('msmarco-distilbert-base-tas-b')

# Create a function to encode the documents
def tasb_encode_documents(doc_texts: list[str]):
    return tasb_model.encode(doc_texts, convert_to_tensor=True, normalize_embeddings=True)#, batch_size=128)

## Step 3: Prepare MiniLM

In [17]:
%%capture

# Load a publicly available MiniLM cross-encoder model
minilm_tokenizer = AutoTokenizer.from_pretrained("cross-encoder/ms-marco-MiniLM-L-4-v2")
minilm_model = AutoModelForSequenceClassification.from_pretrained("cross-encoder/ms-marco-MiniLM-L-4-v2").to('cpu')
minilm_model.eval()

## Step 4: Evaluate the Queries

Finally, we evaluate the queries in batches (each written to the results file).

This is done by first getting 100 candidates from TF-IDF, then reranking them independently with each model.

In [None]:
# Clear the old results files
with open("./Results_TASB.txt", "w"): pass
with open("./Results_MiniLM.txt", "w"): pass

# Evaluate the queries in batches
query_counter = 0
for batch_index in tqdm(range(0, QUERY_COUNT, BATCH_SIZE), desc="Evaluating queries in batches", ncols=80):
    # Get the current batch of queries
    batch_queries = queries[batch_index:batch_index+BATCH_SIZE]
    
    # Create arrays for the results
    results_tasb = []
    results_minilm = []
    
    # Iterate over the batch, fetch the top-100 candidate documents using TF-IDF
    for result, docs_found in a1.evaluate_queries(DOC_COUNT, inverted_index, batch_queries):
        query_id = query_ids[query_counter]
        query_counter += 1
        
        # Fetch the document ids and scores for the top 100
        top100candidates = list(result)
        
        # Encode and score the documents using the TAS-B model
        # TODO
        results_tasb.append(f"tasb B{batch_index//BATCH_SIZE} Q{query_id} Count {query_counter}")
        
        # Rerank the documents using the MiniLM cross-encoder model
        # TODO
        results_minilm.append(f"minilm B{batch_index//BATCH_SIZE} Q{query_id} Count {query_counter}")
    
    # Append the batch results to the result files
    with open("./Results_TASB.txt", "a") as f:
        f.write("\n".join(results_tasb) + "\n")
        
    with open("./Results_MiniLM.txt", "a") as f:
        f.write("\n".join(results_minilm) + "\n")

    if batch_index//BATCH_SIZE >= 5:
        break # Quit early for testing

Evaluating queries in batches:   2%|▎           | 5/222 [00:21<15:53,  4.39s/it]


---

Encodes the queries

In [None]:
# Load queries
with open('scifact\queries.jsonl') as f:
    queries = [json.loads(line) for line in f]
query_texts = [q['text'] for q in queries]
query_ids = [q['_id'] for q in queries]

# Encode queries
query_embeddings = model.encode(query_texts, convert_to_tensor=True)

  with open('C:\dev\CSI4107\Assignment 2\Assignment2_Group13\scifact\queries.jsonl') as f:


Gets the top-k results

In [None]:
# Get top-100 candidate docs per query
tasb_candidates = {}
k = 100 # Change this to get top-k results

for qid, q_emb in zip(query_ids, query_embeddings):
    scores = util.cos_sim(q_emb, doc_embeddings)[0]
    top_results = torch.topk(scores, k)
    tasb_candidates[qid] = [(doc_ids[i], doc_texts[i]) for i in top_results.indices]


## Stage 2: Reranking with MiniLM

Loads and Reranks with MiniLM

**Currently takes 45 minutes**

In [None]:
# Load MiniLM-L-4-v2 model + tokenizer
tokenizer = AutoTokenizer.from_pretrained("cross-encoder/ms-marco-MiniLM-L-4-v2")
model = AutoModelForSequenceClassification.from_pretrained("cross-encoder/ms-marco-MiniLM-L-4-v2").to('cpu')
model.eval()

# Settings
batch_size = 64
rerank_top_k = 10 #change to rerank more values

# Prepare (query, doc) pairs
print("Preparing (query, document) pairs for reranking...")

all_pairs = []
pair_lookup = []

for q in queries:
    qid = q.get("_id")
    qtext = q.get("text")
    candidates = tasb_candidates.get(qid, [])
    for docid, doc_text in candidates:
        all_pairs.append((qtext, doc_text))
        pair_lookup.append((qid, docid))

# Run batched reranking with progress bar
print(f"Starting reranking of {len(all_pairs)} pairs in batches of {batch_size}...\n")
start_time = time.time()
scores = []

for i in tqdm(range(0, len(all_pairs), batch_size), desc="Reranking", ncols=80):
    batch = all_pairs[i:i+batch_size]
    q_texts, d_texts = zip(*batch)

    inputs = tokenizer(
        list(q_texts),
        list(d_texts),
        padding=True,
        truncation=True,
        return_tensors="pt"
    ).to('cpu')

    with torch.no_grad():
        logits = model(**inputs).logits
        batch_scores = logits[:, 0] if logits.shape[-1] > 1 else logits.squeeze()
        scores.extend(batch_scores.cpu().tolist())

end_time = time.time()
print(f"\n✅ Reranking completed in {end_time - start_time:.2f} seconds.")

# Group scores by query
minilm_results = defaultdict(list)
for (qid, docid), score in zip(pair_lookup, scores):
    minilm_results[qid].append((docid, score))

# Keep top-k docs per query
minilm_results = {
    qid: dict(sorted(docs, key=lambda x: x[1], reverse=True)[:rerank_top_k])
    for qid, docs in minilm_results.items()
}

# Save results to JSON
with open("reranked_results.json", "w") as f:
    json.dump(minilm_results, f, indent=2)

print("📝 Results saved to reranked_results.json")


To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


Preparing (query, document) pairs for reranking...
Starting reranking of 33270 pairs in batches of 64...



Reranking: 100%|██████████████████████████████| 520/520 [46:30<00:00,  5.37s/it]


✅ Reranking completed in 2790.53 seconds.
📝 Results saved to minilm_reranked_results.json





## Evaluation: MAP and P@10

In [2]:
# Load qrels
qrels = defaultdict(set)
with open('scifact/qrels/test.tsv') as f:
    for line in f.readlines()[1:]:
        qid, docid, label = line.strip().split()
        if int(label) > 0:
            qrels[qid].add(docid)

# Evaluate MAP and P@10
average_precisions = []
precisions_at_10 = []

for qid, retrieved_docs in minilm_results.items():
    if qid not in qrels:
        continue

    relevant_docs = qrels[qid]
    retrieved_doc_ids = list(retrieved_docs.keys())

    # Calculate Precision@10
    top_10 = retrieved_doc_ids[:10]
    relevant_at_10 = sum([1 for docid in top_10 if docid in relevant_docs])
    precisions_at_10.append(relevant_at_10 / 10)

    # Calculate Average Precision
    num_hits = 0
    precision_sum = 0
    for rank, docid in enumerate(retrieved_doc_ids):
        if docid in relevant_docs:
            num_hits += 1
            precision_sum += num_hits / (rank + 1)
    if len(relevant_docs) > 0:
        average_precisions.append(precision_sum / len(relevant_docs))

# Final metrics
map_score = sum(average_precisions) / len(average_precisions) if average_precisions else 0.0
p10_score = sum(precisions_at_10) / len(precisions_at_10) if precisions_at_10 else 0.0

print("TAS-B + MiniLM-L-4-v2 Evaluation Results:")
print(f"MAP:  {map_score:.4f}")
print(f"P@10: {p10_score:.4f}")

NameError: name 'defaultdict' is not defined

First run I got:

MAP:  0.5867
P@10: 0.0820

MAP Values: .5+ is a good score
P@10: around .75+ is a good score

Something is wrong with our P@10 value, could be for following reasons (most reasonable to least):
1. Bug with result formatting (we currently aren't capturing the titles for corpus)
2. Relavant docs exist, but deeper than top 10
3. Scifact has few positive per query (we should change to exclude corpus' that don't have a single relevant doc)
4. Low recall with TAS-B