This assignment aims to introduce students to working with the BEIR dataset for information retrieval tasks. Students will:

- Understand the structure of the BEIR dataset and preprocess the data.
- Implement a system to encode queries and documents using embeddings.
- Calculate similarity scores to rank documents based on relevance.
- Evaluate the system's performance using metrics like Mean Average Precision (MAP).
- Modify and fine-tune models for better retrieval results.

First start by looking into the Dataset and understanding its structure.
This will help you understand how the dataset is formed, which will be useful in the later stages of the Assignment

https://huggingface.co/datasets/BeIR/nfcorpus

https://huggingface.co/datasets/BeIR/nfcorpus-qrels

In [1]:
# This assignment consists of two key tasks: Ranking Documents and Fine-Tuning the Sentence Transformer Model. 
# Students will be graded based on their implementation and their written report.

# Mention the team/Individual contributions as a part of the report..!!

In [2]:
# Ranking Documents Report (10 Points)

# Students must analyze which encoding methods performed best for document ranking.

# What to include in your report:
    
# Comparison of Encoding Methods 

    # Compare GloVe embeddings vs. Sentence Transformer embeddings.
    # Which method ranked documents better?
    # Did the top-ranked documents make sense?
    # How does cosine similarity behave with different embeddings?

# Observations on Cosine Similarity & Ranking 

    # Did the ranking appear meaningful?
    # Were there cases where documents that should be highly ranked were not?
    # What are possible explanations for incorrect rankings?

# Possible Improvements

    # What can be done to improve document ranking?
    # Would a different distance metric (e.g., Euclidean, Manhattan) help?
    # Would preprocessing the queries or documents (e.g., removing stopwords) improve ranking?


# Fine-Tuning Report (15 Points)

# After fine-tuning, students must compare different training approaches and reflect on their findings.

# What to include in your report:
    
# Comparison of Different Training Strategies 

    # [anchor, positive] vs [anchor, positive, negative].
    # Which approach seemed to improve ranking?
    # How did the model behave differently?

# Impact on MAP Score 

    # Did fine-tuning improve or hurt the Mean Average Precision (MAP) score?
    # If MAP decreased, why might that be?
    # Is fine-tuning always necessary for retrieval models?

# Observations on Training Loss & Learning Rate 

    # Did the loss converge?
    # Was the learning rate too high or too low?
    # How did freezing/unfreezing layers impact training?

# Future Improvements 

    # Would training with more negatives help?
    # Would changing the loss function (e.g., using Softmax Loss) improve performance?
    # Could increasing the number of epochs lead to a better model?


In [3]:
!pip install datasets sentence_transformers



In [4]:
# Create your API token from your Hugging Face Account. Make sure to save it in text file or notepad for future use.
# Will need to add it once per section
from huggingface_hub import login
login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [8]:
from sentence_transformers import SentenceTransformer, InputExample, losses
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
from datasets import load_dataset
import time

class TextSimilarityModel:
    def __init__(self, corpus_name, rel_name, model_name='all-MiniLM-L6-v2', top_k=10):
        """
        Initialize the model with datasets and pre-trained sentence transformer.
        """
        self.model = SentenceTransformer(model_name)
        self.corpus_name = corpus_name
        self.rel_name = rel_name
        self.top_k = top_k
        self.load_data()


    def load_data(self):
        """
        Load and filter datasets based on test queries and documents.
        """
        # Load query and document datasets
        dataset_queries = load_dataset(self.corpus_name, "queries")
        dataset_docs = load_dataset(self.corpus_name, "corpus")

        # Extract queries and documents
        self.queries = dataset_queries["queries"]["text"]
        self.query_ids = dataset_queries["queries"]["_id"]
        self.documents = dataset_docs["corpus"]["text"]
        self.document_ids = dataset_docs["corpus"]["_id"]

        # Filter queries and documents and build relevant queries and documents mapping based on test set
        test_qrels = load_dataset(self.rel_name)["test"]
        self.filtered_test_query_ids = set(test_qrels["query-id"])
        self.filtered_test_doc_ids = set(test_qrels["corpus-id"])

        self.test_queries = [q for qid, q in zip(self.query_ids, self.queries) if qid in self.filtered_test_query_ids]
        self.test_query_ids = [qid for qid in self.query_ids if qid in self.filtered_test_query_ids]
        self.test_documents = [doc for did, doc in zip(self.document_ids, self.documents) if did in self.filtered_test_doc_ids]
        self.test_document_ids = [did for did in self.document_ids if did in self.filtered_test_doc_ids]

        self.test_query_id_to_relevant_doc_ids = {qid: [] for qid in self.test_query_ids}
        for qid, doc_id in zip(test_qrels["query-id"], test_qrels["corpus-id"]):
            if qid in self.test_query_id_to_relevant_doc_ids:
                self.test_query_id_to_relevant_doc_ids[qid].append(doc_id)
                
        ## Code Below this is used for creating the training set 
        # Build query and document id to text mapping
        self.query_id_to_text = {query_id:query for query_id, query in zip(self.query_ids, self.queries)}
        self.document_id_to_text = {document_id:document for document_id, document in zip(self.document_ids, self.documents)}

        # Build relevant queries and documents mapping based on train set
        train_qrels = load_dataset(self.rel_name)["train"]
        self.train_query_id_to_relevant_doc_ids = {qid: [] for qid in train_qrels["query-id"]}

        for qid, doc_id in zip(train_qrels["query-id"], train_qrels["corpus-id"]):
            if qid in self.train_query_id_to_relevant_doc_ids:
                # Append the document ID to the relevant doc mapping
                self.train_query_id_to_relevant_doc_ids[qid].append(doc_id)
        
        # Filter queries and documents and build relevant queries and documents mapping based on validation set  
        #TODO Put your code here. 
        ###########################################################################
        try:
            val_qrels = load_dataset(self.rel_name)['validation']
            self.filtered_val_query_ids = set(val_qrels['query-id'])
            self.filtered_val_doc_ids = set(val_qrels['corpus-id'])

            val_query_pairs = [
                (qid, query) for qid, query in zip(self.query_ids, self.queries) 
                if qid in self.filtered_val_query_ids
            ]
            self.val_query_ids = [pair[0] for pair in val_query_pairs]
            self.val_queries = [pair[1] for pair in val_query_pairs]

            val_doc_pairs = [
                (did, doc) for did, doc in zip(self.document_ids, self.documents) 
                if did in self.filtered_val_doc_ids
            ]
            self.val_document_ids = [pair[0] for pair in val_doc_pairs]
            self.val_documents = [pair[1] for pair in val_doc_pairs]

            self.val_query_id_to_relevant_doc_ids = {qid: [] for qid in self.val_query_ids}
            for qid, doc_id in zip(val_qrels['query-id'], val_qrels['corpus-id']):
                if qid in self.val_query_id_to_relevant_doc_ids:
                    self.val_query_id_to_relevant_doc_ids[qid].append(doc_id)
        except Exception as e:
            print('No validation split available. Skipping validation set creation.')
        ###########################################################################
        

    #Task 1: Encode Queries and Documents (10 Pts)

    def encode_with_glove(self, glove_file_path: str, sentences: list[str]) -> list[np.ndarray]:

        """
        # Inputs:
            - glove_file_path (str): Path to the GloVe embeddings file (e.g., "glove.6B.50d.txt").
            - sentences (list[str]): A list of sentences to encode.

        # Output:
            - list[np.ndarray]: A list of sentence embeddings 
            
        (1) Encodes sentences by averaging GloVe 50d vectors of words in each sentence.
        (2) Return a sequence of embeddings of the sentences.
        Download the glove vectors from here. 
        https://nlp.stanford.edu/data/glove.6B.zip
        Handle unknown words by using zero vectors
        """
        #TODO Put your code here. 
        ###########################################################################
        # Load GloVe embeddings dictionary
        glove_dict = {}
        with open(glove_file_path, 'r', encoding='utf-8') as f:
            for line in f:
                parts = line.strip().split()
                # Skip incomplete lines (should contain one word + 50 dimensions)
                if len(parts) < 51:
                    continue
                word = parts[0]
                vector = np.array(parts[1:], dtype=float)
                glove_dict[word] = vector

        embedding_dim = len(next(iter(glove_dict.values())))
        embeddings = []

        # Encode each sentence by averaging word vectors
        for sentence in sentences:
            tokens = sentence.split()
            vecs = []
            for token in tokens:
                # Use lower-case tokens to match the GloVe keys
                token_vec = glove_dict.get(token.lower())
                if token_vec is None:
                    token_vec = np.zeros(embedding_dim)
                vecs.append(token_vec)
            if vecs:
                sentence_embedding = np.mean(vecs, axis=0)
            else:
                sentence_embedding = np.zeros(embedding_dim)
            embeddings.append(sentence_embedding)

        return embeddings
        ###########################################################################

    #Task 2: Calculate Cosine Similarity and Rank Documents (20 Pts)
    
    def rank_documents(self, encoding_method: str = 'sentence_transformer') -> None:
        """
         # Inputs:
            - encoding_method (str): The method used for encoding queries/documents. 
                             Options: ['glove', 'sentence_transformer'].

        # Output:
            - None (updates self.query_id_to_ranked_doc_ids with ranked document IDs).
    
        (1) Compute cosine similarity between each document and the query
        (2) Rank documents for each query and save the results in a dictionary "query_id_to_ranked_doc_ids" 
            This will be used in "mean_average_precision"
            Example format {2: [125, 673], 35: [900, 822]}
        """
        if encoding_method == 'glove':
            query_embeddings = self.encode_with_glove("glove.6B.50d.txt", self.queries)
            document_embeddings = self.encode_with_glove("glove.6B.50d.txt", self.documents)
        elif encoding_method == 'sentence_transformer':
            query_embeddings = self.model.encode(self.queries)
            document_embeddings = self.model.encode(self.documents)
        else:
            raise ValueError("Invalid encoding method. Choose 'glove' or 'sentence_transformer'.")
        
        #TODO Put your code here.
        ###########################################################################
        # Map test query IDs to their indices in the full query list.
        test_query_indices = [self.query_ids.index(qid) for qid in self.test_query_ids]
        # Map test document IDs to their indices in the full document list.
        test_doc_indices = [self.document_ids.index(doc_id) for doc_id in self.test_document_ids]
        
        # Subset the embeddings for test queries and documents using the computed indices.
        test_query_embeddings = [query_embeddings[i] for i in test_query_indices]
        test_document_embeddings = [document_embeddings[i] for i in test_doc_indices]
        
        # Now compute the cosine similarity matrix only on test queries vs test documents.
        sim_matrix = cosine_similarity(test_query_embeddings, test_document_embeddings)

        # Initialize the dictionary for storing ranked document IDs.
        self.query_id_to_ranked_doc_ids = {}
        
        # For each test query, rank the documents based on their similarity scores.
        for i, qid in enumerate(self.test_query_ids):
            sim_scores = sim_matrix[i]
            # Sort document indices by descending similarity.
            ranked_indices = np.argsort(sim_scores)[::-1]
            ranked_doc_ids = [self.test_document_ids[idx] for idx in ranked_indices]
            self.query_id_to_ranked_doc_ids[qid] = ranked_doc_ids
        ###########################################################################

    @staticmethod
    def average_precision(relevant_docs: list[str], candidate_docs: list[str]) -> float:
        """
        # Inputs:
            - relevant_docs (list[str]): A list of document IDs that are relevant to the query.
            - candidate_docs (list[str]): A list of document IDs ranked by the model.

        # Output:
            - float: The average precision score
    
        Compute average precision for a single query.
        """
        y_true = [1 if doc_id in relevant_docs else 0 for doc_id in candidate_docs]
        precisions = [np.mean(y_true[:k+1]) for k in range(len(y_true)) if y_true[k]]
        return np.mean(precisions) if precisions else 0

    #Task 3: Calculate Evaluate System Performance (10 Pts)
    
    def mean_average_precision(self) -> float:
        """
        # Inputs:
            - None (uses ranked documents stored in self.query_id_to_ranked_doc_ids).

        # Output:
            - float: The MAP score, computed as the mean of all average precision scores.
    
        (1) Compute mean average precision for all queries using the "average_precision" function.
        (2) Compute the mean of all average precision scores
        Return the mean average precision score
        
        reference: https://www.evidentlyai.com/ranking-metrics/mean-average-precision-map
        https://towardsdatascience.com/map-mean-average-precision-might-confuse-you-5956f1bfa9e2
        """
         #TODO Put your code here. 
        ###########################################################################
        ap_scores = []
        for qid in self.test_query_ids:
            relevant_docs = self.test_query_id_to_relevant_doc_ids.get(qid, [])
            candidate_docs = self.query_id_to_ranked_doc_ids.get(qid, [])
            ap = self.average_precision(relevant_docs, candidate_docs)
            ap_scores.append(ap)
        return np.mean(ap_scores) if ap_scores else 0.0
        ###########################################################################
    
    #Task 4: Ranking the Top 10 Documents based on Similarity Scores (10 Pts)
   
    def show_ranking_documents(self, example_query: str) -> None:
        
        """
        # Inputs:
            - example_query (str): A query string for which top-ranked documents should be displayed.

        # Output:
            - None (prints the ranked documents along with similarity scores).
        
        (1) rank documents with given query with cosine similarity scores
        (2) prints the top 10 results along with its similarity score.
        
        """
        #TODO Put your code here. 
        query_embedding = self.model.encode(example_query)
        document_embeddings = self.model.encode(self.documents)
        ###########################################################################
        # Compute cosine similarity scores between the query and all documents
        sim_scores = cosine_similarity([query_embedding], document_embeddings)[0]

        # Get indices of top K documents based on the similarity scores
        top_k_indices = np.argsort(sim_scores)[::-1][:self.top_k]

        print(f'Top {self.top_k} documents for the query: "{example_query}"')
        for rank, idx in enumerate(top_k_indices, start=1):
            doc_id = self.document_ids[idx]
            score = sim_scores[idx]
            print(f'Rank {rank}: Document ID: {doc_id}, Similarity Score: {score:.4f}')
            ###########################################################################
      
    #Task 5:Fine tune the sentence transformer model (25 Pts)
    # Students are not graded on achieving a high MAP score. 
    # The key is to show understanding, experimentation, and thoughtful analysis.
    
    def fine_tune_model(self, batch_size: int = 32, num_epochs: int = 3, save_model_path: str = "finetuned_senBERT") -> None:

        """
        Fine-tunes the model using MultipleNegativesRankingLoss.
        (1) Prepare training examples from `self.prepare_training_examples()`
        (2) Experiment with [anchor, positive] vs [anchor, positive, negative]
        (3) Define a loss function (`MultipleNegativesRankingLoss`)
        (4) Freeze all model layers except the final layers
        (5) Train the model with the specified learning rate
        (6) Save the fine-tuned model
        """
        #TODO Put your code here.
        ###########################################################################
        """
        Fine-tunes the model using MultipleNegativesRankingLoss.
        """
        # Import torch at the beginning of the method
        import torch
        from torch.utils.data import DataLoader
        import time
        from datetime import timedelta
        
        # Check device
        device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
        print(f"Using device: {device}")
        
        # Move model to GPU if available
        self.model = self.model.to(device)
        
        # Prepare training examples
        train_examples = self.prepare_training_examples()
        print(f"Number of training examples: {len(train_examples)}")
        train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=batch_size)
        
        # Define the loss function
        train_loss = losses.MultipleNegativesRankingLoss(self.model)
        
        # Print model parameters status
        total_params = sum(p.numel() for p in self.model.parameters())
        trainable_params = sum(p.numel() for p in self.model.parameters() if p.requires_grad)
        print(f"Total parameters: {total_params:,}")
        print(f"Trainable parameters: {trainable_params:,}")
        
        # Freeze all layers except the final layer
        for name, param in self.model.named_parameters():
            # Only unfreeze the final transformer layer (layer.5 for all-MiniLM-L6-v2)
            if 'layer.5' in name:  # all-MiniLM-L6-v2 has 6 layers (0-5)
                param.requires_grad = True
                print(f"Unfreezing: {name}")
            else:
                param.requires_grad = False

        # Print trainable parameters to verify
        print("\nTrainable parameters:")
        for name, param in self.model.named_parameters():
            if param.requires_grad:
                print(f"- {name}")
                
        # Training loop with timing
        start_time = time.time()
        print("\nStarting training...")
        
        # Fine-tune the model with warmup
        warmup_steps = int(len(train_dataloader) * 0.1)  # 10% of training data for warmup
        self.model.fit(
            train_objectives=[(train_dataloader, train_loss)],
            epochs=num_epochs,
            warmup_steps=warmup_steps,
            show_progress_bar=True,
            output_path=save_model_path,
            checkpoint_path=f"{save_model_path}_checkpoint",
            checkpoint_save_steps=len(train_dataloader),
            callback=lambda score, epoch, steps: print(f"\nEpoch {epoch}: Score {score:.4f}")
        )
        
        # Calculate and print training time
        training_time = time.time() - start_time
        print(f"\nTraining completed in {str(timedelta(seconds=int(training_time)))}")
        
        # Save the model
        print(f"Saving model to {save_model_path}")
        self.model.save(save_model_path)
        print("Model saved successfully!")
        ###########################################################################

    # Take a careful look into how the training set is created
    def prepare_training_examples(self) -> list[InputExample]:

        """
        Prepares training examples from the training data.
        # Inputs:
            - None (uses self.train_query_id_to_relevant_doc_ids to create training pairs).

         # Output:
            Output: - list[InputExample]: A list of training samples containing [anchor, positive] or [anchor, positive, negative].
            
        """
        """
        Prepares training examples from the training data.
        """
        train_examples = []
        import random
        from datetime import timedelta
        from tqdm import tqdm
        
        print("\nPreparing training examples...")
        total_queries = len(self.train_query_id_to_relevant_doc_ids)
        print(f"Total queries to process: {total_queries}")
        
        # Count total examples that will be created
        total_examples = sum(len(doc_ids) for doc_ids in self.train_query_id_to_relevant_doc_ids.values())
        print(f"Expected total training examples: {total_examples}")
        
        start_time = time.time()
        
        # Create progress bar
        pbar = tqdm(self.train_query_id_to_relevant_doc_ids.items(), 
                    total=total_queries,
                    desc="Processing queries")
        
        for qid, doc_ids in pbar:
            anchor = self.query_id_to_text[qid]
            # Precompute negative candidates for current query using set subtraction for efficiency
            relevant_set = set(self.train_query_id_to_relevant_doc_ids.get(qid, []))
            negative_candidates = list(set(self.document_ids) - relevant_set)
            
            for doc_id in doc_ids:
                positive = self.document_id_to_text[doc_id]
                
                # Update progress bar description with current query details
                pbar.set_description(f"Query {qid}: {len(doc_ids)} docs")
                
                # Build texts list without an explicit else branch.
                texts = [anchor, positive]
                if negative_candidates:
                    texts.append(self.document_id_to_text[random.choice(negative_candidates)])
                train_examples.append(InputExample(texts=texts))
        
        elapsed_time = time.time() - start_time
        print(f"\nTraining examples preparation completed in {timedelta(seconds=int(elapsed_time))}")
        print(f"Final number of training examples: {len(train_examples)}")
        
        return train_examples


In [9]:
# Initialize and use the model
model = TextSimilarityModel("BeIR/nfcorpus", "BeIR/nfcorpus-qrels")

# Compare the outputs 
print("Ranking with sentence_transformer...")
model.rank_documents(encoding_method='sentence_transformer')
map_score = model.mean_average_precision()
print("Mean Average Precision:", map_score)

# Compare the outputs 
print("Ranking with glove...")
model.rank_documents(encoding_method='glove')
map_score = model.mean_average_precision()
print("Mean Average Precision:", map_score)


model.show_ranking_documents("Breast Cancer Cells Feed on Cholesterol")

Ranking with sentence_transformer...
Mean Average Precision: 0.1717425840743641
Ranking with glove...
Mean Average Precision: 0.028223179599728726
Top 10 documents for the query: "Breast Cancer Cells Feed on Cholesterol"
Rank 1: Document ID: MED-2439, Similarity Score: 0.6946
Rank 2: Document ID: MED-2434, Similarity Score: 0.6723
Rank 3: Document ID: MED-2440, Similarity Score: 0.6473
Rank 4: Document ID: MED-2427, Similarity Score: 0.5877
Rank 5: Document ID: MED-2774, Similarity Score: 0.5498
Rank 6: Document ID: MED-838, Similarity Score: 0.5406
Rank 7: Document ID: MED-2430, Similarity Score: 0.5205
Rank 8: Document ID: MED-2102, Similarity Score: 0.5141
Rank 9: Document ID: MED-2437, Similarity Score: 0.5081
Rank 10: Document ID: MED-5066, Similarity Score: 0.5012


In [10]:
# Finetune all-MiniLM-L6-v2 sentence transformer model
model.fine_tune_model(batch_size=32, num_epochs=10, save_model_path="finetuned_senBERT_train_v2")  # Adjust batch size and epochs as needed

model.rank_documents()
map_score = model.mean_average_precision()
print("Mean Average Precision:", map_score)

Using device: cuda

Preparing training examples...
Total queries to process: 2590
Expected total training examples: 110575


Query PLAIN-3474: 83 docs: 100%|██████████| 2590/2590 [00:51<00:00, 50.66it/s]  



Training examples preparation completed in 0:00:51
Final number of training examples: 110575
Number of training examples: 110575
Total parameters: 22,713,216
Trainable parameters: 22,713,216
Unfreezing: 0.auto_model.encoder.layer.5.attention.self.query.weight
Unfreezing: 0.auto_model.encoder.layer.5.attention.self.query.bias
Unfreezing: 0.auto_model.encoder.layer.5.attention.self.key.weight
Unfreezing: 0.auto_model.encoder.layer.5.attention.self.key.bias
Unfreezing: 0.auto_model.encoder.layer.5.attention.self.value.weight
Unfreezing: 0.auto_model.encoder.layer.5.attention.self.value.bias
Unfreezing: 0.auto_model.encoder.layer.5.attention.output.dense.weight
Unfreezing: 0.auto_model.encoder.layer.5.attention.output.dense.bias
Unfreezing: 0.auto_model.encoder.layer.5.attention.output.LayerNorm.weight
Unfreezing: 0.auto_model.encoder.layer.5.attention.output.LayerNorm.bias
Unfreezing: 0.auto_model.encoder.layer.5.intermediate.dense.weight
Unfreezing: 0.auto_model.encoder.layer.5.intermed

Computing widget examples:   0%|          | 0/1 [00:00<?, ?example/s]

Step,Training Loss
500,4.0563
1000,3.7222
1500,3.6659
2000,3.644
2500,3.5924
3000,3.5678
3500,3.5488
4000,3.4935
4500,3.4861
5000,3.4661



Training completed in 0:34:20
Saving model to finetuned_senBERT_train_v2
Model saved successfully!
Mean Average Precision: 0.2133299787073085
