This assignment aims to introduce students to working with the BEIR dataset for information retrieval tasks. Students will:

- Understand the structure of the BEIR dataset and preprocess the data.
- Implement a system to encode queries and documents using embeddings.
- Calculate similarity scores to rank documents based on relevance.
- Evaluate the system's performance using metrics like Mean Average Precision (MAP).
- Modify and fine-tune models for better retrieval results.

First start by looking into the Dataset and understanding its structure.
This will help you understand how the dataset is formed, which will be useful in the later stages of the Assignment

https://huggingface.co/datasets/BeIR/nfcorpus

https://huggingface.co/datasets/BeIR/nfcorpus-qrels

In [1]:
# This assignment consists of two key tasks: Ranking Documents and Fine-Tuning the Sentence Transformer Model. 
# Students will be graded based on their implementation and their written report.

# Mention the team/Individual contributions as a part of the report..!!

In [2]:
# Ranking Documents Report (10 Points)

# Students must analyze which encoding methods performed best for document ranking.

# What to include in your report:
    
# Comparison of Encoding Methods 

    # Compare GloVe embeddings vs. Sentence Transformer embeddings.
    # Which method ranked documents better?
    # Did the top-ranked documents make sense?
    # How does cosine similarity behave with different embeddings?

# Observations on Cosine Similarity & Ranking 

    # Did the ranking appear meaningful?
    # Were there cases where documents that should be highly ranked were not?
    # What are possible explanations for incorrect rankings?

# Possible Improvements

    # What can be done to improve document ranking?
    # Would a different distance metric (e.g., Euclidean, Manhattan) help?
    # Would preprocessing the queries or documents (e.g., removing stopwords) improve ranking?


# Fine-Tuning Report (15 Points)

# After fine-tuning, students must compare different training approaches and reflect on their findings.

# What to include in your report:
    
# Comparison of Different Training Strategies 

    # [anchor, positive] vs [anchor, positive, negative].
    # Which approach seemed to improve ranking?
    # How did the model behave differently?

# Impact on MAP Score 

    # Did fine-tuning improve or hurt the Mean Average Precision (MAP) score?
    # If MAP decreased, why might that be?
    # Is fine-tuning always necessary for retrieval models?

# Observations on Training Loss & Learning Rate 

    # Did the loss converge?
    # Was the learning rate too high or too low?
    # How did freezing/unfreezing layers impact training?

# Future Improvements 

    # Would training with more negatives help?
    # Would changing the loss function (e.g., using Softmax Loss) improve performance?
    # Could increasing the number of epochs lead to a better model?


In [3]:
!pip install datasets sentence_transformers



In [3]:
# Create your API token from your Hugging Face Account. Make sure to save it in text file or notepad for future use.
# Will need to add it once per section
from huggingface_hub import login
login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [4]:
from sentence_transformers import SentenceTransformer, InputExample, losses
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
from datasets import load_dataset
import time

class TextSimilarityModel:
    def __init__(self, corpus_name, rel_name, model_name='all-MiniLM-L6-v2', top_k=10):
        """
        Initialize the model with datasets and pre-trained sentence transformer.
        """
        self.model = SentenceTransformer(model_name)
        self.corpus_name = corpus_name
        self.rel_name = rel_name
        self.top_k = top_k
        self.load_data()


    def load_data(self):
        """
        Load and filter datasets based on test queries and documents.
        """
        # Load query and document datasets
        dataset_queries = load_dataset(self.corpus_name, "queries")
        dataset_docs = load_dataset(self.corpus_name, "corpus")

        # Extract queries and documents
        self.queries = dataset_queries["queries"]["text"]
        self.query_ids = dataset_queries["queries"]["_id"]
        self.documents = dataset_docs["corpus"]["text"]
        self.document_ids = dataset_docs["corpus"]["_id"]

        # Filter queries and documents and build relevant queries and documents mapping based on test set
        test_qrels = load_dataset(self.rel_name)["test"]
        self.filtered_test_query_ids = set(test_qrels["query-id"])
        self.filtered_test_doc_ids = set(test_qrels["corpus-id"])

        self.test_queries = [q for qid, q in zip(self.query_ids, self.queries) if qid in self.filtered_test_query_ids]
        self.test_query_ids = [qid for qid in self.query_ids if qid in self.filtered_test_query_ids]
        self.test_documents = [doc for did, doc in zip(self.document_ids, self.documents) if did in self.filtered_test_doc_ids]
        self.test_document_ids = [did for did in self.document_ids if did in self.filtered_test_doc_ids]

        self.test_query_id_to_relevant_doc_ids = {qid: [] for qid in self.test_query_ids}
        for qid, doc_id in zip(test_qrels["query-id"], test_qrels["corpus-id"]):
            if qid in self.test_query_id_to_relevant_doc_ids:
                self.test_query_id_to_relevant_doc_ids[qid].append(doc_id)
                
        ## Code Below this is used for creating the training set 
        # Build query and document id to text mapping
        self.query_id_to_text = {query_id:query for query_id, query in zip(self.query_ids, self.queries)}
        self.document_id_to_text = {document_id:document for document_id, document in zip(self.document_ids, self.documents)}

        # Build relevant queries and documents mapping based on train set
        train_qrels = load_dataset(self.rel_name)["train"]
        self.train_query_id_to_relevant_doc_ids = {qid: [] for qid in train_qrels["query-id"]}

        for qid, doc_id in zip(train_qrels["query-id"], train_qrels["corpus-id"]):
            if qid in self.train_query_id_to_relevant_doc_ids:
                # Append the document ID to the relevant doc mapping
                self.train_query_id_to_relevant_doc_ids[qid].append(doc_id)
        
        # Filter queries and documents and build relevant queries and documents mapping based on validation set  
        #TODO Put your code here. 
        ###########################################################################
        try:
            val_qrels = load_dataset(self.rel_name)['validation']
            self.filtered_val_query_ids = set(val_qrels['query-id'])
            self.filtered_val_doc_ids = set(val_qrels['corpus-id'])

            val_query_pairs = [
                (qid, query) for qid, query in zip(self.query_ids, self.queries) 
                if qid in self.filtered_val_query_ids
            ]
            self.val_query_ids = [pair[0] for pair in val_query_pairs]
            self.val_queries = [pair[1] for pair in val_query_pairs]

            val_doc_pairs = [
                (did, doc) for did, doc in zip(self.document_ids, self.documents) 
                if did in self.filtered_val_doc_ids
            ]
            self.val_document_ids = [pair[0] for pair in val_doc_pairs]
            self.val_documents = [pair[1] for pair in val_doc_pairs]

            self.val_query_id_to_relevant_doc_ids = {qid: [] for qid in self.val_query_ids}
            for qid, doc_id in zip(val_qrels['query-id'], val_qrels['corpus-id']):
                if qid in self.val_query_id_to_relevant_doc_ids:
                    self.val_query_id_to_relevant_doc_ids[qid].append(doc_id)
        except Exception as e:
            print('No validation split available. Skipping validation set creation.')
        ###########################################################################
        

    #Task 1: Encode Queries and Documents (10 Pts)

    def encode_with_glove(self, glove_file_path: str, sentences: list[str]) -> list[np.ndarray]:

        """
        # Inputs:
            - glove_file_path (str): Path to the GloVe embeddings file (e.g., "glove.6B.50d.txt").
            - sentences (list[str]): A list of sentences to encode.

        # Output:
            - list[np.ndarray]: A list of sentence embeddings 
            
        (1) Encodes sentences by averaging GloVe 50d vectors of words in each sentence.
        (2) Return a sequence of embeddings of the sentences.
        Download the glove vectors from here. 
        https://nlp.stanford.edu/data/glove.6B.zip
        Handle unknown words by using zero vectors
        """
        #TODO Put your code here. 
        ###########################################################################
        # Load GloVe embeddings dictionary
        glove_dict = {}
        with open(glove_file_path, 'r', encoding='utf-8') as f:
            for line in f:
                parts = line.strip().split()
                # Skip incomplete lines (should contain one word + 50 dimensions)
                if len(parts) < 51:
                    continue
                word = parts[0]
                vector = np.array(parts[1:], dtype=float)
                glove_dict[word] = vector

        embedding_dim = len(next(iter(glove_dict.values())))
        embeddings = []

        # Encode each sentence by averaging word vectors
        for sentence in sentences:
            tokens = sentence.split()
            vecs = []
            for token in tokens:
                # Use lower-case tokens to match the GloVe keys
                token_vec = glove_dict.get(token.lower())
                if token_vec is None:
                    token_vec = np.zeros(embedding_dim)
                vecs.append(token_vec)
            if vecs:
                sentence_embedding = np.mean(vecs, axis=0)
            else:
                sentence_embedding = np.zeros(embedding_dim)
            embeddings.append(sentence_embedding)

        return embeddings
        ###########################################################################

    #Task 2: Calculate Cosine Similarity and Rank Documents (20 Pts)
    
    def rank_documents(self, encoding_method: str = 'sentence_transformer') -> None:
        """
         # Inputs:
            - encoding_method (str): The method used for encoding queries/documents. 
                             Options: ['glove', 'sentence_transformer'].

        # Output:
            - None (updates self.query_id_to_ranked_doc_ids with ranked document IDs).
    
        (1) Compute cosine similarity between each document and the query
        (2) Rank documents for each query and save the results in a dictionary "query_id_to_ranked_doc_ids" 
            This will be used in "mean_average_precision"
            Example format {2: [125, 673], 35: [900, 822]}
        """
        if encoding_method == 'glove':
            query_embeddings = self.encode_with_glove("glove.6B.50d.txt", self.queries)
            document_embeddings = self.encode_with_glove("glove.6B.50d.txt", self.documents)
        elif encoding_method == 'sentence_transformer':
            query_embeddings = self.model.encode(self.queries)
            document_embeddings = self.model.encode(self.documents)
        else:
            raise ValueError("Invalid encoding method. Choose 'glove' or 'sentence_transformer'.")
        
        #TODO Put your code here.
        ###########################################################################
        # Map test query IDs to their indices in the full query list.
        test_query_indices = [self.query_ids.index(qid) for qid in self.test_query_ids]
        # Map test document IDs to their indices in the full document list.
        test_doc_indices = [self.document_ids.index(doc_id) for doc_id in self.test_document_ids]
        
        # Subset the embeddings for test queries and documents using the computed indices.
        test_query_embeddings = [query_embeddings[i] for i in test_query_indices]
        test_document_embeddings = [document_embeddings[i] for i in test_doc_indices]
        
        # Now compute the cosine similarity matrix only on test queries vs test documents.
        sim_matrix = cosine_similarity(test_query_embeddings, test_document_embeddings)

        # Initialize the dictionary for storing ranked document IDs.
        self.query_id_to_ranked_doc_ids = {}
        
        # For each test query, rank the documents based on their similarity scores.
        for i, qid in enumerate(self.test_query_ids):
            sim_scores = sim_matrix[i]
            # Sort document indices by descending similarity.
            ranked_indices = np.argsort(sim_scores)[::-1]
            ranked_doc_ids = [self.test_document_ids[idx] for idx in ranked_indices]
            self.query_id_to_ranked_doc_ids[qid] = ranked_doc_ids
        ###########################################################################

    @staticmethod
    def average_precision(relevant_docs: list[str], candidate_docs: list[str], k: int = 10) -> float:
        """
        Implement steps:
        1. Only take the first k candidate documents
        2. Calculate the number of relevant documents in the first k documents
        3. Calculate MAP@k
        
        Note:
        - k is usually set to 10, because users rarely look at more results
        - This approach is more realistic in practical applications
        - It better evaluates the model's performance on the most relevant documents
        """
        # Only take the first k documents
        candidate_docs = candidate_docs[:k]
        # Calculate which documents are relevant
        y_true = [1 if doc_id in relevant_docs else 0 for doc_id in candidate_docs]
        # Calculate precision at each position
        precisions = [np.mean(y_true[:i+1]) for i in range(len(y_true)) if y_true[i]]
        return np.mean(precisions) if precisions else 0

    #Task 3: Calculate Evaluate System Performance (10 Pts)
    
    def mean_average_precision(self) -> float:
        """
        # Inputs:
            - None (uses ranked documents stored in self.query_id_to_ranked_doc_ids).

        # Output:
            - float: The MAP score, computed as the mean of all average precision scores.
    
        (1) Compute mean average precision for all queries using the "average_precision" function.
        (2) Compute the mean of all average precision scores
        Return the mean average precision score
        
        reference: https://www.evidentlyai.com/ranking-metrics/mean-average-precision-map
        https://towardsdatascience.com/map-mean-average-precision-might-confuse-you-5956f1bfa9e2
        """
         #TODO Put your code here. 
        ###########################################################################
        ap_scores = []
        for qid in self.test_query_ids:
            relevant_docs = self.test_query_id_to_relevant_doc_ids.get(qid, [])
            candidate_docs = self.query_id_to_ranked_doc_ids.get(qid, [])
            ap = self.average_precision(relevant_docs, candidate_docs)
            ap_scores.append(ap)
        return np.mean(ap_scores) if ap_scores else 0.0
        ###########################################################################
    
    #Task 4: Ranking the Top 10 Documents based on Similarity Scores (10 Pts)
   
    def show_ranking_documents(self, example_query: str) -> None:
        
        """
        # Inputs:
            - example_query (str): A query string for which top-ranked documents should be displayed.

        # Output:
            - None (prints the ranked documents along with similarity scores).
        
        (1) rank documents with given query with cosine similarity scores
        (2) prints the top 10 results along with its similarity score.
        
        """
        #TODO Put your code here. 
        query_embedding = self.model.encode(example_query)
        document_embeddings = self.model.encode(self.documents)
        ###########################################################################
        # Compute cosine similarity scores between the query and all documents
        sim_scores = cosine_similarity([query_embedding], document_embeddings)[0]

        # Get indices of top K documents based on the similarity scores
        top_k_indices = np.argsort(sim_scores)[::-1][:self.top_k]

        print(f'Top {self.top_k} documents for the query: "{example_query}"')
        for rank, idx in enumerate(top_k_indices, start=1):
            doc_id = self.document_ids[idx]
            score = sim_scores[idx]
            print(f'Rank {rank}: Document ID: {doc_id}, Similarity Score: {score:.4f}')
            ###########################################################################
      
    #Task 5:Fine tune the sentence transformer model (25 Pts)
    # Students are not graded on achieving a high MAP score. 
    # The key is to show understanding, experimentation, and thoughtful analysis.
    
    def fine_tune_model(self, batch_size: int = 32, num_epochs: int = 3, save_model_path: str = "finetuned_senBERT") -> None:

        """
        Fine-tunes the model using MultipleNegativesRankingLoss.
        (1) Prepare training examples from `self.prepare_training_examples()`
        (2) Experiment with [anchor, positive] vs [anchor, positive, negative]
        (3) Define a loss function (`MultipleNegativesRankingLoss`)
        (4) Freeze all model layers except the final layers
        (5) Train the model with the specified learning rate
        (6) Save the fine-tuned model
        """
        #TODO Put your code here.
        ###########################################################################
        """
        Fine-tunes the model using MultipleNegativesRankingLoss.
        """
        # Import torch at the beginning of the method
        import torch
        from torch.utils.data import DataLoader
        import time
        from datetime import timedelta
        
        # Check device
        device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
        print(f"Using device: {device}")
        
        # Move model to GPU if available
        self.model = self.model.to(device)
        
        # Prepare training examples
        train_examples = self.prepare_training_examples()
        print(f"Number of training examples: {len(train_examples)}")
        train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=batch_size)
        
        # Define the loss function
        train_loss = losses.MultipleNegativesRankingLoss(self.model)
        
        # Print model parameters status
        total_params = sum(p.numel() for p in self.model.parameters())
        trainable_params = sum(p.numel() for p in self.model.parameters() if p.requires_grad)
        print(f"Total parameters: {total_params:,}")
        print(f"Trainable parameters: {trainable_params:,}")
        
        # Freeze all layers except the final layer
        for name, param in self.model.named_parameters():
            # Only unfreeze the final transformer layer (layer.5 for all-MiniLM-L6-v2)
            if 'layer.5' in name:  # all-MiniLM-L6-v2 has 6 layers (0-5)
                param.requires_grad = True
                print(f"Unfreezing: {name}")
            else:
                param.requires_grad = False

        # Print trainable parameters to verify
        print("\nTrainable parameters:")
        for name, param in self.model.named_parameters():
            if param.requires_grad:
                print(f"- {name}")
                
        # Training loop with timing
        start_time = time.time()
        print("\nStarting training...")
        
        # Fine-tune the model with warmup
        warmup_steps = int(len(train_dataloader) * 0.1)  # 10% of training data for warmup
        self.model.fit(
            train_objectives=[(train_dataloader, train_loss)],
            epochs=num_epochs,
            warmup_steps=warmup_steps,
            show_progress_bar=True,
            output_path=save_model_path,
            checkpoint_path=f"{save_model_path}_checkpoint",
            checkpoint_save_steps=len(train_dataloader),
            callback=lambda score, epoch, steps: print(f"\nEpoch {epoch}: Score {score:.4f}")
        )
        
        # Calculate and print training time
        training_time = time.time() - start_time
        print(f"\nTraining completed in {str(timedelta(seconds=int(training_time)))}")
        
        # Save the model
        print(f"Saving model to {save_model_path}")
        self.model.save(save_model_path)
        print("Model saved successfully!")
        ###########################################################################

    # Take a careful look into how the training set is created
    def prepare_training_examples(self) -> list[InputExample]:

        """
        Prepares training examples from the training data.
        # Inputs:
            - None (uses self.train_query_id_to_relevant_doc_ids to create training pairs).

         # Output:
            Output: - list[InputExample]: A list of training samples containing [anchor, positive] or [anchor, positive, negative].
            
        """
        """
        Prepares training examples from the training data.
        """
        train_examples = []
        import random
        from datetime import timedelta
        from tqdm import tqdm
        
        print("\nPreparing training examples...")
        total_queries = len(self.train_query_id_to_relevant_doc_ids)
        print(f"Total queries to process: {total_queries}")
        
        # Count total examples that will be created
        total_examples = sum(len(doc_ids) for doc_ids in self.train_query_id_to_relevant_doc_ids.values())
        print(f"Expected total training examples: {total_examples}")
        
        start_time = time.time()
        
        # Create progress bar
        pbar = tqdm(self.train_query_id_to_relevant_doc_ids.items(), 
                    total=total_queries,
                    desc="Processing queries")
        
        for qid, doc_ids in pbar:
            anchor = self.query_id_to_text[qid]
            # Precompute negative candidates for current query using set subtraction for efficiency
            relevant_set = set(self.train_query_id_to_relevant_doc_ids.get(qid, []))
            negative_candidates = list(set(self.document_ids) - relevant_set)
            
            for doc_id in doc_ids:
                positive = self.document_id_to_text[doc_id]
                
                # Update progress bar description with current query details
                pbar.set_description(f"Query {qid}: {len(doc_ids)} docs")
                
                # Build texts list without an explicit else branch.
                texts = [anchor, positive]
                if negative_candidates:
                    texts.append(self.document_id_to_text[random.choice(negative_candidates)])
                train_examples.append(InputExample(texts=texts))
        
        elapsed_time = time.time() - start_time
        print(f"\nTraining examples preparation completed in {timedelta(seconds=int(elapsed_time))}")
        print(f"Final number of training examples: {len(train_examples)}")
        
        return train_examples


2025-02-08 22:20:38.949763: I tensorflow/core/util/port.cc:111] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-02-08 22:20:38.972869: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2025-02-08 22:20:38.972895: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2025-02-08 22:20:38.972899: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-02-08 22:20:38.976830: I tensorflow/core/platform/cpu_feature_g

In [5]:
# Initialize and use the model
model = TextSimilarityModel("BeIR/nfcorpus", "BeIR/nfcorpus-qrels")

# Compare the outputs 
print("Ranking with sentence_transformer...")
model.rank_documents(encoding_method='sentence_transformer')
map_score = model.mean_average_precision()
print("Mean Average Precision:", map_score)

# Compare the outputs 
print("Ranking with glove...")
model.rank_documents(encoding_method='glove')
map_score = model.mean_average_precision()
print("Mean Average Precision:", map_score)


model.show_ranking_documents("Breast Cancer Cells Feed on Cholesterol")

Ranking with sentence_transformer...
Mean Average Precision: 0.477390368428073
Ranking with glove...
Mean Average Precision: 0.08843636869952659
Top 10 documents for the query: "Breast Cancer Cells Feed on Cholesterol"
Rank 1: Document ID: MED-2439, Similarity Score: 0.6946
Rank 2: Document ID: MED-2434, Similarity Score: 0.6723
Rank 3: Document ID: MED-2440, Similarity Score: 0.6473
Rank 4: Document ID: MED-2427, Similarity Score: 0.5877
Rank 5: Document ID: MED-2774, Similarity Score: 0.5498
Rank 6: Document ID: MED-838, Similarity Score: 0.5406
Rank 7: Document ID: MED-2430, Similarity Score: 0.5205
Rank 8: Document ID: MED-2102, Similarity Score: 0.5141
Rank 9: Document ID: MED-2437, Similarity Score: 0.5081
Rank 10: Document ID: MED-5066, Similarity Score: 0.5012


In [6]:
# Finetune all-MiniLM-L6-v2 sentence transformer model
model.fine_tune_model(batch_size=32, num_epochs=10, save_model_path="finetuned_senBERT_train_v2")  # Adjust batch size and epochs as needed

model.rank_documents()
map_score = model.mean_average_precision()
print("Mean Average Precision:", map_score)

Using device: cuda

Preparing training examples...
Total queries to process: 2590
Expected total training examples: 110575


Query PLAIN-3474: 83 docs: 100%|██████████| 2590/2590 [00:51<00:00, 49.99it/s]  



Training examples preparation completed in 0:00:51
Final number of training examples: 110575
Number of training examples: 110575
Total parameters: 22,713,216
Trainable parameters: 22,713,216
Unfreezing: 0.auto_model.encoder.layer.5.attention.self.query.weight
Unfreezing: 0.auto_model.encoder.layer.5.attention.self.query.bias
Unfreezing: 0.auto_model.encoder.layer.5.attention.self.key.weight
Unfreezing: 0.auto_model.encoder.layer.5.attention.self.key.bias
Unfreezing: 0.auto_model.encoder.layer.5.attention.self.value.weight
Unfreezing: 0.auto_model.encoder.layer.5.attention.self.value.bias
Unfreezing: 0.auto_model.encoder.layer.5.attention.output.dense.weight
Unfreezing: 0.auto_model.encoder.layer.5.attention.output.dense.bias
Unfreezing: 0.auto_model.encoder.layer.5.attention.output.LayerNorm.weight
Unfreezing: 0.auto_model.encoder.layer.5.attention.output.LayerNorm.bias
Unfreezing: 0.auto_model.encoder.layer.5.intermediate.dense.weight
Unfreezing: 0.auto_model.encoder.layer.5.intermed

Computing widget examples:   0%|          | 0/1 [00:00<?, ?example/s]

Step,Training Loss
500,4.0858
1000,3.706
1500,3.6731
2000,3.6281
2500,3.5824
3000,3.5742
3500,3.5523
4000,3.4992
4500,3.4767
5000,3.4656



Training completed in 0:33:17
Saving model to finetuned_senBERT_train_v2
Model saved successfully!
Mean Average Precision: 0.47489604955042447


In [8]:
# Create a new model instance using [anchor, positive] strategy
model_ap = TextSimilarityModel("BeIR/nfcorpus", "BeIR/nfcorpus-qrels")

# First, evaluate the model before fine-tuning
print("Evaluating model before fine-tuning...")
model_ap.rank_documents()
initial_map_score = model_ap.mean_average_precision()
print("Initial MAP score (before fine-tuning):", initial_map_score)

# Define a new method for preparing training examples with [anchor, positive] pairs
def prepare_training_examples_ap(self) -> list[InputExample]:
    """
    Prepares training examples using only [anchor, positive] pairs.
    This implementation focuses on direct semantic matching without negative samples.
    
    # Output:
        - list[InputExample]: A list of training samples containing [anchor, positive] pairs.
        
    Key differences from the original implementation:
    1. Uses only positive pairs without negative samples
    2. Helps model focus on direct semantic relationships
    3. May result in different ranking behavior
    """
    train_examples = []
    import random
    from datetime import timedelta
    from tqdm import tqdm
    
    print("\nPreparing training examples with [anchor, positive] strategy...")
    total_queries = len(self.train_query_id_to_relevant_doc_ids)
    print(f"Total queries to process: {total_queries}")
    
    # Count total examples that will be created
    total_examples = sum(len(doc_ids) for doc_ids in self.train_query_id_to_relevant_doc_ids.values())
    print(f"Expected total training examples: {total_examples}")
    
    start_time = time.time()
    
    # Create progress bar for monitoring
    pbar = tqdm(self.train_query_id_to_relevant_doc_ids.items(), 
                total=total_queries,
                desc="Processing queries")
    
    for qid, doc_ids in pbar:
        # Get query text as anchor
        anchor = self.query_id_to_text[qid]
        
        for doc_id in doc_ids:
            # Get document text as positive example
            positive = self.document_id_to_text[doc_id]
            
            # Update progress information
            pbar.set_description(f"Query {qid}: {len(doc_ids)} docs")
            
            # Create training example with only anchor and positive
            texts = [anchor, positive]
            train_examples.append(InputExample(texts=texts))
    
    elapsed_time = time.time() - start_time
    print(f"\nTraining examples preparation completed in {timedelta(seconds=int(elapsed_time))}")
    print(f"Final number of training examples: {len(train_examples)}")
    
    return train_examples

# Replace the original method with our new implementation
model_ap.prepare_training_examples = prepare_training_examples_ap.__get__(model_ap)

# Fine-tune the model with [anchor, positive] strategy
print("\nFine-tuning with [anchor, positive] strategy...")
model_ap.fine_tune_model(batch_size=32, num_epochs=10, save_model_path="finetuned_senBERT_ap")

# Evaluate the model's performance after fine-tuning
print("\nEvaluating model after fine-tuning...")
model_ap.rank_documents()
final_map_score = model_ap.mean_average_precision()

# Compare all results
print("\nComparison of MAP scores:")
print(f"1. Original model (before fine-tuning): {initial_map_score:.4f}")
print(f"2. After fine-tuning with [anchor, positive]: {final_map_score:.4f}")
print(f"3. Previous run with [anchor, positive, negative]: 0.4748")

Evaluating model before fine-tuning...
Initial MAP score (before fine-tuning): 0.477390368428073

Fine-tuning with [anchor, positive] strategy...
Using device: cuda

Preparing training examples with [anchor, positive] strategy...
Total queries to process: 2590
Expected total training examples: 110575


Query PLAIN-3474: 83 docs: 100%|██████████| 2590/2590 [00:54<00:00, 47.53it/s]  



Training examples preparation completed in 0:00:54
Final number of training examples: 110575
Number of training examples: 110575
Total parameters: 22,713,216
Trainable parameters: 22,713,216
Unfreezing: 0.auto_model.encoder.layer.5.attention.self.query.weight
Unfreezing: 0.auto_model.encoder.layer.5.attention.self.query.bias
Unfreezing: 0.auto_model.encoder.layer.5.attention.self.key.weight
Unfreezing: 0.auto_model.encoder.layer.5.attention.self.key.bias
Unfreezing: 0.auto_model.encoder.layer.5.attention.self.value.weight
Unfreezing: 0.auto_model.encoder.layer.5.attention.self.value.bias
Unfreezing: 0.auto_model.encoder.layer.5.attention.output.dense.weight
Unfreezing: 0.auto_model.encoder.layer.5.attention.output.dense.bias
Unfreezing: 0.auto_model.encoder.layer.5.attention.output.LayerNorm.weight
Unfreezing: 0.auto_model.encoder.layer.5.attention.output.LayerNorm.bias
Unfreezing: 0.auto_model.encoder.layer.5.intermediate.dense.weight
Unfreezing: 0.auto_model.encoder.layer.5.intermed

Computing widget examples:   0%|          | 0/1 [00:00<?, ?example/s]

Step,Training Loss
500,3.3817
1000,3.0591
1500,3.0105
2000,2.9862
2500,2.9462
3000,2.9204
3500,2.9029
4000,2.8455
4500,2.841
5000,2.8239



Training completed in 0:19:26
Saving model to finetuned_senBERT_ap
Model saved successfully!

Evaluating model after fine-tuning...

Comparison of MAP scores:
1. Original model (before fine-tuning): 0.4774
2. After fine-tuning with [anchor, positive]: 0.4632
3. Previous run with [anchor, positive, negative]: 0.4748


In [9]:
# Create a new model instance for unfrozen experiment
model_unfrozen = TextSimilarityModel("BeIR/nfcorpus", "BeIR/nfcorpus-qrels")

# First, evaluate the model before fine-tuning
print("Evaluating model before fine-tuning...")
model_unfrozen.rank_documents()
initial_map_score = model_unfrozen.mean_average_precision()
print("Initial MAP score (before fine-tuning):", initial_map_score)

# Define a modified fine_tune_model method without layer freezing
def fine_tune_model_unfrozen(self, batch_size: int = 32, num_epochs: int = 10, save_model_path: str = "finetuned_senBERT"):
    """
    Fine-tunes the model without freezing any layers.
    All parameters will be trainable during fine-tuning.
    """
    import torch
    from torch.utils.data import DataLoader
    import time
    from datetime import timedelta
    
    # Check device
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    print(f"Using device: {device}")
    
    # Move model to GPU if available
    self.model = self.model.to(device)
    
    # Prepare training examples
    train_examples = self.prepare_training_examples()
    print(f"Number of training examples: {len(train_examples)}")
    train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=batch_size)
    
    # Define the loss function
    train_loss = losses.MultipleNegativesRankingLoss(self.model)
    
    # Print model parameters status
    total_params = sum(p.numel() for p in self.model.parameters())
    trainable_params = sum(p.numel() for p in self.model.parameters() if p.requires_grad)
    print(f"Total parameters: {total_params:,}")
    print(f"Trainable parameters: {trainable_params:,}")
    
    # All parameters are trainable in this version
    print("\nAll layers are trainable in this experiment")
    
    # Training loop with timing
    start_time = time.time()
    print("\nStarting training...")
    
    # Fine-tune the model with warmup
    warmup_steps = int(len(train_dataloader) * 0.1)  # 10% of training data for warmup
    self.model.fit(
        train_objectives=[(train_dataloader, train_loss)],
        epochs=num_epochs,
        warmup_steps=warmup_steps,
        show_progress_bar=True,
        output_path=save_model_path,
        checkpoint_path=f"{save_model_path}_checkpoint",
        checkpoint_save_steps=len(train_dataloader),
        callback=lambda score, epoch, steps: print(f"\nEpoch {epoch}: Score {score:.4f}")
    )
    
    # Calculate and print training time
    training_time = time.time() - start_time
    print(f"\nTraining completed in {str(timedelta(seconds=int(training_time)))}")
    
    # Save the model
    print(f"Saving model to {save_model_path}")
    self.model.save(save_model_path)
    print("Model saved successfully!")

# Replace the original method with our new implementation
model_unfrozen.fine_tune_model = fine_tune_model_unfrozen.__get__(model_unfrozen)

# Fine-tune the model without layer freezing
print("\nFine-tuning with all layers trainable...")
model_unfrozen.fine_tune_model(batch_size=32, num_epochs=10, save_model_path="finetuned_senBERT_unfrozen")

# Evaluate the model's performance after fine-tuning
print("\nEvaluating model after fine-tuning...")
model_unfrozen.rank_documents()
final_map_score = model_unfrozen.mean_average_precision()

# Compare all results
print("\nComparison of MAP scores:")
print(f"1. Original model (before fine-tuning): {initial_map_score:.4f}")
print(f"2. After fine-tuning with all layers trainable: {final_map_score:.4f}")
print(f"3. Previous run with frozen layers except last: 0.4748")

Evaluating model before fine-tuning...
Initial MAP score (before fine-tuning): 0.477390368428073

Fine-tuning with all layers trainable...
Using device: cuda

Preparing training examples...
Total queries to process: 2590
Expected total training examples: 110575


Query PLAIN-3474: 83 docs: 100%|██████████| 2590/2590 [00:51<00:00, 50.21it/s]  



Training examples preparation completed in 0:00:51
Final number of training examples: 110575
Number of training examples: 110575
Total parameters: 22,713,216
Trainable parameters: 22,713,216

All layers are trainable in this experiment

Starting training...


Computing widget examples:   0%|          | 0/1 [00:00<?, ?example/s]

Step,Training Loss
500,3.8568
1000,3.5754
1500,3.475
2000,3.3893
2500,3.336
3000,3.2396
3500,3.2035
4000,3.0545
4500,3.0229
5000,2.9886



Training completed in 1:07:05
Saving model to finetuned_senBERT_unfrozen
Model saved successfully!

Evaluating model after fine-tuning...

Comparison of MAP scores:
1. Original model (before fine-tuning): 0.4774
2. After fine-tuning with all layers trainable: 0.4965
3. Previous run with frozen layers except last: 0.4748


In [27]:
# --------------------------------------------------
# Experiment with ContrastiveLoss
print("\nExperiment with ContrastiveLoss...")
model_contrastive = TextSimilarityModel("BeIR/nfcorpus", "BeIR/nfcorpus-qrels")
# Evaluate the model before fine-tuning
model_contrastive.rank_documents()
initial_map_contrastive = model_contrastive.mean_average_precision()
print("Initial MAP score:", initial_map_contrastive)

# Prepare training data
def prepare_contrastive_examples(model):
    """Prepare training samples for ContrastiveLoss"""
    import random
    train_examples = []
    for qid, doc_ids in model.train_query_id_to_relevant_doc_ids.items():
        query = model.query_id_to_text[qid]
        for doc_id in doc_ids:
            # Positive pair
            positive = model.document_id_to_text[doc_id]
            train_examples.append(InputExample(texts=[query, positive], label=1.0))
            
            # Negative pair (randomly select an unrelated document)
            negative_candidates = list(set(model.document_ids) - set(doc_ids))
            if negative_candidates:
                negative = model.document_id_to_text[random.choice(negative_candidates)]
                train_examples.append(InputExample(texts=[query, negative], label=0.0))
    
    return train_examples

training_examples_contrastive = prepare_contrastive_examples(model_contrastive)
train_dataloader_contrastive = DataLoader(training_examples_contrastive, batch_size=32, shuffle=True)

# Create ContrastiveLoss instance
contrastive_loss = losses.ContrastiveLoss(model=model_contrastive.model)

# Use model.fit directly for fine-tuning
model_contrastive.model.fit(
    train_objectives=[(train_dataloader_contrastive, contrastive_loss)],
    epochs=10,
    output_path="finetuned_senBERT_contrastive"
)
# Evaluate the model after fine-tuning
model_contrastive.rank_documents()
final_map_contrastive = model_contrastive.mean_average_precision()
print("Final MAP score after ContrastiveLoss fine-tuning:", final_map_contrastive)

# Experiment with TripletLoss
print("Experiment with TripletLoss...")
model_triplet = TextSimilarityModel("BeIR/nfcorpus", "BeIR/nfcorpus-qrels")
# Evaluate the model before fine-tuning
model_triplet.rank_documents()
initial_map_triplet = model_triplet.mean_average_precision()
print("Initial MAP score:", initial_map_triplet)

# Prepare training data
training_examples_triplet = model_triplet.prepare_training_examples()  
train_dataloader_triplet = DataLoader(training_examples_triplet, batch_size=32, shuffle=True)

# Create TripletLoss instance
triplet_loss = losses.TripletLoss(model=model_triplet.model)

model_triplet.model.fit(
    train_objectives=[(train_dataloader_triplet, triplet_loss)],
    epochs=10,
    output_path="finetuned_senBERT_triplet"
)
# Evaluate the model after fine-tuning
model_triplet.rank_documents()
final_map_triplet = model_triplet.mean_average_precision()
print("Final MAP score after TripletLoss fine-tuning:", final_map_triplet)

# --------------------------------------------------
# Compare all results
print("\nLoss Function Comparison Results:")
print("1. MultipleNegativesRankingLoss (previous run): 0.4774 -> 0.4748")
print(f"2. ContrastiveLoss: {initial_map_contrastive:.4f} -> {final_map_contrastive:.4f}")
print(f"3. TripletLoss: {initial_map_triplet:.4f} -> {final_map_triplet:.4f}")


Experiment with ContrastiveLoss...
Initial MAP score: 0.477390368428073


Computing widget examples:   0%|          | 0/1 [00:00<?, ?example/s]

Step,Training Loss
500,0.1258
1000,0.0339
1500,0.0298
2000,0.0288
2500,0.0278
3000,0.0274
3500,0.0269
4000,0.0262
4500,0.0258
5000,0.0253


Final MAP score after ContrastiveLoss fine-tuning: 0.2926161634244341
Experiment with TripletLoss...
Initial MAP score: 0.477390368428073

Preparing training examples...
Total queries to process: 2590
Expected total training examples: 110575


Query PLAIN-3474: 83 docs: 100%|██████████| 2590/2590 [00:42<00:00, 61.05it/s]  



Training examples preparation completed in 0:00:42
Final number of training examples: 110575


Computing widget examples:   0%|          | 0/1 [00:00<?, ?example/s]

Step,Training Loss
500,4.924
1000,4.8978
1500,4.8597
2000,4.8163
2500,4.7736
3000,4.7268
3500,4.6872
4000,4.6618
4500,4.6356
5000,4.6304


Final MAP score after TripletLoss fine-tuning: 0.14230630848830758

Loss Function Comparison Results:
1. MultipleNegativesRankingLoss (previous run): 0.4774 -> 0.4748
2. ContrastiveLoss: 0.4774 -> 0.2926
3. TripletLoss: 0.4774 -> 0.1423


In [28]:
# Comparison of different training durations
print("Experiment with different training durations...")

epochs_to_try = [5, 10, 20]
duration_results = {}

for epochs in epochs_to_try:
    print(f"\nTraining for {epochs} epochs...")
    model = TextSimilarityModel("BeIR/nfcorpus", "BeIR/nfcorpus-qrels")
    
    # Record initial performance
    model.rank_documents()
    initial_map = model.mean_average_precision()
    print(f"Initial MAP score: {initial_map:.4f}")
    
    # Train the model
    model.fine_tune_model(batch_size=32, 
                         num_epochs=epochs, 
                         save_model_path=f"finetuned_senBERT_{epochs}epochs")
    
    # Evaluate after training
    model.rank_documents()
    final_map = model.mean_average_precision()
    duration_results[epochs] = {
        'initial_map': initial_map,
        'final_map': final_map,
    }
    print(f"Final MAP score after {epochs} epochs: {final_map:.4f}")

# Compare results
print("\nTraining Duration Comparison Results:")
for epochs, scores in duration_results.items():
    print(f"Epochs {epochs}: {scores['initial_map']:.4f} -> {scores['final_map']:.4f}")

Experiment with different training durations...

Training for 5 epochs...
Initial MAP score: 0.4774
Using device: cuda

Preparing training examples...
Total queries to process: 2590
Expected total training examples: 110575


Query PLAIN-3474: 83 docs: 100%|██████████| 2590/2590 [00:52<00:00, 49.35it/s]  



Training examples preparation completed in 0:00:52
Final number of training examples: 110575
Number of training examples: 110575
Total parameters: 22,713,216
Trainable parameters: 22,713,216
Unfreezing: 0.auto_model.encoder.layer.5.attention.self.query.weight
Unfreezing: 0.auto_model.encoder.layer.5.attention.self.query.bias
Unfreezing: 0.auto_model.encoder.layer.5.attention.self.key.weight
Unfreezing: 0.auto_model.encoder.layer.5.attention.self.key.bias
Unfreezing: 0.auto_model.encoder.layer.5.attention.self.value.weight
Unfreezing: 0.auto_model.encoder.layer.5.attention.self.value.bias
Unfreezing: 0.auto_model.encoder.layer.5.attention.output.dense.weight
Unfreezing: 0.auto_model.encoder.layer.5.attention.output.dense.bias
Unfreezing: 0.auto_model.encoder.layer.5.attention.output.LayerNorm.weight
Unfreezing: 0.auto_model.encoder.layer.5.attention.output.LayerNorm.bias
Unfreezing: 0.auto_model.encoder.layer.5.intermediate.dense.weight
Unfreezing: 0.auto_model.encoder.layer.5.intermed

Computing widget examples:   0%|          | 0/1 [00:00<?, ?example/s]

Step,Training Loss
500,4.0615
1000,3.7096
1500,3.656
2000,3.6259
2500,3.6102
3000,3.5713
3500,3.5721
4000,3.5197
4500,3.5075
5000,3.4948



Training completed in 0:16:50
Saving model to finetuned_senBERT_5epochs
Model saved successfully!
Final MAP score after 5 epochs: 0.4767

Training for 10 epochs...


Using the latest cached version of the dataset since BeIR/nfcorpus-qrels couldn't be found on the Hugging Face Hub
Found the latest cached dataset configuration 'default' at /home/sky/.cache/huggingface/datasets/BeIR___nfcorpus-qrels/default/0.0.0/a451b3b26d3ae1358f259c1a3a4dd61fcea35a65 (last modified on Sat Feb  8 18:06:32 2025).


Initial MAP score: 0.4774
Using device: cuda

Preparing training examples...
Total queries to process: 2590
Expected total training examples: 110575


Query PLAIN-3474: 83 docs: 100%|██████████| 2590/2590 [00:46<00:00, 55.34it/s]  



Training examples preparation completed in 0:00:46
Final number of training examples: 110575
Number of training examples: 110575
Total parameters: 22,713,216
Trainable parameters: 22,713,216
Unfreezing: 0.auto_model.encoder.layer.5.attention.self.query.weight
Unfreezing: 0.auto_model.encoder.layer.5.attention.self.query.bias
Unfreezing: 0.auto_model.encoder.layer.5.attention.self.key.weight
Unfreezing: 0.auto_model.encoder.layer.5.attention.self.key.bias
Unfreezing: 0.auto_model.encoder.layer.5.attention.self.value.weight
Unfreezing: 0.auto_model.encoder.layer.5.attention.self.value.bias
Unfreezing: 0.auto_model.encoder.layer.5.attention.output.dense.weight
Unfreezing: 0.auto_model.encoder.layer.5.attention.output.dense.bias
Unfreezing: 0.auto_model.encoder.layer.5.attention.output.LayerNorm.weight
Unfreezing: 0.auto_model.encoder.layer.5.attention.output.LayerNorm.bias
Unfreezing: 0.auto_model.encoder.layer.5.intermediate.dense.weight
Unfreezing: 0.auto_model.encoder.layer.5.intermed

Computing widget examples:   0%|          | 0/1 [00:00<?, ?example/s]

Step,Training Loss
500,4.0633
1000,3.7238
1500,3.6635
2000,3.6305
2500,3.6063
3000,3.5675
3500,3.5338
4000,3.4941
4500,3.4786
5000,3.4853



Training completed in 0:31:56
Saving model to finetuned_senBERT_10epochs
Model saved successfully!
Final MAP score after 10 epochs: 0.4769

Training for 20 epochs...
Initial MAP score: 0.4774
Using device: cuda

Preparing training examples...
Total queries to process: 2590
Expected total training examples: 110575


Query PLAIN-3474: 83 docs: 100%|██████████| 2590/2590 [00:42<00:00, 61.51it/s]  



Training examples preparation completed in 0:00:42
Final number of training examples: 110575
Number of training examples: 110575
Total parameters: 22,713,216
Trainable parameters: 22,713,216
Unfreezing: 0.auto_model.encoder.layer.5.attention.self.query.weight
Unfreezing: 0.auto_model.encoder.layer.5.attention.self.query.bias
Unfreezing: 0.auto_model.encoder.layer.5.attention.self.key.weight
Unfreezing: 0.auto_model.encoder.layer.5.attention.self.key.bias
Unfreezing: 0.auto_model.encoder.layer.5.attention.self.value.weight
Unfreezing: 0.auto_model.encoder.layer.5.attention.self.value.bias
Unfreezing: 0.auto_model.encoder.layer.5.attention.output.dense.weight
Unfreezing: 0.auto_model.encoder.layer.5.attention.output.dense.bias
Unfreezing: 0.auto_model.encoder.layer.5.attention.output.LayerNorm.weight
Unfreezing: 0.auto_model.encoder.layer.5.attention.output.LayerNorm.bias
Unfreezing: 0.auto_model.encoder.layer.5.intermediate.dense.weight
Unfreezing: 0.auto_model.encoder.layer.5.intermed

Computing widget examples:   0%|          | 0/1 [00:00<?, ?example/s]

Step,Training Loss
500,4.0614
1000,3.7092
1500,3.6546
2000,3.6231
2500,3.6063
3000,3.5656
3500,3.5646
4000,3.509
4500,3.4956
5000,3.481



Training completed in 1:03:39
Saving model to finetuned_senBERT_20epochs
Model saved successfully!
Final MAP score after 20 epochs: 0.4737

Training Duration Comparison Results:
Epochs 5: 0.4774 -> 0.4767
Epochs 10: 0.4774 -> 0.4769
Epochs 20: 0.4774 -> 0.4737


In [29]:
# Comparison of different numbers of negative samples
print("Experiment with different numbers of negative samples...")

# Modify prepare_training_examples to support multiple negatives
def prepare_training_examples_with_n_negatives(self, n_negatives=1):
    """
    Prepares training examples with specified number of negative samples per positive pair
    """
    train_examples = []
    import random
    from datetime import timedelta
    from tqdm import tqdm
    
    print("\nPreparing training examples...")
    total_queries = len(self.train_query_id_to_relevant_doc_ids)
    print(f"Total queries to process: {total_queries}")
    
    start_time = time.time()
    
    pbar = tqdm(self.train_query_id_to_relevant_doc_ids.items(), 
                total=total_queries,
                desc="Processing queries")
    
    for qid, doc_ids in pbar:
        anchor = self.query_id_to_text[qid]
        relevant_set = set(doc_ids)
        negative_candidates = list(set(self.document_ids) - relevant_set)
        
        for pos_id in doc_ids:
            positive = self.document_id_to_text[pos_id]
            
            # Sample n negative documents
            if negative_candidates:
                negatives = random.sample(negative_candidates, 
                                       min(n_negatives, len(negative_candidates)))
                texts = [anchor, positive] + [self.document_id_to_text[neg] for neg in negatives]
            else:
                texts = [anchor, positive]
            
            train_examples.append(InputExample(texts=texts))
            
            pbar.set_description(f"Query {qid}: {len(doc_ids)} docs")
    
    elapsed_time = time.time() - start_time
    print(f"\nTraining examples preparation completed in {timedelta(seconds=int(elapsed_time))}")
    print(f"Final number of training examples: {len(train_examples)}")
    
    return train_examples

# Test different numbers of negative samples
n_negatives_to_try = [1, 3, 5]
negative_results = {}

for n_neg in n_negatives_to_try:
    print(f"\nTraining with {n_neg} negative samples per positive pair...")
    model = TextSimilarityModel("BeIR/nfcorpus", "BeIR/nfcorpus-qrels")
    
    # Record initial performance
    model.rank_documents()
    initial_map = model.mean_average_precision()
    print(f"Initial MAP score: {initial_map:.4f}")
    
    # Replace prepare_training_examples method
    model.prepare_training_examples = prepare_training_examples_with_n_negatives.__get__(model)
    model.n_negatives = n_neg
    
    # Train the model
    model.fine_tune_model(batch_size=32, 
                         num_epochs=10, 
                         save_model_path=f"finetuned_senBERT_{n_neg}neg")
    
    # Evaluate after training
    model.rank_documents()
    final_map = model.mean_average_precision()
    negative_results[n_neg] = {
        'initial_map': initial_map,
        'final_map': final_map,
    }
    print(f"Final MAP score with {n_neg} negatives: {final_map:.4f}")

# Compare results
print("\nNegative Samples Comparison Results:")
for n_neg, scores in negative_results.items():
    print(f"{n_neg} negatives: {scores['initial_map']:.4f} -> {scores['final_map']:.4f}")

Experiment with different numbers of negative samples...

Training with 1 negative samples per positive pair...
Initial MAP score: 0.4774
Using device: cuda

Preparing training examples...
Total queries to process: 2590


Query PLAIN-3474: 83 docs: 100%|██████████| 2590/2590 [00:42<00:00, 60.38it/s]  



Training examples preparation completed in 0:00:42
Final number of training examples: 110575
Number of training examples: 110575
Total parameters: 22,713,216
Trainable parameters: 22,713,216
Unfreezing: 0.auto_model.encoder.layer.5.attention.self.query.weight
Unfreezing: 0.auto_model.encoder.layer.5.attention.self.query.bias
Unfreezing: 0.auto_model.encoder.layer.5.attention.self.key.weight
Unfreezing: 0.auto_model.encoder.layer.5.attention.self.key.bias
Unfreezing: 0.auto_model.encoder.layer.5.attention.self.value.weight
Unfreezing: 0.auto_model.encoder.layer.5.attention.self.value.bias
Unfreezing: 0.auto_model.encoder.layer.5.attention.output.dense.weight
Unfreezing: 0.auto_model.encoder.layer.5.attention.output.dense.bias
Unfreezing: 0.auto_model.encoder.layer.5.attention.output.LayerNorm.weight
Unfreezing: 0.auto_model.encoder.layer.5.attention.output.LayerNorm.bias
Unfreezing: 0.auto_model.encoder.layer.5.intermediate.dense.weight
Unfreezing: 0.auto_model.encoder.layer.5.intermed

Computing widget examples:   0%|          | 0/1 [00:00<?, ?example/s]

Step,Training Loss
500,4.0837
1000,3.7106
1500,3.685
2000,3.6298
2500,3.6063
3000,3.5547
3500,3.5344
4000,3.4924
4500,3.4868
5000,3.4746



Training completed in 0:32:03
Saving model to finetuned_senBERT_1neg
Model saved successfully!
Final MAP score with 1 negatives: 0.4743

Training with 3 negative samples per positive pair...
Initial MAP score: 0.4774
Using device: cuda

Preparing training examples...
Total queries to process: 2590


Query PLAIN-3474: 83 docs: 100%|██████████| 2590/2590 [00:43<00:00, 59.67it/s]  



Training examples preparation completed in 0:00:43
Final number of training examples: 110575
Number of training examples: 110575
Total parameters: 22,713,216
Trainable parameters: 22,713,216
Unfreezing: 0.auto_model.encoder.layer.5.attention.self.query.weight
Unfreezing: 0.auto_model.encoder.layer.5.attention.self.query.bias
Unfreezing: 0.auto_model.encoder.layer.5.attention.self.key.weight
Unfreezing: 0.auto_model.encoder.layer.5.attention.self.key.bias
Unfreezing: 0.auto_model.encoder.layer.5.attention.self.value.weight
Unfreezing: 0.auto_model.encoder.layer.5.attention.self.value.bias
Unfreezing: 0.auto_model.encoder.layer.5.attention.output.dense.weight
Unfreezing: 0.auto_model.encoder.layer.5.attention.output.dense.bias
Unfreezing: 0.auto_model.encoder.layer.5.attention.output.LayerNorm.weight
Unfreezing: 0.auto_model.encoder.layer.5.attention.output.LayerNorm.bias
Unfreezing: 0.auto_model.encoder.layer.5.intermediate.dense.weight
Unfreezing: 0.auto_model.encoder.layer.5.intermed

Computing widget examples:   0%|          | 0/1 [00:00<?, ?example/s]

Step,Training Loss
500,4.0615
1000,3.7093
1500,3.6551
2000,3.624
2500,3.6076
3000,3.5675
3500,3.567
4000,3.5124
4500,3.4994
5000,3.4854



Training completed in 0:32:02
Saving model to finetuned_senBERT_3neg
Model saved successfully!
Final MAP score with 3 negatives: 0.4690

Training with 5 negative samples per positive pair...
Initial MAP score: 0.4774
Using device: cuda

Preparing training examples...
Total queries to process: 2590


Query PLAIN-3474: 83 docs: 100%|██████████| 2590/2590 [00:43<00:00, 59.64it/s]  



Training examples preparation completed in 0:00:43
Final number of training examples: 110575
Number of training examples: 110575
Total parameters: 22,713,216
Trainable parameters: 22,713,216
Unfreezing: 0.auto_model.encoder.layer.5.attention.self.query.weight
Unfreezing: 0.auto_model.encoder.layer.5.attention.self.query.bias
Unfreezing: 0.auto_model.encoder.layer.5.attention.self.key.weight
Unfreezing: 0.auto_model.encoder.layer.5.attention.self.key.bias
Unfreezing: 0.auto_model.encoder.layer.5.attention.self.value.weight
Unfreezing: 0.auto_model.encoder.layer.5.attention.self.value.bias
Unfreezing: 0.auto_model.encoder.layer.5.attention.output.dense.weight
Unfreezing: 0.auto_model.encoder.layer.5.attention.output.dense.bias
Unfreezing: 0.auto_model.encoder.layer.5.attention.output.LayerNorm.weight
Unfreezing: 0.auto_model.encoder.layer.5.attention.output.LayerNorm.bias
Unfreezing: 0.auto_model.encoder.layer.5.intermediate.dense.weight
Unfreezing: 0.auto_model.encoder.layer.5.intermed

Computing widget examples:   0%|          | 0/1 [00:00<?, ?example/s]

Step,Training Loss
500,4.0615
1000,3.7093
1500,3.6551
2000,3.624
2500,3.6076
3000,3.5675
3500,3.567
4000,3.5124
4500,3.4994
5000,3.4854



Training completed in 0:32:01
Saving model to finetuned_senBERT_5neg
Model saved successfully!
Final MAP score with 5 negatives: 0.4690

Negative Samples Comparison Results:
1 negatives: 0.4774 -> 0.4743
3 negatives: 0.4774 -> 0.4690
5 negatives: 0.4774 -> 0.4690


In [40]:
def analyze_ranking_quality(self, query_example=None):
    """Analyze ranking quality by comparing model rankings with true relevance"""
    # If no query is provided, use the first query from the test set
    if query_example is None:
        query_id = self.test_query_ids[0]
        query_example = self.query_id_to_text[query_id]
    else:
        # Find the query in the test set
        query_id = None
        for qid in self.test_query_ids:
            if self.query_id_to_text[qid] == query_example:
                query_id = qid
                break
        
        if query_id is None:
            # If the query is not found in the test set, use the first query from the test set
            query_id = self.test_query_ids[0]
            query_example = self.query_id_to_text[query_id]
            print(f"Query not found in test set. Using first test query instead: {query_example}")
    
    # Get relevant documents and analyze ranking
    relevant_docs = self.test_query_id_to_relevant_doc_ids.get(query_id, [])
    self.show_ranking_documents(query_example)
    ranked_docs = self.query_id_to_ranked_doc_ids[query_id]
    
    print("\nRanking Analysis:")
    print(f"Query: {query_example}")
    print(f"Number of truly relevant documents: {len(relevant_docs)}")
    print("\nRelevant documents in top 10:")
    for i, doc_id in enumerate(ranked_docs[:10]):
        is_relevant = doc_id in relevant_docs
        print(f"Rank {i+1}: {doc_id} - {'Relevant' if is_relevant else 'Not Relevant'}")
        
    missed_relevant = [doc for doc in relevant_docs if doc not in ranked_docs[:10]]
    print(f"\nRelevant documents not in top 10: {len(missed_relevant)}")
    if missed_relevant:
        print("Examples:", missed_relevant[:3])
    
analyze_ranking_quality(model)

Top 10 documents for the query: "Do Cholesterol Statin Drugs Cause Breast Cancer?"
Rank 1: Document ID: MED-2429, Similarity Score: 0.7492
Rank 2: Document ID: MED-10, Similarity Score: 0.7311
Rank 3: Document ID: MED-2431, Similarity Score: 0.7285
Rank 4: Document ID: MED-14, Similarity Score: 0.7175
Rank 5: Document ID: MED-2428, Similarity Score: 0.6996
Rank 6: Document ID: MED-1193, Similarity Score: 0.6248
Rank 7: Document ID: MED-1429, Similarity Score: 0.6172
Rank 8: Document ID: MED-4827, Similarity Score: 0.6069
Rank 9: Document ID: MED-1486, Similarity Score: 0.6046
Rank 10: Document ID: MED-2525, Similarity Score: 0.5948

Ranking Analysis:
Query: Do Cholesterol Statin Drugs Cause Breast Cancer?
Number of truly relevant documents: 24

Relevant documents in top 10:
Rank 1: MED-2429 - Relevant
Rank 2: MED-10 - Relevant
Rank 3: MED-2431 - Relevant
Rank 4: MED-14 - Relevant
Rank 5: MED-2428 - Relevant
Rank 6: MED-1193 - Not Relevant
Rank 7: MED-1429 - Not Relevant
Rank 8: MED-148

In [46]:
def compare_distance_metrics(self, query_example=None):
    """Compare the effectiveness of different distance metrics"""
    from sklearn.metrics.pairwise import euclidean_distances, manhattan_distances
    
    # If no query is provided, use the first query from the test set
    if query_example is None:
        query_id = self.test_query_ids[0]
        query_example = self.query_id_to_text[query_id]
    else:
        # Find the query in the test set
        query_id = None
        for qid in self.test_query_ids:
            if self.query_id_to_text[qid] == query_example:
                query_id = qid
                break
        
        if query_id is None:
            query_id = self.test_query_ids[0]
            query_example = self.query_id_to_text[query_id]
            print(f"Query not found in test set. Using first test query instead: {query_example}")
    
    # Get ground truth relevant documents
    relevant_docs = self.test_query_id_to_relevant_doc_ids.get(query_id, [])
    print(f"Query: {query_example}")
    print(f"Number of truly relevant documents: {len(relevant_docs)}\n")
    
    # Encode the query and documents
    query_embedding = self.model.encode([query_example])
    doc_embeddings = self.model.encode(self.documents)
    
    # Calculate different distance metrics
    cosine_sim = cosine_similarity(query_embedding, doc_embeddings)[0]
    euclidean_dist = euclidean_distances(query_embedding, doc_embeddings)[0]
    manhattan_dist = manhattan_distances(query_embedding, doc_embeddings)[0]
    
    # Get the top 10 documents for each metric
    top_10_cosine = np.argsort(cosine_sim)[::-1][:10]
    top_10_euclidean = np.argsort(euclidean_dist)[:10]
    top_10_manhattan = np.argsort(manhattan_dist)[:10]
    
    print("Top 10 Documents Comparison:")
    
    def print_ranking_with_relevance(name, scores, indices, reverse=True):
        print(f"\n{name}:")
        relevant_count = 0
        for i, idx in enumerate(indices):
            doc_id = self.document_ids[idx]
            is_relevant = doc_id in relevant_docs
            if is_relevant:
                relevant_count += 1
            score_str = f"score: {scores[idx]:.4f}" if reverse else f"distance: {scores[idx]:.4f}"
            relevance_str = "Relevant" if is_relevant else "Not Relevant"
            print(f"Rank {i+1}: {doc_id} ({score_str}) - {relevance_str}")
        print(f"Relevant documents in top 10: {relevant_count}/10")
    
    print_ranking_with_relevance("Cosine Similarity", cosine_sim, top_10_cosine)
    print_ranking_with_relevance("Euclidean Distance", euclidean_dist, top_10_euclidean, reverse=False)
    print_ranking_with_relevance("Manhattan Distance", manhattan_dist, top_10_manhattan, reverse=False)

compare_distance_metrics(model)

Query: Do Cholesterol Statin Drugs Cause Breast Cancer?
Number of truly relevant documents: 24

Top 10 Documents Comparison:

Cosine Similarity:
Rank 1: MED-2429 (score: 0.7492) - Relevant
Rank 2: MED-10 (score: 0.7311) - Relevant
Rank 3: MED-2431 (score: 0.7285) - Relevant
Rank 4: MED-14 (score: 0.7175) - Relevant
Rank 5: MED-2428 (score: 0.6996) - Relevant
Rank 6: MED-1193 (score: 0.6248) - Not Relevant
Rank 7: MED-1429 (score: 0.6172) - Not Relevant
Rank 8: MED-4827 (score: 0.6069) - Not Relevant
Rank 9: MED-1486 (score: 0.6046) - Not Relevant
Rank 10: MED-2525 (score: 0.5948) - Not Relevant
Relevant documents in top 10: 5/10

Euclidean Distance:
Rank 1: MED-2429 (distance: 0.7083) - Relevant
Rank 2: MED-10 (distance: 0.7333) - Relevant
Rank 3: MED-2431 (distance: 0.7369) - Relevant
Rank 4: MED-14 (distance: 0.7517) - Relevant
Rank 5: MED-2428 (distance: 0.7751) - Relevant
Rank 6: MED-1193 (distance: 0.8662) - Not Relevant
Rank 7: MED-1429 (distance: 0.8750) - Not Relevant
Rank 8: M

In [50]:
def analyze_preprocessing_impact(self, query_example=None):
    """Analyze the impact of preprocessing on ranking results"""
    import re
    
    # If no query is provided, use the first query from the test set
    if query_example is None:
        query_id = self.test_query_ids[0]
        query_example = self.query_id_to_text[query_id]
    else:
        # Find the query in the test set
        query_id = None
        for qid in self.test_query_ids:
            if self.query_id_to_text[qid] == query_example:
                query_id = qid
                break
        
        if query_id is None:
            query_id = self.test_query_ids[0]
            query_example = self.query_id_to_text[query_id]
            print(f"Query not found in test set. Using first test query instead: {query_example}")
    
    # Get ground truth relevant documents
    relevant_docs = self.test_query_id_to_relevant_doc_ids.get(query_id, [])
    print(f"Query: {query_example}")
    print(f"Number of truly relevant documents: {len(relevant_docs)}\n")
    
    def preprocess_text(text):
        # Convert to lowercase
        text = text.lower()
        # Remove punctuation
        text = re.sub(r'[^\w\s]', '', text)
        # Remove common stop words
        stop_words = {'a', 'an', 'and', 'are', 'as', 'at', 'be', 'by', 'for', 'from', 
                     'has', 'he', 'in', 'is', 'it', 'its', 'of', 'on', 'that', 'the', 
                     'to', 'was', 'were', 'will', 'with'}
        words = text.split()
        words = [word for word in words if word not in stop_words]
        return ' '.join(words)
    
    def print_ranking_with_relevance(title, query):
        print(f"\n{title}:")
        print(f"Query used: {query}")
        self.show_ranking_documents(query)
        
        # Get ranked documents and analyze relevance
        ranked_docs = self.query_id_to_ranked_doc_ids[query_id][:10]
        relevant_count = sum(1 for doc_id in ranked_docs if doc_id in relevant_docs)
        
        print("\nRelevance Analysis:")
        for i, doc_id in enumerate(ranked_docs):
            is_relevant = doc_id in relevant_docs
            print(f"Rank {i+1}: {doc_id} - {'Relevant' if is_relevant else 'Not Relevant'}")
        print(f"Relevant documents in top 10: {relevant_count}/10")
        
        # Show missed relevant documents
        missed_relevant = [doc for doc in relevant_docs if doc not in ranked_docs]
        print(f"Relevant documents not in top 10: {len(missed_relevant)}")
        if missed_relevant:
            print("Examples:", missed_relevant[:3])
    
    # Original query results
    print_ranking_with_relevance("Original Query Results", query_example)
    
    # Preprocessed query results
    processed_query = preprocess_text(query_example)
    print_ranking_with_relevance("Preprocessed Query Results", processed_query)
    
    # Compare the rankings
    print("\nPreprocessing Impact Summary:")
    print("Original query:", query_example)
    print("Preprocessed query:", processed_query)

analyze_preprocessing_impact(model)

Query: Do Cholesterol Statin Drugs Cause Breast Cancer?
Number of truly relevant documents: 24


Original Query Results:
Query used: Do Cholesterol Statin Drugs Cause Breast Cancer?
Top 10 documents for the query: "Do Cholesterol Statin Drugs Cause Breast Cancer?"
Rank 1: Document ID: MED-2429, Similarity Score: 0.7492
Rank 2: Document ID: MED-10, Similarity Score: 0.7311
Rank 3: Document ID: MED-2431, Similarity Score: 0.7285
Rank 4: Document ID: MED-14, Similarity Score: 0.7175
Rank 5: Document ID: MED-2428, Similarity Score: 0.6996
Rank 6: Document ID: MED-1193, Similarity Score: 0.6248
Rank 7: Document ID: MED-1429, Similarity Score: 0.6172
Rank 8: Document ID: MED-4827, Similarity Score: 0.6069
Rank 9: Document ID: MED-1486, Similarity Score: 0.6046
Rank 10: Document ID: MED-2525, Similarity Score: 0.5948

Relevance Analysis:
Rank 1: MED-2429 - Relevant
Rank 2: MED-10 - Relevant
Rank 3: MED-2431 - Relevant
Rank 4: MED-14 - Relevant
Rank 5: MED-2428 - Relevant
Rank 6: MED-1193 - No