## Coding Exercise

Ranking Documents Report (10 Points)

Students must analyze which encoding methods performed best for document ranking.

What to include in your report:
    
#Comparison of Encoding Methods

Compare GloVe embeddings vs. Sentence Transformer embeddings.
Which method ranked documents better?
Did the top-ranked documents make sense?
How does cosine similarity behave with different embeddings?

#Observations on Cosine Similarity & Ranking

Did the ranking appear meaningful?
Were there cases where documents that should be highly ranked were not?
What are possible explanations for incorrect rankings?

#Possible Improvements

What can be done to improve document ranking?
Would a different distance metric (e.g., Euclidean, Manhattan) help?
Would preprocessing the queries or documents (e.g., removing stopwords) improve ranking?


Fine-Tuning Report (15 Points)

After fine-tuning, students must compare different training approaches and reflect on their findings.

What to include in your report:
    
#Comparison of Different Training Strategies

[anchor, positive] vs [anchor, positive, negative].
Which approach seemed to improve ranking?
How did the model behave differently?

#Impact on MAP Score

Did fine-tuning improve or hurt the Mean Average Precision (MAP) score?
If MAP decreased, why might that be?
Is fine-tuning always necessary for retrieval models?

#Observations on Training Loss & Learning Rate

Did the loss converge?
Was the learning rate too high or too low?
How did freezing/unfreezing layers impact training?

#Future Improvements

Would training with more negatives help?
Would changing the loss function (e.g., using Softmax Loss) improve performance?
Could increasing the number of epochs lead to a better model?


In [1]:
!pip install datasets sentence_transformers

Collecting datasets
  Downloading datasets-3.2.0-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch>=1.11.0->sentence_transformers)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch>=1.11.0->sentence_transformers)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cu

In [2]:
from huggingface_hub import login

from sentence_transformers import SentenceTransformer, InputExample, losses
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
from datasets import load_dataset

import gdown
import os
import random
import torch
from torch.utils.data import DataLoader

from sympy import false
import pandas as pd

In [15]:
# def collate_fn(batch):
#         """
#         Custom collate function to process InputExample objects into the format that
#         the model can accept.
#         """
#         # Extract texts from InputExample objects
#         texts = [example.texts for example in batch]

#         # The model will accept texts as input (two texts for each pair)
#         return texts

class TextSimilarityModel:
    def __init__(self, corpus_name, rel_name, model_name='all-MiniLM-L6-v2', top_k=10):
        """
        Initialize the model with datasets and pre-trained sentence transformer.
        """
        self.model = SentenceTransformer("/content/finetuned_senBERT_train_v2_21ep_lr5e-4")
        # "/content/finetuned_senBERT_train_v2" model_name
        self.corpus_name = corpus_name
        self.rel_name = rel_name
        self.top_k = top_k

        ### Add Init
        self.query_id_to_ranked_doc_ids = {}
        self.glove_embedding_dict = {}
        self.add_negative = False

        ###
        self.load_data()


    def load_data(self):
        """
        Load and filter datasets based on test queries and documents.
        """
        # Load query and document datasets
        dataset_queries = load_dataset(self.corpus_name, "queries")
        dataset_docs = load_dataset(self.corpus_name, "corpus")

        # Extract queries and documents
        self.queries = dataset_queries["queries"]["text"]
        self.query_ids = dataset_queries["queries"]["_id"]
        self.documents = dataset_docs["corpus"]["text"]
        self.document_ids = dataset_docs["corpus"]["_id"]

        self.kaggle_test_queries = pd.read_csv("test_query.csv")["Query"].tolist()
        self.ori_kaggle_test_document_ids = pd.read_csv("test_documents.csv")["Doc"].tolist()
        self.kaggle_test_documents = [doc for did, doc in zip(self.document_ids, self.documents) if did in self.ori_kaggle_test_document_ids]
        self.kaggle_test_document_ids = [did for did, doc in zip(self.document_ids, self.documents) if did in self.ori_kaggle_test_document_ids]

        # Filter queries and documents and build relevant queries and documents mapping based on test set
        test_qrels = load_dataset(self.rel_name)["test"]
        self.filtered_test_query_ids = set(test_qrels["query-id"])
        self.filtered_test_doc_ids = set(test_qrels["corpus-id"])

        self.test_queries = [q for qid, q in zip(self.query_ids, self.queries) if qid in self.filtered_test_query_ids]
        self.test_query_ids = [qid for qid in self.query_ids if qid in self.filtered_test_query_ids]
        self.test_documents = [doc for did, doc in zip(self.document_ids, self.documents) if did in self.filtered_test_doc_ids]
        self.test_document_ids = [did for did in self.document_ids if did in self.filtered_test_doc_ids]

        self.test_query_id_to_relevant_doc_ids = {qid: [] for qid in self.test_query_ids}
        for qid, doc_id in zip(test_qrels["query-id"], test_qrels["corpus-id"]):
            if qid in self.test_query_id_to_relevant_doc_ids:
                self.test_query_id_to_relevant_doc_ids[qid].append(doc_id)

        ## Code Below this is used for creating the training set
        # Build query and document id to text mapping
        self.query_id_to_text = {query_id:query for query_id, query in zip(self.query_ids, self.queries)}
        self.document_id_to_text = {document_id:document for document_id, document in zip(self.document_ids, self.documents)}

        # Build relevant queries and documents mapping based on train set
        train_qrels = load_dataset(self.rel_name)["train"]
        self.train_query_id_to_relevant_doc_ids = {qid: [] for qid in train_qrels["query-id"]}

        for qid, doc_id in zip(train_qrels["query-id"], train_qrels["corpus-id"]):
            if qid in self.train_query_id_to_relevant_doc_ids:
                # Append the document ID to the relevant doc mapping
                self.train_query_id_to_relevant_doc_ids[qid].append(doc_id)

        # Filter queries and documents and build relevant queries and documents mapping based on validation set
        #TODO Put your code here. Done by Tianyi Li on 02/06/2025
         ###########################################################################
        # Build relevant queries and documents mapping based on validation set
        validate_qrels = load_dataset(self.rel_name)["validation"]
        self.validate_query_id_to_relevant_doc_ids = {qid: [] for qid in validate_qrels["query-id"]}

        for qid, doc_id in zip(validate_qrels["query-id"], validate_qrels["corpus-id"]):
            if qid in self.validate_query_id_to_relevant_doc_ids:
                # Append the document ID to the relevant doc mapping
                self.validate_query_id_to_relevant_doc_ids[qid].append(doc_id)
        ###########################################################################

    def kaggle_rank_documents(self, encoding_method: str = 'sentence_transformer') -> None:
        query_embeddings = self.model.encode(self.kaggle_test_queries)
        document_embeddings = self.model.encode(self.kaggle_test_documents)

        self.query_to_ranked_doc_ids = {}
        query_embeddings = np.array(query_embeddings)
        document_embeddings = np.array(document_embeddings)
        cosine_similarities = cosine_similarity(query_embeddings,document_embeddings)
        for query, similarity_scores in zip(self.kaggle_test_queries, cosine_similarities):
            sorted_indices = np.argsort(similarity_scores)[::-1][:self.top_k]  # Sort in descending order
            ranked_doc_ids = [self.kaggle_test_document_ids[idx] for idx in sorted_indices]  # Map indices to document IDs
            self.query_to_ranked_doc_ids[query] = ranked_doc_ids

        df = pd.DataFrame({
            "Query": self.query_to_ranked_doc_ids.keys(),
            "Doc_ID": [" ".join(docs) for docs in self.query_to_ranked_doc_ids.values()]
        })
        df.to_csv("output.csv", index=False, sep=",")

    #Task 2: Calculate Cosine Similarity and Rank Documents (20 Pts)
    def rank_documents(self, encoding_method: str = 'sentence_transformer') -> None:
        """
         # Inputs:
            - encoding_method (str): The method used for encoding queries/documents.
                             Options: ['glove', 'sentence_transformer'].

        # Output:
            - None (updates self.query_id_to_ranked_doc_ids with ranked document IDs).

        (1) Compute cosine similarity between each document and the query
        (2) Rank documents for each query and save the results in a dictionary "query_id_to_ranked_doc_ids"
            This will be used in "mean_average_precision"
            Example format {2: [125, 673], 35: [900, 822]}
        """
        if encoding_method == 'glove':
            query_embeddings = self.encode_with_glove("glove.6B.50d.txt", self.queries)
            document_embeddings = self.encode_with_glove("glove.6B.50d.txt", self.documents)
        elif encoding_method == 'sentence_transformer':
            query_embeddings = self.model.encode(self.queries)
            document_embeddings = self.model.encode(self.documents)
        else:
            raise ValueError("Invalid encoding method. Choose 'glove' or 'sentence_transformer'.")

        #TODO Put your code here. Done by Tianyi Li on 02/05/2025
        ###########################################################################
         # define a dictionary to store the ranked documents for each query

        query_embeddings = np.array(query_embeddings)
        document_embeddings = np.array(document_embeddings)
        cosine_similarities = cosine_similarity(query_embeddings,document_embeddings)
        for query_id, similarity_scores in zip(self.query_ids, cosine_similarities):
            sorted_indices = np.argsort(similarity_scores)[::-1][:self.top_k]  # Sort in descending order
            ranked_doc_ids = [self.document_ids[idx] for idx in sorted_indices]  # Map indices to document IDs
            self.query_id_to_ranked_doc_ids[query_id] = ranked_doc_ids
        ###########################################################################

    @staticmethod
    def average_precision(relevant_docs: list[str], candidate_docs: list[str]) -> float:
        """
        # Inputs:
            - relevant_docs (list[str]): A list of document IDs that are relevant to the query.
            - candidate_docs (list[str]): A list of document IDs ranked by the model.

        # Output:
            - float: The average precision score

        Compute average precision for a single query.
        """
        y_true = [1 if doc_id in relevant_docs else 0 for doc_id in candidate_docs]
        precisions = [np.mean(y_true[:k+1]) for k in range(len(y_true)) if y_true[k]]
        return np.mean(precisions) if precisions else 0

    #Task 3: Calculate Evaluate System Performance (10 Pts)

    def mean_average_precision(self) -> float:
        """
        # Inputs:
            - None (uses ranked documents stored in self.query_id_to_ranked_doc_ids).

        # Output:
            - float: The MAP score, computed as the mean of all average precision scores.

        (1) Compute mean average precision for all queries using the "average_precision" function.
        (2) Compute the mean of all average precision scores
        Return the mean average precision score

        reference: https://www.evidentlyai.com/ranking-metrics/mean-average-precision-map
        https://towardsdatascience.com/map-mean-average-precision-might-confuse-you-5956f1bfa9e2
        """
         #TODO Put your code here. Done by DanieL Chen on 02/06/2025
        ###########################################################################
        average_precisions = [] # Create an empty list average_precisions to store the AP of each query

        for qid in self.test_query_ids:
            relevant_docs = self.test_query_id_to_relevant_doc_ids[qid]
            ranked_docs = self.query_id_to_ranked_doc_ids[qid]
            average_precisions.append(self.average_precision(relevant_docs, ranked_docs))

        return np.mean(average_precisions) if average_precisions else 0.0

        ###########################################################################

    #Task 4: Ranking the Top 10 Documents based on Similarity Scores (10 Pts)

    def show_ranking_documents(self, example_query: str) -> None:

        """
        # Inputs:
            - example_query (str): A query string for which top-ranked documents should be displayed.

        # Output:
            - None (prints the ranked documents along with similarity scores).

        (1) rank documents with given query with cosine similarity scores
        (2) prints the top 10 results along with its similarity score.

        """
        #TODO Put your code here. Done by DanieL Chen on 02/06/2025
        # query_embedding = self.model.encode(example_query)
        query_embedding = self.model.encode([example_query])[0] # encode() requires a List format
        document_embeddings = self.model.encode(self.documents)
        ###########################################################################
        cosine_similarities = cosine_similarity([query_embedding], document_embeddings)[0]
        sorted_indices = np.argsort(cosine_similarities)[::-1][:10]
        ranked_docs = [(self.documents[i], cosine_similarities[i]) for i in sorted_indices]

        print(f"Top 10 documents for query: {example_query}\n")

        for i, (doc, score) in enumerate(ranked_docs, 1):
            print(f"{i}. Score: {score:.4f}, Document: {doc}")


        ###########################################################################

    #Task 5:Fine tune the sentence transformer model (25 Pts)
    # Students are not graded on achieving a high MAP score.
    # The key is to show understanding, experimentation, and thoughtful analysis.

    def fine_tune_model(self, batch_size: int = 64, num_epochs: int = 3, save_model_path: str = "finetuned_senBERT", lr: float = 5e-4) -> None:
        """
        Fine-tunes the model using MultipleNegativesRankingLoss.
        """
        # self.add_negative = True
        train_examples = self.prepare_training_examples()

        train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=batch_size)

        # Use MultipleNegativesRankingLoss
        train_loss = losses.MultipleNegativesRankingLoss(self.model)

        # Freeze all model layers except the final layers
        for name, param in self.model.named_parameters():
            if "encoder.layer.5" in name or "pooler" in name:
                param.requires_grad = True
            else:
                param.requires_grad = False

        # Training Loop with Validation
        self.model.train()
        best_val_loss = float("inf")
        patience_counter = 0

        # for epoch in range(num_epochs):
        self.model.fit(
              train_objectives=[(train_dataloader, train_loss)],
              epochs=num_epochs,  # Train for 1 epoch each time
              # warmup_steps=int(0.1 * len(train_dataloader)),
              optimizer_params={'lr': lr},
              show_progress_bar=True
        )

            # # Validation and dynamic early stopping
            # val_loss = self.evaluate_validation_loss()  # Implement this function based on your validation data
            # print(f"Epoch {epoch+1}/{num_epochs}, Validation Loss: {val_loss}")

            # if val_loss < best_val_loss:
            #     best_val_loss = val_loss
            #     patience_counter = 0
            #     self.model.save(save_model_path)  # Save the best model
            # else:
            #     patience_counter += 1

            # if patience_counter > 2:  # Stop training if validation loss doesn't improve for 3 consecutive epochs
            #     print("Early stopping due to no improvement in validation loss.")
            #     break
        self.model.save(save_model_path)

    def prepare_training_examples(self) -> list[InputExample]:
        """
        Prepares training examples from the training data.
        Adds hard negatives if specified.
        """
        train_examples = []

        # Compute sentence embeddings for all documents at once to avoid recomputation
        document_embeddings = {doc_id: self.model.encode(text) for doc_id, text in self.document_id_to_text.items()}
        for qid, doc_ids in self.train_query_id_to_relevant_doc_ids.items():
            anchor = self.query_id_to_text[qid]
            anchor_embedding = self.model.encode(anchor)  # Encode the query once

            # Get all negative candidates for the current query
            negative_candidates = list(set(self.document_ids) - set(doc_ids))

            for doc_id in doc_ids:
                positive = self.document_id_to_text[doc_id]
                positive_embedding = document_embeddings[doc_id]  # Precomputed embedding of positive sample

                if self.add_negative:
                    # Hard negative mining: select a negative document that is similar but not relevant
                    hard_negative = self.select_hard_negative(anchor_embedding, negative_candidates, document_embeddings)
                    train_examples.append(InputExample(texts=[anchor, positive, hard_negative]))
                else:
                    train_examples.append(InputExample(texts=[anchor, positive]))

        return train_examples

    def select_hard_negative(self, anchor_embedding, negative_candidates, document_embeddings):
        candidate_ids = list(negative_candidates)
        candidate_embeddings = np.array([document_embeddings[cid] for cid in candidate_ids])

        # 向量化计算相似度
        similarities = np.dot(candidate_embeddings, anchor_embedding) / (
            np.linalg.norm(candidate_embeddings, axis=1) * np.linalg.norm(anchor_embedding)
        )

        max_index = np.argmax(similarities)
        return self.document_id_to_text[candidate_ids[max_index]]

    # def evaluate_validation_loss(self):
    #     """
    #     A method to evaluate the model on validation data and calculate loss.
    #     """
    #     self.model.eval()
    #     validation_examples = self.prepare_validation_examples()  # Prepare validation examples

    #     # Create DataLoader from validation examples
    #     val_dataloader = DataLoader(validation_examples, batch_size=64, shuffle=False, collate_fn=collate_fn)

    #     total_loss = 0
    #     total_samples = 0

    #     # Evaluate over the entire validation set
    #     for batch in val_dataloader:
    #         # Pass the batch to the model and compute the loss
    #         loss = self.model(batch)  # The model automatically computes the loss

    #         total_loss += loss.item() * len(batch)  # Accumulate total loss (scaled by batch size)
    #         total_samples += len(batch)  # Track the number of samples

    #     avg_loss = total_loss / total_samples  # Average loss over all samples in the validation set
    #     return avg_loss

    # def prepare_validation_examples(self) -> list[InputExample]:
    #     """
    #     Prepares validation examples from the validation data.
    #     Adds hard negatives if specified.
    #     """
    #     validation_examples = []

    #     # Compute sentence embeddings for all documents at once to avoid recomputation
    #     document_embeddings = {doc_id: self.model.encode(text) for doc_id, text in self.document_id_to_text.items()}

    #     for qid, doc_ids in self.validate_query_id_to_relevant_doc_ids.items():
    #         anchor = self.query_id_to_text[qid]
    #         anchor_embedding = self.model.encode(anchor)  # Encode the query once

    #         # Get all negative candidates for the current query
    #         negative_candidates = list(set(self.document_ids) - set(doc_ids))

    #         for doc_id in doc_ids:
    #             positive = self.document_id_to_text[doc_id]
    #             positive_embedding = document_embeddings[doc_id]  # Precomputed embedding of positive sample

    #             if self.add_negative:
    #                 # Hard negative mining: select a negative document that is similar but not relevant
    #                 hard_negative = self.select_hard_negative(anchor_embedding, negative_candidates, document_embeddings)
    #                 validation_examples.append(InputExample(texts=[anchor, positive, hard_negative]))
    #             else:
    #                 validation_examples.append(InputExample(texts=[anchor, positive]))

    #     return validation_examples

In [6]:
login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [16]:
# Initialize and use the model
model = TextSimilarityModel("BeIR/nfcorpus", "BeIR/nfcorpus-qrels")

In [9]:
# Compare the outputs
print("Ranking with sentence_transformer...")
model.rank_documents(encoding_method='sentence_transformer')
map_score = model.mean_average_precision()
print("Mean Average Precision:", map_score)

# # Download glove txt
# glove_file_path = 'glove.6B.50d.txt'
# embeddings_id = "1sX7UOmk8dGQfGhe8s1qOyN10XvL-8qHx"

# if not os.path.exists(glove_file_path):
#     print("Donwloading embedings...\n\n")
#     gdown.download(id=embeddings_id, output=glove_file_path, quiet=False)

# # Compare the outputs
# print("Ranking with glove...")
# model.rank_documents(encoding_method='glove')
# map_score = model.mean_average_precision()
# print("Mean Average Precision:", map_score)

model.show_ranking_documents("Breast Cancer Cells Feed on Cholesterol")

Ranking with sentence_transformer...
Mean Average Precision: 0.4635071548112422
Top 10 documents for query: Breast Cancer Cells Feed on Cholesterol

1. Score: 0.6208, Document: The specific role of dietary fat in breast cancer progression is unclear, although a low-fat diet was associated with decreased recurrence of estrogen receptor alpha negative (ER(-)) breast cancer. ER(-) basal-like MDA-MB-231 and MDA-MB-436 breast cancer cell lines contained a greater number of cytoplasmic lipid droplets compared to luminal ER(+) MCF-7 cells. Therefore, we studied lipid storage functions in these cells. Both triacylglycerol and cholesteryl ester (CE) concentrations were higher in the ER(-) cells, but the ability to synthesize CE distinguished the two types of breast cancer cells. Higher baseline, oleic acid- and LDL-stimulated CE concentrations were found in ER(-) compared to ER(+) cells. The differences corresponded to greater mRNA and protein levels of acyl-CoA:cholesterol acyltransferase 1 (A

In [17]:
# Finetune all-MiniLM-L6-v2 sentence transformer model
os.environ["WANDB_DISABLED"] = "true"
model.fine_tune_model(batch_size=32, num_epochs=5, save_model_path="finetuned_senBERT_train_v2", lr=5e-5)  # Adjust batch size and epochs as needed

model.rank_documents()
map_score = model.mean_average_precision()
print("Mean Average Precision:", map_score)

Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


Computing widget examples:   0%|          | 0/1 [00:00<?, ?example/s]

Step,Training Loss
500,1.3032
1000,1.3053
1500,1.2897
2000,1.2899
2500,1.291
3000,1.2891
3500,1.2822
4000,1.2689
4500,1.2863
5000,1.2731


Mean Average Precision: 0.4538383725810011


In [18]:
model.kaggle_rank_documents()

In [4]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive
