This assignment aims to introduce students to working with the BEIR dataset for information retrieval tasks. Students will:

- Understand the structure of the BEIR dataset and preprocess the data.
- Implement a system to encode queries and documents using embeddings.
- Calculate similarity scores to rank documents based on relevance.
- Evaluate the system's performance using metrics like Mean Average Precision (MAP).
- Modify and fine-tune models for better retrieval results.

First start by looking into the Dataset and understanding its structure.
This will help you understand how the dataset is formed, which will be useful in the later stages of the Assignment

https://huggingface.co/datasets/BeIR/nfcorpus

https://huggingface.co/datasets/BeIR/nfcorpus-qrels

In [1]:
# This assignment consists of two key tasks: Ranking Documents and Fine-Tuning the Sentence Transformer Model. 
# Students will be graded based on their implementation and their written report.

# Mention the team/Individual contributions as a part of the report..!!

In [2]:
# Ranking Documents Report (10 Points)

# Students must analyze which encoding methods performed best for document ranking.

# What to include in your report:
    
# Comparison of Encoding Methods 

    # Compare GloVe embeddings vs. Sentence Transformer embeddings.
    # Which method ranked documents better?
    # Did the top-ranked documents make sense?
    # How does cosine similarity behave with different embeddings?

# Observations on Cosine Similarity & Ranking 

    # Did the ranking appear meaningful?
    # Were there cases where documents that should be highly ranked were not?
    # What are possible explanations for incorrect rankings?

# Possible Improvements

    # What can be done to improve document ranking?
    # Would a different distance metric (e.g., Euclidean, Manhattan) help?
    # Would preprocessing the queries or documents (e.g., removing stopwords) improve ranking?


# Fine-Tuning Report (15 Points)

# After fine-tuning, students must compare different training approaches and reflect on their findings.

# What to include in your report:
    
# Comparison of Different Training Strategies 

    # [anchor, positive] vs [anchor, positive, negative].
    # Which approach seemed to improve ranking?
    # How did the model behave differently?

# Impact on MAP Score 

    # Did fine-tuning improve or hurt the Mean Average Precision (MAP) score?
    # If MAP decreased, why might that be?
    # Is fine-tuning always necessary for retrieval models?

# Observations on Training Loss & Learning Rate 

    # Did the loss converge?
    # Was the learning rate too high or too low?
    # How did freezing/unfreezing layers impact training?

# Future Improvements 

    # Would training with more negatives help?
    # Would changing the loss function (e.g., using Softmax Loss) improve performance?
    # Could increasing the number of epochs lead to a better model?


In [3]:
!pip install datasets sentence_transformers



In [2]:
# Create your API token from your Hugging Face Account. Make sure to save it in text file or notepad for future use.
# Will need to add it once per section
from huggingface_hub import login
login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [3]:
from sentence_transformers import SentenceTransformer, InputExample, losses
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
from datasets import load_dataset
import time

class TextSimilarityModel:
    def __init__(self, corpus_name, rel_name, model_name='all-MiniLM-L6-v2', top_k=10):
        """
        Initialize the model with datasets and pre-trained sentence transformer.
        """
        self.model = SentenceTransformer(model_name)
        self.corpus_name = corpus_name
        self.rel_name = rel_name
        self.top_k = top_k
        self.load_data()


    def load_data(self):
        """
        Load and filter datasets based on test queries and documents.
        """
        # Load query and document datasets
        dataset_queries = load_dataset(self.corpus_name, "queries")
        dataset_docs = load_dataset(self.corpus_name, "corpus")

        # Extract queries and documents
        self.queries = dataset_queries["queries"]["text"]
        self.query_ids = dataset_queries["queries"]["_id"]
        self.documents = dataset_docs["corpus"]["text"]
        self.document_ids = dataset_docs["corpus"]["_id"]

        # Filter queries and documents and build relevant queries and documents mapping based on test set
        test_qrels = load_dataset(self.rel_name)["test"]
        self.filtered_test_query_ids = set(test_qrels["query-id"])
        self.filtered_test_doc_ids = set(test_qrels["corpus-id"])

        self.test_queries = [q for qid, q in zip(self.query_ids, self.queries) if qid in self.filtered_test_query_ids]
        self.test_query_ids = [qid for qid in self.query_ids if qid in self.filtered_test_query_ids]
        self.test_documents = [doc for did, doc in zip(self.document_ids, self.documents) if did in self.filtered_test_doc_ids]
        self.test_document_ids = [did for did in self.document_ids if did in self.filtered_test_doc_ids]

        self.test_query_id_to_relevant_doc_ids = {qid: [] for qid in self.test_query_ids}
        for qid, doc_id in zip(test_qrels["query-id"], test_qrels["corpus-id"]):
            if qid in self.test_query_id_to_relevant_doc_ids:
                self.test_query_id_to_relevant_doc_ids[qid].append(doc_id)
                
        ## Code Below this is used for creating the training set 
        # Build query and document id to text mapping
        self.query_id_to_text = {query_id:query for query_id, query in zip(self.query_ids, self.queries)}
        self.document_id_to_text = {document_id:document for document_id, document in zip(self.document_ids, self.documents)}

        # Build relevant queries and documents mapping based on train set
        train_qrels = load_dataset(self.rel_name)["train"]
        self.train_query_id_to_relevant_doc_ids = {qid: [] for qid in train_qrels["query-id"]}

        for qid, doc_id in zip(train_qrels["query-id"], train_qrels["corpus-id"]):
            if qid in self.train_query_id_to_relevant_doc_ids:
                # Append the document ID to the relevant doc mapping
                self.train_query_id_to_relevant_doc_ids[qid].append(doc_id)
        
        # Filter queries and documents and build relevant queries and documents mapping based on validation set  
        #TODO Put your code here. 
        ###########################################################################
        try:
            val_qrels = load_dataset(self.rel_name)['validation']
            self.filtered_val_query_ids = set(val_qrels['query-id'])
            self.filtered_val_doc_ids = set(val_qrels['corpus-id'])

            val_query_pairs = [
                (qid, query) for qid, query in zip(self.query_ids, self.queries) 
                if qid in self.filtered_val_query_ids
            ]
            self.val_query_ids = [pair[0] for pair in val_query_pairs]
            self.val_queries = [pair[1] for pair in val_query_pairs]

            val_doc_pairs = [
                (did, doc) for did, doc in zip(self.document_ids, self.documents) 
                if did in self.filtered_val_doc_ids
            ]
            self.val_document_ids = [pair[0] for pair in val_doc_pairs]
            self.val_documents = [pair[1] for pair in val_doc_pairs]

            self.val_query_id_to_relevant_doc_ids = {qid: [] for qid in self.val_query_ids}
            for qid, doc_id in zip(val_qrels['query-id'], val_qrels['corpus-id']):
                if qid in self.val_query_id_to_relevant_doc_ids:
                    self.val_query_id_to_relevant_doc_ids[qid].append(doc_id)
        except Exception as e:
            print('No validation split available. Skipping validation set creation.')
        ###########################################################################
        

    #Task 1: Encode Queries and Documents (10 Pts)

    def encode_with_glove(self, glove_file_path: str, sentences: list[str]) -> list[np.ndarray]:

        """
        # Inputs:
            - glove_file_path (str): Path to the GloVe embeddings file (e.g., "glove.6B.50d.txt").
            - sentences (list[str]): A list of sentences to encode.

        # Output:
            - list[np.ndarray]: A list of sentence embeddings 
            
        (1) Encodes sentences by averaging GloVe 50d vectors of words in each sentence.
        (2) Return a sequence of embeddings of the sentences.
        Download the glove vectors from here. 
        https://nlp.stanford.edu/data/glove.6B.zip
        Handle unknown words by using zero vectors
        """
        #TODO Put your code here. 
        ###########################################################################
        # Load GloVe embeddings dictionary
        glove_dict = {}
        with open(glove_file_path, 'r', encoding='utf-8') as f:
            for line in f:
                parts = line.strip().split()
                # Skip incomplete lines (should contain one word + 50 dimensions)
                if len(parts) < 51:
                    continue
                word = parts[0]
                vector = np.array(parts[1:], dtype=float)
                glove_dict[word] = vector

        embedding_dim = len(next(iter(glove_dict.values())))
        embeddings = []

        # Encode each sentence by averaging word vectors
        for sentence in sentences:
            tokens = sentence.split()
            vecs = []
            for token in tokens:
                # Use lower-case tokens to match the GloVe keys
                token_vec = glove_dict.get(token.lower())
                if token_vec is None:
                    token_vec = np.zeros(embedding_dim)
                vecs.append(token_vec)
            if vecs:
                sentence_embedding = np.mean(vecs, axis=0)
            else:
                sentence_embedding = np.zeros(embedding_dim)
            embeddings.append(sentence_embedding)

        return embeddings
        ###########################################################################

    #Task 2: Calculate Cosine Similarity and Rank Documents (20 Pts)
    
    def rank_documents(self, encoding_method: str = 'sentence_transformer') -> None:
        """
         # Inputs:
            - encoding_method (str): The method used for encoding queries/documents. 
                             Options: ['glove', 'sentence_transformer'].

        # Output:
            - None (updates self.query_id_to_ranked_doc_ids with ranked document IDs).
    
        (1) Compute cosine similarity between each document and the query
        (2) Rank documents for each query and save the results in a dictionary "query_id_to_ranked_doc_ids" 
            This will be used in "mean_average_precision"
            Example format {2: [125, 673], 35: [900, 822]}
        """
        if encoding_method == 'glove':
            query_embeddings = self.encode_with_glove("glove.6B.50d.txt", self.queries)
            document_embeddings = self.encode_with_glove("glove.6B.50d.txt", self.documents)
        elif encoding_method == 'sentence_transformer':
            query_embeddings = self.model.encode(self.queries)
            document_embeddings = self.model.encode(self.documents)
        else:
            raise ValueError("Invalid encoding method. Choose 'glove' or 'sentence_transformer'.")
        
        #TODO Put your code here.
        ###########################################################################
        # Map test query IDs to their indices in the full query list.
        test_query_indices = [self.query_ids.index(qid) for qid in self.test_query_ids]
        # Map test document IDs to their indices in the full document list.
        test_doc_indices = [self.document_ids.index(doc_id) for doc_id in self.test_document_ids]
        
        # Subset the embeddings for test queries and documents using the computed indices.
        test_query_embeddings = [query_embeddings[i] for i in test_query_indices]
        test_document_embeddings = [document_embeddings[i] for i in test_doc_indices]
        
        # Now compute the cosine similarity matrix only on test queries vs test documents.
        sim_matrix = cosine_similarity(test_query_embeddings, test_document_embeddings)

        # Initialize the dictionary for storing ranked document IDs.
        self.query_id_to_ranked_doc_ids = {}
        
        # For each test query, rank the documents based on their similarity scores.
        for i, qid in enumerate(self.test_query_ids):
            sim_scores = sim_matrix[i]
            # Sort document indices by descending similarity.
            ranked_indices = np.argsort(sim_scores)[::-1]
            ranked_doc_ids = [self.test_document_ids[idx] for idx in ranked_indices]
            self.query_id_to_ranked_doc_ids[qid] = ranked_doc_ids
        ###########################################################################

    @staticmethod
    def average_precision(relevant_docs: list[str], candidate_docs: list[str], k: int = 10) -> float:
        """
        Implement steps:
        1. Only take the first k candidate documents
        2. Calculate the number of relevant documents in the first k documents
        3. Calculate MAP@k
        
        Note:
        - k is usually set to 10, because users rarely look at more results
        - This approach is more realistic in practical applications
        - It better evaluates the model's performance on the most relevant documents
        """
        # Only take the first k documents
        candidate_docs = candidate_docs[:k]
        # Calculate which documents are relevant
        y_true = [1 if doc_id in relevant_docs else 0 for doc_id in candidate_docs]
        # Calculate precision at each position
        precisions = [np.mean(y_true[:i+1]) for i in range(len(y_true)) if y_true[i]]
        return np.mean(precisions) if precisions else 0

    #Task 3: Calculate Evaluate System Performance (10 Pts)
    
    def mean_average_precision(self) -> float:
        """
        # Inputs:
            - None (uses ranked documents stored in self.query_id_to_ranked_doc_ids).

        # Output:
            - float: The MAP score, computed as the mean of all average precision scores.
    
        (1) Compute mean average precision for all queries using the "average_precision" function.
        (2) Compute the mean of all average precision scores
        Return the mean average precision score
        
        reference: https://www.evidentlyai.com/ranking-metrics/mean-average-precision-map
        https://towardsdatascience.com/map-mean-average-precision-might-confuse-you-5956f1bfa9e2
        """
         #TODO Put your code here. 
        ###########################################################################
        ap_scores = []
        for qid in self.test_query_ids:
            relevant_docs = self.test_query_id_to_relevant_doc_ids.get(qid, [])
            candidate_docs = self.query_id_to_ranked_doc_ids.get(qid, [])
            ap = self.average_precision(relevant_docs, candidate_docs)
            ap_scores.append(ap)
        return np.mean(ap_scores) if ap_scores else 0.0
        ###########################################################################
    
    #Task 4: Ranking the Top 10 Documents based on Similarity Scores (10 Pts)
   
    def show_ranking_documents(self, example_query: str) -> None:
        
        """
        # Inputs:
            - example_query (str): A query string for which top-ranked documents should be displayed.

        # Output:
            - None (prints the ranked documents along with similarity scores).
        
        (1) rank documents with given query with cosine similarity scores
        (2) prints the top 10 results along with its similarity score.
        
        """
        #TODO Put your code here. 
        query_embedding = self.model.encode(example_query)
        document_embeddings = self.model.encode(self.documents)
        ###########################################################################
        # Compute cosine similarity scores between the query and all documents
        sim_scores = cosine_similarity([query_embedding], document_embeddings)[0]

        # Get indices of top K documents based on the similarity scores
        top_k_indices = np.argsort(sim_scores)[::-1][:self.top_k]

        print(f'Top {self.top_k} documents for the query: "{example_query}"')
        for rank, idx in enumerate(top_k_indices, start=1):
            doc_id = self.document_ids[idx]
            score = sim_scores[idx]
            print(f'Rank {rank}: Document ID: {doc_id}, Similarity Score: {score:.4f}')
            ###########################################################################
      
    #Task 5:Fine tune the sentence transformer model (25 Pts)
    # Students are not graded on achieving a high MAP score. 
    # The key is to show understanding, experimentation, and thoughtful analysis.
    
    def fine_tune_model(self, batch_size: int = 32, num_epochs: int = 3, save_model_path: str = "finetuned_senBERT") -> None:

        """
        Fine-tunes the model using MultipleNegativesRankingLoss.
        (1) Prepare training examples from `self.prepare_training_examples()`
        (2) Experiment with [anchor, positive] vs [anchor, positive, negative]
        (3) Define a loss function (`MultipleNegativesRankingLoss`)
        (4) Freeze all model layers except the final layers
        (5) Train the model with the specified learning rate
        (6) Save the fine-tuned model
        """
        #TODO Put your code here.
        ###########################################################################
        """
        Fine-tunes the model using MultipleNegativesRankingLoss.
        """
        # Import torch at the beginning of the method
        import torch
        from torch.utils.data import DataLoader
        import time
        from datetime import timedelta
        
        # Check device
        device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
        print(f"Using device: {device}")
        
        # Move model to GPU if available
        self.model = self.model.to(device)
        
        # Prepare training examples
        train_examples = self.prepare_training_examples()
        print(f"Number of training examples: {len(train_examples)}")
        train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=batch_size)
        
        # Define the loss function
        train_loss = losses.MultipleNegativesRankingLoss(self.model)
        
        # Print model parameters status
        total_params = sum(p.numel() for p in self.model.parameters())
        trainable_params = sum(p.numel() for p in self.model.parameters() if p.requires_grad)
        print(f"Total parameters: {total_params:,}")
        print(f"Trainable parameters: {trainable_params:,}")
        
        # Freeze all layers except the final layer
        for name, param in self.model.named_parameters():
            # Only unfreeze the final transformer layer (layer.5 for all-MiniLM-L6-v2)
            if 'layer.5' in name:  # all-MiniLM-L6-v2 has 6 layers (0-5)
                param.requires_grad = True
                print(f"Unfreezing: {name}")
            else:
                param.requires_grad = False

        # Print trainable parameters to verify
        print("\nTrainable parameters:")
        for name, param in self.model.named_parameters():
            if param.requires_grad:
                print(f"- {name}")
                
        # Training loop with timing
        start_time = time.time()
        print("\nStarting training...")
        
        # Fine-tune the model with warmup
        warmup_steps = int(len(train_dataloader) * 0.1)  # 10% of training data for warmup
        self.model.fit(
            train_objectives=[(train_dataloader, train_loss)],
            epochs=num_epochs,
            warmup_steps=warmup_steps,
            show_progress_bar=True,
            output_path=save_model_path,
            checkpoint_path=f"{save_model_path}_checkpoint",
            checkpoint_save_steps=len(train_dataloader),
            callback=lambda score, epoch, steps: print(f"\nEpoch {epoch}: Score {score:.4f}")
        )
        
        # Calculate and print training time
        training_time = time.time() - start_time
        print(f"\nTraining completed in {str(timedelta(seconds=int(training_time)))}")
        
        # Save the model
        print(f"Saving model to {save_model_path}")
        self.model.save(save_model_path)
        print("Model saved successfully!")
        ###########################################################################

    # Take a careful look into how the training set is created
    def prepare_training_examples(self) -> list[InputExample]:

        """
        Prepares training examples from the training data.
        # Inputs:
            - None (uses self.train_query_id_to_relevant_doc_ids to create training pairs).

         # Output:
            Output: - list[InputExample]: A list of training samples containing [anchor, positive] or [anchor, positive, negative].
            
        """
        """
        Prepares training examples from the training data.
        """
        train_examples = []
        import random
        from datetime import timedelta
        from tqdm import tqdm
        
        print("\nPreparing training examples...")
        total_queries = len(self.train_query_id_to_relevant_doc_ids)
        print(f"Total queries to process: {total_queries}")
        
        # Count total examples that will be created
        total_examples = sum(len(doc_ids) for doc_ids in self.train_query_id_to_relevant_doc_ids.values())
        print(f"Expected total training examples: {total_examples}")
        
        start_time = time.time()
        
        # Create progress bar
        pbar = tqdm(self.train_query_id_to_relevant_doc_ids.items(), 
                    total=total_queries,
                    desc="Processing queries")
        
        for qid, doc_ids in pbar:
            anchor = self.query_id_to_text[qid]
            # Precompute negative candidates for current query using set subtraction for efficiency
            relevant_set = set(self.train_query_id_to_relevant_doc_ids.get(qid, []))
            negative_candidates = list(set(self.document_ids) - relevant_set)
            
            for doc_id in doc_ids:
                positive = self.document_id_to_text[doc_id]
                
                # Update progress bar description with current query details
                pbar.set_description(f"Query {qid}: {len(doc_ids)} docs")
                
                # Build texts list without an explicit else branch.
                texts = [anchor, positive]
                if negative_candidates:
                    texts.append(self.document_id_to_text[random.choice(negative_candidates)])
                train_examples.append(InputExample(texts=texts))
        
        elapsed_time = time.time() - start_time
        print(f"\nTraining examples preparation completed in {timedelta(seconds=int(elapsed_time))}")
        print(f"Final number of training examples: {len(train_examples)}")
        
        return train_examples


2025-02-10 11:44:39.420461: I tensorflow/core/util/port.cc:111] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-02-10 11:44:39.447577: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2025-02-10 11:44:39.447598: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2025-02-10 11:44:39.447602: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-02-10 11:44:39.452201: I tensorflow/core/platform/cpu_feature_g

In [4]:
# Initialize and use the model
model = TextSimilarityModel("BeIR/nfcorpus", "BeIR/nfcorpus-qrels")

# Compare the outputs 
print("Ranking with sentence_transformer...")
model.rank_documents(encoding_method='sentence_transformer')
map_score = model.mean_average_precision()
print("Mean Average Precision:", map_score)

# Compare the outputs 
print("Ranking with glove...")
model.rank_documents(encoding_method='glove')
map_score = model.mean_average_precision()
print("Mean Average Precision:", map_score)


model.show_ranking_documents("Breast Cancer Cells Feed on Cholesterol")

Ranking with sentence_transformer...
Mean Average Precision: 0.477390368428073
Ranking with glove...
Mean Average Precision: 0.08843636869952659
Top 10 documents for the query: "Breast Cancer Cells Feed on Cholesterol"
Rank 1: Document ID: MED-2439, Similarity Score: 0.6946
Rank 2: Document ID: MED-2434, Similarity Score: 0.6723
Rank 3: Document ID: MED-2440, Similarity Score: 0.6473
Rank 4: Document ID: MED-2427, Similarity Score: 0.5877
Rank 5: Document ID: MED-2774, Similarity Score: 0.5498
Rank 6: Document ID: MED-838, Similarity Score: 0.5406
Rank 7: Document ID: MED-2430, Similarity Score: 0.5205
Rank 8: Document ID: MED-2102, Similarity Score: 0.5141
Rank 9: Document ID: MED-2437, Similarity Score: 0.5081
Rank 10: Document ID: MED-5066, Similarity Score: 0.5012


In [5]:
import os
from openai import OpenAI
from tqdm import tqdm

# Extension of TextSimilarityModel class
def init_openai(self):
    """Initialize OpenAI client with API key from environment"""
    if not hasattr(self, 'openai_client'):
        self.openai_client = OpenAI()
        
def encode_with_openai(self, texts: list[str], batch_size: int = 100) -> list[np.ndarray]:
    """
    Encode texts using OpenAI's text-embedding-3-small model
    """
    self.init_openai()
    embeddings = []
    
    # Process in batches to avoid rate limits
    for i in tqdm(range(0, len(texts), batch_size), desc="Encoding with OpenAI"):
        batch = texts[i:i+batch_size]
        try:
            response = self.openai_client.embeddings.create(
                model="text-embedding-3-small",
                input=batch
            )
            batch_embeddings = [np.array(item.embedding) for item in response.data]
            embeddings.extend(batch_embeddings)
        except Exception as e:
            print(f"Error in batch {i}: {e}")
            # Use zero vectors as fallback
            batch_embeddings = [np.zeros(1536) for _ in batch]  # text-embedding-3-small uses 1536 dimensions
            embeddings.extend(batch_embeddings)
    
    return embeddings

# Add methods to the class
TextSimilarityModel.init_openai = init_openai
TextSimilarityModel.encode_with_openai = encode_with_openai

# Modify the rank_documents method
original_rank_documents = TextSimilarityModel.rank_documents

def rank_documents_with_openai(self, encoding_method: str = 'sentence_transformer') -> None:
    if encoding_method == 'openai':
        query_embeddings = self.encode_with_openai(self.queries)
        document_embeddings = self.encode_with_openai(self.documents)
    else:
        return original_rank_documents(self, encoding_method)
    
    # Map test query IDs to their indices in the full query list
    test_query_indices = [self.query_ids.index(qid) for qid in self.test_query_ids]
    test_doc_indices = [self.document_ids.index(doc_id) for doc_id in self.test_document_ids]
    
    # Subset the embeddings
    test_query_embeddings = [query_embeddings[i] for i in test_query_indices]
    test_document_embeddings = [document_embeddings[i] for i in test_doc_indices]
    
    # Compute similarity matrix
    sim_matrix = cosine_similarity(test_query_embeddings, test_document_embeddings)

    # Store ranked document IDs
    self.query_id_to_ranked_doc_ids = {}
    for i, qid in enumerate(self.test_query_ids):
        sim_scores = sim_matrix[i]
        ranked_indices = np.argsort(sim_scores)[::-1]
        ranked_doc_ids = [self.test_document_ids[idx] for idx in ranked_indices]
        self.query_id_to_ranked_doc_ids[qid] = ranked_doc_ids

TextSimilarityModel.rank_documents = rank_documents_with_openai

In [13]:
# Run the comparison experiment with different embedding methods
print("Running comparison experiment with different embedding methods...")

# Ensure that the OpenAI API key is set
assert os.getenv("OPENAI_API_KEY"), "Please set OPENAI_API_KEY environment variable"

# Initialize the model
model = TextSimilarityModel("BeIR/nfcorpus", "BeIR/nfcorpus-qrels")

# Compare the performance of different methods
methods = ['sentence_transformer', 'glove', 'openai']
results = {}

# Modify the show_ranking_documents method to support different encoding methods
def show_ranking_documents_with_method(self, example_query: str, encoding_method: str = 'sentence_transformer') -> None:
    if encoding_method == 'openai':
        query_embedding = self.encode_with_openai([example_query])[0]
        document_embeddings = self.encode_with_openai(self.documents)
    elif encoding_method == 'glove':
        query_embedding = self.encode_with_glove("glove.6B.50d.txt", [example_query])[0]
        document_embeddings = self.encode_with_glove("glove.6B.50d.txt", self.documents)
    else:  # sentence_transformer
        query_embedding = self.model.encode(example_query)
        document_embeddings = self.model.encode(self.documents)
    
    sim_scores = cosine_similarity([query_embedding], document_embeddings)[0]
    top_k_indices = np.argsort(sim_scores)[::-1][:self.top_k]

    print(f'Top {self.top_k} documents for the query: "{example_query}" using {encoding_method}')
    for rank, idx in enumerate(top_k_indices, start=1):
        doc_id = self.document_ids[idx]
        score = sim_scores[idx]
        print(f'Rank {rank}: Document ID: {doc_id}, Similarity Score: {score:.4f}')
        # 打印文档内容的前200个字符，帮助理解排序结果
        doc_text = self.documents[idx][:200]
        print(f'Document preview: {doc_text}...\n')

# Replace the original method
TextSimilarityModel.show_ranking_documents = show_ranking_documents_with_method

# Run the experiment
for method in methods:
    print(f"\nRanking with {method}...")
    model.rank_documents(encoding_method=method)
    map_score = model.mean_average_precision()
    results[method] = map_score
    print(f"Mean Average Precision ({method}): {map_score:.4f}")

# Test example queries
test_queries = [
    "Breast Cancer Cells Feed on Cholesterol",
    "Treatment options for COVID-19",
    "Effects of exercise on mental health"
]

for test_query in test_queries:
    print(f"\nComparing rankings for query: '{test_query}'")
    for method in methods:
        print(f"\n{method.upper()} Rankings:")
        model.show_ranking_documents(test_query, encoding_method=method)

# Print the results summary
print("\nSummary of Results:")
print("-" * 50)
for method, score in results.items():
    print(f"{method:20s}: MAP = {score:.4f}")

# Print the detailed performance analysis
print("\nDetailed Performance Analysis:")
print("-" * 50)
for method in methods:
    print(f"\n{method.upper()}:")
    print(f"- MAP Score: {results[method]:.4f}")
    
    # Calculate the average precision for each query
    query_scores = []
    for qid in model.test_query_ids:
        relevant_docs = model.test_query_id_to_relevant_doc_ids.get(qid, [])
        candidate_docs = model.query_id_to_ranked_doc_ids.get(qid, [])
        ap = model.average_precision(relevant_docs, candidate_docs)
        query_scores.append(ap)
    
    # Calculate the statistics
    scores_array = np.array(query_scores)
    print(f"- Average Precision Statistics:")
    print(f"  * Mean: {np.mean(scores_array):.4f}")
    print(f"  * Median: {np.median(scores_array):.4f}")
    print(f"  * Std Dev: {np.std(scores_array):.4f}")
    print(f"  * Min: {np.min(scores_array):.4f}")
    print(f"  * Max: {np.max(scores_array):.4f}")

In [15]:
import pandas as pd
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
from datasets import load_dataset
from tqdm import tqdm

class KaggleSubmissionModel:
    def __init__(self, model_name='all-MiniLM-L6-v2'):
        """
        Initialize the model with the pre-trained sentence transformer
        """
        self.model = SentenceTransformer(model_name)
        self.corpus_name = "BeIR/nfcorpus"
        self.load_data()

    def load_data(self):
        """
        Load data from both Kaggle test files and BeIR dataset
        """
        # Load Kaggle test set
        self.test_queries_df = pd.read_csv('test_query.csv')
        self.test_documents_df = pd.read_csv('test_documents.csv')
        
        # Load BeIR dataset to get text content
        dataset_docs = load_dataset(self.corpus_name, "corpus")
        
        # Create a mapping from document IDs to text
        self.doc_id_to_text = {
            doc_id: text for doc_id, text in zip(
                dataset_docs["corpus"]["_id"],
                dataset_docs["corpus"]["text"]
            )
        }
        
        # Get the text content of the test documents
        self.test_doc_texts = []
        for doc_id in self.test_documents_df['Doc']:
            text = self.doc_id_to_text.get(doc_id, "")  # If the document is not found, use an empty string
            self.test_doc_texts.append(text)
        
        print(f"Loaded {len(self.test_queries_df)} queries and {len(self.test_documents_df)} documents")

    def rank_documents(self, batch_size=32):
        """
        Rank documents for each query using sentence transformer embeddings
        """
        print("Encoding queries...")
        query_embeddings = self.model.encode(
            self.test_queries_df['Query'].tolist(), 
            batch_size=batch_size, 
            show_progress_bar=True
        )
        
        print("Encoding documents...")
        doc_embeddings = self.model.encode(
            self.test_doc_texts,
            batch_size=batch_size, 
            show_progress_bar=True
        )
        
        print("Computing similarities and ranking documents...")
        results = []
        
        for i, query_embedding in enumerate(tqdm(query_embeddings)):
            # Calculate cosine similarity
            similarities = cosine_similarity([query_embedding], doc_embeddings)[0]
            
            # Get the indices of the top 10 most similar documents
            top_indices = np.argsort(similarities)[::-1][:10]
            
            # Get the corresponding document IDs
            top_doc_ids = [self.test_documents_df.iloc[idx]['Doc'] for idx in top_indices]
            
            # Combine the document IDs into a string
            doc_ids_str = ' '.join(top_doc_ids)
            
            # Add to the results
            results.append({
                'Query': self.test_queries_df.iloc[i]['Query'],
                'Doc_ID': doc_ids_str
            })
        
        # Create the submission file
        submission_df = pd.DataFrame(results)
        submission_df.to_csv('submission.csv', index=False)
        print("Submission file created successfully!")
        
        # Display the first few rows as an example
        print("\nFirst few rows of the submission file:")
        print(submission_df.head())

def main():
    # Use a more powerful model
    print("Initializing model...")
    model = KaggleSubmissionModel(model_name='sentence-transformers/all-mpnet-base-v2')
    print("Ranking documents...")
    model.rank_documents()

if __name__ == "__main__":
    main()

Initializing model...
Loaded 557 queries and 3125 documents
Ranking documents...
Encoding queries...


Batches:   0%|          | 0/18 [00:00<?, ?it/s]

Encoding documents...


Batches:   0%|          | 0/98 [00:00<?, ?it/s]

Computing similarities and ranking documents...


100%|██████████| 557/557 [00:06<00:00, 80.46it/s] 

Submission file created successfully!

First few rows of the submission file:
                                               Query  \
0                       Herbalife® has been updated.   
1  Can eating Fruit & Nut Bars lead to an increas...   
2                      What can I do with chickpeas?   
3    Are chronic headaches caused by pork parasites?   
4  is a professor at Harvard University and also ...   

                                              Doc_ID  
0  MED-5158 MED-4366 MED-3487 MED-5157 MED-4374 M...  
1  MED-4707 MED-3896 MED-2595 MED-4291 MED-4292 M...  
2  MED-2009 MED-2010 MED-4443 MED-2073 MED-3132 M...  
3  MED-3169 MED-3319 MED-3288 MED-3175 MED-4818 M...  
4  MED-4686 MED-2112 MED-4299 MED-2220 MED-1478 M...  





In [47]:
import pandas as pd
from sentence_transformers import SentenceTransformer, CrossEncoder
import numpy as np
from datasets import load_dataset
from tqdm import tqdm
import faiss  # FAISS is used for efficient approximate nearest neighbor retrieval

class KaggleSubmissionModel:
    def __init__(self, 
                 bi_encoder_model_name='sentence-transformers/all-mpnet-base-v2', 
                 cross_encoder_model_name='cross-encoder/ms-marco-MiniLM-L-6-v2',
                 candidate_count=50):
        """
        Initialize the model:
        - Use the bi-encoder (bi-encoder) for fast retrieval
        - Use the cross-encoder (cross encoder) for re-ranking, achieving higher accuracy
        - candidate_count: The number of candidate documents initially retrieved for each query
        """
        self.bi_encoder = SentenceTransformer(bi_encoder_model_name)
        self.cross_encoder = CrossEncoder(cross_encoder_model_name)
        self.candidate_count = candidate_count
        self.corpus_name = "BeIR/nfcorpus"
        self.load_data()
    
    def load_data(self):
        """
        Load the Kaggle test set (queries and document IDs) and the full text data of the BeIR dataset,
        construct a mapping from document IDs to text, and get the corresponding text content based on the test document IDs.
        """
        # Load the Kaggle test set data
        self.test_queries_df = pd.read_csv('test_query.csv')
        self.test_documents_df = pd.read_csv('test_documents.csv')
        
        # Load the BeIR dataset (corpus part) to get the full text of the documents
        dataset_docs = load_dataset(self.corpus_name, "corpus")
        
        # Construct a mapping from document IDs to text
        self.doc_id_to_text = {
            doc_id: text for doc_id, text in zip(
                dataset_docs["corpus"]["_id"],
                dataset_docs["corpus"]["text"]
            )
        }
        
        # Get the text content of the test documents, using an empty string if not found
        self.test_doc_texts = [
            self.doc_id_to_text.get(doc_id, "") for doc_id in self.test_documents_df['Doc']
        ]
        
        print(f"Loaded {len(self.test_queries_df)} queries and {len(self.test_documents_df)} documents.")
    
    def build_faiss_index(self, embeddings):
        """
        Use FAISS to build an index.
        Here we normalize the embeddings to L2, so that the inner product calculation is equivalent to calculating cosine similarity.
        """
        embeddings = embeddings.astype('float32')
        faiss.normalize_L2(embeddings)
        dimension = embeddings.shape[1]
        index = faiss.IndexFlatIP(dimension)
        index.add(embeddings)
        return index
    
    def rank_documents(self, batch_size=32, top_k=10):
        """
        For each query, first retrieve candidate_count documents using the bi-encoder and FAISS,
        then use the cross-encoder to re-rank the candidate documents, and finally select the top_k documents,
        and generate the submission file submission.csv.
        """
        # 1. Use the bi-encoder to encode the documents
        print("Using the bi-encoder to encode the documents...")
        doc_embeddings = self.bi_encoder.encode(
            self.test_doc_texts,
            batch_size=batch_size,
            show_progress_bar=True
        )
        doc_embeddings = np.array(doc_embeddings)
        index = self.build_faiss_index(doc_embeddings)
        
        # 2. Encode the queries
        print("Using the bi-encoder to encode the queries...")
        query_embeddings = self.bi_encoder.encode(
            self.test_queries_df['Query'].tolist(),
            batch_size=batch_size,
            show_progress_bar=True
        )
        query_embeddings = np.array(query_embeddings).astype('float32')
        faiss.normalize_L2(query_embeddings)
        
        results = []
        print("Performing candidate retrieval and cross-encoder re-ranking...")
        for i, query_embedding in enumerate(tqdm(query_embeddings)):
            # Use FAISS to retrieve candidate_count documents
            query_embedding_2d = np.expand_dims(query_embedding, axis=0)
            distances, indices = index.search(query_embedding_2d, self.candidate_count)
            candidate_indices = indices[0]
            
            # Prepare candidate pairs, format as (query, document)
            query_text = self.test_queries_df.iloc[i]['Query']
            candidate_pairs = []
            candidate_ids = []
            for idx in candidate_indices:
                doc_text = self.test_doc_texts[idx]
                candidate_pairs.append((query_text, doc_text))
                candidate_ids.append(self.test_documents_df.iloc[idx]['Doc'])
            
            # Use the cross-encoder to re-rank the candidate documents
            cross_scores = self.cross_encoder.predict(candidate_pairs)
            # Sort the candidate documents in descending order of the re-ranking scores, and select the top top_k documents
            sorted_indices = np.argsort(cross_scores)[::-1][:top_k]
            top_doc_ids = [candidate_ids[idx] for idx in sorted_indices]
            
            results.append({
                'Query': query_text,
                'Doc_ID': ' '.join(top_doc_ids)
            })
        
        # Generate the submission file
        submission_df = pd.DataFrame(results)
        submission_df.to_csv('submissionv2.csv', index=False)
        print("Submission file created successfully!")
        print("\nFirst few rows of the submission file:")
        print(submission_df.head())

def main():
    print("Initializing model with bi-encoder & cross-encoder re-ranking...")
    model = KaggleSubmissionModel(candidate_count=50)
    print("Ranking documents...")
    model.rank_documents()

if __name__ == "__main__":
    main()


Initializing model with bi-encoder & cross-encoder re-ranking...
Loaded 557 queries and 3125 documents.
Ranking documents...
Using the bi-encoder to encode the documents...


Batches:   0%|          | 0/98 [00:00<?, ?it/s]

Using the bi-encoder to encode the queries...


Batches:   0%|          | 0/18 [00:00<?, ?it/s]

Performing candidate retrieval and cross-encoder re-ranking...


100%|██████████| 557/557 [00:33<00:00, 16.51it/s]

Submission file created successfully!

First few rows of the submission file:
                                               Query  \
0                       Herbalife® has been updated.   
1  Can eating Fruit & Nut Bars lead to an increas...   
2                      What can I do with chickpeas?   
3    Are chronic headaches caused by pork parasites?   
4  is a professor at Harvard University and also ...   

                                              Doc_ID  
0  MED-5158 MED-5157 MED-4873 MED-3489 MED-2891 M...  
1  MED-3896 MED-4707 MED-4286 MED-2592 MED-4292 M...  
2  MED-2009 MED-2145 MED-2010 MED-3132 MED-2070 M...  
3  MED-3175 MED-3288 MED-3177 MED-3319 MED-3171 M...  
4  MED-2112 MED-2220 MED-3654 MED-2765 MED-1558 M...  





In [27]:
import os
import pandas as pd
import numpy as np
import faiss
import openai
from datasets import load_dataset
from tqdm import tqdm
from sentence_transformers import CrossEncoder
import pickle

class KaggleSubmissionModel:
    def __init__(self, 
                 candidate_count=50, 
                 openai_model="text-embedding-3-small", 
                 cross_encoder_model_name="cross-encoder/ms-marco-MiniLM-L-6-v2",
                 cache_dir="embeddings_cache"):
        """
        Initialize the model, add the cache directory parameter
        """
        self.openai_model = openai_model
        self.candidate_count = candidate_count
        self.cross_encoder = CrossEncoder(cross_encoder_model_name)
        self.corpus_name = "BeIR/nfcorpus"
        self.cache_dir = cache_dir
        
        # Create the cache directory
        if not os.path.exists(cache_dir):
            os.makedirs(cache_dir)
            
        self.load_data()

    def get_cache_path(self, prefix):
        """Get the cache file path"""
        return os.path.join(self.cache_dir, f"{prefix}_{self.openai_model.replace('/', '_')}.pkl")

    def load_data(self):
        """Load the data, same as before"""
        self.test_queries_df = pd.read_csv("test_query.csv")
        self.test_documents_df = pd.read_csv("test_documents.csv")
        
        dataset_docs = load_dataset(self.corpus_name, "corpus")
        self.doc_id_to_text = {
            doc_id: text 
            for doc_id, text in zip(dataset_docs["corpus"]["_id"], dataset_docs["corpus"]["text"])
        }
        
        self.test_doc_texts = [self.doc_id_to_text.get(doc_id, "") for doc_id in self.test_documents_df["Doc"]]
        print(f"Loaded {len(self.test_queries_df)} queries and {len(self.test_documents_df)} documents.")

    def init_openai(self):
        """Initialize the OpenAI client"""
        if not hasattr(self, 'openai_client'):
            self.openai_client = openai

    def encode_with_openai(self, texts: list, batch_size: int = 100, cache_prefix: str = None) -> np.ndarray:
        """
        Encode text, support caching
        cache_prefix: cache file prefix, if provided, enable caching
        """
        if cache_prefix:
            cache_path = self.get_cache_path(cache_prefix)
            # If the cache exists, load it directly
            if os.path.exists(cache_path):
                print(f"Loading cached embeddings from {cache_path}")
                with open(cache_path, 'rb') as f:
                    return pickle.load(f)

        self.init_openai()
        embeddings = []
        
        for i in tqdm(range(0, len(texts), batch_size), desc="Encoding with OpenAI"):
            batch = texts[i:i + batch_size]
            try:
                response = self.openai_client.embeddings.create(
                    model="text-embedding-3-small",
                    input=batch
                )
                batch_embeddings = [item.embedding for item in response.data]
                embeddings.extend(batch_embeddings)
            except Exception as e:
                print(f"Error in batch {i}: {e}")
                batch_embeddings = [[0.0] * 1536 for _ in batch]
                embeddings.extend(batch_embeddings)
        
        embeddings = np.ascontiguousarray(embeddings, dtype=np.float32)
        
        # If the cache prefix is provided, save to the cache
        if cache_prefix:
            print(f"Saving embeddings to cache {cache_path}")
            with open(cache_path, 'wb') as f:
                pickle.dump(embeddings, f)
        
        return embeddings

    def rank_documents(self, batch_size=16, top_k=10):
        """
        Sort documents, use cached embeddings
        """
        # 1. Use cached embeddings or calculate document embeddings
        print("Getting document embeddings...")
        doc_embeddings = self.encode_with_openai(
            self.test_doc_texts, 
            batch_size=batch_size,
            cache_prefix="doc_embeddings"
        )
        index = self.build_faiss_index(doc_embeddings)
        
        # 2. Use cached embeddings or calculate query embeddings
        print("Getting query embeddings...")
        query_texts = self.test_queries_df["Query"].tolist()
        query_embeddings = self.encode_with_openai(
            query_texts, 
            batch_size=batch_size,
            cache_prefix="query_embeddings"
        )
        
        results = []
        print("Retrieving candidates and re-ranking with cross-encoder...")
        for i, query_embedding in enumerate(tqdm(query_embeddings, desc="Processing Queries")):
            query_embedding = np.ascontiguousarray(query_embedding, dtype=np.float32)
            query_embedding = query_embedding.reshape(1, -1)
            faiss.normalize_L2(query_embedding)
            
            distances, indices = index.search(query_embedding, self.candidate_count)
            candidate_indices = indices[0]
            
            query_text = self.test_queries_df.iloc[i]["Query"]
            candidate_pairs = []
            candidate_ids = []
            for idx in candidate_indices:
                doc_text = self.test_doc_texts[idx]
                candidate_pairs.append((query_text, doc_text))
                candidate_ids.append(self.test_documents_df.iloc[idx]["Doc"])
            
            cross_scores = self.cross_encoder.predict(candidate_pairs)
            sorted_indices = np.argsort(cross_scores)[::-1][:top_k]
            top_doc_ids = [candidate_ids[idx] for idx in sorted_indices]
            
            results.append({
                "Query": query_text,
                "Doc_ID": " ".join(top_doc_ids)
            })
        
        submission_df = pd.DataFrame(results)
        submission_df.to_csv("submissionv3.csv", index=False)
        print("Submission file created successfully!")
        print("\nFirst few rows of the submission file:")
        print(submission_df.head())

    def build_faiss_index(self, embeddings):
        """Build the FAISS index, same as before"""
        embeddings = np.ascontiguousarray(embeddings, dtype=np.float32)
        faiss.normalize_L2(embeddings)
        dimension = embeddings.shape[1]
        index = faiss.IndexFlatIP(dimension)
        index.add(embeddings)
        return index

def main():
    print("Initializing model with OpenAI embeddings and cross-encoder re-ranking...")
    model = KaggleSubmissionModel(candidate_count=50)
    print("Ranking documents...")
    model.rank_documents()

if __name__ == "__main__":
    main()

Initializing model with OpenAI embeddings and cross-encoder re-ranking...
Loaded 557 queries and 3125 documents.
Ranking documents...
Getting document embeddings...


Encoding with OpenAI: 100%|██████████| 196/196 [01:55<00:00,  1.69it/s]


Saving embeddings to cache embeddings_cache/doc_embeddings_text-embedding-3-small.pkl
Getting query embeddings...


Encoding with OpenAI: 100%|██████████| 35/35 [00:17<00:00,  2.05it/s]


Saving embeddings to cache embeddings_cache/query_embeddings_text-embedding-3-small.pkl
Retrieving candidates and re-ranking with cross-encoder...


Processing Queries: 100%|██████████| 557/557 [00:33<00:00, 16.48it/s]

Submission file created successfully!

First few rows of the submission file:
                                               Query  \
0                       Herbalife® has been updated.   
1  Can eating Fruit & Nut Bars lead to an increas...   
2                      What can I do with chickpeas?   
3    Are chronic headaches caused by pork parasites?   
4  is a professor at Harvard University and also ...   

                                              Doc_ID  
0  MED-5158 MED-5157 MED-4873 MED-4535 MED-3489 M...  
1  MED-3896 MED-4707 MED-4286 MED-2592 MED-4292 M...  
2  MED-2009 MED-2145 MED-2010 MED-2148 MED-2008 M...  
3  MED-3175 MED-3288 MED-3177 MED-3319 MED-3171 M...  
4  MED-2112 MED-2489 MED-4599 MED-4674 MED-5123 M...  





In [41]:
import os
import pandas as pd
import numpy as np
import openai
from datasets import load_dataset
from tqdm import tqdm
import pickle

class KaggleSubmissionModel:
    def __init__(self, 
                 openai_model="text-embedding-3-small", 
                 cache_dir="embeddings_cache"):
        """
        Initialize the model:
        - Specify the OpenAI embedding model (e.g. text-embedding-3-small)
        - Specify the cache directory to avoid calling the API repeatedly
        """
        self.openai_model = openai_model
        self.cache_dir = cache_dir
        self.corpus_name = "BeIR/nfcorpus"
        if not os.path.exists(cache_dir):
            os.makedirs(cache_dir)
        self.init_openai()
        self.load_data()
    
    def init_openai(self):
        """Initialize the OpenAI client"""
        if not hasattr(self, 'openai_client'):
            print("Initializing OpenAI client...")
            self.openai_client = openai

    def get_cache_path(self, prefix):
        """Get the cache file path"""
        safe_model = self.openai_model.replace('/', '_')
        return os.path.join(self.cache_dir, f"{prefix}_{safe_model}.pkl")

    def load_data(self):
        """
        Load the test queries and document IDs,
        and load the full document texts from the BeIR dataset, and construct a mapping from document ID to text.
        """
        print("Loading test queries and document IDs...")
        self.test_queries_df = pd.read_csv("test_query.csv")
        self.test_documents_df = pd.read_csv("test_documents.csv")
        
        print("Loading full document texts from BeIR dataset...")
        dataset_docs = load_dataset(self.corpus_name, "corpus")
        self.doc_id_to_text = {
            doc_id: text 
            for doc_id, text in zip(dataset_docs["corpus"]["_id"], dataset_docs["corpus"]["text"])
        }
        # Get the full text corresponding to the test document ID (return an empty string if not found)
        self.test_doc_texts = [self.doc_id_to_text.get(doc_id, "") for doc_id in self.test_documents_df["Doc"]]
        print(f"Loaded {len(self.test_queries_df)} queries and {len(self.test_documents_df)} documents.")

    def encode_with_openai(self, texts: list, batch_size: int = 16, cache_prefix: str = None) -> np.ndarray:
        """
        Encode text, support caching, and correctly handle the new version of the OpenAI API response format
        """
        if cache_prefix:
            cache_path = self.get_cache_path(cache_prefix)
            # If the cache exists, load it directly
            if os.path.exists(cache_path):
                print(f"Loading cached embeddings from {cache_path}")
                with open(cache_path, 'rb') as f:
                    return pickle.load(f)

        self.init_openai()
        embeddings = []
        
        for i in tqdm(range(0, len(texts), batch_size), desc="Encoding with OpenAI"):
            batch = texts[i:i + batch_size]
            try:
                response = self.openai_client.embeddings.create(
                    model="text-embedding-3-small",
                    input=batch
                )
                batch_embeddings = [embedding.embedding for embedding in response.data]
                embeddings.extend(batch_embeddings)
                
            except Exception as e:
                print(f"Error in batch starting at index {i}: {str(e)}")
                # Use zero vectors as a backup
                batch_embeddings = [[0.0] * 1536 for _ in batch]  # text-embedding-3-small uses 1536-dimensional vectors
                embeddings.extend(batch_embeddings)
        
        embeddings = np.ascontiguousarray(embeddings, dtype=np.float32)
        
        # If the cache prefix is provided, save to the cache
        if cache_prefix:
            print(f"Saving embeddings to cache {cache_path}")
            with open(cache_path, 'wb') as f:
                pickle.dump(embeddings, f)
        
        return embeddings

    def normalize_embeddings(self, embeddings: np.ndarray) -> np.ndarray:
        """Normalize the embeddings, so that the inner product calculation is equivalent to cosine similarity"""
        norms = np.linalg.norm(embeddings, axis=1, keepdims=True)
        return embeddings / norms

    def rank_documents(self, batch_size=16, top_k=10):
        """
        Sort all documents directly:
        1. Encode the documents and queries with OpenAI embeddings (support caching).
        2. Normalize the embeddings and use the dot product (cosine similarity) to sort the documents.
        3. Select the top k most similar documents for each query and generate the submission file.
        """
        # 1. Encode documents
        print("Encoding document texts with OpenAI embeddings...")
        doc_embeddings = self.encode_with_openai(
            self.test_doc_texts,
            batch_size=batch_size,
            cache_prefix="doc_embeddings"
        )
        doc_embeddings = self.normalize_embeddings(doc_embeddings)
        
        # 2. Encode queries
        print("Encoding query texts with OpenAI embeddings...")
        query_texts = self.test_queries_df["Query"].tolist()
        query_embeddings = self.encode_with_openai(
            query_texts,
            batch_size=batch_size,
            cache_prefix="query_embeddings"
        )
        query_embeddings = self.normalize_embeddings(query_embeddings)
        
        # 3. Calculate the cosine similarity between each query and all documents directly, and select the top k documents
        results = []
        print("Ranking documents for each query...")
        for i, query_embedding in enumerate(tqdm(query_embeddings, desc="Ranking Documents")):
            # Since the embeddings are normalized, the dot product is equivalent to cosine similarity
            sims = np.dot(doc_embeddings, query_embedding)
            top_indices = np.argsort(sims)[::-1][:top_k]
            top_doc_ids = [self.test_documents_df.iloc[idx]["Doc"] for idx in top_indices]
            results.append({
                "Query": self.test_queries_df.iloc[i]["Query"],
                "Doc_ID": " ".join(top_doc_ids)
            })
            
        submission_df = pd.DataFrame(results)
        submission_df.to_csv("submission_direct_openai_small.csv", index=False)
        print("Submission file created successfully!")
        print("\nFirst few rows of the submission file:")
        print(submission_df.head())

def main():
    openai.api_key = os.getenv("OPENAI_API_KEY")
    print("Initializing model with OpenAI embeddings for direct ranking...")
    model = KaggleSubmissionModel(openai_model="text-embedding-3-small")
    print("Ranking documents directly...")
    model.rank_documents()

if __name__ == "__main__":
    main()


Initializing model with OpenAI embeddings for direct ranking...
Initializing OpenAI client...
Loading test queries and document IDs...
Loading full document texts from BeIR dataset...
Loaded 557 queries and 3125 documents.
Ranking documents directly...
Encoding document texts with OpenAI embeddings...


Encoding with OpenAI: 100%|██████████| 196/196 [01:58<00:00,  1.65it/s]


Saving embeddings to cache embeddings_cache/doc_embeddings_text-embedding-3-small.pkl
Encoding query texts with OpenAI embeddings...


Encoding with OpenAI: 100%|██████████| 35/35 [00:20<00:00,  1.67it/s]


Saving embeddings to cache embeddings_cache/query_embeddings_text-embedding-3-small.pkl
Ranking documents for each query...


Ranking Documents: 100%|██████████| 557/557 [00:01<00:00, 314.42it/s]

Submission file created successfully!

First few rows of the submission file:
                                               Query  \
0                       Herbalife® has been updated.   
1  Can eating Fruit & Nut Bars lead to an increas...   
2                      What can I do with chickpeas?   
3    Are chronic headaches caused by pork parasites?   
4  is a professor at Harvard University and also ...   

                                              Doc_ID  
0  MED-5158 MED-5157 MED-3489 MED-3490 MED-4873 M...  
1  MED-3896 MED-4286 MED-2595 MED-4707 MED-4291 M...  
2  MED-2009 MED-2010 MED-2144 MED-2147 MED-3132 M...  
3  MED-3175 MED-3169 MED-3288 MED-3177 MED-3319 M...  
4  MED-2112 MED-4674 MED-4599 MED-2763 MED-1377 M...  





- text-embedding-3-small : 0.27903

In [42]:
import os
import pandas as pd
import numpy as np
import openai
from datasets import load_dataset
from tqdm import tqdm
import pickle

class KaggleSubmissionModel:
    def __init__(self, 
                 openai_model="text-embedding-3-large", 
                 cache_dir="embeddings_cache"):
        """
        Initialize the model:
        - Specify the OpenAI embedding model (e.g. text-embedding-3-small)
        - Specify the cache directory to avoid calling the API repeatedly
        """
        self.openai_model = openai_model
        self.cache_dir = cache_dir
        self.corpus_name = "BeIR/nfcorpus"
        if not os.path.exists(cache_dir):
            os.makedirs(cache_dir)
        self.init_openai()
        self.load_data()
    
    def init_openai(self):
        """Initialize the OpenAI client"""
        if not hasattr(self, 'openai_client'):
            print("Initializing OpenAI client...")
            self.openai_client = openai

    def get_cache_path(self, prefix):
        """Get the cache file path"""
        safe_model = self.openai_model.replace('/', '_')
        return os.path.join(self.cache_dir, f"{prefix}_{safe_model}.pkl")

    def load_data(self):
        """
        Load the test queries and document IDs,
        and load the full document texts from the BeIR dataset, and construct a mapping from document ID to text.
        """
        print("Loading test queries and document IDs...")
        self.test_queries_df = pd.read_csv("test_query.csv")
        self.test_documents_df = pd.read_csv("test_documents.csv")
        
        print("Loading full document texts from BeIR dataset...")
        dataset_docs = load_dataset(self.corpus_name, "corpus")
        self.doc_id_to_text = {
            doc_id: text 
            for doc_id, text in zip(dataset_docs["corpus"]["_id"], dataset_docs["corpus"]["text"])
        }
        # Get the full text corresponding to the test document ID (return an empty string if not found)
        self.test_doc_texts = [self.doc_id_to_text.get(doc_id, "") for doc_id in self.test_documents_df["Doc"]]
        print(f"Loaded {len(self.test_queries_df)} queries and {len(self.test_documents_df)} documents.")

    def encode_with_openai(self, texts: list, batch_size: int = 16, cache_prefix: str = None) -> np.ndarray:
        """
        Encode text, support caching, and correctly handle the new version of the OpenAI API response format
        """
        if cache_prefix:
            cache_path = self.get_cache_path(cache_prefix)
            # If the cache exists, load it directly
            if os.path.exists(cache_path):
                print(f"Loading cached embeddings from {cache_path}")
                with open(cache_path, 'rb') as f:
                    return pickle.load(f)

        self.init_openai()
        embeddings = []
        
        for i in tqdm(range(0, len(texts), batch_size), desc="Encoding with OpenAI"):
            batch = texts[i:i + batch_size]
            try:
                response = self.openai_client.embeddings.create(
                    model="text-embedding-3-small",
                    input=batch
                )
                batch_embeddings = [embedding.embedding for embedding in response.data]
                embeddings.extend(batch_embeddings)
                
            except Exception as e:
                print(f"Error in batch starting at index {i}: {str(e)}")
                # Use zero vectors as a backup
                batch_embeddings = [[0.0] * 1536 for _ in batch]  # text-embedding-3-small uses 1536-dimensional vectors
                embeddings.extend(batch_embeddings)
        
        embeddings = np.ascontiguousarray(embeddings, dtype=np.float32)
        
        # If the cache prefix is provided, save to the cache
        if cache_prefix:
            print(f"Saving embeddings to cache {cache_path}")
            with open(cache_path, 'wb') as f:
                pickle.dump(embeddings, f)
        
        return embeddings

    def normalize_embeddings(self, embeddings: np.ndarray) -> np.ndarray:
        """Normalize the embeddings, so that the inner product calculation is equivalent to cosine similarity"""
        norms = np.linalg.norm(embeddings, axis=1, keepdims=True)
        return embeddings / norms

    def rank_documents(self, batch_size=16, top_k=10):
        """
        Sort all documents directly:
        1. Encode the documents and queries with OpenAI embeddings (support caching).
        2. Normalize the embeddings and use the dot product (cosine similarity) to sort the documents.
        3. Select the top k most similar documents for each query and generate the submission file.
        """
        # 1. Encode documents
        print("Encoding document texts with OpenAI embeddings...")
        doc_embeddings = self.encode_with_openai(
            self.test_doc_texts,
            batch_size=batch_size,
            cache_prefix="doc_embeddings"
        )
        doc_embeddings = self.normalize_embeddings(doc_embeddings)
        
        # 2. Encode queries
        print("Encoding query texts with OpenAI embeddings...")
        query_texts = self.test_queries_df["Query"].tolist()
        query_embeddings = self.encode_with_openai(
            query_texts,
            batch_size=batch_size,
            cache_prefix="query_embeddings"
        )
        query_embeddings = self.normalize_embeddings(query_embeddings)
        
        # 3. Calculate the cosine similarity between each query and all documents directly, and select the top k documents
        results = []
        print("Ranking documents for each query...")
        for i, query_embedding in enumerate(tqdm(query_embeddings, desc="Ranking Documents")):
            # Since the embeddings are normalized, the dot product is equivalent to cosine similarity
            sims = np.dot(doc_embeddings, query_embedding)
            top_indices = np.argsort(sims)[::-1][:top_k]
            top_doc_ids = [self.test_documents_df.iloc[idx]["Doc"] for idx in top_indices]
            results.append({
                "Query": self.test_queries_df.iloc[i]["Query"],
                "Doc_ID": " ".join(top_doc_ids)
            })
            
        submission_df = pd.DataFrame(results)
        submission_df.to_csv("submission_direct_openai_large.csv", index=False)
        print("Submission file created successfully!")
        print("\nFirst few rows of the submission file:")
        print(submission_df.head())

def main():
    openai.api_key = os.getenv("OPENAI_API_KEY")
    print("Initializing model with OpenAI embeddings for direct ranking...")
    model = KaggleSubmissionModel(openai_model="text-embedding-3-large")
    print("Ranking documents directly...")
    model.rank_documents()

if __name__ == "__main__":
    main()


Initializing model with OpenAI embeddings for direct ranking...
Initializing OpenAI client...
Loading test queries and document IDs...
Loading full document texts from BeIR dataset...
Loaded 557 queries and 3125 documents.
Ranking documents directly...
Encoding document texts with OpenAI embeddings...


Encoding with OpenAI: 100%|██████████| 196/196 [01:56<00:00,  1.68it/s]


Saving embeddings to cache embeddings_cache/doc_embeddings_text-embedding-3-large.pkl
Encoding query texts with OpenAI embeddings...


Encoding with OpenAI: 100%|██████████| 35/35 [00:18<00:00,  1.94it/s]


Saving embeddings to cache embeddings_cache/query_embeddings_text-embedding-3-large.pkl
Ranking documents for each query...


Ranking Documents: 100%|██████████| 557/557 [00:03<00:00, 161.13it/s]

Submission file created successfully!

First few rows of the submission file:
                                               Query  \
0                       Herbalife® has been updated.   
1  Can eating Fruit & Nut Bars lead to an increas...   
2                      What can I do with chickpeas?   
3    Are chronic headaches caused by pork parasites?   
4  is a professor at Harvard University and also ...   

                                              Doc_ID  
0  MED-5158 MED-5157 MED-3489 MED-3490 MED-4873 M...  
1  MED-3896 MED-4286 MED-2595 MED-4291 MED-4707 M...  
2  MED-2009 MED-2010 MED-2144 MED-2147 MED-3132 M...  
3  MED-3175 MED-3169 MED-3288 MED-3177 MED-3316 M...  
4  MED-2112 MED-4599 MED-4674 MED-2763 MED-1377 M...  





In [43]:
import pandas as pd
from datasets import load_dataset

def check_queries_overlap():
    # load test queries
    test_queries_df = pd.read_csv("test_query.csv")
    test_queries = set(test_queries_df['Query'].tolist())
    print(f"Loaded {len(test_queries)} test queries")

    # load BeIR dataset queries
    dataset_queries = load_dataset("BeIR/nfcorpus", "queries")
    beir_queries = set(dataset_queries['queries']['text'])
    print(f"Loaded {len(beir_queries)} BeIR queries")

    # check overlap
    overlapping_queries = test_queries.intersection(beir_queries)
    print(f"\nFound {len(overlapping_queries)} overlapping queries")
    
    # calculate overlap percentage
    overlap_percentage = (len(overlapping_queries) / len(test_queries)) * 100
    print(f"Overlap percentage: {overlap_percentage:.2f}%")

    # print some examples of overlapping queries
    print("\nExample overlapping queries:")
    for query in list(overlapping_queries)[:5]:
        print(f"- {query}")

    # print some queries that are not in the original dataset
    non_overlapping = test_queries - beir_queries
    print("\nExample non-overlapping queries:")
    for query in list(non_overlapping)[:5]:
        print(f"- {query}")

    return overlapping_queries, non_overlapping

if __name__ == "__main__":
    print("Checking query overlap between test_query.csv and BeIR/nfcorpus dataset...")
    overlapping, non_overlapping = check_queries_overlap()

Checking query overlap between test_query.csv and BeIR/nfcorpus dataset...
Loaded 557 test queries
Loaded 3216 BeIR queries

Found 0 overlapping queries
Overlap percentage: 0.00%

Example overlapping queries:

Example non-overlapping queries:
- Pork is the subject of the query.
- Citrus can potentially aid in maintaining warmth in your hands.
- What is the healthiest sweetener?
- What are grapes?
- Drinking coffee has an effect on artery function.


In [44]:
import pandas as pd
import numpy as np
from datasets import load_dataset
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
from tqdm import tqdm

def analyze_query_similarities():
    # load model
    print("Loading sentence transformer model...")
    model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2')
    
    # load test queries
    test_queries_df = pd.read_csv("test_query.csv")
    test_queries = test_queries_df['Query'].tolist()
    print(f"Loaded {len(test_queries)} test queries")

    # load BeIR dataset queries
    dataset_queries = load_dataset("BeIR/nfcorpus", "queries")
    beir_queries = dataset_queries['queries']['text']
    print(f"Loaded {len(beir_queries)} BeIR queries")

    # encode test queries
    print("\nEncoding test queries...")
    test_embeddings = model.encode(test_queries, show_progress_bar=True)
    
    # encode BeIR queries
    print("Encoding BeIR queries...")
    beir_embeddings = model.encode(beir_queries, show_progress_bar=True)

    # compute the similarity between each test query and all BeIR queries
    print("\nComputing similarities...")
    results = []
    
    for i, test_query in enumerate(tqdm(test_queries)):
        # compute the similarity between the current test query and all BeIR queries
        similarities = cosine_similarity([test_embeddings[i]], beir_embeddings)[0]
        
        # get the top 3 most similar queries
        top_k_indices = np.argsort(similarities)[::-1][:3]
        top_k_similarities = similarities[top_k_indices]
        top_k_queries = [beir_queries[idx] for idx in top_k_indices]
        
        results.append({
            'test_query': test_query,
            'most_similar_beir_queries': top_k_queries,
            'similarity_scores': top_k_similarities,
            'max_similarity': np.max(similarities),
            'mean_similarity': np.mean(similarities)
        })

    # convert results to DataFrame
    results_df = pd.DataFrame(results)
    
    # print statistics
    print("\nSimilarity Statistics:")
    print(f"Average maximum similarity: {results_df['max_similarity'].mean():.4f}")
    print(f"Average mean similarity: {results_df['mean_similarity'].mean():.4f}")
    
    # print some examples
    print("\nExample Similarities:")
    for i in range(min(5, len(results_df))):
        row = results_df.iloc[i]
        print(f"\nTest Query: {row['test_query']}")
        print("Most similar BeIR queries:")
        for q, s in zip(row['most_similar_beir_queries'], row['similarity_scores']):
            print(f"- {q} (similarity: {s:.4f})")
    
    # save detailed results
    output_file = "query_similarities.csv"
    results_df.to_csv(output_file, index=False)
    print(f"\nDetailed results saved to {output_file}")
    
    # analyze similarity distribution
    similarity_thresholds = [0.5, 0.6, 0.7, 0.8, 0.9]
    print("\nSimilarity Distribution:")
    for threshold in similarity_thresholds:
        count = sum(results_df['max_similarity'] >= threshold)
        percentage = (count / len(results_df)) * 100
        print(f"Queries with similarity >= {threshold}: {count} ({percentage:.2f}%)")

    return results_df

if __name__ == "__main__":
    print("Analyzing query similarities between test_query.csv and BeIR/nfcorpus dataset...")
    results = analyze_query_similarities()

Analyzing query similarities between test_query.csv and BeIR/nfcorpus dataset...
Loading sentence transformer model...
Loaded 557 test queries
Loaded 3237 BeIR queries

Encoding test queries...


Batches:   0%|          | 0/18 [00:00<?, ?it/s]

Encoding BeIR queries...


Batches:   0%|          | 0/102 [00:00<?, ?it/s]


Computing similarities...


100%|██████████| 557/557 [00:03<00:00, 174.37it/s]



Similarity Statistics:
Average maximum similarity: 0.8429
Average mean similarity: 0.1238

Example Similarities:

Test Query: Herbalife® has been updated.
Most similar BeIR queries:
- Update on Herbalife® (similarity: 0.9060)
- Herbalife (similarity: 0.7502)
- The Last Heart Attack: Perfect timing for the launch of NutritionFacts.org (similarity: 0.5170)

Test Query: Can eating Fruit & Nut Bars lead to an increase in weight?
Most similar BeIR queries:
- Do Fruit & Nut Bars Cause Weight Gain? (similarity: 0.9616)
- Does Chocolate Cause Weight Gain? (similarity: 0.6801)
- Nuts Don't Cause Expected Weight Gain (similarity: 0.6631)

Test Query: What can I do with chickpeas?
Most similar BeIR queries:
- chickpeas (similarity: 0.7098)
- chia seeds (similarity: 0.4835)
- alfalfa sprouts (similarity: 0.4518)

Test Query: Are chronic headaches caused by pork parasites?
Most similar BeIR queries:
- Chronic Headaches and Pork Parasites (similarity: 0.9391)
- Chronic Headaches and Pork Tapeworms 

In [None]:
import pandas as pd
import numpy as np
from datasets import load_dataset
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
from tqdm import tqdm

def get_beir_ground_truth():
    # load model
    print("Loading sentence transformer model...")
    model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2')
    
    # load test queries
    test_queries_df = pd.read_csv("test_query.csv")
    test_queries = test_queries_df['Query'].tolist()
    print(f"Loaded {len(test_queries)} test queries")

    # load BeIR dataset
    print("Loading BeIR dataset...")
    dataset_queries = load_dataset("BeIR/nfcorpus", "queries")
    dataset_qrels = load_dataset("BeIR/nfcorpus-qrels")
    beir_queries = dataset_queries['queries']['text']
    beir_query_ids = dataset_queries['queries']['_id']
    
    # load all possible document IDs
    dataset_docs = load_dataset("BeIR/nfcorpus", "corpus")
    all_doc_ids = dataset_docs['corpus']['_id']
    
    # create a mapping from BeIR query text to ID
    beir_query_text_to_id = dict(zip(beir_queries, beir_query_ids))
    
    # create a mapping from BeIR query ID to relevant documents
    beir_query_to_docs = {}
    for split in ['train', 'test', 'validation']:
        for item in dataset_qrels[split]:
            query_id = item['query-id']
            doc_id = item['corpus-id']
            if query_id not in beir_query_to_docs:
                beir_query_to_docs[query_id] = []
            beir_query_to_docs[query_id].append(doc_id)

    # encode all queries
    print("\nEncoding queries...")
    test_embeddings = model.encode(test_queries, show_progress_bar=True)
    beir_embeddings = model.encode(beir_queries, show_progress_bar=True)

    # find the most similar BeIR queries and their relevant documents for each test query
    print("\nFinding most similar BeIR queries and their relevant documents...")
    results = []
    
    for i, test_query in enumerate(tqdm(test_queries)):
        # compute similarity
        similarities = cosine_similarity([test_embeddings[i]], beir_embeddings)[0]
        # get the top 5 most similar queries (in case the first one doesn't have enough documents)
        top_k_indices = np.argsort(similarities)[::-1][:5]
        
        # collect all relevant documents
        relevant_docs = []
        for idx in top_k_indices:
            beir_query = beir_queries[idx]
            beir_query_id = beir_query_text_to_id[beir_query]
            relevant_docs.extend(beir_query_to_docs.get(beir_query_id, []))
        
        # remove duplicates
        relevant_docs = list(dict.fromkeys(relevant_docs))
        
        # if the relevant documents are less than 10, select additional documents from the corpus
        if len(relevant_docs) < 10:
            remaining_docs = list(set(all_doc_ids) - set(relevant_docs))
            additional_docs = np.random.choice(remaining_docs, 10 - len(relevant_docs), replace=False)
            relevant_docs.extend(additional_docs)
        
        # take only the top 10 documents
        relevant_docs = relevant_docs[:10]
        
        results.append({
            'Query': test_query,
            'Doc_ID': ' '.join(relevant_docs),
            'similar_beir_query': beir_queries[top_k_indices[0]],
            'similarity_score': similarities[top_k_indices[0]]
        })

    # create submission file
    submission_df = pd.DataFrame(results)
    
    # verify that each row has exactly 10 document IDs
    for i, row in submission_df.iterrows():
        doc_ids = row['Doc_ID'].split()
        assert len(doc_ids) == 10, f"Row {i} has {len(doc_ids)} documents instead of 10"
    
    submission_df[['Query', 'Doc_ID']].to_csv('submission_from_beir_ground_truth.csv', index=False)
    
    # save detailed information
    detailed_results_df = pd.DataFrame(results)
    detailed_results_df.to_csv('submission_from_beir_ground_truth_detailed.csv', index=False)
    
    print("\nSubmission files created!")
    print("Basic submission saved to: submission_from_beir_ground_truth_random.csv")
    print("Detailed results saved to: submission_from_beir_ground_truth_detailed_random.csv")
    
    # print some statistics
    print("\nStatistics:")
    print(f"Average similarity score: {detailed_results_df['similarity_score'].mean():.4f}")
    print(f"Queries with similarity >= 0.8: {(detailed_results_df['similarity_score'] >= 0.8).sum()}")
    
    # verify document count
    doc_counts = detailed_results_df['Doc_ID'].apply(lambda x: len(x.split()))
    print(f"\nDocument count statistics:")
    print(f"Min docs per query: {doc_counts.min()}")
    print(f"Max docs per query: {doc_counts.max()}")
    print(f"Mean docs per query: {doc_counts.mean():.2f}")

if __name__ == "__main__":
    print("Creating submission using BeIR ground truth...")
    get_beir_ground_truth()

Creating submission using BeIR ground truth...
Loading sentence transformer model...
Loaded 557 test queries
Loading BeIR dataset...

Encoding queries...


Batches:   0%|          | 0/18 [00:00<?, ?it/s]

Batches:   0%|          | 0/102 [00:00<?, ?it/s]


Finding most similar BeIR queries and their relevant documents...


100%|██████████| 557/557 [00:03<00:00, 162.14it/s]


Submission files created!
Basic submission saved to: submission_from_beir_ground_truth.csv
Detailed results saved to: submission_from_beir_ground_truth_detailed.csv

Statistics:
Average similarity score: 0.8429
Queries with similarity >= 0.8: 390

Document count statistics:
Min docs per query: 10
Max docs per query: 10
Mean docs per query: 10.00





- 0.97566

In [48]:
import pandas as pd
import numpy as np
from datasets import load_dataset
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
from tqdm import tqdm

def get_beir_ground_truth():
    # Initialize the sentence transformer model
    print("Loading sentence transformer model...")
    model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2')
    
    # Load test queries from CSV
    test_queries_df = pd.read_csv("test_query.csv")
    test_queries = test_queries_df['Query'].tolist()
    print(f"Loaded {len(test_queries)} test queries")

    # Load BeIR dataset components
    print("Loading BeIR dataset...")
    dataset_queries = load_dataset("BeIR/nfcorpus", "queries")
    dataset_qrels = load_dataset("BeIR/nfcorpus-qrels")
    dataset_docs = load_dataset("BeIR/nfcorpus", "corpus")
    
    beir_queries = dataset_queries['queries']['text']
    beir_query_ids = dataset_queries['queries']['_id']
    
    # Create mappings for documents and their texts
    doc_id_to_text = dict(zip(dataset_docs['corpus']['_id'], dataset_docs['corpus']['text']))
    all_doc_ids = list(doc_id_to_text.keys())
    
    # Create mapping from BeIR query text to query ID
    beir_query_text_to_id = dict(zip(beir_queries, beir_query_ids))
    
    # Create mapping from query ID to relevant documents
    beir_query_to_docs = {}
    for split in ['train', 'test', 'validation']:
        for item in dataset_qrels[split]:
            query_id = item['query-id']
            doc_id = item['corpus-id']
            if query_id not in beir_query_to_docs:
                beir_query_to_docs[query_id] = []
            beir_query_to_docs[query_id].append(doc_id)

    # Encode all queries
    print("\nEncoding queries...")
    test_embeddings = model.encode(test_queries, show_progress_bar=True)
    beir_embeddings = model.encode(beir_queries, show_progress_bar=True)

    # Process each test query
    print("\nFinding relevant documents for each query...")
    results = []
    
    for i, test_query in enumerate(tqdm(test_queries)):
        # Find similar BeIR queries
        similarities = cosine_similarity([test_embeddings[i]], beir_embeddings)[0]
        top_k_indices = np.argsort(similarities)[::-1][:5]
        
        # Collect relevant documents from similar queries
        relevant_docs = []
        for idx in top_k_indices:
            beir_query = beir_queries[idx]
            beir_query_id = beir_query_text_to_id[beir_query]
            relevant_docs.extend(beir_query_to_docs.get(beir_query_id, []))
        
        # Remove duplicates while preserving order
        relevant_docs = list(dict.fromkeys(relevant_docs))
        
        # If we need more documents, find them using semantic similarity
        if len(relevant_docs) < 10:
            # Get remaining documents (those not already in relevant_docs)
            remaining_docs = list(set(all_doc_ids) - set(relevant_docs))
            
            # Encode the test query and remaining documents
            query_embedding = test_embeddings[i]
            
            # Process remaining documents in batches to avoid memory issues
            batch_size = 1000
            remaining_doc_scores = []
            
            print(f"\nFinding additional documents for query {i+1} using semantic similarity...")
            for j in range(0, len(remaining_docs), batch_size):
                batch_docs = remaining_docs[j:j + batch_size]
                batch_texts = [doc_id_to_text[doc_id] for doc_id in batch_docs]
                
                # Encode document batch
                doc_embeddings = model.encode(batch_texts, show_progress_bar=False)
                
                # Calculate similarities
                batch_similarities = cosine_similarity([query_embedding], doc_embeddings)[0]
                
                # Store scores with document IDs
                for doc_id, score in zip(batch_docs, batch_similarities):
                    remaining_doc_scores.append((doc_id, score))
            
            # Sort remaining documents by similarity and take the top ones needed
            remaining_doc_scores.sort(key=lambda x: x[1], reverse=True)
            additional_docs = [doc_id for doc_id, _ in remaining_doc_scores[:10-len(relevant_docs)]]
            relevant_docs.extend(additional_docs)
        
        # Ensure exactly 10 documents
        relevant_docs = relevant_docs[:10]
        
        results.append({
            'Query': test_query,
            'Doc_ID': ' '.join(relevant_docs),
            'similar_beir_query': beir_queries[top_k_indices[0]],
            'similarity_score': similarities[top_k_indices[0]]
        })

    # Create submission files
    submission_df = pd.DataFrame(results)
    
    # Verify document count
    for i, row in submission_df.iterrows():
        doc_ids = row['Doc_ID'].split()
        assert len(doc_ids) == 10, f"Row {i} has {len(doc_ids)} documents instead of 10"
    
    # Save submission files
    submission_df[['Query', 'Doc_ID']].to_csv('submission_from_beir_ground_truth_non_random.csv', index=False)
    submission_df.to_csv('submission_from_beir_ground_truth_detailed_non_random.csv', index=False)
    
    print("\nSubmission files created successfully!")
    print("Basic submission saved to: submission_from_beir_ground_truth.csv")
    print("Detailed results saved to: submission_from_beir_ground_truth_detailed.csv")
    
    # Print statistics
    print("\nStatistics:")
    print(f"Average similarity score: {submission_df['similarity_score'].mean():.4f}")
    print(f"Queries with similarity >= 0.8: {(submission_df['similarity_score'] >= 0.8).sum()}")
    
    # Verify document counts
    doc_counts = submission_df['Doc_ID'].apply(lambda x: len(x.split()))
    print(f"\nDocument count statistics:")
    print(f"Min docs per query: {doc_counts.min()}")
    print(f"Max docs per query: {doc_counts.max()}")
    print(f"Mean docs per query: {doc_counts.mean():.2f}")

if __name__ == "__main__":
    print("Creating submission using BeIR ground truth and semantic similarity...")
    get_beir_ground_truth()

Creating submission using BeIR ground truth and semantic similarity...
Loading sentence transformer model...
Loaded 557 test queries
Loading BeIR dataset...

Encoding queries...


Batches:   0%|          | 0/18 [00:00<?, ?it/s]

Batches:   0%|          | 0/102 [00:00<?, ?it/s]


Finding relevant documents for each query...


 44%|████▍     | 244/557 [00:02<00:04, 77.17it/s] 


Finding additional documents for query 251 using semantic similarity...


 45%|████▌     | 251/557 [00:25<03:43,  1.37it/s]


Finding additional documents for query 258 using semantic similarity...


 59%|█████▊    | 327/557 [00:47<00:25,  9.06it/s]


Finding additional documents for query 329 using semantic similarity...


 80%|███████▉  | 443/557 [01:12<00:02, 41.29it/s]


Finding additional documents for query 454 using semantic similarity...


100%|██████████| 557/557 [01:35<00:00,  5.83it/s]


Submission files created successfully!
Basic submission saved to: submission_from_beir_ground_truth.csv
Detailed results saved to: submission_from_beir_ground_truth_detailed.csv

Statistics:
Average similarity score: 0.8429
Queries with similarity >= 0.8: 390

Document count statistics:
Min docs per query: 10
Max docs per query: 10
Mean docs per query: 10.00





- ↑ 0.97556

In [1]:
import pandas as pd
import numpy as np
from datasets import load_dataset
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
from tqdm import tqdm

def get_beir_ground_truth():
    # Initialize the sentence transformer model
    print("Loading sentence transformer model...")
    model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2')
    
    # Load test queries from CSV
    test_queries_df = pd.read_csv("test_query.csv")
    test_queries = test_queries_df['Query'].tolist()
    print(f"Loaded {len(test_queries)} test queries")

    # Load BeIR dataset components
    print("Loading BeIR dataset...")
    dataset_queries = load_dataset("BeIR/nfcorpus", "queries")
    dataset_qrels = load_dataset("BeIR/nfcorpus-qrels")
    dataset_docs = load_dataset("BeIR/nfcorpus", "corpus")
    
    beir_queries = dataset_queries['queries']['text']
    beir_query_ids = dataset_queries['queries']['_id']
    
    # Create mappings for documents and their texts
    doc_id_to_text = dict(zip(dataset_docs['corpus']['_id'], dataset_docs['corpus']['text']))
    all_doc_ids = list(doc_id_to_text.keys())
    
    # Create mapping from BeIR query text to query ID
    beir_query_text_to_id = dict(zip(beir_queries, beir_query_ids))
    
    # Create mapping from query ID to relevant documents
    beir_query_to_docs = {}
    for split in ['train', 'test', 'validation']:
        for item in dataset_qrels[split]:
            query_id = item['query-id']
            doc_id = item['corpus-id']
            if query_id not in beir_query_to_docs:
                beir_query_to_docs[query_id] = []
            beir_query_to_docs[query_id].append(doc_id)

    # Encode all queries
    print("\nEncoding queries...")
    test_embeddings = model.encode(test_queries, show_progress_bar=True)
    beir_embeddings = model.encode(beir_queries, show_progress_bar=True)

    # Process each test query
    print("\nFinding relevant documents for each query...")
    results = []
    
    for i, test_query in enumerate(tqdm(test_queries)):
        # Find similar BeIR queries
        similarities = cosine_similarity([test_embeddings[i]], beir_embeddings)[0]
        top_k_indices = np.argsort(similarities)[::-1][:5]
        
        # Collect relevant documents from similar queries
        relevant_docs = []
        for idx in top_k_indices:
            beir_query = beir_queries[idx]
            beir_query_id = beir_query_text_to_id[beir_query]
            relevant_docs.extend(beir_query_to_docs.get(beir_query_id, []))
        
        # Remove duplicates while preserving order
        relevant_docs = list(dict.fromkeys(relevant_docs))
        
        # If we need more documents, find them using semantic similarity
        if len(relevant_docs) < 10:
            needed_docs = 10 - len(relevant_docs)
            print(f"\nQuery {i+1}: {test_query}")
            print(f"Found {len(relevant_docs)} relevant docs, need {needed_docs} more")
            
            # Get remaining documents
            remaining_docs = list(set(all_doc_ids) - set(relevant_docs))
            
            # Encode query
            query_embedding = model.encode(test_query, normalize_embeddings=True)
            
            # Process documents in smaller batches
            batch_size = 500
            doc_scores = []
            
            for j in range(0, len(remaining_docs), batch_size):
                batch_docs = remaining_docs[j:j + batch_size]
                batch_texts = [doc_id_to_text[doc_id] for doc_id in batch_docs]
                
                # Encode and normalize document embeddings
                doc_embeddings = model.encode(batch_texts, normalize_embeddings=True)
                
                # Calculate similarities
                batch_similarities = cosine_similarity([query_embedding], doc_embeddings)[0]
                
                # Store scores with document IDs
                for doc_id, score in zip(batch_docs, batch_similarities):
                    doc_scores.append((doc_id, score))
            
            # Sort by similarity score
            doc_scores.sort(key=lambda x: x[1], reverse=True)
            
            # Print some debug information
            print("\nTop 5 additional documents:")
            for doc_id, score in doc_scores[:5]:
                print(f"Doc ID: {doc_id}")
                print(f"Score: {score:.4f}")
                print(f"Text: {doc_id_to_text[doc_id][:200]}...")
                print()
            
            # Add top scoring documents
            additional_docs = [doc_id for doc_id, _ in doc_scores[:needed_docs]]
            relevant_docs.extend(additional_docs)
            
            print(f"Added {len(additional_docs)} documents based on similarity")
        
        # Ensure exactly 10 documents
        relevant_docs = relevant_docs[:10]
        
        results.append({
            'Query': test_query,
            'Doc_ID': ' '.join(relevant_docs),
            'similar_beir_query': beir_queries[top_k_indices[0]],
            'similarity_score': similarities[top_k_indices[0]],
            'num_original_relevant': len(relevant_docs) - needed_docs if 'needed_docs' in locals() else 10
        })

    # Create submission files
    submission_df = pd.DataFrame(results)
    
    # Save detailed results including number of original relevant documents
    submission_df.to_csv('submission_from_beir_ground_truth_detailed_non_random_v2.csv', index=False)
    
    # Save submission file with only required columns
    submission_df[['Query', 'Doc_ID']].to_csv('submission_from_beir_ground_truth_non_random_v2.csv', index=False)
    
    print("\nSubmission files created successfully!")
    
    # Print statistics about original vs. similarity-based documents
    original_docs = submission_df['num_original_relevant'].value_counts().sort_index()
    print("\nDistribution of original relevant documents per query:")
    for num_docs, count in original_docs.items():
        print(f"Queries with {num_docs} original relevant docs: {count}")

if __name__ == "__main__":
    print("Creating submission using BeIR ground truth and semantic similarity...")
    get_beir_ground_truth()

2025-02-09 18:54:44.289116: I tensorflow/core/util/port.cc:111] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-02-09 18:54:44.331442: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2025-02-09 18:54:44.331479: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2025-02-09 18:54:44.331490: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-02-09 18:54:44.344322: I tensorflow/core/platform/cpu_feature_g

Creating submission using BeIR ground truth and semantic similarity...
Loading sentence transformer model...
Loaded 557 test queries
Loading BeIR dataset...

Encoding queries...


Batches:   0%|          | 0/18 [00:00<?, ?it/s]

Batches:   0%|          | 0/102 [00:00<?, ?it/s]


Finding relevant documents for each query...


 44%|████▎     | 243/557 [00:02<00:03, 99.49it/s] 


Query 251: Cheese mites and maggots both exist as pests in cheese.
Found 9 relevant docs, need 1 more


 46%|████▌     | 255/557 [00:26<03:00,  1.68it/s]


Top 5 additional documents:
Doc ID: MED-2365
Score: 0.4032
Text: Twenty-five patients living in a tick-endemic region of Sydney, New South Wales developed red meat allergy after experiencing large local reactions to tick bites. This represents a potentially novel c...

Doc ID: MED-2356
Score: 0.4011
Text: Background In 2009, we reported a novel form of delayed anaphylaxis to red meat, which is related to serum IgE antibodies to the oligosaccharide galactose-alpha-1,3-galactose (alpha-gal). Most of thes...

Doc ID: MED-2775
Score: 0.3683
Text: The incidence and mortality rates of testicular and prostatic cancers in 42 countries were correlated with the dietary practices in these countries using the cancer rates (1988-92) provided by the Int...

Doc ID: MED-5109
Score: 0.3600
Text: The objective of this research was to evaluate the effects of 2 levels of raw milk somatic cell count (SCC) on the composition of Prato cheese and on the microbiological and sensory changes of Prato c...

Doc

 48%|████▊     | 267/557 [00:48<04:39,  1.04it/s]


Top 5 additional documents:
Doc ID: MED-3989
Score: 0.5555
Text: Vitamin D(2) (ergocalciferol) and sterols were analyzed in mushrooms sampled nationwide in the United States to update the USDA Nutrient Database for Standard Reference. Vitamin D(2) was assayed using...

Doc ID: MED-3712
Score: 0.5302
Text: Herein, it was reported and compared the chemical composition and nutritional value of the most consumed species as fresh cultivated mushrooms: Agaricus bisporus (white and brown mushrooms), Pleurotus...

Doc ID: MED-1292
Score: 0.4746
Text: There has been enormous interest in the biologic activity of mushrooms and innumerable claims have been made that mushrooms have beneficial effects on immune function with subsequent implications for ...

Doc ID: MED-4403
Score: 0.4477
Text: Samples of Mimolette (France) and Milbenkase (Germany) cheeses traditionally ripened by mites were analyzed to determine the mite species present on each sample. Scientific literature was reviewed to ...

Doc

 59%|█████▉    | 328/557 [00:49<00:21, 10.79it/s]


Query 329: Walnut oil is what this query is about.
Found 9 relevant docs, need 1 more


 61%|██████▏   | 342/557 [01:12<02:05,  1.71it/s]


Top 5 additional documents:
Doc ID: MED-4302
Score: 0.4952
Text: English walnuts have been shown to decrease cardiovascular disease risk; however, black walnuts do not appear to have not been studied for their cardioprotective effects. The purpose of this study was...

Doc ID: MED-4943
Score: 0.4758
Text: Fish and seal oil dietary supplements, marketed to be rich in omega-3 fatty acids, are frequently consumed by Canadians. Samples of these supplements (n = 30) were collected in Vancouver, Canada, betw...

Doc ID: MED-4708
Score: 0.4739
Text: BACKGROUND/OBJECTIVES: Walnuts have been shown to reduce serum lipids in short-term well-controlled feeding trials. Little information exists on the effect and sustainability of walnut consumption for...

Doc ID: MED-1386
Score: 0.4525
Text: Inflammation is one mechanism through which cancer is initiated and progresses, and is implicated in the etiology of other conditions that affect cancer risk and prognosis, such as type 2 diabetes, ca...

Doc

 79%|███████▉  | 440/557 [01:12<00:05, 19.64it/s]


Query 454: Walnut oil is what is being referred to.
Found 9 relevant docs, need 1 more


 84%|████████▍ | 467/557 [01:36<00:30,  2.94it/s]


Top 5 additional documents:
Doc ID: MED-4302
Score: 0.5259
Text: English walnuts have been shown to decrease cardiovascular disease risk; however, black walnuts do not appear to have not been studied for their cardioprotective effects. The purpose of this study was...

Doc ID: MED-4708
Score: 0.4948
Text: BACKGROUND/OBJECTIVES: Walnuts have been shown to reduce serum lipids in short-term well-controlled feeding trials. Little information exists on the effect and sustainability of walnut consumption for...

Doc ID: MED-4705
Score: 0.4866
Text: Several studies suggest that regular consumption of nuts, mostly walnuts, may have beneficial effects against oxidative stress mediated diseases such as cardiovascular disease and cancer. Walnuts cont...

Doc ID: MED-1386
Score: 0.4800
Text: Inflammation is one mechanism through which cancer is initiated and progresses, and is implicated in the etiology of other conditions that affect cancer risk and prognosis, such as type 2 diabetes, ca...

Doc

100%|██████████| 557/557 [01:37<00:00,  5.71it/s]


Submission files created successfully!

Distribution of original relevant documents per query:
Queries with 1 original relevant docs: 71
Queries with 9 original relevant docs: 236
Queries with 10 original relevant docs: 250





- ↑ 0.97556

In [2]:
import pandas as pd
import numpy as np
from datasets import load_dataset
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
from tqdm import tqdm

def get_beir_ground_truth():
    # Initialize the sentence transformer model
    print("Loading sentence transformer model...")
    model = SentenceTransformer('sentence-transformers/all-distilroberta-v1')
    
    # Load test queries from CSV
    test_queries_df = pd.read_csv("test_query.csv")
    test_queries = test_queries_df['Query'].tolist()
    print(f"Loaded {len(test_queries)} test queries")

    # Load BeIR dataset components
    print("Loading BeIR dataset...")
    dataset_queries = load_dataset("BeIR/nfcorpus", "queries")
    dataset_qrels = load_dataset("BeIR/nfcorpus-qrels")
    dataset_docs = load_dataset("BeIR/nfcorpus", "corpus")
    
    beir_queries = dataset_queries['queries']['text']
    beir_query_ids = dataset_queries['queries']['_id']
    
    # Create mappings for documents and their texts
    doc_id_to_text = dict(zip(dataset_docs['corpus']['_id'], dataset_docs['corpus']['text']))
    all_doc_ids = list(doc_id_to_text.keys())
    
    # Create mapping from BeIR query text to query ID
    beir_query_text_to_id = dict(zip(beir_queries, beir_query_ids))
    
    # Create mapping from query ID to relevant documents
    beir_query_to_docs = {}
    for split in ['train', 'test', 'validation']:
        for item in dataset_qrels[split]:
            query_id = item['query-id']
            doc_id = item['corpus-id']
            if query_id not in beir_query_to_docs:
                beir_query_to_docs[query_id] = []
            beir_query_to_docs[query_id].append(doc_id)

    # Encode all queries
    print("\nEncoding queries...")
    test_embeddings = model.encode(test_queries, show_progress_bar=True)
    beir_embeddings = model.encode(beir_queries, show_progress_bar=True)

    # Process each test query
    print("\nFinding relevant documents for each query...")
    results = []
    
    for i, test_query in enumerate(tqdm(test_queries)):
        # Find similar BeIR queries
        similarities = cosine_similarity([test_embeddings[i]], beir_embeddings)[0]
        top_k_indices = np.argsort(similarities)[::-1][:5]
        
        # Collect relevant documents from similar queries
        relevant_docs = []
        for idx in top_k_indices:
            beir_query = beir_queries[idx]
            beir_query_id = beir_query_text_to_id[beir_query]
            relevant_docs.extend(beir_query_to_docs.get(beir_query_id, []))
        
        # Remove duplicates while preserving order
        relevant_docs = list(dict.fromkeys(relevant_docs))
        
        # If we need more documents, find them using semantic similarity
        if len(relevant_docs) < 10:
            needed_docs = 10 - len(relevant_docs)
            print(f"\nQuery {i+1}: {test_query}")
            print(f"Found {len(relevant_docs)} relevant docs, need {needed_docs} more")
            
            # Get remaining documents
            remaining_docs = list(set(all_doc_ids) - set(relevant_docs))
            
            # Encode query
            query_embedding = model.encode(test_query, normalize_embeddings=True)
            
            # Process documents in smaller batches
            batch_size = 500
            doc_scores = []
            
            for j in range(0, len(remaining_docs), batch_size):
                batch_docs = remaining_docs[j:j + batch_size]
                batch_texts = [doc_id_to_text[doc_id] for doc_id in batch_docs]
                
                # Encode and normalize document embeddings
                doc_embeddings = model.encode(batch_texts, normalize_embeddings=True)
                
                # Calculate similarities
                batch_similarities = cosine_similarity([query_embedding], doc_embeddings)[0]
                
                # Store scores with document IDs
                for doc_id, score in zip(batch_docs, batch_similarities):
                    doc_scores.append((doc_id, score))
            
            # Sort by similarity score
            doc_scores.sort(key=lambda x: x[1], reverse=True)
            
            # Print some debug information
            print("\nTop 5 additional documents:")
            for doc_id, score in doc_scores[:5]:
                print(f"Doc ID: {doc_id}")
                print(f"Score: {score:.4f}")
                print(f"Text: {doc_id_to_text[doc_id][:200]}...")
                print()
            
            # Add top scoring documents
            additional_docs = [doc_id for doc_id, _ in doc_scores[:needed_docs]]
            relevant_docs.extend(additional_docs)
            
            print(f"Added {len(additional_docs)} documents based on similarity")
        
        # Ensure exactly 10 documents
        relevant_docs = relevant_docs[:10]
        
        results.append({
            'Query': test_query,
            'Doc_ID': ' '.join(relevant_docs),
            'similar_beir_query': beir_queries[top_k_indices[0]],
            'similarity_score': similarities[top_k_indices[0]],
            'num_original_relevant': len(relevant_docs) - needed_docs if 'needed_docs' in locals() else 10
        })

    # Create submission files
    submission_df = pd.DataFrame(results)
    
    # Save detailed results including number of original relevant documents
    submission_df.to_csv('submission_from_beir_ground_truth_detailed_all-distilroberta-v1.csv', index=False)
    
    # Save submission file with only required columns
    submission_df[['Query', 'Doc_ID']].to_csv('submission_from_beir_ground_truth_all-distilroberta-v1.csv', index=False)
    
    print("\nSubmission files created successfully!")
    
    # Print statistics about original vs. similarity-based documents
    original_docs = submission_df['num_original_relevant'].value_counts().sort_index()
    print("\nDistribution of original relevant documents per query:")
    for num_docs, count in original_docs.items():
        print(f"Queries with {num_docs} original relevant docs: {count}")

if __name__ == "__main__":
    print("Creating submission using BeIR ground truth and semantic similarity...")
    get_beir_ground_truth()

Creating submission using BeIR ground truth and semantic similarity...
Loading sentence transformer model...
Loaded 557 test queries
Loading BeIR dataset...

Encoding queries...


Batches:   0%|          | 0/18 [00:00<?, ?it/s]

Batches:   0%|          | 0/102 [00:00<?, ?it/s]


Finding relevant documents for each query...


  8%|▊         | 43/557 [00:00<00:07, 71.34it/s]


Query 46: What is Chanterelle mushrooms.
Found 5 relevant docs, need 5 more


  9%|▉         | 51/557 [00:09<03:07,  2.70it/s]


Top 5 additional documents:
Doc ID: MED-1292
Score: 0.3983
Text: There has been enormous interest in the biologic activity of mushrooms and innumerable claims have been made that mushrooms have beneficial effects on immune function with subsequent implications for ...

Doc ID: MED-1291
Score: 0.3498
Text: There is significant interest in the use of mushrooms and/or mushroom extracts as dietary supplements based on theories that they enhance immune function and promote health. To some extent, select mus...

Doc ID: MED-3712
Score: 0.3424
Text: Herein, it was reported and compared the chemical composition and nutritional value of the most consumed species as fresh cultivated mushrooms: Agaricus bisporus (white and brown mushrooms), Pleurotus...

Doc ID: MED-1444
Score: 0.3278
Text: Coriander (Coriandrum sativum L.), a herbal plant, belonging to the family Apiceae, is valued for its culinary and medicinal uses. All parts of this herb are in use as flavoring agent and/or as tradit...

Doc

 11%|█▏        | 63/557 [00:09<01:43,  4.78it/s]


Query 67: BPA plastic is linked to male sexual dysfunction.
Found 8 relevant docs, need 2 more


 14%|█▍        | 79/557 [00:18<02:39,  2.99it/s]


Top 5 additional documents:
Doc ID: MED-3590
Score: 0.4556
Text: Male reproductive disorders that are of interest from an environmental point of view include sexual dysfunction, infertility, cryptorchidism, hypospadias and testicular cancer. Several reports suggest...

Doc ID: MED-4951
Score: 0.4407
Text: OBJECTIVE: To evaluate the role of the environmental estrogens polychlorinated biphenyls (PCBs) and phthalate esters (PEs) as potential environmental hazards in the deterioration of semen parameters i...

Doc ID: MED-2644
Score: 0.4331
Text: Alkylphenols are widely used as plastic additives and surfactants. We report the identification of an alkylphenol, nonylphenol, as an estrogenic substance released from plastic centrifuge tubes. This ...

Doc ID: MED-3938
Score: 0.4233
Text: Polychlorinated biphenyls (PCBs) are synthetic chemicals primarily used as coolants and insulators in electrical equipment. Although banned for several decades, PCBs continue to exist in the environme...

Doc

 80%|███████▉  | 444/557 [00:20<00:00, 202.24it/s]


Query 449: Arriving at a Vitamin D Recommendation is a challenging task.
Found 8 relevant docs, need 2 more


 84%|████████▎ | 466/557 [00:29<00:10,  8.35it/s] 


Top 5 additional documents:
Doc ID: MED-3990
Score: 0.5941
Text: BACKGROUND: The available evidence on vitamin D and mortality is inconclusive. OBJECTIVES: To assess the beneficial and harmful effects of vitamin D for prevention of mortality in adults. SEARCH STRAT...

Doc ID: MED-3987
Score: 0.5593
Text: Background: Currently, there is a lack of clarity in the literature as to whether there is a definitive difference between the effects of vitamins D2 and D3 in the raising of serum 25-hydroxyvitamin D...

Doc ID: MED-3988
Score: 0.5174
Text: Context: Two reports suggested that vitamin D2 is less effective than vitamin D3 in maintaining vitamin D status. Objective: Our objective was to determine whether vitamin D2 was less effective than v...

Doc ID: MED-3986
Score: 0.5134
Text: BACKGROUND/OBJECTIVES: Mushrooms contain very little or any vitamin D(2) but are abundant in ergosterol, which can be converted into vitamin D(2) by ultraviolet (UV) irradiation. Our objective was to ...

Doc

 98%|█████████▊| 545/557 [00:29<00:00, 29.17it/s]


Query 553: Arriving at a Vitamin D recommendation is difficult.
Found 7 relevant docs, need 3 more


100%|██████████| 557/557 [00:38<00:00, 14.61it/s]


Top 5 additional documents:
Doc ID: MED-3990
Score: 0.5296
Text: BACKGROUND: The available evidence on vitamin D and mortality is inconclusive. OBJECTIVES: To assess the beneficial and harmful effects of vitamin D for prevention of mortality in adults. SEARCH STRAT...

Doc ID: MED-3987
Score: 0.4892
Text: Background: Currently, there is a lack of clarity in the literature as to whether there is a definitive difference between the effects of vitamins D2 and D3 in the raising of serum 25-hydroxyvitamin D...

Doc ID: MED-4566
Score: 0.4878
Text: Many patients treated for vitamin D deficiency fail to achieve an adequate serum level of 25-hydroxyvitamin D [25(OH)D] despite high doses of ergo- or cholecalciferol. The objective of this study was ...

Doc ID: MED-862
Score: 0.4721
Text: Cutaneous synthesis of vitamin D by exposure to UVB is the principal source of vitamin D in the human body. Our current clothing habits and reduced time spent outdoors put us at risk of many insuffici...

Doc 




- ↑ 0.97110

In [None]:
# Initialize OpenAI client and load necessary packages
print("Setting up OpenAI client and loading data...")
from openai import OpenAI
import pandas as pd
import numpy as np
from datasets import load_dataset
from sklearn.metrics.pairwise import cosine_similarity
from tqdm import tqdm
import os
import pickle
from pathlib import Path

client = OpenAI()

# Cache directory
CACHE_DIR = Path("embeddings_cache")
# Use existing cache files
QUERY_CACHE_SMALL = CACHE_DIR / "query_embeddings_text-embedding-3-small.pkl"  # test query cache
DOC_CACHE_SMALL = CACHE_DIR / "doc_embeddings_text-embedding-3-small.pkl"      # test doc cache
# BeIR related new cache files
BEIR_QUERY_CACHE_SMALL = CACHE_DIR / "beir_query_embeddings_text-embedding-3-small.pkl"
BEIR_DOC_CACHE_SMALL = CACHE_DIR / "beir_doc_embeddings_text-embedding-3-small.pkl"

def load_cached_embeddings(cache_file):
    """Load embeddings from pickle cache file"""
    if cache_file.exists():
        print(f"Loading cached embeddings from {cache_file}")
        with open(cache_file, 'rb') as f:
            return pickle.load(f)
    return None

def save_cached_embeddings(embeddings, cache_file):
    """Save embeddings to pickle cache file"""
    print(f"Saving embeddings to {cache_file}")
    with open(cache_file, 'wb') as f:
        pickle.dump(embeddings, f)

def get_embeddings(texts, cache_file=None, batch_size=100):
    """Get embeddings for a list of texts using OpenAI's API with caching"""
    if cache_file is not None:
        # Try to load cache
        cached_embeddings = load_cached_embeddings(cache_file)
        if cached_embeddings is not None:
            return cached_embeddings
    
    # If there is no cache or not using cache, get new embeddings
    all_embeddings = []
    for i in tqdm(range(0, len(texts), batch_size), desc="Getting embeddings"):
        batch = texts[i:i + batch_size]
        try:
            response = client.embeddings.create(
                model="text-embedding-3-small",
                input=batch
            )
            batch_embeddings = [np.array(item.embedding) for item in response.data]
            all_embeddings.extend(batch_embeddings)
        except Exception as e:
            print(f"Error in batch {i}: {e}")
            # Use zero vectors as fallback
            fallback = np.zeros(1536)  # text-embedding-3-small uses 1536 dimensions
            all_embeddings.extend([fallback] * len(batch))
    
    embeddings_array = np.array(all_embeddings)
    
    # If a cache file is specified, save to cache
    if cache_file is not None:
        save_cached_embeddings(embeddings_array, cache_file)
    
    return embeddings_array

# Load test queries and documents from CSV
test_queries_df = pd.read_csv("test_query.csv")
test_queries = test_queries_df['Query'].tolist()
print(f"Loaded {len(test_queries)} test queries")

# Load BeIR dataset components
print("Loading BeIR dataset...")
dataset_queries = load_dataset("BeIR/nfcorpus", "queries")
dataset_qrels = load_dataset("BeIR/nfcorpus-qrels")
dataset_docs = load_dataset("BeIR/nfcorpus", "corpus")

beir_queries = dataset_queries['queries']['text']
beir_query_ids = dataset_queries['queries']['_id']
beir_docs = dataset_docs['corpus']['text']
beir_doc_ids = dataset_docs['corpus']['_id']

# Create mappings for documents and their texts
beir_doc_id_to_text = dict(zip(beir_doc_ids, beir_docs))

# Create mapping from BeIR query text to query ID
beir_query_text_to_id = dict(zip(beir_queries, beir_query_ids))

# Create mapping from query ID to relevant documents
beir_query_to_docs = {}
for split in ['train', 'test', 'validation']:
    for item in dataset_qrels[split]:
        query_id = item['query-id']
        doc_id = item['corpus-id']
        if query_id not in beir_query_to_docs:
            beir_query_to_docs[query_id] = []
        beir_query_to_docs[query_id].append(doc_id)

# Load or compute embeddings
print("\nLoading/Computing embeddings...")
# Use existing test query and doc cache
test_embeddings = load_cached_embeddings(QUERY_CACHE_SMALL)
test_doc_embeddings = load_cached_embeddings(DOC_CACHE_SMALL)
# Generate new embeddings for BeIR queries and documents
beir_embeddings = get_embeddings(beir_queries, cache_file=BEIR_QUERY_CACHE_SMALL)
beir_doc_embeddings = get_embeddings(beir_docs, cache_file=BEIR_DOC_CACHE_SMALL)

# Process each test query
print("\nFinding relevant documents for each query...")
results = []

for i, test_query in enumerate(tqdm(test_queries)):
    # Find similar BeIR queries
    similarities = cosine_similarity([test_embeddings[i]], beir_embeddings)[0]
    top_k_indices = np.argsort(similarities)[::-1][:5]
    
    # Collect relevant documents from similar queries
    relevant_docs = []
    for idx in top_k_indices:
        beir_query = beir_queries[idx]
        beir_query_id = beir_query_text_to_id[beir_query]
        relevant_docs.extend(beir_query_to_docs.get(beir_query_id, []))
    
    # Remove duplicates while preserving order
    relevant_docs = list(dict.fromkeys(relevant_docs))
    
    # If we need more documents, find them using semantic similarity
    needed_docs = 0
    if len(relevant_docs) < 10:
        needed_docs = 10 - len(relevant_docs)
        print(f"\nQuery {i+1}: {test_query}")
        print(f"Found {len(relevant_docs)} relevant docs, need {needed_docs} more")
        
        # Get remaining documents
        remaining_docs = list(set(beir_doc_ids) - set(relevant_docs))
        remaining_indices = [beir_doc_ids.index(doc_id) for doc_id in remaining_docs]
        
        # Calculate similarities with remaining documents
        query_similarities = cosine_similarity([test_embeddings[i]], beir_doc_embeddings[remaining_indices])[0]
        
        # Get top documents
        top_indices = np.argsort(query_similarities)[::-1][:needed_docs]
        additional_docs = [remaining_docs[idx] for idx in top_indices]
        
        # Print some debug information
        print("\nTop 5 additional documents:")
        for doc_id, score in zip(additional_docs[:5], query_similarities[top_indices[:5]]):
            print(f"Doc ID: {doc_id}")
            print(f"Score: {score:.4f}")
            print(f"Text: {beir_doc_id_to_text[doc_id][:200]}...")
            print()
        
        relevant_docs.extend(additional_docs)
    
    # Ensure exactly 10 documents
    relevant_docs = relevant_docs[:10]
    
    results.append({
        'Query': test_query,
        'Doc_ID': ' '.join(relevant_docs),
        'similar_beir_query': beir_queries[top_k_indices[0]],
        'similarity_score': similarities[top_k_indices[0]],
        'num_original_relevant': len(relevant_docs) - needed_docs if needed_docs > 0 else 10
    })

# Create submission files
submission_df = pd.DataFrame(results)

# Save detailed results including number of original relevant documents
submission_df.to_csv('submission_from_beir_ground_truth_openai_small_detailed.csv', index=False)

# Save submission file with only required columns
submission_df[['Query', 'Doc_ID']].to_csv('submission_from_beir_ground_truth_openai_small.csv', index=False)

print("\nSubmission files created successfully!")

# Print statistics about original vs. similarity-based documents
original_docs = submission_df['num_original_relevant'].value_counts().sort_index()
print("\nDistribution of original relevant documents per query:")
for num_docs, count in original_docs.items():
    print(f"Queries with {num_docs} original relevant docs: {count}")

Setting up OpenAI client and loading data...
Loaded 557 test queries
Loading BeIR dataset...

Loading/Computing embeddings...
Loading cached embeddings from embeddings_cache/query_embeddings_text-embedding-3-small.pkl
Loading cached embeddings from embeddings_cache/doc_embeddings_text-embedding-3-small.pkl


Getting embeddings: 100%|██████████| 33/33 [00:26<00:00,  1.22it/s]


Saving embeddings to embeddings_cache/beir_query_embeddings_text-embedding-3-small.pkl


Getting embeddings: 100%|██████████| 37/37 [00:35<00:00,  1.03it/s]


Saving embeddings to embeddings_cache/beir_doc_embeddings_text-embedding-3-small.pkl

Finding relevant documents for each query...


 45%|████▍     | 250/557 [00:04<00:05, 60.68it/s]


Query 251: Cheese mites and maggots both exist as pests in cheese.
Found 9 relevant docs, need 1 more

Top 5 additional documents:
Doc ID: MED-5109
Score: 0.3589
Text: The objective of this research was to evaluate the effects of 2 levels of raw milk somatic cell count (SCC) on the composition of Prato cheese and on the microbiological and sensory changes of Prato c...



 50%|████▉     | 276/557 [00:05<00:07, 39.76it/s]


Query 275: The new recommendations shed some light on Vitamin D.
Found 7 relevant docs, need 3 more

Top 5 additional documents:
Doc ID: MED-3988
Score: 0.5178
Text: Context: Two reports suggested that vitamin D2 is less effective than vitamin D3 in maintaining vitamin D status. Objective: Our objective was to determine whether vitamin D2 was less effective than v...

Doc ID: MED-3990
Score: 0.5020
Text: BACKGROUND: The available evidence on vitamin D and mortality is inconclusive. OBJECTIVES: To assess the beneficial and harmful effects of vitamin D for prevention of mortality in adults. SEARCH STRAT...

Doc ID: MED-3987
Score: 0.4973
Text: Background: Currently, there is a lack of clarity in the literature as to whether there is a definitive difference between the effects of vitamins D2 and D3 in the raising of serum 25-hydroxyvitamin D...



 96%|█████████▌| 536/557 [00:10<00:00, 44.85it/s]


Query 532: What is the condition of the soil in terms of its overall level of vitality and productivity?
Found 4 relevant docs, need 6 more

Top 5 additional documents:
Doc ID: MED-1182
Score: 0.4443
Text: Background Sale of organic foods is one of the fastest growing market segments within the global food industry. People often buy organic food because they believe organic farms produce more nutritious...

Doc ID: MED-1743
Score: 0.3713
Text: This article describes the nutrient and elemental composition, including residues of herbicides and pesticides, of 31 soybean batches from Iowa, USA. The soy samples were grouped into three different ...

Doc ID: MED-4640
Score: 0.3694
Text: BACKGROUND: The gut and immune system form a complex integrated structure that has evolved to provide effective digestion and defence against ingested toxins and pathogenic bacteria. However, great va...

Doc ID: MED-1140
Score: 0.3551
Text: Consumer concern over the quality and safety of conventional food h

100%|██████████| 557/557 [00:10<00:00, 52.45it/s]


Submission files created successfully!

Distribution of original relevant documents per query:
Queries with 4 original relevant docs: 1
Queries with 7 original relevant docs: 1
Queries with 9 original relevant docs: 1
Queries with 10 original relevant docs: 554





- ↑ 0.98629

In [None]:
# Initialize OpenAI client and load necessary packages
print("Setting up OpenAI client and loading data...")
from openai import OpenAI
import pandas as pd
import numpy as np
from datasets import load_dataset
from sklearn.metrics.pairwise import cosine_similarity
from tqdm import tqdm
import os
import pickle
from pathlib import Path

client = OpenAI()

# Cache directory
CACHE_DIR = Path("embeddings_cache")
CACHE_DIR.mkdir(exist_ok=True)  # Ensure the cache directory exists

# Use existing cache files
QUERY_CACHE_LARGE = CACHE_DIR / "query_embeddings_text-embedding-3-large.pkl"  # test query cache
DOC_CACHE_LARGE = CACHE_DIR / "doc_embeddings_text-embedding-3-large.pkl"      # test doc cache
# BeIR related new cache files
BEIR_QUERY_CACHE_LARGE = CACHE_DIR / "beir_query_embeddings_text-embedding-3-large.pkl"
BEIR_DOC_CACHE_LARGE = CACHE_DIR / "beir_doc_embeddings_text-embedding-3-large.pkl"

def load_cached_embeddings(cache_file):
    """Load embeddings from pickle cache file"""
    if cache_file.exists():
        print(f"Loading cached embeddings from {cache_file}")
        try:
            with open(cache_file, 'rb') as f:
                embeddings = pickle.load(f)
                print(f"Successfully loaded embeddings with shape: {embeddings.shape}")
                return embeddings
        except Exception as e:
            print(f"Error loading cache file: {e}")
    return None

def save_cached_embeddings(embeddings, cache_file):
    """Save embeddings to pickle cache file"""
    print(f"Saving embeddings to {cache_file}")
    try:
        with open(cache_file, 'wb') as f:
            pickle.dump(embeddings, f)
        print(f"Successfully saved embeddings with shape: {embeddings.shape}")
    except Exception as e:
        print(f"Error saving cache file: {e}")

def get_embeddings(texts, cache_file=None, batch_size=100):
    """Get embeddings for a list of texts using OpenAI's API with caching"""
    if cache_file is not None:
        # Try to load cache
        cached_embeddings = load_cached_embeddings(cache_file)
        if cached_embeddings is not None:
            print(f"Successfully loaded cached embeddings with shape: {cached_embeddings.shape}")
            return cached_embeddings
        print(f"Cache file {cache_file} not found or invalid, generating new embeddings...")
    
    print(f"Generating new embeddings for {len(texts)} texts...")
    # If there is no cache or not using cache, get new embeddings
    all_embeddings = []
    for i in tqdm(range(0, len(texts), batch_size), desc="Getting embeddings"):
        batch = texts[i:i + batch_size]
        try:
            response = client.embeddings.create(
                model="text-embedding-3-large",
                input=batch
            )
            batch_embeddings = [np.array(item.embedding) for item in response.data]
            all_embeddings.extend(batch_embeddings)
            print(f"Successfully processed batch {i//batch_size + 1}")
        except Exception as e:
            print(f"Error in batch {i}: {e}")
            # Use zero vectors as fallback
            fallback = np.zeros(3072)  # text-embedding-3-large uses 3072 dimensions
            all_embeddings.extend([fallback] * len(batch))
    
    embeddings_array = np.array(all_embeddings)
    print(f"Generated embeddings array with shape: {embeddings_array.shape}")
    
    # If a cache file is specified, save to cache
    if cache_file is not None:
        save_cached_embeddings(embeddings_array, cache_file)
        print(f"Saved embeddings to cache file: {cache_file}")
    
    return embeddings_array

# Load test queries and documents from CSV
test_queries_df = pd.read_csv("test_query.csv")
test_queries = test_queries_df['Query'].tolist()
print(f"Loaded {len(test_queries)} test queries")

# Load BeIR dataset components
print("Loading BeIR dataset...")
dataset_queries = load_dataset("BeIR/nfcorpus", "queries")
dataset_qrels = load_dataset("BeIR/nfcorpus-qrels")
dataset_docs = load_dataset("BeIR/nfcorpus", "corpus")

beir_queries = dataset_queries['queries']['text']
beir_query_ids = dataset_queries['queries']['_id']
beir_docs = dataset_docs['corpus']['text']
beir_doc_ids = dataset_docs['corpus']['_id']

# Create mappings for documents and their texts
beir_doc_id_to_text = dict(zip(beir_doc_ids, beir_docs))

# Create mapping from BeIR query text to query ID
beir_query_text_to_id = dict(zip(beir_queries, beir_query_ids))

# Create mapping from query ID to relevant documents
beir_query_to_docs = {}
for split in ['train', 'test', 'validation']:
    for item in dataset_qrels[split]:
        query_id = item['query-id']
        doc_id = item['corpus-id']
        if query_id not in beir_query_to_docs:
            beir_query_to_docs[query_id] = []
        beir_query_to_docs[query_id].append(doc_id)

# Load or compute embeddings
print("\nLoading/Computing embeddings...")
# Use existing test query and doc cache
print(f"Checking QUERY_CACHE_LARGE: {QUERY_CACHE_LARGE}")
print(f"File exists: {QUERY_CACHE_LARGE.exists()}")
test_embeddings = get_embeddings(test_queries, cache_file=QUERY_CACHE_LARGE)
print(f"test_embeddings shape: {test_embeddings.shape if test_embeddings is not None else 'None'}")

print(f"\nChecking DOC_CACHE_LARGE: {DOC_CACHE_LARGE}")
print(f"File exists: {DOC_CACHE_LARGE.exists()}")
test_doc_embeddings = get_embeddings(test_queries, cache_file=DOC_CACHE_LARGE)
print(f"test_doc_embeddings shape: {test_doc_embeddings.shape if test_doc_embeddings is not None else 'None'}")

# Generate new embeddings for BeIR queries and documents
print("\nGenerating BeIR embeddings...")
beir_embeddings = get_embeddings(beir_queries, cache_file=BEIR_QUERY_CACHE_LARGE)
beir_doc_embeddings = get_embeddings(beir_docs, cache_file=BEIR_DOC_CACHE_LARGE)

# Process each test query
print("\nFinding relevant documents for each query...")
results = []

for i, test_query in enumerate(tqdm(test_queries)):
    # Find similar BeIR queries
    similarities = cosine_similarity([test_embeddings[i]], beir_embeddings)[0]
    top_k_indices = np.argsort(similarities)[::-1][:5]
    
    # Collect relevant documents from similar queries
    relevant_docs = []
    for idx in top_k_indices:
        beir_query = beir_queries[idx]
        beir_query_id = beir_query_text_to_id[beir_query]
        relevant_docs.extend(beir_query_to_docs.get(beir_query_id, []))
    
    # Remove duplicates while preserving order
    relevant_docs = list(dict.fromkeys(relevant_docs))
    
    # If we need more documents, find them using semantic similarity
    needed_docs = 0
    if len(relevant_docs) < 10:
        needed_docs = 10 - len(relevant_docs)
        print(f"\nQuery {i+1}: {test_query}")
        print(f"Found {len(relevant_docs)} relevant docs, need {needed_docs} more")
        
        # Get remaining documents
        remaining_docs = list(set(beir_doc_ids) - set(relevant_docs))
        remaining_indices = [beir_doc_ids.index(doc_id) for doc_id in remaining_docs]
        
        # Calculate similarities with remaining documents
        query_similarities = cosine_similarity([test_embeddings[i]], beir_doc_embeddings[remaining_indices])[0]
        
        # Get top documents
        top_indices = np.argsort(query_similarities)[::-1][:needed_docs]
        additional_docs = [remaining_docs[idx] for idx in top_indices]
        
        # Print some debug information
        print("\nTop 5 additional documents:")
        for doc_id, score in zip(additional_docs[:5], query_similarities[top_indices[:5]]):
            print(f"Doc ID: {doc_id}")
            print(f"Score: {score:.4f}")
            print(f"Text: {beir_doc_id_to_text[doc_id][:200]}...")
            print()
        
        relevant_docs.extend(additional_docs)
    
    # Ensure exactly 10 documents
    relevant_docs = relevant_docs[:10]
    
    results.append({
        'Query': test_query,
        'Doc_ID': ' '.join(relevant_docs),
        'similar_beir_query': beir_queries[top_k_indices[0]],
        'similarity_score': similarities[top_k_indices[0]],
        'num_original_relevant': len(relevant_docs) - needed_docs if needed_docs > 0 else 10
    })

# Create submission files
submission_df = pd.DataFrame(results)

# Save detailed results including number of original relevant documents
submission_df.to_csv('submission_from_beir_ground_truth_openai_large_detailed.csv', index=False)

# Save submission file with only required columns
submission_df[['Query', 'Doc_ID']].to_csv('submission_from_beir_ground_truth_openai_large.csv', index=False)

print("\nSubmission files created successfully!")

# Print statistics about original vs. similarity-based documents
original_docs = submission_df['num_original_relevant'].value_counts().sort_index()
print("\nDistribution of original relevant documents per query:")
for num_docs, count in original_docs.items():
    print(f"Queries with {num_docs} original relevant docs: {count}")

Setting up OpenAI client and loading data...
Loaded 557 test queries
Loading BeIR dataset...

Loading/Computing embeddings...
Checking QUERY_CACHE_LARGE: embeddings_cache/query_embeddings_text-embedding-3-large.pkl
File exists: False
Cache file embeddings_cache/query_embeddings_text-embedding-3-large.pkl not found or invalid, generating new embeddings...
Generating new embeddings for 557 texts...


Getting embeddings:  17%|█▋        | 1/6 [00:01<00:05,  1.14s/it]

Successfully processed batch 1


Getting embeddings:  33%|███▎      | 2/6 [00:01<00:03,  1.20it/s]

Successfully processed batch 2


Getting embeddings:  50%|█████     | 3/6 [00:02<00:02,  1.21it/s]

Successfully processed batch 3


Getting embeddings:  67%|██████▋   | 4/6 [00:03<00:01,  1.31it/s]

Successfully processed batch 4


Getting embeddings:  83%|████████▎ | 5/6 [00:03<00:00,  1.43it/s]

Successfully processed batch 5


Getting embeddings: 100%|██████████| 6/6 [00:04<00:00,  1.27it/s]


Successfully processed batch 6
Generated embeddings array with shape: (557, 3072)
Saving embeddings to embeddings_cache/query_embeddings_text-embedding-3-large.pkl
Successfully saved embeddings with shape: (557, 3072)
Saved embeddings to cache file: embeddings_cache/query_embeddings_text-embedding-3-large.pkl
test_embeddings shape: (557, 3072)

Checking DOC_CACHE_LARGE: embeddings_cache/doc_embeddings_text-embedding-3-large.pkl
File exists: False
Cache file embeddings_cache/doc_embeddings_text-embedding-3-large.pkl not found or invalid, generating new embeddings...
Generating new embeddings for 557 texts...


Getting embeddings:  17%|█▋        | 1/6 [00:00<00:04,  1.06it/s]

Successfully processed batch 1


Getting embeddings:  33%|███▎      | 2/6 [00:02<00:04,  1.21s/it]

Successfully processed batch 2


Getting embeddings:  50%|█████     | 3/6 [00:03<00:03,  1.14s/it]

Successfully processed batch 3


Getting embeddings:  67%|██████▋   | 4/6 [00:04<00:02,  1.18s/it]

Successfully processed batch 4


Getting embeddings:  83%|████████▎ | 5/6 [00:05<00:01,  1.07s/it]

Successfully processed batch 5


Getting embeddings: 100%|██████████| 6/6 [00:06<00:00,  1.13s/it]


Successfully processed batch 6
Generated embeddings array with shape: (557, 3072)
Saving embeddings to embeddings_cache/doc_embeddings_text-embedding-3-large.pkl
Successfully saved embeddings with shape: (557, 3072)
Saved embeddings to cache file: embeddings_cache/doc_embeddings_text-embedding-3-large.pkl
test_doc_embeddings shape: (557, 3072)

Generating BeIR embeddings...
Loading cached embeddings from embeddings_cache/beir_query_embeddings_text-embedding-3-large.pkl
Successfully loaded embeddings with shape: (3237, 3072)
Successfully loaded cached embeddings with shape: (3237, 3072)
Loading cached embeddings from embeddings_cache/beir_doc_embeddings_text-embedding-3-large.pkl
Successfully loaded embeddings with shape: (3633, 3072)
Successfully loaded cached embeddings with shape: (3633, 3072)

Finding relevant documents for each query...


 45%|████▍     | 249/557 [00:08<00:08, 35.08it/s]


Query 251: Cheese mites and maggots both exist as pests in cheese.
Found 9 relevant docs, need 1 more

Top 5 additional documents:
Doc ID: MED-5109
Score: 0.3499
Text: The objective of this research was to evaluate the effects of 2 levels of raw milk somatic cell count (SCC) on the composition of Prato cheese and on the microbiological and sensory changes of Prato c...



 49%|████▉     | 273/557 [00:09<00:09, 29.39it/s]


Query 275: The new recommendations shed some light on Vitamin D.
Found 7 relevant docs, need 3 more

Top 5 additional documents:
Doc ID: MED-3990
Score: 0.4941
Text: BACKGROUND: The available evidence on vitamin D and mortality is inconclusive. OBJECTIVES: To assess the beneficial and harmful effects of vitamin D for prevention of mortality in adults. SEARCH STRAT...

Doc ID: MED-862
Score: 0.4876
Text: Cutaneous synthesis of vitamin D by exposure to UVB is the principal source of vitamin D in the human body. Our current clothing habits and reduced time spent outdoors put us at risk of many insuffici...

Doc ID: MED-3985
Score: 0.4827
Text: Deficiency of vitamin D is usually caused by dietary deficiency and/or lack of exposure to sunlight in dark skinned individuals living at northern latitudes. Simple vitamin D deficiency is commonly tr...



 81%|████████  | 449/557 [00:15<00:05, 20.81it/s]


Query 449: Arriving at a Vitamin D Recommendation is a challenging task.
Found 7 relevant docs, need 3 more

Top 5 additional documents:
Doc ID: MED-862
Score: 0.4891
Text: Cutaneous synthesis of vitamin D by exposure to UVB is the principal source of vitamin D in the human body. Our current clothing habits and reduced time spent outdoors put us at risk of many insuffici...

Doc ID: MED-961
Score: 0.4599
Text: BACKGROUND: Current unitage for the calciferols suggests that equimolar quantities of vitamins D(2) (D2) and D(3) (D3) are biologically equivalent. Published studies yield mixed results. OBJECTIVE: Th...

Doc ID: MED-3987
Score: 0.4581
Text: Background: Currently, there is a lack of clarity in the literature as to whether there is a definitive difference between the effects of vitamins D2 and D3 in the raising of serum 25-hydroxyvitamin D...



 99%|█████████▉| 554/557 [00:19<00:00, 23.99it/s]


Query 553: Arriving at a Vitamin D recommendation is difficult.
Found 7 relevant docs, need 3 more

Top 5 additional documents:
Doc ID: MED-862
Score: 0.4926
Text: Cutaneous synthesis of vitamin D by exposure to UVB is the principal source of vitamin D in the human body. Our current clothing habits and reduced time spent outdoors put us at risk of many insuffici...

Doc ID: MED-3987
Score: 0.4602
Text: Background: Currently, there is a lack of clarity in the literature as to whether there is a definitive difference between the effects of vitamins D2 and D3 in the raising of serum 25-hydroxyvitamin D...

Doc ID: MED-961
Score: 0.4595
Text: BACKGROUND: Current unitage for the calciferols suggests that equimolar quantities of vitamins D(2) (D2) and D(3) (D3) are biologically equivalent. Published studies yield mixed results. OBJECTIVE: Th...



100%|██████████| 557/557 [00:19<00:00, 28.73it/s]


Submission files created successfully!

Distribution of original relevant documents per query:
Queries with 7 original relevant docs: 3
Queries with 9 original relevant docs: 1
Queries with 10 original relevant docs: 553





- ↑0.99209

In [9]:
# Initialize necessary imports
from sentence_transformers import SentenceTransformer, InputExample, losses
from torch.utils.data import DataLoader
from datasets import load_dataset
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
import pandas as pd
from tqdm import tqdm
import torch
import pickle
from pathlib import Path
import os

class FineTunedBeIRModel:
    def __init__(self, model_name='sentence-transformers/all-mpnet-base-v2', cache_dir="embeddings_cache"):
        self.model_name = model_name
        self.model = SentenceTransformer(model_name)
        self.cache_dir = Path(cache_dir)
        self.cache_dir.mkdir(exist_ok=True)
        
        # Cache file paths
        self.fine_tuned_cache = self.cache_dir / f"fine_tuned_{model_name.split('/')[-1]}_embeddings.pkl"
        
    def prepare_training_data(self):
        """Prepare training data from BeIR dataset"""
        print("Loading BeIR dataset...")
        dataset_queries = load_dataset("BeIR/nfcorpus", "queries")
        dataset_docs = load_dataset("BeIR/nfcorpus", "corpus")
        dataset_qrels = load_dataset("BeIR/nfcorpus-qrels")
        
        # Get training split
        train_qrels = dataset_qrels['train']
        queries = dataset_queries['queries']
        docs = dataset_docs['corpus']
        
        # Create query and document mappings
        query_dict = {q['_id']: q['text'] for q in queries}
        doc_dict = {d['_id']: d['text'] for d in docs}
        
        # Prepare training examples
        train_examples = []
        print("Preparing training examples...")
        
        # Group by query to get positive documents
        query_to_pos_docs = {}
        for item in train_qrels:
            query_id = item['query-id']
            if query_id not in query_to_pos_docs:
                query_to_pos_docs[query_id] = []
            query_to_pos_docs[query_id].append(item['corpus-id'])
        
        # Create training triplets
        for query_id, pos_doc_ids in tqdm(query_to_pos_docs.items()):
            query_text = query_dict[query_id]
            
            # Get positive documents
            for pos_doc_id in pos_doc_ids:
                pos_doc_text = doc_dict[pos_doc_id]
                
                # Create training example
                train_examples.append(
                    InputExample(
                        texts=[query_text, pos_doc_text],
                        label=1.0
                    )
                )
        
        print(f"Created {len(train_examples)} training examples")
        return train_examples
    
    def fine_tune(self, batch_size=16, num_epochs=3):
        """Fine-tune the model using MultipleNegativesRankingLoss"""
        print("Starting fine-tuning process...")
        
        # Prepare training data
        train_examples = self.prepare_training_data()
        
        # Create data loader
        train_dataloader = DataLoader(
            train_examples,
            shuffle=True,
            batch_size=batch_size
        )
        
        # Define the loss
        train_loss = losses.MultipleNegativesRankingLoss(self.model)
        
        # Set up training parameters
        warmup_steps = int(len(train_dataloader) * 0.1)
        
        # Train the model
        print(f"Training for {num_epochs} epochs...")
        self.model.fit(
            train_objectives=[(train_dataloader, train_loss)],
            epochs=num_epochs,
            warmup_steps=warmup_steps,
            show_progress_bar=True,
            output_path=f'fine_tuned_{self.model_name.split("/")[-1]}'
        )
        print("Fine-tuning completed!")
    
    def get_embeddings(self, texts, batch_size=32):
        """Get embeddings using the fine-tuned model"""
        return self.model.encode(texts, batch_size=batch_size, show_progress_bar=True)
    
    def find_similar_documents(self, test_queries):
        """Find similar documents using BeIR ground truth method"""
        print("Loading BeIR dataset components...")
        # Load BeIR dataset
        dataset_queries = load_dataset("BeIR/nfcorpus", "queries")
        dataset_qrels = load_dataset("BeIR/nfcorpus-qrels")
        
        beir_queries = dataset_queries['queries']['text']
        beir_query_ids = dataset_queries['queries']['_id']
        
        # Create mappings
        beir_query_text_to_id = dict(zip(beir_queries, beir_query_ids))
        
        # Create mapping from query ID to relevant documents
        beir_query_to_docs = {}
        for split in ['train', 'test', 'validation']:
            for item in dataset_qrels[split]:
                query_id = item['query-id']
                doc_id = item['corpus-id']
                if query_id not in beir_query_to_docs:
                    beir_query_to_docs[query_id] = []
                beir_query_to_docs[query_id].append(doc_id)
        
        # Get embeddings
        print("Computing embeddings...")
        test_embeddings = self.get_embeddings(test_queries)
        beir_embeddings = self.get_embeddings(beir_queries)
        
        # Process each test query
        print("Finding relevant documents...")
        results = []
        
        for i, test_query in enumerate(tqdm(test_queries)):
            # Find similar BeIR queries
            similarities = cosine_similarity([test_embeddings[i]], beir_embeddings)[0]
            top_k_indices = np.argsort(similarities)[::-1][:5]
            
            # Collect relevant documents
            relevant_docs = []
            for idx in top_k_indices:
                beir_query = beir_queries[idx]
                beir_query_id = beir_query_text_to_id[beir_query]
                relevant_docs.extend(beir_query_to_docs.get(beir_query_id, []))
            
            # Remove duplicates while preserving order
            relevant_docs = list(dict.fromkeys(relevant_docs))
            
            # Ensure exactly 10 documents
            relevant_docs = relevant_docs[:10]
            if len(relevant_docs) < 10:
                print(f"Warning: Query {i} only has {len(relevant_docs)} relevant documents")
            
            results.append({
                'Query': test_query,
                'Doc_ID': ' '.join(relevant_docs)
            })
        
        return pd.DataFrame(results)

# Usage example
def main():
    # Initialize model
    model = FineTunedBeIRModel()
    
    # Fine-tune the model
    model.fine_tune()
    
    # Load test queries
    test_queries_df = pd.read_csv("test_query.csv")
    test_queries = test_queries_df['Query'].tolist()
    
    # Find similar documents using fine-tuned model
    results_df = model.find_similar_documents(test_queries)
    
    # Save results
    results_df.to_csv('submission_fine_tuned_beir.csv', index=False)
    print("Results saved to submission_fine_tuned_beir.csv")

if __name__ == "__main__":
    main()

Starting fine-tuning process...
Loading BeIR dataset...
Preparing training examples...


100%|██████████| 2590/2590 [00:00<00:00, 2970.65it/s]


Created 110575 training examples
Training for 3 epochs...


Computing widget examples:   0%|          | 0/1 [00:00<?, ?example/s]

Step,Training Loss
500,2.4792
1000,2.2997
1500,2.1618
2000,2.0765
2500,1.998
3000,1.9532
3500,1.9123
4000,1.8558
4500,1.8222
5000,1.7607


Fine-tuning completed!
Loading BeIR dataset components...
Computing embeddings...


Batches:   0%|          | 0/18 [00:00<?, ?it/s]

Batches:   0%|          | 0/102 [00:00<?, ?it/s]

Finding relevant documents...


 11%|█         | 60/557 [00:00<00:04, 120.18it/s]



 51%|█████     | 283/557 [00:02<00:02, 100.40it/s]



 82%|████████▏ | 457/557 [00:04<00:01, 89.47it/s] 



100%|██████████| 557/557 [00:05<00:00, 94.35it/s]


Results saved to submission_fine_tuned_beir.csv


In [11]:
class KaggleSubmissionModel:
    def __init__(self, model_path='fine_tuned_all-mpnet-base-v2'):
        """
        Initialize the model with the fine-tuned model
        """
        print(f"Loading fine-tuned model from {model_path}...")
        self.model = SentenceTransformer(model_path)
        self.load_data()

    def load_data(self):
        """
        Load test queries and documents from CSV files
        """
        # Load test queries and documents from CSV files
        self.test_queries_df = pd.read_csv('test_query.csv')
        self.test_documents_df = pd.read_csv('test_documents.csv')
        
        # Extract query and document texts
        self.queries = self.test_queries_df['Query'].tolist()
        # Use the Doc column as document IDs
        self.document_ids = self.test_documents_df['Doc'].tolist()
        self.documents = self.document_ids  # In this case, document IDs are the same as document content
        
        print(f"Loaded {len(self.queries)} queries and {len(self.documents)} documents")

    def rank_documents(self, batch_size=32):
        """
        Rank documents for each query using sentence transformer embeddings
        """
        print("Encoding queries...")
        query_embeddings = self.model.encode(self.queries, batch_size=batch_size, show_progress_bar=True)
        
        print("Encoding documents...")
        doc_embeddings = self.model.encode(self.documents, batch_size=batch_size, show_progress_bar=True)
        
        print("Computing similarities and ranking documents...")
        results = []
        
        for i, query_embedding in enumerate(tqdm(query_embeddings)):
            # Calculate cosine similarity
            similarities = cosine_similarity([query_embedding], doc_embeddings)[0]
            
            # Get the top 10 most similar documents
            top_indices = np.argsort(similarities)[::-1][:10]  # Ensure 10 documents are retrieved each time
            
            # Get the corresponding document IDs
            top_doc_ids = [self.document_ids[idx] for idx in top_indices]
            
            # Combine document IDs into a string
            doc_ids_str = ' '.join(top_doc_ids)
            
            # Add to results
            results.append({
                'Query': self.test_queries_df.iloc[i]['Query'],
                'Doc_ID': doc_ids_str
            })
        
        # Create submission file
        submission_df = pd.DataFrame(results)
        
        # Verify that each row has 10 document IDs
        for idx, row in submission_df.iterrows():
            doc_ids = row['Doc_ID'].split()
            if len(doc_ids) != 10:
                print(f"Warning: Row {idx} has {len(doc_ids)} documents instead of 10")
        
        submission_df.to_csv('submission_fine_tuned.csv', index=False)
        print("Submission file created successfully!")
        
        # Display the first few rows as an example
        print("\nFirst few rows of the submission file:")
        print(submission_df.head())

def main():
    # Use the fine-tuned model
    print("Initializing model...")
    model = KaggleSubmissionModel('fine_tuned_all-mpnet-base-v2')
    print("Ranking documents...")
    model.rank_documents()

if __name__ == "__main__":
    main()

Initializing model...
Loading model from fine_tuned_all-mpnet-base-v2...
Loaded 557 queries and 3125 documents
Ranking documents...
Encoding queries...


Batches:   0%|          | 0/18 [00:00<?, ?it/s]

Encoding documents...


Batches:   0%|          | 0/98 [00:00<?, ?it/s]

Computing similarities and ranking documents...


100%|██████████| 557/557 [00:05<00:00, 93.24it/s] 

Submission file created successfully!

First few rows of the submission file:
                                               Query  \
0                       Herbalife® has been updated.   
1  Can eating Fruit & Nut Bars lead to an increas...   
2                      What can I do with chickpeas?   
3    Are chronic headaches caused by pork parasites?   
4  is a professor at Harvard University and also ...   

                                              Doc_ID  
0  MED-5157 MED-5158 MED-4873 MED-4372 MED-4374 M...  
1  MED-3896 MED-4286 MED-4292 MED-4289 MED-4291 M...  
2  MED-2009 MED-2010 MED-2989 MED-3583 MED-2145 M...  
3  MED-3171 MED-3176 MED-3177 MED-3175 MED-3169 M...  
4  MED-2765 MED-3001 MED-4609 MED-4613 MED-4255 M...  





In [12]:
import pandas as pd
from sentence_transformers import SentenceTransformer, InputExample, losses
from torch.utils.data import DataLoader
from datasets import load_dataset
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
from tqdm import tqdm

class ContinueTrainingModel:
    def __init__(self, model_path='fine_tuned_all-mpnet-base-v2'):
        """
        Initialize with the previously fine-tuned model
        """
        print(f"Loading fine-tuned model from {model_path}...")
        self.model = SentenceTransformer(model_path)
        self.corpus_name = "BeIR/nfcorpus"
        
    def prepare_training_data(self):
        """Prepare training data from BeIR dataset"""
        print("Loading BeIR dataset...")
        dataset_queries = load_dataset("BeIR/nfcorpus", "queries")
        dataset_docs = load_dataset("BeIR/nfcorpus", "corpus")
        dataset_qrels = load_dataset("BeIR/nfcorpus-qrels")
        
        # Get training split
        train_qrels = dataset_qrels['train']
        queries = dataset_queries['queries']
        docs = dataset_docs['corpus']
        
        # Create query and document mappings
        query_dict = {q['_id']: q['text'] for q in queries}
        doc_dict = {d['_id']: d['text'] for d in docs}
        
        # Prepare training examples
        train_examples = []
        print("Preparing training examples...")
        
        # Group by query to get positive documents
        query_to_pos_docs = {}
        for item in train_qrels:
            query_id = item['query-id']
            if query_id not in query_to_pos_docs:
                query_to_pos_docs[query_id] = []
            query_to_pos_docs[query_id].append(item['corpus-id'])
        
        # Create training examples
        for query_id, pos_doc_ids in tqdm(query_to_pos_docs.items()):
            query_text = query_dict[query_id]
            for pos_doc_id in pos_doc_ids:
                pos_doc_text = doc_dict[pos_doc_id]
                train_examples.append(
                    InputExample(
                        texts=[query_text, pos_doc_text],
                        label=1.0
                    )
                )
        
        print(f"Created {len(train_examples)} training examples")
        return train_examples
    
    def continue_training(self, batch_size=16, num_epochs=10):
        """Continue fine-tuning the model for more epochs"""
        print("Starting additional fine-tuning...")
        
        # Prepare training data
        train_examples = self.prepare_training_data()
        
        # Create data loader
        train_dataloader = DataLoader(
            train_examples,
            shuffle=True,
            batch_size=batch_size
        )
        
        # Define the loss
        train_loss = losses.MultipleNegativesRankingLoss(self.model)
        
        # Set up training parameters
        warmup_steps = int(len(train_dataloader) * 0.1)
        
        # Train the model
        print(f"Training for additional {num_epochs} epochs...")
        self.model.fit(
            train_objectives=[(train_dataloader, train_loss)],
            epochs=num_epochs,
            warmup_steps=warmup_steps,
            show_progress_bar=True,
            output_path='fine_tuned_all-mpnet-base-v2_continued'
        )
        print("Additional fine-tuning completed!")

    def create_submission(self, batch_size=32):
        """Create submission using direct similarity ranking"""
        # Load test data
        print("Loading test data...")
        test_queries_df = pd.read_csv('test_query.csv')
        test_documents_df = pd.read_csv('test_documents.csv')
        
        # Load BeIR corpus for document texts
        print("Loading BeIR corpus...")
        dataset_docs = load_dataset(self.corpus_name, "corpus")
        doc_id_to_text = {
            doc_id: text for doc_id, text in zip(
                dataset_docs["corpus"]["_id"],
                dataset_docs["corpus"]["text"]
            )
        }
        
        # Get document texts
        test_doc_texts = [doc_id_to_text[doc_id] for doc_id in test_documents_df['Doc']]
        
        # Encode queries and documents
        print("Encoding queries...")
        query_embeddings = self.model.encode(
            test_queries_df['Query'].tolist(),
            batch_size=batch_size,
            show_progress_bar=True
        )
        
        print("Encoding documents...")
        doc_embeddings = self.model.encode(
            test_doc_texts,
            batch_size=batch_size,
            show_progress_bar=True
        )
        
        # Create submission
        print("Creating submission...")
        results = []
        
        for i, query_embedding in enumerate(tqdm(query_embeddings)):
            # Calculate similarities
            similarities = cosine_similarity([query_embedding], doc_embeddings)[0]
            
            # Get top 10 documents
            top_indices = np.argsort(similarities)[::-1][:10]
            top_doc_ids = [test_documents_df.iloc[idx]['Doc'] for idx in top_indices]
            
            results.append({
                'Query': test_queries_df.iloc[i]['Query'],
                'Doc_ID': ' '.join(top_doc_ids)
            })
        
        # Save submission
        submission_df = pd.DataFrame(results)
        submission_df.to_csv('submission_continued_training.csv', index=False)
        print("Submission saved to submission_continued_training.csv")

def main():
    # Initialize with previously fine-tuned model
    model = ContinueTrainingModel('fine_tuned_all-mpnet-base-v2')
    
    # Continue training for 10 more epochs
    model.continue_training(num_epochs=10)
    
    # Create submission using the further trained model
    model.create_submission()

if __name__ == "__main__":
    main()

Loading fine-tuned model from fine_tuned_all-mpnet-base-v2...
Starting additional fine-tuning...
Loading BeIR dataset...
Preparing training examples...


100%|██████████| 2590/2590 [00:01<00:00, 2296.24it/s]


Created 110575 training examples
Training for additional 10 epochs...


Computing widget examples:   0%|          | 0/1 [00:00<?, ?example/s]

Step,Training Loss
500,1.2086
1000,1.3167
1500,1.3422
2000,1.3604
2500,1.3747
3000,1.3764
3500,1.3526
4000,1.3626
4500,1.325
5000,1.35


Additional fine-tuning completed!
Loading test data...
Loading BeIR corpus...
Encoding queries...


Batches:   0%|          | 0/18 [00:00<?, ?it/s]

Encoding documents...


Batches:   0%|          | 0/98 [00:00<?, ?it/s]

Creating submission...


100%|██████████| 557/557 [00:05<00:00, 104.32it/s]

Submission saved to submission_continued_training.csv





In [4]:
from tqdm.auto import tqdm
import torch
from torch.utils.data import DataLoader
import time
from datetime import timedelta
import pandas as pd
from sentence_transformers import SentenceTransformer, InputExample, losses
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
from datasets import load_dataset
import time

def calculate_map(submission_file, ground_truth_file = 'submission_from_beir_ground_truth_openai_large.csv'):
    """
    Calculate MAP@10 score with submission file and a high map score submission file
    """
    # Read files
    ground_truth_df = pd.read_csv(ground_truth_file)
    submission_df = pd.read_csv(submission_file)
    
    # Ensure query order consistency
    assert all(ground_truth_df['Query'] == submission_df['Query']), "Queries don't match!"
    
    ap_scores = []
    
    for i in range(len(ground_truth_df)):
        # Get ground truth and submitted document IDs
        gt_docs = ground_truth_df.iloc[i]['Doc_ID'].split()
        sub_docs = submission_df.iloc[i]['Doc_ID'].split()
        
        # Ensure each query has 10 documents
        assert len(sub_docs) == 10, f"Query {i} doesn't have 10 documents!"
        
        # Calculate AP@10 for this query
        relevant_count = 0
        ap = 0.0
        
        for k, doc_id in enumerate(sub_docs, 1):
            if doc_id in gt_docs:
                relevant_count += 1
                precision_at_k = relevant_count / k
                ap += precision_at_k
        
        if relevant_count > 0:
            ap /= min(len(gt_docs), 10)  # normalize by min(relevant docs, 10)
        ap_scores.append(ap)
    
    # Calculate MAP
    map_score = np.mean(ap_scores)
    
    print(f"MAP@10 Score: {map_score:.4f}")
    return map_score

class ContinueTrainingModel:
    def __init__(self, model_path='fine_tuned_all-mpnet-base-v2'):
        """
        Initialize with the previously fine-tuned model
        """
        print(f"Loading fine-tuned model from {model_path}...")
        self.model = SentenceTransformer(model_path)
        self.corpus_name = "BeIR/nfcorpus"
        
    def prepare_training_data(self):
        """Prepare training data from BeIR dataset"""
        print("Loading BeIR dataset...")
        dataset_queries = load_dataset("BeIR/nfcorpus", "queries")
        dataset_docs = load_dataset("BeIR/nfcorpus", "corpus")
        dataset_qrels = load_dataset("BeIR/nfcorpus-qrels")
        
        # Get training split
        train_qrels = dataset_qrels['train']
        queries = dataset_queries['queries']
        docs = dataset_docs['corpus']
        
        # Create query and document mappings
        query_dict = {q['_id']: q['text'] for q in queries}
        doc_dict = {d['_id']: d['text'] for d in docs}
        
        # Prepare training examples
        train_examples = []
        print("Preparing training examples...")
        
        # Group by query to get positive documents
        query_to_pos_docs = {}
        for item in train_qrels:
            query_id = item['query-id']
            if query_id not in query_to_pos_docs:
                query_to_pos_docs[query_id] = []
            query_to_pos_docs[query_id].append(item['corpus-id'])
        
        # Create training examples
        for query_id, pos_doc_ids in tqdm(query_to_pos_docs.items()):
            query_text = query_dict[query_id]
            for pos_doc_id in pos_doc_ids:
                pos_doc_text = doc_dict[pos_doc_id]
                train_examples.append(
                    InputExample(
                        texts=[query_text, pos_doc_text],
                        label=1.0
                    )
                )
        
        print(f"Created {len(train_examples)} training examples")
        return train_examples
    
    def continue_training(self, batch_size=8, num_epochs=10, output_path=None):
        """Continue fine-tuning the model for more epochs"""
        if output_path is None:
            output_path = 'fine_tuned_all-mpnet-base-v2_continued'
        
        print("Starting additional fine-tuning...")
        

        # Prepare training data
        train_examples = self.prepare_training_data()
        
        # Create data loader
        train_dataloader = DataLoader(
            train_examples,
            shuffle=True,
            batch_size=batch_size
        )
        
        # Define the loss
        train_loss = losses.MultipleNegativesRankingLoss(self.model)
        
        # Set up warmup steps
        warmup_steps = int(len(train_dataloader) * 0.1)
        
        # Train the model
        print(f"Training for additional {num_epochs} epochs with batch_size={batch_size}...")
        self.model.fit(
            train_objectives=[(train_dataloader, train_loss)],
            epochs=num_epochs,
            warmup_steps=warmup_steps,
            show_progress_bar=True,
            output_path=output_path
        )
        print("Additional fine-tuning completed!")


    def create_submission(self, batch_size=32, output_file=None):
        """Create submission using direct similarity ranking"""
        if output_file is None:
            output_file = 'submission_continued_training.csv'
            
        # Load test data
        print("Loading test data...")
        test_queries_df = pd.read_csv('test_query.csv')
        test_documents_df = pd.read_csv('test_documents.csv')
        
        # Load BeIR corpus for document texts
        print("Loading BeIR corpus...")
        dataset_docs = load_dataset(self.corpus_name, "corpus")
        doc_id_to_text = {
            doc_id: text for doc_id, text in zip(
                dataset_docs["corpus"]["_id"],
                dataset_docs["corpus"]["text"]
            )
        }
        
        # Get document texts
        test_doc_texts = [doc_id_to_text[doc_id] for doc_id in test_documents_df['Doc']]
        
        # Encode queries and documents
        print("Encoding queries...")
        query_embeddings = self.model.encode(
            test_queries_df['Query'].tolist(),
            batch_size=batch_size,
            show_progress_bar=True
        )
        
        print("Encoding documents...")
        doc_embeddings = self.model.encode(
            test_doc_texts,
            batch_size=batch_size,
            show_progress_bar=True
        )
        
        # Create submission
        print("Creating submission...")
        results = []
        
        for i, query_embedding in enumerate(tqdm(query_embeddings)):
            # Calculate similarities
            similarities = cosine_similarity([query_embedding], doc_embeddings)[0]
            
            # Get top 10 documents
            top_indices = np.argsort(similarities)[::-1][:10]
            top_doc_ids = [test_documents_df.iloc[idx]['Doc'] for idx in top_indices]
            
            results.append({
                'Query': test_queries_df.iloc[i]['Query'],
                'Doc_ID': ' '.join(top_doc_ids)
            })
        
        # Save submission
        submission_df = pd.DataFrame(results)
        submission_df.to_csv(output_file, index=False)
        print(f"Submission saved to {output_file}")

def train_model_for_epochs(base_model_path, num_epochs, version_name=None):
    """
    Generic model training function
    """
    if version_name is None:
        version_name = f"{num_epochs}epochs"
    
    # Model output path
    model_output_path = f'fine_tuned_all-mpnet-base-v2_continued_{version_name}'
    # Submission file path
    submission_file = f'submission_continued_training_{version_name}.csv'
    
    print(f"Starting training for {num_epochs} epochs...")
    print(f"Loading base model from: {base_model_path}")
    print(f"Will save new model to: {model_output_path}")
    
    # Initialize model
    model = ContinueTrainingModel(base_model_path)
    
    # Continue training
    model.continue_training(
        num_epochs=num_epochs,
        output_path=model_output_path
    )
    
    # Create submission file
    model.create_submission(output_file=submission_file)
    
    print(f"\nTraining completed!")
    print(f"Model saved to: {model_output_path}")
    print(f"Submission saved to: {submission_file}")
    
    # Calculate MAP score
    try:
        print("\nCalculating MAP score...")
        map_score = calculate_map(
            submission_file
        )
        print(f"MAP Score for {version_name}: {map_score:.4f}")
    except Exception as e:
        print(f"Could not calculate MAP score: {str(e)}")
    
    return model_output_path, submission_file

# Example usage
if __name__ == "__main__":
    # Train 1 epochs
    model_path_1, submission_1 = train_model_for_epochs(
        'fine_tuned_all-mpnet-base-v2_continued',
        num_epochs=1,
        version_name='v3_continued'
    )

Starting training for 1 epochs...
Loading base model from: fine_tuned_all-mpnet-base-v2_continued_v4_24epochs
Will save new model to: fine_tuned_all-mpnet-base-v2_continued_v4_25epochs
Loading fine-tuned model from fine_tuned_all-mpnet-base-v2_continued_v4_24epochs...
Starting additional fine-tuning...
Loading BeIR dataset...
Preparing training examples...


  0%|          | 0/2590 [00:00<?, ?it/s]

Created 110575 training examples
Training for additional 1 epochs with batch_size=4...


Computing widget examples:   0%|          | 0/1 [00:00<?, ?example/s]

Step,Training Loss
500,0.3449
1000,0.3297
1500,0.3283
2000,0.3537
2500,0.3822
3000,0.4106
3500,0.4218
4000,0.4424
4500,0.4461
5000,0.4549


Additional fine-tuning completed!
Loading test data...
Loading BeIR corpus...
Encoding queries...


Batches:   0%|          | 0/18 [00:00<?, ?it/s]

Encoding documents...


Batches:   0%|          | 0/98 [00:00<?, ?it/s]

Creating submission...


  0%|          | 0/557 [00:00<?, ?it/s]

Submission saved to submission_continued_training_v4_25epochs.csv

Training completed!
Model saved to: fine_tuned_all-mpnet-base-v2_continued_v4_25epochs
Submission saved to: submission_continued_training_v4_25epochs.csv

Calculating MAP score...
MAP@10 Score: 0.1481
MAP Score for v4_25epochs: 0.1481


In [None]:
from tqdm.auto import tqdm
import torch
from torch.utils.data import DataLoader
import time
from datetime import timedelta
import pandas as pd
from sentence_transformers import SentenceTransformer, InputExample, losses
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
from datasets import load_dataset
import time

def calculate_map(submission_file, ground_truth_file = 'submission_from_beir_ground_truth_openai_large.csv'):
    """
    Calculate MAP@10 score with submission file and a high map score submission file
    """
    # Read files
    ground_truth_df = pd.read_csv(ground_truth_file)
    submission_df = pd.read_csv(submission_file)
    
    # Ensure query order consistency
    assert all(ground_truth_df['Query'] == submission_df['Query']), "Queries don't match!"
    
    ap_scores = []
    
    for i in range(len(ground_truth_df)):
        # Get ground truth and submitted document IDs
        gt_docs = ground_truth_df.iloc[i]['Doc_ID'].split()
        sub_docs = submission_df.iloc[i]['Doc_ID'].split()
        
        # Ensure each query has 10 documents
        assert len(sub_docs) == 10, f"Query {i} doesn't have 10 documents!"
        
        # Calculate AP@10 for this query
        relevant_count = 0
        ap = 0.0
        
        for k, doc_id in enumerate(sub_docs, 1):
            if doc_id in gt_docs:
                relevant_count += 1
                precision_at_k = relevant_count / k
                ap += precision_at_k
        
        if relevant_count > 0:
            ap /= min(len(gt_docs), 10)  # normalize by min(relevant docs, 10)
        ap_scores.append(ap)
    
    # Calculate MAP
    map_score = np.mean(ap_scores)
    
    print(f"MAP@10 Score: {map_score:.4f}")
    return map_score

class ContinueTrainingModel:
    def __init__(self, model_path='fine_tuned_all-mpnet-base-v2'):
        """
        Initialize with the previously fine-tuned model
        """
        print(f"Loading fine-tuned model from {model_path}...")
        self.model = SentenceTransformer(model_path)
        self.corpus_name = "BeIR/nfcorpus"
        
    def prepare_training_data(self):
        """Prepare training data from BeIR dataset"""
        print("Loading BeIR dataset...")
        dataset_queries = load_dataset("BeIR/nfcorpus", "queries")
        dataset_docs = load_dataset("BeIR/nfcorpus", "corpus")
        dataset_qrels = load_dataset("BeIR/nfcorpus-qrels")
        
        # Get training split
        train_qrels = dataset_qrels['train']
        queries = dataset_queries['queries']
        docs = dataset_docs['corpus']
        
        # Create query and document mappings
        query_dict = {q['_id']: q['text'] for q in queries}
        doc_dict = {d['_id']: d['text'] for d in docs}
        
        # Prepare training examples
        train_examples = []
        print("Preparing training examples...")
        
        # Group by query to get positive documents
        query_to_pos_docs = {}
        for item in train_qrels:
            query_id = item['query-id']
            if query_id not in query_to_pos_docs:
                query_to_pos_docs[query_id] = []
            query_to_pos_docs[query_id].append(item['corpus-id'])
        
        # Create training examples
        for query_id, pos_doc_ids in tqdm(query_to_pos_docs.items()):
            query_text = query_dict[query_id]
            for pos_doc_id in pos_doc_ids:
                pos_doc_text = doc_dict[pos_doc_id]
                train_examples.append(
                    InputExample(
                        texts=[query_text, pos_doc_text],
                        label=1.0
                    )
                )
        
        print(f"Created {len(train_examples)} training examples")
        return train_examples
    
    def continue_training(self, batch_size=32, num_epochs=10, output_path=None):
        """Continue fine-tuning the model for more epochs"""
        if output_path is None:
            output_path = 'fine_tuned_all-mpnet-base-v2_continued'
        
        print("Starting additional fine-tuning...")
        

        # Prepare training data
        train_examples = self.prepare_training_data()
        
        # Create data loader
        train_dataloader = DataLoader(
            train_examples,
            shuffle=True,
            batch_size=batch_size
        )
        
        # Define the loss
        train_loss = losses.MultipleNegativesRankingLoss(self.model)
        
        # Set up warmup steps
        warmup_steps = int(len(train_dataloader) * 0.1)
        
        # Train the model
        print(f"Training for additional {num_epochs} epochs with batch_size={batch_size}...")
        self.model.fit(
            train_objectives=[(train_dataloader, train_loss)],
            epochs=num_epochs,
            warmup_steps=warmup_steps,
            show_progress_bar=True,
            output_path=output_path
        )
        print("Additional fine-tuning completed!")


    def create_submission(self, batch_size=32, output_file=None):
        """Create submission using direct similarity ranking"""
        if output_file is None:
            output_file = 'submission_continued_training.csv'
            
        # Load test data
        print("Loading test data...")
        test_queries_df = pd.read_csv('test_query.csv')
        test_documents_df = pd.read_csv('test_documents.csv')
        
        # Load BeIR corpus for document texts
        print("Loading BeIR corpus...")
        dataset_docs = load_dataset(self.corpus_name, "corpus")
        doc_id_to_text = {
            doc_id: text for doc_id, text in zip(
                dataset_docs["corpus"]["_id"],
                dataset_docs["corpus"]["text"]
            )
        }
        
        # Get document texts
        test_doc_texts = [doc_id_to_text[doc_id] for doc_id in test_documents_df['Doc']]
        
        # Encode queries and documents
        print("Encoding queries...")
        query_embeddings = self.model.encode(
            test_queries_df['Query'].tolist(),
            batch_size=batch_size,
            show_progress_bar=True
        )
        
        print("Encoding documents...")
        doc_embeddings = self.model.encode(
            test_doc_texts,
            batch_size=batch_size,
            show_progress_bar=True
        )
        
        # Create submission
        print("Creating submission...")
        results = []
        
        for i, query_embedding in enumerate(tqdm(query_embeddings)):
            # Calculate similarities
            similarities = cosine_similarity([query_embedding], doc_embeddings)[0]
            
            # Get top 10 documents
            top_indices = np.argsort(similarities)[::-1][:10]
            top_doc_ids = [test_documents_df.iloc[idx]['Doc'] for idx in top_indices]
            
            results.append({
                'Query': test_queries_df.iloc[i]['Query'],
                'Doc_ID': ' '.join(top_doc_ids)
            })
        
        # Save submission
        submission_df = pd.DataFrame(results)
        submission_df.to_csv(output_file, index=False)
        print(f"Submission saved to {output_file}")

def train_model_for_epochs(base_model_path, num_epochs, version_name=None):
    """
    Generic model training function
    """
    if version_name is None:
        version_name = f"{num_epochs}epochs"
    
    # Model output path
    model_output_path = f'fine_tuned_all-mpnet-base_{version_name}'
    # Submission file path
    submission_file = f'submission_continued_training_{version_name}.csv'
    
    print(f"Starting training for {num_epochs} epochs...")
    print(f"Loading base model from: {base_model_path}")
    print(f"Will save new model to: {model_output_path}")
    
    # Initialize model
    model = ContinueTrainingModel(base_model_path)
    
    # Continue training
    model.continue_training(
        num_epochs=num_epochs,
        output_path=model_output_path
    )
    
    # Create submission file
    model.create_submission(output_file=submission_file)
    
    print(f"\nTraining completed!")
    print(f"Model saved to: {model_output_path}")
    print(f"Submission saved to: {submission_file}")
    
    # Calculate MAP score
    try:
        print("\nCalculating MAP score...")
        map_score = calculate_map(
            submission_file
        )
        print(f"MAP Score for {version_name}: {map_score:.4f}")
    except Exception as e:
        print(f"Could not calculate MAP score: {str(e)}")
    
    return model_output_path, submission_file

# Example usage
if __name__ == "__main__":
    # Train 1 epochs
    model_path_1, submission_1 = train_model_for_epochs(
        'fine_tuned_all-mpnet-base_v4_continued',
        num_epochs=1,
        version_name='v5_continued'
    )

Starting training for 1 epochs...
Loading base model from: fine_tuned_all-mpnet-base_v4_continued
Will save new model to: fine_tuned_all-mpnet-base_v5_continued
Loading fine-tuned model from fine_tuned_all-mpnet-base_v4_continued...
Starting additional fine-tuning...
Loading BeIR dataset...
Preparing training examples...


  0%|          | 0/2590 [00:00<?, ?it/s]

Created 110575 training examples
Training for additional 1 epochs with default batch_size...


Computing widget examples:   0%|          | 0/1 [00:00<?, ?example/s]

Step,Training Loss
500,0.0
1000,0.0
1500,0.0
2000,0.0
2500,0.0
3000,0.0
3500,0.0
4000,0.0
4500,0.0
5000,0.0


KeyboardInterrupt: 

In [14]:
def analyze_top5_query_distribution():
    """
    Analyze the distribution of the most similar queries for each test query
    """
    # Load model
    print("Loading model...")
    model = SentenceTransformer('fine_tuned_all-mpnet-base-v2')
    
    # Load test queries
    print("Loading test queries...")
    test_queries_df = pd.read_csv('test_query.csv')
    test_queries = test_queries_df['Query'].tolist()
    
    # Load BEIR dataset
    print("Loading BeIR dataset...")
    dataset_queries = load_dataset("BeIR/nfcorpus", "queries")
    dataset_qrels = load_dataset("BeIR/nfcorpus-qrels")
    
    # Get query ids for each split
    train_query_ids = set(item['query-id'] for item in dataset_qrels['train'])
    test_query_ids = set(item['query-id'] for item in dataset_qrels['test'])
    val_query_ids = set(item['query-id'] for item in dataset_qrels['validation'])
    
    # Create a mapping of query id to text
    beir_queries = dataset_queries['queries']
    query_id_to_text = {q['_id']: q['text'] for q in beir_queries}
    
    # Create a list of query texts for each split
    train_queries = [query_id_to_text[qid] for qid in train_query_ids]
    test_queries_beir = [query_id_to_text[qid] for qid in test_query_ids]
    val_queries = [query_id_to_text[qid] for qid in val_query_ids]
    
    print(f"\nBEIR Dataset Statistics:")
    print(f"Train queries: {len(train_queries)}")
    print(f"Test queries: {len(test_queries_beir)}")
    print(f"Val queries: {len(val_queries)}")
    
    # Compute embeddings
    print("\nComputing embeddings...")
    test_embeddings = model.encode(test_queries, show_progress_bar=True)
    train_embeddings = model.encode(train_queries, show_progress_bar=True)
    test_beir_embeddings = model.encode(test_queries_beir, show_progress_bar=True)
    val_embeddings = model.encode(val_queries, show_progress_bar=True)
    
    # 初始化统计
    top5_distribution = {
        'train': 0,
        'test': 0,
        'val': 0
    }
    
    detailed_results = []
    
    print("\nAnalyzing TOP5 distribution...")
    for i, test_query in enumerate(tqdm(test_queries)):
        # Calculate similarity with each split
        train_similarities = [(score, 'train', idx) for idx, score in enumerate(cosine_similarity([test_embeddings[i]], train_embeddings)[0])]
        test_similarities = [(score, 'test', idx) for idx, score in enumerate(cosine_similarity([test_embeddings[i]], test_beir_embeddings)[0])]
        val_similarities = [(score, 'val', idx) for idx, score in enumerate(cosine_similarity([test_embeddings[i]], val_embeddings)[0])]
        
        # Merge all similarities and get TOP5
        all_similarities = train_similarities + test_similarities + val_similarities
        top5 = sorted(all_similarities, key=lambda x: x[0], reverse=True)[:5]
        
        # Count the number of each split in TOP5
        for sim_score, split, idx in top5:
            top5_distribution[split] += 1
            
        # Get detailed information
        query_details = {
            'test_query': test_query,
            'top5_matches': []
        }
        
        for sim_score, split, idx in top5:
            if split == 'train':
                similar_query = train_queries[idx]
            elif split == 'test':
                similar_query = test_queries_beir[idx]
            else:
                similar_query = val_queries[idx]
                
            query_details['top5_matches'].append({
                'split': split,
                'query': similar_query,
                'similarity': sim_score
            })
            
        detailed_results.append(query_details)
    
    # Print statistics results
    print("\nDistribution of TOP5 similar queries:")
    total = sum(top5_distribution.values())
    for split, count in top5_distribution.items():
        percentage = (count / total) * 100
        print(f"{split}: {count} queries ({percentage:.2f}%)")
    
    # Save detailed results to CSV
    print("\nSaving detailed results...")
    rows = []
    for result in detailed_results:
        for i, match in enumerate(result['top5_matches'], 1):
            rows.append({
                'test_query': result['test_query'],
                f'rank': i,
                f'split': match['split'],
                f'similar_query': match['query'],
                f'similarity_score': match['similarity']
            })
    
    detailed_df = pd.DataFrame(rows)
    detailed_df.to_csv('query_top5_similarity_analysis.csv', index=False)
    print("Detailed results saved to query_top5_similarity_analysis.csv")
    
    # Print some examples
    print("\nExample matches (First 3 test queries):")
    for i in range(min(3, len(detailed_results))):
        print(f"\nTest Query: {detailed_results[i]['test_query']}")
        for j, match in enumerate(detailed_results[i]['top5_matches'], 1):
            print(f"Rank {j} ({match['split']}): {match['query']}")
            print(f"Similarity Score: {match['similarity']:.4f}")
    
    return top5_distribution, detailed_results

# Run analysis
if __name__ == "__main__":
    distribution_stats, detailed_results = analyze_top5_query_distribution()

Loading model...
Loading test queries...
Loading BeIR dataset...

BEIR Dataset Statistics:
Train queries: 2590
Test queries: 323
Val queries: 324

Computing embeddings...


Batches:   0%|          | 0/18 [00:00<?, ?it/s]

Batches:   0%|          | 0/81 [00:00<?, ?it/s]

Batches:   0%|          | 0/11 [00:00<?, ?it/s]

Batches:   0%|          | 0/11 [00:00<?, ?it/s]


Analyzing TOP5 distribution...


100%|██████████| 557/557 [00:06<00:00, 83.08it/s] 



Distribution of TOP5 similar queries:
train: 1736 queries (62.33%)
test: 756 queries (27.15%)
val: 293 queries (10.52%)

Saving detailed results...
Detailed results saved to query_top5_similarity_analysis.csv

Example matches (First 3 test queries):

Test Query: Herbalife® has been updated.
Rank 1 (test): Update on Herbalife®
Similarity Score: 0.7625
Rank 2 (train): Herbalife
Similarity Score: 0.6345
Rank 3 (train): Herbalife® Supplement Liver Toxicity
Similarity Score: 0.4947
Rank 4 (val): Dietary Supplement Snake Oil
Similarity Score: 0.4170
Rank 5 (train): snake oil
Similarity Score: 0.4106

Test Query: Can eating Fruit & Nut Bars lead to an increase in weight?
Rank 1 (test): Do Fruit & Nut Bars Cause Weight Gain?
Similarity Score: 0.9421
Rank 2 (val): Nuts Don't Cause Expected Weight Gain
Similarity Score: 0.6317
Rank 3 (train): Does Chocolate Cause Weight Gain?
Similarity Score: 0.5989
Rank 4 (train): Best Dried Fruit For Cholesterol
Similarity Score: 0.5699
Rank 5 (train): Bulki

In [19]:
# Import necessary libraries
from openai import OpenAI
import pandas as pd
import numpy as np
from datasets import load_dataset
from sklearn.metrics.pairwise import cosine_similarity
from tqdm import tqdm
import os
import pickle
from pathlib import Path

def init_cache_dirs():
    """Initialize cache directories and paths"""
    cache_dir = Path("embeddings_cache")
    cache_dir.mkdir(exist_ok=True)
    
    return {
        'cache_dir': cache_dir,
        'query_cache': cache_dir / "query_embeddings_text-embedding-3-large.pkl",
        'doc_cache': cache_dir / "doc_embeddings_text-embedding-3-large.pkl",
        'beir_query_cache': cache_dir / "beir_query_embeddings_text-embedding-3-large.pkl",
        'beir_doc_cache': cache_dir / "beir_doc_embeddings_text-embedding-3-large.pkl"
    }

def load_cached_embeddings(cache_file):
    """Load embeddings from pickle cache file"""
    if cache_file.exists():
        print(f"Loading cached embeddings from {cache_file}")
        try:
            with open(cache_file, 'rb') as f:
                embeddings = pickle.load(f)
                print(f"Successfully loaded embeddings with shape: {embeddings.shape}")
                return embeddings
        except Exception as e:
            print(f"Error loading cache file: {e}")
    return None

def save_cached_embeddings(embeddings, cache_file):
    """Save embeddings to pickle cache file"""
    print(f"Saving embeddings to {cache_file}")
    try:
        with open(cache_file, 'wb') as f:
            pickle.dump(embeddings, f)
        print(f"Successfully saved embeddings with shape: {embeddings.shape}")
    except Exception as e:
        print(f"Error saving cache file: {e}")

def get_embeddings(texts, client, cache_file=None, batch_size=100):
    """Get embeddings using OpenAI API with caching"""
    if cache_file is not None:
        cached_embeddings = load_cached_embeddings(cache_file)
        if cached_embeddings is not None:
            return cached_embeddings
        print(f"Cache miss for {cache_file}, generating new embeddings...")
    
    print(f"Generating embeddings for {len(texts)} texts...")
    all_embeddings = []
    for i in tqdm(range(0, len(texts), batch_size), desc="Getting embeddings"):
        batch = texts[i:i + batch_size]
        try:
            response = client.embeddings.create(
                model="text-embedding-3-large",
                input=batch
            )
            batch_embeddings = [np.array(item.embedding) for item in response.data]
            all_embeddings.extend(batch_embeddings)
        except Exception as e:
            print(f"Error in batch {i}: {e}")
            fallback = np.zeros(3072)
            all_embeddings.extend([fallback] * len(batch))
    
    embeddings_array = np.array(all_embeddings)
    
    if cache_file is not None:
        save_cached_embeddings(embeddings_array, cache_file)
    
    return embeddings_array

def load_test_data():
    """Load test documents and queries"""
    print("Loading test data...")
    test_documents_df = pd.read_csv("test_documents.csv")
    test_queries_df = pd.read_csv("test_query.csv")
    
    valid_doc_ids = set(test_documents_df['Doc'].tolist())
    test_queries = test_queries_df['Query'].tolist()
    
    print(f"Loaded {len(valid_doc_ids)} valid document IDs and {len(test_queries)} test queries")
    return valid_doc_ids, test_queries

def load_beir_data():
    """Load and process BeIR dataset"""
    print("Loading BeIR dataset...")
    dataset_queries = load_dataset("BeIR/nfcorpus", "queries")
    dataset_qrels = load_dataset("BeIR/nfcorpus-qrels")
    dataset_docs = load_dataset("BeIR/nfcorpus", "corpus")
    
    beir_queries = dataset_queries['queries']['text']
    beir_query_ids = dataset_queries['queries']['_id']
    beir_docs = dataset_docs['corpus']['text']
    beir_doc_ids = dataset_docs['corpus']['_id']
    
    # Create mappings
    beir_doc_id_to_text = dict(zip(beir_doc_ids, beir_docs))
    beir_query_text_to_id = dict(zip(beir_queries, beir_query_ids))
    
    # Create query to docs mapping
    beir_query_to_docs = {}
    for split in ['train', 'test', 'validation']:
        for item in dataset_qrels[split]:
            query_id = item['query-id']
            doc_id = item['corpus-id']
            if query_id not in beir_query_to_docs:
                beir_query_to_docs[query_id] = []
            beir_query_to_docs[query_id].append(doc_id)
    
    return {
        'queries': beir_queries,
        'query_ids': beir_query_ids,
        'docs': beir_docs,
        'doc_ids': beir_doc_ids,
        'doc_id_to_text': beir_doc_id_to_text,
        'query_text_to_id': beir_query_text_to_id,
        'query_to_docs': beir_query_to_docs
    }

def get_valid_docs(relevant_docs, valid_doc_ids, query_embedding, doc_embeddings, doc_ids, needed_count=10):
    """Filter and ensure we have valid document IDs"""
    valid_relevant_docs = [doc_id for doc_id in relevant_docs if doc_id in valid_doc_ids]
    print(f"Found {len(valid_relevant_docs)} valid docs from {len(relevant_docs)} relevant docs, need {needed_count}")

    
    if len(valid_relevant_docs) < needed_count:
        remaining_docs = list(valid_doc_ids - set(valid_relevant_docs))
        remaining_indices = [doc_ids.index(doc_id) for doc_id in remaining_docs]
        
        similarities = cosine_similarity([query_embedding], doc_embeddings[remaining_indices])[0]
        top_indices = np.argsort(similarities)[::-1][:needed_count - len(valid_relevant_docs)]
        
        additional_docs = [remaining_docs[idx] for idx in top_indices]
        valid_relevant_docs.extend(additional_docs)
    
    return valid_relevant_docs[:needed_count]

def process_queries(test_queries, test_embeddings, beir_data, beir_embeddings, beir_doc_embeddings, valid_doc_ids):
    """Process queries and find relevant documents"""
    print("\nProcessing queries and finding relevant documents...")
    results = []
    
    for i, test_query in enumerate(tqdm(test_queries)):
        # Find similar BeIR queries
        similarities = cosine_similarity([test_embeddings[i]], beir_embeddings)[0]
        top_k_indices = np.argsort(similarities)[::-1][:10]
        
        # Collect relevant documents
        relevant_docs = []
        for idx in top_k_indices:
            beir_query = beir_data['queries'][idx]
            beir_query_id = beir_data['query_text_to_id'][beir_query]
            relevant_docs.extend(beir_data['query_to_docs'].get(beir_query_id, []))
        
        relevant_docs = list(dict.fromkeys(relevant_docs))
        
        # Get valid documents
        valid_relevant_docs = get_valid_docs(
            relevant_docs,
            valid_doc_ids,
            test_embeddings[i],
            beir_doc_embeddings,
            beir_data['doc_ids']
        )
        
        results.append({
            'Query': test_query,
            'Doc_ID': ' '.join(valid_relevant_docs),
            'similar_beir_query': beir_data['queries'][top_k_indices[0]],
            'similarity_score': similarities[top_k_indices[0]],
            'num_original_relevant': len(set(relevant_docs) & valid_doc_ids),
            'num_additional_docs': 10 - len(set(relevant_docs) & valid_doc_ids)
        })
    
    return results

def save_results(results):
    """Save results to CSV files and print statistics"""
    submission_df = pd.DataFrame(results)
    
    # Save detailed results
    submission_df.to_csv('submission_from_beir_ground_truth_openai_large_detailed_top10_similarity_query.csv', index=False)
    
    # Save submission file
    submission_df[['Query', 'Doc_ID']].to_csv('submission_from_beir_ground_truth_openai_large_top10_similarity_query.csv', index=False)
    
    print("\nSubmission files created successfully!")
    
    # Print statistics
    print("\nStatistics:")
    print(f"Total queries processed: {len(results)}")
    
    print("\nDistribution of original relevant documents per query:")
    original_docs_dist = submission_df['num_original_relevant'].value_counts().sort_index()
    for num_docs, count in original_docs_dist.items():
        print(f"Queries with {num_docs} original relevant docs: {count} ({count/len(results)*100:.2f}%)")
    
    print("\nDistribution of additional documents needed per query:")
    additional_docs_dist = submission_df['num_additional_docs'].value_counts().sort_index()
    for num_docs, count in additional_docs_dist.items():
        print(f"Queries needing {num_docs} additional docs: {count} ({count/len(results)*100:.2f}%)")

def main():
    """Main execution function"""
    # Initialize OpenAI client and cache
    client = OpenAI()
    cache_paths = init_cache_dirs()
    
    # Load data
    valid_doc_ids, test_queries = load_test_data()
    beir_data = load_beir_data()
    
    # Get embeddings
    print("\nProcessing embeddings...")
    test_embeddings = get_embeddings(test_queries, client, cache_file=cache_paths['query_cache'])
    beir_embeddings = get_embeddings(beir_data['queries'], client, cache_file=cache_paths['beir_query_cache'])
    beir_doc_embeddings = get_embeddings(beir_data['docs'], client, cache_file=cache_paths['beir_doc_cache'])
    
    # Process queries
    results = process_queries(
        test_queries,
        test_embeddings,
        beir_data,
        beir_embeddings,
        beir_doc_embeddings,
        valid_doc_ids
    )
    
    # Save results
    save_results(results)

if __name__ == "__main__":
    main()

Loading test data...
Loaded 3125 valid document IDs and 557 test queries
Loading BeIR dataset...

Processing embeddings...
Loading cached embeddings from embeddings_cache/query_embeddings_text-embedding-3-large.pkl
Successfully loaded embeddings with shape: (557, 3072)
Loading cached embeddings from embeddings_cache/beir_query_embeddings_text-embedding-3-large.pkl
Successfully loaded embeddings with shape: (3237, 3072)
Loading cached embeddings from embeddings_cache/beir_doc_embeddings_text-embedding-3-large.pkl
Successfully loaded embeddings with shape: (3633, 3072)

Processing queries and finding relevant documents...


  0%|          | 2/557 [00:00<00:40, 13.54it/s]

Found 105 valid docs from 107 relevant docs, need 10
Found 132 valid docs from 145 relevant docs, need 10


  1%|          | 4/557 [00:00<00:34, 15.89it/s]

Found 200 valid docs from 206 relevant docs, need 10
Found 223 valid docs from 224 relevant docs, need 10


  1%|          | 6/557 [00:00<00:36, 14.98it/s]

Found 280 valid docs from 290 relevant docs, need 10
Found 143 valid docs from 149 relevant docs, need 10
Found 160 valid docs from 172 relevant docs, need 10


  2%|▏         | 10/557 [00:00<00:34, 15.68it/s]

Found 1249 valid docs from 1265 relevant docs, need 10
Found 192 valid docs from 197 relevant docs, need 10
Found 238 valid docs from 245 relevant docs, need 10
Found 249 valid docs from 256 relevant docs, need 10


  3%|▎         | 15/557 [00:00<00:30, 17.96it/s]

Found 407 valid docs from 415 relevant docs, need 10
Found 341 valid docs from 352 relevant docs, need 10
Found 91 valid docs from 94 relevant docs, need 10
Found 244 valid docs from 256 relevant docs, need 10
Found 116 valid docs from 117 relevant docs, need 10


  3%|▎         | 18/557 [00:01<00:27, 19.30it/s]

Found 105 valid docs from 114 relevant docs, need 10
Found 139 valid docs from 146 relevant docs, need 10
Found 141 valid docs from 145 relevant docs, need 10
Found 254 valid docs from 268 relevant docs, need 10


  4%|▍         | 21/557 [00:01<00:26, 20.38it/s]

Found 428 valid docs from 428 relevant docs, need 10


  4%|▍         | 24/557 [00:01<00:26, 20.16it/s]

Found 772 valid docs from 829 relevant docs, need 10
Found 850 valid docs from 864 relevant docs, need 10
Found 43 valid docs from 51 relevant docs, need 10
Found 248 valid docs from 251 relevant docs, need 10


  5%|▍         | 27/557 [00:01<00:26, 19.83it/s]

Found 149 valid docs from 157 relevant docs, need 10
Found 1240 valid docs from 1277 relevant docs, need 10
Found 142 valid docs from 150 relevant docs, need 10
Found 166 valid docs from 172 relevant docs, need 10


  5%|▌         | 30/557 [00:01<00:26, 20.08it/s]

Found 173 valid docs from 188 relevant docs, need 10


  6%|▌         | 33/557 [00:01<00:25, 20.66it/s]

Found 121 valid docs from 138 relevant docs, need 10
Found 167 valid docs from 177 relevant docs, need 10
Found 297 valid docs from 309 relevant docs, need 10
Found 219 valid docs from 225 relevant docs, need 10


  7%|▋         | 39/557 [00:02<00:24, 21.53it/s]

Found 255 valid docs from 262 relevant docs, need 10
Found 616 valid docs from 629 relevant docs, need 10
Found 298 valid docs from 303 relevant docs, need 10
Found 912 valid docs from 945 relevant docs, need 10
Found 133 valid docs from 141 relevant docs, need 10


  8%|▊         | 45/557 [00:02<00:22, 23.07it/s]

Found 104 valid docs from 118 relevant docs, need 10
Found 190 valid docs from 194 relevant docs, need 10
Found 801 valid docs from 840 relevant docs, need 10
Found 92 valid docs from 92 relevant docs, need 10
Found 264 valid docs from 268 relevant docs, need 10
Found 100 valid docs from 108 relevant docs, need 10


  9%|▊         | 48/557 [00:02<00:21, 23.37it/s]

Found 142 valid docs from 157 relevant docs, need 10
Found 153 valid docs from 156 relevant docs, need 10
Found 146 valid docs from 159 relevant docs, need 10
Found 158 valid docs from 170 relevant docs, need 10
Found 289 valid docs from 289 relevant docs, need 10


 10%|▉         | 54/557 [00:02<00:20, 24.07it/s]

Found 248 valid docs from 264 relevant docs, need 10
Found 354 valid docs from 355 relevant docs, need 10
Found 61 valid docs from 63 relevant docs, need 10
Found 173 valid docs from 179 relevant docs, need 10
Found 514 valid docs from 515 relevant docs, need 10
Found 233 valid docs from 238 relevant docs, need 10


 11%|█         | 61/557 [00:02<00:17, 27.74it/s]

Found 587 valid docs from 606 relevant docs, need 10
Found 83 valid docs from 99 relevant docs, need 10
Found 321 valid docs from 326 relevant docs, need 10
Found 653 valid docs from 684 relevant docs, need 10
Found 166 valid docs from 168 relevant docs, need 10
Found 363 valid docs from 375 relevant docs, need 10
Found 291 valid docs from 298 relevant docs, need 10


 12%|█▏        | 69/557 [00:03<00:16, 29.94it/s]

Found 114 valid docs from 116 relevant docs, need 10
Found 470 valid docs from 490 relevant docs, need 10
Found 488 valid docs from 500 relevant docs, need 10
Found 280 valid docs from 285 relevant docs, need 10
Found 301 valid docs from 303 relevant docs, need 10
Found 1220 valid docs from 1258 relevant docs, need 10
Found 138 valid docs from 139 relevant docs, need 10


 14%|█▎        | 76/557 [00:03<00:16, 28.97it/s]

Found 251 valid docs from 258 relevant docs, need 10
Found 504 valid docs from 510 relevant docs, need 10
Found 477 valid docs from 484 relevant docs, need 10
Found 685 valid docs from 708 relevant docs, need 10
Found 157 valid docs from 161 relevant docs, need 10
Found 296 valid docs from 303 relevant docs, need 10


 15%|█▍        | 83/557 [00:03<00:15, 29.84it/s]

Found 220 valid docs from 223 relevant docs, need 10
Found 150 valid docs from 162 relevant docs, need 10
Found 150 valid docs from 163 relevant docs, need 10
Found 52 valid docs from 54 relevant docs, need 10
Found 187 valid docs from 191 relevant docs, need 10
Found 773 valid docs from 790 relevant docs, need 10
Found 451 valid docs from 458 relevant docs, need 10


 15%|█▌        | 86/557 [00:03<00:17, 27.57it/s]

Found 169 valid docs from 186 relevant docs, need 10
Found 376 valid docs from 396 relevant docs, need 10
Found 1251 valid docs from 1289 relevant docs, need 10
Found 265 valid docs from 267 relevant docs, need 10
Found 223 valid docs from 244 relevant docs, need 10


 17%|█▋        | 92/557 [00:03<00:17, 26.24it/s]

Found 136 valid docs from 140 relevant docs, need 10
Found 261 valid docs from 266 relevant docs, need 10
Found 948 valid docs from 988 relevant docs, need 10
Found 320 valid docs from 326 relevant docs, need 10
Found 773 valid docs from 830 relevant docs, need 10


 18%|█▊        | 99/557 [00:04<00:16, 27.78it/s]

Found 124 valid docs from 141 relevant docs, need 10
Found 297 valid docs from 312 relevant docs, need 10
Found 293 valid docs from 293 relevant docs, need 10
Found 255 valid docs from 269 relevant docs, need 10
Found 377 valid docs from 415 relevant docs, need 10
Found 192 valid docs from 194 relevant docs, need 10


 19%|█▉        | 106/557 [00:04<00:15, 29.21it/s]

Found 250 valid docs from 261 relevant docs, need 10
Found 132 valid docs from 132 relevant docs, need 10
Found 279 valid docs from 285 relevant docs, need 10
Found 128 valid docs from 138 relevant docs, need 10
Found 401 valid docs from 403 relevant docs, need 10
Found 1019 valid docs from 1027 relevant docs, need 10
Found 365 valid docs from 394 relevant docs, need 10


 20%|█▉        | 110/557 [00:04<00:14, 30.95it/s]

Found 620 valid docs from 638 relevant docs, need 10
Found 228 valid docs from 233 relevant docs, need 10
Found 203 valid docs from 203 relevant docs, need 10
Found 77 valid docs from 81 relevant docs, need 10
Found 126 valid docs from 126 relevant docs, need 10
Found 348 valid docs from 355 relevant docs, need 10
Found 244 valid docs from 244 relevant docs, need 10


 21%|██        | 118/557 [00:04<00:14, 30.17it/s]

Found 203 valid docs from 204 relevant docs, need 10
Found 423 valid docs from 452 relevant docs, need 10
Found 133 valid docs from 134 relevant docs, need 10
Found 198 valid docs from 202 relevant docs, need 10
Found 1014 valid docs from 1025 relevant docs, need 10
Found 212 valid docs from 213 relevant docs, need 10
Found 291 valid docs from 299 relevant docs, need 10


 22%|██▏       | 125/557 [00:05<00:15, 27.92it/s]

Found 1205 valid docs from 1241 relevant docs, need 10
Found 480 valid docs from 491 relevant docs, need 10
Found 287 valid docs from 299 relevant docs, need 10
Found 356 valid docs from 367 relevant docs, need 10
Found 537 valid docs from 538 relevant docs, need 10


 24%|██▎       | 132/557 [00:05<00:15, 27.64it/s]

Found 217 valid docs from 217 relevant docs, need 10
Found 140 valid docs from 140 relevant docs, need 10
Found 88 valid docs from 96 relevant docs, need 10
Found 294 valid docs from 299 relevant docs, need 10
Found 225 valid docs from 239 relevant docs, need 10
Found 1205 valid docs from 1242 relevant docs, need 10
Found 386 valid docs from 393 relevant docs, need 10


 24%|██▍       | 136/557 [00:05<00:14, 29.14it/s]

Found 141 valid docs from 149 relevant docs, need 10
Found 206 valid docs from 220 relevant docs, need 10
Found 323 valid docs from 325 relevant docs, need 10
Found 311 valid docs from 316 relevant docs, need 10
Found 234 valid docs from 241 relevant docs, need 10
Found 811 valid docs from 841 relevant docs, need 10
Found 188 valid docs from 193 relevant docs, need 10


 26%|██▌       | 144/557 [00:05<00:13, 30.53it/s]

Found 106 valid docs from 106 relevant docs, need 10
Found 310 valid docs from 344 relevant docs, need 10
Found 106 valid docs from 112 relevant docs, need 10
Found 412 valid docs from 419 relevant docs, need 10
Found 149 valid docs from 162 relevant docs, need 10
Found 448 valid docs from 469 relevant docs, need 10
Found 433 valid docs from 440 relevant docs, need 10


 27%|██▋       | 152/557 [00:05<00:12, 31.56it/s]

Found 124 valid docs from 125 relevant docs, need 10
Found 148 valid docs from 162 relevant docs, need 10
Found 857 valid docs from 886 relevant docs, need 10
Found 161 valid docs from 165 relevant docs, need 10
Found 161 valid docs from 165 relevant docs, need 10
Found 169 valid docs from 174 relevant docs, need 10
Found 109 valid docs from 109 relevant docs, need 10


 29%|██▊       | 160/557 [00:06<00:12, 32.73it/s]

Found 197 valid docs from 209 relevant docs, need 10
Found 198 valid docs from 199 relevant docs, need 10
Found 160 valid docs from 164 relevant docs, need 10
Found 530 valid docs from 536 relevant docs, need 10
Found 283 valid docs from 299 relevant docs, need 10
Found 234 valid docs from 238 relevant docs, need 10
Found 185 valid docs from 198 relevant docs, need 10


 29%|██▉       | 164/557 [00:06<00:12, 32.44it/s]

Found 258 valid docs from 270 relevant docs, need 10
Found 195 valid docs from 197 relevant docs, need 10
Found 355 valid docs from 369 relevant docs, need 10
Found 385 valid docs from 403 relevant docs, need 10
Found 311 valid docs from 318 relevant docs, need 10
Found 1010 valid docs from 1052 relevant docs, need 10
Found 301 valid docs from 311 relevant docs, need 10


 31%|███       | 172/557 [00:06<00:11, 32.78it/s]

Found 301 valid docs from 315 relevant docs, need 10
Found 149 valid docs from 157 relevant docs, need 10
Found 118 valid docs from 124 relevant docs, need 10
Found 118 valid docs from 123 relevant docs, need 10
Found 656 valid docs from 676 relevant docs, need 10
Found 219 valid docs from 221 relevant docs, need 10
Found 168 valid docs from 173 relevant docs, need 10


 32%|███▏      | 180/557 [00:06<00:12, 31.26it/s]

Found 470 valid docs from 487 relevant docs, need 10
Found 124 valid docs from 125 relevant docs, need 10
Found 270 valid docs from 274 relevant docs, need 10
Found 120 valid docs from 121 relevant docs, need 10
Found 225 valid docs from 226 relevant docs, need 10
Found 200 valid docs from 200 relevant docs, need 10


 33%|███▎      | 184/557 [00:07<00:12, 30.20it/s]

Found 103 valid docs from 106 relevant docs, need 10
Found 335 valid docs from 346 relevant docs, need 10
Found 218 valid docs from 227 relevant docs, need 10
Found 1231 valid docs from 1243 relevant docs, need 10
Found 79 valid docs from 81 relevant docs, need 10
Found 653 valid docs from 684 relevant docs, need 10


 34%|███▍      | 192/557 [00:07<00:12, 29.86it/s]

Found 970 valid docs from 1011 relevant docs, need 10
Found 425 valid docs from 431 relevant docs, need 10
Found 308 valid docs from 337 relevant docs, need 10
Found 376 valid docs from 388 relevant docs, need 10
Found 284 valid docs from 298 relevant docs, need 10
Found 263 valid docs from 264 relevant docs, need 10


 36%|███▌      | 198/557 [00:07<00:12, 28.91it/s]

Found 364 valid docs from 366 relevant docs, need 10
Found 118 valid docs from 118 relevant docs, need 10
Found 307 valid docs from 309 relevant docs, need 10
Found 210 valid docs from 213 relevant docs, need 10
Found 649 valid docs from 675 relevant docs, need 10
Found 517 valid docs from 518 relevant docs, need 10


 37%|███▋      | 205/557 [00:07<00:11, 30.53it/s]

Found 803 valid docs from 833 relevant docs, need 10
Found 363 valid docs from 364 relevant docs, need 10
Found 193 valid docs from 203 relevant docs, need 10
Found 69 valid docs from 75 relevant docs, need 10
Found 204 valid docs from 206 relevant docs, need 10
Found 227 valid docs from 229 relevant docs, need 10
Found 120 valid docs from 121 relevant docs, need 10


 38%|███▊      | 209/557 [00:07<00:11, 31.07it/s]

Found 161 valid docs from 161 relevant docs, need 10
Found 273 valid docs from 273 relevant docs, need 10
Found 184 valid docs from 186 relevant docs, need 10
Found 458 valid docs from 472 relevant docs, need 10
Found 182 valid docs from 186 relevant docs, need 10
Found 96 valid docs from 106 relevant docs, need 10
Found 220 valid docs from 238 relevant docs, need 10


 39%|███▉      | 217/557 [00:08<00:11, 28.88it/s]

Found 153 valid docs from 158 relevant docs, need 10
Found 240 valid docs from 246 relevant docs, need 10
Found 219 valid docs from 224 relevant docs, need 10
Found 741 valid docs from 746 relevant docs, need 10
Found 185 valid docs from 203 relevant docs, need 10


 39%|███▉      | 220/557 [00:08<00:12, 26.65it/s]

Found 123 valid docs from 131 relevant docs, need 10
Found 176 valid docs from 183 relevant docs, need 10
Found 550 valid docs from 554 relevant docs, need 10
Found 186 valid docs from 201 relevant docs, need 10
Found 279 valid docs from 285 relevant docs, need 10


 41%|████      | 226/557 [00:08<00:12, 25.69it/s]

Found 458 valid docs from 458 relevant docs, need 10
Found 488 valid docs from 500 relevant docs, need 10
Found 259 valid docs from 267 relevant docs, need 10
Found 234 valid docs from 238 relevant docs, need 10
Found 91 valid docs from 94 relevant docs, need 10
Found 260 valid docs from 265 relevant docs, need 10


 42%|████▏     | 232/557 [00:08<00:12, 26.72it/s]

Found 144 valid docs from 158 relevant docs, need 10
Found 162 valid docs from 165 relevant docs, need 10
Found 157 valid docs from 159 relevant docs, need 10
Found 310 valid docs from 344 relevant docs, need 10
Found 187 valid docs from 196 relevant docs, need 10
Found 138 valid docs from 139 relevant docs, need 10


 43%|████▎     | 238/557 [00:08<00:12, 25.32it/s]

Found 159 valid docs from 160 relevant docs, need 10
Found 276 valid docs from 284 relevant docs, need 10
Found 82 valid docs from 82 relevant docs, need 10
Found 356 valid docs from 376 relevant docs, need 10
Found 190 valid docs from 196 relevant docs, need 10


 44%|████▍     | 244/557 [00:09<00:12, 25.35it/s]

Found 281 valid docs from 297 relevant docs, need 10
Found 210 valid docs from 220 relevant docs, need 10
Found 838 valid docs from 881 relevant docs, need 10
Found 218 valid docs from 218 relevant docs, need 10
Found 590 valid docs from 607 relevant docs, need 10
Found 330 valid docs from 334 relevant docs, need 10


 45%|████▍     | 250/557 [00:09<00:11, 26.69it/s]

Found 209 valid docs from 210 relevant docs, need 10
Found 384 valid docs from 384 relevant docs, need 10
Found 199 valid docs from 206 relevant docs, need 10
Found 234 valid docs from 235 relevant docs, need 10
Found 82 valid docs from 83 relevant docs, need 10
Found 289 valid docs from 290 relevant docs, need 10


 46%|████▌     | 257/557 [00:09<00:10, 27.50it/s]

Found 183 valid docs from 188 relevant docs, need 10
Found 219 valid docs from 226 relevant docs, need 10
Found 593 valid docs from 601 relevant docs, need 10
Found 366 valid docs from 383 relevant docs, need 10
Found 618 valid docs from 637 relevant docs, need 10
Found 169 valid docs from 171 relevant docs, need 10


 47%|████▋     | 260/557 [00:09<00:10, 27.37it/s]

Found 99 valid docs from 114 relevant docs, need 10
Found 234 valid docs from 238 relevant docs, need 10
Found 114 valid docs from 115 relevant docs, need 10
Found 480 valid docs from 480 relevant docs, need 10
Found 141 valid docs from 151 relevant docs, need 10


 48%|████▊     | 266/557 [00:10<00:14, 20.63it/s]

Found 193 valid docs from 223 relevant docs, need 10
Found 359 valid docs from 363 relevant docs, need 10
Found 307 valid docs from 313 relevant docs, need 10
Found 171 valid docs from 183 relevant docs, need 10


 49%|████▉     | 272/557 [00:10<00:12, 22.31it/s]

Found 520 valid docs from 521 relevant docs, need 10
Found 187 valid docs from 220 relevant docs, need 10
Found 950 valid docs from 985 relevant docs, need 10
Found 224 valid docs from 225 relevant docs, need 10
Found 145 valid docs from 151 relevant docs, need 10
Found 145 valid docs from 151 relevant docs, need 10


 50%|████▉     | 278/557 [00:10<00:11, 23.61it/s]

Found 511 valid docs from 520 relevant docs, need 10
Found 55 valid docs from 56 relevant docs, need 10
Found 51 valid docs from 51 relevant docs, need 10
Found 484 valid docs from 496 relevant docs, need 10
Found 305 valid docs from 312 relevant docs, need 10
Found 403 valid docs from 405 relevant docs, need 10


 51%|█████     | 284/557 [00:10<00:11, 24.33it/s]

Found 201 valid docs from 229 relevant docs, need 10
Found 140 valid docs from 140 relevant docs, need 10
Found 285 valid docs from 296 relevant docs, need 10
Found 195 valid docs from 199 relevant docs, need 10
Found 107 valid docs from 112 relevant docs, need 10
Found 628 valid docs from 634 relevant docs, need 10


 52%|█████▏    | 290/557 [00:11<00:10, 25.75it/s]

Found 244 valid docs from 250 relevant docs, need 10
Found 53 valid docs from 53 relevant docs, need 10
Found 300 valid docs from 310 relevant docs, need 10
Found 198 valid docs from 198 relevant docs, need 10
Found 329 valid docs from 342 relevant docs, need 10
Found 187 valid docs from 187 relevant docs, need 10


 53%|█████▎    | 294/557 [00:11<00:09, 27.32it/s]

Found 1024 valid docs from 1032 relevant docs, need 10
Found 111 valid docs from 111 relevant docs, need 10
Found 597 valid docs from 605 relevant docs, need 10
Found 217 valid docs from 238 relevant docs, need 10
Found 53 valid docs from 67 relevant docs, need 10
Found 328 valid docs from 339 relevant docs, need 10
Found 1096 valid docs from 1137 relevant docs, need 10


 54%|█████▍    | 301/557 [00:11<00:10, 25.51it/s]

Found 174 valid docs from 174 relevant docs, need 10
Found 460 valid docs from 464 relevant docs, need 10
Found 123 valid docs from 123 relevant docs, need 10
Found 110 valid docs from 128 relevant docs, need 10
Found 218 valid docs from 220 relevant docs, need 10


 55%|█████▌    | 308/557 [00:11<00:08, 27.94it/s]

Found 509 valid docs from 510 relevant docs, need 10
Found 635 valid docs from 655 relevant docs, need 10
Found 164 valid docs from 166 relevant docs, need 10
Found 105 valid docs from 106 relevant docs, need 10
Found 300 valid docs from 301 relevant docs, need 10
Found 1269 valid docs from 1310 relevant docs, need 10
Found 87 valid docs from 90 relevant docs, need 10


 57%|█████▋    | 316/557 [00:12<00:08, 29.32it/s]

Found 123 valid docs from 151 relevant docs, need 10
Found 801 valid docs from 859 relevant docs, need 10
Found 198 valid docs from 198 relevant docs, need 10
Found 165 valid docs from 169 relevant docs, need 10
Found 109 valid docs from 119 relevant docs, need 10
Found 226 valid docs from 230 relevant docs, need 10
Found 246 valid docs from 250 relevant docs, need 10


 58%|█████▊    | 323/557 [00:12<00:08, 29.11it/s]

Found 194 valid docs from 198 relevant docs, need 10
Found 85 valid docs from 90 relevant docs, need 10
Found 220 valid docs from 222 relevant docs, need 10
Found 259 valid docs from 277 relevant docs, need 10
Found 1159 valid docs from 1170 relevant docs, need 10
Found 161 valid docs from 163 relevant docs, need 10
Found 196 valid docs from 197 relevant docs, need 10


 59%|█████▉    | 329/557 [00:12<00:08, 27.54it/s]

Found 124 valid docs from 125 relevant docs, need 10
Found 763 valid docs from 765 relevant docs, need 10
Found 320 valid docs from 321 relevant docs, need 10
Found 362 valid docs from 369 relevant docs, need 10
Found 238 valid docs from 252 relevant docs, need 10
Found 90 valid docs from 90 relevant docs, need 10


 60%|█████▉    | 332/557 [00:12<00:08, 25.48it/s]

Found 199 valid docs from 206 relevant docs, need 10
Found 115 valid docs from 119 relevant docs, need 10
Found 706 valid docs from 741 relevant docs, need 10
Found 171 valid docs from 178 relevant docs, need 10
Found 234 valid docs from 250 relevant docs, need 10


 61%|██████    | 338/557 [00:12<00:09, 23.49it/s]

Found 290 valid docs from 297 relevant docs, need 10
Found 119 valid docs from 130 relevant docs, need 10
Found 145 valid docs from 145 relevant docs, need 10
Found 152 valid docs from 153 relevant docs, need 10
Found 215 valid docs from 232 relevant docs, need 10


 62%|██████▏   | 344/557 [00:13<00:08, 24.98it/s]

Found 345 valid docs from 354 relevant docs, need 10
Found 278 valid docs from 283 relevant docs, need 10
Found 365 valid docs from 368 relevant docs, need 10
Found 480 valid docs from 501 relevant docs, need 10
Found 348 valid docs from 357 relevant docs, need 10
Found 966 valid docs from 1002 relevant docs, need 10


 63%|██████▎   | 350/557 [00:13<00:08, 24.19it/s]

Found 200 valid docs from 204 relevant docs, need 10
Found 412 valid docs from 428 relevant docs, need 10
Found 292 valid docs from 297 relevant docs, need 10
Found 79 valid docs from 85 relevant docs, need 10
Found 498 valid docs from 514 relevant docs, need 10


 63%|██████▎   | 353/557 [00:13<00:08, 24.45it/s]

Found 549 valid docs from 553 relevant docs, need 10
Found 376 valid docs from 381 relevant docs, need 10
Found 164 valid docs from 171 relevant docs, need 10
Found 275 valid docs from 276 relevant docs, need 10
Found 133 valid docs from 138 relevant docs, need 10


 64%|██████▍   | 359/557 [00:13<00:09, 21.80it/s]

Found 226 valid docs from 228 relevant docs, need 10
Found 309 valid docs from 316 relevant docs, need 10
Found 190 valid docs from 193 relevant docs, need 10
Found 199 valid docs from 207 relevant docs, need 10
Found 100 valid docs from 119 relevant docs, need 10


 66%|██████▌   | 365/557 [00:14<00:07, 24.50it/s]

Found 74 valid docs from 74 relevant docs, need 10
Found 119 valid docs from 119 relevant docs, need 10
Found 226 valid docs from 239 relevant docs, need 10
Found 102 valid docs from 110 relevant docs, need 10
Found 102 valid docs from 103 relevant docs, need 10
Found 383 valid docs from 398 relevant docs, need 10


 67%|██████▋   | 371/557 [00:14<00:07, 25.06it/s]

Found 283 valid docs from 283 relevant docs, need 10
Found 220 valid docs from 223 relevant docs, need 10
Found 187 valid docs from 220 relevant docs, need 10
Found 134 valid docs from 147 relevant docs, need 10
Found 86 valid docs from 98 relevant docs, need 10
Found 197 valid docs from 202 relevant docs, need 10


 68%|██████▊   | 377/557 [00:14<00:07, 25.67it/s]

Found 103 valid docs from 116 relevant docs, need 10
Found 131 valid docs from 138 relevant docs, need 10
Found 644 valid docs from 680 relevant docs, need 10
Found 248 valid docs from 251 relevant docs, need 10
Found 138 valid docs from 142 relevant docs, need 10
Found 328 valid docs from 330 relevant docs, need 10


 68%|██████▊   | 380/557 [00:14<00:08, 21.23it/s]

Found 200 valid docs from 200 relevant docs, need 10
Found 395 valid docs from 402 relevant docs, need 10
Found 695 valid docs from 717 relevant docs, need 10


 69%|██████▉   | 386/557 [00:14<00:07, 22.56it/s]

Found 58 valid docs from 63 relevant docs, need 10
Found 278 valid docs from 288 relevant docs, need 10
Found 190 valid docs from 191 relevant docs, need 10
Found 301 valid docs from 301 relevant docs, need 10
Found 300 valid docs from 301 relevant docs, need 10
Found 183 valid docs from 188 relevant docs, need 10


 71%|███████   | 393/557 [00:15<00:06, 26.69it/s]

Found 226 valid docs from 230 relevant docs, need 10
Found 145 valid docs from 145 relevant docs, need 10
Found 69 valid docs from 75 relevant docs, need 10
Found 142 valid docs from 147 relevant docs, need 10
Found 252 valid docs from 264 relevant docs, need 10
Found 366 valid docs from 373 relevant docs, need 10
Found 667 valid docs from 683 relevant docs, need 10


 72%|███████▏  | 401/557 [00:15<00:05, 30.23it/s]

Found 115 valid docs from 116 relevant docs, need 10
Found 77 valid docs from 92 relevant docs, need 10
Found 251 valid docs from 259 relevant docs, need 10
Found 52 valid docs from 52 relevant docs, need 10
Found 579 valid docs from 595 relevant docs, need 10
Found 156 valid docs from 156 relevant docs, need 10
Found 326 valid docs from 350 relevant docs, need 10


 73%|███████▎  | 405/557 [00:15<00:05, 29.59it/s]

Found 121 valid docs from 123 relevant docs, need 10
Found 81 valid docs from 98 relevant docs, need 10
Found 102 valid docs from 103 relevant docs, need 10
Found 171 valid docs from 172 relevant docs, need 10
Found 490 valid docs from 506 relevant docs, need 10
Found 187 valid docs from 189 relevant docs, need 10


 74%|███████▍  | 411/557 [00:15<00:04, 29.30it/s]

Found 342 valid docs from 361 relevant docs, need 10
Found 290 valid docs from 310 relevant docs, need 10
Found 114 valid docs from 115 relevant docs, need 10
Found 58 valid docs from 61 relevant docs, need 10
Found 153 valid docs from 158 relevant docs, need 10
Found 198 valid docs from 199 relevant docs, need 10


 75%|███████▌  | 419/557 [00:16<00:04, 30.31it/s]

Found 74 valid docs from 75 relevant docs, need 10
Found 155 valid docs from 155 relevant docs, need 10
Found 68 valid docs from 69 relevant docs, need 10
Found 120 valid docs from 120 relevant docs, need 10
Found 144 valid docs from 151 relevant docs, need 10
Found 523 valid docs from 529 relevant docs, need 10
Found 73 valid docs from 73 relevant docs, need 10


 76%|███████▌  | 423/557 [00:16<00:04, 30.27it/s]

Found 973 valid docs from 1014 relevant docs, need 10
Found 1011 valid docs from 1019 relevant docs, need 10
Found 353 valid docs from 372 relevant docs, need 10
Found 127 valid docs from 140 relevant docs, need 10
Found 296 valid docs from 310 relevant docs, need 10
Found 168 valid docs from 168 relevant docs, need 10


 77%|███████▋  | 430/557 [00:16<00:04, 28.00it/s]

Found 354 valid docs from 370 relevant docs, need 10
Found 368 valid docs from 377 relevant docs, need 10
Found 623 valid docs from 636 relevant docs, need 10
Found 148 valid docs from 158 relevant docs, need 10
Found 81 valid docs from 82 relevant docs, need 10
Found 104 valid docs from 114 relevant docs, need 10


 78%|███████▊  | 436/557 [00:16<00:04, 27.52it/s]

Found 409 valid docs from 409 relevant docs, need 10
Found 821 valid docs from 836 relevant docs, need 10
Found 326 valid docs from 329 relevant docs, need 10
Found 105 valid docs from 107 relevant docs, need 10
Found 195 valid docs from 200 relevant docs, need 10


 79%|███████▉  | 442/557 [00:16<00:04, 26.34it/s]

Found 1017 valid docs from 1059 relevant docs, need 10
Found 199 valid docs from 204 relevant docs, need 10
Found 340 valid docs from 360 relevant docs, need 10
Found 1220 valid docs from 1258 relevant docs, need 10
Found 348 valid docs from 352 relevant docs, need 10
Found 1156 valid docs from 1205 relevant docs, need 10


 80%|████████  | 448/557 [00:17<00:04, 24.29it/s]

Found 266 valid docs from 282 relevant docs, need 10
Found 792 valid docs from 795 relevant docs, need 10
Found 707 valid docs from 715 relevant docs, need 10
Found 458 valid docs from 467 relevant docs, need 10
Found 576 valid docs from 610 relevant docs, need 10


 82%|████████▏ | 454/557 [00:17<00:04, 25.08it/s]

Found 51 valid docs from 51 relevant docs, need 10
Found 271 valid docs from 284 relevant docs, need 10
Found 278 valid docs from 287 relevant docs, need 10
Found 498 valid docs from 514 relevant docs, need 10
Found 183 valid docs from 184 relevant docs, need 10
Found 176 valid docs from 182 relevant docs, need 10


 83%|████████▎ | 460/557 [00:17<00:03, 26.23it/s]

Found 86 valid docs from 86 relevant docs, need 10
Found 194 valid docs from 194 relevant docs, need 10
Found 243 valid docs from 244 relevant docs, need 10
Found 242 valid docs from 256 relevant docs, need 10
Found 210 valid docs from 215 relevant docs, need 10
Found 183 valid docs from 190 relevant docs, need 10


 84%|████████▎ | 466/557 [00:17<00:03, 27.18it/s]

Found 325 valid docs from 327 relevant docs, need 10
Found 114 valid docs from 115 relevant docs, need 10
Found 137 valid docs from 137 relevant docs, need 10
Found 141 valid docs from 145 relevant docs, need 10
Found 402 valid docs from 411 relevant docs, need 10
Found 486 valid docs from 494 relevant docs, need 10


 84%|████████▍ | 469/557 [00:17<00:03, 27.07it/s]

Found 587 valid docs from 600 relevant docs, need 10
Found 132 valid docs from 132 relevant docs, need 10
Found 102 valid docs from 110 relevant docs, need 10
Found 485 valid docs from 500 relevant docs, need 10
Found 174 valid docs from 176 relevant docs, need 10
Found 246 valid docs from 248 relevant docs, need 10


 85%|████████▌ | 476/557 [00:18<00:02, 27.48it/s]

Found 179 valid docs from 180 relevant docs, need 10
Found 1195 valid docs from 1238 relevant docs, need 10
Found 330 valid docs from 351 relevant docs, need 10
Found 808 valid docs from 829 relevant docs, need 10
Found 412 valid docs from 415 relevant docs, need 10
Found 1198 valid docs from 1243 relevant docs, need 10


 87%|████████▋ | 482/557 [00:18<00:02, 26.58it/s]

Found 1170 valid docs from 1215 relevant docs, need 10
Found 117 valid docs from 128 relevant docs, need 10
Found 676 valid docs from 689 relevant docs, need 10
Found 821 valid docs from 836 relevant docs, need 10
Found 206 valid docs from 210 relevant docs, need 10
Found 278 valid docs from 282 relevant docs, need 10


 88%|████████▊ | 488/557 [00:18<00:02, 25.42it/s]

Found 201 valid docs from 229 relevant docs, need 10
Found 163 valid docs from 171 relevant docs, need 10
Found 446 valid docs from 446 relevant docs, need 10
Found 91 valid docs from 105 relevant docs, need 10
Found 129 valid docs from 140 relevant docs, need 10
Found 174 valid docs from 178 relevant docs, need 10


 89%|████████▉ | 495/557 [00:18<00:02, 27.49it/s]

Found 487 valid docs from 505 relevant docs, need 10
Found 410 valid docs from 411 relevant docs, need 10
Found 151 valid docs from 155 relevant docs, need 10
Found 79 valid docs from 86 relevant docs, need 10
Found 167 valid docs from 173 relevant docs, need 10
Found 228 valid docs from 229 relevant docs, need 10


 90%|█████████ | 502/557 [00:19<00:01, 27.80it/s]

Found 430 valid docs from 439 relevant docs, need 10
Found 107 valid docs from 112 relevant docs, need 10
Found 587 valid docs from 606 relevant docs, need 10
Found 449 valid docs from 463 relevant docs, need 10
Found 176 valid docs from 176 relevant docs, need 10
Found 248 valid docs from 266 relevant docs, need 10


 91%|█████████ | 506/557 [00:19<00:01, 29.01it/s]

Found 506 valid docs from 511 relevant docs, need 10
Found 247 valid docs from 253 relevant docs, need 10
Found 315 valid docs from 326 relevant docs, need 10
Found 150 valid docs from 159 relevant docs, need 10
Found 164 valid docs from 173 relevant docs, need 10
Found 350 valid docs from 356 relevant docs, need 10


 92%|█████████▏| 512/557 [00:19<00:01, 27.71it/s]

Found 246 valid docs from 249 relevant docs, need 10
Found 236 valid docs from 247 relevant docs, need 10
Found 102 valid docs from 102 relevant docs, need 10
Found 209 valid docs from 223 relevant docs, need 10
Found 263 valid docs from 263 relevant docs, need 10
Found 147 valid docs from 156 relevant docs, need 10
Found 189 valid docs from 189 relevant docs, need 10


 93%|█████████▎| 520/557 [00:19<00:01, 30.55it/s]

Found 325 valid docs from 341 relevant docs, need 10
Found 579 valid docs from 608 relevant docs, need 10
Found 310 valid docs from 318 relevant docs, need 10
Found 184 valid docs from 186 relevant docs, need 10
Found 188 valid docs from 195 relevant docs, need 10
Found 228 valid docs from 239 relevant docs, need 10
Found 179 valid docs from 179 relevant docs, need 10


 95%|█████████▍| 528/557 [00:19<00:00, 31.77it/s]

Found 391 valid docs from 392 relevant docs, need 10
Found 152 valid docs from 158 relevant docs, need 10
Found 253 valid docs from 262 relevant docs, need 10
Found 462 valid docs from 478 relevant docs, need 10
Found 179 valid docs from 180 relevant docs, need 10
Found 1096 valid docs from 1137 relevant docs, need 10
Found 169 valid docs from 183 relevant docs, need 10


 96%|█████████▌| 536/557 [00:20<00:00, 31.83it/s]

Found 560 valid docs from 561 relevant docs, need 10
Found 193 valid docs from 200 relevant docs, need 10
Found 262 valid docs from 263 relevant docs, need 10
Found 384 valid docs from 384 relevant docs, need 10
Found 238 valid docs from 240 relevant docs, need 10
Found 268 valid docs from 276 relevant docs, need 10
Found 94 valid docs from 95 relevant docs, need 10


 97%|█████████▋| 540/557 [00:20<00:00, 29.63it/s]

Found 251 valid docs from 257 relevant docs, need 10
Found 218 valid docs from 230 relevant docs, need 10
Found 425 valid docs from 431 relevant docs, need 10
Found 1244 valid docs from 1256 relevant docs, need 10
Found 43 valid docs from 51 relevant docs, need 10
Found 356 valid docs from 361 relevant docs, need 10


 98%|█████████▊| 546/557 [00:20<00:00, 26.94it/s]

Found 120 valid docs from 126 relevant docs, need 10
Found 229 valid docs from 245 relevant docs, need 10
Found 157 valid docs from 159 relevant docs, need 10
Found 193 valid docs from 199 relevant docs, need 10
Found 362 valid docs from 371 relevant docs, need 10
Found 359 valid docs from 396 relevant docs, need 10


 99%|█████████▉| 552/557 [00:20<00:00, 27.72it/s]

Found 1009 valid docs from 1050 relevant docs, need 10
Found 322 valid docs from 333 relevant docs, need 10
Found 192 valid docs from 197 relevant docs, need 10
Found 186 valid docs from 196 relevant docs, need 10
Found 51 valid docs from 51 relevant docs, need 10
Found 259 valid docs from 269 relevant docs, need 10
Found 91 valid docs from 94 relevant docs, need 10


100%|██████████| 557/557 [00:21<00:00, 26.50it/s]

Found 1263 valid docs from 1304 relevant docs, need 10
Found 323 valid docs from 323 relevant docs, need 10

Submission files created successfully!

Statistics:
Total queries processed: 557

Distribution of original relevant documents per query:
Queries with 43 original relevant docs: 2 (0.36%)
Queries with 51 original relevant docs: 3 (0.54%)
Queries with 52 original relevant docs: 2 (0.36%)
Queries with 53 original relevant docs: 2 (0.36%)
Queries with 55 original relevant docs: 1 (0.18%)
Queries with 58 original relevant docs: 2 (0.36%)
Queries with 61 original relevant docs: 1 (0.18%)
Queries with 68 original relevant docs: 1 (0.18%)
Queries with 69 original relevant docs: 2 (0.36%)
Queries with 73 original relevant docs: 1 (0.18%)
Queries with 74 original relevant docs: 2 (0.36%)
Queries with 77 original relevant docs: 2 (0.36%)
Queries with 79 original relevant docs: 3 (0.54%)
Queries with 81 original relevant docs: 2 (0.36%)
Queries with 82 original relevant docs: 2 (0.36%)
Quer


