**Loading the dataset**


In [47]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

df = pd.read_csv('/content/test.csv')

# Convert the 'question' column to a list
ques = df['question'].tolist()


In [48]:
%env HUGGING_FACE_TOKEN=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

env: HUGGING_FACE_TOKEN=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


**Generating and Paraphrasing Synthetic Questions Using Pre-trained Language Models**

Here, using the GPT-2 for generating synthetic questions from a prompt and a T5-based model for paraphrasing questions to add variety. It integrates Hugging Face pipelines for both tasks, leveraging pre-trained models for text generation and rephrasing.

In [62]:
from transformers import pipeline, set_seed
import os

# Load a large language model for synthetic data generation
generator = pipeline('text-generation', model="gpt2")

# Function to generate synthetic questions based on a prompt
set_seed(42)
def generate_synthetic_questions(prompt, num_samples=5):
    synthetic_questions = []
    for _ in range(num_samples):
        synthetic_question = generator(prompt, max_length=50, num_return_sequences=1, truncation=True)[0]["generated_text"]
        synthetic_questions.append(synthetic_question)
    return synthetic_questions

# Example of adding diversity by paraphrasing
def paraphrase_question(question):
    # If you have stored your Hugging Face token in an environment variable, retrieve it
    token = os.getenv("HUGGING_FACE_TOKEN")  # Ensure to set this environment variable with your token

    # Initialize a paraphrasing pipeline with a T5-based model
    paraphraser = pipeline("text2text-generation", model="prithivida/parrot_paraphraser_on_T5")

    # Generate multiple paraphrases with beam search
    paraphrased_questions = paraphraser(question, num_beams=3, num_return_sequences=3)
    return [output["generated_text"] for output in paraphrased_questions]


Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


In [63]:
!pip install faiss-cpu



In [64]:
!pip install sentence-transformers



**Efficient Question Retrieval and Query Rewriting Using Sentence Transformers and FAISS**

Using Sentence-Transformers to embed questions into dense vectors and indexes them in FAISS for efficient similarity-based retrieval. A query rewriting function is implemented using the Flan-T5 model, enhancing retrieval performance by refining the input query.


In [65]:
!pip install transformers

import faiss
import numpy as np
from transformers import AutoTokenizer, AutoModel, pipeline

# Load sentence transformer model
tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")
model = AutoModel.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")

# Build FAISS index
def embed_text(text):
    inputs = tokenizer(text, return_tensors="pt")
    embeddings = model(**inputs).last_hidden_state.mean(dim=1).detach().numpy()
    return embeddings

questions = ques  # Assuming 'ques' is defined elsewhere
question_embeddings = np.vstack([embed_text(q) for q in questions])
index = faiss.IndexFlatL2(question_embeddings.shape[1])
index.add(question_embeddings)

# Function to rewrite the query using a text generation model
def rewrite_query(query):
    query_rewriter = pipeline("text2text-generation", model="google/flan-t5-base")  # Initialize pipeline
    rewritten_query = query_rewriter(query, max_length=50)  # Increase max_length
    return rewritten_query





**Query Rewriting and Relevant Question Retrieval Using Sentence Transformers and FAISS**


In [66]:

query = "tell me about stoke laws"
# rewrite the original text query
rewritten_query = rewrite_query(query)
rewritten_query_text = rewritten_query[0]['generated_text']
print("Rewritten query:", rewritten_query[0]['generated_text']) # Print rewritten query
distances, indices = index.search(embed_text(rewritten_query_text), k=10)  # Retrieve top 3 matches
for i in indices[0]:
    print("Relevant question:", questions[i])



Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


Rewritten query: Stoke laws are laws that regulate the use of a stoke device.
Relevant question: When we were dealing with electrical effects, it was very useful to speak of an electric field that surrounded what?
Relevant question: One important phenomenon related to the relative strength of cohesive and adhesive forces is capillary action—the tendency of a fluid to be raised or suppressed in a narrow tube, or called this?
Relevant question: No charge is actually created or destroyed when charges are separated as we have been discussing. rather, existing charges are moved about. in fact, in all situations the total amount of charge is always this?
Relevant question: Newton’s second law of what is more than a definition; it is a relationship among acceleration, force, and mass?
Relevant question: What term, calculated by multiplying heart contractions by stroke volume, means the volume of blood pumped by the heart in one minute?
Relevant question: An object attached to a spring sliding

**Semantic Search and Document Re-Ranking Using Bi-Encoder and Cross-Encoder Model**

uses a Bi-Encoder model for efficient semantic search by embedding queries and documents into a vector space and retrieving top-k documents based on cosine similarity. The retrieved documents are further refined using a Cross-Encoder re-ranker, which scores query-document pairs for more accurate relevance ranking.

In [68]:
import torch
from sentence_transformers import SentenceTransformer, util

# Device configuration
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

# Load the pre-trained Bi-Encoder model for semantic search
bi_encoder_model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2').to(device)


# Function to encode query and documents
def encode_query_and_docs(query, documents, model):
    # Encode the query and documents into vector space
    query_embedding = model.encode(query, convert_to_tensor=True, device=device)
    doc_embeddings = model.encode(documents, convert_to_tensor=True, device=device)
    return query_embedding, doc_embeddings

# Function to retrieve top-k relevant documents
def retrieve_top_k_documents(query, documents, model, top_k=3):
    # Encode query and documents
    query_embedding, doc_embeddings = encode_query_and_docs(query, documents, model)

    # Compute cosine similarity between query and documents
    similarities = util.pytorch_cos_sim(query_embedding, doc_embeddings)[0]

    # Get top-k most relevant documents
    top_k_indices = torch.topk(similarities, k=top_k).indices
    top_k_documents = [(documents[idx], similarities[idx].item()) for idx in top_k_indices]
    return top_k_documents


from transformers import AutoTokenizer, AutoModelForSequenceClassification
reranker_model_name = "cross-encoder/nli-deberta-v3-base"
reranker_model = AutoModelForSequenceClassification.from_pretrained(reranker_model_name).to(device)
reranker_tokenizer = AutoTokenizer.from_pretrained(reranker_model_name)

# Function to re-rank documents using a re-ranker
def rerank_documents(query, top_k_documents, model, tokenizer):
    reranked_docs = []
    for doc, score in top_k_documents:
        # Prepare the input for cross-encoder (query-document pair)
        inputs = tokenizer(query, doc, return_tensors="pt", padding=True, truncation=True, max_length=512).to(device)
        outputs = model(**inputs)
        # Extract the maximum value across the logits
        rerank_score = outputs.logits[0].max().item()  # Get the max score across the logits
        reranked_docs.append((doc, rerank_score))

    # Sort by re-ranker score in descending order
    reranked_docs = sorted(reranked_docs, key=lambda x: x[1], reverse=True)
    return reranked_docs


Using device: cuda


**Top-k Document Retrieval Using Bi-Encoder Semantic Search**

Retrieving the top-k relevant documents for a given query by encoding both the query and a document set using a Bi-Encoder model. It ranks the documents based on cosine similarity, providing a list of the most relevant matches with their scores.

In [69]:
query="what is newton law of motion"
top_k_documents = retrieve_top_k_documents(query, ques, bi_encoder_model, top_k=10)
print("Top-k Relevant Documents:")
for i, (doc, score) in enumerate(top_k_documents, 1):
    print(f"{i}. {doc} (Score: {score:.4f})")

Top-k Relevant Documents:
1. Newton’s second law of what is more than a definition; it is a relationship among acceleration, force, and mass? (Score: 0.6592)
2. What is the study of how forces affect the motion of objects? (Score: 0.5277)
3. What is the term for the combined forces acting on an object? (Score: 0.4624)
4. In physics, when one subtracts the frictional force from the applied force what is the result? (Score: 0.4305)
5. In phyisics, what is considered to be the rotational version of force? (Score: 0.4261)
6. What is the force that opposes motion between two surfaces that are touching? (Score: 0.4229)
7. What is the term for the force that brings objects toward the earth? (Score: 0.4204)
8. What force occurs because no surface is perfectly smooth? (Score: 0.4052)
9. What is a vector quantity with the same direction as the force called? (Score: 0.3710)
10. When explosion of gases creates pressure resulting in motion of a rocket, the force pushing the rocket is called what? (

**Re-ranking Retrieved Documents Using Cross-Encoder**

Refining the initial top-k retrieved documents by using a Cross-Encoder model to score query-document pairs. This re-ranking process ensures more accurate prioritization of documents based on relevance, outputting the final sorted list with updated scores.


In [71]:
# Re-rank the documents using the reranker model
reranked_documents = rerank_documents(query, top_k_documents, reranker_model, reranker_tokenizer)

# Print the re-ranked documents with formatted scores
print("\nRe-ranked Documents:")
for i, (doc, score) in enumerate(reranked_documents, start=1):
    # Check if the score is a tensor and inspect its shape
    if isinstance(score, torch.Tensor):
        print(f"Score tensor shape: {score.shape}")
        # If the tensor has more than one element, take the mean or max
        if score.numel() == 1:
            score = score.item()  # Single element, directly convert to scalar
        else:
            score = score.max().item()  # Take the max score if it's a vector of logits
    print(f"{i}. {doc} (Re-rank Score: {score:.4f})")



Re-ranked Documents:
1. What force occurs because no surface is perfectly smooth? (Re-rank Score: 4.5590)
2. When explosion of gases creates pressure resulting in motion of a rocket, the force pushing the rocket is called what? (Re-rank Score: 4.2261)
3. In physics, when one subtracts the frictional force from the applied force what is the result? (Re-rank Score: 3.8636)
4. What is the force that opposes motion between two surfaces that are touching? (Re-rank Score: 2.8803)
5. In phyisics, what is considered to be the rotational version of force? (Re-rank Score: 1.8775)
6. What is the term for the combined forces acting on an object? (Re-rank Score: 1.8697)
7. Newton’s second law of what is more than a definition; it is a relationship among acceleration, force, and mass? (Re-rank Score: 1.7521)
8. What is a vector quantity with the same direction as the force called? (Re-rank Score: 1.1486)
9. What is the term for the force that brings objects toward the earth? (Re-rank Score: 1.1434)

# Evaluation


In [72]:
!pip install sacrebleu



Device Configuration:

Determines whether to use GPU or CPU for computations.
Bi-Encoder Embedding:
Utilizes the msmarco-distilbert-base-v4 model to generate vector embeddings for the query and documents for semantic search.


LLM Judge:

* Loads a GPT-2 model for binary relevance classification between query and retrieved results.
Metric 1 - LLM-Based Precision:

* Evaluates precision by scoring each result as relevant (1) or non-relevant (0) based on the GPT-2 output logits.
Metric 2 - SacreBLEU:

* Computes BLEU scores for query-document pairs to assess linguistic similarity.

Metric 3 - Cosine Precision at k:

* Uses cosine similarity between query and document embeddings to determine whether the most relevant document appears in the top-k results.
Evaluation:

Calculates and displays scores for LLM-based precision, SacreBLEU, and cosine precision metrics.


In [85]:
import torch
from sentence_transformers import SentenceTransformer, util
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import sacrebleu
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

# Device configuration
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

# Load Bi-Encoder for vector embeddings
# bi_encoder_model = SentenceTransformer('paraphrase-MiniLM-L6-v2').to(device)

# Load LLM judge (Mistral model)
llm_judge_model_name = "gpt2"
llm_judge_model = AutoModelForSequenceClassification.from_pretrained(llm_judge_model_name).to(device)
llm_judge_tokenizer = AutoTokenizer.from_pretrained(llm_judge_model_name)

# Sample queries and model outputs
# query = "Can you provide examples of questions about plant hormones?"
top_k_results = top_k_documents

# Metric 1: LLM-based Precision
def llm_based_precision(query, top_k_results, model, tokenizer):
    precision_scores = []
    for result in top_k_results:
        # Prepare query and result for LLM evaluation
        input_text = f"Query: {query}\nQuestion: {result}"
        inputs = tokenizer(input_text, return_tensors="pt", truncation=True, max_length=512).to(device)
        outputs = model(**inputs)
        # Assume the logits indicate relevance (1 or 0)
        relevance_score = torch.argmax(outputs.logits).item()  # Binary: 1 for relevant, 0 otherwise
        precision_scores.append(relevance_score)

    # Calculate average precision
    precision = sum(precision_scores) / len(precision_scores)
    return precision

# Metric 2: SacreBLEU
def sacrebleu_score(query, top_k_results):
    # SacreBLEU expects a reference and a list of hypothesis
    bleu_scores = [sacrebleu.sentence_bleu(doc, [query]).score for doc, _ in top_k_results]
    return sum(bleu_scores) / len(bleu_scores)

# Metric 3: Cosine Precision at k = 5
def cosine_precision_at_k(query, top_k_results, model, k=5):
    # Encode query and results
    query_embedding = model.encode(query, convert_to_tensor=True, device=device)
    # Extract only the document strings from top_k_results
    result_docs = [doc for doc, _ in top_k_results]
    result_embeddings = model.encode(result_docs, convert_to_tensor=True, device=device)

    # Compute cosine similarities
    similarities = util.pytorch_cos_sim(query_embedding, result_embeddings)[0]

    # Sort results by similarity
    sorted_indices = torch.argsort(similarities, descending=True)
    top_k_indices = sorted_indices[:k]

    # Assume target document is the first result (synthetic example)
    target_document = top_k_results[0]  # Assuming ground truth at index 0
    target_embedding = model.encode(target_document, convert_to_tensor=True, device=device)

    # Calculate precision: Check if target document is in top-k
    target_in_top_k = any(torch.equal(result_embeddings[idx], target_embedding) for idx in top_k_indices)
    precision = 1.0 if target_in_top_k else 0.0
    return precision

# Evaluation
llm_precision = llm_based_precision(query, top_k_results, llm_judge_model, llm_judge_tokenizer)
bleu_score = sacrebleu_score(query, top_k_results)
cosine_precision = cosine_precision_at_k(query, top_k_results, bi_encoder_model)

# Display Results
print("Evaluation Metrics:")
print(f"LLM-based Precision: {llm_precision:.2f}")
print(f"SacreBLEU Score: {bleu_score:.2f}")
print(f"Cosine Precision at k=5: {cosine_precision:.2f}")


Using device: cuda


Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at gpt2 and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Evaluation Metrics:
LLM-based Precision: 0.00
SacreBLEU Score: 3.81
Cosine Precision at k=5: 0.00


Computing the Discounted Cumulative Gain (DCG) for a list of relevances and normalizes it to calculate the normalized DCG (nDCG). nDCG is used to evaluate the ranking quality of a set of results by comparing it to the ideal ranking.

In [74]:
import numpy as np

def dcg(relevances):
    return np.sum([(2**rel - 1) / np.log2(i + 2) for i, rel in enumerate(relevances)])

def ndcg(relevances, ideal_relevances):
    return dcg(relevances) / dcg(ideal_relevances)

# Sample usage
relevances = [3, 2, 3, 0, 1]
ideal_relevances = sorted(relevances, reverse=True)
print("nDCG:", ndcg(relevances, ideal_relevances))


nDCG: 0.9574784666412695
