## Problem Statement

**Use Case Title:**  
**"Research Paper Selector using Retrieve-and-Rerank RAG (R&R-RAG)"**

**Problem Statement:**  
In academic research, retrieving the most relevant scholarly papers for a specific topic (e.g., "few-shot learning techniques") can be challenging due to the large volume of documents and noisy keyword-based results. A basic semantic search using embeddings is often insufficient in terms of ranking the best-matching results based on fine-grained semantic nuances.

To improve the **accuracy and relevance** of retrieved results, this project implements a **hybrid RAG pipeline** using:
- A **bi-encoder (SentenceTransformer)** for fast semantic retrieval using FAISS.
- A **cross-encoder (MS MARCO TinyBERT)** for reranking the top retrieved results using deeper interaction modeling between query and document.

This approach enhances **information retrieval quality** for NLP/NLU-based academic literature search.

In [None]:
# What's new?
# CSV-based structured document handling
# CrossEncoder reranking added
# Uses CrossEncoder for deep reranking
# Focus is on ranking, not answering
# Reranked paper results with relevance scores

## Practical Significance

This pattern (R&R-RAG) is used in:
- **Academic search engines** like Semantic Scholar, Arxiv-Sanity.
- **Legal document analysis** for ranking contract clauses by importance.
- **Patent retrieval** and **systematic literature reviews** in NLP pipelines.

In [1]:
# Install dependencies
!pip install sentence-transformers faiss-cpu pandas tqdm -q
# tqdm - For showing progress bars


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.3.1[0m[39;49m -> [0m[32;49m25.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [2]:
# Imports
import pandas as pd
import faiss
import numpy as np
from tqdm import tqdm
from sentence_transformers import SentenceTransformer, CrossEncoder

In [3]:
# Step 1: Load academic papers dataset
# Example format: papers.csv with 'title' and 'abstract'
df = pd.read_csv("papers.csv")  # <- Replace with your corpus
df['content'] = df['title'] + ". " + df['abstract']
documents = df['content'].tolist()
print(f"✅ Loaded {len(documents)} documents.")

✅ Loaded 10 documents.


In [4]:
print("Sample document:", documents[0])

Sample document: Few-Shot Learning via Prompt Tuning with LLMs. We propose a prompt-tuning strategy using large language models for adapting to few-shot settings in NLP tasks. Our method reduces the need for fine-tuning by leveraging prompt engineering.


In [5]:
documents

['Few-Shot Learning via Prompt Tuning with LLMs. We propose a prompt-tuning strategy using large language models for adapting to few-shot settings in NLP tasks. Our method reduces the need for fine-tuning by leveraging prompt engineering.',
 'Meta-Learning for Efficient Few-Shot Classification. Meta-learning frameworks have shown promising results in few-shot classification by optimizing the initialization of neural networks across tasks.',
 'A Survey on Transformers in Vision. This paper surveys the use of Transformer architectures in computer vision, including ViT, DETR, and Swin Transformers, with benchmarks and comparisons.',
 'Contrastive Learning for Representation Learning. We explore contrastive learning approaches that learn useful representations by pulling semantically similar instances together and pushing dissimilar ones apart.',
 'Neural Scaling Laws in Large Language Models. This work investigates how performance scales with model size, dataset size, and compute, providi

In [6]:
# Step 2: Embed documents using SentenceTransformer (bi-encoder)
bi_encoder = SentenceTransformer("all-MiniLM-L6-v2")
doc_embeddings = bi_encoder.encode(documents, show_progress_bar=True)

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

In [7]:
# Step 3: Create FAISS index
dimension = doc_embeddings[0].shape[0]
index = faiss.IndexFlatL2(dimension)
index.add(np.array(doc_embeddings))
print("✅ FAISS index built.")

✅ FAISS index built.


In [8]:
# Step 4: Cross-Encoder for reranking (query-doc pairs)
cross_encoder = CrossEncoder("cross-encoder/ms-marco-TinyBERT-L-6", max_length=512)
# Loads a cross-encoder model (TinyBERT) that is trained on MS MARCO — a benchmark for question-answer relevance.
# The model takes pairs of (query, document) and predicts a similarity score.
# max_length=512 ensures the combined input length is trimmed or padded appropriately.

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/541 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/3.48k [00:00<?, ?B/s]

In [9]:
# Step 5: Retrieval + Reranking pipeline
# Defines a function that takes a user’s question and returns the top-k best-matched academic papers.
def retrieve_and_rerank(query, top_k=10):
    # Step 1: Vector search (fast recall)
    query_embedding = bi_encoder.encode([query]) # Embeds the user's query into a dense vector using the bi-encoder.
    distances, indices = index.search(np.array(query_embedding), top_k * 5) # Searches the FAISS index to get the top 50 most similar documents based on vector similarity (5× top_k for broader initial recall).
    initial_results = [(documents[i], df.iloc[i]['title'], df.iloc[i]['abstract']) for i in indices[0]]

    # Step 2: Cross-encoder reranking
    rerank_pairs = [[query, doc] for doc, _, _ in initial_results] # Creates a list of pairs: [query, doc] — required input format for cross-encoder prediction.
    scores = cross_encoder.predict(rerank_pairs) # Gets back relevance scores (e.g., 0.9 = very relevant, 0.1 = not relevant).

    # Combines scores with initial results using zip()
    # Sorts them in descending order of score.
    # Returns the top_k (e.g., top 10 papers).
    reranked = sorted(zip(scores, initial_results), key=lambda x: x[0], reverse=True)

    return reranked[:top_k]

In [10]:
# Test it
query = "What are the latest techniques in few-shot learning?"
results = retrieve_and_rerank(query)

print(f"\n🔍 Top results for: {query}\n")
for score, (doc, title, abstract) in results:
    print(f"📝 Title: {title}")
    print(f"📊 Score: {score:.4f}")
    print(f"📄 Abstract: {abstract[:300]}...")
    print("-" * 80)


🔍 Top results for: What are the latest techniques in few-shot learning?

📝 Title: Few-Shot Learning via Prompt Tuning with LLMs
📊 Score: 0.9026
📄 Abstract: We propose a prompt-tuning strategy using large language models for adapting to few-shot settings in NLP tasks. Our method reduces the need for fine-tuning by leveraging prompt engineering....
--------------------------------------------------------------------------------
📝 Title: Meta-Learning for Efficient Few-Shot Classification
📊 Score: 0.7122
📄 Abstract: Meta-learning frameworks have shown promising results in few-shot classification by optimizing the initialization of neural networks across tasks....
--------------------------------------------------------------------------------
📝 Title: Contrastive Learning for Representation Learning
📊 Score: 0.0003
📄 Abstract: We explore contrastive learning approaches that learn useful representations by pulling semantically similar instances together and pushing dissimilar ones apart..