## 🔍 Notebook Summary: Find Semantically Similar Research Papers

This notebook demonstrates a multi-method approach for retrieving semantically similar research papers using their abstracts. It includes:

1. **TF-IDF Retrieval**: Traditional cosine similarity on bag-of-words representations.
2. **SBERT-Based Semantic Retrieval**: Dense sentence embeddings using `all-MiniLM-L6-v2`.
3. **Cross-Encoder Reranking**: Fine-grained relevance scoring using the `cross-encoder/ms-marco` model.
4. **Explainability with KeyBERT**: Visual inspection of overlapping keywords between query and candidate abstracts.
5. **Evaluation**: Measures retrieval effectiveness using average cosine similarity.

Each retrieval method is modular, and the notebook can be easily extended for additional models, visualizations, or use cases.


In [8]:
from helper_functions import *

from sentence_transformers import SentenceTransformer
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from helper_functions import preprocess, get_number_of_times_a_paper_is_cited
from keybert import KeyBERT
from nltk import sent_tokenize

from collections import defaultdict

import torch.nn.functional as F
import spacy

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

from sentence_transformers import SentenceTransformer, util, CrossEncoder

seed=25 #for random state
pd.set_option('display.width', 200)
pd.set_option('display.max_colwidth', 200) 

%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [2]:
# -----------------------------
# Load and preprocess dataset
# -----------------------------
file_path = "ai_ml_papers.csv"
df = pd.read_csv(file_path)
df = df.dropna(subset=['title', 'abstract'])
df = df.drop_duplicates(subset='title')
df['processed'] = df['title'] + ". " + df['abstract']
df['cleaned_text'] = df['abstract'].apply(preprocess)

nlp = spacy.load("en_core_web_sm")

  df = pd.read_csv(file_path)


In [3]:
## how do i get number of rows in df?
num_rows = df.shape[0]
print("Number of rows in df:", num_rows)

Number of rows in df: 266081


In [7]:
## Define the input index of the paper you want to analyze and also number of top k papers to retrieve
input_idx_of_paper = 5
top_k = 2

In [5]:
# -----------------------------
# TF-IDF similarity model
# -----------------------------

vectorizer = TfidfVectorizer(stop_words='english', max_df=0.9, min_df=5)
tfidf_matrix = vectorizer.fit_transform(df['cleaned_text'])

def get_top_k_similar_papers(input_paper_idx, k=3):
    # Compute cosine similarity
    similarities = cosine_similarity(tfidf_matrix[input_paper_idx], tfidf_matrix)
    
    # Exclude self and get top-k indices
    similarity_scores = list(enumerate(similarities[0]))
    similarity_scores = sorted(similarity_scores, key=lambda x: x[1], reverse=True)
    top_k_indices = [i for i, _ in similarity_scores[1:k+1]]  # Skip index 0 (self)
    
    # Return results
    results = df.iloc[top_k_indices][['id', 'abstract', 'categories']]
    results['similarity_score'] = [similarity_scores[i][1] for i in range(1, k+1)]
    return results

similar_papers = get_top_k_similar_papers(input_idx_of_paper, top_k)
print(f"Input Paper: {df.iloc[input_idx_of_paper]['id']} - {df.iloc[input_idx_of_paper]['abstract']}\n")
print("Top Similar Papers:")
similar_papers

Input Paper: 704.0985 -   Advances in semiconductor technology are contributing to the increasing
complexity in the design of embedded systems. Architectures with novel
techniques such as evolvable nature and autonomous behavior have engrossed lot
of attention. This paper demonstrates conceptually evolvable embedded systems
can be characterized basing on acausal nature. It is noted that in acausal
systems, future input needs to be known, here we make a mechanism such that the
system predicts the future inputs and exhibits pseudo acausal nature. An
embedded system that uses theoretical framework of acausality is proposed. Our
method aims at a novel architecture that features the hardware evolability and
autonomous behavior alongside pseudo acausality. Various aspects of this
architecture are discussed in detail along with the limitations.


Top Similar Papers:


Unnamed: 0,id,abstract,categories,similarity_score
163679,2302.13451,"The transformer is a fundamental building block in deep learning, and the\nattention mechanism is the transformer's core component. Self-supervised speech\nrepresentation learning (SSRL) represe...",cs.SD cs.CL cs.LG eess.AS,0.288698
6057,1302.4958,"Whereas acausal Bayesian networks represent probabilistic independence,\ncausal Bayesian networks represent causal relationships. In this paper, we\nexamine Bayesian methods for learning both ty...",cs.AI,0.25595


In [12]:
# -----------------------------
# SBERT similarity model
# -----------------------------

# Step 1: Load SBERT model (MiniLM is fast & accurate)
model = SentenceTransformer('all-MiniLM-L6-v2')

# Step 2: Prepare abstracts
abstracts = df['abstract'].fillna('').tolist()

# Step 3: Encode abstracts into dense vectors
embeddings = model.encode(abstracts, show_progress_bar=True, batch_size=64)

Batches:   0%|          | 0/4158 [00:00<?, ?it/s]

In [None]:
def memoize_number_of_times_each_paper_is_cited():
    for i in range(5, len(df)):
        title = df.loc[i, 'title']
        citation_count = get_number_of_times_a_paper_is_cited(title=title)
        df.at[i, 'citation_count'] = citation_count

memoize_number_of_times_each_paper_is_cited()
print(df[['id', 'title', 'citation_count']].head(5))

In [26]:
def get_sbert_top_n_candidates(input_paper_idx, n=50):
    query_vec = embeddings[input_paper_idx].reshape(1, -1)
    similarities = cosine_similarity(query_vec, embeddings).flatten()
    
    # Exclude the query paper
    candidate_indices = [i for i in range(len(embeddings)) if i != input_paper_idx]
    
    # Sort by similarity
    sorted_candidates = sorted(candidate_indices, key=lambda idx: similarities[idx], reverse=True)
    
    # Keep top-n
    top_n_indices = sorted_candidates[:n]
    
    return [(idx, similarities[idx]) for idx in top_n_indices]



def rerank_with_citations(top_n_candidates, alpha=0.6):
    rows = []
    for idx, sim_score in top_n_candidates:
        title = df.iloc[idx]['title']
        citation_count = get_number_of_times_a_paper_is_cited(title)  # Your custom function
        rows.append({
            'idx': idx,
            'similarity_score': sim_score,
            'citation_count': citation_count
        })
        
    # Normalize citation counts via log scale
    citation_values = [np.log1p(r['citation_count']) for r in rows]
    max_c = max(citation_values) if citation_values else 1
    for r, c in zip(rows, citation_values):
        r['normalized_citations'] = c / max_c
        
    # Compute combined score
    for r in rows:
        r['combined_score'] = alpha * r['similarity_score'] + (1 - alpha) * r['normalized_citations']
        
    # Sort descending by combined_score
    rows = sorted(rows, key=lambda x: x['combined_score'], reverse=True)
    return rows

def get_top_k_similar_sbert_with_citations(input_paper_idx, k=5, pool_size=100, alpha=0.6):
    # 1) Get top-N by pure semantic similarity
    top_n_candidates = get_sbert_top_n_candidates(input_paper_idx, n=pool_size)
    
    # 2) Re-rank with citations
    reranked = rerank_with_citations(top_n_candidates, alpha=alpha)
    
    # 3) Build DataFrame for top-k
    top_k = reranked[:k]
    indices = [r['idx'] for r in top_k]
    
    results = df.iloc[indices][['id', 'title', 'abstract', 'categories']].copy()
    results['similarity_score'] = [r['similarity_score'] for r in top_k]
    results['citation_count'] = [r['citation_count'] for r in top_k]
    results['combined_score'] = [r['combined_score'] for r in top_k]
    
    return results


# Example usage
top_k = 5
similar_papers_sbert = get_top_k_similar_sbert_with_citations(input_idx_of_paper, top_k)

# Display
print(f"Input Paper: {df.iloc[input_idx_of_paper]['id']} - {df.iloc[input_idx_of_paper]['abstract']}\n")
print("Top Similar Papers (SBERT):")
similar_papers_sbert

Input Paper: 704.0985 -   Advances in semiconductor technology are contributing to the increasing
complexity in the design of embedded systems. Architectures with novel
techniques such as evolvable nature and autonomous behavior have engrossed lot
of attention. This paper demonstrates conceptually evolvable embedded systems
can be characterized basing on acausal nature. It is noted that in acausal
systems, future input needs to be known, here we make a mechanism such that the
system predicts the future inputs and exhibits pseudo acausal nature. An
embedded system that uses theoretical framework of acausality is proposed. Our
method aims at a novel architecture that features the hardware evolability and
autonomous behavior alongside pseudo acausality. Various aspects of this
architecture are discussed in detail along with the limitations.


Top Similar Papers (SBERT):


Unnamed: 0,id,title,abstract,categories,similarity_score,citation_count,combined_score
45181,1905.05248,Design Space Exploration via Answer Set Programming Modulo Theories,"The design of embedded systems, that are ubiquitously used in mobile devices\nand cars, is becoming continuously more complex such that efficient\nsystem-level design methods are becoming crucia...",cs.AI cs.LO,0.4764,25103,0.68584
265416,cs/0403031,"Concept of E-machine: How does a ""dynamical"" brain learn to process\n ""symbolic"" information? Part I",The human brain has many remarkable information processing characteristics\nthat deeply puzzle scientists and engineers. Among the most important and the\nmost intriguing of these characteristic...,cs.AI cs.LG,0.414249,7131,0.598862
225511,2406.01384,Extending Structural Causal Models for Autonomous Embodied Systems,"In this work we aim to bridge the divide between autonomous embodied systems\nand causal reasoning. Autonomous embodied systems have come to increasingly\ninteract with humans, and in many cases...",cs.AI cs.RO cs.SE,0.421662,6287,0.598337
207072,2402.03824,A call for embodied AI,"We propose Embodied AI as the next fundamental step in the pursuit of\nArtificial General Intelligence, juxtaposing it against current AI\nadvancements, particularly Large Language Models. We tr...",cs.AI,0.422943,1592,0.544893
142413,2207.11089,Do Artificial Intelligence Systems Understand?,Are intelligent machines really intelligent? Is the underlying philosophical\nconcept of intelligence satisfactory for describing how the present systems\nwork? Is understanding a necessary and ...,cs.AI cs.LG,0.403156,1792,0.537691


# For even more precision we can use Cross encoder to re-rank the similar papers based on the input paper's abstract.

In [11]:
# title = df.iloc[28]['title']
# print(f"Title: {title}")
# get_number_of_times_a_paper_is_cited(title)

Title: Can the Internet cope with stress?


2047

In [30]:
# -----------------------------
# Cross-encoder re-ranking
# -----------------------------


# 1. Load a pretrained cross-encoder model (can be swapped with others)
cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

# 2. Get top-k SBERT candidates first (e.g., top 5)
sbert_top_k = similar_papers_sbert

# 3. Prepare pairs: (query abstract, candidate abstract)
query_abstract = df.iloc[input_idx_of_paper]['abstract']
candidate_abstracts = sbert_top_k['abstract'].tolist()
query_pairs = [(query_abstract, cand) for cand in candidate_abstracts]

# 4. Get similarity scores from the cross-encoder
cross_scores = cross_encoder.predict(query_pairs, convert_to_tensor=True)

# 5. Rerank based on cross-encoder scores
sbert_top_k['cross_score'] = cross_scores.cpu()
sbert_top_k['cross_probs'] = F.sigmoid(cross_scores).cpu().numpy()
sbert_top_k_sorted = sbert_top_k.sort_values(by='cross_probs', ascending=False).head(top_k)

# 6. View top reranked results
sbert_top_k_sorted[['id', 'abstract', 'cross_score', 'cross_probs']]

You are trying to use a model that was created with Sentence Transformers version 4.1.0.dev0, but you're currently using version 4.0.1. This might cause unexpected behavior or errors. In that case, try to update to the latest version.


Unnamed: 0,id,abstract,cross_score,cross_probs
45181,1905.05248,"The design of embedded systems, that are ubiquitously used in mobile devices\nand cars, is becoming continuously more complex such that efficient\nsystem-level design methods are becoming crucia...",-0.210489,0.447571
207072,2402.03824,"We propose Embodied AI as the next fundamental step in the pursuit of\nArtificial General Intelligence, juxtaposing it against current AI\nadvancements, particularly Large Language Models. We tr...",-1.021541,0.264727
225511,2406.01384,"In this work we aim to bridge the divide between autonomous embodied systems\nand causal reasoning. Autonomous embodied systems have come to increasingly\ninteract with humans, and in many cases...",-3.375798,0.03306
265416,cs/0403031,The human brain has many remarkable information processing characteristics\nthat deeply puzzle scientists and engineers. Among the most important and the\nmost intriguing of these characteristic...,-5.302018,0.004957
142413,2207.11089,Are intelligent machines really intelligent? Is the underlying philosophical\nconcept of intelligence satisfactory for describing how the present systems\nwork? Is understanding a necessary and ...,-5.323239,0.004853


In [31]:
# -----------------------------
# Keyword-based explainability
# -----------------------------

kw_model = KeyBERT(model='all-MiniLM-L6-v2')
def explain_similarity_with_keywords(query_abstract, candidate_abstract, top_n=10):
    # Extract top-n keywords from both abstracts
    query_keywords = [kw for kw, _ in kw_model.extract_keywords(
        query_abstract, top_n=top_n, stop_words='english', keyphrase_ngram_range=(1, 3), use_maxsum=True, nr_candidates=top_n)]
    candidate_keywords = [kw for kw, _ in kw_model.extract_keywords(
        candidate_abstract, top_n=top_n, stop_words='english', keyphrase_ngram_range=(1, 3))]

    query_keywords = semantically_deduplicate_keywords(query_keywords, model, similarity_threshold=0.8)
    candidate_keywords = semantically_deduplicate_keywords(candidate_keywords, model, similarity_threshold=0.8)

    matched_pairs = one_to_one_keyword_matches(query_keywords, candidate_keywords, model, threshold=0.6, top_n=5)

    return matched_pairs


def one_to_one_keyword_matches(query_keywords, candidate_keywords, model, threshold, top_n=None):
    # Encode phrases
    query_embs = model.encode(query_keywords, convert_to_tensor=True)
    candidate_embs = model.encode(candidate_keywords, convert_to_tensor=True)

    # Compute cosine similarity matrix
    sim_matrix = util.cos_sim(query_embs, candidate_embs).cpu().numpy()

    # Flatten and sort all (i, j, score) tuples
    all_pairs = []
    for i in range(len(query_keywords)):
        for j in range(len(candidate_keywords)):
            score = sim_matrix[i][j]
            if score >= threshold:
                all_pairs.append((i, j, score))

    # Sort pairs by score in descending order
    all_pairs.sort(key=lambda x: x[2], reverse=True)

    matched_query_indices = set()
    matched_candidate_indices = set()
    matched_pairs = []

    # Greedy 1-to-1 matching
    for i, j, score in all_pairs:
        if i not in matched_query_indices and j not in matched_candidate_indices:
            matched_pairs.append((query_keywords[i], candidate_keywords[j], round(score, 3)))
            matched_query_indices.add(i)
            matched_candidate_indices.add(j)
            if top_n and len(matched_pairs) >= top_n:
                break

    return matched_pairs

def semantically_deduplicate_keywords(keywords, model, similarity_threshold):
    embeddings = model.encode(keywords, convert_to_tensor=True)
    keep = []
    used_indices = set()

    for i in range(len(keywords)):
        if i in used_indices:
            continue
        keep.append(keywords[i])
        sims = util.cos_sim(embeddings[i], embeddings).squeeze()
        for j in range(i + 1, len(keywords)):
            if sims[j] > similarity_threshold:
                used_indices.add(j)

    return keep

## Explainability by showing which keywords from the input paper match the candidate paper

query_abs = df.iloc[input_idx_of_paper]['abstract']

candidate_abs = sbert_top_k_sorted.iloc[0]['abstract']  # top match
citation_count = sbert_top_k_sorted.iloc[0]['citation_count']
combined_score = sbert_top_k_sorted.iloc[0]['combined_score']

matched_pairs = explain_similarity_with_keywords(query_abs, candidate_abs)

print(f"Selected paper has {citation_count} citations and a combined relevance score of {combined_score:.3f}.")
print("It is considered relevant because:")
for q, c, score in matched_pairs:
    print(f"Current paper talks about  '{q}' → and retrieved paper also talks about '{c}' (score: {score:.3f})")



Selected paper has 25103 citations and a combined relevance score of 0.686.
It is considered relevant because:
Current paper talks about  'design embedded systems' → and retrieved paper also talks about 'design embedded systems' (score: 1.000)


In [50]:
# -----------------------------
# Evaluation Metrics
# -----------------------------

## Average Cosine Similarity


def evaluate_tfidf_cosine(df, tfidf_matrix, top_k=5):
    avg_cosine_scores = []
    # purity_scores = []
    categories = df['categories'].fillna("unknown").tolist()

    for i in range(tfidf_matrix.shape[0]):
        query_vec = tfidf_matrix[i]
        sims = cosine_similarity(query_vec, tfidf_matrix).flatten()
        sims[i] = -1  # exclude self

        top_k_indices = sims.argsort()[::-1][:top_k]
        avg_cosine_scores.append(np.mean(sims[top_k_indices]))

        if (i > 50):
            break

    return {
        "avg_cosine_similarity": np.mean(avg_cosine_scores),
    }

results = evaluate_tfidf_cosine(df, tfidf_matrix, top_k=5)

print("TF-IDF Evaluation Metrics:")
print("→ Average Cosine Similarity:", round(results["avg_cosine_similarity"], 4))

TF-IDF Evaluation Metrics:
→ Average Cosine Similarity: 0.3557


In [69]:
def evaluate_sbert_similarity(sbert_embeddings, categories, top_k=5):
    avg_cosine_scores = []
    # purity_scores = []

    for i in range(len(sbert_embeddings)):
        query_vec = sbert_embeddings[i].reshape(1, -1)
        sims = cosine_similarity(query_vec, sbert_embeddings).flatten()
        sims[i] = -1  # exclude self

        top_k_indices = sims.argsort()[::-1][:top_k]
        avg_cosine_scores.append(np.mean(sims[top_k_indices]))

        if (i > 50):
            break

    return {
        "avg_cosine_similarity": np.mean(avg_cosine_scores),
        # "category_purity": np.mean(purity_scores)
    }

categories = df['categories'].fillna("unknown").tolist()
results = evaluate_sbert_similarity(embeddings, categories, top_k=5)

print("SBERT Evaluation:")
print("→ Average Cosine Similarity:", round(results["avg_cosine_similarity"], 4))

SBERT Evaluation:
→ Average Cosine Similarity: 0.6486
