## 🔍 Notebook Summary: Find Semantically Similar Research Papers

This notebook demonstrates a multi-method approach for retrieving semantically similar research papers using their abstracts. It includes:

1. **TF-IDF Retrieval**: Traditional cosine similarity on bag-of-words representations.
2. **SBERT-Based Semantic Retrieval**: Dense sentence embeddings using `all-MiniLM-L6-v2`.
3. **Cross-Encoder Reranking**: Fine-grained relevance scoring using the `cross-encoder/ms-marco` model.
4. **Explainability with KeyBERT**: Visual inspection of overlapping keywords between query and candidate abstracts.
5. **Evaluation**: Measures retrieval effectiveness using average cosine similarity.

Each retrieval method is modular, and the notebook can be easily extended for additional models, visualizations, or use cases.


In [8]:
from helper_functions import *

from sentence_transformers import SentenceTransformer
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from helper_functions import preprocess
from keybert import KeyBERT
from nltk import sent_tokenize

from collections import defaultdict

import torch.nn.functional as F
import spacy

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

from sentence_transformers import SentenceTransformer, util, CrossEncoder

seed=25 #for random state
pd.set_option('display.width', 200)
pd.set_option('display.max_colwidth', 200) 

%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [2]:
# -----------------------------
# Load and preprocess dataset
# -----------------------------
file_path = "ai_ml_papers.csv"
df = pd.read_csv(file_path)
df = df.dropna(subset=['title', 'abstract'])
df = df.drop_duplicates(subset='title')
df['processed'] = df['title'] + ". " + df['abstract']
df['cleaned_text'] = df['abstract'].apply(preprocess)

nlp = spacy.load("en_core_web_sm")

  df = pd.read_csv(file_path)


In [9]:
## how do i get number of rows in df?
num_rows = df.shape[0]
print("Number of rows in df:", num_rows)

Number of rows in df: 266081


In [10]:
## Define the input index of the paper you want to analyze and also number of top k papers to retrieve
input_idx_of_paper = 0
top_k = 2

In [11]:
# -----------------------------
# TF-IDF similarity model
# -----------------------------

vectorizer = TfidfVectorizer(stop_words='english', max_df=0.9, min_df=5)
tfidf_matrix = vectorizer.fit_transform(df['cleaned_text'])

def get_top_k_similar_papers(input_paper_idx, k=3):
    # Compute cosine similarity
    similarities = cosine_similarity(tfidf_matrix[input_paper_idx], tfidf_matrix)
    
    # Exclude self and get top-k indices
    similarity_scores = list(enumerate(similarities[0]))
    similarity_scores = sorted(similarity_scores, key=lambda x: x[1], reverse=True)
    top_k_indices = [i for i, _ in similarity_scores[1:k+1]]  # Skip index 0 (self)
    
    # Return results
    results = df.iloc[top_k_indices][['id', 'abstract', 'categories']]
    results['similarity_score'] = [similarity_scores[i][1] for i in range(1, k+1)]
    return results

similar_papers = get_top_k_similar_papers(input_idx_of_paper, top_k)
print(f"Input Paper: {df.iloc[input_idx_of_paper]['id']} - {df.iloc[input_idx_of_paper]['abstract']}\n")
print("Top Similar Papers:")
similar_papers

Input Paper: 704.0047 -   The intelligent acoustic emission locator is described in Part I, while Part
II discusses blind source separation, time delay estimation and location of two
simultaneously active continuous acoustic emission sources.
  The location of acoustic emission on complicated aircraft frame structures is
a difficult problem of non-destructive testing. This article describes an
intelligent acoustic emission source locator. The intelligent locator comprises
a sensor antenna and a general regression neural network, which solves the
location problem based on learning from examples. Locator performance was
tested on different test specimens. Tests have shown that the accuracy of
location depends on sound velocity and attenuation in the specimen, the
dimensions of the tested area, and the properties of stored data. The location
accuracy achieved by the intelligent locator is comparable to that obtained by
the conventional triangulation method, while the applicability of the


Unnamed: 0,id,abstract,categories,similarity_score
1,704.005,"Part I describes an intelligent acoustic emission locator, while Part II\ndiscusses blind source separation, time delay estimation and location of two\ncontinuous acoustic emission sources.\n A...",cs.NE cs.AI,0.569653
220454,2405.0234,"Reducing Carbon dioxide (CO2) emission is vital at both global and national\nlevels, given their significant role in exacerbating climate change. CO2\nemission, stemming from a variety of indust...",stat.AP cs.LG,0.240823


In [12]:
# -----------------------------
# SBERT similarity model
# -----------------------------

# Step 1: Load SBERT model (MiniLM is fast & accurate)
model = SentenceTransformer('all-MiniLM-L6-v2')

# Step 2: Prepare abstracts
abstracts = df['abstract'].fillna('').tolist()

# Step 3: Encode abstracts into dense vectors
embeddings = model.encode(abstracts, show_progress_bar=True, batch_size=64)

Batches:   0%|          | 0/4158 [00:00<?, ?it/s]

In [13]:

# Step 4: Define function to get top-k similar papers
def get_top_k_similar_sbert(input_paper_idx, k=5):
    query_vec = embeddings[input_paper_idx].reshape(1, -1)
    similarities = cosine_similarity(query_vec, embeddings).flatten()
    
    # Get top-k indices excluding the query paper itself
    top_k_indices = similarities.argsort()[::-1][1:k+1]
    
    results = df.iloc[top_k_indices][['id', 'abstract', 'categories']].copy()
    results['similarity_score'] = similarities[top_k_indices]
    
    return results

# Example usage
top_k = 5
similar_papers_sbert = get_top_k_similar_sbert(input_idx_of_paper, top_k)

# Display
print(f"Input Paper: {df.iloc[input_idx_of_paper]['id']} - {df.iloc[input_idx_of_paper]['abstract']}\n")
print("Top Similar Papers (SBERT):")
similar_papers_sbert

Input Paper: 704.0047 -   The intelligent acoustic emission locator is described in Part I, while Part
II discusses blind source separation, time delay estimation and location of two
simultaneously active continuous acoustic emission sources.
  The location of acoustic emission on complicated aircraft frame structures is
a difficult problem of non-destructive testing. This article describes an
intelligent acoustic emission source locator. The intelligent locator comprises
a sensor antenna and a general regression neural network, which solves the
location problem based on learning from examples. Locator performance was
tested on different test specimens. Tests have shown that the accuracy of
location depends on sound velocity and attenuation in the specimen, the
dimensions of the tested area, and the properties of stored data. The location
accuracy achieved by the intelligent locator is comparable to that obtained by
the conventional triangulation method, while the applicability of the


Unnamed: 0,id,abstract,categories,similarity_score
1,704.005,"Part I describes an intelligent acoustic emission locator, while Part II\ndiscusses blind source separation, time delay estimation and location of two\ncontinuous acoustic emission sources.\n A...",cs.NE cs.AI,0.791418
131006,2203.16988,"Acoustic source localization has been applied in different fields, such as\naeronautics and ocean science, generally using multiple microphones array data\nto reconstruct the source location. Ho...",cs.SD cs.LG eess.AS,0.554613
89431,2012.11058,"In the field of structural health monitoring (SHM), the acquisition of\nacoustic emissions to localise damage sources has emerged as a popular\napproach. Despite recent advances, the task of loc...",cs.LG cs.SD eess.AS,0.544416
137282,2206.01495,The automated localisation of damage in structures is a challenging but\ncritical ingredient in the path towards predictive or condition-based\nmaintenance of high value structures. The use of a...,cs.LG cs.SD eess.AS,0.534913
54854,1910.04415,We propose a direction of arrival (DOA) estimation method that combines\nsound-intensity vector (IV)-based DOA estimation and DNN-based denoising and\ndereverberation. Since the accuracy of IV-b...,eess.AS cs.LG cs.SD stat.ML,0.509552


# For even more precision we can use Cross encoder to re-rank the similar papers based on the input paper's abstract.

In [14]:
# -----------------------------
# Cross-encoder re-ranking
# -----------------------------


# 1. Load a pretrained cross-encoder model (can be swapped with others)
cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

# 2. Get top-k SBERT candidates first (e.g., top 5)
sbert_top_k = get_top_k_similar_sbert(input_idx_of_paper, k=5)

# 3. Prepare pairs: (query abstract, candidate abstract)
query_abstract = df.iloc[input_idx_of_paper]['abstract']
candidate_abstracts = sbert_top_k['abstract'].tolist()
query_pairs = [(query_abstract, cand) for cand in candidate_abstracts]

# 4. Get similarity scores from the cross-encoder
cross_scores = cross_encoder.predict(query_pairs, convert_to_tensor=True)

# 5. Rerank based on cross-encoder scores
sbert_top_k['cross_score'] = cross_scores.cpu()
sbert_top_k['cross_probs'] = F.sigmoid(cross_scores).cpu().numpy()
sbert_top_k_sorted = sbert_top_k.sort_values(by='cross_probs', ascending=False).head(top_k)

# 6. View top reranked results
sbert_top_k_sorted[['id', 'abstract', 'cross_score', 'cross_probs']]

tokenizer_config.json:   0%|          | 0.00/1.33k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/132 [00:00<?, ?B/s]

Unnamed: 0,id,abstract,cross_score,cross_probs
1,704.005,"Part I describes an intelligent acoustic emission locator, while Part II\ndiscusses blind source separation, time delay estimation and location of two\ncontinuous acoustic emission sources.\n A...",3.021933,0.953555
89431,2012.11058,"In the field of structural health monitoring (SHM), the acquisition of\nacoustic emissions to localise damage sources has emerged as a popular\napproach. Despite recent advances, the task of loc...",-0.904016,0.288226
131006,2203.16988,"Acoustic source localization has been applied in different fields, such as\naeronautics and ocean science, generally using multiple microphones array data\nto reconstruct the source location. Ho...",-1.628937,0.163976
54854,1910.04415,We propose a direction of arrival (DOA) estimation method that combines\nsound-intensity vector (IV)-based DOA estimation and DNN-based denoising and\ndereverberation. Since the accuracy of IV-b...,-2.054816,0.113567
137282,2206.01495,The automated localisation of damage in structures is a challenging but\ncritical ingredient in the path towards predictive or condition-based\nmaintenance of high value structures. The use of a...,-2.330805,0.088604


In [26]:
# -----------------------------
# Keyword-based explainability
# -----------------------------

kw_model = KeyBERT(model='all-MiniLM-L6-v2')
def explain_similarity_with_keywords(query_abstract, candidate_abstract, top_n=10):
    # Extract top-n keywords from both abstracts
    query_keywords = [kw for kw, _ in kw_model.extract_keywords(
        query_abstract, top_n=top_n, stop_words='english', keyphrase_ngram_range=(1, 3), use_maxsum=True, nr_candidates=top_n)]
    candidate_keywords = [kw for kw, _ in kw_model.extract_keywords(
        candidate_abstract, top_n=top_n, stop_words='english', keyphrase_ngram_range=(1, 3))]

    query_keywords = semantically_deduplicate_keywords(query_keywords, model, similarity_threshold=0.8)
    candidate_keywords = semantically_deduplicate_keywords(candidate_keywords, model, similarity_threshold=0.8)

    matched_pairs = one_to_one_keyword_matches(query_keywords, candidate_keywords, model, threshold=0.6, top_n=5)

    return matched_pairs


def one_to_one_keyword_matches(query_keywords, candidate_keywords, model, threshold, top_n=None):
    # Encode phrases
    query_embs = model.encode(query_keywords, convert_to_tensor=True)
    candidate_embs = model.encode(candidate_keywords, convert_to_tensor=True)

    # Compute cosine similarity matrix
    sim_matrix = util.cos_sim(query_embs, candidate_embs).cpu().numpy()

    # Flatten and sort all (i, j, score) tuples
    all_pairs = []
    for i in range(len(query_keywords)):
        for j in range(len(candidate_keywords)):
            score = sim_matrix[i][j]
            if score >= threshold:
                all_pairs.append((i, j, score))

    # Sort pairs by score in descending order
    all_pairs.sort(key=lambda x: x[2], reverse=True)

    matched_query_indices = set()
    matched_candidate_indices = set()
    matched_pairs = []

    # Greedy 1-to-1 matching
    for i, j, score in all_pairs:
        if i not in matched_query_indices and j not in matched_candidate_indices:
            matched_pairs.append((query_keywords[i], candidate_keywords[j], round(score, 3)))
            matched_query_indices.add(i)
            matched_candidate_indices.add(j)
            if top_n and len(matched_pairs) >= top_n:
                break

    return matched_pairs

def semantically_deduplicate_keywords(keywords, model, similarity_threshold):
    embeddings = model.encode(keywords, convert_to_tensor=True)
    keep = []
    used_indices = set()

    for i in range(len(keywords)):
        if i in used_indices:
            continue
        keep.append(keywords[i])
        sims = util.cos_sim(embeddings[i], embeddings).squeeze()
        for j in range(i + 1, len(keywords)):
            if sims[j] > similarity_threshold:
                used_indices.add(j)

    return keep

## Explainability by showing which keywords from the input paper match the candidate paper

query_abs = df.iloc[input_idx_of_paper]['abstract']

candidate_abs = sbert_top_k_sorted.iloc[0]['abstract']  # top match

matched_pairs = explain_similarity_with_keywords(query_abs, candidate_abs)

print("This candidate paper is similar to your query because:")
for q, c, score in matched_pairs:
    print(f"Current paper talks about  '{q}' → and retrieved paper also talks about '{c}' (score: {score:.3f})")



This candidate paper is similar to your query because:
Current paper talks about  'acoustic emission locator' → and retrieved paper also talks about 'acoustic emission locator' (score: 1.000)
Current paper talks about  'intelligent acoustic emission' → and retrieved paper also talks about 'intelligent acoustic emission' (score: 1.000)
Current paper talks about  'structures acoustic emission' → and retrieved paper also talks about 'acoustic emission analysis' (score: 0.818)


In [50]:
# -----------------------------
# Evaluation Metrics
# -----------------------------

## Average Cosine Similarity


def evaluate_tfidf_cosine(df, tfidf_matrix, top_k=5):
    avg_cosine_scores = []
    # purity_scores = []
    categories = df['categories'].fillna("unknown").tolist()

    for i in range(tfidf_matrix.shape[0]):
        query_vec = tfidf_matrix[i]
        sims = cosine_similarity(query_vec, tfidf_matrix).flatten()
        sims[i] = -1  # exclude self

        top_k_indices = sims.argsort()[::-1][:top_k]
        avg_cosine_scores.append(np.mean(sims[top_k_indices]))

        if (i > 50):
            break

    return {
        "avg_cosine_similarity": np.mean(avg_cosine_scores),
    }

results = evaluate_tfidf_cosine(df, tfidf_matrix, top_k=5)

print("TF-IDF Evaluation Metrics:")
print("→ Average Cosine Similarity:", round(results["avg_cosine_similarity"], 4))

TF-IDF Evaluation Metrics:
→ Average Cosine Similarity: 0.3557


In [69]:
def evaluate_sbert_similarity(sbert_embeddings, categories, top_k=5):
    avg_cosine_scores = []
    # purity_scores = []

    for i in range(len(sbert_embeddings)):
        query_vec = sbert_embeddings[i].reshape(1, -1)
        sims = cosine_similarity(query_vec, sbert_embeddings).flatten()
        sims[i] = -1  # exclude self

        top_k_indices = sims.argsort()[::-1][:top_k]
        avg_cosine_scores.append(np.mean(sims[top_k_indices]))

        if (i > 50):
            break

    return {
        "avg_cosine_similarity": np.mean(avg_cosine_scores),
        # "category_purity": np.mean(purity_scores)
    }

categories = df['categories'].fillna("unknown").tolist()
results = evaluate_sbert_similarity(embeddings, categories, top_k=5)

print("SBERT Evaluation:")
print("→ Average Cosine Similarity:", round(results["avg_cosine_similarity"], 4))

SBERT Evaluation:
→ Average Cosine Similarity: 0.6486
