## 🔍 Notebook Summary: Find Semantically Similar Research Papers

This notebook demonstrates a multi-method approach for retrieving semantically similar research papers using their abstracts. It includes:

1. **TF-IDF Retrieval**: Traditional cosine similarity on bag-of-words representations.
2. **SBERT-Based Semantic Retrieval**: Dense sentence embeddings using `all-MiniLM-L6-v2`.
3. **Cross-Encoder Reranking**: Fine-grained relevance scoring using the `cross-encoder/ms-marco` model.
4. **Explainability with KeyBERT**: Visual inspection of overlapping keywords between query and candidate abstracts.
5. **Evaluation**: Measures retrieval effectiveness using average cosine similarity.

Each retrieval method is modular, and the notebook can be easily extended for additional models, visualizations, or use cases.


In [64]:
from helper_functions import *

from sentence_transformers import SentenceTransformer
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from helper_functions import preprocess
from keybert import KeyBERT
from nltk import sent_tokenize

from collections import defaultdict

import torch.nn.functional as F
import spacy

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

from sentence_transformers import SentenceTransformer, util, CrossEncoder

seed=25 #for random state
pd.set_option('display.width', 200)
pd.set_option('display.max_colwidth', 200) 

%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [5]:
# -----------------------------
# Load and preprocess dataset
# -----------------------------
file_path = "ai_ml_papers.csv"
df = pd.read_csv(file_path)
df = df.dropna(subset=['title', 'abstract'])
df = df.drop_duplicates(subset='title')
df['processed'] = df['title'] + ". " + df['abstract']
df['cleaned_text'] = df['abstract'].apply(preprocess)

nlp = spacy.load("en_core_web_sm")

  df = pd.read_csv(file_path)


In [6]:
# -----------------------------
# TF-IDF similarity model
# -----------------------------

vectorizer = TfidfVectorizer(stop_words='english', max_df=0.9, min_df=5)
tfidf_matrix = vectorizer.fit_transform(df['cleaned_text'])

def get_top_k_similar_papers(input_paper_idx, k=3):
    # Compute cosine similarity
    similarities = cosine_similarity(tfidf_matrix[input_paper_idx], tfidf_matrix)
    
    # Exclude self and get top-k indices
    similarity_scores = list(enumerate(similarities[0]))
    similarity_scores = sorted(similarity_scores, key=lambda x: x[1], reverse=True)
    top_k_indices = [i for i, _ in similarity_scores[1:k+1]]  # Skip index 0 (self)
    
    # Return results
    results = df.iloc[top_k_indices][['id', 'abstract', 'categories']]
    results['similarity_score'] = [similarity_scores[i][1] for i in range(1, k+1)]
    return results

input_idx = 0
top_k = 2
similar_papers = get_top_k_similar_papers(input_idx, top_k)
print(f"Input Paper: {df.iloc[input_idx]['id']} - {df.iloc[input_idx]['abstract']}\n")
print("Top Similar Papers:")
similar_papers

Input Paper: 704.0047 -   The intelligent acoustic emission locator is described in Part I, while Part
II discusses blind source separation, time delay estimation and location of two
simultaneously active continuous acoustic emission sources.
  The location of acoustic emission on complicated aircraft frame structures is
a difficult problem of non-destructive testing. This article describes an
intelligent acoustic emission source locator. The intelligent locator comprises
a sensor antenna and a general regression neural network, which solves the
location problem based on learning from examples. Locator performance was
tested on different test specimens. Tests have shown that the accuracy of
location depends on sound velocity and attenuation in the specimen, the
dimensions of the tested area, and the properties of stored data. The location
accuracy achieved by the intelligent locator is comparable to that obtained by
the conventional triangulation method, while the applicability of the


Unnamed: 0,id,abstract,categories,similarity_score
1,704.005,"Part I describes an intelligent acoustic emission locator, while Part II\ndiscusses blind source separation, time delay estimation and location of two\ncontinuous acoustic emission sources.\n A...",cs.NE cs.AI,0.569653
220454,2405.0234,"Reducing Carbon dioxide (CO2) emission is vital at both global and national\nlevels, given their significant role in exacerbating climate change. CO2\nemission, stemming from a variety of indust...",stat.AP cs.LG,0.240823


In [None]:
# -----------------------------
# SBERT similarity model
# -----------------------------

# Step 1: Load SBERT model (MiniLM is fast & accurate)
model = SentenceTransformer('all-MiniLM-L6-v2')

# Step 2: Prepare abstracts
abstracts = df['abstract'].fillna('').tolist()

# Step 3: Encode abstracts into dense vectors
embeddings = model.encode(abstracts, show_progress_bar=True, batch_size=64)

# Paper index that needs to be summarized
input_idx = 2

In [14]:

# Step 4: Define function to get top-k similar papers
def get_top_k_similar_sbert(input_paper_idx, k=5):
    query_vec = embeddings[input_paper_idx].reshape(1, -1)
    similarities = cosine_similarity(query_vec, embeddings).flatten()
    
    # Get top-k indices excluding the query paper itself
    top_k_indices = similarities.argsort()[::-1][1:k+1]
    
    results = df.iloc[top_k_indices][['id', 'abstract', 'categories']].copy()
    results['similarity_score'] = similarities[top_k_indices]
    
    return results

# Example usage
top_k = 5
similar_papers_sbert = get_top_k_similar_sbert(input_idx, top_k)

# Display
print(f"Input Paper: {df.iloc[input_idx]['id']} - {df.iloc[input_idx]['abstract']}\n")
print("Top Similar Papers (SBERT):")
similar_papers_sbert

Input Paper: 704.0304 -   This paper discusses the benefits of describing the world as information,
especially in the study of the evolution of life and cognition. Traditional
studies encounter problems because it is difficult to describe life and
cognition in terms of matter and energy, since their laws are valid only at the
physical scale. However, if matter and energy, as well as life and cognition,
are described in terms of information, evolution can be described consistently
as information becoming more complex.
  The paper presents eight tentative laws of information, valid at multiple
scales, which are generalizations of Darwinian, cybernetic, thermodynamic,
psychological, philosophical, and complexity principles. These are further used
to discuss the notions of life, cognition and their evolution.


Top Similar Papers (SBERT):


Unnamed: 0,id,abstract,categories,similarity_score
7914,1311.0413,"Nature can be seen as informational structure with computational dynamics\n(info-computationalism), where an (info-computational) agent is needed for the\npotential information of the world to a...",cs.AI,0.704437
100692,2105.03216,"Even when concepts similar to emergence have been used since antiquity, we\nlack an agreed definition. However, emergence has been identified as one of the\nmain features of complex systems. Mos...",physics.gen-ph cs.AI,0.647915
8572,1401.4942,This paper addresses the open question formulated as: Which levels of\nabstraction are appropriate in the synthetic modelling of life and cognition?\nwithin the framework of info-computational c...,cs.AI,0.61608
1142,912.4649,"In this review we integrate results of long term experimental study on ant\n""language"" and intelligence which were fully based on fundamental ideas of\nInformation Theory, such as the Shannon en...",cs.IT cs.AI math.IT nlin.AO,0.596172
15809,1605.05676,We present some arguments why existing methods for representing agents fall\nshort in applications crucial to artificial life. Using a thought experiment\ninvolving a fictitious dynamical system...,cs.AI cs.SI,0.594813


# For even more precision we can use Cross encoder to re-rank the similar papers based on the input paper's abstract.

In [28]:
# -----------------------------
# Cross-encoder re-ranking
# -----------------------------


# 1. Load a pretrained cross-encoder model (can be swapped with others)
cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

# 2. Get top-k SBERT candidates first (e.g., top 5)
sbert_top_k = get_top_k_similar_sbert(input_idx, k=5)

# 3. Prepare pairs: (query abstract, candidate abstract)
query_abstract = df.iloc[input_idx]['abstract']
candidate_abstracts = sbert_top_k['abstract'].tolist()
query_pairs = [(query_abstract, cand) for cand in candidate_abstracts]

# 4. Get similarity scores from the cross-encoder
cross_scores = cross_encoder.predict(query_pairs, convert_to_tensor=True)

# 5. Rerank based on cross-encoder scores
sbert_top_k['cross_score'] = cross_scores.cpu()
sbert_top_k['cross_probs'] = F.sigmoid(cross_scores).cpu().numpy()
sbert_top_k_sorted = sbert_top_k.sort_values(by='cross_probs', ascending=False).head(top_k)

# 6. View top reranked results
sbert_top_k_sorted[['id', 'abstract', 'cross_score', 'cross_probs']]

Unnamed: 0,id,abstract,cross_score,cross_probs
8572,1401.4942,This paper addresses the open question formulated as: Which levels of\nabstraction are appropriate in the synthetic modelling of life and cognition?\nwithin the framework of info-computational c...,0.710034,0.670409
100692,2105.03216,"Even when concepts similar to emergence have been used since antiquity, we\nlack an agreed definition. However, emergence has been identified as one of the\nmain features of complex systems. Mos...",0.439256,0.608082
7914,1311.0413,"Nature can be seen as informational structure with computational dynamics\n(info-computationalism), where an (info-computational) agent is needed for the\npotential information of the world to a...",-1.049732,0.259277
1142,912.4649,"In this review we integrate results of long term experimental study on ant\n""language"" and intelligence which were fully based on fundamental ideas of\nInformation Theory, such as the Shannon en...",-3.495477,0.029441
15809,1605.05676,We present some arguments why existing methods for representing agents fall\nshort in applications crucial to artificial life. Using a thought experiment\ninvolving a fictitious dynamical system...,-3.779365,0.022327


In [49]:
# -----------------------------
# Keyword-based explainability
# -----------------------------

kw_model = KeyBERT(model='all-MiniLM-L6-v2')
def explain_similarity_with_keywords(query_abstract, candidate_abstract, top_n=10):
    # Extract top-n keywords
    query_keywords = [kw for kw, _ in kw_model.extract_keywords(query_abstract, top_n=top_n, stop_words='english')]
    candidate_keywords = [kw for kw, _ in kw_model.extract_keywords(candidate_abstract, top_n=top_n, stop_words='english')]
    
    query_embs = model.encode(query_keywords, convert_to_tensor=True)
    candidate_embs = model.encode(candidate_keywords, convert_to_tensor=True)

    # Just keep the keyword strings (ignore scores)
    # query_kw_set = set([kw for kw, _ in query_keywords])
    # candidate_kw_set = set([kw for kw, _ in candidate_keywords])

    sim_matrix = util.cos_sim(query_embs, candidate_embs)
    
    # Find overlapping keywords
    similarity_threshold = 0.5
    matched_pairs = []
    for i, q_kw in enumerate(query_keywords):
        for j, c_kw in enumerate(candidate_keywords):
            if sim_matrix[i][j] >= similarity_threshold:
                matched_pairs.append((q_kw, c_kw, round(float(sim_matrix[i][j]), 3)))
    
    return {
        'matched_keywords_pairs': matched_pairs,
        'num_matches': len(matched_pairs)
    }

## Explainability by showing which keywords from the input paper match the candidate paper

query_abs = df.iloc[input_idx]['abstract']

candidate_abs = sbert_top_k_sorted.iloc[0]['abstract']  # top match

explanation = explain_similarity_with_keywords(query_abs, candidate_abs)

matched_pairs = explanation['matched_keywords_pairs']
# print("Query Keywords:", explanation['query_keywords'])
# print("Candidate Keywords:", explanation['candidate_keywords'])

print("Matched Keyword Pairs (Query → Candidate):")
for q, c, score in matched_pairs:
    print(f"  {q:15} → {c:15} (score: {score:.3f})")


Matched Keyword Pairs (Query → Candidate):
  evolution       → organisms       (score: 0.553)
  cognition       → cognition       (score: 1.000)
  cognition       → cognizing       (score: 0.612)
  information     → informational   (score: 0.753)
  darwinian       → organisms       (score: 0.520)
  complexity      → computational   (score: 0.597)


In [65]:

def extract_phrases(text):
    doc = nlp(text)
    return [chunk.text.lower() for chunk in doc.noun_chunks]

In [66]:
query_chunks = sent_tokenize(query_abs)
candidate_chunks = sent_tokenize(candidate_abs)

query_embs = model.encode(query_chunks, convert_to_tensor=True)
candidate_embs = model.encode(candidate_chunks, convert_to_tensor=True)

sim_matrix = util.cos_sim(query_embs, candidate_embs)

aligned_pairs = []
threshold = 0.5
for i, query_sent in enumerate(query_chunks):
    for j, cand_sent in enumerate(candidate_chunks):
        score = sim_matrix[i][j]
        if score >= threshold:
            aligned_pairs.append((query_sent, cand_sent, float(score)))

aligned_pairs = sorted(aligned_pairs, key=lambda x: x[2], reverse=True)        

In [67]:
concept_pairs = []
for q_sent, c_sent, score in aligned_pairs:
    q_concepts = extract_phrases(q_sent)
    c_concepts = extract_phrases(c_sent)

    for q in q_concepts:
        for c in c_concepts:
            # Compute semantic similarity between q and c
            q_emb = model.encode(q, convert_to_tensor=True)
            c_emb = model.encode(c, convert_to_tensor=True)
            sim = util.cos_sim(q_emb, c_emb).item()

            if sim >= 0.7:
                concept_pairs.append((q, c, round(sim, 3)))


from collections import defaultdict

merged_concepts = defaultdict(float)
for q, c, sim in concept_pairs:
    label = q if q == c else f"{q} / {c}"
    merged_concepts[label] = max(merged_concepts[label], sim)

final_concepts = sorted(merged_concepts.items(), key=lambda x: x[1], reverse=True)


print("This candidate paper is similar to your query because:")
for concept, sim in final_concepts:
    print(f"• Both mention **{concept}** (semantic similarity: {sim})")

This candidate paper is similar to your query because:
• Both mention **life** (semantic similarity: 1.0)
• Both mention **cognition** (semantic similarity: 1.0)
• Both mention **  this paper** (semantic similarity: 1.0)


In [50]:
# -----------------------------
# Evaluation Metrics
# -----------------------------

## Average Cosine Similarity


def evaluate_tfidf_cosine(df, tfidf_matrix, top_k=5):
    avg_cosine_scores = []
    # purity_scores = []
    categories = df['categories'].fillna("unknown").tolist()

    for i in range(tfidf_matrix.shape[0]):
        query_vec = tfidf_matrix[i]
        sims = cosine_similarity(query_vec, tfidf_matrix).flatten()
        sims[i] = -1  # exclude self

        top_k_indices = sims.argsort()[::-1][:top_k]
        avg_cosine_scores.append(np.mean(sims[top_k_indices]))

        if (i > 50):
            break

    return {
        "avg_cosine_similarity": np.mean(avg_cosine_scores),
        # "category_purity": np.mean(purity_scores)
    }

results = evaluate_tfidf_cosine(df, tfidf_matrix, top_k=5)

print("TF-IDF Evaluation Metrics:")
print("→ Average Cosine Similarity:", round(results["avg_cosine_similarity"], 4))

TF-IDF Evaluation Metrics:
→ Average Cosine Similarity: 0.3557


In [69]:
def evaluate_sbert_similarity(sbert_embeddings, categories, top_k=5):
    avg_cosine_scores = []
    # purity_scores = []

    for i in range(len(sbert_embeddings)):
        query_vec = sbert_embeddings[i].reshape(1, -1)
        sims = cosine_similarity(query_vec, sbert_embeddings).flatten()
        sims[i] = -1  # exclude self

        top_k_indices = sims.argsort()[::-1][:top_k]
        avg_cosine_scores.append(np.mean(sims[top_k_indices]))

        # query_cat = categories[i]
        # match_count = sum([query_cat in categories[j] for j in top_k_indices])
        # purity_scores.append(match_count / top_k)

        if (i > 50):
            break

    return {
        "avg_cosine_similarity": np.mean(avg_cosine_scores),
        # "category_purity": np.mean(purity_scores)
    }

categories = df['categories'].fillna("unknown").tolist()
results = evaluate_sbert_similarity(embeddings, categories, top_k=5)

print("SBERT Evaluation:")
print("→ Average Cosine Similarity:", round(results["avg_cosine_similarity"], 4))

SBERT Evaluation:
→ Average Cosine Similarity: 0.6486
