# arXiv Paper Similarity Search

This notebook implements semantic similarity search over arXiv papers using pre-computed embeddings.

## Pipeline Overview:
1. Load pre-computed embeddings from HDFS
2. Initialize SciBERT model for query encoding
3. Normalize embeddings for cosine similarity
4. Encode user query
5. Compute similarity scores (dot product)
6. Return top-K most similar papers


## 1. Initialize Spark Session

We create a Spark session to load and process the embeddings stored in HDFS.

In [1]:
from pyspark.sql import SparkSession

spark = (
    SparkSession.builder
    .appName("ArxivSimilaritySearch")
    .config("spark.driver.memory", "2g")
    .config("spark.executor.memory", "2g")
    .config("spark.executor.cores", "1")
    .getOrCreate()
)

spark.sparkContext.setLogLevel("WARN")
print(f"Spark version: {spark.version}")

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
26/01/15 12:30:32 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


Spark version: 3.5.0


## 2. Load Pre-computed Embeddings from HDFS

We load the embeddings that were generated in the previous notebook.


Each row contains:
- `id`: paper identifier
- `embedding`: 768-dimensional vector (Array[Float])

In [2]:
EMBEDDINGS_PATH = "hdfs:///arxiv/embeddings/scibert"

embeddings_df = spark.read.parquet(EMBEDDINGS_PATH)

print("Schema:")
embeddings_df.printSchema()

print(f"\nTotal papers: {embeddings_df.count()}")

print("\nSample data:")
embeddings_df.select("id").show(5)

                                                                                

Schema:
root
 |-- id: string (nullable = true)
 |-- embedding: array (nullable = true)
 |    |-- element: float (containsNull = true)



                                                                                


Total papers: 2000

Sample data:


                                                                                

+---------+
|       id|
+---------+
|1003.2590|
|1003.2591|
|1003.2592|
|1003.2593|
|1003.2594|
+---------+
only showing top 5 rows



## 3. Load Paper Metadata

To display meaningful results, we need the paper titles and abstracts.
We'll join the embeddings with the cleaned metadata.

In [3]:
from pyspark.sql.functions import col

# Load cleaned metadata
metadata_df = spark.read.parquet("hdfs:///arxiv/clean")

# Join embeddings with metadata
papers_df = (
    embeddings_df
    .join(metadata_df.select("id", "title", "abstract"), on="id", how="inner")
)

print(f"Papers with embeddings and metadata: {papers_df.count()}")
papers_df.select("id", "title").show(3, truncate=False)

                                                                                

Papers with embeddings and metadata: 2000
+---------+------------------------------------------------------------------------------------+
|id       |title                                                                               |
+---------+------------------------------------------------------------------------------------+
|1003.2590|IceCube and KM3NeT: Lessons and Relations                                           |
|1003.2591|Stochastic Aspect of the Tomographic Reconstruction Problems in a\n  Transport Model|
|1003.2592|Non-permutation invariant Borel quantifiers                                         |
+---------+------------------------------------------------------------------------------------+
only showing top 3 rows



## 4. Normalize Embeddings

For cosine similarity, we normalize all vectors to unit length.

**Why normalize?**
- Cosine similarity = dot product of normalized vectors
- Faster computation (no division needed)
- More stable numerical properties

**Formula**: `v_norm = v / ||v||`

In [4]:
!pip install numpy

Defaulting to user installation because normal site-packages is not writeable


In [5]:
from pyspark.ml.feature import Normalizer
from pyspark.ml.functions import array_to_vector
from pyspark.sql.functions import col

# Convert array<float> â†’ Spark ML vector
papers_df = papers_df.withColumn(
    "embedding_vec",
    array_to_vector(col("embedding"))
)

# Normalize
normalizer = Normalizer(
    inputCol="embedding_vec",
    outputCol="embedding_norm",
    p=2.0
)

papers_df = normalizer.transform(papers_df)

papers_df.select("id", "embedding_norm").show(1, truncate=True)




+---------+--------------------+
|       id|      embedding_norm|
+---------+--------------------+
|1003.2590|[-0.0433954223934...|
+---------+--------------------+
only showing top 1 row



                                                                                

## 5. Load SciBERT Model for Query Encoding

We use the same model that generated the stored embeddings.

**Model**: `allenai/scibert_scivocab_uncased`
- Pre-trained on scientific papers
- Optimized for academic text
- 768-dimensional embeddings (matches our stored vectors)

In [6]:
from transformers import AutoTokenizer, AutoModel
import torch

print("Loading SciBERT model...")
tokenizer = AutoTokenizer.from_pretrained("allenai/scibert_scivocab_uncased")
model = AutoModel.from_pretrained("allenai/scibert_scivocab_uncased")
model.eval()  # Set to evaluation mode

print("Model loaded successfully!")

  from .autonotebook import tqdm as notebook_tqdm


Loading SciBERT model...
Model loaded successfully!


## 6. Define Query Encoding Function

This function converts a text query into a normalized embedding vector.

**Steps**:
1. Tokenize the query text
2. Pass through SciBERT to get token embeddings
3. Apply mean pooling to get a single vector
4. Normalize to unit length

In [7]:
import numpy as np

def encode_query(query_text, max_length=256):
    """
    Encode a text query into a normalized embedding vector.
    
    Args:
        query_text: String to encode
        max_length: Maximum token length
    
    Returns:
        Normalized embedding as a list
    """
    # Tokenize
    inputs = tokenizer(
        query_text,
        return_tensors="pt",
        truncation=True,
        padding=True,
        max_length=max_length
    )
    
    # Generate embeddings
    with torch.no_grad():
        outputs = model(**inputs)
    
    # Mean pooling
    token_embeddings = outputs.last_hidden_state
    attention_mask = inputs["attention_mask"].unsqueeze(-1)
    
    embedding = (token_embeddings * attention_mask).sum(dim=1)
    embedding = embedding / attention_mask.sum(dim=1)
    
    # Normalize
    embedding_np = embedding.squeeze().numpy()
    embedding_norm = embedding_np / np.linalg.norm(embedding_np)
    
    return embedding_norm.tolist()

print("Query encoding function ready!")

Query encoding function ready!


## 7. Lineage Cut

Creates a brand-new DataFrame with only the columns we need.

In [8]:
papers_df = papers_df.select(
    "id",
    "title",
    "abstract",
    "embedding_norm"
)


## 8. Define Similarity Search Function

This is the main search function that:
1. Encodes the query
2. Computes similarity with all papers
3. Returns top-K most similar papers

In [9]:
from pyspark.ml.functions import vector_to_array
from pyspark.sql.functions import col, expr, lit, array


def search_similar_papers(query_text, top_k=5):
    print(f"Query: '{query_text}'")
    print("Encoding query...")

    # Driver-only encoding (NumPy OK here)
    query_embedding = encode_query(query_text)

    print("Computing similarities...")

    # Build Spark literal array from Python list
    query_arr_col = array(*[lit(float(x)) for x in query_embedding])

    scored_df = (
        papers_df
        # document vector -> array
        .withColumn("embedding_arr", vector_to_array(col("embedding_norm")))
        # attach query vector as a column
        .withColumn("query_arr", query_arr_col)
        # cosine similarity = dot product
        .withColumn(
            "similarity",
            expr("""
                aggregate(
                    zip_with(embedding_arr, query_arr, (x, y) -> x * y),
                    0D,
                    (acc, v) -> acc + v
                )
            """)
        )
    )

    return (
        scored_df
        .orderBy(col("similarity").desc())
        .select("id", "title", "abstract", "similarity")
        .limit(top_k)
    )


## 9. Example Search: Deep Learning for Medical Imaging

Let's search for papers related to deep learning in medical imaging.

In [10]:
query1 = "Deep learning methods for medical image analysis"

results1 = search_similar_papers(query1, top_k=5)

print("\n" + "="*80)
print("TOP 5 RESULTS")
print("="*80)

results1.show(truncate=False)

Query: 'Deep learning methods for medical image analysis'
Encoding query...
Computing similarities...

TOP 5 RESULTS




+---------+---------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------+
|id       |title                                                                 

                                                                                

## 10. Example Search: Quantum Computing

Let's try another query about quantum computing algorithms.

In [11]:
query2 = "Quantum algorithms for optimization problems"

results2 = search_similar_papers(query2, top_k=5)

print("\n" + "="*80)
print("TOP 5 RESULTS")
print("="*80)

results2.show(truncate=False)

Query: 'Quantum algorithms for optimization problems'
Encoding query...
Computing similarities...

TOP 5 RESULTS




+---------+-------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------+
|id       |title                                                              |abstract                   

                                                                                

## 11. Example Search: Natural Language Processing

Search for papers on transformer models and attention mechanisms.

In [12]:
query3 = "Transformer architectures and self-attention mechanisms for NLP"

results3 = search_similar_papers(query3, top_k=5)

print("\n" + "="*80)
print("TOP 5 RESULTS")
print("="*80)

results3.show(truncate=False)

Query: 'Transformer architectures and self-attention mechanisms for NLP'
Encoding query...
Computing similarities...

TOP 5 RESULTS




+---------+------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------+
|id       |title                                

                                                                                

## 12. Helper Function: Display Results in Readable Format

Create a prettier display for search results with truncated abstracts.

In [13]:
def display_results(results_df, max_abstract_length=300):
    """
    Display search results in a readable format.
    
    Args:
        results_df: Spark DataFrame with results
        max_abstract_length: Max characters to show from abstract
    """
    results_list = results_df.collect()
    
    for i, row in enumerate(results_list, 1):
        print(f"\n{'='*80}")
        print(f"RANK {i} | Similarity: {row.similarity:.4f}")
        print(f"{'='*80}")
        print(f"ID: {row.id}")
        print(f"\nTitle: {row.title}")
        
        abstract = row.abstract
        if len(abstract) > max_abstract_length:
            abstract = abstract[:max_abstract_length] + "..."
        
        print(f"\nAbstract: {abstract}")

print("Display function ready!")

Display function ready!


## 13. Display Previous Results in Readable Format

Let's re-display the first search with better formatting.

In [14]:
print("Query: 'Deep learning methods for medical image analysis'")
display_results(results1)

Query: 'Deep learning methods for medical image analysis'





RANK 1 | Similarity: 0.0683
ID: 1003.2811

Title: Refining the Protein-Protein Interactome using Gene Expression Data

Abstract:   Proteins interact with other proteins within biological pathways, forming
connected subgraphs in the protein-protein interactome (PPI). Proteins are
often involved in multiple biological pathways which complicates interpretation
of interactions between proteins. Gene expression data can assist our...

RANK 2 | Similarity: 0.0668
ID: 1003.4139

Title: On the vanishing of the lower K-theory of the Holomorph of a free group
  on two generators

Abstract:   We show that the holomorph of the free group on two generators satisfies the
Farrell-Jones Fibered Isomorphism Conjecture. As a consequence, we show that
the lower K-theory of the above group vanishes.


RANK 3 | Similarity: 0.0652
ID: 1003.4292

Title: Breakdown of Angular Momentum Selection Rules in High Pressure Optical
  Pumping Experiments

Abstract:   We present measurements, using two complementary m

                                                                                

## 14. Interactive Search Cell

Try your own queries here!

In [15]:
# Enter your query here
your_query = "graph neural networks for molecular property prediction"
num_results = 10

# Run search
your_results = search_similar_papers(your_query, top_k=num_results)

# Display
display_results(your_results)

Query: 'graph neural networks for molecular property prediction'
Encoding query...
Computing similarities...





RANK 1 | Similarity: 0.0745
ID: 1003.3086

Title: On Graphene Hydrate

Abstract:   Using first-principles calculations, we show that the formation of
carbohydrate directly from carbon and water is energetically favored when
graphene membrane is subjected to aqueous environment with difference in
chemical potential across the two sides. The resultant carbohydrate is
two-dimension...

RANK 2 | Similarity: 0.0732
ID: 1003.3073

Title: Strain-mediated metal-insulator transition in epitaxial ultra-thin films
  of NdNiO3

Abstract:   We have synthesized epitaxial NdNiO$_{3}$ ultra-thin films in a
layer-by-layer growth mode under tensile and compressive strain on SrTiO$_{3}$
(001) and LaAlO$_3$ (001), respectively. A combination of X-ray diffraction,
temperature dependent resistivity, and soft X-ray absorption spectroscopy has
...

RANK 3 | Similarity: 0.0701
ID: 1003.4473

Title: Ab initio study of beryllium-decorated fullerenes for hydrogen storage

Abstract:   We have found that a berylli

                                                                                

## 15.Save Search Results

You can save search results to HDFS or local filesystem for later use.

In [16]:
# Save results to HDFS
output_path = "hdfs:///arxiv/search_results/result"

results1.write.mode("overwrite").parquet(output_path)
results2.write.mode("overwrite").parquet(output_path)
results3.write.mode("overwrite").parquet(output_path)

print(f"Results saved to: {output_path}")



Results saved to: hdfs:///arxiv/search_results/result


                                                                                

## 16. Cleanup

Stop the Spark session when done.

In [17]:
spark.stop()
print("Spark session stopped.")

Spark session stopped.
