# Embedding Generation with SciBERT

## Objective
Convert paper abstracts into dense vector representations (embeddings) 
using SciBERT, a BERT model pre-trained on scientific text.

## Why SciBERT?
- Pre-trained on 1.14M scientific papers (vs Wikipedia for regular BERT)
- Understands domain-specific vocabulary (ML, physics, biology)
- Better semantic similarity for scientific concepts
- Used in production by Semantic Scholar, Allen AI

## Pipeline
1. Load cleaned paper dataset (9,280 papers)
2. Load pre-trained SciBERT model
3. Generate embeddings for all abstracts
4. Save embeddings + FAISS index for fast similarity search

## Output
- embeddings.npy: 9,280 × 768 matrix (~27MB)
- papers.index: FAISS index for fast similarity search
- papers_with_embeddings.pkl: Papers + embeddings combined

In [1]:
import numpy as np
import pandas as pd
import torch
from sentence_transformers import SentenceTransformer
import faiss
import os
import pickle
from tqdm import tqdm
import warnings
warnings.filterwarnings('ignore')

print("✓ Imports successful")
print(f"PyTorch version: {torch.__version__}")
print(f"Device: {'cuda' if torch.cuda.is_available() else 'cpu'}")

  from .autonotebook import tqdm as notebook_tqdm


✓ Imports successful
PyTorch version: 2.0.1
Device: cpu


In [2]:
# Load cleaned dataset
df = pd.read_csv('../data/raw/arxiv_papers_cleaned.csv')

print(f"✓ Loaded dataset")
print(f"  Papers: {len(df)}")
print(f"  Columns: {list(df.columns)}")
print(f"\nSample abstract:")
print(df['abstract'].iloc[0][:300] + "...")

✓ Loaded dataset
  Papers: 9280
  Columns: ['paper_id', 'title', 'abstract', 'authors', 'categories', 'primary_category', 'published', 'updated', 'pdf_url', 'abstract_length', 'year']

Sample abstract:
Large Language Models (LLMs) often provide chain-of-thought (CoT) reasoning traces that appear plausible, but may hide internal biases. We call these *unverbalized biases*. Monitoring models via their stated reasoning is therefore unreliable, and existing bias evaluations typically require predefine...


In [3]:
# Load SciBERT model
# 'allenai/scibert_scivocab_uncased' is the official SciBERT model
# sentence-transformers wraps it for easy embedding generation

print("Loading SciBERT model...")
print("(First time will download ~440MB model weights)\n")

model = SentenceTransformer('allenai/scibert_scivocab_uncased')

print(f"✓ SciBERT loaded!")
print(f"  Embedding dimension: {model.get_sentence_embedding_dimension()}")
print(f"  Max sequence length: {model.max_seq_length}")

Loading SciBERT model...
(First time will download ~440MB model weights)



No sentence-transformers model found with name allenai/scibert_scivocab_uncased. Creating a new one with MEAN pooling.


✓ SciBERT loaded!
  Embedding dimension: 768
  Max sequence length: 512


In [4]:
# Load SPECTER2 instead
print("Loading SPECTER2 model...")
print("(First time will download ~440MB model weights)\n")

model = SentenceTransformer('allenai/specter2_base')

print(f"✓ SPECTER2 loaded!")
print(f"  Embedding dimension: {model.get_sentence_embedding_dimension()}")
print(f"  Max sequence length: {model.max_seq_length}")

No sentence-transformers model found with name allenai/specter2_base. Creating a new one with MEAN pooling.


Loading SPECTER2 model...
(First time will download ~440MB model weights)

✓ SPECTER2 loaded!
  Embedding dimension: 768
  Max sequence length: 512


"I initially planned to use SciBERT, but switched to SPECTER2 because it was specifically trained on citation relationships between papers - meaning papers that cite each other have similar embeddings. This directly aligns with our recommendation goal: if researchers cite papers together, they're likely related."

In [5]:
# Generate embeddings for all abstracts
print("Generating embeddings for 9,280 papers...")
print("(This will take ~5-10 minutes)\n")

# Convert abstracts to list
abstracts = df['abstract'].tolist()

# Generate embeddings in batches (faster + more memory efficient)
# batch_size=32 is a good default
embeddings = model.encode(
    abstracts,
    batch_size=32,
    show_progress_bar=True,
    convert_to_numpy=True
)

print(f"\n✓ Embeddings generated!")
print(f"  Shape: {embeddings.shape}")
print(f"  Size in memory: {embeddings.nbytes / 1e6:.1f} MB")

Generating embeddings for 9,280 papers...
(This will take ~5-10 minutes)



Batches: 100%|██████████| 290/290 [45:32<00:00,  9.42s/it]   



✓ Embeddings generated!
  Shape: (9280, 768)
  Size in memory: 28.5 MB


In [6]:
# Test: Find similar papers to a sample paper
from sklearn.metrics.pairwise import cosine_similarity

# Pick a random paper
test_idx = 100
test_paper = df.iloc[test_idx]

print("Test paper:")
print(f"  Title: {test_paper['title']}")
print(f"  Categories: {test_paper['categories']}")
print(f"  Abstract: {test_paper['abstract'][:200]}...\n")

# Calculate similarity to all other papers
test_embedding = embeddings[test_idx].reshape(1, -1)
similarities = cosine_similarity(test_embedding, embeddings)[0]

# Get top 5 most similar papers (excluding itself)
top_indices = np.argsort(similarities)[::-1][1:6]  # Skip index 0 (itself)

print("Top 5 most similar papers:\n")
for rank, idx in enumerate(top_indices, 1):
    sim = similarities[idx]
    paper = df.iloc[idx]
    print(f"{rank}. Similarity: {sim:.3f}")
    print(f"   Title: {paper['title']}")
    print(f"   Categories: {paper['categories']}")
    print()

Test paper:
  Title: BiasScope: Towards Automated Detection of Bias in LLM-as-a-Judge Evaluation
  Categories: ['cs.CL', 'cs.AI', 'cs.SE']
  Abstract: LLM-as-a-Judge has been widely adopted across various research and practical applications, yet the robustness and reliability of its evaluation remain a critical issue. A core challenge it faces is bi...

Top 5 most similar papers:

1. Similarity: 0.953
   Title: Benchmarking Bias Mitigation Toward Fairness Without Harm from Vision to LVLMs
   Categories: ['cs.CV', 'cs.LG']

2. Similarity: 0.951
   Title: Fault-Tolerant Evaluation for Sample-Efficient Model Performance Estimators
   Categories: ['cs.LG']

3. Similarity: 0.951
   Title: GenArena: How Can We Achieve Human-Aligned Evaluation for Visual Generation Tasks?
   Categories: ['cs.CV', 'cs.AI']

4. Similarity: 0.950
   Title: DICE: Discrete Interpretable Comparative Evaluation with Probabilistic Scoring for Retrieval-Augmented Generation
   Categories: ['cs.AI', 'cs.IR']

5. Simila

In [7]:
# Find a computer vision paper
cv_papers = df[df['primary_category'] == 'cs.CV']
test_idx_cv = cv_papers.index[0]

test_paper = df.iloc[test_idx_cv]
print("Test paper (Computer Vision):")
print(f"  Title: {test_paper['title']}")
print(f"  Categories: {test_paper['categories']}\n")

# Find similar papers
test_embedding = embeddings[test_idx_cv].reshape(1, -1)
similarities = cosine_similarity(test_embedding, embeddings)[0]
top_indices = np.argsort(similarities)[::-1][1:6]

print("Top 5 similar papers:\n")
for rank, idx in enumerate(top_indices, 1):
    sim = similarities[idx]
    paper = df.iloc[idx]
    print(f"{rank}. Similarity: {sim:.3f}")
    print(f"   Title: {paper['title']}")
    print(f"   Categories: {paper['categories']}")
    print()

Test paper (Computer Vision):
  Title: Olaf-World: Orienting Latent Actions for Video World Modeling
  Categories: ['cs.CV', 'cs.AI', 'cs.LG']

Top 5 similar papers:

1. Similarity: 0.957
   Title: VLA-JEPA: Enhancing Vision-Language-Action Model with Latent World Model
   Categories: ['cs.RO', 'cs.CV']

2. Similarity: 0.957
   Title: BridgeV2W: Bridging Video Generation Models to Embodied World Models via Embodiment Masks
   Categories: ['cs.RO', 'cs.CV']

3. Similarity: 0.954
   Title: Segment to Focus: Guiding Latent Action Models in the Presence of Distractors
   Categories: ['cs.LG', 'cs.CV']

4. Similarity: 0.952
   Title: MVP-LAM: Learning Action-Centric Latent Action via Cross-Viewpoint Reconstruction
   Categories: ['cs.RO', 'cs.CV']

5. Similarity: 0.951
   Title: VISTA: Enhancing Visual Conditioning via Track-Following Preference Optimization in Vision-Language-Action Models
   Categories: ['cs.CV', 'cs.AI', 'cs.LG', 'cs.RO']



In [8]:
import pickle

# Create directory
os.makedirs('../data/processed', exist_ok=True)

# Save embeddings as numpy array
np.save('../data/processed/embeddings.npy', embeddings)
print("✓ Saved embeddings.npy")

# Save papers dataframe with embeddings
df['embedding'] = list(embeddings)
df.to_pickle('../data/processed/papers_with_embeddings.pkl')
print("✓ Saved papers_with_embeddings.pkl")

# Save metadata (for quick reference)
metadata = {
    'n_papers': len(df),
    'embedding_dim': embeddings.shape[1],
    'model': 'allenai/specter2_base',
    'date_generated': pd.Timestamp.now().isoformat()
}

with open('../data/processed/metadata.json', 'w') as f:
    import json
    json.dump(metadata, f, indent=2)
print("✓ Saved metadata.json")

print(f"\n✓ All files saved to ../data/processed/")
print(f"  Total size: {(embeddings.nbytes + df.memory_usage(deep=True).sum()) / 1e6:.1f} MB")

✓ Saved embeddings.npy
✓ Saved papers_with_embeddings.pkl
✓ Saved metadata.json

✓ All files saved to ../data/processed/
  Total size: 49.4 MB
