# Semantic Search in arXiv Papers

This notebook shows how to retrieve data from the arXiv API and implement semantic search and recency weighting with Superlinked. More specifically, the notebook will include the following steps:

Preparation

- Retrieving, processing and exploring the data

Setting up our vector computer

-  Creating a schema
-  Creating vector embedding spaces
-  Indexing & parsing
-  Setting up & filling an in-memory data store

Searching

- Queries
- Weighting

## Preparation

In [1]:
%%capture
%%PIP COMMAND%%
%pip install lxml bs4

In [2]:
!pip install altair



In [3]:
%pip install --upgrade pip

Note: you may need to restart the kernel to use updated packages.


In [5]:
!python --version

Python 3.11.0


### Setting up a basic logger

In [None]:
!pip freeze

absl-py==1.4.0
alabaster @ file:///home/ktietz/src/ci/alabaster_1611921544520/work
altair==5.3.0
anaconda-client==1.11.1
anaconda-navigator==2.4.0
anaconda-project @ file:///C:/Windows/TEMP/abs_91fu4tfkih/croots/recipe/anaconda-project_1660339890874/work
anyio @ file:///C:/ci/anyio_1644481856696/work/dist
appdirs==1.4.4
argon2-cffi @ file:///opt/conda/conda-bld/argon2-cffi_1645000214183/work
argon2-cffi-bindings @ file:///C:/ci/argon2-cffi-bindings_1644569876605/work
arrow @ file:///C:/b/abs_cal7u12ktb/croot/arrow_1676588147908/work
astroid @ file:///C:/b/abs_d4lg3_taxn/croot/astroid_1676904351456/work
astropy @ file:///C:/ci/astropy_1657719642921/work
asttokens @ file:///opt/conda/conda-bld/asttokens_1646925590279/work
astunparse==1.6.3
atomicwrites==1.4.0
attrs @ file:///C:/b/abs_09s3y775ra/croot/attrs_1668696195628/work
Automat @ file:///tmp/build/80754af9/automat_1600298431173/work
autopep8 @ file:///opt/conda/conda-bld/autopep8_1650463822033/work
Babel @ file:///C:/b/abs_a2shv_3tq

In [8]:
import altair as alt
import logging
import numpy as np
import pandas as pd
import requests

from bs4 import BeautifulSoup
from dateutil import parser
from datetime import datetime, timedelta, timezone
from urllib.parse import urlencode
from superlinked.evaluation.charts.recency_plotter import RecencyPlotter
from superlinked.framework.common.dag.context import CONTEXT_COMMON, CONTEXT_COMMON_NOW
from superlinked.framework.common.dag.period_time import PeriodTime
from superlinked.framework.common.parser.dataframe_parser import DataFrameParser
from superlinked.framework.common.schema.id_schema_object import IdField
from superlinked.framework.common.schema.schema import schema
from superlinked.framework.common.schema.schema_object import String, Timestamp
from superlinked.framework.dsl.executor.in_memory.in_memory_executor import InMemoryExecutor, InMemoryApp
from superlinked.framework.dsl.index.index import Index
from superlinked.framework.dsl.query.result import Result
from superlinked.framework.dsl.query.query import Query
from superlinked.framework.dsl.query.param import Param
from superlinked.framework.dsl.source.in_memory_source import InMemorySource
from superlinked.framework.dsl.space.text_similarity_space import TextSimilaritySpace
from superlinked.framework.dsl.space.recency_space import RecencySpace

alt.renderers.enable("mimetype")

# Creating and configuring our logger
logging.basicConfig(filename="std.log", format="%(asctime)s %(message)s", filemode="w")
logger = logging.getLogger()

# Set the logger threshold to DEBUG if you encounter errors
logger.setLevel(logging.INFO)

## Fetching & processing data from the arXiv API

In [9]:
def query_arxiv(
    query="%22large%20language%20models%22",
    max_results=1000,
    order_by="lastUpdatedDate",
    order="descending",
):
    """
    Basic function for querying the api that lets us specify the most important parameters.

    query: URL encoded string to search for in paper titles and abstracts
    max_results: maximum amount of results returned by the api
    order_by: variable to order the results by
    order: descending or ascending based on the order_by parameter
    """
    params = {
        "search_query": f"all:{query}",
        "start": 0,
        "max_results": max_results,
        "sortBy": order_by,
        "sortOrder": order,
    }
    url = f"http://export.arxiv.org/api/query?{urlencode(params)}"
    try:
        response = requests.get(url)
        response.raise_for_status()
        logging.info(f"Length of response text: {len(response.text)}")
        soup = BeautifulSoup(response.text, "xml")
        data = []

        for entry in soup.find_all("entry"):
            data_entry = {tag.name: tag.text.strip() for tag in entry.find_all()}
            if "id" in data_entry:  # Ensure there is an 'id' field
                data.append(data_entry)

        logging.info(f"{len(data)} entries found")
        return pd.DataFrame(data)

    except requests.exceptions.RequestException as e:
        logging.error(f"Error during request: {e}")
    except Exception as e:
        logging.error(f"Unexpected error: {e}")

    return pd.DataFrame()  # Return an empty DataFrame if there was an error

In [10]:
# We are using URL encodings here: %22 means "" and %20 stands for a space
df = query_arxiv(query="%22retrieval%20augmented%20generation%22")

In [11]:
# Notice that we set the maximum to 1000 but the api returned less results, meaning
# that the number of paper titles and abstracts including our search query is below 1000
len(df)

637

## Exploring & preparing the data

In [12]:
# Checking all columns
df.columns

Index(['id', 'updated', 'published', 'title', 'summary', 'author', 'name',
       'link', 'primary_category', 'category', 'comment', 'journal_ref', 'doi',
       'affiliation'],
      dtype='object')

In [13]:
# Feel free to play around more with the data if you want,
# but for this application, we will only need a few columns
df = df[["id", "published", "title", "summary"]].copy()

In [14]:
df.head(3)

Unnamed: 0,id,published,title,summary
0,http://arxiv.org/abs/2407.15831v1,2024-07-22T17:50:31Z,NV-Retriever: Improving text embedding models ...,Text embedding models have been popular for in...
1,http://arxiv.org/abs/2407.15748v1,2024-07-22T15:53:27Z,MoRSE: Bridging the Gap in Cybersecurity Exper...,"In this paper, we introduce MoRSE (Mixture of ..."
2,http://arxiv.org/abs/2407.15734v1,2024-07-22T15:37:41Z,"TaskGen: A Task-Based, Memory-Infused Agentic ...",TaskGen is an open-sourced agentic framework w...


In [15]:
# Renaming the columns to have more intuitive names
df = df.reset_index().rename(
    columns={"id": "url", "index": "id", "summary": "abstract"}
)

In [16]:
# The api returns the datetimes as a string, which we first parse
# in the datetime format and then convert them to timestamps
df["published_timestamp"] = [
    int(parser.parse(date).replace(tzinfo=timezone.utc).timestamp())
    for date in df.published
]

In [17]:
df.head()

Unnamed: 0,id,url,published,title,abstract,published_timestamp
0,0,http://arxiv.org/abs/2407.15831v1,2024-07-22T17:50:31Z,NV-Retriever: Improving text embedding models ...,Text embedding models have been popular for in...,1721670631
1,1,http://arxiv.org/abs/2407.15748v1,2024-07-22T15:53:27Z,MoRSE: Bridging the Gap in Cybersecurity Exper...,"In this paper, we introduce MoRSE (Mixture of ...",1721663607
2,2,http://arxiv.org/abs/2407.15734v1,2024-07-22T15:37:41Z,"TaskGen: A Task-Based, Memory-Infused Agentic ...",TaskGen is an open-sourced agentic framework w...,1721662661
3,3,http://arxiv.org/abs/2407.15621v1,2024-07-22T13:29:56Z,RadioRAG: Factual Large Language Models for En...,Large language models (LLMs) have advanced the...,1721654996
4,4,http://arxiv.org/abs/2407.15569v1,2024-07-22T11:55:14Z,An Empirical Study of Retrieval Augmented Gene...,Since the launch of ChatGPT at the end of 2022...,1721649314


## Visualizing the timestamps

In [18]:
# some quick transformations and an altair histogram
years_to_plot: pd.DataFrame = pd.DataFrame(
    {
        "year_of_publication": [
            int(datetime.fromtimestamp(ts).year) for ts in df["published_timestamp"]
        ]
    }
)
alt.Chart(years_to_plot).mark_bar().encode(
    alt.X("year_of_publication:N", title="Year of publication"),
    y=alt.Y("count()", title="Count of articles"),
).properties(width=400, height=400)

<VegaLite 4 object>

If you see this message, it means the renderer has not been properly enabled
for the frontend that you are using. For more information, see
https://altair-viz.github.io/user_guide/troubleshooting.html


## Setting up Superlinked

In [19]:
# Setting up the schema according to our inputs
@schema
class PapersSchema:
    url: String
    title: String
    abstract: String
    published_timestamp: Timestamp
    id: IdField


papers = PapersSchema()

In [20]:
YEAR_IN_DAYS = 365

# Textual characteristics are embedded using a sentence-transformers model
abstract_space = TextSimilaritySpace(
    text=papers.abstract, model="sentence-transformers/all-mpnet-base-v2"
)
title_space = TextSimilaritySpace(
    text=papers.title, model="sentence-transformers/all-mpnet-base-v2"
)
# Release date is encoded using Superlinked's recency embedding algorithm
recency_space = RecencySpace(
    timestamp=papers.published_timestamp,
    period_time_list=[
        PeriodTime(timedelta(days=0.5 * YEAR_IN_DAYS), weight=1),
        PeriodTime(timedelta(days=10 * YEAR_IN_DAYS), weight=1),
    ],
    negative_filter=0.0,
)

.gitattributes:   0%|          | 0.00/1.23k [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

data_config.json:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

train_script.py:   0%|          | 0.00/13.1k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

In [21]:
# We create an index of our spaces
papers_index = Index(spaces=[abstract_space, title_space, recency_space])

In [22]:
dataframe_parser = DataFrameParser(
    schema=papers,
    mapping={
        papers.published_timestamp: "published_timestamp",
        papers.abstract: "abstract",
    },
)

In [23]:
# Setting a specific end date to ensure reproducibility of the notebook
END_OF_APRIL_24_TS = int(datetime(2024, 4, 30, 23, 59).timestamp())
EXECUTOR_DATA = {CONTEXT_COMMON: {CONTEXT_COMMON_NOW: END_OF_APRIL_24_TS}}

source: InMemorySource = InMemorySource(papers, parser=dataframe_parser)
executor: InMemoryExecutor = InMemoryExecutor(
    sources=[source], indices=[papers_index], context_data=EXECUTOR_DATA
)
app: InMemoryApp = executor.run()

In [24]:
# IMPORTANT: if you're running this notebook in Google Colab and
# this step is taking very long - you might be running an instance without a GPU
source.put([df])

Batches:   0%|          | 0/20 [00:00<?, ?it/s]

Batches:   0%|          | 0/20 [00:00<?, ?it/s]

## Understanding recency

In [25]:
# To get an intuitive understanding of how recency is weighted for our data,
# we can explore the weights using Superlinked's inbuilt RecencyPlotter
recency_plotter = RecencyPlotter(recency_space, context_data=EXECUTOR_DATA)
recency_plotter.plot_recency_curve()

<VegaLite 4 object>

If you see this message, it means the renderer has not been properly enabled
for the frontend that you are using. For more information, see
https://altair-viz.github.io/user_guide/troubleshooting.html


## Defining queries

In [26]:
TOP_N = 10

# A simple query will serve us right when we simply want to search the dataset with a search term
# the term will search in both textual fields
# and we will have the option to weight certain inputs' importance
simple_query = (
    Query(
        papers_index,
        weights={
            abstract_space: Param("abstract_weight"),
            title_space: Param("title_weight"),
            recency_space: Param("recency_weight"),
        },
    )
    .find(papers)
    .similar(abstract_space.text, Param("query_text"))
    .similar(title_space.text, Param("query_text"))
    .limit(TOP_N)
)

In [27]:
# A quick helper to present the results in a notebook
def present_result(
    result: Result,
    cols_to_keep: list[str] = ["abstract", "title", "release_date", "id"],
) -> pd.DataFrame:
    # Parse result to dataframe
    df: pd.DataFrame = pd.DataFrame([entry.stored_object for entry in result.entries])
    # Transform timestamp back to release year, Ts is in milliseconds originally hence the division
    df["release_date"] = [
        datetime.fromtimestamp(timestamp, tz=timezone.utc).date()
        for timestamp in df["published_timestamp"]
    ]
    return df[cols_to_keep]

## Executing the queries

In [28]:
regular_result = app.query(
    simple_query,
    query_text="cost reduction",
    abstract_weight=1,
    title_weight=1,
    recency_weight=0,
)

present_result(regular_result)

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Unnamed: 0,abstract,title,release_date,id
0,By integrating Artificial Intelligence (AI) wi...,Generative AI for Low-Carbon Artificial Intell...,2024-04-28,32
1,"In the field of computational advertising, the...",Ad Auctions for LLMs via Retrieval Augmented G...,2024-06-12,222
2,"In this paper, we explore the potential applic...",Automated Conversion of Static to Dynamic Sche...,2024-05-08,356
3,"In many modern LLM applications, such as retri...",Symbolic Prompt Program Search: A Structure-Aw...,2024-04-02,126
4,Design/methodology/approach This research eval...,Graph database while computationally efficient...,2024-01-15,557
5,"In this paper, we conduct a study to utilize L...",PlanRAG: A Plan-then-Retrieval Augmented Gener...,2024-06-18,193
6,Retrieval-augmented generation (RAG) technique...,Enhancing Retrieval and Managing Retrieval: A ...,2024-07-15,43
7,Retrieval-augmented generation supports langua...,CompAct: Compressing Retrieved Documents Activ...,2024-07-12,45
8,Standard Full-Data classifiers in NLP demand t...,Making LLMs Worth Every Penny: Resource-Limite...,2023-11-10,593
9,There is a compelling necessity from enterpris...,Fine Tuning LLM for Enterprise: Practical Guid...,2024-03-23,454


In [None]:
recency_weighted_result = app.query(
    simple_query,
    query_text="cost reduction",
    abstract_weight=1,
    title_weight=1,
    recency_weight=5,
)

present_result(recency_weighted_result)

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Unnamed: 0,abstract,title,release_date,id
0,By integrating Artificial Intelligence (AI) wi...,Generative AI for Low-Carbon Artificial Intell...,2024-04-28,69
1,"In this paper, we explore the potential applic...",Automated Conversion of Static to Dynamic Sche...,2024-05-08,44
2,As Large Language Models (LLMs) and Retrieval ...,RaFe: Ranking Feedback Improves Query Rewritin...,2024-05-23,2
3,Purpose: The purpose of this study is to inves...,Exploring the Potential of Large Language Mode...,2024-05-15,24
4,"In customer service technical support, swiftly...",Retrieval-Augmented Generation with Knowledge ...,2024-04-26,50
5,Accurate evaluation of financial question answ...,FinTextQA: A Dataset for Long-form Financial Q...,2024-05-16,22
6,"This paper introduces xRAG, an innovative cont...",xRAG: Extreme Context Compression for Retrieva...,2024-05-22,4
7,Large Language Models (LLMs) have made signifi...,Compressing Long Context for Enhancing RAG wit...,2024-05-06,53
8,Enterprise retrieval augmented generation (RAG...,Question-Based Retrieval using Atomic Units fo...,2024-05-20,12
9,This paper introduces the RAG-RLRC-LaySum fram...,RAG-RLRC-LaySum at BioLaySumm: Integrating Ret...,2024-05-21,8


In [29]:
# A quick helper to visualize the effect of recency weighting
def get_time_differences(
    result: Result,
    alternative_result: Result,
    cols_to_keep: list[str] = ["abstract", "title", "release_date", "id"],
) -> pd.DataFrame:
    # Getting the timestamps of both results
    result_ts = [entry.stored_object["published_timestamp"] for entry in result.entries]
    alternative_result_ts = [
        entry.stored_object["published_timestamp"]
        for entry in alternative_result.entries
    ]
    # Calculating the absolute time difference in seconds
    time_diff = list(np.absolute(np.array(result_ts) - np.array(alternative_result_ts)))
    # Rounded time difference in days
    time_diff_days = [round(t_d / 3600 / 24, 1) for t_d in time_diff]
    return time_diff_days

In [30]:
get_time_differences(regular_result, recency_weighted_result)

NameError: name 'recency_weighted_result' is not defined

You will see that a lot of the positions haven’t changed, but some have!

Obviously, this was a pretty basic example. But I hope I was able to make clear why recency can be an important factor. We could’ve also filtered our timestamp data - metadata filtering is a common practice in Advanced RAG systems after all. However, the recency embeddings we used here are generally more nuanced, similar to how text embeddings are more nuanced than regex.

Which one will work better for you will depend on your specific use case. It’s important to remember that there are no silver bullets!