## Introduction

This project implements a semantic search application using a dataset of Machine Learning papers from ArXiv. The dataset, tagged with "cs.LG" to indicate relevance to Machine Learning, contains approximately 100,000 papers. These papers are a subset filtered from the full ArXiv dataset, which originally contained about 2 million papers. The filtered dataset, maintained through requests to the ArXiv API, includes only the title and abstract of each paper.

The objective of this project is to build an efficient search system that retrieves relevant papers based on a user's query. By leveraging the power of Natural Language Processing (NLP) and embedding techniques, the system is designed to understand the semantic meaning behind user queries and match them with the most relevant papers in the dataset.

### Key Components

- **Dataset Preparation**: The dataset is sourced from Hugging Face (`CShorten/ML-ArXiv-e model.
  
- **Tokenization and Embedding**: The text data is tokenized and converted into dense vector embeddings that represent the semantic content of , using pretrained sentence transformers miniLM model.he papers.

- **FAISS Indexing**: To facilitate fast and scalable similarity searches, FAISS (Facebook AI Similarity Search) is employed to index the embeddings.

- **Gradio Interface**: A user-friendly interface is built using Gradio, allowing users to enter search queries and retrieve the most relevant ArXiv papers.

This project demonstrates the practical application of semantic search in academic literature, offering a powerful tool for researchers to discover relevant works in the field of Machine Learning.


#### Dataset:

**CShorten/ML-ArXiv-Papers:**
This dataset contains a subset of ArXiv papers with the "cs.LG" tag, indicating that the paper is about Machine Learning. The core dataset is filtered from the full ArXiv dataset hosted on Kaggle: ArXiv Dataset on Kaggle. The original dataset contains roughly 2 million papers, and this dataset contains approximately 100,000 papers after category filtering.

The dataset is maintained by making requests to the ArXiv API. The current iteration only includes the title and abstract of each paper.

Dataset Source: The dataset is sourced from Hugging Face: CShorten/ML-ArXiv-Papers.

In [25]:
from datasets import load_dataset

arxiv_dataset = load_dataset('CShorten/ML-ArXiv-Papers', trust_remote_code=True)

In [26]:
arxiv_dataset

DatasetDict({
    train: Dataset({
        features: ['Unnamed: 0.1', 'Unnamed: 0', 'title', 'abstract'],
        num_rows: 117592
    })
})

#### Preparing the dataset:

In [27]:
def combine_title_abstract(example):
    example['combined_text'] = example['title'] + ' ' + example['abstract']  # Removed '\n' and replaced with a space
    return example

In [28]:
arxiv_dataset = arxiv_dataset['train'].map(combine_title_abstract)

In [29]:
arxiv_dataset

Dataset({
    features: ['Unnamed: 0.1', 'Unnamed: 0', 'title', 'abstract', 'combined_text'],
    num_rows: 117592
})

In [30]:
arxiv_dataset['combined_text'][0]

'Learning from compressed observations   The problem of statistical learning is to construct a predictor of a random\nvariable $Y$ as a function of a related random variable $X$ on the basis of an\ni.i.d. training sample from the joint distribution of $(X,Y)$. Allowable\npredictors are drawn from some specified class, and the goal is to approach\nasymptotically the performance (expected loss) of the best predictor in the\nclass. We consider the setting in which one has perfect observation of the\n$X$-part of the sample, while the $Y$-part has to be communicated at some\nfinite bit rate. The encoding of the $Y$-values is allowed to depend on the\n$X$-values. Under suitable regularity conditions on the admissible predictors,\nthe underlying family of probability distributions and the loss function, we\ngive an information-theoretic characterization of achievable predictor\nperformance in terms of conditional distortion-rate functions. The ideas are\nillustrated on the example of nonparam

#### Tokenization:

In [31]:
from transformers import AutoTokenizer, AutoModel

model_checkpoint = 'sentence-transformers/all-MiniLM-L6-v2'


In [32]:
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)


In [33]:
model = AutoModel.from_pretrained(model_checkpoint)

#### Create text embeddings:

In [34]:
#We are taking the embeddings of CLS token, which contains the information of all the tokens in the sentence
def cls_pooling(model_output):
    return model_output.last_hidden_state[:,0]

In [35]:
def get_embeddings(text_list):
    encoded_input = tokenizer(
        text_list, padding=True, truncation=True, return_tensors="pt"
    )
    model_output = model(**encoded_input) #can take batches of input and return batches of output
    return cls_pooling(model_output)



In [36]:
embedding = get_embeddings(arxiv_dataset["combined_text"][0])
embedding.shape

torch.Size([1, 384])

In [37]:
from tqdm.notebook import tqdm

# Efficiently get embeddings with mixed precision
def get_embeddings(text_list):
    encoded_input = tokenizer(
        text_list, padding=True, truncation=True, return_tensors="pt"
    )
    model_output = model(**encoded_input)
    return cls_pooling(model_output)

# Function with a progress bar for batch processing
def compute_embeddings(batch):
    return {"embeddings": [get_embeddings(text).detach().cpu().numpy() for text in tqdm(batch["combined_text"], desc="Processing batches")]}

# Applying the map function with tqdm for monitoring progress
embeddings_dataset = arxiv_dataset.map(compute_embeddings, batched=True, batch_size=1000)


Map:   0%|          | 0/117592 [00:00<?, ? examples/s]

Processing batches:   0%|          | 0/1000 [00:00<?, ?it/s]

Processing batches:   0%|          | 0/1000 [00:00<?, ?it/s]

Processing batches:   0%|          | 0/1000 [00:00<?, ?it/s]

Processing batches:   0%|          | 0/1000 [00:00<?, ?it/s]

Processing batches:   0%|          | 0/1000 [00:00<?, ?it/s]

Processing batches:   0%|          | 0/1000 [00:00<?, ?it/s]

Processing batches:   0%|          | 0/1000 [00:00<?, ?it/s]

Processing batches:   0%|          | 0/1000 [00:00<?, ?it/s]

Processing batches:   0%|          | 0/1000 [00:00<?, ?it/s]

Processing batches:   0%|          | 0/1000 [00:00<?, ?it/s]

Processing batches:   0%|          | 0/1000 [00:00<?, ?it/s]

Processing batches:   0%|          | 0/1000 [00:00<?, ?it/s]

Processing batches:   0%|          | 0/1000 [00:00<?, ?it/s]

Processing batches:   0%|          | 0/1000 [00:00<?, ?it/s]

Processing batches:   0%|          | 0/1000 [00:00<?, ?it/s]

Processing batches:   0%|          | 0/1000 [00:00<?, ?it/s]

Processing batches:   0%|          | 0/1000 [00:00<?, ?it/s]

Processing batches:   0%|          | 0/1000 [00:00<?, ?it/s]

Processing batches:   0%|          | 0/1000 [00:00<?, ?it/s]

Processing batches:   0%|          | 0/1000 [00:00<?, ?it/s]

Processing batches:   0%|          | 0/1000 [00:00<?, ?it/s]

Processing batches:   0%|          | 0/1000 [00:00<?, ?it/s]

Processing batches:   0%|          | 0/1000 [00:00<?, ?it/s]

Processing batches:   0%|          | 0/1000 [00:00<?, ?it/s]

Processing batches:   0%|          | 0/1000 [00:00<?, ?it/s]

Processing batches:   0%|          | 0/1000 [00:00<?, ?it/s]

Processing batches:   0%|          | 0/1000 [00:00<?, ?it/s]

Processing batches:   0%|          | 0/1000 [00:00<?, ?it/s]

Processing batches:   0%|          | 0/1000 [00:00<?, ?it/s]

Processing batches:   0%|          | 0/1000 [00:00<?, ?it/s]

Processing batches:   0%|          | 0/1000 [00:00<?, ?it/s]

Processing batches:   0%|          | 0/1000 [00:00<?, ?it/s]

Processing batches:   0%|          | 0/1000 [00:00<?, ?it/s]

Processing batches:   0%|          | 0/1000 [00:00<?, ?it/s]

Processing batches:   0%|          | 0/1000 [00:00<?, ?it/s]

Processing batches:   0%|          | 0/1000 [00:00<?, ?it/s]

Processing batches:   0%|          | 0/1000 [00:00<?, ?it/s]

Processing batches:   0%|          | 0/1000 [00:00<?, ?it/s]

Processing batches:   0%|          | 0/1000 [00:00<?, ?it/s]

Processing batches:   0%|          | 0/1000 [00:00<?, ?it/s]

Processing batches:   0%|          | 0/1000 [00:00<?, ?it/s]

Processing batches:   0%|          | 0/1000 [00:00<?, ?it/s]

Processing batches:   0%|          | 0/1000 [00:00<?, ?it/s]

Processing batches:   0%|          | 0/1000 [00:00<?, ?it/s]

Processing batches:   0%|          | 0/1000 [00:00<?, ?it/s]

Processing batches:   0%|          | 0/1000 [00:00<?, ?it/s]

Processing batches:   0%|          | 0/1000 [00:00<?, ?it/s]

Processing batches:   0%|          | 0/1000 [00:00<?, ?it/s]

Processing batches:   0%|          | 0/1000 [00:00<?, ?it/s]

Processing batches:   0%|          | 0/1000 [00:00<?, ?it/s]

Processing batches:   0%|          | 0/1000 [00:00<?, ?it/s]

Processing batches:   0%|          | 0/1000 [00:00<?, ?it/s]

Processing batches:   0%|          | 0/1000 [00:00<?, ?it/s]

Processing batches:   0%|          | 0/1000 [00:00<?, ?it/s]

Processing batches:   0%|          | 0/1000 [00:00<?, ?it/s]

Processing batches:   0%|          | 0/1000 [00:00<?, ?it/s]

Processing batches:   0%|          | 0/1000 [00:00<?, ?it/s]

Processing batches:   0%|          | 0/1000 [00:00<?, ?it/s]

Processing batches:   0%|          | 0/1000 [00:00<?, ?it/s]

Processing batches:   0%|          | 0/1000 [00:00<?, ?it/s]

Processing batches:   0%|          | 0/1000 [00:00<?, ?it/s]

Processing batches:   0%|          | 0/1000 [00:00<?, ?it/s]

Processing batches:   0%|          | 0/1000 [00:00<?, ?it/s]

Processing batches:   0%|          | 0/1000 [00:00<?, ?it/s]

Processing batches:   0%|          | 0/1000 [00:00<?, ?it/s]

Processing batches:   0%|          | 0/1000 [00:00<?, ?it/s]

Processing batches:   0%|          | 0/1000 [00:00<?, ?it/s]

Processing batches:   0%|          | 0/1000 [00:00<?, ?it/s]

Processing batches:   0%|          | 0/1000 [00:00<?, ?it/s]

Processing batches:   0%|          | 0/1000 [00:00<?, ?it/s]

Processing batches:   0%|          | 0/1000 [00:00<?, ?it/s]

Processing batches:   0%|          | 0/1000 [00:00<?, ?it/s]

Processing batches:   0%|          | 0/1000 [00:00<?, ?it/s]

Processing batches:   0%|          | 0/1000 [00:00<?, ?it/s]

Processing batches:   0%|          | 0/1000 [00:00<?, ?it/s]

Processing batches:   0%|          | 0/1000 [00:00<?, ?it/s]

Processing batches:   0%|          | 0/1000 [00:00<?, ?it/s]

Processing batches:   0%|          | 0/1000 [00:00<?, ?it/s]

Processing batches:   0%|          | 0/1000 [00:00<?, ?it/s]

Processing batches:   0%|          | 0/1000 [00:00<?, ?it/s]

Processing batches:   0%|          | 0/1000 [00:00<?, ?it/s]

Processing batches:   0%|          | 0/1000 [00:00<?, ?it/s]

Processing batches:   0%|          | 0/1000 [00:00<?, ?it/s]

Processing batches:   0%|          | 0/1000 [00:00<?, ?it/s]

Processing batches:   0%|          | 0/1000 [00:00<?, ?it/s]

Processing batches:   0%|          | 0/1000 [00:00<?, ?it/s]

Processing batches:   0%|          | 0/1000 [00:00<?, ?it/s]

Processing batches:   0%|          | 0/1000 [00:00<?, ?it/s]

Processing batches:   0%|          | 0/1000 [00:00<?, ?it/s]

Processing batches:   0%|          | 0/1000 [00:00<?, ?it/s]

Processing batches:   0%|          | 0/1000 [00:00<?, ?it/s]

Processing batches:   0%|          | 0/1000 [00:00<?, ?it/s]

Processing batches:   0%|          | 0/1000 [00:00<?, ?it/s]

Processing batches:   0%|          | 0/1000 [00:00<?, ?it/s]

Processing batches:   0%|          | 0/1000 [00:00<?, ?it/s]

Processing batches:   0%|          | 0/1000 [00:00<?, ?it/s]

Processing batches:   0%|          | 0/1000 [00:00<?, ?it/s]

Processing batches:   0%|          | 0/1000 [00:00<?, ?it/s]

Processing batches:   0%|          | 0/1000 [00:00<?, ?it/s]

Processing batches:   0%|          | 0/1000 [00:00<?, ?it/s]

Processing batches:   0%|          | 0/1000 [00:00<?, ?it/s]

Processing batches:   0%|          | 0/1000 [00:00<?, ?it/s]

Processing batches:   0%|          | 0/1000 [00:00<?, ?it/s]

Processing batches:   0%|          | 0/1000 [00:00<?, ?it/s]

Processing batches:   0%|          | 0/1000 [00:00<?, ?it/s]

Processing batches:   0%|          | 0/1000 [00:00<?, ?it/s]

Processing batches:   0%|          | 0/1000 [00:00<?, ?it/s]

Processing batches:   0%|          | 0/1000 [00:00<?, ?it/s]

Processing batches:   0%|          | 0/1000 [00:00<?, ?it/s]

Processing batches:   0%|          | 0/1000 [00:00<?, ?it/s]

Processing batches:   0%|          | 0/1000 [00:00<?, ?it/s]

Processing batches:   0%|          | 0/1000 [00:00<?, ?it/s]

Processing batches:   0%|          | 0/1000 [00:00<?, ?it/s]

Processing batches:   0%|          | 0/1000 [00:00<?, ?it/s]

Processing batches:   0%|          | 0/1000 [00:00<?, ?it/s]

Processing batches:   0%|          | 0/1000 [00:00<?, ?it/s]

Processing batches:   0%|          | 0/1000 [00:00<?, ?it/s]

Processing batches:   0%|          | 0/592 [00:00<?, ?it/s]

In [79]:
from datasets import load_dataset

dataset = load_dataset('csv', data_files="/content/embeddings_dataset.csv")

In [80]:
dataset

DatasetDict({
    train: Dataset({
        features: ['Unnamed: 0.1', 'Unnamed: 0', 'title', 'abstract', 'combined_text', 'embeddings'],
        num_rows: 117592
    })
})

In [81]:
len(dataset['train']['embeddings'][1])

7374

In [82]:
train_dataset = dataset['train']

In [84]:
import numpy as np
import re

#Converting the str type of embeddings coloumn to list
def convert_to_array(example):
    if isinstance(example['embeddings'], list):  # Handle batches
        example['embeddings'] = [np.array(re.findall(r"[-+]?\d*\.\d+e[-+]?\d+", emb), dtype=np.float32)
                                 for emb in example['embeddings']]
    else:  # Handle single example
        numbers = re.findall(r"[-+]?\d*\.\d+e[-+]?\d+", example['embeddings'])
        example['embeddings'] = np.array(numbers, dtype=np.float32)
    return example

# Apply the conversion to the dataset, processing examples in batches
train_dataset = train_dataset.map(convert_to_array, batched=True, batch_size=1000)

In [190]:
import numpy as np

# Function to convert embeddings to NumPy arrays on-the-fly to add faiss
def transform(example):
    example['embeddings'] = np.array(example['embeddings'], dtype=np.float32)
    return example

# Apply the transformation
train_dataset.set_transform(transform)


In [191]:
# Materialize the transformation by creating a new dataset with the transformations applied
def materialize(example):
    return {
        'embeddings': np.array(example['embeddings'], dtype=np.float32),
        # Add any other fields that need to be retained
        'title': example['title'],
        'abstract': example['abstract'],
        'combined_text': example['combined_text']
    }

# Apply the materialize function to create a new dataset
new_train_dataset = train_dataset.map(materialize, load_from_cache_file=False)


Map:   0%|          | 0/117592 [00:00<?, ? examples/s]

In [192]:
type(new_train_dataset[0]['embeddings'])

numpy.ndarray

In [110]:
import pandas as pd

# Convert to pandas DataFrame
df = new_train_dataset.to_pandas()

# Save as Parquet
df.to_parquet("/content/new_train_dataset.parquet", index=False)

#After saving pushed the dataset to hugging face hub

#### Using FAISS for efficient similarity search

In [125]:
from datasets import load_dataset

dataset = load_dataset('Tarun-1999M/arxiv_cs_lg_embeddings', data_files="/content/new_train_dataset.parquet")

In [128]:
train_dataset = dataset['train']

In [129]:
#since the faiss needs numpy arrays
import numpy as np

# Function to convert embeddings to NumPy arrays on-the-fly
def transform(example):
    example['embeddings'] = np.array(example['embeddings'], dtype=np.float32)
    return example

# Apply the transformation
train_dataset.set_transform(transform)


In [130]:
# Materialize the transformation by creating a new dataset with the transformations applied
def materialize(example):
    return {
        'embeddings': np.array(example['embeddings'], dtype=np.float32),
        # Add any other fields that need to be retained
        'title': example['title'],
        'abstract': example['abstract'],
        'combined_text': example['combined_text']
    }

# Apply the materialize function to create a new dataset
new_train_dataset = train_dataset.map(materialize, load_from_cache_file=False)


Map:   0%|          | 0/117592 [00:00<?, ? examples/s]

In [193]:
new_train_dataset.add_faiss_index(column='embeddings')

  0%|          | 0/118 [00:00<?, ?it/s]

Dataset({
    features: ['Unnamed: 0.1', 'Unnamed: 0', 'title', 'abstract', 'combined_text', 'embeddings'],
    num_rows: 117592
})

In [194]:
query = 'what is a transformer'
question_embedding = get_embeddings([query]).cpu().detach().numpy()
question_embedding.shape

(1, 384)

In [195]:
scores, samples = new_train_dataset.get_nearest_examples(
    "embeddings", question_embedding, k=5
)

In [196]:
samples

{'Unnamed: 0.1': [76176, 111843, 1074, 88859, 110134],
 'Unnamed: 0': [76176.0, 111843.0, None, 88859.0, 110134.0],
 'title': ['Do Transformer Modifications Transfer Across Implementations and\n  Applications?',
  'Multimodal Learning with Transformers: A Survey',
  'Multimodal Learning with Transformers: A Survey',
  'Transformers predicting the future. Applying attention in next-frame and\n  time series forecasting',
  'Transformers from an Optimization Perspective'],
 'abstract': ['  The research community has proposed copious modifications to the Transformer\narchitecture since it was introduced over three years ago, relatively few of\nwhich have seen widespread adoption. In this paper, we comprehensively evaluate\nmany of these modifications in a shared experimental setting that covers most\nof the common uses of the Transformer in natural language processing.\nSurprisingly, we find that most modifications do not meaningfully improve\nperformance. Furthermore, most of the Transfor

In [159]:
#creating a zip file and pushing the dataset to the hub

from datasets import load_dataset
import numpy as np

# Step 1: Load the dataset
dataset = load_dataset('Tarun-1999M/arxiv_cs_lg_embeddings', data_files="/content/new_train_dataset.parquet")
train_dataset = dataset['train']

# Step 2: Define a transformation to convert embeddings to NumPy arrays on-the-fly
def transform(example):
    example['embeddings'] = np.array(example['embeddings'], dtype=np.float32)
    return example

# Apply the transformation
train_dataset.set_transform(transform)

# Step 3: Materialize the transformation using map (retain NumPy arrays)
def materialize(example):
    return {
        'embeddings': np.array(example['embeddings'], dtype=np.float32),
        'title': example['title'],
        'abstract': example['abstract'],
        'combined_text': example['combined_text']
    }

new_train_dataset = train_dataset.map(materialize, load_from_cache_file=False)

# Step 4: Drop the FAISS index if added (before saving)
new_train_dataset.set_format("numpy", columns=["embeddings"])
new_train_dataset.set_transform(None)

# Save the materialized dataset
new_train_dataset.save_to_disk('/content/new_train_dataset_with_numpy')

# Step 5: Add the FAISS index again after loading (if needed)


Map:   0%|          | 0/117592 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/117592 [00:00<?, ? examples/s]

In [161]:
!zip -r new_train_dataset_with_numpy.zip /content/new_train_dataset_with_numpy


updating: content/new_train_dataset_with_numpy/ (stored 0%)
  adding: content/new_train_dataset_with_numpy/state.json (deflated 39%)
  adding: content/new_train_dataset_with_numpy/data-00000-of-00001.arrow (deflated 43%)
  adding: content/new_train_dataset_with_numpy/dataset_info.json (deflated 66%)


In [162]:
from google.colab import files
files.download('/content/new_train_dataset_with_numpy.zip')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/gradio/queueing.py", line 536, in process_events
    response = await route_utils.call_process_api(
  File "/usr/local/lib/python3.10/dist-packages/gradio/route_utils.py", line 288, in call_process_api
    output = await app.get_blocks().process_api(
  File "/usr/local/lib/python3.10/dist-packages/gradio/blocks.py", line 1931, in process_api
    result = await self.call_function(
  File "/usr/local/lib/python3.10/dist-packages/gradio/blocks.py", line 1516, in call_function
    prediction = await anyio.to_thread.run_sync(  # type: ignore
  File "/usr/local/lib/python3.10/dist-packages/anyio/to_thread.py", line 33, in run_sync
    return await get_asynclib().run_sync_in_worker_thread(
  File "/usr/local/lib/python3.10/dist-packages/anyio/_backends/_asyncio.py", line 877, in run_sync_in_worker_thread
    return await future
  File "/usr/local/lib/python3.10/dist-packages/anyio/_backends/_asyncio.py", line 8

In [166]:
from datasets import load_from_disk, DatasetDict
from huggingface_hub import HfApi, HfFolder

# Load your dataset
dataset = load_from_disk('/content/new_train_dataset_with_numpy')

# Log in to Hugging Face Hub
from huggingface_hub import login
login()

# Push to your Hugging Face repository
dataset.push_to_hub('Tarun-1999M/arxiv_cs_lg_embeddings')


VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/118 [00:00<?, ?ba/s]

README.md:   0%|          | 0.00/24.0 [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/datasets/Tarun-1999M/arxiv_cs_lg_embeddings/commit/d4addbfae4ef1db660478756c79dbd9b6158c6d3', commit_message='Upload dataset', commit_description='', oid='d4addbfae4ef1db660478756c79dbd9b6158c6d3', pr_url=None, pr_revision=None, pr_num=None)

#### Gradio Setup

In [69]:
#| default_exp app

In [73]:
#| export

import gradio as gr
from datasets import load_dataset
import numpy as np
from transformers import AutoTokenizer, AutoModel

model_checkpoint = 'sentence-transformers/all-MiniLM-L6-v2'
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
model = AutoModel.from_pretrained(model_checkpoint)

# Load the dataset from Hugging Face
dataset = load_dataset('Tarun-1999M/arxiv_cs_lg_embeddings')
train_dataset = dataset['train']

# Ensure embeddings are converted to NumPy arrays on-the-fly using set_transform
def transform(example):
    example['embeddings'] = np.array(example['embeddings'], dtype=np.float32)
    return example

train_dataset.set_transform(transform)

# Add FAISS index
train_dataset.add_faiss_index(column='embeddings')

#We are taking the embeddings of CLS token, which contains the information of all the tokens in the sentence
def cls_pooling(model_output):
    return model_output.last_hidden_state[:,0]
    
# Function to get the embeddings for the query
def get_embeddings(query_list):
    encoded_input = tokenizer(query_list, padding=True, truncation=True, return_tensors='pt')
    model_output = model(**encoded_input)
    return cls_pooling(model_output)

# Function to search the ArXiv papers
def search_arxiv(query):
    # Get the embedding for the query
    question_embedding = get_embeddings([query]).cpu().detach().numpy()

    # Search for similar papers
    scores, samples = train_dataset.get_nearest_examples("embeddings", question_embedding, k=5)

    # Sort the results by scores in descending order
    sorted_results = sorted(zip(scores, samples['title'], samples['abstract']), reverse=True)

    # Prepare and format the results for display
    results = []
    for score, title, abstract in sorted_results:
        result = f"\n**Title:** {title}\n**Abstract:** {abstract}\n**Score:** {score:.4f}"
        results.append(result)

    return "\n\n".join(results)

# Create the Gradio interface
iface = gr.Interface(
    fn=search_arxiv,
    inputs=gr.components.Textbox(lines=1, placeholder="Enter your query..."),
    outputs="markdown",
    title="Semantic Search in ArXiv ML Papers",
    description="Enter a query to find relevant ML papers from the ArXiv dataset."
)

# Launch the interface
iface.launch(share=True)




  0%|          | 0/118 [00:00<?, ?it/s]

Running on local URL:  http://127.0.0.1:7871

Could not create share link. Please check your internet connection or our status page: https://status.gradio.app.




In [75]:
import nbdev
notebook_name = "Semantic_Search_in_ArXiv_ML_papers.ipynb"
export_destination = "." # the root directory
nbdev.export.nb_export(notebook_name, export_destination)