## üîéüìö ArXiv CS RAG üìöüîç

This is the code for generating embeddings for [ArXiv CS RAG](https://huggingface.co/spaces/bishmoy/Arxiv-CS-RAG), a Huggingface space for searching paper embeddings and querying using large language models (LLMs) of your choice. This code takes advantage of the amazing [ArXiv](https://www.kaggle.com/datasets/Cornell-University/arxiv) dataset, which is updated weekly, to create embeddings from computer science paper abstracts. 

Our workflow consists of the following steps:

1. ‚öóÔ∏è **Filtering the ArXiv Dataset**: Filter out the enormous ArXiv dataset to pick abstracts from the computer science domain.
2. ‚òëÔ∏è **Selecting Samples for Performance Verification**: Choose specific search queries and their expected paper abstract to verify performance in later steps. 
2. üóÉÔ∏è **Indexing Abstracts**: Use COLBERTv2 to index the filtered abstracts through the Ragatouille library 
3. üìä **Verifying Indices**: Evaluate the accuracy of the generated indices using the selected sample search queries.

All right, lets start!

### Installing the libraries

In [1]:
# Installing ragatouille
!pip install -qq ragatouille
# Uninstalling 'faiss-cpu' package and installing 'faiss-gpu' package
!pip uninstall --y --q faiss-cpu & pip install --q faiss-gpu 

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
cudf 24.6.1 requires cubinlinker, which is not installed.
cudf 24.6.1 requires cupy-cuda11x>=12.0.0, which is not installed.
cudf 24.6.1 requires ptxcompiler, which is not installed.
cuml 24.6.1 requires cupy-cuda11x>=12.0.0, which is not installed.
dask-cudf 24.6.1 requires cupy-cuda11x>=12.0.0, which is not installed.
keras-cv 0.9.0 requires keras-core, which is not installed.
tensorflow-decision-forests 1.8.1 requires wurlitzer, which is not installed.
apache-beam 2.46.0 requires dill<0.3.2,>=0.3.1.1, but you have dill 0.3.8 which is incompatible.
apache-beam 2.46.0 requires numpy<1.25.0,>=1.14.3, but you have numpy 1.26.4 which is incompatible.
apache-beam 2.46.0 requires pyarrow<10.0.0,>=3.0.0, but you have pyarrow 16.1.0 which is incompatible.
cudf 24.6.1 requires cuda-python<12.0a0,>=11.7.1,

## ‚öóÔ∏è Filtering the ArXiv Dataset

In this section, we filter out the ArXiv dataset to pick abstracts from the computer science domain only. 

To be more specific, we will filter abstracts from the following categories:

- cs.CV (Computer Vision and Pattern Recognition)
- cs.LG (Machine Learning)
- cs.CL (Computation and Language)
- cs.AI (Artificial Intelligence)
- cs.NE (Neural and Evolutionary Computing)
- cs.RO (Robotics)

To do that, we would need to load the JSON dataset into memory using Polars. While you can use Pandas for this task, Polars is often faster for certain operations, making it our choice for filtering the dataset.

Loading the entire dataset occupies approximately 27 GB of RAM. If this RAM is not freed, it can lead to an out-of-memory (OOM) error once we proceed with the next step of indexing. To prevent this, we create a smaller, filtered-out dataset from the large one and save it to disk. To that purpose, we write a script that would perform the filtering operation.

Using a script clears out the occupied RAM once the script execution is complete. This is very useful for our purposes, as it lets us avoid potential OOM errors.


In [2]:
%%writefile /kaggle/working/df_build.py
import polars as pl
import time
import argparse

# Function to parse command-line arguments
def parse_args():
    parser = argparse.ArgumentParser(description="Process ArXiv JSON dataset.")
    parser.add_argument("--input_json", type=str, default = "/kaggle/input/arxiv/arxiv-metadata-oai-snapshot.json", help="Path to the input NDJSON file.")
    parser.add_argument("--output_json", type=str, default = "/kaggle/working/arxiv_cs.json", help="Path to the output NDJSON file.")
    return parser.parse_args()

# Function to convert the latest 'created' date from the 'versions' field to a specific format
def get_latest_time(element):
    return time.strftime("%d %b %Y", time.strptime(element[-1]['created'], "%a, %d %b %Y %H:%M:%S %Z"))

# Function to convert the 'update_date' field to a specific format
def get_latest_date(element):
    return time.strftime("%d %b %Y", time.strptime(element, "%Y-%m-%d"))

def main():
    args = parse_args()

    # Loading the entire JSON dataset from the input file path
    cs_arxiv_df = pl.read_ndjson(args.input_json)

    # Filtering rows where the 'categories' column contains specific computer science categories
    cs_arxiv_df = cs_arxiv_df.filter(pl.col("categories").str.contains(r"\b(?:cs\.(?:CV|LG|CL|AI|NE|RO))\b", strict=True))

    # Initializing a new column '_time' with default value 0
    cs_arxiv_df = cs_arxiv_df.with_columns(pl.lit(0, dtype=pl.Int64).alias('_time'))

    # Updating the '_time' column with the latest version date or update date
    cs_arxiv_df = cs_arxiv_df.with_columns(
        pl.when(cs_arxiv_df['versions'].is_not_null())
        .then(cs_arxiv_df['versions'].map_elements(get_latest_time,  return_dtype=pl.Utf8))
        .otherwise(cs_arxiv_df['update_date'].map_elements(get_latest_date,  return_dtype=pl.Utf8))
        .alias('_time')
    )

    # Columns to be dropped from the DataFrame
    columns_to_drop = ['versions', 'authors_parsed', 'report-no', 'license', 'submitter']

    # Dropping the specified columns based on the DataFrame type
    cs_arxiv_df = cs_arxiv_df.drop(columns_to_drop)

    # Writing the processed DataFrame to a new NDJSON file specified by the output path
    cs_arxiv_df.write_ndjson(args.output_json)

if __name__ == "__main__":
    main()

Writing /kaggle/working/df_build.py


Time to execute the script! Feel free to have a look at the RAM during execution, and observe how the RAM frees once the script execution is done.

In [3]:
# Run the df_build.py script
!python df_build.py --input_json "/kaggle/input/arxiv/arxiv-metadata-oai-snapshot.json" --output_json "/kaggle/working/arxiv_cs.json"

## ‚òëÔ∏è Selecting Samples for Performance Verification

To ensure our RAG system functions correctly and there are no mismatches in the indices or information, we will create a set of search queries, and select their expected results. These handpicked queries will serve as a small benchmark to verify the performance of our system once it is complete.

In [4]:
# Load the small JSON dataset consisting of ArXiv CS abstracts
import polars as pl
cs_arxiv = pl.read_ndjson('/kaggle/working/arxiv_cs.json')
cs_arxiv.head(2)

id,authors,title,comments,journal-ref,doi,categories,abstract,update_date,_time
str,str,str,str,str,str,str,str,str,str
"""0704.0047""","""T. Kosel and I. Grabec""","""Intelligent location of simult‚Ä¶","""5 pages, 5 eps figures, uses I‚Ä¶",,,"""cs.NE cs.AI""",""" The intelligent acoustic emi‚Ä¶","""2009-09-29""","""01 Apr 2007"""
"""0704.0050""","""T. Kosel and I. Grabec""","""Intelligent location of simult‚Ä¶","""5 pages, 7 eps figures, uses I‚Ä¶",,,"""cs.NE cs.AI""",""" Part I describes an intellig‚Ä¶","""2007-05-23""","""01 Apr 2007"""


Here, we pick the search queries `mistral`,`mixtral`, `gemini`, `ColBERTv2` and the DOI of their respective papers (`2310.06825`, `2401.04088`, `2312.11805`, `2112.01488` )

In [5]:
search_queries = {'mistral':'2310.06825',
                  'mixtral': '2401.04088',
                  'gemini': '2312.11805',
                  'ColBERTv2': '2112.01488',}
search_ids = list(search_queries.values())
## Get the metadata from the DOI/search_ids in a dataframe
filtered_df = cs_arxiv.filter(pl.col('id').is_in(search_ids))

Let's save this dataset in a JSON file

In [6]:
filtered_df.write_ndjson('/kaggle/working/arxiv_cs_verify.json')

## üóÉÔ∏è Indexing Abstracts

In [7]:
# Create a directory where we can store the data
import os
upload_folder = 'arxiv'
upload_dir = os.path.join('/kaggle/working/', upload_folder)
os.makedirs(upload_dir)

### Picking the metadata

Here we would be picking which metadata we want to store along with the abstracts. While not all metadata might be useful for our case, we would store the DOI, author names, paper title, and the time the paper last updated (denoted by `doi`, `authors`, `title` and `_time` respectively) along with the abstracts.

In [8]:
abstracts = list(cs_arxiv['abstract'])
ids = list(cs_arxiv['id'])
metadata = cs_arxiv[['doi','authors','title', '_time']]

# convert metadata to string to avoid Nonetype error 
if isinstance(metadata, pl.DataFrame):
    metadata = metadata.fill_null('None')
    metadata = metadata.to_dicts()
else:
    raise Exception

### Loading Pre-trained ColBERTv2 and Indexing Abstracts

First, we load the pretrained ColBERTv2 model using the Ragatouille library.

In [9]:
%%time

from ragatouille import RAGPretrainedModel
# We load the pretrained model "colbert-ir/colbertv2.0" and specify a location to store the index
Colbertv2 = RAGPretrainedModel.from_pretrained("colbert-ir/colbertv2.0", index_root= upload_dir)

artifact.metadata:   0%|          | 0.00/1.63k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/743 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/405 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

CPU times: user 9.8 s, sys: 2.48 s, total: 12.3 s
Wall time: 19.8 s


Now this is what we have been working for - time to start indexing!

In [10]:
%%time
Colbertv2.index(collection = abstracts, 
        document_ids = ids,
        document_metadatas = metadata,
        index_name = 'arxiv_colbert',
        max_document_length = 512,
        split_documents = False,
        bsize = 128
        )



[Aug 15, 08:26:13] #> Creating directory /kaggle/working/arxiv/colbert/indexes/arxiv_colbert 


[Aug 15, 08:26:17] [0] 		 #> Encoding 108609 passages..
[Aug 15, 08:57:24] [0] 		 avg_doclen_est = 203.3844757080078 	 len(local_sample) = 108,609
[Aug 15, 08:57:39] [0] 		 Creating 131,072 partitions.
[Aug 15, 08:57:39] [0] 		 *Estimated* 78,095,367 embeddings.
[Aug 15, 08:57:39] [0] 		 #> Saving the indexing plan to /kaggle/working/arxiv/colbert/indexes/arxiv_colbert/plan.json ..
Clustering 22039384 points in 128D to 131072 clusters, redo 1 times, 4 iterations
  Preprocessing in 2.83 s
[Aug 15, 09:06:58] Loading decompress_residuals_cpp extension (set COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True for more info)...
[Aug 15, 09:08:10] Loading packbits_cpp extension (set COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True for more info)...
[0.028, 0.028, 0.029, 0.026, 0.027, 0.031, 0.03, 0.026, 0.027, 0.028, 0.028, 0.028, 0.028, 0.029, 0.026, 0.03, 0.025, 0.028, 0.026, 0.027, 0.028, 0.029, 0.027, 0.028, 

0it [00:00, ?it/s]

[Aug 15, 09:09:22] [0] 		 #> Encoding 25000 passages..


1it [07:34, 454.29s/it]

[Aug 15, 09:16:56] [0] 		 #> Encoding 25000 passages..


2it [15:12, 456.77s/it]

[Aug 15, 09:24:34] [0] 		 #> Encoding 25000 passages..


3it [22:51, 457.71s/it]

[Aug 15, 09:32:13] [0] 		 #> Encoding 25000 passages..


4it [30:30, 458.20s/it]

[Aug 15, 09:39:52] [0] 		 #> Encoding 25000 passages..


5it [38:10, 458.66s/it]

[Aug 15, 09:47:32] [0] 		 #> Encoding 25000 passages..


6it [45:50, 459.34s/it]

[Aug 15, 09:55:12] [0] 		 #> Encoding 25000 passages..


7it [53:28, 458.98s/it]

[Aug 15, 10:02:51] [0] 		 #> Encoding 25000 passages..


8it [1:01:10, 459.76s/it]

[Aug 15, 10:10:32] [0] 		 #> Encoding 25000 passages..


9it [1:08:52, 460.51s/it]

[Aug 15, 10:18:14] [0] 		 #> Encoding 25000 passages..


10it [1:16:34, 460.88s/it]

[Aug 15, 10:25:56] [0] 		 #> Encoding 25000 passages..


11it [1:24:16, 461.31s/it]

[Aug 15, 10:33:38] [0] 		 #> Encoding 25000 passages..


12it [1:31:58, 461.56s/it]

[Aug 15, 10:41:20] [0] 		 #> Encoding 25000 passages..


13it [1:39:41, 461.81s/it]

[Aug 15, 10:49:03] [0] 		 #> Encoding 25000 passages..


14it [1:47:24, 462.30s/it]

[Aug 15, 10:56:46] [0] 		 #> Encoding 25000 passages..


15it [1:55:08, 462.67s/it]

[Aug 15, 11:04:30] [0] 		 #> Encoding 8979 passages..


16it [1:57:52, 442.02s/it]
100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 16/16 [00:00<00:00, 83.32it/s]


[Aug 15, 11:07:19] #> Optimizing IVF to store map from centroids to list of pids..
[Aug 15, 11:07:19] #> Building the emb2pid mapping..
[Aug 15, 11:07:20] len(emb2pid) = 78078959


100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 131072/131072 [00:14<00:00, 8926.98it/s]


[Aug 15, 11:07:37] #> Saved optimized IVF to /kaggle/working/arxiv/colbert/indexes/arxiv_colbert/ivf.pid.pt
Done indexing!
CPU times: user 2h 41min 33s, sys: 5min 18s, total: 2h 46min 51s
Wall time: 2h 41min 31s


'/kaggle/working/arxiv/colbert/indexes/arxiv_colbert'

## üìä Verifying Indices

All right! Now that the embeddings have been generated, lets check the retrieval performance. We would use the previously handpicked queries and see if our expected result is within the top 10 queries or not. We would also match the retrieved result content (abstract, title) with the expected content 

In [11]:
Colbertv2.search(query="What is mistral", k=10)

Loading searcher for index arxiv_colbert for the first time... This may take a few seconds
[Aug 15, 11:07:45] #> Loading codec...
[Aug 15, 11:07:45] #> Loading IVF...
[Aug 15, 11:07:45] #> Loading doclens...


100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 16/16 [00:00<00:00, 680.93it/s]

[Aug 15, 11:07:45] #> Loading codes and residuals...



100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 16/16 [00:02<00:00,  6.23it/s]


Searcher loaded!

#> QueryTokenizer.tensorize(batch_text[0], batch_background[0], bsize) ==
#> Input: . What is mistral, 		 True, 		 None
#> Output IDs: torch.Size([32]), tensor([  101,     1,  2054,  2003, 11094,  7941,   102,   103,   103,   103,
          103,   103,   103,   103,   103,   103,   103,   103,   103,   103,
          103,   103,   103,   103,   103,   103,   103,   103,   103,   103,
          103,   103], device='cuda:0')
#> Output Mask: torch.Size([32]), tensor([1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0], device='cuda:0')



[{'content': '  In a digital epoch where cyberspace is the emerging nexus of geopolitical\ncontention, the melding of information operations and Large Language Models\n(LLMs) heralds a paradigm shift, replete with immense opportunities and\nintricate challenges. As tools like the Mistral 7B LLM (Mistral, 2023)\ndemocratise access to LLM capabilities (Jin et al., 2023), a vast spectrum of\nactors, from sovereign nations to rogue entities (Howard et al., 2023), find\nthemselves equipped with potent narrative-shaping instruments (Goldstein et\nal., 2023). This paper puts forth a framework for navigating this brave new\nworld in the "ClausewitzGPT" equation. This novel formulation not only seeks to\nquantify the risks inherent in machine-speed LLM-augmented operations but also\nunderscores the vital role of autonomous AI agents (Wang, Xie, et al., 2023).\nThese agents, embodying ethical considerations (Hendrycks et al., 2021), emerge\nas indispensable components (Wang, Ma, et al., 2023), e

To verify the result, we would be searching using the predefined search queries (`mistral`,`mixtral`, `gemini`, `ColBERTv2` in this case), and see if any of the top-10 search results contain the expected DOI (`document_id` in the `rag_output`).

If the `document_id` matches with the DOI, we would do an additional verification - we would check if the metadata stored accurately matches with the expected metadata. There can be accidental cases when the metadata of all the indices gets mismatched, this would be a check to ensure it did not happen.

In [12]:
# Initialize counters for total number of queries and number of wrong results
total = 0
wrong = 0

# Loop through each query and its corresponding value from the search_queries dictionary
for query, value in search_queries.items():
    # Perform a search using the query and retrieve the top 10 results
    rag_output = Colbertv2.search(query=query, k=10)
    
    # Retrieve the ground truth data based on the value from filtered_df
    gnd_truth = filtered_df.row(by_predicate=(pl.col("id") == value), named=True)
    
    # Initialize a flag to check if there is a match
    match = False
    
    # Loop through the search results to check if any match the ground truth
    for rag_output_val in rag_output:
        # Check if the document ID matches the ground truth ID
        if rag_output_val['document_id'] == gnd_truth['id']:
            # Assert that the titles are the same
            assert rag_output_val['document_metadata']['title'] == gnd_truth['title']
            
            # Check if the content matches the ground truth abstract
            abs_match_1 = rag_output_val['content'] == gnd_truth['abstract']
            abs_match_2 = rag_output_val['content'].replace('\n', '').replace(' ', '') == gnd_truth['abstract'].replace('\n', '').replace(' ', '')
            
            # Assert that either of the two abstract match conditions is true
            assert any([abs_match_1, abs_match_2])
            
            # Set the match flag to True
            match = True
    
    # If no match was found, increment the wrong counter and print the ground truth and search results
    if not match:
        wrong += 1
        print(gnd_truth)
        print(rag_output)
        
    # Increment the total counter for each query
    total += 1

# Print the number of correct matches and the accuracy percentage
print("Matches : ", total - wrong, " Accuracy:", round((total - wrong) * 100 / total, 2))


Matches :  4  Accuracy: 100.0


Awesome! If things look great, we can use these indices in our [ArXiv CS RAG](https://huggingface.co/spaces/bishmoy/Arxiv-CS-RAG) space.