# Archimedes AI - Build with Claude June 2024 Developer Contest

**Authors**: Antek Hasiura (Princeton Univeristy) & Jędrzej Hasiura (Eindhoven University of Technology)
<br/>
<br/>

**Project Goal**: develop an AI research assistant ("Archimedes") that could accelerate the academic research process through a RAG-based scholarly work retrieval, leveraging recent discoveries on the efficiency of LLMs in markdown processing, chain-of-thought prompting, and reranking.
<br/>
<br/>

**Motivation**:
* High volume of published papers presents challenges to researchers who find it difficult to stay up to date with relevant work in their field
* Tools for literature review research are outdated and the space continues to not be disrupted by recent developments in LLMs

<br/>


In this notebook, we showcase the backend RAG model we developed for Archimedes. At the end of our work, we discuss what we envision Archimedes could do if we continued to develop the project further. Our initial UI drafts can also be viewed in the provided repository.

<br/>


**References**:

<u>Markdowns</u>

[1] Revolutionizing Retrieval-Augmented Generation with Enhanced PDF Structure Recognition [https://arxiv.org/abs/2401.12599]

[2] Nougat: Neural Optical Understanding for Academic Documents [https://arxiv.org/abs/2308.13418]

[3] Marker [https://github.com/VikParuchuri/marker]
<br/><br/>


<u>Retrieval Augmented Generation</u>

[4] Azure AI Search: Outperforming vector search with hybrid retrieval and ranking capabilities [https://techcommunity.microsoft.com/t5/ai-azure-ai-services-blog/azure-ai-search-outperforming-vector-search-with-hybrid/ba-p/3929167]

[5] Retrieval-Augmented Generation for Large Language Models: A Survey [https://arxiv.org/pdf/2312.10997]

[6] Introducing Rerank 3: A New Foundation Model for Efficient Enterprise Search & Retrieval [https://cohere.com/blog/rerank-3]

[7] Developing and Evaluating RAG Solution: Generate embeddings [https://learn.microsoft.com/en-us/azure/architecture/ai-ml/guide/rag/rag-generating-embeddings]

[8] Build with Claude: Embeddings [https://docs.anthropic.com/en/docs/build-with-claude/embeddings]
<br/><br/>

<u>Prompting</u>

[9] Chain-of-Thought Prompting Elicits Reasoning in Large Language Models [https://arxiv.org/abs/2201.11903]

[10] The Prompt Report: A Systematic Survey of Prompting Techniques [https://arxiv.org/pdf/2406.06608]

[11] Anthropic Prompt Generator [https://console.anthropic.com]

[12] Emergent Abilities of Large Language Models [https://arxiv.org/pdf/2206.07682]
<br/><br/>

<u>Benchmarks </u>

[13] MTEB Embedding Benchmark - [https://huggingface.co/spaces/mteb/leaderboard, https://arxiv.org/abs/2210.07316]

[14] Claude 3.5 Sonnet comparison with GPT-4o [https://www.anthropic.com/news/claude-3-5-sonnet]

## 1. Initial Setup

In [None]:
!pip install --quiet langchain langchain-chroma langchain-openai sentence-transformers langchain-cohere langchain-anthropic fastembed langchain-community langchain-voyageai langchain-cohere

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
cudf 24.4.1 requires cubinlinker, which is not installed.
cudf 24.4.1 requires cupy-cuda11x>=12.0.0, which is not installed.
cudf 24.4.1 requires ptxcompiler, which is not installed.
cuml 24.4.0 requires cupy-cuda11x>=12.0.0, which is not installed.
dask-cudf 24.4.1 requires cupy-cuda11x>=12.0.0, which is not installed.
keras-cv 0.9.0 requires keras-core, which is not installed.
keras-nlp 0.12.1 requires keras-core, which is not installed.
tensorflow-decision-forests 1.8.1 requires wurlitzer, which is not installed.
aiobotocore 2.13.0 requires aiohttp<4.0.0,>=3.9.2, but you have aiohttp 3.9.1 which is incompatible.
aiobotocore 2.13.0 requires botocore<1.34.107,>=1.34.70, but you have botocore 1.34.141 which is incompatible.
apache-beam 2.46.0 requires dill<0.3.2,>=0.3.1.1, but you have dill 0.3.8 which is inc

In [None]:
from kaggle_secrets import UserSecretsClient
import os

os.environ["OPENAI_API_KEY"] = UserSecretsClient().get_secret("OPENAI_API_KEY")
os.environ["COHERE_API_KEY"] = UserSecretsClient().get_secret("COHERE_API_KEY")
os.environ["ANTHROPIC_API_KEY"] = UserSecretsClient().get_secret("ANTHROPIC_API_KEY")
os.environ["VOYAGE_API_KEY"] = UserSecretsClient().get_secret("VOYAGE_API_KEY")

## 2. Paper Database

### 2.1 Processing PDFs into Markdowns
We begin by processing PDFs of academic papers scraped from the Arxiv database by converting them into Markdown files. Lin 2024 [1], demonstrated how low accuracy of PDF parsing in RAGs impacts the effectiveness of knowledge-based QA, so we address this problem by equipping the Archimedes' database with an alternative format. Our choice is **Markdown**: a format that allows for consistent formatting for text, formulas and tables.

<br/>

While there is a number of open source solutions for PDF to Markdown conversion, two stand out as currently leading: Meta's Nougat-OCR [2] and Marker [3]. Both show improved efficiency relative to previously commonly used solution: PyPDF (which was utilized by e.g. ChatGPT).

<br/>

Nougat-OCR and Matker can be thought of as compliments to each other. Marker allows faster processing, while Nougat-OCR has proven to be more stable in high-accuracy generation. As such, we use a combination for both to develop our database for Archimedes.


### 2.2 Loading Markdown Files
We trim each text to exclude the bibliography sections in order to remove the content that would be irrelevant for the RAG

In [None]:
from typing import Iterator
from langchain_core.document_loaders import BaseLoader
from langchain_core.documents import Document

class ArxivMarkdownLoader(BaseLoader):
    def __init__(self, file_path: str) -> None:
        self.file_path = file_path

    def lazy_load(self) -> Iterator[Document]:
        try:
            with open(self.file_path, 'r', encoding="utf-8") as f:
                lines = f.readlines()  # Read all lines to extract the range
                abstract_start = None
                references_end = None

                abstract_str = None
                for idx, line in enumerate(lines):
                    if "###### Abstract" in line:
#                         print("Found 6# Abstract")
                        abstract_start = idx
                        abstract_str = "###### Abstract"
                    elif "## References" in line:
                        references_end = idx
                        break  # We can stop searching once the end is found

                if abstract_start is None or references_end is None or abstract_start >= references_end:
                    raise ValueError("Could not find the specified section in the file.")

                section_content = abstract_str + "".join(lines[abstract_start + 1:references_end])

                yield Document(
                    page_content=section_content,
                    metadata={"source": self.file_path}
                )
        except Exception:
            print(f"Failed to process {self.file_path}")

In [None]:
import os

class ArxivMarkdownDirectoryLoader(BaseLoader):

    def __init__(self, directory_path: str, file_extension: str = ".md") -> None:
        self.directory_path = directory_path
        self.file_extension = file_extension

    def lazy_load(self) -> Iterator[Document]:
        file_paths = [
            os.path.join(self.directory_path, f)
            for f in os.listdir(self.directory_path)
            if f.endswith(self.file_extension)
        ]

        for file_path in file_paths:
            loader = ArxivMarkdownLoader(file_path)
            # Load documents from the individual file loader
            for document in loader.lazy_load():
                yield document

In [None]:
loader = ArxivMarkdownDirectoryLoader("/kaggle/input/arxivlangchain-md2/papers_md",
                                     file_extension = ".mmd")
markdown_documents = loader.load()

Failed to process /kaggle/input/arxivlangchain-md2/papers_md/2403.08822v1.LoRA_SP__Streamlined_Partial_Parameter_Adaptation_for_Resource_Efficient_Fine_Tuning_of_Large_Language_Models.mmd
Failed to process /kaggle/input/arxivlangchain-md2/papers_md/2206.14077v2.DSME_LoRa__Seamless_Long_Range_Communication_Between_Arbitrary_Nodes_in_the_Constrained_IoT.mmd
Failed to process /kaggle/input/arxivlangchain-md2/papers_md/2402.13533v1.FinGPT_HPC__Efficient_Pretraining_and_Finetuning_Large_Language_Models_for_Financial_Applications_with_High_Performance_Computing.mmd
Failed to process /kaggle/input/arxivlangchain-md2/papers_md/2206.09532v1.Hands_on_Wireless_Sensing_with_Wi_Fi__A_Tutorial.mmd
Failed to process /kaggle/input/arxivlangchain-md2/papers_md/2209.00863v1.Delay_Tolerant_ICN_and_Its_Application_to_LoRa.mmd
Failed to process /kaggle/input/arxivlangchain-md2/papers_md/2401.13569v1.SPARC_LoRa__A_Scalable__Power_efficient__Affordable__Reliable__and_Cloud_Service_enabled_LoRa_Networking_Sys

In [None]:
len(markdown_documents)

88

Please note that this only includes a part of our database for the purposes of this demo

### 3. Chunking

While the division of the papers into coherent subsections is helpful for the RAG, we take our data processing a step further with thew use of chunking. Microsoft Research 2023 [4], showed that chunks of 512 tokens with 25% overlap have strike a balance between capturing enough context and staying within the token limits of many NLP models, ensuring that relevant information is more likely to be retrieved effectively. In the case of Archimedes, a small number of subsections exceed 512 tokens, what also reduces risk of cutting up coherent messages in the described chunking process. As such, we implement it into our RAG.

In [None]:
from langchain.document_loaders import TextLoader
from langchain.text_splitter import MarkdownHeaderTextSplitter

def flatten(nested_list):
        return [item for sublist in nested_list for item in sublist]

headers_to_split_on = [
    ("#", "Section"),
    ("##", "Section"),
    ("######", "Section"),
]

markdown_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on, strip_headers = False)
docs = flatten([markdown_splitter.split_text(d.page_content) for d in markdown_documents])


In [None]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

CHUNK_SIZE = 512
text_splitter = text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
            encoding_name="cl100k_base", chunk_size=CHUNK_SIZE, chunk_overlap=int(0.25 * CHUNK_SIZE),
            allowed_special={'<|endofprompt|>', '<|endoftext|>',}
        )
docs = text_splitter.split_documents(docs)

In [None]:
len(docs)

2566

In [None]:
docs[0]

Document(metadata={'Section': 'Abstract'}, page_content="###### Abstract\nAs the key component in multimodal large language models (MLLMs), the ability of the visual encoder greatly affects MLLM's understanding on diverse image content. Although some large-scale pre-trained vision encoders such as vision encoders in CLIP and DINOv2 have brought promising performance, we found that there is still no single vision encoder that can dominate various image content understanding, _e.g._, the CLIP vision encoder leads to outstanding results on general image understanding but poor performance on document or chart content. To alleviate the bias of CLIP vision encoder, we first delve into the inherent behavior of different pre-trained vision encoders and then propose the MoVA, a powerful and novel MLLM, adaptively routing and fusing task-specific vision experts with a coarse-to-fine mechanism. In the coarse-grained stage, we design a context-aware expert routing strategy to dynamically select th

### 4. Vector database

The created overlapping chunks need to be processed to a vector database for the RAG architecture. The main decision we faced when developing the vector database was the  choice of embedding model.
<br/>

Based on Microsoft Research [7], in absence of specialized embedding models for academic papers, we settled on generalist embedding models. We have guided our decision by MTEB benchmark [13], where we have focused both on overall score, but also on applicability. As such, we decided to use the Voyage Large 2 model. It is, in fact, also the model generally recommended by the Anthropic Team [8], and one that has proven to be highly effective for the tasks tackled by Archimedes.
<br/>

For the implementaion of the database, we chose to use ChromaDB as a test store and Pinecone for the production step.


In [None]:
from langchain_chroma import Chroma
from langchain_community.embeddings.fastembed import FastEmbedEmbeddings
from langchain_openai import OpenAIEmbeddings
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_voyageai import VoyageAIEmbeddings
import torch
from tqdm import tqdm

def split_list(input_list, chunk_size):
    for i in range(0, len(input_list), chunk_size):
        yield input_list[i:i + chunk_size]

def load_chromadb(embedding_function):
    split_docs_chunked = split_list(docs, 1)

    vector_db = None
    for split_docs_chunk in tqdm(list(split_docs_chunked)):
        vector_db = Chroma.from_documents(
            documents=split_docs_chunk,
            embedding=embedding_function,
            persist_directory=DB_PATH,
        )

    return vector_db


DB_PATH = "./chroma_db"

# embedding_function = FastEmbedEmbeddings()
# embedding_function = OpenAIEmbeddings(model="text-embedding-3-large")
# embedding_function = HuggingFaceEmbeddings(
#     model_name="Alibaba-NLP/gte-large-en-v1.5",
#     model_kwargs={'device': 'cuda' if torch.cuda.is_available() else 'cpu',
#                'trust_remote_code':True,
#                 'torch_dtype':torch.float16
#                },
#     encode_kwargs={'normalize_embeddings': True}
# )

embedding_function = VoyageAIEmbeddings(
    voyage_api_key=os.environ["VOYAGE_API_KEY"], model="voyage-large-2"
)

vector_db = load_chromadb(embedding_function)
print("Finished loading")


batch size None


100%|██████████| 2566/2566 [16:05<00:00,  2.66it/s]

Finished loading





### 5. Overview of Archimedes' Search Philosophy

Our key principle is to extract as much knowledge from papers as possible without relying on any information that an LLM uses internally (hence the choice of temperature=0 for or output generation). This approach is the most useful to the researchers who would want to gain insights solely based on scholarly research.


Please note that in this experiment we have focused on retrieval from embedded documents only. However, in the future, we are planning to use combination of keyword-based search and RAG retrieval (which has been shown as an efficient approach by Microsoft Research [4]),

### 6. Reranking & Testing

To improve extraction of information and further advance the Archimedes RAG solution, we added an additional reranking step [5]. This approach enables the extraction of a number of potentially relevant documents abd then reranking them based on the relevance to the query in order to highlight a subset of top documents (chunks). We used the dedicated Cohere Rerank 3 model, which has proven to be outperforming generalist solutions like GPT-4-turbo, while remaning cost efficient. [6]  


We are going to ask a seemingly simply, but quite ambigious questions -> What is LoRA?

In LLMs it is Low-Rank Adaptation used for efficient fine-tuning. Yet this concept also exists in computer networking. Note that our database for the RAG contains papers relevant to LoRA in each of the two (among other) contexts. As such, for a simple RAG this task could present a substential challenge. We begin by checking if our RAG with the rerank accurately selects relevant chunks.

In [None]:
from langchain.retrievers.contextual_compression import ContextualCompressionRetriever
from langchain_cohere import CohereRerank
from langchain_community.llms import Cohere

# https://api.python.langchain.com/en/latest/retrievers/langchain.retrievers.document_compressors.cohere_rerank.CohereRerank.html
question = "What is LoRA in LLM?"

reranker = Cohere(temperature=0)
compressor = CohereRerank(model="rerank-english-v3.0", top_n=5)
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor, base_retriever=vector_db.as_retriever(search_kwargs={"k": 20})
)

retrieved_docs = compression_retriever.invoke(question)

In [None]:
print("\n\n".join([d.page_content for d in retrieved_docs]))

###### Abstract
LoRA (Low-Rank Adaptation) [1] has emerged as a preferred method for efficiently adapting Large Language Models (LLMs) with remarkable simplicity and efficacy. This note extends the original LoRA paper by offering new perspectives that were not initially discussed and presents a series of insights for deploying LoRA at scale. Without introducing new experiments, we aim to improve the understanding and application of LoRA.

## 1 Introduction  
Large Language Models (LLMs) have achieved significant success across a wide spectrum of Natural Language Processing (NLP) tasks Brown et al. (2020); Yuan et al. (2023); Huang et al. (2023). For practical deployment, fine-tuning these models is essential, as it improves their performance for specific downstream tasks and/or aligns model behaviors with human preferences. Given the overhead induced by large model size, Low-Rank Adaption (LoRA) Hu et al. (2021) comes as a parameter-efficient finetuning mechanism widely adopted to fine

### 7. Final response generation

After retriving reranked top documents the final stage of advanced RAG pipeline is to form final response. We used Claude 3.5 Sonnet as our LLM, which accoring to the most recent benchmark [14], outperfomed GPT-4o in key capabilities vital to our task: graduate level reasoning and reasoning over text.

Having ensured the use of the most capable model, we worked on incorporating the best prompting practices, following those outlined by Schulhoff et al. 2024 [10]:
1. We used a role prompt to adjust style of response
2. We used a prompt template that allows adaptability
3. We added the handling of negative cases
4. We added the handling of cases where the model does not have sufficient context data to reduce hallucinations
5. We used Chain of Thought prompting to enhance reasoning [9,12]

To ensure high quality prompt we also optimized it using the Anthropic Prompt Generator [11].

<br/>

With this advanced RAG setup, let's see how Archmiedes performs in distinguishing between the two LoRA cases. We are also hoping for an output style that would be desired by a researcher looking for answers while writing a literature review for a paper...


In [None]:
from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI
from langchain_anthropic import ChatAnthropic

llm = ChatAnthropic(model='claude-3-5-sonnet-20240620', temperature=0)
# llm = ChatAnthropic(model='claude-3-sonnet-20240229', temperature=0)
# llm = ChatOpenAI(model='gpt-4o', temperature=0)

prompt = ChatPromptTemplate.from_messages([
    ("system", "You are an all-knowing AI research assistant whose role is to answer questions using excerpts from research papers that will be provided to you. The provided papers are chosen very carefully so that they contain the most relevant sources there are available to answer the query. You are speaking to academic researchers."),
    ("user", """
Keep the following question in mind as you complete this task:
```
{question}
```

When working through this question I want you to think step by step. However, do not explicitly spell out the individual steps you take, rather answer the query based on the following guidelines.

Here are the excerpts (which form your context to answer the research question) from research papers that may contain relevant information to answer the question:
```
{context}
```

Carefully search through the excerpts to try to find information that will help you answer the question.
Do not add any outside information or make any assumptions. If the question cannot be fully answered by the excerpts, do not attempt to guess or fill in the gaps - just provide a partial answer using only what is stated in the excerpts. If there is nothing that you find relevant to be able to answer the question, then simply state that you cannot answer based on the information that you have available.
Do not be overly verbose.

Please also behave as if the excerpts were part of your knowledge - i.e. don't refer to them as 'excerpts' or
'provided research papers' that were given to you, but rather as knowledge that you can simply hold internally. Never use phrases like "in the excerpts provided", etc. Just imagine that the excerpts you are given are a part of your knowledge base. For example, if you're asked "What is CAMELoT", directly provide its definition from the excerpt you're provided (e.g. "CAMELoT (Consolidated
Associative Memory Enhanced Long Transformer) is a memory-augmented architecture designed to handle long input sequences in large language  models (LLMs) without the need for re-training."), instead of saying "according to the documents/excerpts provided, CAMELoT is..."

Always be specific in your references to papers. This means that, for example, if you're asked about methods, you should be naming specific methods, not giving a generic answer. Please also cite the authors and years in APA format.


""")])



chain = prompt | llm

In [None]:
question = "What is LoRA in LLM?"
context = "\n\n\n".join([d.page_content for d in retrieved_docs])

response = chain.invoke({"question": question,
             "context": context})

print(response.content)

LoRA, which stands for Low-Rank Adaptation, is a parameter-efficient fine-tuning method for Large Language Models (LLMs). Introduced by Hu et al. (2021), LoRA has become a preferred approach for efficiently adapting LLMs due to its simplicity and effectiveness.

The key idea behind LoRA is to inject a trainable rank decomposition matrix into the transformer block of an LLM while keeping the other parameters frozen. This approach significantly reduces the number of trainable parameters compared to full fine-tuning, making it more computationally efficient.

Some key advantages of LoRA include:

1. Efficiency: LoRA allows for fine-tuning LLMs with much fewer parameters. For example, a LoRA adaptation for a Llama-2-7B model weighs only about 10MB, compared to the full model size of 14GB.

2. Accessibility and shareability: LoRA weights can be easily shared and adopted for downstream tasks, enabling flexible model customization.

3. Simultaneous adoption: Multiple LoRAs can be used togethe

Let's check now what is LoRA in network communication...

In [None]:
from langchain.retrievers.contextual_compression import ContextualCompressionRetriever
from langchain_cohere import CohereRerank
from langchain_community.llms import Cohere

question = "What is LoRA in computer network communication?"

reranker = Cohere(temperature=0)
compressor = CohereRerank(model="rerank-english-v3.0", top_n=5)
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor, base_retriever=vector_db.as_retriever(search_kwargs={"k": 20})
)

retrieved_docs = compression_retriever.invoke(question)

In [None]:
print("\n\n".join([d.page_content for d in retrieved_docs]))

## I Introduction  
LoRa is a popular wireless technology for the Internet of Things (IoT) that achieves long range transmissions (km) at minimal power consumption (mW). The narrowband chirp spread spectrum modulation is robust against interference and doppler effect. LoRa operates in unlicensed subGHz spectrums, which are subject to regional band regulations. LoRaWAN is a cloud-based network architecture for LoRa that organizes all communication between constrained Endnodes (ENs) and user applications. LoRaWAN consists of three components: Application Servers (ASs) provide an interface for business logic implementation; A centralized Network Server (NS) coordinates communication including the PHY configuration, media access, and routes between ASs and ENs; Gateways (GWs) act as a LoRaWAN backbone and mediate packets between ENs and the NS. Three constrains are worth stressing: First, downlink packets are heavily regulated. Regional band restrictions limit the number of downlink packet

In [None]:
context = "\n\n\n".join([d.page_content for d in retrieved_docs])

response = chain.invoke({"question": question,
             "context": context})


print(response.content)

LoRa (Long Range) is a wireless communication technology designed for Internet of Things (IoT) applications. It uses chirp spread spectrum (CSS) modulation to achieve long-range transmissions (up to kilometers) while maintaining low power consumption (in the milliwatt range). LoRa operates in unlicensed sub-GHz frequency bands, typically 433 MHz, 868 MHz, or 915 MHz, depending on regional regulations.

Key features of LoRa include:

1. Modulation: LoRa uses CSS modulation, which involves a sequence of chirps with linearly increasing frequencies over time. This modulation technique allows for robust communication against interference and Doppler effects.

2. Data packet structure: A LoRa data packet consists of a preamble (for synchronization), a start frame delimiter (SFD), and a data payload.

3. Network architecture: LoRaWAN is a cloud-based network architecture built on top of LoRa. It includes End Nodes (ENs), Gateways (GWs), a Network Server (NS), and Application Servers (ASs).

4

We see that Archimedes effectively responds to the query, presenting an output desired for our target audience of academic researchers.
<br/>

Due to constrained time, this project has limitations but we wanted to end our work by describing our full vision. We hope that Archimedes could become an AI copilot for scientific research, capable of:
1. Scholarly work retrieval (which we showed above)
2. Literature review writing
3. Research idea generation
4. Live research feed



While this task is undoubtably ambitious, we continue to be thankful for teams like Anthropic, which create tools that could help bring this vision of more efficient research to reality.