# Simple RAG Demo with OLMo and ArXiv `astro-ph` Dataset

The RAG based approach below is a demonstration of how to use [OLMo-1B](https://huggingface.co/allenai/OLMo-1B) LLM model by AI2 to generate an abstract completion for a given input text. The input text is a random starting abstract from `astro-ph` category of [ArXiv Dataset](https://www.kaggle.com/datasets/Cornell-University/arxiv). The abstract completion is generated by the model using the RAG approach. The RAG approach retrieves relevant documents from [Qdrant Vector Database](https://qdrant.tech/), which provides contextual information to the model for generating the completion.

The input text was retrieved from the [AstroLLaMa Paper](https://arxiv.org/abs/2309.06126). Rather than fine-tuning a model, we wanted to see if RAG approach can also work.

We will use the following statement as user input:

In [1]:
statement = """The Magellanic Stream (MS) - an enormous ribbon of gas spanning 140∘ of the southern
sky trailing the Magellanic Clouds - has been exquisitely mapped in the five decades since
its discovery. However, despite concerted efforts, no stellar counterpart to the MS has been
conclusively identified. This stellar stream would reveal the distance and 6D kinematics of
the MS, constraining its formation and the past orbital history of the Clouds. We"""

### Utility Functions

In [2]:
import zipfile
import json
import pandas as pd
import io
import fsspec

def fetch_arxiv_dataset(zip_url: str) -> pd.DataFrame:
    cols = ['id', 'title', 'abstract', 'categories']

    with fsspec.open(zip_url) as f:
        with zipfile.ZipFile(f) as archive:
            data = []
            json_file = archive.filelist[0]
            with archive.open(json_file) as f:
                for line in io.TextIOWrapper(f, encoding="latin-1"):
                    doc = json.loads(line)
                    lst = [doc['id'], doc['title'], doc['abstract'], doc['categories']]
                    data.append(lst)
                    
            df_data = pd.DataFrame(data=data, columns=cols)
    return df_data

# https://github.com/allenai/open-instruct/blob/main/eval/templates.py
def create_prompt_with_olmo_chat_format(messages, bos="|||IP_ADDRESS|||", eos="|||IP_ADDRESS|||", add_bos=True):
    formatted_text = ""
    for message in messages:
        if message["role"] == "system":
            formatted_text += "<|system|>\n" + message["content"] + "\n"
        elif message["role"] == "user":
            formatted_text += "<|user|>\n" + message["content"] + "\n"
        elif message["role"] == "assistant":
            formatted_text += "<|assistant|>\n" + message["content"].strip() + eos + "\n"
        else:
            raise ValueError(
                "Olmo chat template only supports 'system', 'user' and 'assistant' roles. Invalid role: {}.".format(message["role"])
                )
    formatted_text += "<|assistant|>\n"
    formatted_text = bos + formatted_text  # forcibly add bos
    return formatted_text

# Post-processing
def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)



### Retrieve documents (arXiv `astro-ph` abstracts)

This section retrieves the arXiv abstracts and creates documents
for loading into a vector database. You can skip running the following sections
if you have a local copy of the Qdrant Vector Database data ready to go.

In [3]:
from langchain_community.document_loaders import DataFrameLoader

Note: the following zip_url will be changed weekly.

In case you run the code for the first time or need the latest url,  please go to website of ArXiv Dataset to download the latest dataset.

#### You can find zip_url of that by clicking the info_directory-> more_info-> where_from of dataset locally.

In case you have downloaded the dataset locally, you can use file path directly (e.g. file_path).

In [4]:
# zip_url = "https://storage.googleapis.com/kaggle-data-sets/612177/7925852/bundle/archive.zip?X-Goog-Algorithm=GOOG4-RSA-SHA256&X-Goog-Credential=gcp-kaggle-com@kaggle-161607.iam.gserviceaccount.com/20240327/auto/storage/goog4_request&X-Goog-Date=20240327T183523Z&X-Goog-Expires=259200&X-Goog-SignedHeaders=host&X-Goog-Signature=4747ce35edc693785c00b4ade2fc7f62149173bf160f1b04f97fc6a752bfb1ccb5408359a16b475e7d955f04a52f2fb9f916d8090330993839fabfb1835847e0c62452243ecc74e232eeed1d747beaf6da1209b9614d305c020e6bd09bb096e6c6e2bb4711d96fb457ed1533c04bb78690253d3b6f4a4068aa3b9cd073742a3ed68562fa2a88a29e646a629dee0a26f99ff0539b5f81c926bc2b5a62642ac9f0a92febc7ca812a61351191334baad93b3ecca2ac408da8ca35a4d6e8afda67d6e8196b50c20ee18358a19cb21c25dfbcc7394bc99b280ed9222c8a933ea91f7d4b65aba05156ab985b36e761a70a35f6bbd208b9507a04ff68e15c258ec5920f"

# zip_url = "https://storage.googleapis.com/kaggle-data-sets/612177/8112112/bundle/archive.zip?X-Goog-Algorithm=GOOG4-RSA-SHA256&X-Goog-Credential=gcp-kaggle-com%40kaggle-161607.iam.gserviceaccount.com%2F20240420%2Fauto%2Fstorage%2Fgoog4_request&X-Goog-Date=20240420T072044Z&X-Goog-Expires=259200&X-Goog-SignedHeaders=host&X-Goog-Signature=4205122187fe955292316d8c21c161d1a5cda0cad078505415764e112ed5502117eac4d4c492908387a31c00f84a0b5ff9591b0128fbff051a6f0f675604d95325162cf0c957033fcbdfe9f070945da28c00cfbce1b7f387804228f150ab3cb1489a9e7ce1c58f5ab8b3fe3cc5f9680d19366969b43e8e0905444416f1a77314582c3335eaee1d06e8859fb95631a5baa6a75212702383dfc628ed8cc0b34b9f9a1433cca752789d20af013022f828ef2d54ea653f03649bbf45c2b1139611e3621b5844c3f9a489a79724dee8b883f22bab92c9c08e96915561a98b3862e4e93bff86eaa613fd81d88798230fe31eee129e9dda76a2aaacb565700e41b7524f"

file_path = "/Users/amy/Desktop/archive.zip"

In [5]:
# Fetch the dataset containing all arXiv abstracts
df_data = fetch_arxiv_dataset(file_path)
# Filter the dataset to only include astro-ph category
astro_df = df_data[df_data.categories.str.contains('astro-ph')].reset_index(drop=True)
astro_df=astro_df[:1000]
print("Number of astro-ph papers: ", len(astro_df))

Number of astro-ph papers:  1000


In [6]:
# Eargerly load the dataframe full of abstracts
# to memory in the form of langchain Document objects
loader = DataFrameLoader(astro_df, page_content_column="abstract")
documents = loader.load()

### Document Embeddings to Qdrant Vector Database

In [7]:
from langchain_community.vectorstores import Qdrant
from langchain.embeddings import HuggingFaceEmbeddings

In [8]:
from langchain_community.embeddings import HuggingFaceEmbeddings

#### Setup Vector DB

In [9]:
import os

In [10]:
qdrant_path="./local_qdrant_first_1000_L6_v2"
qdrant_collection="arxiv_astro-ph_abstracts_first_1000_L6_v2"

In [11]:
# Setup the embedding, we are using the MiniLM model here
embedding = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")

  from tqdm.autonotebook import tqdm, trange


In [12]:
if os.path.exists(qdrant_path):
    print(f"Loading existing Qdrant collection '{qdrant_collection}'")
    from qdrant_client import QdrantClient
    # If the Qdrant Vector Database Collection already exists, load it
    client = QdrantClient(path=qdrant_path)
    qdrant = Qdrant(
        client=client,
        collection_name=qdrant_collection,
        embeddings=embedding
    )
else:
    print(f"Creating new Qdrant collection '{qdrant_collection}' from {len(documents)} documents")
    
    # Load the documents into a Qdrant Vector Database Collection
    # this will save locally in the current directory as sqlite
    qdrant = Qdrant.from_documents(
        documents,
        embedding,
        path=qdrant_path,
        collection_name=qdrant_collection,
    )
    

Creating new Qdrant collection 'arxiv_astro-ph_abstracts_first_1000_L6_v2' from 1000 documents


#### Test out the Qdrant collection

In [13]:
# Setup the retriever for later step
retriever = qdrant.as_retriever(search_type="mmr", search_kwargs={"k": 2})

In [14]:
# Test out the statement retrieval
found_docs = retriever.get_relevant_documents(statement)

  warn_deprecated(


In [15]:
print(format_docs(found_docs))

  We have analyzed the HI aperture synthesis image of the Large Magellanic
Cloud (LMC), using an objective and quantitative measure of topology to
understand the HI distribution hosting a number of holes and clumps of various
sizes in the medium. The HI distribution shows different topology at four
different chosen scales. At the smallest scales explored (19-29 pc), the HI
mass is distributed in such a way that numerous clumps are embedded on top of a
low density background. At the larger scales from 73 to 194 pc, it shows a
generic hole topology. These holes might have been formed mainly by stellar
winds from hot stars. At the scales from 240 to 340 pc, slightly above the disk
scale-height of the gaseous disk, major clumps in the HI map change the
distribution to have a slight clump topology. These clumps include the giant
cloud associations in the spiral arms and the thick filaments surrounding
superholes. At the largest scales studied (390-485 pc), the hole topology is
present again

### Question Answering

Note: in case to build olmo model correctly, we need to run the following statement to downgrade version of transformer:


#### %pip install transformers==4.38

In [16]:
from pathlib import Path
from huggingface_hub import snapshot_download
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline

In [17]:
import torch

In [18]:
qa_model = pipeline("question-answering", model="distilbert/distilbert-base-cased-distilled-squad", torch_dtype=torch.float32)


In [19]:
question = "What range of wavelengths is suggested as appropriate to look for dynamic signatures in the solar chromosphere according to the computations by Carlsson and Stein?"

print (f'Question: {question}\n')
context_docs = retriever.invoke(question)
context = " ".join([doc.page_content for doc in context_docs])

qa_response = qa_model(question = question, context = context)

print(f'\nAnswer: {qa_response["answer"]}\n')

print("Explanation:\n"+context)

Question: What range of wavelengths is suggested as appropriate to look for dynamic signatures in the solar chromosphere according to the computations by Carlsson and Stein?


Answer: 1.5-8
mHz

Explanation:
  The very nature of the solar chromosphere, its structuring and dynamics,
remains far from being properly understood, in spite of intensive research.
Here we point out the potential of chromospheric observations at millimeter
wavelengths to resolve this long-standing problem. Computations carried out
with a sophisticated dynamic model of the solar chromosphere due to Carlsson
and Stein demonstrate that millimeter emission is extremely sensitive to
dynamic processes in the chromosphere and the appropriate wavelengths to look
for dynamic signatures are in the range 0.8-5.0 mm. The model also suggests
that high resolution observations at mm wavelengths, as will be provided by
ALMA, will have the unique property of reacting to both the hot and the cool
gas, and thus will have the pote

### Text Generation by locally running LLM

In [20]:
# pipe = pipeline("text-generation", model="HuggingFaceH4/zephyr-7b-beta", torch_dtype=torch.bfloat16, device_map="auto")


In [21]:
# question = "What range of wavelengths is suggested as appropriate to look for dynamic signatures in the solar chromosphere according to the computations by Carlsson and Stein?"


# context_docs = retriever.invoke(question)
# context = " ".join([doc.page_content for doc in context_docs])

# # We use the tokenizer's chat template to format each message - see https://huggingface.co/docs/transformers/main/en/chat_templating
# messages = [
#     {
#         "role": "system",
#         "content": "You are a helpful question answering chatbot, that tries to answer the question given the context.",
#     },
#     {"role": "user", "content": f'Question: {question}, context: {context}\n'},
# ]
# prompt = pipe.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
# outputs = pipe(prompt, max_new_tokens=256, do_sample=True, temperature=0.7, top_k=50, top_p=0.95)
# print(outputs[0]["generated_text"])