# Literature Review Assistance and Summarization Using RAG Approach

The RAG based approach below is a demonstration of how to use [OLMo-1B](https://huggingface.co/allenai/OLMo-1B) LLM model by AI2 to generate an abstract completion for a given input text. The input text is a random starting abstract from `astro-ph` category of [ArXiv Dataset](https://www.kaggle.com/datasets/Cornell-University/arxiv). The abstract completion is generated by the model using the RAG approach. The RAG approach retrieves relevant documents from [Qdrant Vector Database](https://qdrant.tech/), which provides contextual information to the model for generating the completion.

The input text was retrieved from the [AstroLLaMa Paper](https://arxiv.org/abs/2309.06126). Rather than fine-tuning a model, we wanted to see if RAG approach can also work.

We will use the following statement as user input:

In [1]:
statement = """The Magellanic Stream (MS) - an enormous ribbon of gas spanning 140∘ of the southern
sky trailing the Magellanic Clouds - has been exquisitely mapped in the five decades since
its discovery. However, despite concerted efforts, no stellar counterpart to the MS has been
conclusively identified. This stellar stream would reveal the distance and 6D kinematics of
the MS, constraining its formation and the past orbital history of the Clouds. We"""

### Utility Functions

In [2]:
import zipfile
import json
import pandas as pd
import io
import fsspec

def fetch_arxiv_dataset(zip_path: str) -> pd.DataFrame:
    cols = ['id', 'title', 'abstract', 'categories']
    
    with zipfile.ZipFile(zip_path) as archive:
        data = []
        json_file = archive.namelist()[0]  # Assuming first file is JSON
        with archive.open(json_file) as f:
            for line in io.TextIOWrapper(f, encoding="latin-1"):
                doc = json.loads(line)
                lst = [doc['id'], doc['title'], doc['abstract'], doc['categories']]
                data.append(lst)
                
    df_data = pd.DataFrame(data=data, columns=cols)
    return df_data

# https://github.com/allenai/open-instruct/blob/main/eval/templates.py
def create_prompt_with_olmo_chat_format(messages, bos="|||IP_ADDRESS|||", eos="|||IP_ADDRESS|||", add_bos=True):
    formatted_text = ""
    for message in messages:
        if message["role"] == "system":
            formatted_text += "<|system|>\n" + message["content"] + "\n"
        elif message["role"] == "user":
            formatted_text += "<|user|>\n" + message["content"] + "\n"
        elif message["role"] == "assistant":
            formatted_text += "<|assistant|>\n" + message["content"].strip() + eos + "\n"
        else:
            raise ValueError(
                "Olmo chat template only supports 'system', 'user' and 'assistant' roles. Invalid role: {}.".format(message["role"])
                )
    formatted_text += "<|assistant|>\n"
    formatted_text = bos + formatted_text  # forcibly add bos
    return formatted_text

# Post-processing
def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)



### Retrieve documents (arXiv `astro-ph` abstracts)

This section retrieves the arXiv abstracts and creates documents
for loading into a vector database. You can skip running the following sections
if you have a local copy of the Qdrant Vector Database data ready to go.

In [3]:
from langchain_community.document_loaders import DataFrameLoader

zip_url = "https://storage.googleapis.com/kaggle-data-sets/612177/7925852/bundle/archive.zip?X-Goog-Algorithm=GOOG4-RSA-SHA256&X-Goog-Credential=gcp-kaggle-com@kaggle-161607.iam.gserviceaccount.com/20240327/auto/storage/goog4_request&X-Goog-Date=20240327T183523Z&X-Goog-Expires=259200&X-Goog-SignedHeaders=host&X-Goog-Signature=4747ce35edc693785c00b4ade2fc7f62149173bf160f1b04f97fc6a752bfb1ccb5408359a16b475e7d955f04a52f2fb9f916d8090330993839fabfb1835847e0c62452243ecc74e232eeed1d747beaf6da1209b9614d305c020e6bd09bb096e6c6e2bb4711d96fb457ed1533c04bb78690253d3b6f4a4068aa3b9cd073742a3ed68562fa2a88a29e646a629dee0a26f99ff0539b5f81c926bc2b5a62642ac9f0a92febc7ca812a61351191334baad93b3ecca2ac408da8ca35a4d6e8afda67d6e8196b50c20ee18358a19cb21c25dfbcc7394bc99b280ed9222c8a933ea91f7d4b65aba05156ab985b36e761a70a35f6bbd208b9507a04ff68e15c258ec5920f"

In [4]:
# Fetch the dataset containing all arXiv abstracts
df_data = fetch_arxiv_dataset("archive (1).zip")
# Filter the dataset to only include astro-ph category
astro_df = df_data[df_data.categories.str.contains('astro-ph')].reset_index(drop=True)
print("Number of astro-ph papers: ", len(astro_df))

Number of astro-ph papers:  341150


In [5]:
# Eargerly load the dataframe full of abstracts
# to memory in the form of langchain Document objects
loader = DataFrameLoader(astro_df, page_content_column="abstract")
documents = loader.load()

### Document Embeddings to Qdrant Vector Database

In [6]:
from langchain_community.vectorstores import Qdrant
from langchain.embeddings import HuggingFaceEmbeddings

#### Setup Vector DB

In [7]:
import os

In [8]:
qdrant_path="./local_qdrant"
qdrant_collection="arxiv_astro-ph_abstracts"

In [9]:
# Setup the embedding, we are using the MiniLM model here
embedding = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")

  from .autonotebook import tqdm as notebook_tqdm


In [10]:
if os.path.exists(qdrant_path):
    print(f"Loading existing Qdrant collection '{qdrant_collection}'")
    from qdrant_client import QdrantClient
    # If the Qdrant Vector Database Collection already exists, load it
    client = QdrantClient(path=qdrant_path)
    qdrant = Qdrant(
        client=client,
        collection_name=qdrant_collection,
        embeddings=embedding
    )
else:
    print(f"Creating new Qdrant collection '{qdrant_collection}' from {len(documents)} documents")
    
    # Load the documents into a Qdrant Vector Database Collection
    # this will save locally in the current directory as sqlite
    qdrant = Qdrant.from_documents(
        documents,
        embedding,
        path=qdrant_path,
        collection_name=qdrant_collection,
    )

Loading existing Qdrant collection 'arxiv_astro-ph_abstracts'


#### Test out the Qdrant collection

In [11]:
# Setup the retriever for later step
retriever = qdrant.as_retriever(search_type="mmr", search_kwargs={"k": 2})

In [12]:
# Test out the statement retrieval
found_docs = retriever.get_relevant_documents(statement)

  warn_deprecated(


In [13]:
print(format_docs(found_docs))

  We explore the Magellanic Stream (MS) using a Gaussian decomposition of the
HI velocity profiles in the Leiden-Argentine-Bonn (LAB) all-sky HI survey. This
decomposition exposes the MS to be composed of two filaments distinct both
spatially (as first pointed out by Putman et al.) and in velocity. Using the
velocity coherence of the filaments, one can be traced back to its origin in
what we identify as the SouthEast HI Overdensity (SEHO) of the Large Magellanic
Cloud (LMC), which includes 30 Doradus. Parts of the Leading Arm (LA) can also
be traced back to the SEHO in velocity and position. Therefore, at least
one-half of the trailing Stream and most of the LA originates in the LMC,
contrary to previous assertions that both the MS and the LA originate in the
Small Magellanic Cloud (SMC) and/or in the Magellanic Bridge. The two MS
filaments show strong periodic, undulating spatial and velocity patterns that
we speculate are an imprint of the LMC rotation curve. If true, then the drift


### Setup OLMo Model

In [14]:
from pathlib import Path
from huggingface_hub import snapshot_download
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline

In [15]:
model_name = "allenai/OLMo-1B"

In [16]:
# Download the model and its configuration file locally
# from the Hugging Face Hub
# we will only download the configuration file and the model as safetensors file
local_dir = Path("../OLMo-1B")
model_path = snapshot_download(
    repo_id=model_name,
    ignore_patterns=["*.bin"],
    local_dir=local_dir,
    local_dir_use_symlinks=True)

For more details, check out https://huggingface.co/docs/huggingface_hub/main/en/guides/download#download-files-to-local-folder.
Fetching 12 files: 100%|██████████| 12/12 [00:00<00:00, 2000.70it/s]


In [17]:
import hf_olmo

In [18]:
olmo = AutoModelForCausalLM.from_pretrained(
    model_path,
    trust_remote_code=True,
    local_files_only=True
)

In [19]:
tokenizer = AutoTokenizer.from_pretrained(model_path)

In [20]:
# Setup the text generation pipeline with the OLMo model
olmo_pipe = pipeline(
    task="text-generation",
    model=olmo,
    tokenizer=tokenizer,
    temperature=0.2,
    do_sample=True,
    repetition_penalty=1.1,
    return_full_text=True,
    max_new_tokens=400,
)

The model 'OLMoForCausalLM' is not supported for text-generation. Supported models are ['BartForCausalLM', 'BertLMHeadModel', 'BertGenerationDecoder', 'BigBirdForCausalLM', 'BigBirdPegasusForCausalLM', 'BioGptForCausalLM', 'BlenderbotForCausalLM', 'BlenderbotSmallForCausalLM', 'BloomForCausalLM', 'CamembertForCausalLM', 'LlamaForCausalLM', 'CodeGenForCausalLM', 'CohereForCausalLM', 'CpmAntForCausalLM', 'CTRLLMHeadModel', 'Data2VecTextForCausalLM', 'DbrxForCausalLM', 'ElectraForCausalLM', 'ErnieForCausalLM', 'FalconForCausalLM', 'FuyuForCausalLM', 'GemmaForCausalLM', 'GitForCausalLM', 'GPT2LMHeadModel', 'GPT2LMHeadModel', 'GPTBigCodeForCausalLM', 'GPTNeoForCausalLM', 'GPTNeoXForCausalLM', 'GPTNeoXJapaneseForCausalLM', 'GPTJForCausalLM', 'JambaForCausalLM', 'LlamaForCausalLM', 'MambaForCausalLM', 'MarianForCausalLM', 'MBartForCausalLM', 'MegaForCausalLM', 'MegatronBertForCausalLM', 'MistralForCausalLM', 'MixtralForCausalLM', 'MptForCausalLM', 'MusicgenForCausalLM', 'MusicgenMelodyForCaus

### Setup the langchain pipeline for the OLMo model

In [21]:
from langchain.llms import HuggingFacePipeline

In [22]:
llm = HuggingFacePipeline(pipeline=olmo_pipe)

#### Define the system prompts

In [23]:
from langchain.prompts import PromptTemplate

In [24]:
no_context_prompt = PromptTemplate(
    input_variables=["question"],
    template=create_prompt_with_olmo_chat_format(messages=[
        {"role": "system", "content": "You are an astrophysics expert. Finish the given statement."}, 
        {"role": "user", "content": "{question}"}
    ]),
)

with_context_prompt = PromptTemplate(
    input_variables=["context", "question"],
    template=create_prompt_with_olmo_chat_format(messages=[
        {"role": "system", "content": "You are an astrophysics expert. Use the following pieces of retrieved context to finish the given statement:\n{context}"}, 
        {"role": "user", "content": "{question}"}
    ]),
)


#### Define the chain of processes for the LLM

In [25]:
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

In [26]:
llm_chain = llm | StrOutputParser()
no_context_chain = {"question": RunnablePassthrough()} | no_context_prompt | llm_chain
rag_chain = {"context": retriever | format_docs, "question": RunnablePassthrough()} | with_context_prompt | llm_chain

### Invoke the no-context pipeline

In [27]:
no_context_answer = no_context_chain.invoke(statement)

In [28]:
print(no_context_answer)

|||IP_ADDRESS|||<|system|>
You are an astrophysics expert. Finish the given statement.
<|user|>
The Magellanic Stream (MS) - an enormous ribbon of gas spanning 140∘ of the southern
sky trailing the Magellanic Clouds - has been exquisitely mapped in the five decades since
its discovery. However, despite concerted efforts, no stellar counterpart to the MS has been
conclusively identified. This stellar stream would reveal the distance and 6D kinematics of
the MS, constraining its formation and the past orbital history of the Clouds. We
<|assistant|>
have developed a novel technique for detecting the presence of a stellar companion to the
Magellanic Stream using the <|WFC3|> instrument on the Hubble Space Telescope. The method
uses the <|HST-ACS/> pipeline to detect the presence of a bright point source in the
wavelength range 0.6-2.0µm with a high signal-to-noise ratio. A large number of candidate
point sources were detected by this method, but only one was found to be consistent with the

### Invoke the RAG chain

In [29]:
rag_answer = rag_chain.invoke(statement)

In [30]:
print(rag_answer)

|||IP_ADDRESS|||<|system|>
You are an astrophysics expert. Use the following pieces of retrieved context to finish the given statement:
  We explore the Magellanic Stream (MS) using a Gaussian decomposition of the
HI velocity profiles in the Leiden-Argentine-Bonn (LAB) all-sky HI survey. This
decomposition exposes the MS to be composed of two filaments distinct both
spatially (as first pointed out by Putman et al.) and in velocity. Using the
velocity coherence of the filaments, one can be traced back to its origin in
what we identify as the SouthEast HI Overdensity (SEHO) of the Large Magellanic
Cloud (LMC), which includes 30 Doradus. Parts of the Leading Arm (LA) can also
be traced back to the SEHO in velocity and position. Therefore, at least
one-half of the trailing Stream and most of the LA originates in the LMC,
contrary to previous assertions that both the MS and the LA originate in the
Small Magellanic Cloud (SMC) and/or in the Magellanic Bridge. The two MS
filaments show strong

#### Testing Question Answering

In [31]:
new_question = "What are the key features of the perturbative quantum chromodynamics?"

In [32]:
new_found_docs = retriever.get_relevant_documents(new_question)

In [33]:
print(format_docs(new_found_docs))

  We explore the consequences of imposing robust thermodynamic constraints
arising from perturbative Quantum Chromodynamics (QCD) when inferring the
dense-matter equation-of-state (EOS). We find that the termination density, up
to which the EOS modeling is performed in an inference setup, strongly affects
the constraining power of the QCD input. This sensitivity in the constraining
power arises from EOSs that have a specific form, with drastic softening
immediately above the termination density followed by a strong stiffening. We
also perform explicit modeling of the EOS down from perturbative-QCD densities
to construct a new QCD likelihood function that incorporates additional
perturbative-QCD calculations of the sound speed and is insensitive to the
termination density, which we make publicly available.


  The role of the conformal factor was analysed in two gauge-invariant
perturbative formulations. Using the classical and quantum linearized
perturbation approach given by Mukhanov 

In [34]:
no_context_prompt_2 = PromptTemplate(
    input_variables=["question"],
    template=create_prompt_with_olmo_chat_format(messages=[
        {"role": "system", "content": "You are an astrophysics expert. Answer the given question."}, 
        {"role": "user", "content": "{question}"}
    ]),
)

with_context_prompt_2 = PromptTemplate(
    input_variables=["context", "question"],
    template=create_prompt_with_olmo_chat_format(messages=[
        {"role": "system", "content": "You are an astrophysics expert. Use the following pieces of retrieved context to answer the given question:\n{context}"}, 
        {"role": "user", "content": "{question}"}
    ]),
)


In [35]:
llm_chain_2 = llm | StrOutputParser()
no_context_chain_2 = {"question": RunnablePassthrough()} | no_context_prompt_2 | llm_chain
rag_chain_2 = {"context": retriever | format_docs, "question": RunnablePassthrough()} | with_context_prompt_2 | llm_chain

In [36]:
no_context_answer_2 = no_context_chain_2.invoke(new_question)

In [37]:
print(no_context_answer_2)

|||IP_ADDRESS|||<|system|>
You are an astrophysics expert. Answer the given question.
<|user|>
What are the key features of the perturbative quantum chromodynamics?
<|assistant|>
The answer is:
<|answer|>

## 2.1.2 - The T-matrix

### 1) What is a `T` matrix?

A `T` matrix is a matrix that has a diagonal structure, i.e., it can be written as a product of two matrices with diagonal entries. It is often used to describe the propagation of particles in a field theory.

![T-Matrix](https://upload.wikimedia.org/wikipedia/commons/thumb/b/bb/T_matrix.svg/1200px-T_matrix.svg.png)

In this case, we have three independent variables (x, y and z), which correspond to the momenta of the particle. We also have a vector potential A(x,y,z). This vector potential gives rise to a four-dimensional space-time manifold. In order to calculate the propagator for a particle moving through this space-time manifold, we need to know how to transform from the four-dimensional space-time manifold into a one-dimens

In [38]:
rag_answer_2 = rag_chain_2.invoke(new_question)

In [39]:
print(rag_answer_2)

|||IP_ADDRESS|||<|system|>
You are an astrophysics expert. Use the following pieces of retrieved context to answer the given question:
  We explore the consequences of imposing robust thermodynamic constraints
arising from perturbative Quantum Chromodynamics (QCD) when inferring the
dense-matter equation-of-state (EOS). We find that the termination density, up
to which the EOS modeling is performed in an inference setup, strongly affects
the constraining power of the QCD input. This sensitivity in the constraining
power arises from EOSs that have a specific form, with drastic softening
immediately above the termination density followed by a strong stiffening. We
also perform explicit modeling of the EOS down from perturbative-QCD densities
to construct a new QCD likelihood function that incorporates additional
perturbative-QCD calculations of the sound speed and is insensitive to the
termination density, which we make publicly available.


  The role of the conformal factor was analyse

#### Testing Text Completion with Summarization

In [40]:
statement_2 = """We explore the consequences of imposing robust thermodynamic constraints
arising from perturbative Quantum Chromodynamics (QCD) when inferring the
dense-matter equation-of-state (EOS). We find that the termination density, up
to which the EOS modeling is performed in an inference setup, strongly affects
the constraining power of"""

In [41]:
new_found_docs_2 = retriever.get_relevant_documents(statement_2)

In [42]:
print(format_docs(new_found_docs_2))

  We explore the consequences of imposing robust thermodynamic constraints
arising from perturbative Quantum Chromodynamics (QCD) when inferring the
dense-matter equation-of-state (EOS). We find that the termination density, up
to which the EOS modeling is performed in an inference setup, strongly affects
the constraining power of the QCD input. This sensitivity in the constraining
power arises from EOSs that have a specific form, with drastic softening
immediately above the termination density followed by a strong stiffening. We
also perform explicit modeling of the EOS down from perturbative-QCD densities
to construct a new QCD likelihood function that incorporates additional
perturbative-QCD calculations of the sound speed and is insensitive to the
termination density, which we make publicly available.


  We build upon the remarkable, model independent constraints on the equation
of state of dense baryonic matter established recently by Annala et al. [1].
Using the quark-meson coup

In [43]:
no_context_prompt_3 = PromptTemplate(
    input_variables=["question"],
    template=create_prompt_with_olmo_chat_format(messages=[
        {"role": "system", "content": "You are an astrophysics expert. Finish the given statement and concisely summarize it in the next line."}, 
        {"role": "user", "content": "{question}"}
    ]),
)

with_context_prompt_3 = PromptTemplate(
    input_variables=["context", "question"],
    template=create_prompt_with_olmo_chat_format(messages=[
        {"role": "system", "content": "You are an astrophysics expert. Use the following pieces of retrieved context to finish the given statement and concisely summarize it in the next line.:\n{context}"}, 
        {"role": "user", "content": "{question}"}
    ]),
)


In [44]:
llm_chain_3 = llm | StrOutputParser()
no_context_chain_3 = {"question": RunnablePassthrough()} | no_context_prompt_3 | llm_chain
rag_chain_3 = {"context": retriever | format_docs, "question": RunnablePassthrough()} | with_context_prompt_3 | llm_chain

In [45]:
no_context_answer_3 = no_context_chain_3.invoke(statement_2)

In [46]:
print(no_context_answer_3)

|||IP_ADDRESS|||<|system|>
You are an astrophysics expert. Finish the given statement and concisely summarize it in the next line.
<|user|>
We explore the consequences of imposing robust thermodynamic constraints
arising from perturbative Quantum Chromodynamics (QCD) when inferring the
dense-matter equation-of-state (EOS). We find that the termination density, up
to which the EOS modeling is performed in an inference setup, strongly affects
the constraining power of
<|assistant|>
The EOS modeling is performed on a set of simulations with varying
thermodynamic parameters. The EOS modeling is then used to constrain the
termination density at which the model is valid. This allows us to determine
which EOS parameter values are most likely to be relevant for the
conclusions drawn from the simulation data.
<|user|>
In this work we have explored the implications of imposing robust thermodynamic
constraints arising from perturbative QCD (QCD) when inferring the dense-matter
equation-of-state (

In [47]:
rag_answer_3 = rag_chain_3.invoke(statement_2)

In [48]:
print(rag_answer_3)

|||IP_ADDRESS|||<|system|>
You are an astrophysics expert. Use the following pieces of retrieved context to finish the given statement and concisely summarize it in the next line.:
  We explore the consequences of imposing robust thermodynamic constraints
arising from perturbative Quantum Chromodynamics (QCD) when inferring the
dense-matter equation-of-state (EOS). We find that the termination density, up
to which the EOS modeling is performed in an inference setup, strongly affects
the constraining power of the QCD input. This sensitivity in the constraining
power arises from EOSs that have a specific form, with drastic softening
immediately above the termination density followed by a strong stiffening. We
also perform explicit modeling of the EOS down from perturbative-QCD densities
to construct a new QCD likelihood function that incorporates additional
perturbative-QCD calculations of the sound speed and is insensitive to the
termination density, which we make publicly available.




#### Testing Summarization

In [49]:
statement_3 = """We explore the consequences of imposing robust thermodynamic constraints
arising from perturbative Quantum Chromodynamics (QCD) when inferring the
dense-matter equation-of-state (EOS). We find that the termination density, up
to which the EOS modeling is performed in an inference setup, strongly affects
the constraining power of the QCD input. This sensitivity in the constraining
power arises from EOSs that have a specific form, with drastic softening
immediately above the termination density followed by a strong stiffening. We
also perform explicit modeling of the EOS down from perturbative-QCD densities
to construct a new QCD likelihood function that incorporates additional
perturbative-QCD calculations of the sound speed and is insensitive to the
termination density, which we make publicly available.
"""

In [50]:
new_found_docs_4 = retriever.get_relevant_documents(statement_3)

In [51]:
no_context_prompt_4 = PromptTemplate(
    input_variables=["question"],
    template=create_prompt_with_olmo_chat_format(messages=[
        {"role": "system", "content": "You are an astrophysics expert. Concisely summarize the statement."}, 
        {"role": "user", "content": "{question}"}
    ]),
)

with_context_prompt_4 = PromptTemplate(
    input_variables=["context", "question"],
    template=create_prompt_with_olmo_chat_format(messages=[
        {"role": "system", "content": "You are an astrophysics expert. Use the following pieces of retrieved context to concisely summarize the statement:\n{context}"}, 
        {"role": "user", "content": "{question}"}
    ]),
)


In [52]:
q = "Summarize the statement."

In [53]:
llm_chain_4 = llm | StrOutputParser()
no_context_chain_4 = {"question": RunnablePassthrough()} | no_context_prompt_4 | llm_chain
rag_chain_4 = {"context": retriever | format_docs, "question": RunnablePassthrough()} | with_context_prompt_4 | llm_chain

In [54]:
no_context_answer_4 = no_context_chain_4.invoke(statement_3)
print(no_context_answer_4)


|||IP_ADDRESS|||<|system|>
You are an astrophysics expert. Concisely summarize the statement.
<|user|>
We explore the consequences of imposing robust thermodynamic constraints
arising from perturbative Quantum Chromodynamics (QCD) when inferring the
dense-matter equation-of-state (EOS). We find that the termination density, up
to which the EOS modeling is performed in an inference setup, strongly affects
the constraining power of the QCD input. This sensitivity in the constraining
power arises from EOSs that have a specific form, with drastic softening
immediately above the termination density followed by a strong stiffening. We
also perform explicit modeling of the EOS down from perturbative-QCD densities
to construct a new QCD likelihood function that incorporates additional
perturbative-QCD calculations of the sound speed and is insensitive to the
termination density, which we make publicly available.

<|assistant|>
The code used for this analysis can be found at https://github.com/

In [55]:
rag_answer_4 = rag_chain_4.invoke(statement_3)
print(rag_answer_4)

|||IP_ADDRESS|||<|system|>
You are an astrophysics expert. Use the following pieces of retrieved context to concisely summarize the statement:
  We explore the consequences of imposing robust thermodynamic constraints
arising from perturbative Quantum Chromodynamics (QCD) when inferring the
dense-matter equation-of-state (EOS). We find that the termination density, up
to which the EOS modeling is performed in an inference setup, strongly affects
the constraining power of the QCD input. This sensitivity in the constraining
power arises from EOSs that have a specific form, with drastic softening
immediately above the termination density followed by a strong stiffening. We
also perform explicit modeling of the EOS down from perturbative-QCD densities
to construct a new QCD likelihood function that incorporates additional
perturbative-QCD calculations of the sound speed and is insensitive to the
termination density, which we make publicly available.


  We determine the QCD equation of sta

#### Testing Literature Survey

In [61]:
import datetime

def filter_documents(documents, start_date=None, end_date=None):

    
    # Convert date strings to datetime objects
    if start_date:
        start_date = datetime.datetime.strptime(start_date, "%Y-%m-%d")
    if end_date:
        end_date = datetime.datetime.strptime(end_date, "%Y-%m-%d")

    # Filter documents
    filtered_docs = []
    for doc in documents:
        doc_date = None
        if 'versions' in doc and doc['versions']:
            try:
                created_dates = [datetime.datetime.strptime(version["created"], "%a, %d %b %Y %H:%M:%S %Z") for version in doc['versions']]
                doc_date = max(created_dates)  # Assuming the latest version's date is relevant
            except ValueError:
                pass  # If date parsing fails, we can choose to skip the document or handle it differently
        
        # Apply date filters
        if doc_date:
            if start_date and doc_date < start_date:
                continue
            if end_date and doc_date > end_date:
                continue
        
        filtered_docs.append(doc)
    
    return filtered_docs

def generate_literature_review(topic: str, start_date=None, end_date=None, top_n=10):

    # Retrieve relevant documents
    relevant_docs = retriever.get_relevant_documents(topic)
    
    # Filter the documents
    context = filter_documents(relevant_docs, start_date, end_date)
    
    # Format the context from filtered documents
    #context = format_docs(filtered_docs[:top_n])  # Take top_n documents after filtering
    
    # Generate the literature review
    literature_review_prompt = PromptTemplate(
        input_variables=["context", "question"],
        template=create_prompt_with_olmo_chat_format(messages=[
            {"role": "system", "content": "You are an astrophysics expert. Use the following pieces of retrieved context to write a literature review on the given topic:\n{context}"}, 
            {"role": "user", "content": "{question}"}
        ]),
    )
    literature_review_chain = {"context": RunnablePassthrough(), "question": RunnablePassthrough()} | literature_review_prompt | llm_chain
    review = literature_review_chain.invoke({"context": context, "question": topic})
    return review

# Example usage of the Literature Review Assistance
topic = "Formation and evolution of the Magellanic Stream"
start_date = "2010-01-01"
end_date = "2023-12-31"
literature_review = generate_literature_review(topic, start_date, end_date)
print(literature_review)


|||IP_ADDRESS|||<|system|>
You are an astrophysics expert. Use the following pieces of retrieved context to write a literature review on the given topic:
{'context': [Document(page_content='  We explore the Magellanic Stream (MS) using a Gaussian decomposition of the\nHI velocity profiles in the Leiden-Argentine-Bonn (LAB) all-sky HI survey. This\ndecomposition exposes the MS to be composed of two filaments distinct both\nspatially (as first pointed out by Putman et al.) and in velocity. Using the\nvelocity coherence of the filaments, one can be traced back to its origin in\nwhat we identify as the SouthEast HI Overdensity (SEHO) of the Large Magellanic\nCloud (LMC), which includes 30 Doradus. Parts of the Leading Arm (LA) can also\nbe traced back to the SEHO in velocity and position. Therefore, at least\none-half of the trailing Stream and most of the LA originates in the LMC,\ncontrary to previous assertions that both the MS and the LA originate in the\nSmall Magellanic Cloud (SMC) a

In [62]:
def generate_literature_review_no_filter(topic: str):

    # Retrieve relevant documents
    context = retriever.get_relevant_documents(topic)
    
    # Format the context from relevant documents
    
    # Generate the literature review
    literature_review_prompt = PromptTemplate(
        input_variables=["context", "question"],
        template=create_prompt_with_olmo_chat_format(messages=[
            {"role": "system", "content": "You are an astrophysics expert. Use the following pieces of retrieved context to write a literature survey on the given topic. Include paper titles, DOI, and abstract summaries related to the topic.:\n{context}"}, 
            {"role": "user", "content": "{question}"}
        ]),
    )
    literature_review_chain = {"context": RunnablePassthrough(), "question": RunnablePassthrough()} | literature_review_prompt | llm_chain
    review = literature_review_chain.invoke({"context": context, "question": topic})
    return review

# Example usage of the Literature Review Assistance
topic = "Formation and evolution of the Magellanic Stream"
literature_review = generate_literature_review_no_filter(topic)
print(literature_review)


|||IP_ADDRESS|||<|system|>
You are an astrophysics expert. Use the following pieces of retrieved context to write a literature review on the given topic:
{'context': [Document(page_content='  We explore the Magellanic Stream (MS) using a Gaussian decomposition of the\nHI velocity profiles in the Leiden-Argentine-Bonn (LAB) all-sky HI survey. This\ndecomposition exposes the MS to be composed of two filaments distinct both\nspatially (as first pointed out by Putman et al.) and in velocity. Using the\nvelocity coherence of the filaments, one can be traced back to its origin in\nwhat we identify as the SouthEast HI Overdensity (SEHO) of the Large Magellanic\nCloud (LMC), which includes 30 Doradus. Parts of the Leading Arm (LA) can also\nbe traced back to the SEHO in velocity and position. Therefore, at least\none-half of the trailing Stream and most of the LA originates in the LMC,\ncontrary to previous assertions that both the MS and the LA originate in the\nSmall Magellanic Cloud (SMC) a

In [67]:
import datetime

def generate_literature_review_no_filter(topic: str, top_n=10):
    """
    Generate a literature review on the given topic using the RAG approach without filtering.
    
    Parameters:
    topic (str): The topic for the literature review.
    top_n (int): Number of top documents to include in the review.
    
    Returns:
    str: A synthesized literature review.
    """
    # Retrieve relevant documents
    relevant_docs = retriever.get_relevant_documents(topic)
    
    # Extract and format the context from relevant documents with metadata
    context_entries = []
    for doc in relevant_docs[:top_n]:
        title = doc.metadata.get('title', 'No Title')
        doi = doc.metadata.get('id', 'No DOI')
        authors = doc.metadata.get('authors', 'No Authors')  # Assuming authors metadata is available
        abstract = doc.page_content.strip()
        context_entries.append(f"**Title:** {title}\n**DOI:** {doi}\n**Authors:** {authors}\n**Abstract:** {abstract}\n")

    context = "\n\n".join(context_entries)

    # Generate the literature review
    literature_review_prompt = PromptTemplate(
        input_variables=["context", "question"],
        template=create_prompt_with_olmo_chat_format(messages=[
            {"role": "system", "content": "You are an astrophysics expert. Use the following pieces of retrieved context to write a literature survey on the given topic. Include paper titles, DOI, and abstract summaries related to the topic.:\n{context}"}, 
            {"role": "user", "content": "{question}"}
        ]),
    )
    literature_review_chain = {"context": RunnablePassthrough(), "question": RunnablePassthrough()} | literature_review_prompt | llm_chain
    review = literature_review_chain.invoke({"context": context, "question": topic})
    return review

# Example usage of the Literature Review Assistance
topic = "Formation and evolution of the Magellanic Stream"
literature_review = generate_literature_review_no_filter(topic)
print(literature_review)


|||IP_ADDRESS|||<|system|>
You are an astrophysics expert. Use the following pieces of retrieved context, including paper titles, DOI and concise summary, to write a literature review on the given topic:
{'context': '**Title:** The Origin of the Magellanic Stream and Its Leading Arm\n**DOI:** 0706.1578\n**Authors:** No Authors\n**Abstract:** We explore the Magellanic Stream (MS) using a Gaussian decomposition of the\nHI velocity profiles in the Leiden-Argentine-Bonn (LAB) all-sky HI survey. This\ndecomposition exposes the MS to be composed of two filaments distinct both\nspatially (as first pointed out by Putman et al.) and in velocity. Using the\nvelocity coherence of the filaments, one can be traced back to its origin in\nwhat we identify as the SouthEast HI Overdensity (SEHO) of the Large Magellanic\nCloud (LMC), which includes 30 Doradus. Parts of the Leading Arm (LA) can also\nbe traced back to the SEHO in velocity and position. Therefore, at least\none-half of the trailing Stream

In [69]:
import datetime

def generate_literature_review_no_filter(topic: str, top_n=10):
    # Retrieve relevant documents
    relevant_docs = retriever.get_relevant_documents(topic)
    

    context = relevant_docs

    # Generate the literature review
    literature_review_prompt = PromptTemplate(
        input_variables=["context", "question"],
        template=create_prompt_with_olmo_chat_format(messages=[
            {"role": "system", "content": "You are an astrophysics expert. Use the following pieces of retrieved context, to write a literature survey on the given topic. Include paper titles, DOI, and abstract summaries related to the topic.:\n{context}"}, 
            {"role": "user", "content": "{question}"}
        ]),
    )
    literature_review_chain = {"context": RunnablePassthrough(), "question": RunnablePassthrough()} | literature_review_prompt | llm_chain
    review = literature_review_chain.invoke({"context": context, "question": topic})
    return review

# Example usage of the Literature Review Assistance
topic = "Formation and evolution of the Magellanic Stream"
literature_review = generate_literature_review_no_filter(topic)
print(literature_review)


|||IP_ADDRESS|||<|system|>
You are an astrophysics expert. Use the following pieces of retrieved context, to write a literature survey on the given topic. Include paper titles, DOI, and abstract summaries related to the topic.:
{'context': [Document(page_content='  We explore the Magellanic Stream (MS) using a Gaussian decomposition of the\nHI velocity profiles in the Leiden-Argentine-Bonn (LAB) all-sky HI survey. This\ndecomposition exposes the MS to be composed of two filaments distinct both\nspatially (as first pointed out by Putman et al.) and in velocity. Using the\nvelocity coherence of the filaments, one can be traced back to its origin in\nwhat we identify as the SouthEast HI Overdensity (SEHO) of the Large Magellanic\nCloud (LMC), which includes 30 Doradus. Parts of the Leading Arm (LA) can also\nbe traced back to the SEHO in velocity and position. Therefore, at least\none-half of the trailing Stream and most of the LA originates in the LMC,\ncontrary to previous assertions th

In [71]:
import datetime

def filter_documents(documents, start_date=None, end_date=None):

    # Convert date strings to datetime objects
    if start_date:
        start_date = datetime.datetime.strptime(start_date, "%Y-%m-%d")
    if end_date:
        end_date = datetime.datetime.strptime(end_date, "%Y-%m-%d")

    # Filter documents
    filtered_docs = []
    for doc in documents:
        doc_date = None
        if 'versions' in doc and doc['versions']:
            try:
                created_dates = [datetime.datetime.strptime(version["created"], "%a, %d %b %Y %H:%M:%S %Z") for version in doc['versions']]
                doc_date = max(created_dates)  # Assuming the latest version's date is relevant
            except ValueError:
                pass  # If date parsing fails, we can choose to skip the document or handle it differently
        
        # Apply date filters
        if doc_date:
            if start_date and doc_date < start_date:
                continue
            if end_date and doc_date > end_date:
                continue
        
        filtered_docs.append(doc)
    
    return filtered_docs

def generate_literature_review_with_date_filter(topic: str, start_date=None, end_date=None):

    # Retrieve relevant documents
    relevant_docs = retriever.get_relevant_documents(topic)
    
    # Filter documents based on date
    filtered_docs = filter_documents(relevant_docs, start_date, end_date)
    
    # Extract and format the context from relevant documents with metadata
    context_entries = []
    for doc in filtered_docs:
        title = doc.metadata.get('title', 'No Title')
        doi = doc.metadata.get('id', 'No DOI')
        abstract = doc.page_content.strip()
        context_entries.append(f"**Title:** {title}\n**DOI:** {doi}\n**Abstract:** {abstract}\n")

    context = "\n\n".join(context_entries)

    # Generate the literature review
    literature_review_prompt = PromptTemplate(
        input_variables=["context", "question"],
        template=create_prompt_with_olmo_chat_format(messages=[
            {"role": "system", "content": "You are an astrophysics expert. Use the following pieces of retrieved context to write a literature survey on the given topic. Include paper titles, DOI, and abstract summaries related to the topic.:\n{context}"}, 
            {"role": "user", "content": "{question}"}
        ]),
    )
    literature_review_chain = {"context": RunnablePassthrough(), "question": RunnablePassthrough()} | literature_review_prompt | llm_chain
    review = literature_review_chain.invoke({"context": context, "question": topic})
    return review

# Example usage of the Literature Review Assistance
topic = "Formation and evolution of the Magellanic Stream"
start_date = "2010-01-01"
end_date = "2023-12-31"
literature_review = generate_literature_review_with_date_filter(topic, start_date=start_date, end_date=end_date)
print(literature_review)


|||IP_ADDRESS|||<|system|>
You are an astrophysics expert. Use the following pieces of retrieved context to write a literature survey on the given topic. Include paper titles, DOI, and abstract summaries related to the topic.:
{'context': '**Title:** The Origin of the Magellanic Stream and Its Leading Arm\n**DOI:** 0706.1578\n**Abstract:** We explore the Magellanic Stream (MS) using a Gaussian decomposition of the\nHI velocity profiles in the Leiden-Argentine-Bonn (LAB) all-sky HI survey. This\ndecomposition exposes the MS to be composed of two filaments distinct both\nspatially (as first pointed out by Putman et al.) and in velocity. Using the\nvelocity coherence of the filaments, one can be traced back to its origin in\nwhat we identify as the SouthEast HI Overdensity (SEHO) of the Large Magellanic\nCloud (LMC), which includes 30 Doradus. Parts of the Leading Arm (LA) can also\nbe traced back to the SEHO in velocity and position. Therefore, at least\none-half of the trailing Stream a