# LlamaIndex

In this notebook we embed PDF documents into three types of vector storage (LlamaIndex's default, ChromaDB, and FAISS).

See available Vector Stores: https://docs.llamaindex.ai/en/stable/module_guides/storing/vector_stores/

Let's download astronomy papers in PDF format:

In [1]:
import os
import urllib

PDF_DIR = "pdf_papers/"
os.makedirs(PDF_DIR, exist_ok=True)

for arXiv_id in ["2406.03308","1609.04153"]:
    urllib.request.urlretrieve("https://arxiv.org/pdf/"+arXiv_id+".pdf", PDF_DIR+arXiv_id+".pdf")
    print("Downloaded paper",arXiv_id)

Downloaded paper 2406.03308
Downloaded paper 1609.04153


### Parse PDF documents as text

In [2]:
from llama_index.core import SimpleDirectoryReader

In [3]:
%%time
reader = SimpleDirectoryReader(input_dir=PDF_DIR)
docs = reader.load_data()

print(f"We have read {len(docs)} pages.")

We have read 60 pages.
CPU times: user 906 ms, sys: 25.7 ms, total: 932 ms
Wall time: 993 ms


### Embed them

By default, the embedding used is ..., but we can also use one from Ollama. This requires to install the `llama-index-embeddings-ollama` package separately. Then:

In [4]:
from llama_index.embeddings.ollama import OllamaEmbedding

In [5]:
ollama_embedding = OllamaEmbedding(
    model_name="nomic-embed-text:latest",
    base_url="http://localhost:11434",
)

In [6]:
from llama_index.core import VectorStoreIndex

In [7]:
index = VectorStoreIndex.from_documents(docs, 
                                        show_progress=True,
                                        embed_model=ollama_embedding)

Parsing nodes:   0%|          | 0/60 [00:00<?, ?it/s]

Generating embeddings:   0%|          | 0/186 [00:00<?, ?it/s]

This can be made persistant with:

In [8]:
index.set_index_id("astronomy")
index.storage_context.persist("./db/astronomyPDF")

The directories are created automatically, the vectors are stored as json, and can be reloaded later.

Here we embed entire pages. A better strategy would employ additional chunking. One way to do that is to set before calling `from_documents`:

    from llama_index.core import Settings
    Settings.chunk_size = 512
    Settings.chunk_overlap = 50
    
    index = VectorStoreIndex.from_documents(
        documents,
    )

but this is beyond the scope of this notebook.

### Embed into ChromaDB

Llamaindex can also embed direcly into ChromaDB, with the package `llama-index-vector-stores-chroma`.

In [9]:
import chromadb
from llama_index.vector_stores.chroma import ChromaVectorStore
from llama_index.core import StorageContext
from llama_index.core import VectorStoreIndex

db = chromadb.PersistentClient(path="./db/astronomyPDFchroma")
chroma_collection = db.get_or_create_collection("astronomy")
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
storage_context = StorageContext.from_defaults(vector_store=vector_store)

In [10]:
chroma_index = VectorStoreIndex.from_documents(docs, 
                                               storage_context=storage_context, 
                                               show_progress=True,
                                               embed_model=ollama_embedding)

Parsing nodes:   0%|          | 0/60 [00:00<?, ?it/s]

Generating embeddings:   0%|          | 0/186 [00:00<?, ?it/s]

### Embed into FAISS

Uses packages `llama-index-vector-stores-faiss` and `faiss` (which for some reason I had to install as `faiss-cpu`, the GPU versions seem to only work for specific Python versions).

We need to compute this thing first:

In [30]:
import faiss

# dimensions of nomic-embed-text:latest
d = 768
faiss_index_L2 = faiss.IndexFlatL2(d)

In [31]:
from llama_index.vector_stores.faiss import FaissVectorStore

In [32]:
faiss_vector_store = FaissVectorStore(faiss_index=faiss_index_L2)
storage_context = StorageContext.from_defaults(vector_store=faiss_vector_store)

faiss_index = VectorStoreIndex.from_documents(docs, 
                                              storage_context=storage_context, 
                                              show_progress=True,
                                              embed_model=ollama_embedding)

Parsing nodes:   0%|          | 0/60 [00:00<?, ?it/s]

Generating embeddings:   0%|          | 0/186 [00:00<?, ?it/s]

Which we could also make persistant with:

    faiss_index.storage_context.persist(persist_dir="./db/astronomyFAISS")

## Retrieve

In [14]:
from llama_index.core.retrievers import VectorIndexRetriever

In [15]:
retriever = VectorIndexRetriever(
    index=index,
    similarity_top_k=1,
)
retriever.retrieve('infrared')

[NodeWithScore(node=TextNode(id_='c9c224e9-cda0-468c-84c4-e02a0b43bc23', embedding=None, metadata={'page_label': '28', 'file_name': '1609.04153.pdf', 'file_path': '/Users/tristan/playground/gen-ai-ollama/pdf_papers/1609.04153.pdf', 'file_type': 'application/pdf', 'file_size': 4866808, 'creation_date': '2025-05-21', 'last_modified_date': '2025-05-21'}, excluded_embed_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], excluded_llm_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], relationships={<NodeRelationship.SOURCE: '1'>: RelatedNodeInfo(node_id='ce295a3d-5552-4bb8-9567-209c7ccc5a87', node_type='4', metadata={'page_label': '28', 'file_name': '1609.04153.pdf', 'file_path': '/Users/tristan/playground/gen-ai-ollama/pdf_papers/1609.04153.pdf', 'file_type': 'application/pdf', 'file_size': 4866808, 'creation_date': '2025-05-21', 'last_modified_date': '2025-05-2

In [16]:
chroma_retriever = VectorIndexRetriever(
    index=chroma_index,
    similarity_top_k=1,
)
chroma_retriever.retrieve('infrared')

[NodeWithScore(node=TextNode(id_='b93b705e-3ae3-484b-89ec-78baef99711d', embedding=None, metadata={'page_label': '10', 'file_name': '1609.04153.pdf', 'file_path': '/Users/tristan/playground/gen-ai-ollama/pdf_papers/1609.04153.pdf', 'file_type': 'application/pdf', 'file_size': 4866808, 'creation_date': '2025-05-21', 'last_modified_date': '2025-05-21'}, excluded_embed_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], excluded_llm_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], relationships={<NodeRelationship.SOURCE: '1'>: RelatedNodeInfo(node_id='f49e5420-d672-4532-973d-3cc52fe36b83', node_type='4', metadata={'page_label': '10', 'file_name': '1609.04153.pdf', 'file_path': '/Users/tristan/playground/gen-ai-ollama/pdf_papers/1609.04153.pdf', 'file_type': 'application/pdf', 'file_size': 4866808, 'creation_date': '2025-05-21', 'last_modified_date': '2025-05-2

In [17]:
faiss_retriever = VectorIndexRetriever(
    index=faiss_index,
    similarity_top_k=1,
)
chroma_retriever.retrieve('infrared')

[NodeWithScore(node=TextNode(id_='b93b705e-3ae3-484b-89ec-78baef99711d', embedding=None, metadata={'page_label': '10', 'file_name': '1609.04153.pdf', 'file_path': '/Users/tristan/playground/gen-ai-ollama/pdf_papers/1609.04153.pdf', 'file_type': 'application/pdf', 'file_size': 4866808, 'creation_date': '2025-05-21', 'last_modified_date': '2025-05-21'}, excluded_embed_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], excluded_llm_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], relationships={<NodeRelationship.SOURCE: '1'>: RelatedNodeInfo(node_id='f49e5420-d672-4532-973d-3cc52fe36b83', node_type='4', metadata={'page_label': '10', 'file_name': '1609.04153.pdf', 'file_path': '/Users/tristan/playground/gen-ai-ollama/pdf_papers/1609.04153.pdf', 'file_type': 'application/pdf', 'file_size': 4866808, 'creation_date': '2025-05-21', 'last_modified_date': '2025-05-2

## Query the index with a LLM

Requires `llama-index-llms-ollama`.

In [18]:
from llama_index.llms.ollama import Ollama

In [27]:
llm_deepseek = Ollama(model="deepseek-r1:7b", request_timeout=120.0, temperature=0.)

In [20]:
query_engine = index.as_query_engine(llm=llm_deepseek)

In [21]:
%%time
response = query_engine.query("What is an open cluster?")

CPU times: user 30.6 ms, sys: 2.56 ms, total: 33.2 ms
Wall time: 41.1 s


In [22]:
print(response.response)

<think>
Okay, I need to figure out what an open cluster is based on the provided context. Let me read through the context carefully.

Looking at page_label 13, there are several documents mentioned like arXiv papers and A&A journal articles. Some of them discuss "open clusters," so that's my starting point. The first document is about a morphological, kinematical, and chemical analysis of an open cluster in the Milky Way using Gaia DR2 data. Another paper talks about differential abundances and tidal tails of open clusters.

I notice terms like "open clusters" throughout these documents. From what I know, astronomers use specific terminology to describe different groups of stars. Open clusters are probably a type of stellar group. The context mentions things like chemical tagging, kinematical studies, and morphological analyses, which are methods used to study open clusters.

The papers also discuss tidal tails associated with open clusters, suggesting that these are groups of stars bo

The query also returns `response.source_nodes` and `response.metadata`.

In [23]:
%%time
chroma_query_engine = chroma_index.as_query_engine(llm=llm_deepseek)
response = query_engine.query("Can white dwarfs be found in open clusters?")
print(response.response)

<think>
Okay, so I'm trying to figure out whether white dwarfs can be found in open clusters based on the provided context. Let me read through the given information carefully.

Looking at the context, there's a section labeled "page_label: 6" and it's discussing chemical abundances of stellar clusters. The document talks about how studies have searched for white dwarfs associated with clusters like Berkeley 17. It mentions that very few white dwarfs with progenitors more massive than 5 solar masses are known to reside in clusters. This suggests that most white dwarfs found in open clusters might be less massive.

Additionally, the context states that because cluster ages can often be accurately determined using isochrones, it's possible to constrain the mass of white dwarf progenitors much better than for field stars. However, even with this, only a few high-mass white dwarfs are associated with clusters.

So putting this together, while in theory white dwarfs could exist within open 

In [28]:
%%time
faiss_query_engine = faiss_index.as_query_engine(llm=llm_deepseek)
response = faiss_query_engine.query("Is there a Gaia DR5?")
print(response.response)

<think>
Okay, so I need to figure out if there's a Gaia DR5 based on the provided context. Let me start by looking through the given pages.

First, page 29 talks about the Gaia mission and its data releases up to five years. It mentions that after five years, they achieve certain precision levels for stellar parallaxes and radial velocities. Then it goes into details about photometric and spectroscopic data processing, validation, and the architecture of the DPAC.

Page 24 discusses the data processing cycles. Each cycle processes all the data segments from previous cycles plus a new one. The DR1 is mentioned as being based on the first intermediate release after five years. It also talks about simulations for data products like GIBIS, GASS, and GOG, which are used internally and to support the astronomical community.

Looking at page 24 again, it says that the Gaia Data Processing and Analysis Consortium (DPAC) prepares simulations from pixels in the focal plane up through simulated D

### See prompts

https://docs.llamaindex.ai/en/stable/examples/prompts/prompt_mixin/

In [39]:
print(faiss_query_engine.get_prompts()['response_synthesizer:text_qa_template'].get_template())

Context information is below.
---------------------
{context_str}
---------------------
Given the context information and not prior knowledge, answer the query.
Query: {query_str}
Answer: 


In [40]:
print(faiss_query_engine.get_prompts()['response_synthesizer:refine_template'].get_template())

The original query is as follows: {query_str}
We have provided an existing answer: {existing_answer}
We have the opportunity to refine the existing answer (only if needed) with some more context below.
------------
{context_msg}
------------
Given the new context, refine the original answer to better answer the query. If the context isn't useful, return the original answer.
Refined Answer: 
