In this tutorial, we show you how to build a simple in-memory vector store that can store documents along with metadata. It will also expose a query interface that can support a variety of queries:

- semantic search (with embedding similarity)
- metadata filtering

# Donwload Data

In [2]:
!mkdir data
!wget --user-agent "Mozilla" "https://arxiv.org/pdf/2307.09288.pdf" -O "data/llama2.pdf"

mkdir: data: File exists
--2025-08-26 10:21:23--  https://arxiv.org/pdf/2307.09288.pdf
Resolviendo arxiv.org (arxiv.org)... 151.101.195.42, 151.101.67.42, 151.101.3.42, ...
Conectando con arxiv.org (arxiv.org)[151.101.195.42]:443... conectado.
Petición HTTP enviada, esperando respuesta... 301 Moved Permanently
Localización: /pdf/2307.09288 [siguiendo]
--2025-08-26 10:21:24--  https://arxiv.org/pdf/2307.09288
Reutilizando la conexión con arxiv.org:443.
Petición HTTP enviada, esperando respuesta... 200 OK
Longitud: 13661300 (13M) [application/pdf]
Grabando a: «data/llama2.pdf»


2025-08-26 10:21:27 (3.61 MB/s) - «data/llama2.pdf» guardado [13661300/13661300]



# Imports

In [34]:
from typing import  Any, cast

from llama_index.core import VectorStoreIndex
from llama_index.core.bridge.pydantic import Field
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core.schema import TextNode, BaseNode
from llama_index.core.vector_stores import MetadataFilters, VectorStoreQuery, VectorStoreQueryResult
from llama_index.core.vector_stores.types import BasePydanticVectorStore
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.llms.huggingface import HuggingFaceLLM
from llama_index.readers.file import PyMuPDFReader

import numpy as np


# Setup

We load in some documents, and parse them into `Node` objects - chunks that are ready to be inserted into a vector store.

### Load in Documents

In [2]:
loader = PyMuPDFReader()
documents = loader.load(file_path="./data/llama2.pdf")

### Parse into Nodes

In [3]:
node_parser = SentenceSplitter(chunk_size=256)
nodes = node_parser.get_nodes_from_documents(documents)

### Generate Embeddings for each Node

In [4]:
embed_model  = HuggingFaceEmbedding(model_name = "BAAI/bge-small-en")

2025-08-26 11:34:21,036 - INFO - Load pretrained SentenceTransformer: BAAI/bge-small-en
2025-08-26 11:34:23,996 - INFO - 1 prompt is loaded, with the key: query


In [6]:
for node in nodes:
    node_embedding = embed_model.get_text_embedding(
        node.get_content(metadata_mode="all")
    )
    node.embedding = node_embedding

# Build a Simple In-Memory Vector Store

Now we'll build our in-memory vector store. We'll store Nodes within a simple Python dictionary. We'll start off implementing embedding search, and add metadata filters.

## 1. Defining the Interface
We'll first define the interface for building a vector store. It contains the following items:
    - get
    - add
    - delete
    - query
    - persist (which we will not implement)

In [8]:
class BaseVectorStore(BasePydanticVectorStore):
    """Simple custom Vector Store.

    Stores documents in a simple in-memory dict.

    """

    stores_text: bool = True

    def client(self) -> Any:
        """Get client."""
        return None

    def get(self, text_id: str) -> list[float]:
        """Get embedding."""
        pass

    def add(
        self,
        nodes: list[BaseNode],
    ) -> list[str]:
        """Add nodes to index."""
        pass

    def delete(self, ref_doc_id: str, **delete_kwargs: Any) -> None:
        """
        Delete nodes using with ref_doc_id.

        Args:
            ref_doc_id (str): The doc_id of the document to delete.

        """
        pass

    def query(
        self,
        query: VectorStoreQuery,
        **kwargs: Any,
    ) -> VectorStoreQueryResult:
        """Get nodes for response."""
        pass

    def persist(self, persist_path, fs=None) -> None:
        """Persist the SimpleVectorStore to a directory.

        NOTE: we are not implementing this for now.

        """
        pass


At a high-level, we subclass our base `VectorStore` abstraction. There's no inherent reason to do this if you're just building a vector store from scratch. We do it because it makes it easy to plug into our downstream abstractions later.

Let's look at some of the classes defined here.

`BaseNode` is simply the parent class of our core Node modules. Each Node represents a text chunk + associated metadata.
We also use some lower-level constructs, for instance our `VectorStoreQuery` and `VectorStoreQueryResult`. These are just lightweight dataclass containers to represent queries and results. We look at the dataclass fields below.

## 2. Defining add, get, and delete
We add some basic capabilities to add, get, and delete from a vector store.

The implementation is very simple (everything is just stored in a python dictionary).

In [9]:
class VectorStore2(BaseVectorStore):
    """VectorStore2 (add/get/delete implemented)."""

    stores_text: bool = True
    node_dict: dict[str, BaseNode] = Field(default_factory=dict)

    def get(self, text_id: str) -> list[float]:
        """Get embedding."""
        return self.node_dict[text_id]

    def add(
        self,
        nodes: list[BaseNode],
    ) -> list[str]:
        """Add nodes to index."""
        for node in nodes:
            self.node_dict[node.node_id] = node

    def delete(self, node_id: str, **delete_kwargs: Any) -> None:
        """
        Delete nodes using with node_id.

        Args:
            node_id: str

        """
        del self.node_dict[node_id]

In [10]:
test_node = TextNode(id_="id1", text="hello world")
test_node2 = TextNode(id_="id2", text="foo bar")
test_nodes = [test_node, test_node2]

In [11]:
vector_store = VectorStore2()

vector_store.add(test_nodes)

In [12]:
node = vector_store.get("id1")
print(str(node))

Node ID: id1
Text: hello world


## 3.a Defining query (semantic search)
We implement a basic version of top-k semantic search. This simply iterates through all document embeddings, and compute cosine-similarity with the query embedding. The top-k documents by cosine similarity are returned.

Cosine similarity: $\dfrac{\vec{d}\vec{q}}{|\vec{d}||\vec{q}|}$ for every document, query embedding pair $\vec{d}$, $\vec{p}$.

NOTE: The top-k value is contained in the VectorStoreQuery container.

NOTE: Similar to the above, we define another subclass just so we don't have to reimplement the above functions (not because this is actually good code practice).

In [14]:
def get_top_k_embeddings(
    query_embedding: list[float],
    doc_embeddings: list[list[float]],
    doc_ids: list[str],
    similarity_top_k: int = 5,
) -> tuple[list[float], list]:
    """Get top nodes by similarity to the query."""
    # dimensions: D
    q_embed_np = np.array(query_embedding)
    # dimensions: N x D
    d_embed_np = np.array(doc_embeddings)
    # dimensions: N
    d_product_arr = np.dot(d_embed_np, q_embed_np)
    # dimensions: N
    norm_arr = np.linalg.norm(q_embed_np) * np.linalg.norm(
        d_embed_np, axis=1, keepdims=False
    )
    # dimensions: N
    cos_sim_arr = d_product_arr / norm_arr

    # now we have the N cosine similarities for each document
    # sort by top k cosine similarity, and return ids
    tups = [(cos_sim_arr[i], doc_ids[i]) for i in range(len(doc_ids))]
    sorted_tups = sorted(tups, key=lambda t: t[0], reverse=True)

    sorted_tups = sorted_tups[:similarity_top_k]

    result_similarities = [s for s, _ in sorted_tups]
    result_ids = [n for _, n in sorted_tups]
    return result_similarities, result_ids

In [16]:
class VectorStore3A(VectorStore2):
    """Implements semantic/dense search."""

    def query(
        self,
        query: VectorStoreQuery,
        **kwargs: Any,
    ) -> VectorStoreQueryResult:
        """Get nodes for response."""

        query_embedding = cast(list[float], query.query_embedding)
        doc_embeddings = [n.embedding for n in self.node_dict.values()]
        doc_ids = [n.node_id for n in self.node_dict.values()]

        similarities, node_ids = get_top_k_embeddings(
            query_embedding,
            doc_embeddings,
            doc_ids,
            similarity_top_k=query.similarity_top_k,
        )
        result_nodes = [self.node_dict[node_id] for node_id in node_ids]

        return VectorStoreQueryResult(
            nodes=result_nodes, similarities=similarities, ids=node_ids
        )

## 3.b. Supporting Metadata Filtering
The next extension is adding metadata filter support. This means that we will first filter the candidate set with documents that pass the metadata filters, and then perform semantic querying.

For simplicity we use metadata filters for exact matching with an AND condition.

In [18]:
def filter_nodes(nodes: list[BaseNode], filters: MetadataFilters):
    filtered_nodes = []
    for node in nodes:
        matches = True
        for f in filters.filters:
            if f.key not in node.metadata:
                matches = False
                continue
            if f.value != node.metadata[f.key]:
                matches = False
                continue
        if matches:
            filtered_nodes.append(node)
    return filtered_nodes

We add filter_nodes as a first-pass over the nodes before running semantic search.

In [19]:
def dense_search(query: VectorStoreQuery, nodes: list[BaseNode]):
    """Dense search."""
    query_embedding = cast(list[float], query.query_embedding)
    doc_embeddings = [n.embedding for n in nodes]
    doc_ids = [n.node_id for n in nodes]
    return get_top_k_embeddings(
        query_embedding,
        doc_embeddings,
        doc_ids,
        similarity_top_k=query.similarity_top_k,
    )

In [20]:
class VectorStore3B(VectorStore2):
    """Implements Metadata Filtering."""

    def query(
        self,
        query: VectorStoreQuery,
        **kwargs: Any,
    ) -> VectorStoreQueryResult:
        """Get nodes for response."""
        # 1. First filter by metadata
        nodes = self.node_dict.values()
        if query.filters is not None:
            nodes = filter_nodes(nodes, query.filters)
        if len(nodes) == 0:
            result_nodes = []
            similarities = []
            node_ids = []
        else:
            # 2. Then perform semantic search
            similarities, node_ids = dense_search(query, nodes)
            result_nodes = [self.node_dict[node_id] for node_id in node_ids]
        return VectorStoreQueryResult(
            nodes=result_nodes, similarities=similarities, ids=node_ids
        )

## 4. Load Data into our Vector Store
Let's load our text chunks into the vector store, and run it on different types of queries: dense search, w/ metadata filters, and more.

In [21]:
vector_store = VectorStore3B()
# load data into the vector stores
vector_store.add(nodes)

Define an example question and embed it.

In [22]:
query_str = "Can you tell me about the key concepts for safety finetuning"
query_embedding = embed_model.get_query_embedding(query_str)
print(query_embedding)

[-0.057459719479084015, 0.0006168371182866395, 0.0073997811414301395, -0.026107775047421455, 0.010598297230899334, 0.0079874899238348, 0.06677503138780594, 0.039609961211681366, 0.002579552587121725, -0.0042901113629341125, -0.05172336474061012, -0.03663443773984909, 0.016417549923062325, 0.020253019407391548, 0.005108681973069906, 0.0037742929998785257, 0.03296405449509621, 0.0320092998445034, -0.008439943194389343, 0.01931261643767357, -0.004506618250161409, 0.031248915940523148, -0.024012690410017967, -0.020971687510609627, 0.004960776772350073, -0.0026956091169267893, 0.017490452155470848, 0.0022530765272676945, -0.01721314713358879, -0.19273661077022552, -0.04692849516868591, -0.03517277166247368, -0.02127939835190773, -0.007309731561690569, -0.020944062620401382, -0.008686739951372147, -0.028816986829042435, 0.03126359358429909, -0.004988355562090874, -0.003774945391342044, 0.04736080393195152, 0.017691636458039284, -0.008748933672904968, -0.009141991846263409, -0.014435577206313

### Query the vector store with dense search.

In [23]:
query_obj = VectorStoreQuery(
    query_embedding=query_embedding, similarity_top_k=2
)

query_result = vector_store.query(query_obj)
for similarity, node in zip(query_result.similarities, query_result.nodes):
    print(
        "\n----------------\n"
        f"[Node ID {node.node_id}] Similarity: {similarity}\n\n"
        f"{node.get_content(metadata_mode='all')}"
        "\n----------------\n\n"
    )


----------------
[Node ID 5c848fc0-d442-4e5a-8f67-f5105b4009c3] Similarity: 0.871129510942508

total_pages: 77
file_path: ./data/llama2.pdf
source: 23

Benchmarks give a summary view of model capabilities and behaviors that allow us to understand general
patterns in the model, but they do not provide a fully comprehensive view of the impact the model may have
on people or real-world outcomes; that would require study of end-to-end product deployments. Further
testing and mitigation should be done to understand bias and other social issues for the specific context
in which a system may be deployed. For this, it may be necessary to test beyond the groups available in
the BOLD dataset (race, religion, and gender). As LLMs are integrated and deployed, we look forward to
continuing research that will amplify their potential for positive impact on these important social issues.
4.2
Safety Fine-Tuning
In this section, we describe our approach to safety fine-tuning, including safety categorie

### Query the vector store with dense search + Metadata Filters

In [24]:
# filters = MetadataFilters(
#     filters=[
#         ExactMatchFilter(key="page", value=3)
#     ]
# )
filters = MetadataFilters.from_dict({"source": "24"})

query_obj = VectorStoreQuery(
    query_embedding=query_embedding, similarity_top_k=2, filters=filters
)

query_result = vector_store.query(query_obj)
for similarity, node in zip(query_result.similarities, query_result.nodes):
    print(
        "\n----------------\n"
        f"[Node ID {node.node_id}] Similarity: {similarity}\n\n"
        f"{node.get_content(metadata_mode='all')}"
        "\n----------------\n\n"
    )


----------------
[Node ID 42365dea-7b3c-4c25-91a2-be30c4d8dc00] Similarity: 0.8589715603640367

total_pages: 77
file_path: ./data/llama2.pdf
source: 24

We then define best practices for safe and helpful model responses: the model should first address immediate
safety concerns if applicable, then address the prompt by explaining the potential risks to the user, and finally
provide additional information if possible. We also ask the annotators to avoid negative user experience
categories (see Appendix A.5.2). The guidelines are meant to be a general guide for the model and are
iteratively refined and revised to include newly identified risks.
4.2.2
Safety Supervised Fine-Tuning
In accordance with the established guidelines from Section 4.2.1, we gather prompts and demonstrations
of safe model responses from trained annotators, and use the data for supervised fine-tuning in the same
manner as described in Section 3.1. An example can be found in Table 5.
The annotators are instructed to 

  filters = MetadataFilters.from_dict({"source": "24"})


## Build a RAG System with the Vector Store
Now that we've built the RAG system, it's time to plug it into our downstream system!

In [26]:
index = VectorStoreIndex.from_vector_store(vector_store, embed_model=embed_model)

In [28]:
llm = HuggingFaceLLM(
    model_name="google/gemma-3-270m", 
    tokenizer_name="google/gemma-3-270m",
    device_map="auto"
)

In [29]:
response = llm.complete("What is the meaning of life?")
print(str(response))


Life is the eternal quest for meaning and purpose.
It may seem a small world for most people. For that person to be happy, or to have something else important to them is meaningless. But, if they really care about themselves and their future, then they have something important for them. So why do they keep waiting for something more valuable? That was the question for my grandfather, who always felt like he was waiting. Why did he sit all these long years? His answers are different depending on who you ask. Some people have a spiritual meaning to life, and are happy because they are spiritual. Others are happy because they live a full, meaningful life. But, for others, it is the end of the world. What has caused their problem? Why are they sitting here all day?
At age 67, I am still sitting here at my parents’ house, waiting for the answer. While my mother and I have gone through the same experiences as my grandfather, I do not know what has caused my mom to get to the point she has. 

In [31]:
query_engine = index.as_query_engine(llm=llm)

In [32]:
query_str = "Can you tell me about the key concepts for safety finetuning"

In [33]:
response = query_engine.query(query_str)
print(str(response))

1) The general framework for fine-tuning.
2) The importance of labeling and fine-tuning.
3) The safety category of the model for fine-tuning.
4) The methods for fine-tuning the model in the particular environment.
5) The method for fine-tuning the model in a specific environment.
5.1. Supervised Safety Fine-Tuning
Our approach to the fine-tuning process is similar to the previous section: we employ a series of tasks to collect
as many adversarial prompts as possible while ensuring that the prompts are both clean and unconstrained. We
implement one prompt in each model: the default prompt for a general fine-tuning task. We then collect
these prompts and fine-tune the model using these prompts.
---------------------
Our process for fine-tuning an LLM and a large unlabeled dataset follows the same pattern as the previous
section: we collect adversarial prompt questions from our LLM and fine-tune it using the same prompts.
This process is similar to the previous section, except that we can

## Conclusion
That's it! We've built a simple in-memory vector store that supports very simple inserts, gets, deletes, and supports dense search and metadata filtering. This can then be plugged into the rest of LlamaIndex abstractions.

It doesn't support sparse search yet and is obviously not meant to be used in any sort of actual app. But this should expose some of what's going on under the hood!