- Here, we will build a simple in-memory vector store that can store documents and metadata.
  - It will expose a query interface that can support a variety of queries:
    - semantic search
    - metadata filtering

- Setup
  -  load some documents and parse into Nodes

In [1]:
!mkdir data
!wget --user-agent "Mozilla" "https://arxiv.org/pdf/2307.09288.pdf" -O "data/llama2.pdf"

mkdir: cannot create directory ‘data’: File exists
--2023-09-20 12:55:55--  https://arxiv.org/pdf/2307.09288.pdf
Resolving arxiv.org (arxiv.org)... 128.84.21.199
Connecting to arxiv.org (arxiv.org)|128.84.21.199|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 13661300 (13M) [application/pdf]
Saving to: ‘data/llama2.pdf’


2023-09-20 12:55:59 (4.05 MB/s) - ‘data/llama2.pdf’ saved [13661300/13661300]



In [2]:
from pathlib import Path
from llama_hub.file.pymu_pdf.base import PyMuPDFReader

In [3]:
loader = PyMuPDFReader()
documents = loader.load(file_path="./data/llama2.pdf")

In [4]:
from llama_index.node_parser import SimpleNodeParser

node_parser = SimpleNodeParser.from_defaults(chunk_size=256)
nodes = node_parser.get_nodes_from_documents(documents)

In [49]:
from llama_index.embeddings import OpenAIEmbedding

embed_model = OpenAIEmbedding()

# generate embeddings for each node
for node in nodes:
    node.embedding = embed_model.get_text_embedding(
        node.get_content(metadata_mode="all")
    )

- To build out in-memory vector store we'll use a python dictionary.
- First we will implement embedding search, and then add metadata filtering.

In [9]:
# Define the interface: get, add, delete, query, persist
from llama_index.vector_stores.types import (
    VectorStore,
    VectorStoreQuery,
    VectorStoreQueryResult,
)
from typing import List, Any, Optional, Dict
from llama_index.schema import TextNode, BaseNode
import os


class BaseVectorStore(VectorStore):
    """Simple custom vector store
    Stores documents in a simple in-memory dict
    """

    stores_text: bool = True

    def get(self, text_id: str) -> List[float]:
        """Get embedding"""
        pass

    def add(self, nodes: List[BaseNode]) -> List[str]:
        """Add nodes to index"""
        pass

    def delete(self, ref_doc_id: str, **delete_kwargs: Any) -> None:
        """Delete nodes using with ref_doc_id
        
        Parameters
        ----------
        ref_doc_id : str
            The doc_id of the document to delete
        """
        pass

    def persist(self, persist_path: str, fs=None) -> None:
        """Persist the SipleVectorStore to a directory
        
        NOTE: Not implementing now
        """
        pass        

Let’s look at some of the classes defined here.

BaseNode is simply the parent class of our core Node modules. Each Node represents a text chunk + associated metadata.

We also use some lower-level constructs, for instance our VectorStoreQuery and VectorStoreQueryResult. These are just lightweight dataclass containers to represent queries and results. We look at the dataclass fields below.

In [11]:
from dataclasses import fields

{f.name: f.type for f in fields(VectorStoreQuery)}

{'query_embedding': typing.Optional[typing.List[float]],
 'similarity_top_k': int,
 'doc_ids': typing.Optional[typing.List[str]],
 'node_ids': typing.Optional[typing.List[str]],
 'query_str': typing.Optional[str],
 'output_fields': typing.Optional[typing.List[str]],
 'embedding_field': typing.Optional[str],
 'mode': <enum 'VectorStoreQueryMode'>,
 'alpha': typing.Optional[float],
 'filters': typing.Optional[llama_index.vector_stores.types.MetadataFilters],
 'mmr_threshold': typing.Optional[float],
 'sparse_top_k': typing.Optional[int]}

In [12]:
{f.name: f.type for f in fields(VectorStoreQueryResult)}

{'nodes': typing.Optional[typing.Sequence[llama_index.schema.BaseNode]],
 'similarities': typing.Optional[typing.List[float]],
 'ids': typing.Optional[typing.List[str]]}

In [14]:
# 2. Define add, get, delete
class VectorStore2(BaseVectorStore):
    """VectorStore wtih add/get/delete methods"""

    stores_text: bool = True

    def __init__(self) -> None:
        self._node_dict: Dict[str, BaseNode] = {}

    def get(self, text_id: str) -> List[float]:
        return self._node_dict[text_id]
    
    def add(self, nodes: List[BaseNode]) -> List[str]:
        for node in nodes:
            self._node_dict[node.node_id] = node
    
    def delete(self, node_id: str, **delete_kwargs: Any) -> None:
        del self._node_dict[node_id]

In [15]:
# basic tests
test_node = TextNode(id_="id1", text="hello world")
test_node2 = TextNode(id_="id2", text="foo bar")
test_nodes = [test_node, test_node2]

vector_store = VectorStore2()
vector_store.add(test_nodes)
node = vector_store.get("id1")
print(str(node))

Node ID: id1
Text: hello world


3.a. Defining `query` - semantic search

- implement a basic version of top-k similarity search
  - iterates through doc embeddings, computing cosine-similarity with the query embedding
  - top k documents by cosine similarity are returned

Cosine similarity: $\dfrac{\vec{d}\vec{q}}{|\vec{d}||\vec{q}|}$ for every document, query embedding pair $\vec{d}$, $\vec{q}$

NOTE: The top-k value is contained in the VectorStoreQuery container.

NOTE: Similar to the above, we define another subclass just so we don’t have to reimplement the above functions (not because this is actually good code practice).

In [35]:
from typing import Tuple
import numpy as np

def get_top_k_embeddings(
    query_embedding: List[float],
    doc_embeddings: List[List[float]],
    doc_ids: List[str],
    similarity_top_k: int = 5,
) -> Tuple[List[float], List]:
    """Get top nodes by similarity to the query"""

    # dimensions: D
    qembed_np = np.array(query_embedding)

    # dimensions: N x D
    dembed_np = np.array(doc_embeddings)

    # dimensions: N
    dproduct_arr = np.dot(dembed_np, qembed_np)

    # dimensions: N
    norm_arr = np.linalg.norm(qembed_np) * np.linalg.norm(dembed_np, axis=1, keepdims=False)

    # dimensions: N
    cos_sim_arr = dproduct_arr / norm_arr 

    # Now we have the N cosine similarities for each doc
    # sort by top k and return
    tups = [(cos_sim_arr[i], doc_ids[i]) for i in range(len(doc_ids))]
    sorted_tups = sorted(tups, key=lambda t: t[0], reverse=True)

    sorted_tups = sorted_tups[:similarity_top_k]

    result_similarities = [s for s, _ in sorted_tups]
    result_ids = [n for _, n in sorted_tups]
    return result_similarities, result_ids

In [36]:
from typing import cast

class VectorStore3A(VectorStore2):
    """Implement semantic/dense search."""

    def query(self, query: VectorStoreQuery, **kwargs: Any):

        query_embedding = cast(List[float], query.query_embedding)
        doc_embeddings = [n.embedding for n in self.node_dict.values()]
        doc_ids = [n.node_id for n in self.node_dict.values()]

        similarities, node_ids = get_top_k_embeddings(
            query_embedding=query_embedding,
            doc_embeddings=doc_embeddings,
            doc_ids=doc_ids,
            similarity_top_k=query.similarity_top_k,
        )

        result_nodes = [self.node_dict[node_id] for node_id in node_ids]

        return VectorStoreQueryResult(
            nodes=result_nodes, similarities=similarities, ids=node_ids
        )

3.b. Supporting Metadata Filtering
The next extension is adding metadata filter support. This means that we will first filter the candidate set with documents that pass the metadata filters, and then perform semantic querying.

For simplicity we use metadata filters for exact matching with an AND condition.

In [56]:
from llama_index.vector_stores import MetadataFilters
from llama_index.schema import BaseNode
from typing import cast


def filter_nodes(nodes: List[BaseNode], filters: MetadataFilters):
    filtered_nodes = []

    for node in nodes:
        matches = True
        for f in filters.filters:
            if f.key not in node.metadata:
                matches = False
                continue
            if f.value != node.metadata[f.key]:
                matches = False
                continue
        if matches:
            filtered_nodes.append(node)

    return filtered_nodes


def dense_search(query: VectorStoreQuery, nodes: List[BaseNode]):
    """Dense search"""
    query_embedding = cast(List[float], query.query_embedding)
    doc_embeddings = [n.embedding for n in nodes]
    doc_ids = [n.node_id for n in nodes]

    return get_top_k_embeddings(
        query_embedding=query_embedding,
        doc_embeddings=doc_embeddings,
        doc_ids=doc_ids,
        similarity_top_k=query.similarity_top_k,
    )


class VectorStore3B(VectorStore2):
    """Implement Metadata filtering"""

    def query(self, query: VectorStoreQuery, **kwargs: Any):

        nodes = self._node_dict.values()

        # Filter by metadata
        if query.filters is not None:
            nodes = filter_nodes(nodes, query.filters)

        if len(nodes) == 0:
            result_nodes, similarities, node_ids = [], [], []
        else:
            # semantic search
            similarities, node_ids = dense_search(query, nodes)
            result_nodes = [self._node_dict[node_id] for node_id in node_ids]

        return VectorStoreQueryResult(
            nodes=result_nodes, similarities=similarities, ids=node_ids
        )

- load data into the vector store

In [57]:
vector_store = VectorStore3B()
vector_store.add(nodes)

In [58]:
query_str = "Can you tell me about the key concepts for safety finetuning"
query_embedding = embed_model.get_query_embedding(query_str)

In [59]:
query_obj = VectorStoreQuery(query_embedding=query_embedding, similarity_top_k=2)

query_result = vector_store.query(query_obj)
for similarity, node in zip(query_result.similarities, query_result.nodes):
    print(
        "\n----------------\n"
        f"[Node ID {node.node_id}] Similarity: {similarity}\n\n"
        f"{node.get_content(metadata_mode='all')}"
        "\n----------------\n\n"
    )


----------------
[Node ID da13256d-69d7-4526-8fd0-83fc4219eaec] Similarity: 0.8353106487126585

total_pages: 77
file_path: ./data/llama2.pdf
source: 23

Specifically, we use the following techniques in safety fine-tuning:
1. Supervised Safety Fine-Tuning: We initialize by gathering adversarial prompts and safe demonstra-
tions that are then included in the general supervised fine-tuning process (Section 3.1). This teaches
the model to align with our safety guidelines even before RLHF, and thus lays the foundation for
high-quality human preference data annotation.
2. Safety RLHF: Subsequently, we integrate safety in the general RLHF pipeline described in Sec-
tion 3.2.2. This includes training a safety-specific reward model and gathering more challenging
adversarial prompts for rejection sampling style fine-tuning and PPO optimization.
3. Safety Context Distillation: Finally, we refine our RLHF pipeline with context distillation (Askell
et al., 2021b).
----------------



-------------

In [60]:
# Query with metadata filters
filters = MetadataFilters.from_dict({"source": "24"})

query_obj = VectorStoreQuery(
    query_embedding=query_embedding, similarity_top_k=2, filters=filters
)

query_result = vector_store.query(query_obj)

for similarity, node in zip(query_result.similarities, query_result.nodes):
    print(
        "\n----------------\n"
        f"[Node ID {node.node_id}] Similarity: {similarity}\n\n"
        f"{node.get_content(metadata_mode='all')}"
        "\n----------------\n\n"
    )


----------------
[Node ID a3fdf251-0340-461f-abbd-bca5a8f47a33] Similarity: 0.8185257137702916

total_pages: 77
file_path: ./data/llama2.pdf
source: 24

4.2.2
Safety Supervised Fine-Tuning
In accordance with the established guidelines from Section 4.2.1, we gather prompts and demonstrations
of safe model responses from trained annotators, and use the data for supervised fine-tuning in the same
manner as described in Section 3.1. An example can be found in Table 5.
The annotators are instructed to initially come up with prompts that they think could potentially induce
the model to exhibit unsafe behavior, i.e., perform red teaming, as defined by the guidelines. Subsequently,
annotators are tasked with crafting a safe and helpful response that the model should produce.
4.2.3
Safety RLHF
We observe early in the development of Llama 2-Chat that it is able to generalize from the safe demonstrations
in supervised fine-tuning. The model quickly learns to write detailed safe responses, addres

- Build a RAD system with the Vector Store

In [63]:
from llama_index import VectorStoreIndex

index = VectorStoreIndex.from_vector_store(vector_store)

query_engine = index.as_query_engine()

query_str = "Can you tell me about the key concepts for safety finetuning"

response = query_engine.query(query_str)
print(str(response))

The key concepts for safety fine-tuning include supervised safety fine-tuning, safety RLHF (Reinforcement Learning from Human Feedback), and safety context distillation. Supervised safety fine-tuning involves gathering adversarial prompts and safe demonstrations to align the model with safety guidelines before RLHF. Safety RLHF integrates safety into the RLHF pipeline by training a safety-specific reward model and gathering more challenging adversarial prompts for fine-tuning and optimization. Finally, safety context distillation is used to refine the RLHF pipeline. These techniques aim to mitigate safety risks and ensure that the model aligns with safety guidelines.
