In [1]:
# sentence transformers
from llama_index.embeddings.huggingface import HuggingFaceEmbedding

embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en", cache_folder=".model_cache")

In [50]:
# llm block moved below

In [2]:
import psycopg2
from dotenv import load_dotenv
import os
from sqlalchemy import make_url
from llama_index.vector_stores.postgres import PGVectorStore

# Load environment variables
load_dotenv(override=True)

# Fetch database configuration from environment variables
db_name = os.getenv("DB_NAME")
host = os.getenv("DB_HOST")
password = os.getenv("DB_PASSWORD")
port = os.getenv("DB_PORT")
user = os.getenv("DB_USER")

# Connect to Postgres and create the database
conn = psycopg2.connect(
    dbname="postgres",
    host=host,
    password=password,
    port=port,
    user=user,
)
conn.autocommit = True

try:
    with conn.cursor() as c:
        # Check if the database exists and create it only if it doesn't
        c.execute(f"SELECT 1 FROM pg_database WHERE datname = '{db_name}'")
        exists = c.fetchone()
        if not exists:
            c.execute(f"CREATE DATABASE {db_name}")
            print(f"Database '{db_name}' created.")
        else:
            print(f"Database '{db_name}' already exists.")
finally:
    conn.close()
# Close the initial connection
conn.close()

# Connect to the new database and set up PGVectorStore
vector_store = PGVectorStore.from_params(
    database=db_name,
    host=host,
    password=password,
    port=port,
    user=user,
    table_name="llama2_paper",
    embed_dim=384,  # Adjust as needed for your embeddings
)


Database 'vectorizedDnD' already exists.


In [3]:
# test with llama2 paper
from pathlib import Path
from llama_index.readers.file import PyMuPDFReader

loader = PyMuPDFReader()
documents = loader.load(file_path="data/llama2.pdf")#"./data/DnD_5e_EGTW_eng.pdf")

In [4]:
documents

[Document(id_='7fb6e712-136b-4b42-ad05-5b4c31c911d9', embedding=None, metadata={'total_pages': 77, 'file_path': 'data/llama2.pdf', 'source': '1'}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, text='Llama 2: Open Foundation and Fine-Tuned Chat Models\nHugo Touvron∗\nLouis Martin†\nKevin Stone†\nPeter Albert Amjad Almahairi Yasmine Babaei Nikolay Bashlykov Soumya Batra\nPrajjwal Bhargava Shruti Bhosale Dan Bikel Lukas Blecher Cristian Canton Ferrer Moya Chen\nGuillem Cucurull David Esiobu Jude Fernandes Jeremy Fu Wenyin Fu Brian Fuller\nCynthia Gao Vedanuj Goswami Naman Goyal Anthony Hartshorn Saghar Hosseini Rui Hou\nHakan Inan Marcin Kardas Viktor Kerkez Madian Khabsa Isabel Kloumann Artem Korenev\nPunit Singh Koura Marie-Anne Lachaux Thibaut Lavril Jenya Lee Diana Liskovich\nYinghai Lu Yuning Mao Xavier Martinet Todor Mihaylov Pushkar Mishra\nIgor Molybog Yixin Nie Andrew Poulton Jeremy Reizenstein Rashi Rungta Kalyan Saladi\nAlan Schelten Ruan Sil

In [5]:
from llama_index.core.node_parser import SentenceSplitter
text_parser = SentenceSplitter(
    chunk_size=1024,
    # separator=" ",
)

In [6]:
text_chunks = []
# maintain relationship with source doc index, to help inject doc metadata in (3)
doc_idxs = []
for doc_idx, doc in enumerate(documents):
    cur_text_chunks = text_parser.split_text(doc.text)
    text_chunks.extend(cur_text_chunks)
    doc_idxs.extend([doc_idx] * len(cur_text_chunks))

In [7]:
print(text_chunks[5])

Figure 3: Safety human evaluation results for Llama 2-Chat compared to other open-source and closed-
source models. Human raters judged model generations for safety violations across ~2,000 adversarial
prompts consisting of both single and multi-turn prompts. More details can be found in Section 4.4. It is
important to caveat these safety results with the inherent bias of LLM evaluations due to limitations of the
prompt set, subjectivity of the review guidelines, and subjectivity of individual raters. Additionally, these
safety evaluations are performed using content standards that are likely to be biased towards the Llama
2-Chat models.
We are releasing the following models to the general public for research and commercial use‡:
1. Llama 2, an updated version of Llama 1, trained on a new mix of publicly available data. We also
increased the size of the pretraining corpus by 40%, doubled the context length of the model, and
adopted grouped-query attention (Ainslie et al., 2023). We are

In [8]:
from llama_index.core.schema import TextNode

nodes = []
for idx, text_chunk in enumerate(text_chunks):
    node = TextNode(
        text=text_chunk,
    )
    src_doc = documents[doc_idxs[idx]]
    node.metadata = src_doc.metadata
    nodes.append(node)

In [9]:
nodes

[TextNode(id_='8cdfa0d6-9672-44ee-ad6a-5dc220a86620', embedding=None, metadata={'total_pages': 77, 'file_path': 'data/llama2.pdf', 'source': '1'}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, text='Llama 2: Open Foundation and Fine-Tuned Chat Models\nHugo Touvron∗\nLouis Martin†\nKevin Stone†\nPeter Albert Amjad Almahairi Yasmine Babaei Nikolay Bashlykov Soumya Batra\nPrajjwal Bhargava Shruti Bhosale Dan Bikel Lukas Blecher Cristian Canton Ferrer Moya Chen\nGuillem Cucurull David Esiobu Jude Fernandes Jeremy Fu Wenyin Fu Brian Fuller\nCynthia Gao Vedanuj Goswami Naman Goyal Anthony Hartshorn Saghar Hosseini Rui Hou\nHakan Inan Marcin Kardas Viktor Kerkez Madian Khabsa Isabel Kloumann Artem Korenev\nPunit Singh Koura Marie-Anne Lachaux Thibaut Lavril Jenya Lee Diana Liskovich\nYinghai Lu Yuning Mao Xavier Martinet Todor Mihaylov Pushkar Mishra\nIgor Molybog Yixin Nie Andrew Poulton Jeremy Reizenstein Rashi Rungta Kalyan Saladi\nAlan Schelten Ruan Sil

In [10]:
for node in nodes:
    node_embedding = embed_model.get_text_embedding(
        node.get_content(metadata_mode="all")
    )
    node.embedding = node_embedding

In [11]:
nodes

[TextNode(id_='8cdfa0d6-9672-44ee-ad6a-5dc220a86620', embedding=[-0.04612698033452034, 0.03514251485466957, -0.012776022776961327, -0.011174554005265236, 0.01774013228714466, 0.0013715123059228063, -0.0059648011811077595, 0.015248250216245651, 0.024939164519309998, -0.006700380705296993, -0.01120698545128107, -0.03888924419879913, 0.05995891988277435, 0.018198519945144653, 0.05590090528130531, 0.037978190928697586, 0.01972636952996254, -0.016168123111128807, 0.009030930697917938, 0.03445332124829292, -0.011639954522252083, -0.04606913775205612, 0.013169744983315468, 0.003836013376712799, -0.024358266964554787, 0.018239019438624382, -0.03546268492937088, -0.038379114121198654, -0.008201866410672665, -0.22522640228271484, -0.004977657925337553, 0.006810924503952265, 0.04910403862595558, 0.007316108327358961, -0.05522305890917778, -0.010157926008105278, -0.0024386036675423384, -0.019605208188295364, -0.02533531002700329, 0.017055565491318703, 0.006722817197442055, 0.002594253746792674, 0.

In [12]:
vector_store.add(nodes)

['8cdfa0d6-9672-44ee-ad6a-5dc220a86620',
 'dc68aea1-9bc5-409d-90e2-d85d123368e7',
 'f8204f71-63b0-4625-b112-e35c9c0a2ffa',
 '4b5e39af-5d97-4106-b329-41a993692f14',
 '3f862f09-d833-4d32-992e-da5becd4613f',
 '4aefa90c-4dd9-45a3-9674-e61f464d7d68',
 '1cb0156b-b65d-4b4b-af2c-edfde08a67f9',
 'a84c4228-9ea6-4129-afdf-f90145369db1',
 '4d1fd1ae-e713-4675-9539-509ba7661791',
 '96b8ebc0-5588-4239-93fc-412aa5cfe39b',
 'de8e2d8d-6412-426d-a818-a1f15fcc744b',
 '85cfa09c-8cd5-4a7f-a59f-f6dad545572b',
 'fddc1fab-4efb-4c95-adf7-680838cb9067',
 '0e8fb07f-fc0f-496b-a075-be404dfa3d0a',
 '8e45a130-1411-4523-bfb0-62917b7e9539',
 'a86c21fc-59b2-4725-99dc-c7677d22d799',
 '7af9dba6-a469-44b5-9b09-e0a546dd4d35',
 '31958fcc-1db1-4b3c-8bf7-d4205dffd8e5',
 '2f9662c9-6de5-4043-beca-3c4095f18aae',
 'd4b05e54-987c-437a-b391-6c7010021744',
 '17ce6b3a-1651-4da6-8cda-7b8ecb154aee',
 'c2a75a9e-4666-414d-923c-9269a8b06039',
 '8770787e-438b-4547-843b-815647997a53',
 '063f3bc0-49b8-4f97-906b-c99335dfa019',
 '0a64a29b-7b7f-

In [13]:
# test the system
query_str = "Can you tell me about the key concepts for safety finetuning"
query_embedding = embed_model.get_query_embedding(query_str)
# construct vector store query
from llama_index.core.vector_stores import VectorStoreQuery

query_mode = "default"
# query_mode = "sparse"
# query_mode = "hybrid"

vector_store_query = VectorStoreQuery(
    query_embedding=query_embedding, similarity_top_k=2, mode=query_mode
)

# returns a VectorStoreQueryResult
query_result = vector_store.query(vector_store_query)
print(query_result.nodes[0].get_content())

from llama_index.core.schema import NodeWithScore
from typing import Optional

nodes_with_scores = []
for index, node in enumerate(query_result.nodes):
    score: Optional[float] = None
    if query_result.similarities is not None:
        score = query_result.similarities[index]
    nodes_with_scores.append(NodeWithScore(node=node, score=score))

TruthfulQA ↑
ToxiGen ↓
MPT
7B
29.13
22.32
30B
35.25
22.61
Falcon
7B
25.95
14.53
40B
40.39
23.44
Llama 1
7B
27.42
23.00
13B
41.74
23.08
33B
44.19
22.57
65B
48.71
21.77
Llama 2
7B
33.29
21.25
13B
41.86
26.10
34B
43.45
21.19
70B
50.18
24.60
Table 11: Evaluation of pretrained LLMs on automatic safety benchmarks. For TruthfulQA, we present the
percentage of generations that are both truthful and informative (the higher the better). For ToxiGen, we
present the percentage of toxic generations (the smaller, the better).
Benchmarks give a summary view of model capabilities and behaviors that allow us to understand general
patterns in the model, but they do not provide a fully comprehensive view of the impact the model may have
on people or real-world outcomes; that would require study of end-to-end product deployments. Further
testing and mitigation should be done to understand bias and other social issues for the specific context
in which a system may be deployed. For this, it may be necessary

In [14]:
nodes_with_scores

[NodeWithScore(node=TextNode(id_='843d203e-3c28-48d7-a808-54d92aaebbc9', embedding=None, metadata={'total_pages': 77, 'file_path': 'data/llama2.pdf', 'source': '23'}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, text='TruthfulQA ↑\nToxiGen ↓\nMPT\n7B\n29.13\n22.32\n30B\n35.25\n22.61\nFalcon\n7B\n25.95\n14.53\n40B\n40.39\n23.44\nLlama 1\n7B\n27.42\n23.00\n13B\n41.74\n23.08\n33B\n44.19\n22.57\n65B\n48.71\n21.77\nLlama 2\n7B\n33.29\n21.25\n13B\n41.86\n26.10\n34B\n43.45\n21.19\n70B\n50.18\n24.60\nTable 11: Evaluation of pretrained LLMs on automatic safety benchmarks. For TruthfulQA, we present the\npercentage of generations that are both truthful and informative (the higher the better). For ToxiGen, we\npresent the percentage of toxic generations (the smaller, the better).\nBenchmarks give a summary view of model capabilities and behaviors that allow us to understand general\npatterns in the model, but they do not provide a fully comprehensive view of t

In [15]:
from llama_index.core import QueryBundle
from llama_index.core.retrievers import BaseRetriever
from typing import Any, List


class VectorDBRetriever(BaseRetriever):
    """Retriever over a postgres vector store."""

    def __init__(
        self,
        vector_store: PGVectorStore,
        embed_model: Any,
        query_mode: str = "default",
        similarity_top_k: int = 2,
    ) -> None:
        """Init params."""
        self._vector_store = vector_store
        self._embed_model = embed_model
        self._query_mode = query_mode
        self._similarity_top_k = similarity_top_k
        super().__init__()

    def _retrieve(self, query_bundle: QueryBundle) -> List[NodeWithScore]:
        """Retrieve."""
        query_embedding = embed_model.get_query_embedding(
            query_bundle.query_str
        )
        vector_store_query = VectorStoreQuery(
            query_embedding=query_embedding,
            similarity_top_k=self._similarity_top_k,
            mode=self._query_mode,
        )
        query_result = vector_store.query(vector_store_query)

        nodes_with_scores = []
        for index, node in enumerate(query_result.nodes):
            score: Optional[float] = None
            if query_result.similarities is not None:
                score = query_result.similarities[index]
            nodes_with_scores.append(NodeWithScore(node=node, score=score))

        return nodes_with_scores

In [16]:
retriever = VectorDBRetriever(
    vector_store, embed_model, query_mode="default", similarity_top_k=2
)

In [17]:
from llama_index.llms.llama_cpp import LlamaCPP

# model_url = "https://huggingface.co/TheBloke/Llama-2-13B-chat-GGML/resolve/main/llama-2-13b-chat.ggmlv3.q4_0.bin"
model_url = "https://huggingface.co/TheBloke/Llama-2-13B-chat-GGUF/resolve/main/llama-2-13b-chat.Q4_0.gguf"
model_path = ".model_cache/llama-2-13b-chat.Q4_0.gguf"
llm = LlamaCPP(
    # You can pass in the URL to a GGML model to download it automatically
    #model_url=model_url,
    # optionally, you can set the path to a pre-downloaded model instead of model_url
    model_path=model_path,
    temperature=0.1,
    max_new_tokens=256,
    # llama2 has a context window of 4096 tokens, but we set it lower to allow for some wiggle room
    context_window=3900,
    # kwargs to pass to __call__()
    generate_kwargs={},
    # kwargs to pass to __init__()
    # set to at least 1 to use GPU
    model_kwargs={"n_gpu_layers": -1},
    verbose=True,
)




llama_model_loader: loaded meta data with 19 key-value pairs and 363 tensors from .model_cache/llama-2-13b-chat.Q4_0.gguf (version GGUF V2)
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = LLaMA v2
llama_model_loader: - kv   2:                       llama.context_length u32              = 4096
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 5120
llama_model_loader: - kv   4:                          llama.block_count u32              = 40
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 13824
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32 

In [18]:
from llama_index.core.query_engine import RetrieverQueryEngine

query_engine = RetrieverQueryEngine.from_args(retriever, llm=llm)

In [19]:
query_str = "How does Llama 2 perform compared to other open-source models?"

response = query_engine.query(query_str)
print(str(response))


llama_print_timings:        load time =   18468.51 ms
llama_print_timings:      sample time =       3.23 ms /   151 runs   (    0.02 ms per token, 46821.71 tokens per second)
llama_print_timings: prompt eval time =   22458.67 ms /   762 tokens (   29.47 ms per token,    33.93 tokens per second)
llama_print_timings:        eval time =   17211.62 ms /   150 runs   (  114.74 ms per token,     8.72 tokens per second)
llama_print_timings:       total time =   39744.69 ms /   912 tokens


 Based on the information provided in the text, Llama 2 outperforms all open-source models on almost all benchmarks, with the exception of GPT-4 and PaLM-2-L. Specifically, Llama 2 achieves a score of 78.3 on MMLU, 69.3 on GSM8K, and 68.9 on TriviaQA, while GPT-3.5 achieves a score of 86.4 on MMLU and 82.1 on GSM8K. Additionally, Llama 2 is on par or better than PaLM (540B) on almost all benchmarks.


In [20]:
print(response.source_nodes[0].get_content())

Additionally, Llama 2 70B model outperforms all open-source models.
In addition to open-source models, we also compare Llama 2 70B results to closed-source models. As shown
in Table 4, Llama 2 70B is close to GPT-3.5 (OpenAI, 2023) on MMLU and GSM8K, but there is a significant
gap on coding benchmarks. Llama 2 70B results are on par or better than PaLM (540B) (Chowdhery et al.,
2022) on almost all benchmarks. There is still a large gap in performance between Llama 2 70B and GPT-4
and PaLM-2-L.
We also analysed the potential data contamination and share the details in Section A.6.
Benchmark (shots)
GPT-3.5
GPT-4
PaLM
PaLM-2-L
Llama 2
MMLU (5-shot)
70.0
86.4
69.3
78.3
68.9
TriviaQA (1-shot)
–
–
81.4
86.1
85.0
Natural Questions (1-shot)
–
–
29.3
37.5
33.0
GSM8K (8-shot)
57.1
92.0
56.5
80.7
56.8
HumanEval (0-shot)
48.1
67.0
26.2
–
29.9
BIG-Bench Hard (3-shot)
–
–
52.3
65.7
51.2
Table 4: Comparison to closed-source models on academic benchmarks. Results for GPT-3.5 and GPT-4
are from OpenAI

In [None]:
# explicit cleanup of GPU memory by restarting kernel
