# RAG ICD System

https://docs.llamaindex.ai/en/stable/examples/low_level/oss_ingestion_retrieval.html

## Libs

In [1]:
%pip install llama-index llama-hub huggingface_hub llama-cpp-python llama-hub PyMuPDF

Collecting llama-index
  Downloading llama_index-0.9.15-py3-none-any.whl.metadata (8.2 kB)
Collecting llama-hub
  Downloading llama_hub-0.0.57-py3-none-any.whl.metadata (15 kB)
Collecting llama-cpp-python
  Downloading llama_cpp_python-0.2.22.tar.gz (8.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.7/8.7 MB[0m [31m25.3 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25h  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
[?25hCollecting PyMuPDF
  Downloading PyMuPDF-1.23.7-cp311-none-macosx_11_0_arm64.whl.metadata (3.4 kB)
Collecting SQLAlchemy>=1.4.49 (from SQLAlchemy[asyncio]>=1.4.49->llama-index)
  Downloading SQLAlchemy-2.0.23-cp311-cp311-macosx_11_0_arm64.whl.metadata (9.6 kB)
Collecting beautifulsoup4<5.0.0,>=4.12.2 (from llama-index)
  Downloading beautifulsoup4-4.12.2-py3-none-any.whl (142 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

## Embeddings

In [2]:
# sentence transformers
from llama_index.embeddings import HuggingFaceEmbedding

embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en")

  from .autonotebook import tqdm as notebook_tqdm
Downloading config.json: 100%|██████████| 684/684 [00:00<00:00, 1.94MB/s]
Downloading model.safetensors: 100%|██████████| 133M/133M [00:04<00:00, 29.4MB/s] 
Downloading tokenizer_config.json: 100%|██████████| 366/366 [00:00<00:00, 709kB/s]
Downloading vocab.txt: 100%|██████████| 232k/232k [00:00<00:00, 1.10MB/s]
Downloading tokenizer.json: 100%|██████████| 711k/711k [00:00<00:00, 2.07MB/s]
Downloading (…)cial_tokens_map.json: 100%|██████████| 125/125 [00:00<00:00, 1.24MB/s]


In [3]:
from llama_index.llms import LlamaCPP

# model_url = "https://huggingface.co/TheBloke/Llama-2-7B-chat-GGML/resolve/main/llama-2-7b-chat.ggmlv3.q4_0.bin"
model_url = "https://huggingface.co/TheBloke/Llama-2-7B-chat-GGUF/resolve/main/llama-2-7b-chat.Q4_0.gguf"

llm = LlamaCPP(
    # You can pass in the URL to a GGML model to download it automatically
    model_url=model_url,
    # optionally, you can set the path to a pre-downloaded model instead of model_url
    model_path=None,
    temperature=0.1,
    max_new_tokens=256,
    # llama2 has a context window of 4096 tokens, but we set it lower to allow for some wiggle room
    context_window=3900,
    # kwargs to pass to __call__()
    generate_kwargs={},
    # kwargs to pass to __init__()
    # set to at least 1 to use GPU
    model_kwargs={"n_gpu_layers": 1},
    verbose=True,
)

Downloading url https://huggingface.co/TheBloke/Llama-2-7B-chat-GGUF/resolve/main/llama-2-7b-chat.Q4_0.gguf to path /Users/tilmankerl/Library/Caches/llama_index/models/llama-2-7b-chat.Q4_0.gguf
total size (MB): 3825.81


3649it [02:37, 23.10it/s]                          
llama_model_loader: loaded meta data with 19 key-value pairs and 291 tensors from /Users/tilmankerl/Library/Caches/llama_index/models/llama-2-7b-chat.Q4_0.gguf (version GGUF V2)
llama_model_loader: - tensor    0:                token_embd.weight q4_0     [  4096, 32000,     1,     1 ]
llama_model_loader: - tensor    1:           blk.0.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor    2:            blk.0.ffn_down.weight q4_0     [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor    3:            blk.0.ffn_gate.weight q4_0     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor    4:              blk.0.ffn_up.weight q4_0     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor    5:            blk.0.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor    6:              blk.0.attn_k.weight q4_0     [  4096,  4096,     1,     1 ]
llama_model_lo

In [4]:
from llama_index import ServiceContext

service_context = ServiceContext.from_defaults(
    llm=llm, embed_model=embed_model
)

## Data

In [6]:
!mkdir rag_data
!wget --user-agent "Mozilla" "https://arxiv.org/pdf/2307.09288.pdf" -O "rag_data/llama2.pdf"

mkdir: rag_data: File exists


--2023-12-13 23:04:58--  https://arxiv.org/pdf/2307.09288.pdf
Resolving arxiv.org (arxiv.org)... 128.84.21.199
Connecting to arxiv.org (arxiv.org)|128.84.21.199|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 13661300 (13M) [application/pdf]
Saving to: 'rag_data/llama2.pdf'


2023-12-13 23:05:15 (837 KB/s) - 'rag_data/llama2.pdf' saved [13661300/13661300]



In [7]:
from pathlib import Path
from llama_hub.file.pymu_pdf.base import PyMuPDFReader

In [8]:
loader = PyMuPDFReader()
documents = loader.load(file_path="rag_data/llama2.pdf")

In [9]:
from llama_index.node_parser.text import SentenceSplitter

In [10]:
text_parser = SentenceSplitter(
    chunk_size=1024,
    # separator=" ",
)

In [11]:
text_chunks = []
# maintain relationship with source doc index, to help inject doc metadata in (3)
doc_idxs = []
for doc_idx, doc in enumerate(documents):
    cur_text_chunks = text_parser.split_text(doc.text)
    text_chunks.extend(cur_text_chunks)
    doc_idxs.extend([doc_idx] * len(cur_text_chunks))


In [12]:
from llama_index.schema import TextNode

nodes = []
for idx, text_chunk in enumerate(text_chunks):
    node = TextNode(
        text=text_chunk,
    )
    src_doc = documents[doc_idxs[idx]]
    node.metadata = src_doc.metadata
    nodes.append(node)


In [13]:
for node in nodes:
    node_embedding = embed_model.get_text_embedding(
        node.get_content(metadata_mode="all")
    )
    node.embedding = node_embedding

In [14]:
!pip install psycopg2-binary pgvector asyncpg "sqlalchemy[asyncio]" greenlet

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Collecting psycopg2-binary
  Downloading psycopg2_binary-2.9.9-cp311-cp311-macosx_11_0_arm64.whl.metadata (4.4 kB)
Collecting pgvector
  Downloading pgvector-0.2.4-py2.py3-none-any.whl.metadata (9.8 kB)
Collecting asyncpg
  Downloading asyncpg-0.29.0-cp311-cp311-macosx_11_0_arm64.whl.metadata (4.4 kB)
Downloading psycopg2_binary-2.9.9-cp311-cp311-macosx_11_0_arm64.whl (2.6 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.6/2.6 MB[0m [31m21.9 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hDownloading pgvector-0.2.4-py2.py3-none-any.whl (9.6 kB)
Downloading asyncpg-0.29.0-cp311-cp311-macosx_11_0_arm64.whl (638 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m638.7/638.7 kB[0m [31m54.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: psycopg2-binary, pgvector, asyncpg
Successfully installed asyncpg-0.29.0 pgvector-0.2.4 psycopg2-binary-2.9.9


In [None]:
# CREATE ROLE docrag WITH LOGIN PASSWORD 'rag-adl-llama';
# ALTER ROLE docrag SUPERUSER;

In [None]:
# https://github.com/pgvector/pgvector
# cd /tmp
# git clone --branch v0.5.1 https://github.com/pgvector/pgvector.git
# cd pgvector
# make
# make install # may need sudo
# CREATE EXTENSION vector;

In [16]:
import psycopg2

db_name = "vector_db"
host = "localhost"
password = "rag-adl-llama"
port = "5432"
user = "docrag"
# conn = psycopg2.connect(connection_string)
conn = psycopg2.connect(
    dbname="postgres",
    host=host,
    password=password,
    port=port,
    user=user,
)
conn.autocommit = True

with conn.cursor() as c:
    c.execute(f"DROP DATABASE IF EXISTS {db_name}")
    c.execute(f"CREATE DATABASE {db_name}")


In [17]:
from sqlalchemy import make_url
from llama_index.vector_stores import PGVectorStore

vector_store = PGVectorStore.from_params(
    database=db_name,
    host=host,
    password=password,
    port=port,
    user=user,
    table_name="llama2_paper",
    embed_dim=384,  # openai embedding dimension
)

In [19]:
vector_store.add(nodes)

['d9422c36-8ee1-42f2-a2ae-7dc7eb0fd059',
 '05bbbd95-53ed-477b-b9b1-1bd2cc948135',
 '4541ee6e-eabe-4392-ab7b-f3a5ca64c15c',
 '322e8bdc-c783-44f0-9dab-66aa57a7d170',
 '381fda60-a499-46ae-90c5-0f86b01ebaa2',
 '8b726c43-7cd7-44d0-b128-57e5439a35c0',
 'fa66fe9d-8deb-4ac7-8175-6f30609e95bc',
 '0c884380-f535-47f6-912f-2a5969b84945',
 '91c9a63f-0506-46a5-a038-997adc7830e1',
 '87364647-f154-4a7a-a37a-179ea68634ce',
 '27ec8af1-69c4-40b5-a356-736eff50a888',
 '6e9954a2-2916-41dc-b6c6-3bc06b5fb298',
 '1170e109-eb46-4584-ae5e-b0013629cb71',
 'be5dbc65-203f-473c-89dc-77a4e65cbbc6',
 '75b606ab-9d5f-4884-927d-7aa911356813',
 '4fe9d13c-313b-4389-863f-36fac11ae950',
 '4241641b-179f-4739-b31f-f07482ee2edc',
 'e48bf3e0-ffd8-468c-b5c4-dbdb4021b6d6',
 'f3c88d50-e27a-454d-a3e1-6a5c14591b37',
 '20e4ba40-3005-4b79-92cb-0c8ad1918920',
 '82f3bdc1-40c6-4a81-8f32-b315e01fe477',
 '0078ba11-0f3a-4fcc-9067-f2a46708ca4a',
 'a582f5c8-e72d-4a41-a320-cde333a9d4e8',
 '6588d9c1-d74b-49c4-9da2-40051acc6f6b',
 '93835a21-960e-

In [20]:
query_str = "Can you tell me about the key concepts for safety finetuning"

In [21]:
query_embedding = embed_model.get_query_embedding(query_str)

In [22]:
# construct vector store query
from llama_index.vector_stores import VectorStoreQuery

query_mode = "default"
# query_mode = "sparse"
# query_mode = "hybrid"

vector_store_query = VectorStoreQuery(
    query_embedding=query_embedding, similarity_top_k=2, mode=query_mode
)

In [23]:
# returns a VectorStoreQueryResult
query_result = vector_store.query(vector_store_query)
print(query_result.nodes[0].get_content())

TruthfulQA ↑
ToxiGen ↓
MPT
7B
29.13
22.32
30B
35.25
22.61
Falcon
7B
25.95
14.53
40B
40.39
23.44
Llama 1
7B
27.42
23.00
13B
41.74
23.08
33B
44.19
22.57
65B
48.71
21.77
Llama 2
7B
33.29
21.25
13B
41.86
26.10
34B
43.45
21.19
70B
50.18
24.60
Table 11: Evaluation of pretrained LLMs on automatic safety benchmarks. For TruthfulQA, we present the
percentage of generations that are both truthful and informative (the higher the better). For ToxiGen, we
present the percentage of toxic generations (the smaller, the better).
Benchmarks give a summary view of model capabilities and behaviors that allow us to understand general
patterns in the model, but they do not provide a fully comprehensive view of the impact the model may have
on people or real-world outcomes; that would require study of end-to-end product deployments. Further
testing and mitigation should be done to understand bias and other social issues for the specific context
in which a system may be deployed. For this, it may be necessary

In [24]:
from llama_index.schema import NodeWithScore
from typing import Optional

nodes_with_scores = []
for index, node in enumerate(query_result.nodes):
    score: Optional[float] = None
    if query_result.similarities is not None:
        score = query_result.similarities[index]
    nodes_with_scores.append(NodeWithScore(node=node, score=score))


In [25]:
from llama_index import QueryBundle
from llama_index.retrievers import BaseRetriever
from typing import Any, List


class VectorDBRetriever(BaseRetriever):
    """Retriever over a postgres vector store."""

    def __init__(
        self,
        vector_store: PGVectorStore,
        embed_model: Any,
        query_mode: str = "default",
        similarity_top_k: int = 2,
    ) -> None:
        """Init params."""
        self._vector_store = vector_store
        self._embed_model = embed_model
        self._query_mode = query_mode
        self._similarity_top_k = similarity_top_k
        super().__init__()

    def _retrieve(self, query_bundle: QueryBundle) -> List[NodeWithScore]:
        """Retrieve."""
        query_embedding = embed_model.get_query_embedding(
            query_bundle.query_str
        )
        vector_store_query = VectorStoreQuery(
            query_embedding=query_embedding,
            similarity_top_k=self._similarity_top_k,
            mode=self._query_mode,
        )
        query_result = vector_store.query(vector_store_query)

        nodes_with_scores = []
        for index, node in enumerate(query_result.nodes):
            score: Optional[float] = None
            if query_result.similarities is not None:
                score = query_result.similarities[index]
            nodes_with_scores.append(NodeWithScore(node=node, score=score))

        return nodes_with_scores


In [26]:
retriever = VectorDBRetriever(
    vector_store, embed_model, query_mode="default", similarity_top_k=2
)

In [27]:
from llama_index.query_engine import RetrieverQueryEngine

query_engine = RetrieverQueryEngine.from_args(
    retriever, service_context=service_context
)

In [28]:
# query_str = "Can you tell me about the key concepts for safety finetuning"
query_str = "How does Llama 2 perform compared to other open-source models?"

response = query_engine.query(query_str)


llama_print_timings:        load time =   12076.71 ms
llama_print_timings:      sample time =      35.29 ms /    88 runs   (    0.40 ms per token,  2493.91 tokens per second)
llama_print_timings: prompt eval time =   21673.20 ms /  1852 tokens (   11.70 ms per token,    85.45 tokens per second)
llama_print_timings:        eval time =    6149.94 ms /    87 runs   (   70.69 ms per token,    14.15 tokens per second)
llama_print_timings:       total time =   28230.89 ms


In [29]:
print(str(response))


Llama 2 outperforms all open-source models on academic benchmarks. In particular, Llama 2 70B improves the results on MMLU and BBH by ≈5 and ≈8 points, respectively, compared to Llama 1 65B. Additionally, Llama 2 70B model outperforms all open-source models.


In [30]:
print(response.source_nodes[0].get_content())

Additionally, Llama 2 70B model outperforms all open-source models.
In addition to open-source models, we also compare Llama 2 70B results to closed-source models. As shown
in Table 4, Llama 2 70B is close to GPT-3.5 (OpenAI, 2023) on MMLU and GSM8K, but there is a significant
gap on coding benchmarks. Llama 2 70B results are on par or better than PaLM (540B) (Chowdhery et al.,
2022) on almost all benchmarks. There is still a large gap in performance between Llama 2 70B and GPT-4
and PaLM-2-L.
We also analysed the potential data contamination and share the details in Section A.6.
Benchmark (shots)
GPT-3.5
GPT-4
PaLM
PaLM-2-L
Llama 2
MMLU (5-shot)
70.0
86.4
69.3
78.3
68.9
TriviaQA (1-shot)
–
–
81.4
86.1
85.0
Natural Questions (1-shot)
–
–
29.3
37.5
33.0
GSM8K (8-shot)
57.1
92.0
56.5
80.7
56.8
HumanEval (0-shot)
48.1
67.0
26.2
–
29.9
BIG-Bench Hard (3-shot)
–
–
52.3
65.7
51.2
Table 4: Comparison to closed-source models on academic benchmarks. Results for GPT-3.5 and GPT-4
are from OpenAI