We will be implementing the RAG pipeline. This will not be a naive/vanilla RAG. There are some optimisations in it like Using ReRanking using Cohere and NVIDIA's NeMo Guardrail library.

Install all the libraries given

In [7]:
!pip install -qU \
    nemoguardrails==0.4.0 \
    pinecone-client==2.2.2 \
    datasets==2.14.3 \
    openai==0.27.8

In [5]:
!pip install cohere==4.27



In [4]:
!pip install tiktoken==0.5.1

Collecting tiktoken==0.5.1
  Downloading tiktoken-0.5.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.0/2.0 MB[0m [31m8.6 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tiktoken
Successfully installed tiktoken-0.5.1


Now we need to get a dataset. Here we have downloaded the contents of the LLAMA-2 arvix papers. This is a perfect dataset for RAG as information of LLAMA-2 was not in the training dataset of GPT-3.5/

In [8]:
from datasets import load_dataset

data = load_dataset(
    "jamescalam/llama-2-arxiv-papers-chunked",
    split="train"
)
data

Downloading readme:   0%|          | 0.00/409 [00:00<?, ?B/s]

Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/14.4M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Dataset({
    features: ['doi', 'chunk-id', 'chunk', 'id', 'title', 'summary', 'source', 'authors', 'categories', 'comment', 'journal_ref', 'primary_category', 'published', 'updated', 'references'],
    num_rows: 4838
})

Some data preprocessing steps like dropiing irrelevant columns etc.

In [9]:
data = data.map(lambda x: {
    'uid': f"{x['doi']}-{x['chunk-id']}"
})
data

Map:   0%|          | 0/4838 [00:00<?, ? examples/s]

Dataset({
    features: ['doi', 'chunk-id', 'chunk', 'id', 'title', 'summary', 'source', 'authors', 'categories', 'comment', 'journal_ref', 'primary_category', 'published', 'updated', 'references', 'uid'],
    num_rows: 4838
})

In [10]:
data = data.to_pandas()
# drop irrelevant fields
data = data[['uid', 'chunk', 'title', 'source']]

Now setup your OpenAI API Key, so we can use embeddings, GPT for our model.

In [11]:
import os

os.environ["OPENAI_API_KEY"]="sxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxQ"

This is the code for OpenAI's embedding function which converts setences to a high dimensional vector space representing the semantic meaning of the sentences. Here we have used the ada-002 embeddings. The data will be stored in a vector database.

In [12]:
import openai

embed_model_id = "text-embedding-ada-002"

res = openai.Embedding.create(
    input=[
        "We would have some text to embed here",
        "This is the second chunk"
    ], engine=embed_model_id
)

We now have a dataset that can serve as our chatbot knowledge base. Our next task is to transform that dataset into the knowledge base that our chatbot can use. To do this we must the above embedding model and vector database. We will be using the Pineconde Database. Put your api key and setup the database.



In [14]:
import pinecone

# initialize connection to pinecone (get API key at app.pinecone.io)
api_key = "axxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx3"
# find your environment next to the api key in pinecone console
env = "gcp-starter"

pinecone.init(api_key=api_key, environment=env)
pinecone.whoami()

WhoAmIResponse(username=None, user_label=None, projectname='8ymm4d3')

In [15]:
index_name = "nemo-guardrails-rag-with-actions"

In [16]:
import time

# check if index already exists (it shouldn't if this is first time)
if index_name not in pinecone.list_indexes():
    # if does not exist, create index
    pinecone.create_index(
        index_name,
        dimension=len(res['data'][0]['embedding']),
        metric='cosine'
    )
    # wait for index to be initialized
    while not pinecone.describe_index(index_name).status['ready']:
        time.sleep(1)

# connect to index
index = pinecone.Index(index_name)
# view index stats
index.describe_index_stats()

{'dimension': 1536,
 'index_fullness': 0.0,
 'namespaces': {},
 'total_vector_count': 0}

Populate the Database with the dataset embeddings.

In [17]:
from tqdm.auto import tqdm

batch_size = 100  # how many embeddings we create and insert at once

for i in tqdm(range(0, len(data), batch_size)):
    # find end of batch
    i_end = min(len(data), i+batch_size)
    batch = data[i:i_end]
    # get ids
    ids_batch = batch['uid'].to_list()
    # get texts to encode
    texts = batch['chunk'].to_list()
    # create embeddings
    res = openai.Embedding.create(input=texts, engine=embed_model_id)
    embeds = [record['embedding'] for record in res['data']]
    # create metadata
    metadata = [{
        'chunk': x['chunk'],
        'source': x['source']
    } for _, x in batch.iterrows()]
    to_upsert = list(zip(ids_batch, embeds, metadata))
    # upsert to Pinecone
    index.upsert(vectors=to_upsert)

  0%|          | 0/49 [00:00<?, ?it/s]

Our database is ready. Now we must define our RAG and Retrival Function.

In [18]:
async def rag(query: str, contexts: list) -> str:
    print("> RAG Called")  # we'll add this so we can see when this is being used
    context_str = "\n".join(contexts)
    # place query and contexts into RAG prompt
    prompt = f"""You are a helpful assistant, below is a query from a user and
    some relevant contexts. Answer the question given the information in those
    contexts. If you cannot find the answer to the question, say "I don't know".

    Contexts:
    {context_str}

    Query: {query}

    Answer: """
    # generate answer
    res = openai.Completion.create(
        engine="text-davinci-003",
        prompt=prompt,
        temperature=0.0,
        max_tokens=100
    )
    return res['choices'][0]['text']

Reranking offers us a solution that helps us find those records from the databse that may not be within the top results when we use vector similarity search and pull them into a smaller set of results to be given to the LLM.

We will use Cohere's rerank endpoint for this, to use it you will need a Cohere API key. Once you have your key you use it to create authenticate your Cohere client like so:



In [20]:
import cohere

os.environ["COHERE_API_KEY"] ="fxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxW"
# init client
co = cohere.Client(os.environ["COHERE_API_KEY"])

In [26]:
async def retrieve(query: str) -> list:
    # create query embedding
    res = openai.Embedding.create(input=[query], engine=embed_model_id)
    xq = res['data'][0]['embedding']
    # get relevant contexts from pinecone
    res = index.query(xq, top_k=10, include_metadata=True)
    # get list of retrieved texts
    contexts = [x['metadata']['chunk'] for x in res['matches']]
    rerank_docs = co.rerank(
    query=query, documents=contexts, top_n=5, model="rerank-english-v2.0")
    reranked_docs=[doc.document["chunk"] for doc in rerank_docs]
    return reranked_docs



We can create a guardrail to identify queries that indicate someone is asking a question — when a user asks a question, the guardrails identify this intent and trigger the RAG pipeline.
We now need to initialize our configs for Rails:

In [21]:
yaml_content = """
models:
- type: main
  engine: openai
  model: text-davinci-003
"""

rag_colang_content = """

# define RAG intents and flow
define user ask llama
    "tell me about llama 2?"
    "what is large language model"
    "where did meta's new model come from?"
    "how to llama?"
    "have you ever meta llama?"

define flow llama
    user ask llama
    $contexts = execute retrieve(query=$last_user_message)
    $answer = execute rag(query=$last_user_message, contexts=$contexts)
    bot $answer
"""

It executes the retrieve action using the last_user_message to get our contexts, we then pass the last_user_message and contexts to our rag action. We initialize our RAG-enabled rails with this Colang setup:

In [22]:
from nemoguardrails import LLMRails, RailsConfig

# initialize rails config
config = RailsConfig.from_content(
    colang_content=rag_colang_content,
    yaml_content=yaml_content
)
# create rails
rag_rails = LLMRails(config)

.gitattributes:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

data_config.json:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

train_script.py:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

We need to register any actions that are used in the Colang config file, otherwise our rails have no idea how to execute retrieve or execute rag. We register both like so:

In [23]:
rag_rails.register_action(action=retrieve, name="retrieve")
rag_rails.register_action(action=rag, name="rag")

Now we can ask our chatbot questions and for the questions which are related to the topic LLAMA-2, it will call the RAG pipleline otherwise, it will simply answer according to its parameric knowledge

In [30]:
await rag_rails.generate_async(prompt="When did the Covid-19 hit the wordl?")

'The first cases of Covid-19 were reported in December 2019 in Wuhan, China. It spread rapidly across the world in 2020, leading to a global pandemic.'

In [32]:
await rag_rails.generate_async(prompt="Give some advantages of LLAMA 2")

> RAG Called


' LLAMA 2 has several advantages, including improved usability and safety, reduced costs in compute and human annotation, and improved performance on helpfulness and safety benchmarks compared to existing open-source models. Additionally, LLAMA 2 models may be on par with some of the closed-source models, at least on the human evaluations performed.'

We can see the model did very well to determine when to call the RAG pipeline and when to answer the Query according to its parameteric knowledge.
So we optimised the RAG using the Cohere Reranking which reranks the records and by using Guardrail library we have fasten up the model(by only calling the RAG pipeline when the query is relvant to our setup Databse' content)