# **What is Retrieval-Augmented Generation (RAG)?**

---

# Retrieval augmented generation, or RAG, is an architectural approach that can improve the efficacy of large language model (LLM) applications by leveraging custom data.

![](https://dist.neo4j.com/wp-content/uploads/20230608064925/1zydD2GKzjpEyvL-d_cP0vA.png)


# **Installing required libraries**


# -   Sentence Transformer : For generating Embeddings
# - LucknowLLM : For document preprocessing and LLM api call (Gemini Model apis)




In [None]:
%%capture
!pip3 install sentence_transformers
!pip3 install git+https://github.com/LucknowAI/Lucknow-LLM

In [None]:
from sentence_transformers import SentenceTransformer
import numpy as np
from lucknowllm import UnstructuredDataLoader, split_into_segments, GeminiModel
from google.colab import userdata

# **Configuration for the RAG System**

In [None]:
MODEL_NAME = 'paraphrase-MiniLM-L6-v2'
API_KEY = userdata.get('gemini')
GEMINI_MODEL_NAME = "gemini-1.0-pro"
FOLDER_NAME = 'Cultural_Festival_of_Lucknow'
FILE_NAME = 'Lucknow_Mahotsav.txt'
TOP_N = 3

In [None]:
sentence_model = SentenceTransformer(MODEL_NAME)
gemini_model   = GeminiModel(api_key=API_KEY, model_name=GEMINI_MODEL_NAME)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/3.73k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/314 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

# **The retrieval step: getting the right information out of your knowledge base**



Above we assumed we had the right knowledge snippets to send to the LLM. But how do we actually get these from the user’s question? This is the retrieval step, and it is the core piece of infrastructure in any “chat with your data” system.

At its core, retrieval is a search operation—we want to look up the most relevant information based on a user’s input. And just like search, there are two main pieces:



## - Indexing: Turning your knowledge base into something that can be searched/queried.
## - Querying: Pulling out the most relevant bits of knowledge from a search term.


# **load the documents and chunk into smaller sentences**


**Why?**


Because we don't want to give entire data to large language models, two issues


1) Every Large language Model comes with token limit (how many words or sentences you can give as input), so giving long document exhaust the token limit.

2) Longer document takes more time to process and generate answer.

![](https://communitykeeper-media.s3.amazonaws.com/media/images/retrieval.original.png)



**For example**

For example if my question is **"When lucknow acuired the name Awadh?"**

And we have this data for context.



"Hello world, We are building RAG system for lucknow.
The Nawabs of Lucknow, in reality, the Nawabs of Awadh, acquired the name after the reign of the third Nawab when Lucknow became their capital. The city became North India's cultural capital, and its nawabs, best remembered for their refined and extravagant lifestyles, were patrons of the arts"

||

**chunking into sentences**

||

sentence 1 = Hello world, We are building RAG system for lucknow.

sentence 2 = The Nawabs of Lucknow, in reality, the Nawabs of Awadh, acquired the name after the reign of the third Nawab when Lucknow became their capital.
  
sentence 3 = The city became North India's cultural capital, and its nawabs, best remembered for their refined and extravagant lifestyles, were patrons of the arts"

Now we can send only second sentence with query, we don't need to send entire document.

In [None]:
def load_and_preprocess_data():
    loader = UnstructuredDataLoader()
    external_database = loader.get_data(folder_name='Cultural_Festival_of_Lucknow', file_name='Lucknow_Mahotsav.txt')
    chunks = []
    for document in external_database:
        chunks.extend(split_into_segments(document['data']))
    return chunks


# **Embedings of data**

# Rather than simple search we'll use vector search, To search the document efficiently we can embed the sentences with word embeddings (vectors)

![](https://communitykeeper-media.s3.amazonaws.com/media/images/Screenshot_from_2023-08-25_09-52-18.original.png)

![](https://communitykeeper-media.s3.amazonaws.com/media/images/Screenshot_from_2023-08-22_14-09-41.original.png)




In [None]:
def embed_text_data(model, text_data):
    return model.encode(text_data)

# **Using cosine similarity as a distance metric, if the distance is less, it means the document is relevant to the question.**

![](https://communitykeeper-media.s3.amazonaws.com/media/images/Embedding_Plot1.original.png)

![](https://communitykeeper-media.s3.amazonaws.com/media/images/Screenshot_from_2023-08-24_15-35-42.original.png)




In [None]:
def cosine_similarity(a, b):
    return np.dot(a, b.T) / (np.linalg.norm(a, axis=1)[:, np.newaxis] * np.linalg.norm(b, axis=1))


# Exact documents using vector database

![](
https://communitykeeper-media.s3.amazonaws.com/media/images/Screenshot_from_2023-08-22_14-13-59.original.png)



In [None]:
def find_top_n_similar(query_vec, data_vecs, top_n=3):
    similarities = cosine_similarity(query_vec[np.newaxis, :], data_vecs)
    top_indices = np.argsort(similarities[0])[::-1][:top_n]
    return top_indices


# **Knowledge Indexing**

Once we have the document snippets, we save them into our vector database, as described above, and we’re finally done!

Here’s the complete picture of indexing a knowledge base.


![](
https://communitykeeper-media.s3.amazonaws.com/media/images/Knowledge_Indexing_Complete.original.png
)

In [None]:
def generate_gemini_response(prompt):
    return gemini_model.generate_content(prompt)


In [None]:
def main(queries):
    chunks = load_and_preprocess_data()
    embedded_data = embed_text_data(sentence_model, chunks)
    embedded_queries = embed_text_data(sentence_model, queries)

    for i, query_vec in enumerate(embedded_queries):
        top_indices = find_top_n_similar(query_vec, embedded_data, TOP_N)
        top_documents = [chunks[index] for index in top_indices]

        prompt = f"You are an expert question answering system, I'll give you a question and context, and you'll return the answer. Query: {queries[i]} Contexts: {top_documents[0]}"
        model_output = generate_gemini_response(prompt)

        return model_output


![](
https://communitykeeper-media.s3.amazonaws.com/media/images/Complete.original.png
)




In [None]:
# Example usage
queries = ["What is the duration of Lucknow Mahotsav, and when does it usually take place?"]
res = main(queries)
print(res)

The duration of Lucknow Mahotsav is 10 days, and it usually takes place in the month of November or December.


In [None]:
!pip3 install gradio

Collecting gradio
  Downloading gradio-4.25.0-py3-none-any.whl (17.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m17.1/17.1 MB[0m [31m58.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting aiofiles<24.0,>=22.0 (from gradio)
  Downloading aiofiles-23.2.1-py3-none-any.whl (15 kB)
Collecting fastapi (from gradio)
  Downloading fastapi-0.110.1-py3-none-any.whl (91 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m91.9/91.9 kB[0m [31m12.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting ffmpy (from gradio)
  Downloading ffmpy-0.3.2.tar.gz (5.5 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting gradio-client==0.15.0 (from gradio)
  Downloading gradio_client-0.15.0-py3-none-any.whl (313 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m313.4/313.4 kB[0m [31m38.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting httpx>=0.24.1 (from gradio)
  Downloading httpx-0.27.0-py3-none-any.whl (75 kB)
[2K     [90m━━━━━━━━━━━━━━━━

In [None]:
import gradio as gr

def lucknow_rag(input):
    return main([input])

app = gr.Interface(fn=lucknow_rag,
                   inputs="textbox",
                   outputs="textbox",
                   description="Lucknow RAG is a retrieval-augmented generation (RAG) model, which is a type of natural language processing (NLP) system that combines the strengths of retrieval-based and generation-based approaches. It utilizes a retrieval component to gather relevant information from a large corpus of text, and then employs a language generation model to produce a coherent and contextually appropriate output by conditioning on the retrieved information. The retrieval-augmented generation approach aims to leverage the broad knowledge available in large text corpora while retaining the ability to generate fluent and semantically meaningful text outputs.")
app.launch(debug=True)

Setting queue=True in a Colab notebook requires sharing enabled. Setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. This cell will run indefinitely so that you can see errors and logs. To turn off, set debug=False in launch().
Running on public URL: https://0cb6996df48896882c.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)
