**Introduction to the Retrieval-Augmented Generation (RAG) System**

This colab Notebook demonstrates the implementation of a Retrieval-Augmented Generation (RAG) system. The RAG system leverages advanced NLP models to generate contextually relevant responses based on a given query. This is achieved by integrating document retrieval capabilities with state-of-the-art language generation models.

**Key Components:**
Language Models (OpenAI and Langchain Integration): We utilize Langchain to interface with OpenAI's powerful language models. This allows for generating comprehensive responses to various queries.

**Document Retrieval (Pinecone):** Pinecone's vector database is employed to efficiently retrieve documents that are contextually similar to the input query. These documents serve as additional context for the response generation.

**Transformer Embeddings (BERT Model):** We use a pre-trained BERT model from Hugging Face's transformers library to generate text embeddings. These embeddings are crucial for the document retrieval process in Pinecone.

**Robust Error Handling (Tenacity):** To ensure reliability, we implement a retry mechanism using the Tenacity library. This handles potential rate limits or transient errors encountered when interacting with external APIs.

Functionality Overview:

**Text Embedding Generation:** Converts text to vector embeddings using BERT.


**Document Retrieval:**Retrieves top relevant documents based on query embeddings.

**RAG Response Generation:** Generates responses by augmenting the query with retrieved documents and processing through Langchain-enabled OpenAI models.

**Retry Mechanism:** Uses exponential backoff for robust API interaction.

**Testing the System:**
The system is tested with a sample query to demonstrate its capability in generating context-aware responses.

**INSTALLING PACKAGES**

In [1]:
!pip install transformers
!pip install pinecone-client
!pip install openai -q
!pip install langchain -q


Collecting pinecone-client
  Downloading pinecone_client-2.2.4-py3-none-any.whl (179 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m179.4/179.4 kB[0m [31m3.4 MB/s[0m eta [36m0:00:00[0m
Collecting loguru>=0.5.0 (from pinecone-client)
  Downloading loguru-0.7.2-py3-none-any.whl (62 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m62.5/62.5 kB[0m [31m7.1 MB/s[0m eta [36m0:00:00[0m
Collecting dnspython>=2.0.0 (from pinecone-client)
  Downloading dnspython-2.4.2-py3-none-any.whl (300 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m300.4/300.4 kB[0m [31m16.3 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: loguru, dnspython, pinecone-client
Successfully installed dnspython-2.4.2 loguru-0.7.2 pinecone-client-2.2.4
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m224.7/224.7 kB[0m [31m1.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m75.9/75.9 kB

**IMPORTING NESSESARY LIBRARIES**

In [2]:
import openai
from langchain.llms import OpenAI as LangchainOpenAI
from tenacity import retry, stop_after_attempt, wait_random_exponential
from getpass import getpass
import os
import pinecone
import transformers
from transformers import AutoTokenizer, AutoModel
import openai
import numpy as np

  from tqdm.autonotebook import tqdm


**SETTING OPEN AI API KEY**

In [3]:
#For security reasons, I have removed the API key
os.environ["OPENAI_API_KEY"] = "ENTER API KEY"


**INITIALIZING PINECONE API KEY AND INDEX NAME**

In [4]:
#For security reasons, I have removed the API key
pinecone.init(
	api_key='ENTER API KEY',
	environment='gcp-starter'
)
index = pinecone.Index('ragiiit')

**INITIALIZING LANGUAGE MODELS AND SYSTEM EMBEDDINGS**

In [5]:
# Initialize Langchain with OpenAI
llm = LangchainOpenAI()


# Load Transformer Model for Embeddings
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModel.from_pretrained("bert-base-uncased")


  warn_deprecated(
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

**GENERATING TEXT EMBEDDINGS**

This function (text_to_embedding) takes a piece of text and converts it into a vector embedding using the loaded BERT model. It's used to transform text data into a format that can be processed by vector-based systems like Pinecone.

In [6]:
def text_to_embedding(text):
    # Ensuring the embedding process captures domain-specific language characteristics
    inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
    outputs = model(**inputs)
    return outputs.last_hidden_state.mean(dim=1).detach().numpy()



 **RETRIVING RELAVENT DOCUMENTS**

 The retrieve_documents function takes a query, converts it into an embedding, and then uses this embedding to retrieve relevant documents from the Pinecone index. It's crucial for the retrieval step in the RAG system, fetching contextually similar documents based on the query.

In [7]:
def retrieve_documents(query, top_k=5):
    # Modify embedding to better suit business-specific terminology
    query_embedding = text_to_embedding(query)

    # Query Pinecone database, which should be indexed with business-relevant documents
    retrieved = index.query(query_embedding.tolist(), top_k=top_k)

    # Return documents that are most relevant to the business context
    return [doc["metadata"]["text"] for doc in retrieved["matches"]]

**GENERATED AUGMENTED RESPONSE**

The rag_generate_response function forms the core of the RAG system. It first retrieves relevant documents based on the given query and then concatenates these documents with the query to form an augmented query. This augmented query is then fed into the Langchain-enabled OpenAI model to generate a comprehensive response that combines the original query context with the information from the retrieved documents.

In [8]:
def rag_generate_response(query):
    documents = retrieve_documents(query)
    augmented_query = query + " " + " ".join(documents)
    response = llm(augmented_query)
    return response



**ROBUST RAG RESPONSE GENERATION WITH RETRY MECHANISM**

The inclusion of the @retry decorator with exponential backoff and a defined maximum number of attempts
ensures that transient errors, such as rate limits or temporary network issues, are handled gracefully.
The function will automatically retry the request with increasing wait times between attempts,
enhancing the reliability of the response generation process.

In [9]:

@retry(wait=wait_random_exponential(min=1, max=60), stop=stop_after_attempt(6))
def rag_generate_response(query):
    # Retrieve documents related to the business domain
    documents = retrieve_documents(query)

    # Create an augmented query that combines the original query with business-context documents
    augmented_query = query + " " + " ".join(documents)

    # Generate response using the augmented query
    response = llm(augmented_query)
    return response

**TESTING RAG MODEL**

The encountered **RateLimitError** in this RAG model is indicative of the constraints imposed by the current OpenAI API plan's usage limits. Such limits are typical of free or basic-tier plans, designed to balance server load and user access. While the implemented strategies, including query optimization and exponential backoff via the tenacity library, mitigate some issues, they do not expand the inherent quota limitations.

Upgrading to a higher-tier OpenAI plan would substantially increase the system's efficiency and reliability. This upgrade would extend the API call quota and enhance the infrastructure's capacity to handle a higher request volume, crucial for business-critical applications. In essence, while the current setup offers a functional prototype within existing constraints, a plan upgrade is pivotal for scalability and optimal performance in a production environment.

In [11]:
#This section of code is dedicated to testing the RAG model's response generation capability.
test_query = "How to build a buizness model "
response = rag_generate_response(test_query)
print(response)


RetryError: RetryError[<Future at 0x7c9e3cd87d90 state=finished raised RateLimitError>]

**Project Summary:** Business-Specific RAG Model

This notebook details the development and implementation of a Retrieval Augmented Generation (RAG) model, specifically tailored for a business-oriented QA bot. Key highlights of this project include:

**OpenAI and Langchain Integration:** Utilization of OpenAI's language models via Langchain for advanced response generation.

**Document Retrieval with Pinecone:** Efficient retrieval of relevant documents from a Pinecone vector database, ensuring contextually appropriate responses.

**BERT Embeddings:** Application of BERT model for generating text embeddings, aiding in accurate document retrieval.

**Error Handling with Tenacity:** Implementation of a robust retry mechanism to handle potential API rate limits and ensure consistent performance.

This RAG model has been optimized for the specific linguistic and contextual requirements of a designated business domain, demonstrating its capability to handle diverse, domain-specific queries effectively.