# Simple RAG application using LangChain, IBM Generative AI Python SDK, LLama and FAISS vector store

<b>Retrieval Augmented Generation (RAG)</b>

General-purpose language models can be fine-tuned to achieve several common tasks such as sentiment analysis and named entity recognition. These tasks generally don't require additional background knowledge.

For more complex and knowledge-intensive tasks, it's possible to build a language model-based system that accesses external knowledge sources to complete tasks. This enables more factual consistency, improves reliability of the generated responses, and helps to mitigate the problem of "hallucination".

Meta AI researchers introduced a method called Retrieval Augmented Generation (RAG) to address such knowledge-intensive tasks. RAG combines an information retrieval component with a text generator model. RAG can be fine-tuned and its internal knowledge can be modified in an efficient manner and without needing retraining of the entire model.

RAG takes an input and retrieves a set of relevant/supporting documents given a source (e.g., Wikipedia). The documents are concatenated as context with the original input prompt and fed to the text generator which produces the final output. This makes RAG adaptive for situations where facts could evolve over time. This is very useful as LLMs's parametric knowledge is static. RAG allows language models to bypass retraining, enabling access to the latest information for generating reliable outputs via retrieval-based generation.

source: https://www.promptingguide.ai/techniques/rag

# 1. Import dependencies

In [32]:
# # IBM Generative AI Python SDK:
from genai import Client, Credentials
from genai.extensions.langchain import LangChainInterface
from genai.extensions.langchain import LangChainChatInterface
from genai.schema import (
    DecodingMethod,
    ModerationHAP,
    ModerationHAPInput,
    ModerationHAPOutput,
    ModerationParameters,
    TextGenerationParameters,
)

# Langchain packages:
from langchain_text_splitters import CharacterTextSplitter
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain.chains.retrieval import create_retrieval_chain
from langchain_core.prompts import PromptTemplate
from langchain.document_loaders import PyPDFLoader
from langchain.vectorstores import FAISS # Faiss is a library for efficient similarity search and clustering of dense vectors.

# Supporting packages:
from typing import Any, Dict, List
import inspect
from tqdm.auto import tqdm
import time
import pandas as pd
# from sentence_transformers import SentenceTransformer
import os
import getpass
from dotenv import load_dotenv

# 2. Login to IBM Cloud and select LLM model

In [14]:
load_dotenv()
api_key = os.environ.get('IBM_BAM_API_KEY')

In [15]:
credentials = Credentials(api_key=api_key, api_endpoint="https://bam-api.res.ibm.com")

In [16]:
model_name = "meta-llama/llama-3-8b-instruct"

In [17]:
client = Client(credentials=credentials)
llm = LangChainChatInterface(
    client=client,
    model_id=model_name,
    parameters=TextGenerationParameters(
        decoding_method=DecodingMethod.GREEDY, # Uses a more deterministic method to ensure accuracy and coherence.
        temperature=0.3, # Lower temperature to make outputs more deterministic and less random
        top_k=10, # Considers the top 10 tokens, balancing variety and accuracy.
        top_p=0.9, # Ensures the tokens are within the 90% cumulative probability
        max_new_tokens=800,
    )
)

# 3. Define Custom Embeddings class

Now, in order to us to build our RAG application, we must define our embedding mechanism to store our dataset in vector database, that can be later queried for answers.

For this purpose, we're going to define our <b>CustomEmbeddings</b> class which inherits from the Embeddings class provided by LangChain, and implements the required methods. We will base our custom class on <b>BAAI/bge-m3</b> sentence transformers library. 

Here’s how we can make CustomEmbeddings class to inherit from Embeddings:

1. Import the necessary Embeddings class from LangChain.
2. Ensure CustomEmbeddings inherits from Embeddings.
3. Implement the required methods (embed_documents and embed_query) as per the Embeddings interface.

First, let's inspect LangChain <b>'Embeddings' class</b>, so we know how to build our own 'class CustomEmbeddings':

In [20]:
inspect.getsourcelines(Embeddings) 

(['class Embeddings(ABC):\n',
  '    """Interface for embedding models."""\n',
  '\n',
  '    @abstractmethod\n',
  '    def embed_documents(self, texts: List[str]) -> List[List[float]]:\n',
  '        """Embed search docs."""\n',
  '\n',
  '    @abstractmethod\n',
  '    def embed_query(self, text: str) -> List[float]:\n',
  '        """Embed query text."""\n',
  '\n',
  '    async def aembed_documents(self, texts: List[str]) -> List[List[float]]:\n',
  '        """Asynchronous Embed search docs."""\n',
  '        return await run_in_executor(None, self.embed_documents, texts)\n',
  '\n',
  '    async def aembed_query(self, text: str) -> List[float]:\n',
  '        """Asynchronous Embed query text."""\n',
  '        return await run_in_executor(None, self.embed_query, text)\n'],
 8)

In [31]:
from FlagEmbedding import BGEM3FlagModel

class CustomEmbeddings(Embeddings):
    def __init__(self, model_name='BAAI/bge-m3', use_fp16=True, output_dim=1024):
        self.model = BGEM3FlagModel(model_name, use_fp16=use_fp16)
        self.output_dim = output_dim
    
    def embed_query(self, text: str):
        embeddings = self.model.encode([text], batch_size=1, max_length=1024)['dense_vecs']
        if embeddings.shape[1] != self.output_dim:
            raise ValueError(f"Expected embedding dimension to be {self.output_dim}, but got {embeddings.shape[1]}")
        return embeddings[0].tolist()
    
    def embed_documents(self, texts):
        embeddings = self.model.encode(texts, batch_size=len(texts), max_length=1024)['dense_vecs']
        if embeddings.shape[1] != self.output_dim:
            raise ValueError(f"Expected embedding dimension to be {self.output_dim}, but got {embeddings.shape[1]}")
        return embeddings.tolist() 

In [33]:
custom_embeddings = CustomEmbeddings()

Fetching 30 files:   0%|          | 0/30 [00:00<?, ?it/s]

# 4. Create Vector Database and upload document

Here is the document we want to upload into our vector db:

In [28]:
pdf_path = "data/Arduino Engineering Kit R2 FAQ updated.pdf"

## Text splitting (chunking) for RAG applications

We need to transform long text documents to smaller chunks that are embedded, indexed, stored then later used for information retrieval.

Splitting documents into smaller segments called chunks is an essential step when embedding text into a vector store. RAG pipelines retrieve relevant chunks from the database, using similarity search metrics (like Euclidean distance, Cosine Similarity, Dot Product, etc.) and then serve the information to LLM, based on which it generates an answer for us.

When we retrieve, we want to return the chunks that are semantically closest to the query. If chunks are not distinct enough semantically, then we may get information that was not asked for in the query, leading to lower quality results and higher LLM hallucination rate.

In [29]:
loader = PyPDFLoader(file_path=pdf_path) # load up the document, read it and chunk it.
documents = loader.load() # variable that keeps our chunked pdf

# We use character text_splitter to chunk our document
text_splitter = CharacterTextSplitter(
    chunk_size=100, chunk_overlap=30, separator="\n"
)
docs = text_splitter.split_documents(documents=documents) # Do the required split on docs, now they are ready for embeddings

## Upload data into FAISS

<b>FAISS Introduction:</b> 

FAISS is a library for efficient similarity search and clustering of dense vectors. It contains algorithms that search in sets of vectors of any size, up to ones that possibly do not fit in RAM. It also contains supporting code for evaluation and parameter tuning. Faiss is written in C++ with complete wrappers for Python/numpy. Some of the most useful algorithms are implemented on the GPU. It is developed primarily at Meta's Fundamental AI Research group.

FAISS is also able to convert big vectors into objects that are small enough to be saved in the RAM.

<b>In the below code</b> we provide two arguments:<br>
(1) document and the embeddings, and<br> 
(2) this function will them into vectors.

It's then going to take the vectors and store them on local machine in RAM as a vector store object.

In [34]:
vectorstore = FAISS.from_documents(docs, custom_embeddings)

We can also store it locally:

In [35]:
vectorstore.save_local("faiss_index_db") # our index name

# 5. Query Vector Database

Now we can query our db:

1. <b>Loading the Vector Store with FAISS</b><br>
new_vectorstore = FAISS.load_local("faiss_index_db", custom_embeddings, allow_dangerous_deserialization=True)

    - <b>load_local("faiss_index_db", custom_embeddings)</b>: Loads a FAISS index from a local file called "faiss_index_db". This index is used for efficient vector similarity search, custom_embeddings: Represents the embedding function or model used to generate embeddings that match those stored in the FAISS index.

    - <b>allow_dangerous_deserialization=True:</b> A safety parameter that allows loading potentially untrusted serialized objects. Setting this to True should be done with caution as it may expose the system to risks if the serialized data is compromised.

2. <b>Creating a Retrieval-based QA System</b><br>
    - <b>RetrievalQA.from_chain_type(...)</b>: This method initializes a retrieval-augmented QA system.
    - <b>llm=llm</b>: Specifies the language model (LLM) to be used for generating responses. llm should be an instance of a large language model like LLaMA.
    - <b>chain_type="stuff"</b>: Specifies the type of chain to use. In this context, "stuff" refers to a specific chain type where the relevant information is "stuffed" into the prompt for the language model to process.
    - <b>retriever=new_vectorstore.as_retriever()</b>: Converts the new_vectorstore (FAISS index) into a retriever object that the QA system can use to find relevant documents based on the query. The retriever uses the FAISS index to find the most similar vectors (i.e., the most relevant documents).

3. <b>Running the QA System</b><br>
    - <b>qa.run("Which languages does the online platform support?")</b>: Executes the QA system with the given query.
        - The system uses the retriever to find relevant documents in the vector store.
        - These documents are then passed to the LLM, which generates a response based on the retrieved information.
    - <b>res</b>: Stores the output or answer generated by the QA system in response to the query.

In [36]:
new_vectorstore = FAISS.load_local("faiss_index_db", custom_embeddings, allow_dangerous_deserialization=True) # load our vectorstore db from the local file
qa = RetrievalQA.from_chain_type(llm=llm, chain_type="stuff", retriever=new_vectorstore.as_retriever())
res = qa.run("Which languages does the online platform support?")

  warn_deprecated(


In [37]:
res

'The online platform is currently available in English and Spanish.'