# RAG Evaluation with LangChain and KDB.AI

This notebook serves as a guide to utilizing LangChain tooling for evaluating a basic Retrieval-Augmented Generation (RAG) system. 

The evaluation process involves employing [LangChain's String Evaluators](https://python.langchain.com/docs/guides/evaluation/string/) to assess both conciseness and correctness. KDB.AI serves as the primary knowledge base, enabling the retrieval of semantically relevant content for the evaluation.

## Aim

1. Build basic RAG pipeline
2. Calculate conciseness
3. Calculate correctness


## 0. Setup

### Install packages 

Before running this notebook, ensure you have installed the required dependencies provided in the `requirements.txt`. Refer to the [README](https://github.com/KxSystems/kdbai-samples/blob/main/README.md#install-python-packages) for instructions on using `pip install` to set up the necessary environment.

In [None]:
#pip install -r requirements.txt

### Import packages

In [1]:
import os
from getpass import getpass
import kdbai_client as kdbai
import time

from dotenv import load_dotenv
load_dotenv()

True

## 1. Build basic RAG pipeline

Load a PDF document that contains a state of the union address. Then build a generic QA system using LangChain with KDB.AI as the vector store.


### Set API Keys

To follow this example you will need to request both an [OpenAI API Key](https://platform.openai.com/apps).

You can create both for free by registering using the link provided. Once you have the credentials you can add them below.

In [2]:
os.environ["OPENAI_API_KEY"] = (
    os.environ["OPENAI_API_KEY"]
    if "OPENAI_API_KEY" in os.environ
    else getpass("OpenAI API Key: ")
)

### KDB.AI Cloud
To use KDB.AI Cloud, you will need two session details - a URL endpoint and an API key. To get these you can sign up for free here.

You can connect to a KDB.AI Cloud session using kdbai.Session and passing the session URL endpoint and API key details from your KDB.AI Cloud portal.

If the environment variables KDBAI_ENDPOINTS and KDBAI_API_KEY exist on your system containing your KDB.AI Cloud portal details, these variables will automatically be used to connect. If these do not exist, it will prompt you to enter your KDB.AI Cloud portal session URL endpoint and API key details.

In [3]:
KDBAI_ENDPOINT = (
    os.environ["KDBAI_ENDPOINT"]
    if "KDBAI_ENDPOINT" in os.environ
    else input("KDB.AI endpoint: ")
)
KDBAI_API_KEY = (
    os.environ["KDBAI_API_KEY"]
    if "KDBAI_API_KEY" in os.environ
    else getpass("KDB.AI API key: ")
)

### Load data and embeddings

In [4]:
# langchain packages
from langchain.chains import RetrievalQA
from langchain.chat_models import ChatOpenAI
from langchain.document_loaders import TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import KDBAI
from langchain import HuggingFaceHub
from langchain.llms import OpenAI
from langchain.chains.question_answering import load_qa_chain

doc = TextLoader("data/state_of_the_union.txt").load()
# Chunk the documents into 500 character chunks using langchain's text splitter "RucursiveCharacterTextSplitter"
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=0)

# split_documents produces a list of all the chunks created, printing out first chunk for example
pages = [p.page_content for p in text_splitter.split_documents(doc)]

embeddings = OpenAIEmbeddings(model="text-embedding-ada-002")

### Connect to KDB.AI and save data

In [5]:
session = kdbai.Session(api_key=KDBAI_API_KEY, endpoint=KDBAI_ENDPOINT)

## Define table schema
rag_schema = {
    "columns": [
        {"name": "id", "pytype": "str"},
        {"name": "text", "pytype": "bytes"},
        {
            "name": "embeddings",
            "pytype": "float32",
            "vectorIndex": {"dims": 1536, "metric": "L2", "type": "flat"},
        },
    ]
}

# First ensure the table does not already exist
try:
    session.table("rag_langchain").drop()
    time.sleep(3)
except kdbai.KDBAIException:
    pass

table = session.create_table("rag_langchain", rag_schema)

# Save the data to KDB.AI
vecdb_kdbai = KDBAI(table, embeddings)
vecdb_kdbai.add_texts(texts=pages)

['09c86122-d028-42de-b318-c7349209bcd6',
 'c30d113d-2fef-4b28-ae5f-84e67fab3109',
 '4e307828-0ff6-469e-9f57-30fd2c031f55',
 'db7c750a-dbd2-411a-9e16-e0663f4c88cd',
 'df645be4-73d7-4ffc-977e-b0d6885f9794',
 'a9a7db05-08dc-41cd-8b43-aabbfffa1d9c',
 'c60efca0-5f6e-4259-b9a7-ad48b8cc41e8',
 '54aac53b-ed54-4526-a13a-f764c955eaa1',
 '78faf54c-e332-481c-9cf5-8831dd607410',
 '65d5dea7-49a3-4cf7-a1e7-a0652327d0eb',
 '9aee930a-0c03-4627-bab1-d761ae8a5640',
 '8856d2a8-0416-4ad7-8d6f-0d1e2cf5b1c4',
 'eea29139-b64b-4afd-904c-b32082792c62',
 '9030e40b-0ddc-4ba8-9d40-8ac69e457cf8',
 '842a14c0-58e1-4065-9096-8866cc94ed35',
 '68fe53be-9518-4a6d-80fa-8a590132f1d9',
 'ddc10548-c5a3-4310-90e7-b1e05a34f446',
 '50d6ac94-62a2-4ba7-a3cb-811d7b064de3',
 'b4571df1-ac2a-4b3f-aa50-93d55a877b97',
 '5f42a5d4-fbc7-4ba9-a98a-d27eb44c7c8f',
 'ace0f1c5-f8b2-4180-a566-d3555dc5cb80',
 '346382ce-0532-47e5-a072-aad82561335b',
 '628ab41c-8394-4a5f-a0a3-850094c27988',
 'dad0bb8e-d974-414b-a694-85649ad67662',
 '89d2ad9b-33a6-

### Similarity Search

In [7]:
query = "What improvements could be made in infrastructure?"
# query holds results of the similarity search, the closest related chunks to the query.
query_sim = vecdb_kdbai.similarity_search(query)
query_sim

[Document(page_content='Because we know that when the middle class grows, the poor have a ladder up and the wealthy do very well. \n\nAmerica used to have the best roads, bridges, and airports on Earth. \n\nNow our infrastructure is ranked 13th in the world. \n\nWe won’t be able to compete for the jobs of the 21st Century if we don’t fix that. \n\nThat’s why it was so important to pass the Bipartisan Infrastructure Law—the most sweeping investment to rebuild America in history.', metadata={'id': 'dad0bb8e-d974-414b-a694-85649ad67662', 'embeddings': array([ 0.00343209,  0.01247914,  0.00845823, ..., -0.02029975,
        -0.0327919 , -0.01868618], dtype=float32)})]

### Question-answering bot using GPT-3.5

The code below defines a question-answering bot that combines OpenAI's GPT-3.5 Turbo for generating responses and a retriever that accesses the KDB.AI vector database to find relevant information.

In [8]:
#gpt-3.5 as retriever
K = 10
qabot = RetrievalQA.from_chain_type(
    chain_type="stuff",
    llm=ChatOpenAI(model="gpt-3.5-turbo-16k", temperature=0.0),
    retriever=vecdb_kdbai.as_retriever(search_kwargs=dict(k=K)),
    return_source_documents=True,
)

# testing it out 
print(query)
print("-----")
pred = qabot(dict(query=query))["result"]
print(pred)

What improvements could be made in infrastructure?
-----
Some improvements that could be made in infrastructure include:

1. Rebuilding and repairing highways and bridges that are in disrepair.
2. Building a national network of 500,000 electric vehicle charging stations.
3. Replacing poisonous lead pipes to ensure clean water for every American.
4. Providing affordable high-speed internet access for all Americans, including urban, suburban, rural, and tribal communities.
5. Modernizing airports, ports, and waterways.
6. Investing in renewable energy production, such as solar and wind, to double America's clean energy production.
7. Lowering the cost of electric vehicles and promoting their use.
8. Weatherizing homes and businesses to improve energy efficiency.
9. Investing in emerging technologies and American manufacturing to compete with global competitors like China.
10. Ensuring that infrastructure projects are made in America, supporting domestic manufacturing and supply chains.


## 2. Calculate Conciseness

Let's measure the conciseness of this answer the QA bot returns using LangChain's `load_evaluator` function with the `criteria` set to concisesness. In this example we use GPT-4 as the LLM that performs the evaluation.

In [11]:
#gpt-4 as evaluator
from pprint import pprint as print
from langchain.evaluation import load_evaluator

evaluation_llm = ChatOpenAI(model="gpt-4")
evaluator = load_evaluator("criteria", criteria="conciseness", llm=evaluation_llm)

eval_result = evaluator.evaluate_strings(
    prediction=pred,
    input=query,
)
print(eval_result)

{'reasoning': 'The criterion for this task is conciseness. To evaluate the '
              "submission on this basis, let's consider the following:\n"
              '\n'
              '1. The submission begins with a brief introduction to the '
              'topic, which sets the context for the following points. It does '
              'not provide any unnecessary information.\n'
              '\n'
              '2. Each point in the list provides a clear and specific '
              'improvement that could be made to infrastructure. These points '
              'are directly related to the question and do not contain any '
              'extraneous details or explanations.\n'
              '\n'
              '3. The submission does not include any additional sentences or '
              'paragraphs outside of the introduced improvements. This '
              'suggests that the writer was focused on providing a direct '
              'answer to the question without adding unnecessary

### 3. Calculate Correctness

We can use the same `load_evaluator` function to calculate correctness simply changing the `criteria` to correctness.

Let's try a different query this time.

In [12]:
query2 = "How many jobs were created in the country due the electric vehicle manufacturing industry?"
print(query2)
print("-----")
pred2 = qabot(dict(query=query2))["result"]
print(pred2)

('How many jobs were created in the country due the electric vehicle '
 'manufacturing industry?')
'-----'
('The passage states that Ford is investing $11 billion to build electric '
 'vehicles, creating 11,000 jobs across the country. Additionally, GM is '
 'making the largest investment in its history—$7 billion to build electric '
 'vehicles, creating 4,000 jobs in Michigan. Therefore, a total of 15,000 jobs '
 'were created in the country due to the electric vehicle manufacturing '
 'industry mentioned in the passage.')


When using this option we can pass a reference for the evaluator to check for correctness against. Let's pass a reference that matches the information returned as well as one that doesn't.

In [13]:
# Reference matches
evaluator2 = load_evaluator("labeled_criteria", criteria="correctness", llm=evaluation_llm,requires_reference=True)

eval_result2 = evaluator2.evaluate_strings(
    prediction=pred2,
    input=query2,
    reference="15000 jobs were created due to manufacturing of electric vehicles."
)

print(eval_result2)

{'reasoning': 'The criterion for this assessment is correctness, accuracy, and '
              'factuality. We need to verify if the submitted answer is '
              'correct and factual based on the given input and the '
              'reference.\n'
              '\n'
              'The input asks for the number of jobs created in the country '
              'due to the electric vehicle manufacturing industry. \n'
              '\n'
              'The submission provided a detailed answer, stating that Ford '
              "and GM's investments in electric vehicle manufacturing resulted "
              'in a total of 15,000 jobs across the country. \n'
              '\n'
              'Comparing this with the reference, which states that 15,000 '
              'jobs were created due to the manufacturing of electric '
              "vehicles, we can see that the submission's information aligns "
              'with the reference, indicating it is accurate and factual. \n'
          

In [14]:
# Reference contradicts
eval_result3 = evaluator2.evaluate_strings(
    prediction=pred2,
    input=query2,
    reference="12000 jobs were created due to manufacturing of electric vehicles."
)

print(eval_result3)

{'reasoning': 'The criterion to assess is the correctness, accuracy, and '
              'factualness of the submission.\n'
              '\n'
              '1. Correctness: The submission provides details about the '
              'number of jobs created due to the investment in electric '
              'vehicle manufacturing by Ford and GM. These details align with '
              'the input question.\n'
              '\n'
              '2. Accuracy: The submission states that a total of 15,000 jobs '
              'were created. This is the sum of the 11,000 from Ford and 4,000 '
              'from GM.\n'
              '\n'
              '3. Factualness: The reference provided states that 12,000 jobs '
              'were created due to the manufacturing of electric vehicles, '
              "which contradicts the submission's claim of 15,000 jobs. \n"
              '\n'
              'Based on this analysis, the submission fails to meet the '
              'factualness criterium, 