# RAG Evaluation Methods


## LangChain Conciseness, Correctness

Measures how to the point and correct the response is.

### 1. Build basic RAG pipeline using GPT-3.5

We load a PDF document that contains a state of the union address. We then build a generic QA system using LangChain with KDB.AI as the vector store.

In [1]:
!pip install git+https://github.com/KxSystems/langchain.git@KDB.AI#subdirectory=libs/langchain -q

In [2]:
import os
from getpass import getpass
import kdbai_client as kdbai
import time

from dotenv import load_dotenv
load_dotenv()

True

In [3]:
os.environ["OPENAI_API_KEY"] = (
    os.environ["OPENAI_API_KEY"]
    if "OPENAI_API_KEY" in os.environ
    else getpass("OpenAI API Key: ")
)

In [4]:
KDBAI_ENDPOINT = (
    os.environ["KDBAI_ENDPOINT"]
    if "KDBAI_ENDPOINT" in os.environ
    else input("KDB.AI endpoint: ")
)
KDBAI_API_KEY = (
    os.environ["KDBAI_API_KEY"]
    if "KDBAI_API_KEY" in os.environ
    else getpass("KDB.AI API key: ")
)

In [5]:
# langchain packages
from langchain.chains import RetrievalQA
from langchain.chat_models import ChatOpenAI
from langchain.document_loaders import TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import KDBAI
from langchain import HuggingFaceHub
from langchain.llms import OpenAI
from langchain.chains.question_answering import load_qa_chain

doc = TextLoader("data/state_of_the_union.txt").load()
# Chunk the documents into 500 character chunks using langchain's text splitter "RucursiveCharacterTextSplitter"
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=0)

# split_documents produces a list of all the chunks created, printing out first chunk for example
pages = [p.page_content for p in text_splitter.split_documents(doc)]

embeddings = OpenAIEmbeddings(model="text-embedding-ada-002")

session = kdbai.Session(api_key=KDBAI_API_KEY, endpoint=KDBAI_ENDPOINT)

rag_schema = {
    "columns": [
        {"name": "id", "pytype": "str"},
        {"name": "text", "pytype": "bytes"},
        {
            "name": "embeddings",
            "pytype": "float32",
            "vectorIndex": {"dims": 1536, "metric": "L2", "type": "flat"},
        },
    ]
}


In [6]:
# First ensure the table does not already exist
try:
    session.table("rag_langchain").drop()
    time.sleep(5)
except kdbai.KDBAIException:
    pass

table = session.create_table("rag_langchain", rag_schema)

In [None]:
vecdb_kdbai = KDBAI(table, embeddings)
vecdb_kdbai.add_texts(texts=pages)

In [8]:
query2 = "What improvements could be made in infrastructure?"
# query_sim holds results of the similarity search, the closest related chunks to the query.
query_sim = vecdb_kdbai.similarity_search(query2)
query_sim

[Document(page_content='Because we know that when the middle class grows, the poor have a ladder up and the wealthy do very well. \n\nAmerica used to have the best roads, bridges, and airports on Earth. \n\nNow our infrastructure is ranked 13th in the world. \n\nWe won’t be able to compete for the jobs of the 21st Century if we don’t fix that. \n\nThat’s why it was so important to pass the Bipartisan Infrastructure Law—the most sweeping investment to rebuild America in history.', metadata={'id': '42886e69-0589-49b5-8de8-3cdac385e545', 'embeddings': array([ 0.0033795 ,  0.01255523,  0.00851544, ..., -0.0206088 ,
        -0.0326306 , -0.01865721], dtype=float32)})]

In [9]:
#gpt-3.5 as retriever
K = 10
qabot = RetrievalQA.from_chain_type(
    chain_type="stuff",
    llm=ChatOpenAI(model="gpt-3.5-turbo-16k", temperature=0.0),
    retriever=vecdb_kdbai.as_retriever(search_kwargs=dict(k=K)),
    return_source_documents=True,
)

In [10]:
# testing it out 
print(query2)
print("-----")
pred = qabot(dict(query=query2))["result"]
print(pred)

What improvements could be made in infrastructure?
-----
Some improvements that could be made in infrastructure include:

1. Rebuilding and repairing roads, bridges, and highways that are in disrepair.
2. Building a national network of 500,000 electric vehicle charging stations.
3. Replacing poisonous lead pipes to ensure clean water for every American.
4. Providing affordable high-speed internet access for all Americans, including urban, suburban, rural, and tribal communities.
5. Modernizing airports, ports, and waterways.
6. Investing in renewable energy production, such as solar and wind, to promote clean energy and reduce reliance on fossil fuels.
7. Weatherizing homes and businesses to improve energy efficiency and reduce costs.
8. Investing in emerging technologies and American manufacturing to compete with global competitors like China.
9. Ensuring that infrastructure projects are made in America, supporting domestic manufacturing and supply chains.
10. Increasing investments i

### 2. Calculate Conciseness using GPT-4

In [11]:
#gpt-4 as evaluator
from pprint import pprint as print
from langchain.evaluation import load_evaluator
evaluation_llm = ChatOpenAI(model="gpt-4")
evaluator = load_evaluator("criteria", criteria="conciseness", llm=evaluation_llm)

eval_result = evaluator.evaluate_strings(
    prediction=pred,
    input=query2,
)
print(eval_result)

{'reasoning': 'The criterion in question is conciseness. This refers to the '
              'submission being brief and directly expressing the points '
              'without unnecessary detail. \n'
              '\n'
              'Looking at the submission, it provides a detailed list of 10 '
              'possible improvements to infrastructure. Each point is stated '
              'directly and briefly, without any excessive elaboration or '
              'unnecessary detail. It is also arranged in a clear and '
              'organized fashion, which aids in its conciseness. \n'
              '\n'
              'Therefore, the submission can be considered concise and to the '
              'point. \n'
              '\n'
              'Y',
 'score': 1,
 'value': 'Y'}


### 3. Calculate Correctness using GPT-4

In [16]:
query3 = "How many jobs were created in the country due the electric vehicle manufacturing industry?"
print(query3)
print("-----")
pred3 = qabot(dict(query=query3))["result"]
print(pred3)

('How many jobs were created in the country due the electric vehicle '
 'manufacturing industry?')
'-----'
('The passage states that Ford is investing $11 billion to build electric '
 'vehicles, creating 11,000 jobs across the country. Additionally, GM is '
 'making the largest investment in its history—$7 billion to build electric '
 'vehicles, creating 4,000 jobs in Michigan. Therefore, a total of 15,000 jobs '
 'were created in the country due to the electric vehicle manufacturing '
 'industry mentioned in the passage.')


In [18]:
# reference matches
evaluator3 = load_evaluator("labeled_criteria", criteria="correctness", llm=evaluation_llm,requires_reference=True)

eval_result3 = evaluator3.evaluate_strings(
    prediction=pred3,
    input=query3,
    reference="15000 jobs were created due to manufacturing of electric vehicles."
)

print(eval_result3)

{'reasoning': 'The criterion for this task is the correctness, accuracy, and '
              'factualness of the submitted answer. \n'
              '\n'
              'The reference states that 15000 jobs were created due to the '
              'manufacturing of electric vehicles. The submission states the '
              'same result, mentioning that Ford created 11,000 jobs and GM '
              'created 4,000 jobs, therefore a total of 15,000 jobs were '
              'created. \n'
              '\n'
              'The submission matches the reference and appears to be accurate '
              'and factual, suggesting that the submission meets the '
              'criterion. \n'
              '\n'
              'Y',
 'score': 1,
 'value': 'Y'}


In [19]:
# reference contradicts
eval_result4 = evaluator3.evaluate_strings(
    prediction=pred3,
    input=query3,
    reference="12000 jobs were created due to manufacturing of electric vehicles."
)

print(eval_result4)

{'reasoning': 'The criterion for this assessment is the correctness of the '
              'submission. We have to determine if the submission is accurate '
              'and factual. The submission states that due to the electric '
              'vehicle manufacturing industry, 15,000 jobs were created. '
              'However, the reference provided states that there were 12,000 '
              'jobs created due to the manufacturing of electric vehicles. '
              'Therefore, the submission does not match the reference in terms '
              'of the number of jobs created. Hence, the submission does not '
              'meet the correctness criterion.\n'
              '\n'
              'N',
 'score': 0,
 'value': 'N'}


## Other Metrics

### BLEU

BLEU (Bilingual Evaluation Understudy) is an algorithm for evaluating the quality of text which has been machine-translated from one natural language to another.

In [None]:
!pip install evaluate

In [None]:
!python3 -c "import evaluate; print(evaluate.load('exact_match').compute(references=['hello'], predictions=['hello']))"

In [None]:
import evaluate

In [29]:
bleu = evaluate.load("bleu")

In [30]:
original = ["The new study found that eating chocolate can improve cognitive function."]

In [31]:
summary1 = ["Chocolate can improve brain health."]
summary2 = ["A new study suggests that chocolate consumption can boost cognitive performance."]

In [32]:
bleu.compute(predictions=original, references=summary1)

{'bleu': 0.0,
 'precisions': [0.25, 0.09090909090909091, 0.0, 0.0],
 'brevity_penalty': 1.0,
 'length_ratio': 2.0,
 'translation_length': 12,
 'reference_length': 6}

In [33]:
bleu.compute(predictions=original, references=summary2)

{'bleu': 0.0,
 'precisions': [0.5833333333333334, 0.09090909090909091, 0.0, 0.0],
 'brevity_penalty': 1.0,
 'length_ratio': 1.0,
 'translation_length': 12,
 'reference_length': 12}

BLEU favors Summary 2 likely because it has a higher proportion of words that match the original article.​​

BLEU’s output is always a number between 0 and 1. This value indicates how similar the candidate text is to the reference texts, with values closer to 1 representing more similar texts. Few human translations will attain a score of 1, since this would indicate that the candidate is identical to one of the reference translations. For this reason, it is not necessary to attain a score of 1. 

Let's test this by giving it the identical text:

In [34]:
bleu.compute(predictions=original, references=["The new study found that eating chocolate can improve cognitive function."])

{'bleu': 1.0,
 'precisions': [1.0, 1.0, 1.0, 1.0],
 'brevity_penalty': 1.0,
 'length_ratio': 1.0,
 'translation_length': 12,
 'reference_length': 12}