## LangChain: Evaluation

### Outline:

> * Example generation
> * Manual evaluation (and debuging)
> * LLM-assisted evaluation
> * LangChain evaluation platform

In [1]:
import warnings
warnings.filterwarnings('ignore')

In [3]:
from dotenv import load_dotenv

load_dotenv()

True

In [4]:
import os

api_key = os.getenv("GOOGLE_API_KEY")

In [6]:
# Chat model
from langchain_google_genai import ChatGoogleGenerativeAI

from langchain_community.document_loaders.csv_loader import CSVLoader # CSV file loader
from langchain_google_genai import GoogleGenerativeAIEmbeddings  # Gemini embedding model
from langchain.vectorstores import DocArrayInMemorySearch # vector store/database
from langchain.indexes import VectorstoreIndexCreator # indexing for vector stores
from langchain.chains import RetrievalQA # for retrieval

In [10]:
file = "datasets\sample_data.csv"
loader = CSVLoader(file_path=file)

data = loader.load()

In [8]:
embeddings = GoogleGenerativeAIEmbeddings(model="models/embedding-001")

index = VectorstoreIndexCreator(
    vectorstore_cls=DocArrayInMemorySearch,
    embedding=embeddings
).from_loaders([loader])

In [9]:
chat_model = ChatGoogleGenerativeAI(model="gemini-1.5-pro")

qa_stuff = RetrievalQA.from_chain_type(
    llm=chat_model,
    chain_type="stuff",
    retriever=index.vectorstore.as_retriever(),
    verbose=True,
    chain_type_kwargs = {
        "document_separator": "<<<<>>>>>"
    }
)

In [12]:
data[0].page_content

"Text: Habiendo probado un par de otras marcas de galletas sÃ¡ndwich sin gluten, estas son las mejores del grupo.Son crujientes y fieles a la textura de las otras galletas 'reales' que no son sin gluten.Algunos podrÃ\xadan pensar que el relleno las hace un poco demasiado dulces,\\ \npero para mÃ\xad eso solo significa que he satisfecho mi gusto por lo dulce mÃ¡s rÃ¡pido.La versiÃ³n de chocolate de Glutino es igual de buena y tiene un verdadero sabor a 'chocolate',\\ \nalgo que no estÃ¡ presente en las otras marcas sin gluten disponibles.\nScore: 5"

In [13]:
data[1].page_content

'Text: Meine Katze liebt diese Leckerlis. Wenn ich sie im Haus nicht finden kann,Ã¶ffne ich einfach den Deckel und sie schieÃŸt aus ihrem Versteck, um ein Leckerli zu holen.Sie mag keine knusprigen Leckerlis, daher sind diese perfekt fÃ¼r sie.Ich habe ihr alle drei Geschmacksrichtungen gegeben, und sie scheint sie alle gleichermaÃŸen zu mÃ¶gen.Allerdings neigen sie dazu, auszutrocknen, wenn ich mich dem Ende der Flasche nÃ¤here.Der Klappdeckel ist sehr praktisch. Sehr schÃ¶ne, preiswerte Leckerlis fÃ¼r Katzen.Ich habe noch keine Katze getroffen, die diese nicht liebt!\nScore: 5'

In [49]:
# Hard-coded examples
examples = [
    {
        "query": "The text that starts with 'Meine Katze liebt diese Leckerlis' has a score of?",
        "answer": "3"
    },
    {
        "query": "The text that starts with 'Habiendo probado un par de otras marcas' has a score of?",
        "answer": "5"
    }
]

In [50]:
qa_stuff.invoke(examples[0]["query"])



[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


{'query': "The text that starts with 'Meine Katze liebt diese Leckerlis' has a score of?",
 'result': '5\n'}

### Manual Evaluation

In [51]:
# import langchain
# langchain.debug = True

In [52]:
# qa_stuff.invoke(examples[0]["query"])

In [53]:
# Turn off the debug mode
# langchain.debug = False

In [55]:
predictions = qa_stuff.apply(examples)



[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


In [56]:
from langchain.evaluation.qa import QAEvalChain

eval_chain = QAEvalChain.from_llm(chat_model)

In [57]:
graded_outputs = eval_chain.evaluate(examples, predictions)

In [58]:
graded_outputs

[{'results': 'INCORRECT\n'}, {'results': 'CORRECT\n'}]

In [59]:
for i, eg in enumerate(examples):
    print(f"Example {i}:")
    print("Question: " + predictions[i]['query'])
    print("Real Answer: " + predictions[i]['answer'])
    print("Predicted Answer: " + predictions[i]['result'])
    print("Predicted Grade: " + graded_outputs[i]['results'])
    print()

Example 0:
Question: The text that starts with 'Meine Katze liebt diese Leckerlis' has a score of?
Real Answer: 3
Predicted Answer: 5

Predicted Grade: INCORRECT


Example 1:
Question: The text that starts with 'Habiendo probado un par de otras marcas' has a score of?
Real Answer: 5
Predicted Answer: 5

Predicted Grade: CORRECT


