# LangChain: Evaluation
- useful when comparing different strategies
    - swapping model
    - using different vector database
    - how to create chunks, etc.
- **LangSmith**?

## Outline:

* Example generation
* Manual evaluation (and debuging)
* LLM-assisted evaluation
* LangChain evaluation platform

In [None]:
import os

from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) # read local .env file

In [None]:
# account for deprecation of LLM model
import datetime
# Get the current date
current_date = datetime.datetime.now().date()

# Define the date after which the model should be set to "gpt-3.5-turbo"
target_date = datetime.date(2024, 6, 12)

# Set the model variable based on the current date
if current_date > target_date:
    llm_model = "gpt-3.5-turbo"
else:
    llm_model = "gpt-3.5-turbo-0301"

## Create our QandA application
- we're using the same Chain we build in the previous session on Q&A

In [None]:
from langchain.chains import RetrievalQA
from langchain.chat_models import ChatOpenAI
from langchain.document_loaders import CSVLoader
from langchain.indexes import VectorstoreIndexCreator
from langchain.vectorstores import DocArrayInMemorySearch

In [None]:
file = 'OutdoorClothingCatalog_1000.csv'
loader = CSVLoader(file_path=file)
data = loader.load()

In [None]:
index = VectorstoreIndexCreator(
    vectorstore_cls=DocArrayInMemorySearch
).from_loaders([loader])

**Specify the LLM and Chain**

In [None]:
llm = ChatOpenAI(temperature = 0.0, model=llm_model)
qa = RetrievalQA.from_chain_type(
    llm=llm, 
    chain_type="stuff", 
    retriever=index.vectorstore.as_retriever(), 
    verbose=True,
    chain_type_kwargs = {
        "document_separator": "<<<<>>>>>"
    }
)

### Coming up with test datapoints
- looking of a few of the documents here
- coming up with ground truth examples manually --> takes time, does not scale well!

In [None]:
data[10]

In [None]:
data[11]

### Hard-coded examples

In [None]:
examples = [
    {
        "query": "Do the Cozy Comfort Pullover Set\
        have side pockets?",
        "answer": "Yes"
    },
    {
        "query": "What collection is the Ultra-Lofty \
        850 Stretch Down Hooded Jacket from?",
        "answer": "The DownTek collection"
    }
]

### LLM-Generated examples
- automate the steps above for ground truth examples
- we use a specific chain **QAGenerateChain** for that
- it takes documents as input and **automatically generates Question/Answer pairs from each document** using an LLM itself

In [None]:
from langchain.evaluation.qa import QAGenerateChain


**Pass the LLM used to generate the Question/Answer pairs**

In [None]:
example_gen_chain = QAGenerateChain.from_llm(ChatOpenAI(model=llm_model))

In [None]:
# the warning below can be safely ignored

**Generate examples using apply_and_parse() method**
- we want to get back a dict which has *query: answer* pair entries

In [None]:
new_examples = example_gen_chain.apply_and_parse(
    [{"doc": t} for t in data[:5]]
)

### Combine examples
- add the generated examples to the ones we already created

In [None]:
examples += new_examples

**Manual test: run an example through the Q&A chain and get the LLMs response**

In [None]:
qa.run(examples[0]["query"])

## Manual Evaluation
- how do we know what's actually happening inside the chain? Which prompt is used, etc.?
- we can use LangChain's *debug*

In [None]:
import langchain
langchain.debug = True

**Manual test again, with debug: run an example through the Q&A chain and get the LLMs response**
- we can see it's using the *RetrievalQA* chain first
- then enters a *StuffDocumentsChain* (using the *stuff* method)
- lastly it uses the *LLMChain* with the original question, passing context from the retrieved document(s)
- when things go wrong: often times it's not the LLM messing up, but the retrieval
- looking up what the question and the used context is can help with debugging

In [None]:
qa.run(examples[0]["query"])

In [None]:
# Turn off the debug mode
langchain.debug = False

## LLM assisted evaluation
- now we are letting an LLM evaluate the predictions to our test examples


**Get predictions for all examples first**
- this stept might take a while running through all chains

In [None]:
predictions = qa.apply(examples)

In [None]:
from langchain.evaluation.qa import QAEvalChain

- again, we need an LLM to grade so we declare it

In [None]:
llm = ChatOpenAI(temperature=0, model=llm_model)
eval_chain = QAEvalChain.from_llm(llm)

**We use a chain to grade the predictions**
- by passing our test examples and the generated predictions to an **QAEvalChain**

In [None]:
graded_outputs = eval_chain.evaluate(examples, predictions)

- for each example we are going to loop through them 
    - print the question (generated by the LLM)
    - print the ground truth answer (generated by the same LLM which had the original document as context)
    - print the prediction (generated by another LLM using the QAChain, which retrieved the document from the vector database)
    - print the grade (generated by yet another LLM taking all the above as context)

In [None]:
for i, eg in enumerate(examples):
    print(f"Example {i}:")
    print("Question: " + predictions[i]['query'])
    print("Real Answer: " + predictions[i]['answer'])
    print("Predicted Answer: " + predictions[i]['result'])
    print("Predicted Grade: " + graded_outputs[i]['text'])

## LangChain evaluation platform

https://smith.langchain.com

### Source: https://learn.deeplearning.ai/langchain/lesson/6/evaluation