## LangChain: Evaluation ##
### Outline: ###
- Example generation
- Manual evaluation (and debuging)
- LLM-assisted evaluation
- LangChain evaluation platform

- **How do you evaluate how well your application is doing? Is it meeting some accuracy criteria?** 
    - if you decide to change your implementation, maybe swap in a different LLM, or change the strategy of how you use a vector database or something else to retrieve chunks, or change some other parameters of your system, how do you know if you're making it better or worse? 
    - Let's dive into some frameworks on how to think about evaluating a LLM-based application, as well as some tools to help you do that. 
    - These applications are really chains and sequences of a lot of different steps. 
        - understand what exactly is going in and coming out of each step. 
        - And so some of the tools can really just be thought of as visualizers or debuggers in that vein. 
        - But it's often really useful to get a more holistic picture on a lot of different data points of how the model is doing. And one way to do that is by looking at things by eye. But 
    - there's also cool idea of using language models themselves and chains themselves to evaluate other language models, and other chains, and other applications. 

- **Set up with evaluation.**
    - First, we need to have the chain or the application that we're going to evaluate in the first place. 
    - Second, Use the document question answering chain from the previous lesson. 
        - So we've got this application, and the first thing we need to do is 
            - figure out what are some data points that we want to evaluate it on. 
    - there's a few different methods that we're going to cover for doing this. 
        - The first is the most simple, which is basically we're going to come up with data points that we think are good examples ourselves. 
            - To do that, we can just look at some of the data and come up with example questions and then example ground truth answers that we can later use to evaluate. 
               
- The verbose output of `langchain` debug process shows the sequence of steps involved in handling your query, which may appear repetitive at first glance. However, each step represents a distinct stage in the process with a specific purpose. Let's break down these steps to understand their roles:

    1. **[chain/start] [1:chain:RetrievalQA > 2:chain:StuffDocumentsChain > 3:chain:LLMChain] Entering Chain run with input:**
       - This step signifies the beginning of the entire chain operation. The `RetrievalQA` chain is initiated here.
       - The input at this stage is the original question and the **context extracted from the vector store**.
           - "Context" is the result of a search from the vector store, processed through the StuffDocumentsChain, and then used as input along with your question to generate an answer in the LLMChain
       - The chain mentioned here (`RetrievalQA > StuffDocumentsChain > LLMChain`) outlines the sequence of components that will process the query.

    2. **[llm/start] [1:chain:RetrievalQA > 2:chain:StuffDocumentsChain > 3:chain:LLMChain > 4:llm:ChatOpenAI] Entering LLM run with input:**
       - This step indicates the start of the Large Language Model (LLM) operation within the overall chain.
       - The input at this point is the "prompt" constructed from the question and the context. This prompt is specifically formatted for the language model (in this case, `ChatOpenAI`).
       - The addition of `4:llm:ChatOpenAI` shows that we have moved deeper into the chain, specifically into the language model processing part.

    3. **[llm/end] [1:chain:RetrievalQA > 2:chain:StuffDocumentsChain > 3:chain:LLMChain > 4:llm:ChatOpenAI] Exiting LLM run with output:**
       - This step marks the completion of the LLM processing.
       - The output here is the "generation" from the language model, which is the model's response to the input prompt.
       - The time `[513.7270000000001ms]` shows how long this LLM processing step took.

- the verbose output is detailing each stage of the process, including entering and exiting different components of the chain. Each step represents a different phase:

    - **Chain Start**: The whole process begins.
    - **LLM Start**: The specific LLM processing starts.
    - **LLM End**: The LLM processing ends, and we have the model's output.


- And so when doing question answering, oftentimes when a wrong result is returned, it's not necessarily the language model itself that's messing up. It's actually the retrieval step that's messing up. 
 
- **Exactly what is entering the language model**, Chat OpenAI itself. 
    - we can see **the full prompt that's passed in**. 
        - It is the prompt that the question answering chain is using under the hood,  
    - a bunch of the context as inserted before,  
    - a human question, which is the question that we asked it. 

- We can also see a lot more information about the actual return type. 
    - So rather than just a string, we get back a bunch of information like 
        - the "token_usage", 
        - the "prompt_tokens",
        - "completion_tokens", 
        - "total_tokens", 
        - "model_name". 
 

- what about all the examples we created? How are we going to evaluate those? 
    - Manual Evaluation 
        - We could run the chain over all the examples, then look at the outputs, and try to figure out what's going on, whether it's correct, incorrect, partially correct. Similar to creating the examples, that starts to get a little bit tedious over time. 

    - LLM assisted evaluation
        - First, we need to create predictions for all the examples. Before doing that, turn off the debug mode in order to just not print everything out onto the screen. 
        - I'm going to create predictions for all the different examples. 
            - we had seven examples total, and so we're going to loop through this chain seven times, getting a prediction for each one. 
        - Evaluation those examples. 
            - import the QA question answering eval chain.
            - create this chain with a language model, because we're going to be using a language model to help do the evaluation. 
            - evaluate on this chain. passing in examples and predictions, 
        - Get back a bunch of graded outputs. 


In [1]:
import os
os.environ['OPENAI_API_KEY']='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'

In [2]:
import os

from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) # read local .env file

In [3]:
# account for deprecation of LLM model
import datetime
# Get the current date
current_date = datetime.datetime.now().date()

# Define the date after which the model should be set to "gpt-3.5-turbo"
target_date = datetime.date(2024, 6, 12)

# Set the model variable based on the current date
if current_date > target_date:
    llm_model = "gpt-3.5-turbo"
else:
    llm_model = "gpt-3.5-turbo-0301"

### Create our QandA application ###

In [4]:
from langchain.chains import RetrievalQA
from langchain.chat_models import ChatOpenAI
from langchain.document_loaders import CSVLoader
from langchain.indexes import VectorstoreIndexCreator
from langchain.vectorstores import DocArrayInMemorySearch

In [5]:
file = 'OutdoorClothingCatalog_1000.csv'
loader = CSVLoader(file_path=file)
data = loader.load()

In [6]:
size = len(data)
size

1000

In [None]:
data[10]

In [7]:
index = VectorstoreIndexCreator(
    vectorstore_cls=DocArrayInMemorySearch
).from_loaders([loader])

In [8]:
llm = ChatOpenAI(temperature = 0.0, model=llm_model)
qa = RetrievalQA.from_chain_type(
    llm=llm, 
    chain_type="stuff", 
    retriever=index.vectorstore.as_retriever(), 
    verbose=True,
    chain_type_kwargs = {
        "document_separator": "<<<<>>>>>"
    }
)

### Coming up with test datapoints ###
- what is the datapoints we want to validate on

In [None]:
data[10]

In [None]:
data[11]

### Hard-coded examples ###
- create two examples from above data points

In [9]:
examples = [
    {
        "query": "Do the Cozy Comfort Pullover Set\
        have side pockets?",
        "answer": "Yes"
    },
    {
        "query": "What collection is the Ultra-Lofty \
        850 Stretch Down Hooded Jacket from?",
        "answer": "The DownTek collection"
    }
]

### LLM-Generated examples ###

In [10]:
# QAGenerateChain will take in documents and create quesion answer pair for each documents
from langchain.evaluation.qa import QAGenerateChain

In [11]:
example_gen_chain = QAGenerateChain.from_llm(ChatOpenAI(model=llm_model))

In [12]:
# create examples
new_examples = example_gen_chain.apply_and_parse(
    [{"doc": t} for t in data[:5]]
)

In [None]:
new_examples[0]

In [None]:
data[0]

### Combine examples ###

In [13]:
examples += new_examples

In [15]:
import langchain
print(langchain.__version__)

0.0.179


In [16]:
qa.run(examples[0]["query"])



[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


'The Cozy Comfort Pullover Set, Stripe has side pockets on the pull-on pants.'

## Manual Evaluation ##

In [17]:
import langchain
langchain.debug = True

In [18]:
qa.run(examples[0]["query"])

[32;1m[1;3m[chain/start][0m [1m[1:chain:RetrievalQA] Entering Chain run with input:
[0m{
  "query": "Do the Cozy Comfort Pullover Set        have side pockets?"
}
[32;1m[1;3m[chain/start][0m [1m[1:chain:RetrievalQA > 2:chain:StuffDocumentsChain] Entering Chain run with input:
[0m[inputs]
[32;1m[1;3m[chain/start][0m [1m[1:chain:RetrievalQA > 2:chain:StuffDocumentsChain > 3:chain:LLMChain] Entering Chain run with input:
[0m{
  "question": "Do the Cozy Comfort Pullover Set        have side pockets?",
  "context": ": 10\nname: Cozy Comfort Pullover Set, Stripe\ndescription: Perfect for lounging, this striped knit set lives up to its name. We used ultrasoft fabric and an easy design that's as comfortable at bedtime as it is when we have to make a quick run out.\n\nSize & Fit\n- Pants are Favorite Fit: Sits lower on the waist.\n- Relaxed Fit: Our most generous fit sits farthest from the body.\n\nFabric & Care\n- In the softest blend of 63% polyester, 35% rayon and 2% spandex.\

'The Cozy Comfort Pullover Set, Stripe has side pockets on the pull-on pants.'

In [19]:
# Turn off the debug mode
langchain.debug = False

## LLM assisted evaluation ##
- First, create predictions for all the different examples. 
    - we had seven examples total, and so we're going to **loop through this chain seven times**, getting a prediction for each one.

In [20]:
predictions = qa.apply(examples)



[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


- import the QA question answering eval chain QAEvalChain 
    - Create this chain with a language model, `we're going to be using a language model to help do the evaluation`. 
    - Call evaluate on this chain. 
        - pass in examples and predictions,  
    - get back a bunch of graded outputs. 

In [21]:
from langchain.evaluation.qa import QAEvalChain

In [22]:
llm = ChatOpenAI(temperature=0, model=llm_model)
eval_chain = QAEvalChain.from_llm(llm) # llm to do the evaluation

In [23]:
graded_outputs = eval_chain.evaluate(examples, predictions)

In [24]:
for i, eg in enumerate(examples):
    print(f"Example {i}:")
    print("Question: " + predictions[i]['query'])
    print("Real Answer: " + predictions[i]['answer'])
    print("Predicted Answer: " + predictions[i]['result'])
    print("Predicted Grade: " + graded_outputs[i]['text'])
    print()

Example 0:
Question: Do the Cozy Comfort Pullover Set        have side pockets?
Real Answer: Yes
Predicted Answer: The Cozy Comfort Pullover Set, Stripe has side pockets on the pull-on pants.
Predicted Grade: CORRECT

Example 1:
Question: What collection is the Ultra-Lofty         850 Stretch Down Hooded Jacket from?
Real Answer: The DownTek collection
Predicted Answer: The Ultra-Lofty 850 Stretch Down Hooded Jacket is from the DownTek collection.
Predicted Grade: CORRECT

Example 2:
Question: What is the weight of the Women's Campside Oxfords per pair?
Real Answer: The Women's Campside Oxfords have an approximate weight of 1 lb. 1 oz. per pair.
Predicted Answer: The Women's Campside Oxfords weigh approximately 1 lb. 1 oz. per pair.
Predicted Grade: CORRECT

Example 3:
Question: What are the dimensions of the small and medium Recycled Waterhog Dog Mat?
Real Answer: The dimensions of the small Recycled Waterhog Dog Mat are 18" x 28" and the dimensions of the medium Recycled Waterhog Dog

- Why we actually need to use the language model in the first place. 
    - These two strings:`real answer` and `predicted answer` are actually nothing alike. 
            - Real Answer: Yes
            - Predicted Answer: The Cozy Comfort Pullover Set, Stripe has side pockets on the pull-on pants.
        - One's really short, one's really long. 
        - if we were to try to do some string matching, or exact matching, or even some regexes here, it wouldn't know what to do. They're not the same thing. 
    - That shows off the importance of using the language model to do evaluation here. 
        - You've got these answers, which are arbitrary strings. There's no single one truth string that is the best possible answer. 
        - There's many different variants. And as long as they have the same semantic meaning, they should be graded as being similar. 
        - And that's what a language model helps with, as opposed to just doing exact matching. 

### LangChain evaluation platform ###
- This is a way to do everything that we just did in the notebook, but persisted and show it in a UI
- The LangChain evaluation platform, LangChain Plus, can be accessed here https://www.langchain.plus/. Use the invite code lang_learners_2023