# LangChain: Evaluation

When building an LLM-based application, evaluating the model's performance using clear and measurable metrics is a critical and often challenging step. These metrics not only help assess how well the model performs but also allow us to quantify the impact of any changes made to the application—such as integrating a Vector database—by determining whether these changes improve or degrade the application's overall performance.

This notebook explores some frameworks to evaluate LLM based applications.

## Notebook Outline:

* Example generation
* Manual evaluation (and debuging)
* LLM-assisted evaluation
* LangChain evaluation platform

## 1. Load and import required libraries

In [1]:
import os

from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) # read local .env file

In [2]:
# account for deprecation of LLM model
import datetime
# Get the current date
current_date = datetime.datetime.now().date()

# Define the date after which the model should be set to "gpt-3.5-turbo"
target_date = datetime.date(2024, 6, 12)

# Set the model variable based on the current date
if current_date > target_date:
    llm_model = "gpt-3.5-turbo"
else:
    llm_model = "gpt-3.5-turbo-0301"

In [3]:
from langchain.chains import RetrievalQA
from langchain.chat_models import ChatOpenAI
from langchain.document_loaders import CSVLoader
from langchain.indexes import VectorstoreIndexCreator
from langchain.vectorstores import DocArrayInMemorySearch
from langchain.embeddings import OpenAIEmbeddings

## 2. Create our QandA application

In [4]:
file = r"..\Data\OutdoorClothingCatalog_1000.csv"
loader = CSVLoader(file_path=file)
data = loader.load()

What the Code Does:
- Document Embedding: The code takes documents from the `loader`, processes them into text, and converts them into vector embeddings using the specified `embeddings` model.
- Index Creation: The embeddings are stored in a `DocArrayInMemorySearch` vector store, creating an index that can be queried for similarity-based search.
- Search-Ready: The resulting `index` can now be used to perform tasks like question-answering, semantic search, or retrieving relevant documents based on input queries.

Breakdown:
1. **`VectorstoreIndexCreator`**:  
   - This is a utility in LangChain that simplifies the process of creating an index for vector-based search. It manages the pipeline for embedding documents, storing them in a vector store, and making them searchable.

2. **`embedding=embeddings`**:  
   - This specifies the embedding model to be used.  
   - The `embeddings` variable likely holds a pre-trained embedding model (e.g., OpenAI's embeddings, Sentence Transformers, etc.), which converts text into numerical vectors for similarity search.

3. **`vectorstore_cls=DocArrayInMemorySearch`**:  
   - `DocArrayInMemorySearch` is a vector store implementation. It stores document embeddings in memory, making it fast for small to medium datasets.  
   - This is a lightweight, in-memory solution for vector storage and similarity search, part of DocArray's library.

4. **`.from_loaders([loader])`**:  
   - The `.from_loaders()` method is used to populate the index with documents.  
   - `loader` is likely an instance of a document loader (e.g., `TextLoader`, `DirectoryLoader`, etc.), which reads documents from a file, directory, or other sources and prepares them for indexing.  
   - The `loader` provides the text data that gets embedded and stored in the vector store.

In [5]:
# Instantiating embeddings model 
embeddings = OpenAIEmbeddings()

index = VectorstoreIndexCreator(
    embedding=embeddings,
    vectorstore_cls=DocArrayInMemorySearch
).from_loaders([loader])

  embeddings = OpenAIEmbeddings()


Create the pipeline

In [6]:
llm = ChatOpenAI(temperature = 0.0, model=llm_model)
qa = RetrievalQA.from_chain_type(
    llm=llm, 
    chain_type="stuff", 
    retriever=index.vectorstore.as_retriever(), 
    verbose=True,
    chain_type_kwargs = {
        "document_separator": "<<<<>>>>>"
    }
)

  llm = ChatOpenAI(temperature = 0.0, model=llm_model)


## 3. Test the pipeline

In [22]:
data[10]

Document(metadata={'source': '..\\Data\\OutdoorClothingCatalog_1000.csv', 'row': 10}, page_content=": 10\nname: Cozy Comfort Pullover Set, Stripe\ndescription: Perfect for lounging, this striped knit set lives up to its name. We used ultrasoft fabric and an easy design that's as comfortable at bedtime as it is when we have to make a quick run out.\n\nSize & Fit\n- Pants are Favorite Fit: Sits lower on the waist.\n- Relaxed Fit: Our most generous fit sits farthest from the body.\n\nFabric & Care\n- In the softest blend of 63% polyester, 35% rayon and 2% spandex.\n\nAdditional Features\n- Relaxed fit top with raglan sleeves and rounded hem.\n- Pull-on pants have a wide elastic waistband and drawstring, side pockets and a modern slim leg.\n\nImported.")

: 

In [8]:
data[11]

Document(metadata={'source': '..\\Data\\OutdoorClothingCatalog_1000.csv', 'row': 11}, page_content=': 11\nname: Ultra-Lofty 850 Stretch Down Hooded Jacket\ndescription: This technical stretch down jacket from our DownTek collection is sure to keep you warm and comfortable with its full-stretch construction providing exceptional range of motion. With a slightly fitted style that falls at the hip and best with a midweight layer, this jacket is suitable for light activity up to 20° and moderate activity up to -30°. The soft and durable 100% polyester shell offers complete windproof protection and is insulated with warm, lofty goose down. Other features include welded baffles for a no-stitch construction and excellent stretch, an adjustable hood, an interior media port and mesh stash pocket and a hem drawcord. Machine wash and dry. Imported.')

Hard-coded examples to build source truth

In [9]:
examples = [
    {
        "query": "Do the Cozy Comfort Pullover Set\
        have side pockets?",
        "answer": "Yes"
    },
    {
        "query": "What collection is the Ultra-Lofty \
        850 Stretch Down Hooded Jacket from?",
        "answer": "The DownTek collection"
    }
]

LLM-Generated examples (automated) to build source truth

In [10]:
from langchain.evaluation.qa import QAGenerateChain

`QAGenerateChain` will take documents as input and create question and answer pair from each document. It does this with a specified LLM, OpenAI LLM in this case.

In [11]:
example_gen_chain = QAGenerateChain.from_llm(ChatOpenAI(model=llm_model))

Apply the method and parse the output into a dict, instead of a single sting. 

In [13]:
# the warning below can be safely ignored
new_examples = example_gen_chain.apply_and_parse(
    [{"doc": t} for t in data[:5]]
)



Check query and answer pair together with actual document. 

In [14]:
new_examples[0]

{'qa_pairs': {'query': "What materials are the Women's Campside Oxfords made of and what features do they have for comfort and support?",
  'answer': "The Women's Campside Oxfords are made of soft canvas material for a broken-in feel and look. They also feature a comfortable EVA innersole with Cleansport NXT® antimicrobial odor control, a vintage hunt, fish, and camping motif on the innersole, a moderate arch contour, EVA foam midsole for cushioning and support, and a chain-tread-inspired molded rubber outsole with a modified chain-tread pattern."}}

In [15]:
data[0]

Document(metadata={'source': '..\\Data\\OutdoorClothingCatalog_1000.csv', 'row': 0}, page_content=": 0\nname: Women's Campside Oxfords\ndescription: This ultracomfortable lace-to-toe Oxford boasts a super-soft canvas, thick cushioning, and quality construction for a broken-in feel from the first time you put them on. \n\nSize & Fit: Order regular shoe size. For half sizes not offered, order up to next whole size. \n\nSpecs: Approx. weight: 1 lb.1 oz. per pair. \n\nConstruction: Soft canvas material for a broken-in feel and look. Comfortable EVA innersole with Cleansport NXT® antimicrobial odor control. Vintage hunt, fish and camping motif on innersole. Moderate arch contour of innersole. EVA foam midsole for cushioning and support. Chain-tread-inspired molded rubber outsole with modified chain-tread pattern. Imported. \n\nQuestions? Please contact us for any inquiries.")

### Combine examples

In [16]:
examples += new_examples

In [17]:
qa.run(examples[0]["query"])

  qa.run(examples[0]["query"])




[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


'Yes, the Cozy Comfort Pullover Set does have side pockets.'

## Manual Evaluation

In [18]:
import langchain
langchain.debug = True

In [19]:
qa.run(examples[0]["query"])

[32;1m[1;3m[chain/start][0m [1m[chain:RetrievalQA] Entering Chain run with input:
[0m{
  "query": "Do the Cozy Comfort Pullover Set        have side pockets?"
}
[32;1m[1;3m[chain/start][0m [1m[chain:RetrievalQA > chain:StuffDocumentsChain] Entering Chain run with input:
[0m[inputs]
[32;1m[1;3m[chain/start][0m [1m[chain:RetrievalQA > chain:StuffDocumentsChain > chain:LLMChain] Entering Chain run with input:
[0m{
  "question": "Do the Cozy Comfort Pullover Set        have side pockets?",
  "context": ": 10\nname: Cozy Comfort Pullover Set, Stripe\ndescription: Perfect for lounging, this striped knit set lives up to its name. We used ultrasoft fabric and an easy design that's as comfortable at bedtime as it is when we have to make a quick run out.\n\nSize & Fit\n- Pants are Favorite Fit: Sits lower on the waist.\n- Relaxed Fit: Our most generous fit sits farthest from the body.\n\nFabric & Care\n- In the softest blend of 63% polyester, 35% rayon and 2% spandex.\n\nAdditiona

'Yes, the Cozy Comfort Pullover Set does have side pockets.'

In [20]:
# Turn off the debug mode
langchain.debug = False

## LLM assisted evaluation

In [21]:
predictions = qa.apply(examples)



[1m> Entering new RetrievalQA chain...[0m


  predictions = qa.apply(examples)



[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m


ValueError: Missing some input keys: {'query'}

In [27]:
from langchain.evaluation.qa import QAEvalChain

In [28]:
llm = ChatOpenAI(temperature=0, model=llm_model)
eval_chain = QAEvalChain.from_llm(llm)

In [29]:
graded_outputs = eval_chain.evaluate(examples, predictions)

NameError: name 'predictions' is not defined

In [30]:
for i, eg in enumerate(examples):
    print(f"Example {i}:")
    print("Question: " + predictions[i]['query'])
    print("Real Answer: " + predictions[i]['answer'])
    print("Predicted Answer: " + predictions[i]['result'])
    print("Predicted Grade: " + graded_outputs[i]['text'])
    print()

Example 0:


NameError: name 'predictions' is not defined

In [31]:
graded_outputs[0]

NameError: name 'graded_outputs' is not defined