# <center style="font-family: consolas; font-size: 32px; font-weight: bold;">  Evaluating LLM Applications Using LangChain </center>
<center style="font-family: consolas; font-size: 32px; font-weight: bold;">  Hands-On LangChain for LLM Application Development </center> 

***



When constructing a sophisticated application employing an LLM, a crucial yet challenging aspect revolves around evaluating its performance. How can you ascertain if it meets accuracy standards? 

Moreover, if you opt to alter your implementation — perhaps by substituting a different LLM or adjusting the strategy for utilizing a vector database or other retrieval mechanisms — how can you gauge whether these changes enhance or detract from the application?

This notebook discusses the challenges of evaluating the performance of applications built with large language models (LLMs) and explores strategies for effectively assessing their accuracy and effectiveness.

It emphasizes the importance of understanding the inputs and outputs of each step in the application’s workflow and introduces frameworks and tools designed to aid in evaluation. 

Additionally, it explores the concept of using language models and chains themselves to evaluate other models and applications. With the rise of prompt-based development and the growing reliance on LLMs, the process of evaluating application workflows is undergoing reevaluation.

#### <a id="top"></a>
# <div style="box-shadow: rgb(60, 121, 245) 0px 0px 0px 3px inset, rgb(255, 255, 255) 10px -10px 0px -3px, rgb(31, 193, 27) 10px -10px, rgb(255, 255, 255) 20px -20px 0px -3px, rgb(255, 217, 19) 20px -20px, rgb(255, 255, 255) 30px -30px 0px -3px, rgb(255, 156, 85) 30px -30px, rgb(255, 255, 255) 40px -40px 0px -3px, rgb(255, 85, 85) 40px -40px; padding:20px; margin-right: 40px; font-size:30px; font-family: consolas; text-align:center; display:fill; border-radius:15px; color:rgb(60, 121, 245);"><b>Table of contents</b></div>

<div style="background-color: rgba(60, 121, 245, 0.03); padding:30px; font-size:15px; font-family: consolas;">
<ul>
    <li><a href="#1" target="_self" rel=" noreferrer nofollow">1. Setting Up Working Environment </a></li> 
    <li><a href="#2" target="_self" rel=" noreferrer nofollow">2. Manual Evaluation & Debugging </a></li> 
    <li><a href="#3" target="_self" rel=" noreferrer nofollow">3. LLM-Assisted Evaluation </a></li> 
    <li><a href="#4" target="_self" rel=" noreferrer nofollow">4. Observing Behind the Scenes </a></li> 
</ul>
</div>

***

<a id="1"></a>
# <div style="box-shadow: rgba(0, 0, 0, 0.16) 0px 1px 4px inset, rgb(51, 51, 51) 0px 0px 0px 3px inset; padding:20px; font-size:32px; font-family: consolas; text-align:center; display:fill; border-radius:15px;  color:rgb(34, 34, 34);"> <b> 1. Setting Up Working Environment </b></div>



As usual, we will start by setting the working environment and importing the important packages and libraries we will use throughout this notebook. First, we will define the environment variables that will be used later to load the OpenAI API key to use the LLM.


In [1]:
!pip install langchain
!pip install langchain_community
!pip install openai
!pip install docarray
!pip install tiktoken

import os
import openai

from openai import OpenAI
import openai
import os
from kaggle_secrets import UserSecretsClient

user_secrets = UserSecretsClient()
openai.api_key = user_secrets.get_secret("openai_api")
client = OpenAI(
    # This is the default and can be omitted
    api_key=openai.api_key,
)

llm_model = "gpt-3.5-turbo"

Collecting langchain
  Downloading langchain-0.2.3-py3-none-any.whl.metadata (6.9 kB)
Collecting langchain-core<0.3.0,>=0.2.0 (from langchain)
  Downloading langchain_core-0.2.5-py3-none-any.whl.metadata (5.8 kB)
Collecting langchain-text-splitters<0.3.0,>=0.2.0 (from langchain)
  Downloading langchain_text_splitters-0.2.1-py3-none-any.whl.metadata (2.2 kB)
Collecting langsmith<0.2.0,>=0.1.17 (from langchain)
  Downloading langsmith-0.1.75-py3-none-any.whl.metadata (13 kB)
Collecting packaging<24.0,>=23.2 (from langchain-core<0.3.0,>=0.2.0->langchain)
  Downloading packaging-23.2-py3-none-any.whl.metadata (3.2 kB)
Collecting orjson<4.0.0,>=3.9.14 (from langsmith<0.2.0,>=0.1.17->langchain)
  Downloading orjson-3.10.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (49 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m49.7/49.7 kB[0m [31m1.7 MB/s[0m eta [36m0:00:00[0m
Downloading langchain-0.2.3-py3-none-any.whl (974 kB)
[2K   [9

We need to have a chain to evaluate so we will use the document question-answering chain. To do this we will have to import the packages that we are going to use and then load the data using LangChain PyDFloader We are going to import everything we need.

In [2]:
from langchain.chains import RetrievalQA
from langchain.chat_models import ChatOpenAI
from langchain.document_loaders import CSVLoader
from langchain.indexes import VectorstoreIndexCreator
from langchain.vectorstores import DocArrayInMemorySearch
from langchain.document_loaders import PyPDFLoader

file = '/kaggle/input/how-to-build-a-career-in-ai-pdf/eBook-How-to-Build-a-Career-in-AI.pdf'
loader = PyPDFLoader(file_path=file)
data = loader.load()

We’re going to create that index with one line, and then we’re going to create the retrieval QA chain by specifying the language model, the chain type, the retriever, and then the verbosity that we’re going to print out.

In [3]:
from langchain.embeddings import OpenAIEmbeddings
# Initialize the embedding model
embedding_model = OpenAIEmbeddings(model="text-embedding-ada-002", openai_api_key=openai.api_key)  # Specify the model name

# Create the vector store index
index = VectorstoreIndexCreator(
    vectorstore_cls=DocArrayInMemorySearch,
    embedding=embedding_model  # Include the embedding model
).from_loaders([loader])

# Initialize the language model
llm = ChatOpenAI(temperature=0.9, model=llm_model, openai_api_key=openai.api_key)

# Create the QA chain
qa = RetrievalQA.from_chain_type(
    llm=llm, 
    chain_type="stuff", 
    retriever=index.vectorstore.as_retriever(), 
    verbose=True,
    chain_type_kwargs={
        "document_separator": "<<<<>>>>>"
    }
)


  warn_deprecated(
2024-06-09 11:41:08.300437: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-06-09 11:41:08.300604: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-06-09 11:41:08.472030: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
INFO - docarray - DB config created
INFO - docarray - Runtime config created
INFO - docarray - No docs or index file provided. Initializing empty InMemoryExactNNIndex.
  warn_deprecated(


Now that we have our application set, we need to figure out what are some data points that we want to evaluate it on. 

<a id="2"></a>
# <div style="box-shadow: rgba(0, 0, 0, 0.16) 0px 1px 4px inset, rgb(51, 51, 51) 0px 0px 0px 3px inset; padding:20px; font-size:32px; font-family: consolas; text-align:center; display:fill; border-radius:15px;  color:rgb(34, 34, 34);"> <b> 2. Manual Evaluation & Debugging </b></div>

We will start with the simplest method of debugging, in which we will come up with data points that we think are good examples and test them. We can just look at some of the data and come up with example questions and then example ground truth answers that we can later use to evaluate. 

So if we look at a few of the documents here, We can get a sense of what’s going on inside them. We are using the How to Build a Career in AI document by Andrew NG. 

In [4]:
data[10]

Document(page_content='PAGE 11\nThe Best Way to Build \na New Habit\nOne of my favorite books is BJ Fogg’s, Tiny Habits: The Small Changes That Change \nEverything. Fogg explains that the best way to build a new habit is to start small \nand succeed, rather than start  too big and fail. For example, rather than trying to \nexercise for 30 minutes a day, he recommends aspiring to do just one push-up, and \ndoing it consistently.\nThis approach may be helpful to those of you who want to spend more time studying. \nIf you start by holding yourself accountable for watching, say, 10 seconds of an \neducational video every day — and you do so consistently — the habit of studying daily \nwill grow naturally. Even if you learn nothing in that 10 seconds, you’re establishing the \nhabit of studying a little every day. On some days, maybe you’ll end up studying for an \nhour or longer.', metadata={'source': '/kaggle/input/how-to-build-a-career-in-ai-pdf/eBook-How-to-Build-a-Career-in-AI.pdf', 'p

This part of the document discusses th importance of continuous learning and developing the habit of learning.

In [5]:
data[8]

Document(page_content='PAGE 9In the previous chapter, I introduced three key steps for building a career in AI: learning \nfoundational technical skills, working on projects, and finding a job, all of which is supported \nby being part of a community. In this chapter, I’d like to dive more deeply into the first step: \nlearning foundational skills.\nMore research papers have been published on AI than anyone can read in a lifetime. So, when \nlearning, it’s critical to prioritize topic selection. I believe the most important topics for a technical \ncareer in machine learning are:\nFoundational machine learning skills: For example, it’s important to understand models such \nas linear regression, logistic regression, neural networks, decision trees, clustering, and anomaly \ndetection. Beyond specific models, it’s even more important to understand the core concepts \nbehind how and why machine learning works, such as bias/variance, cost functions, regularization, \noptimization algorithm

The second page explores the different skills you need to start a career in AI and data science. So the first one we can ask is a simple question Is machine learning foundations the most important skill? 

for the second one, we can ask what are the Python frameworks you need to learn. The answer to this question is in the second document.

In [6]:
examples = [
    {
        "query": "Is machine learning foundations the most important skill?",
        "answer": "Yes"
    },
    {
        "query": "What are Python frameworks you need to learn ?",
        "answer": "Tensorflow and PyTorch"
    }
]

But this doesn’t scale that. It takes a bit of time to look through each example and figure out what’s going on. So a better way to do that is to automate it. 

<a id="3"></a>
# <div style="box-shadow: rgba(0, 0, 0, 0.16) 0px 1px 4px inset, rgb(51, 51, 51) 0px 0px 0px 3px inset; padding:20px; font-size:32px; font-family: consolas; text-align:center; display:fill; border-radius:15px;  color:rgb(34, 34, 34);"> <b> 3. LLM-Assisted Evaluation </b></div>

One of the methods that we can automate the evaluation process is with LLM themselves. We have a chain in Langchain that can do exactly that. So we can import the QA generation chain, and this will take in documents and will create a question-answer pair from each document. It’ll do this using a language model itself. So we need to create this chain by passing in the chat open AI language model. 

In [7]:
from langchain.evaluation.qa import QAGenerateChain


Then from there, we can create a list of examples. 

In [8]:
example_gen_chain = QAGenerateChain.from_llm(ChatOpenAI(model=llm_model, openai_api_key=openai.api_key))


We are going to use the apply and parse method because this is applying an output parser to the result because we want to get back a dictionary that has the query and answer pair, not just a single string. 

In [9]:
new_examples = example_gen_chain.apply_and_parse(
    [{"doc": t} for t in data[:5]]
)
new_examples[0]



{'qa_pairs': {'query': 'According to the document, who is the founder of DeepLearning.AI?',
  'answer': 'The founder of DeepLearning.AI is Andrew Ng, as mentioned in the document.'}}

And so now if we look at what exactly is returned here, we can see a query and we can see an answer. Let’s check the document that this is a question and answer for. 

In [10]:
data[0]


Document(page_content='PAGE 1Founder, DeepLearning.AICollected Insights\nfrom Andrew Ng\nHow to \nBuild\nYour\nCareer\nin AIA Simple Guide\n', metadata={'source': '/kaggle/input/how-to-build-a-career-in-ai-pdf/eBook-How-to-Build-a-Career-in-AI.pdf', 'page': 0})

We just generated a bunch of question-answer pairs. We won’t have to write it all ourselves which will save us a lot of time and we can do more testing cases. Now let’s go ahead and add these examples into the examples that we already created.

In [11]:
examples += new_examples


In [12]:
examples

[{'query': 'Is machine learning foundations the most important skill?',
  'answer': 'Yes'},
 {'query': 'What are Python frameworks you need to learn ?',
  'answer': 'Tensorflow and PyTorch'},
 {'qa_pairs': {'query': 'According to the document, who is the founder of DeepLearning.AI?',
   'answer': 'The founder of DeepLearning.AI is Andrew Ng, as mentioned in the document.'}},
 {'qa_pairs': {'query': 'According to Andrew Ng in the document, what is AI compared to and how does he believe it will impact human life?',
   'answer': 'Andrew Ng compares AI to the new electricity and believes that it will transform and improve all areas of human life.'}},
 {'qa_pairs': {'query': 'According to the document, what are the key topics covered in the chapters of the book "How to Build a Career in AI"?',
   'answer': 'The key topics covered in the chapters of the book "How to Build a Career in AI" include: '}},
 {'qa_pairs': {'query': "According to the document, what is the author's comparison between

So we got these examples now, but how exactly do we evaluate what’s going on? The first thing we want to do is just run an example through the chain and take a look at the output it produces. 

In [13]:
qa.run(examples[0]["query"])


  warn_deprecated(




[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


'Yes, having a strong foundation in machine learning is indeed considered one of the most important skills for a career in AI. Understanding foundational machine learning skills, as well as deep learning concepts and software development, are key components to excel in the field of machine learning.'

When we input a query, we receive an answer. However, this approach limits our visibility into the chain’s inner workings. What exact prompt is fed into the language model? Which documents does it fetch? 

In more complex chains with multiple steps, what intermediate results are generated? Simply observing the final answer often isn’t sufficient for understanding potential issues within the chain. To address this, we have a helpful utility in Lane Chain called Lane Chain Debug.

<a id="4"></a>
# <div style="box-shadow: rgba(0, 0, 0, 0.16) 0px 1px 4px inset, rgb(51, 51, 51) 0px 0px 0px 3px inset; padding:20px; font-size:32px; font-family: consolas; text-align:center; display:fill; border-radius:15px;  color:rgb(34, 34, 34);"> <b> 4. Observing Behind the Scenes </b></div>


If we want to observe what is happening behind the scenes we can set the LangChain debug equals to true, and we now rerun the same example as above, we can see that it starts printing out a lot more information. 

In [14]:
import langchain
langchain.debug = True

qa.run(examples[0]["query"])

[32;1m[1;3m[chain/start][0m [1m[chain:RetrievalQA] Entering Chain run with input:
[0m{
  "query": "Is machine learning foundations the most important skill?"
}
[32;1m[1;3m[chain/start][0m [1m[chain:RetrievalQA > chain:StuffDocumentsChain] Entering Chain run with input:
[0m[inputs]
[32;1m[1;3m[chain/start][0m [1m[chain:RetrievalQA > chain:StuffDocumentsChain > chain:LLMChain] Entering Chain run with input:
[0m{
  "question": "Is machine learning foundations the most important skill?",
  "context": "PAGE 9In the previous chapter, I introduced three key steps for building a career in AI: learning \nfoundational technical skills, working on projects, and finding a job, all of which is supported \nby being part of a community. In this chapter, I’d like to dive more deeply into the first step: \nlearning foundational skills.\nMore research papers have been published on AI than anyone can read in a lifetime. So, when \nlearning, it’s critical to prioritize topic selection. I be

'Yes, foundational machine learning skills are considered crucial for a technical career in AI. Understanding concepts like linear regression, neural networks, decision trees, and core principles behind machine learning is essential for building a career in the field.'

When examining the output closely, we notice it first delves into the retrieval QA chain, followed by the documents chain. As mentioned, the method is utilized here. Subsequently, it enters the LLM chain, where various inputs are involved. The original question is evident, alongside the provided context, synthesized from multiple retrieved documents. 

In question-answering scenarios, errors often arise not from the language model itself, but from flaws in the retrieval process. Thus, scrutinizing the question and context aids in debugging. Further inspection reveals the inputs into the language model, Chat OpenAI itself.

Here, we have the complete prompt passed in, comprising a system message and the prompt description utilized by the question-answering chain under the hood, which we haven’t explored until now. 

The prompt instructs to utilize specific context pieces for answering the user’s question and emphasizes not to fabricate answers if uncertain. Following this, we encounter the inserted context and the human question posed. 

Moreover, detailed information about the return type is provided, including token usage metrics such as prompt tokens, completion tokens, total tokens, and the model name. This data proves valuable for monitoring token usage in chains or language model calls over time, correlating closely with the total cost.

Let's ask a language model to do it. First, we need to create predictions for all the examples. Then I’m going to create predictions for all the different examples. We are going to loop through this chain, getting a prediction for each one. 

In [15]:
# Turn off the debug mode
langchain.debug = False

In [16]:
# Preprocess examples to ensure 'query' is a top-level key
processed_examples = []

for example in examples:
    if 'qa_pairs' in example:
        processed_examples.append({
            'query': example['qa_pairs']['query'],
            'answer': example['qa_pairs']['answer']
        })
    else:
        processed_examples.append(example)

# Now all examples have a consistent structure
for example in processed_examples:
    print(example)

# Assuming qa is an instance of QAEvalChain or a similar class
predictions = qa.apply(processed_examples)
print(predictions)


{'query': 'Is machine learning foundations the most important skill?', 'answer': 'Yes'}
{'query': 'What are Python frameworks you need to learn ?', 'answer': 'Tensorflow and PyTorch'}
{'query': 'According to the document, who is the founder of DeepLearning.AI?', 'answer': 'The founder of DeepLearning.AI is Andrew Ng, as mentioned in the document.'}
{'query': 'According to Andrew Ng in the document, what is AI compared to and how does he believe it will impact human life?', 'answer': 'Andrew Ng compares AI to the new electricity and believes that it will transform and improve all areas of human life.'}
{'query': 'According to the document, what are the key topics covered in the chapters of the book "How to Build a Career in AI"?', 'answer': 'The key topics covered in the chapters of the book "How to Build a Career in AI" include: '}
{'query': "According to the document, what is the author's comparison between traditional literacy and coding literacy?", 'answer': 'The author compares tra

  warn_deprecated(



[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m
[{'query': 'Is machine learning foundations the most important skill?', 'answer': 'Yes', 'result': 'Yes, foundational machine learning skills are considered one of the most important skills for a technical career in machine learning. Understanding models like linear regression, logistic regression, neural networks, decision trees, and more, along with core concepts like bias/variance, cost functions, regularization, and optimization algorithms are crucial for success in the field.'}, {'query': 'What are Python frameworks you need

With these examples in hand, let’s consider evaluating them. First, we’ll import the QA question answering eval chain. Then, we’ll instantiate this chain with a language model, as we’ll utilize it for evaluation purposes. Subsequently, we’ll invoke the ‘evaluate’ function on this chain, passing in examples and predictions, and receive graded outputs in return.

In [17]:
from langchain.evaluation.qa import QAEvalChain
llm = ChatOpenAI(temperature=0, model=llm_model, openai_api_key=openai.api_key)
eval_chain = QAEvalChain.from_llm(llm)
graded_outputs = eval_chain.evaluate(processed_examples, predictions)

To observe the process for each example, we’ll iterate through them. We’ll print the question, which was generated by a language model, followed by the real answer, also generated by a language model with access to the complete document. 

Next, we’ll display the predicted answer, generated by a language model within the QA chain, utilizing embeddings and vector databases for retrieval before passing through the language model for prediction.

 Additionally, we’ll print the grade, determined by a language model when prompted to evaluate and assess correctness. By systematically looping through these examples and printing their details, we gain insight into each example’s evaluation.

In [18]:
for i, eg in enumerate(processed_examples):
    print(f"Example {i}:")
    print("Question: " + predictions[i]['query'])
    print("Real Answer: " + predictions[i]['answer'])
    print("Predicted Answer: " + predictions[i]['result'])
    print("Predicted Grade: " + graded_outputs[i]['results'])
    print()

Example 0:
Question: Is machine learning foundations the most important skill?
Real Answer: Yes
Predicted Answer: Yes, foundational machine learning skills are considered one of the most important skills for a technical career in machine learning. Understanding models like linear regression, logistic regression, neural networks, decision trees, and more, along with core concepts like bias/variance, cost functions, regularization, and optimization algorithms are crucial for success in the field.
Predicted Grade: CORRECT

Example 1:
Question: What are Python frameworks you need to learn ?
Real Answer: Tensorflow and PyTorch
Predicted Answer: Some key Python frameworks that are important to learn for machine learning and artificial intelligence development include TensorFlow, PyTorch, and scikit-learn. These frameworks are widely used in the field and provide essential tools and resources for building and training machine learning models.
Predicted Grade: CORRECT

Example 2:
Question: Acc

And looks like here it got everything correct except for the first question and it seems because our answer at the beginning of the notebook was very short and the right answer from the document has more details in it the evaluation was incorrect. 

# <div style="box-shadow: rgba(240, 46, 170, 0.4) -5px 5px inset, rgba(240, 46, 170, 0.3) -10px 10px inset, rgba(240, 46, 170, 0.2) -15px 15px inset, rgba(240, 46, 170, 0.1) -20px 20px inset, rgba(240, 46, 170, 0.05) -25px 25px inset; padding:20px; font-size:30px; font-family: consolas; display:fill; border-radius:15px; color: rgba(240, 46, 170, 0.7)"> <b> ༼⁠ ⁠つ⁠ ⁠◕⁠‿⁠◕⁠ ⁠༽⁠つ Thank You!</b></div>

<p style="font-family:verdana; color:rgb(34, 34, 34); font-family: consolas; font-size: 16px;"> 💌 Thank you for taking the time to read through my notebook. I hope you found it interesting and informative. If you have any feedback or suggestions for improvement, please don't hesitate to let me know in the comments. <br><br> 🚀 If you liked this notebook, please consider upvoting it so that others can discover it too. Your support means a lot to me, and it helps to motivate me to create more content in the future. <br><br> ❤️ Once again, thank you for your support, and I hope to see you again soon!</p>