# LangSmith and Evaluation Overview with AI Makerspace

Today we'll be looking at an amazing tool:

[LangSmith](https://docs.smith.langchain.com/)!

This tool will help us monitor, test, debug, and evaluate our LangChain applications - and more!

We'll also be looking at a few Advanced Retrieval techniques along the way - and evaluate it using LangSmith!

✋BREAKOUT ROOM #2:
- Task 1: Dependencies and OpenAI API Key
- Task 2: LCEL RAG Chain
- Task 3: Setting Up LangSmith
- Task 4: Examining the Trace in LangSmith!
- Task 5: Create Testing Dataset
- Task 6: Evaluation

## Task 1: Dependencies and OpenAI API Key

We'll be using OpenAI's suite of models today to help us generate and embed our documents for a simple RAG system built on top of LangChain's blogs!

In [1]:
# 🚶‍♀️🚶‍♀️🚶‍♀️ Like the previous notebook, run some pre prerequisite cells.

In [1]:
!pip install langchain_core langchain_openai langchain_community langchain-qdrant qdrant-client langsmith openai tiktoken cohere lxml -qU

In [2]:
import os
import getpass

os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter your OpenAI API Key:")

#### Asyncio Bug Handling

This is necessary for Colab.

In [3]:
import nest_asyncio
nest_asyncio.apply()

## Task #2: Create a Simple RAG Application Using Qdrant, Hugging Face, and LCEL

Now that we have a grasp on how LCEL works, and how we can use LangChain and Hugging Face to interact with our data - let's step it up a notch and incorporate Qdrant!

## LangChain Powered RAG

First and foremost, LangChain provides a convenient way to store our chunks and their embeddings.

It's called a `VectorStore`!

We'll be using QDrant as our `VectorStore` today. You can read more about it [here](https://qdrant.tech/documentation/).

Think of a `VectorStore` as a smart way to house your chunks and their associated embedding vectors. The implementation of the `VectorStore` also allows for smarter and more efficient search of our embedding vectors - as the method we used above would not scale well as we got into the millions of chunks.

Otherwise, the process remains relatively similar under the hood!

We'll use a SiteMapLoader to scrape the LangChain blogs - which will serve as our data for today!

### Data Collection

We'll be leveraging the `SitemapLoader` to load our PDF directly from the web!

In [2]:
# 🚶‍♀️🚶‍♀️🚶‍♀️ In this notebook, our source of RAG chian is Langchain bolg post.

In [6]:
from langchain.document_loaders import SitemapLoader

documents = SitemapLoader(web_path="https://blog.langchain.dev/sitemap-posts.xml").load()

Fetching pages: 100%|##########| 220/220 [00:05<00:00, 38.47it/s]


### Chunking Our Documents

Let's do the same process as we did before with our `RecursiveCharacterTextSplitter` - but this time we'll use ~200 tokens as our max chunk size!

In [36]:
# from langchain.text_splitter import RecursiveCharacterTextSplitter

# 😀 To make it easier to remember, I will import the classes this way
from langchain_text_splitters import RecursiveCharacterTextSplitter 

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 500,
    chunk_overlap = 0,
    length_function = len,
)

split_chunks = text_splitter.split_documents(documents)

In [11]:
len(split_chunks)

4821

Alright, now we have 516 ~200 token long documents.

Let's verify the process worked as intended by checking our max document length.

In [12]:
max_chunk_length = 0

for chunk in split_chunks:
  max_chunk_length = max(max_chunk_length, len(chunk.page_content))

print(max_chunk_length)

499


Perfect! Now we can carry on to creating and storing our embeddings.

### Embeddings and Vector Storage

We'll use the `text-embedding-3-small` embedding model again - and `Qdrant` to store all our embedding vectors for easy retrieval later!

In [13]:
# 🚶‍♀️🚶‍♀️🚶‍♀️ instead of manually creating a  retrieber, we use vector store and its associated retriever.
# 🚶‍♀️🚶‍♀️🚶‍♀️ vector store is a langchain abstraction of vertor database. Many 3rd party vector stores are out there. 
# 🚶‍♀️🚶‍♀️🚶‍♀️ We select Qdrant vector database

from langchain_community.vectorstores import Qdrant
from langchain_openai.embeddings import OpenAIEmbeddings

embedding_model = OpenAIEmbeddings(model="text-embedding-3-small")

qdrant_vectorstore = Qdrant.from_documents(
    documents=split_chunks,
    embedding=embedding_model,
    location=":memory:"
)

  from .autonotebook import tqdm as notebook_tqdm


Now let's set up our retriever, just as we saw before, but this time using LangChain's simple `as_retriever()` method!

In [14]:
# 🚶‍♀️🚶‍♀️🚶‍♀️ One of the advantages of Langchain voector store is that we can easily get retriever from as_retriever method.

qdrant_retriever = qdrant_vectorstore.as_retriever()

In [23]:
# 😀 My own cell

from langchain_core.runnables import Runnable

print(isinstance(qdrant_vectorstore, Runnable))
print(isinstance(qdrant_retriever, Runnable))
print(qdrant_retriever.InputType)
print(qdrant_retriever.OutputType)

False
True
<class 'str'>
typing.List[langchain_core.documents.base.Document]


#### Back to the Flow

We're ready to move to the next step!

### Setting up our RAG

We'll use the LCEL we touched on earlier to create a RAG chain.

Let's think through each part:

1. First we need to retrieve context
2. We need to pipe that context to our model
3. We need to parse that output

Let's start by setting up our prompt again, just so it's fresh in our minds!

####🏗️ Activity #2:

Complete the prompt so that your RAG application answers queries based on the context provided, but *does not* answer queries if the context is unrelated to the query.

In [113]:
from langchain.prompts import ChatPromptTemplate

# 😀 I completed the prompt template.
base_rag_prompt_template = """\
Answer the user's Question based on the given Context. \
If the given Context is not relevant to the Question, please respond with "I don't know". Do not make up.

Context:
{context}

Question:
{question}
"""

base_rag_prompt = ChatPromptTemplate.from_template(base_rag_prompt_template)

We'll set our Generator - `gpt-4o` in this case - below!

In [32]:
from langchain_openai.chat_models import ChatOpenAI

base_llm = ChatOpenAI(model="gpt-4o-mini", tags=["base_llm"])

#### Our RAG Chain

Notice how we have a bit of a more complex chain this time - that's because we want to return our sources with the response.

Let's break down the chain step-by-step:

1. We invoke the chain with the `question` item. Notice how we only need to provide `question` since both the retreiver and the `"question"` object depend on it.
  - We also chain our `"question"` into our `retriever`! This is what ultimately collects the context through Qdrant.
2. We assign our collected context to a `RunnablePassthrough()` from the previous object. This is going to let us simply pass it through to the next step, but still allow us to run that section of the chain.
3. We finally collect our response by chaining our prompt, which expects both a `"question"` and `"context"`, into our `llm`. We also, collect the `"context"` again so we can output it in the final response object.

The key thing to keep in mind here is that we need to pass our context through *after* we've retrieved it - to populate the object in a way that doesn't require us to call it or try and use it for something else.

In [42]:
# 😀 My own cell
from operator import itemgetter

qdrant_retriever.invoke("Why should we use LCEL?")

[Document(metadata={'source': 'https://blog.langchain.dev/langchain-v0-1-0/', 'loc': 'https://blog.langchain.dev/langchain-v0-1-0/', 'lastmod': '2024-02-09T21:45:58.000Z', '_id': 'b5d2495936e54a1a9b64c0f283c8b320', '_collection_name': '2279a7153971439bbb7bf2af0e0a5560'}, page_content='streaming, covered later in this post.The components for LCEL are in langchain-core. We’ve started to create higher level entry points for specific chains in LangChain. These will gradually replace pre-existing (now “Legacy”) chains, because chains built with LCEL will get streaming, ease of customization, observability, batching, retries out-of-the-box. Our goal is to make this transition seamless. Previously you may have done:ConversationalRetrievalChain.from_llm(llm, …)We want to simply make'),
 Document(metadata={'source': 'https://blog.langchain.dev/code-execution-with-langgraph/', 'loc': 'https://blog.langchain.dev/code-execution-with-langgraph/', 'lastmod': '2024-03-01T21:52:16.000Z', '_id': '22eb6

In [40]:
# 🚶‍♀️🚶‍♀️🚶‍♀️ This is the way we implement our RAG chain using LCEL, langchain expression language, which is
# 🚶‍♀️🚶‍♀️🚶‍♀️ very intuitive and convient way of cunstructing a chain.


from operator import itemgetter
# from langchain.schema.output_parser import StrOutputParser
# from langchain.schema.runnable import RunnablePassthrough

# 😀 To make it easier to remember, I will import the classes this way
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

retrieval_augmented_qa_chain = (
    # INVOKE CHAIN WITH: {"question" : "<<SOME USER QUESTION>>"}
    # "question" : populated by getting the value of the "question" key
    # "context"  : populated by getting the value of the "question" key and chaining it into the base_retriever
    {"context": itemgetter("question") | qdrant_retriever, "question": itemgetter("question")}
    
    # "context"  : is assigned to a RunnablePassthrough object (will not be called or considered in the next step)
    #              by getting the value of the "context" key from the previous step
    | RunnablePassthrough.assign(context=itemgetter("context"))
    
    # "response" : the "context" and "question" values are used to format our prompt object and then piped
    #              into the LLM and stored in a key called "response"
    # "context"  : populated by getting the value of the "context" key from the previous step
    | {"response": base_rag_prompt | base_llm, "context": itemgetter("context")}
)

Let's get a visual understanding of our chain!

In [43]:
!pip install -qU grandalf

In [44]:
print(retrieval_augmented_qa_chain.get_graph().draw_ascii())

          +---------------------------------+      
          | Parallel<context,question>Input |      
          +---------------------------------+      
                    **            **               
                  **                **             
                **                    **           
         +--------+                     **         
         | Lambda |                      *         
         +--------+                      *         
              *                          *         
              *                          *         
              *                          *         
  +----------------------+          +--------+     
  | VectorStoreRetriever |          | Lambda |     
  +----------------------+          +--------+     
                    **            **               
                      **        **                 
                        **    **                   
          +----------------------------------+     
          | 

Let's try another visual representation:

![image](https://i.imgur.com/Ad31AhL.png)

Let's test our chain out!

In [49]:
response = retrieval_augmented_qa_chain.invoke({"question" : "What's new in LangChain v0.2?"})

In [86]:
response["response"].content

'LangChain v0.2 introduces several significant improvements and features, building upon the foundation laid in v0.1. Here are the key updates:\n\n1. **Separation of Packages**: There is now a full separation between the `langchain` and `langchain-community` packages, enhancing modularity and organization.\n\n2. **New Documentation**: The release includes versioned documentation, making it easier for users to navigate and reference.\n\n3. **Mature Agent Framework**: The agent framework has been improved for better maturity and control, allowing for more effective use of agents within the LangChain ecosystem.\n\n4. **Standardized LLM Interface**: There is improved standardization around the LLM (Large Language Model) interface, particularly regarding tool calling, which enhances overall usability.\n\n5. **Streaming Support**: The new version adds support for streaming, which can enhance responsiveness and performance in various applications.\n\n6. **New Partner Packages**: More than 30 n

In [53]:
response

{'response': AIMessage(content='LangChain v0.2 brings several improvements, including:\n\n1. Full separation of the langchain package from langchain-community.\n2. New (and versioned) documentation.\n3. A more mature and controllable agent framework.\n4. Improved LLM interface standardization, particularly around tool calling.\n5. Streaming support.\n6. Over 30 new partner packages.\n\nThis release builds upon the foundation laid in v0.1 and incorporates community feedback.', additional_kwargs={'refusal': None}, response_metadata={'token_usage': {'completion_tokens': 92, 'prompt_tokens': 1117, 'total_tokens': 1209}, 'model_name': 'gpt-4o-mini-2024-07-18', 'system_fingerprint': 'fp_48196bc67a', 'finish_reason': 'stop', 'logprobs': None}, id='run-db681569-701b-4ced-9bba-4d5e978b5ef0-0', usage_metadata={'input_tokens': 1117, 'output_tokens': 92, 'total_tokens': 1209}),
 'context': [Document(metadata={'source': 'https://blog.langchain.dev/langchain-v02-leap-to-stability/', 'loc': 'https://

In [54]:
for context in response["context"]:
  print("Context:")
  print(context)
  print("----")

Context:
page_content='Four months ago, we released the first stable version of LangChain. Today, we are following up by announcing a pre-release of langchain v0.2.This release builds upon the foundation laid in v0.1 and incorporates community feedback. We’re excited to share that v0.2 brings: A much-desired full separation of langchain and langchain-community New (and versioned!) docs A more mature and controllable agent framework Improved LLM interface standardization, particularly around tool callingBetter' metadata={'source': 'https://blog.langchain.dev/langchain-v02-leap-to-stability/', 'loc': 'https://blog.langchain.dev/langchain-v02-leap-to-stability/', 'lastmod': '2024-05-16T22:26:07.000Z', '_id': '5253c8094d31419fa6a2f81344b08c4f', '_collection_name': '2279a7153971439bbb7bf2af0e0a5560'}
----
Context:
page_content='LangChain v0.2: A Leap Towards Stability




















































Skip to content
















All Posts




Case Studies




In the Lo

Let's see if it can handle a query that is totally unrelated to the source documents.

In [55]:
response = retrieval_augmented_qa_chain.invoke({"question" : "What is the airspeed velocity of an unladen swallow?"})

In [56]:
response["response"].content

"I don't know."

## Task 3: Setting Up LangSmith

Now that we have a chain - we're ready to get started with LangSmith!

We're going to go ahead and use the following `env` variables to get our Colab notebook set up to start reporting.

If all you needed was simple monitoring - this is all you would need to do!

In [57]:
# 🚶‍♀️🚶‍♀️🚶‍♀️ In fact, the main topic of this notebook is LangSmith.
# 🚶‍♀️🚶‍♀️🚶‍♀️ LangSmith is tracing and evaluation platform we can use to get very deep understanding about the data flow of our chain


from uuid import uuid4

unique_id = uuid4().hex[0:8]

os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_PROJECT"] = f"LangSmith - {unique_id}"

### LangSmith API

In order to use LangSmith - you will need a beta key, you can join the queue through the `Beta Sign Up` button on LangSmith's homepage!

Join [here](https://www.langchain.com/langsmith)

In [58]:
os.environ["LANGCHAIN_API_KEY"] = getpass.getpass('Enter your LangSmith API key: ')

Let's test our our first generation!

In [59]:
# 🚶‍♀️🚶‍♀️🚶‍♀️ Let's invoke our chain and get the answer
# 🚶‍♀️🚶‍♀️🚶‍♀️ Thanks to Langsmith, we can get walk through every step from the very begining of our chain to the final response.

retrieval_augmented_qa_chain.invoke({"question" : "What is LangSmith?"}, {"tags" : ["Demo Run"]})['response']

AIMessage(content='LangSmith is a framework built on the shoulders of LangChain, designed to track the performance and improve the observability of AI-powered products, particularly those utilizing large language models (LLMs). It offers features such as a sleek user interface and an SDK that provides fine-grain controls and customizability for developers.', additional_kwargs={'refusal': None}, response_metadata={'token_usage': {'completion_tokens': 63, 'prompt_tokens': 927, 'total_tokens': 990}, 'model_name': 'gpt-4o-mini-2024-07-18', 'system_fingerprint': 'fp_48196bc67a', 'finish_reason': 'stop', 'logprobs': None}, id='run-51ce8ab4-a099-4257-b97d-5aac8e209eae-0', usage_metadata={'input_tokens': 927, 'output_tokens': 63, 'total_tokens': 990})

## Task 4: Examining the Trace in LangSmith!

Head on over to your LangSmith web UI to check out how the trace looks in LangSmith!

#### 🏗️ Activity #1:

Include a screenshot of your trace and explain what it means.

## Task 5: Loading Our Testing Set

In [3]:
# 🚶‍♀️🚶‍♀️🚶‍♀️ Not only is Langsmith used for tracing, It is also used for evaluation.


In [62]:
# !git clone https://github.com/AI-Maker-Space/DataRepository.git

Cloning into 'DataRepository'...
remote: Enumerating objects: 84, done.[K
remote: Counting objects: 100% (76/76), done.[K
remote: Compressing objects: 100% (63/63), done.[K
remote: Total 84 (delta 23), reused 28 (delta 8), pack-reused 8 (from 1)[K
Receiving objects: 100% (84/84), 70.08 MiB | 14.60 MiB/s, done.
Resolving deltas: 100% (23/23), done.


In [None]:
import pandas as pd

test_df = pd.read_csv("langchain_blog_test_data.csv")

In [109]:
# 😀 My own cell

questions_and_answers = []

for triplet in test_df.iterrows():
    questions_and_answers.append({'question':triplet[1]['question'], 'answer':triplet[1]['answer']})

questions_and_answers

[{'question': 'How did Podium improve their agent F1 response quality and reduce engineering intervention by 90%?',
  'answer': "Podium optimized agent behavior and reduced engineering intervention by 90% by using LangSmith for dataset curation and finetuning, which improved the agent's F1 response quality to 98%."},
 {'question': 'How did Athena Intelligence utilize LangSmith in their workflow to enhance the generation of complex research reports?',
  'answer': 'Athena Intelligence used the LangSmith playground and debugging features to quickly identify LLM issues and generate complex research reports.'},
 {'question': 'What are the four strategies supported by LangGraph Cloud for handling additional context when dealing with double-texting in currently-running threads?',
  'answer': 'LangGraph Cloud provides four different strategies for handling additional user inputs on currently-running threads of the graph. What are these strategies?\n\nReject, queue, interrupt, and rollback.'},


Now we can set up our LangSmith client - and we'll add the above created dataset to our LangSmith instance!

> NOTE: Read more about this process [here](https://docs.smith.langchain.com/old/evaluation/faq/manage-datasets#create-from-list-of-values)

In [64]:
from langsmith import Client

client = Client()

dataset_name = "langsmith-demo-dataset-aie4-triples-v3"

dataset = client.create_dataset(
    dataset_name=dataset_name, description="LangChain Blog Test Questions"
)

for triplet in test_df.iterrows():
  triplet = triplet[1]
  client.create_example(
      inputs={"question" : triplet["question"], "context": triplet["context"]},
      outputs={"answer" : triplet["answer"]},
      dataset_id=dataset.id
  )

## Task 6: Evaluation

Now we can run the evaluation!

We'll need to start by preparing some custom data preparation functions to ensure our chain works with the expected inputs/outputs from the `evaluate` process in LangSmith.

> NOTE: More reading on this available [here](https://docs.smith.langchain.com/how_to_guides/evaluation/evaluate_llm_application#evaluate-a-langchain-runnable)

In [73]:
def prepare_data_ref(run, example):
  return {
      "prediction" : run.outputs["response"],
      "reference" : example.outputs["answer"],
      "input" : example.inputs["question"]
  }

def prepare_data_noref(run, example):
  return {
      "prediction" : run.outputs["response"],
      "input" : example.inputs["question"]
  }

def prepare_context_ref(run, example):
  return {
      "prediction" : run.outputs["response"],
      "reference" : example.inputs["context"],
      "input" : example.inputs["question"]
  }

We'll be using a few custom evaluators to evaluate our pipeline, as well as a few "built in" methods!

Check out the built-ins [here](https://docs.smith.langchain.com/reference/sdk_reference/langchain_evaluators)!

In [4]:
# 🚶‍♀️🚶‍♀️🚶‍♀️ It provides many built-in evaluaotors, such as Cot, chain-of-thouht evaluator.
#  It is also possible to define our own evaluation criterion.

In [None]:
from langsmith.evaluation import LangChainStringEvaluator, evaluate

cot_qa_evaluator = LangChainStringEvaluator("cot_qa", prepare_data=prepare_context_ref)

unlabeled_dopeness_evaluator = LangChainStringEvaluator(
    "criteria",
    config={
        "criteria" : {
            "dopeness" : "Is the answer to the question dope, meaning cool - awesome - and legit?"
        }
    },
    prepare_data=prepare_data_noref
)

labeled_score_evaluator = LangChainStringEvaluator(
    "labeled_score_string",
    config={
        "criteria": {
            "accuracy": "Is the generated answer the same as the reference answer?"
        },
    },
    prepare_data=prepare_data_ref
)

base_rag_results = evaluate(
    retrieval_augmented_qa_chain.invoke,
    data=dataset_name,
    evaluators=[
        cot_qa_evaluator,
        unlabeled_dopeness_evaluator,
        labeled_score_evaluator,
        ],
    experiment_prefix="Base RAG Evaluation"
)

In [6]:
# 🚶‍♀️🚶‍♀️🚶‍♀️ Evaluation results are automatically tracked so that we can easily check them out.

# 🚶‍♀️🚶‍♀️🚶‍♀️ OK, thats gonna do it for today's video.

### 😀😀😀 Screenshots of my LangSmith traces


![Screenshot 1](Screenshots/Screenshot_1.png)
![Screenshot 2](Screenshots/Screenshot_2.png)
![Screenshot 3](Screenshots/Screenshot_3.png)
![Screenshot 4](Screenshots/Screenshot_4.png)
![Screenshot 5](Screenshots/Screenshot_5.png)
![Screenshot 6](Screenshots/Screenshot_6.png)
![Screenshot 7](Screenshots/Screenshot_7.png)



#### ❓Question #1:

What conclusions can you draw about the above results?

Describe in your own words what the metrics are expressing.

### 😀😀😀 Answers to Question #1

- CoT (Chain-of-thougt) metric
    - It requires question, context and answer for judgement. Ground truth is not required.
    - It focuses on whether the answer is derived from the given context in a step-by-step manner.
    - It is sort of a binary classifier.

- Dopeness metric
    - It makes its conclusion only based on question and answer. Context and ground truth are not required.
    - It is a custom metric about dopeness (coolness).
    - As the evaluater pointed out in the result, this metric can be highly subjective and need for caution. 
    - It is sort of a binary classifier.

- Accuracy metric
    - It draws a conclution by question, answer and ground truth. Context is nore required.
    - Basically it examines how close it is our chains response to ground truth.
    - It is a 10-class claasifier.