# LangSmith and Evaluation Overview with AI Makerspace

Today we'll be looking at an amazing tool:

[LangSmith](https://docs.smith.langchain.com/)!

This tool will help us monitor, test, debug, and evaluate our LangChain applications - and more!

We'll also be looking at a few Advanced Retrieval techniques along the way - and evaluate it using LangSmith!

✋BREAKOUT ROOM #2:
- Task 1: Dependencies and OpenAI API Key
- Task 2: LCEL RAG Chain
- Task 3: Setting Up LangSmith
- Task 4: Examining the Trace in LangSmith!
- Task 5: Create Testing Dataset
- Task 6: Evaluation

## Task 1: Dependencies and OpenAI API Key

We'll be using OpenAI's suite of models today to help us generate and embed our documents for a simple RAG system built on top of LangChain's blogs!

In [1]:
!pip install langchain_core langchain_openai langchain_community langchain-qdrant qdrant-client langsmith openai tiktoken cohere lxml -qU


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.3.1[0m[39;49m -> [0m[32;49m25.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [2]:
import os
import getpass
from dotenv import load_dotenv

load_dotenv()

True

#### Asyncio Bug Handling

This is necessary for Colab.

In [3]:
import nest_asyncio
nest_asyncio.apply()

## Task #2: Create a Simple RAG Application Using Qdrant, Hugging Face, and LCEL

Now that we have a grasp on how LCEL works, and how we can use LangChain and Hugging Face to interact with our data - let's step it up a notch and incorporate Qdrant!

## LangChain Powered RAG

First and foremost, LangChain provides a convenient way to store our chunks and their embeddings.

It's called a `VectorStore`!

We'll be using QDrant as our `VectorStore` today. You can read more about it [here](https://qdrant.tech/documentation/).

Think of a `VectorStore` as a smart way to house your chunks and their associated embedding vectors. The implementation of the `VectorStore` also allows for smarter and more efficient search of our embedding vectors - as the method we used above would not scale well as we got into the millions of chunks.

Otherwise, the process remains relatively similar under the hood!

We'll use a SiteMapLoader to scrape the LangChain blogs - which will serve as our data for today!

### Data Collection

We'll be leveraging the `SitemapLoader` to load our PDF directly from the web!

In [4]:
from langchain.document_loaders import SitemapLoader

documents = SitemapLoader(web_path="https://blog.langchain.dev/sitemap-posts.xml").load()

USER_AGENT environment variable not set, consider setting it to identify your requests.
Fetching pages: 100%|######################################################################################################################################################| 263/263 [00:20<00:00, 12.95it/s]


### Chunking Our Documents

Let's do the same process as we did before with our `RecursiveCharacterTextSplitter` - but this time we'll use ~200 tokens as our max chunk size!

In [5]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 500,
    chunk_overlap = 0,
    length_function = len,
)

split_chunks = text_splitter.split_documents(documents)

In [6]:
len(split_chunks)

5675

Alright, now we have 516 ~200 token long documents.

Let's verify the process worked as intended by checking our max document length.

In [7]:
max_chunk_length = 0

for chunk in split_chunks:
  max_chunk_length = max(max_chunk_length, len(chunk.page_content))

print(max_chunk_length)

499


Perfect! Now we can carry on to creating and storing our embeddings.

### Embeddings and Vector Storage

We'll use the `text-embedding-3-small` embedding model again - and `Qdrant` to store all our embedding vectors for easy retrieval later!

In [8]:
from langchain_community.vectorstores import Qdrant
from langchain_openai.embeddings import OpenAIEmbeddings

embedding_model = OpenAIEmbeddings(model="text-embedding-3-small")

qdrant_vectorstore = Qdrant.from_documents(
    documents=split_chunks,
    embedding=embedding_model,
    location=":memory:"
)

Now let's set up our retriever, just as we saw before, but this time using LangChain's simple `as_retriever()` method!

In [9]:
qdrant_retriever = qdrant_vectorstore.as_retriever()

#### Back to the Flow

We're ready to move to the next step!

### Setting up our RAG

We'll use the LCEL we touched on earlier to create a RAG chain.

Let's think through each part:

1. First we need to retrieve context
2. We need to pipe that context to our model
3. We need to parse that output

Let's start by setting up our prompt again, just so it's fresh in our minds!

####🏗️ Activity #2:

Complete the prompt so that your RAG application answers queries based on the context provided, but *does not* answer queries if the context is unrelated to the query.

In [10]:
from langchain.prompts import ChatPromptTemplate

base_rag_prompt_template = """
[INTRODUCTION]  
You are an advanced AI language model designed to provide highly accurate, relevant, and context-aware responses. 
You will use the provided retrieved context to generate responses, ensuring factual consistency and avoiding speculation.  

[CONTEXT HANDLING]  
- If relevant context is provided:  
  - Use it as the primary source of truth.  
  - Each context item will be separated with "<|>" separator.
  - Do not invent information—strictly base responses on the given context.  
  - If the context is unclear, ambiguous, or insufficient, state the limitations and ask for clarification.  

- If no context is provided:  
  - Clearly inform the user that no supporting data was given.  
  - Provide general knowledge while explicitly stating that the response is not grounded in retrieved documents.  
  - Avoid making up specific details.  

[RESPONSE GUIDELINES]  
1. Accuracy & Relevance  
   - Prioritize precise, concise, and contextually relevant responses.  
   - If multiple interpretations exist, ask clarifying questions.  

2. Structured Responses  
   - When applicable, format responses with headings, bullet points, or numbered lists.  
   - Maintain clarity and logical flow.  

3. Transparency & Uncertainty Handling  
   - If the context lacks sufficient details, indicate the gap and avoid speculation.  
   - Use phrases like: "Based on the provided context..." or "There is no specific information available on X, but generally..."  

4. No Overreliance on Training Data  
   - If asked about recent events or dynamic topics, clearly state that up-to-date information is required and suggest retrieving the latest data.  

5. Respect and Ethical Considerations  
   - Ensure responses remain neutral, unbiased, and free from misinformation.  
   - Do not engage in unethical, harmful, or misleading discussions.  

[OUTPUT FORMAT]  
- Default response in natural, professional language.  
- For structured data, return JSON or tabular format if explicitly requested.  
- Adhere to any specific formatting preferences the user specifies.  

[ADDITIONAL INSTRUCTIONS]  
- If a query is unclear, ask for clarification before proceeding.  
- If a query is outside your scope, politely decline and guide the user to authoritative sources.  
- Adopt the appropriate tone (formal, casual, technical) based on the user's input style.

[REQUEST FORMAT]
Context will be provided in the following format:

<Context Start>
{context}
<Context End>

<User Query Start>
{question}
<User Query End>
"""

base_rag_prompt = ChatPromptTemplate.from_template(base_rag_prompt_template)

We'll set our Generator - `gpt-4o` in this case - below!

In [11]:
from langchain_openai.chat_models import ChatOpenAI

base_llm = ChatOpenAI(model="gpt-4o-mini", tags=["base_llm"])

#### Our RAG Chain

Notice how we have a bit of a more complex chain this time - that's because we want to return our sources with the response.

Let's break down the chain step-by-step:

1. We invoke the chain with the `question` item. Notice how we only need to provide `question` since both the retreiver and the `"question"` object depend on it.
  - We also chain our `"question"` into our `retriever`! This is what ultimately collects the context through Qdrant.
2. We assign our collected context to a `RunnablePassthrough()` from the previous object. This is going to let us simply pass it through to the next step, but still allow us to run that section of the chain.
3. We finally collect our response by chaining our prompt, which expects both a `"question"` and `"context"`, into our `llm`. We also, collect the `"context"` again so we can output it in the final response object.

The key thing to keep in mind here is that we need to pass our context through *after* we've retrieved it - to populate the object in a way that doesn't require us to call it or try and use it for something else.

In [12]:
from operator import itemgetter
from langchain.schema.output_parser import StrOutputParser
from langchain.schema.runnable import RunnablePassthrough

retrieval_augmented_qa_chain = (
    # INVOKE CHAIN WITH: {"question" : "<<SOME USER QUESTION>>"}
    # "question" : populated by getting the value of the "question" key
    # "context"  : populated by getting the value of the "question" key and chaining it into the base_retriever
    {"context": itemgetter("question") | qdrant_retriever, "question": itemgetter("question")}
    # "context"  : is assigned to a RunnablePassthrough object (will not be called or considered in the next step)
    #              by getting the value of the "context" key from the previous step
    | RunnablePassthrough.assign(context=itemgetter("context"))
    # "response" : the "context" and "question" values are used to format our prompt object and then piped
    #              into the LLM and stored in a key called "response"
    # "context"  : populated by getting the value of the "context" key from the previous step
    | {"response": base_rag_prompt | base_llm, "context": itemgetter("context")}
)

Let's get a visual understanding of our chain!

In [13]:
!pip install -qU grandalf


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.3.1[0m[39;49m -> [0m[32;49m25.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [14]:
print(retrieval_augmented_qa_chain.get_graph().draw_ascii())

          +---------------------------------+      
          | Parallel<context,question>Input |      
          +---------------------------------+      
                    **            **               
                  **                **             
                **                    **           
         +--------+                     **         
         | Lambda |                      *         
         +--------+                      *         
              *                          *         
              *                          *         
              *                          *         
  +----------------------+          +--------+     
  | VectorStoreRetriever |          | Lambda |     
  +----------------------+          +--------+     
                    **            **               
                      **        **                 
                        **    **                   
          +----------------------------------+     
          | 

Let's try another visual representation:

![image](https://i.imgur.com/Ad31AhL.png)

Let's test our chain out!

In [15]:
response = retrieval_augmented_qa_chain.invoke({"question" : "What's new in LangChain v0.2?"})

In [16]:
response["response"].content

'LangChain v0.2 introduces several significant improvements and features based on community feedback and aims to enhance stability and security. Here are the key updates:\n\n1. **Separation of Packages**:\n   - LangChain is now fully separated from langchain-community, with langchain-community depending on langchain-core and langchain.\n\n2. **Documentation**:\n   - The release comes with new and versioned documentation, making it easier to navigate and understand the functionalities.\n\n3. **Agent Framework**:\n   - A more mature and controllable agent framework has been implemented.\n\n4. **LLM Interface Standardization**:\n   - Improvements have been made to standardize the LLM interface, particularly concerning tool calling.\n\n5. **Streaming Support**:\n   - The new version includes support for streaming.\n\n6. **New Partner Packages**:\n   - More than 30 new partner packages have been added.\n\nThis version is currently a pre-release, with a full release of v0.2 expected in a few

In [23]:
for context in response["context"]:
  print("Context:")
  print(context)
  print("----")

Context:
page_content='fine-tuned for chat.yi-34b-200k-capybara - a 34b parameter model from Nous Research.Check out the linked comparisons to see the outputs in LangSmith, or reference the aggregate metrics below:' metadata={'source': 'https://blog.langchain.dev/extraction-benchmarking/', 'loc': 'https://blog.langchain.dev/extraction-benchmarking/', 'lastmod': '2023-12-05T18:17:30.000Z', '_id': 'aba7384496d847d5b9abf534b4c6d749', '_collection_name': '0edae4e819de45e5b21bf01a818a989f'}
----
Context:
page_content='- p)}{n}}\)We were surprised by the poor performance of the fine-tuned mistral-7b-instruct-v0.1 model. Why does it struggle for this task? Let's review one of its runs to see where it could be improved. For the data point "aaa" (see linked run), the model first invokes "a", then responds in text "a" with a mis-formatted function call for the letter "b". The agent then returns.Failing response on the second invocation.The image above is taken from the second LLM invocation, aft

Let's see if it can handle a query that is totally unrelated to the source documents.

In [18]:
response = retrieval_augmented_qa_chain.invoke({"question" : "What is the airspeed velocity of an unladen swallow?"})

In [19]:
response["response"].content

'There is no specific information available on the airspeed velocity of an unladen swallow in the provided context. However, generally speaking, this is a well-known question often cited in popular culture, particularly from the film "Monty Python and the Holy Grail." In that context, it refers humorously to the idea of measuring the flight speed of a swallow, with various estimates suggesting that the airspeed velocity of an unladen European swallow is about 11 meters per second. If you need more specific or detailed information, please let me know!'

## Task 3: Setting Up LangSmith

Now that we have a chain - we're ready to get started with LangSmith!

We're going to go ahead and use the following `env` variables to get our Colab notebook set up to start reporting.

If all you needed was simple monitoring - this is all you would need to do!

In [21]:
assert os.getenv("LANGSMITH_TRACING_V2")
assert os.getenv("LANGSMITH_API_KEY")
assert os.getenv("LANGSMITH_PROJECT")

### LangSmith API

In order to use LangSmith - you will need a beta key, you can join the queue through the `Beta Sign Up` button on LangSmith's homepage!

Join [here](https://www.langchain.com/langsmith)

Let's test our our first generation!

In [22]:
retrieval_augmented_qa_chain.invoke({"question" : "What is LangSmith?"}, {"tags" : ["Demo Run"]})['response']

AIMessage(content="### What is LangSmith?\n\nLangSmith is a framework built on top of LangChain, designed to enhance the observability and usability of AI-powered products, particularly those utilizing large language models (LLMs). Here are some key features and aspects of LangSmith based on the provided context:\n\n- **Purpose**: LangSmith aims to track and improve the development lifecycle of AI applications, providing tools for better decision-making and QA (quality assurance).\n\n- **Implementation**: It comes with an SDK (Software Development Kit) that allows developers to easily integrate LangSmith into their LLM-related functions. This integration can be as simple as adding a `@traceable` decorator to functions.\n\n- **Benefits**: \n  - **Improved Observability**: LangSmith has been noted for enhancing visibility into product quality, which facilitates rapid iteration during development.\n  - **Customization**: The SDK provides fine-grain controls, allowing teams to tailor the f

## Task 4: Examining the Trace in LangSmith!

Head on over to your LangSmith web UI to check out how the trace looks in LangSmith!

#### 🏗️ Activity #1:

Include a screenshot of your trace and explain what it means.

## Task 5: Loading Our Testing Set

In [24]:
!git clone https://github.com/AI-Maker-Space/DataRepository.git

Cloning into 'DataRepository'...
remote: Enumerating objects: 119, done.[K
remote: Counting objects: 100% (111/111), done.[K
remote: Compressing objects: 100% (96/96), done.[K
remote: Total 119 (delta 36), reused 40 (delta 10), pack-reused 8 (from 1)[K
Receiving objects: 100% (119/119), 78.04 MiB | 722.00 KiB/s, done.
Resolving deltas: 100% (36/36), done.


In [28]:
import pandas as pd

test_df = pd.read_csv("DataRepository/langchain_blog_test_data.csv")

Now we can set up our LangSmith client - and we'll add the above created dataset to our LangSmith instance!

> NOTE: Read more about this process [here](https://docs.smith.langchain.com/old/evaluation/faq/manage-datasets#create-from-list-of-values)

In [29]:
from langsmith import Client

load_dotenv()

client = Client()

dataset_name = "langsmith-demo-dataset-aie4-triples-v3"

dataset = client.create_dataset(
    dataset_name=dataset_name, description="LangChain Blog Test Questions"
)

for triplet in test_df.iterrows():
  triplet = triplet[1]
  client.create_example(
      inputs={"question" : triplet["question"], "context": triplet["context"]},
      outputs={"answer" : triplet["answer"]},
      dataset_id=dataset.id
  )

## Task 6: Evaluation

Now we can run the evaluation!

We'll need to start by preparing some custom data preparation functions to ensure our chain works with the expected inputs/outputs from the `evaluate` process in LangSmith.

> NOTE: More reading on this available [here](https://docs.smith.langchain.com/how_to_guides/evaluation/evaluate_llm_application#evaluate-a-langchain-runnable)

In [30]:
def prepare_data_ref(run, example):
  return {
      "prediction" : run.outputs["response"],
      "reference" : example.outputs["answer"],
      "input" : example.inputs["question"]
  }

def prepare_data_noref(run, example):
  return {
      "prediction" : run.outputs["response"],
      "input" : example.inputs["question"]
  }

def prepare_context_ref(run, example):
  return {
      "prediction" : run.outputs["response"],
      "reference" : example.inputs["context"],
      "input" : example.inputs["question"]
  }

We'll be using a few custom evaluators to evaluate our pipeline, as well as a few "built in" methods!

Check out the built-ins [here](https://docs.smith.langchain.com/reference/sdk_reference/langchain_evaluators)!

In [31]:
from langsmith.evaluation import LangChainStringEvaluator, evaluate

eval_llm = ChatOpenAI(model="gpt-4o-mini", tags=["eval_llm"])

cot_qa_evaluator = LangChainStringEvaluator("cot_qa",  config={"llm":eval_llm}, prepare_data=prepare_context_ref)

unlabeled_dopeness_evaluator = LangChainStringEvaluator(
    "criteria",
    config={
        "criteria" : {
            "dopeness" : "Is the answer to the question dope, meaning cool - awesome - and legit?"
        },
        "llm" : eval_llm,
    },
    prepare_data=prepare_data_noref
)

labeled_score_evaluator = LangChainStringEvaluator(
    "labeled_score_string",
    config={
        "criteria": {
            "accuracy": "Is the generated answer the same as the reference answer?"
        },
    },
    prepare_data=prepare_data_ref
)

base_rag_results = evaluate(
    retrieval_augmented_qa_chain.invoke,
    data=dataset_name,
    evaluators=[
        cot_qa_evaluator,
        unlabeled_dopeness_evaluator,
        labeled_score_evaluator,
        ],
    experiment_prefix="Base RAG Evaluation"
)

  from .autonotebook import tqdm as notebook_tqdm


View the evaluation results for experiment: 'Base RAG Evaluation-989012fb' at:
https://smith.langchain.com/o/572e12a7-8fde-491d-9e63-4eecbb3c1f61/datasets/01a89ad2-15a2-4d17-8876-9a997868c633/compare?selectedSessions=5c06da74-fb28-4e56-a883-20c4d63463cf




2it [13:06, 393.06s/it]


KeyboardInterrupt: 

#### ❓Question #1:

What conclusions can you draw about the above results?

Describe in your own words what the metrics are expressing.