# Synthetic Data Generation Using RAGAS - RAG Evaluation with LangSmith

In the following notebook we'll explore a use-case for RAGAS' synthetic testset generation workflow!



- 🤝 BREAKOUT ROOM #1
  1. Use RAGAS to Generate Synthetic Data

- 🤝 BREAKOUT ROOM #2
  1. Load them into a LangSmith Dataset
  2. Evaluate our RAG chain against the synthetic test data
  3. Make changes to our pipeline
  4. Evaluate the modified pipeline

SDG is a critical piece of the puzzle, especially for early iteration! Without it, it would not be nearly as easy to get high quality early signal for our application's performance.

Let's dive in!

# 🤝 BREAKOUT ROOM #1

## Task 1: Dependencies and API Keys

We'll need to install a number of API keys and dependencies, since we'll be leveraging a number of great technologies for this pipeline!

1. OpenAI's endpoints to handle the Synthetic Data Generation
2. OpenAI's Endpoints for our RAG pipeline and LangSmith evaluation
3. QDrant as our vectorstore
4. LangSmith for our evaluation coordinator!

Let's install and provide all the required information below!

## Dependencies and API Keys:

### NLTK Import

To prevent errors that may occur based on OS - we'll import NLTK and download the needed packages to ensure correct handling of data.

In [1]:
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package punkt to /Users/ash/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/ash/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [2]:
import os
# import getpass
from dotenv import load_dotenv

load_dotenv('../.env')
# os.environ["LANGCHAIN_TRACING_V2"] = "true"
# os.environ["LANGCHAIN_API_KEY"] = getpass.getpass("LangChain API Key:")

True

We'll also want to set a project name to make things easier for ourselves.

In [3]:
from uuid import uuid4

os.environ["LANGSMITH_PROJECT"] = f"AIM - SDG - {uuid4().hex[0:8]}"

OpenAI's API Key!

In [4]:
# os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key:")

## Generating Synthetic Test Data

We wil be using Ragas to build out a set of synthetic test questions, references, and reference contexts. This is useful because it will allow us to find out how our system is performing.

> NOTE: Ragas is best suited for finding *directional* changes in your LLM-based systems. The absolute scores aren't comparable in a vacuum.

### Data Preparation

We'll prepare our data - which should hopefull be familiar at this point since it's our Use-Case Data!

Next, let's load our data into a familiar LangChain format using the `DirectoryLoader`.

In [4]:
from langchain_community.document_loaders import DirectoryLoader
from langchain_community.document_loaders import PyMuPDFLoader


path = "data/"
loader = DirectoryLoader(path, glob="*.pdf", loader_cls=PyMuPDFLoader)
docs = loader.load()

### Knowledge Graph Based Synthetic Generation

Ragas uses a knowledge graph based approach to create data. This is extremely useful as it allows us to create complex queries rather simply. The additional testset complexity allows us to evaluate larger problems more effectively, as systems tend to be very strong on simple evaluation tasks.

Let's start by defining our `generator_llm` (which will generate our questions, summaries, and more), and our `generator_embeddings` which will be useful in building our graph.

### Unrolled SDG

In [5]:
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from langchain_openai import ChatOpenAI
from langchain_openai import OpenAIEmbeddings
generator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4.1-nano"))
generator_embeddings = LangchainEmbeddingsWrapper(OpenAIEmbeddings())

Next, we're going to instantiate our Knowledge Graph.

This graph will contain N number of nodes that have M number of relationships. These nodes and relationships (AKA "edges") will define our knowledge graph and be used later to construct relevant questions and responses.

In [7]:
from ragas.testset.graph import KnowledgeGraph

kg = KnowledgeGraph()
kg

KnowledgeGraph(nodes: 0, relationships: 0)

The first step we're going to take is to simply insert each of our full documents into the graph. This will provide a base that we can apply transformations to.

In [8]:
from ragas.testset.graph import Node, NodeType

### NOTICE: We're using a subset of the data for this example - this is to keep costs/time down.
for doc in docs:
    kg.nodes.append(
        Node(
            type=NodeType.DOCUMENT,
            properties={"page_content": doc.page_content, "document_metadata": doc.metadata}
        )
    )
kg

KnowledgeGraph(nodes: 64, relationships: 0)

Now, we'll apply the *default* transformations to our knowledge graph. This will take the nodes currently on the graph and transform them based on a set of [default transformations](https://docs.ragas.io/en/latest/references/transforms/#ragas.testset.transforms.default_transforms).

These default transformations are dependent on the corpus length, in our case:

- Producing Summaries -> produces summaries of the documents
- Extracting Headlines -> finding the overall headline for the document
- Theme Extractor -> extracts broad themes about the documents

It then uses cosine-similarity and heuristics between the embeddings of the above transformations to construct relationships between the nodes.

In [9]:
from ragas.testset.transforms import default_transforms, apply_transforms

transformer_llm = generator_llm
embedding_model = generator_embeddings

default_transforms = default_transforms(documents=docs, llm=transformer_llm, embedding_model=embedding_model)
apply_transforms(kg, default_transforms)
kg

Applying HeadlinesExtractor:   0%|          | 0/21 [00:00<?, ?it/s]

Applying HeadlineSplitter:   0%|          | 0/64 [00:00<?, ?it/s]

unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to ap

Applying SummaryExtractor:   0%|          | 0/38 [00:00<?, ?it/s]

Property 'summary' already exists in node '14db16'. Skipping!
Property 'summary' already exists in node '9be0ea'. Skipping!
Property 'summary' already exists in node '3a8d1c'. Skipping!
Property 'summary' already exists in node '651275'. Skipping!
Property 'summary' already exists in node '6337e4'. Skipping!
Property 'summary' already exists in node '62a640'. Skipping!
Property 'summary' already exists in node '5b3559'. Skipping!
Property 'summary' already exists in node '31fd56'. Skipping!
Property 'summary' already exists in node 'cc9752'. Skipping!
Property 'summary' already exists in node '6da9e1'. Skipping!
Property 'summary' already exists in node 'e56c44'. Skipping!
Property 'summary' already exists in node '9e1387'. Skipping!
Property 'summary' already exists in node '7bdefc'. Skipping!
Property 'summary' already exists in node 'f446db'. Skipping!
Property 'summary' already exists in node '51b782'. Skipping!
Property 'summary' already exists in node 'b232b0'. Skipping!
Property

Applying CustomNodeFilter:   0%|          | 0/8 [00:00<?, ?it/s]

Applying [EmbeddingExtractor, ThemesExtractor, NERExtractor]:   0%|          | 0/48 [00:00<?, ?it/s]

Property 'summary_embedding' already exists in node '3a8d1c'. Skipping!
Property 'summary_embedding' already exists in node '62a640'. Skipping!
Property 'summary_embedding' already exists in node '14db16'. Skipping!
Property 'summary_embedding' already exists in node 'cc9752'. Skipping!
Property 'summary_embedding' already exists in node 'e56c44'. Skipping!
Property 'summary_embedding' already exists in node '9be0ea'. Skipping!
Property 'summary_embedding' already exists in node '7bdefc'. Skipping!
Property 'summary_embedding' already exists in node '6337e4'. Skipping!
Property 'summary_embedding' already exists in node '5b3559'. Skipping!
Property 'summary_embedding' already exists in node '6da9e1'. Skipping!
Property 'summary_embedding' already exists in node '651275'. Skipping!
Property 'summary_embedding' already exists in node 'f446db'. Skipping!
Property 'summary_embedding' already exists in node '9e1387'. Skipping!
Property 'summary_embedding' already exists in node '31fd56'. Sk

Applying [CosineSimilarityBuilder, OverlapScoreBuilder]:   0%|          | 0/2 [00:00<?, ?it/s]

KnowledgeGraph(nodes: 86, relationships: 711)

We can save and load our knowledge graphs as follows.

In [10]:
kg.save("usecase_data_kg.json")
usecase_data_kg = KnowledgeGraph.load("usecase_data_kg.json")
usecase_data_kg

KnowledgeGraph(nodes: 86, relationships: 711)

Using our knowledge graph, we can construct a "test set generator" - which will allow us to create queries.

In [11]:
from ragas.testset import TestsetGenerator

generator = TestsetGenerator(llm=generator_llm, embedding_model=embedding_model, knowledge_graph=usecase_data_kg)

However, we'd like to be able to define the kinds of queries we're generating - which is made simple by Ragas having pre-created a number of different "QuerySynthesizer"s.

Each of these Synthetsizers is going to tackle a separate kind of query which will be generated from a scenario and a persona.

In essence, Ragas will use an LLM to generate a persona of someone who would interact with the data - and then use a scenario to construct a question from that data and persona.

In [12]:
from ragas.testset.synthesizers import default_query_distribution, SingleHopSpecificQuerySynthesizer, MultiHopAbstractQuerySynthesizer, MultiHopSpecificQuerySynthesizer

query_distribution = [
        (SingleHopSpecificQuerySynthesizer(llm=generator_llm), 0.5),
        (MultiHopAbstractQuerySynthesizer(llm=generator_llm), 0.25),
        (MultiHopSpecificQuerySynthesizer(llm=generator_llm), 0.25),
]

#### ❓ Question #1:

What are the three types of query synthesizers doing? Describe each one in simple terms.

- SingleHopSpecificQuerySynthesizer: Generates straightforward, fact-based questions that can be answered using a single piece of information from the knowledge graph.

- MultiHopAbstractQuerySynthesizer: Creates more complex, open-ended questions that require connecting multiple pieces of information or reasoning across the knowledge graph.

- MultiHopSpecificQuerySynthesizer: Produces detailed, multi-step questions that need information from several places in the knowledge graph to answer, but with a specific, concrete answer.



Finally, we can use our `TestSetGenerator` to generate our testset!

In [13]:
testset = generator.generate(testset_size=10, query_distribution=query_distribution)
testset.to_pandas()

Generating personas:   0%|          | 0/3 [00:00<?, ?it/s]

Generating Scenarios:   0%|          | 0/3 [00:00<?, ?it/s]

Generating Samples:   0%|          | 0/11 [00:00<?, ?it/s]

Unnamed: 0,user_input,reference_contexts,reference,synthesizer_name
0,Whaat is ChatGPT and how is it used in the con...,[Introduction ChatGPT launched in November 202...,ChatGPT is based on a Large Language Model (LL...,single_hop_specifc_query_synthesizer
1,When is June 2025?,[Table 1: ChatGPT daily message counts (millio...,The context mentions June 2025 as the date end...,single_hop_specifc_query_synthesizer
2,SOC2 codes 11 what is it,[Variation by Occupation Figure 23 presents va...,Variation by Occupation Figure 23 presents var...,single_hop_specifc_query_synthesizer
3,Management how does it relate to ChatGPT use?,[Conclusion This paper studies the rapid growt...,The context discusses how users in management ...,single_hop_specifc_query_synthesizer
4,How do the statistical analysis and regression...,[<1-hop>\n\nVariation by Occupation Figure 23 ...,"The variation in ChatGPT usage by occupation, ...",multi_hop_abstract_query_synthesizer
5,How do large language models (LLMs) like ChatG...,[<1-hop>\n\nIntroduction ChatGPT launched in N...,"ChatGPT, based on large language models (LLMs)...",multi_hop_abstract_query_synthesizer
6,How do the changes in user behavior and prefer...,[<1-hop>\n\nMonth Non-Work (M) (%) Work (M) (%...,"Between June 2024 and June 2025, total ChatGPT...",multi_hop_abstract_query_synthesizer
7,How does the rapid growth of ChatGPT usage by ...,[<1-hop>\n\nConclusion This paper studies the ...,"By July 2025, ChatGPT experienced a significan...",multi_hop_specific_query_synthesizer
8,Whi is the dat in July 2025 that show how Chat...,[<1-hop>\n\nConclusion This paper studies the ...,"The context indicates that by July 2025, ChatG...",multi_hop_specific_query_synthesizer
9,Based on the rapid growth of ChatGPT usage in ...,[<1-hop>\n\nConclusion This paper studies the ...,"The context indicates that by July 2025, over ...",multi_hop_specific_query_synthesizer


### Abstracted SDG

The above method is the full process - but we can shortcut that using the provided abstractions!

This will generate our knowledge graph under the hood, and will - from there - generate our personas and scenarios to construct our queries.



In [6]:
from ragas.testset import TestsetGenerator

generator = TestsetGenerator(llm=generator_llm, embedding_model=generator_embeddings)
dataset = generator.generate_with_langchain_docs(docs, testset_size=10)

Applying HeadlinesExtractor:   0%|          | 0/21 [00:00<?, ?it/s]

Applying HeadlineSplitter:   0%|          | 0/64 [00:00<?, ?it/s]

unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to ap

Applying SummaryExtractor:   0%|          | 0/38 [00:00<?, ?it/s]

Property 'summary' already exists in node '210995'. Skipping!
Property 'summary' already exists in node 'e7e722'. Skipping!
Property 'summary' already exists in node 'bcbc72'. Skipping!
Property 'summary' already exists in node 'bc5391'. Skipping!
Property 'summary' already exists in node '030d25'. Skipping!
Property 'summary' already exists in node 'b22839'. Skipping!
Property 'summary' already exists in node '813df6'. Skipping!
Property 'summary' already exists in node '7948bd'. Skipping!
Property 'summary' already exists in node 'c23bf3'. Skipping!
Property 'summary' already exists in node 'f568b8'. Skipping!
Property 'summary' already exists in node '8898b1'. Skipping!
Property 'summary' already exists in node 'fc718d'. Skipping!
Property 'summary' already exists in node '49a4cf'. Skipping!
Property 'summary' already exists in node 'c92ea8'. Skipping!
Property 'summary' already exists in node '38b28b'. Skipping!
Property 'summary' already exists in node 'e6084d'. Skipping!
Property

Applying CustomNodeFilter:   0%|          | 0/8 [00:00<?, ?it/s]

Applying [EmbeddingExtractor, ThemesExtractor, NERExtractor]:   0%|          | 0/48 [00:00<?, ?it/s]

Property 'summary_embedding' already exists in node '49a4cf'. Skipping!
Property 'summary_embedding' already exists in node '210995'. Skipping!
Property 'summary_embedding' already exists in node 'fc718d'. Skipping!
Property 'summary_embedding' already exists in node 'c23bf3'. Skipping!
Property 'summary_embedding' already exists in node 'c92ea8'. Skipping!
Property 'summary_embedding' already exists in node 'e7e722'. Skipping!
Property 'summary_embedding' already exists in node '813df6'. Skipping!
Property 'summary_embedding' already exists in node 'f568b8'. Skipping!
Property 'summary_embedding' already exists in node 'bc5391'. Skipping!
Property 'summary_embedding' already exists in node '030d25'. Skipping!
Property 'summary_embedding' already exists in node 'bcbc72'. Skipping!
Property 'summary_embedding' already exists in node 'b22839'. Skipping!
Property 'summary_embedding' already exists in node '7948bd'. Skipping!
Property 'summary_embedding' already exists in node '8898b1'. Sk

Applying [CosineSimilarityBuilder, OverlapScoreBuilder]:   0%|          | 0/2 [00:00<?, ?it/s]

Generating personas:   0%|          | 0/3 [00:00<?, ?it/s]

Generating Scenarios:   0%|          | 0/3 [00:00<?, ?it/s]

Generating Samples:   0%|          | 0/12 [00:00<?, ?it/s]

In [7]:
dataset.to_pandas()

Unnamed: 0,user_input,reference_contexts,reference,synthesizer_name
0,Considering the significant milestone of ChatG...,[Introduction ChatGPT launched in November 202...,"ChatGPT was launched in November 2022, and sin...",single_hop_specifc_query_synthesizer
1,OpenAI how does it help in education and what ...,[Table 1: ChatGPT daily message counts (millio...,OpenAI is involved in understanding product us...,single_hop_specifc_query_synthesizer
2,As an Educational Technology Coordinator lever...,[Variation by Occupation Figure 23 presents va...,Variation in ChatGPT usage by occupation shows...,single_hop_specifc_query_synthesizer
3,What is practical guidance in ChatGPT usage?,[Conclusion This paper studies the rapid growt...,Practical Guidance is one of the three most co...,single_hop_specifc_query_synthesizer
4,Hw many chatgpt msgs are for info and wrting v...,[<1-hop>\n\nTable 1: ChatGPT daily message cou...,The context indicates that nearly 80% of all C...,multi_hop_abstract_query_synthesizer
5,How do the trends in monthly message statistic...,[<1-hop>\n\nMonth Non-Work (M) (%) Work (M) (%...,The data shows that from June 2024 to June 202...,multi_hop_abstract_query_synthesizer
6,how chatgpt message volume and classify activi...,[<1-hop>\n\nTable 1: ChatGPT daily message cou...,Table 1 shows that daily message counts for Ch...,multi_hop_abstract_query_synthesizer
7,How does the rapid growth of ChatGPT since 202...,[<1-hop>\n\nConclusion This paper studies the ...,The rapid growth of ChatGPT since its launch i...,multi_hop_abstract_query_synthesizer
8,"Considering the rapid growth of ChatGPT, which...",[<1-hop>\n\nConclusion This paper studies the ...,"By July 2025, ChatGPT's weekly usage by more t...",multi_hop_specific_query_synthesizer
9,How does the rapid growth of ChatGPT since its...,[<1-hop>\n\nConclusion This paper studies the ...,"Since its launch in November 2022, ChatGPT exp...",multi_hop_specific_query_synthesizer


We'll need to provide our LangSmith API key, and set tracing to "true".

# 🤝 BREAKOUT ROOM #2

## Task 4: LangSmith Dataset

Now we can move on to creating a dataset for LangSmith!

First, we'll need to create a dataset on LangSmith using the `Client`!

We'll name our Dataset to make it easy to work with later.

In [9]:
from langsmith import Client

client = Client()

dataset_name = "Use Case Synthetic Data - AIE8 v2"

langsmith_dataset = client.create_dataset(
    dataset_name=dataset_name,
    description="Synthetic Data for Use Cases"
)

We'll iterate through the RAGAS created dataframe - and add each example to our created dataset!

> NOTE: We need to conform the outputs to the expected format - which in this case is: `question` and `answer`.

In [10]:
for data_row in dataset.to_pandas().iterrows():
  client.create_example(
      inputs={
          "question": data_row[1]["user_input"]
      },
      outputs={
          "answer": data_row[1]["reference"]
      },
      metadata={
          "context": data_row[1]["reference_contexts"]
      },
      dataset_id=langsmith_dataset.id
  )

## Basic RAG Chain

Time for some RAG!


In [11]:
rag_documents = docs

To keep things simple, we'll just use LangChain's recursive character text splitter!


In [12]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 500,
    chunk_overlap = 50
)

rag_documents = text_splitter.split_documents(rag_documents)

We'll create our vectorstore using OpenAI's [`text-embedding-3-small`](https://platform.openai.com/docs/guides/embeddings/embedding-models) embedding model.

In [13]:
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

As usual, we will power our RAG application with Qdrant!

In [14]:
from langchain_community.vectorstores import Qdrant

vectorstore = Qdrant.from_documents(
    documents=rag_documents,
    embedding=embeddings,
    location=":memory:",
    collection_name="Use Case RAG"
)

In [15]:
retriever = vectorstore.as_retriever(search_kwargs={"k": 10})

To get the "A" in RAG, we'll provide a prompt.

In [16]:
from langchain.prompts import ChatPromptTemplate

RAG_PROMPT = """\
Given a provided context and question, you must answer the question based only on context.

If you cannot answer the question based on the context - you must say "I don't know".

Context: {context}
Question: {question}
"""

rag_prompt = ChatPromptTemplate.from_template(RAG_PROMPT)

As is usual: We'll be using `gpt-4.1-mini` for our RAG!

In [17]:
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4.1-mini")

Finally, we can set-up our RAG LCEL chain!

In [18]:
from operator import itemgetter
from langchain_core.runnables import RunnablePassthrough, RunnableParallel
from langchain.schema import StrOutputParser

rag_chain = (
    {"context": itemgetter("question") | retriever, "question": itemgetter("question")}
    | rag_prompt | llm | StrOutputParser()
)

In [19]:
rag_chain.invoke({"question" : "What are people doing with AI these days?"})

'Based on the provided context, people are using AI, particularly generative AI like ChatGPT, in many flexible ways both at work and outside of work. Key activities include:\n\n- Performing workplace tasks either by augmenting or automating human labor.\n- Producing writing, software code, spreadsheets, and other digital products.\n- Using AI as co-workers that produce output or as co-pilots that give advice and improve human problem-solving.\n- Seeking information and advice, similar to how they use traditional web search engines.\n- Engaging in self-expression activities like relationships, personal reflection, games, and role play (though these represent a smaller share of use).\n- Therapy or companionship has been noted as a prevalent use case.\n\nOverall, AI is being used to enhance productivity, creativity, and decision-making across various economic and social activities.'

## LangSmith Evaluation Set-up

We'll use OpenAI's GPT-4.1 as our evaluation LLM for our base Evaluators.

In [20]:
eval_llm = ChatOpenAI(model="gpt-4.1")

We'll be using a number of evaluators - from LangSmith provided evaluators, to a few custom evaluators!

In [21]:
from langsmith.evaluation import LangChainStringEvaluator, evaluate

qa_evaluator = LangChainStringEvaluator("qa", config={"llm" : eval_llm})

labeled_helpfulness_evaluator = LangChainStringEvaluator(
    "labeled_criteria",
    config={
        "criteria": {
            "helpfulness": (
                "Is this submission helpful to the user,"
                " taking into account the correct reference answer?"
            )
        },
        "llm" : eval_llm
    },
    prepare_data=lambda run, example: {
        "prediction": run.outputs["output"],
        "reference": example.outputs["answer"],
        "input": example.inputs["question"],
    }
)

dopeness_evaluator = LangChainStringEvaluator(
    "criteria",
    config={
        "criteria": {
            "dopeness": "Is this response dope, lit, cool, or is it just a generic response?",
        },
        "llm" : eval_llm
    }
)

#### 🏗️ Activity #2:

Highlight what each evaluator is evaluating.

- `qa_evaluator`: Evaluates whether the model's answer is factually correct and matches the expected reference answer. The label `"qa"` refers to a built-in LangSmith evaluation type for question answering accuracy.

- `labeled_helpfulness_evaluator`: Assesses how helpful the answer is to the user, taking into account the correct reference answer. The label `"labeled_criteria"` is used in LangSmith to specify custom evaluation criteria that are explicitly labeled, in this case, "helpfulness".

- `dopeness_evaluator`: Judges if the response is unique, interesting, and "dope" (i.e., not generic or boring). The label `"criteria"` in LangSmith allows you to define custom evaluation criteria, here used for the "dopeness" metric.

## LangSmith Evaluation

In [22]:
evaluate(
    rag_chain.invoke,
    data=dataset_name,
    evaluators=[
        qa_evaluator,
        labeled_helpfulness_evaluator,
        dopeness_evaluator
    ],
    metadata={"revision_id": "default_chain_init"},
)

View the evaluation results for experiment: 'unique-love-79' at:
https://smith.langchain.com/o/a58e59bf-7551-4351-b574-57c230dc2aa6/datasets/9ff69106-4780-4042-a51d-41486ff1415d/compare?selectedSessions=aba2a8f8-668c-4a1f-9e9c-646c15e32356




0it [00:00, ?it/s]

Unnamed: 0,inputs.question,outputs.output,error,reference.answer,feedback.correctness,feedback.helpfulness,feedback.dopeness,execution_time,example_id,id
0,how many messages 18 billion and 2.5 billion m...,Based on the provided context:\n\n- The exact ...,,"The context shows that by July 2025, ChatGPT u...",1,1,0,6.366211,7145eab0-21aa-4bdb-93a7-a69c7a1abc92,60bb1b3b-3cf9-4259-89fa-ac53b7f69171
1,Hw many messages did ChatGPT send weekly by Ju...,"By July 2025, ChatGPT users were collectively ...",,"By July 2025, ChatGPT was sending more than 18...",1,1,0,3.949315,9ab933b3-7159-4c16-877f-34ef9fd07bbf,bb0fff21-5aea-4141-bb94-67fdc7bb364f
2,How does the rapid growth of ChatGPT since its...,The rapid growth of ChatGPT since its launch i...,,"Since its launch in November 2022, ChatGPT exp...",1,1,0,3.407433,3439e2bf-0b89-4698-b360-c8c41b36b60f,cfd47a42-aecf-42e5-a0ed-343fe12e7893
3,"Considering the rapid growth of ChatGPT, which...",The extensive adoption of ChatGPT—used weekly ...,,"By July 2025, ChatGPT's weekly usage by more t...",1,1,0,6.322455,28017ba6-25e3-4f41-a4b5-107a2f0be8bf,ab5ff3be-6b4e-43a1-9335-8715fdeddf35
4,How does the rapid growth of ChatGPT since 202...,The rapid growth of ChatGPT since its launch i...,,The rapid growth of ChatGPT since its launch i...,1,1,0,4.312861,a3e96e4c-cc01-43a0-a342-cb5bf287dc71,679cd134-17bb-46f5-b512-551e61660e49
5,how chatgpt message volume and classify activi...,Based on the provided context:\n\n- Writing is...,,Table 1 shows that daily message counts for Ch...,1,1,0,9.426799,3b0ba9db-2649-462f-b5d5-c5c30ec4e723,71a38cd9-f93b-4ab9-b755-31e221b757c0
6,How do the trends in monthly message statistic...,"Based on the provided context, trends in month...",,The data shows that from June 2024 to June 202...,0,0,0,8.846879,3fad9b01-cf8d-464b-b91b-ad45e59a6709,2a8d40b4-db3a-4747-8cf5-e892dd082a65
7,Hw many chatgpt msgs are for info and wrting v...,Based on the provided context:\n\n- Only 4.2% ...,,The context indicates that nearly 80% of all C...,1,1,0,9.380094,7d501354-9090-4404-8b51-1d96f61b9557,89d5a359-92c7-4411-935b-038148afcc6e
8,What is practical guidance in ChatGPT usage?,"Based on the provided context, Practical Guida...",,Practical Guidance is one of the three most co...,1,1,0,6.266928,e0812ccc-f5fa-4544-bbbf-a248128fdde9,25f28498-cf94-4367-a15a-c8712abf731d
9,As an Educational Technology Coordinator lever...,I don't know.,,Variation in ChatGPT usage by occupation shows...,0,0,0,0.720266,93d2d624-c564-4742-91bc-36cb6534efa5,3df82ca0-bdaa-4307-9b01-752893db0985


## Dope-ifying Our Application

We'll be making a few changes to our RAG chain to increase its performance on our SDG evaluation test dataset!

- Include a "dope" prompt augmentation
- Use larger chunks
- Improve the retriever model to: `text-embedding-3-large`

Let's see how this changes our evaluation!

In [23]:
DOPENESS_RAG_PROMPT = """\
Given a provided context and question, you must answer the question based only on context.

If you cannot answer the question based on the context - you must say "I don't know".

Make your answer rad, ensure high levels of dopeness. Do not be generic, or give generic responses.

Context: {context}
Question: {question}
"""

dopeness_rag_prompt = ChatPromptTemplate.from_template(DOPENESS_RAG_PROMPT)

In [24]:
rag_documents = docs

In [25]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 1000,
    chunk_overlap = 50
)

rag_documents = text_splitter.split_documents(rag_documents)

#### ❓Question #2:

Why would modifying our chunk size modify the performance of our application?

Modifying the chunk size changes how much context is included in each document chunk that is embedded and retrieved. 
If the chunks are too small, important information may be split across multiple chunks, making it less likely that the retriever will surface all relevant context for a given question. If the chunks are too large, they may include irrelevant information, which can confuse the model or dilute the relevance of the retrieved context. 

The optimal chunk size balances these trade-offs, ensuring that each chunk is large enough to contain meaningful context but not so large that it introduces noise, thereby improving retrieval accuracy and the quality of generated answers.

In [26]:
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-large")

#### ❓Question #3:

Why would modifying our embedding model modify the performance of our application?

Modifying the embedding model changes how well the RAG system can understand and retrieve relevant information. More advanced embedding models create better representations that capture more nuanced semantic relationships between words, phrases, and concepts. This allows them to encode context, meaning, and even subtle distinctions in language more effectively, resulting in embeddings that are more accurate and useful for retrieval tasks. As a result, the RAG system can match questions to relevant documents with higher precision, improving the quality of answers. In contrast, less advanced models may produce embeddings that miss important context or fail to distinguish between similar but distinct concepts.

In [27]:
vectorstore = Qdrant.from_documents(
    documents=rag_documents,
    embedding=embeddings,
    location=":memory:",
    collection_name="Use Case RAG Docs"
)

In [28]:
retriever = vectorstore.as_retriever()

Setting up our new and improved DOPE RAG CHAIN.

In [29]:
dopeness_rag_chain = (
    {"context": itemgetter("question") | retriever, "question": itemgetter("question")}
    | dopeness_rag_prompt | llm | StrOutputParser()
)

Let's test it on the same output that we saw before.

In [30]:
dopeness_rag_chain.invoke({"question" : "How are people using AI to make money?"})

"Alright, buckle up because the way folks are stacking cash with AI is straight-up next-level wizardry. According to the juicy insights from the context, people aren’t just using AI as a fancy tool to punch in tasks—they're leveraging ChatGPT as a **decision-making sidekick and research advisor** in the workplace. It’s like having the smartest partner who boosts your brain’s horsepower, especially in knowledge-dense gigs where every choice counts.\n\nSo instead of AI just clocking boring tasks, it’s **amplifying worker output by supercharging the quality of their decisions**—making them sharper, faster, and maybe even a bit more legendary at what they do. This elevated decision support translates to better productivity and, bam, more money moves.\n\nPlus, the sheer value is colossal: Collis and Brynjolfsson (2025) drop this beast of a stat—US users would need a $98 monthly payout just to skip using AI, reflecting a mind-boggling **$97 billion per year surplus**. That’s AI turning into 

Finally, we can evaluate the new chain on the same test set!

In [31]:
evaluate(
    dopeness_rag_chain.invoke,
    data=dataset_name,
    evaluators=[
        qa_evaluator,
        labeled_helpfulness_evaluator,
        dopeness_evaluator
    ],
    metadata={"revision_id": "dopeness_rag_chain"},
)

View the evaluation results for experiment: 'scholarly-grain-54' at:
https://smith.langchain.com/o/a58e59bf-7551-4351-b574-57c230dc2aa6/datasets/9ff69106-4780-4042-a51d-41486ff1415d/compare?selectedSessions=0041d008-6ea0-4982-aefe-d0ebf8e531ae




0it [00:00, ?it/s]

Unnamed: 0,inputs.question,outputs.output,error,reference.answer,feedback.correctness,feedback.helpfulness,feedback.dopeness,execution_time,example_id,id
0,how many messages 18 billion and 2.5 billion m...,"Yo, here’s the epic lowdown straight from the ...",,"The context shows that by July 2025, ChatGPT u...",1,0,1,4.552142,7145eab0-21aa-4bdb-93a7-a69c7a1abc92,3cf5c481-2bb2-4ee2-935e-d2ddd6aa6165
1,Hw many messages did ChatGPT send weekly by Ju...,"Yo, check this out! By June 2025, ChatGPT was ...",,"By July 2025, ChatGPT was sending more than 18...",1,1,1,4.4156,9ab933b3-7159-4c16-877f-34ef9fd07bbf,269710c4-8038-4520-a9f0-28b0d3535c1d
2,How does the rapid growth of ChatGPT since its...,"Yo, brace yourself for this mind-blowing ChatG...",,"Since its launch in November 2022, ChatGPT exp...",1,1,1,6.153453,3439e2bf-0b89-4698-b360-c8c41b36b60f,57a4fe43-30d1-40bd-898d-b98134bb9d39
3,"Considering the rapid growth of ChatGPT, which...","Yo, this massive surge to 700 million weekly u...",,"By July 2025, ChatGPT's weekly usage by more t...",1,1,1,5.072241,28017ba6-25e3-4f41-a4b5-107a2f0be8bf,6270b6c5-131e-4e09-b195-824d95d57070
4,How does the rapid growth of ChatGPT since 202...,"Oh, buckle up—this answer’s got some serious j...",,The rapid growth of ChatGPT since its launch i...,1,1,1,7.042744,a3e96e4c-cc01-43a0-a342-cb5bf287dc71,994d7652-7f6f-499a-b428-3c16b90c3fa8
5,how chatgpt message volume and classify activi...,"Alright, let’s break down the beast of ChatGPT...",,Table 1 shows that daily message counts for Ch...,1,1,1,8.039882,3b0ba9db-2649-462f-b5d5-c5c30ec4e723,c480d1d8-e1a3-4f42-8992-2dcd88084c14
6,How do the trends in monthly message statistic...,"Alright, let’s crank up the cool factor and di...",,The data shows that from June 2024 to June 202...,1,1,1,6.72142,3fad9b01-cf8d-464b-b91b-ad45e59a6709,01a71960-14f3-41a8-a546-4ca32cff77e4
7,Hw many chatgpt msgs are for info and wrting v...,"Alright, let's break down the vibe in this AI ...",,The context indicates that nearly 80% of all C...,1,1,1,6.714435,7d501354-9090-4404-8b51-1d96f61b9557,07ae5b5a-33a7-4766-b4e4-6a5fe0177363
8,What is practical guidance in ChatGPT usage?,"Alright, let's crank up the cool factor and br...",,Practical Guidance is one of the three most co...,1,1,1,3.688803,e0812ccc-f5fa-4544-bbbf-a248128fdde9,332c600e-38ea-4606-9729-fee6dea38daf
9,As an Educational Technology Coordinator lever...,"Alright, buckle up for some next-level AI insi...",,Variation in ChatGPT usage by occupation shows...,1,1,1,7.566596,93d2d624-c564-4742-91bc-36cb6534efa5,2ee510f1-6509-4d5b-b3d9-8b7d6792f7f8


#### 🏗️ Activity #3:

Provide a screenshot of the difference between the two chains, and explain why you believe certain metrics changed in certain ways.

CHAIN 1
![Chain 1](chain1.png)

CHAIN 2
![Chain 2](chain2.png)

Some differences between chain 1 and chain 2 are:
- Chain 1 scored 0.8 correctness and 0.8 helpfulness vs chain 2 that scored 1.0 correctness and 0.9 helpfulness
- All of Chain 1 responses scored "no" on "dopeness". All of chain 2 scored "yes" on "dopeness".
- Average latency for chain 1 was 5.5 vs 6.3 for chain 2. Not a significant difference.
- Chain 1 generated 43,553 tokens in total, costing a total of 0.0206 cents.
- Chain 2 generated 26,723 tokens, costing a total of 0.0155 cents.
- For chain 1, one of the incorrect responses was "i don't know". I think our eval could be improved to score this differently, since we prefer "i don't know" to a hallucinated answer. If the model was not powerful to answer the question, and made that clear, i think that should be scored differently than "incorrect" hallucinated answers.

Well the biggest difference between the chains is the dopeness metric. Chain 2 scored higher than chain 1 because chain 2 was instructed in the prompt to return dope, non-generic answers, whereas nothing of that sort was mentioned in the prompt for chain 1.

Another interesting difference is that chain 2 generated less tokens and cost less overall for this task, despite using a more expensive embedding model.

The larger embedding model and larger chunks certainly helped chain 2 to achieve higher levels of correctness and helpfulness because it likely gained a better understanding of the meaning and intent in the reference text, and of the user's question. The larger embedding model generates more precise vector embeddings, that allow the question and the reference text to have cosine similarities that are more in line with reality. That sets the base for model to perform a more accurate search for an answer.