# Session 9: Synthetic Data Generation and RAG Evaluation with LangSmith

In the following notebook we'll explore a use-case for RAGAS' synthetic testset generation workflow, and use it to evaluate and iterate on a RAG pipeline with LangSmith!

**Learning Objectives:**
- Understand Ragas' knowledge graph-based synthetic data generation workflow
- Generate synthetic test sets with different query synthesizer types
- Load synthetic data into LangSmith for evaluation
- Evaluate a RAG chain using LangSmith evaluators
- Iterate on RAG pipeline parameters and measure the impact

## Table of Contents:

- **Breakout Room #1:** Synthetic Data Generation with Ragas
  - Task 1: Dependencies and API Keys
  - Task 2: Data Preparation and Knowledge Graph Construction
  - Task 3: Generating Synthetic Test Data
  - Question #1 & Question #2
  - üèóÔ∏è Activity #1: Custom Query Distribution

- **Breakout Room #2:** RAG Evaluation with LangSmith
  - Task 4: LangSmith Dataset Setup
  - Task 5: Building a Basic RAG Chain
  - Task 6: Evaluating with LangSmith
  - Task 7: Modifying the Pipeline and Re-Evaluating
  - Question #3 & Question #4
  - üèóÔ∏è Activity #2: Analyze Evaluation Results

---
# ü§ù Breakout Room #1
## Synthetic Data Generation with Ragas

## Task 1: Dependencies and API Keys

We'll need to install a number of API keys and dependencies, since we'll be leveraging a number of great technologies for this pipeline!

1. OpenAI's endpoints to handle the Synthetic Data Generation
2. OpenAI's Endpoints for our RAG pipeline and LangSmith evaluation
3. QDrant as our vectorstore
4. LangSmith for our evaluation coordinator!

Let's install and provide all the required information below!

## Dependencies and API Keys:

### NLTK Import

To prevent errors that may occur based on OS - we'll import NLTK and download the needed packages to ensure correct handling of data.

In [1]:
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/ayushikandoi/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/ayushikandoi/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


True

In [2]:
import os
import getpass

os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = getpass.getpass("LangChain API Key:")

We'll also want to set a project name to make things easier for ourselves.

In [3]:
from uuid import uuid4

os.environ["LANGCHAIN_PROJECT"] = f"AIM - SDG - {uuid4().hex[0:8]}"

OpenAI's API Key!

In [4]:
os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key:")

## Generating Synthetic Test Data

We wil be using Ragas to build out a set of synthetic test questions, references, and reference contexts. This is useful because it will allow us to find out how our system is performing.

> NOTE: Ragas is best suited for finding *directional* changes in your LLM-based systems. The absolute scores aren't comparable in a vacuum.

### Data Preparation

We'll prepare our data using two complementary guides ‚Äî a Health & Wellness Guide covering exercise, nutrition, sleep, and stress management, and a Mental Health & Psychology Handbook covering mental health conditions, therapeutic approaches, resilience, and daily mental health practices. The topical overlap between documents helps RAGAS build rich cross-document relationships in the knowledge graph.

Next, let's load our data into a familiar LangChain format using the `TextLoader`.

In [5]:
from langchain_community.document_loaders import DirectoryLoader, TextLoader

loader = DirectoryLoader("data/", glob="*.txt", loader_cls=TextLoader)
docs = loader.load()
print(f"Loaded {len(docs)} documents: {[d.metadata['source'] for d in docs]}")

Loaded 2 documents: ['data/MentalHealthGuide.txt', 'data/HealthWellnessGuide.txt']


### Knowledge Graph Based Synthetic Generation

Ragas uses a knowledge graph based approach to create data. This is extremely useful as it allows us to create complex queries rather simply. The additional testset complexity allows us to evaluate larger problems more effectively, as systems tend to be very strong on simple evaluation tasks.

Let's start by defining our `generator_llm` (which will generate our questions, summaries, and more), and our `generator_embeddings` which will be useful in building our graph.

### Unrolled SDG

In [6]:
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from langchain_openai import ChatOpenAI
from langchain_openai import OpenAIEmbeddings
generator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4.1-nano"))
generator_embeddings = LangchainEmbeddingsWrapper(OpenAIEmbeddings())

  for match in re.finditer('{0}\s*'.format(re.escape(sent)), self.original_text):
  txt = re.sub('(?<={0})\.'.format(am), '‚àØ', txt)
  txt = re.sub('(?<={0})\.'.format(am), '‚àØ', txt)


Next, we're going to instantiate our Knowledge Graph.

This graph will contain N number of nodes that have M number of relationships. These nodes and relationships (AKA "edges") will define our knowledge graph and be used later to construct relevant questions and responses.

In [7]:
from ragas.testset.graph import KnowledgeGraph

kg = KnowledgeGraph()
kg

KnowledgeGraph(nodes: 0, relationships: 0)

The first step we're going to take is to simply insert each of our full documents into the graph. This will provide a base that we can apply transformations to.

In [8]:
from ragas.testset.graph import Node, NodeType

for doc in docs:
    kg.nodes.append(
        Node(
            type=NodeType.DOCUMENT,
            properties={"page_content": doc.page_content, "document_metadata": doc.metadata}
        )
    )
kg

KnowledgeGraph(nodes: 2, relationships: 0)

Now, we'll apply the *default* transformations to our knowledge graph. This will take the nodes currently on the graph and transform them based on a set of [default transformations](https://docs.ragas.io/en/latest/references/transforms/#ragas.testset.transforms.default_transforms).

These default transformations are dependent on the corpus length, in our case:

- Producing Summaries -> produces summaries of the documents
- Extracting Headlines -> finding the overall headline for the document
- Theme Extractor -> extracts broad themes about the documents

It then uses cosine-similarity and heuristics between the embeddings of the above transformations to construct relationships between the nodes.

In [9]:
from ragas.testset.transforms import default_transforms, apply_transforms

transformer_llm = generator_llm
embedding_model = generator_embeddings

default_transforms = default_transforms(documents=docs, llm=transformer_llm, embedding_model=embedding_model)
apply_transforms(kg, default_transforms)
kg

Applying HeadlinesExtractor:   0%|          | 0/2 [00:00<?, ?it/s]

Applying HeadlineSplitter:   0%|          | 0/2 [00:00<?, ?it/s]

Applying SummaryExtractor:   0%|          | 0/2 [00:00<?, ?it/s]

Applying CustomNodeFilter:   0%|          | 0/8 [00:00<?, ?it/s]

Applying [EmbeddingExtractor, ThemesExtractor, NERExtractor]:   0%|          | 0/18 [00:00<?, ?it/s]

Applying [CosineSimilarityBuilder, OverlapScoreBuilder]:   0%|          | 0/2 [00:00<?, ?it/s]

KnowledgeGraph(nodes: 10, relationships: 19)

We can save and load our knowledge graphs as follows.

In [10]:
kg.save("usecase_data_kg.json")
usecase_data_kg = KnowledgeGraph.load("usecase_data_kg.json")
usecase_data_kg

KnowledgeGraph(nodes: 10, relationships: 19)

Using our knowledge graph, we can construct a "test set generator" - which will allow us to create queries.

In [11]:
from ragas.testset import TestsetGenerator

generator = TestsetGenerator(llm=generator_llm, embedding_model=embedding_model, knowledge_graph=usecase_data_kg)

However, we'd like to be able to define the kinds of queries we're generating - which is made simple by Ragas having pre-created a number of different "QuerySynthesizer"s.

Each of these Synthetsizers is going to tackle a separate kind of query which will be generated from a scenario and a persona.

In essence, Ragas will use an LLM to generate a persona of someone who would interact with the data - and then use a scenario to construct a question from that data and persona.

In [12]:
from ragas.testset.synthesizers import default_query_distribution, SingleHopSpecificQuerySynthesizer, MultiHopAbstractQuerySynthesizer, MultiHopSpecificQuerySynthesizer

query_distribution = [
        (SingleHopSpecificQuerySynthesizer(llm=generator_llm), 0.5),
        (MultiHopAbstractQuerySynthesizer(llm=generator_llm), 0.25),
        (MultiHopSpecificQuerySynthesizer(llm=generator_llm), 0.25),
]

## ‚ùì Question #1:

What are the three types of query synthesizers doing? Describe each one in simple terms.

##### Answer:
SingleHopSpecificQuerySynthesizer: Generate simple, specific questions. It require information from one document. Test basic retrieval and direct QA extraction. 

MultiHopAbstractQuerySynthesizer: Generate conceptual questions. It require combining information from multiple documents. Test multi-step reasoning across the documents. 

MultiHopSpecificQuerySynthesizer: Generate more factual questions. Require multiple documents. Test both retrieval depth and factual ground across the documents. 


Finally, we can use our `TestSetGenerator` to generate our testset!

In [13]:
testset = generator.generate(testset_size=10, query_distribution=query_distribution)
testset.to_pandas()

Generating personas:   0%|          | 0/2 [00:00<?, ?it/s]

Generating Scenarios:   0%|          | 0/3 [00:00<?, ?it/s]

Generating Samples:   0%|          | 0/11 [00:00<?, ?it/s]

Unnamed: 0,user_input,reference_contexts,reference,synthesizer_name
0,Wht is the United States?,[The Mental Health and Psychology Handbook A P...,The context provided does not include a defini...,single_hop_specifc_query_synthesizer
1,What is MBSR and how does it contribute to imp...,[PART 2: THERAPEUTIC APPROACHES Chapter 4: Cog...,Mindfulness-Based Stress Reduction (MBSR) is a...,single_hop_specifc_query_synthesizer
2,What is CBT-I?,[Write letters to or from your future self Jou...,CBT-I is the recommended first-line treatment ...,single_hop_specifc_query_synthesizer
3,Who are Marriage and Family Therapists in the ...,[social interactions How to set and maintain b...,Marriage and Family Therapists are a type of m...,single_hop_specifc_query_synthesizer
4,Wednesday is what day?,[The Personal Wellness Guide A Comprehensive R...,Wednesday is the day mentioned in the context ...,single_hop_specifc_query_synthesizer
5,How can creating an optimal sleep environment‚Äî...,[<1-hop>\n\nPART 3: SLEEP AND RECOVERY Chapter...,Creating an optimal sleep environment by keepi...,multi_hop_abstract_query_synthesizer
6,How can Cognitive Behavioral Therapy (CBT) be ...,[<1-hop>\n\nThe Mental Health and Psychology H...,Cognitive Behavioral Therapy (CBT) is one of t...,multi_hop_abstract_query_synthesizer
7,How do understanding symptoms of mental health...,[<1-hop>\n\nThe Mental Health and Psychology H...,Understanding symptoms of mental health condit...,multi_hop_abstract_query_synthesizer
8,"How can Cognitive Behavioral Therapy (CBT), in...",[<1-hop>\n\nPART 3: SLEEP AND RECOVERY Chapter...,The provided context explains that Cognitive B...,multi_hop_specific_query_synthesizer
9,How can Cognitive Behavioral Therapy for Insom...,[<1-hop>\n\nPART 3: SLEEP AND RECOVERY Chapter...,Cognitive Behavioral Therapy for Insomnia (CBT...,multi_hop_specific_query_synthesizer


### Abstracted SDG

The above method is the full process - but we can shortcut that using the provided abstractions!

This will generate our knowledge graph under the hood, and will - from there - generate our personas and scenarios to construct our queries.



In [14]:
from ragas.testset import TestsetGenerator

generator = TestsetGenerator(llm=generator_llm, embedding_model=generator_embeddings)
dataset = generator.generate_with_langchain_docs(docs, testset_size=10)

Applying HeadlinesExtractor:   0%|          | 0/2 [00:00<?, ?it/s]

Applying HeadlineSplitter:   0%|          | 0/2 [00:00<?, ?it/s]

Applying SummaryExtractor:   0%|          | 0/2 [00:00<?, ?it/s]

Applying CustomNodeFilter:   0%|          | 0/7 [00:00<?, ?it/s]

Applying [EmbeddingExtractor, ThemesExtractor, NERExtractor]:   0%|          | 0/16 [00:00<?, ?it/s]

Applying [CosineSimilarityBuilder, OverlapScoreBuilder]:   0%|          | 0/2 [00:00<?, ?it/s]

Generating personas:   0%|          | 0/2 [00:00<?, ?it/s]

Generating Scenarios:   0%|          | 0/3 [00:00<?, ?it/s]

Generating Samples:   0%|          | 0/12 [00:00<?, ?it/s]

In [15]:
dataset.to_pandas()

Unnamed: 0,user_input,reference_contexts,reference,synthesizer_name
0,How does the United States address mental heal...,[The Mental Health and Psychology Handbook A P...,The provided context does not include specific...,single_hop_specifc_query_synthesizer
1,What does Carol Dweck's research suggest about...,[PART 3: BUILDING RESILIENCE Chapter 7: What I...,"According to the context, Carol Dweck's resear...",single_hop_specifc_query_synthesizer
2,Who is Jon Kabat-Zinn?,[PART 2: THERAPEUTIC APPROACHES Chapter 4: Cog...,Jon Kabat-Zinn is the developer of Mindfulness...,single_hop_specifc_query_synthesizer
3,How do omega-3 fatty acids contribute to menta...,[- Sleep restriction: Limiting time in bed to ...,Omega-3 fatty acids are linked to reduced depr...,single_hop_specifc_query_synthesizer
4,How can understanding the effects of stress an...,[<1-hop>\n\nPART 2: NUTRITION AND DIET Chapter...,"Understanding the effects of stress, such as h...",multi_hop_abstract_query_synthesizer
5,How can building a workout routine that incorp...,[<1-hop>\n\nThe Personal Wellness Guide A Comp...,Building a workout routine that includes regul...,multi_hop_abstract_query_synthesizer
6,How do diet and key nutrients like Omega-3 and...,[<1-hop>\n\n- Sleep restriction: Limiting time...,The context explains that nutritional psychiat...,multi_hop_abstract_query_synthesizer
7,How can understanding the effects of stress an...,[<1-hop>\n\nPART 2: NUTRITION AND DIET Chapter...,"Understanding the effects of stress, which inc...",multi_hop_abstract_query_synthesizer
8,How does Chapter 10 about sleep restriction an...,[<1-hop>\n\n- Sleep restriction: Limiting time...,"Chapter 10 discusses sleep restriction, stimul...",multi_hop_specific_query_synthesizer
9,How does PART 4 relate to PART 2 in building r...,[<1-hop>\n\nPART 3: BUILDING RESILIENCE Chapte...,PART 4 discusses daily mental health practices...,multi_hop_specific_query_synthesizer


## ‚ùì Question #2:

Ragas offers both an "unrolled" (manual) approach and an "abstracted" (automatic) approach to synthetic data generation. What are the trade-offs between these two approaches? When would you choose one over the other?

##### Answer:
Trade-off between 2 approaches: 

Unrolled approach: 

Advantages: Full control over knowlegde graph construction, query distribution, better for research, experimentation. 

Disadvantage: More setup, require deeper RAGAS knowledge

When to Use: For experimentation with graph transforms, if you want custom query distribution. 

Abstracted approach: 

Advantages: Minimal setup, faster iterations and quick testing. 

Disadvantages: Less flexibility, hard to fine-tune kg

When to use: For prototyping purpose. 

---
## üèóÔ∏è Activity #1: Custom Query Distribution

Modify the `query_distribution` to experiment with different ratios of query types.

### Requirements:
1. Create a custom query distribution with different weights than the default
2. Generate a new test set using your custom distribution
3. Compare the types of questions generated with the default distribution
4. Explain why you chose the weights you did

In [16]:
### YOUR CODE HERE ###

# Define a custom query distribution with different weights
from ragas.testset.synthesizers import default_query_distribution, SingleHopSpecificQuerySynthesizer, MultiHopAbstractQuerySynthesizer, MultiHopSpecificQuerySynthesizer

custom_query_distribution = [
        (SingleHopSpecificQuerySynthesizer(llm=generator_llm), 0.2),
        (MultiHopAbstractQuerySynthesizer(llm=generator_llm), 0.4),
        (MultiHopSpecificQuerySynthesizer(llm=generator_llm), 0.4),
]
# Generate a new test set and compare with the default
testset = generator.generate(testset_size=10, query_distribution=custom_query_distribution)
testset.to_pandas()

Generating Scenarios:   0%|          | 0/3 [00:00<?, ?it/s]

Generating Samples:   0%|          | 0/10 [00:00<?, ?it/s]

Unnamed: 0,user_input,reference_contexts,reference,synthesizer_name
0,How does mental health influence physical heal...,[The Mental Health and Psychology Handbook A P...,Mental health conditions can increase the risk...,single_hop_specifc_query_synthesizer
1,What is psychological resilience?,[PART 3: BUILDING RESILIENCE Chapter 7: What I...,Psychological resilience is the ability to ada...,single_hop_specifc_query_synthesizer
2,How does mental health and well-being relate t...,[<1-hop>\n\nThe Mental Health and Psychology H...,The context explains that mental health encomp...,multi_hop_abstract_query_synthesizer
3,Considering the importance of supporting socia...,[],Supporting social connections and community en...,multi_hop_abstract_query_synthesizer
4,Hwo does the mind-body connecion impact mental...,[<1-hop>\n\nThe Mental Health and Psychology H...,The mind-body connection demonstrates that men...,multi_hop_abstract_query_synthesizer
5,"How do mindfulness and meditation, particularl...",[<1-hop>\n\nPART 2: THERAPEUTIC APPROACHES Cha...,"Mindfulness and meditation, especially through...",multi_hop_abstract_query_synthesizer
6,How do chapters 11 and 13 together suggest tha...,[<1-hop>\n\n- Sleep restriction: Limiting time...,Chapter 11 discusses the importance of sleep f...,multi_hop_specific_query_synthesizer
7,"How do Chapters 4 and 11 from the nutrition, s...",[<1-hop>\n\nPART 3: BUILDING RESILIENCE Chapte...,Chapters 4 and 11 provide complementary insigh...,multi_hop_specific_query_synthesizer
8,"How do B vitamins, found in foods like leafy g...",[<1-hop>\n\n- Sleep restriction: Limiting time...,"B vitamins, which are found in foods such as l...",multi_hop_specific_query_synthesizer
9,How does cognitive therapy relate to cognitive...,[<1-hop>\n\n- Sleep restriction: Limiting time...,Cognitive therapy involves addressing beliefs ...,multi_hop_specific_query_synthesizer


Comparing questions generated with default weights: 

Default: More simple questions, test retrieval precision 

Custom: More reasoning questions, test both retrieval and reasoning depth. 

I increased the weights towards multi-hop queries because that is more realistic scenario and most failure happens in reasoning across multiple documents then single document simple retrieval. 

We'll need to provide our LangSmith API key, and set tracing to "true".

---
# ü§ù Breakout Room #2
## RAG Evaluation with LangSmith

## Task 4: LangSmith Dataset

Now we can move on to creating a dataset for LangSmith!

First, we'll need to create a dataset on LangSmith using the `Client`!

We'll name our Dataset to make it easy to work with later.

In [17]:
from langsmith import Client
import uuid

client = Client()

dataset_name = f"Use Case Synthetic Data - AIE9 - {uuid.uuid4()}"

langsmith_dataset = client.create_dataset(
    dataset_name=dataset_name,
    description="Synthetic Data for Use Cases"
)

We'll iterate through the RAGAS created dataframe - and add each example to our created dataset!

> NOTE: We need to conform the outputs to the expected format - which in this case is: `question` and `answer`.

In [18]:
for data_row in dataset.to_pandas().iterrows():
  client.create_example(
      inputs={
          "question": data_row[1]["user_input"]
      },
      outputs={
          "answer": data_row[1]["reference"]
      },
      metadata={
          "context": data_row[1]["reference_contexts"]
      },
      dataset_id=langsmith_dataset.id
  )

## Basic RAG Chain

Time for some RAG!


In [19]:
rag_documents = docs

To keep things simple, we'll just use LangChain's recursive character text splitter!


In [20]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 500,
    chunk_overlap = 50
)

rag_documents = text_splitter.split_documents(rag_documents)

We'll create our vectorstore using OpenAI's [`text-embedding-3-small`](https://platform.openai.com/docs/guides/embeddings/embedding-models) embedding model.

In [21]:
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

As usual, we will power our RAG application with Qdrant!

In [22]:
from langchain_qdrant import QdrantVectorStore

vectorstore = QdrantVectorStore.from_documents(
    documents=rag_documents,
    embedding=embeddings,
    location=":memory:",
    collection_name="use_case_rag"
)

In [23]:
retriever = vectorstore.as_retriever(search_kwargs={"k": 10})

To get the "A" in RAG, we'll provide a prompt.

In [24]:
from langchain_core.prompts import ChatPromptTemplate

RAG_PROMPT = """\
Given a provided context and question, you must answer the question based only on context.

If you cannot answer the question based on the context - you must say "I don't know".

Context: {context}
Question: {question}
"""

rag_prompt = ChatPromptTemplate.from_template(RAG_PROMPT)

As is usual: We'll be using `gpt-4.1-mini` for our RAG!

In [26]:
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4.1-mini")

Finally, we can set-up our RAG LCEL chain!

In [27]:
from operator import itemgetter
from langchain_core.runnables import RunnablePassthrough, RunnableParallel
from langchain_core.output_parsers import StrOutputParser

rag_chain = (
    {"context": itemgetter("question") | retriever, "question": itemgetter("question")}
    | rag_prompt | llm | StrOutputParser()
)

In [28]:
rag_chain.invoke({"question" : "What are some recommended exercises for lower back pain?"})

'Recommended exercises for lower back pain include:\n\n- Cat-Cow Stretch: Start on hands and knees, alternate between arching your back up (cat) and letting it sag down (cow). Do 10-15 repetitions.\n- Bird Dog: From hands and knees, extend opposite arm and leg while keeping your core engaged. Hold for 5 seconds, then switch sides. Do 10 repetitions per side.\n- Partial Crunches: Lie on your back with knees bent, cross arms over chest, tighten stomach muscles and raise shoulders off floor. Hold briefly, then lower. Do 8-12 repetitions.\n- Knee-to-Chest Stretch: Lie on your back, pull one knee toward your chest while keeping the other foot flat. Hold for 15-30 seconds, then switch legs.\n- Pelvic Tilts: Lie on your back with knees bent, flatten your back against the floor by tightening abs and tilting pelvis up slightly. Hold for 10 seconds, repeat 8-12 times.'

## LangSmith Evaluation Set-up

We'll use OpenAI's GPT-4.1 as our evaluation LLM for our base Evaluators.

In [29]:
eval_llm = ChatOpenAI(model="gpt-4.1")

We'll be using a number of evaluators - from LangSmith provided evaluators, to a few custom evaluators!

In [30]:
from openevals.llm import create_llm_as_judge
from langsmith.evaluation import evaluate

# 1. QA Correctness (replaces LangChainStringEvaluator("qa"))
qa_evaluator = create_llm_as_judge(
    prompt="You are evaluating a QA system. Given the input, assess whether the prediction is correct.\n\nInput: {inputs}\nPrediction: {outputs}\nReference answer: {reference_outputs}\n\nIs the prediction correct? Return 1 if correct, 0 if incorrect.",
    feedback_key="qa",
    model="openai:gpt-4o" ,  # pass your LangChain chat model directly
)

# 2. Labeled Helpfulness (replaces LangChainStringEvaluator("labeled_criteria"))
labeled_helpfulness_evaluator = create_llm_as_judge(
    prompt=(
        "You are assessing a submission based on the following criterion:\n\n"
        "helpfulness: Is this submission helpful to the user, "
        "taking into account the correct reference answer?\n\n"
        "Input: {inputs}\n"
        "Submission: {outputs}\n"
        "Reference answer: {reference_outputs}\n\n"
        "Does the submission meet the criterion? Return 1 if yes, 0 if no."
    ),
    feedback_key="helpfulness",
    model="openai:gpt-4o" ,
)

# 3. Dopeness (replaces LangChainStringEvaluator("criteria"))
dopeness_evaluator = create_llm_as_judge(
    prompt=(
        "You are assessing a submission based on the following criterion:\n\n"
        "dopeness: Is this response dope, lit, cool, or is it just a generic response?\n\n"
        "Input: {inputs}\n"
        "Submission: {outputs}\n\n"
        "Does the submission meet the criterion? Return 1 if yes, 0 if no."
    ),
    feedback_key="dopeness",
    model="openai:gpt-4o" ,
)

> **Describe what each evaluator is evaluating:**
>
> - `qa_evaluator`: This evaluator checks the factual correctness. It will look at the question, the prediction by the model and the reference answer. Then it will assess whether the prediction matches the reference. Returns 1 if correct and 0 if incorrect. 

> - `labeled_helpfulness_evaluator`: This evaluator assesses helpfulness relative to the user's needs. Even if the response is techinally correct, it still checks whether the response is actually useful and infromative for the user, given the reference answer. Returns 1 if helpful or 0 if not. 

> - `dopeness_evaluator`: This evaluator assess the style or coolness/dopeness of the response. It evaluates whether the response is engaging, creative vs generic. Return 1 if dope or 0 if not. 

## LangSmith Evaluation

In [31]:
evaluate(
    rag_chain.invoke,
    data=dataset_name,
    evaluators=[
        qa_evaluator,
        labeled_helpfulness_evaluator,
        dopeness_evaluator
    ],
    metadata={"revision_id": "default_chain_init"},
)

View the evaluation results for experiment: 'drab-time-33' at:
https://smith.langchain.com/o/0a0d9ed2-3509-4225-9d44-d67d51a35e08/datasets/18a31e34-f58a-4e64-8556-bc268748559b/compare?selectedSessions=f7259abe-5ace-4494-84b0-3c87384a9bc8




0it [00:00, ?it/s]

Unnamed: 0,inputs.question,outputs.output,error,reference.answer,feedback.qa,feedback.helpfulness,feedback.dopeness,execution_time,example_id,id
0,How does cognitive therapy relate to cognitive...,"Based on the context, cognitive therapy is a c...",,Cognitive therapy is a component of cognitive ...,True,True,False,5.253119,dac145eb-df3b-4ec0-b61f-b09ab6a8bc82,019c5401-12ab-7f62-b647-750835b7b1b2
1,How do the strategies discussed in Chapter 13 ...,I don't know.,,"Chapter 13 emphasizes sleep hygiene practices,...",False,False,False,1.707189,2aec32dc-8f8f-4aac-a805-ef930057e70d,019c5401-5670-7370-8ae7-09695e3dc6cb
2,How does PART 4 relate to PART 2 in building r...,"Based on the context provided, PART 4 (Stress ...",,PART 4 discusses daily mental health practices...,True,True,True,4.624369,1292b810-81c4-4e27-ac5a-558f3bbc9f0e,019c5401-8530-7f13-ab69-0775783fc547
3,How does Chapter 10 about sleep restriction an...,I don't know.,,"Chapter 10 discusses sleep restriction, stimul...",False,False,False,0.619601,5700b516-1b87-44e1-813f-c9c09449eff3,019c5401-cdab-75e2-b45a-66fc06081f9e
4,How can understanding the effects of stress an...,Understanding the effects of stress is crucial...,,"Understanding the effects of stress, which inc...",True,True,True,3.240459,1b819785-7b5d-4abb-a49a-4af41a8545f9,019c5401-f3cc-7e52-8b19-0ba04d1c7734
5,How do diet and key nutrients like Omega-3 and...,Diet and key nutrients have a significant impa...,,The context explains that nutritional psychiat...,True,True,True,4.511755,205ac36f-4673-4ad4-ac8f-a8466365609f,019c5402-2a0d-7133-9d1e-a1d192ea46b4
6,How can building a workout routine that incorp...,Building a workout routine that incorporates h...,,Building a workout routine that includes regul...,True,True,True,4.671338,aef095e0-9cca-4abd-86fa-0c3cad482644,019c5402-7a87-7dd1-82c9-57c3c1740ca7
7,How can understanding the effects of stress an...,Understanding the effects of stress helps reco...,,"Understanding the effects of stress, such as h...",True,True,True,3.584988,d7523bac-5eab-4b5d-8c48-23541ce13bc9,019c5402-db3b-7d22-8466-c13c323ea289
8,How do omega-3 fatty acids contribute to menta...,Omega-3 fatty acids contribute to mental healt...,,Omega-3 fatty acids are linked to reduced depr...,True,True,False,1.035764,92e2ce6d-694d-492a-8ff3-a709f8da9fb8,019c5403-1b7a-7842-97af-deb35d3635f3
9,Who is Jon Kabat-Zinn?,Jon Kabat-Zinn is the developer of Mindfulness...,,Jon Kabat-Zinn is the developer of Mindfulness...,True,True,False,0.796232,14f5c505-966c-4336-9322-e8816d80916a,019c5403-5391-76f0-8319-a8ef7e4129b8


## Dope-ifying Our Application

We'll be making a few changes to our RAG chain to increase its performance on our SDG evaluation test dataset!

- Include a "dope" prompt augmentation
- Use larger chunks
- Improve the retriever model to: `text-embedding-3-large`

Let's see how this changes our evaluation!

In [32]:
DOPENESS_RAG_PROMPT = """\
Given a provided context and question, you must answer the question based only on context.

If you cannot answer the question based on the context - you must say "I don't know".

Make your answer rad, ensure high levels of dopeness. Do not be generic, or give generic responses.

Context: {context}
Question: {question}
"""

dopeness_rag_prompt = ChatPromptTemplate.from_template(DOPENESS_RAG_PROMPT)

In [33]:
rag_documents = docs

In [34]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 1000,
    chunk_overlap = 50
)

rag_documents = text_splitter.split_documents(rag_documents)

## ‚ùì Question #3:

Why would modifying our chunk size modify the performance of our application?

##### Answer:
Modifying the chunk size definitely affects the performance; could be positive or negative because it directly changes how data is processed and stored. For example: 

Larger/Bigger chunks: could be faster and increase by reducing overhead but will use more memory and could be a risk at exceeding the memory limits.  

Smaller chunks: Safer for memory and allow faster early results but processing more chunks increases overhead which can slow overall performance. 


In [35]:
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-large")

## ‚ùì Question #4:

Why would modifying our embedding model modify the performance of our application?

##### Answer:
Modifying the embedding model affects the performance because it changes how text is represented and processed. For example: 

Stronger/larger models: Can produce more accurate and meaningful embeddings, improving search and relevance, but are slower and use more memory.

Smaller/lighter models: Faster and cheaper to compute with lower memory usage, but may produce lower-quality embeddings, reducing accuracy and relevance.

It is a trade-off between quality vs cost/latency. 

In [36]:
from langchain_qdrant import QdrantVectorStore

vectorstore = QdrantVectorStore.from_documents(
    documents=rag_documents,
    embedding=embeddings,
    location=":memory:",
    collection_name="Use Case RAG Docs"
)

In [37]:
retriever = vectorstore.as_retriever()

Setting up our new and improved DOPE RAG CHAIN.

In [38]:
dopeness_rag_chain = (
    {"context": itemgetter("question") | retriever, "question": itemgetter("question")}
    | dopeness_rag_prompt | llm | StrOutputParser()
)

Let's test it on the same output that we saw before.

In [39]:
dopeness_rag_chain.invoke({"question" : "How can I improve my sleep quality?"})

'Alright, ready to level up your sleep game? Here‚Äôs the ultimate cheat code straight from the sleep sages:\n\n1. **Lock in a consistent sleep schedule‚Äîyes, even weekends.** Your body loves rhythm like a dope beat, so keep those sleep and wake times steady.\n\n2. **Craft a chill bedtime ritual.** Think: cracking open a book, gentle stretching, or a warm bath to drop your brainjack from hype mode to calm.\n\n3. **Turn your bedroom into a sleep fortress.** Keep it cool (65-68¬∞F / 18-20¬∞C), pitch black with blackout curtains or a sleek sleep mask, and quiet‚Äîwhite noise machines or earplugs are your allies here.\n\n4. **Screen detox 1-2 hours before lights out.** Blue light from devices is that sneaky villain that messes with your melatonin flow, so unplug and unwind.\n\n5. **Ditch caffeine past 2 PM.** No energy potions late in the day‚Äîlet your body naturally power down.\n\n6. **Exercise like a beast‚Äîbut not right before bedtime.** Keep those endorphins pumped during the day to

Finally, we can evaluate the new chain on the same test set!

In [40]:
evaluate(
    dopeness_rag_chain.invoke,
    data=dataset_name,
    evaluators=[
        qa_evaluator,
        labeled_helpfulness_evaluator,
        dopeness_evaluator
    ],
    metadata={"revision_id": "dopeness_rag_chain"},
)

View the evaluation results for experiment: 'abandoned-rice-27' at:
https://smith.langchain.com/o/0a0d9ed2-3509-4225-9d44-d67d51a35e08/datasets/18a31e34-f58a-4e64-8556-bc268748559b/compare?selectedSessions=3f5f9ba1-04ad-4242-affa-51fcbf4af85b




0it [00:00, ?it/s]

Unnamed: 0,inputs.question,outputs.output,error,reference.answer,feedback.qa,feedback.helpfulness,feedback.dopeness,execution_time,example_id,id
0,How does cognitive therapy relate to cognitive...,"Alright, here‚Äôs the rad lowdown on how cogniti...",,Cognitive therapy is a component of cognitive ...,True,True,True,2.076102,dac145eb-df3b-4ec0-b61f-b09ab6a8bc82,019c5405-8012-7503-ad17-b52185818cc9
1,How do the strategies discussed in Chapter 13 ...,"Yo, here‚Äôs the 411 straight from the mental we...",,"Chapter 13 emphasizes sleep hygiene practices,...",False,False,True,4.690881,2aec32dc-8f8f-4aac-a805-ef930057e70d,019c5405-c1a8-78f2-96f9-39681faa89c4
2,How does PART 4 relate to PART 2 in building r...,"Oh, *hell yeah*, let‚Äôs unpack this mental well...",,PART 4 discusses daily mental health practices...,True,True,True,4.737342,1292b810-81c4-4e27-ac5a-558f3bbc9f0e,019c5406-1740-7912-84b1-d81fa557af77
3,How does Chapter 10 about sleep restriction an...,"Yo, let‚Äôs connect those dots like a pro sleep-...",,"Chapter 10 discusses sleep restriction, stimul...",True,True,True,5.500696,5700b516-1b87-44e1-813f-c9c09449eff3,019c5406-62fc-7e11-b949-e9868814ecd7
4,How can understanding the effects of stress an...,"Alright, let‚Äôs crank this up to eleven! Unders...",,"Understanding the effects of stress, which inc...",True,True,True,4.679619,1b819785-7b5d-4abb-a49a-4af41a8545f9,019c5406-a6f6-72a0-abae-7a6281a3c733
5,How do diet and key nutrients like Omega-3 and...,"Alright, listen up! Your brain is basically a ...",,The context explains that nutritional psychiat...,True,True,True,3.601726,205ac36f-4673-4ad4-ac8f-a8466365609f,019c5406-e9f8-7c33-aa3c-7ccd35576744
6,How can building a workout routine that incorp...,"Alright, let‚Äôs crank this up to legendary stat...",,Building a workout routine that includes regul...,True,True,True,6.141353,aef095e0-9cca-4abd-86fa-0c3cad482644,019c5407-2676-7b62-be4c-559930d4a23b
7,How can understanding the effects of stress an...,"Yo, here‚Äôs the ultimate glow-up for your menta...",,"Understanding the effects of stress, such as h...",True,True,True,3.513046,d7523bac-5eab-4b5d-8c48-23541ce13bc9,019c5407-7121-75c2-934b-091dde061457
8,How do omega-3 fatty acids contribute to menta...,"Alright, listen up‚Äîomega-3 fatty acids are the...",,Omega-3 fatty acids are linked to reduced depr...,True,True,True,1.94196,92e2ce6d-694d-492a-8ff3-a709f8da9fb8,019c5407-bcdb-7f52-a110-cf3786af517a
9,Who is Jon Kabat-Zinn?,Jon Kabat-Zinn is the legendary master behind ...,,Jon Kabat-Zinn is the developer of Mindfulness...,True,True,True,2.247923,14f5c505-966c-4336-9322-e8816d80916a,019c5407-e20d-7613-9e7a-29fe5dbf703e


---
## üèóÔ∏è Activity #2: Analyze Evaluation Results

Provide a screenshot of the difference between the two chains in LangSmith, and explain why you believe certain metrics changed in certain ways.

##### Answer:
Attached screenshot in data folder between the 2 chains in Langsmith. 

Why certain metrics changed in certain ways: Considering Chain 1: The normal Chain and Chain 2: The Dopeness Chain

1. Feedback Metrics more quality analysis: 
    1. Dopeness: Chain 1 (0.42) < Chain 2 (1.0)
        
        Why: The dopeness prompt modification worked. By explicitly instructing the model to provide dope answer, Chain 2 successfully produce more creative responses. 
    2. Helpfulness: Chain 1 (0.75) < Chain 2 (0.92)
        
        Why: The better retrieval model improved retrieval quality which increase the helpfullness criteria of response. 
    3. QA (Factual Correctness): Chain 1 (0.83) < Chain 2 (0.92)
        
        Why: Again the better retrieval model (text-embedding-3-large) provide better semantic understanding and more accurate retrieval contexts. And maybe additionally larger chunks (1000 vs 500 characters) provided more complete information for the retrieved passages. 
2. Latency: 
    P50: Chain 1 (2.8s) < Chain 2 (4.1s)
    P90: Chain 1 (5.1s) < Chain 2 (6.0s)
        
        Why: Chain 2 was slower because the embedding model is bigger, more computationally expensive and due to larger chunks which increases generation time. 

3. Tokens & Cost: 
    Total tokens: Chain 1 (~ 16.5K) > Chain 2 (~14.5K)
    Input tokens: Chain 2 has lower input token than Chain 1 
        
        Why: Despite larger chunks Chain 2 actually used fewer tokens because with retriever K=10 and the better embedding model, it likely retrieved more similar chunks that may have been deduplicated or were more concise information. 
4. Error rate: Both chains: 0% errors because both implementations were stable, no execution failures. 


---
## Summary

In this session, we:

1. **Generated synthetic test data** using Ragas' knowledge graph-based approach
2. **Explored query synthesizers** for creating diverse question types
3. **Loaded synthetic data** into a LangSmith dataset for evaluation
4. **Built and evaluated a RAG chain** using LangSmith evaluators
5. **Iterated on the pipeline** by modifying chunk size, embedding model, and prompt ‚Äî then measured the impact

### Key Takeaways:

- **Synthetic data generation** is critical for early iteration ‚Äî it provides high-quality signal without manually creating test data
- **LangSmith evaluators** enable systematic comparison of pipeline versions
- **Small changes matter** ‚Äî chunk size, embedding model, and prompt modifications can significantly affect evaluation scores