# Synthetic Data Generation Using RAGAS


We'll need to provide our LangSmith API key, and set tracing to "true".

In [1]:
import os
import getpass

os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = getpass.getpass("LangChain API Key:")

We'll also want to set a project name to make things easier for ourselves.

In [2]:
from uuid import uuid4

os.environ["LANGCHAIN_PROJECT"] = f"AIM - SDG - {uuid4().hex[0:8]}"

OpenAI's API Key!

In [3]:
os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key:")

**Loading Source Documents**

In order to create a synthetic dataset, we must first load our source documents!

In [6]:
from langchain_community.document_loaders import PyMuPDFLoader

# List of file paths for your PDFs
file_paths = ["dataset/data1.pdf", "dataset/data2.pdf"]  # Add more files here

# List to store all loaded documents
documents = []

# Loop through the file paths and load each PDF
for file_path in file_paths:
    loader = PyMuPDFLoader(file_path=file_path)
    docs = loader.load()  # Load documents from the current PDF
    documents.extend(docs)  # Add them to the overall documents list

# Now 'documents' contains the contents of all loaded PDFs



In [7]:
documents

[Document(metadata={'source': 'dataset/data1.pdf', 'file_path': 'dataset/data1.pdf', 'page': 0, 'total_pages': 73, 'format': 'PDF 1.6', 'title': 'Blueprint for an AI Bill of Rights', 'author': '', 'subject': '', 'keywords': '', 'creator': 'Adobe Illustrator 26.3 (Macintosh)', 'producer': 'iLovePDF', 'creationDate': "D:20220920133035-04'00'", 'modDate': "D:20221003104118-04'00'", 'trapped': ''}, page_content=' \n \n \n \n \n \n \n \n \n \nBLUEPRINT FOR AN \nAI BILL OF \nRIGHTS \nMAKING AUTOMATED \nSYSTEMS WORK FOR \nTHE AMERICAN PEOPLE \nOCTOBER 2022 \n'),
 Document(metadata={'source': 'dataset/data1.pdf', 'file_path': 'dataset/data1.pdf', 'page': 1, 'total_pages': 73, 'format': 'PDF 1.6', 'title': 'Blueprint for an AI Bill of Rights', 'author': '', 'subject': '', 'keywords': '', 'creator': 'Adobe Illustrator 26.3 (Macintosh)', 'producer': 'iLovePDF', 'creationDate': "D:20220920133035-04'00'", 'modDate': "D:20221003104118-04'00'", 'trapped': ''}, page_content=' \n \n \n \n \n \n \n \n \

## Task 3: Generate Synthetic Data

Let's first take a peek under the RAGAS hood to see what's happening when we generate a single example.

For simplicities sake - we'll look at a flow that results in a reasoning question.

Actually creating our Synthetic Dataset is as simple as running the following cell!

In [8]:
from ragas.testset.generator import TestsetGenerator
from ragas.testset.evolutions import simple, reasoning, multi_context
from langchain_openai import ChatOpenAI, OpenAIEmbeddings

generator_llm = ChatOpenAI(model="gpt-3.5-turbo")
critic_llm = ChatOpenAI(model="gpt-4o-mini")
embeddings = OpenAIEmbeddings()

generator = TestsetGenerator.from_langchain(
    generator_llm,
    critic_llm,
    embeddings
)

distributions = {
    simple: 0.5,
    multi_context: 0.25,
    reasoning: 0.25
}

  from .autonotebook import tqdm as notebook_tqdm


In [9]:
testset = generator.generate_with_langchain_docs(documents, 15, distributions, with_debugging_logs=True)

Filename and doc_id are the same for all nodes.                   
Generating:   0%|          | 0/16 [00:00<?, ?it/s][ragas.testset.filters.DEBUG] context scoring: {'clarity': 2, 'depth': 3, 'structure': 2, 'relevance': 3, 'score': 2.5}
[ragas.testset.evolutions.DEBUG] keyphrases in merged node: ['AI system', 'Safety risks', 'Negative risk', 'System reliability', 'Real-time monitoring']
[ragas.testset.filters.DEBUG] context scoring: {'clarity': 2, 'depth': 2, 'structure': 2, 'relevance': 3, 'score': 2.25}
[ragas.testset.evolutions.DEBUG] keyphrases in merged node: ['OSTP conducted meetings', 'Private sector and civil society stakeholders', 'AI Bill of Rights', 'Positive use cases', 'Oversight possibilities']
[ragas.testset.filters.DEBUG] context scoring: {'clarity': 1, 'depth': 2, 'structure': 1, 'relevance': 2, 'score': 1.5}
[ragas.testset.evolutions.DEBUG] keyphrases in merged node: ['Racial equity', 'Supreme Court Decision', 'Automated society', 'Privacy protection', 'Crime predicti

In [86]:
dataset=testset.to_pandas()

In [87]:
dataset.head()

Unnamed: 0,question,contexts,ground_truth,evolution_type,metadata,episode_done
0,Which organizations were involved as private s...,[APPENDIX\n• OSTP conducted meetings with a va...,"Adobe, American Civil Liberties Union (ACLU), ...",simple,"[{'source': 'dataset/data1.pdf', 'file_path': ...",True
1,What is the importance of building a glossary ...,[ \n57 \nNational Institute of Standards and T...,The importance of building a glossary for synt...,simple,"[{'source': 'dataset/data2.pdf', 'file_path': ...",True
2,How can organizations enhance content provenan...,[ \n52 \n• \nMonitoring system capabilities an...,Organizations can enhance content provenance t...,simple,"[{'source': 'dataset/data2.pdf', 'file_path': ...",True
3,What is the purpose of the NIST AI Risk Manage...,[ \n \n \n \n \n \n \n \n \n \n \n \n \n \nSAF...,The NIST AI Risk Management Framework is being...,simple,"[{'source': 'dataset/data1.pdf', 'file_path': ...",True
4,How can structured feedback mechanisms be used...,[ \n29 \nMS-1.1-006 \nImplement continuous mon...,Structured feedback mechanisms can be used to ...,simple,"[{'source': 'dataset/data2.pdf', 'file_path': ...",True


## LangSmith Dataset

Now we can move on to creating a dataset for LangSmith!

First, we'll need to create a dataset on LangSmith using the `Client`!

We'll name our Dataset to make it easy to work with later.

In [14]:
from langsmith import Client

client = Client()

dataset_name = "Etical_ai"

dataset = client.create_dataset(
    dataset_name=dataset_name,
    description="Questions about etical AI"
)

In [15]:
for test in testset.to_pandas().iterrows():
  client.create_example(
      inputs={
          "question": test[1]["question"]
      },
      outputs={
          "answer": test[1]["ground_truth"]
      },
      metadata={
          "context": test[0]
      },
      dataset_id=dataset.id
  )

## Evaluation Pipeline Chunk stragegy and  embedding model


Recursive character text splitter!


In [47]:
from langchain.document_loaders import PyPDFLoader
from langchain.chat_models import ChatOpenAI
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.chains import RetrievalQA, ConversationalRetrievalChain
import os

In [3]:
#Firs TIME
#import nltk
#nltk.download('punkt_tab')
#import nltk
#nltk.download('averaged_perceptron_tagger_eng')

[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     C:\nltk_data...
[nltk_data]   Unzipping taggers\averaged_perceptron_tagger_eng.zip.


True

In [48]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

print(f"You have {len(data)} document(s) in your data")
print(f"There are {len(data[0].page_content)} characters in your document")

text_splitter = RecursiveCharacterTextSplitter(chunk_size = 1000, chunk_overlap = 50)
rag_documents = text_splitter.split_documents(documents)

print(f"You have {len(rag_documents)} pages")

You have 1 document(s) in your data
There are 219681 characters in your document
You have 465 pages


In [49]:
from langchain_ollama import OllamaEmbeddings
embeddings = OllamaEmbeddings(
    model="bge-large",
)

In [45]:
embeddings

OllamaEmbeddings(model='bge-large')

In [50]:
from langchain_community.vectorstores import Qdrant

vectorstore = Qdrant.from_documents(
    documents=rag_documents,
    embedding=embeddings,
    location=":memory:",
    collection_name="EticalAI"
)

In [51]:
from langchain_qdrant import QdrantVectorStore
from qdrant_client import QdrantClient
from qdrant_client.http.models import Distance, VectorParams
client = QdrantClient(":memory:")

client.create_collection(
    collection_name="EticalAI",
    vectors_config=VectorParams(size=1024 , distance=Distance.COSINE),
)

True

In [52]:
from langchain_qdrant import RetrievalMode

qdrant = QdrantVectorStore.from_documents(
    rag_documents,
    embedding=embeddings,
    location=":memory:",
    collection_name="EticalAI",
    retrieval_mode=RetrievalMode.DENSE,
)


In [53]:

query = "What data privacy concerns are mentioned?"
found_docs = qdrant.similarity_search(query)
found_docs

[Document(metadata={'source': 'dataset/data1.pdf', 'file_path': 'dataset/data1.pdf', 'page': 5, 'total_pages': 73, 'format': 'PDF 1.6', 'title': 'Blueprint for an AI Bill of Rights', 'author': '', 'subject': '', 'keywords': '', 'creator': 'Adobe Illustrator 26.3 (Macintosh)', 'producer': 'iLovePDF', 'creationDate': "D:20220920133035-04'00'", 'modDate': "D:20221003104118-04'00'", 'trapped': '', '_id': '1ddd09fdfddd43fba86cd857f26372ea', '_collection_name': 'EticalAI'}, page_content='SECTION TITLE\nDATA PRIVACY\nYou should be protected from abusive data practices via built-in protections and you \nshould have agency over how data about you is used. You should be protected from violations of \nprivacy through design choices that ensure such protections are included by default, including ensuring that \ndata collection conforms to reasonable expectations and that only data strictly necessary for the specific \ncontext is collected. Designers, developers, and deployers of automated systems 

In [54]:
retriever = vectorstore.as_retriever()

To get the "A" in RAG, we'll provide a prompt.

In [55]:
from langchain.prompts import ChatPromptTemplate

RAG_PROMPT = """\
Given a provided context and question, you must answer the question based only on context.

If you cannot answer the question based on the context - you must say "I don't know".

Context: {context}
Question: {question}
"""

rag_prompt = ChatPromptTemplate.from_template(RAG_PROMPT)

In [56]:
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4o-mini")

Finally, we can set-up our RAG LCEL chain!

In [57]:
from operator import itemgetter
from langchain_core.runnables import RunnablePassthrough, RunnableParallel
from langchain.schema import StrOutputParser

rag_chain = (
    {"context": itemgetter("question") | retriever, "question": itemgetter("question")}
    | rag_prompt | llm | StrOutputParser()
)

In [58]:
rag_chain.invoke({"question" : "What data privacy concerns are mentioned?"})

'The data privacy concerns mentioned include:\n\n1. Protection from abusive data practices and the need for built-in protections.\n2. Agency over how data about individuals is used.\n3. Violations of privacy through design choices that should ensure protections are included by default.\n4. Data collection should conform to reasonable expectations and only collect data strictly necessary for the specific context.\n5. The need for designers, developers, and deployers of automated systems to seek permission and respect user decisions regarding data collection, use, access, transfer, and deletion.\n6. User experience and design decisions should not obfuscate user choice or impose privacy-invasive defaults.\n7. Consent should only be used justifiably for data collection in certain cases.\n8. Lack of a comprehensive statutory or regulatory framework governing public rights concerning personal data, leading to unclear application of existing laws in various contexts. \n9. Concerns about autom

## SemanticChunker

In [60]:
from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai.embeddings import OpenAIEmbeddings

In [62]:
semantic_chunker = SemanticChunker(embeddings, breakpoint_threshold_type="percentile")

In [63]:
semantic_chunks = semantic_chunker.create_documents([d.page_content for d in documents])

In [67]:
vectorstore = Qdrant.from_documents(
    documents=semantic_chunks,
    embedding=embeddings,
    location=":memory:",
    collection_name="EticalAI2"
)

In [69]:
retriever2 = vectorstore.as_retriever()

In [118]:
rag_chain = (
    {"context": itemgetter("question") | retriever, "question": itemgetter("question")}
    | rag_prompt | llm | StrOutputParser()
)

In [119]:
rag_chain.invoke({"question" : "What data privacy concerns are mentioned?"})

'The data privacy concerns mentioned include:\n\n1. The use of personal data for GAI training raises risks to privacy principles such as transparency, individual participation (including consent), and purpose specification.\n2. Lack of disclosure by model developers regarding specific data sources used for training, limiting user awareness of whether personally identifiable information (PII) was included.\n3. The potential for models to leak, generate, or infer sensitive information about individuals, leading to privacy risks.\n4. Data memorization, where models may reveal sensitive information that was present in their training data.\n5. Inferences made by models can negatively impact individuals, even if those inferences are not accurate.\n6. Wrong or inappropriate inferences of PII can lead to downstream harmful impacts, such as adverse decisions based on predictive inferences.\n7. Concerns about mission creep in data collection and use, requiring that data collection be minimized a

# According to the answer retrieval, the semantic chunk is providing better responses, so let's evaluate it.

In [115]:
from langchain.prompts import ChatPromptTemplate

RAG_PROMPT = """\
Given a provided context and question, you must answer the question based only on context.

If you cannot answer the question based on the context - you must say "I don't know".

Context: {context}
Question: {question}
"""

rag_prompt = ChatPromptTemplate.from_template(RAG_PROMPT)

In [116]:
from langchain_openai import ChatOpenAI
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

primary_qa_llm = ChatOpenAI(model_name="gpt-4o-mini", temperature=0)

retrieval_augmented_qa_chain = (
    # INVOKE CHAIN WITH: {"question" : "<<SOME USER QUESTION>>"}
    # "question" : populated by getting the value of the "question" key
    # "context"  : populated by getting the value of the "question" key and chaining it into the base_retriever
    {"context": itemgetter("question") | retriever2, "question": itemgetter("question")}
    # "context"  : is assigned to a RunnablePassthrough object (will not be called or considered in the next step)
    #              by getting the value of the "context" key from the previous step
    | RunnablePassthrough.assign(context=itemgetter("context"))
    # "response" : the "context" and "question" values are used to format our prompt object and then piped
    #              into the LLM and stored in a key called "response"
    # "context"  : populated by getting the value of the "context" key from the previous step
    | {"response": rag_prompt | primary_qa_llm, "context": itemgetter("context")})

In [117]:
dataset.head()

Unnamed: 0,question,contexts,ground_truth,evolution_type,metadata,episode_done
0,Which organizations were involved as private s...,[APPENDIX\n• OSTP conducted meetings with a va...,"Adobe, American Civil Liberties Union (ACLU), ...",simple,"[{'source': 'dataset/data1.pdf', 'file_path': ...",True
1,What is the importance of building a glossary ...,[ \n57 \nNational Institute of Standards and T...,The importance of building a glossary for synt...,simple,"[{'source': 'dataset/data2.pdf', 'file_path': ...",True
2,How can organizations enhance content provenan...,[ \n52 \n• \nMonitoring system capabilities an...,Organizations can enhance content provenance t...,simple,"[{'source': 'dataset/data2.pdf', 'file_path': ...",True
3,What is the purpose of the NIST AI Risk Manage...,[ \n \n \n \n \n \n \n \n \n \n \n \n \n \nSAF...,The NIST AI Risk Management Framework is being...,simple,"[{'source': 'dataset/data1.pdf', 'file_path': ...",True
4,How can structured feedback mechanisms be used...,[ \n29 \nMS-1.1-006 \nImplement continuous mon...,Structured feedback mechanisms can be used to ...,simple,"[{'source': 'dataset/data2.pdf', 'file_path': ...",True


In [105]:
test_questions = dataset["question"].values.tolist()
test_groundtruths = dataset["ground_truth"].values.tolist()

In [106]:
answers = []
contexts = []

for question in test_questions:
  response = retrieval_augmented_qa_chain.invoke({"question" : question})
  answers.append(response["response"].content)
  contexts.append([context.page_content for context in response["context"]])

In [107]:
from datasets import Dataset

response_dataset = Dataset.from_dict({
    "question" : test_questions,
    "answer" : answers,
    "contexts" : contexts,
    "ground_truth" : test_groundtruths
})

In [108]:
response_dataset[7]

{'question': 'What is the role of a purpose-built testing environment, such as NIST Dioptra, in empirically evaluating GAI trustworthy characteristics?',
 'answer': 'The role of a purpose-built testing environment, such as NIST Dioptra, in empirically evaluating GAI trustworthy characteristics is to facilitate the assessment of various factors including CBRN Information or Capabilities, Data Privacy, Confabulation, Information Integrity, Information Security, Dangerous, Violent, or Hateful Content, and Harmful Bias and Homogenization. It is utilized to ensure that the AI system being deployed is valid and reliable, and to document limitations of generalizability beyond the conditions under which the technology was developed.',
 'contexts': [' \n31 \nMS-2.3-004 \nUtilize a purpose-built testing environment such as NIST Dioptra to empirically \nevaluate GAI trustworthy characteristics. CBRN Information or Capabilities; \nData Privacy; Confabulation; \nInformation Integrity; Information \

In [109]:
from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    answer_correctness,
    context_recall,
    context_precision,
)

metrics = [
    faithfulness,
    answer_relevancy,
    context_recall,
    context_precision,
    answer_correctness,
]

In [110]:
results = evaluate(response_dataset, metrics)

Evaluating:  79%|███████▉  | 63/80 [00:24<00:06,  2.63it/s]No statements were generated from the answer.
Evaluating: 100%|██████████| 80/80 [00:43<00:00,  1.83it/s]


In [120]:
results_df2 = results.to_pandas()
results_df2.to_csv("results.csv")

In [127]:
results_df.answer_correctness

0     0.983049
1     0.177262
2     0.909787
3     0.525703
4     0.813576
5     0.178610
6     0.672686
7     0.997191
8     0.180106
9     0.173743
10    0.175215
11    0.176218
12    0.521829
13    0.195204
14    0.866048
15    0.180756
Name: answer_correctness, dtype: float64

In [114]:
results_df2.answer_correctness

0     0.968340
1     0.000000
2     0.972429
3     0.969035
4     0.979054
5     0.000000
6     1.000000
7     0.991522
8     0.997174
9     0.997114
10    0.933063
11    0.918173
12    1.000000
13    0.000000
14    0.971745
15    0.000000
Name: answer_relevancy, dtype: float64

In [1]:
import pandas as pd

In [2]:
df=pd.read_csv("results.csv")

In [3]:
df

Unnamed: 0.1,Unnamed: 0,question,answer,contexts,ground_truth,faithfulness,answer_relevancy,context_recall,context_precision,answer_correctness
0,0,Which organizations were involved as private s...,The organizations involved as private sector a...,['APPENDIX\n• OSTP conducted meetings with a v...,"Adobe, American Civil Liberties Union (ACLU), ...",,0.96834,1.0,1.0,0.983049
1,1,What is the importance of building a glossary ...,I don't know.,['al. (2024) AI deception: A survey of example...,The importance of building a glossary for synt...,0.0,0.0,1.0,0.833333,0.177262
2,2,How can organizations enhance content provenan...,Organizations can enhance content provenance t...,[' \n52 \n• \nMonitoring system capabilities a...,Organizations can enhance content provenance t...,1.0,0.972429,1.0,1.0,0.806347
3,3,What is the purpose of the NIST AI Risk Manage...,The purpose of the NIST AI Risk Management Fra...,[' \n2 \nThis work was informed by public feed...,The NIST AI Risk Management Framework is being...,1.0,0.969035,0.5,0.75,0.428933
4,4,How can structured feedback mechanisms be used...,Structured feedback mechanisms can be used to ...,[' \n29 \nMS-1.1-006 \nImplement continuous mo...,Structured feedback mechanisms can be used to ...,0.625,0.979054,1.0,1.0,0.826538
5,5,How do safety metrics reflect system reliabili...,I don't know.,[' \n32 \nMEASURE 2.6: The AI system is evalua...,Safety metrics in the evaluation of AI systems...,0.0,0.0,1.0,1.0,0.17861
6,6,What is the focus of the National Science Foun...,The focus of the National Science Foundation's...,"["" \n \n \n \n \n \n \n \n \n \n \n \n \n \n \...",The focus of the National Science Foundation's...,1.0,1.0,1.0,1.0,0.619114
7,7,What is the role of a purpose-built testing en...,The role of a purpose-built testing environmen...,[' \n31 \nMS-2.3-004 \nUtilize a purpose-built...,A purpose-built testing environment like NIST ...,0.8,0.991522,1.0,1.0,0.809734
8,8,What factors to consider when managing content...,The factors to consider when managing content ...,"['et al. (2023) Human favoritism, not AI avers...",The answer to given question is not present in...,1.0,0.997174,1.0,0.333333,0.180106
9,9,What steps are needed for safe and effective a...,The steps needed for safe and effective automa...,['Systems should undergo extensive testing bef...,The answer to given question is not present in...,1.0,0.997114,1.0,0.0,0.173593
