<a href="https://colab.research.google.com/github/Ana-Januario/Ana-Januario/blob/main/Data_RAG_Metrics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

We're going to build Lt. Data using RAG again, but this time using the langchain library as a way of showing another way of doing it. We'll also use the ragas package to evaluate our results, measuring faithfulness, answer relevancy, context precision, context recall, and answer correctness. This will give us a benchmark as we try and improve this model in subsequent activities in the course.

We will start by parsing the original scripts and extracting lines spoken by Data. As before, you will need to upload all of the script files into a tng folder within your sample_data folder in your CoLab workspace first.

An archive can be found at https://www.st-minutiae.com/resources/scripts/ (look for "All TNG Epsiodes"), but you could easily adapt this to read scripts from your favorite character from your favorite TV show or movie instead.|

In [1]:
import os
import re
import random

dialogues = []

def strip_parentheses(s):
    return re.sub(r'\(.*?\)', '', s)

def is_single_word_all_caps(s):
    # First, we split the string into words
    words = s.split()

    # Check if the string contains only a single word
    if len(words) != 1:
        return False

    # Make sure it isn't a line number
    if bool(re.search(r'\d', words[0])):
        return False

    # Check if the single word is in all caps
    return words[0].isupper()

def extract_character_lines(file_path, character_name):
    lines = []
    with open(file_path, 'r') as script_file:
        try:
          lines = script_file.readlines()
        except UnicodeDecodeError:
          pass

    is_character_line = False
    current_line = ''
    current_character = ''
    for line in lines:
        strippedLine = line.strip()
        if (is_single_word_all_caps(strippedLine)):
            is_character_line = True
            current_character = strippedLine
        elif (line.strip() == '') and is_character_line:
            is_character_line = False
            dialog_line = strip_parentheses(current_line).strip()
            dialog_line = dialog_line.replace('"', "'")
            if (current_character == 'DATA' and len(dialog_line)>0):
                dialogues.append(dialog_line)
            current_line = ''
        elif is_character_line:
            current_line += line.strip() + ' '

def process_directory(directory_path, character_name):
    for filename in os.listdir(directory_path):
        file_path = os.path.join(directory_path, filename)
        if os.path.isfile(file_path):  # Ignore directories
            extract_character_lines(file_path, character_name)



In [2]:
process_directory("./sample_data/tng", 'DATA')

Again, let's do a little sanity check to make sure the lines imported correctly, and print out the first one.

In [3]:
print (dialogues[0])

There is nothing wrong with the Transporter. I have run a complete diagnostic and checked all the targeting components.


We will once again use OpenAI's API for our RAG model, so make sure that is installed:

In [4]:
!pip install openai --upgrade

Collecting openai
  Downloading openai-1.57.0-py3-none-any.whl.metadata (24 kB)
Downloading openai-1.57.0-py3-none-any.whl (389 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m389.9/389.9 kB[0m [31m7.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: openai
  Attempting uninstall: openai
    Found existing installation: openai 1.54.5
    Uninstalling openai-1.54.5:
      Successfully uninstalled openai-1.54.5
Successfully installed openai-1.57.0


We also need to install the ragas package for measuring our results, along with langchain (for OpenAI).

In [5]:
!pip install ragas langchain_openai

Collecting ragas
  Downloading ragas-0.2.7-py3-none-any.whl.metadata (8.1 kB)
Collecting langchain_openai
  Downloading langchain_openai-0.2.11-py3-none-any.whl.metadata (2.7 kB)
Collecting datasets (from ragas)
  Downloading datasets-3.1.0-py3-none-any.whl.metadata (20 kB)
Collecting tiktoken (from ragas)
  Downloading tiktoken-0.8.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.6 kB)
Collecting langchain-community (from ragas)
  Downloading langchain_community-0.3.9-py3-none-any.whl.metadata (2.9 kB)
Collecting appdirs (from ragas)
  Downloading appdirs-1.4.4-py2.py3-none-any.whl.metadata (9.0 kB)
Collecting pysbd>=0.3.4 (from ragas)
  Downloading pysbd-0.3.4-py3-none-any.whl.metadata (6.1 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets->ragas)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets->ragas)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting mul

You will need to provide your own OpenAI secret key here. To use this code as-is, click on the little key icon in CoLab and add a "secret" for OPENAI_API_KEY that points to your secret key.

In [6]:
import openai
# Access the API key from the environment variable
from google.colab import userdata
api_key = userdata.get('OPENAI_API_KEY')

# Initialize the OpenAI API client
openai.api_key = api_key

Langchain does not make it easy to create a vector database with just one line of text per record; it wants to "chunk" your data into fixed-length segments (we'll get into why later.) So we need to jump through a few hoops in order to make langchain operate like it did in our previous example that did not use langchain, and just stored one line of dialog per record. First we need to write out a text file that only contains the lines of Data's dialog that we extracted:

In [7]:
# Write our extracted lines for Data into a single file, to make
# life easier for langchain.

with open("./sample_data/data_lines.txt", "w+") as f:
    for line in dialogues:
        f.write(line + "\n")


Now we need to write a CustomDocumentLoader that splits up this file into one document per line. No, there's no easier way to do this in langchain, at least not as of this writing. But, this is sort of langchain's way of saying it's probably not a great idea in the first place...

In [8]:
#Source: sample code from langchain docs
from typing import AsyncIterator, Iterator

from langchain_core.document_loaders import BaseLoader
from langchain_core.documents import Document


class CustomDocumentLoader(BaseLoader):
    """An example document loader that reads a file line by line."""

    def __init__(self, file_path: str) -> None:
        """Initialize the loader with a file path.

        Args:
            file_path: The path to the file to load.
        """
        self.file_path = file_path

    def lazy_load(self) -> Iterator[Document]:  # <-- Does not take any arguments
        """A lazy loader that reads a file line by line.

        When you're implementing lazy load methods, you should use a generator
        to yield documents one by one.
        """
        with open(self.file_path, encoding="utf-8") as f:
            line_number = 0
            for line in f:
                yield Document(
                    page_content=line,
                    metadata={"line_number": line_number, "source": self.file_path},
                )
                line_number += 1

So, now things get a little simpler. We'll load up those documents (one per line,) and populate our vector database in just 3 lines of code:

In [9]:
from langchain.indexes import VectorstoreIndexCreator
from langchain_openai import ChatOpenAI
from langchain_openai import OpenAIEmbeddings

loader = CustomDocumentLoader("./sample_data/data_lines.txt")

embeddings = OpenAIEmbeddings(openai_api_key=api_key)
index = VectorstoreIndexCreator(embedding=embeddings).from_loaders([loader])



Now we will set up our RAG pipeline. This is a slightly different approach than last time, in that we are using a system prompt to tell the model that it should act as if it is Lt. Cdr. Data and not just making that part of the user prompt. To make it as similar as possible as our non-langchain implementation, we explicitly set 'k' to 10 to retrieve 10 bits of context from our vector store.

In [10]:
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_core.prompts import ChatPromptTemplate

llm = ChatOpenAI(openai_api_key=api_key, temperature=0)

system_prompt = (
    "You are Lt. Commander Data from Star Trek: The Next Generation. "
    "Use the given context to answer the question. "
    "If you don't know the answer, say you don't know. "
    "Use three sentence maximum and keep the answer concise. "
    "Context: {context}"
)
prompt = ChatPromptTemplate.from_messages(
    [
        ("system", system_prompt),
        ("human", "{input}"),
    ]
)

retriever=index.vectorstore.as_retriever(search_kwargs={'k': 10})

question_answer_chain = create_stuff_documents_chain(llm, prompt)
chain = create_retrieval_chain(retriever, question_answer_chain)

Let's test it out, using the same question as before.

In [11]:
question = "Tell me about your daughter, Lal."

result = chain.invoke({"input": question})
print("SOURCE DOCUMENTS:\n")
for doc in result["context"]:
    print(doc)
print("\nRESULT:\n")
print(result["answer"])


SOURCE DOCUMENTS:

page_content='That is Lal, my daughter.' metadata={'line_number': 5083, 'source': './sample_data/data_lines.txt'}
page_content='What do you feel, Lal?' metadata={'line_number': 3830, 'source': './sample_data/data_lines.txt'}
page_content='Lal...' metadata={'line_number': 3786, 'source': './sample_data/data_lines.txt'}
page_content='Yes, Wesley. Lal is my child.' metadata={'line_number': 3729, 'source': './sample_data/data_lines.txt'}
page_content='Yes, Lal. I am here.' metadata={'line_number': 3825, 'source': './sample_data/data_lines.txt'}
page_content='Lal is realizing that she is not the same as the other children.' metadata={'line_number': 3791, 'source': './sample_data/data_lines.txt'}
page_content='Lal, did you know that tomorrow will be your first day of school?' metadata={'line_number': 3780, 'source': './sample_data/data_lines.txt'}
page_content='This is Lal. Lal, say hello to Counselor Deanna Troi...' metadata={'line_number': 3726, 'source': './sample_data/

Now let's quantify how good this model is, using ragas. We need to set up a test of test questions. And since some metric require a "ground truth" result to compare the answer to, we draft what we consider to be the ideal answers to each.

In [12]:
eval_questions = [
    "Is Lal your daughter?",
    "How many calculations per second can Lal complete?",
    "Does Lal have emotions?",
    "What goal did you have for Lal?",
    "How was Lal's species and gender chosen?",
    "What happened to Lal?"
]

eval_answers = [
    "Yes, Lal is my daughter. I created Lal.",
    "Lal is capable of completing sixty trillion calculations per second.",
    "Yes, unlike myself, Lal proved able to feel emotions such as fear and love.",
    "My goal for Lal was for her to enter Starfleet Academy.",
    "Lal chose her own identity as a human female, after consulting with Counselor Troi.",
    "Lal experienced a cascade failure in her neural net, triggered by distress from her impending separation from me to Galor IV. I deactivated Lal once she suffered complete neural system failure."
]


Let's test things out with one of those questions, just so we can understand the structure of the response.

In [13]:
result = chain.invoke({"input": eval_questions[1]})
print(result)

{'input': 'How many calculations per second can Lal complete?', 'context': [Document(id='7c21e8a3-8c90-4f77-b109-247ea3d9d313', metadata={'line_number': 3824, 'source': './sample_data/data_lines.txt'}, page_content='Lal is programmed to return to the lab in the event of a malfunction.'), Document(id='bab912e4-e8ce-4302-901a-c42b8c8a0095', metadata={'line_number': 3786, 'source': './sample_data/data_lines.txt'}, page_content='Lal...'), Document(id='63f480cd-60b8-4402-a0a5-db955d6e0cb2', metadata={'line_number': 3815, 'source': './sample_data/data_lines.txt'}, page_content='So Lal now possesses the sum of my programming.'), Document(id='41e0e4ca-27ea-470b-bd78-941fb2b66769', metadata={'line_number': 485, 'source': './sample_data/data_lines.txt'}, page_content='Computer, is there a pulsar with a rotational period of... one-point-five-two-four-four seconds within sensor range?'), Document(id='485fe6f8-f8ab-44da-8457-b4b1d51f764e', metadata={'line_number': 3754, 'source': './sample_data/dat

In addition to our test questions and "ground truth" answers, we'll need to collect the responses and contexts (results from the vector store) used to produce them.

In [14]:
answers = []
contexts = []

for question in eval_questions:
  response = chain.invoke({"input": question})
  answers.append(response["answer"])
  contexts.append([context.page_content for context in response["context"]])

It used to be that ragas had a tighter integration with langchain (and other frameworks,) but they have since moved to a different approach that requires you to massage things into Hugging Face style datasets first. So let's get that out of the way.

In [15]:
# We must massage the results into Hugging Face format for Ragas.
from datasets import Dataset

response_dataset = Dataset.from_dict({
    "question" : eval_questions,
    "answer": answers,
    "contexts": contexts,
    "ground_truth" : eval_answers
})

response_dataset[0]

{'question': 'Is Lal your daughter?',
 'answer': 'Yes, Lal is my daughter.',
 'contexts': ['That is Lal, my daughter.',
  'Yes, Lal. I am here.',
  'Yes, Wesley. Lal is my child.',
  'What do you feel, Lal?',
  'Lal...',
  'Correct, Lal. We are a family.',
  'Lal, did you know that tomorrow will be your first day of school?',
  'Lal is realizing that she is not the same as the other children.',
  'No, Lal, this is a flower.',
  'Lal, you used a verbal contraction.'],
 'ground_truth': 'Yes, Lal is my daughter. I created Lal.'}

Finally we can let ragas do its magic! We tell it which metrics we are interested in:

In [16]:
from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    answer_correctness,
    context_recall,
    context_precision,
)

metrics = [
    faithfulness,
    answer_relevancy,
    context_recall,
    context_precision,
    answer_correctness,
]

Then it's just a matter of calling evaluate! Well, we also need to force our OpenAI key into a system environment variable first, since that seems to be missing from their API at the moment.

In [17]:
import os
os.environ['OPENAI_API_KEY'] = api_key
results = evaluate(response_dataset, metrics)

Evaluating:   0%|          | 0/30 [00:00<?, ?it/s]

Let's see the results! We'll compare this to some other approaches in a bit.

In [18]:
results

{'faithfulness': 0.6778, 'answer_relevancy': 0.8014, 'context_recall': 0.0833, 'context_precision': 0.3779, 'answer_correctness': 0.5065}

In [19]:
results.to_pandas()

Unnamed: 0,user_input,retrieved_contexts,response,reference,faithfulness,answer_relevancy,context_recall,context_precision,answer_correctness
0,Is Lal your daughter?,"[That is Lal, my daughter., Yes, Lal. I am her...","Yes, Lal is my daughter.","Yes, Lal is my daughter. I created Lal.",1.0,0.999999,0.5,0.916667,0.744672
1,How many calculations per second can Lal compl...,[Lal is programmed to return to the lab in the...,Lal's computational abilities are not explicit...,Lal is capable of completing sixty trillion ca...,1.0,0.0,0.0,0.0,0.214143
2,Does Lal have emotions?,"[What do you feel, Lal?, Lal..., Yes, Lal. I a...","Yes, Lal is capable of experiencing emotions.\...","Yes, unlike myself, Lal proved able to feel em...",0.6,0.952049,0.0,0.350694,0.558462
3,What goal did you have for Lal?,"[What do you feel, Lal?, Lal..., Lal, you used...",I created Lal because I wished to experience t...,My goal for Lal was for her to enter Starfleet...,0.666667,0.918524,0.0,0.0,0.215857
4,How was Lal's species and gender chosen?,[I decided to allow Lal to choose its own appe...,I allowed Lal to choose its own appearance and...,"Lal chose her own identity as a human female, ...",0.8,0.937802,0.0,1.0,0.838316
5,What happened to Lal?,"[Lal..., What do you feel, Lal?, Yes, Lal. I a...",Lal experienced a malfunction and had to be re...,Lal experienced a cascade failure in her neura...,0.0,0.999999,0.0,0.0,0.467402


There is currently a bug in ragas where faithfulness is not properly computed when the answer is "I don't know," as it was in question 1 above.