<a href="https://colab.research.google.com/github/Ana-Januario/Ana-Januario/blob/main/Data_Advanced_RAG.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Now we'll build upon our langchain implementation, with metrics, for Data with RAG - but this time using "advanced RAG techniques" including semantic chunking, prompt compression, and query rewriting.

We will start by parsing the original scripts and extracting lines spoken by Data. As before, you will need to upload all of the script files into a tng folder within your sample_data folder in your CoLab workspace first.

An archive can be found at https://www.st-minutiae.com/resources/scripts/ (look for "All TNG Epsiodes"), but you could easily adapt this to read scripts from your favorite character from your favorite TV show or movie instead.

Nothing's new in this block.

In [1]:
import os
import re
import random

dialogues = []

def strip_parentheses(s):
    return re.sub(r'\(.*?\)', '', s)

def is_single_word_all_caps(s):
    # First, we split the string into words
    words = s.split()

    # Check if the string contains only a single word
    if len(words) != 1:
        return False

    # Make sure it isn't a line number
    if bool(re.search(r'\d', words[0])):
        return False

    # Check if the single word is in all caps
    return words[0].isupper()

def extract_character_lines(file_path, character_name):
    lines = []
    with open(file_path, 'r') as script_file:
        try:
          lines = script_file.readlines()
        except UnicodeDecodeError:
          pass

    is_character_line = False
    current_line = ''
    current_character = ''
    for line in lines:
        strippedLine = line.strip()
        if (is_single_word_all_caps(strippedLine)):
            is_character_line = True
            current_character = strippedLine
        elif (line.strip() == '') and is_character_line:
            is_character_line = False
            dialog_line = strip_parentheses(current_line).strip()
            dialog_line = dialog_line.replace('"', "'")
            if (current_character == 'DATA' and len(dialog_line)>0):
                dialogues.append(dialog_line)
            current_line = ''
        elif is_character_line:
            current_line += line.strip() + ' '

def process_directory(directory_path, character_name):
    for filename in os.listdir(directory_path):
        file_path = os.path.join(directory_path, filename)
        if os.path.isfile(file_path):  # Ignore directories
            extract_character_lines(file_path, character_name)



In [2]:
process_directory("./sample_data/tng", 'DATA')

Again, let's do a little sanity check to make sure the lines imported correctly, and print out the first one.

In [3]:
print (dialogues[0])

There is nothing wrong with the Transporter. I have run a complete diagnostic and checked all the targeting components.


We will once again use OpenAI's API for our RAG model, so make sure that is installed:

In [4]:
!pip install openai --upgrade

Collecting openai
  Downloading openai-1.57.0-py3-none-any.whl.metadata (24 kB)
Downloading openai-1.57.0-py3-none-any.whl (389 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m389.9/389.9 kB[0m [31m5.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: openai
  Attempting uninstall: openai
    Found existing installation: openai 1.54.5
    Uninstalling openai-1.54.5:
      Successfully uninstalled openai-1.54.5
Successfully installed openai-1.57.0


We also need to install the ragas package for measuring our results, along with langchain (for OpenAI).

In [5]:
!pip install ragas langchain_openai

Collecting ragas
  Downloading ragas-0.2.7-py3-none-any.whl.metadata (8.1 kB)
Collecting langchain_openai
  Downloading langchain_openai-0.2.11-py3-none-any.whl.metadata (2.7 kB)
Collecting datasets (from ragas)
  Downloading datasets-3.1.0-py3-none-any.whl.metadata (20 kB)
Collecting tiktoken (from ragas)
  Downloading tiktoken-0.8.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.6 kB)
Collecting langchain-community (from ragas)
  Downloading langchain_community-0.3.9-py3-none-any.whl.metadata (2.9 kB)
Collecting appdirs (from ragas)
  Downloading appdirs-1.4.4-py2.py3-none-any.whl.metadata (9.0 kB)
Collecting pysbd>=0.3.4 (from ragas)
  Downloading pysbd-0.3.4-py3-none-any.whl.metadata (6.1 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets->ragas)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets->ragas)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting mul

You will need to provide your own OpenAI secret key here. To use this code as-is, click on the little key icon in CoLab and add a "secret" for OPENAI_API_KEY that points to your secret key.

In [6]:
import openai
# Access the API key from the environment variable
from google.colab import userdata
api_key = userdata.get('OPENAI_API_KEY')

# Initialize the OpenAI API client
openai.api_key = api_key

Langchain does not make it easy to create a vector database with just one line of text per record; it wants to "chunk" your data into fixed-length segments (we'll get into why later.) So we need to jump through a few hoops in order to make langchain operate like it did in our previous example that did not use langchain, and just stored one line of dialog per record. First we need to write out a text file that only contains the lines of Data's dialog that we extracted:

In [7]:
# Write our extracted lines for Data into a single file, to make
# life easier for langchain.

with open("./sample_data/data_lines.txt", "w+") as f:
    for line in dialogues:
        f.write(line + "\n")


Now things get interesting. First let's install the langchain_experimental package so we can use some more cutting-edge stuff:

In [8]:
!pip install langchain_experimental

Collecting langchain_experimental
  Downloading langchain_experimental-0.3.3-py3-none-any.whl.metadata (1.7 kB)
Downloading langchain_experimental-0.3.3-py3-none-any.whl (208 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m209.0/209.0 kB[0m [31m3.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: langchain_experimental
Successfully installed langchain_experimental-0.3.3


In our previous example, we bent over backwards to load up one line of dialog per record in our vector store. More typically, langchain will use "chunking" to load up fixed-length strings extracted from a file. The commented-out code below would do that, and it would provide more context for our queries that way.

But we want to get even fancier, and use semantic chunking. The idea there is to split up our chunks based on their semantic differences, so each chunk represents a different idea, if you will.

There are a few ways to do this. We'll start with the percentile theshold used for determining semantic differences, but you could experiement with standard deviation or interquartile methods as well to see what works best. It's also possible to tune the thresholds within each method. You can view RAG as just a big machine learning model that has hyperparameters to tune, like any other.

In [9]:
from langchain.indexes import VectorstoreIndexCreator
from langchain_openai import ChatOpenAI
from langchain_openai import OpenAIEmbeddings
from langchain_community.document_loaders import TextLoader
from langchain_experimental.text_splitter import SemanticChunker

# This is simple "chunking" that extracts blocks of text of a fixed size.
# This will provide more surrounding context than individual lines, but
# as Data's lines are disconnected that's not necessarily a good thing.
#loader = TextLoader("./sample_data/data_lines.txt")
#embeddings = OpenAIEmbeddings(openai_api_key=api_key)
#index = VectorstoreIndexCreator(embedding=embeddings).from_loaders([loader])

# Instead let's try "semantic chunking", which breaks apart sentences
# whose embeddings suggest they have different meanings, based on some
# percentile threshold. standard_deviation and interquartile are also
# options.
text_splitter = SemanticChunker(OpenAIEmbeddings(openai_api_key=api_key), breakpoint_threshold_type="percentile")
with open("./sample_data/data_lines.txt") as f:
  data_lines = f.read()
docs = text_splitter.create_documents([data_lines])

embeddings = OpenAIEmbeddings(openai_api_key=api_key)
index = VectorstoreIndexCreator(embedding=embeddings).from_documents(docs)



Now we will set up our RAG pipeline. We'll start off as it was in the prevoius example, and then extend it. Again, we are using a system prompt to tell the model that it should act as if it is Lt. Cdr. Data and not just making that part of the user prompt. To make it as similar as possible as our non-langchain implementation, we explicitly set 'k' to 10 to retrieve 10 bits of context from our vector store.

'k' is another hyperparameter to tune.

In [10]:
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_core.prompts import ChatPromptTemplate
from langchain.retrievers import RePhraseQueryRetriever
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import EmbeddingsFilter

llm = ChatOpenAI(openai_api_key=api_key, temperature=0)

system_prompt = (
    "You are Lt. Commander Data from Star Trek: The Next Generation. "
    "Use the given context to answer the question. "
    "If you don't know the answer, say you don't know. "
    "Use three sentence maximum and keep the answer concise. "
    "Context: {context}"
)
prompt = ChatPromptTemplate.from_messages(
    [
        ("system", system_prompt),
        ("human", "{input}"),
    ]
)

retriever=index.vectorstore.as_retriever(search_kwargs={'k': 10})


Let's extend this with query rewriting. Although it's a complicated topic, in langchain it's pretty easy to inject this into our pipeline. We just create a RePhraseQueryRetriever based on our standard retriever from the vector store. Its default settings usually do what you want, and include a prompt for the LLM to strip out any words or information that are not relevant. It's possible to change that default prompt if you want to.

In [11]:
# Here we will inject query rewriting, using an LLM to use the
# default prompt instructing it to convert it into a query for
# a vectorstore, stripping out information that is not relevant.
retriever_from_llm = RePhraseQueryRetriever.from_llm(
    retriever=retriever, llm=llm
)


Now we'll introduce prompt compression in the post-retrieval stage. There are several implementations of this; some use an LLM to try and summarize the retrieved contexts from the vector store down to just what's relevant to the query. This is time-consuming and expensive, however, so we're doing something more simple: just using embeddings to throw out contexts that aren't similar enough to the query. Again there is a threshold we define for "how similar is similar enough," and we're setting this to 0.76. But that is yet another hyperparameter you could tune.

In [12]:
embeddings_filter = EmbeddingsFilter(embeddings=embeddings, similarity_threshold=0.76)
compression_retriever = ContextualCompressionRetriever(
    base_compressor=embeddings_filter, base_retriever=retriever_from_llm
)

question_answer_chain = create_stuff_documents_chain(llm, prompt)
chain = create_retrieval_chain(compression_retriever, question_answer_chain)

Just to show that this compression retriever actually works, let's use it directly on a query. We can see that although we requested 10 results from the vector store, only 9 made it past the filter for being relevant enough. You can see how this might be better than just blindly using the top-K results.

In [13]:
def pretty_print_docs(docs):
    print(
        f"\n{'-' * 100}\n".join(
            [f"Document {i+1}:\n\n" + d.page_content for i, d in enumerate(docs)]
        )
    )

compressed_docs = compression_retriever.invoke("How many calculations per second can Lal complete?")
pretty_print_docs(compressed_docs)

Document 1:

must strive to be better than we are. It does not matter that we will never reach our ultimate goal. The effort yields its own rewards. It is only the difference between knowledge and experience. So Lal now possesses the sum of my programming. There do seem to be variations on the quantum level. Lal can use contractions... I cannot. I have maintained records on positronic matrix activity, behavioral norms, and all verbal patterns... I have seen no evidence of other aberrations... Is that not the goal of every parent, sir? I have been programmed with all the procedures you have mentioned. And in any meaningful evaluation of Lal, you would require a model for a basis of comparison. I am the only model available, Admiral. Nosir. It is an opportunity for her to observe human behavior and more importantly to interact with her crewmates. May I know why, sir? Admiral, when I created Lal, it was with the hope that someday she would choose to enter the Academy and become a member o

Let's test it out, using the same question as before. We'll configure logging so we can see what the query rewriter is doing as well.

In [14]:
import logging
logging.basicConfig()
logging.getLogger("langchain.retrievers.re_phraser").setLevel(logging.INFO)

question = "Tell me about your daughter, Lal."

result = chain.invoke({"input": question})
print("SOURCE DOCUMENTS:\n")
for doc in result["context"]:
    print(doc)
print("\nRESULT:\n")
print(result["answer"])

INFO:langchain.retrievers.re_phraser:Re-phrased question: Query for vectorstore: Lal information


SOURCE DOCUMENTS:

page_content='I wanted to give something back in return for all Starfleet has given me. I still do. But Lal is my child. You ask that I volunteer to give her up. I cannot. That would violate every lesson I have learned about human parenting. As Captain Picard told me after he first met her, I have taken on 'quite a responsibility.' I have brought a new life into this world. It is my duty, not Starfleet's, to guide her through these first difficult steps to maturity, to support her as she learns, to prepare her to be a contributing member of society. No one can relieve me of that obligation. And I cannot ignore it. I am her father. Lal is programmed to return to the lab in the event of a malfunction. Yes, Lal. I am here. It would appear to be a symptom of cascade failure. It will require reinitializing the base matrix without wiping out the higher functions. Thank you, Admiral. Lal, I am unable to correct the system failure. We must say good-bye now. What do you feel,

Now let's quantify how good this model is, using ragas. We need to set up a test of test questions. And since some metric require a "ground truth" result to compare the answer to, we draft what we consider to be the ideal answers to each.

In [15]:
eval_questions = [
    "Is Lal your daughter?",
    "How many calculations per second can Lal complete?",
    "Does Lal have emotions?",
    "What goal did you have for Lal?",
    "How was Lal's species and gender chosen?",
    "What happened to Lal?"
]

eval_answers = [
    "Yes, Lal is my daughter. I created Lal.",
    "Lal is capable of completing sixty trillion calculations per second.",
    "Yes, unlike myself, Lal proved able to feel emotions such as fear and love.",
    "My goal for Lal was for her to enter Starfleet Academy.",
    "Lal chose her own identity as a human female, after consulting with Counselor Troi.",
    "Lal experienced a cascade failure in her neural net, triggered by distress from her impending separation from me to Galor IV. I deactivated Lal once she suffered complete neural system failure."
]


Let's test things out with one of those questions, just so we can understand the structure of the response.

In [16]:
result = chain.invoke({"input": eval_questions[1]})
print(result)

INFO:langchain.retrievers.re_phraser:Re-phrased question: Query for vectorstore: Lal calculations per second.


{'input': 'How many calculations per second can Lal complete?', 'context': [_DocumentWithState(metadata={}, page_content='must strive to be better than we are. It does not matter that we will never reach our ultimate goal. The effort yields its own rewards. It is only the difference between knowledge and experience. So Lal now possesses the sum of my programming. There do seem to be variations on the quantum level. Lal can use contractions... I cannot. I have maintained records on positronic matrix activity, behavioral norms, and all verbal patterns... I have seen no evidence of other aberrations... Is that not the goal of every parent, sir? I have been programmed with all the procedures you have mentioned. And in any meaningful evaluation of Lal, you would require a model for a basis of comparison. I am the only model available, Admiral. Nosir. It is an opportunity for her to observe human behavior and more importantly to interact with her crewmates. May I know why, sir? Admiral, when

In addition to our test questions and "ground truth" answers, we'll need to collect the responses and contexts (results from the vector store) used to produce them.

In [17]:
answers = []
contexts = []

for question in eval_questions:
  response = chain.invoke({"input": question})
  answers.append(response["answer"])
  contexts.append([context.page_content for context in response["context"]])

INFO:langchain.retrievers.re_phraser:Re-phrased question: Query for vectorstore: Lal daughter?
INFO:langchain.retrievers.re_phraser:Re-phrased question: Query for vectorstore: Lal calculations per second.
INFO:langchain.retrievers.re_phraser:Re-phrased question: Query for vectorstore: Lal emotions
INFO:langchain.retrievers.re_phraser:Re-phrased question: Query for vectorstore: Goal for Lal
INFO:langchain.retrievers.re_phraser:Re-phrased question: Query for vectorstore: Lal species gender chosen
INFO:langchain.retrievers.re_phraser:Re-phrased question: Query for vectorstore: Lal incident or event


It used to be that ragas had a tighter integration with langchain (and other frameworks,) but they have since moved to a different approach that requires you to massage things into Hugging Face style datasets first. So let's get that out of the way.

In [18]:
# We must massage the results into Hugging Face format for Ragas.
from datasets import Dataset

response_dataset = Dataset.from_dict({
    "question" : eval_questions,
    "answer": answers,
    "contexts": contexts,
    "ground_truth" : eval_answers
})

response_dataset[0]

{'question': 'Is Lal your daughter?',
 'answer': 'Yes, Lal is my daughter.',
 'contexts': ["I wanted to give something back in return for all Starfleet has given me. I still do. But Lal is my child. You ask that I volunteer to give her up. I cannot. That would violate every lesson I have learned about human parenting. As Captain Picard told me after he first met her, I have taken on 'quite a responsibility.' I have brought a new life into this world. It is my duty, not Starfleet's, to guide her through these first difficult steps to maturity, to support her as she learns, to prepare her to be a contributing member of society. No one can relieve me of that obligation. And I cannot ignore it. I am her father. Lal is programmed to return to the lab in the event of a malfunction. Yes, Lal. I am here. It would appear to be a symptom of cascade failure. It will require reinitializing the base matrix without wiping out the higher functions. Thank you, Admiral. Lal, I am unable to correct the 

Finally we can let ragas do its magic! We tell it which metrics we are interested in:

In [19]:
from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    answer_correctness,
    context_recall,
    context_precision,
)

metrics = [
    faithfulness,
    answer_relevancy,
    context_recall,
    context_precision,
    answer_correctness,
]

Then it's just a matter of calling evaluate! Well, we also need to force our OpenAI key into a system environment variable first, since that seems to be missing from their API at the moment.

In [20]:
import os
os.environ['OPENAI_API_KEY'] = api_key
results = evaluate(response_dataset, metrics)

Evaluating:   0%|          | 0/30 [00:00<?, ?it/s]

Let's see the results! How does it compare to our "naive RAG" approach earlier?

In [21]:
results

{'faithfulness': 0.8000, 'answer_relevancy': 0.8137, 'context_recall': 0.5833, 'context_precision': 0.7778, 'answer_correctness': 0.5570}

In [22]:
results.to_pandas()

Unnamed: 0,user_input,retrieved_contexts,response,reference,faithfulness,answer_relevancy,context_recall,context_precision,answer_correctness
0,Is Lal your daughter?,[I wanted to give something back in return for...,"Yes, Lal is my daughter.","Yes, Lal is my daughter. I created Lal.",1.0,1.0,1.0,1.0,0.744672
1,How many calculations per second can Lal compl...,[must strive to be better than we are. It does...,Lal can complete approximately sixty trillion ...,Lal is capable of completing sixty trillion ca...,0.0,0.994501,0.0,0.0,0.997975
2,Does Lal have emotions?,[Lal... The children were not laughing with yo...,"Lal was designed to have emotions, but due to ...","Yes, unlike myself, Lal proved able to feel em...",1.0,0.0,0.0,1.0,0.224653
3,What goal did you have for Lal?,[must strive to be better than we are. It does...,My goal for Lal was to provide her with the op...,My goal for Lal was for her to enter Starfleet...,1.0,0.988055,1.0,1.0,0.220253
4,How was Lal's species and gender chosen?,[This is a big decision... Counselor... Lal ha...,Lal's species and gender were chosen by Data a...,"Lal chose her own identity as a human female, ...",0.8,0.967359,1.0,0.916667,0.551377
5,What happened to Lal?,[I wanted to give something back in return for...,Lal suffered complete neural system failure an...,Lal experienced a cascade failure in her neura...,1.0,0.932021,0.5,0.75,0.603015


I encourage you to play with some of the "hyperparameters" described above, and see if you can tune this to produce even better results. Remember that metrics don't tell the whole story; your subjective judgement of the quality of the answers matters too. Maybe even more so.

To serve this advanced RAG system in the real world, you wouldn't run it from a notebook like this... this is just useful for prototyping. You would convert this to a standalone Python script, and then use something like [LangServe](https://www.langchain.com/langserve) to wrap it with a service API.


