<a href="https://colab.research.google.com/github/OmarOneil/Data-Science/blob/main/Production_Ready_RAG_and_LangSmith.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Production Ready RAG and LangSmith

Today we'll take a peek at ways we can improve typical Retrieval Augmented Generation pipelines - and showcase how we can test our pipelines to provide directional signal!

In [None]:
!pip install -U -q langchain openai langsmith cohere tiktoken qdrant-client

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m182.2/182.2 kB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.7/2.7 MB[0m [31m26.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m143.8/143.8 kB[0m [31m18.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m294.6/294.6 kB[0m [31m24.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m57.5/57.5 kB[0m [31m7.0 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
tensorflow-metadata 1.14.0 requires protobuf<4.21,>=3.20.3, but you have protobuf 4.25.1 which is incompatible.[0m[31m
[0m

We'll need to make sure we can run async within our Jupyter Notebook - so we'll do that in the next cell!

In [None]:
import nest_asyncio

nest_asyncio.apply()

### Document Loader

Now we can load our data source - OpenAI [blogs](https://openai.com/blog) - for that, we'll use our SitemapLoader which will be able to parse out the OpenAI sitemap, and then filter out only the blog posts!

In [None]:
from langchain.document_loaders.sitemap import SitemapLoader

loader = SitemapLoader(
    web_path = "https://openai.com/sitemap.xml",
    filter_urls=["https://openai.com/blog"]
)

In [None]:
docs = loader.load()

Fetching pages: 100%|##########| 113/113 [00:07<00:00, 14.98it/s]


In [None]:
len(docs)

113

We have ~113 blog posts loaded up - and now we need to cut them down to a reasonable chunk size.

We'll use the rather naive RecursiveCharacterTextSplitter to achieve this goal today.

As we know our blogs are in a typical writtern format - paragraphs, sentences, headed sections - we can split preferentially by `\n\n`, `\n`, `' '`.

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 1250,
    chunk_overlap  = 100,
    length_function = len,
    is_separator_regex = False,
)

Now we can split our blogs!

In [None]:
naive_split_docs = text_splitter.split_documents(docs)

In [None]:
len(naive_split_docs)

831

We've got a final number of 831 chunks!

### Embeddings

We'll be leveraging [Cohere's Embeddings v3](https://txt.cohere.com/introducing-embed-v3/) embeddings model.

It's, as of time of writing this notebook, the most performant closed-source embeddings model available!

We'll need to start by providing our Cohere API key!

In [None]:
os.environ['COHERE_API_KEY'] = getpass.getpass('Enter your Cohere API key: ')

Enter your Cohere API key: ··········


Now we can load our embeddings model - we'll be leveraging the "light" model to reduce cost

In [None]:
from langchain.embeddings import CohereEmbeddings

embeddings = CohereEmbeddings(model="embed-english-light-v3.0")

### Vector Store & Retriever

For our vector store today we'll be using Qdrant!

To keep things consistent in the notebook, we'll be leveraging their cloud solution - which provides 1GB of free-tier access.

Qdrant is an open-source, self-hostable, performant vector database.

If you listen to their [marketing](https://qdrant.tech/benchmarks/?gad_source=1&gclid=CjwKCAiA1MCrBhAoEiwAC2d64Xro4dyNYPXWzmAkaqQMDEfzrjjLaMKHW0LhtMpJEvQTAETbws2RaBoC1aAQAvD_BwE) they are among the best of the best.

In reality, Qdrant can scale to extremely high volumes without performance suffering - and retains the option to self-host, which can be critical for businesses with data or privacy concerns.

Let's get started by loading our API key and our cluster URL.

In [None]:
qdrant_api_key = getpass.getpass("QDrant Cluster API Key: ")

QDrant Cluster API Key: ··········


In [None]:
qdrant_cluster_url = getpass.getpass("QDrant Cluster URL: ")

QDrant Cluster URL: ··········


Now we can instantiate our Qdrant cluster from LangChain!

In [None]:
from langchain.vectorstores import Qdrant

qdrant = Qdrant.from_documents(
    naive_split_docs,
    embeddings,
    url=qdrant_cluster_url,
    prefer_grpc=True,
    api_key=qdrant_api_key,
    collection_name="openai_blogs",
)

We'll set this as our base retriever - and set the number of retrieved documents (typically called `k`) to a high number for use with our reranker later on.

In [None]:
base_retriever = qdrant.as_retriever(search_kwargs={"k" : 20})

### Ensemble Retrieval

In order to augment our retrieval stack - we're going to leverage something called "ensemble retrieval".

The basic idea is as follows:

1. Retrieve a large number of documents from a dense vector retrieval.
2. Retrieve a large number of documents from a sparse vector search.
3. Use [Reciprocal Rank Fusion](https://plg.uwaterloo.ca/~gvcormac/cormacksigir09-rrf.pdf) to combine the results into a single ranked set.

We'll use our Qdrant vector-database to power our dense vector retrieval - and we'll use [BM25](https://en.wikipedia.org/wiki/Okapi_BM25) as our sparse solution.

LangChain will take care of the rest.

In [None]:
!pip install -qU rank_bm25

In [None]:
from langchain.retrievers import BM25Retriever, EnsembleRetriever

bm25_retriever = BM25Retriever.from_documents(naive_split_docs)
bm25_retriever.k = 20

In [None]:
ensemble_retriever = EnsembleRetriever(
    retrievers=[bm25_retriever, base_retriever], weights=[0.5, 0.5]
)

In [None]:
len(ensemble_retriever.invoke("How many parameters were in GPT, GPT-2, InstructGPT, and GPT-3 models?  What were other key differences?"))

37

### Re-ranking

Now that we have a large number of retrieved documents - we can use Cohere's [Rerank](https://txt.cohere.com/rerank/) service to provide us with a reranked list of the top 5 most relevant sources.

This idea of "casting a wide net" and then trimming down the results will help us improve our results fairly significantly.

In [None]:
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import CohereRerank

compressor = CohereRerank(top_n=5)
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor, base_retriever=ensemble_retriever
)

In [None]:
len(compression_retriever.invoke("How many parameters were in GPT, GPT-2, InstructGPT, and GPT-3 models?  What were other key differences?"))

5

### Creating our Chain

Now that we have our retrieval pipeline, we can integrate it into a chain - and leverage that to ask questions about our data!

First, we'll set up a chat template that is compatible with the RAG pattern:

We'll provide a user question, then we'll provide relevant context that will be used to answer the question!

In [None]:
from langchain.prompts import ChatPromptTemplate

template = """Answer the question based only on the following context:
{context}

Question: {question}
"""

prompt = ChatPromptTemplate.from_template(template)

In [None]:
import os
import getpass

os.environ['OPENAI_API_KEY'] = getpass.getpass('Enter your OpenAI API key: ')

Enter your OpenAI API key: ··········


Now, we'll want to set up the "brains" of the operation - GPT-4 Turbo!

Once again, we'll use LangChain to make this easy!

In [None]:
from langchain.chat_models import ChatOpenAI

model = ChatOpenAI(model="gpt-4-1106-preview", temperature=0)

Now we can set up our chain!

You'll notice we're using the LCEL to do this - this is the prefered method of initializing chains for production with LangChain.

More information is provided [here](https://python.langchain.com/docs/expression_language/)!

In [None]:
from operator import itemgetter
from langchain.schema.runnable import RunnablePassthrough
from langchain.schema.output_parser import StrOutputParser

rerank_rag_chain = (
    {"context": itemgetter("question") | compression_retriever, "question": RunnablePassthrough()}
    | prompt
    | model
    | StrOutputParser()
)

Now we can invoke our chain - and see what kinds of outputs we get!

In [None]:
rerank_rag_chain.invoke({"question" : "What are Sam Altman's thoughts on the recent leadership transition?"})

'Based on the provided context, Sam Altman\'s thoughts on the recent leadership transition are positive and forward-looking. He expresses excitement about the future and gratitude for the team\'s hard work during an unclear and unprecedented situation. He mentions his belief in the resilience and spirit of the team, which he feels sets them apart. Altman is looking forward to continuing the work on building beneficial artificial general intelligence (AGI) with what he refers to as "the best team in the world, best mission in the world." His message conveys a sense of optimism and commitment to the mission of OpenAI.'

In [None]:
!pip install -U -q arxiv

# LangSmith

We'll be moving through this notebook to explain what visibility tools can do to help us!

In [None]:
from uuid import uuid4

unique_id = uuid4().hex[0:8]

os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_PROJECT"] = f"LangSmith Introduction - {unique_id}"
os.environ["LANGCHAIN_ENDPOINT"] = "https://api.smith.langchain.com"
os.environ["LANGCHAIN_API_KEY"] = getpass.getpass('Enter your LangSmith API key: ')

Enter your LangSmith API key: ··········


In [None]:
from langsmith import Client

client = Client()

In [None]:
rerank_rag_chain.invoke({"question" : "what are the ethical and alignment considerations that I should keep in mind when training and fine-tuning my own LLM?"})

"When training and fine-tuning your own Large Language Model (LLM), you should consider the following ethical and alignment considerations:\n\n1. Prohibit misuse: Establish usage guidelines and terms of use that prevent material harm to individuals, communities, and society. This includes prohibiting the use of LLMs for spam, fraud, astroturfing, or any high-risk use-cases that are not appropriate, such as classifying people based on protected characteristics.\n\n2. Enforce usage guidelines: Build systems and infrastructure to enforce the guidelines you set. This could involve rate limits, content filtering, application approval processes, monitoring for anomalous activity, and other mitigations.\n\n3. Mitigate unintentional harm: Take proactive steps to mitigate harmful model behavior. This includes comprehensive model evaluation to understand limitations, minimizing potential sources of bias in training data, and employing techniques to minimize unsafe behavior, such as learning from

In [None]:
rerank_rag_chain.invoke({"question" : "what are most important recent advancements related to building production LLM applications?"})

'Based on the provided context, the most important recent advancements related to building production Large Language Model (LLM) applications are not explicitly listed. However, the documents do discuss best practices for deploying LLMs, which can be seen as advancements in the responsible and safe use of these models in production environments. These best practices include:\n\n1. Prohibiting misuse by publishing usage guidelines and terms of use that prevent material harm through actions like spam, fraud, or astroturfing.\n2. Building systems and infrastructure to enforce usage guidelines, such as rate limits, content filtering, and monitoring for anomalous activity.\n3. Mitigating unintentional harm by conducting comprehensive model evaluations, minimizing bias in training data, and learning from human feedback.\n4. Documenting known weaknesses and vulnerabilities of the models to inform users and developers.\n5. Collaborating with diverse stakeholders to address potential biases and

In [None]:
rerank_rag_chain.invoke({"question" : "How many parameters were in GPT, GPT-2, InstructGPT, and GPT-3 models?  What were other key differences?"})

"Based on the provided context, Sam Altman's thoughts on the recent leadership transition are positive and forward-looking. He expresses excitement about the future and gratitude for the team's hard work during an unclear and unprecedented situation. He mentions his belief in the resilience and spirit of the team, setting them apart, and looks forward to working closely with the new initial board and the OpenAI community to continue building beneficial artificial general intelligence (AGI). He signs off with a message of love, indicating a personal and emotional investment in the company and its mission."

Let's build a number of input/output pairs that we can leverage later!

In [None]:
import asyncio

inputs = [
    "What are Sam's thoughts of Illya?",
    "What are some frontier risks?",
    "What are Custom GPTs?",
    "Can I use DALL-E 3 with ChatGPT Plus?",
    "What is 'red teaming'?",
    "How can AI be leverages to do better teaching?"
]

results = []

async def arun(chain, input_example):
    try:
        return await chain.invoke({"question" : input_example})
    except Exception as e:
        return e

for input_example in inputs:
    results.append(arun(rerank_rag_chain, input_example))

results = await asyncio.gather(*results)

Now that we've run through all of those chains - we can leverage LangSmith to create a dataset that we can use to benchmark other application solutions!

In [None]:
from langchain.callbacks.tracers.langchain import wait_for_all_tracers

wait_for_all_tracers()

### Evaluating with LangSmith

The first thing we'll need to do is collect our responses into a dataset that we can use to benchmark other solutions against!

In [None]:
dataset_name = f"openai-rag-{unique_id}"

dataset = client.create_dataset(
    dataset_name, description="A dataset for benchmarking a RAG system using the OpenAI Blogs as Source Material"
)

runs = client.list_runs(
    project_name=os.environ["LANGCHAIN_PROJECT"],
    execution_order=1,  # Only return the top-level runs
    error=False,  # Only runs that succeed
)
for run in runs:
    client.create_example(inputs=run.inputs, outputs=run.outputs, dataset_id=dataset.id)

Now that we have our dataset set up in LangSmith - let's create another system that we can benchmark against our original!

Since it's possible to build an agent that has memory (which could influence results and might not provide accurate benchmarking) - we'll use an `agent_factory` to create our agent for each test-case.

In [None]:
def chain_factory():
    rerank_rag_chain = (
    {"context": itemgetter("question") | compression_retriever, "question": RunnablePassthrough()}
    | prompt
    | model
    | StrOutputParser()
    )
    return rerank_rag_chain

Now we can use the `langchain.evaluation.EvaluatorType` and `langchain.smith.RunEvalConfig` methods to build a pipeline for our evaluation.

More information about these metrics is found [here](https://docs.smith.langchain.com/evaluation/evaluator-implementations)
Let's set it up with the following evluators:

- `EvaluatorType.QA` - measures how "correct" your response is, based on a reference answer (we built these in the first part of the notebook)
- `EvaluatorType.EMBEDDING_DISTANCE` - measure closeness between the two responses
- `RunEvalConfig.LabeledCriteria` - measures the output against the given criteria
- `RunEvalConfig.Criteria({"YOUR CUSTOM CRITERAI", "DESCRIPTION OF YOUR CRITERIA IN NATURAL LANGUAGE"})`



We'll also build our own custom evaluator as a demonstration of how to implement such an evaluator!

In [None]:
!pip install -U -q tiktoken

In our own custom evaluator we need to make sure of a couple things:

1. We provide a system by which we can measure or provide a measure of closeness/some numeric metric.
2. We provide logic for implementing our score and parsing the relevant outputs.

In [None]:
import re
from typing import Any, Optional

from langchain.chains import LLMChain
from langchain.chat_models import ChatOpenAI
from langchain.evaluation import StringEvaluator


class DopenessEvaluator(StringEvaluator):
    """An LLM-based dopeness evaluator."""

    def __init__(self):
        llm = ChatOpenAI(model="gpt-4", temperature=0)

        template = """On a scale from 0 to 100, how dope is the following response to the input:
        --------
        INPUT: {input}
        --------
        OUTPUT: {prediction}
        --------
        Reason step by step about why the score is appropriate, then print the score at the end. At the end, repeat that score alone on a new line."""

        self.eval_chain = LLMChain.from_string(llm=llm, template=template)

    @property
    def requires_input(self) -> bool:
        return True

    @property
    def requires_reference(self) -> bool:
        return False

    @property
    def evaluation_name(self) -> str:
        return "dopeness_score"

    def _evaluate_strings(
        self,
        prediction: str,
        input: Optional[str] = None,
        reference: Optional[str] = None,
        **kwargs: Any
    ) -> dict:
        evaluator_result = self.eval_chain(
            dict(input=input, prediction=prediction), **kwargs
        )
        reasoning, score = evaluator_result["text"].split("\n", maxsplit=1)
        score = re.search(r"\d+", score).group(0)
        if score is not None:
            score = float(score.strip()) / 100.0
        return {"score": score, "dopeness": reasoning.strip()}

Now we can set our `RunEvalFeedback` up!

Notice how we can create custom evaluations that are string based only -

In [None]:
from langchain.evaluation import EvaluatorType
from langchain.smith import RunEvalConfig

evaluation_config = RunEvalConfig(
    evaluators = [
        EvaluatorType.QA,
        EvaluatorType.EMBEDDING_DISTANCE,
        RunEvalConfig.LabeledCriteria("relevance"),
        RunEvalConfig.Criteria({
            "fully_answered" : "Does this response fully answer the question?"
        })
    ],
    custom_evaluators = [
        DopenessEvaluator()
    ]
)

In [None]:
from langchain.smith import (
    arun_on_dataset,
)

tag_name = f"Rerank-EnsembleRetrieval"
tag = "OpenAI Blog RAG -" + tag_name

chain_results = await arun_on_dataset(
    client=client,
    dataset_name=dataset_name,
    llm_or_chain_factory=chain_factory,
    evaluation=evaluation_config,
    verbose=True,
    tags=[tag],
)

View the evaluation results for project 'cooked-insurance-16' at:
https://smith.langchain.com/o/69867b2b-1696-431d-a878-81df6e9d559b/datasets/da76ed52-268a-4504-9f84-a5aa4030c45e/compare?selectedSessions=ad1f7674-9eac-4ac3-a4d2-ccde1717c753

View all tests for Dataset openai-rag-f05a6fe8 at:
https://smith.langchain.com/o/69867b2b-1696-431d-a878-81df6e9d559b/datasets/da76ed52-268a-4504-9f84-a5aa4030c45e
[------------------------------------------------->] 9/9
 Eval quantiles:
                                                   output  \
count                                                   9   
unique                                                  9   
top     When training and fine-tuning your own Large L...   
freq                                                    1   
mean                                                  NaN   
std                                                   NaN   
min                                                   NaN   
25%                             

In [None]:
simple_retriever = qdrant.as_retriever(search_kwargs={"k" : 5})

In [None]:
def naive_chain_factory():
    rerank_rag_chain = (
    {"context": itemgetter("question") | simple_retriever, "question": RunnablePassthrough()}
    | prompt
    | model
    | StrOutputParser()
    )
    return rerank_rag_chain

In [None]:
tag_name = f"SimpleRetriever"
tag = "OpenAI Blog RAG -" + tag_name

chain_results = await arun_on_dataset(
    client=client,
    dataset_name=dataset_name,
    llm_or_chain_factory=naive_chain_factory,
    evaluation=evaluation_config,
    verbose=True,
    tags=[tag],
)

View the evaluation results for project 'best-time-82' at:
https://smith.langchain.com/o/69867b2b-1696-431d-a878-81df6e9d559b/datasets/da76ed52-268a-4504-9f84-a5aa4030c45e/compare?selectedSessions=1eef14c6-7ce9-4b36-a8b3-ea6741e39233

View all tests for Dataset openai-rag-f05a6fe8 at:
https://smith.langchain.com/o/69867b2b-1696-431d-a878-81df6e9d559b/datasets/da76ed52-268a-4504-9f84-a5aa4030c45e
[------------------------------------------------->] 9/9
 Eval quantiles:
                                                   output  \
count                                                   9   
unique                                                  9   
top     When training and fine-tuning your own Large L...   
freq                                                    1   
mean                                                  NaN   
std                                                   NaN   
min                                                   NaN   
25%                                    