# LangSmith

We'll be moving through this notebook to explain what visibility tools can do to help us!

In [2]:
!pip install -U -q langchain openai langsmith

## Basic Application

We'll be leveraging the same application as we did in the Weights and Biases Notebook to showcase what can be done with LangSmith:

A simple Arxiv Agent!

Let's start with grabbing a few pieces of information and our additional dependencies:

In [4]:
import os
import getpass

os.environ['OPENAI_API_KEY'] = getpass.getpass('Enter your OpenAI API key: ')

In [5]:
!pip install -U -q arxiv

Now we can set up our LangChain environment variables:

In [7]:
from uuid import uuid4

unique_id = uuid4().hex[0:8]

os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_PROJECT"] = f"LLM Ops Walk Through - {unique_id}"
os.environ["LANGCHAIN_ENDPOINT"] = "https://api.smith.langchain.com"
os.environ["LANGCHAIN_API_KEY"] = getpass.getpass('Enter your LangSmith API key: ')

In [8]:
from langsmith import Client

client = Client()

Now we can set up our simple application!

In [17]:
from langchain.chat_models import ChatOpenAI
from langchain.agents import load_tools, initialize_agent, AgentType

llm = ChatOpenAI(model_name="gpt-4", temperature=0)

tools = load_tools(
    ["arxiv"]
)

agent_chain = initialize_agent(
    tools,
    llm,
    agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION,
)

In [16]:
agent_chain("What is QLoRA?")



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mQLoRA might be a term or concept from a scientific field. I should search for it on arxiv to find relevant articles.
Action: arxiv
Action Input: QLoRA[0m
Observation: [36;1m[1;3mPublished: 2023-05-23
Title: QLoRA: Efficient Finetuning of Quantized LLMs
Authors: Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, Luke Zettlemoyer
Summary: We present QLoRA, an efficient finetuning approach that reduces memory usage
enough to finetune a 65B parameter model on a single 48GB GPU while preserving
full 16-bit finetuning task performance. QLoRA backpropagates gradients through
a frozen, 4-bit quantized pretrained language model into Low Rank
Adapters~(LoRA). Our best model family, which we name Guanaco, outperforms all
previous openly released models on the Vicuna benchmark, reaching 99.3% of the
performance level of ChatGPT while only requiring 24 hours of finetuning on a
single GPU. QLoRA introduces a number of innovations to save mem

{'input': 'What is QLoRA?',
 'output': 'QLoRA is an efficient finetuning approach for large language models that reduces memory usage while preserving performance. It introduces several innovations including a new data type for normally distributed weights, double quantization, and paged optimizers. It has been used to finetune a wide range of models, providing detailed analysis of their performance.'}

Let's build a number of input/output pairs that we can leverage later!

In [18]:
import asyncio

inputs = [
    "What optimizer is used in QLoRA?",
    "What data type was created in the QLoRA paper?",
    "What is a Retrieval Augmented Generation system?",
    "Who authored the QLoRA paper?",
    "What is the most popular deep learning framework?",
    "What significant improvements does the LoRA system make?"
]

results = []

async def arun(agent, input_example):
    try:
        return await agent.arun(input_example)
    except Exception as e:
        return e

for input_example in inputs:
    results.append(arun(agent_chain, input_example))

results = await asyncio.gather(*results)

Now that we've run through all of those chains - we can leverage LangSmith to create a dataset that we can use to benchmark other application solutions!

In [19]:
from langchain.callbacks.tracers.langchain import wait_for_all_tracers

wait_for_all_tracers()

### Evaluating with LangSmith

The first thing we'll need to do is collect our responses into a dataset that we can use to benchmark other solutions against!

In [20]:
dataset_name = f"arxiv-rag-gpt-4-{unique_id}"

dataset = client.create_dataset(
    dataset_name, description="A dataset for benchmarking a RAG system using the Arxiv tool"
)

runs = client.list_runs(
    project_name=os.environ["LANGCHAIN_PROJECT"],
    execution_order=1,  # Only return the top-level runs
    error=False,  # Only runs that succeed
)
for run in runs:
    client.create_example(inputs=run.inputs, outputs=run.outputs, dataset_id=dataset.id)

Now that we have our dataset set up in LangSmith - let's create another system that we can benchmark against our original!

Since it's possible to build an agent that has memory (which could influence results and might not provide accurate benchmarking) - we'll use an `agent_factory` to create our agent for each test-case.

In [22]:
from langchain.chat_models import ChatOpenAI
from langchain.agents import load_tools, initialize_agent, AgentType

llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)

tools = load_tools(
    ["arxiv"]
)

def agent_factory():
    agent_chain = initialize_agent(
        tools,
        llm,
        agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION,
        verbose=True,
        handle_parsing_errors=True,
    )
    return agent_chain

Now we can use the `langchain.evaluation.EvaluatorType` and `langchain.smith.RunEvalConfig` methods to build a pipeline for our evaluation.

More information about these metrics is found [here](https://docs.smith.langchain.com/evaluation/evaluator-implementations)
Let's set it up with the following evluators:

- `EvaluatorType.QA` - measures how "correct" your response is, based on a reference answer (we built these in the first part of the notebook)
- `EvaluatorType.EMBEDDING_DISTANCE` - measure closeness between the two responses
- `RunEvalConfig.LabeledCriteria` - measures the output against the given criteria
- `RunEvalConfig.Criteria({"YOUR CUSTOM CRITERAI", "DESCRIPTION OF YOUR CRITERIA IN NATURAL LANGUAGE"})`



We'll also build our own custom evaluator as a demonstration of how to implement such an evaluator!

In [26]:
!pip install -U -q tiktoken

In our own custom evaluator we need to make sure of a couple things:

1. We provide a system by which we can measure or provide a measure of closeness/some numeric metric.
2. We provide logic for implementing our score and parsing the relevant outputs.

In [29]:
import re
from typing import Any, Optional

from langchain.chains import LLMChain
from langchain.chat_models import ChatOpenAI
from langchain.evaluation import StringEvaluator


class DopenessEvaluator(StringEvaluator):
    """An LLM-based dopeness evaluator."""

    def __init__(self):
        llm = ChatOpenAI(model="gpt-4", temperature=0)

        template = """On a scale from 0 to 100, how dope is the following response to the input:
        --------
        INPUT: {input}
        --------
        OUTPUT: {prediction}
        --------
        Reason step by step about why the score is appropriate, then print the score at the end. At the end, repeat that score alone on a new line."""

        self.eval_chain = LLMChain.from_string(llm=llm, template=template)

    @property
    def requires_input(self) -> bool:
        return True

    @property
    def requires_reference(self) -> bool:
        return False

    @property
    def evaluation_name(self) -> str:
        return "dopeness_score"

    def _evaluate_strings(
        self,
        prediction: str,
        input: Optional[str] = None,
        reference: Optional[str] = None,
        **kwargs: Any
    ) -> dict:
        evaluator_result = self.eval_chain(
            dict(input=input, prediction=prediction), **kwargs
        )
        reasoning, score = evaluator_result["text"].split("\n", maxsplit=1)
        score = re.search(r"\d+", score).group(0)
        if score is not None:
            score = float(score.strip()) / 100.0
        return {"score": score, "dopeness": reasoning.strip()}

Now we can set our `RunEvalFeedback` up!

Notice how we can create custom evaluations that are string based only - 

In [30]:
from langchain.evaluation import EvaluatorType
from langchain.smith import RunEvalConfig

evaluation_config = RunEvalConfig(
    evaluators = [
        EvaluatorType.QA,
        EvaluatorType.EMBEDDING_DISTANCE,
        RunEvalConfig.LabeledCriteria("relevance"),
        RunEvalConfig.Criteria({
            "fully_answered" : "Does this response fully answer the question?"
        })
    ],
    custom_evaluators = [
        DopenessEvaluator()
    ]
)

In [None]:
from langchain.smith import (
    arun_on_dataset,
)

tag_name = f"<< YOUR NAME HERE>>"
tag = "GPT-4 vs. GPT-3.5-turbo on Arxiv RAG " + tag_name

chain_results = await arun_on_dataset(
    client=client,
    dataset_name=dataset_name,
    llm_or_chain_factory=agent_factory,
    evaluation=evaluation_config,
    verbose=True,
    tags=[tag],
)