# Testing Responding in JSON

This notebook will cover how to benchmark examples of chains that should respond in JSON. We will:

1. Create a dataset of test examples
2. Upload that dataset to LangSmith
3. Create multiple chains
4. Define some evaluation criteria
5. Run some tests!

In [None]:
# %env LANGCHAIN_ENDOPINT="https://api.smith.langchain.com"
# %env LANGCHAIN_API_KEY="<your-api-key>"

In [None]:
%load_ext autoreload
%autoreload 2

In [4]:
import logging
from langchain.evaluation import load_evaluator
logging.basicConfig(level=logging.INFO)

In [5]:
c = load_evaluator("labeled_criteria")

In [6]:
c.evaluate_strings(prediction="How can i help?", reference="Come se si può", input="Hi")

{'reasoning': 'The criteria for evaluation is the helpfulness of the submission. The input is a simple greeting "Hi" and the submission is a response asking how they can help. This is a helpful and appropriate response to the input as it opens up for further communication and offers assistance. Therefore, the submission meets the criteria. The reference does not seem to be relevant in this context as it is in a different language and does not provide any additional context or information.',
 'value': None,
 'score': None}

In [None]:
from langsmith import Client

client = Client()
dataset_name = "Structured JSON Dataset"

# # Storing inputs in a dataset lets us
# # run chains and LLMs over a shared set of examples.
# dataset = client.create_dataset(
#     dataset_name=dataset_name, description="Extracting structured JSON",
# )

In [None]:
import pandas as pd

df = pd.read_csv("https://raw.githubusercontent.com/dair-iitd/CaRB/master/data/gold/dev.tsv", sep="\t", error_bad_lines=False)

## Create a dataset of test examples

Let's create a dataset of examples. Let's pretend we want to extract structured information from unstructured input and we want to be structured in JSON format. Let's pretend we want to extract a person's name and age.

In [None]:
import json

examples = [
    # Standard example
    ("Julie is 13", json.dumps({"name": "Julie", "age": 13})),
    # Example with name in lower case
    ("ben is 9", json.dumps({"name": "Ben", "age": 9})),
    # Example with age spelled out
    ("Sam is thirty four", json.dumps({"name": "Sam", "age": 34})),
    # Examples without ground truth
    ("Bob is 17", ),
    ("Molly is 2", ),
]

## Upload dataset to LangSmith

In [None]:
# for example in examples:
#     # Each example must be unique and have inputs defined.
#     # Outputs are optional
#     if len(example) == 1:
#         client.create_example(
#             inputs={"input": example[0]},
#             outputs=None,
#             dataset_id=dataset.id,
#         )
#     elif len(example) == 2:
#         client.create_example(
#             inputs={"input": example[0]},
#             outputs={"output": example[1]},
#             dataset_id=dataset.id,
#         )
#     else:
#         raise ValueError

## Create Multiple Chains

At this point, let's just try out OpenAI vs Anthropic

In [None]:
from langchain.chat_models import ChatAnthropic, ChatOpenAI
from langchain.prompts import ChatPromptTemplate
from langchain.schema import SystemMessage
from langchain.schema.output_parser import StrOutputParser
from langchain.chains import LLMChain

instructions = """Convert any user messages into valid json. You should only respond with 

```json
...q
```

Do NOT include any words before or after.

For each user input, you should extract the name and age of person in question. \
You should use the `name` and `age` to extract that information. \
Name should always be a properly capitalized name, Age should always be an integer.

For example, for the input `Jim is 10` would get a response of:

```json
{"name": "Jim", "age": 10}}
```"""

prompt = ChatPromptTemplate.from_messages([
    SystemMessage(content=instructions),
    ("human", "{input}")
])

In [None]:
def create_openai():
    return LLMChain(
        prompt=prompt, 
        llm=ChatOpenAI(temperature=0, model="gpt-4"),
        output_parser=StrOutputParser()
    )

def create_anthropic():
    return LLMChain(
        prompt=prompt,
        llm=ChatAnthropic(temperature=0, model="claude-2"),
        output_parser=StrOutputParser()
    )

## Define Custom Evaluation Criteria

We can now define some custom evaluation criteria. Let's define a few!

1. Whether after some parsing the expected output is exactly the same as expected
2. Whether any words were returned before ```json
3. Whether the json that was returned was valid json

## Run evaluation

Now we can run evaluation!

In [None]:
from langchain.smith import RunEvalConfig, run_on_dataset

evaluation_config = RunEvalConfig(
    evaluation=["json_validity", "json_equality"],
)
run_on_dataset(
    client,
    "Structured JSON Dataset",
    create_anthropic,
    evaluation=evaluation_config,
    concurrency_level=2
)