# Lab | Summarization evaluation using LangSmith
Let's revisit your capstone project 2? Well, sort of. Pick diffierent sets of data and re-run this notebook. Maybe parts of the dataset you used in your last project week. The point is for you to understand all steps involve and the many different ways one can and should evaluate LLM applications using LangSmith.

What did you learn? - Let's discuss that in class

## LangSmith - LangChain evaluation

In [2]:
from dotenv import load_dotenv, find_dotenv
import os
_ = load_dotenv(find_dotenv())

OPENAI_API_KEY  = os.getenv('OPENAI_API_KEY')
LANGCHAIN_API_KEY = os.getenv("LANGCHAIN_API_KEY")
HUGGINGFACEHUB_API_TOKEN = os.getenv("HUGGINGFACEHUB_API_TOKEN")

In [3]:
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_ENDPOINT"]="https://api.smith.langchain.com"
os.environ["LANGCHAIN_PROJECT"]="langsmith_max-test"

In [4]:
#Importing Client from Langsmith
from langsmith import Client
client = Client(api_key=LANGCHAIN_API_KEY)

### Create Dataset


In [10]:
from datasets import load_dataset

dataset = load_dataset("multi_news", split="test[:100]", trust_remote_code=True)

Generating train split: 100%|██████████| 44972/44972 [00:04<00:00, 10062.37 examples/s]
Generating validation split: 100%|██████████| 5622/5622 [00:00<00:00, 10043.37 examples/s]
Generating test split: 100%|██████████| 5622/5622 [00:00<00:00, 9604.17 examples/s] 


In [13]:
def add_prefix(example):
    return {
        **example,
        "article": f"Summarize this news:\n{example['document']}"  # 'document' is the key in multi_news
    }

dataset = dataset.map(add_prefix)

multi_news = dataset


Map: 100%|██████████| 100/100 [00:00<00:00, 2864.90 examples/s]


In [14]:
multi_news

Dataset({
    features: ['document', 'summary', 'article'],
    num_rows: 100
})

In [16]:
multi_news[0]


{'document': 'GOP Eyes Gains As Voters In 11 States Pick Governors \n \n Enlarge this image toggle caption Jim Cole/AP Jim Cole/AP \n \n Voters in 11 states will pick their governors tonight, and Republicans appear on track to increase their numbers by at least one, with the potential to extend their hold to more than two-thirds of the nation\'s top state offices. \n \n Eight of the gubernatorial seats up for grabs are now held by Democrats; three are in Republican hands. Republicans currently hold 29 governorships, Democrats have 20, and Rhode Island\'s Gov. Lincoln Chafee is an Independent. \n \n Polls and race analysts suggest that only three of tonight\'s contests are considered competitive, all in states where incumbent Democratic governors aren\'t running again: Montana, New Hampshire and Washington. \n \n While those state races remain too close to call, Republicans are expected to wrest the North Carolina governorship from Democratic control, and to easily win GOP-held seats in

In [18]:
MAX_NEWS = 10

sample_news = multi_news.select(range(MAX_NEWS)).map(add_prefix)

sample_news


Map: 100%|██████████| 10/10 [00:00<00:00, 360.37 examples/s]


Dataset({
    features: ['document', 'summary', 'article'],
    num_rows: 10
})

The dataset contains three columns: article, highlights, and id. To use LangSmith, we need to create a dataset in LangSmith format.

LangSmith expects a prompt and a result. To achieve this, we will transform the article into a prompt by adding the prefix: "Summarize this news." As a result, we will use the content of highlights, which represents the summaries created by humans.

In [19]:
print(sample_news[0])

{'document': 'GOP Eyes Gains As Voters In 11 States Pick Governors \n \n Enlarge this image toggle caption Jim Cole/AP Jim Cole/AP \n \n Voters in 11 states will pick their governors tonight, and Republicans appear on track to increase their numbers by at least one, with the potential to extend their hold to more than two-thirds of the nation\'s top state offices. \n \n Eight of the gubernatorial seats up for grabs are now held by Democrats; three are in Republican hands. Republicans currently hold 29 governorships, Democrats have 20, and Rhode Island\'s Gov. Lincoln Chafee is an Independent. \n \n Polls and race analysts suggest that only three of tonight\'s contests are considered competitive, all in states where incumbent Democratic governors aren\'t running again: Montana, New Hampshire and Washington. \n \n While those state races remain too close to call, Republicans are expected to wrest the North Carolina governorship from Democratic control, and to easily win GOP-held seats in

Now We have the Dataset with the prompt and the Reference Summary, it is time to create a Dataset in LangSmith with this information.
### Create the Dataset in Langsmith

The dataset in LangSmith is composed of an input, which is the prompt passed to the model for evaluation, and an output, which should contain what we expect the model to return.

In [20]:
import datetime

In [25]:
import uuid
input_key = ["article"]        # your prompt input
output_key = ["summary"]       # expected summary output

NAME_DATASET=f"Summarize_dataset_{datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S')}"

In [26]:
#This creates the dataset in LangSmith with the content in sample_cnn - If you run this more than once you will get POST errors
dataset = client.upload_dataframe(
    df=sample_news,                         # This must have 'article' and 'summary' columns
    input_keys=input_key,
    output_keys=output_key,
    name=NAME_DATASET,
    description="Test Embedding distance between model summarizations",
    data_type="kv"                          # key-value type dataset
)


Creating CSV from Arrow format: 100%|██████████| 1/1 [00:00<00:00, 128.38ba/s]


In this image, we can see an example from the dataset once it's been registered in LangSmith.

In the Input column, there is the prompt to be sent, while in the Output column, the expected output is stored.

When performing the comparison, the model will be given the prompt, and the Cosine distance between its response and the one stored in the sample dataset will be calculated.
<img src="https://github.com/peremartra/Large-Language-Model-Notebooks-Course/blob/main/img/Martra_Figure_4_2SDL_Dataset.jpg?raw=true">

### Recovering Models From Hugging Face
Let's retrieve both models from HuggingFace. A base T5 model and a model that has been fine-tuned using the training portion of this same dataset to generate summaries.

In [27]:
from langchain import HuggingFaceHub

In [28]:
summarizer_base = HuggingFaceHub(
    repo_id="t5-base",
    model_kwargs={"temperature":0, "max_length":180},
    huggingfacehub_api_token=HUGGINGFACEHUB_API_TOKEN
)

  summarizer_base = HuggingFaceHub(


In [29]:
summarizer_finetuned = HuggingFaceHub(
    repo_id="flax-community/t5-base-cnn-dm",
    model_kwargs={"temperature":0, "max_length":180},
    huggingfacehub_api_token=HUGGINGFACEHUB_API_TOKEN
)

## Defining Evaluator
The first step is to define an evaluator, where we specify the variables we want to evaluate. In our case, I have chosen to measure only the "embedding_distance."

I've left the "string_distance" as a comment in case you want to conduct a test with two evaluations instead of one.


In [30]:
from langchain.smith import run_on_dataset, RunEvalConfig
!pip install -q rapidfuzz==3.6.1

In [31]:
#We are using just one of the multiple evaluator avaiable on LangSmith.

evaluation_config = RunEvalConfig(
    evaluators=[
        "embedding_distance",
        #"string_distance"
    ],
)



### Running Evaluator
With the same configuration, we can launch two evaluations on the same dataset. One for each of the chosen models.

In [32]:
project_name = f"T5-BASE {datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S')}"

base_t5_results = run_on_dataset(
    client=client,
    project_name=project_name,
    dataset_name=NAME_DATASET,
    llm_or_chain_factory=summarizer_base,
    evaluation=evaluation_config,
)

View the evaluation results for project 'T5-BASE 2025-07-28 15:12:58' at:
https://smith.langchain.com/o/e18ec3e5-d867-4b98-b11d-17dd5c083320/datasets/f2719b84-8226-40e6-8e44-707aeb7a5d6c/compare?selectedSessions=10dba9b8-f625-4797-b7da-d7e2eb2a74dd

View all tests for Dataset Summarize_dataset_2025-07-28 15:11:01 at:
https://smith.langchain.com/o/e18ec3e5-d867-4b98-b11d-17dd5c083320/datasets/f2719b84-8226-40e6-8e44-707aeb7a5d6c
[>                                                 ] 0/10

LLM failed for example 373ad156-ff04-4dc8-8c5a-dd702ccde710 with inputs {'article': 'Summarize this news:\nThe seed for this crawl was a list of every host in the Wayback Machine \n \n This crawl was run at a level 1 (URLs including their embeds, plus the URLs of all outbound links including their embeds) \n \n The WARC files associated with this crawl are not currently available to the general public. ||||| Summary: Microsoft\'s acquisition of Nokia is aimed at building a devices and services strategy, but the joint company won\'t take the same form as Apple. \n \n Microsoft has been working on its evolution into a devices and services company, away from the services business it has traditionally been, for several years now with limited success. \n \n Its acquisition of most of Nokia is the latest acceleration of that strategy — to move further away from the moribund world of the beige desktop and towards the sunlit world of smartphones and tablets. \n \n Owning the desktop (via Windo

[-------------->                                   ] 3/10

LLM failed for example 6c8de2bb-c14f-4bd5-8b7a-167cb0189338 with inputs {'article': 'Summarize this news:\nStarting in 1996, Alexa Internet has been donating their crawl data to the Internet Archive. Flowing in every day, these data are added to the Wayback Machine after an embargo period. ||||| St. Paul\'s Top 3 Dive Bars"Best burger in town. Really, it\'s that amazing. Not a huge place, so don\'t bring your whole extended family with you." \n \n WCCO Viewers\' Choice For Best Egg Roll In MinnesotaIn this week’s Best of Minnesota, the tastiest egg rolls can be found at a popular neighborhood restaurant in northeast Minneapolis called Que Viet. \n \n DeRusha Eats: The Sunshine Factory Burns Bright In PlymouthThe Sunshine Factory opened in New Hope in 1976, and everything about it screamed the \'70s. \n \n Minneapolis Moonshine: 5 Of The City\'s Top DistilleriesWe crunched the numbers to find the top distilleries in Minneapolis, to help you find the best spots to meet your needs. \n \n 

[------------------->                              ] 4/10

LLM failed for example 6df6d057-a54a-40f3-a4c0-67eb2199b696 with inputs {'article': 'Summarize this news:\n\n \n \n \n UPDATE: 4/19/2001 Read Richard Metzger: How I, a married, middle-aged man, became an accidental spokesperson for gay rights overnight on Boing Boing \n \n It’s time to clarify a few details about the controversial “Hey Facebook what’s SO wrong with a pic of two men kissing?” story, as it now beginning to be reported in the mainstream media, and not always correctly. \n \n First of all, with regards to the picture: \n \n The photo which was used to illustrate my first post about the John Snow Kiss-In is a promotional still from the British soap opera “Eastenders.” It features one of the main characters from the show (Christian Clarke, played by the actor John Partridge- left) and someone else who I don’t know. I am not a regular viewer so I can’t say if the man on the right is an extra or an actual character. \n \n This picture has itself caused scandal in the UK, as it

[------------------------>                         ] 5/10

LLM failed for example 8fe42ae1-1281-4ff7-bd51-8ced8a3302f5 with inputs {'article': 'Summarize this news:\nA still image taken from Israeli Defence Forces (IDF) video footage shows what they say is a small unidentified aircraft shot down in a mid-air interception after it crossed into southern Israel October 6, 2012. \n \n DUBAI (Reuters) - The incursion by an unmanned aircraft into Israeli airspace at the weekend exposed the weakness of Israeli air defenses, an Iranian military official was quoted as saying on Monday. \n \n The Israeli air force shot down a drone on Saturday after it crossed into southern Israel, the military said, but it remained unclear where the aircraft had come from. \n \n Jamaluddin Aberoumand, deputy coordinator for Iran\'s Islamic Revolutionary Guard Corps, said the incident indicated that Israel\'s Iron Dome anti-missile defense system "does not work and lacks the necessary capacity", Fars news agency reported. \n \n The Iron Dome system, jointly funded with 

[----------------------------->                    ] 6/10

LLM failed for example be9fe43e-2880-4174-a243-ceb31d260bee with inputs {'article': 'Summarize this news:\nIt\'s the Golden State\'s latest version of the Great Secession. \n \n \n \n Fed up by Sacramento\'s regulations and Southern California\'s political sway, residents in one rural Northern California county are taking steps to leave the state. \n \n The Siskiyou County Board of Supervisors voted, 4-1, on Tuesday to pursue seceding from California, the Redding Record Searchlight reported. Proponents say Siskiyou should form a new state -- called Jefferson -- with other counties in Northern California and Southern Oregon they believe share similar interests. \n \n On Tuesday more than 100 people filled the supervisors\' chambers, many of whom indicated support for the declaration, the Searchlight reported. When a speaker asked those in the audience who was in favor, "nearly every hand in the room was raised," the newspaper said. \n \n "Many proposed laws are unconstitutional and deny

[--------------------------------------->          ] 8/10

LLM failed for example dc3b0b84-c84e-4c88-bb4f-2e3efbbf9272 with inputs {'article': 'Summarize this news:\nGOP Eyes Gains As Voters In 11 States Pick Governors \n \n Enlarge this image toggle caption Jim Cole/AP Jim Cole/AP \n \n Voters in 11 states will pick their governors tonight, and Republicans appear on track to increase their numbers by at least one, with the potential to extend their hold to more than two-thirds of the nation\'s top state offices. \n \n Eight of the gubernatorial seats up for grabs are now held by Democrats; three are in Republican hands. Republicans currently hold 29 governorships, Democrats have 20, and Rhode Island\'s Gov. Lincoln Chafee is an Independent. \n \n Polls and race analysts suggest that only three of tonight\'s contests are considered competitive, all in states where incumbent Democratic governors aren\'t running again: Montana, New Hampshire and Washington. \n \n While those state races remain too close to call, Republicans are expected to wrest

[-------------------------------------------->     ] 9/10

LLM failed for example eef4b09f-5c8b-40aa-97a1-2f630e458ea1 with inputs {'article': 'Summarize this news:\nIf True, Building Set For Demolition Could Be Manhattan\'s Oldest October 15, 2013 5:39 PM \n \n Preservationist Adam Woodward discovered a cellar that he believes could be the foundation of the Revolutionary War-era Bull’s Head Tavern. (credit: Adam Woodward) \n \n NEW YORK (CBSNewYork) — A preservationist says he has found evidence that a Manhattan building is the former site of an 18th-century tavern where George Washington is believed to have enjoyed a celebratory drink during the American Revolution. \n \n If it is indeed the home of the legendary watering hole, the discovery could mean that the building that is perhaps Manhattan’s oldest is slated to demolished. \n \n “After the English had marched up the Bowery and out of the city (in 1783), George Washington and Governor (George) Clinton stopped at the Bull’s Head (tavern),” preservationist Adam Woodward told WCBS 880’s Al

[------------------------------------------------->] 10/10


In [33]:
#Ignore the error shown below
project_name = f"T5-FineTuned {datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S')}"

finetuned_t5_results = run_on_dataset(
    client=client,
    project_name=project_name,
    dataset_name=NAME_DATASET,
    llm_or_chain_factory=summarizer_finetuned,
    evaluation=evaluation_config,
)

View the evaluation results for project 'T5-FineTuned 2025-07-28 15:13:33' at:
https://smith.langchain.com/o/e18ec3e5-d867-4b98-b11d-17dd5c083320/datasets/f2719b84-8226-40e6-8e44-707aeb7a5d6c/compare?selectedSessions=5293037d-5d19-422c-bd48-d958132ac066

View all tests for Dataset Summarize_dataset_2025-07-28 15:11:01 at:
https://smith.langchain.com/o/e18ec3e5-d867-4b98-b11d-17dd5c083320/datasets/f2719b84-8226-40e6-8e44-707aeb7a5d6c
[>                                                 ] 0/10

LLM failed for example 6c8de2bb-c14f-4bd5-8b7a-167cb0189338 with inputs {'article': 'Summarize this news:\nStarting in 1996, Alexa Internet has been donating their crawl data to the Internet Archive. Flowing in every day, these data are added to the Wayback Machine after an embargo period. ||||| St. Paul\'s Top 3 Dive Bars"Best burger in town. Really, it\'s that amazing. Not a huge place, so don\'t bring your whole extended family with you." \n \n WCCO Viewers\' Choice For Best Egg Roll In MinnesotaIn this week’s Best of Minnesota, the tastiest egg rolls can be found at a popular neighborhood restaurant in northeast Minneapolis called Que Viet. \n \n DeRusha Eats: The Sunshine Factory Burns Bright In PlymouthThe Sunshine Factory opened in New Hope in 1976, and everything about it screamed the \'70s. \n \n Minneapolis Moonshine: 5 Of The City\'s Top DistilleriesWe crunched the numbers to find the top distilleries in Minneapolis, to help you find the best spots to meet your needs. \n \n 

[------------------->                              ] 4/10

LLM failed for example 373ad156-ff04-4dc8-8c5a-dd702ccde710 with inputs {'article': 'Summarize this news:\nThe seed for this crawl was a list of every host in the Wayback Machine \n \n This crawl was run at a level 1 (URLs including their embeds, plus the URLs of all outbound links including their embeds) \n \n The WARC files associated with this crawl are not currently available to the general public. ||||| Summary: Microsoft\'s acquisition of Nokia is aimed at building a devices and services strategy, but the joint company won\'t take the same form as Apple. \n \n Microsoft has been working on its evolution into a devices and services company, away from the services business it has traditionally been, for several years now with limited success. \n \n Its acquisition of most of Nokia is the latest acceleration of that strategy — to move further away from the moribund world of the beige desktop and towards the sunlit world of smartphones and tablets. \n \n Owning the desktop (via Windo

[------------------------>                         ] 5/10

LLM failed for example c6205ebe-3a11-4873-9bde-08de5cba1a5d with inputs {'article': 'Summarize this news:\nPARIS (AP) — The Pompidou Centre in Paris hopes to display a long-vanished Picasso painting in May, now that it has been recovered by U.S. customs authorities. \n \n This undated photo provided by the United States Department of Justice, shows a cubist painting entitled “The Hairdresser” by Pablo Picasso. Authorities say the painting worth millions of dollars was... (Associated Press) \n \n The 1911 cubist painting "The Hairdresser," worth millions of dollars, was reported missing from a Pompidou storeroom in 2001. It was smuggled into the U.S. in December from Belgium. \n \n Pompidou director Alain Seban said the discovery comes as a "true comfort" at a time when the cultural world is reeling from an Islamic State video showing the destruction of statues in Iraq. \n \n Seban said in a statement Friday that he hopes the work can be exhibited again publicly in May. \n \n U.S. and F

[--------------------------------------->          ] 8/10

LLM failed for example be9fe43e-2880-4174-a243-ceb31d260bee with inputs {'article': 'Summarize this news:\nIt\'s the Golden State\'s latest version of the Great Secession. \n \n \n \n Fed up by Sacramento\'s regulations and Southern California\'s political sway, residents in one rural Northern California county are taking steps to leave the state. \n \n The Siskiyou County Board of Supervisors voted, 4-1, on Tuesday to pursue seceding from California, the Redding Record Searchlight reported. Proponents say Siskiyou should form a new state -- called Jefferson -- with other counties in Northern California and Southern Oregon they believe share similar interests. \n \n On Tuesday more than 100 people filled the supervisors\' chambers, many of whom indicated support for the declaration, the Searchlight reported. When a speaker asked those in the audience who was in favor, "nearly every hand in the room was raised," the newspaper said. \n \n "Many proposed laws are unconstitutional and deny

[-------------------------------------------->     ] 9/10

LLM failed for example eef4b09f-5c8b-40aa-97a1-2f630e458ea1 with inputs {'article': 'Summarize this news:\nIf True, Building Set For Demolition Could Be Manhattan\'s Oldest October 15, 2013 5:39 PM \n \n Preservationist Adam Woodward discovered a cellar that he believes could be the foundation of the Revolutionary War-era Bull’s Head Tavern. (credit: Adam Woodward) \n \n NEW YORK (CBSNewYork) — A preservationist says he has found evidence that a Manhattan building is the former site of an 18th-century tavern where George Washington is believed to have enjoyed a celebratory drink during the American Revolution. \n \n If it is indeed the home of the legendary watering hole, the discovery could mean that the building that is perhaps Manhattan’s oldest is slated to demolished. \n \n “After the English had marched up the Bowery and out of the city (in 1783), George Washington and Governor (George) Clinton stopped at the Bull’s Head (tavern),” preservationist Adam Woodward told WCBS 880’s Al

[------------------------------------------------->] 10/10


<img src="https://github.com/peremartra/Large-Language-Model-Notebooks-Course/blob/main/img/Martra_Figure_4_2SDL_Tests.jpg?raw=true">

In the image below you can see the comparision between two tests.
<img src="https://github.com/peremartra/Large-Language-Model-Notebooks-Course/blob/main/img/Martra_Figure_4_2SDL_CompareTestst.jpg?raw=true">

Well, since it has been so straightforward, why don't we try to make the comparison with an OpenAI model?

In [34]:
from langchain_openai import OpenAI
open_aillm=OpenAI(temperature=0.0)

In [35]:
project_name = f"OpenAI {datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S')}"

finetuned_t5_results = run_on_dataset(
    client=client,
    project_name=project_name,
    dataset_name=NAME_DATASET,
    llm_or_chain_factory=open_aillm,
    evaluation=evaluation_config,
)

View the evaluation results for project 'OpenAI 2025-07-28 15:13:52' at:
https://smith.langchain.com/o/e18ec3e5-d867-4b98-b11d-17dd5c083320/datasets/f2719b84-8226-40e6-8e44-707aeb7a5d6c/compare?selectedSessions=ac7ba6a7-be37-4355-a246-33a3af43efcd

View all tests for Dataset Summarize_dataset_2025-07-28 15:11:01 at:
https://smith.langchain.com/o/e18ec3e5-d867-4b98-b11d-17dd5c083320/datasets/f2719b84-8226-40e6-8e44-707aeb7a5d6c
[------------------------------------------------->] 10/10


<img src="https://github.com/peremartra/Large-Language-Model-Notebooks-Course/blob/main/img/Martra_Figure_4_2SDL_CompareOpenAI_HF.jpg?raw=true">

The experiment with the OpenAI model has yielded the best results. But, be aware! As we can see, there is a cost involved since we are using an API, and it needs to be paid for.

Another crucial piece of information is that we can view performance data for the models. This data could also be useful for minimally evaluating our inference server.