# Lab | Summarization evaluation using LangSmith
Let's revisit your capstone project 2? Well, sort of. Pick diffierent sets of data and re-run this notebook. Maybe parts of the dataset you used in your last project week. The point is for you to understand all steps involve and the many different ways one can and should evaluate LLM applications using LangSmith.

What did you learn? - Let's discuss that in class

## LangSmith - LangChain evaluation

In [1]:
from dotenv import load_dotenv, find_dotenv
import os
_ = load_dotenv(find_dotenv())

OPENAI_API_KEY  = os.getenv('OPENAI_API_KEY')
LANGCHAIN_API_KEY = os.getenv("LANGCHAIN_API_KEY")
HUGGINGFACEHUB_API_TOKEN = os.getenv("HUGGINGFACEHUB_API_TOKEN")

In [2]:
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_ENDPOINT"]="https://api.smith.langchain.com"
os.environ["LANGCHAIN_PROJECT"]="langsmith_max-test"

In [3]:
#Importing Client from Langsmith
from langsmith import Client
client = Client(api_key=LANGCHAIN_API_KEY)

### Create Dataset


In [4]:
from datasets import load_dataset
cnn_dataset = load_dataset(
    "permutans/fineweb-bbc-news",
    split=None,  # This will load all splits as a DatasetDict
    trust_remote_code=True,
    name="CC-MAIN-2013-20"
)

  from .autonotebook import tqdm as notebook_tqdm


In [5]:
def add_prefix(example):
    return {
        **example,
        "article": f"Summarize this news:\n{example['article']}"
    }

#cnn_dataset = cnn_dataset.map(add_prefix)

In [6]:
cnn_dataset

DatasetDict({
    train: Dataset({
        features: ['url', 'text'],
        num_rows: 179829
    })
})

In [7]:
cnn_dataset['train'][0]

{'url': 'http://news.bbc.co.uk/2/hi/programmes/from_our_own_correspondent/5326614.stm',
 'text': 'By Duncan Bartlett\nBBC News, Japan\nA Japanese legend claims that Jesus escaped Jerusalem and made his way to Aomori in Japan where he became a rice farmer. Christians say the story is nonsense. However, a monument there known as the Grave of Christ attracts curious visitors from all over the world.\nTo reach the Grave of Christ or Kristo no Hakka as it is known locally, you need to head deep into the northern countryside of Japan, a place of paddy fields and apple orchards.\nThe Grave of Christ has become an international tourist attraction\nHalfway up a remote mountain surrounded by a thicket of bamboo lies a mound of bare earth marked with a large wooden cross.\nMost visitors peer at the grave curiously and pose in front of the cross for a photograph before heading off for apple ice cream at the nearby cafe.\nBut some pilgrims leave coins in front of the grave in thanks for answered pr

In [11]:
#Get just a few news to test
MAX_NEWS=10

def add_prefix(example):
	return {
		**example,
		"article": f"Summarize this news:\n{example['text']}"
	}

sample_cnn = cnn_dataset["train"].select(range(MAX_NEWS)).map(add_prefix)

sample_cnn

Map: 100%|██████████| 10/10 [00:00<00:00, 1199.61 examples/s]


Dataset({
    features: ['url', 'text', 'article'],
    num_rows: 10
})

The dataset contains three columns: article, highlights, and id. To use LangSmith, we need to create a dataset in LangSmith format.

LangSmith expects a prompt and a result. To achieve this, we will transform the article into a prompt by adding the prefix: "Summarize this news." As a result, we will use the content of highlights, which represents the summaries created by humans.

In [12]:
print(sample_cnn[0])

{'url': 'http://news.bbc.co.uk/2/hi/programmes/from_our_own_correspondent/5326614.stm', 'text': 'By Duncan Bartlett\nBBC News, Japan\nA Japanese legend claims that Jesus escaped Jerusalem and made his way to Aomori in Japan where he became a rice farmer. Christians say the story is nonsense. However, a monument there known as the Grave of Christ attracts curious visitors from all over the world.\nTo reach the Grave of Christ or Kristo no Hakka as it is known locally, you need to head deep into the northern countryside of Japan, a place of paddy fields and apple orchards.\nThe Grave of Christ has become an international tourist attraction\nHalfway up a remote mountain surrounded by a thicket of bamboo lies a mound of bare earth marked with a large wooden cross.\nMost visitors peer at the grave curiously and pose in front of the cross for a photograph before heading off for apple ice cream at the nearby cafe.\nBut some pilgrims leave coins in front of the grave in thanks for answered pra

Now We have the Dataset with the prompt and the Reference Summary, it is time to create a Dataset in LangSmith with this information.
### Create the Dataset in Langsmith

The dataset in LangSmith is composed of an input, which is the prompt passed to the model for evaluation, and an output, which should contain what we expect the model to return.

In [13]:
import datetime

In [14]:
import uuid
input_key=['article']
output_key=['highlights']

NAME_DATASET=f"Summarize_dataset_{datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S')}"

In [17]:
# Add 'highlights' column to sample_cnn from the original dataset
highlights = cnn_dataset["train"].select(range(MAX_NEWS))["text"]
sample_cnn = sample_cnn.add_column("highlights", highlights)

# This creates the dataset in LangSmith with the content in sample_cnn - If you run this more than once you will get POST errors
dataset = client.upload_dataframe(
    df=sample_cnn,
    input_keys=input_key,
    output_keys=output_key,
    name=NAME_DATASET,
    description="Test Embedding distance between model summarizations",
    data_type="kv"
)

Creating CSV from Arrow format: 100%|██████████| 1/1 [00:00<00:00, 222.37ba/s]


In this image, we can see an example from the dataset once it's been registered in LangSmith.

In the Input column, there is the prompt to be sent, while in the Output column, the expected output is stored.

When performing the comparison, the model will be given the prompt, and the Cosine distance between its response and the one stored in the sample dataset will be calculated.
<img src="https://github.com/peremartra/Large-Language-Model-Notebooks-Course/blob/main/img/Martra_Figure_4_2SDL_Dataset.jpg?raw=true">

### Recovering Models From Hugging Face
Let's retrieve both models from HuggingFace. A base T5 model and a model that has been fine-tuned using the training portion of this same dataset to generate summaries.

In [18]:
from langchain import HuggingFaceHub

In [19]:
summarizer_base = HuggingFaceHub(
    repo_id="t5-base",
    model_kwargs={"temperature":0, "max_length":180},
    huggingfacehub_api_token=HUGGINGFACEHUB_API_TOKEN
)

  summarizer_base = HuggingFaceHub(


In [20]:
summarizer_finetuned = HuggingFaceHub(
    repo_id="flax-community/t5-base-cnn-dm",
    model_kwargs={"temperature":0, "max_length":180},
    huggingfacehub_api_token=HUGGINGFACEHUB_API_TOKEN
)

## Defining Evaluator
The first step is to define an evaluator, where we specify the variables we want to evaluate. In our case, I have chosen to measure only the "embedding_distance."

I've left the "string_distance" as a comment in case you want to conduct a test with two evaluations instead of one.


In [21]:
from langchain.smith import run_on_dataset, RunEvalConfig
!pip install -q rapidfuzz==3.6.1

In [22]:
#We are using just one of the multiple evaluator avaiable on LangSmith.

evaluation_config = RunEvalConfig(
    evaluators=[
        "embedding_distance",
        #"string_distance"
    ],
)



### Running Evaluator
With the same configuration, we can launch two evaluations on the same dataset. One for each of the chosen models.

In [23]:
project_name = f"T5-BASE {datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S')}"

base_t5_results = run_on_dataset(
    client=client,
    project_name=project_name,
    dataset_name=NAME_DATASET,
    llm_or_chain_factory=summarizer_base,
    evaluation=evaluation_config,
)

View the evaluation results for project 'T5-BASE 2025-06-24 12:46:55' at:
https://smith.langchain.com/o/7d7d32ad-43dd-483f-a860-6edb1a519b69/datasets/ff55fe95-6679-40b9-aef9-2e4ea670675e/compare?selectedSessions=352bbe0e-dffc-4308-a071-1274b419c477

View all tests for Dataset Summarize_dataset_2025-06-24 12:46:06 at:
https://smith.langchain.com/o/7d7d32ad-43dd-483f-a860-6edb1a519b69/datasets/ff55fe95-6679-40b9-aef9-2e4ea670675e
[>                                                 ] 0/10

LLM failed for example 0322b738-2379-4ad2-a883-b1b45afe4cf9 with inputs {'article': 'Summarize this news:\nMr Erdogan says Turkey has a role to play as mediator\nTurkish PM Recep Tayyip Erdogan has met Syrian President Bashar al-Assad in Damascus, as part of efforts to secure a peace deal between Syria and Israel.\nMr Erdogan said both nations had sought Turkey\'s help on the issue.\nMediation would begin at a low level and, if successful, progress to higher-level officials, he said.\nOn Thursday Syria said Israel had indicated it would be prepared to withdraw from the Golan Heights in return for peace.\nThe office of Israeli Prime Minister Ehud Olmert has declined to comment on the reports, but Mr Olmert has said that he is interested in peace in Syria.\nIsrael and Syria remain technically at war, although both sides have recently spoken of their desire for peace.\nIsrael says it is interested in peace with Syria\nThe Syrian government has insisted that peace talks can be resumed only

[--------->                                        ] 2/10

LLM failed for example 17f1570d-fecc-4c72-9c76-5c1d4c78f9ed with inputs {'article': 'Summarize this news:\nBy Laura Smith-Spark\nBBC News, Washington\nAs wildfires force hundreds of thousands of people to flee their homes in California, inevitable comparisons are drawn with the response to Hurricane Katrina.\nFirefighters continue to battle fierce blazes across southern California\nHas the US learnt the harsh lessons of New Orleans?\nThe ramifications of the bungled response to Katrina are still felt two years later in the US, both politically and by the people living in New Orleans and the Gulf Coast.\nIt has quickly become clear that the White House has no intention of letting events unravel in a similarly chaotic - and public - fashion in California.\nHomeland Security Secretary Michael Chertoff and David Paulison, head of the Federal Emergency Management Agency (Fema), are already on the scene.\n"What we see now that we did not see during Hurricane Katrina is a very good team effor

[------------------->                              ] 4/10

LLM failed for example 49970ce6-8c9d-456d-8932-0572e96b4d6e with inputs {'article': 'Summarize this news:\nPaypal is the world\'s leading online payment service\nWeb payment firm Paypal has said it will block "unsafe browsers" from using its service as part of wider anti-phishing efforts.\nCustomers will first be warned that a browser is unsafe but could then be blocked if they continue using it.\nPaypal said it was "an alarming fact that there is a significant set of users who use very old and vulnerable browsers such as Internet Explorer 4".\nPhishing attacks trick users into handing over sensitive data.\nPaypal said some users were still using Internet Explorer 3 , released more than 10 years ago. It lacks many of the security and safety features needed to protect users from phishing and other online attacks.\nPaypal said it supported the use of Extended Validation SSL Certificates. Browsers which support the technology highlight the address bar in green when users are on a site tha

[----------------------------->                    ] 6/10

LLM failed for example abe3c25d-3942-4279-a139-2fca0dcffb42 with inputs {'article': 'Summarize this news:\nBirmingham Six release remembered\nTwenty years ago the Birmingham Six were freed after their convictions for the murders of 21 people in two pub bombings were quashed.\nThey had served nearly 17 years behind bars in one of the worst miscarriages of justice seen in Britain.\nPaddy Hill, Gerry Hunter, Johnny Walker, Hugh Callaghan, Richard McIlkenny and Billy Power strode from London\'s Old Bailey on 14 March 1991, their innocence finally proved.\nAlongside the men as they left court greeted by cheering crowds and beeping car horns was Chris Mullin, a journalist and MP who had been working towards their freedom since the late 1970s.\nEnd Quote Chris Mullin\nI was convinced that here were six civilians who were in the wrong place at the wrong time”\nMr Mullin, now 63, first became interested in the case when his journalist friend Peter Chippindale, who attended the men\'s trial and 

[-------------------------------------------->     ] 9/10

LLM failed for example 8acf7ae2-28f6-4d69-8d83-ac755db1b31c with inputs {'article': "Summarize this news:\nBy Vaudine England\nBBC News, Hong Kong\nA study by doctors in Hong Kong has concluded that epilepsy can be induced by the Chinese tile game of mahjong.\nThe study said the syndrome affects more men than women\nThe findings, published in the Hong Kong Medical Journal, were based on 23 cases of people who suffered mahjong-induced seizures.\nThe report's four authors, from Hong Kong's Queen Mary Hospital, said the best prevention - and cure - was to avoid playing mahjong.\nThe study led the doctors to define mahjong epilepsy as a unique syndrome.\nEpileptic seizures can be provoked by a wide variety of triggers, but one cause increasingly evident to researchers is the playing - or even watching - of mahjong.\nThis Chinese tile game, played by four people round a table, can involve gambling and quickly becomes compulsive.\nThe game, which is intensely social and sometimes played in c

[------------------------------------------------->] 10/10

In [24]:
#Ignore the error shown below
project_name = f"T5-FineTuned {datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S')}"

finetuned_t5_results = run_on_dataset(
    client=client,
    project_name=project_name,
    dataset_name=NAME_DATASET,
    llm_or_chain_factory=summarizer_finetuned,
    evaluation=evaluation_config,
)

View the evaluation results for project 'T5-FineTuned 2025-06-24 12:47:04' at:
https://smith.langchain.com/o/7d7d32ad-43dd-483f-a860-6edb1a519b69/datasets/ff55fe95-6679-40b9-aef9-2e4ea670675e/compare?selectedSessions=6935c8db-34c6-4b5b-a02c-8ac4f82cbcee

View all tests for Dataset Summarize_dataset_2025-06-24 12:46:06 at:
https://smith.langchain.com/o/7d7d32ad-43dd-483f-a860-6edb1a519b69/datasets/ff55fe95-6679-40b9-aef9-2e4ea670675e
[>                                                 ] 0/10

LLM failed for example 16c036e4-69db-4365-8d78-bbf5ee0682ab with inputs {'article': "Summarize this news:\nLeader: Yevgeny Shevchuk\nA former speaker of Trans-Dniester's parliament, Yevgeny Shevchuk overturned expectations by coming first in the initial round of voting in the November 2011 presidential election, pushing the incumbent Igor Smirnov into third place.\nMr Shevchuk then beat Anatoly Kaminsky, a colleague-turned-rival and Russia's preferred candidate, in the second round to become president in December.\nYevgeny Shevchuk broke with long-serving President Smirnov in 2009 in an attempt to limit the latter's powers. He then lead an anti-corruption movement that also called for greater transparency in government.\nThe 43-year-old Mr Shevchuk's election campaign benefited from public weariness with lack of progress in peace talks and general economic stagnation under Mr Smirnov, who had also lost the support of Russia.\nThe new president says he wants to improve relations with Mo

[---->                                             ] 1/10

LLM failed for example 3e49cd8d-a71f-4681-835e-cf703f7deda7 with inputs {'article': 'Summarize this news:\nBy Tom Bishop\nBBC News entertainment reporter\nMadonna has headed back to the disco for her new album, Confessions on a Dancefloor, which is released in the UK on Monday. It has earned her some of the best reviews of her 22-year career.\nIn stark contrast to 2003\'s introspective American Life album, she has dusted off her glitterball, strapped on her pink stilettos and sampled Abba on latest hit single Hung Up.\nHas Madonna reinvigorated her music career, or is she merely throwing one final dance party for her long-term fans before settling down to record more sedate material?\nMadonna embraced New York club culture in the early 1980s\n"Dance music fans may be unconvinced by Madonna\'s new image as it no longer reflects her real life," says DJ magazine\'s features editor Carl Loben.\n"Madonna embraced the early stages of New York club culture in the 1980s but I doubt she has bee

[-------------->                                   ] 3/10

LLM failed for example 0322b738-2379-4ad2-a883-b1b45afe4cf9 with inputs {'article': 'Summarize this news:\nMr Erdogan says Turkey has a role to play as mediator\nTurkish PM Recep Tayyip Erdogan has met Syrian President Bashar al-Assad in Damascus, as part of efforts to secure a peace deal between Syria and Israel.\nMr Erdogan said both nations had sought Turkey\'s help on the issue.\nMediation would begin at a low level and, if successful, progress to higher-level officials, he said.\nOn Thursday Syria said Israel had indicated it would be prepared to withdraw from the Golan Heights in return for peace.\nThe office of Israeli Prime Minister Ehud Olmert has declined to comment on the reports, but Mr Olmert has said that he is interested in peace in Syria.\nIsrael and Syria remain technically at war, although both sides have recently spoken of their desire for peace.\nIsrael says it is interested in peace with Syria\nThe Syrian government has insisted that peace talks can be resumed only

[------------------------>                         ] 5/10

LLM failed for example 7069d92f-df2f-48d7-b98b-d9d4637f72ad with inputs {'article': 'Summarize this news:\nBy Duncan Bartlett\nBBC News, Japan\nA Japanese legend claims that Jesus escaped Jerusalem and made his way to Aomori in Japan where he became a rice farmer. Christians say the story is nonsense. However, a monument there known as the Grave of Christ attracts curious visitors from all over the world.\nTo reach the Grave of Christ or Kristo no Hakka as it is known locally, you need to head deep into the northern countryside of Japan, a place of paddy fields and apple orchards.\nThe Grave of Christ has become an international tourist attraction\nHalfway up a remote mountain surrounded by a thicket of bamboo lies a mound of bare earth marked with a large wooden cross.\nMost visitors peer at the grave curiously and pose in front of the cross for a photograph before heading off for apple ice cream at the nearby cafe.\nBut some pilgrims leave coins in front of the grave in thanks for an

[----------------------------->                    ] 6/10

LLM failed for example 74475abf-7c65-4628-bf8a-9a49dc228e7c with inputs {'article': 'Summarize this news:\nSaudi woman athlete makes headlines\nWojdan Shaherkani has been making headlines.\nShe became the subject of worldwide media attention when it was announced that she would be one of the first two Saudi female athletes to compete at the Olympics.\nBut this was soon overshadowed by a row over her hijab - a head covering that many Muslim women wear - that meant she was at risk of not taking part at all.\nThe International Judo Federation initially said Shaherkani would not be allowed to wear a headscarf during the competition due to safety concerns.\nA spokesman said that in Judo athletes used strangleholds and chokeholds and that wearing a hijab could be dangerous.\nBut that was a deal-breaker for the Saudi Arabian Olympic Committee.\nEnd Quote Noor al-Sajan Saudi law student\nShe\'s 16 years old... I don\'t know how she\'s handling all of this”\nThe Saudi authorities had agreed to 

[-------------------------------------------->     ] 9/10

LLM failed for example abe3c25d-3942-4279-a139-2fca0dcffb42 with inputs {'article': 'Summarize this news:\nBirmingham Six release remembered\nTwenty years ago the Birmingham Six were freed after their convictions for the murders of 21 people in two pub bombings were quashed.\nThey had served nearly 17 years behind bars in one of the worst miscarriages of justice seen in Britain.\nPaddy Hill, Gerry Hunter, Johnny Walker, Hugh Callaghan, Richard McIlkenny and Billy Power strode from London\'s Old Bailey on 14 March 1991, their innocence finally proved.\nAlongside the men as they left court greeted by cheering crowds and beeping car horns was Chris Mullin, a journalist and MP who had been working towards their freedom since the late 1970s.\nEnd Quote Chris Mullin\nI was convinced that here were six civilians who were in the wrong place at the wrong time”\nMr Mullin, now 63, first became interested in the case when his journalist friend Peter Chippindale, who attended the men\'s trial and 

[------------------------------------------------->] 10/10

<img src="https://github.com/peremartra/Large-Language-Model-Notebooks-Course/blob/main/img/Martra_Figure_4_2SDL_Tests.jpg?raw=true">

In the image below you can see the comparision between two tests.
<img src="https://github.com/peremartra/Large-Language-Model-Notebooks-Course/blob/main/img/Martra_Figure_4_2SDL_CompareTestst.jpg?raw=true">

Well, since it has been so straightforward, why don't we try to make the comparison with an OpenAI model?

In [25]:
from langchain_openai import OpenAI
open_aillm=OpenAI(temperature=0.0)

In [26]:
project_name = f"OpenAI {datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S')}"

finetuned_t5_results = run_on_dataset(
    client=client,
    project_name=project_name,
    dataset_name=NAME_DATASET,
    llm_or_chain_factory=open_aillm,
    evaluation=evaluation_config,
)

View the evaluation results for project 'OpenAI 2025-06-24 12:47:17' at:
https://smith.langchain.com/o/7d7d32ad-43dd-483f-a860-6edb1a519b69/datasets/ff55fe95-6679-40b9-aef9-2e4ea670675e/compare?selectedSessions=3b33c45b-23f1-439f-9529-c66729be2b19

View all tests for Dataset Summarize_dataset_2025-06-24 12:46:06 at:
https://smith.langchain.com/o/7d7d32ad-43dd-483f-a860-6edb1a519b69/datasets/ff55fe95-6679-40b9-aef9-2e4ea670675e
[------------------------------------------------->] 10/10

<img src="https://github.com/peremartra/Large-Language-Model-Notebooks-Course/blob/main/img/Martra_Figure_4_2SDL_CompareOpenAI_HF.jpg?raw=true">

The experiment with the OpenAI model has yielded the best results. But, be aware! As we can see, there is a cost involved since we are using an API, and it needs to be paid for.

Another crucial piece of information is that we can view performance data for the models. This data could also be useful for minimally evaluating our inference server.