
# Summarization Application with LLMs

In this notebook, I'll build a summarization application with LLMs 
I use existing, open-source models. For this, I use [Hugging Face models](https://huggingface.co/models) and   prompt engineering.


In [3]:
%pip install sacremoses==0.0.53

In [None]:
%run ../Includes/Classroom-Setup

In [None]:
from datasets import load_dataset
from transformers import pipeline

### Summarization

Summarization can take two forms:
* `extractive` (selecting representative excerpts from the text)
* `abstractive` (generating novel text summaries)

Here, I will use a model which does *abstractive* summarization.

**Background reading**: The [Hugging Face summarization task page](https://huggingface.co/docs/transformers/tasks/summarization) lists model architectures which support summarization. The [summarization course chapter](https://huggingface.co/course/chapter7/5) provides a detailed walkthrough.

In this section, I will use:
* **Data**: [xsum](https://huggingface.co/datasets/xsum) dataset, which provides a set of BBC articles and summaries.
* **Model**: [t5-small](https://huggingface.co/t5-small) model, which has 60 million parameters (242MB for PyTorch).  T5 is an encoder-decoder model created by Google which supports several tasks such as summarization, translation, Q&A, and text classification.  For more details, see the [Google blog post](https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html), [code on GitHub](https://github.com/google-research/text-to-text-transfer-transformer), or the [research paper](https://arxiv.org/pdf/1910.10683.pdf).

In [None]:
# Note: We specify cache_dir to use predownloaded data.
xsum_dataset = load_dataset(
    "xsum", 
    version="1.2.0",
    cache_dir=DA.paths.datasets,
    #verification_mode="no_checks"
)

xsum_dataset  # The printed representation of this object shows the `num_rows` of each dataset split.

This dataset provides 3 columns:
* `document`: the BBC article text
* `summary`: a "ground-truth" summary --> Note how subjective this "ground-truth" is.  Is this the same summary you would write?  This a great example of how many LLM applications do not have obvious "right" answers.
* `id`: article ID

In [None]:
xsum_sample = xsum_dataset["train"].select(range(10))
display(xsum_sample.to_pandas())

I next use the Hugging Face `pipeline` tool to load a pre-trained model.  In this LLM pipeline constructor, I specify:
* `task`: This first argument specifies the primary task.  See [Hugging Face tasks](https://huggingface.co/tasks) for more information.
* `model`: This is the name of the pre-trained model from the [Hugging Face Hub](https://huggingface.co/models).
* `min_length`, `max_length`: I want our generated summaries to be between these two token lengths.
* `truncation`: Some input articles may be too long for the LLM to process.  Most LLMs have fixed limits on the length of input sequences.  This option tells the pipeline to truncate the input if needed.

In [None]:
summarizer = pipeline(
    task="summarization",
    model="t5-small", # google/pegasus-xsum    t5-small
    min_length=20,
    max_length=40,
    truncation=True,
    model_kwargs={"cache_dir": DA.paths.datasets},
)  # Note: We specify cache_dir to use predownloaded models.

In [None]:
# Apply to 1 article
summarizer(xsum_sample["document"][0])

In [None]:
# Apply to a batch of articles
results = summarizer(xsum_sample["document"])

In [None]:
# Display the generated summary side-by-side with the reference summary and original document.
# We use Pandas to join the inputs and outputs together in a nice format.
import pandas as pd

display(
    pd.DataFrame.from_dict(results)
    .rename({"summary_text": "generated_summary"}, axis=1)
    .join(pd.DataFrame.from_dict(xsum_sample))[
        ["generated_summary", "summary", "document"]
    ]
)