# Text summarization using Hugging Face Transformers

The following script demonstartes the usage of the Hugging Face Transformers library to perform text summarization using a pre-trained model.

CAUTION: this works on Colab, but fails locally!



In [1]:
import nltk  # Import the Natural Language Toolkit library
from nltk.tokenize import PunktSentenceTokenizer

tokenizer = PunktSentenceTokenizer()

nltk.download("punkt")  # Download the 'punkt' tokenizer models from NLTK
# Download the 'punkt_tab' resource required by sumy's Tokenizer
nltk.download("punkt_tab")
from nltk.tokenize import sent_tokenize  # Import the sentence tokenizer from NLTK

from transformers import pipeline, set_seed  # Import the pipeline and set_seed functions from the transformers library

from sumy.parsers.plaintext import PlaintextParser  # Import PlainTextParser from the sumy package
from sumy.nlp.tokenizers import Tokenizer  # Import Tokenizer from the sumy package
from sumy.summarizers.text_rank import TextRankSummarizer  # Import TextRankSummarizer from the sumy package

from datasets import load_dataset  # Import functions to load datasets from the datasets library

import evaluate # Import the evaluate library for model evaluation

import pandas as pd  # Import the pandas library to work with tables

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


Nltk (Natural Language Toolkit) is a comprehensive library for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and more. The 'punkt' tokenizer specifically helps in sentence tokenization, which is the process of dividing a text into a list of its component sentences.

The transformers library by Huggingface is a state-of-the-art natural language processing library that provides thousands of pretrained models to perform tasks on different modalities such as text, vision, and audio (Hugging Face – The AI Community Building the Future., 2025). The 'pipeline' function simplifies the implementation of complex models for various NLP tasks, including text generation and summarization.

Sumy is a Python library used for text summarization. It offers several algorithms for extracting summaries from text documents. The PlaintextParser is used to read and parse plain text files. TextRankSummarizer is an implementation of the TextRank algorithm, a graph-based summarization technique.


In [2]:
#dataset = load_dataset('ccdv/cnn_dailymail', '3.0.0', split='train[:1]', streaming=False, download_mode="force_redownload")#, trust_remote_code = True)
#dataset = load_dataset("ccdv/cnn_dailymail", version="3.0.0", trust_remote_code=True)
# Access the first article
#first_article = dataset[0]
#print(first_article['article'])

billsum = load_dataset("billsum", split="ca_test")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [3]:
billsum = billsum.train_test_split(test_size=0.2)
billsum["train"][0]

{'text': 'The people of the State of California do enact as follows:\n\n\nSECTION 1.\nSection 12300 of the Welfare and Institutions Code is amended to read:\n12300.\n(a) The purpose of this article is to provide in every county in a manner consistent with this chapter and the annual Budget Act those supportive services identified in this section to aged, blind, or disabled persons, as defined under this chapter, who are unable to perform the services themselves and who cannot safely remain in their homes or abodes of their own choosing unless these services are provided.\n(b) Supportive services shall include domestic services and services related to domestic services, heavy cleaning, personal care services, accompaniment by a provider when needed during necessary travel to health-related appointments or to alternative resource sites, yard hazard abatement, protective supervision, teaching and demonstration directed at reducing the need for other supportive services, and paramedical se

We have selected the CNN/DailyMail dataset (Hermann et al., 2015) for our analysis of summarization models. This dataset includes approximately 300,000 pairs of news articles and their corresponding summaries. These summaries are derived from the bullet points that CNN and the DailyMail add to their articles. A notable feature of this dataset is that the summaries are not mere excerpts but are instead abstractive, generating new sentences that encapsulate the essence of the articles.

For illustrative purposes, we will apply our summarization techniques to a single article. Lengthy articles present a significant challenge to most transformer models due to their typical context size limitation of about 1,000 tokens, which equates to several paragraphs. A common, albeit rudimentary, strategy to manage this constraint is to truncate texts that exceed the model's context size. Although crucial content might reside towards the end of the text, we must navigate this limitation of the model architectures for now, thereby limiting the article to the first 2000 characters:


In [4]:
sample_text = billsum["train"][0]['text'][:2000] # Get the first 2000 characters of the article in the training dataset
#sample_text = " ".join(sample_text)
sample_text

'The people of the State of California do enact as follows:\n\n\nSECTION 1.\nSection 12300 of the Welfare and Institutions Code is amended to read:\n12300.\n(a) The purpose of this article is to provide in every county in a manner consistent with this chapter and the annual Budget Act those supportive services identified in this section to aged, blind, or disabled persons, as defined under this chapter, who are unable to perform the services themselves and who cannot safely remain in their homes or abodes of their own choosing unless these services are provided.\n(b) Supportive services shall include domestic services and services related to domestic services, heavy cleaning, personal care services, accompaniment by a provider when needed during necessary travel to health-related appointments or to alternative resource sites, yard hazard abatement, protective supervision, teaching and demonstration directed at reducing the need for other supportive services, and paramedical services\nw

In the absence of human-generated "true" summaries within the text corpus, we can create a reference point by establishing a simple baseline. This baseline allows us to compare the machine-generated summaries against a consistent benchmark. To achieve this, we extract the first three sentences of the text to serve as our baseline summary.

In [5]:
# We'll collect the generated summaries of each model in a dictionary
summaries = {}  # Initialize an empty dictionary to store summaries

summaries["baseline"] = "\n".join(tokenizer.tokenize(sample_text)[:3])  # Generate a baseline summary and store it in the dictionary

The baseline summary, extracted as the first three sentences of the article, already provides a concise encapsulation of the key events:

In [6]:
summaries["baseline"]

'The people of the State of California do enact as follows:\n\n\nSECTION 1.\nSection 12300 of the Welfare and Institutions Code is amended to read:\n12300.\n(a) The purpose of this article is to provide in every county in a manner consistent with this chapter and the annual Budget Act those supportive services identified in this section to aged, blind, or disabled persons, as defined under this chapter, who are unable to perform the services themselves and who cannot safely remain in their homes or abodes of their own choosing unless these services are provided.'

We now proceed with generating a summary using the classical algorithm provided by the Sumy library. One of the key algorithms provided by Sumy is the TextRankSummarizer, which implements the TextRank algorithm. TextRank is a classical unsupervised algorithm inspired by Google's PageRank method used for ranking web pages. As it doesn’t require pre-labeled data or training, it is versatile option when large annotated datasets aren't available. Its effectiveness in summarizing text, especially when domain-specific training data is scarce, makes it a valuable tool in the toolkit of text summarization techniques. By using the TextRankSummarizer from the Sumy library, we can efficiently generate summaries that aim to capture the essential information and main points of the source documents without needing extensive computational resources or training data.
Here's how TextRank operates in the context of text summarization:

1. Graph-Based Approach: TextRank constructs a graph where sentences are nodes. The edges between sentences are weighted based on the similarity between sentences, typically computed using measures like cosine similarity on word vectors.

2. Sentence Representation: Sentences in the text are represented as nodes in the graph, and the algorithm establishes connections (edges) between these nodes based on semantic similarity; the more similar two sentences are, the stronger the connection.

3. Ranking Sentences: Once the graph is built, the TextRank algorithm applies a ranking process similar to PageRank to score the nodes (sentences). This process iteratively refines the score of each sentence based on its connections and the score of the connected sentences.

4. Extracting Key Sentences: After the ranking process converges, the sentences with the highest scores are extracted as the summary. Typically, the top-ranked sentences are selected to form a coherent and concise summary of the original text.


In [7]:
parser = PlaintextParser.from_string(sample_text, tokenizer = Tokenizer("english"))  # Use PlainTextParser to parse the sample text
summarizer = TextRankSummarizer()  # Initialize the TextRank summarizer

# Collect the summary sentences in a list
summary_sentences = []  # Initialize an empty list to store summary sentences
for sentence in summarizer(parser.document, 5):  # Generate summary sentences using the TextRank summarizer
    summary_sentences.append(str(sentence))  # Append each summary sentence to the list

# Join the sentences to form a single summary string
summaries["sumy"] = "\n".join(summary_sentences)  # Store the generated summary in the dictionary

# Print the summary (optional)
print(summaries["sumy"])  # Print the summary generated by the TextRank summarizer


(a) The purpose of this article is to provide in every county in a manner consistent with this chapter and the annual Budget Act those supportive services identified in this section to aged, blind, or disabled persons, as defined under this chapter, who are unable to perform the services themselves and who cannot safely remain in their homes or abodes of their own choosing unless these services are provided.
(b) Supportive services shall include domestic services and services related to domestic services, heavy cleaning, personal care services, accompaniment by a provider when needed during necessary travel to health-related appointments or to alternative resource sites, yard hazard abatement, protective supervision, teaching and demonstration directed at reducing the need for other supportive services, and paramedical services which that make it possible for the recipient to establish and maintain an independent living arrangement.
(8) Respiration.
(d) Personal care services are avail

In the final step, we employ three different large language models (LLMs) for summarization: GPT, BART, and a lightweight version of DeepSeek. Among these, GPT-2 and DeepSeek serve as general-purpose LLMs. We prompt these models by appending "TL;DR:" to the article, which is shorthand for "too long; didn’t read" and commonly signals a brief summary. Language models recognize this due to its frequent occurrence in training data, interpreting it as an instruction to summarize. This allows for conditional generation, where the model, given this prefix, produces the subsequent words, forming the summary. In contrast, the specific BART model employed here has been fine-tuned on the CNN/DailyMail dataset for summarization tasks. Consequently, BART acts as a form of positive control in our experiment, given its prior training on the very dataset we are utilizing for demonstration.

In [8]:
# GPT2 summary
set_seed(42)  # Set the random seed for reproducibility
pipe = pipeline("text-generation", model="gpt2-xl")  # Initialize a text generation pipeline with the GPT-2 XL model
query = sample_text + "\nTL;DR:\n"  # Create a query for the GPT-2 model
pipe_out = pipe(query, max_new_tokens=1000, clean_up_tokenization_spaces=True)  # Generate text using the GPT-2 model
summaries["gpt2"] = "\n".join(  # Store the generated summary in the dictionary
    sent_tokenize(pipe_out[0]["generated_text"][len(query) :]))  # Tokenize the generated text into sentences
del pipe
del pipe_out
del query

config.json:   0%|          | 0.00/689 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/6.43G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Device set to use cpu
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In [9]:
# DeepSeek Summary
model_name = "deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B"  # Define the model name for the DeepSeek model
set_seed(42)  # Set the random seed for reproducibility
pipe = pipeline("text-generation", model=model_name)  # Initialize a text generation pipeline with the DeepSeek model
query = sample_text + "\nTL;DR:\n"  # Create a query for the DeepSeek model
pipe_out = pipe(query, max_new_tokens=1000, clean_up_tokenization_spaces=True)  # Generate text using the DeepSeek model
summaries["DeepSeek"] = "\n".join(  # Store the generated summary in the dictionary
    sent_tokenize(pipe_out[0]["generated_text"][len(query) :]))  # Tokenize the generated text into sentences
del pipe
del pipe_out
del query

config.json:   0%|          | 0.00/679 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/3.55G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/181 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/3.07k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/7.03M [00:00<?, ?B/s]

Device set to use cpu


In [10]:
#BART summary
pipe = pipeline("summarization", model="facebook/bart-large-cnn")  # Initialize a summarization pipeline with the BART model
pipe_out = pipe(sample_text)  # Generate a summary using the BART model
summaries["bart"] = "\n".join(sent_tokenize(pipe_out[0]["summary_text"]))  # Store the generated summary in the dictionary
del pipe
del pipe_out


config.json:   0%|          | 0.00/1.58k [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Device set to use cpu


In [11]:
summaries["DeepSeek"]

'The people of the State of California do enact as follows:\n\nSECTION 12300 of the Welfare and Institutions Code is amended to provide for the provision of supportive services to aged, blind, or disabled persons who cannot perform the services themselves and cannot safely remain in their homes.\nThe services include domestic cleaning, personal care, medical care, and paramedical services, which enable the recipient to establish an independent living arrangement.\nThe question is: What is the total number of distinct support services provided under this amended section?\n(Note that some of these services are provided in both the recipient\'s home and other locations as may be authorized by the director.)\nTo determine this, we need to count the number of distinct categories listed in the amendment.\nEach category is a separate service.\nSo, for example, if in the amendment, " assistance with ambulation" is listed twice, that would count as two separate services.\nTherefore, to find the

Comparing summaries

In [19]:
print("ORIGINAL TEXT")
for sentence in sent_tokenize(sample_text):
    print(sentence)
print("")

print("HUMAN SUMMARY")
#print(dataset["train"][1]["highlights"])
print(billsum["train"][0]['summary'][:2000])
print("")

for model_name in summaries:
    print(model_name.upper())
    print(summaries[model_name])
    print("")

ORIGINAL TEXT
The people of the State of California do enact as follows:


SECTION 1.
Section 12300 of the Welfare and Institutions Code is amended to read:
12300.
(a) The purpose of this article is to provide in every county in a manner consistent with this chapter and the annual Budget Act those supportive services identified in this section to aged, blind, or disabled persons, as defined under this chapter, who are unable to perform the services themselves and who cannot safely remain in their homes or abodes of their own choosing unless these services are provided.
(b) Supportive services shall include domestic services and services related to domestic services, heavy cleaning, personal care services, accompaniment by a provider when needed during necessary travel to health-related appointments or to alternative resource sites, yard hazard abatement, protective supervision, teaching and demonstration directed at reducing the need for other supportive services, and paramedical servi

But which of these summaries comes closest to the reference summary? To this end, we evaluate the results with the ROUGE metric set, which measures the overlap of n-grams between the machine-generated and reference summaries.

- ROUGE-1 measures the overlap of unigrams (single words) between the generated and reference summary , implying the generated summary captures more key terms from the reference.

- ROUGE-2 assesses the overlap of bigrams (two-word sequences), suggesting more phrase-level fidelity and detail in the generated summary relative to the reference.

- ROUGE-L considers the longest common subsequence, highlighting fluency and coherence.

- ROUGE-Lsum is a variant of ROUGE-L specifically tuned for summarization tasks, assessing sentence splits, a higher score suggests a better overall structural and sentence-level match to the reference summary.


In [17]:
!pip install rouge_score



In [18]:
rouge_metric = evaluate.load("rouge")

In [25]:
reference = billsum["train"][0]['summary'] #dataset["train"][1]["highlights"]  # Get the reference summary from the dataset
records = []  # Initialize an empty list to store ROUGE scores
rouge_names = ["rouge1", "rouge2", "rougeL", "rougeLsum"]  # Define the names of the ROUGE metrics

for model_name in summaries:  # Iterate over the generated summaries
    rouge_metric.add(prediction=summaries[model_name], reference=reference)  # Add the prediction and reference to the ROUGE metric
    score = rouge_metric.compute()  # Compute the ROUGE scores
    rouge_dict = dict((rn, score[rn]) for rn in rouge_names)  # Create a dictionary of ROUGE scores
    records.append(rouge_dict)  # Append the ROUGE scores to the list

pd.DataFrame.from_records(records, index=summaries.keys())  # Create a DataFrame from the ROUGE scores


Unnamed: 0,rouge1,rouge2,rougeL,rougeLsum
baseline,0.377622,0.099291,0.223776,0.27972
sumy,0.22963,0.059701,0.140741,0.192593
gpt2,0.095238,0.0,0.095238,0.095238
DeepSeek,0.075206,0.018846,0.054054,0.072855
bart,0.237624,0.040404,0.138614,0.217822


Based on the ROUGE evaluation results, we can draw several conclusions about the performance of the summarization systems:

**Baseline (Three-Sentence Summary):**

With the highest scores across all metrics—ROUGE-1: 0.378, ROUGE-2: 0.099, ROUGE-L: 0.224, and ROUGE-Lsum: 0.280—the baseline method demonstrates strong overlap with the reference summaries. It captures key terms and maintains a relatively coherent structure, though the moderate ROUGE-2 score suggests limited phrase-level richness.

**Sumy (TextRank):**

Achieving ROUGE-1: 0.230 and ROUGE-2: 0.060, Sumy performs moderately well, outperforming GPT-2 and DeepSeek. Its ROUGE-L: 0.141 and ROUGE-Lsum: 0.193 indicate a fair level of fluency and structural alignment, making it a viable classical approach when resources are limited.

**GPT-2:**

With ROUGE-1: 0.095 and ROUGE-2: 0.000, GPT-2 shows minimal overlap with the reference summaries. The absence of bigram matches suggests a lack of phrase-level consistency, and ROUGE-L: 0.095 reflects limited coherence. This indicates that GPT-2, in its current form, may not be well-suited for extractive or reference-aligned summarization tasks.

**DeepSeek:**

Scoring ROUGE-1: 0.075, ROUGE-2: 0.019, ROUGE-L: 0.054, and ROUGE-Lsum: 0.073, DeepSeek performs the weakest overall. These low scores suggest that it struggles to capture both key content and structural fidelity in summaries.

**BART:**

With ROUGE-1: 0.238, ROUGE-2: 0.040, ROUGE-L: 0.139, and ROUGE-Lsum: 0.218, BART performs comparably to Sumy and slightly better in terms of summary structure. While not outperforming the baseline, its results reflect reasonable content and phrase-level alignment, especially considering its neural architecture.

**Summary:**

The baseline method remains the strongest performer, likely due to its simplicity and alignment with the reference format. Sumy and BART offer competitive alternatives, particularly in environments where classical or pretrained models are preferred. GPT-2 and DeepSeek, however, show limited effectiveness in this evaluation, highlighting the need for fine-tuning or task-specific adaptation. Overall, these results emphasize the importance of aligning model capabilities with the nature of the summarization task and the characteristics of the target dataset.