# Part 2 - Zero-shot summaries

In this part we will use Hugging Face's high-level Pipeline API to create summaries with a pre-trained model. There are three main steps involved when you pass some text to a pipeline:

1) The text is preprocessed into a format the model can understand.

2) The preprocessed inputs are passed to the model.

3) The predictions of the model are post-processed, so you can make sense of them.

In [8]:
from transformers import pipeline
summarizer = pipeline("summarization")

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.


pytorch_model.bin:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

In [25]:
from transformers import file_utils
print(file_utils.default_cache_path)

C:\Users\ASUS\.cache\huggingface\hub


- Pipeline: This is a function provided by the Hugging Face transformers library to make it easy to apply different types of Natural Language Processing (NLP) tasks, such as text classification, translation, summarization, and so on. The function returns a ready-to-use pipeline object for the specified task.
- "summarization": This is the first argument to the pipeline function and specifies the type of task you want the pipeline to perform. In this case, "summarization" means that the pipeline will be configured to summarize text.

In [31]:
#!pip install wrapt

In [None]:
#!pip install h5py

The h5py package is a Pythonic interface to the HDF5 binary data format. 
HDF5 lets you store huge amounts of numerical data

This line of code allows us to see which model is being used by default. We can also find this information in the source code for pipelines:https://github.com/huggingface/transformers/blob/master/src/transformers/pipelines/__init__.py

In [29]:
summarizer.model.config.__getattribute__('_name_or_path')

'sshleifer/distilbart-cnn-12-6'

The model for the standard summarisation task is https://huggingface.co/sshleifer/distilbart-cnn-12-6, which has been specifically trained on 2 datasets: https://huggingface.co/datasets/cnn_dailymail and https://huggingface.co/datasets/xsum. We will keep using this model, but if we wanted to use a different model we could easily do this by specifing it like below. All the models that are trained for summarisation can be viewed here: https://huggingface.co/models?pipeline_tag=summarization&sort=downloads

In [None]:
# summarizer = pipeline("summarization", model='facebook/bart-large-cnn')

In [10]:
import pandas as pd
df_test = pd.read_csv('data/test.csv')
ref_summaries = list(df_test['summary'])
texts = list(df_test['text'])

In [24]:
type(new_text)

tuple

Testing the pipeline with an abstract from the test dataset

In [30]:
summarizer(texts[0], max_length=80)

[{'summary_text': ' Threefold $X$ has a unique anticanonical section which is a Jacobian K3 Kummer surface $S$ of Picard number 17 . We construct an infinite-order pseudo-automorphism $\\phi_X$ on $X$, induced by the complete linear system of a divisor of degree 13 .'}]

Running the pipeline over all 2,000 examples. Because this will take a while we print a counter to keep track of the progress. This should take around 50 minutes.

In [None]:
candidate_summaries = []

for i, text in enumerate(texts):
    if i % 100 == 0:
        print(i)
    candidate = summarizer(text, min_length=5, max_length=20)
    candidate_summaries.append(candidate[0]['summary_text'])

Saving the candidate summaries in case we want to investigate further.

In [None]:
file = open("summaries/zero-shot-summaries.txt", "w")
for s in candidate_summaries:
    file.write(s + "\n")
file.close()

In [None]:
candidate_summaries[:5]

Calculating the ROUGE scores

In [None]:
from datasets import load_metric
metric = load_metric("rouge")

In [None]:
def calc_rouge_scores(candidates, references):
    result = metric.compute(predictions=candidates, references=references, use_stemmer=True)
    result = {key: round(value.mid.fmeasure * 100, 1) for key, value in result.items()}
    return result

In [None]:
calc_rouge_scores(candidate_summaries, ref_summaries)