# Text summarization
with Hugging Face Transformers

In [None]:
# To install the library in jupyter notebook environment
#!pip install transformers

## Import

In [60]:
from transformers import pipeline

## Load summarization pipeline

In [61]:
# load pre-trained summarization pipeline
summarizer = pipeline("summarization")
print(type(summarizer))

<class 'transformers.pipelines.text2text_generation.SummarizationPipeline'>


In [62]:
summarizer.max_model_input_sizes

AttributeError: 'SummarizationPipeline' object has no attribute 'max_model_input_sizes'

In [3]:
chapter = """
Judicial space
Regarding space,
Bruno Dayez carries out a topographical analysis of the trial which allows us to highlight several characteristics:
First, "the trial takes place in a defined, unchanging and closed place: the courtroom ", which is itself located within the courthouse, which at first glance makes us think of an imposing and austere temple where we have the impression that we are not necessarily welcome and that there is it's not good to live, while for my part, I can assure you that we have very good times.
Therefore, the place where the trial takes place is separate from the ordinary world, "the justice is a particular, autonomous operation that requires its detachment from the everyday world. "

It is therefore a space separated from the secular space of the city.
Regarding the publicity of the audiences, these cannot be filmed
and disseminated (except for trials of historical interest by
example). When the trial takes place at the last instance, it must
putting a definitive end to the conflict is what sets it apart from
the gear of private revenge.

Indeed, imagine it is broadcast on television, it would be subject to
recurrent manner in democratic debate, which would be an obstacle to its
essential function of social pacification: "Res judicata pro veritate
habetur, "the saying goes," Res judicata is held to be truth. "
So "except for the few palate rats baited by the smell of
sentence ", of which I am a part, are present at the hearing only those who
are summoned to appear there. And it is the press that delivers the only echo of what is happening
weft within the walls of the palace.
Then, concerning the interior space of the trial, it is divided into
regions.
Each speaker occupies a limited space, it defines the status
even of the speaker, the precise role he must fulfill. "It is forbidden to
put in the place of others because this substitution would risk throwing the
confusion in the artificial world of the trial. "

Like the auditorium, the courtroom has two differentiated spaces, the separation of which
can be materialized by a barrier, a rope or simply a
empty space. One, with benches, is for the public and the other for the stage
judicial proper.
The space is organized symmetrically on either side of the president,
whose chair is often slightly raised. The president is surrounded
assessors. Then, at the ends, we find the clerk on one side and the
the public prosecutor of the other (or the attorney general if the case falls within the jurisdiction of the seats).

So, in general, the public prosecutor is at the same level as the court.
The question then arises of the balance of power in the spatial organization of the
trial and in particular the asymmetrical position of the prosecutor and the lawyer
in relation to the judge. This geographical proximity between the prosecutor and the
judge could lead one to believe that the rights of the defense and the necessities of
repression are not on an equal footing. And finally, always separated
of the public, is the bar where witnesses come to testify, "the past in
flashback " to use the good word of Master Vergès. And then, on both sides
on the other hand, the benches reserved for the accused and his lawyer are distributed, and those
reserved for the victim and his lawyer. "The compartmentalisation of the actors
therefore already freezes in the personification of an action: accusing, defending, judging
or be judged. "

We can also add that in addition to the very precise function assigned to each
actor, the dress, that is to say the dressing, also makes it possible to better identify
the various protagonists of the judicial scene.
Antoine Garapon assigns him three main functions. A first
function of purification, it purifies the ordinary person before this one
exercise its institutional role. Here too, there is a desire to mark the
break between life and trial. Then, it aims to protect the
person who is about to perform the function which is proper to him, in him
conferring a feeling of superiority which will release it from violence
legitimate which it is called upon to exercise. And finally, it allows to signify the
victory of the appearing over the being, of the character over the person, of
the institution on the person. Antoine Garapon says it very rightly: "the dress
allows, for the wearer, identification with his character.
Contrary to the saying, in the trial it is the dress that makes the judge,
the lawyer and the prosecutor."
"""

In [19]:
# Summarize in a text of minimum 30 words and max 130 words
# do_sample False to use a greedi decoder: to return a sequence with next word that has a high probability of making sense
summarizer(chapter, max_length=130, min_length=30, do_sample=False)

[{'summary_text': ' Bruno Dayez says the trial takes place in a defined, unchanging and closed place: the courtroom . The space is organized symmetrically on either side of the president, the clerk and the public prosecutor of the other . Each speaker occupies a limited space, it defines the status of the speaker, the precise role he must fulfill . The public prosecutor is at the same level as the court .'}]

In [39]:
import re
import requests
import time
import random

# from selenium import webdriver
from bs4 import BeautifulSoup as bs
import pandas as pd

## Functions

In [40]:
def download_page(url):
    """
    Function to 
    - download page 
    - and parse it with BeautifulSoup
    """
    # download page
    response = requests.get(url)
    print(url, response.status_code)
    
    # parse page
    soup = bs(response.content, features="lxml")

    return soup


def get_book_links(soup, base_url) -> pd.DataFrame:
    """
    Function to get links of the first 25 books 
    from search page
    """
    # create empty dataframe
    cols = ["title", "author", "link", "cover_img_link", "book_id"]
    books_df = pd.DataFrame(columns=cols)

    # scrape info
    for rank, element in enumerate(soup.find_all("li", attrs={"class": "booklink"}), start=1):
        title = element.find("span", attrs={"class": "title"}).text
        author = element.find("span", attrs={"class": "subtitle"}).text
        link = base_url + element.find("a", attrs={"class": "link"}).get("href")
        cover_img_link = base_url + element.find("img", attrs={"class": "cover-thumb"}).get("src")
        book_id = re.findall(r"\d+", link)[0]
        # utf_8_txt_link = f"{base_url}/files/{book_id}/{book_id}-0.txt"
        books_df.loc[rank] = [title, author, link, cover_img_link, book_id]
        
    return books_df

def get_book_text(link, base_url):
    """
    Function to get book text in plain text UTF-8
    """
    # download book page
    soup = download_page(link)

    # scrape book text link
    book_text_link = base_url + soup.find("a", attrs={"class": "link", "type": "text/plain"}).get("href")

    # slow down requests frequency to avoid IP ban
    time.sleep(random.uniform(2.0, 3.0))

    # download book text page
    response = requests.get(book_text_link)
    print(book_text_link, response.status_code)
    
    # return book text
    return response.text

In [None]:
# define base url for links
base_url = "https://www.gutenberg.org"

# download page of most popular books
url='https://www.gutenberg.org/ebooks/search/?sort_order=downloads'
soup = download_page(url)

In [39]:
# input search query
search = input("search for books, authors, genre, ...")

# prepare search query url in required format
search_book_url = "https://www.gutenberg.org/ebooks/search/?query="
search_book_url += "+".join(search.split(" "))

# download page
soup = download_page(url)

https://www.gutenberg.org/ebooks/search/?query=asimov 200


In [42]:
search_df = get_book_links(soup)
search_df

Unnamed: 0,title,author,link,cover_img_link,utf_8_txt_link
1,Youth,Isaac Asimov,https://www.gutenberg.org/ebooks/31547,https://www.gutenberg.org/cache/epub/31547/pg3...,https://www.gutenberg.org/files/31547/31547-0.txt
2,Worlds Within Worlds: The Story of Nuclear Ene...,Isaac Asimov,https://www.gutenberg.org/ebooks/49819,https://www.gutenberg.org/cache/epub/49819/pg4...,https://www.gutenberg.org/files/49819/49819-0.txt
3,100 New Yorkers of the 1970s,Max Millard,https://www.gutenberg.org/ebooks/17385,https://www.gutenberg.org/cache/epub/17385/pg1...,https://www.gutenberg.org/files/17385/17385-0.txt
4,Worlds Within Worlds: The Story of Nuclear Ene...,Isaac Asimov,https://www.gutenberg.org/ebooks/49821,https://www.gutenberg.org/cache/epub/49821/pg4...,https://www.gutenberg.org/files/49821/49821-0.txt
5,The Genetic Effects of Radiation,Isaac Asimov and Theodosius Dobzhansky,https://www.gutenberg.org/ebooks/55738,https://www.gutenberg.org/cache/epub/55738/pg5...,https://www.gutenberg.org/files/55738/55738-0.txt
6,Worlds Within Worlds: The Story of Nuclear Ene...,Isaac Asimov,https://www.gutenberg.org/ebooks/49820,https://www.gutenberg.org/cache/epub/49820/pg4...,https://www.gutenberg.org/files/49820/49820-0.txt


In [3]:
# define base url for links
base_url = "https://www.gutenberg.org"
# book_link = search_df.loc[1, "link"]
# book_text = get_book_text(book_link, base_url)
book_text = get_book_text("https://www.gutenberg.org/ebooks/31547", base_url)

https://www.gutenberg.org/ebooks/31547 200
https://www.gutenberg.org/ebooks/31547.txt.utf-8 200


In [44]:
def get_book_text(link):
    """
    Function to get book text in plain text UTF-8
    """
    # download book page
    soup = download_page(link)

    # define base url for links
    base_url = "https://www.gutenberg.org"

    # scrape book text link
    book_text_link = base_url + soup.find("a", attrs={"class": "link"}, href=re.compile(r".txt")).get("href")

    # slow down requests frequency to avoid IP ban
    time.sleep(random.uniform(2.0, 3.0))

    # download book text page
    response = requests.get(book_text_link)
    print(book_text_link, response.status_code)
    
    # return book text
    return response.text
book_text = get_book_text("https://www.gutenberg.org/ebooks/64317")

https://www.gutenberg.org/ebooks/64317 200
https://www.gutenberg.org/files/64317/64317-0.txt 200


In [45]:
print(type(test))

<class 'str'>


In [46]:
print(book_text[:1000])

ï»¿The Project Gutenberg eBook of The Great Gatsby, by F. Scott Fitzgerald

This eBook is for the use of anyone anywhere in the United States and
most other parts of the world at no cost and with almost no restrictions
whatsoever. You may copy it, give it away or re-use it under the terms
of the Project Gutenberg License included with this eBook or online at
www.gutenberg.org. If you are not located in the United States, you
will have to check the laws of the country where you are located before
using this eBook.

Title: The Great Gatsby

Author: F. Scott Fitzgerald

Release Date: January 17, 2021 [eBook #64317]
[Most recently updated: January 24 2021]

Language: English

Character set encoding: UTF-8

Produced by: Alex Cabal for the Standard Ebooks project, based on a
             transcription produced for Project Gutenberg Australia.

*** START OF THE PROJECT GUTENBERG EBOOK THE GREAT GATSBY ***


			   The Great Gatsby
				  by
			 F. Scott Fitzgerald



In [47]:
# find index position where metadata from website ends
metadata_end_idx = book_text.rfind("***",0,1000)

In [48]:
book_text[metadata_end_idx: 1000]

'***\r\n\r\n\r\n\t\t\t   The Great Gatsby\r\n\t\t\t\t  by\r\n\t\t\t F. Scott Fitzgerald\r\n'

In [77]:
len(book_text)

78716

In [6]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, pipeline
  
tokenizer = AutoTokenizer.from_pretrained("facebook/bart-large-xsum")

model = AutoModelForSeq2SeqLM.from_pretrained("facebook/bart-large-xsum")

In [5]:
from transformers import pipeline
summarizer = pipeline("summarization", model=model, tokenizer=tokenizer)
summarizer(book_text[metadata_end_idx:metadata_end_idx + 3500], min_length=30, max_length=130, do_sample=False)

NameError: name 'model' is not defined

In [None]:
# parse with Beautifulsoup
# soup = bs(self._data)

        # Get listed links
        # link_tags = soup.main.find_all("a", attrs={"class": "property-content"})
        # self._links = [link.attrs["href"] for link in link_tags]

        # Create webdriver object
        driver = webdriver.Firefox()

        # Wait 30 ms to navigate to the webpage
        driver.implicitly_wait(30)
        driver.get(self.page_url)

        # When opening the url on Firefox, a pop-up window appears.
        # Click on "Keep browsing" to get to the actual page.
        python_button = driver.find_elements_by_xpath(
            "//button[@id='uc-btn-accept-banner']"
        )[0]
        python_button.click()

        # Search for all houses and apartment
        # 1. Select "House and apartment" label
        python_label_button = driver.find_elements_by_xpath(
            "//button[@id='propertyTypesDesktop']"
        )[0]
        python_label_button.click()
        python_house_apartment_button = driver.find_elements_by_xpath(
            "//li[@data-value='HOUSE,APARTMENT']"
        )[0]
        python_house_apartment_button.click()

        # 2. Click on search
        python_search_button = driver.find_elements_by_xpath(
            "//button[@id='searchBoxSubmitButton']"
        )[0]
        python_search_button.click()

        # 3. Get links of houses and apartment in 5 pages
        self._links = []

        # Get links for each page
        for _ in range(334):
            # Initialize attempts count
            attempts_count = 0
            while attempts_count < 5:
                try:
                    links_tags = driver.find_elements_by_xpath("//a[@class='card__title-link']")
                    self._links.extend([link.get_attribute("href") for link in links_tags])
                    break

                except:
                    attempts_count += 1

            # Navigate to next page
            python_label_button = driver.find_elements_by_xpath(
                "//a[@class='pagination__link pagination__link--next button button--text button--size-small']"
            )[0]
            python_label_button.click()


        # print(self._links)
        driver.close()

soup.find("th", text=re.compile(name, re.IGNORECASE))
soup.select_one(".classified__title").text.strip().lower()
soup.select_one('th:-soup-contains("area")')
label.next_sibling.next_sibling.contents[0].strip())
  



             


In [36]:
from bs4 import BeautifulSoup as bs
import re
html = """
<a href="/files/84/84-0.txt" type="text/plain; charset=utf-8" class="link" title="Download">Plain Text UTF-8</a>
"""
soup = bs(html, features="lxml")
soup.find("a", attrs={"class": "link"}, href=re.compile(r".txt")).get("href")
# soup.select('a[href*=.txt]')
# soup.find_all(href=re.find(r'.txt')))

'/files/84/84-0.htm'

In [8]:
ALLOWED_EXTENSIONS = {'txt', 'pdf'}

In [12]:
file = "qs.az.txt"
# split string into list only 1 time and from the right
file.rsplit(".", 1)[1]

'txt'

# Summarize larger text
source = https://github.com/nicknochnack/Longform-Summarization-with-Hugging-Face/blob/main/LongSummarization.ipynb

In [1]:
from transformers import pipeline
from bs4 import BeautifulSoup
import requests

# 1. load summarization pipeline
summarizer = pipeline("summarization")

# 2. download pages
URL = "https://towardsdatascience.com/a-bayesian-take-on-model-regularization-9356116b6457"
URL = "https://hackernoon.com/will-the-game-stop-with-gamestop-or-is-this-just-the-beginning-2j1x32aa"

r = requests.get(URL)

soup = BeautifulSoup(r.text, 'html.parser')
results = soup.find_all(['h1', 'p'])
text = [result.text for result in results]
ARTICLE = ' '.join(text)
ARTICLE

''

In [None]:
# 3. chunk text
max_chunk = 500

ARTICLE = ARTICLE.replace('.', '.<eos>')
ARTICLE = ARTICLE.replace('?', '?<eos>')
ARTICLE = ARTICLE.replace('!', '!<eos>')

sentences = ARTICLE.split('<eos>')
current_chunk = 0 
chunks = []
for sentence in sentences:
    if len(chunks) == current_chunk + 1: 
        if len(chunks[current_chunk]) + len(sentence.split(' ')) <= max_chunk:
            chunks[current_chunk].extend(sentence.split(' '))
        else:
            current_chunk += 1
            chunks.append(sentence.split(' '))
    else:
        print(current_chunk)
        chunks.append(sentence.split(' '))

for chunk_id in range(len(chunks)):
    chunks[chunk_id] = ' '.join(chunks[chunk_id])

len(chunks)

In [None]:
# 5. summarize text
res = summarizer(chunks, max_length=120, min_length=30, do_sample=False)
res[0]

In [None]:
text = ' '.join([summ['summary_text'] for summ in res])

In [None]:
# 6. output to a file
with open('blogsummary.txt', 'w') as f:
    f.write(text)

In [33]:
book = 1
f'href="{{{{ url_for(\'summarize_book\', book={book}) }}}}"'
# 'href="{{ url_for(' + "'summarize_book'" + ', book=book) }}"')

'href="{{ url_for(\'summarize_book\', book=1) }}"'

In [None]:
href="{{ url_for('summarize_book', book=book) }}"

In [19]:
import time

In [18]:
import gensim
from gensim.summarization import summarize

In [21]:
text = """The story is staged in the distant future within our own Milky Way Galaxy, approximately in the late 36th century. A portion of the galaxy is filled with terraformed worlds inhabited by interstellar traveling human beings. For 150 years two mighty space powers have intermittently warred with each other: the Galactic Empire and the Free Planets Alliance.
Within the Galactic Empire, based on mid 19th century Prussia, an ambitious military genius, Reinhard von Müsel, later conferred Reinhard von Lohengramm, is rising to power. He is driven by the desire to free his sister Annerose, who was taken by the Kaiser as a concubine. Later, he wants not only to end the corrupt Goldenbaum dynasty but also to defeat the Free Planets Alliance and to unify the whole galaxy under his rule.
In the Free Planets Alliance Star Fleet is another genius, Yang Wen Li. He originally aspired to become a historian through a military academy, and joined the tactical division only out of need for tuition money. He was rapidly promoted to an admiral because he demonstrated excellence in military strategy in a number of decisive battles and conflicts. He becomes the archrival of Reinhard, though they highly respect one another. Unlike Reinhard he is better known for his underdog victories and accomplishments in overcoming seemingly impossible odds and mitigating casualties and damages due to military operations.
As a historian, Yang often predicts the motives behind his enemies and narrates the rich history of his world and comments on it. One of his famous quotes is: “There are few wars between good and evil; most are between one good and another good.”
Besides the two main heroes, the story is full of vivid characters and intricate politics. All types of characters, from high nobility, admirals and politicians, to common soldiers and farmers, are interwoven into the story. The story frequently switches away from the main heroes to the Unknown Soldier fighting for his life on the battlefield.
There is a third neutral power nominally attached to the Galactic Empire called the Phezzan Dominion, a planet-state (city-state on a galactic scale) which trades with both warring powers. There is also a Terraism cult, which claims that humans should go back to Earth, gaining popularity throughout the galaxy. Throughout the story executive political figures of Phezzan in concert with the upper-hierarchy of the Terraism cult orchestrate a number of conspiracies to shift the tide of the galactic war so that it may favor their objectives.
"""

In [24]:
# start timer
start_time = time.time()

# pass text corpus to summarizer
summary = summarize(text)

# stop timer and compute the execution time
end_time = time.time()
diff_time = (end_time - start_time)
print(f"\nTime to summarize: {diff_time:.2f} seconds")
summary


Time to summarize: 0.01 seconds


'For 150 years two mighty space powers have intermittently warred with each other: the Galactic Empire and the Free Planets Alliance.\nIn the Free Planets Alliance Star Fleet is another genius, Yang Wen Li. He originally aspired to become a historian through a military academy, and joined the tactical division only out of need for tuition money.\nThroughout the story executive political figures of Phezzan in concert with the upper-hierarchy of the Terraism cult orchestrate a number of conspiracies to shift the tide of the galactic war so that it may favor their objectives.'

In [26]:
# start timer
start_time = time.time()

# pass text corpus to summarizer
## ratio = proportion of summary compared with text
summary = summarize(text, ratio=0.1)

# stop timer and compute the execution time
end_time = time.time()
diff_time = (end_time - start_time)
print(f"\nTime to summarize: {diff_time:.2f} seconds")
summary


Time to summarize: 0.01 seconds


'For 150 years two mighty space powers have intermittently warred with each other: the Galactic Empire and the Free Planets Alliance.'

In [30]:
# start timer
start_time = time.time()

# pass text corpus to summarizer
## word_count = number of words in summary
summary = summarize(text, word_count=25)

# stop timer and compute the execution time
end_time = time.time()
diff_time = (end_time - start_time)
print(f"\nTime to summarize: {diff_time:.2f} seconds")
print(len(summary.split(' ')))
summary


Time to summarize: 0.01 seconds
21


'For 150 years two mighty space powers have intermittently warred with each other: the Galactic Empire and the Free Planets Alliance.'

In [31]:
# Importing requirements
from transformers import T5Tokenizer, T5Config, T5ForConditionalGeneration


# Instantiate model and tokenizer 
my_model = T5ForConditionalGeneration.from_pretrained('t5-small')
tokenizer = T5Tokenizer.from_pretrained('t5-small')

Downloading: 100%|██████████| 1.20k/1.20k [00:00<00:00, 202kB/s]
Downloading: 100%|██████████| 242M/242M [00:24<00:00, 10.1MB/s]
Downloading: 100%|██████████| 792k/792k [00:00<00:00, 1.19MB/s]
Downloading: 100%|██████████| 1.39M/1.39M [00:01<00:00, 1.14MB/s]


In [32]:
# Prefix string "summarize:" to original text
text_t5 = "summarize:" + text

In [38]:
# T5 is a an encoder-decoder model 
# => convert text into input-ids (=sequence of ids): encode text
input_ids=tokenizer.encode(text_t5, return_tensors='pt', max_length=512, truncation=True)


In [39]:
# Generate summary ids
summary_ids = my_model.generate(input_ids)
summary_ids

tensor([[    0,     8,   733,    19,  1726,    26,    16,     8, 10382,   647,
           441,    69,   293, 18389,    63,  5994, 24856,     3,     5,     8]])

In [40]:
# decode tensor of input-ids (inverse of encode())
t5_summary = tokenizer.decode(summary_ids[0])
print(t5_summary)


<pad> the story is staged in the distant future within our own Milky Way galaxy. the


In [41]:
len(t5_summary)

84

In [42]:
from transformers import BartForConditionalGeneration, BartTokenizer, BartConfig

In [43]:
# Load model and tokenizer for bart-large-cnn
# pretrained model fine tuned for summarization task

tokenizer=BartTokenizer.from_pretrained('facebook/bart-large-cnn')
model=BartForConditionalGeneration.from_pretrained('facebook/bart-large-cnn')

Downloading: 100%|██████████| 899k/899k [00:01<00:00, 802kB/s]
Downloading: 100%|██████████| 456k/456k [00:00<00:00, 694kB/s]
Downloading: 100%|██████████| 1.36M/1.36M [00:01<00:00, 959kB/s] 
Downloading: 100%|██████████| 1.40k/1.40k [00:00<00:00, 1.35MB/s]
Downloading: 100%|██████████| 1.63G/1.63G [02:50<00:00, 9.55MB/s]


In [44]:
# Encode inputs
inputs: dict = tokenizer.batch_encode_plus([text],return_tensors='pt')
print(inputs)

# Generate ids
summary_ids = model.generate(inputs['input_ids'], early_stopping=True)

{'input_ids': tensor([[    0,   133,   527,    16, 10899,    11,     5, 13258,   499,   624,
            84,   308, 36713,  4846,  5325,     6,  2219,    11,     5,   628,
          2491,   212,  3220,     4,    83,  4745,     9,     5, 22703,    16,
          3820,    19,  8470,   763, 10312, 14490, 36308,    30, 43240,  5796,
          1050, 14766,     4,   286,  3982,   107,    80, 23514,   980,  4361,
            33, 41870,  7240,   997,  2050,    19,   349,    97,    35,     5,
         40332, 11492,     8,     5,  3130,  5427,  2580,  6035,     4, 50118,
         35469,     5, 40332, 11492,     6,   716,    15,  1084,   753,   212,
          3220,  2869, 17280,     6,    41,  8263,   831, 16333,     6, 17789,
          9635,  5689, 28168,  5317,     6,   423, 35851, 17789,  9635,  5689,
          5463, 31075,  4040,   119,     6,    16,  2227,     7,   476,     4,
            91,    16,  3185,    30,     5,  4724,     7,   481,    39,  2761,
           660,  1396,  3876,     6,  

In [45]:
# Decode summary
bart_summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
print(bart_summary)

The story is staged in the distant future within our own Milky Way Galaxy, approximately in the late 36th century. For 150 years two mighty space powers have intermittently warred with each other: the Galactic Empire and the Free Planets Alliance. The story frequently switches away from the main heroes to the Unknown Soldier fighting for his life on the battlefield.


In [46]:
# Importing model and tokenizer
from transformers import GPT2Tokenizer,GPT2LMHeadModel

Downloading: 100%|██████████| 1.04M/1.04M [00:01<00:00, 831kB/s]
Downloading: 100%|██████████| 456k/456k [00:01<00:00, 335kB/s]
Downloading: 100%|██████████| 1.36M/1.36M [00:02<00:00, 643kB/s]
Downloading: 100%|██████████| 665/665 [00:00<00:00, 576kB/s]
Downloading: 100%|██████████| 548M/548M [00:56<00:00, 9.73MB/s]
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Input length of input_ids is 512, but ``max_length`` is set to 20.This can lead to unexpected behavior. You should consider increasing ``config.max_length`` or ``max_length``.


In [50]:
# Instantiating the model and tokenizer with gpt-2
tokenizer=GPT2Tokenizer.from_pretrained('gpt2')
model=GPT2LMHeadModel.from_pretrained('gpt2')

In [55]:
# Encoding text to get input ids & pass them to model.generate()
inputs=tokenizer.batch_encode_plus([text],return_tensors='pt',max_length=512, truncation=True)
summary_ids=model.generate(inputs['input_ids'],early_stopping=True, max_length=200)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Input length of input_ids is 512, but ``max_length`` is set to 200.This can lead to unexpected behavior. You should consider increasing ``config.max_length`` or ``max_length``.


In [56]:
# Decode
GPT_summary=tokenizer.decode(summary_ids[0],skip_special_tokens=True)
print(GPT_summary)

The story is staged in the distant future within our own Milky Way Galaxy, approximately in the late 36th century. A portion of the galaxy is filled with terraformed worlds inhabited by interstellar traveling human beings. For 150 years two mighty space powers have intermittently warred with each other: the Galactic Empire and the Free Planets Alliance.
Within the Galactic Empire, based on mid 19th century Prussia, an ambitious military genius, Reinhard von Müsel, later conferred Reinhard von Lohengramm, is rising to power. He is driven by the desire to free his sister Annerose, who was taken by the Kaiser as a concubine. Later, he wants not only to end the corrupt Goldenbaum dynasty but also to defeat the Free Planets Alliance and to unify the whole galaxy under his rule.
In the Free Planets Alliance Star Fleet is another genius, Yang Wen Li. He originally aspired to become a historian through a military academy, and joined the tactical division only out of need for tuition money. He 

### Summarization with XLM Transformers
Summary from XLM not very good. as model was not fine tuned for Summarization.

In [57]:
# Importing model and tokenizer
from transformers import XLMWithLMHeadModel, XLMTokenizer

Downloading: 100%|██████████| 646k/646k [00:01<00:00, 531kB/s]
Downloading: 100%|██████████| 487k/487k [00:00<00:00, 501kB/s]
Downloading: 100%|██████████| 840/840 [00:00<00:00, 498kB/s]
Downloading: 100%|██████████| 2.67G/2.67G [04:14<00:00, 10.5MB/s]
Some weights of XLMWithLMHeadModel were not initialized from the model checkpoint at xlm-mlm-en-2048 and are newly initialized: ['transformer.position_ids']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


NameError: name 'original_text' is not defined

In [58]:
# Instantiating the model and tokenizer 
tokenizer=XLMTokenizer.from_pretrained('xlm-mlm-en-2048')
model=XLMWithLMHeadModel.from_pretrained('xlm-mlm-en-2048')

# Encoding text to get input ids & pass them to model.generate()
inputs=tokenizer.batch_encode_plus([text],return_tensors='pt',max_length=512)
summary_ids=model.generate(inputs['input_ids'],early_stopping=True)

# Decode and print the summary
XLM_summary=tokenizer.decode(summary_ids[0],skip_special_tokens=True)
print(XLM_summary)

Some weights of XLMWithLMHeadModel were not initialized from the model checkpoint at xlm-mlm-en-2048 and are newly initialized: ['transformer.position_ids']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Input length of input_ids is 512, but ``max_length`` is set to 20.This can lead to unexpected behavior. You should consider increasing ``config.max_length`` or ``max_length``.
the story is staged in the distant future within our own milky way galaxy, approximately in the late 36th century. a portion of the galaxy is filled with terraformed worlds inhabited 

In [59]:
tokenizer.max_model_input_sizes

{'xlm-mlm-en-2048': 512,
 'xlm-mlm-ende-1024': 512,
 'xlm-mlm-enfr-1024': 512,
 'xlm-mlm-enro-1024': 512,
 'xlm-mlm-tlm-xnli15-1024': 512,
 'xlm-mlm-xnli15-1024': 512,
 'xlm-clm-enfr-1024': 512,
 'xlm-clm-ende-1024': 512,
 'xlm-mlm-17-1280': 512,
 'xlm-mlm-100-1280': 512}

In [None]:
class TransformersTextSummarizer(BaseTextSummarizer):
    def __init__ (self, model_key, language):
        self._tokenizer = AutoTokenizer.from_pretrained(model_key)

        self._language = language

        self._model = AutoModelForSeq2SeqLM.from_pretrained(model_key)

        self._device = 'cuda' if bool(strtobool(os.getenv('USE_GPU'))) else 'cpu'

    def __chunk_text(self, text):
        sentences = [ s + ' ' for s in sentence_segmentation(text, minimum_n_words_to_accept_sentence=1, language=self._language) ]

        chunks = []

        chunk = ''

        length = 0

        for sentence in sentences:
            tokenized_sentence = self._tokenizer.encode(sentence, truncation=False, max_length=None, return_tensors='pt') [0]

            if len(tokenized_sentence) > self._tokenizer.model_max_length:
                continue

            length += len(tokenized_sentence)

            if length <= self._tokenizer.model_max_length:
                chunk = chunk + sentence
            else:
                chunks.append(chunk.strip())
                chunk = sentence
                length = len(tokenized_sentence)

        if len(chunk) > 0:
            chunks.append(chunk.strip())

        return chunks

    def __clean_text(self, text):
      if text.count('.') == 0:
        return text.strip()

      end_index = text.rindex('.') + 1

      return text[0 : end_index].strip()

    def summarize(self, text, *args, **kwargs):
        chunk_texts = self.__chunk_text(text)

        chunk_summaries = []

        for chunk_text in chunk_texts:
            input_tokenized = self._tokenizer.encode(chunk_text, return_tensors='pt')

            input_tokenized = input_tokenized.to(self._device)

            summary_ids = self._model.to(self._device).generate(input_tokenized, length_penalty=3.0, min_length = int(0.2 * len(chunk_text)), max_length = int(0.3 * len(chunk_text)), early_stopping=True, num_beams=5, no_repeat_ngram_size=2)

            output = [self._tokenizer.decode(g, skip_special_tokens=True, clean_up_tokenization_spaces=True) for g in summary_ids]

            chunk_summaries.append(output)

        summaries = [ self.__clean_text(text) for chunk_summary in chunk_summaries for text in chunk_summary ]

        return summaries

def sentence_segmentation(document, minimum_n_words_to_accept_sentence, language):
    paragraphs = list(filter(lambda o: len(o.strip()) > 0, document.split('\n')))

    paragraphs = [ p.strip() for p in paragraphs ]

    paragraph_sentences = [ sent_tokenize(p, language=language) for p in paragraphs ]

    paragraph_sentences = chain(*paragraph_sentences)

    paragraph_sentences = [ s.strip() for s in paragraph_sentences ]

    normal_word_tokenizer = RegexpTokenizer(r'[^\W_]+')

    paragraph_sentences = filter(lambda o: len(normal_word_tokenizer.tokenize(o)) >= minimum_n_words_to_accept_sentence, paragraph_sentences)

    return list(paragraph_sentences)

My approach is to “explode” the given dataset input in sentences, use the transformer tokenizer to get the length of each sentence and calculate a nice chunking (uniform length, no split sentences). This is the function that I am using:
where RE_SPLITTER is ‘.(?!\d)|\n’

In [None]:
def chunk_text(text, num_tok):
    text_sent =\
        [sent.strip()+'.' for sent in re.split(RE_SPLITTER, text) if len(sent) > 1]

    # calculate number of tokens per sentence
    num_tok_sent = [len(tokenizer.tokenize(sent)) for sent in text_sent]
    
    # calculate chunk dimension to fit into model
    n = int(np.ceil(num_tok / MODEL_MAX_LEN))
    len_chunk = int(num_tok / n)

    # get a more uniform splitting to avoid splits
    # which are too short at the end
    if len_chunk+50 > MODEL_MAX_LEN:
        len_chunk = int(num_tok / (n+1))
    
    len_curr = 0
    text_curr = []
    text_chunk = []
    for te, len_sent in zip(text_sent, num_tok_sent):

        if len_curr + len_sent < len_chunk:
            text_curr.append(te)
            len_curr += len_sent

        elif len_curr + len_sent >= MODEL_MAX_LEN:
            text_chunk.append(text_curr)

            text_curr = [te]
            len_curr = len_sent

        else: # >= len_chunk && < MODEL_MAX_LEN
            text_curr.append(te)
            text_chunk.append(text_curr)
            
            text_curr = []
            len_curr = 0

    if len_curr > 0:
        text_chunk.append(text_curr)

    return text_chunk

In [63]:
from transformers import BartTokenizer, BartForConditionalGeneration
import torch

long_text = "This is a very very long text. " * 300

model = BartForConditionalGeneration.from_pretrained('facebook/bart-large-cnn')
tokenizer = BartTokenizer.from_pretrained('facebook/bart-large-cnn')

# tokenize without truncation
inputs_no_trunc = tokenizer(long_text, max_length=None, return_tensors='pt', truncation=False)

# get batches of tokens corresponding to the exact model_max_length
chunk_start = 0
chunk_end = tokenizer.model_max_length  # == 1024 for Bart
inputs_batch_lst = []
while chunk_start <= len(inputs_no_trunc['input_ids'][0]):
    inputs_batch = inputs_no_trunc['input_ids'][0][chunk_start:chunk_end]  # get batch of n tokens
    inputs_batch = torch.unsqueeze(inputs_batch, 0)
    inputs_batch_lst.append(inputs_batch)
    chunk_start += tokenizer.model_max_length  # == 1024 for Bart
    chunk_end += tokenizer.model_max_length  # == 1024 for Bart

# generate a summary on each batch
summary_ids_lst = [model.generate(inputs, num_beams=4, max_length=100, early_stopping=True) for inputs in inputs_batch_lst]

# decode the output and join into one string with one paragraph per summary batch
summary_batch_lst = []
for summary_id in summary_ids_lst:
    summary_batch = [tokenizer.decode(g, skip_special_tokens=True, clean_up_tokenization_spaces=False) for g in summary_id]
    summary_batch_lst.append(summary_batch[0])
summary_all = '\n'.join(summary_batch_lst)

print(summary_all)

# output (would of course make more sense on a sensible input):
#This is a very very long text. This is avery very long texts. This has been a very long day for me. I'm going to have to take a break from this. I've got a lot of work to do. I'll be back in a few days.
#This is a very very long text. This is avery very long texts. This has been a very long day for me. I'm going to have to take a break from this. I've got a lot of work to do. I'll be back in a few days.
#This is a very very long text. This is avery very long texts. This has been a very, very long day. This will be a very long, very, long night. I hope you enjoy it. I will be back in a week or so with a new text.

Token indices sequence length is longer than the specified maximum sequence length for this model (2403 > 1024). Running this sequence through the model will result in indexing errors
This is a very very long text. This is avery very long texts. This has been a very long day for me. I'm going to have to take a break from this. I've got a lot of work to do. I'll be back in a few days.
This is a very very long text. This is avery very long texts. This has been a very long day for me. I'm going to have to take a break from this. I've got a lot of work to do. I'll be back in a few days.
This is a very very long text. This is avery very long texts. This has been a very, very long day. This will be a very long, very, long night. I hope you enjoy it. I will be back in a week or so with a new text.


The main advantage of this approach is that it uses the tokenization directly from the transformers tokenizer instead of an external tokenizer like NLTK. Keep in mind that most transformer models use different sub-word tokenizers, while NLTK probably uses a word-level tokenizer (see explanation here). This means that NLTK will split a string like “I have a new GPU!” into 6 tokens (one per word + punctuation), while e.g. BERT’s tokenizer will split it into 7 (['i', 'have', 'a', 'new', 'gp', '##u', '!']), because it splits rare words into sub-words (e.g. GPU). With the “pure transformers” approach you can be sure to really get the exact maximum number of tokens.

The disadvantage is that there is no sentence boundary detection. You can theoretically solve that with the NLTK (or SpaCy) approach and splitting sentences. But the token threshold should probably be set below 1024 words (maybe 900?), because 1024 NLTK word tokens translate into more than 1024++ sub-word tokens. Otherwise, the text gets truncated again and you effectively delete parts of your text.

I feel like summarising texts above 1024 tokens is probably a common use case and enabling this kind of “long text summarisation” could be a very useful feature for the summarisation pipeline. Could this maybe be something that could be added to the pipeline with e.g. a keyword argument like ‘summarise_long_text=True’?
I dont know the internals of the pipeline well enough to know if this would be an easy addition or too complicated @sshleifer (Also please correct me if I’m wrong about the code and explanation above, I’m also new to this)

(Btw, look at the input and the output… :smiley: is Bart getting lazy when the text is too long and monotonous? :stuck_out_tongue: )

In [None]:
from transformers import BartForConditionalGeneration, BartTokenizer
from typing import List

def old_summarization_pipeline(text: List[str]) -> List[str]:
    tokenizer = BartTokenizer.from_pretrained('bart-large-cnn')
    model = BartForConditionalGeneration.from_pretrained('bart-large-cnn')
    input_ids = tokenizer.batch_encode_plus(text, return_tensors='pt', max_length=1024)['input_ids']
    summary_ids = model.generate(input_ids)
    summaries = [tokenizer.decode(s) for s in summary_ids]
    return summaries

@sshleifer what's the typical recommendation for summarization on larger documents? Chunk them and generate summaries or any other tips?

EDIT: Cross-posted here, I think this is a much better place for this.

This is what I use currently but open to better recommendations.

In [None]:
# generate chunks of text \ sentences <= 1024 tokens
def nest_sentences(document):
  nested = []
  sent = []
  length = 0
  for sentence in nltk.sent_tokenize(document):
    length += len(sentence)
    if length < 1024:
      sent.append(sentence)
    else:
      nested.append(sent)
      sent = [sentence]
      length = len(sentence)

  if sent:
    nested.append(sent)
  return nested

# generate summary on text with <= 1024 tokens
def generate_summary(nested_sentences):
  device = 'cuda'
  summaries = []
  for nested in nested_sentences:
    input_tokenized = bart_tokenizer.encode(' '.join(nested), truncation=True, return_tensors='pt')
    input_tokenized = input_tokenized.to(device)
    summary_ids = bart_model.to(device).generate(input_tokenized,
                                      length_penalty=3.0,
                                      min_length=30,
                                      max_length=100)
    output = [bart_tokenizer.decode(g, skip_special_tokens=True, clean_up_tokenization_spaces=False) for g in summary_ids]
    summaries.append(output)
  summaries = [sentence for sublist in summaries for sentence in sublist]
  return summaries

In [64]:
import torch
torch.cuda.is_available()

False

In [1]:
from transformers import pipeline

translator = pipeline("translation_fr_to_en")

ValueError: The task does not provide any default models for options ('fr', 'en')