# Text summarization
with Hugging Face Transformers

In [None]:
# To install the library in jupyter notebook environment
#!pip install transformers

## Import

In [79]:
from transformers import pipeline

ModuleNotFoundError: No module named 'transformers'

## Load summarization pipeline

In [17]:
# load pre-trained summarization pipeline
summarizer = pipeline("summarization")

In [3]:
chapter = """
Judicial space
Regarding space,
Bruno Dayez carries out a topographical analysis of the trial which allows us to highlight several characteristics:
First, "the trial takes place in a defined, unchanging and closed place: the courtroom ", which is itself located within the courthouse, which at first glance makes us think of an imposing and austere temple where we have the impression that we are not necessarily welcome and that there is it's not good to live, while for my part, I can assure you that we have very good times.
Therefore, the place where the trial takes place is separate from the ordinary world, "the justice is a particular, autonomous operation that requires its detachment from the everyday world. "

It is therefore a space separated from the secular space of the city.
Regarding the publicity of the audiences, these cannot be filmed
and disseminated (except for trials of historical interest by
example). When the trial takes place at the last instance, it must
putting a definitive end to the conflict is what sets it apart from
the gear of private revenge.

Indeed, imagine it is broadcast on television, it would be subject to
recurrent manner in democratic debate, which would be an obstacle to its
essential function of social pacification: "Res judicata pro veritate
habetur, "the saying goes," Res judicata is held to be truth. "
So "except for the few palate rats baited by the smell of
sentence ", of which I am a part, are present at the hearing only those who
are summoned to appear there. And it is the press that delivers the only echo of what is happening
weft within the walls of the palace.
Then, concerning the interior space of the trial, it is divided into
regions.
Each speaker occupies a limited space, it defines the status
even of the speaker, the precise role he must fulfill. "It is forbidden to
put in the place of others because this substitution would risk throwing the
confusion in the artificial world of the trial. "

Like the auditorium, the courtroom has two differentiated spaces, the separation of which
can be materialized by a barrier, a rope or simply a
empty space. One, with benches, is for the public and the other for the stage
judicial proper.
The space is organized symmetrically on either side of the president,
whose chair is often slightly raised. The president is surrounded
assessors. Then, at the ends, we find the clerk on one side and the
the public prosecutor of the other (or the attorney general if the case falls within the jurisdiction of the seats).

So, in general, the public prosecutor is at the same level as the court.
The question then arises of the balance of power in the spatial organization of the
trial and in particular the asymmetrical position of the prosecutor and the lawyer
in relation to the judge. This geographical proximity between the prosecutor and the
judge could lead one to believe that the rights of the defense and the necessities of
repression are not on an equal footing. And finally, always separated
of the public, is the bar where witnesses come to testify, "the past in
flashback " to use the good word of Master Vergès. And then, on both sides
on the other hand, the benches reserved for the accused and his lawyer are distributed, and those
reserved for the victim and his lawyer. "The compartmentalisation of the actors
therefore already freezes in the personification of an action: accusing, defending, judging
or be judged. "

We can also add that in addition to the very precise function assigned to each
actor, the dress, that is to say the dressing, also makes it possible to better identify
the various protagonists of the judicial scene.
Antoine Garapon assigns him three main functions. A first
function of purification, it purifies the ordinary person before this one
exercise its institutional role. Here too, there is a desire to mark the
break between life and trial. Then, it aims to protect the
person who is about to perform the function which is proper to him, in him
conferring a feeling of superiority which will release it from violence
legitimate which it is called upon to exercise. And finally, it allows to signify the
victory of the appearing over the being, of the character over the person, of
the institution on the person. Antoine Garapon says it very rightly: "the dress
allows, for the wearer, identification with his character.
Contrary to the saying, in the trial it is the dress that makes the judge,
the lawyer and the prosecutor."
"""

In [19]:
# Summarize in a text of minimum 30 words and max 130 words
# do_sample False to use a greedi decoder: to return a sequence with next word that has a high probability of making sense
summarizer(chapter, max_length=130, min_length=30, do_sample=False)

[{'summary_text': ' Bruno Dayez says the trial takes place in a defined, unchanging and closed place: the courtroom . The space is organized symmetrically on either side of the president, the clerk and the public prosecutor of the other . Each speaker occupies a limited space, it defines the status of the speaker, the precise role he must fulfill . The public prosecutor is at the same level as the court .'}]

In [1]:
import re
import requests
import time
import random

# from selenium import webdriver
from bs4 import BeautifulSoup as bs
import pandas as pd

## Functions

In [2]:
def download_page(url):
    """
    Function to 
    - download page 
    - and parse it with BeautifulSoup
    """
    # download page
    response = requests.get(url)
    print(url, response.status_code)
    
    # parse page
    soup = bs(response.content, features="lxml")

    return soup


def get_book_links(soup, base_url) -> pd.DataFrame:
    """
    Function to get links of the first 25 books 
    from search page
    """
    # create empty dataframe
    cols = ["title", "author", "link", "cover_img_link", "book_id"]
    books_df = pd.DataFrame(columns=cols)

    # scrape info
    for rank, element in enumerate(soup.find_all("li", attrs={"class": "booklink"}), start=1):
        title = element.find("span", attrs={"class": "title"}).text
        author = element.find("span", attrs={"class": "subtitle"}).text
        link = base_url + element.find("a", attrs={"class": "link"}).get("href")
        cover_img_link = base_url + element.find("img", attrs={"class": "cover-thumb"}).get("src")
        book_id = re.findall(r"\d+", link)[0]
        # utf_8_txt_link = f"{base_url}/files/{book_id}/{book_id}-0.txt"
        books_df.loc[rank] = [title, author, link, cover_img_link, book_id]
        
    return books_df

def get_book_text(link, base_url):
    """
    Function to get book text in plain text UTF-8
    """
    # download book page
    soup = download_page(link)

    # scrape book text link
    book_text_link = base_url + soup.find("a", attrs={"class": "link", "type": "text/plain"}).get("href")

    # slow down requests frequency to avoid IP ban
    time.sleep(random.uniform(2.0, 3.0))

    # download book text page
    response = requests.get(book_text_link)
    print(book_text_link, response.status_code)
    
    # return book text
    return response.text

In [None]:
# define base url for links
base_url = "https://www.gutenberg.org"

# download page of most popular books
url='https://www.gutenberg.org/ebooks/search/?sort_order=downloads'
soup = download_page(url)

In [39]:
# input search query
search = input("search for books, authors, genre, ...")

# prepare search query url in required format
search_book_url = "https://www.gutenberg.org/ebooks/search/?query="
search_book_url += "+".join(search.split(" "))

# download page
soup = download_page(url)

https://www.gutenberg.org/ebooks/search/?query=asimov 200


In [42]:
search_df = get_book_links(soup)
search_df

Unnamed: 0,title,author,link,cover_img_link,utf_8_txt_link
1,Youth,Isaac Asimov,https://www.gutenberg.org/ebooks/31547,https://www.gutenberg.org/cache/epub/31547/pg3...,https://www.gutenberg.org/files/31547/31547-0.txt
2,Worlds Within Worlds: The Story of Nuclear Ene...,Isaac Asimov,https://www.gutenberg.org/ebooks/49819,https://www.gutenberg.org/cache/epub/49819/pg4...,https://www.gutenberg.org/files/49819/49819-0.txt
3,100 New Yorkers of the 1970s,Max Millard,https://www.gutenberg.org/ebooks/17385,https://www.gutenberg.org/cache/epub/17385/pg1...,https://www.gutenberg.org/files/17385/17385-0.txt
4,Worlds Within Worlds: The Story of Nuclear Ene...,Isaac Asimov,https://www.gutenberg.org/ebooks/49821,https://www.gutenberg.org/cache/epub/49821/pg4...,https://www.gutenberg.org/files/49821/49821-0.txt
5,The Genetic Effects of Radiation,Isaac Asimov and Theodosius Dobzhansky,https://www.gutenberg.org/ebooks/55738,https://www.gutenberg.org/cache/epub/55738/pg5...,https://www.gutenberg.org/files/55738/55738-0.txt
6,Worlds Within Worlds: The Story of Nuclear Ene...,Isaac Asimov,https://www.gutenberg.org/ebooks/49820,https://www.gutenberg.org/cache/epub/49820/pg4...,https://www.gutenberg.org/files/49820/49820-0.txt


In [3]:
# define base url for links
base_url = "https://www.gutenberg.org"
# book_link = search_df.loc[1, "link"]
# book_text = get_book_text(book_link, base_url)
book_text = get_book_text("https://www.gutenberg.org/ebooks/31547", base_url)

https://www.gutenberg.org/ebooks/31547 200
https://www.gutenberg.org/ebooks/31547.txt.utf-8 200


In [16]:
print(book_text[:1000])

﻿The Project Gutenberg EBook of Youth, by Isaac Asimov

This eBook is for the use of anyone anywhere at no cost and with
almost no restrictions whatsoever.  You may copy it, give it away or
re-use it under the terms of the Project Gutenberg License included
with this eBook or online at www.gutenberg.org


Title: Youth

Author: Isaac Asimov

Illustrator: Schecterson

Release Date: March 7, 2010 [EBook #31547]
[Last updated: February 22, 2012]

Language: English


*** START OF THIS PROJECT GUTENBERG EBOOK YOUTH ***




Produced by Greg Weeks, Stephen Blundell and the Online
Distributed Proofreading Team at https://www.pgdp.net









YOUTH

_by_ ISAAC ASIMOV


    Red and Slim found the two strange little animals the morning after
    they heard the thunder sounds. They knew that they could never show
    their new pets to their parents.


[Illustration]


There was a spatter of pebbles against the window and the youngster
stirred in hi


In [4]:
# find index position where metadata from website ends
metadata_end_idx = book_text.rfind("***",0,1000)

In [76]:
book_text[metadata_end_idx: 1000]

'***\r\n\r\n\r\n\r\n\r\nProduced by Greg Weeks, Stephen Blundell and the Online\r\nDistributed Proofreading Team at https://www.pgdp.net\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\nYOUTH\r\n\r\n_by_ ISAAC ASIMOV\r\n\r\n\r\n    Red and Slim found the two strange little animals the morning after\r\n    they heard th'

In [77]:
len(book_text)

78716

In [6]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, pipeline
  
tokenizer = AutoTokenizer.from_pretrained("facebook/bart-large-xsum")

model = AutoModelForSeq2SeqLM.from_pretrained("facebook/bart-large-xsum")

In [5]:
from transformers import pipeline
summarizer = pipeline("summarization", model=model, tokenizer=tokenizer)
summarizer(book_text[metadata_end_idx:metadata_end_idx + 3500], min_length=30, max_length=130, do_sample=False)

NameError: name 'model' is not defined

In [None]:
# parse with Beautifulsoup
# soup = bs(self._data)

        # Get listed links
        # link_tags = soup.main.find_all("a", attrs={"class": "property-content"})
        # self._links = [link.attrs["href"] for link in link_tags]

        # Create webdriver object
        driver = webdriver.Firefox()

        # Wait 30 ms to navigate to the webpage
        driver.implicitly_wait(30)
        driver.get(self.page_url)

        # When opening the url on Firefox, a pop-up window appears.
        # Click on "Keep browsing" to get to the actual page.
        python_button = driver.find_elements_by_xpath(
            "//button[@id='uc-btn-accept-banner']"
        )[0]
        python_button.click()

        # Search for all houses and apartment
        # 1. Select "House and apartment" label
        python_label_button = driver.find_elements_by_xpath(
            "//button[@id='propertyTypesDesktop']"
        )[0]
        python_label_button.click()
        python_house_apartment_button = driver.find_elements_by_xpath(
            "//li[@data-value='HOUSE,APARTMENT']"
        )[0]
        python_house_apartment_button.click()

        # 2. Click on search
        python_search_button = driver.find_elements_by_xpath(
            "//button[@id='searchBoxSubmitButton']"
        )[0]
        python_search_button.click()

        # 3. Get links of houses and apartment in 5 pages
        self._links = []

        # Get links for each page
        for _ in range(334):
            # Initialize attempts count
            attempts_count = 0
            while attempts_count < 5:
                try:
                    links_tags = driver.find_elements_by_xpath("//a[@class='card__title-link']")
                    self._links.extend([link.get_attribute("href") for link in links_tags])
                    break

                except:
                    attempts_count += 1

            # Navigate to next page
            python_label_button = driver.find_elements_by_xpath(
                "//a[@class='pagination__link pagination__link--next button button--text button--size-small']"
            )[0]
            python_label_button.click()


        # print(self._links)
        driver.close()

soup.find("th", text=re.compile(name, re.IGNORECASE))
soup.select_one(".classified__title").text.strip().lower()
soup.select_one('th:-soup-contains("area")')
label.next_sibling.next_sibling.contents[0].strip())
  



             


In [8]:
ALLOWED_EXTENSIONS = {'txt', 'pdf'}

In [12]:
file = "qs.az.txt"
# split string into list only 1 time and from the right
file.rsplit(".", 1)[1]

'txt'