# Text summarization
with Hugging Face Transformers

In [None]:
# To install the library in jupyter notebook environment
#!pip install transformers

## Import

In [2]:
from transformers import pipeline

## Load summarization pipeline

In [17]:
# load pre-trained summarization pipeline
summarizer = pipeline("summarization")

In [18]:
chapter = """
Judicial space
Regarding space,
Bruno Dayez carries out a topographical analysis of the trial which allows us to
highlight several characteristics:
First, "the trial takes place in a defined, unchanging and closed place: the
courtroom ", which is itself located within the courthouse, which
at first glance makes us think of an imposing and austere temple where we have
the impression that we are not necessarily welcome and that there is
it's not good to live, while for my part, I can assure you that we have
very good times.
Therefore, the place where the trial takes place is separate from the ordinary world, "the
justice is a particular, autonomous operation that requires its
detachment from the everyday world. "

It is therefore a space separated from the secular space of the city.
Regarding the publicity of the audiences, these cannot be filmed
and disseminated (except for trials of historical interest by
example). When the trial takes place at the last instance, it must
putting a definitive end to the conflict is what sets it apart from
the gear of private revenge.

Indeed, imagine it is broadcast on television, it would be subject to
recurrent manner in democratic debate, which would be an obstacle to its
essential function of social pacification: "Res judicata pro veritate
habetur, "the saying goes," Res judicata is held to be truth. "
So "except for the few palate rats baited by the smell of
sentence ", of which I am a part, are present at the hearing only those who
are summoned to appear there. And it is the press that delivers the only echo of what is happening
weft within the walls of the palace.
Then, concerning the interior space of the trial, it is divided into
regions.
Each speaker occupies a limited space, it defines the status
even of the speaker, the precise role he must fulfill. "It is forbidden to
put in the place of others because this substitution would risk throwing the
confusion in the artificial world of the trial. "

Like the auditorium, the courtroom has two differentiated spaces, the separation of which
can be materialized by a barrier, a rope or simply a
empty space. One, with benches, is for the public and the other for the stage
judicial proper.
The space is organized symmetrically on either side of the president,
whose chair is often slightly raised. The president is surrounded
assessors. Then, at the ends, we find the clerk on one side and the
the public prosecutor of the other (or the attorney general if the case falls within the jurisdiction of the seats).

So, in general, the public prosecutor is at the same level as the court.
The question then arises of the balance of power in the spatial organization of the
trial and in particular the asymmetrical position of the prosecutor and the lawyer
in relation to the judge. This geographical proximity between the prosecutor and the
judge could lead one to believe that the rights of the defense and the necessities of
repression are not on an equal footing. And finally, always separated
of the public, is the bar where witnesses come to testify, "the past in
flashback " to use the good word of Master Vergès. And then, on both sides
on the other hand, the benches reserved for the accused and his lawyer are distributed, and those
reserved for the victim and his lawyer. "The compartmentalisation of the actors
therefore already freezes in the personification of an action: accusing, defending, judging
or be judged. "

We can also add that in addition to the very precise function assigned to each
actor, the dress, that is to say the dressing, also makes it possible to better identify
the various protagonists of the judicial scene.
Antoine Garapon assigns him three main functions. A first
function of purification, it purifies the ordinary person before this one
exercise its institutional role. Here too, there is a desire to mark the
break between life and trial. Then, it aims to protect the
person who is about to perform the function which is proper to him, in him
conferring a feeling of superiority which will release it from violence
legitimate which it is called upon to exercise. And finally, it allows to signify the
victory of the appearing over the being, of the character over the person, of
the institution on the person. Antoine Garapon says it very rightly: "the dress
allows, for the wearer, identification with his character.
Contrary to the saying, in the trial it is the dress that makes the judge,
the lawyer and the prosecutor."
"""

In [19]:
# Summarize in a text of minimum 30 words and max 130 words
# do_sample False to use a greedi decoder: to return a sequence with next word that has a high probability of making sense
summarizer(chapter, max_length=130, min_length=30, do_sample=False)

[{'summary_text': ' Bruno Dayez says the trial takes place in a defined, unchanging and closed place: the courtroom . The space is organized symmetrically on either side of the president, the clerk and the public prosecutor of the other . Each speaker occupies a limited space, it defines the status of the speaker, the precise role he must fulfill . The public prosecutor is at the same level as the court .'}]

In [1]:
# from selenium import webdriver
from bs4 import BeautifulSoup as bs
import re
import requests

In [12]:
import pandas as pd

In [2]:
# download page of most popular books
# url='https://www.gutenberg.org/files/103/103-h/103-h.htm'
url='https://www.gutenberg.org/ebooks/search/?sort_order=downloads'
response = requests.get(url)
print(url, response.status_code)

# parse data
soup = bs(response.content, features="lxml")

https://www.gutenberg.org/ebooks/search/?sort_order=downloads 200


In [28]:
# Get links of the 25 most popular books
#links = []
base_url = "https://www.gutenberg.org"
cols = ["title", "author", "link", "cover_img_link", "utf_8_txt_link"]
books_df = pd.DataFrame(columns=cols)

for rank, element in enumerate(soup.find_all("li", attrs={"class": "booklink"}), start=1):
    title = element.find("span", attrs={"class": "title"}).text
    author = element.find("span", attrs={"class": "subtitle"}).text
    link = base_url + element.find("a", attrs={"class": "link"}).get("href")
    cover_img_link = base_url + element.find("img", attrs={"class": "cover-thumb"}).get("src")
    book_id = re.findall(r"\d+", link)[0]
    utf_8_txt_link = f"{base_url}/files/{book_id}/{book_id}-0.txt"
    books_df.loc[rank] = [title, author, link, cover_img_link, utf_8_txt_link]
books_df


Unnamed: 0,title,author,link,cover_img_link,utf_8_txt_link
1,"Frankenstein; Or, The Modern Prometheus",Mary Wollstonecraft Shelley,https://www.gutenberg.org/ebooks/84,https://www.gutenberg.org/cache/epub/84/pg84.c...,https://www.gutenberg.org/files/84/84-0.txt
2,Pride and Prejudice,Jane Austen,https://www.gutenberg.org/ebooks/1342,https://www.gutenberg.org/cache/epub/1342/pg13...,https://www.gutenberg.org/files/1342/1342-0.txt
3,The Great Gatsby,F. Scott Fitzgerald,https://www.gutenberg.org/ebooks/64317,https://www.gutenberg.org/cache/epub/64317/pg6...,https://www.gutenberg.org/files/64317/64317-0.txt
4,A Tale of Two Cities,Charles Dickens,https://www.gutenberg.org/ebooks/98,https://www.gutenberg.org/cache/epub/98/pg98.c...,https://www.gutenberg.org/files/98/98-0.txt
5,Alice's Adventures in Wonderland,Lewis Carroll,https://www.gutenberg.org/ebooks/11,https://www.gutenberg.org/cache/epub/11/pg11.c...,https://www.gutenberg.org/files/11/11-0.txt
6,Et dukkehjem. English,Henrik Ibsen,https://www.gutenberg.org/ebooks/2542,https://www.gutenberg.org/cache/epub/2542/pg25...,https://www.gutenberg.org/files/2542/2542-0.txt
7,The Importance of Being Earnest: A Trivial Com...,Oscar Wilde,https://www.gutenberg.org/ebooks/844,https://www.gutenberg.org/cache/epub/844/pg844...,https://www.gutenberg.org/files/844/844-0.txt
8,The Picture of Dorian Gray,Oscar Wilde,https://www.gutenberg.org/ebooks/174,https://www.gutenberg.org/cache/epub/174/pg174...,https://www.gutenberg.org/files/174/174-0.txt
9,A Modest Proposal\r,Jonathan Swift,https://www.gutenberg.org/ebooks/1080,https://www.gutenberg.org/cache/epub/1080/pg10...,https://www.gutenberg.org/files/1080/1080-0.txt
10,Metamorphosis,Franz Kafka,https://www.gutenberg.org/ebooks/5200,https://www.gutenberg.org/cache/epub/5200/pg52...,https://www.gutenberg.org/files/5200/5200-0.txt


In [33]:
# download book in plain text UTF-8
utf_8_txt_url = books_df.loc[1, "utf_8_txt_link"]
response = requests.get(utf_8_txt_url)
print(utf_8_txt_url, response.status_code)
book = response.text

https://www.gutenberg.org/files/84/84-0.txt 200


In [35]:
book[:100]

'ï»¿The Project Gutenberg eBook of Frankenstein, by Mary Wollstonecraft (Godwin) Shelley\r\n\r\nThis eBoo'

In [32]:
print(soup.find("h2", text=re.compile("contents", re.IGNORECASE)))

<h2 align="center">
CONTENTS
</h2>


In [None]:
# parse with Beautifulsoup
# soup = bs(self._data)

        # Get listed links
        # link_tags = soup.main.find_all("a", attrs={"class": "property-content"})
        # self._links = [link.attrs["href"] for link in link_tags]

        # Create webdriver object
        driver = webdriver.Firefox()

        # Wait 30 ms to navigate to the webpage
        driver.implicitly_wait(30)
        driver.get(self.page_url)

        # When opening the url on Firefox, a pop-up window appears.
        # Click on "Keep browsing" to get to the actual page.
        python_button = driver.find_elements_by_xpath(
            "//button[@id='uc-btn-accept-banner']"
        )[0]
        python_button.click()

        # Search for all houses and apartment
        # 1. Select "House and apartment" label
        python_label_button = driver.find_elements_by_xpath(
            "//button[@id='propertyTypesDesktop']"
        )[0]
        python_label_button.click()
        python_house_apartment_button = driver.find_elements_by_xpath(
            "//li[@data-value='HOUSE,APARTMENT']"
        )[0]
        python_house_apartment_button.click()

        # 2. Click on search
        python_search_button = driver.find_elements_by_xpath(
            "//button[@id='searchBoxSubmitButton']"
        )[0]
        python_search_button.click()

        # 3. Get links of houses and apartment in 5 pages
        self._links = []

        # Get links for each page
        for _ in range(334):
            # Initialize attempts count
            attempts_count = 0
            while attempts_count < 5:
                try:
                    links_tags = driver.find_elements_by_xpath("//a[@class='card__title-link']")
                    self._links.extend([link.get_attribute("href") for link in links_tags])
                    break

                except:
                    attempts_count += 1

            # Navigate to next page
            python_label_button = driver.find_elements_by_xpath(
                "//a[@class='pagination__link pagination__link--next button button--text button--size-small']"
            )[0]
            python_label_button.click()


        # print(self._links)
        driver.close()



    # Simple function to get values from details table
    @staticmethod
    def get_detail(soup, name):
        """Get detail from table by name."""
        # Find cell
        tag = soup.find("th", text=re.compile(name, re.IGNORECASE))
        if tag is None:
            return None
        # Info is in sibling tag
        data = tag.next_sibling.next_sibling.contents[0].strip().lower()

        # Convert booleans
        if data in ["yes", "no"]:
            return True if data == "yes" else "False"
        else:
            return data


        # 1. locality: str = None
        # label = soup.select_one('th:-soup-contains("locality")')
        # self._property.locality = label.next_sibling.next_sibling.contents[0].strip()
        self._property.locality = self.get_detail(soup, "locality")
        try:
            property_type = soup.select_one(".classified__title").text.strip().lower()
            if "house" in property_type or "house" in self.page_url:
                self._property.property_type = "house"
            elif "apartment" in property_type or "apartment" in self.page_url:
                self._property.property_type = "apartment"
        except:
            if "house" in self.page_url:
                self._property.property_type = "house"
            elif "apartment" in self.page_url:
                self._property.property_type = "apartment"

        property_subtype = soup.select_one(".classified__title").text.strip().lower()


        # 4. price: float = None
        try:
            price = soup.select_one('span:-soup-contains("€")').text
            price = price.replace("€", "").replace(
                ",", ""
            )  # convert into right number format
            # take the min price available
            min_price = min(re.findall("(\d+)", price))
            self._property.price = float(min_price)
        except:
            self._property.price = None


        # 5. sale_type: str = None

        # 6. number_rooms: int = None
        # label = soup.select_one('th:-soup-contains("Bedrooms")')
        # self._property.number_rooms = int(label.next_sibling.next_sibling.contents[0].strip())
        self._property.number_rooms = self.get_detail(soup, "Bedrooms")
        # Convert number_rooms into integer if not None
        self._property.number_rooms = (
            int(self._property.number_rooms) if self._property.number_rooms else None
        )

        # 7. area: float = None
        # label = soup.select_one('th:-soup-contains("area")')
        # self._property.area = float(label.next_sibling.next_sibling.contents[0].strip())
        self._property.area = self.get_detail(soup, "area")
        # Convert area into float if not None
        self._property.area = (
            float(self._property.area) if self._property.area else None
        )

        # 8. fully_equipped_kitchen: bool = None
        kitchen_type = self.get_detail(soup, "Kitchen type")
        # Determine if the kitchen is fully equipped or not
        not_installed_labels = ["notinstalled", "uninstalled", "not installed"]
        installed_labels = ["fully", "hyper", "installed"]
        if kitchen_type is not None:
            if any(label in kitchen_type for label in not_installed_labels):
                self._property.fully_equipped_kitchen = False
            elif any(label in kitchen_type for label in installed_labels):
                self._property.fully_equipped_kitchen = True

        # 9. is_furnished: bool = None
        self._property.is_furnished = self.get_detail(soup, "Furnished")

        # 10. has_open_fire: bool = None
        fireplace = self.get_detail(soup, "fireplace")
        # Determine if there is a fireplace
        if fireplace is not None:
            self._property.has_open_fire = True if int(fireplace) > 0 else False

        # 11. has_terrace: bool = None ; # 12. terrace_area: float = None
        terrace = self.get_detail(soup, "Terrace")
        # Determine if there is a terrace
        if terrace is not None:
            if terrace == 'yes':
                self._property.has_terrace = True
                self._property.terrace_area = None # None as the terrace area is not specified
            elif terrace == 'no':
                self._property.has_terrace = False
                self._property.terrace_area = 0
            else:
                try:
                    terrace = float(terrace)
                    self._property.has_terrace = True if terrace > 0 else False
                    self._property.terrace_area = terrace
                except:
                    self._property.has_terrace = None
                    self._property.terrace_area = None

        # 13. has_garden: bool = None
        garden = self.get_detail(soup, "Garden")
        # Determine if there is a garden
        if garden is not None:
            self._property.has_garden = True if float(garden) > 0 else False

        # 14. garden_area: float = None
        self._property.garden_area = float(garden) if garden else None

        # 15. land_surface: float = None

        # 16. land_plot_area: float = None
        self._property.land_plot_area = self.get_detail(soup, "Surface of the plot")
        # Convert area into float if not None
        self._property.land_plot_area = (
            float(self._property.land_plot_area)
            if self._property.land_plot_area
            else None
        )

        # 17. number_facades: int = None
        self._property.number_facades = self.get_detail(soup, "frontage")
        # Convert number_facades into integer if not None
        self._property.number_facades = (
            int(self._property.number_facades)
            if self._property.number_facades
            else None
        )

        # 18. has_swimming_pool: bool = None
        self._property.has_swimming_pool = self.get_detail(soup, "Swimming")