# <u>Chapter 8</u>: Text Summarization

This chapter introduces another challenging topic in natural language processing and demystifies methods for text summarization. For implementing pertinent systems, we exploit data coming from the Web. Thus, we examine the techniques for accessing and automatically parsing this resource. Besides the standard text summarization methods, we delve into a state-of-the-art architecture that provides exceptional performance in many real-world applications. The specific topology extends the seq2seq architectures we have already discussed and combines many concepts encountered throughout the book. Finally, as we did in previous chapters, we discuss the metrics to assess the performance of relevant systems.

In [None]:
!pip install scrapy
!pip install pandas
!pip install sumy
!pip install nltk
!pip install wikipedia

## Scraping book reviews

Like the quotes example, we crawl a website with book reviews, including 152 book items split into eight web pages. The created spider is seeded with a selected URL and is responsible for identifying and visiting all subsequent links. 

An ``Item`` in ``scrapy`` is a logical grouping (container) of extracted data points from a website. In the following code, we define the ``BookItem`` to read the title and the product description of a book.

In [None]:
import scrapy
from scrapy.loader.processors import MapCompose, TakeFirst

# Remove the double quotes from the input.
def remove_quotes(input):
    input = input.replace("\"", "")
    return input

# Create the book item for scraping.
class BookItem(scrapy.Item):

    # The item consists of a title and a description.
    title = scrapy.Field(output_processor=TakeFirst())
    product_description = scrapy.Field(input_processor=MapCompose(remove_quotes), output_processor=TakeFirst())

Let's now create the crawler and set the start URL.

In [None]:
from scrapy.loader import ItemLoader

# Create a spider for scraping book info.
class BookSpider(scrapy.Spider):
    name = 'book_spider'
    allowed_domains = ['books.toscrape.com']
    start_urls = ['https://books.toscrape.com/catalogue/category/books/default_15/index.html']
    custom_settings = {
        "FEEDS" : { "books.json": { "format": "json", "overwrite": True}}
    }
    
    # Parse the info for each page with books.
    def parse(self, response):

        # Iterate over all products in the page.
        for article in response.css("article.product_pod"):

            # Get the url for one book.
            book_url = article.css("div > a::attr(href)").get()
            
            if book_url:
                # Parse the info for the specific book.
                yield response.follow(
                    url=book_url,
                    callback=self.parse_book_info,
                    dont_filter=True)

        # Go to the next books page.
        next_url = response.css("li.next > a::attr(href)").get()
        if next_url:
            yield response.follow(url=next_url, callback=self.parse)


    # Callback method for scraping a specific book's page.
    def parse_book_info(self, response):

        item_loader = ItemLoader(item=BookItem(), response=response)
        item_loader.add_css('title', "div > h1::text")
        item_loader.add_css('product_description', "div#product_description + p::text")

        return item_loader.load_item()

We create and start a crawler process using the ``BookSpider``.

In [None]:
from scrapy.crawler import CrawlerProcess

# Create a crawler process using the book spider.
process = CrawlerProcess({
    'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})

# Start the crawling.
crawler = process.create_crawler(BookSpider)
process.crawl(crawler)
process.start()

# In case you get: ReactorNotRestartable error, you have to restart the kernel.
# The reactor is only meant to run once.

Let's verify that everything worked as expected.

In [None]:
# Print statistics from the scraping process.
stats_dict = crawler.stats.get_stats()
stats_dict

## Performing extractive summarization

Wxtractive summarization identifies important words or phrases and stitch together portions of the content producing a condensed version of the original text. So, we use the previously created ``books.json`` file and employ different methods to extract summaries for an input document. 

In [None]:
import pandas as pd

df = pd.read_json('books.json')
df.head()

Next, we ensure that there are no missing values.

In [None]:
# Remove missing values.
df = df.dropna()
df.shape

Let's now print a sample description.

In [None]:
# Print a sample description.
print(df['product_description'][136])

Then, we define a generic method that performs summarization.

In [None]:
from sumy.parsers.plaintext import PlaintextParser
from sumy.nlp.tokenizers import Tokenizer
from sumy.nlp.stemmers import Stemmer
from sumy.summarizers.lex_rank import LexRankSummarizer 
from sumy.summarizers.luhn import LuhnSummarizer
from sumy.summarizers.lsa import LsaSummarizer
from sumy.summarizers.text_rank import TextRankSummarizer
from sumy.summarizers.edmundson import EdmundsonSummarizer
from sumy.summarizers.sum_basic import SumBasicSummarizer
from sumy.summarizers.kl import KLSummarizer
from sumy.summarizers.reduction import ReductionSummarizer
from sumy.utils import get_stop_words
from nltk.corpus import stopwords

stop_words = stopwords.words('english')

# Summarize the input given a method method and a number of output sentences.
def summarize(input, method, sentence_num, language='english'):
    summarizer = method(Stemmer(language))
    summarizer.stop_words = get_stop_words(language)

    # For this summarizer we can define positive (bonus),
    # negative (stigma), and stop words.
    if isinstance(summarizer, EdmundsonSummarizer):
        # The bonus and stigma sets are empty.
        summarizer.bonus_words = ['']
        summarizer.stigma_words = ['']
        summarizer.null_words = stop_words

    # Extract the summary.
    summary = summarizer(PlaintextParser(input, Tokenizer(language)).document, sentence_num)

    return summary

Finally, we can extract the summaries using seven methods.

In [None]:
# Extract summaries with all methods.
for method in [EdmundsonSummarizer, KLSummarizer, LexRankSummarizer, LsaSummarizer, 
                    LuhnSummarizer, ReductionSummarizer, TextRankSummarizer]:
                    
    print('>> ' + method.__name__ + ':')
    summary = summarize(df['product_description'][136], method, 1)

    # Print the summary.
    for sentence in summary:
        print(sentence)
    
    print('')


In [None]:
import wikipedia

# Get wiki content.
wikisearch = wikipedia.page("Athens")
wikicontent = wikisearch.content
wikisummary = wikisearch.summary

print(wikisummary)

In [None]:
# getting suggestions
result = wikipedia.search("India", results = 5)

# printing the result
print(result)

In [None]:
# setting language to hindi
wikipedia.set_lang("en")
 
# printing the summary
print(wikipedia.summary("Microsoft"))

In [None]:
# wikipedia page object is created
page_object = wikipedia.page("Microsoft")
 
# printing html of page_object
print(page_object.html)
 
# printing title
print(page_object.original_title)
 
# printing links on that page object
print(page_object.links[0:100])

### Machine Learning Techniques for Text 
&copy;2021&ndash;2022, Nikos Tsourakis, <nikos@tsourakis.net>, Packt Publications. All Rights Reserved.