# <u>Chapter 7</u>: Summarizing Wikipedia Articles

This exercise introduces another challenging topic in natural language processing and demystifies methods for `text summarization`. For implementing pertinent systems, we exploit data coming from the Web. Thus, we examine the techniques for accessing and automatically parsing this resource. The focus is on standard text summarization methods.

In [1]:
import sys
import subprocess
import pkg_resources

# Find out which packages are missing.
installed_packages = {dist.key for dist in pkg_resources.working_set}
required_packages = {'scrapy', 'pandas', 'sumy', 'nltk', 'wikipedia'}
missing_packages = required_packages - installed_packages

# If there are missing packages install them.
if missing_packages:
    print('Installing the following packages: ' + str(missing_packages))
    python = sys.executable
    subprocess.check_call([python, '-m', 'pip', 'install', *missing_packages], stdout=subprocess.DEVNULL)

## Scrape book reviews

We crawl a website with book reviews, including 152 book items split into 8 web pages (http://books.toscrape.com/). The created spider is seeded with a selected URL and is responsible for identifying and visiting all subsequent links. 

An ``Item`` in ``scrapy`` is a logical grouping (container) of extracted data points from a website. In the following code, we define the ``BookItem`` to read the title and the product description of a book.

In [2]:
import scrapy
from scrapy.loader.processors import MapCompose, TakeFirst

# Remove the double quotes from the input.
def remove_quotes(input):
    input = input.replace("\"", "")
    return input

# Create the book item for scraping.
class BookItem(scrapy.Item):

    # The item consists of a title and a description.
    title = scrapy.Field(output_processor=TakeFirst())
    product_description = scrapy.Field(input_processor=MapCompose(remove_quotes), output_processor=TakeFirst())

  title = scrapy.Field(output_processor=TakeFirst())
  product_description = scrapy.Field(input_processor=MapCompose(remove_quotes), output_processor=TakeFirst())
  product_description = scrapy.Field(input_processor=MapCompose(remove_quotes), output_processor=TakeFirst())


Let's now create the crawler and set the start URL.

In [3]:
from scrapy.loader import ItemLoader

# Create a spider for scraping book info.
class BookSpider(scrapy.Spider):
    name = 'book_spider'
    allowed_domains = ['books.toscrape.com']
    start_urls = ['https://books.toscrape.com/catalogue/category/books/default_15/index.html']
    custom_settings = {
        "FEEDS" : { "books.json": { "format": "json", "overwrite": True}}
    }
    
    # Parse the info for each page with books.
    def parse(self, response):

        # Iterate over all products on the page.
        for article in response.css("article.product_pod"):

            # Get the url for one book.
            book_url = article.css("div > a::attr(href)").get()
            
            if book_url:
                # Parse the info for the specific book.
                yield response.follow(
                    url=book_url,
                    callback=self.parse_book_info,
                    dont_filter=True)

        # Go to the next books page.
        next_url = response.css("li.next > a::attr(href)").get()
        if next_url:
            yield response.follow(url=next_url, callback=self.parse)


    # Callback method for scraping a specific book's page.
    def parse_book_info(self, response):

        item_loader = ItemLoader(item=BookItem(), response=response)
        item_loader.add_css('title', "div > h1::text")
        item_loader.add_css('product_description', "div#product_description + p::text")

        return item_loader.load_item()

We create and start a crawler process using the ``BookSpider``.

In [4]:
from scrapy.crawler import CrawlerProcess

# Create a crawler process using the book spider.
process = CrawlerProcess({
    'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})

# Start the crawling.
crawler = process.create_crawler(BookSpider)
process.crawl(crawler)
process.start()

# In case you get: ReactorNotRestartable error, you have to restart the kernel.
# The reactor is only meant to run once.

2022-11-02 00:22:42 [scrapy.utils.log] INFO: Scrapy 2.5.1 started (bot: scrapybot)
2022-11-02 00:22:42 [scrapy.utils.log] INFO: Versions: lxml 4.6.3.0, libxml2 2.9.5, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 21.7.0, Python 3.9.7 (tags/v3.9.7:1016ef3, Aug 30 2021, 20:19:38) [MSC v.1929 64 bit (AMD64)], pyOpenSSL 21.0.0 (OpenSSL 1.1.1l  24 Aug 2021), cryptography 35.0.0, Platform Windows-10-10.0.19042-SP0
2022-11-02 00:22:42 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2022-11-02 00:22:42 [scrapy.crawler] INFO: Overridden settings:
{'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'}
2022-11-02 00:22:42 [scrapy.extensions.telnet] INFO: Telnet Password: 5468ed58ba18b9d5
2022-11-02 00:22:42 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.feedexport.FeedExporter',
 'scrapy.extensions.logstats.LogStats']
2022-11-02 00:22:4

Let's verify that everything worked as expected.

In [5]:
# Print statistics from the scraping process.
stats_dict = crawler.stats.get_stats()
stats_dict

{'log_count/INFO': 13,
 'start_time': datetime.datetime(2022, 11, 1, 23, 22, 42, 889317),
 'scheduler/enqueued/memory': 160,
 'scheduler/enqueued': 160,
 'scheduler/dequeued/memory': 160,
 'scheduler/dequeued': 160,
 'downloader/request_count': 160,
 'downloader/request_method_count/GET': 160,
 'downloader/request_bytes': 60476,
 'downloader/response_count': 160,
 'downloader/response_status_count/200': 160,
 'downloader/response_bytes': 3349033,
 'log_count/DEBUG': 312,
 'response_received_count': 160,
 'request_depth_max': 8,
 'item_scraped_count': 152,
 'elapsed_time_seconds': 4.635436,
 'finish_time': datetime.datetime(2022, 11, 1, 23, 22, 47, 524753),
 'finish_reason': 'finished',
 'feedexport/success_count/FileFeedStorage': 1}

Indeed, 8 pages are downloaded, 152 book items are scraped, and their data is stored in the _books.json_ file.

## Extractive summarization

`Extractive summarization` identifies important words or phrases and stitch together portions of the content producing a condensed version of the original text. So, we use the previously created _books.json_ file and employ different methods to extract summaries for an input document. 

In [6]:
import pandas as pd

df = pd.read_json('books.json')
df.head()

2022-11-02 00:22:48 [numexpr.utils] INFO: NumExpr defaulting to 8 threads.


Unnamed: 0,title,product_description
0,Tracing Numbers on a Train,Start preparing children for classroom success...
1,"A Piece of Sky, a Grain of Rice: A Memoir in F...",In this layered collage of memory within memor...
2,The Bridge to Consciousness: I'm Writing the B...,
3,The Emerald Mystery,Three young adults are invited to spend a week...
4,The Secret (The Secret #1),Fragments of a Great Secret have been found in...


Next, we ensure that there are no missing values.

In [7]:
# Remove missing values.
df = df.dropna()
df.shape

(151, 2)

Let's now print a sample description.

In [8]:
# Print a sample description.
print(df['product_description'][136])

Mikael Blomkvist, crusading journalist and publisher of the magazine Millennium, has decided to run a story that will expose an extensive sex trafficking operation between Eastern Europe and Sweden, implicating well-known and highly placed members of Swedish society, business, and government.But he has no idea just how explosive the story will be until, on the eve of publi Mikael Blomkvist, crusading journalist and publisher of the magazine Millennium, has decided to run a story that will expose an extensive sex trafficking operation between Eastern Europe and Sweden, implicating well-known and highly placed members of Swedish society, business, and government.But he has no idea just how explosive the story will be until, on the eve of publication, the two investigating reporters are murdered. And even more shocking for Blomkvist: the fingerprints found on the murder weapon belong to Lisbeth Salander—the troubled, wise-beyond-her-years genius hacker who came to his aid in The Girl with

Then, we define a generic method that performs summarization.

In [9]:
import nltk
from sumy.parsers.plaintext import PlaintextParser
from sumy.nlp.tokenizers import Tokenizer
from sumy.nlp.stemmers import Stemmer
from sumy.summarizers.lex_rank import LexRankSummarizer 
from sumy.summarizers.luhn import LuhnSummarizer
from sumy.summarizers.lsa import LsaSummarizer
from sumy.summarizers.text_rank import TextRankSummarizer
from sumy.summarizers.edmundson import EdmundsonSummarizer
from sumy.summarizers.kl import KLSummarizer
from sumy.summarizers.reduction import ReductionSummarizer
from sumy.utils import get_stop_words
from nltk.corpus import stopwords

nltk.download('stopwords')
nltk.download('punkt')

stop_words = stopwords.words('english')

# Summarize the input given method and sentence number.
def summarize(input, method, sentence_num, language='english'):
    summarizer = method(Stemmer(language))
    summarizer.stop_words = get_stop_words(language)

    # For this summarizer, we can define positive (bonus),
    # negative (stigma), and stop words.
    if isinstance(summarizer, EdmundsonSummarizer):
        # The bonus and stigma sets are empty.
        summarizer.bonus_words = ['']
        summarizer.stigma_words = ['']
        summarizer.null_words = stop_words

    # Extract the summary.
    summary = summarizer(PlaintextParser(input, Tokenizer(language)).document, sentence_num)

    return summary

Finally, we can extract the summaries using seven methods.

In [10]:
# Extract summaries with all methods.
for method in [EdmundsonSummarizer, KLSummarizer, LexRankSummarizer, LsaSummarizer, 
                    LuhnSummarizer, ReductionSummarizer, TextRankSummarizer]:
                    
    print('>> ' + method.__name__ + ':')
    summary = summarize(df['product_description'][136], method, 1)

    # Print the summary.
    for sentence in summary:
        print(sentence)
    
    print('')


>> EdmundsonSummarizer:
Mikael Blomkvist, crusading journalist and publisher of the magazine Millennium, has decided to run a story that will expose an extensive sex trafficking operation between Eastern Europe and Sweden, implicating well-known and highly placed members of Swedish society, business, and government.But he has no idea just how explosive the story will be until, on the eve of publi Mikael Blomkvist, crusading journalist and publisher of the magazine Millennium, has decided to run a story that will expose an extensive sex trafficking operation between Eastern Europe and Sweden, implicating well-known and highly placed members of Swedish society, business, and government.But he has no idea just how explosive the story will be until, on the eve of publication, the two investigating reporters are murdered.

>> KLSummarizer:
And even more shocking for Blomkvist: the fingerprints found on the murder weapon belong to Lisbeth Salander—the troubled, wise-beyond-her-years genius h

## What we have learned …

| |
| --- |
| **Tools**<ul><li>Web crawling and scraping</li></ul>
| |