# <u>Chapter 7</u>: Summarizing Wikipedia Articles

In this exercise, we scrape data from http://quotes.toscrape.com, a website that includes popular quotes from famous people. 

In [1]:
import sys
import subprocess
import pkg_resources

# Find out which packages are missing.
installed_packages = {dist.key for dist in pkg_resources.working_set}
required_packages = {'scrapy'}
missing_packages = required_packages - installed_packages

# If there are missing packages install them.
if missing_packages:
    print('Installing the following packages: ' + str(missing_packages))
    python = sys.executable
    subprocess.check_call([python, '-m', 'pip', 'install', *missing_packages], stdout=subprocess.DEVNULL)

<ins>Note</ins>: In case you get a _ReactorNotRestartable_ error, you have to restart the kernel. The reactor is only meant to run once.

## Web scraping

The ``scrapy`` Python framework is an elegant way to implement spiders in Python for large-scale web scraping. In the code that follows, we create the crawler and set the start URL.

In [2]:
import scrapy

# Create a spider for scraping quotes.
class QuotesSpider(scrapy.Spider):
    name = 'quote_spider'
    start_urls = ['http://quotes.toscrape.com']    
    
    # Define its parse method.
    def parse(self, response):
        print(f"Visiting: {response.url}")

        # Parse the info for each quote.
        for quote in response.css("div.quote"):
            text = quote.css("span.text::text").get()
            author = quote.css("small.author::text").get()
            tags = quote.css("div.tags a.tag::text").getall()
            
            print(dict(text=text, author=author, tags=tags))

Next, let's create and start a crawler process using the ``QuotesSpider``.

In [3]:
from scrapy.crawler import CrawlerProcess

# Create a crawler process using the quote spider.
process = CrawlerProcess({
    'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})

# Start the crawling.
crawler = process.create_crawler(QuotesSpider)
process.crawl(crawler)
process.start()

2022-11-01 23:37:18 [scrapy.utils.log] INFO: Scrapy 2.5.1 started (bot: scrapybot)
2022-11-01 23:37:18 [scrapy.utils.log] INFO: Versions: lxml 4.6.3.0, libxml2 2.9.5, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 21.7.0, Python 3.9.7 (tags/v3.9.7:1016ef3, Aug 30 2021, 20:19:38) [MSC v.1929 64 bit (AMD64)], pyOpenSSL 21.0.0 (OpenSSL 1.1.1l  24 Aug 2021), cryptography 35.0.0, Platform Windows-10-10.0.19042-SP0
2022-11-01 23:37:18 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2022-11-01 23:37:18 [scrapy.crawler] INFO: Overridden settings:
{'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'}
2022-11-01 23:37:18 [scrapy.extensions.telnet] INFO: Telnet Password: 2a9d51827d98cbf5
2022-11-01 23:37:18 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.logstats.LogStats']
2022-11-01 23:37:18 [scrapy.middleware] INFO: Enabled downloader

Visiting: http://quotes.toscrape.com
{'text': '“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”', 'author': 'Albert Einstein', 'tags': ['change', 'deep-thoughts', 'thinking', 'world']}
{'text': '“It is our choices, Harry, that show what we truly are, far more than our abilities.”', 'author': 'J.K. Rowling', 'tags': ['abilities', 'choices']}
{'text': '“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”', 'author': 'Albert Einstein', 'tags': ['inspirational', 'life', 'live', 'miracle', 'miracles']}
{'text': '“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”', 'author': 'Jane Austen', 'tags': ['aliteracy', 'books', 'classic', 'humor']}
{'text': "“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”", 'author': 'Marilyn Monroe', 'tags

The output is a JSON-formatted string with three key-value pairs. Keys: _text_, _author_ and _tags_.

## What we have learned …

| |
| --- |
| **Tools**<ul><li>Web crawling and scraping</li></ul>
| |

## Author Information

- **Author:** Nikos Tsourakis
- **Email:** nikos@tsourakis.net
- **Website:** [tsourakis.net](https://tsourakis.net)
- **Date:** November 20, 2023