# <u>Chapter 7</u>: Summarizing Wikipedia Articles

In this exercise, we scrape text data from [Wikipedia](https://en.wikipedia.org/).

In [1]:
import sys
import subprocess
import pkg_resources

# Find out which packages are missing.
installed_packages = {dist.key for dist in pkg_resources.working_set}
required_packages = {'scrapy', 'wikipedia'}
missing_packages = required_packages - installed_packages

# If there are missing packages install them.
if missing_packages:
    print('Installing the following packages: ' + str(missing_packages))
    python = sys.executable
    subprocess.check_call([python, '-m', 'pip', 'install', *missing_packages], stdout=subprocess.DEVNULL)

<ins>Note</ins>: In case you get a _ReactorNotRestartable_ error, you have to restart the kernel. The reactor is only meant to run once.

``XML path`` (XPath) is an alternative to the CSS selectors used so far in ``scrapy``, a language for selecting tags in XML documents and HTML. We implement the spider, set the start URL and define the ``parse`` method.

In [2]:
import scrapy

# Create a spider for scraping Wikipedia articles.
class WikipediaSpider(scrapy.Spider):
    name = 'wikipedia_spider'
    allowed_domains = ['en.wikipedia.org']
    start_urls = ['https://en.wikipedia.org/wiki/Athens']
    
    # Parse the info for a specific page.
    def parse(self, response):

        print(response.xpath("//span[@class='mw-headline']/text()").getall())

Then, starting the crawler yields all headlines from the specific Wikipedia article.

In [3]:
from scrapy.crawler import CrawlerProcess

# Create a crawler process using the quote spider.
process = CrawlerProcess({
    'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})

# Start the crawling.
crawler = process.create_crawler(WikipediaSpider)
process.crawl(crawler)
process.start()

2022-11-07 12:44:29 [scrapy.utils.log] INFO: Scrapy 2.5.1 started (bot: scrapybot)
2022-11-07 12:44:29 [scrapy.utils.log] INFO: Versions: lxml 4.6.3.0, libxml2 2.9.5, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 21.7.0, Python 3.9.7 (tags/v3.9.7:1016ef3, Aug 30 2021, 20:19:38) [MSC v.1929 64 bit (AMD64)], pyOpenSSL 21.0.0 (OpenSSL 1.1.1l  24 Aug 2021), cryptography 35.0.0, Platform Windows-10-10.0.19042-SP0
2022-11-07 12:44:29 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2022-11-07 12:44:29 [scrapy.crawler] INFO: Overridden settings:
{'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'}
2022-11-07 12:44:29 [scrapy.extensions.telnet] INFO: Telnet Password: 25313d2eb5206d45
2022-11-07 12:44:29 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.logstats.LogStats']
2022-11-07 12:44:29 [scrapy.middleware] INFO: Enabled downloader

['Etymology and names', 'History', 'Geography', 'Environment', 'Safety', 'Climate', 'Locations', 'Neighbourhoods of the center of Athens (Municipality of Athens)', 'Parks and zoos', 'Urban and suburban municipalities', 'Administration', 'Athens Urban Area', 'Athens Metropolitan Area', 'Demographics', 'Population in modern times', 'Population of the Athens Metropolitan Area', 'Population in ancient times', 'Government and politics', 'International relations and influence', 'Twin towns – sister cities', 'Partnerships', 'Other locations named after Athens', 'Economy', 'Transport', 'Bus transport', 'Athens Metro', 'Commuter/suburban rail (Proastiakos)', 'Tram', 'Athens International Airport', 'Railways and ferry connections', 'Motorways', 'Education', 'Culture', 'Archaeological hub', 'Architecture', 'Urban sculpture', 'Museums', 'Tourism', 'Entertainment and performing arts', 'Sports', 'Sports clubs', 'Olympic Games', '1896 Summer Olympics', '1906 Summer Olympics', '2004 Summer Olympics', 

## What we have learned …

| |
| --- |
| **Tools**<ul><li>Web crawling and scraping</li></ul>
| |