# Chapter 4 Web Crawling Models

## Planning and Defining Objects

This section in the textbook mainly talks about how to design objecgts to store scraped data. You can also think it as a pre-stage for storing data in the future. Please check the textbook for detailed explanation.

## Dealing with Different Website Layouts

Although it's ideal to have a general web crawler that works for all websites, human beings will pre-select a few websites needed to be scraped in reality. Here we give an example of using `Content` class to deal with both "Brookings" and "New York Times".

In [1]:
import requests
from bs4 import BeautifulSoup

We wrap the process of using `BeautifulSoup` to scrape webpages into a utility function.

In [2]:
def getPage(url):
    """
    Utilty function used to get a BeautifulSoup object from a given URL
    """

    session = requests.Session()
    headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36',
               'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8'}
    try:
        req = session.get(url, headers=headers)
    except requests.exceptions.RequestException:
        return None
    bs = BeautifulSoup(req.text, 'html.parser')
    return bs

In [3]:
class Content:
    def __init__(self, url, title, body):
        self.url = url
        self.title = title
        self.body = body


def getPage(url):
    req = requests.get(url)
    return BeautifulSoup(req.text, 'html.parser')


def scrapeNYTimes(url):
    bs = getPage(url)
    title = bs.find('h1').text
    lines = bs.select('div.StoryBodyCompanionColumn div p')
    body = '\n'.join([line.text for line in lines])
    return Content(url, title, body)

def scrapeBrookings(url):
    bs = getPage(url)
    title = bs.find('h1').text
    body = bs.find('div', {'class', 'post-body'}).text
    return Content(url, title, body)

Scraping "Brookings":

In [4]:
url = 'https://www.brookings.edu/blog/future-development/2018/01/26/delivering-inclusive-urban-access-3-uncomfortable-truths/'
content = scrapeBrookings(url)
print('Title: {}'.format(content.title))
print('URL: {}\n'.format(content.url))
# uncommnet to see the body
# print(content.body)
print("Length of BODY: {}".format(len(content.body)))

Title: Delivering inclusive urban access: 3 uncomfortable truths
URL: https://www.brookings.edu/blog/future-development/2018/01/26/delivering-inclusive-urban-access-3-uncomfortable-truths/

Length of BODY: 8234


Scraping "New York Times":

In [5]:
url = 'https://www.nytimes.com/2018/01/25/opinion/sunday/silicon-valley-immortality.html'
content = scrapeNYTimes(url)
print('Title: {}'.format(content.title))
print('URL: {}\n'.format(content.url))
# uncommnet to see the body
# print(content.body)
print("Length of BODY: {}".format(len(content.body)))

Title: The Men Who Want to Live Forever
URL: https://www.nytimes.com/2018/01/25/opinion/sunday/silicon-valley-immortality.html

Length of BODY: 6515


Do you find a pattern? We can wrap the printing process into a class function as well. Also, we can construct a `Website` class to store website information.

In [6]:
class Content:
    """
    Common base class for all articles/pages
    """
    def __init__(self, url, title, body):
        self.url = url
        self.title = title
        self.body = body

    def print(self):
        """
        Flexible printing function controls output
        """
        print('URL: {}'.format(self.url))
        print('TITLE: {}'.format(self.title))
        # uncommnet to see the body
        # print('BODY:\n{}'.format(self.body))
        print("Length of BODY: {}".format(len(self.body)))

class Website:
    """ 
    Contains information about website structure
    """
    def __init__(self, name, url, titleTag, bodyTag):
        self.name = name
        self.url = url
        self.titleTag = titleTag
        self.bodyTag = bodyTag

And a better `Crawler` class:

In [7]:
class Crawler:

    def getPage(self, url):
        session = requests.Session()
        headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36',
               'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8'}
        try:
            req = session.get(url, headers=headers)
        except requests.exceptions.RequestException:
            return None
        return BeautifulSoup(req.text, 'html.parser')

    def safeGet(self, pageObj, selector):
        """
        Utilty function used to get a content string from a Beautiful Soup
        object and a selector. Returns an empty string if no object
        is found for the given selector
        """
        selectedElems = pageObj.select(selector)
        if selectedElems is not None and len(selectedElems) > 0:
            return '\n'.join([elem.get_text() for elem in selectedElems])
        return ''

    def parse(self, site, url):
        """
        Extract content from a given page URL
        """
        bs = self.getPage(url)
        if bs is None:
            print("Request exception caught in: {}".format(url))
        else:
            title = self.safeGet(bs, site.titleTag)
            body = self.safeGet(bs, site.bodyTag)
            if title != '' and body != '':
                content = Content(url, title, body)
                content.print()
            else:
                print("Failed to load title and/or body: {}".format(url))
        print('-'*20)  

Based on how we define `Crawler` class. We just need to list a series of websites and their tags respectively.

In [8]:
crawler = Crawler()

siteData = [
    ['O\'Reilly Media', 'http://oreilly.com', 'h1', 'section#product-description'],
    ['Reuters', 'http://reuters.com', 'h1', 'div.StandardArticleBody_body'], # no "_1gnLA" is needed
    ['Brookings', 'http://www.brookings.edu', 'h1', 'div.post-body'],
    ['New York Times', 'http://nytimes.com', 'h1', 'div.StoryBodyCompanionColumn div p']
]

websites = []
for row in siteData:
    websites.append(Website(row[0], row[1], row[2], row[3]))

crawler.parse(websites[0], 'http://shop.oreilly.com/product/0636920028154.do')
crawler.parse(
    websites[1], 'http://www.reuters.com/article/us-usa-epa-pruitt-idUSKBN19W2D0')
crawler.parse(
    websites[2],
    'https://www.brookings.edu/blog/techtank/2016/03/01/idea-to-retire-old-methods-of-policy-education/')
crawler.parse(
    websites[3], 
    'https://www.nytimes.com/2018/01/28/business/energy-environment/oil-boom.html')

URL: http://shop.oreilly.com/product/0636920028154.do
TITLE: Learning Python, 5th Edition 
Length of BODY: 1306
--------------------
URL: http://www.reuters.com/article/us-usa-epa-pruitt-idUSKBN19W2D0
TITLE: EPA chief wants scientists to debate climate on TV
Length of BODY: 4863
--------------------
URL: https://www.brookings.edu/blog/techtank/2016/03/01/idea-to-retire-old-methods-of-policy-education/
TITLE: Idea to Retire: Old methods of policy education
Idea to Retire: Old methods of policy education
Length of BODY: 9557
--------------------
URL: https://www.nytimes.com/2018/01/28/business/energy-environment/oil-boom.html
TITLE: Oil Boom Gives the U.S. a New Edge in Energy and Diplomacy
Length of BODY: 8498
--------------------


## Structuring Crawlers

In this section, we focus on one of 3 different crawling structure introduced in the textbook: **Crawling Sites Through Search**

In [9]:
class Content:
    """Common base class for all articles/pages"""

    def __init__(self, topic, title, body, url):
        self.topic = topic
        self.title = title
        self.body = body
        self.url = url

    def print(self):
        """
        Flexible printing function controls output
        """
        print('New article found for topic: {}'.format(self.topic))
        print('URL: {}'.format(self.url))
        print('TITLE: {}'.format(self.title))
        # uncommnet to see the body
        # print('BODY:\n{}'.format(self.body))
        print("Length of BODY: {}".format(len(self.body)))


class Website:
    """Contains information about website structure"""

    def __init__(self, name, url, searchUrl, resultListing, resultUrl, absoluteUrl, titleTag, bodyTag):
        self.name = name
        self.url = url
        self.searchUrl = searchUrl
        self.resultListing = resultListing
        self.resultUrl = resultUrl
        self.absoluteUrl = absoluteUrl
        self.titleTag = titleTag
        self.bodyTag = bodyTag

        
class Crawler:

    def getPage(self, url):
        session = requests.Session()
        headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36',
               'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8'}
        try:
            req = session.get(url, headers=headers)
        except requests.exceptions.RequestException:
            return None
        return BeautifulSoup(req.text, 'html.parser')

    def safeGet(self, pageObj, selector):
        childObj = pageObj.select(selector)
        if childObj is not None and len(childObj) > 0:
            return childObj[0].get_text()
        return ''

    def search(self, topic, site):
        """
        Searches a given website for a given topic and records all pages found
        """
        bs = self.getPage(site.searchUrl + topic)
        searchResults = bs.select(site.resultListing)
        for result in searchResults[:3]: # limit to first 3 results
            url = result.select(site.resultUrl)[0].attrs['href']
            # Check to see whether it's a relative or an absolute URL
            if(site.absoluteUrl):
                bs = self.getPage(url)
            else:
                bs = self.getPage(site.url + url)
            if bs is None:
                print('Something was wrong with that page or URL. Skipping!')
                print('-'*20)
                return
            title = self.safeGet(bs, site.titleTag)
            body = self.safeGet(bs, site.bodyTag)
            if title != '' and body != '':
                content = Content(topic, title, body, url)
                content.print()
            else:
                print("Failed to load title and/or body: {}".format(url))
            print('-'*20)

In [10]:
crawler = Crawler()

siteData = [
    ['O\'Reilly Media', 'http://oreilly.com', 'https://ssearch.oreilly.com/?q=',
        'article.product-result', 'p.title a', True, 'h1', 'section#product-description'], # work only for non-free book
    ['Reuters', 'http://reuters.com', 'http://www.reuters.com/search/news?blob=', 'div.search-result-content',
        'h3.search-result-title a', False, 'h1', 'div.StandardArticleBody_body'],
    ['Brookings', 'http://www.brookings.edu', 'https://www.brookings.edu/search/?s=',
        'div.list-content article', 'h4.title a', True, 'h1', 'div.post-body']
]
sites = []
for row in siteData:
    sites.append(Website(row[0], row[1], row[2],
                         row[3], row[4], row[5], row[6], row[7]))

topics = ['python', 'data science']
for topic in topics:
    print('*'*40)
    print('GETTING INFO ABOUT: ' + topic)
    print('*'*40)
    for targetSite in sites:
        crawler.search(topic, targetSite)

****************************************
GETTING INFO ABOUT: python
****************************************
New article found for topic: python
URL: http://shop.oreilly.com/product/0636920028154.do
TITLE: Learning Python, 5th Edition 
Length of BODY: 1306
--------------------
Failed to load title and/or body: https://www.oreilly.com/programming/free/python-data-for-developers.csp
--------------------
Failed to load title and/or body: https://www.oreilly.com/programming/free/python-for-scientists.csp
--------------------
New article found for topic: python
URL: /article/idUSKBN1OD2CM
TITLE: UK woman illegally imported python-skin products
Length of BODY: 1262
--------------------
New article found for topic: python
URL: /article/idUSKCN11S04G
TITLE: Python in India demonstrates huge appetite
Length of BODY: 1008
--------------------
New article found for topic: python
URL: /article/idUSKBN0L31PS20150130
TITLE: Zimbabwean jailed for nine years for eating python meat
Length of BODY: 617
