# Chapter 4: Web Crawling Models

---

The value of web crawlers goes beyond just their ability to collect data; it lies in their capacity to enhance maintainability, robustness, and adaptability to different situations and patterns.

---

## 4.1 Planning and Defining Objects

A common web scraping mistake is collecting data based only on what's visible, leading to an unsustainable and messy dataset that requires constant adjustments and becomes difficult to manage.

**Note**

When collecting data, it's essential to assess its relevance to the project goals. Key questions to consider include whether the data is necessary, if it's redundant, if it can be collected later, and if it makes sense to store it within the current object (e.g., storing product descriptions that vary by store may not be logical).

A flexible data model is essential for handling sparse and evolving product attributes without frequent schema changes. Price tracking requires associating products with stores, timestamps, and prices, while attribute-based pricing may necessitate separate product instances. When scraping news articles, distinguishing between essential and optional fields ensures adaptability. Thoughtful planning of data structures prevents inefficiencies, making extraction and analysis more scalable and maintainable.

---

## 4.2 Dealing with Different Website Layouts

Web crawling often targets known sites, allowing manual parsing instead of complex algorithms. The best approach is writing separate scrapers for each site, using BeautifulSoup to extract structured content into Python objects.

Here is an example

In [40]:
import requests
from bs4 import BeautifulSoup
from urllib.request import urlopen

class Content:
    def __init__(self, url, title, body):
        self.url = url
        self.title = title
        self.body = body
    
    def print(self):
        print(f'TITLE: {self.title}')
        print(f'URL: {self.url}')
        print(f'BODY:\n {self.body}')

In [81]:
def scrapeCNN(url):
    bs = BeautifulSoup(urlopen(url))
    title = bs.find('h1').text.strip()
    body = bs.find('div', {'class': 'article__content'}).text.strip()
    print('body: ')
    print(body)
    return Content(url, title, body)

def scrapeBrookings(url):
    bs = BeautifulSoup(requests.get(url).text, 'html.parser')
    title = bs.find('h1').text.strip()
    try:
        body = bs.find('section', {'id': 'content'}).text.strip()
    except: body = None
    return Content(url, title, body)

In [85]:
url = 'https://www.brookings.edu/research/robotic-rulemaking/'
content = scrapeBrookings(url)
content.print()

TITLE: Robotic rulemaking
URL: https://www.brookings.edu/research/robotic-rulemaking/
BODY:
 Sections



















Print








Series on Regulatory Process and Perspective







Read more from


        Series on Regulatory Process and Perspective
              









Follow the authors



Bridget C. E. Dooling









Mark Febrizio









See More







      More On
    

Technology & Information
U.S. Economy




								Sub-Topics
								

Regulatory Policy 






Program

Economic Studies 


Center

Center on Regulation and Markets 





As it has rocketed to some 100 million active users in record time, ChatGPT is provoking conversations about the role of artificial intelligence (AI) in drafting written materials such as student exams, news articles, legal pleadings, poems, and more. The chatbot, developed by OpenAI, relies on a large language model (LLM) to respond to user-submitted requests, or “prompts” as they are known. It is an example of generative AI, a te

In [86]:
url = 'https://www.cnn.com/2023/04/03/investing/dogecoin-elon-musk-twitter/index.html'
content = scrapeCNN(url)
content.print()

body: 
New York
CNN
         — 
    


            Twitter’s traditional bird icon was booted and replaced with an image of a Shiba Inu, an apparent nod to dogecoin, the joke cryptocurrency that CEO Elon Musk is being sued over. 
    

            Musk addressed the change Monday afternoon, tweeting, “as promised” above an image of a year-old conversation in which another user suggested that Musk “just buy Twitter” and “change the bird logo to a doge.” 
    











CNN/Adobe Stock





Elon Musk's Twitter promised a purge of blue check marks. Instead he singled out one account




            The doge logo appeared on the site two days after Musk asked a judge to throw out a $258 billion racketeering lawsuit accusing him of running a pyramid scheme to support the dogecoin, according to Reuters.


            Lawyers for Musk and Tesla called the lawsuit by dogecoin investors a “fanciful work of fiction” over Musk’s “innocuous and often silly tweets.”
    

            It wasn’t cle

*More Convenient Code*

In [27]:
class Content:
    """
    Common base class for all articles/pages
    """
    def __init__(self, url, title, body):
        self.url = url
        self.title = title
        self.body = body
    def print(self):
        """
        Flexible printing function controls output
        """
        print('URL: {}'.format(self.url))
        print('TITLE: {}'.format(self.title))
        print('BODY:\n{}'.format(self.body))
    
class Website:
    """
    Contains information about website structure
    """
    def __init__(self, name, url, titleTag, bodyTag):
        self.name = name
        self.url = url
        self.titleTag = titleTag
        self.bodyTag = bodyTag

**Notice Here: I have struggled running code from book since websites changes its structure and elements name so be careful when running code that you check the HTML first is it the same or not**

In [87]:

class Crawler:
    def getPage(url):
        try:
            html = urlopen(url)
        except Exception:
            return None
        return BeautifulSoup(html, 'html.parser')

    def safeGet(bs, selector):
        """
        Utilty function used to get a content string from a Beautiful Soup
        object and a selector. Returns an empty string if no object
        is found for the given selector
        """
        selectedElems = bs.select(selector)
        if selectedElems is not None and len(selectedElems) > 0:
            return '\n'.join([elem.get_text() for elem in selectedElems])
        return ''

    def getContent(website, path):
        """
        Extract content from a given page URL
        """
        url = website.url+path
        bs = Crawler.getPage(url)
        if bs is not None:
            title = Crawler.safeGet(bs, website.titleTag)
            body = Crawler.safeGet(bs, website.bodyTag)
            return Content(url, title, body)
        return Content(url, '', '')

In [88]:
siteData = [
    ['O\'Reilly Media', 'https://www.oreilly.com', 'h1', 'div.title-description'],
    ['Reuters', 'https://www.reuters.com', 'h1', 'div.ArticleBodyWrapper'],
    ['Brookings', 'https://www.brookings.edu', 'h1', 'div.post-body'],
    ['CNN', 'https://www.cnn.com', 'h1', 'div.article__content']
]
websites = []

for name, url, title, body in siteData:
    websites.append(Website(name, url, title, body))

Crawler.getContent(
    websites[0], '/library/view/web-scraping-with/9781491910283').print()
Crawler.getContent(
    websites[1], '/article/us-usa-epa-pruitt-idUSKBN19W2D0').print()
Crawler.getContent(
    websites[2],
    '/blog/techtank/2016/03/01/idea-to-retire-old-methods-of-policy-education/').print()
Crawler.getContent(
    websites[3], 
    '/2023/04/03/investing/dogecoin-elon-musk-twitter/index.html').print()

TITLE: Web Scraping with Python
URL: https://www.oreilly.com/library/view/web-scraping-with/9781491910283
BODY:
 


      
        Book
      description
Learn web scraping and crawling techniques to access unlimited data from any web source in any format. With this practical guide, you’ll learn how to use Python scripts and web APIs to gather and process data from thousands—or even millions—of web pages at once.Ideal for programmers, security professionals, and web administrators familiar with Python, this book not only teaches basic web scraping mechanics, but also delves into more advanced topics, such as analyzing raw data or using scrapers for frontend website testing. Code samples are available to help you understand the concepts in practice.
Show and hide more

Publisher resources
View/Submit Errata




TITLE: 
URL: https://www.reuters.com/article/us-usa-epa-pruitt-idUSKBN19W2D0
BODY:
 
TITLE: 
URL: https://www.brookings.edu/blog/techtank/2016/03/01/idea-to-retire-old-methods-of

---

## 4.3 Structuring Crawlers

This section outlines how to build a flexible, automated web crawler for discovering links and gathering data efficiently. It presents three basic crawler structures adaptable to most web scraping scenarios.

### 4.3.1 Crawling Sites Through Search

In [155]:
class Content:
    """Common base class for all articles/pages"""

    def __init__(self, topic, url, title, body):
        self.topic = topic
        self.title = title
        self.body = body
        self.url = url

    def print(self):
        """
        Flexible printing function controls output
        """
        print(f'New article found for topic: {self.topic}')
        print(f'URL: {self.url}')
        print(f'TITLE: {self.title}')
        print(f'BODY:\n{self.body}')

In [156]:
class Website:
    """Contains information about website structure"""

    def __init__(self, name, url, searchUrl, resultListing, resultUrl, absoluteUrl, titleTag, bodyTag):
        self.name = name
        self.url = url
        self.searchUrl = searchUrl
        self.resultListing = resultListing
        self.resultUrl = resultUrl
        self.absoluteUrl = absoluteUrl
        self.titleTag = titleTag
        self.bodyTag = bodyTag

In [221]:
class Crawler:
    def __init__(self, website):
        self.site = website
        self.found = {}

    def getPage(url):
        try:
            html = urlopen(url)
        except Exception as e:
            return None
        return BeautifulSoup(html, 'html.parser')

    def safeGet(bs, selector):
        """
        Utilty function used to get a content string from a Beautiful Soup
        object and a selector. Returns an empty string if no object
        is found for the given selector
        """
        selectedElems = bs.select(selector)
        if selectedElems is not None and len(selectedElems) > 0:
            return '\n'.join([elem.get_text() for elem in selectedElems])
        return ''

    def getContent(self, topic, url):
        """
        Extract content from a given page URL
        """
        bs = Crawler.getPage(url)
        if bs is not None:
            title = Crawler.safeGet(bs, self.site.titleTag)
            body = Crawler.safeGet(bs, self.site.bodyTag)
            return Content(topic, url, title, body)
        return Content(topic, url, '', '')

    def search(self, topic):
        """
        Searches a given website for a given topic and records all pages found
        """
        bs = Crawler.getPage(self.site.searchUrl+topic)
        try:
            searchResults = bs.select(self.site.resultListing)
        except:
            searchResults = None
        else:
            for result in searchResults:
                url = result.select(self.site.resultUrl)[0].attrs['href']
                # Check to see whether it's a relative or an absolute URL
                url = url if self.site.absoluteUrl else self.site.url + url
                if url not in self.found:
                    self.found[url] = self.getContent(topic, url)
                self.found[url].print()

Code should work fine if every elements with their right names

In [222]:
siteData = [
    ['Reuters', 'http://reuters.com', 'https://www.reuters.com/search/news?blob=', 'div.search-result-indiv',
        'h3.search-result-title a', False, 'h1', 'div.ArticleBodyWrapper'],
    ['Brookings', 'http://www.brookings.edu', 'https://www.brookings.edu/search/?s=',
        'div.article-info', 'h4.title a', True, 'h1', 'div.core-block']
]
sites = []
for name, url, search, rListing, rUrl, absUrl, tt, bt in siteData:
    sites.append(Website(name, url, search, rListing, rUrl, absUrl, tt, bt))

crawlers = [Crawler(site) for site in sites]
topics = ['python', 'data%20science']

for topic in topics:
    for crawler in crawlers:
        crawler.search(topic)

### 4.3.2 Crawling Sites Through Links

This crawler uses a regular expression (targetPattern) to identify and follow links matching a specific URL pattern, making it flexible for scraping unstructured websites. It doesn't rely on predefined search page structures and can handle both absolute and relative URLs. This approach is ideal for gathering data from large, disorganized websites.

In [223]:
class Website:

    def __init__(self, name, url, targetPattern, absoluteUrl, titleTag, bodyTag):
        self.name = name
        self.url = url
        self.targetPattern = targetPattern
        self.absoluteUrl = absoluteUrl
        self.titleTag = titleTag
        self.bodyTag = bodyTag


class Content:

    def __init__(self, url, title, body):
        self.url = url
        self.title = title
        self.body = body

    def print(self):
        print(f'URL: {self.url}')
        print(f'TITLE: {self.title}')
        print(f'BODY:\n{self.body}')

In [230]:
import re


class Crawler:
    def __init__(self, site):
        self.site = site
        self.visited = {}

    def getPage(url):
        try:
            html = requests.get(url)
        except Exception as e:
            print(e)
            return None
        return BeautifulSoup(html.text, 'html.parser')

    def safeGet(bs, selector):
        selectedElems = bs.select(selector)
        if selectedElems is not None and len(selectedElems) > 0:
            return '\n'.join([elem.get_text() for elem in selectedElems])
        return ''

    def getContent(self, url):
        """
        Extract content from a given page URL
        """
        bs = Crawler.getPage(url)
        if bs is not None:
            title = Crawler.safeGet(bs, self.site.titleTag)
            body = Crawler.safeGet(bs, self.site.bodyTag)
            return Content(url, title, body)
        return Content(url, '', '')

    def crawl(self):
        """
        Get pages from website home page
        """
        bs = Crawler.getPage(self.site.url)
        targetPages = bs.findAll('a', href=re.compile(self.site.targetPattern))
        for targetPage in targetPages:
            url = targetPage.attrs['href']
            url = url if self.site.absoluteUrl else f'{self.site.url}{targetPage}'
            if url not in self.visited:
                self.visited[url] = self.getContent(url)
                self.visited[url].print()

In [242]:
brookings = Website('Reuters', 'https://brookings.edu', '/articles/', True, 'h1.w-full', '#content > div > div > p')
crawler = Crawler(brookings)
crawler.crawl()

URL: https://www.brookings.edu/articles/what-comes-after-a-usaid-shutdown/
TITLE: The implications of a USAID shutdown
BODY:
Over the past few weeks, the Trump administration has frozen U.S. foreign assistance, appointed Secretary of State Marco Rubio as acting administrator for the U.S. Agency for International Development (USAID), and announced a review of the agency’s activities.
With USAID’s future unclear, Brookings experts weigh in on the implications of a shutdown or reorganization, for the United States and for the world.
The Trump administration’s reported plans to abolish the U.S. Agency for International Development (USAID) by presidential fiat raise serious questions of constitutional law. While USAID was originally created by executive order, Congress has statutorily mandated its existence since the Foreign Affairs Reform and Restructuring Act of 1998 and routinely reinforces this status through authorization and appropriations legislation. Recent annual appropriations leg

### 4.3.3 Crawling Multiple Page Types

Overview On the methodology

In [243]:
class Website:
    """Common base class for all articles/pages"""

    def __init__(self, name, url, titleTag, bodyTag):
        self.name = name
        self.url = url
        self.titleTag = titleTag
        self.bodyTag = bodyTag

In [244]:
class Product(Website):
    """Contains information for scraping a product page"""

    def __init__(self, name, url, titleTag, productNumber, price):
        Website.__init__(self, name, url, titleTag)
        self.productNumberTag = productNumber
        self.priceTag = price

class Article(Website):
    """Contains information for scraping an article page"""

    def __init__(self, name, url, titleTag, bodyTag, dateTag):
        Website.__init__(self, name, url, titleTag)
        self.bodyTag = bodyTag
        self.dateTag = dateTag

---

## 4.4 Thinking About Web Crawler Models

When building a web scraper, it's crucial to normalize data across sources for consistency and scalability. Designing scrapers with future expansion in mind minimizes programming overhead. Identifying patterns in seemingly different websites can simplify integration. Additionally, understanding data relationships (e.g., type, size, topic) improves storage and retrieval. While software architecture is complex, web scraping follows recurring patterns that can be mastered with practice. Planning ahead ensures efficient and maintainable scrapers.

---

## End

Chapter 4 examines how to build adaptable web crawlers by outlining different crawling strategies and emphasizing flexible data modeling. It stresses planning the data you need before scraping and using configuration objects to manage the variability in website layouts. The chapter also covers structuring your crawler to distribute requests efficiently, setting a foundation for scalable and maintainable web scraping projects.