# Chapter 3: Writing Web Crawlers

---

With scrapers traversing multiple pages and even multiple sites, you must be extremely conscientious of how much bandwidth you are using and make every effort to determine whether there’s a way to make the target server’s load easier.

---

## 3.1 Traversing a Single Domain

Degrees of Wikipedia solution finder:

In [1]:
from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen('http://en.wikipedia.org/wiki/Kevin_Bacon')
bs = BeautifulSoup(html.read(), 'html.parser')

links = bs.find_all('a')
for link in links:
    if 'href' in link.attrs:
        print(link.attrs['href'])

#bodyContent
/wiki/Main_Page
/wiki/Wikipedia:Contents
/wiki/Portal:Current_events
/wiki/Special:Random
/wiki/Wikipedia:About
//en.wikipedia.org/wiki/Wikipedia:Contact_us
/wiki/Help:Contents
/wiki/Help:Introduction
/wiki/Wikipedia:Community_portal
/wiki/Special:RecentChanges
/wiki/Wikipedia:File_upload_wizard
/wiki/Main_Page
/wiki/Special:Search
https://donate.wikimedia.org/?wmf_source=donate&wmf_medium=sidebar&wmf_campaign=en.wikipedia.org&uselang=en
/w/index.php?title=Special:CreateAccount&returnto=Kevin+Bacon
/w/index.php?title=Special:UserLogin&returnto=Kevin+Bacon
https://donate.wikimedia.org/?wmf_source=donate&wmf_medium=sidebar&wmf_campaign=en.wikipedia.org&uselang=en
/w/index.php?title=Special:CreateAccount&returnto=Kevin+Bacon
/w/index.php?title=Special:UserLogin&returnto=Kevin+Bacon
/wiki/Help:Introduction
/wiki/Special:MyContributions
/wiki/Special:MyTalk
#
#Early_life_and_education
#Acting_career
#Early_work
#1980s
#1990s
#2000s
#2010s
#Other_ventures
#Six_Degrees_of_Kevin_B

Wikipedia is full of sidebar, footer, and header links that appear on every page, along with links to the category pages, talk pages, and other pages that do not contain different articles

If you examine the links that point to article pages (as opposed to other internal pages), you’ll see that they all have three things in common:

- They reside within the div with the id set to bodyContent.

- The URLs do not contain colons.

- The URLs begin with /wiki/.

In [2]:
import re

artcls_links = bs.find('div', {'id': 'bodyContent'}).find_all('a', href=re.compile('^(/wiki/)((?!:).)*$'))
for link in artcls_links:
    if 'href' in link.attrs:
        print(link.attrs['href'])

/wiki/Kevin_Bacon_(disambiguation)
/wiki/Philadelphia
/wiki/Kevin_Bacon_filmography
/wiki/Kyra_Sedgwick
/wiki/Sosie_Bacon
/wiki/Edmund_Bacon_(architect)
/wiki/Michael_Bacon_(musician)
/wiki/Leading_man
/wiki/Character_actor
/wiki/Golden_Globe_Award
/wiki/Screen_Actors_Guild_Award
/wiki/Primetime_Emmy_Award
/wiki/National_Lampoon%27s_Animal_House
/wiki/Diner_(1982_film)
/wiki/Footloose_(1984_film)
/wiki/JFK_(film)
/wiki/A_Few_Good_Men
/wiki/Apollo_13_(film)
/wiki/Mystic_River_(film)
/wiki/Frost/Nixon_(film)
/wiki/Friday_the_13th_(1980_film)
/wiki/Tremors_(1990_film)
/wiki/The_River_Wild
/wiki/The_Woodsman_(2004_film)
/wiki/Crazy,_Stupid,_Love
/wiki/Patriots_Day_(film)
/wiki/Losing_Chase
/wiki/Loverboy_(2005_film)
/wiki/Golden_Globe_Award_for_Best_Actor_%E2%80%93_Miniseries_or_Television_Film
/wiki/Screen_Actors_Guild_Award_for_Outstanding_Performance_by_a_Male_Actor_in_a_Miniseries_or_Television_Movie
/wiki/Michael_Strobl
/wiki/HBO
/wiki/Taking_Chance
/wiki/Fox_Broadcasting_Company
/wik

Teransforming the code to more effictive one and notice that the following code will just work fine but it far from being ready to put in production since it needs more exception-handling as mentioned in chapter 1 and its value does not exist in its ability to extract only but also its reusability

In [3]:
import datetime
import random

random.seed(datetime.datetime.now().timestamp())

def getLinks(article_URL):
    html = urlopen(f'http://en.wikipedia.org{article_URL}')
    bs = BeautifulSoup(html.read(), 'html.parser')
    
    lks = bs.find('div', {'id': 'bodyContent'}).find_all('a', href=re.compile('^(/wiki/)((?!:).)*$'))
    
    return lks

links = getLinks('/wiki/Kevin_Bacon')

for i in range(5):
    new_article = links[random.randint(0, len(links)-1)].attrs['href']
    print(new_article)
    links = getLinks(new_article)

/wiki/Old_97%27s
/wiki/Satellite_Rides
/wiki/Rhett_Miller
/wiki/The_Traveler_(Rhett_Miller_album)
/wiki/Rhett_Miller_(album)


**Note**

In computer science there is actually nothing random, any number generated has a specific dependancy that leads to the generated number, but there is a way to generate a number that is psuedo but good for now

However there is a way to generate pure random numbers using hardware resource (e.g. RDRAND f. intel) that introduce user-friendly tool (e.g secrets, PyOpenSSl or os.urandom)

***Actually there is also doubts around these numbers that is they are not pure random, So LOL***

---

## 3.2 Crawling an Entire Site

Crawling an entire site is a beast that we need to plan before facing it to save time and resources

In [4]:
pages = set()
def getLinks(pageUrl):
    global pages
    html = urlopen('http://en.wikipedia.org{}'.format(pageUrl))
    bs = BeautifulSoup(html, 'html.parser')
    for link in bs.find_all('a', href=re.compile('^(/wiki/)')):
        if 'href' in link.attrs:
            if link.attrs['href'] not in pages:
                # A new page
                newPage = link.attrs['href']
                print(newPage)
                pages.add(newPage)
                try:
                    getLinks(newPage)
                except:
                    return
getLinks('')

/wiki/Main_Page
/wiki/Wikipedia:Contents
/wiki/Portal:Current_events
/wiki/Special:Random
/wiki/Wikipedia:About
/wiki/Help:Contents
/wiki/Help:Introduction
/wiki/Wikipedia:Community_portal
/wiki/Special:RecentChanges
/wiki/Wikipedia:File_upload_wizard
/wiki/Special:Search
/wiki/Special:MyContributions
/wiki/Special:MyTalk
/wiki/Special:WhatLinksHere/User_talk:102.185.89.233
/wiki/Help:User_contributions
/wiki/Help_talk:User_contributions
/wiki/Special:WhatLinksHere/Help_talk:User_contributions
/wiki/Help:What_links_here
/wiki/Help_talk:What_links_here
/wiki/Special:WhatLinksHere/Help_talk:What_links_here
/wiki/Special:SpecialPages
/wiki/Help:Special_page
/wiki/Help_talk:Special_page
/wiki/Special:WhatLinksHere/Help_talk:Special_page
/wiki/Help:What_links_here#Transclusion
/wiki/Special:WhatLinksHere/Help:What_links_here
/wiki/Talk:Vancouver_(disambiguation)
/wiki/Vancouver_(disambiguation)
/wiki/Special:WhatLinksHere/Vancouver_(disambiguation)
/wiki/Vancouver
/wiki/Talk:Vancouver
/wiki

: 

: 

**Note**

*The warning explains that Python's default recursion limit is 1,000, which can cause a program to crash when dealing with large websites like Wikipedia. To avoid this, a recursion counter can be used. For websites with fewer than 1,000 links, the program usually works fine, with some exceptions.*

---

## 3.3 Collecting Data Across an Entire Site

Web crawlers become useful when they extract meaningful data from pages rather than just navigating links. To build a scraper, analyze page structures to identify patterns for extracting the title, first paragraph, and edit link (if available). This requires inspecting multiple pages to ensure consistency in data retrieval.

**Note**: you have to pos the exceptions wisely to have the full control over the scraper and be aware of any possible error

### Handling Redirects

Redirects point URLs to different locations, either via server-side (handled automatically) or client-side (delayed redirects). Python’s urllib manages redirects by default, while requests requires allow_redirects=True. Be mindful that the final URL may differ from the initial one.

---

## 3.4 Crawling Across the Internet

Before building a web crawler, define your data goals and decide if scraping a few known sites is sufficient or if discovering new sites is necessary. Determine whether the crawler should explore deeply within a site or quickly follow outbound links. Consider exclusions like non-English content and ensure legal compliance to avoid issues with webmasters.

In [13]:
from urllib.parse import urlparse

random.seed(datetime.datetime.now().timestamp())

We begin with the function resposible for internal links

In [9]:
def getInternalLinks(bs, includeURL):
    
    includeURL = f'{urlparse(includeURL).scheme}://{urlparse(includeURL).netloc}'
    internal_links = []
    
    for link in bs.find_all('a', href=re.compile('^(/|.*'+includeURL+')')):
        
        if (link.attrs['href'] != None) and (link.attrs['href'] not in internal_links):
            
            if (link.attrs['href'].startswith('/')): internal_links.append(includeURL+link.attrs['href'])
            else: internal_links.append(link.attrs['href'])
            
    return internal_links

Then the one responsible for external links

In [10]:
def getExternalLinks(bs, excluddeURL):
    external_links = []
    
    for link in bs.find_all('a', href=re.compile('^(http|www)((?!'+excluddeURL+').)*$')):
        
        if (link.attrs['href'] != None) and (link.attrs['href'] not in external_links): external_links.append(link.attrs['href'])
    
    return external_links

And more one function for returning a random external link

In [19]:
def getRandomExternalLink(start_url):
    try:
        html = urlopen(start_url)
        bs = BeautifulSoup(html.read(), 'html.parser')
        external_links = getExternalLinks(bs, urlparse(start_url).netloc)
        
    except: pass
        
    else:  
        if len(external_links) == 0:
            print('No external links, looking around the site for one')
            internal_links = f'{urlparse(start_url).scheme}://{urlparse(start_url).netloc}'
            return getRandomExternalLink(internal_links[random.randint(0, len(internal_links)-1)])
        
        else: return external_links[random.randint(0, len(external_links)-1)]

A function to wrap the process

In [24]:
def followExternalOnly(startingSite):
    external_link = getRandomExternalLink(startingSite)
    print('Random external link is: {}'.format(external_link))
    try:
        followExternalOnly(external_link)
    except: pass

In [27]:
# Now we run our program
followExternalOnly('http://oreilly.com')

Random external link is: https://oreilly.hk/
Random external link is: https://learning.oreilly.com/search/?q=author%3A%22Neal%20Ford%22&type=*
Random external link is: https://play.google.com/store/apps/details?id=com.safariflow.queue
Random external link is: https://support.google.com/googleplay/?p=report_content
Random external link is: https://www.google.com.eg/intl/en/about/products?tab=uh
Random external link is: https://home.google.com/welcome
Random external link is: https://www.google.com
Random external link is: https://play.google.com/?hl=ar&tab=w8
Random external link is: https://support.google.com/googleplay?p=pff_parentguide
Random external link is: https://play.google.com/store/apps/details?id=com.google.android.apps.kids.familylink&referrer=utm_source%3Dplayparentguide
Random external link is: https://www.google.com/policies/privacy
Random external link is: https://policies.google.com/privacy
Random external link is: https://www.google.com/intl/en/safetycenter/
Random ex

Of course this code is not ready to put in production since it needs more exceptions and corner cases handling

Another form of the code

In [28]:
# Collects a list of all external URLs found on the site
allExtLinks = set()
allIntLinks = set()
def getAllExternalLinks(siteUrl):
    
    html = urlopen(siteUrl)
    domain = '{}://{}'.format(urlparse(siteUrl).scheme, urlparse(siteUrl).netloc)
    bs = BeautifulSoup(html, 'html.parser')
    
    internalLinks = getInternalLinks(bs, domain)
    externalLinks = getExternalLinks(bs, domain)
    
    for link in externalLinks:
        if link not in allExtLinks:
            allExtLinks.add(link)
            print(link)
            
    for link in internalLinks:
       if link not in allIntLinks:
            allIntLinks.add(link)
            getAllExternalLinks(link)
            
allIntLinks.add('http://oreilly.com')
getAllExternalLinks('http://oreilly.com')

https://www.oreilly.com
https://www.oreilly.com/member/login/
https://www.oreilly.com/online-learning/try-now.html
https://www.oreilly.com/online-learning/teams.html
https://www.oreilly.com/online-learning/government.html
https://www.oreilly.com/online-learning/academic.html
https://www.oreilly.com/online-learning/individuals.html
https://www.oreilly.com/online-learning/features.html
https://www.oreilly.com/online-learning/courses.html
https://www.oreilly.com/online-learning/feature-certification.html
https://www.oreilly.com/online-learning/intro-interactive-learning.html
https://www.oreilly.com/online-learning/live-events.html
https://www.oreilly.com/online-learning/feature-answers.html
https://www.oreilly.com/online-learning/insights-dashboard.html
https://www.oreilly.com/online-learning/pricing.html
https://www.oreilly.com/radar/
https://www.oreilly.com/content-marketing-solutions.html
https://www.oreilly.com/ceros/727901-lunar-new-year-2025.html
https://learning.oreilly.com/start-t

KeyboardInterrupt: 

**Note**

Making diagrams of what the code should do before you write the code itself is a fantastic habit to get into, and one that can save you a lot of time and frustration as your crawlers get more complicated

---

## End

Chapter 3 focused on building web crawlers to navigate and extract data from websites. It covered techniques for following links, handling multiple pages, and managing crawling efficiency. The chapter provided essential skills for developing scalable web crawlers, preparing for more advanced scraping methods.