Web Crawling & Scraping (Web Crawling - crawl ALL webpages of a blog)

Web Crawling using Python

We build a custom web crawler.

It starts with the "seed" page (typically the first page or home page of a website).
For example,
    http://www.mongabong.com/
The above is the "seed" page that we will use later on.

This web crawler will:
1. look for all anchor tags (or links) on index.html
2. go into each of the above links and repeat 1)
3. In the end, web crawler will have compiled a complete list of ALL webpages hosted at http://www.mongabong.com.

'queue.txt' will be used to store all links extracted by web crawler.
As each webpage in 'queue.txt' is searched - and its anchor tags identified, two events will happen:
- The newly identified anchor tags' URLs (href) will be added into 'queue.txt'.
- The once searched webpage from 'queue.txt' will be moved to 'crawled.txt'.
'crawled.txt' contains all links our web crawler extracted AND these links were already searched for more anchor tags.

In this lab, we will attempt to perform web crawling on:
--> Mongchin Yeo's blog (local influencer): http://www.mongabong.com/

In [None]:
##### Step 1: Helper Functions #####
'''
The following are custom 'helper' functions that our web crawler will use later on.
Largely, it deals with file writing, data appending, directory creation, etc.

Run this code segment - to load the functions for later use.
'''

import os

# Each website is a separate project (folder)
def create_project_dir(directory):
    # If specified directory does NOT exist, then create one.
    if not os.path.exists(directory):
        print('Creating directory ' + directory)
        os.makedirs(directory) # makedirs --> make directory


# Create queue and crawled files (if not created)
def create_data_files(project_name, base_url):
    queue = os.path.join(project_name , 'queue.txt')
    crawled = os.path.join(project_name, 'crawled.txt')
    
    if not os.path.isfile(queue):
        # This part is VERY IMPORTANT
        # "queue.txt" must have the SEED PAGE's URL
        # Thus, we're going to write the SEED PAGE's URL (base_url) into this file.
        write_file(queue, base_url)
        
    if not os.path.isfile(crawled):
        write_file(crawled, '')


# Create a new file
def write_file(path, data):
    with open(path, 'w') as f:
        f.write(data)


# Add data onto an existing file
def append_to_file(path, data):
    with open(path, 'a') as file:
        file.write(data + '\n')


# Delete the contents of a file
def delete_file_contents(path):
    open(path, 'w').close()


# Read a file and convert each line to set items
def file_to_set(file_name):
    results = set()
    with open(file_name, 'rt') as f:
        for line in f:
            results.add(line.replace('\n', ''))
    return results


# Iterate through a set, each item will be a line in a file
def set_to_file(links, file_name):
    with open(file_name,"w") as f:
        for l in sorted(links):
            f.write(l+"\n")


In [None]:
##### Step 2: LinkFinder class #####
'''
The below is a Class definition for LinkFinder.

Given a webpage URL (page_url), LinkFinder will go into the webpage and
looks for ALL <a> tags.

For each <a> tags, e.g. <a href="http://www.mongabong.com/2019/09/maisha-from-closet-lover.html">
LinkFinder will extract out the link/URL: http://www.mongabong.com/2019/09/maisha-from-closet-lover.html

page_links() method will return ALL links found in a given webpage (page_url).

Run this code segment - to load the functions for later use.
'''

from html.parser import HTMLParser
from urllib import parse


class LinkFinder(HTMLParser):

    # This constructor initializes base_url & page_url.
    #    Example
    #       base_url: http://www.mongabong.com
    #       page_url: http://www.mongabong.com/2019/09/maisha-from-closet-lover.html
    def __init__(self, base_url, page_url):
        
        #print('base_url: ' + base_url)
        #print('page_url: ' + page_url)
        #exit(1)
        
        super().__init__()
        self.base_url = base_url
        self.page_url = page_url
        self.links = set()

    # When we call HTMLParser feed(), this function is called when it encounters an opening tag <a>
    # Given an <a> tag, e.g. <a href="http://www.mongabong.com/2019/09/maisha-from-closet-lover.html">
    #   extract attribute 'href' value --> http://www.mongabong.com/2019/09/maisha-from-closet-lover.html
    def handle_starttag(self, tag, attrs):
        if tag == 'a':
            for (attribute, value) in attrs:
                if attribute == 'href':
                    url = parse.urljoin(self.base_url, value)
                    self.links.add(url)

    def page_links(self):
        return self.links

    def error(self, message):
        pass


In [None]:
##### Step 3: Domain Name Helper Functions #####
'''
The following are custom 'helper' functions that our web crawler will use later on.
Largely, it deals with file writing, data appending, directory creation, etc.

Run this code segment - to load the functions for later use.
'''

from urllib.parse import urlparse


# Get domain name (example.com)
# Given: http://www.mongabong.com/
# returns the domain name without http://
#    mongabong.com
def get_domain_name(url):
    try:
        #print('===url: ' + url)
        results = get_sub_domain_name(url).split('.')
        #print(results[-2])
        #print(results[-1])
        #exit(1)
        return results[-2] + '.' + results[-1]
    except:
        return ''


# Get sub domain name (name.example.com)
# returns
#       name
def get_sub_domain_name(url):
    try:
        #print(urlparse(url).netloc)
        #exit(1)
        return urlparse(url).netloc
    except:
        return ''


In [None]:
##### Step 4: Spider class #####
'''
The below is a Class definition for Spider.

This is the main class for Web Crawling.

crawl_page is the main method.
It takes a new web page URL (page_url) and initiates crawling.

Run this code segment - to load the functions for later use.
'''

from urllib.request import urlopen

class Spider:

    project_name = ''
    base_url = ''
    domain_name = ''
    queue_file = ''
    crawled_file = ''
    queue = set()
    crawled = set()

    def __init__(self, project_name, base_url, domain_name):
        Spider.project_name = project_name
        Spider.base_url = base_url
        Spider.domain_name = domain_name
        Spider.queue_file = Spider.project_name + '/queue.txt'
        Spider.crawled_file = Spider.project_name + '/crawled.txt'
        self.boot()
        self.crawl_page('First spider', Spider.base_url)

    # Creates directory and files for project on first run and starts the spider
    @staticmethod
    def boot():
        create_project_dir(Spider.project_name)
        create_data_files(Spider.project_name, Spider.base_url)
        Spider.queue = file_to_set(Spider.queue_file)
        Spider.crawled = file_to_set(Spider.crawled_file)

    # Updates user display, fills queue and updates files
    @staticmethod
    def crawl_page(thread_name, page_url):
        if page_url not in Spider.crawled:
            print(thread_name + ' now crawling ' + page_url)
            print('Queue ' + str(len(Spider.queue)) + ' | Crawled  ' + str(len(Spider.crawled)))
            Spider.add_links_to_queue(Spider.gather_links(page_url))
            Spider.queue.remove(page_url)
            Spider.crawled.add(page_url)
            Spider.update_files()

    # Converts raw response data into readable information and checks for proper html formatting
    @staticmethod
    def gather_links(page_url):
        html_string = ''
        try:
            response = urlopen(page_url)
            if 'text/html' in response.getheader('Content-Type'):
                html_bytes = response.read()
                html_string = html_bytes.decode("utf-8")
            finder = LinkFinder(Spider.base_url, page_url)
            #print(html_string) # prints HTML content of the current page
            finder.feed(html_string)
        except Exception as e:
            print(str(e))
            return set()
        return finder.page_links()

    # Saves queue data to project files
    @staticmethod
    def add_links_to_queue(links):
        for url in links:
            # If url is something that we already extracted and it's either
            # - in queue to be crawled
            # - in crawled.txt (already crawled)
            # Then, ignore it - no need to process duplicates.
            if (url in Spider.queue) or (url in Spider.crawled):
                continue
            
            # VERY IMPORTANT
            # In this lab, we will ONLY crawl... internal webpages.
            # If an internal webpage points to http://cnn.com/entertainment/xyz.html
            # We will IGNORE IT.
            # How do we do it?
            # We compare the domain name.
            # For xyz.html, its domain is "cnn.com"
            # Our domain name is supremeleader.today and it's NOT equal to "cnn.com"
            # So we IGNORE
            if Spider.domain_name != get_domain_name(url):
                continue
                
            # ELSE
            # We add this url to the queue
            Spider.queue.add(url)

    @staticmethod
    def update_files():
        set_to_file(Spider.queue, Spider.queue_file)
        set_to_file(Spider.crawled, Spider.crawled_file)


In [None]:
##### Step 5: MAIN #####
'''
This is where the user (you) initiates web crawling.

PROJECT_NAME: this should uniquely identify the website you intend to crawl
--> Later on, our code creates a directory (inside the current Jupyter directory).
--> <current_directory>/<PROJECT_NAME>/queue.txt
--> <current_directory>/<PROJECT_NAME>/crawled.txt

NUMBER_OF_THREADS: specify desired number of threads
--> Threading will create multiple instances of web crawler for "parallel processing"
--> Increasing this to a very high number does not necessarily mean web crawling will be done faster
-----> This depends on the computer server/machine on which web crawling is done.

Run this code segment - to load the functions for later use.
'''

import threading
from queue import Queue

PROJECT_NAME = 'mongabong'
HOMEPAGE = 'http://www.mongabong.com/'

DOMAIN_NAME = get_domain_name(HOMEPAGE)
QUEUE_FILE = PROJECT_NAME + '/queue.txt'
CRAWLED_FILE = PROJECT_NAME + '/crawled.txt'
NUMBER_OF_THREADS = 2
queue = Queue()
Spider(PROJECT_NAME, HOMEPAGE, DOMAIN_NAME)


# Create worker threads (will die when main exits)
def create_workers():
    for _ in range(NUMBER_OF_THREADS):
        t = threading.Thread(target=work)
        t.daemon = True
        t.start()


# Do the next job in the queue
def work():
    while True:
        url = queue.get() # grab the next URL in the queue
        Spider.crawl_page(threading.current_thread().name, url) # kick off web crawler
        queue.task_done()


# Each queued link is a new job
def create_jobs():
    for link in file_to_set(QUEUE_FILE):
        queue.put(link)
    queue.join()
    crawl()


# Check if there are items in the queue, if so crawl them
def crawl():
    # 'queue.txt' should have the SEED PAGE's URL
    # We start crawling from the SEED PAGE
    queued_links = file_to_set(QUEUE_FILE)
    if len(queued_links) > 0:
        print(str(len(queued_links)) + ' links in the queue')
        create_jobs()

In [None]:
'''
WARNING

We're crawling someone else's website.
For DEMO purposes, please run this segment for up to 10 seconds and STOP running this script.

'''

####### We're ready to kick off web crawler #######
# Let's create worker(s)
# If you specified NUMBER_OF_THREADS = 2,
# this function will create 2 instances of Web Crawler
# The two will parallel process web crawling.
# This function does NOT start crawling yet - it will simply create and prep the workers.
create_workers()

# Let's crawl now
crawl()