# Obtain internet resources

Beyond communication, people have been utilizing the internet for content creation and consumption. To obtain these contents, especially for data analysis purposes, there are a few ways to achieve so.

## Web scraping (HTML parsing)

Technically there are [a number of techniques](https://en.wikipedia.org/wiki/Web_scraping#Techniques) under the category of web scraping, including the aforementioned Web API consumption. In this section we will focus solely on the HTML (short for Hypertext Markup Language) parsing technique, which is to automate what humans would do to ingest information from a website manually. This is usually a supplement to hack around the lack of publically accessible Web APIs.

The foundation of the HTML parsing technique is enabled by a semantic understanding of the language. Regardless of how complex and dynamic the processes are behind the website (or web app), the eventual content is delivered as HTML, plus CSS (shoft for Cascading Style Sheet) for styling, and usually JavaScript for interactivities.

In Python, we can leverage the open-source framework [Scrapy](https://scrapy.org/) to crawl and scrape data from websites.

### A Canadian University Spider

In [1]:
import json

from scrapy import Spider


# our first "Spider" (that crawls the designated website for us)
class UniversitySpider(Spider):

    name = 'University Spider'
    start_urls = ['https://en.wikipedia.org/wiki/List_of_universities_in_Canada']
    
    custom_settings = {
        'ITEM_PIPELINES': { 'item_pipeline.ItemPipeline': 300 },  # from item_pipeline.py
        'LOG_LEVEL': 'ERROR',
    }

    def parse(self, response):
        rows = response.css('table.wikitable > tbody > tr')

        for row in rows:
            school = row.xpath('td[1]')

            if school.css('a ::text'):
                yield response.follow(school.css('a')[0], self.school_parser)

    def school_parser(self, response):
        school_info = {}
        school_info['name'] = response.css('h1.firstHeading ::text').get()

        school_info['lat'] = response.css('span.latitude ::text').get()
        school_info['lng'] = response.css('span.longitude ::text').get()

        rows = response.css('table.infobox > tbody > tr')
        # fuzzy search
        for row in rows:
            header = row.css('th ::text').get()
            if header:
                school_info[header] = row.css('td ::text').get()

        yield school_info

To make a scraping script, we write a `class` by extending the `scrapy.Spider` base class which abstracts away the underlying process so we can focus on specifics such as:
* The starting website URLs for the "Spider" to crawl into.
* Rules based on HTML and CSS selectors to:
    * Next level links to follow into.
    * Parse and pick out actual information we want to collect.

Besides extension of a base class, another new concept is the use of `yield`. This involves the Python generator mechanism, which allows a function (or method) to behave like an iterator, which we can conceptualize as an efficient way of interacting with something like a `list`. You can read more about it on its [Python Wiki entry](https://wiki.python.org/moin/Generators). In short, `yield` behaves very much like `return`, but it may keep going until the iterative or concurrent logic that surrounds it exhausts all possible inputs.

In [2]:
from scrapy.crawler import CrawlerProcess

process = CrawlerProcess()
process.crawl(UniversitySpider)
process.start()

2021-03-02 18:39:40 [scrapy.utils.log] INFO: Scrapy 2.4.1 started (bot: scrapybot)
2021-03-02 18:39:40 [scrapy.utils.log] INFO: Versions: lxml 4.6.2.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 21.2.0, Python 3.8.2 (default, May  5 2020, 15:52:07) - [Clang 11.0.0 (clang-1100.0.33.17)], pyOpenSSL 20.0.1 (OpenSSL 1.1.1j  16 Feb 2021), cryptography 3.4.6, Platform macOS-10.16-x86_64-i386-64bit
2021-03-02 18:39:40 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2021-03-02 18:39:40 [scrapy.crawler] INFO: Overridden settings:
{'LOG_LEVEL': 'ERROR'}


In [3]:
import pandas as pd

# load university data into a Pandas DataFrame
df = pd.read_json('./universities.json')
df

Unnamed: 0,name,lat,lng,Former names,Type,Established,President,Academic staff,Administrative staff,Students,...,Call signs,Principal and Vice-Chancellor,Tag line,Athletic teams,Public transit,Faculty,The University of Manitoba Act,Legislative Assembly of Manitoba,Passed,Introduced by
0,Alberta University of the Arts,51°03′43″N,114°05′29″W,\n,Public,1926,Dr. Daniel Doz,145,95,1323,...,,,,,,,,,,
1,Simon Fraser University,49°16′44″N,122°54′58″W,,Public,1965,Joy Johnson,1095,,34990,...,,,,,,,,,,
2,Royal Roads University,48°26′04″N,123°28′22″W,,Public university,1995,Dr. Philip Steenkamp,"52 core full-time, plus 450 associate faculty",,,...,,,,,,,,,,
3,Brandon University,49°48′34″N,97°07′58″W,,Public,1889,Dr. David Docherty,,,,...,,,,,,,,,,
4,University of Northern British Columbia,53°53′14.40″N,122°48′49.40″W,,Public university,1990,Geoffrey Payne (Interim),,,3570 (2019/2020),...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
88,University College of the North,53°49′11″N,101°14′16″W,Keewatin Community College (1966-2004),University college,"July 1, 2004 as University College of the North",Doug Lauvstad,,Approximately 400,"Approximately 2,400",...,,,,,,,,,,
89,Mount Royal University,51°0′49.09″N,114°8′0.54″W,,Public,1910,Tim Rahilly,740,,14258,...,,,,,,,,,,
90,MacEwan University,53°32′49″N,113°30′17″W,"Grant MacEwan University, Grant MacEwan Colleg...",Public University,1971,Annette Trimbee,972,,19101,...,,,,,,,,,,
91,Athabasca University,54°43′20.63″N,113°18′12.19″W,,Public university,1970,Neil Fassina,,,40722,...,,,,,,1233,,,,


In [4]:
# convert DMS (Degrees-Minutes-Seconds) format to pure numerical decimal point format