# Part 0: Mining the web

Perhaps the richest source of openly available data today is [the Web](http://www.computerhistory.org/revolution/networking/19/314)! In this lab, you'll explore some of the basic programming tools you need to scrape web data.

> **Note 0.** The Vocareum platform runs in a cloud-based environment that limits what websites a program can connect to directly. Therefore, some (or possibly all) of the code below will **not** work. Therefore, we are making this notebook **optional** and are providing some solutions inline.
>
> **Note 1.** Even if you are using a home or local installation of Jupyter, you may encounter problems if you attempt to access a site too many times or too rapidly. That can happen if your internet service provider (ISP) or the target website detect your accesses as "unusual" and reject them. It's easy to imagine accidentally writing an infinite loop that tries to access a page and being seen from the other side as a malicious program. :)
>
> **Note 2.** The exercises below involve processing of HTML files. However, you don't need to know anything specific about HTML; you can solve (and we have solved) all of these exercises assuming only that the data is a semi-structured string, amenable to simple string manipulation and regular expression processing techniques. In Part 1 of this notebook assignment, you'll see a different method that employs the [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) module.
>
> **Note 3.** Following Note 2, there are some outspoken people who believe you should never use regular expressions on HTML. Your instructor finds these arguments to be overly pedantic. For an entertaining take on the subject, see [this blog post](https://blog.codinghorror.com/parsing-html-the-cthulhu-way/).

## The Requests module

Python's [Requests module](http://requests.readthedocs.io/en/latest/user/quickstart/) to download a web page.

For instance, here is a code fragment to download the [Georgia Tech](http://www.gatech.edu) home page and print the first 250 characters. You might also want to [view the source](http://www.computerhope.com/issues/ch000746.htm) of Georgia Tech's home page to get a nicely formatted view, and compare its output to what you see above.

> If you you are having connection or download issues, we have also provided a file containing the HTML contents from a snapshot of the site. Just change the variable, `USE_LOCAL_SNAPSHOT` to `True` to load that file instead.

In [17]:
import requests

USE_LOCAL_SNAPSHOT = False

if USE_LOCAL_SNAPSHOT:
    print("\n=== Reading webpage from local file ... ===\n")
    with open('gatech_edu--20190125-1143.html', 'rt') as fp:
        webpage = fp.read()
else:
    print("\n=== Attempting to download webpage ... ===\n")
    response = requests.get('https://www.gatech.edu/')
    webpage = response.text  # or response.content for raw bytes

print(webpage[0:50000]) # Prints the first hundred characters only


=== Attempting to download webpage ... ===

<!DOCTYPE html>
<html lang="en" dir="ltr" 
  xmlns:content="http://purl.org/rss/1.0/modules/content/"
  xmlns:dc="http://purl.org/dc/terms/"
  xmlns:foaf="http://xmlns.com/foaf/0.1/"
  xmlns:og="http://ogp.me/ns#"
  xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"
  xmlns:sioc="http://rdfs.org/sioc/ns#"
  xmlns:sioct="http://rdfs.org/sioc/types#"
  xmlns:skos="http://www.w3.org/2004/02/skos/core#"
  xmlns:xsd="http://www.w3.org/2001/XMLSchema#">
    <head profile="http://www.w3.org/1999/xhtml/vocab">
        <meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1"><meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<meta name="description" content="Georgia Tech (Georgia Institute of Technology) is a technology-focused college in Atlanta, Ga, and one of the top research universities in the USA." />
<link rel="shortcut icon" href="https://www.gatech.edu/sites/all/themes/gt_tlw/favicon.ico" type="image/vnd.microsoft.ico

**Exercise 1.** Given the string contents of the GT home page as above (e.g., the `webpage` variable), write a function that returns a list of links (URLs) of the site's "top stories."

For instance, consider the front page from Saturday, January 25, 2020:

![www.gatech.edu as of Sat Jan 25, 2020](./gatech_edu--20190125-1143.png)

The top stories are the ones associated with the three images ("Quantum collaborators," "10 x 10 x 10 Tech," and "Transfer program offers...").  Each image links to a news story, and we want your function to return the URL of each link. If no URLs can be found, the function should return an empty list.

In [12]:
import re # Maybe you want to use a regular expression?

def get_gt_top_stories(webpage_text):
    """Given the HTML text for the GT front page, returns a list
    of the URLs of the top stories or an empty list if none are
    found.
    """
    pattern = '''<a class="slide-link" href="(?P<url>[^"]+)"'''
    return re.findall(pattern, webpage_text)

In [13]:
top_stories = get_gt_top_stories(webpage)

print("Your claimed links to top stories:")
for k, url in enumerate(top_stories):
    print(k, url)

Your claimed links to top stories:


## A more complex example

Go to [Yelp!](http://www.yelp.com) and look up `ramen` in `Atlanta, GA`. Take note of the URL:

![Yelp! search for ramen in ATL (January 25, 2020)](./yelp-ramen-atl--20200125-1205--scroll-to-results-annotated.png)

This URL encodes what is known as an _HTTP "get"_ method (or request). It basically means a URL with two parts: a _command_ followed by one or more _arguments_. In this case, the command is everything up to and including the word `search`; the arguments are the rest, where individual arguments are separated by the `&` or `#`.

> "HTTP" stands for "HyperText Transport Protocol," which is a standardized set of communication protocols that allow _web clients_, like your web browser or your Python program, to communicate with _web servers_.

In this next example, let's see how to build a "get request" with the `requests` module. It's pretty easy!

In [18]:
url_command = 'https://yelp.com/search'
url_args = {'find_desc': "ramen",
            'find_loc': "atlanta, ga"}
response = requests.get (url_command, params=url_args, timeout=60)

print("==> Downloading from: '%s'" % response.url) # confirm URL
print("\n==> Excerpt from this URL:\n\n%s\n" % response.text[0:10000])

==> Downloading from: 'https://www.yelp.com/search?find_desc=ramen&find_loc=atlanta%2C+ga'

==> Excerpt from this URL:

<!DOCTYPE HTML>

<!--[if lt IE 7 ]> <html xmlns:fb="http://www.facebook.com/2008/fbml" class="ie6 ie ltie9 ltie8 no-js" lang="en"> <![endif]-->
<!--[if IE 7 ]>    <html xmlns:fb="http://www.facebook.com/2008/fbml" class="ie7 ie ltie9 ltie8 no-js" lang="en"> <![endif]-->
<!--[if IE 8 ]>    <html xmlns:fb="http://www.facebook.com/2008/fbml" class="ie8 ie ltie9 no-js" lang="en"> <![endif]-->
<!--[if IE 9 ]>    <html xmlns:fb="http://www.facebook.com/2008/fbml" class="ie9 ie no-js" lang="en"> <![endif]-->
<!--[if (gt IE 9)|!(IE)]><!--> <html xmlns:fb="http://www.facebook.com/2008/fbml" class="no-js" lang="en"> <!--<![endif]-->
    <head>
        <script>
            (function() {
                var main = null;

                var main=function(){window.onerror=function(k,a,c,i,f){var j=(document.getElementsByTagName("html")[0].getAttribute("webdriver")==="true"||naviga

**Sample data (HTML file) from a Yelp! query.** We've pre-downloaded the results of a query for `"fried chicken"` in `"atlanta, ga"`, and stored it in a local file. The following code cell will read its contents and store them in a variable called, `yelp_fried_chicken_atl_query_html`, which the test cells will use.

In [19]:
# Query page for fried chicken in Atlanta (pre-downloaded):
sample_query_filename = "yelp-fried_chicken-atl--20200125-1240.html"
with open(sample_query_filename, "rt") as fp:
    yelp_fried_chicken_atl_query_html = fp.read()
    
# Sample:
print(f"=== First few characters of '{sample_query_filename}' ===\n")
print(yelp_fried_chicken_atl_query_html[:1000])

=== First few characters of 'yelp-fried_chicken-atl--20200125-1240.html' ===

<!DOCTYPE HTML>

<!--[if lt IE 7 ]> <html xmlns:fb="http://www.facebook.com/2008/fbml" class="ie6 ie ltie9 ltie8 no-js" lang="en"> <![endif]-->
<!--[if IE 7 ]>    <html xmlns:fb="http://www.facebook.com/2008/fbml" class="ie7 ie ltie9 ltie8 no-js" lang="en"> <![endif]-->
<!--[if IE 8 ]>    <html xmlns:fb="http://www.facebook.com/2008/fbml" class="ie8 ie ltie9 no-js" lang="en"> <![endif]-->
<!--[if IE 9 ]>    <html xmlns:fb="http://www.facebook.com/2008/fbml" class="ie9 ie no-js" lang="en"> <![endif]-->
<!--[if (gt IE 9)|!(IE)]><!--> <html xmlns:fb="http://www.facebook.com/2008/fbml" class="no-js" lang="en"> <!--<![endif]-->
    <head>
        <script>
            (function() {
                var main = null;

                var main=function(){window.onerror=function(k,a,c,i,f){var j=(document.getElementsByTagName("html")[0].getAttribute("webdriver")==="true"||navigator.userAgent==="selenium");var h=f&&(f.na

**Exercise 2.** Given a string holding the HTML contents of a Yelp query, like the one above, complete the function below so it returns the list of the names of all **non-sponsored** search results. The list should be in ascending order of the rank of the result, and should contain no more than 10 items (since a query of the form above returns, by default, the top 10 matches).

> **Note 0.** The test cell uses the pre-downloaded query file from above. You may find it helpful to open that file in a web browser, view the source, and study its contents.
>
> **Note 1.** We are providing one possible solution, which uses elementary string processing and regular expressions. How would you have approached this problem?

In [20]:
import re

def find_biz_names(html_string):
    # SAMPLE SOLUTION:
    all_results_raw = html_string.split(r'"text":"All Results"')[1]
    items_raw = all_results_raw.split('"ranking":')
    top10 = [None] * 10
    for item in items_raw:
        match = re.match(r'^([0-9]+),"reviewCount":\d+,"name":"([^"]*)"', item)
        if match is not None:
            rank = int(match.groups()[0])
            name = match.groups()[1]
            if 1 <= rank <= 10:
                top10[rank-1] = name
    return top10

In [21]:
# Demo:
find_biz_names(yelp_fried_chicken_atl_query_html)

['Hattie B’s Hot Chicken - Atlanta',
 'Gus’s World Famous Fried Chicken',
 'Roc South Cuisine',
 'South City Kitchen Midtown',
 'Buttermilk Kitchen',
 'Mary Mac’s Tea Room',
 'Rock’s Chicken &amp; Fries',
 'Busy Bee Cafe',
 'Joella’s Hot Chicken - Cumberland',
 'Gus’s World Famous Fried Chicken']

In [None]:
# Test cell 1: `yelp_atl__test1`
def load_query_results(filename):
    print(f"Loading HTML query results from {filename}...")
    with open(filename, "rt") as fp:
        html_string = fp.read()
    return html_string

query_0 = load_query_results("yelp-fried_chicken-atl--20200125-1240.html")
your_top10_0 = find_biz_names(query_0)
assert your_top10_0 == ['Hattie B’s Hot Chicken - Atlanta',
 'Gus’s World Famous Fried Chicken',
 'Roc South Cuisine',
 'South City Kitchen Midtown',
 'Buttermilk Kitchen',
 'Mary Mac’s Tea Room',
 'Rock’s Chicken &amp; Fries',
 'Busy Bee Cafe',
 'Joella’s Hot Chicken - Cumberland',
 'Gus’s World Famous Fried Chicken']

print("\n(Passed!)")

In [None]:
# Test cell 2: `yelp_atl__test2`
query_1 = load_query_results("yelp-ramen-atl--20200125-1205.html")
your_top10_1 = find_biz_names(query_1)
assert your_top10_1 == ['JINYA Ramen Bar',
                        'E Ramen +',
                        'Ginya Izakaya',
                        'JINYA Ramen Bar',
                        'Hajime',
                        'Okiboru Tsukemen &amp; Ramen',
                        'Hotto Hotto Ramen &amp; Teppanyaki',
                        'Lifting Noodles Ramen',
                        'Tanaka Ramen',
                        'Ton Ton']
print("\n(Passed!)")

One issue with the above exercises is that they treat HTML as a flat string, whereas the document is at least semi-structured. Moreover, web pages are such a common source of data today that you would expect better tools for processing them. Indeed, such tools exist! The next part of this assignment, Part 1, walks you through one such tool. So, head there when you are ready!