# Python Web Scraping Introduction

## Introduction - What Is Web Scraping?
We have seen how many interesting Internet companies have evolved APIs to facilitate the delivery of
data. While the availability of APIs for accessing data is on the rise, the fact remains that the vast
majority of data available from the Web cannot be accessed via a well-formed API.

Web scraping (WS) is a means of ingesting data from the web in the absence of an API. The objective of
WS is to download semi-structured data from the web, select relevant portions of that data, and
forwarding that data to the next stage in your analytics pipeline.
WS projects can range from extremely sophisticated to “quick and dirty”. The good news is that the
Python ecosystem provides some amazing tools in support of your Web scraping ambitions.

## Web Scraping for Mere Mortals

Recall that since all APIs are different there is an upfront investment (a learning curve) to using an API.
You have to create developer accounts, understand API authentication and authorization, understand
API endpoints, and data payloads, etc.
WS also comes with its own learning curve. The first requirement of scraping any Web site is
understanding the nature and structure of the HTML on the site’s web pages. That means looking at lots
of HTML.

When you decide to scrape a page you typically know what data from that page you hope to gather.
Your job is to understand how the target data is nestled within the page’s HTML in order to devise an
extraction plan. Needless to say, this is often a one-off (bespoke) plan that is unique to the page/site
that you are scraping.

## Research Question and Methodology
Let’s get started. As with all analytics projects, you start with a question that needs answering and
answering questions requires data. Here’s our question - `who are the five most popular actors to play
the role of Dr. Who in the popular and long running BBC series Dr. Who?`

Excellent question! Once we have a question, we need to decide what available data would be best to
answer that question. In this case, lets agree that page views to an actor’s Wikipedia page will be our
proxy for popularity.

It’s always a good idea to consider how one would accomplish a task manually prior to automation.
That’s not to say that they will be identical processes; however, they will be conceptually similar. In my
mind, if I had a complete list of the Dr. Who actors, I could just Google each one,
click on their Wikipedia link, then click on the Page information menu-link on the LHS of every Wikipedia
web page. There I can see the number of page views in the last 30 days. Tally, those up for each actor,
sort in descending order and, voilà, I have my answer.

So, to implement this plan using a WS strategy I would need:

1. a complete list of Dr. Who actors
2. the Wikipedia page information page for each actor
3. the 30-day page view information from that page

## Web Scraping Solution
This is where Python and Web scraping comes in. As stated above, WS is about downloading
semi-structured data from the web, capturing some of that data, and using the data
in an analytics pipeline.

In this lab, you will be writing a Python program that:

1. downloads the list of Dr. Who actors
2. downloads their Wikipedia information pages
3. captures the 30-day page view data as a proxy for their popularity
4. record that data
5. analyze the data

## The Dr. Who Actors
There many places to get the list of Dr. Who actors; however, one of my favorites is, of course,
Entertainment Weekly (EW). Specifically, the following single page contains the complete list -
https://ew.com/tv/doctor-who-actors/. Our job is to parse the HTML on the page to build our list.
Navigate to the above page and use your browser to explore the HTML that underlies the page. For
chromium browsers, this can be accomplished by pressing the ctrl+u keys. **What you are looking for are
exploitable patterns that appear in the HTML.** This, is the essence (and the "art") of Web scraping.

What does "exploitable patterns" mean exactly? Well, there are some number of actors and I
do NOT want to write separate data extraction plans for each one – I want to write one!
That means I’d like to find a pattern in the
HTML surrounding the actor names that occurs for all of the actors. Usually, this is pretty easy
when it comes to lists and this example is no exception.

Looking at the rendered HTML via a browser (Ctrl+u) I see that the 13th Dr. Who is Jodie Whittaker.
Now turning to the underlying HTML, I’m going to search for the string ‘Jodie Whittaker’. I’m going
to continue to search until I see something that looks easy to parse and looks like a pattern
that may repeat for all actors.

Try this on your own before proceeding. It’s possible that we all come up with a different pattern!

Here’s what I found. Look for an occurrence of Jodie Whittaker within a block of HTML that begins with
&lt;noscript&gt;. I see the actor’s name in this block, so it looks promising. Next I will search for the HTML
for &lt;noscript&gt; again to see if the other actors’ names also appear in similar blocks – they do.
However, this pattern also occurs for other data on the page. Can you see a way to distinguish the
‘useful’ HTML blocks (the ones that contain the data we are after) from the noise? The ability to discern
these types of patterns are an essential skill for WS.

Well, it turns out that if a &lt;img&gt; tag nested within a &lt;div&gt; tag nested with the &lt;noscript&gt; tag has a
title attribute that begins with the regex pattern ^Slide\s+\d+: then we have our name. All we to do is
find &lt;img&gt; tags that have title attributes that match this RE pattern!

## Make a Web Request
As with past labs, use the request library to return the content from the EW web page above. Ultimately,
you will need the .text attribute from the response returned from HTTP GET. Refer to our previous API
labs for a refresher if necessary.

## Wrangling HTML with BeautifulSoup
Once you have raw HTML in front of you, you can start to select and extract. For this purpose, we will be
using BeautifulSoup(BS). The BS constructor parses raw HTML strings and produces an object that
mirrors the HTML document’s structure. Parse? Yes, parse. That should sound familiar. That is the
same term we applied to processing XML documents. In fact, HTML, like XML, is a tag-based language.
BS parses HTML in exactly the same way as we parsed XML. The result of this parsing is an in memory
tree of Python objects reflective of the HTML.

### Consider the following example of an HTML document and the accompanying code:

In [1]:
raw_html = '''
<html>
<head>
    <title> Contrived Example
</head>
<body>
    <p id="eggman"> I am the egg man </p>
    <p id="walrus"> I am the walrus </p>
</body>
</html>
'''

from bs4 import BeautifulSoup

soup = BeautifulSoup(raw_html, 'html.parser')

# The simplest way to navigate the parse tree is to say the name of the tag you want (akin to XPATH)
# If you want the <body> tag, just say soup.body:
body_tag = soup.body

# A tag’s children are available in a list called .contents:
print("here is testing")
print(body_tag.contents)

# You can do use this trick again and again to zoom in on a certain part of the parse tree.
# This code gets the FIRST <p> tag beneath the <body> tag:
p_tag = body_tag.p

# Note the diff between contents and text
print("here is testing1")
print(p_tag.contents)
print(p_tag.text)
# If you need to get all the <a> tags, or anything more complicated than the first tag
# with a certain name, you’ll need to use one of the 'searching' methods described in
# Searching the tree (https://www.crummy.com/software/BeautifulSoup/bs4/doc/#searching-the-tree),
# such as find_all():

# Find ALL <p> tags in the soup - may be more than you want!
for p in soup.find_all('p'):
    if p['id'] == 'walrus':
        print(p.text)

# Like XML, You can find* from any valid element in the soup.
# Here we start with the previously found body element
#
for p_tag in body_tag.find_all('p'):
    if p_tag['id'] == 'walrus':
        print(p_tag.text)

# OR
# some common HTML attributes like id, class, etc. are implemented
# as key-word arguments
for p in body_tag.find_all('p', id='walrus'):
    print(p.text)

here is testing
['\n', <p id="eggman"> I am the egg man </p>, '\n', <p id="walrus"> I am the walrus </p>, '\n']
here is testing1
[' I am the egg man ']
 I am the egg man 
 I am the walrus 
 I am the walrus 
 I am the walrus 


Breaking down the example, you first parse the raw HTML by passing it to the BS constructor. BS
accepts multiple back-end parsers, but the standard back-end is html.parser, which you supply here as
the second argument. Note; however, 'lxml' parser is also a popular choice for a parser.

The 'tags-as-attributes' approach, allows you to find the first occurrence of a tag. This is
a bit like XPATH.  This bit of magic is easy to implement in a Python class using the either the
\_\_getattr\_\_ or \_\_getattribute\_\_ dunder methods.
See https://medium.com/@satishgoda/python-attribute-access-using-getattr-and-getattribute-6401f7425ce6

The issue with attribute approach however, is that you can only find the first
such tag.  If you need to find all tags, the you need to use one of the the find* methods.
See https://www.crummy.com/software/BeautifulSoup/bs4/doc/#searching-the-tree.

The find_all() method on the soup object lets you locate all matching elements in the document.
While find() finds a single element.

In the above case, soup.find_all('p') returns a list of
paragraph elements (ie., &lt;p&gt;).
Each *p* may have HTML attributes that you can access like a dictionary. In the line if p['id'] == 'walrus', for
example, you check if the id attribute is equal to the string 'walrus', which corresponds to
&lt;p id="walrus"&gt; in the HTML. This is the same as what we saw when using ElementTree to parse XML.

Also, as with XML, the .text property yields the text associated with a tag.

<hr/>

**Technical Note:** in addition to strings representing HTML tags, the arguments to find() and find_all() can
be compiled regular expressions (RE). This is extremely useful as you shall see.
<hr/>

### Phase 1 - Using BeautifulSoup to Get Dr. Who Actor Names
Now that you have given BeautifulSoup’s find methods in a short test drive, how do you
determine exactly what argument to supply to find? The fastest way is to step out of Python use your
browser to examine the underlying HTML of the document as discussed above.

Recall that there is an &lt;img&gt; tag nested within a &lt;div&gt; tag nested with the &lt;noscript&gt; tag has a title
attribute that, if it begins with the regex pattern ^Slide\s+\d+:\s+[A-Z] , then this is an &lt;img&gt; tag we are
interested in.

The "beautiful" thing about BeautifulSoup is that many of the pattern-type arguments
to BS methods can be compiled regular expressions. I told you this would be useful…

So, to find &lt;img&gt; tags that contain the Dr. Who actor names we need a title attribute that matches the
following compiled RE pattern:
```
re.compile(r'^Slide\s+\d+:\s+[A-Z]')
Ex title attribute data: Slide 15: First Doctor: William Hartnell
```

Once we have the title text, we will need to turn to REs again to get parse names coming from the title
attribute – like so:
```
 m = re.search(r'^Slide\s+\d+:[^:]+[:]\s+(?P<actor>.*)$', title)
```
What’s going on in the above RE? We want the only the name from the title attribute text. This RE
starts the same as before; however, after the first : the [^:]+ says "gobble up all characters that are
**not** a : (colon) until you run into a colon that is followed by one or more spaces.”

After that, capture all remaining characters (the $ is the end-of-string anchor) in a group named <actor>.

### To Do:
To see how this part turned out, review the who_actors() function below.
Make sure you understand what’s going on before proceeding.

In [1]:
from requests.exceptions import HTTPError
import requests
from bs4 import BeautifulSoup
import re

EW_URL = 'http://ew.com/tv/doctor-who-actors/'

def simple_get(url, *args, **kwargs):
    """
    Attempts to get the content at `url` by making an HTTP GET request.
    If the content-type of response is some kind of HTML/XML, return the
    text content, otherwise return None.
    """
    try:
        resp = requests.get(url, *args, **kwargs)
        # If the response was successful, no Exception will be raised
        resp.raise_for_status()

    except HTTPError as http_err:
        print(f'HTTP error occurred: {http_err}')
        raise http_err
    except Exception as err:
        print(f'Other error occurred: {err}')
        raise err

    return resp

def who_actors(url):
    resp = simple_get(url, timeout=5)
    html = resp.text

    # sanity check. is this HTML?
    assert re.search('html', resp.headers['Content-Type'], re.IGNORECASE)

    soup = BeautifulSoup(html, 'html.parser')

    # to be returned
    actor_list = []

    for img in soup.find_all('img', title=re.compile(r'^Slide\s+\d+:\s+[A-Z]')):

        # I want the name from the title attribute which looks like this:
        # Slide 10: Sixth Doctor: Colin Baker
        # Another good use for REs.
        # This RE starts the same as before; however, after the first :
        # the [^:]+[:]\s+ says "gobble up all (one ore more) characters that
        # are not a : until you run into a colon
        # that is followed by one or more spaces. After that,
        # capture all remaining characters in a group named <actor>"
        #
        title = img['title']

        m = re.search(r'^Slide\s+\d+:[^:]+[:]\s+(?P<actor>.*)$', title)
        # if no match, then I've screwed up something
        assert m is not None
        if m:
            actor_list.append(m.group('actor'))

    # Great, got my list of actors. Return to caller
    return actor_list

# PHASE 1: TESTING ONLY
# Get the Dr.Who actors from EW_URL
alst = who_actors(EW_URL)
print(alst)


['Jo Martin', 'Jodie Whittaker', 'Peter Capaldi', 'Matt Smith', 'David Tennant', 'Christopher Eccleston', 'John Hurt', 'Paul McGann', 'Sylvester McCoy', 'Colin Baker', 'Peter Davison', 'Tom Baker', 'Jon Pertwee', 'Patrick Troughton', 'William Hartnell']
