# Web Scraping Tutorial

##  (Part 1 of 4)

Let's continue adding to our knowledge baseline even as we look to add nuance and granularity to what we have already learned.

In this mini project, we'll use two of the most popular libraries among digital humanists and among archivists and data curators:  The `requests` library, which refers to itself as "HTTP for Humans"; and the `Beautiful Soup` library.  We'll use `requests` to download HTML page from the web.  Then, using `Beautiful Soup`, we'll take that HTML page apart and scrape it for desired information.

This is fairly straightforward, but we're still building our confidence with this material, so we'll break things into three phases, in order to avoid the need to chew through too much new material simultaneously.

In this tutorial, which includes the first phase, we'll write some code in Python to scrape and parse the HTML data we need.  In the next tutorial, which covers our second phase, we go "behind the scenes" to discover how we can find a webpage that makes for an ideal candidate for scraping.  We'll make some notes about the CSS and HTML involved.  In the third phase, finally, we'll make use of what we've learned as we break up the HTML we scraped and organize the data we find into a CSV file.

As a bonus, we'll finish this series by using the `Pandas` library to draw a quick (but good looking!) little chart describing the top five movies in theatrical release.
 
Along the way, we'll probably make use of a few additional techniques, too, but I encourage you to pay special attention to the way we put these libraries to use:  There are inevitably myriad ways you'll be able to put them to work in your own research -- and even your daily life -- right away.

In [19]:
import requests

# Read about the venerable requests library here:
# http://docs.python-requests.org/en/master/

# I am duty-bound to point out that this is the only
# library of which I know that sells _stickers_ from within
# its support documentation:
# http://www.unixstickers.com/stickers/coding_stickers/requests-shaped-sticker

In [20]:
from bs4 import BeautifulSoup

# While `import library` is usually enough to load
# a Python library, that isn't the case here.  Because
# there are significant differences between versions
# 3 and 4 of the library, the developers needed a way
# to make it clear to programmers that they were using
# version 4.

# The docs on Beautiful Soup are well done:
# https://beautiful-soup-4.readthedocs.io/en/latest/

# They've recently made them available in Chinese:
# https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/
# (Japanese and Korean, too.)

Let's start by creating a variable to hold the URL of the page we want to scrape.  (For the moment, trust me on this.  In the next part of the tutorial, I'll show you the website, we'll look at its source code, and then I'll explain how I plan to decompose it).

In [21]:
movieDataURL = "http://www.boxofficemojo.com/weekend/?view=year&p=.htm"

In [22]:
# Now use the requests library to grab the website:

boxOfficeData = requests.get(movieDataURL)

# requests.get(URL) activates the library and sends it 
# to fetch the HTML (or whatever) located at the address we give it.
# Whatever it finds on the web it brings back and saves inside
# the variable we specified at the start.  .get() isn't the
# only action the `requests` library can perform, but it is
# the most common one to use.

Regardless of what we plan to do with the data we just grabbed, we're well-served to let Beautiful Soup work its beautiful magic and parse the code.  In essence, the library is smart enough to distinguish between the text of a webpage and the markup (tags) that define it.  It may seem like a small thing, but it will save you hours of effort every time you use it.

In [23]:
beautifulBoxOffice = BeautifulSoup(boxOfficeData.text, 'html.parser')

# Here, I'm putting my Beautiful Soup library to work on the boxOfficeData
# text I just got from the web.  BeautifulSoup wants two arguments before
# it starts:  The first argument is a chunk of plain text to operate on,
# and the second argument is the rule-set to apply.

# You can see for the second argument I've used the 'html.parser' -- it
# is a bit slow, and not as fancy as some of its other parsers, but 
# it is much easier to use:  Whenever possible, stick with 'html.parser'.

# For the first argument, though, you'll notice that I took the variable
# from earlier in the code -- boxOfficeData -- and added a weird thing:
# boxOfficeData.text

# What gives?  Here's the deal:  When `requests` brought back that HTML
# from the web, it packaged it up as an OBJECT, not as plain text.  That
# can be good AND bad.  But it means that if we want to operate on the
# text of that website, we have to extract it from the OBJECT in which it is wrapped.

# How?  Oh, how?  Actually, its easy.  We just append .text to the original object,
# and it will divulge its deepest, text-iest secrets.

All of this can seem to happen fairly quickly, leading to confusion.  If you get lost, remember that programming languages are all organized according to a rigid order of operations -- just like algebraic equations.  The most significant rule is this:  Don't solve left - to - right.  Instead, solve from the inside out.  In the case of the single line above (the BeautifulSoup call) -- the code may make more sense to you if you start inside the parentheses:  Decide what boxOfficeData.text will yield.  Then you can think about what sense BeautifulSoup will make of it.  And then you'll know what value beautifulBoxOffice will be assigned.  (Roughly).

And as a final pass, we "prettify" the code, cleaning it up and making sure the HTML code layout will make sense to readers.

In [24]:
beautifulBoxOffice = beautifulBoxOffice.prettify();

What does that leave us with?  Let's find out:

In [25]:
print(beautifulBoxOffice)

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html lang="en">
 <head>
  <title>
   Weekend Box Office Index, 1982-Present - Box Office Mojo
  </title>
  <meta content="Weekend box office from 1982 to the present." name="description"/>
  <meta content="weekend box office, weekend, box, office, movie, report, film" name="keywords"/>
  <link charset="utf-8" href="/css/mojo.css?1" media="screen" rel="stylesheet" title="no title" type="text/css"/>
  <link charset="utf-8" href="/css/mojo.css?1" media="print" rel="stylesheet" title="no title" type="text/css"/>
 </head>
 <body>
  <iframe frameborder="0" height="1" id="sis_pixel_sitewide" marginheight="0" marginwidth="0" style="display: none;" width="1">
  </iframe>
  <script>
   setTimeout(function(){
        try{
            //sis3.0 pixel
            var cacheBust = Math.random() * 10000000000000000,
                url_sis3 = 'http://s.amazon-adsystem.com/iu3?',
                param