## Web Scraping 101: BeautifulSoup

[BeautifulSoup documentation](http://www.crummy.com/software/BeautifulSoup/bs4/doc/)

Scraping data from the internet.

Web scraping is simple due to the consistent format of information among web pages.

## HTML Refresher

### Overview
* HTML is the basic language used to create a web page. 
* It tells the web browser what text/media to display, where to display it, and how to display it (style)
* HTML is very structured/hirarchical. 
* Every page is made up of discrete "elements."

### Tags

* Elements are labeled with "tags."

* For example:

    ```html
    <p>You are beginning to learn HTML.</p>
    ```

### Attributes

* A start tag also often contains "attributes" with info about the element.

* Attributes usually have a name and value.

* Example:

```html
<p class="my_red_sentences">You are beginning to learn HTML.</p>
```

### Structure

A full HTML document has a structure more like this:

```html
<html> 
  <head> </head>
  <body>
     <p class="red">You are beginning to learn HTML.</p>
     <h1> This is a header </h1>
     <a href="www.google.com"> Some link </a>
  </body>
</html>
```

### Explore in Browser

* Let's explore some live HTML!
* Go to http://boxofficemojo.com/movies/?id=biglebowski.htm in your browser, preferably Chrome.
* Click Inspect Element, also click on View Page Source.

## HTML to BeautifulSoup

### Request data for The Big Lebowski

Scrape some information about [The Big Lebowski](http://boxofficemojo.com/movies/?id=biglebowski.htm).

In [None]:
from __future__ import print_function, division

In [None]:
# if needed: pip install requests or conda install requests
import requests

requests.__path__

In [None]:
url = 'http://boxofficemojo.com/movies/?id=biglebowski.htm'

response = requests.get(url)

### Check the Status

For information on HTTP status codes, see:

https://en.wikipedia.org/wiki/List_of_HTTP_status_codes

In [None]:
response.status_code # status code = 200 => OK

### Look at the Text

In [None]:
print(response.text)

### Soupify the Text

In [None]:
page = response.text

lxml is a library for processing XML and HTML in Python. We are parsing the data from txt to lxml.

In [None]:
# if needed: conda install beautifulsoup4 lxml (in a terminal window)
from bs4 import BeautifulSoup

soup = BeautifulSoup(page, "lxml")

In [None]:
print(soup)

### Prettify the Soup

A webpage can be thought of as a tree of elements, there is the 'body', which would contain a few 'divs' and each of those 'divs' can in turn contain 'divs' and other elements. A Soup object contains this tree. The prettify() method will turn a Beautiful Soup tree into a nicely formatted Unicode string, with each HTML/XML tag on its own line.

In [None]:
print(soup.prettify())

## Beautiful Soup - Find & Find_All

### `soup.find()`

* `soup.find()` is the most common function we will use from this package.  
* Let's try out some common variations of `soup.find()`

* `soup.find()` returns the first matched tag it finds.
* It searches the entire tree.

* Search for a type of tag by using the tag as a string argument ('body','div','p','a')

In [None]:
soup.find('a') # "a" tag is for hyperlink

In [None]:
# Equivalently:
soup.a

In [None]:
# Prettier:
print(soup.a.prettify())

Here's how you can find the next one.

In [None]:
soup.find('a').findNextSibling()

### `soup.find_all()`

`soup.find_all()` returns a list of all matches

In [None]:
len(soup.find_all('a'))

In [None]:
for link in soup.find_all('a'): 
    print(link)

In [None]:
[link for link in soup.find_all('a') if 'joelcoen' in str(link)]

## Beautiful Soup - More on Find

### `href` Example

In [None]:
# retrieve the url from an anchor tag
soup.find('a')['href']

### `id` and `class` examples

* An attribute like id or class can be matched
* Example: 'mp_box_content' classes

In [None]:
soup.find_all(id='top_links')

In [None]:
for element in soup.find_all(class_='mp_box_content'):
    print(element, '\n')

## Beautiful Soup - Chaining Finds

All the fields in mp_box_content can be found by "chaining" a few `find_all` functions.

In [None]:
# 'td' is for a cell in an HTML table
chain = [x.find_all('td') for x in soup.find_all(class_='mp_box_content')]

In [None]:
# for the first mp_box_content find all td's
chain[0]

To extract just the value of interest:

In [None]:
# Find the domestic gross. The '\xa0' represents a space in unicode
soup.find(class_='mp_box_content').find_all('td')[1].text

In [None]:
# There are 2 td's the second one has the $17,451,873 and we remove the space character
soup.find(class_='mp_box_content').find_all('td')[1].text[1:] 

## Let's Practice Web Scraping!

### Items to scrape for each movie:

* Movie Title
* Domestic Total Gross
* Runtime
* MPAA Rating
* Release Date

### Movie Title

In [None]:
soup.find('title')

In [None]:
soup.find('title').text

In [None]:
title_string = soup.find('title').text
title_string

In [None]:
title_string.split('(')

In [None]:
# .strip() removes the white spaces at the beginning and end of the string
title = title_string.split('(')[0].strip() 
title

### Domestic Total Gross

Let's try to find the exact text.

In [None]:
print(soup.find(text="Domestic Total Gross"))

`Text` does an exact match search, so we have to be careful.

In [None]:
print(soup.find(text="Domestic Total Gross: "))

What if we don't want to be careful? [Regular expressions](https://xkcd.com/208/) to the rescue!

We are going to talk a lot more about regular expressions in the next week or two, but there's a really powerful way to search for patterns in text. Today, we're going to use a very simple case, basically doing a "contains" instead of an "exact match".

In [None]:
import re
domestic_total_regex = re.compile('Domestic Total')
domestic_total_regex

In [None]:
dtg_string = soup.find(text=domestic_total_regex)
dtg_string

In [None]:
dtg_string.findNextSibling()

We found the domestic total gross! Now let's strip it down and convert it to an integer.

In [None]:
dtg = dtg_string.findNextSibling().text
print(dtg, type(dtg))

dtg = dtg.replace('$','').replace(',','')
print(dtg, type(dtg))

domestic_total_gross = int(dtg)
print(domestic_total_gross, type(domestic_total_gross))

### Runtime, MPAA Rating & Release Date

#### Step 1: Create Function to Identify Values

Let's make a function to scrape multiple things, assuming the value will always follow the field name.

In [None]:
def get_movie_value(soup, field_name):
    
    '''Grab a value from boxofficemojo HTML
    
    Takes a string attribute of a movie on the page and returns the string in
    the next sibling object (the value for that attribute) or None if nothing is found.
    '''
    
    obj = soup.find(text=re.compile(field_name))
    
    if not obj: 
        return None
    
    # this works for most of the values
    next_sibling = obj.findNextSibling()
    
    if next_sibling:
        return next_sibling.text 
    else:
        return None

In [None]:
# domestic total gross
dtg = get_movie_value(soup,'Domestic Total')
print(dtg)

In [None]:
# runtime
runtime = get_movie_value(soup,'Runtime')
print(runtime)

In [None]:
# rating
rating = get_movie_value(soup,'MPAA Rating')
print(rating)

In [None]:
release_date = get_movie_value(soup,'Release Date')
print(release_date)

#### Step 2: Convert Values to Appropriate Data Types

In [None]:
import dateutil.parser

def money_to_int(moneystring):
    moneystring = moneystring.replace('$', '').replace(',', '')
    return int(moneystring)

def runtime_to_minutes(runtimestring):
    runtime = runtimestring.split()
    try:
        minutes = int(runtime[0])*60 + int(runtime[2])
        return minutes
    except:
        return None

def to_date(datestring):
    date = dateutil.parser.parse(datestring)
    return date

#### Step 3: Apply the Conversions

In [None]:
# Let's get these again and format them all in one swoop

raw_domestic_total_gross = get_movie_value(soup,'Domestic Total')
domestic_total_gross = money_to_int(raw_domestic_total_gross)

raw_runtime = get_movie_value(soup,'Runtime')
runtime = runtime_to_minutes(raw_runtime)

raw_release_date = get_movie_value(soup,'Release Date')
release_date = to_date(raw_release_date)

print(domestic_total_gross, runtime, release_date)
print(type(domestic_total_gross), type(runtime), type(release_date))

#### Step 4: Print It All Out

In [None]:
from pprint import pprint # pretty print

headers = ['movie title', 'domestic total gross',
           'runtime (mins)', 'rating', 'release date']

movie_data = []
movie_dict = dict(zip(headers, [title,
                                domestic_total_gross,
                                runtime,
                                rating, 
                                release_date]))

movie_data.append(movie_dict)
pprint(movie_data)

## Table Scraping Example

### Step 1: Soupify the Website

Let's take a look at the foreign language page of Box Office Mojo. Let's say we wanted to pull all of the data from the main table on the page.

In [None]:
url = 'http://www.boxofficemojo.com/genres/chart/?id=foreign.htm'

response = requests.get(url)
page = response.text

soup = BeautifulSoup(page,"lxml")

### Step 2: Find the Tables

In [None]:
tables = soup.find_all("table")
tables

### Step 3: Pull Just the Rows

In [None]:
rows = [row for row in tables[3].find_all('tr')] # tr tag is for rows

In [None]:
# let's take a look at one row
rows[0]

In [None]:
# let's take a look at one value in the row
rows[0].find_all('td')[1].find('a')['href']

### Step 4: Pull All Values

In [None]:
rows[1].find_all('td')[1].find('a')['href']

In [None]:
rows = rows[1:21] # let's just look at the first 20 rows for now
movies = {}

for row in rows:
    items = row.find_all('td')
    title = items[1].find('a')['href']
    movies[title] = [i.text for i in items[1:]]
    
list(movies.items())[0]

### Step 5: Pandas Alternative

In [None]:
# you can also use pandas to read tables
import pandas as pd

url = 'http://www.boxofficemojo.com/genres/chart/?id=foreign.htm'

In [None]:
tables = pd.read_html(url)

In [None]:
tables[2]
# how can you fix the header?

In [None]:
tables[2][0:5]

Conclusion: Beautiful Soup is powerful but it has many limitations. If a page needs interactions (like entering password) or if a page is not static, but dynamically generated, we can't use Soup. We will explore other tools for that.

One such tool for tomorrow: *Selenium*