## Web Scraping 1: BeautifulSoup

[BeautifulSoup documentation](http://www.crummy.com/software/BeautifulSoup/bs4/doc/)

Scraping data from the internet.

### First, an HTML refresher

**HTML** is the basic language used to create a web page. 

* It tells the web browser what text/media to display, where to display it, and how to display it (style)

* HTML is very structured/hirarchical. 

* Every page is made up of discrete "elements."
  * ex: `<p>You are beginning to learn HTML.</p>`

* Elements are labeled with "tags."
  * ex: `<p>` and `</p>` form the tags that "tell" the browser that this is a paragraph element

A start tag also often contains **attributes** with info about the element.

* Attributes consist of a name and a value
  * ex: `<p class="obvious-statements">You are beginning to learn HTML.</p>`

**Class attributes** are particularly useful attributes that can be used for isolating elements to format the page with css, alter things with javascript, etc.

---

A full HTML document has a structure more like this:

```
<html> 
  <head> </head>
  <body>
     <p class="red">You are beginning to learn HTML.</p>
     <h1> This is a header </h1>
     <a href="www.google.com"> Some link </a>
  </body>
</html>
```


# Exercise :

## Explore some live HTML with `BeautifulSoup`

**Step 1.** Go to http://boxofficemojo.com/movies/?id=biglebowski.htm 

* Right click on any element and choose 'Inspect' or 'Inspect Element'


**Step 2.** Get the HTML from this page and convert to a BeautifulSoup object


In [None]:
import requests

Run the cell above. 

(If failure happens, go to your terminal and `pip install requests`)

**Step 2a.** Using requests, we'll retrieve the contents of the page

In [None]:
BASE_URL = 'http://boxofficemojo.com/movies/?id='
slug = 'biglebowski.htm'

url = BASE_URL + slug
response = requests.get(url)

**Step 2b.** Make sure we got the page successfully by checking the status code of the response object. 

[Information on HTTP status codes](https://en.wikipedia.org/wiki/List_of_HTTP_status_codes)

In [None]:
response.ok

**Step 2c.** Access the contents of the page and make sure they look legit

In [None]:
raw_html = response.text

**Step 2d.** Import BeautifulSoup and convert the contents of the page to a BeautifulSoup object.

In [None]:
from bs4 import BeautifulSoup # import BeautifulSoup class
#(if fail: pip install beautifulsoup4)

soup = BeautifulSoup(raw_html) # create instance of BeautifulSoup with the html of page

In [None]:
print soup.prettify()

**Step 3.** Explore the page contents using `soup.find()`

>`soup.find()` is the most common function we will use from the `Beautiful Soup` package.  



**3a.** Search for a type of tag (like 'body','div','p','a') by using the tag as a string argument

** `soup.find()`** returns the first matched tag it finds.

In [None]:
soup.find('a')

In [None]:
# Equivalently:
soup.a

**`soup.find_all()`** returns a list of all matches in the document

In [None]:
soup.find_all('a')

In [None]:
type(soup.find_all('a')[0])
type(soup.find_all('a')[0]['href'])

**3b.** Search for a more specific element or set of elements, matching on an attribute like an id or class.


In [None]:
soup.find('div',{'id':'hp_footer'})
soup.find('ul',class_='footer_link_list')

**3c.** Chaining `find()`s and `find_all()`s to get even more specific


In [None]:
thing = soup.find('div',{'id':'hp_footer'}).find('li').find_parent()

**3d.** Extract just a value of interest from the returned element(s)

In [None]:
print thing.find_all('li')[-2].find('a')['href']
print thing.find_all('li')[-2].find('a').text

## Web scraping for reals
Web scraping is made simple by the consistent format of information among like pages of a website. 

Let's choose a set of items to scrape for each movie

### Items to scrape for each movie:

1. Title
1. Domestic Total Gross
1. Actors
1. Year
1. Release Date
1. Director


**Step 4.** Let's get all of the information we want for one movie

**4a.** For each item on our list, let's figure out how to access the information of interest


In [None]:
from pprint import pprint
all_tables = soup.find_all('table')
table_with_title = all_tables[2]

money_table = table_with_title.find_all('table')[0]
row_with_stuff = money_table.find('tr')
#print row_with_stuff

cells_with_stuff = row_with_stuff.find_all('td')
print cells_with_stuff

In [None]:
def title_from_cws(cws):
    title = cws[1].find('b').text
    return title

In [None]:
# need BASE_URL global variable
import requests
from bs4 import BeautifulSoup
import urlparse



def get_soup_from_slug(slug):
    url = BASE_URL + slug
    response = requests.get(url)
    
    if response.ok:
        return BeautifulSoup(response.text)

def find_cells_w_stuff(soup):
    
    all_tables = soup.find_all('table')
    table_with_title = all_tables[2]

    money_table = table_with_title.find_all('table')[0]
    row_with_stuff = money_table.find('tr')

    return row_with_stuff.find_all('td')
    
def get_stuff_from_bomojo_url(slg):
    
    soup = get_soup_from_slug(slg) # soup object has .find_all
    cells_w_stuff = find_cells_w_stuff(soup) # 
    
    return cells_w_stuff

In [None]:
slug = 'brooklyn.htm'

cws = get_stuff_from_bomojo_url(slug)
title = title_from_cws(cws)
print title

In [None]:
url = "http://www.boxofficemojo.com/alltime/world/?pagenum=1&p=.htm"
response = requests.get(url)

In [None]:
raw_html = response.text

In [None]:
soup = BeautifulSoup(raw_html)
SHORT_BASE_URL = 'http://boxofficemojo.com'
long_slug = '/movies/?id=dawnoftheapes.htm'
url = urlparse.urljoin(SHORT_BASE_URL,long_slug)

In [2]:
from pprint import pprint
import collections
import urlparse
import bs4
import requests


def ahref_to_string(link):
    string = str(link)
    try:
        movie_title = string.split('=')[2].split('.')[0]
        return movie_title
    except IndexError:
        print "?????NOT A MOVIE?????\n", link

def get_movie_names(list_of_movies_links):
    movie_title_list = []

    for link in list_of_movies_links:
        movie_name = ahref_to_string(link)
        movie_title_list.append(movie_name)
    return movie_title_list


# def main():
URL1 = "http://www.boxofficemojo.com/alltime/world/"
# URL2 = "http://www.boxofficemojo.com/alltime/world/?sort=year&order=DESC&pagenum=2&p=.htm"
response = requests.get(URL1)
raw_html = response.text

# soup = bs4.BeautifulSoup(raw_html)
# dict_of_movies = collections.defaultdict(dict)
# list_of_movies_links = soup.find_all('a')[44:]
# list_of_movies_names = get_movie_names(list_of_movies_links)

soup = bs4.BeautifulSoup(raw_html)
tables = soup.find_all('table')
movie_table = tables[2]
movie_table.find_all('tr')
# cell = rows[1].find_all('td')
# cell_contents = cell[0].find('font')
# pprint(cell_contents)

    

# main()

[<tr bgcolor="#dcdcdc"><td align="center"><font size="2"><a href="/alltime/world/?pagenum=1&amp;sort=rank&amp;order=ASC&amp;p=.htm">Rank</a></font></td><td align="center"><font size="2"><a href="/alltime/world/?pagenum=1&amp;sort=title&amp;order=ASC&amp;p=.htm">Title</a></font></td><td align="center"><font size="2"><a href="/alltime/world/?pagenum=1&amp;sort=studio&amp;order=ASC&amp;p=.htm">Studio</a></font></td><td align="center"><font size="2"><a href="/alltime/world/?pagenum=1&amp;sort=wwgross&amp;order=ASC&amp;p=.htm"><b>Worldwide</b></a></font></td><td align="center" colspan="2"><font size="2"><a href="/alltime/world/?pagenum=1&amp;sort=domgross&amp;order=DESC&amp;p=.htm">Domestic</a> / </font><font size="2"><a href="/alltime/world/?pagenum=1&amp;sort=dompercent&amp;order=DESC&amp;p=.htm">%</a></font></td><td align="center" colspan="2"><font size="2"><a href="/alltime/world/?pagenum=1&amp;sort=osgross&amp;order=DESC&amp;p=.htm">Overseas</a> / </font><font size="2"><a href="/alltim

http://www.boxofficemojo.com/movies/alphabetical.htm?letter=A&p=.htm

different subheading
http://www.boxofficemojo.com/movies/alphabetical.htm?letter=A&page=2&p=.htm

different main heading
http://www.boxofficemojo.com/movies/alphabetical.htm?letter=B&p=.htm

different sub heading
http://www.boxofficemojo.com/movies/alphabetical.htm?letter=B&page=2&p=.htm

schema
base_url + ?letter={letter}&p={pagenum}

In [213]:
from pprint import pprint
import collections
import urlparse
from bs4 import BeautifulSoup as bs
import requests

BASE_URL = "http://www.boxofficemojo.com/movies/alphabetical.htm?"
SLUG = "alphabetical.htm?letter={letter}&page={page}&p=.htm"

LETTERS = map(chr, range(65, 91))

def gen_list_of_urls():
    list_of_urls = []
    for letter in LETTERS:
        for number in range(0,100):
            url = urlparse.urljoin(BASE_URL, SLUG.format(letter=letter, page=3))
            response = requests.get(url)
    return list_of_urls

movies_dict = collections.defaultdict(dict)

### first let's get the data from one page
url = urlparse.urljoin(BASE_URL, SLUG.format(letter="A", page=3))
response = requests.get(url)
raw_html = response.text

### movies appear to start at row 5
soup = bs(raw_html)

### Rows
rows = soup.find_all('tr')
row = rows[7]

    ### Cells
cells = row.find_all('td')

        ### title
movie_title = str(cells[0]).split('=')[5].split('.')[0]

        ### studio
studio = str(cells[1])
studio

        ### Total Gross
cell = cells[2]
total_gross = str(cell).split('>')[2].split('<')[0]
total_gross

        ### Theaters for total Gross
cell = cells[3]
entry = cell.find_all('font')
theaters = str(entry).split('>')[1].split('<')[0]
theaters

        ### Opening
    
cell = cells[4]
entry = cell.find_all('font')
opening = str(cell).split('>')[2].split('<')[0]
opening

        ### Theaters at opening
    
cell = cells[5]
entry = cell.find_all('font')
theaters_at_opening = str(entry[0]).split('>')[1].split('<')[0]
theaters_at_opening

        ### Date of Opening
cell = cells[6]
entry = cell
date_of_opening = str(entry).split('>')[3].split('<')[0]
date_of_opening


# entry = cell.find_all('font')
# entry

# bodys = table.find_all('tr')
# bodys
# TableSchema = collections.namedtuple('table_schema', ['movie',
#                                                       'studio',
#                                                       'total_gross',
#                                                       'opening',
#                                                       'theaters',
#                                                       'open_date'])


    

'8/11/2000'

In [223]:
str(row.find_all('td')[1]).split('>')[2].split('<')[0]

'FL'

**4b.** As we figure out each one, convert the process into a function so that it becomes reuseable.

**4c.** If a function gets too complicated, let's pull out helper methods 

## Generalizing

Now that we can scrape one movie page, we can use our code to scrape any movie page on this website. Hopefully. 

**Step 5.** Make one function to create a movie page information object for a given url

**Step 6.** Use our function on new pages

**6a.** Create a list of urls to scrape

**Step 7.** Turn our work into a reuseable module.