## Web Scraping 101: BeautifulSoup

[BeautifulSoup documentation](http://www.crummy.com/software/BeautifulSoup/bs4/doc/)

Scraping data from the internet.

Web scraping is simple due to the consistent format of information among web pages.

## HTML Refresher

### Overview
* HTML is the basic language used to create a web page. 
* It tells the web browser what text/media to display, where to display it, and how to display it (style)
* HTML is very structured/hirarchical. 
* Every page is made up of discrete "elements."

### Tags

* Elements are labeled with "tags."

* For example:

    ```html
    <p>You are beginning to learn HTML.</p>
    ```

### Attributes

* A start tag also often contains "attributes" with info about the element.

* Attributes usually have a name and value.

* Example:

```html
<p class="my_red_sentences">You are beginning to learn HTML.</p>
```

### Structure

A full HTML document has a structure more like this:

```html
<html> 
  <head> </head>
  <body>
     <p class="red">You are beginning to learn HTML.</p>
     <h1> This is a header </h1>
     <a href="www.google.com"> Some link </a>
  </body>
</html>
```

### Explore in Browser

* Let's explore some live HTML!
* Go to http://boxofficemojo.com/movies/?id=biglebowski.htm in your browser, preferably Chrome.
* Click Inspect Element, also click on View Page Source.

## HTML to BeautifulSoup

### Request data for The Big Lebowski

Scrape some information about [The Big Lebowski](http://boxofficemojo.com/movies/?id=biglebowski.htm).

In [1]:
from __future__ import print_function, division

In [2]:
# if needed: pip install requests or conda install requests
import requests

requests.__path__

['//anaconda3/lib/python3.7/site-packages/requests']

In [3]:
url = 'http://boxofficemojo.com/movies/?id=biglebowski.htm'

response = requests.get(url)

In [11]:
!pip install requests

Looking in indexes: https://pypi.org/simple, https://pypi.fury.io/artificialsoph/


### Check the Status

For information on HTTP status codes, see:

https://en.wikipedia.org/wiki/List_of_HTTP_status_codes

https://developer.mozilla.org/en-US/docs/Web/HTTP/Status

In [4]:
response.status_code # status code = 200 => OK

200

In [5]:
url2="https://www.saudi.gov.sa/wps/portal/snp/pages/agencies/!ut/p/z1/04_Sj9CPykssy0xPLMnMz0vMAfIjo8zifQxNHT2c3Q18_L38DA0czf38zE38fI3cTY30w1EVWAS7WwAVuHgaW4YZGhsYGOhHkaTf39DMCajA1CTQ1NIXqN-EsP4oVCVYXIBmBqYVYAUGOICjgX5BbmiEQaanIgBQekxl/p0/IZ7_L15AHCG0LOJN10A7NN74NM2GL0=CZ6_L15AHCG0LOJN10A7NN74NM2G52=MK5l6YvA=GF=/#Z7_L15AHCG0LOJN10A7NN74NM2GL0"
response2 = requests.get(url2)
response2.status_code

200

In [6]:
url3="https://www.tripadvisor.com/Restaurant_Review-g293995-d12408571-Reviews-Cipriani-Riyadh_Riyadh_Province.html"
response3 = requests.get(url3)
response3.status_code

200

In [7]:
url4="https://www.usnews.com/best-colleges/rankings/national-universities"#"https://www.yelp.com/biz/aldeerah-saudi-restaurant-vienna-4"
response4 = requests.get(url4)
response4.status_code

403

In [8]:
response5 = requests.get('https://foursquare.com/mazenat/list/restaurants-in-riyadh')
response5.status_code

200

In [9]:
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'
            }
url6="https://www.usnews.com/best-colleges/rankings/national-universities"#"https://www.switchup.org/bootcamps/metis"
response6 = requests.get(url6,headers=headers)
response6.status_code

403

In [None]:
#to get user-agent options
#in inspector window
#triple-dot>more tools>network conditions

In [10]:
headers = {'User-Agent':"Mozilla/5.0 (BB10; Touch) AppleWebKit/537.1+ (KHTML, like Gecko) Version/10.0.0.1337 Mobile Safari/537.1+"}
url6="https://www.usnews.com/best-colleges/rankings/national-universities"
response6 = requests.get(url6,headers=headers)
response6.status_code

403

### Look at the Text

In [None]:
print(response.text)

### Soupify the Text

In [12]:
page = response.text

In [16]:
page[:200]

'<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">\n<html lang="en">\n<head>\n<meta http-equiv="Content-type" content="text/html;charset=iso-8859-1">\n<'

lxml is a library for processing XML and HTML in Python. We are parsing the data from txt to lxml.

In [14]:
# if needed: conda install beautifulsoup4 lxml (in a terminal window)
from bs4 import BeautifulSoup

soup = BeautifulSoup(page, "lxml")

In [15]:
print(soup)

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html lang="en">
<head>
<meta content="text/html;charset=utf-8" http-equiv="Content-type"/>
<title>The Big Lebowski (1998) - Box Office Mojo</title>
<style type="text/css">
table.chart-wide { width: 100%; }
</style>
<meta content="the big lebowski, movie, film, box office, result, records, charts, revenue, opening weekend, gross, worldwide, overseas, foreign, news, reviews, articles, stories, story, analysis, revenue, release date, mpaa rating, genre, running time, length, budget, production budget, distributor, studio, gramercy, theatrical summary, theatrical, showtimes, tickets, show times, theaters, playing, weekend box office results, weekly box office, weekly box office, release summary, similar movies, box office mojo" name="keywords"/>
<meta content="The Big Lebowski summary of box office results, charts and release information and related links." name="description"/>
<link cha

### Prettify the Soup

A webpage can be thought of as a tree of elements, there is the 'body', which would contain a few 'divs' and each of those 'divs' can in turn contain 'divs' and other elements. A Soup object contains this tree. The prettify() method will turn a Beautiful Soup tree into a nicely formatted Unicode string, with each HTML/XML tag on its own line.

In [17]:
print(soup.prettify())

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html lang="en">
 <head>
  <meta content="text/html;charset=utf-8" http-equiv="Content-type"/>
  <title>
   The Big Lebowski (1998) - Box Office Mojo
  </title>
  <style type="text/css">
   table.chart-wide { width: 100%; }
  </style>
  <meta content="the big lebowski, movie, film, box office, result, records, charts, revenue, opening weekend, gross, worldwide, overseas, foreign, news, reviews, articles, stories, story, analysis, revenue, release date, mpaa rating, genre, running time, length, budget, production budget, distributor, studio, gramercy, theatrical summary, theatrical, showtimes, tickets, show times, theaters, playing, weekend box office results, weekly box office, weekly box office, release summary, similar movies, box office mojo" name="keywords"/>
  <meta content="The Big Lebowski summary of box office results, charts and release information and related links." name="d

## Beautiful Soup - Find & Find_All

### `soup.find()`

* `soup.find()` is the most common function we will use from this package.  
* Let's try out some common variations of `soup.find()`

* `soup.find()` returns the first matched tag it finds.
* It searches the entire tree.

* Search for a type of tag by using the tag as a string argument ('body','div','p','a')

In [19]:
type(soup)

bs4.BeautifulSoup

In [20]:
soup.find('a') # "a" tag is for hyperlink

<a href="/daily/chart/">Daily Box Office (Fri.)</a>

In [21]:
# Equivalently:
soup.a

<a href="/daily/chart/">Daily Box Office (Fri.)</a>

In [24]:
type(soup.a)

bs4.element.Tag

In [22]:
# Prettier:
print(soup.a.prettify())

<a href="/daily/chart/">
 Daily Box Office (Fri.)
</a>



Here's how you can find the next one.

In [23]:
soup.find('a').findNextSibling()

<a href="/weekend/chart/">Weekend Box Office (Aug. 30–Sep. 1)</a>

In [28]:
soup.find('a',attrs={'target':"_blank"})

<a href="//bs.serving-sys.com/Serving/adServer.bs?cn=brd&amp;pli=1073982656&amp;Page=&amp;Pos=668343950" target="_blank">
<img border="0" height="90" src="//bs.serving-sys.com/Serving/adServer.bs?c=8&amp;cn=display&amp;pli=1073982656&amp;Page=&amp;Pos=668343950" width="728"/>
</a>

### `soup.find_all()`

`soup.find_all()` returns a list of all matches

In [25]:
len(soup.find_all('a'))

100

In [27]:
soup.find_all('a')[0]

<a href="/daily/chart/">Daily Box Office (Fri.)</a>

In [26]:
soup.find_all('a')[4]

<a href="//bs.serving-sys.com/Serving/adServer.bs?cn=brd&amp;pli=1073982656&amp;Page=&amp;Pos=668343950" target="_blank">
<img border="0" height="90" src="//bs.serving-sys.com/Serving/adServer.bs?c=8&amp;cn=display&amp;pli=1073982656&amp;Page=&amp;Pos=668343950" width="728"/>
</a>

In [30]:
type(soup.find_all('a'))

bs4.element.ResultSet

In [29]:
for link in soup.find_all('a'): 
    print(link)

<a href="/daily/chart/">Daily Box Office (Fri.)</a>
<a href="/weekend/chart/">Weekend Box Office (Aug. 30–Sep. 1)</a>
<a href="/movies/?id=angelhasfallen.htm">#1 Movie: 'Angel has Fallen'</a>
<a href="http://www.imdb.com/showtimes/?ref_=mojo">Showtimes</a>
<a href="//bs.serving-sys.com/Serving/adServer.bs?cn=brd&amp;pli=1073982656&amp;Page=&amp;Pos=668343950" target="_blank">
<img border="0" height="90" src="//bs.serving-sys.com/Serving/adServer.bs?c=8&amp;cn=display&amp;pli=1073982656&amp;Page=&amp;Pos=668343950" width="728"/>
</a>
<a href="/"><img alt="Box Office Mojo" height="56" src="/img/misc/bom_logo1.png" width="245"/></a>
<a href="http://pro.imdb.com/signup/index.html?rf=mojo_nb_hm&amp;ref_=mojo_nb_hm" target="_blank">
<img alt="Get industry info at IMDbPro" height="20" src="/images/IMDbPro.png"/>
</a>
<a href="http://twitter.com/boxofficemojo" target="_blank">
<img alt="Follow us on Twitter" height="18" src="/images/glyphicons-social-32-twitter@2x.png"/>
</a>
<a href="http://f

In [31]:
[link for link in soup.find_all('a') if 'joelcoen' in str(link)]

[<a href="/people/chart/?view=Director&amp;id=joelcoen.htm">Joel Coen</a>,
 <a href="/people/chart/?view=Writer&amp;id=joelcoen.htm">Joel Coen</a>]

## Beautiful Soup - More on Find

### `href` Example

In [33]:
soup.find('a')

<a href="/daily/chart/">Daily Box Office (Fri.)</a>

In [32]:
# retrieve the url from an anchor tag
soup.find('a')['href']

'/daily/chart/'

### `id` and `class` examples

* An attribute like id or class can be matched
* Example: 'mp_box_content' classes

In [35]:
soup.find_all(attrs={'id':'top_links'})

[<div id="top_links">
 <div style="float: left"><a href="/daily/chart/">Daily Box Office (Fri.)</a> | <a href="/weekend/chart/">Weekend Box Office (Aug. 30–Sep. 1)</a> | <a href="/movies/?id=angelhasfallen.htm">#1 Movie: 'Angel has Fallen'</a> | <a href="http://www.imdb.com/showtimes/?ref_=mojo">Showtimes</a></div>
 <div style="float: right">Updated 9/7/2019 8:13 A.M. Pacific Time</div>
 <div style="clear:both; height: 0px"></div>
 </div>]

In [34]:
soup.find_all(id='top_links')

[<div id="top_links">
 <div style="float: left"><a href="/daily/chart/">Daily Box Office (Fri.)</a> | <a href="/weekend/chart/">Weekend Box Office (Aug. 30–Sep. 1)</a> | <a href="/movies/?id=angelhasfallen.htm">#1 Movie: 'Angel has Fallen'</a> | <a href="http://www.imdb.com/showtimes/?ref_=mojo">Showtimes</a></div>
 <div style="float: right">Updated 9/7/2019 8:13 A.M. Pacific Time</div>
 <div style="clear:both; height: 0px"></div>
 </div>]

In [37]:
for element in soup.find_all(class_='mp_box_content'):
    print(element.prettify(), '\n')

<div class="mp_box_content">
 <table border="0" cellpadding="0" cellspacing="0">
  <tr>
   <td width="40%">
    <b>
     Domestic:
    </b>
   </td>
   <td align="right" width="35%">
    <b>
     $18,034,458
    </b>
   </td>
   <td align="right" width="25%">
    <b>
     38.6%
    </b>
   </td>
  </tr>
  <tr>
   <td width="40%">
    + Foreign:
   </td>
   <td align="right" width="35%">
    $28,690,764
   </td>
   <td align="right" width="25%">
    61.4%
   </td>
  </tr>
  <tr>
   <td colspan="3" width="100%">
    <hr/>
   </td>
  </tr>
  <tr>
   <td width="40%">
    =
    <b>
     Worldwide:
    </b>
   </td>
   <td align="right" width="35%">
    <b>
     $46,725,222
    </b>
   </td>
   <td width="25%">
   </td>
  </tr>
 </table>
</div>
 

<div class="mp_box_content">
 <table border="0" cellpadding="0" cellspacing="0">
  <tr>
   <td align="center">
    <a href="/weekend/chart/?yr=1998&amp;wknd=10&amp;p=.htm">
     Opening Weekend:
    </a>
   </td>
   <td>
    $5,533,844
   </td>
  <

In [38]:
len(soup.find_all(class_='mp_box_content'))

7

## Beautiful Soup - Chaining Finds

All the fields in mp_box_content can be found by "chaining" a few `find_all` functions.

In [45]:
soup.find_all(class_='mp_box_content')[0]

<div class="mp_box_content">
<table border="0" cellpadding="0" cellspacing="0">
<tr>
<td width="40%"><b>Domestic:</b></td>
<td align="right" width="35%"> <b>$18,034,458</b></td>
<td align="right" width="25%">   <b>38.6%</b></td>
</tr>
<tr>
<td width="40%">+ Foreign:</td>
<td align="right" width="35%"> $28,690,764</td>
<td align="right" width="25%">   61.4%</td>
</tr>
<tr>
<td colspan="3" width="100%"><hr/></td>
</tr>
<tr>
<td width="40%">= <b>Worldwide:</b></td>
<td align="right" width="35%"> <b>$46,725,222</b></td>
<td width="25%"> </td>
</tr>
</table>
</div>

In [43]:
soup.find_all(class_='mp_box_content')[0].find_all('td')

[<td width="40%"><b>Domestic:</b></td>,
 <td align="right" width="35%"> <b>$18,034,458</b></td>,
 <td align="right" width="25%">   <b>38.6%</b></td>,
 <td width="40%">+ Foreign:</td>,
 <td align="right" width="35%"> $28,690,764</td>,
 <td align="right" width="25%">   61.4%</td>,
 <td colspan="3" width="100%"><hr/></td>,
 <td width="40%">= <b>Worldwide:</b></td>,
 <td align="right" width="35%"> <b>$46,725,222</b></td>,
 <td width="25%"> </td>]

In [44]:
soup.find_all(class_='mp_box_content')[1].find_all('td')

[<td align="center"><a href="/weekend/chart/?yr=1998&amp;wknd=10&amp;p=.htm">Opening Weekend:</a></td>,
 <td> $5,533,844</td>,
 <td align="center" colspan="2"><font size="2">(#6 rank, 1,207 theaters, $4,585 average)</font></td>,
 <td align="right">% of Total Gross:</td>,
 <td> 31.7%</td>,
 <td align="right" colspan="2"><font face="Helvetica, Arial, Sans-Serif" size="1"><a href="/movies/?page=weekend&amp;id=biglebowski.htm"><b>&gt; View All 4 Weekends</b></a></font></td>,
 <td>Widest Release:</td>,
 <td> 1,235 theaters</td>]

In [46]:
# 'td' is for a cell in an HTML table
chain = [div_tag.find_all('td') for div_tag in soup.find_all(class_='mp_box_content')]

In [47]:
# for the first mp_box_content find all td's
chain[0]

[<td width="40%"><b>Domestic:</b></td>,
 <td align="right" width="35%"> <b>$18,034,458</b></td>,
 <td align="right" width="25%">   <b>38.6%</b></td>,
 <td width="40%">+ Foreign:</td>,
 <td align="right" width="35%"> $28,690,764</td>,
 <td align="right" width="25%">   61.4%</td>,
 <td colspan="3" width="100%"><hr/></td>,
 <td width="40%">= <b>Worldwide:</b></td>,
 <td align="right" width="35%"> <b>$46,725,222</b></td>,
 <td width="25%"> </td>]

To extract just the value of interest:

In [60]:
# Find the domestic gross. The '\xa0' represents a space in unicode
cost=soup.find(class_='mp_box_content').find_all('td')[1].text
cost

'\xa0$18,034,458'

In [None]:
id,class

In [61]:
cost1=soup.find(attrs={"class":'mp_box_content'}).find_all('td')[1].text
cost1

'\xa0$18,034,458'

In [54]:
int(cost)

ValueError: invalid literal for int() with base 10: '\xa0$18,034,458'

In [57]:
cost.strip().replace('$','').replace(',','')

'18034458'

In [58]:
int(cost.strip().replace('$','').replace(',',''))

18034458

In [None]:
# There are 2 td's the second one has the $17,451,873 and we remove the space character
soup.find(class_='mp_box_content').find_all('td')[1].text[1:] 

## Let's Practice Web Scraping!

### Items to scrape for each movie:

* Movie Title
* Domestic Total Gross
* Runtime
* MPAA Rating
* Release Date

### Movie Title

In [62]:
soup.find('title')

<title>The Big Lebowski (1998) - Box Office Mojo</title>

In [63]:
soup.find('title').text

'The Big Lebowski (1998) - Box Office Mojo'

In [64]:
title_string = soup.find('title').text
title_string

'The Big Lebowski (1998) - Box Office Mojo'

In [65]:
title_string.split('(')

['The Big Lebowski ', '1998) - Box Office Mojo']

In [66]:
# .strip() removes the white spaces at the beginning and end of the string
title = title_string.split('(')[0].strip() 
title

'The Big Lebowski'

### Domestic Total Gross

Let's try to find the exact text.

In [106]:
#soup.find_all(name='font',lambda tag: "Domestic Total Gross" in tag.text )

SyntaxError: positional argument follows keyword argument (<ipython-input-106-26dff0a4e871>, line 1)

In [67]:
print(soup.find(text="Domestic Total Gross"))

None


`Text` does an exact match search, so we have to be careful.

In [71]:
type(soup.find(text="Domestic Total Gross: "))

bs4.element.NavigableString

What if we don't want to be careful? [Regular expressions](https://xkcd.com/208/) to the rescue!

We are going to talk a lot more about regular expressions in the next week or two, but there's a really powerful way to search for patterns in text. Today, we're going to use a very simple case, basically doing a "contains" instead of an "exact match".

In [74]:
import re
domestic_total_regex = re.compile('Domestic Total')
domestic_total_regex

re.compile(r'Domestic Total', re.UNICODE)

In [75]:
dtg_string = soup.find(text=domestic_total_regex)
dtg_string

'Domestic Total Gross: '

In [76]:
dtg_string.findNextSibling()

<b>$17,451,873</b>

We found the domestic total gross! Now let's strip it down and convert it to an integer.

In [77]:
dtg = dtg_string.findNextSibling().text
print(dtg, type(dtg))

dtg = dtg.replace('$','').replace(',','')
print(dtg, type(dtg))

domestic_total_gross = int(dtg)
print(domestic_total_gross, type(domestic_total_gross))

$17,451,873 <class 'str'>
17451873 <class 'str'>
17451873 <class 'int'>


### Runtime, MPAA Rating & Release Date

#### Step 1: Create Function to Identify Values

Let's make a function to scrape multiple things, assuming the value will always follow the field name.

In [78]:
def get_movie_value(soup, field_name):
    
    '''Grab a value from boxofficemojo HTML
    
    Takes a string attribute of a movie on the page and returns the string in
    the next sibling object (the value for that attribute) or None if nothing is found.
    '''
    
    obj = soup.find(text=re.compile(field_name))
    
    if not obj: 
        return None
    
    # this works for most of the values
    next_sibling = obj.findNextSibling()
    
    if next_sibling:
        return next_sibling.text 
    else:
        return None

In [79]:
# domestic total gross
dtg = get_movie_value(soup,'Domestic Total')
print(dtg)

$17,451,873


In [80]:
# runtime
runtime = get_movie_value(soup,'Runtime')
print(runtime)

1 hrs. 57 min.


In [81]:
# rating
rating = get_movie_value(soup,'MPAA Rating')
print(rating)

R


In [82]:
release_date = get_movie_value(soup,'Release Date')
print(release_date)

March 6, 1998


#### Step 2: Convert Values to Appropriate Data Types

In [83]:
import dateutil.parser

def money_to_int(moneystring):
    moneystring = moneystring.replace('$', '').replace(',', '')
    return int(moneystring)

def runtime_to_minutes(runtimestring):
    runtime = runtimestring.split()
    try:
        minutes = int(runtime[0])*60 + int(runtime[2])
        return minutes
    except:
        return None

def to_date(datestring):
    date = dateutil.parser.parse(datestring)
    return date

#### Step 3: Apply the Conversions

In [84]:
# Let's get these again and format them all in one swoop

raw_domestic_total_gross = get_movie_value(soup,'Domestic Total')
domestic_total_gross = money_to_int(raw_domestic_total_gross)

raw_runtime = get_movie_value(soup,'Runtime')
runtime = runtime_to_minutes(raw_runtime)

raw_release_date = get_movie_value(soup,'Release Date')
release_date = to_date(raw_release_date)

print(domestic_total_gross, runtime, release_date)
print(type(domestic_total_gross), type(runtime), type(release_date))

17451873 117 1998-03-06 00:00:00
<class 'int'> <class 'int'> <class 'datetime.datetime'>


#### Step 4: Print It All Out

In [85]:
from pprint import pprint # pretty print

headers = ['movie title', 'domestic total gross',
           'runtime (mins)', 'rating', 'release date']

movie_data = []
movie_dict = dict(zip(headers, [title,
                                domestic_total_gross,
                                runtime,
                                rating, 
                                release_date]))

movie_data.append(movie_dict)
pprint(movie_data)

[{'domestic total gross': 17451873,
  'movie title': 'The Big Lebowski',
  'rating': 'R',
  'release date': datetime.datetime(1998, 3, 6, 0, 0),
  'runtime (mins)': 117}]


## Table Scraping Example

### Step 1: Soupify the Website

Let's take a look at the foreign language page of Box Office Mojo. Let's say we wanted to pull all of the data from the main table on the page.

In [86]:
url = 'http://www.boxofficemojo.com/genres/chart/?id=foreign.htm'

response = requests.get(url)
page = response.text

soup = BeautifulSoup(page,"lxml")

### Step 2: Find the Tables

In [87]:
tables = soup.find_all("table")
tables

[<table border="0" cellpadding="0" cellspacing="0">
 <tr>
 <form action="/adjuster.php" method="POST" name="adjuster">
 <input name="returnURL" type="hidden" value="/genres/chart/?id=foreign.htm"/>
 <td valign="center">
 <font face="Verdana" size="2"><a href="/about/adjuster.htm"><b>Adjuster:</b></a></font>
 <select name="ticketyr" size="1" style="font-family: Verdana; font-size: 10pt">
 <option selected="" value="0">Actuals</option>
 <option value="1">Est. Tckts</option>
 <script language="javascript">
   for(i=2019; i>=1933; i--) {
   	document.write('<option value="' + i + '"');
 	if(i=='0') document.write(' selected');
 	document.write('>' + i );
 	if(i=='0') document.write(', $' + '0.00');
 	document.write('</option>');
   }
 </script>
 <option value="1929">1929</option>
 <option value="1924">1924</option>
 <option value="1910">1910</option>
 </select><input name="Go" style="font-size: 10pt; height: 22" type="submit" value="Go"/>
 </td></form></tr></table>,
 <table border="0" cell

### Step 3: Pull Just the Rows

In [88]:
rows = [row for row in tables[3].find_all('tr')] # tr tag is for rows

In [89]:
# let's take a look at one row
rows[0]

<tr bgcolor="#dcdcdc"><td align="center"><font size="2"><a href="/genres/chart/?id=foreign.htm&amp;sort=rank&amp;order=ASC&amp;p=.htm">Rank</a></font></td><td align="center"><font size="2"><a href="/genres/chart/?id=foreign.htm&amp;sort=title&amp;order=ASC&amp;p=.htm">Title (click to view)</a></font></td><td align="center"><font size="2"><a href="/genres/chart/?id=foreign.htm&amp;sort=studio&amp;order=ASC&amp;p=.htm">Studio</a></font></td><td align="center" colspan="2"><font size="2"><a href="/genres/chart/?id=foreign.htm&amp;sort=gross&amp;order=ASC&amp;p=.htm"><b>Lifetime Gross</b></a> / </font><font size="2"><a href="/genres/chart/?id=foreign.htm&amp;sort=maxtheaters&amp;order=DESC&amp;p=.htm">Theaters</a></font></td><td align="center" colspan="2"><font size="2"><a href="/genres/chart/?id=foreign.htm&amp;sort=opengross&amp;order=DESC&amp;p=.htm">Opening</a> / </font><font size="2"><a href="/genres/chart/?id=foreign.htm&amp;sort=opentheaters&amp;order=DESC&amp;p=.htm">Theaters</a></f

In [93]:
# let's take a look at one value in the row
rows[1].find_all('td')

[<td align="center"><font size="2">1</font></td>,
 <td><font size="2"><a href="/movies/?id=crouchingtigerhiddendragon.htm"><b>Crouching Tiger, Hidden Dragon</b></a><br/>(Taiwan)</font></td>,
 <td><font size="2"><a href="/studio/chart/?studio=sonyclassics.htm">SPC</a></font></td>,
 <td align="right"><font size="2"><b>$128,078,872</b></font></td>,
 <td align="right"><font size="2">2,027</font></td>,
 <td align="right"><font size="2">$663,205</font></td>,
 <td align="right"><font size="2">16</font></td>,
 <td align="right"><font size="2"><a href="/schedule/?view=bydate&amp;release=theatrical&amp;date=2000-12-08&amp;p=.htm">12/8/00</a></font></td>]

### Step 4: Pull All Values

In [94]:
rows[1].find_all('td')[1].find('a')['href']

'/movies/?id=crouchingtigerhiddendragon.htm'

In [95]:
rows = rows[1:21] # let's just look at the first 20 rows for now
movies = {}

for row in rows:
    items = row.find_all('td')
    title = items[1].find('a')['href']
    movies[title] = [i.text for i in items[1:]]
    
list(movies.items())[0]

('/movies/?id=crouchingtigerhiddendragon.htm',
 ['Crouching Tiger, Hidden Dragon(Taiwan)',
  'SPC',
  '$128,078,872',
  '2,027',
  '$663,205',
  '16',
  '12/8/00'])

### Step 5: Pandas Alternative

In [96]:
# you can also use pandas to read tables
import pandas as pd

url = 'http://www.boxofficemojo.com/genres/chart/?id=foreign.htm'

In [97]:
tables = pd.read_html(url)

In [103]:
print(type(tables),len(tables))

<class 'list'> 3


In [100]:
tables[1]
# how can you fix the header?

Unnamed: 0,0,1,2,3,4,5,6,7
0,Rank,Title (click to view),Studio,Lifetime Gross / Theaters,Lifetime Gross / Theaters,Opening / Theaters,Opening / Theaters,Date
1,1,"Crouching Tiger, Hidden Dragon(Taiwan)",SPC,"$128,078,872",2027,"$663,205",16,12/8/00
2,2,Life Is Beautiful(Italy),Mira.,"$57,563,264",1136,"$118,920",6,10/23/98
3,3,Hero(China),Mira.,"$53,710,019",2175,"$18,004,319",2031,8/27/04
4,4,Instructions Not Included,LGF,"$44,467,206",978,"$7,846,426",348,8/30/13
5,5,Pan's Labyrinth(Mexico),PicH,"$37,634,615",1143,"$568,641",17,12/29/06
6,6,Amelie(France),Mira.,"$33,225,499",303,"$136,470",3,11/2/01
7,7,Jet Li's Fearless(China),Rog.,"$24,633,730",1810,"$10,590,244",1806,9/22/06
8,8,Il Postino(Italy),Mira.,"$21,848,932",430,"$95,310",10,6/16/95
9,9,Like Water for Chocolate(Mexico),Mira.,"$21,665,468",64,"$23,600",2,2/19/93


In [104]:
tables[1].iloc[0,:]

0                         Rank
1        Title (click to view)
2                       Studio
3    Lifetime Gross / Theaters
4    Lifetime Gross / Theaters
5           Opening / Theaters
6           Opening / Theaters
7                         Date
Name: 0, dtype: object

In [99]:
tables[2][0:5]

Unnamed: 0,0,1,2
0,Title (click to view),Studio,Release Date
1,Dus Khaniyan(India),Eros,12/7/07
2,Britt-Marie Was Here,Cohen,9/20/19
3,The Golden Glove,Strand,9/27/19
4,War (2019),Yash,10/2/19


Conclusion: Beautiful Soup is powerful but it has many limitations. If a page needs interactions (like entering password) or if a page is not static, but dynamically generated, we can't use Soup. We will explore other tools for that.

One such tool for tomorrow: *Selenium*