# Web Data Scraping

[Spring 2021 ITSS Mini-Course](https://www.colorado.edu/cartss/programs/interdisciplinary-training-social-sciences-itss/mini-course-web-data-scraping) — ARSC 5040  
[Brian C. Keegan, Ph.D.](http://brianckeegan.com/)  
[Assistant Professor, Department of Information Science](https://www.colorado.edu/cmci/people/information-science/brian-c-keegan)  
University of Colorado Boulder  

Copyright and distributed under an [MIT License](https://opensource.org/licenses/MIT)

## Class outline

* **Week 1**: Introduction to Jupyter, browser console, structured data, ethical considerations
* **Week 2**: Scraping HTML with `requests` and `BeautifulSoup`
* **Week 3**: Scraping web data with Selenium
* **Week 4**: Scraping an API with `requests` and `json`, Wikipedia and Reddit
* **Week 5**: Scraping data from Twitter

## Acknowledgements

This course will draw on resources built by myself and [Allison Morgan](https://allisonmorgan.github.io/) for the [2018 Summer Institute for Computational Social Science](https://github.com/allisonmorgan/sicss_boulder), which were in turn derived from [other resources](https://github.com/simonmunzert/web-scraping-with-r-extended-edition) developed by [Simon Munzert](http://simonmunzert.github.io/) and [Chris Bail](http://www.chrisbail.net/). 

Thank you also to Professor [Terra KcKinnish](https://www.colorado.edu/economics/people/faculty/terra-mckinnish) for coordinating the ITSS seminars.

## Class 2 goals

* Sharing accomplishments and challenges with last week's material
* Parsing HTML data into tabular data
* Writing your own parser
* Traversing directories vs. parsing targets to retrieve data
* Applying techniques and debugging for individual projects

## Sharing accomplishments and challenges

* Using the inspect tool
* Counting numbers of members of U.S. House with XML
* Parsing information out from Twitter's JSON payload

## Parsing HTML data into tabular data

The overall goal we have as researchers in scraping data from the web is converting data from one structured format (HTML's tree-like structures) into another structured format (probably a tabular structure with rows and columns). 
This could involve simply reading tables out of a webpage all the way up to taking irregularly-structured HTML elements into a tabular format. 

We are going to make some use of the [`pandas`](https://pandas.pydata.org/) library ("**pan**el **da**ta", not the cute animal), which is Python's implementation of a data frame concept. This is a very powerful and complex library that I typically spend more than 12 hours of lecture teaching in intermediate programming classes. I hope to convey some important elements as we work through material, but it is far beyond the scope of this class to be able to cover all the fundamentals and syntax. 

Let's begin by importing the libraries we'll need in this notebook: requests, BeautifulSoup, and pandas

In [1]:
# Most straight-forward way to import a librayr in Python
import requests

# BeautifulSoup is a module inside the "bs4" library, we only import the BeautifulSoup module
from bs4 import BeautifulSoup

# We import pandas but give the library a shortcut alias "pd" since we will call its functions so much
import pandas as pd

### Reading an HTML table into Python

[The Numbers](http://www.the-numbers.com) is a popular source of data about movies' box office revenue numbers. Their daily domestic charts are HTML tables with the top-grossing movies for each day of the year, going back for several years. This [table](https://www.the-numbers.com/box-office-chart/daily/2018/12/25) for Christmas day in 2018 has coluns for the current week's ranking, previous week's ranking, name of movie, distributor, gross, change over the previous week, number of theaters, revenue per theater, total gross, and number of days since release. This looks like a fairly straightforward table that could be read directly into data frame-like structure.

Using the Inspect tool, we can see the table exists as a `<table border="0" ... align="CENTER">` element with child tags like `<tbody>` and `<tr>` (table row). Each `<tr>` has `<td>` which defines each of the cells and their content. For more on how HTML defines tables, check out [this tutoral](https://www.w3schools.com/html/html_tables.asp).

Using `requests` and `BeautifulSoup` we would get this webpage's HTML, turn it into soup, and then find the table (`<table>`) or the table rows (`<tr>`) and pull out their content.

In [4]:
# Make the request
xmas_bo_raw = requests.get('https://www.the-numbers.com/box-office-chart/daily/2018/12/25').text

# Turn into soup, specify the HTML parser
xmas_bo_soup = BeautifulSoup(xmas_bo_raw,'html.parser')

# Use .find_all to retrieve all the tables in the page
xmas_bo_tables = xmas_bo_soup.find_all('table')

It turns out there are two tables on the page, the first is a baby table consisting of the "Previous Chart", "Chart Index", and "Next Chart" at the top. We want the second table with all the data: `xmas_bo_tables[1]` returns the second chart (remember that Python is 0-indexed, so the first chart is at `xmas_bo_tables[0]`). With this table identified, we can do a second `find_all` to get the table rows inside it and we save it as `xmas_bo_trs`.

In [22]:
xmas_bo_trs = xmas_bo_tables[1].find_all('tr')

Let's inspect a few of these rows. The first row in our list of rows under `xmas_bo_trs` should be the header with the names of the columns.

In [23]:
xmas_bo_trs[0]

<tr><th> </th><th> </th><th>Movie</th><th>Distributor</th><th>Gross</th><th>Change</th><th>Thtrs.</th><th>Per Thtr.</th><th>Total Gross</th><th>Days</th></tr>

The next table row should be for Aquaman.

In [24]:
xmas_bo_trs[1]

<tr>
<td class="data">1</td>
<td class="data">(1)</td>
<td><b><a href="/movie/Aquaman-(2018)#tab=box-office">Aquaman</a></b></td>
<td><a href="/market/distributor/Warner-Bros">Warner Bros.</a></td>
<td class="data">$21,982,419</td>
<td class="data chart_up">+103%</td>
<td class="data">4,125</td>
<td class="data chart_grey">$5,329</td>
<td class="data">  $105,407,869</td>
<td class="data">5</td>
</tr>

If we wanted to access the contents of this table row, we could use the `.contents` method to get a list of each of the `<td>` table cells, which (frustratingly) intersperses newline characters.

In [31]:
xmas_bo_trs[1].contents

['\n',
 <td class="data">1</td>,
 '\n',
 <td class="data">(1)</td>,
 '\n',
 <td><b><a href="/movie/Aquaman-(2018)#tab=box-office">Aquaman</a></b></td>,
 '\n',
 <td><a href="/market/distributor/Warner-Bros">Warner Bros.</a></td>,
 '\n',
 <td class="data">$21,982,419</td>,
 '\n',
 <td class="data chart_up">+103%</td>,
 '\n',
 <td class="data">4,125</td>,
 '\n',
 <td class="data chart_grey">$5,329</td>,
 '\n',
 <td class="data">  $105,407,869</td>,
 '\n',
 <td class="data">5</td>,
 '\n']

Another alternative is to use the `.text` method to get the text content of all the cells in this row.

In [32]:
xmas_bo_trs[1].text

'\n1\n(1)\nAquaman\nWarner Bros.\n$21,982,419\n+103%\n4,125\n$5,329\n\xa0\xa0$105,407,869\n5\n'

The `\n` characters re-appear here, but if we `print` out this statement, we see their newline functionality.

In [33]:
print(xmas_bo_trs[1].text)


1
(1)
Aquaman
Warner Bros.
$21,982,419
+103%
4,125
$5,329
  $105,407,869
5



We could use string processing to take this text string and convert it into a simple list of data. `.split('\n')` will split the string on the newline characters and return a list of what exists in between.

In [34]:
xmas_bo_trs[1].text.split('\n')

['',
 '1',
 '(1)',
 'Aquaman',
 'Warner Bros.',
 '$21,982,419',
 '+103%',
 '4,125',
 '$5,329',
 '\xa0\xa0$105,407,869',
 '5',
 '']

We'll write a `for` loop to go through all the table rows in `xmas_bo_trs`, get the list of data from the row, and add it back to a list of all the rows.

In [41]:
cleaned_xmas_bo_rows = []

# Loop through all the non-header (first row) table rows
for row in xmas_bo_trs[1:]:
    
    # Get the text of the row and split on the newlines (like above)
    cleaned_row = row.text.split('\n')
    
    # Add this cleaned row back to the external list of row data
    cleaned_xmas_bo_rows.append(cleaned_row)
    
# Inspect the first few rows of data
cleaned_xmas_bo_rows[:2]

[['',
  '1',
  '(1)',
  'Aquaman',
  'Warner Bros.',
  '$21,982,419',
  '+103%',
  '4,125',
  '$5,329',
  '\xa0\xa0$105,407,869',
  '5',
  ''],
 ['',
  '2',
  '(2)',
  'Mary Poppins Returns',
  'Walt Disney',
  '$11,457,469',
  '+86%',
  '4,090',
  '$2,801',
  '\xa0\xa0$49,946,455',
  '7',
  '']]

Now we can pass this list of lists in `cleaned_xmas_bo_rows` to pandas's `DataFrame` function and hopefully get a nice table out.

In [43]:
xmas_bo_df = pd.DataFrame(cleaned_xmas_bo_rows)

# Inspect
xmas_bo_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11
0,,1,(1),Aquaman,Warner Bros.,"$21,982,419",+103%,4125,"$5,329","$105,407,869",5,
1,,2,(2),Mary Poppins Returns,Walt Disney,"$11,457,469",+86%,4090,"$2,801","$49,946,455",7,
2,,3,(3),Bumblebee,Paramount Pictures,"$8,887,978",+139%,3550,"$2,504","$34,253,863",5,
3,,4,new,Holmes & Watson,Sony Pictures,"$6,434,922",,2719,"$2,367","$6,434,922",1,
4,,5,(4),Spider-Man: Into The Spider…,Sony Pictures,"$5,630,385",+68%,3813,"$1,477","$73,543,868",12,


We need to do a bit of cleanup on this data:

* Columns 0 and 11 are all empty
* Add column names

In [45]:
# Drop columns 0 and 11 and overwrite the xmas_box_df variable
xmas_bo_df = xmas_bo_df.drop(columns=[0,11])

# Rename the columns
xmas_bo_df.columns = ['Rank','Last rank','Movie','Distributor','Gross',
                      'Change','Theaters','Per theater','Total gross',
                      'Days']

# Write to disk
# xmas_bo_df.to_csv('christmas_2018_box_office.csv',encoding='utf8')

# Inspect
xmas_bo_df.head()

Unnamed: 0,Rank,Last rank,Movie,Distributor,Gross,Change,Theaters,Per theater,Total gross,Days
0,1,(1),Aquaman,Warner Bros.,"$21,982,419",+103%,4125,"$5,329","$105,407,869",5
1,2,(2),Mary Poppins Returns,Walt Disney,"$11,457,469",+86%,4090,"$2,801","$49,946,455",7
2,3,(3),Bumblebee,Paramount Pictures,"$8,887,978",+139%,3550,"$2,504","$34,253,863",5
3,4,new,Holmes & Watson,Sony Pictures,"$6,434,922",,2719,"$2,367","$6,434,922",1
4,5,(4),Spider-Man: Into The Spider…,Sony Pictures,"$5,630,385",+68%,3813,"$1,477","$73,543,868",12


### `pandas`'s `read_html`
That was a good amount of work just to get this simple HTML table into Python. But it was important to cover how table elements moved from a string in `requests`, into a soup object from `BeautifulSoup`. into a list of data, and finally into `pandas`. 

`pandas` also has powerful functionality for reading tables directly from HTML. If we convert the soup of the first table (`xmas_bo_tables[1]`) back into a string, `pandas` can read it directly into a table. 

There are a few ideosyncracies here, the result is a list of dataframes—even if there's only a single table/dataframe—so we need to return the first (and only) element of this list. This is why there's a `[0]` at the end and the `.head()` is just to show the first five rows.

In [48]:
xmas_bo_table_as_string = str(xmas_bo_tables[1])

pd.read_html(xmas_bo_table_as_string)[0].head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,,,Movie,Distributor,Gross,Change,Thtrs.,Per Thtr.,Total Gross,Days
1,1.0,(1),Aquaman,Warner Bros.,"$21,982,419",+103%,4125,"$5,329","$105,407,869",5
2,2.0,(2),Mary Poppins Returns,Walt Disney,"$11,457,469",+86%,4090,"$2,801","$49,946,455",7
3,3.0,(3),Bumblebee,Paramount Pictures,"$8,887,978",+139%,3550,"$2,504","$34,253,863",5
4,4.0,new,Holmes & Watson,Sony Pictures,"$6,434,922",,2719,"$2,367","$6,434,922",1


The column names got lumped in as rows, but we can fix this as well with the `read_html` function by passing the row index where the column lives. In this case, it is the first row, so we pass `header=0`.

In [49]:
pd.read_html(xmas_bo_table_as_string,header=0)[0].head()

Unnamed: 0.1,Unnamed: 0,Unnamed: 1,Movie,Distributor,Gross,Change,Thtrs.,Per Thtr.,Total Gross,Days
0,1,(1),Aquaman,Warner Bros.,"$21,982,419",+103%,4125,"$5,329","$105,407,869",5
1,2,(2),Mary Poppins Returns,Walt Disney,"$11,457,469",+86%,4090,"$2,801","$49,946,455",7
2,3,(3),Bumblebee,Paramount Pictures,"$8,887,978",+139%,3550,"$2,504","$34,253,863",5
3,4,new,Holmes & Watson,Sony Pictures,"$6,434,922",,2719,"$2,367","$6,434,922",1
4,5,(4),Spider-Man: Into The Spider…,Sony Pictures,"$5,630,385",+68%,3813,"$1,477","$73,543,868",12


Finally, you can point `read_html` at a URL without any `requests` or `BeautifulSoup` and get all the tables on the page as a list of DataFrames. `pandas` is simply doing the `requests` and `BeautifulSoup` on the inside. Interestingly, I'm getting a [HTTP 403](https://en.wikipedia.org/wiki/HTTP_403) error indicating the server (The Numbers) is forbidding the client (us) from accessing their data using this strategy. We will discuss next week whether and how to handle situations where web servers refuse connections from non-human clients. In this case, you cannot use the off-the-shelf `read_html` approach and would need to revert to using the `requests`+`BeautifulSoup` approach above.

In [56]:
simple_tables = pd.read_html('https://www.the-numbers.com/box-office-chart/daily/2018/12/25')
simple_tables

HTTPError: HTTP Error 403: Forbidden

If we point it at Wikipedia's [2018 in film](https://en.wikipedia.org/wiki/2018_in_film), it will pull all of the tables present on the page.

In [58]:
simple_tables = pd.read_html('https://en.wikipedia.org/wiki/2018_in_film')

The first three correspond to the "Year in film" navigation box on the side and are poorly-formatted by default.

In [59]:
simple_tables[0]

Unnamed: 0,0,1,2,3
0,List of years in film (table),,List of years in film,(table)
1,,List of years in film,(table),
2,... 2008 2009 2010 2011 2012 2013 2014 ... 201...,,,
3,.mw-parser-output .nobold{font-weight:normal}I...,,,
4,Art Archaeology Architecture Literature Music ...,,,
5,,List of years in film,(table),


The fourth table in the `simple_tables` list we got from parsing the Wikipedia page with `read_html` is the table under the "Highest-grossing films" section.

In [64]:
simple_tables[3]

Unnamed: 0,0,1,2,3
0,Rank,Title,Distributor,Worldwide gross
1,1,Avengers: Infinity War,Disney,"$2,048,359,754"
2,2,Black Panther,"$1,346,913,161",
3,3,Jurassic World: Fallen Kingdom,Universal,"$1,309,484,461"
4,4,Incredibles 2,Disney,"$1,242,770,554"
5,5,Aquaman,Warner Bros.,"$1,066,976,848"
6,6,Venom,Sony,"$855,789,419"
7,7,Bohemian Rhapsody,20th Century Fox,"$798,453,776"
8,8,Mission: Impossible – Fallout,Paramount,"$791,115,104"
9,9,Deadpool 2,20th Century Fox,"$743,879,037"


You can pass the "header" option in `read_html` to make sure the column names from a particular row (in this case the first row) do not accidentally become rows of data. 

In [69]:
wiki_top_grossing_t = pd.read_html('https://en.wikipedia.org/wiki/2018_in_film',header=0)[3]
wiki_top_grossing_t

Unnamed: 0,Rank,Title,Distributor,Worldwide gross
0,1,Avengers: Infinity War,Disney,"$2,048,359,754"
1,2,Black Panther,"$1,346,913,161",
2,3,Jurassic World: Fallen Kingdom,Universal,"$1,309,484,461"
3,4,Incredibles 2,Disney,"$1,242,770,554"
4,5,Aquaman,Warner Bros.,"$1,066,976,848"
5,6,Venom,Sony,"$855,789,419"
6,7,Bohemian Rhapsody,20th Century Fox,"$798,453,776"
7,8,Mission: Impossible – Fallout,Paramount,"$791,115,104"
8,9,Deadpool 2,20th Century Fox,"$743,879,037"
9,10,Fantastic Beasts: The Crimes of Grindelwald,Warner Bros.,"$650,320,988"


Note that there are still a few errors in this table because the "Disney" value in the Wikipedia table spans two rows and `read_html` thus skips the "Distributor" value for Black Panther. 

In [70]:
# Copy the value at index position 1, column position Distributor to Wordwide gross
wiki_top_grossing_t.loc[1,'Worldwide gross'] = wiki_top_grossing_t.loc[1,'Distributor']

# Change the value at 1, Distributor to Disney
wiki_top_grossing_t.loc[1,'Distributor'] = 'Disney'

wiki_top_grossing_t

Unnamed: 0,Rank,Title,Distributor,Worldwide gross
0,1,Avengers: Infinity War,Disney,"$2,048,359,754"
1,2,Black Panther,Disney,"$1,346,913,161"
2,3,Jurassic World: Fallen Kingdom,Universal,"$1,309,484,461"
3,4,Incredibles 2,Disney,"$1,242,770,554"
4,5,Aquaman,Warner Bros.,"$1,066,976,848"
5,6,Venom,Sony,"$855,789,419"
6,7,Bohemian Rhapsody,20th Century Fox,"$798,453,776"
7,8,Mission: Impossible – Fallout,Paramount,"$791,115,104"
8,9,Deadpool 2,20th Century Fox,"$743,879,037"
9,10,Fantastic Beasts: The Crimes of Grindelwald,Warner Bros.,"$650,320,988"


## Writing your own parser

We will return to the historical Oscars data. Even though data as prominent as this is likely to already exist in tabular format somewhere, we will maintain the illusion that we are the first to both scrape it and parse it into a tabular format. Our goal here is to write a parser that will (ideally) work across multiple pages; in this case, each of the award years.

One of the first things we should do before writing any code is come up with a model of what we want our data to look like at the end of this. This is an intuitive and "tidy" format, but you might come up with alternatives based on your analysis and modeling needs.

| *Year* | *Category* | *Nominee* | *Movie* | *Won* |
| --- | --- | --- | --- | --- |
| 2019 | Actor in a leading role | Christian Bale | Vice | NA |
| 2019 | Actor in a leading role | Bradley Cooper | A Star Is Born | NA |
| 2019 | Actor in a leading role | Willem Dafoe | At Eternity's Gate | NA |
| 2019 | Actor in a leading role | Rami Malek | Bohemian Rhapsody | NA |
| 2019 | Actor in a leading role | Viggo Mortensen | Green Book | NA |

We will begin with writing a parser for a (hopefully!) representative year, then scrape the data for all the years, then apply the scraper to each of those years, and finally combine all the years' data together into a large data set. 

Let's begin with writing a parser for a (hopefully!) representative year: in this case, 2019 is actually not a great case because it is missing information about who won and lost since (at the time of my writing this notebook) the winners had not been announced. We will use 2018 instead and make the profoundly naïve assumption it should work the same going back in time.

Start off with using `requests` to get the data and then use `BeautifulSoup` to turn it into soup we can parse through.

In [71]:
oscars2018_raw = requests.get('https://www.oscars.org/oscars/ceremonies/2018').text

oscars2018_soup = BeautifulSoup(oscars2018_raw)

Using the Inspect tool exercise from Class 1, the `<div class="view-grouping">` seems to be the most promising tag for us to extract. Use `.find_all('div',{'class':'view-grouping'})` to (hopefully!) get all of these award groups. Inspect the first and last ones to make sure they looks coherent.

In [73]:
# Get all the groups that have a <div class="view-grouping"> tag
oscars2018_groups = oscars2018_soup.find_all('div',{'class':'view-grouping'})

# Inspect the first one
oscars2018_groups[0]

<div class="view-grouping"><div class="view-grouping-header"><h2>Actor in a Leading Role</h2></div><div class="view-grouping-content"> <h3><span class="golden-text">Winner</span></h3>
<div class="views-row views-row-1 views-row-odd views-row-first views-row-last">
<div class="views-field views-field-field-actor-name"> <h4 class="field-content">Gary Oldman</h4> </div>
<div class="views-field views-field-title"> <span class="field-content">Darkest Hour
</span> </div>
<div class="views-field views-field-edit-node"> <span class="field-content"></span> </div> </div>
<h3>Nominees</h3>
<div class="views-row views-row-1 views-row-odd views-row-first">
<div class="views-field views-field-field-actor-name"> <h4 class="field-content">Timothée Chalamet</h4> </div>
<div class="views-field views-field-title"> <span class="field-content">Call Me by Your Name
</span> </div>
<div class="views-field views-field-edit-node"> <span class="field-content"></span> </div> </div>
<div class="views-row views-row

The last group is something besides "Writing (Original Screenplay)" and it's not clear to me where this tag's content renders on the page.

In [78]:
# Inspect the last one
oscars2018_groups[-1]

<div class="view-grouping"><div class="view-grouping-header">Wonder</div><span class="label">1 Nomination</span><div class="view-grouping-content"> <div>
<div class="views-field views-field-title-field"> <div class="field-content">Makeup and Hairstyling - Arjen Tuiten</div> </div> </div>
</div></div>

This puts us into something of a bind going forward: if the `.find_all` returns more groupings than we expected, then it's not sufficiently precise to identify *only* groupings of nominees.  However, there do not appear to be any child tags in the `oscars2018_groups[0]` grouping that uniquely differentiate them from the child tags present in the `oscars2018_groups[-1]` grouping. Another alternative is to simple parse the first 24 groupings, but this is a very brittle solution since other years' awards might have more or fewer groupings.

In [79]:
len(oscars2018_groups)

105

### Navigating the HTML tree to find more specific parent elements
A third alternative is to leverage the tree structure of HTML and get the parent element in the hopes it is more unique than its children. In this case something like `<div id="quicktabs-tabpage-honorees-0"...>` is a promising lead. Use `find_all` to search for this tag and confirm there is only one the one `<div>` element (with its children) rather than multiple `<div>` elements matching "quicktabs-container-honorees".

In [88]:
# Get the new tag group
oscars2018_parent_group = oscars2018_soup.find_all('div',{'id':'quicktabs-tabpage-honorees-0'})

# Hopefully there is only one group matching this pattern
len(oscars2018_parent_group)


1

So far so good, now we can use `find_all` on the soup for this `<div class="view-grouping">` to search *within* this specific parent group and hopefully there should be the 24 awards groupings. Nope, still 105. This is because 

In [89]:
# Note the addition of the [0] since the _parent_group is a list with 1 element in it
# We just extract that single element (which is a soup) and then we can use find_all on it

oscars2018_true_groups = oscars2018_parent_group[0].find_all('div',{'class':'view-grouping'})

len(oscars2018_true_groups)

24

Hallelujah! The award names for each group live inside a `<div class="view-grouping-header">`, so we can `find_all` for those, loop through each, and print out the name.

In [92]:
for group in oscars2018_parent_group[0].find_all('div',{'class':'view-grouping-header'}):
    print(group.text)

Actor in a Leading Role
Actor in a Supporting Role
Actress in a Leading Role
Actress in a Supporting Role
Animated Feature Film
Cinematography
Costume Design
Directing
Documentary (Feature)
Documentary (Short Subject)
Film Editing
Foreign Language Film
Makeup and Hairstyling
Music (Original Score)
Music (Original Song)
Best Picture
Production Design
Short Film (Animated)
Short Film (Live Action)
Sound Editing
Sound Mixing
Visual Effects
Writing (Adapted Screenplay)
Writing (Original Screenplay)


It turns out that the Oscars site loads a bunch of extra data that it does not render that lives underneath a `<div id="quicktabs-tabpage-honorees-1">` which is where the 81 extra "awards" come from. This appears to be an attempt to organize the page by film, rather than by category.

### Navigating the HTML tree from a specific child to find specific generic parents
Now bear with me through some additional and presently unnecessary pain. Above, we were able to isolate the 24 category groupings we wanted through finding an appropriate *parent* tag and then working *down*. But I also want to show how we could identify the same 24 category groups by finding an appropriate *child* tag and working back up. This could be helpful in other situations where the elements are hard to disambiguate.

Let's start by finding the tag for the "Actor in a Leading Role" from the soup containing all the tags.

In [96]:
oscars2018_groups[0]

<div class="view-grouping"><div class="view-grouping-header"><h2>Actor in a Leading Role</h2></div><div class="view-grouping-content"> <h3><span class="golden-text">Winner</span></h3>
<div class="views-row views-row-1 views-row-odd views-row-first views-row-last">
<div class="views-field views-field-field-actor-name"> <h4 class="field-content">Gary Oldman</h4> </div>
<div class="views-field views-field-title"> <span class="field-content">Darkest Hour
</span> </div>
<div class="views-field views-field-edit-node"> <span class="field-content"></span> </div> </div>
<h3>Nominees</h3>
<div class="views-row views-row-1 views-row-odd views-row-first">
<div class="views-field views-field-field-actor-name"> <h4 class="field-content">Timothée Chalamet</h4> </div>
<div class="views-field views-field-title"> <span class="field-content">Call Me by Your Name
</span> </div>
<div class="views-field views-field-edit-node"> <span class="field-content"></span> </div> </div>
<div class="views-row views-row

Rather than finding *all* the `<div class="view-grouping">` present in the page, we only want the 23 *siblings* of this specific tag. We can use the `find_next_siblings()` to get these 23 siblings. I do not like this method very much because you have to find the "eldest" sibling and then combine it with its siblings later on if you want all the children. In this case, you'd need to keep track of the `<div class="view-grouping">` corresponding to Best Actor and then combine it with its 23 siblings, rather than an approach that simply returns all 24 in a single list.

In [102]:
oscars2018_group0_next_siblings = oscars2018_groups[0].find_next_siblings()

len(oscars2018_group0_next_siblings)

23

We could also go up to get the parent and then find all 24 of the `<div class='view-grouping'>` among the children.

In [112]:
# From the child we like, get its parent
oscars2018_group0_parent = oscars2018_groups[0].parent

# Now with the parent, find all the relevant children
oscars2018_group0_parent_children = oscars2018_group0_parent.find_all('div',{'class':'view-grouping'})

# Confirm
len(oscars2018_group0_parent_children)

24

### Checking the relevant fields

That seemed like a major digression away from the core task of writing a parser, but it is critical that we write a parser that parses *only* the data we want and nothing else. Now that we have our 24 awards groups in `oscars2018_true_groups`, let's break one open and extract all the yummy data waiting inside.

There are a few `<div>` sub-classes that are helpfully named that should make extracting this data a bit easier.

* `<div class="view-grouping-header">` - name of the category
* `<span class="golden-text">` - winner
* `<div class="views-field views-field-field-actor-name">` - name of actor
* `<div class="views-field views-field-title">` - title of movie

In [121]:
oscars2018_true_groups[0]

<div class="view-grouping"><div class="view-grouping-header"><h2>Actor in a Leading Role</h2></div><div class="view-grouping-content"> <h3><span class="golden-text">Winner</span></h3>
<div class="views-row views-row-1 views-row-odd views-row-first views-row-last">
<div class="views-field views-field-field-actor-name"> <h4 class="field-content">Gary Oldman</h4> </div>
<div class="views-field views-field-title"> <span class="field-content">Darkest Hour
</span> </div>
<div class="views-field views-field-edit-node"> <span class="field-content"></span> </div> </div>
<h3>Nominees</h3>
<div class="views-row views-row-1 views-row-odd views-row-first">
<div class="views-field views-field-field-actor-name"> <h4 class="field-content">Timothée Chalamet</h4> </div>
<div class="views-field views-field-title"> <span class="field-content">Call Me by Your Name
</span> </div>
<div class="views-field views-field-edit-node"> <span class="field-content"></span> </div> </div>
<div class="views-row views-row

"Zoom in" to the `views-field-field-actor-name`.

In [118]:
oscars2018_true_groups[0].find_all('div',{'class':"views-field views-field-field-actor-name"})

[<div class="views-field views-field-field-actor-name"> <h4 class="field-content">Gary Oldman</h4> </div>,
 <div class="views-field views-field-field-actor-name"> <h4 class="field-content">Timothée Chalamet</h4> </div>,
 <div class="views-field views-field-field-actor-name"> <h4 class="field-content">Daniel Day-Lewis</h4> </div>,
 <div class="views-field views-field-field-actor-name"> <h4 class="field-content">Daniel Kaluuya</h4> </div>,
 <div class="views-field views-field-field-actor-name"> <h4 class="field-content">Denzel Washington</h4> </div>]

These `<h4>` tags may be more specific and helpful.

In [122]:
oscars2018_true_groups[0].find_all('h4')

[<h4 class="field-content">Gary Oldman</h4>,
 <h4 class="field-content">Timothée Chalamet</h4>,
 <h4 class="field-content">Daniel Day-Lewis</h4>,
 <h4 class="field-content">Daniel Kaluuya</h4>,
 <h4 class="field-content">Denzel Washington</h4>]

Zoom into the `views-field-title`.

In [120]:
oscars2018_true_groups[0].find_all('div',{'class':"views-field views-field-title"})

[<div class="views-field views-field-title"> <span class="field-content">Darkest Hour
 </span> </div>,
 <div class="views-field views-field-title"> <span class="field-content">Call Me by Your Name
 </span> </div>,
 <div class="views-field views-field-title"> <span class="field-content">Phantom Thread
 </span> </div>,
 <div class="views-field views-field-title"> <span class="field-content">Get Out
 </span> </div>,
 <div class="views-field views-field-title"> <span class="field-content">Roman J. Israel, Esq.
 </span> </div>]

These `<span>` tags may be more specific and helpful, but there are also empty tags here clogging things up.

In [128]:
oscars2018_true_groups[0].find_all('span')

[<span class="golden-text">Winner</span>,
 <span class="field-content">Darkest Hour
 </span>,
 <span class="field-content"></span>,
 <span class="field-content">Call Me by Your Name
 </span>,
 <span class="field-content"></span>,
 <span class="field-content">Phantom Thread
 </span>,
 <span class="field-content"></span>,
 <span class="field-content">Get Out
 </span>,
 <span class="field-content"></span>,
 <span class="field-content">Roman J. Israel, Esq.
 </span>,
 <span class="field-content"></span>]

As a battle-scarred web scraper, let me continue to emphasize the importance of quick-checking your assumptions before commiting to writing code. Are these fields still appropriate for other awards categories? Let's check the last category for original screenplay. Are the `<div>`s for "field-actor-name" still people and for "field-title" still movies? Nope. 

Looking back at the web page, it's now obvious that the movie title and person who gets the award are flipped between actors/actresses and the other awards categories. We're going to have to keep this in mind going forward!

In [117]:
oscars2018_true_groups[-1].find_all('div',{'class':"views-field views-field-field-actor-name"})

[<div class="views-field views-field-field-actor-name"> <h4 class="field-content">Get Out</h4> </div>,
 <div class="views-field views-field-field-actor-name"> <h4 class="field-content">The Big Sick</h4> </div>,
 <div class="views-field views-field-field-actor-name"> <h4 class="field-content">Lady Bird</h4> </div>,
 <div class="views-field views-field-field-actor-name"> <h4 class="field-content">The Shape of Water</h4> </div>,
 <div class="views-field views-field-field-actor-name"> <h4 class="field-content">Three Billboards outside Ebbing, Missouri</h4> </div>]

In [119]:
oscars2018_true_groups[-1].find_all('div',{'class':"views-field views-field-title"})

[<div class="views-field views-field-title"> <span class="field-content">Written by Jordan Peele
 </span> </div>,
 <div class="views-field views-field-title"> <span class="field-content">Written by Emily V. Gordon &amp; Kumail Nanjiani
 </span> </div>,
 <div class="views-field views-field-title"> <span class="field-content">Written by Greta Gerwig
 </span> </div>,
 <div class="views-field views-field-title"> <span class="field-content">Screenplay by Guillermo del Toro &amp; Vanessa Taylor; Story by Guillermo del Toro
 </span> </div>,
 <div class="views-field views-field-title"> <span class="field-content">Written by Martin McDonagh
 </span> </div>]

### Writing the core parser functionality

How will we map the contents of the HTML to the 

* **Year**: All the awards are from the same year, also in the URL
* **Category**: `<h2>`
* **Nominee**: `<h4>` for actors, `<span>` for non-actors
* **Movie**: `<span>` for actors, `<h4>` for non-actors
* **Won**: `<h3>` for sibling, 0 for everyone else; alternatively just the top nominee

In [135]:
oscars2018_true_groups[0]

<div class="view-grouping"><div class="view-grouping-header"><h2>Actor in a Leading Role</h2></div><div class="view-grouping-content"> <h3><span class="golden-text">Winner</span></h3>
<div class="views-row views-row-1 views-row-odd views-row-first views-row-last">
<div class="views-field views-field-field-actor-name"> <h4 class="field-content">Gary Oldman</h4> </div>
<div class="views-field views-field-title"> <span class="field-content">Darkest Hour
</span> </div>
<div class="views-field views-field-edit-node"> <span class="field-content"></span> </div> </div>
<h3>Nominees</h3>
<div class="views-row views-row-1 views-row-odd views-row-first">
<div class="views-field views-field-field-actor-name"> <h4 class="field-content">Timothée Chalamet</h4> </div>
<div class="views-field views-field-title"> <span class="field-content">Call Me by Your Name
</span> </div>
<div class="views-field views-field-edit-node"> <span class="field-content"></span> </div> </div>
<div class="views-row views-row

In [155]:
category = oscars2018_true_groups[0].find_all('h2')[0].text
print("The name of the category is:",category)

The name of the category is: Actor in a Leading Role


In [149]:
names = []
for _nominee in oscars2018_true_groups[0].find_all('h4'):
    nominee_name = _nominee.text
    names.append(nominee_name)
    print("The name of a nominee is:",nominee_name)

The name of a nominee is: Gary Oldman
The name of a nominee is: Timothée Chalamet
The name of a nominee is: Daniel Day-Lewis
The name of a nominee is: Daniel Kaluuya
The name of a nominee is: Denzel Washington


In [150]:
movies = []

for _movie in oscars2018_true_groups[0].find_all('span'):
    if len(_movie.text) > 0:
        movie_name = _movie.text.strip()
        movies.append(movie_name)
        print("The name of a movie is:",movie_name)

The name of a movie is: Winner
The name of a movie is: Darkest Hour
The name of a movie is: Call Me by Your Name
The name of a movie is: Phantom Thread
The name of a movie is: Get Out
The name of a movie is: Roman J. Israel, Esq.


One strategy is to use Python's [`zip`](https://docs.python.org/3.7/library/functions.html#zip) library to combine elements from different lists together. But `zip` is a bit too slick and abstract for my tastes.

In [166]:
# The elements of each list being combined need to be the same size
# So we make a list of the category name and multiply it by 5 to make it the same size as the others
list(zip([category]*5,names,movies))

[('Actor in a Leading Role', 'Gary Oldman', 'Winner'),
 ('Actor in a Leading Role', 'Timothée Chalamet', 'Darkest Hour'),
 ('Actor in a Leading Role', 'Daniel Day-Lewis', 'Call Me by Your Name'),
 ('Actor in a Leading Role', 'Daniel Kaluuya', 'Phantom Thread'),
 ('Actor in a Leading Role', 'Denzel Washington', 'Get Out')]

Another strategy is to use the `<div class="views-row">`s for each nominee and extract the relevant information from its subdivs. This is a bit more intuitive in the sense of reading from top to bottom and also makes it easier to capture the winner and losers based on position.

In [203]:
actor_nominees = oscars2018_true_groups[0].find_all('div',{'class':'views-row'})

for i,nominee in enumerate(actor_nominees):
    
    # If in the first position, the nominee won
    if i == 0:
        winner = 'Won'
    # Otherwise, the nominee lost
    else:
        winner = 'Lost'
    
    # Get a list of all the sub-divs
    subdivs = nominee.find_all('div')
    
    # The first subdiv (for an actor) is the name
    name = subdivs[0].text.strip()
    
    # The second subdiv (for an actor) is the movie name
    movie = subdivs[1].text.strip()
    
    print("{0} was nominated for \"{1}\" and {2}.".format(name,movie,winner))

Gary Oldman was nominated for "Darkest Hour" and Won.
Timothée Chalamet was nominated for "Call Me by Your Name" and Lost.
Daniel Day-Lewis was nominated for "Phantom Thread" and Lost.
Daniel Kaluuya was nominated for "Get Out" and Lost.
Denzel Washington was nominated for "Roman J. Israel, Esq." and Lost.


Check that reversing "movie" and "name" works for another award category like original screenplay (`oscars2018_true_groups[-1]`). There's some weirdness with "Written by" and "Story by" filtering in here rather than simply names that may need to get fixed in the final calculation, but I would want to talk to a domain expert about the differences between these labels.

In [204]:
original_screenplay_nominees = oscars2018_true_groups[-1].find_all('div',{'class':'views-row'})

for i,nominee in enumerate(original_screenplay_nominees):
    if i == 0:
        winner = 'Won'
    else:
        winner = 'Lost'
        
    subdivs = nominee.find_all('div')
    
    # movie and name reversed
    movie = subdivs[0].text.strip()
    name = subdivs[1].text.strip()
    
    print("{0} was nominated for \"{1}\" and {2}.".format(name,movie,winner))

Written by Jordan Peele was nominated for "Get Out" and Won.
Written by Emily V. Gordon & Kumail Nanjiani was nominated for "The Big Sick" and Lost.
Written by Greta Gerwig was nominated for "Lady Bird" and Lost.
Screenplay by Guillermo del Toro & Vanessa Taylor; Story by Guillermo del Toro was nominated for "The Shape of Water" and Lost.
Written by Martin McDonagh was nominated for "Three Billboards outside Ebbing, Missouri" and Lost.


This was just for Best Actors, now lets add another layer for all the different awards categories. We can see the movie name and awardee switch is important now since most of the categories are reversed.

In [195]:
for group in oscars2018_true_groups:
    category = group.find_all('h2')[0].text
    
    for i,nominee in enumerate(group.find_all('div',{'class':'views-row'})):
        if i == 0:
            winner = 'Won'
        else:
            winner = 'Lost'

        subdivs = nominee.find_all('div')
        name = subdivs[0].text.strip()
        movie = subdivs[1].text.strip()

        print("{0} was nominated in {1} for {2}\" and {3}.".format(name,category,movie,winner))

Gary Oldman was nominated in Actor in a Leading Role for Darkest Hour" and Won.
Timothée Chalamet was nominated in Actor in a Leading Role for Call Me by Your Name" and Lost.
Daniel Day-Lewis was nominated in Actor in a Leading Role for Phantom Thread" and Lost.
Daniel Kaluuya was nominated in Actor in a Leading Role for Get Out" and Lost.
Denzel Washington was nominated in Actor in a Leading Role for Roman J. Israel, Esq." and Lost.
Sam Rockwell was nominated in Actor in a Supporting Role for Three Billboards outside Ebbing, Missouri" and Won.
Willem Dafoe was nominated in Actor in a Supporting Role for The Florida Project" and Lost.
Woody Harrelson was nominated in Actor in a Supporting Role for Three Billboards outside Ebbing, Missouri" and Lost.
Richard Jenkins was nominated in Actor in a Supporting Role for The Shape of Water" and Lost.
Christopher Plummer was nominated in Actor in a Supporting Role for All the Money in the World" and Lost.
Frances McDormand was nominated in Actre

Include some flow control, if the name "actor" or "actree" appears in the category title, then  do nominee name first and movie name second, otherwise do movie name first and nominee name second.

In [196]:
for group in oscars2018_true_groups:
    category = group.find_all('h2')[0].text
    
    if 'Actor' in category or 'Actress' in category:
    
        for i,nominee in enumerate(group.find_all('div',{'class':'views-row'})):
            if i == 0:
                winner = 'Won'
            else:
                winner = 'Lost'

            subdivs = nominee.find_all('div')
            name = subdivs[0].text.strip()
            movie = subdivs[1].text.strip()

            print("{0} was nominated in {1} for \"{2}\" and {3}.".format(name,category,movie,winner))
            
    else:
        
        for i,nominee in enumerate(group.find_all('div',{'class':'views-row'})):
            if i == 0:
                winner = 'Won'
            else:
                winner = 'Lost'

            subdivs = nominee.find_all('div')
            movie = subdivs[0].text.strip()
            name = subdivs[1].text.strip()

            print("\"{0}\" was nominated in {1} for {2} and {3}.".format(name,category,movie,winner))
            

Gary Oldman was nominated in Actor in a Leading Role for "Darkest Hour" and Won.
Timothée Chalamet was nominated in Actor in a Leading Role for "Call Me by Your Name" and Lost.
Daniel Day-Lewis was nominated in Actor in a Leading Role for "Phantom Thread" and Lost.
Daniel Kaluuya was nominated in Actor in a Leading Role for "Get Out" and Lost.
Denzel Washington was nominated in Actor in a Leading Role for "Roman J. Israel, Esq." and Lost.
Sam Rockwell was nominated in Actor in a Supporting Role for "Three Billboards outside Ebbing, Missouri" and Won.
Willem Dafoe was nominated in Actor in a Supporting Role for "The Florida Project" and Lost.
Woody Harrelson was nominated in Actor in a Supporting Role for "Three Billboards outside Ebbing, Missouri" and Lost.
Richard Jenkins was nominated in Actor in a Supporting Role for "The Shape of Water" and Lost.
Christopher Plummer was nominated in Actor in a Supporting Role for "All the Money in the World" and Lost.
Frances McDormand was nominate

Rather than printing out the information, store it in `nominees_2018` so that we can turn it into a DataFrame.

In [207]:
nominees_2018 = []

for group in oscars2018_true_groups:
    category = group.find_all('h2')[0].text
    
    if 'Actor' in category or 'Actress' in category:
    
        for i,nominee in enumerate(group.find_all('div',{'class':'views-row'})):
            if i == 0:
                winner = 'Won'
            else:
                winner = 'Lost'

            subdivs = nominee.find_all('div')
            name = subdivs[0].text.strip()
            movie = subdivs[1].text.strip()
            
            # Swap out the print
            # Make a payload for each nominee
            nominee_payload = {'Category':category,
                               'Name':name,
                               'Movie':movie,
                               'Year':2018, # We're only looking at 2018 right now
                               'Winner':winner}
            
            # Add the payload to the list of nominees at top
            nominees_2018.append(nominee_payload)
            
    else:
        
        for i,nominee in enumerate(group.find_all('div',{'class':'views-row'})):
            if i == 0:
                winner = 'Won'
            else:
                winner = 'Lost'

            subdivs = nominee.find_all('div')
            movie = subdivs[0].text.strip()
            name = subdivs[1].text.strip()

            # Swap out the print
            nominee_payload = {'Category':category,
                               'Name':name,
                               'Movie':movie,
                               'Year':2018,
                               'Winner':winner}
            
            nominees_2018.append(nominee_payload)
            

Moment of truth!

In [208]:
nominees_df = pd.DataFrame(nominees_2018)
nominees_df

Unnamed: 0,Category,Movie,Name,Winner,Year
0,Actor in a Leading Role,Darkest Hour,Gary Oldman,Won,2018
1,Actor in a Leading Role,Call Me by Your Name,Timothée Chalamet,Lost,2018
2,Actor in a Leading Role,Phantom Thread,Daniel Day-Lewis,Lost,2018
3,Actor in a Leading Role,Get Out,Daniel Kaluuya,Lost,2018
4,Actor in a Leading Role,"Roman J. Israel, Esq.",Denzel Washington,Lost,2018
5,Actor in a Supporting Role,"Three Billboards outside Ebbing, Missouri",Sam Rockwell,Won,2018
6,Actor in a Supporting Role,The Florida Project,Willem Dafoe,Lost,2018
7,Actor in a Supporting Role,"Three Billboards outside Ebbing, Missouri",Woody Harrelson,Lost,2018
8,Actor in a Supporting Role,The Shape of Water,Richard Jenkins,Lost,2018
9,Actor in a Supporting Role,All the Money in the World,Christopher Plummer,Lost,2018


Now let's turn this hulking beast of a parser into a function so we can apply it to other years' nominees in the next step.

In [213]:
def parse_nominees(true_groups,year):
    nominees_list = []

    for group in true_groups:
        category = group.find_all('h2')[0].text

        if 'Actor' in category or 'Actress' in category:

            for i,nominee in enumerate(group.find_all('div',{'class':'views-row'})):
                if i == 0:
                    winner = 'Won'
                else:
                    winner = 'Lost'

                subdivs = nominee.find_all('div')
                name = subdivs[0].text.strip()
                movie = subdivs[1].text.strip()
                
                nominee_payload = {'Category':category,
                                   'Name':name,
                                   'Movie':movie,
                                   'Year':year, # We may look at other years
                                   'Winner':winner}

                nominees_list.append(nominee_payload)

        else:

            for i,nominee in enumerate(group.find_all('div',{'class':'views-row'})):
                if i == 0:
                    winner = 'Won'
                else:
                    winner = 'Lost'

                subdivs = nominee.find_all('div')
                movie = subdivs[0].text.strip()
                name = subdivs[1].text.strip()

                nominee_payload = {'Category':category,
                                   'Name':name,
                                   'Movie':movie,
                                   'Year':year,
                                   'Winner':winner}

                nominees_list.append(nominee_payload)
    
    return nominees_list

## Iterating vs. parsing to retrieve data

Often the data you are interested in is spread across multiple web pages. In an ideal world, the naming conventions would let you retrieve the data from these pages systematically. In the case of the Oscars, the URLs appear to be consistently formatted: `https://www.oscars.org/oscars/ceremonies/2019` suggests that we could change the 2019 to any other date going back to the start of the Oscars and get that year as well: `https://www.oscars.org/oscars/ceremonies/2018` should get us the page for 2018, and so on. Let's demonstrate each of these strategies with the Oscars data: iterating from 2019 back to 1929 in the URL versus parsing the list of links from the header.

### Iterating strategies for retrieving data

The fundamental assumption with this strategy is that the data are stored at URLs in a consistent way that we can access sequentially. In the case of the Oscars, we *should* be able to simply pass each year to the URL in requests. Here we want to practice responsible data scraping by including a sleep between each request so that we do not overwhelm the Oscars server with requests. We can use the `sleep` function within `time`.

In [210]:
from time import sleep

The `sleep(3)` below prevents any more code from progressing for 3 seconds.

In [212]:
print("The start of something.")
sleep(3)
print("The end of something.")

The start of something.
The end of something.


The core part of the iterating strategy is simply using Python's [`range`](https://docs.python.org/3.7/library/functions.html#func-range) function to generate a sequence of values. Here, we can use `range` to print out a sequence of URLs that should correspond to awards pages from 2010 through 2019. We can also incorporate the `sleep` functionality and wait a second between each `print` statement—it should now take 10 seconds for this code to finish printing. This simulates how we can use `sleep` to slow down and spread out requests so that we do not overwhelm the servers whose data we are trying to scrape.

In [228]:
for year in range(2010,2020):
    sleep(1)
    print('https://www.oscars.org/oscars/ceremonies/{0}'.format(year))

https://www.oscars.org/oscars/ceremonies/2010
https://www.oscars.org/oscars/ceremonies/2011
https://www.oscars.org/oscars/ceremonies/2012
https://www.oscars.org/oscars/ceremonies/2013
https://www.oscars.org/oscars/ceremonies/2014
https://www.oscars.org/oscars/ceremonies/2015
https://www.oscars.org/oscars/ceremonies/2016
https://www.oscars.org/oscars/ceremonies/2017
https://www.oscars.org/oscars/ceremonies/2018
https://www.oscars.org/oscars/ceremonies/2019


We defined a function `parse_nominees` above that takes the "true groups" of nominees. Let's try to tie these pieces together for all the nominees in the 2010s.

In [221]:
# Create an empty list to store the data we get
all_years_nominees = dict()

# For each year starting in 2010 until 2019
for year in range(2010,2020):
    
    # Pause for a second between each request
    sleep(1)
    
    # Get the raw HTML
    year_raw_html = requests.get('https://www.oscars.org/oscars/ceremonies/{0}'.format(year)).text

    # Soup-ify
    year_souped_html = BeautifulSoup(year_raw_html)
    
    # Get the parent group
    year_parent_group = year_souped_html.find_all('div',{'id':'quicktabs-tabpage-honorees-0'})
    
    # Get the true groups under the parent group
    year_true_groups = year_parent_group[0].find_all('div',{'class':'view-grouping'})
    
    # Use our parsing function, passing the year from above
    year_nominees = parse_nominees(year_true_groups,year)
    
    # Convert the year_nominees to a DataFrame and add them to all_years_nominees
    all_years_nominees[year] = pd.DataFrame(year_nominees)

Combine each of the DataFrames in `all_years_nominees` into a giant DataFrame of all the nominees from 2010-2019.

In [230]:
all_years_nominees_df = pd.concat(all_years_nominees)
all_years_nominees_df.reset_index(drop=True).head(10)

Unnamed: 0,Category,Movie,Name,Winner,Year
0,Actor in a Leading Role,Crazy Heart,Jeff Bridges,Won,2010
1,Actor in a Leading Role,Up in the Air,George Clooney,Lost,2010
2,Actor in a Leading Role,A Single Man,Colin Firth,Lost,2010
3,Actor in a Leading Role,Invictus,Morgan Freeman,Lost,2010
4,Actor in a Leading Role,The Hurt Locker,Jeremy Renner,Lost,2010
5,Actor in a Supporting Role,Inglourious Basterds,Christoph Waltz,Won,2010
6,Actor in a Supporting Role,Invictus,Matt Damon,Lost,2010
7,Actor in a Supporting Role,The Messenger,Woody Harrelson,Lost,2010
8,Actor in a Supporting Role,The Last Station,Christopher Plummer,Lost,2010
9,Actor in a Supporting Role,The Lovely Bones,Stanley Tucci,Lost,2010


### Parsing strategy for retrieving data

Frustratingly, this iterating strategy may not always hold: maybe some years are skipped or the naming convention changes at some point. We will cover some basics of [error-handling in Python](https://realpython.com/python-exceptions/) that could let us work around errors as they pop up, but this may result in an incomplete collection if the naming conventions are systematic. What we would want to do is to identify all the links ahead of time by parsing them from list and then work through that list to get the complete data collection.

What this means in the context of our Oscars example is assuming that we cannot trust that the sequential numbering of the years is a realiable guide to get all the data. Instead, we should get a list of the URLs for each of the awards pages from the "ceremonies-decade-scroller" (from Inspect) at the top. This scroller *should* be consistent across all the pages, but start with the nominees for 2019 just to be safe:

In [232]:
oscars2019_raw = requests.get('https://www.oscars.org/oscars/ceremonies/2019').text
oscars2019_soup = BeautifulSoup(oscars2019_raw)

Using the Inspect tool, there is a `<div class="years">` that contains the links to each of the years. Run a `.find_all` to get all these href locations.

In [240]:
# Get the <div class="years"> as a parent tag first, just in case there are <a class="years"> elsewhere
oscars2019_years_div = oscars2019_soup.find_all('div',{'class':'years'})[0]

# Now get the <a class="years"> underneath only the oscars2019_years_div
oscars2019_years_a = oscars2019_years_div.find_all('a',{'class':'year'})

# Inspect the first 10
oscars2019_years_a[:10]

[<a class="year" href="/oscars/ceremonies/1929">1929</a>,
 <a class="year" href="/oscars/ceremonies/1930">1930</a>,
 <a class="year" href="/oscars/ceremonies/1931">1931</a>,
 <a class="year" href="/oscars/ceremonies/1932">1932</a>,
 <a class="year" href="/oscars/ceremonies/1933">1933</a>,
 <a class="year" href="/oscars/ceremonies/1934">1934</a>,
 <a class="year" href="/oscars/ceremonies/1935">1935</a>,
 <a class="year" href="/oscars/ceremonies/1936">1936</a>,
 <a class="year" href="/oscars/ceremonies/1937">1937</a>,
 <a class="year" href="/oscars/ceremonies/1938">1938</a>]

Each of these `<a>` tags contains an "href", or the URL element where the page lives, and a text element for what's displayed.

In [241]:
oscars2019_years_a[0]['href']

'/oscars/ceremonies/1929'

In [242]:
oscars2019_years_a[0].text

'1929'

Now we can write a loop to print out the URL locations for all the other award years based on the "official" links in the "ceremonies-decade-scroller" navigation rather than assuming the years are sequential—I promise this will pay dividends in the future when inconsistent design wreaks havoc on your sequential data strategies!

In [244]:
for a in oscars2019_years_a[-10:]:
    href = a['href']
    print('https://www.oscars.org' + href)

https://www.oscars.org/oscars/ceremonies/2010
https://www.oscars.org/oscars/ceremonies/2011
https://www.oscars.org/oscars/ceremonies/2012
https://www.oscars.org/oscars/ceremonies/2013
https://www.oscars.org/oscars/ceremonies/2014
https://www.oscars.org/oscars/ceremonies/2015
https://www.oscars.org/oscars/ceremonies/2016
https://www.oscars.org/oscars/ceremonies/2017
https://www.oscars.org/oscars/ceremonies/2018
https://www.oscars.org/oscars/ceremonies/2019


We can now use the `parse_nominees` function for these pages as well.

In [246]:
# Create an empty list to store the data we get
all_years_nominees = dict()

# For the 10 most recent years
for a in oscars2019_years_a[-10:]:
    
    # Pause for a second between each request
    sleep(1)
    
    # Get the href
    href = a['href']
    
    # Get the year
    year = a.text
    
    # Get the raw HTML
    url = 'https://www.oscars.org' + href
    year_raw_html = requests.get(url).text

    # Soup-ify
    year_souped_html = BeautifulSoup(year_raw_html)
    
    # Get the parent group
    year_parent_group = year_souped_html.find_all('div',{'id':'quicktabs-tabpage-honorees-0'})
    
    # Get the true groups under the parent group
    year_true_groups = year_parent_group[0].find_all('div',{'class':'view-grouping'})
    
    # Use our parsing function, passing the year from above
    year_nominees = parse_nominees(year_true_groups,year)
    
    # Convert the year_nominees to a DataFrame and add them to all_years_nominees
    all_years_nominees[year] = pd.DataFrame(year_nominees)

Combine each of the DataFrames in `all_years_nominees` into a giant DataFrame of all the nominees from 2010-2019.

In [247]:
all_years_nominees_df = pd.concat(all_years_nominees)
all_years_nominees_df.reset_index(drop=True).head(10)

Unnamed: 0,Category,Movie,Name,Winner,Year
0,Actor in a Leading Role,Crazy Heart,Jeff Bridges,Won,2010
1,Actor in a Leading Role,Up in the Air,George Clooney,Lost,2010
2,Actor in a Leading Role,A Single Man,Colin Firth,Lost,2010
3,Actor in a Leading Role,Invictus,Morgan Freeman,Lost,2010
4,Actor in a Leading Role,The Hurt Locker,Jeremy Renner,Lost,2010
5,Actor in a Supporting Role,Inglourious Basterds,Christoph Waltz,Won,2010
6,Actor in a Supporting Role,Invictus,Matt Damon,Lost,2010
7,Actor in a Supporting Role,The Messenger,Woody Harrelson,Lost,2010
8,Actor in a Supporting Role,The Last Station,Christopher Plummer,Lost,2010
9,Actor in a Supporting Role,The Lovely Bones,Stanley Tucci,Lost,2010


## Project time

Let's take a look at websites you're interested in scraping and see what kinds of challenges and opportunities exist for scraping their data.