# STAT29000 Project 6 Examples

In this project we are dealing with two primary libraries: `requests` and `beautifulsoup4`. `requests` is an HTTP library that has utilities for downloading data from the web. `beautifulsoup4` is more or less a package that helps us extract the data we actually want, from the files (html, xml, json) we download from the web.

In addition to those libraries, we will be using `pandas` and by extension, `numpy` (`pandas` is built on `numpy`).

Here are the links to the official documentation for `requests` and `beautifulsoup4`:
- [beautifulsoup4](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#)
- [requests](https://requests.readthedocs.io/en/master/)

Here are some links where you can read more about how to use these two libraries:
- https://realpython.com/python-web-scraping-practical-introduction/
- https://towardsdatascience.com/how-to-web-scrape-with-python-in-4-minutes-bc49186a8460


One last quick note is that it will be immensely useful to use your browser to peek at the structure of a website's HTML in order to figure out patterns. Typically, you can do so by right clicking on a page and clicking on: "view page source", or "inspect element". The former shows the entire web page, and the latter tries to show you the HTML responsible for where you right-clicked on the page.

In [1]:
import requests
import pandas as pd
from bs4 import BeautifulSoup as bsoup

In [2]:
# Let's take a look at rotten tomatoes at https://www.rottentomatoes.com/
# Let's write a function that returns a list of tuples where the first 
# element in each tuple is a movie name, and the second is the box office $$.
# You can see that that info is freely available and updated on the home page.

# Before we write the function lets walk through some steps.

# Download the html.
html = requests.get('https://www.rottentomatoes.com/')
print(html) # 200 means success!

# Show it as html (the first 100 characters of the text).
print(html.text[:100])

# Okay, now that we have the html
# let's feed it to beautiful soup.
soup = bsoup(html.text)

# Show it as html (the first 100 characters of the text).
# The method prettify() displays the text in a nicer way 
# (e.g., showing each tag in one line).
print(soup.prettify()[:100])

<Response [200]>
<!DOCTYPE html>
<html lang="en" dir="ltr" xmlns="http://www.w3.org/1999/xhtml" prefix="fb: http://ww
<!DOCTYPE html>
<html dir="ltr" lang="en" prefix="fb: http://www.facebook.com/2008/fbml og: http://o


Seeing this snippet of HTML brings up a good. Jupyter notebooks provide you with a way to properly display and run HTML without getting an error. We will demonstrate below using the `%%html` magic command. You can read more about this command, and others [here](https://ipython.readthedocs.io/en/stable/interactive/magics.html).

In [3]:
# In a normal code cell, including html, like the html below, will cause an error.
<html>
    <body>
        <div class="content">
            <h1>My HTML page!</h1>
        </div>
    </body>
</html>

SyntaxError: invalid syntax (<ipython-input-3-da97b316b31c>, line 2)

In the following cell, we will use the magic `%%html` command, and everything will work just as we would expect.

In [4]:
%%html
<html>
    <body>
        <div style="text-align: center;">
            <h1>My HTML page!</h1>
        </div>
    </body>
</html>

In [5]:
# At this point, I have everything downloaded and ready to parse through.
# My next step is to look at the website and figure out where the info I want 
# is at in the HTML.

# Start by right clicking on the "Top Box Office" title and inspecting the element.
# If you look closely, you can see that the entirety of that "box" of information
# is within: <section id="top-box-office" class="media-lists__category-section media-lists__top-box-office"> </div>

# Let's break this down. "section" is an HTML element. Every element has a start and end "tag".
# In this case, <section id="top-box-office" class="..."> is the start tag, and </div> 
# is the end tag. "id" and "class" are called "attributes". Often times attributes are
# given logical meaning and patterns by the individual who made the webpage. In web scraping, attributes 
# are useful because we can use them to isolate parts of the html.

# In this case, it looks like the "id" attribute may be unique! Let's see what we can do.
box_office_html = soup.find_all(id="top-box-office")
print(box_office_html) # The output html contents (in texts) is relatively long!

# Excellent! It worked! We captured everything in between the "div" tags, and you can see 
# that we were correct in thinking that the "id" attribute was unique!
print(len(box_office_html))

# If it were not unique, box_office_html would have a len > 1. For example:
print(len(soup.find_all(class_="icon")))

# Note that because "class" is a special python keyword, in order to search for 
# an HTML attribute called "class", we specify "class_" in the find_all method.

[<section class="media-lists__category-section media-lists__top-box-office" id="top-box-office">
<div class="media-lists__main-section-header-group">
<h2 class="h3 h--neusa">Top Box Office</h2>
<a href="/showtimes/">Get Tickets</a>
</div>
<table class="media-lists__table table">
<tr>
<td class="media-lists__td-rating">
<a href="/m/sonic_the_hedgehog_2020">
<span class="icon icon--tiny icon__fresh"></span>
<span>63%</span>
</a>
</td>
<td class="media-lists__td-title">
<a href="/m/sonic_the_hedgehog_2020">Sonic the Hedgehog</a>
</td>
<td class="media-lists__td-date">
<a href="/m/sonic_the_hedgehog_2020">$58M</a>
</td>
</tr>
<tr>
<td class="media-lists__td-rating">
<a href="/m/birds_of_prey_2020">
<span class="icon icon--tiny icon__certified_fresh"></span>
<span>78%</span>
</a>
</td>
<td class="media-lists__td-title">
<a href="/m/birds_of_prey_2020">Birds of Prey (And the Fantabulous Emancipation of One Harley Quinn)</a>
</td>
<td class="media-lists__td-date">
<a href="/m/birds_of_prey_20

In [6]:
# Another important note: An attribute may have more than one value.
# For example: <div class="icon icon--tiny icon__certified_fresh"></div>
# In this instance, the "class" attribute has 3 values: icon, icon--tiny, & icon__certified_fresh.

# These would all find our example span tag given above:
# (the output is omitted)
soup.find_all(class_="icon")
soup.find_all(class_="icon--tiny")
soup.find_all(class_="icon__certified_fresh")

# To see how many tags we find:
print(len(soup.find_all(class_="icon")))
print(len(soup.find_all(class_="icon--tiny")))
print(len(soup.find_all(class_="icon__certified_fresh")))

# But if you want to search on multiple values, order matters:
print(soup.find_all(class_="icon icon--tiny icon__certified_fresh")) # works
print(soup.find_all(class_="icon icon--tiny")) # doesn't work, not complete
print(soup.find_all(class_="icon--tiny icon icon__certified_fresh")) # doesn't work

# If you'd like, you can also search on multiple attributes:
our_attributes_dict = {"class": "media-lists__top-box-office", "id": "top-box-office"}
another_box_office_html = soup.find_all(attrs=our_attributes_dict) 
print(len(another_box_office_html))
print(another_box_office_html[0].prettify()[:80])

40
40
6
[<span class="icon icon--tiny icon__certified_fresh"></span>, <span class="icon icon--tiny icon__certified_fresh"></span>, <span class="icon icon--tiny icon__certified_fresh"></span>, <span class="icon icon--tiny icon__certified_fresh"></span>, <span class="icon icon--tiny icon__certified_fresh"></span>, <span class="icon icon--tiny icon__certified_fresh"></span>]
[]
[]
1
<section class="media-lists__category-section media-lists__top-box-office" id="t


In [7]:
# Back to the topic. We have box_office_html, and we are looking to extract a list of movie names and 
# a list of $$ values.

# Now we need to look for patterns in the structure that may let us separate the data.
# For instance, it looks like the $$ values are in "a" tags that are in "td" tags with 
# the class attribute is "media-lists__td-date".

# We will first obtain the "td" tags in a list
money = box_office_html[0].find_all("td", class_="media-lists__td-date")

# Now lets get the "a" tags from each of the "td" tags
money_list = [td.find("a").text for td in money]

# or
money_list = [td.a.text for td in money]
print(money_list)

# Excellent.

# Now for the movie names (and let's also add the id's)
names = box_office_html[0].find_all("td", class_="media-lists__td-title")
names = [name.a.text for name in names]
print(names)

ids = box_office_html[0].find_all("td", class_="media-lists__td-title")

# note that to get the attribute value itself, you can access it like a 
# dict
ids = [i.a['href'] for i in ids] 
print(ids)

['$58M', '$17.3M', '$12.3M', '$12.3M', '$11.6M', '$8.2M', '$5.8M', '$5.7M', '$5M', '$4.7M']
['Sonic the Hedgehog', 'Birds of Prey (And the Fantabulous Emancipation of One Harley Quinn)', 'Fantasy Island', 'The Photograph', 'Bad Boys for Life', '1917', 'Parasite (Gisaengchung)', 'Jumanji: The Next Level', 'Dolittle', 'Downhill']
['/m/sonic_the_hedgehog_2020', '/m/birds_of_prey_2020', '/m/fantasy_island_2020', '/m/the_photograph_2020', '/m/bad_boys_for_life', '/m/1917_2019', '/m/parasite_2019', '/m/jumanji_the_next_level', '/m/dolittle', '/m/downhill_2020']


In [8]:
# Great. Let's put this together in a function
from typing import Tuple

# The -> Tuple[str, str] part is called type hinting.
# It is completely optional, and tells the user
# exactly what types are returned. This makes it 
# easier for users who have to read the code, as
# well as better for documentation and IDE's. You 
# can read more: https://stackoverflow.com/questions/32557920/what-are-type-hints-in-python-3-5
def get_box_office() -> Tuple[str, str, str]:
    """
    Scrape the Top Box Office: movie names, $ values,
    and the end of the links to the movie pages. Return
    them as 3 lists.
    """
    # GET the rottentomatoes home page
    html = requests.get('https://www.rottentomatoes.com/')

    # Create the bs4 parser
    soup = bsoup(html.text)

    # Get the enclosing box office "div" tag from rottentomatoes.com
    box_office_html = soup.find_all(id="top-box-office")

    # Isolate and parse the $
    money = box_office_html[0].find_all("td", class_="media-lists__td-date")
    money = [td.a.text for td in money]

    # Isolate and parse the movie names
    names = box_office_html[0].find_all("td", class_="media-lists__td-title")
    names = [name.a.text for name in names]
    
    # Isolate and parse the ids
    ids = box_office_html[0].find_all("td", class_="media-lists__td-title")
    ids = [i.a['href'] for i in ids] 

    return names, money, ids

# Test things out
money, names, ids = get_box_office()
print(money, names, ids)

['Sonic the Hedgehog', 'Birds of Prey (And the Fantabulous Emancipation of One Harley Quinn)', 'Fantasy Island', 'The Photograph', 'Bad Boys for Life', '1917', 'Parasite (Gisaengchung)', 'Jumanji: The Next Level', 'Dolittle', 'Downhill'] ['$58M', '$17.3M', '$12.3M', '$12.3M', '$11.6M', '$8.2M', '$5.8M', '$5.7M', '$5M', '$4.7M'] ['/m/sonic_the_hedgehog_2020', '/m/birds_of_prey_2020', '/m/fantasy_island_2020', '/m/the_photograph_2020', '/m/bad_boys_for_life', '/m/1917_2019', '/m/parasite_2019', '/m/jumanji_the_next_level', '/m/dolittle', '/m/downhill_2020']


In [9]:
# Let's make it return a pandas dataframe instead
def get_box_office() -> pd.DataFrame:
    """
    Scrape the Top Box Office: movie names, $ values,
    and the end of the links to the movie pages. Return
    them as 3 lists.
    """
    # GET the rottentomatoes home page
    html = requests.get('https://www.rottentomatoes.com/')

    # Create the bs4 parser
    soup = bsoup(html.text)

    # Get the enclosing box office "div" tag from rottentomatoes.com
    box_office_html = soup.find_all(id="top-box-office")

    # Isolate and parse the $
    money = box_office_html[0].find_all("td", class_="media-lists__td-date")
    money = [td.a.text for td in money]

    # Isolate and parse the movie names
    names = box_office_html[0].find_all("td", class_="media-lists__td-title")
    names = [name.a.text for name in names]
    
    # Isolate and parse the ids
    ids = box_office_html[0].find_all("td", class_="media-lists__td-title")
    ids = [i.a['href'] for i in ids] 

    return pd.DataFrame(data = {"names": names, "box_office": money, "rt_id": ids})

print(get_box_office().head())

                                               names box_office  \
0                                 Sonic the Hedgehog       $58M   
1  Birds of Prey (And the Fantabulous Emancipatio...     $17.3M   
2                                     Fantasy Island     $12.3M   
3                                     The Photograph     $12.3M   
4                                  Bad Boys for Life     $11.6M   

                        rt_id  
0  /m/sonic_the_hedgehog_2020  
1       /m/birds_of_prey_2020  
2      /m/fantasy_island_2020  
3      /m/the_photograph_2020  
4        /m/bad_boys_for_life  


<font color="red">
    
**!!!**

**You do not need the following examples to solve the project, however, they are included in case you are interested, or would like to see more.**

**!!!**
</font>

In [None]:
# That's pretty neat. You could imagine putting that into a script
# and making the script run every day to update a database, or 
# append to an excel file.

# A lot of times you will need to use scraping in order to collect data to do
# an analysis or make a comparison. What if we wanted to do the latter?

# Take a look at these two links:
# https://www.rottentomatoes.com/browse/dvd-streaming-all?minTomato=95&maxTomato=100&services=netflix_iw
# https://www.rottentomatoes.com/browse/dvd-streaming-all?minTomato=95&maxTomato=100&services=amazon_prime

# By clicking around a website and carefully observing the URL, you can figure out patterns
# that the API utilizes and use those patterns to get the information onto the web page that you want. 
# Let's write a function that accepts 3 arguments: minTomato, maxTomato, and service, and returns
# a list of link id's for every qualifying movie.
def get_movie_links(minTomato: int, maxTomato: int, service: str) -> Tuple[str]:
    url = None
    if service.lower() == "netflix":
        url = f'https://www.rottentomatoes.com/browse/dvd-streaming-all?minTomato={minTomato}&maxTomato={maxTomato}&services=netflix_iw'
    elif service.lower() == "amazon":
        url = f'https://www.rottentomatoes.com/browse/dvd-streaming-all?minTomato={minTomato}&maxTomato={maxTomato}&services=amazon_prime'
    else:
        sys.exit(f'Service {service} not found.')

    
# Ok, good start. But how do we get all the movies to show, not just the first page?
# If you inspect the "Show More" button, you can see that rottentomatoes is being 
# clever and not showing us how to modify the url to show more. Luckily,
# if you click to inspect the element, and change to the "console", you can see
# that a request was made to a different url, rottentomatoes.com/api/private/v2.0/...
# This is what we are looking for. They have an API that returns already
# organized data. Lets figure this out.

# https://www.rottentomatoes.com/api/private/v2.0/browse?minTomato=95&maxTomato=100&services=amazon_prime&type=dvd-streaming-all&page=1
# If you go to that url, you can see that they are returning structured JSON data, already organized! Looks like only 32 results at a time.
# We can handle this type of structured data too!
html = requests.get('https://www.rottentomatoes.com/api/private/v2.0/browse?minTomato=95&maxTomato=100&services=amazon_prime&type=dvd-streaming-all&page=1')

# Instead of using html.text, lets use html.json()
print(type(html.json()))

# A dict is easy to navigate, great.
json = html.json()

# 3 keys: counts, results, and debugUrl
print(json.keys())

# Let's look at counts.
print(json['counts'])

# That is useful. It tells us the total # of results for our query. We could use this 
# to calculate the page #'s.
print(type(json['results'])) # hmm, its a list

print(len(json['results'])) # oh okay, the length is the number of movies, 32

# Let's see if each movie is a dict
print(type(json['results'][0]))

# Great. Let's see what information we have for each movie.
print(json['results'][0].keys())

# Excellent. Since we have all of the movie information already here, we don't need to messily scrape each
# movies webpage for more information. Let's use this API instead. I want to know how netflix and amazon 
# compare when looking at movies with 95+ tomatoScore. Specifically, how many reviews per movie on avg?
json['results'][0]['tomatoScore']

# Wait, I'm not seeing how many reviews each movie has. Time to step back and see if we can get that info.
# Let's see how we could find this number for WALL-E: https://www.rottentomatoes.com/m/wall_e
# Open up the browser console (usually ctrl+shift+j), and navigate to that page and investigate the 
# requests to see if we can find an API again.

In [None]:
# So far: https://www.rottentomatoes.com/m/wall_e/reviews, so it looks like they just tack /reviews 
# to the end of the regular movie link. Looks like 20 reviews per page, and 1-20 on the last page.
# Let's write a function to count the number of reviews on a page.

# First we need to investigate.
html = requests.get("https://www.rottentomatoes.com/m/wall_e/reviews?type=&sort=&page=1")
soup = bsoup(html.text)

# Lets right click and inspect one of those rows with a review.
# The whole table of reviews is inside: <div class="review_table"></div>
# Each review lies inside: <div class="row review_table_row"></div>
# That's convenient. We can just pull the entire table and count the 
# rows.
def count_reviews(url:str) -> int:
    html = requests.get(url)
    soup = bsoup(html.text)

    table = soup.find_all(class_="review_table")

    # Let's ensure that we got exactly 1
    if len(table) > 1:
        sys.exit("Retrieved more than one div tag with class review_table.")
    
    if len(table) < 1:
        return 0

    return len(table[0].find_all(class_="row review_table_row"))

# Ok, let's test it
print(count_reviews("https://www.rottentomatoes.com/m/wall_e/reviews?type=&sort=&page=1"))
print(count_reviews("https://www.rottentomatoes.com/m/wall_e/reviews?type=&sort=&page=13"))

# Perfect. Now let's write a function that counts the reviews for a movie, 
# given the rt_id thing: /m/something
def count_movie_reviews(rt_id: str) -> int:
    url = f'https://www.rottentomatoes.com{rt_id}/reviews'
    html = requests.get(url)
    soup = bsoup(html.text)

    # Get page count, wait we didnt cover this...

# Navigate to: https://www.rottentomatoes.com/m/wall_e/reviews
# right click and inspect the part that says page 1 of X.
# The class attribute is pageInfo
html = requests.get("https://www.rottentomatoes.com/m/wall_e/reviews")
soup = bsoup(html.text)

# There are 2 which makes sense. They are identical though.
print(soup.find_all(class_="pageInfo"))

# Extract the values of the tag
print(soup.find(class_="pageInfo").string)

# The number we want is always after "of", unless of course
# there is only a single page. If that is the case, just
# set the last_page to 1
soup.find(class_="pageInfo").string.split("of")[1].strip()

# Ok, back to the function:
def count_movie_reviews(rt_id: str) -> int:
    url = f'https://www.rottentomatoes.com{rt_id}/reviews'
    html = requests.get(url)
    soup = bsoup(html.text)

    # Get page count. Note that this "try" "except" stuff isn't important
    # for now. Just know that python "tries" to do what is in the "try"
    # block. If it encounters an exception, it runs the code in the "except"
    # block.
    try:
        last_page = soup.find(class_="pageInfo").string.split("of")[1].strip()
    except:
        last_page = 1

    url = f'https://www.rottentomatoes.com{rt_id}/reviews?page={last_page}'

    # Count the reviews
    last_page_count = count_reviews(url)

    return last_page_count + 20*(int(last_page)-1)

# Let's test it out
print(count_movie_reviews("/m/wall_e"))
print(count_movie_reviews("/m/frozen_2013"))

In [None]:
# Let's use our new function to compare the number of reviews for Netflix and Amazon 
# movies that have 95+ rating.
amazonHTML = requests.get('https://www.rottentomatoes.com/api/private/v2.0/browse?minTomato=95&maxTomato=100&services=amazon_prime&type=dvd-streaming-all&page=1')
netflixHTML = requests.get('https://www.rottentomatoes.com/api/private/v2.0/browse?minTomato=95&maxTomato=100&services=netflix_iw&type=dvd-streaming-all&page=1')

amazon = amazonHTML.json()
netflix = netflixHTML.json()

# Remember to get the rt_id we want:
print(amazon['results'][0]['url'])

# and to get page #'s
from math import ceil
print(ceil(amazon['counts']['total']/amazon['counts']['count']))

# Loop through all of the pages of amazons 95+ movies
amazonLinks = []
for page in range(1, ceil(amazon['counts']['total']/amazon['counts']['count'])+1):
    html = requests.get(f'https://www.rottentomatoes.com/api/private/v2.0/browse?minTomato=95&maxTomato=100&services=amazon_prime&type=dvd-streaming-all&page={page}')
    json = html.json()

    # Loop through all of the 32 or less movies
    for result in json['results']:
        amazonLinks.append(result['url'])

# Do the same for netflix
netflixLinks = []
for page in range(1, ceil(amazon['counts']['total']/amazon['counts']['count'])+1):
    html = requests.get(f'https://www.rottentomatoes.com/api/private/v2.0/browse?minTomato=95&maxTomato=100&services=netflix_iw&type=dvd-streaming-all&page={page}')
    json = html.json()

    # Loop through all of the 32 or less movies
    for result in json['results']:
        netflixLinks.append(result['url'])

In [None]:
# Now we need to cyle through and use our count_movie_reviews function

amazonReviewCount = 0
netflixReviewCount = 0
for link in amazonLinks:
    amazonReviewCount += count_movie_reviews(link)

for link in netflixLinks:
    netflixReviewCount += count_movie_reviews(link)

print(amazonReviewCount/len(amazonLinks))
print(netflixReviewCount/len(netflixLinks))