## Project: Web Scraping Top 2019 Movies at the Worldwide Box Office

In this project, I'll be giving a walkthrough on how to **scrape** a **table from a webpage created in HTML**. Familiarity of this procedure helps data professionals in **fetching tabular data from the internet without having to worry about distorted formatting or omitting any elements from the table**. Web scraping tables ensures efficient and effective way of extracting data from different sizes of table, but this is practically convenient and most beneficial to tables consisting of multiple rows and columns which are hardly possible to copy and paste all at once (not advisable).

Reminder: If you wish to replicate this code to scrap a table from another webpage, inspect the site first by hitting **F12** or simply right-clicking and selecting **Inspect** from the menu (for Chrome users). Understanding the site structure tells you the code files of the webpage, whether the table was created using HTML, JavaScript, or other components. Static tables are usually created using HTML (look for tags enclosed in <> to confirm), while interactive and dynamic tables are likely created using JavaScript. This sample notebook interacts with an HTML-created tables.

## Step 1: Install & Import Libraries and Modules

In [6]:
#!pip install bs4 # For installing the library where BeautifulSoup package is stored 
from bs4 import BeautifulSoup # For parsing the HTML document 
import requests # For making HTTP requests 
import pandas as pd # For data structure and manipulation

## Step 2: Define Data Source and HTML Parser 

In [7]:
# Create an object passing in the URL in which the table to scrape is found
url = 'https://www.the-numbers.com/box-office-records/worldwide/all-movies/cumulative/released-in-2019'

# Create an object to retrieve text attributes from the URL
data = requests.get(url).text

#Create an object that parses the retrieved file 
soup = BeautifulSoup(data, 'html.parser')

## Step 3: Find Table Class and Assign Table Index

In [13]:
# Create an object that finds all tables from the parsed file 'Soup'
tables = soup.find_all('table')

# Iterate from the 'tables' to locate particular table that contains "Worldwide" in the title. Notice that there is only one table in the webpage, 
# but this was added for future reference or need when the webpage has more than 1 table.
for index, table in enumerate(tables):
    if ("Worldwide" in str(table)):
        table_index = index

## Step 4: Create Output Table

In [None]:
# Ceate the output table in the form of a dataframe containing columns names based exactly the same as the ones from the target table
boxoffice = pd.DataFrame(columns = ["Rank", "Movie", "Worldwide Box Office", "Domestic Box Office", "International Box Office", "Domestic"])

# Iterate from the 'tables' to search for table rows ('tr') and table elements ('td') and arrange them into columns according to the order.
# Lastly, append the located rows and elements to the created output table 'boxoffice'
for row in tables[table_index].tbody.find_all('tr'):
    col = row.find_all('td')
    if (col != []):
        rank = col[0].text
        movie = col[1].text
        wbo = col[2].text.strip()
        dbo = col[3].text.strip()
        ibo = col[4].text.strip()
        domestic = col[5].text
        boxoffice = boxoffice.append({"Rank": rank, "Movie": movie, "Worldwide Box Office": wbo, "Domestic Box Office": dbo, "International Box Office":ibo, "Domestic":domestic}, ignore_index = True)

In [5]:
# View the output table
boxoffice

Unnamed: 0,Rank,Movie,Worldwide Box Office,Domestic Box Office,International Box Office,Domestic
0,1,Avengers: Endgame,"$2,797,800,564","$858,373,000","$1,939,427,564",30.68%
1,2,The Lion King,"$1,654,367,425","$543,638,043","$1,110,729,382",32.86%
2,3,Frozen II,"$1,446,925,396","$477,373,578","$969,551,818",32.99%
3,4,Spider-Man: Far From Home,"$1,131,113,066","$390,532,085","$740,580,981",34.53%
4,5,Captain Marvel,"$1,129,727,388","$426,829,839","$702,897,549",37.78%
...,...,...,...,...,...,...
95,96,ek-si-teu,"$67,044,017","$478,949","$66,565,068",0.71%
96,97,Happy Death Day 2U,"$64,686,515","$28,148,130","$36,538,385",43.51%
97,98,Eiga Doraemon: Nobita no Getsumen Tansaki,"$63,191,904",,"$63,191,904",
98,99,Cold Pursuit,"$62,599,159","$32,138,862","$30,460,297",51.34%


In [6]:
# Extra: Apply table styling
boxoffice = boxoffice.style
boxoffice.set_caption("Top 2019 Movies at the Worldwide Box Office").set_table_styles([{'selector':'caption','props': [('color', '#2d477a'),\
            ('font-size', '20px'),('font-weight', 'bold'),('text-align','center')]}, {'selector':'th.col_heading','props': [('background-color', '#1877ad'),\
            ('color','white'),('text-align', 'center')]},{'selector': 'td', 'props': [('text-align', 'left')]}], overwrite=True)

Unnamed: 0,Rank,Movie,Worldwide Box Office,Domestic Box Office,International Box Office,Domestic
0,1,Avengers: Endgame,"$2,797,800,564","$858,373,000","$1,939,427,564",30.68%
1,2,The Lion King,"$1,654,367,425","$543,638,043","$1,110,729,382",32.86%
2,3,Frozen II,"$1,446,925,396","$477,373,578","$969,551,818",32.99%
3,4,Spider-Man: Far From Home,"$1,131,113,066","$390,532,085","$740,580,981",34.53%
4,5,Captain Marvel,"$1,129,727,388","$426,829,839","$702,897,549",37.78%
5,6,Toy Story 4,"$1,073,080,329","$434,038,008","$639,042,321",40.45%
6,7,Star Wars: The Rise of Skywalker,"$1,072,848,487","$515,202,542","$557,645,945",48.02%
7,8,Joker,"$1,072,507,517","$335,451,311","$737,056,206",31.28%
8,9,Aladdin,"$1,046,649,706","$355,559,216","$691,090,490",33.97%
9,10,Jumanji: The Next Level,"$800,128,637","$316,831,246","$483,297,391",39.60%


#### Note that applying table styling makes the dataframe a Styler Object. This means you cannot apply any data manipulation or do exploratory data analysis to it anymore.

#### In order to retrieve the dataframe back, you can pass in boxoffice1 = boxoffice.data