## Worldwide Box Office collection, 2010 to 2022

TODO  (Intro): 
- Web scraping is the process of extracting data from websites. Web scraping can directly access  Hypertext Transfer Protocol (HTTP).
- Box Office Mojo is an American website that tracks box-office revenue in a systematic, algorithmic way and the problem statement
- I used Python programming language and requests, Beautiful Soup and pandas libraries for this web scarping

Here are the steps we'll follow:

* We're going to scrape [Box Office Mojo](https://www.boxofficemojo.com/year/world/?ref_=bo_nb_cso_tab)
* We'll get a list of box office collection. For each collection, we'll get movie title,  worldwide collection, domestic collection, percentage of domestic collection, foreign collection, percentage of foreign collection and release year.
* We'll get 200 box office collection list for each box office collection
* We'll create a CSV file for each box office collection and we'll merge them into one CSV file


In [1]:
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
import os

## Scrape box office collections from the Box Office Mojo

Explain how you'll do it

- Use requests to download the page
- Use BS4 to parse and extract information from the page
- Convert into a pandas data frame 

This function is used to get the HTML content of the Box Office Mojo website using the requests library. The HTML content is then parsed using the Beautiful Soup library and returned as a Beautiful Soup object.

In [2]:
def get_box_office_mojo():
    # Download the page
    url = 'https://www.boxofficemojo.com/year/world/?ref_=bo_nb_cso_tab'
    response = requests.get(url)
    # Check response
    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(url))
    # Parse using Beautiful Soup
    doc = bs(response.text, 'html.parser')
    return doc

In [3]:
# Storing the function in a variable
doc = get_box_office_mojo()

In [16]:
title_class = 'a-text-left mojo-field-type-release_group'
title = doc.find_all('td', {'class': title_class})

In [15]:
title[0]

<td class="a-text-left mojo-field-type-release_group"><a class="a-link-normal" href="/releasegroup/gr817189381/?ref_=bo_ydw_table_1">M3GAN</a></td>

### Getting release date information from movie title link

To get year and year urls, we can use `option` tags with `value` which contain links

![](https://imgur.com/6BdMZEk.png)

The base URL for the Box Office Mojo website to a variable called "base_url". This variable can be used later to access different pages on the website by appending the specific path of the page to the base URL. For example, we could use base_url + '/year/world/2022/' to access the page for worldwide box office collections in 2022.

In [26]:
# Box Office Mojo url
base_url = 'https://www.boxofficemojo.com'

The function `get_year_url` takes in no input parameters and returns a list of strings containing the URLs for each year from 2010 to 2022. The function first uses Beautiful Soup to find all `option` tags in the HTML document and stores them in a list called year. Then, it creates an empty list called 'year_url' and uses a for loop to iterate through the elements in year. For each element, it appends the value of the value attribute of the element (which is a URL) to the year_url list. Finally, the function returns a slice of the 'year_url' list that includes only the URLs for the years 2010 to 2022.

In [23]:
def get_year_url():
    # Find the option tags
    year = doc.find_all('option')
    # Create a loop, which add all the year urls in the empty list `year_url`
    year_url = []
    for i in year:
        year_url.append(base_url + i['value'])
    year_urls = year_url[1:13]
    return year_urls

The list will contain URLs for the most recent 12 years of data, with the first five URLs being shown.

In [24]:
# Create a variable
year_urls = get_year_url()
# Show the values of the urls
year_urls[:5]

['https://www.boxofficemojo.com/year/world/2022/',
 'https://www.boxofficemojo.com/year/world/2021/',
 'https://www.boxofficemojo.com/year/world/2020/',
 'https://www.boxofficemojo.com/year/world/2019/',
 'https://www.boxofficemojo.com/year/world/2018/']

Similarly, we defined a function for years. We can use the text to get the year from the `option` tags. It does the same as previous function, however, it loops through these tags and extracts the text from each tag, which represents a year. the list is returned and includes the years from 2022 to 2010. 

In [13]:
def get_year():
    # find all the option tags 
    year = doc.find_all('option')
    # Create a loop, which add all th years in the empty list `years`
    years = []
    for i in year:
        years.append(i.text)
    # This will give years from 2022 to 2010
    years = years[1:13]
    return years

This code will return a list of the years from 2022 to 2010, using the `get_year()`function. The list is contain years, with the first five years being shown.

In [14]:
# Show the values of years
get_year()[:5]

['2022', '2021', '2020', '2019', '2018']

### Getting information out of each link

In [158]:
box_office_url = year_url[3]

In [159]:
box_office_url

'https://www.boxofficemojo.com/year/world/2019/'

In [213]:
response = requests.get(year_url[3])

In [214]:
response.status_code

200

In [215]:
len(response.text)

233671

In [216]:
box_doc = bs(response.text, 'html.parser')

In [217]:
box_office = box_doc.find_all('tr')
len(box_office)

201

In [218]:
box_office = box_office[1:201]

In [219]:
box_office[0]

<tr><td class="a-text-right mojo-header-column mojo-truncate mojo-field-type-rank mojo-sort-column">1</td><td class="a-text-left mojo-field-type-release_group"><a class="a-link-normal" href="/releasegroup/gr3511898629/?ref_=bo_ydw_table_1">Avengers: Endgame</a></td><td class="a-text-right mojo-field-type-money">$2,797,501,328</td><td class="a-text-right mojo-field-type-money">$858,373,000</td><td class="a-text-right mojo-field-type-percent">30.7%</td><td class="a-text-right mojo-field-type-money">$1,939,128,328</td><td class="a-text-right mojo-field-type-percent">69.3%</td></tr>

In [150]:
td_tags = box_office[0].find_all('td')

In [151]:
td_tags[0].text

'1'

In [152]:
td_tags[1].text

'Toy Story 3'

In [153]:
td_tags[2].text

'$1,066,969,703'

In [126]:
td_tags[3].text

'$718,732,821'

In [127]:
td_tags[4].text

'48.3%'

In [128]:
td_tags[5].text

'$770,000,000'

In [129]:
td_tags[6].text

'51.7%'

In [31]:
def get_box_office_info(tr_tags):
    td_tags = tr_tags.find_all('td')
    Title = td_tags[1].text
    Worldwide = td_tags[2].text
    Domestic = td_tags[3].text
    Domestic_percent = td_tags[4].text
    Foreign = td_tags[5].text
    Foreign_percent = td_tags[6].text
    return Title, Worldwide, Domestic, Domestic_percent, Foreign, Foreign_percent

In [32]:
get_box_office_info(box_office[0])

('Top Gun: Maverick',
 '$1,488,732,821',
 '$718,732,821',
 '48.3%',
 '$770,000,000',
 '51.7%')

In [33]:
box_office_dict = {
    'Title': [],
    'Worldwide': [],
    'Domestic': [],
    'Domestic_percent': [],
    'Foreign': [],
    'Foreign_percent': []
}


for i in range(len(box_office)):
    info = get_box_office_info(box_office[i])
    box_office_dict['Title'].append(info[0])
    box_office_dict['Worldwide'].append(info[1])
    box_office_dict['Domestic'].append(info[2])
    box_office_dict['Domestic_percent'].append(info[3])
    box_office_dict['Foreign'].append(info[4])
    box_office_dict['Foreign_percent'].append(info[5])

## Get box office collections from 2022 to 2010

This function is used to download and parse a webpage containing box office data for a particular year. It takes in a URL for a year's box office data as an input and returns a Beautiful Soup object of the webpage's HTML content. the function uses the bs4 library's BeautifulSoup function to parse the HTML content of the webpage and return the resulting object.

In [18]:
def get_box_office_page(box_year_urls):
    # Download the box office pages
    response = requests.get(box_year_urls)
    # Check successful response
    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(box_year_urls))
    # Parse using Beautiful Soup
    box_doc = bs(response.text, 'html.parser')
    return box_doc

`tr` tags containing the table of the box office page. We're going to find all the `td` tags which represents a table cell containing box office information of a movie

![](https://imgur.com/qqDGAMs.png)

This function, `get_box_office_info()`, takes a Beautiful Soup object representing a row of data in a table as an argument. It then searches for all `td` tags within that row and retrieves the text content of each of these `td `tags. This text is then stored in variables named Title, Worldwide, Domestic, Domestic_percent, Foreign, and Foreign_percent. Finally, these variables are returned as a tuple.

The purpose of this function is to extract specific data points from a table row in the HTML of a webpage. In this case, the data points correspond to information about movies listed on the Box Office Mojo website, such as the title of the movie, its worldwide box office earnings, and its foreign box office earnings.

In [17]:
def get_box_office_info(tr_tags):
    # Find all the td tags in the tr tags
    td_tags = tr_tags.find_all('td')
    # Create variable which contains the box office information
    Title = td_tags[1].text
    Worldwide = td_tags[2].text
    Domestic = td_tags[3].text
    Domestic_percent = td_tags[4].text
    Foreign = td_tags[5].text
    Foreign_percent = td_tags[6].text
    return Title, Worldwide, Domestic, Domestic_percent, Foreign, Foreign_percent

In [80]:
def get_release_year(release_year):
    year_doc = get_box_office_page(release_year)
    selection_class = 'mojo-gutter'
    release_year = year_doc.find('h1', {'class': selection_class})
    release_years= []
    for i in release_year:
        release_years.append(i.text[:4])
    return release_years

In [81]:
year = get_release_year(year_urls[2])
year

['2020']

This function gets the box office information from the given box_doc which is a Beautiful Soup object containing the HTML page of a particular year's box office collections. It first extracts all the `tr` tags from the page and assigns them to the variable "box_office". It then removes the first `tr` tag, which is the header row, by slicing the list from index 1 to 201. This leaves us with a list of tr tags containing the box office information for 200 movies for a particular year.

Next, the function creates an empty dictionary with the keys being the column names of the final data frame and the values being empty lists. It then loops through the list of `tr` tags and for each tag, it calls the `get_box_office_info()` function which returns a tuple containing the box office information for a particular movie. The function then appends this information to the corresponding lists in the dictionary.

Finally, the function converts the dictionary into a pandas data frame.

In [19]:
def get_box_office(box_doc):
    # Find all the tr tags
    box_office = box_doc.find_all('tr')
    box_office = box_office[1:201]

    # Create an empty dictionary
    box_office_dict = {
    'Title': [],
    'Worldwide': [],
    'Domestic': [],
    'Domestic_percent': [],
    'Foreign': [],
    'Foreign_percent': []
}
    # create a loop, which add all the value into the empty dictionary
    for i in range(len(box_office)):
        info = get_box_office_info(box_office[i])
        box_office_dict['Title'].append(info[0])
        box_office_dict['Worldwide'].append(info[1])
        box_office_dict['Domestic'].append(info[2])
        box_office_dict['Domestic_percent'].append(info[3])
        box_office_dict['Foreign'].append(info[4])
        box_office_dict['Foreign_percent'].append(info[5])

    # Convert into a pandas data frame
    return pd.DataFrame(box_office_dict)

The function `get_box_office()` takes in a Beautiful Soup object of a box office page and returns a pandas data frame containing the box office information from the page. 

In [27]:
box_office_page = get_box_office_page(year_urls[1])
get_box_office(box_office_page)

Unnamed: 0,Title,Worldwide,Domestic,Domestic_percent,Foreign,Foreign_percent
0,Spider-Man: No Way Home,"$1,906,693,477","$804,793,477",42.2%,"$1,101,900,000",57.8%
1,The Battle at Lake Changjin,"$902,548,476","$342,411",<0.1%,"$902,206,065",100%
2,"Hi, Mom","$822,009,764",-,-,"$822,009,764",100%
3,No Time to Die,"$774,153,007","$160,891,007",20.8%,"$613,262,000",79.2%
4,F9: The Fast Saga,"$726,229,501","$173,005,945",23.8%,"$553,223,556",76.2%
...,...,...,...,...,...,...
195,Pushpa: The Rise - Part 1,"$7,592,374","$1,320,000",17.4%,"$6,272,374",82.6%
196,The Mauritanian,"$7,527,030","$836,536",11.1%,"$6,690,494",88.9%
197,The Ice Road,"$7,502,846",-,-,"$7,502,846",100%
198,Judas and the Black Messiah,"$7,428,769","$5,478,009",73.7%,"$1,950,760",26.3%


This function is used to scrape the box office information from the Box Office Mojo website. It then calls the `get_year` and `get_year_url` functions to obtain lists of years and year URLs, respectively. It stores these lists in a dictionary with keys 'year' and 'year_urls', and converts the dictionary into a Pandas DataFrame.

In [28]:
def scrape_box_office():
    url = 'https://www.boxofficemojo.com/year/world/?ref_=bo_nb_cso_tab'
    response = requests.get(url)
    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(url))
    doc = bs(response.text, 'html.parser')

    year_urls = {
        'year' : get_year(),
        'year_urls' : get_year_url()
    }
    return pd.DataFrame(year_urls)

In [29]:
# Show the data frame
scrape_box_office().head()

Unnamed: 0,year,year_urls
0,2022,https://www.boxofficemojo.com/year/world/2022/
1,2021,https://www.boxofficemojo.com/year/world/2021/
2,2020,https://www.boxofficemojo.com/year/world/2020/
3,2019,https://www.boxofficemojo.com/year/world/2019/
4,2018,https://www.boxofficemojo.com/year/world/2018/


We defined a function which scrape information form the Box Office Mojo for a specific year and save result data to a CSV file

- year_url: The URL of the page on the Box Office Mojo website for the specific year that we want to scrape
- path: The file path where we want to save the resulting data

In [33]:
def scrape_info(year_url, path):
    # Check if the path already exist
    if os.path.exists(path):
        print("The file {} already exists. Skipping...".format(path))
        return
    # Get all the information from a specific page
    box_office = get_box_office(get_box_office_page(year_url))
    box_office.to_csv(path,index = None)

## Putting all together

We defined function to scrape box office collection data from the Box Office Mojo website for a given list of years. The function first calls the `scrape_box_office` function to get the list of years and their corresponding URLs. It then iterates through each year and URL, and calls the `scrape_info` function to scrape the box office collection data for that year. The scraped data is saved to a CSV file in the "data" folder.

In [41]:
def scrape_box_office_collections():
    box_office_collection = scrape_box_office()

    for index, i in box_office_collection.iterrows():
        print('Scrapping box office collections "{}" ...'.format(i['year']))
        scrape_info(i['year_urls'], 'data/box office collection {}.csv'.format(i['year']))


In [61]:
scrape_box_office_collections()

Scrapping box office collections "2022"
Scrapping box office collections "2021"
Scrapping box office collections "2020"
Scrapping box office collections "2019"
Scrapping box office collections "2018"
Scrapping box office collections "2017"
Scrapping box office collections "2016"
Scrapping box office collections "2015"
Scrapping box office collections "2014"
Scrapping box office collections "2013"
Scrapping box office collections "2012"
Scrapping box office collections "2011"
Scrapping box office collections "2010"


## Merging all Data Frames into one 

In [212]:
df1 = pd.read_csv('./data/box office collection 2022.csv')
df2 = pd.read_csv('./data/box office collection 2021.csv')
df3 = pd.read_csv('./data/box office collection 2020.csv')
df4 = pd.read_csv('./data/box office collection 2019.csv')
df5 = pd.read_csv('./data/box office collection 2018.csv')
df6 = pd.read_csv('./data/box office collection 2017.csv')
df7 = pd.read_csv('./data/box office collection 2016.csv')
df8 = pd.read_csv('./data/box office collection 2015.csv')
df9 = pd.read_csv('./data/box office collection 2014.csv')
df10 = pd.read_csv('./data/box office collection 2013.csv')
df11 = pd.read_csv('./data/box office collection 2012.csv')
df12 = pd.read_csv('./data/box office collection 2011.csv')
df13 = pd.read_csv('./data/box office collection 2010.csv')

In [213]:
df_1 = pd.merge(df1, df2, how= 'outer')
df_2 = df_1.merge(df3, how = 'outer')

In [176]:
df_2.to_csv('data/box office.csv', index = None)