# Scraping the All Time Worldwide Box Office using Python

### Web Scraping is the process of extracting and parsing data from websites in an automated fashion using computer program
1. Importing requests library
2. Downloading web pages using requests library
3. Requests library allows you to access the HTTP links 

![](https://i.imgur.com/heKuzSD.png)


### The Numbers website
the-numbers.com is a website which provides list of All Time Worldwide Box Office

##### **Objective**:
Scraping the "All Time Worldwide Box Office" by parsing the information from this website in the form of Tabular data.

#####  **List of creative fields on website:**

1. Rank
2. year
3. movie
4. Worldwide_Box Office
5. Domestic_Box Office
6. International_Box Office

## **Outline of the project:**
1. Understanding the structure of [the numbers Website]("www.the-numbers.com")
2. Installing and Importing required libraries 
3. Extracting the movies's details of different fields from website using `BeautifulSoup`
4. Parsing the All Time Worldwide Box Office details into 6 fields Rank,year,movie,Worldwide_Box Office,Domestic_Box Office,International_Box Office
5. Storing the extracted data into a dictionary.
6. Compiling all the data into a DataFrame using `Pandas` and saving the data  into `CSV` file.

In [100]:
# Install the library
!pip install requests --upgrade --quiet

Note: you may need to restart the kernel to use updated packages.


### Creating a file and reading page contents into it

In [86]:
with open('all.html','r',encoding='utf-8') as f:
    htmlss = f.read()

In [103]:
# Install the library
!pip install beautifulsoup4 --upgrade --quiet

## Use Beautiful Soup to parse and extract information
#### To extract the information from the HTML source code of a page we can use BeautifulSoup library to import that we have to use import BeautifulSoup from the bs4 module.

In [87]:
from bs4 import BeautifulSoup
doc = BeautifulSoup(htmlss, 'lxml')

In [88]:
type(doc)

bs4.BeautifulSoup

In [89]:
doc.name

'[document]'

In [90]:
doc.title

<title>All Time Worldwide Box Office</title>

In [91]:
# Display HTML tag
[tag.name for tag in doc.find_all()]

['html',
 'head',
 'link',
 'meta',
 'script',
 'script',
 'meta',
 'meta',
 'meta',
 'meta',
 'script',
 'meta',
 'script',
 'link',
 'script',
 'link',
 'meta',
 'meta',
 'meta',
 'meta',
 'meta',
 'meta',
 'meta',
 'meta',
 'meta',
 'meta',
 'meta',
 'meta',
 'meta',
 'meta',
 'meta',
 'meta',
 'meta',
 'meta',
 'meta',
 'meta',
 'title',
 'link',
 'style',
 'body',
 'div',
 'div',
 'div',
 'span',
 'a',
 'img',
 'span',
 'form',
 'input',
 'a',
 'img',
 'a',
 'img',
 'a',
 'img',
 'a',
 'img',
 'ul',
 'li',
 'a',
 'ul',
 'li',
 'a',
 'li',
 'a',
 'li',
 'a',
 'li',
 'a',
 'li',
 'a',
 'ul',
 'li',
 'a',
 'li',
 'a',
 'li',
 'a',
 'li',
 'a',
 'li',
 'a',
 'li',
 'a',
 'li',
 'a',
 'li',
 'a',
 'li',
 'a',
 'li',
 'a',
 'li',
 'a',
 'li',
 'a',
 'ul',
 'li',
 'a',
 'li',
 'a',
 'li',
 'a',
 'li',
 'a',
 'li',
 'a',
 'li',
 'a',
 'li',
 'a',
 'li',
 'a',
 'li',
 'a',
 'li',
 'a',
 'li',
 'a',
 'li',
 'a',
 'ul',
 'li',
 'a',
 'li',
 'a',
 'li',
 'a',
 'li',
 'a',
 'li',
 'a',
 'li',


In [92]:
print(doc.head.prettify())

<head>
 <link href="/images/logo_2021/favicon.ico" rel="icon"/>
 <meta content="nofollow, NOARCHIVE, NOODP, NOYDIR" name="robots"/>
 <!-- Global site tag (gtag.js) - Google Analytics -->
 <!-- Google tag (gtag.js) -->
 <script async="" src="https://www.googletagmanager.com/gtag/js?id=G-5K2DT3XQN5">
 </script>
 <script>
  window.dataLayer = window.dataLayer || [];
                function gtag() {
                    dataLayer.push(arguments);
                }
                gtag('js', new Date());

                gtag('config', 'G-5K2DT3XQN5');
 </script>
 <!--<script async src="https://www.googletagmanager.com/gtag/js?id=UA-1343128-1"></script>
            <script>
              window.dataLayer = window.dataLayer || [];
              function gtag(){dataLayer.push(arguments);}
              gtag('js', new Date());
            
              gtag('config', 'UA-1343128-1');
            </script>-->
 <meta content='(PICS-1.1 "https://www.icra.org/ratingsv02.html" l gen true for "http

In [93]:
print(doc.body.prettify())

<body>
 <div id="wrap">
  <div id="desktopnav">
   <div id="header">
    <span style="/*float: left;*/">
     <a href="/">
      <img alt="The Numbers - Where Data and Movies Meet" border="0" height="50" src="/images/logo_2021/SVG/numbers-logo-r.svg" width="300"/>
     </a>
    </span>
    <span style="float:right;">
     <form action="/custom-search" method="get" style="display: inline-block;margin-top:12px;">
      <input name="searchterm" onkeyup="if (event.keyCode == 13) { this.form.submit(); return false; }" placeholder="Search" size="20" style="height:40px" type="text" value=""/>
     </form>
     <a href="https://www.facebook.com/TheNumbers" target="_blank">
      <img height="28" src="https://www.the-numbers.com/images/icons/facebook.svg" style="border:none;" title="Follow The Numbers on Facebook" width="28"/>
     </a>
     <a href="https://www.twitter.com/MovieNumbers" target="_blank" title="Follow The Numbers on Twitter">
      <img height="28" src="https://www.the-numbers.c

In [94]:
print(doc.div.prettify())

<div id="wrap">
 <div id="desktopnav">
  <div id="header">
   <span style="/*float: left;*/">
    <a href="/">
     <img alt="The Numbers - Where Data and Movies Meet" border="0" height="50" src="/images/logo_2021/SVG/numbers-logo-r.svg" width="300"/>
    </a>
   </span>
   <span style="float:right;">
    <form action="/custom-search" method="get" style="display: inline-block;margin-top:12px;">
     <input name="searchterm" onkeyup="if (event.keyCode == 13) { this.form.submit(); return false; }" placeholder="Search" size="20" style="height:40px" type="text" value=""/>
    </form>
    <a href="https://www.facebook.com/TheNumbers" target="_blank">
     <img height="28" src="https://www.the-numbers.com/images/icons/facebook.svg" style="border:none;" title="Follow The Numbers on Facebook" width="28"/>
    </a>
    <a href="https://www.twitter.com/MovieNumbers" target="_blank" title="Follow The Numbers on Twitter">
     <img height="28" src="https://www.the-numbers.com/images/icons/twitter.

In [95]:
print(doc.text)













 




























All Time Worldwide Box Office
























News

Latest News
RSS Feed
Release Schedule
On This Day


Box Office

Daily Chart
Weekend Chart
Weekly Chart
Annual Box Office
Theatrical Market
International Charts
Records
Chart Index
Release Schedule
2023 Domestic
2023 Worldwide


Home Video

Weekly DVD Chart
Weekly Blu-ray Chart
Weekly Combined DVD+Blu-ray Chart
DEG Watched at Home Top 20 Chart
Netflix Daily Top 10
2023 DVD Chart
2023 Blu-ray Chart
2023 Combined Chart
All-Time Blu-ray
Release Schedule
Distributors


Movies

Budgets and Finances
Franchises
Keywords
Movie Index
Release Schedule
Most Anticipated
Trending Movies
Production Companies
Production Countries
Languages
Comparisons
Report Builder


People

Bankability
Records
People Index
Trending People
Highest Grossing Stars of 2023


Research Tools

Report Builder
Keyword Analysis
Movie Comparison
Search


Our Services

Research Services
Data Services
Bankability
Advanced Report

#### Fetching the rows with find_all function in the webpage to get the data present in it.

In [96]:
rows = doc.tbody.find_all('tr')
rows

[<tr>
 <td class="data">1</td>
 <td class="data"><a href="/box-office-records/worldwide/all-movies/cumulative/released-in-2009">2009</a></td>
 <td><b><a href="/movie/Avatar#tab=summary">Avatar</a></b></td>
 <td align="right">$2,923,706,026</td>
 <td align="right">$785,221,649</td>
 <td align="right">$2,138,484,377</td>
 </tr>,
 <tr>
 <td class="data">2</td>
 <td class="data"><a href="/box-office-records/worldwide/all-movies/cumulative/released-in-2019">2019</a></td>
 <td><b><a href="/movie/Avengers-Endgame-(2019)#tab=summary">Avengers: Endgame</a></b></td>
 <td align="right">$2,794,731,755</td>
 <td align="right">$858,373,000</td>
 <td align="right">$1,936,358,755</td>
 </tr>,
 <tr class="highlight">
 <td class="data">3</td>
 <td class="data"><a href="/box-office-records/worldwide/all-movies/cumulative/released-in-2022">2022</a></td>
 <td><b><a href="/movie/Avatar-The-Way-of-Water-(2022)#tab=summary">Avatar: The Way of Water</a></b></td>
 <td align="right">$2,319,972,415</td>
 <td alig

#### Creating dictionary for append data inside it

In [97]:
rank = []
year = []
movie = []
ww = []
db = []
ibo = []
table_dict = {'Rank': rank,
              'Year': year,
              'Movie': movie,
              'Worldwide_Box Office':ww,
              'Domestic_Box Office':db,
              'International_Box Office': ibo
              }

In [104]:
doc.tbody.find_all('td')

[<td class="data">1</td>,
 <td class="data"><a href="/box-office-records/worldwide/all-movies/cumulative/released-in-2009">2009</a></td>,
 <td><b><a href="/movie/Avatar#tab=summary">Avatar</a></b></td>,
 <td align="right">$2,923,706,026</td>,
 <td align="right">$785,221,649</td>,
 <td align="right">$2,138,484,377</td>,
 <td class="data">2</td>,
 <td class="data"><a href="/box-office-records/worldwide/all-movies/cumulative/released-in-2019">2019</a></td>,
 <td><b><a href="/movie/Avengers-Endgame-(2019)#tab=summary">Avengers: Endgame</a></b></td>,
 <td align="right">$2,794,731,755</td>,
 <td align="right">$858,373,000</td>,
 <td align="right">$1,936,358,755</td>,
 <td class="data">3</td>,
 <td class="data"><a href="/box-office-records/worldwide/all-movies/cumulative/released-in-2022">2022</a></td>,
 <td><b><a href="/movie/Avatar-The-Way-of-Water-(2022)#tab=summary">Avatar: The Way of Water</a></b></td>,
 <td align="right">$2,319,972,415</td>,
 <td align="right">$684,075,767</td>,
 <td al

#### Appending values from rows in dictionary's value

Each row (tr_tag) contains 9 'td_tag' tags which contains details about each palyer.
![imgg.png](attachment:imgg.png)

In [98]:
for row in rows:
    rank.append(row.find_all('td')[0].text.strip())
    year.append(row.find_all('td')[1].text.strip())
    movie.append(row.find_all('td')[2].text.strip())
    ww.append(row.find_all('td')[3].text.strip())
    db.append(row.find_all('td')[4].text.strip())
    ibo.append(row.find_all('td')[5].text.strip())

#### Converting dictionary into a DataFrame

In [99]:
import pandas as pd
df = pd.DataFrame(table_dict)
df

Unnamed: 0,Rank,Year,Movie,Worldwide_Box Office,Domestic_Box Office,International_Box Office
0,1,2009,Avatar,"$2,923,706,026","$785,221,649","$2,138,484,377"
1,2,2019,Avengers: Endgame,"$2,794,731,755","$858,373,000","$1,936,358,755"
2,3,2022,Avatar: The Way of Water,"$2,319,972,415","$684,075,767","$1,635,896,648"
3,4,1997,Titanic,"$2,222,985,568","$674,396,795","$1,548,588,773"
4,5,2015,Star Wars Ep. VII: The Force Awakens,"$2,064,615,817","$936,662,225","$1,127,953,592"
...,...,...,...,...,...,...
95,96,2016,Fantastic Beasts and Where to Find Them,"$811,724,385","$234,037,575","$577,686,810"
96,97,2007,Shrek the Third,"$807,330,936","$322,719,944","$484,610,992"
97,98,2019,Jumanji: The Next Level,"$798,210,215","$316,831,246","$481,378,969"
98,99,1982,E.T. the Extra-Terrestrial,"$797,103,542","$439,251,124","$357,852,418"


# **References**
https://dorianlazar.medium.com/scraping-medium-with-python-beautiful-soup-3314f898bbf5

https://www.the-numbers.com/box-office-records/worldwide/all-movies/cumulative/all-time
