# Reading HTML tables using Pandas

In our logging practice, we will use the data "Top Grossing Movies of 2021".
Here, we obtain the data "Top Grossing Movies of 2021" from the website and check out what this data looks like.
<br>
I watched the tutorial "Data Analysis with Python - Full Course for Beginners (Numpy, Pandas, Matplotlib, Seaborn)."

https://www.youtube.com/watch?v=r-uOLxNrNk8&t=13929s

In [1]:
import pandas as pd
import numpy as np
import requests

html_url = "https://www.the-numbers.com/market/2021/top-grossing-movies"
r = requests.get(html_url)
top_movie = pd.read_html(r.text, header=0)

We can see the version of pandas with the following two methods.
<ul>
<li>
<code> print(pd.__version__) </code> in text editor
</li>
<li>
<code> pip show pandas </code> in terminal
</li>
</ul>

In [2]:
print(pd.__version__)

1.2.4


###### How many tables are there on this website?

In [3]:
len(top_movie)

1

###### Let `movie` be the unique table on this website.

In [4]:
movie = top_movie[0]

In [5]:
movie.head()

Unnamed: 0,Rank,Movie,ReleaseDate,Distributor,Genre,2021 Gross,Tickets Sold
0,1,A Quiet Place: Part II,"May 28, 2021",Paramount Pictures,Horror,"$136,381,860",14888849.0
1,2,Godzilla vs. Kong,"Mar 31, 2021",Warner Bros.,Action,"$100,392,257",10959853.0
2,3,Cruella,"May 28, 2021",Walt Disney,Comedy,"$71,382,602",7792860.0
3,4,F9: The Fast Saga,"Jun 25, 2021",Universal,Action,"$70,043,165",7646633.0
4,5,The Conjuring: The Devil Ma…,"Jun 4, 2021",Warner Bros.,Horror,"$59,204,511",6463374.0


In [6]:
movie.tail()

Unnamed: 0,Rank,Movie,ReleaseDate,Distributor,Genre,2021 Gross,Tickets Sold
194,195,The Bra,"Oct 16, 2020",Indican Pictures,Comedy,$572,62.0
195,196,Funhouse,"May 28, 2021",Magnet Releasing,Horror,$507,55.0
196,197,The Evil Next Door,"Jun 25, 2021",Magnet Releasing,Horror,$303,33.0
197,Total Gross of All Movies,Total Gross of All Movies,Total Gross of All Movies,Total Gross of All Movies,Total Gross of All Movies,"$1,024,231,962",
198,Total Tickets Sold,Total Tickets Sold,Total Tickets Sold,Total Tickets Sold,Total Tickets Sold,Total Tickets Sold,111815621.0


###### Let `Rank` be the index of `movie` and we remove the last two rows.

In [7]:
movie.set_index('Rank', inplace=True)
movie = movie.drop(['Total Gross of All Movies', 'Total Tickets Sold'], axis = 0)

In [8]:
movie.head()

Unnamed: 0_level_0,Movie,ReleaseDate,Distributor,Genre,2021 Gross,Tickets Sold
Rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,A Quiet Place: Part II,"May 28, 2021",Paramount Pictures,Horror,"$136,381,860",14888849.0
2,Godzilla vs. Kong,"Mar 31, 2021",Warner Bros.,Action,"$100,392,257",10959853.0
3,Cruella,"May 28, 2021",Walt Disney,Comedy,"$71,382,602",7792860.0
4,F9: The Fast Saga,"Jun 25, 2021",Universal,Action,"$70,043,165",7646633.0
5,The Conjuring: The Devil Ma…,"Jun 4, 2021",Warner Bros.,Horror,"$59,204,511",6463374.0


In [9]:
movie.tail()

Unnamed: 0_level_0,Movie,ReleaseDate,Distributor,Genre,2021 Gross,Tickets Sold
Rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
193,Killer Raccoons 2: Dark Chr…,"Jul 31, 2020",Indican Pictures,Comedy,$900,98.0
194,The Forgotten Carols,"Nov 20, 2020",Purdie Distribution,Musical,$623,68.0
195,The Bra,"Oct 16, 2020",Indican Pictures,Comedy,$572,62.0
196,Funhouse,"May 28, 2021",Magnet Releasing,Horror,$507,55.0
197,The Evil Next Door,"Jun 25, 2021",Magnet Releasing,Horror,$303,33.0


###### We did it! The following two cells show how to see what columns are in the table.

In [10]:
for col in movie.columns:
    print(col)

Movie
ReleaseDate
Distributor
Genre
2021 Gross
Tickets Sold


In [11]:
list(movie.columns.values)

['Movie', 'ReleaseDate', 'Distributor', 'Genre', '2021 Gross', 'Tickets Sold']

###### The following cell shows how to see what `Genre` consists of.

In [12]:
movie['Genre'].unique()

array(['Horror', 'Action', 'Comedy', 'Adventure', 'Thriller/Suspense',
       'Musical', 'Western', 'Drama', 'Black Comedy', 'Romantic Comedy',
       'Concert/Perfor…', 'Documentary', 'Multiple Genres'], dtype=object)

###### We want to see the list of popular horror movies.

In [13]:
horror = movie[(movie['Genre'] == "Horror")]
pd.DataFrame(horror['Movie'])

Unnamed: 0_level_0,Movie
Rank,Unnamed: 1_level_1
1,A Quiet Place: Part II
5,The Conjuring: The Devil Ma…
16,Spiral
20,The Unholy
30,Separation
49,Wrong Turn
51,In the Earth
53,Come Play
67,Willyâs Wonderland
79,Freaky


###### Now, we want to see how much those movies earn and how many tickets are sold in 2021. We will check the average of top 197 movies. First let's see if they are numbers or objects.

In [14]:
print(movie.dtypes)

Movie            object
ReleaseDate      object
Distributor      object
Genre            object
2021 Gross       object
Tickets Sold    float64
dtype: object


###### We want to convert each object in 2021 Gross to float. 

In [15]:
movie['2021 Gross'] = movie['2021 Gross'].str.replace(',', '')
movie['2021 Gross'] = [float(x.strip('$')) if type(x) == str else x for x in movie['2021 Gross']]

In [16]:
print(movie.dtypes)

Movie            object
ReleaseDate      object
Distributor      object
Genre            object
2021 Gross      float64
Tickets Sold    float64
dtype: object


Now, let's see the averages of `2021 Gross` and `Tickets Sold`.

In [17]:
movie['2021 Gross'].mean()

5199147.015228426

In [18]:
movie['Tickets Sold'].mean()

567591.9847715736