With your knowledge of HTML file structure, you're going to use Beautiful Soup to extract our desired Audience Score metric and number of audience ratings, along with the movie title like in the video above (so we have something to merge the datasets on later) for each HTML file, then save them in a pandas DataFrame.

In [1]:
from bs4 import BeautifulSoup
import os
import pandas as pd

#### HTML File Structure
The Hypertext Markup Language (or HTML) is the language used to create documents for the World Wide Web.

In [3]:
with open('rt_html/et_the_extraterrestrial.html') as file:
    soup = BeautifulSoup(file) #Make the soup

In [8]:
#soup #outputs html document

Let's find the title of our movie using **find()** method in our soup.

In [7]:
soup.find('title') 
#this is the title of the website not the movie.

<title>E.T. The Extra-Terrestrial (1982) - Rotten Tomatoes</title>

To get the movie title only we have to do some simple str slicing.

To acces the contents of this tag, we can use **.contents()** which returns a list of the tags children.

In [10]:
soup.find('title').contents #this outputs a list

['E.T. The Extra-Terrestrial\xa0(1982) - Rotten Tomatoes']

Because there's only one thing withing this tag, the list is one item long and we can then access it using the index 0.

In [18]:
# by indexing with 0 we can access the full element of the list
soup.find('title').contents[0][:-len(' - Rotten Tomatoes')]
# then we slice again from the inital index to the 18th last character which is the lenght of the characters we dont want

'E.T. The Extra-Terrestrial\xa0(1982)'

In [19]:
len(' - Rotten Tomatoes')

18

### Quiz

With your knowledge of HTML file structure, you're going to use Beautiful Soup to extract our desired Audience Score metric and number of audience ratings, along with the movie title like in the video above (so we have something to merge the datasets on later) for each HTML file, then save them in a pandas DataFrame.

The Jupyter Notebook below contains template code that:

- Creates an empty list, df_list, to which dictionaries will be appended. This list of dictionaries will eventually be converted to a pandas DataFrame (this is the most efficient way of building a DataFrame row by row).

- Loops through each movie's Rotten Tomatoes HTML file in the rt_html folder.

- Opens each HTML file and passes it into a file handle called file.

- Creates a DataFrame called df by converting df_list using the pd.DataFrame constructor.

Your task is to extract the title, audience score, and number of audience ratings in each HTML file so each trio can be appended as a dictionary to df_list.

The Beautiful Soup methods required for this task are:

- find()
- find_all()

In [29]:
# Step 1: create a for loop to iterate through the list of dictionaries by using os.listdir()
df_list = []
folder = 'rt_html'
# List of dictionaries to build file by file and later convert to a DataFrame
for movie_html in os.listdir(folder):
    print(movie_html)

# os.listdir() method in python is used to get the list of all files and directories in the specified directory. 

zootopia.html
treasure_of_the_sierra_madre.html
1000642-all_quiet_on_the_western_front.html
1017289-rear_window.html
selma.html
citizen_kane.html
1003707-casablanca.html
inside_out_2015.html
gravity_2013.html
dr_strangelove.html
seven_samurai_1956.html
open_city.html
1021749-touch_of_evil.html
get_out.html
the_battle_of_algiers.html
star_wars_episode_vii_the_force_awakens.html
rosemarys_baby.html
brooklyn.html
grapes_of_wrath.html
the_conformist.html
1000626-all_about_eve.html
bicycle_thieves.html
gone_with_the_wind.html
argo_2012.html
moonlight_2016.html
bride_of_frankenstein.html
1000121-39_steps.html
toy_story_2.html
1048445-snow_white_and_the_seven_dwarfs.html
1046060-high_noon.html
psycho.html
12_years_a_slave.html
1000013-12_angry_men.html
la_la_land.html
rashomon.html
the_wizard_of_oz_1939.html
the_good_the_bad_and_the_ugly.html
400_blows.html
manchester_by_the_sea.html
battleship_potemkin.html
man_on_wire.html
toy_story_3.html
up.html
spotlight_2015.html
boyhood.html
harry_pott

In [32]:
# Step 2: join paths of folder and html file tu use open function and make the soup
df_list = []
folder = 'rt_html'

for movie_html in os.listdir(folder):
    with open(os.path.join(folder, movie_html)) as file: #os.path.join: joins two or more paths 
        # we need to do this since we need to following format to use open function:
        # 'rt_html/et_the_extraterrestrial.html'
        soup = BeautifulSoup(file) # make the soup

In [46]:
# Step 3: use find() methods to first get movie title audience score 

df_list = []
folder = 'rt_html'
for movie_html in os.listdir(folder):
    with open(os.path.join(folder, movie_html)) as file:
        soup = BeautifulSoup(file)
        title = soup.find('title').contents[0][:-len(' - Rotten Tomatoes')]
        audience_score = soup.find('div', class_ = 'audience-score meter').find('span')
        print(audience_score)
        break

<span class="superPageFontColor" style="vertical-align:top">92%</span>


In [48]:
# validate audience score 
df_list = []
folder = 'rt_html'
for movie_html in os.listdir(folder):
    with open(os.path.join(folder, movie_html)) as file:
        soup = BeautifulSoup(file)
        title = soup.find('title').contents[0][:-len(' - Rotten Tomatoes')]
        audience_score = soup.find('div', class_ = 'audience-score meter').find('span').contents[0]
        print(audience_score)
        break

92%


In [52]:
# validate audience score 
df_list = []
folder = 'rt_html'
for movie_html in os.listdir(folder):
    with open(os.path.join(folder, movie_html)) as file:
        soup = BeautifulSoup(file)
        title = soup.find('title').contents[0][:-len(' - Rotten Tomatoes')]
        audience_score = soup.find('div', class_ = 'audience-score meter').find('span').contents[0][:-1]
        # take out the % sign we get rid of it with some basic string slicing. 
        # Here you grab everything in the string except the last character in it.
        print(audience_score)
        break

92


In [71]:
df_list = []
folder = 'rt_html'
for movie_html in os.listdir(folder):
    with open(os.path.join(folder, movie_html)) as file:
        soup = BeautifulSoup(file)
        title = soup.find('title').contents[0][:-len(' - Rotten Tomatoes')]
        audience_score = soup.find('div', class_ = 'audience-score meter').find('span').contents[0][:-1]
        num_audience_ratings = soup.find('div', class_ = 'audience-info hidden-xs superPageFontColor')
        num_audience_ratings = num_audience_ratings.find_all('div')[1].contents[2]
        print(num_audience_ratings)
        break


        98,633


The output is in string string form. We'll all have to convert to an integer, to do that we have to rome the commas. So we will do that using python's replace function, replacing commas with empty characters.

In [79]:
df_list = []
folder = 'rt_html'
for movie_html in os.listdir(folder):
    with open(os.path.join(folder, movie_html)) as file:
        soup = BeautifulSoup(file)
        title = soup.find('title').contents[0][:-len(' - Rotten Tomatoes')]
        audience_score = soup.find('div', class_ = 'audience-score meter').find('span').contents[0][:-1]
        num_audience_ratings = soup.find('div', class_ = 'audience-info hidden-xs superPageFontColor')
        num_audience_ratings = num_audience_ratings.find_all('div')[1].contents[2].strip().replace(',', '')
        print(num_audience_ratings)
        break
# The strip() method returns a copy of the string in which all chars have been 
# stripped from the beginning and the end of the string (default whitespace characters).

98633


In [98]:
df_list = []
folder = 'rt_html'
for movie_html in os.listdir(folder):
    with open(os.path.join(folder, movie_html)) as file:
        soup = BeautifulSoup(file)
        title = soup.find('title').contents[0][:-len(' - Rotten Tomatoes')]
        audience_score = soup.find('div', class_ = 'audience-score meter').find('span').contents[0][:-1]
        num_audience_ratings = soup.find('div', class_= 'audience-info hidden-xs superPageFontColor')
        num_audience_ratings = num_audience_ratings.find_all('div')[1].contents[2].strip().replace(',','')
        df_list.append({'title': title,
                       'audience_score' : int(audience_score),
                       'number_of_audience_ratings' : int(num_audience_ratings)})
df = pd.DataFrame(df_list, columns = ['title', 'audience_score', 'number_of_audience_ratings'])

In [99]:
df

Unnamed: 0,title,audience_score,number_of_audience_ratings
0,Zootopia (2016),92,98633
1,The Treasure of the Sierra Madre (1948),93,25627
2,All Quiet on the Western Front (1930),89,17768
3,Rear Window (1954),95,149458
4,Selma (2015),86,60533
...,...,...,...
95,The Night of the Hunter (1955),90,24322
96,A Streetcar Named Desire (1951),90,54761
97,Arrival (2016),82,78740
98,Skyfall (2012),86,372497


In [100]:
# Example 
# Python Program to demonstrate use of strip() method 

str1 = 'geeks for geeks'
# Print the string without striping. 
print(str1) 

# String whose set of characters are to be 
# remove from original string at both its ends. 
str2 = 'ekgs'

# Print string after striping str2 from str1 at both its end. 
print(str1.strip(str2)) 

geeks for geeks
 for 


Working of above code :
- #1 We first construct a string str1 = ‘geeks for geeks’
- #2 Now we call strip method over str1 and pass str2 = ‘ekgs’ as arguement.
- #3 Now python interpreter trace str1 from left.It remove the character of str1 if it is present in str2.
- #4 Otherwise it stops tracing.
- #5 Now python interpreter trace str1 from right. It remove the character of str1 if it is present in str2.
- #6 Otherwise it stop tracing.
- #7 Now at last it returns the resultant string.

When we call strip() without argument, it removes leading and trailing spaces.

In [101]:
# Python Program to demonstrate use of strip() method without any arguement 
str1 = """ geeks for geeks """

# Print the string without striping. 
print(str1) 

# Print string after removing all leading 
# and trailing whitespaces. 
print(str1.strip()) 

 geeks for geeks 
geeks for geeks
