I want to use Beautiful Soup to extract title from the movie webpage "Skyfall". At first we make the soup:
* that means passing the path to our HTML file into file handle
* then passing that file handle into Beautiful Soup constructor

In [1]:
from bs4 import BeautifulSoup

In [2]:
with open('data/rt_html/skyfall.html', 'r', encoding='utf-8') as file:
    soup = BeautifulSoup(file, 'lxml')

In [3]:
soup

<!DOCTYPE html>
<html lang="en" xmlns:fb="http://www.facebook.com/2008/fbml" xmlns:og="http://opengraphprotocol.org/schema/">
<head prefix="og: http://ogp.me/ns# flixstertomatoes: http://ogp.me/ns/apps/flixstertomatoes#">
<script src="//cdn.optimizely.com/js/594670329.js"></script>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<meta content="width=device-width,initial-scale=1" name="viewport"/>
<meta content="VPPXtECgUUeuATBacnqnCm4ydGO99reF-xgNklSbNbc" name="google-site-verification"/>
<meta content="034F16304017CA7DCF45D43850915323" name="msvalidate.01"/>
<link href="https://staticv2-4.rottentomatoes.com/static/images/iphone/apple-touch-icon.png" rel="apple-touch-icon"/>
<link href="https://staticv2-4.rottentomatoes.com/static/images/icons/favicon.ico" rel="shortcut icon" type="image/x-icon"/>
<link href="https://staticv2-4.rottentomatoes.com/static/styles/css/rt_main.css" rel="stylesheet"/>
<script id="jsonLdSchema" type="application/ld+json">{"@context":"http

now we can use methods in the BeautifulSoup library to find and extract data from this HTML.
Let's use `find` method, to find title of the movie
Therefor we have to find the title tag

In [4]:
soup.find('title')

<title>Skyfall (2012) - Rotten Tomatoes</title>

Next we us the slicing method to get rid of the tags:

In [5]:
soup.find('title').contents[0][:-len(' - Rotten Tomatoes')]

'Skyfall\xa0(2012)'

'xa0' is unicode for non-breakin space

Now we are going to use Beautiful Soup to extract our desired Genre, Director's name, runtime along with the movie title (so we have something to merge the datasets on later) for each HTML file, then save them in a pandas DataFrame.

In [6]:
from bs4 import BeautifulSoup
import os

In [7]:
df_list = []
folder = 'data/rt_html/'
for movie_html in os.listdir(folder):
    with open(os.path.join(folder, movie_html), 'r', encoding='utf-8') as file:
        soup = BeautifulSoup(file, 'lxml')
        title = soup.find('title').contents[0][:-len(' - Rotten Tomatoes')]
        print(title)
        break

12 Angry Men (Twelve Angry Men) (1957)


Now we have to grab the genre. I only extract the first mentioned genre.

In [8]:
df_list = []
folder = 'data/rt_html/'
for movie_html in os.listdir(folder):
    with open(os.path.join(folder, movie_html), 'r', encoding='utf-8') as file:
        soup = BeautifulSoup(file, 'lxml')
        title = soup.find('title').contents[0][:-len(' - Rotten Tomatoes')]
        genre = soup.find_all('ul')[14].find_all('a')[0].get_text()
        print(genre)
        break


                        Classics


We can use strip() to remove blanks

In [9]:
df_list = []
folder = 'data/rt_html/'
for movie_html in os.listdir(folder):
    with open(os.path.join(folder, movie_html), 'r', encoding='utf-8') as file:
        soup = BeautifulSoup(file, 'lxml')
        title = soup.find('title').contents[0][:-len(' - Rotten Tomatoes')]
        genre = soup.find_all('ul')[14].find_all('a')[0].get_text().strip()
        print(genre)
        break

Classics


The next task is to find the director:

In [14]:
df_list = []
folder = 'data/rt_html/'
for movie_html in os.listdir(folder):
    with open(os.path.join(folder, movie_html), 'r', encoding='utf-8') as file:
        soup = BeautifulSoup(file, 'lxml')
        title = soup.find('title').contents[0][:-len(' - Rotten Tomatoes')]
        genre = soup.find_all('ul')[14].find_all('a')[0].get_text().strip()
        director = soup.find_all('ul')[14].find_all('li')[2].find('a').get_text()
        print(director)
        break

Sidney Lumet


We also want to retrieve the runtime:

In [22]:
df_list = []
folder = 'data/rt_html/'
for movie_html in os.listdir(folder):
    with open(os.path.join(folder, movie_html), 'r', encoding='utf-8') as file:
        soup = BeautifulSoup(file, 'lxml')
        title = soup.find('title').contents[0][:-len(' - Rotten Tomatoes')]
        genre = soup.find_all('ul')[14].find_all('a')[0].get_text().strip()
        director = soup.find_all('ul')[14].find_all('li')[2].find('a').get_text()
        runtime = soup.find_all('ul')[14].find_all('li')[6].find('time').get_text().strip()[:-len(' minutes')]
        print(runtime)
        break

95
