## Topic 4

In this notebook, we explore web scraping with `beautifulsoup` .

In [1]:
from urllib.request import urlopen
import pandas as pd
from bs4 import BeautifulSoup

### Example 1

In [2]:
html_file = open('data/wk4_example0.html', 'r')
html_doc = html_file.read()
soup = BeautifulSoup(html_doc, 'html.parser')
print(soup.prettify())

<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The Dormouse's story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
    Elsie
   </a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   ;
and they lived at the bottom of a well.
  </p>
  <p class="story">
   ...
  </p>
 </body>
</html>



In [3]:
soup.title

<title>The Dormouse's story</title>

In [4]:
soup.title.text

"The Dormouse's story"

In [5]:
soup.find_all('a')

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

In [6]:
soup.find_all('p', class_ = 'title')

[<p class="title"><b>The Dormouse's story</b></p>]

### Example 2

In [7]:
url = 'http://www.pythonscraping.com/pages/warandpeace.html'
html = urlopen(url)
html_doc = html.read()
soup = BeautifulSoup(html_doc, 'html.parser')
print(soup.h1)

<h1>War and Peace</h1>


In [8]:
nameList = soup.find_all('span', {'class': 'green'})
for name in nameList:
    print(name.get_text())

Anna
Pavlovna Scherer
Empress Marya
Fedorovna
Prince Vasili Kuragin
Anna Pavlovna
St. Petersburg
the prince
Anna Pavlovna
Anna Pavlovna
the prince
the prince
the prince
Prince Vasili
Anna Pavlovna
Anna Pavlovna
the prince
Wintzingerode
King of Prussia
le Vicomte de Mortemart
Montmorencys
Rohans
Abbe Morio
the Emperor
the prince
Prince Vasili
Dowager Empress Marya Fedorovna
the baron
Anna Pavlovna
the Empress
the Empress
Anna Pavlovna's
Her Majesty
Baron
Funke
The prince
Anna
Pavlovna
the Empress
The prince
Anatole
the prince
The prince
Anna
Pavlovna
Anna Pavlovna


In [9]:
titles = soup.find_all(['h1', 'h2','h3','h4','h5','h6'])
print(titles)

[<h1>War and Peace</h1>, <h2>Chapter 1</h2>]


In [10]:
text = soup.find_all(id = 'text')
print(text)

[<div id="text">
"<span class="red">Well, Prince, so Genoa and Lucca are now just family estates of the
Buonapartes. But I warn you, if you don't tell me that this means war,
if you still try to defend the infamies and horrors perpetrated by
that Antichrist- I really believe he is Antichrist- I will have
nothing more to do with you and you are no longer my friend, no longer
my 'faithful slave,' as you call yourself! But how do you do? I see
I have frightened you- sit down and tell me all the news.</span>"
<p></p>
It was in July, 1805, and the speaker was the well-known <span class="green">Anna
Pavlovna Scherer</span>, maid of honor and favorite of the <span class="green">Empress Marya
Fedorovna</span>. With these words she greeted <span class="green">Prince Vasili Kuragin</span>, a man
of high rank and importance, who was the first to arrive at her
reception. <span class="green">Anna Pavlovna</span> had had a cough for some days. She was, as
she said, suffering from la grippe; grippe b

### 300 Best movies of all time

> https://editorial.rottentomatoes.com/guide/best-movies-of-all-time/

In [11]:
url = 'https://editorial.rottentomatoes.com/guide/best-movies-of-all-time/'
html = urlopen(url)
soup = BeautifulSoup(html.read(), 'html.parser')

The tag that contains the entire content of all movies is `table`.

In [12]:
table = soup.find_all('table')
print(table)

[<table class="aligncenter" style="width: 75%; padding: 8px;">
<tbody>
<tr style="height: 23px; border: 1px solid #dddddd;">
<td style="width: 10%; height: 23px; text-align: center;">1.</td>
<td style="width: 500px; height: 23px; border: 1px solid #dddddd;">
<p class="apple-news-link-wrap movie">
<span class="score-wrap">
<img class="apple-critic-score-icon" height="16" src="https://images.fandango.com/cms/assets/c6672520-d359-11ea-a15f-bdf29fa24277--certified-fresh.png" width="16"/>
<span class="score"><strong>97%</strong></span>
</span>
<span class="details">
<a class="title" href="https://www.rottentomatoes.com/m/the_godfather">The Godfather</a>
<span class="year">(1972)</span>
</span>
</p></td>
</tr>
<tr style="height: 23px; border: 1px solid #dddddd;">
<td style="width: 10%; height: 23px; text-align: center;">2.</td>
<td style="width: 500px; height: 23px; border: 1px solid #dddddd;">
<p class="apple-news-link-wrap movie">
<span class="score-wrap">
<img class="apple-critic-score-ic

Each movie is stored in a single table row, or the `tr` tag.

In [13]:
rows = soup.find_all('tr')
print(rows)

[<tr style="height: 23px; border: 1px solid #dddddd;">
<td style="width: 10%; height: 23px; text-align: center;">1.</td>
<td style="width: 500px; height: 23px; border: 1px solid #dddddd;">
<p class="apple-news-link-wrap movie">
<span class="score-wrap">
<img class="apple-critic-score-icon" height="16" src="https://images.fandango.com/cms/assets/c6672520-d359-11ea-a15f-bdf29fa24277--certified-fresh.png" width="16"/>
<span class="score"><strong>97%</strong></span>
</span>
<span class="details">
<a class="title" href="https://www.rottentomatoes.com/m/the_godfather">The Godfather</a>
<span class="year">(1972)</span>
</span>
</p></td>
</tr>, <tr style="height: 23px; border: 1px solid #dddddd;">
<td style="width: 10%; height: 23px; text-align: center;">2.</td>
<td style="width: 500px; height: 23px; border: 1px solid #dddddd;">
<p class="apple-news-link-wrap movie">
<span class="score-wrap">
<img class="apple-critic-score-icon" height="16" src="https://images.fandango.com/cms/assets/c6672520-

In [14]:
title_tags = soup.find_all('a', class_ = 'title')
titles = [t.get_text(strip = True) for t in title_tags]
print(titles)

['The Godfather', 'Seven Samurai', 'Casablanca', 'Rear Window', 'L.A. Confidential', 'On the Waterfront', 'Chinatown', 'Modern Times', 'The Battle of Algiers', "Schindler's List", '12 Angry Men', 'All About Eve', 'Parasite', "Singin' in the Rain", 'Stop Making Sense', 'Sunset Boulevard', 'Toy Story', 'The Third Man', 'Toy Story 2', 'Top Gun: Maverick', 'Star Wars: Episode IV - A New Hope', 'Godzilla Minus One', 'Cool Hand Luke', 'The Philadelphia Story', 'How to Train Your Dragon', 'M', 'Citizen Kane', 'The Godfather, Part II', 'Three Colors: Red', 'The Decalogue', 'A Separation', 'The Kid', 'Summer of Soul (...Or, When the Revolution Could Not Be Televised)', 'Toy Story 3', 'Finding Nemo', 'Dr. Strangelove Or: How I Learned to Stop Worrying and Love the Bomb', 'Still: A Michael J. Fox Movie', 'Sinners', 'Up', 'The Wages of Fear', 'The Maltese Falcon', 'Spotlight', 'The Wrestler', 'Grave of the Fireflies', 'North by Northwest', 'Psycho', 'Bicycle Thieves', 'Spider-Man: Into the Spider-

In [15]:
year_tags = soup.find_all('span', class_ = 'year')
years = [y.get_text(strip = True).strip('()') for y in year_tags]
print(years)

['1972', '1954', '1942', '1954', '1997', '1954', '1974', '1936', '1966', '1993', '1957', '1950', '2019', '1952', '1984', '1950', '1995', '1949', '1999', '2022', '1977', '2023', '1967', '1940', '2010', '1931', '1941', '1974', '1994', '1989', '2011', '1921', '2021', '2010', '2003', '1964', '2023', '2025', '2009', '1953', '1941', '2015', '2008', '1988', '1959', '1960', '1948', '2018', '1939', '2020', '1934', '2001', '1948', '1959', '2016', '1944', '1977', '2008', '2024', '2017', '2024', '2015', '1969', '2018', '1949', '1985', '2018', '1957', '1953', '1995', '1975', '2020', '1957', '1966', '1979', '2014', '2024', '1927', '2023', '1956', '2017', '1999', '1982', '1971', '1955', '1979', '1964', '2018', '1944', '2008', '2024', '2020', '1946', '1930', '1940', '1945', '1952', '1940', '2019', '1925', '2013', '1959', '2024', '2019', '2024', '2016', '2018', '1954', '2023', '2018', '2002', '1984', '1962', '1940', '2010', '1955', '2024', '1943', '1964', '2005', '1963', '1964', '1941', '1978', '1974',

In [16]:
score_tags = soup.find_all('span', class_ = 'score')
scores = [s.find('strong').get_text(strip = True) for s in score_tags]
print(scores)

['97%', '100%', '99%', '99%', '99%', '99%', '98%', '98%', '99%', '98%', '100%', '99%', '99%', '100%', '100%', '98%', '100%', '99%', '100%', '96%', '94%', '99%', '100%', '100%', '99%', '100%', '99%', '96%', '100%', '100%', '99%', '100%', '99%', '98%', '99%', '98%', '99%', '97%', '98%', '100%', '99%', '97%', '99%', '100%', '97%', '97%', '99%', '97%', '98%', '98%', '98%', '96%', '100%', '99%', '98%', '97%', '97%', '100%', '100%', '97%', '97%', '98%', '100%', '99%', '100%', '100%', '98%', '100%', '100%', '100%', '97%', '99%', '100%', '97%', '93%', '99%', '97%', '97%', '99%', '100%', '99%', '98%', '100%', '98%', '98%', '100%', '98%', '100%', '100%', '98%', '100%', '100%', '98%', '98%', '99%', '100%', '98%', '99%', '97%', '98%', '98%', '100%', '99%', '99%', '98%', '98%', '99%', '98%', '97%', '97%', '95%', '100%', '100%', '98%', '98%', '100%', '97%', '100%', '99%', '98%', '97%', '100%', '99%', '98%', '100%', '100%', '98%', '97%', '97%', '98%', '100%', '98%', '97%', '98%', '100%', '98%', '96%'

In [17]:
rank_tags = [
    td for td in soup.find_all('td')
    if td.get_text(strip = True).endswith('.')
]
ranks = [r.get_text(strip = True).strip('.') for r in rank_tags]
print(ranks)

['1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '30', '31', '32', '33', '34', '35', '36', '37', '38', '39', '40', '41', '42', '43', '44', '45', '46', '47', '48', '49', '50', '51', '52', '53', '54', '55', '56', '57', '58', '59', '60', '61', '62', '63', '64', '65', '66', '67', '68', '69', '70', '71', '72', '73', '74', '75', '76', '77', '78', '79', '80', '81', '82', '83', '84', '85', '86', '87', '88', '89', '90', '91', '92', '93', '94', '95', '96', '97', '98', '99', '100', '101', '102', '103', '104', '105', '106', '107', '108', '109', '110', '111', '112', '113', '114', '115', '116', '117', '118', '119', '120', '121', '122', '123', '124', '125', '126', '127', '128', '129', '130', '131', '132', '133', '134', '135', '136', '137', '138', '139', '140', '141', '142', '143', '144', '145', '146', '147', '148', '149', '150', '151', '152', '153', '154', '155', '156', '157', '158', '

In [18]:
print(len(titles), len(years), len(scores), len(ranks))

300 300 300 300


In [19]:
df = pd.DataFrame({
    'rank': ranks,
    'title': titles,
    'year': years,
    'score': scores
})
df.head(10)

Unnamed: 0,rank,title,year,score
0,1,The Godfather,1972,97%
1,2,Seven Samurai,1954,100%
2,3,Casablanca,1942,99%
3,4,Rear Window,1954,99%
4,5,L.A. Confidential,1997,99%
5,6,On the Waterfront,1954,99%
6,7,Chinatown,1974,98%
7,8,Modern Times,1936,98%
8,9,The Battle of Algiers,1966,99%
9,10,Schindler's List,1993,98%


In [None]:
# df.to_csv('data/movies.csv', index = False)