# Intro to web scraping

The first step of web scraping is to identify a website and download the html code from it. 

Real html from websites tends to be long and a bit too chaotic for a total beginner. Here we will start with a dummy html document and learn the basics of extracting info with beautifulsoup.

In [2]:
html_doc = """
<!DOCTYPE html>
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
</html>
"""

In [3]:
html_doc

'\n<!DOCTYPE html>\n<html><head><title>The Dormouse\'s story</title></head>\n<body>\n<p class="title"><b>The Dormouse\'s story</b></p>\n\n<p class="story">Once upon a time there were three little sisters; and their names were\n<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,\n<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and\n<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;\nand they lived at the bottom of a well.</p>\n\n<p class="story">...</p>\n</html>\n'

In [4]:
from bs4 import BeautifulSoup

#### "creating the soup"

In [5]:
# parse the element
soup = BeautifulSoup(html_doc, 'html.parser')

In [6]:
soup.prettify

<bound method Tag.prettify of 
<!DOCTYPE html>

<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body></html>
>

#### accessing single elements

In [7]:
soup.title

<title>The Dormouse's story</title>

In [8]:
soup.title.string

"The Dormouse's story"

In [9]:
soup.title.parent.name

'head'

In [10]:
soup.body

<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body>

In [11]:
# this method only retrieves the first element of the specified tag
soup.p

<p class="title"><b>The Dormouse's story</b></p>

#### finding all elements of a tag with find_all()

In [12]:
p_tags = soup.find_all("p")

In [13]:
for p in p_tags:
    print(p.get_text())

The Dormouse's story
Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well.
...


In [14]:
html = '<title>This is a title</title>there is no tag for this text<something>This is something</something>'
soup_2 = BeautifulSoup(html, 'html.parser')

print(soup_2.get_text())                #This is a titlethere is no tag for this textThis is something
print(soup_2.text)                      #This is a titlethere is no tag for this textThis is something
print(soup_2.string)                    #None
print(soup_2.find('title').get_text())  #This is a title
print(soup_2.find('title').string)      #This is a title

This is a titlethere is no tag for this textThis is something
This is a titlethere is no tag for this textThis is something
None
This is a title
This is a title


#### Using css selectors

Let's learn first the syntax of css selectors playing this game: https://flukeout.github.io/

Everyone should reach level 12!

In [15]:
soup.select("a")

for a in soup.select('a'):
    print(a.get_text())

Elsie
Lacie
Tillie


In [16]:
# select all elements with class="title"
soup.select(".title")

[<p class="title"><b>The Dormouse's story</b></p>]

In [17]:
# select all elements with class="sister"
soup.select(".sister")

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

In [18]:
# select "all" elements with the id="link2"
soup.select("#link2")

[<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

We can combine the `select()` method with other bs4 methods, such as `get_text()`.

`get_text()`, however, can only be applied to single elements, while `select()` might return multiple elements. It's common to iterate through the output of `select()`

In [19]:
print(soup.select("p.story")[0].get_text())

Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well.


In [20]:
for p in soup.select("p.story"):
    print(p.get_text())

Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well.
...


### Your turn:

Write code to print the following contents (not including the html tags, only human-readable text): 

1. All the "fun facts". 

2. The names of all the places. 

3. The content (name and fact) of all the cities (only cities, not countries!) 

4. The names (not facts!) of all the cities (not countries!)

In [21]:
geography = """
<!DOCTYPE html>
<html>
<head> Geography</head>
<body>

<div class="city">
  <h2>London</h2>
  <p>London is the most popular tourist destination in the world.</p>
</div>

<div class="city">
  <h2>Paris</h2>
  <p>Paris was originally a Roman City called Lutetia.</p>
</div>

<div class="country">
  <h2>Spain</h2>
  <p>Spain produces 43,8% of all the world's Olive Oil.</p>
</div>

</body>
</html>
"""

In [22]:
# Create the "soup"
soup_2 = BeautifulSoup(geography, 'html.parser')

print(soup.prettify())

<!DOCTYPE html>
<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The Dormouse's story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
    Elsie
   </a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   ;
and they lived at the bottom of a well.
  </p>
  <p class="story">
   ...
  </p>
 </body>
</html>



In [1]:
# 1. All the "fun facts"
fun_facts = soup2.find_all("p")
for p in fun_facts:
    print(p.get_text())

In [23]:
# 2. The names of all the places.


In [24]:
# 3. All the content (name and fact) of all the cities (only cities, not countries!)


In [25]:
# 4. The names (not facts!) of all the cities (not countries!)


## Use case: imdb top charts

Let's go to https://www.imdb.com/chart/top, where we'll see the top 250 movies according to IMDb ratings.

Notice how each movie has the following elements:

- Title

- Release Year

- IMDb rating

- Director & main stars (they appear when you hover over the title)

Our objective is going to be to scrape this information and store it in a pandas dataframe.



In [26]:
# 1. import libraries
from bs4 import BeautifulSoup
import requests
import pandas as pd

In [27]:
# 2. find url and store it in a variable
url = "https://www.imdb.com/chart/top"

In [28]:
# 3. download html with a get request
response = requests.get(url)
response.status_code # 200 status code means OK!

200

In [29]:
# 4.1. parse html (create the 'soup')
soup = BeautifulSoup(response.content, "html.parser")
# 4.2. check that the html code looks like it should
soup

In [None]:
# 5. retrieve/extract the desired info (here, you'll paste the "Selector" you copied before to get the element that belongs to the top movie)

soup.select("#main > div > span > div > div > div.lister > table > tbody > tr:nth-child(1) > td.titleColumn > a")[0].get_text()

'Die Verurteilten'

This long selector we copied is kind of long and ugly, isn't it? And it only selects one single movie, while we will want to collect data from all of them. Going from that particular selector to one that's more "general" and "elegant" is the actual work the web scraper needs to do.

In this case, we can play around a bit with different tags and classes, until we notice that all the information about the movies is under the tag <td class="titleColumn">. We're lucky that under this tag there's not much "trash", just the info we need.

In [None]:
soup.select("td.titleColumn") # all the info about all the movies

[<td class="titleColumn">
       1.
       <a href="/title/tt0111161/" title="Frank Darabont (dir.), Tim Robbins, Morgan Freeman">Die Verurteilten</a>
 <span class="secondaryInfo">(1994)</span>
 </td>,
 <td class="titleColumn">
       2.
       <a href="/title/tt0068646/" title="Francis Ford Coppola (dir.), Marlon Brando, Al Pacino">Der Pate</a>
 <span class="secondaryInfo">(1972)</span>
 </td>,
 <td class="titleColumn">
       3.
       <a href="/title/tt0071562/" title="Francis Ford Coppola (dir.), Al Pacino, Robert De Niro">Der Pate 2</a>
 <span class="secondaryInfo">(1974)</span>
 </td>,
 <td class="titleColumn">
       4.
       <a href="/title/tt0468569/" title="Christopher Nolan (dir.), Christian Bale, Heath Ledger">The Dark Knight</a>
 <span class="secondaryInfo">(2008)</span>
 </td>,
 <td class="titleColumn">
       5.
       <a href="/title/tt0050083/" title="Sidney Lumet (dir.), Henry Fonda, Lee J. Cobb">Die zwölf Geschworenen</a>
 <span class="secondaryInfo">(1957)</span>
 

In [None]:
soup.select("td.titleColumn a") # all elements containing movie titles

[<a href="/title/tt0111161/" title="Frank Darabont (dir.), Tim Robbins, Morgan Freeman">Die Verurteilten</a>,
 <a href="/title/tt0068646/" title="Francis Ford Coppola (dir.), Marlon Brando, Al Pacino">Der Pate</a>,
 <a href="/title/tt0071562/" title="Francis Ford Coppola (dir.), Al Pacino, Robert De Niro">Der Pate 2</a>,
 <a href="/title/tt0468569/" title="Christopher Nolan (dir.), Christian Bale, Heath Ledger">The Dark Knight</a>,
 <a href="/title/tt0050083/" title="Sidney Lumet (dir.), Henry Fonda, Lee J. Cobb">Die zwölf Geschworenen</a>,
 <a href="/title/tt0108052/" title="Steven Spielberg (dir.), Liam Neeson, Ralph Fiennes">Schindlers Liste</a>,
 <a href="/title/tt0167260/" title="Peter Jackson (dir.), Elijah Wood, Viggo Mortensen">Der Herr der Ringe: Die Rückkehr des Königs</a>,
 <a href="/title/tt0110912/" title="Quentin Tarantino (dir.), John Travolta, Uma Thurman">Pulp Fiction</a>,
 <a href="/title/tt0060196/" title="Sergio Leone (dir.), Clint Eastwood, Eli Wallach">Zwei glorre

In [None]:
# we can use .get_text() to extract the content of the tags we selected
# we'll need to do it to each tag with a for loop: here we do it to the first one
soup.select("td.titleColumn a")[0]
soup.select("td.titleColumn a")[0].get_text()

'Die Verurteilten'

In [None]:
# the director and main stars are in the same tag, but as a value of the attribute "title"
# we can access attributes as key-value pairs of dictionaries: using ["key"] to get the value:
soup.select("td.titleColumn a")[0]["title"] 

# instead of ["title"] we could use .get("title"): choose whatever you prefer

'Frank Darabont (dir.), Tim Robbins, Morgan Freeman'

In [None]:
# the years are inside a 'span' tag with the 'secondaryInfo' class
# we also specify the parent tag and its class, which is the same we used before
# the years are inside parentheses, but we'll take care of that later
soup.select("td.titleColumn span.secondaryInfo")[0].get_text()

'(1994)'

### Storing information in lists

In [None]:
#initialize empty lists
title = []
dir_stars = []
year = []

In [None]:
# define the number of iterations of our for loop 
# by checking how many elements are in the retrieved result set
# (this is equivalent but more robust than just explicitly defining 250 iterations)
num_iter = len(soup.select("td.titleColumn a"))

In [None]:
table = soup.select("tbody > tr")

In [None]:
table[2].select("td.titleColumn a")

[<a href="/title/tt0071562/" title="Francis Ford Coppola (dir.), Al Pacino, Robert De Niro">Der Pate 2</a>]

In [None]:
# iterate through the result set and retrive all the data
for i in range(num_iter):
    title.append(soup.select("td.titleColumn a")[i].get_text())
    dir_stars.append(soup.select("td.titleColumn a")[i]["title"])
    year.append(soup.select("td.titleColumn span.secondaryInfo")[i].get_text())

In [None]:
#if condition returns True, then nothing happens
#if condition returns False, AssertionError is raised
assert len(title) == len(dir_stars) == len(year)

### Storing information in pandas DataFrames

In [None]:
import pandas as pd

In [None]:
movies_df = pd.DataFrame(
    {"movie_name": title,
     "director_stars": dir_stars,
     "release_year": year
    }
)

In [None]:
movies_df.head()

Unnamed: 0,movie_name,director_stars,release_year
0,Die Verurteilten,"Frank Darabont (dir.), Tim Robbins, Morgan Fre...",(1994)
1,Der Pate,"Francis Ford Coppola (dir.), Marlon Brando, Al...",(1972)
2,Der Pate 2,"Francis Ford Coppola (dir.), Al Pacino, Robert...",(1974)
3,The Dark Knight,"Christopher Nolan (dir.), Christian Bale, Heat...",(2008)
4,Die zwölf Geschworenen,"Sidney Lumet (dir.), Henry Fonda, Lee J. Cobb",(1957)


#### Chellenge: Cleaning the data

An inherent part of web scraping is data cleaning. We managed to get the information we needed, but for it to be useful, we still need some extra steps:

- Take the year out of the parentheses: we know we can totally do that with regex, but string methods such as str.replace() might be simpler to use. Additionally, this column should be turned into a numerical data type.

- Split dir_stars into 3 columns, one for each person: "director", "star_1", "star_2". This could have been done by filtering when extracting the data from the html document, but it looks easier afterwards:

    - The "(dir.)" pattern can be totally removed
    - We can split the string at each comma

In [None]:
# your code here