# Intro to web scraping

The first step of web scraping is to identify a website and download the html code from it. 

Real html from websites tends to be long and a bit too chaotic for a total beginner. Here we will start with a dummy html document and learn the basics of extracting info with beautifulsoup.

In [15]:
html_doc = """
<!DOCTYPE html>
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
</html>
"""

In [16]:
html_doc

'\n<!DOCTYPE html>\n<html><head><title>The Dormouse\'s story</title></head>\n<body>\n<p class="title"><b>The Dormouse\'s story</b></p>\n\n<p class="story">Once upon a time there were three little sisters; and their names were\n<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,\n<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and\n<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;\nand they lived at the bottom of a well.</p>\n\n<p class="story">...</p>\n</html>\n'

In [17]:
from bs4 import BeautifulSoup

#### "creating the soup"

In [18]:
# parse the element
soup = BeautifulSoup(html_doc, 'html.parser')

#### accessing single elements

We can access to the html tags appending to the correspoding soup a dot . and the name of the corresponding tag, ie:

* title
* body
* p
* a

In case of having multiple instances of the tag, **only the first one will be retrieved**.



In [5]:
soup.title

<title>The Dormouse's story</title>

In [6]:
soup.body

<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body>

In [7]:
soup.p

<p class="title"><b>The Dormouse's story</b></p>

#### finding all elements of a tag with find_all()

If we want to retrieve all the elements which have a particular attribute (id, class), we can provide a dictionaty two `find_all()`. Moreover, if one element, has more than one `class` we can add the corresponding classes as elements of a list inside the dictionary.

In [8]:
soup.find_all("p")

[<p class="title"><b>The Dormouse's story</b></p>,
 <p class="story">Once upon a time there were three little sisters; and their names were
 <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
 and they lived at the bottom of a well.</p>,
 <p class="story">...</p>]

In [9]:
len(soup.find_all("p"))

3

In [12]:
soup.find_all("p")[-1]

<p class="story">...</p>

In [13]:
len(soup.find_all("p"))

3

In [14]:
soup.find_all("a")

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

We can restrict which tag we want, providing additional tag's atttributes like the "class" with a dictionary.

In [13]:
soup.find_all("p", {"class":"story"})

[<p class="story">Once upon a time there were three little sisters; and their names were
 <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
 and they lived at the bottom of a well.</p>, <p class="story">...</p>]

In [14]:
soup.find_all("p", {"class":"story"})[0]

<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

In [15]:
soup.find_all("p", {"class":"story"})[-1]

<p class="story">...</p>

In [16]:
soup.find_all("a", {"id":"link2"})

[<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

#### Using css selectors

Let's learn first the syntax of css selectors playing this game: https://flukeout.github.io/

Everyone should reach level 6!

In order to specify a hierarchy, we can use the `>`:

soup.select("tag1 > tag2") will select all the tag2 inside tag1.

In [19]:
soup.select("p > b")

[<b>The Dormouse's story</b>]

In [20]:
soup.select("p > b")[0].get_text()

"The Dormouse's story"

In [21]:
type(soup.select("p > b")[0].get_text())

str

We can combine the `select()` method with other bs4 methods, such as `get_text()`.

`get_text()`, however, can only be applied to single elements, while `select()` might return multiple elements. It's common to iterate through the output of `select()`

In [22]:
[elem.get_text().replace("\n"," ") for elem in soup.select("p")]

["The Dormouse's story",
 'Once upon a time there were three little sisters; and their names were Elsie, Lacie and Tillie; and they lived at the bottom of a well.',
 '...']

### Your turn:

Write code to print the following contents (not including the html tags, only human-readable text): 

1. All the "fun facts". 

2. The names of all the places. 

3. The content (name and fact) of all the cities (only cities, not countries!) 

4. The names (not facts!) of all the cities (not countries!)

In [24]:
geography = """
<!DOCTYPE html>
<html>
<head> Geography</head>
<body>

<div class="city">
  <h2>London</h2>
  <p>London is the most popular tourist destination in the world.</p>
</div>

<div class="city">
  <h2>Paris</h2>
  <p>Paris was originally a Roman City called Lutetia.</p>
</div>

<div class="country">
  <h2>Spain</h2>
  <p>Spain produces 43,8% of all the world's Olive Oil.</p>
</div>

</body>
</html>
"""

In [25]:
soup = BeautifulSoup(geography,'html.parser')

In [26]:
[elem.get_text() for elem in soup.find_all("p")]

['London is the most popular tourist destination in the world.',
 'Paris was originally a Roman City called Lutetia.',
 "Spain produces 43,8% of all the world's Olive Oil."]

In [27]:
soup.find_all("div", {"class":"city"})

[<div class="city">
 <h2>London</h2>
 <p>London is the most popular tourist destination in the world.</p>
 </div>,
 <div class="city">
 <h2>Paris</h2>
 <p>Paris was originally a Roman City called Lutetia.</p>
 </div>]

In [28]:
for elem in soup.find_all("div", {"class":"city"}):
    #print(elem.find_all("h2")[0].get_text())
    #print(elem.find_all("p")[0].get_text())
    print("{}: facts: {}".format(elem.find_all("h2")[0].get_text(),elem.find_all("p")[0].get_text()))

London: facts: London is the most popular tourist destination in the world.
Paris: facts: Paris was originally a Roman City called Lutetia.


In [29]:
[elem.get_text() for elem in soup.select("h2")]

['London', 'Paris', 'Spain']

In [30]:
soup = BeautifulSoup(geography, 'html.parser')

In [None]:
# 1. All the "fun facts" using .find_all()


[<div class="city">
 <h2>London</h2>
 <p>London is the most popular tourist destination in the world.</p>
 </div>, <div class="city">
 <h2>Paris</h2>
 <p>Paris was originally a Roman City called Lutetia.</p>
 </div>]

Get first the tags which contains the text you want

[<p>London is the most popular tourist destination in the world.</p>,
 <p>Paris was originally a Roman City called Lutetia.</p>]

Now get the text inside

['London is the most popular tourist destination in the world.',
 'Paris was originally a Roman City called Lutetia.']

In [None]:
# 2. The names of all the places.

['London', 'Paris', 'Spain']

In [None]:
# 3. All the content (name and fact) of all the cities (only cities, not countries!)


['London', 'Paris']

In [None]:
for elem in soup.find_all("div", {"class":"city"}):
    #print(elem.h2.get_text() + ': ' + elem.p.get_text())
    print(elem.h2.get_text() + ': ' + ' '.join(elem.p.get_text().split()[1:]))
    #print(' '.join(elem.p.get_text().split()[1:]))

London: is the most popular tourist destination in the world.
Paris: was originally a Roman City called Lutetia.


In [None]:
# 4. The names (not facts!) of all the cities (not countries!)


London
Paris


## Use case: imdb top charts

Let's go to https://www.imdb.com/chart/top, where we'll see the top 250 movies according to IMDb ratings.

Notice how each movie has the following elements:

- Title

- Release Year

- IMDb rating

- Director & main stars (they appear when you hover over the title)

Our objective is going to be to scrape this information and store it in a pandas dataframe. We will proceed in steps:

1.
* Store the titles inside a list of titles
* Store the release year inside a list of years
* Store the rating inside another list
* Store the director and main stars into another list

2.
* Create a dictionary in which the keys will contain the column names of the dataframe and the values will be the lists created before

3.
* Create the dataframe from the dictionary


In [31]:
# 1. import libraries
import requests # to download html code
from bs4 import BeautifulSoup # to navigate through the html code
import pandas as pd
import numpy as np
import re

In [32]:
# 2. find url and store it in a variable
url = "https://www.imdb.com/chart/top"

In [33]:
# 3. download html with a get request. Use the function request.get() and store the output in response
response = requests.get(url)
# 200 status code means OK! response.status_code
print(response.status_code)

200


In [34]:
# 4.1. parse html (create the 'soup')
soup = BeautifulSoup(response.text, 'html.parser')
# 4.2. check that the html code looks like it should
print(soup.prettify())

<!DOCTYPE html>
<html xmlns:fb="http://www.facebook.com/2008/fbml" xmlns:og="http://ogp.me/ns#">
 <head>
  <meta charset="utf-8"/>
  <meta content="IE=edge" http-equiv="X-UA-Compatible"/>
  <script type="text/javascript">
   var IMDbTimer={starttime: new Date().getTime(),pt:'java'};
  </script>
  <script>
   if (typeof uet == 'function') {
      uet("bb", "LoadTitle", {wb: 1});
    }
  </script>
  <script>
   (function(t){ (t.events = t.events || {})["csm_head_pre_title"] = new Date().getTime(); })(IMDbTimer);
  </script>
  <title>
   Top 250 Movies - IMDb
  </title>
  <script>
   (function(t){ (t.events = t.events || {})["csm_head_post_title"] = new Date().getTime(); })(IMDbTimer);
  </script>
  <script>
   if (typeof uet == 'function') {
      uet("be", "LoadTitle", {wb: 1});
    }
  </script>
  <script>
   if (typeof uex == 'function') {
      uex("ld", "LoadTitle", {wb: 1});
    }
  </script>
  <link href="https://www.imdb.com/chart/top" rel="canonical"/>
  <meta content="http://ww

In [36]:
# 5. retrieve/extract the desired info (here, you'll paste the "Selector" you copied before to get the element that belongs to the top movie)

text = soup.select("td.titleColumn")[0].get_text()
print("Original text: ",text)
#\b(\w*\s){1,}
#re.search(r"\b(\w*\s{1}){1,}",text).group()[:-1]
#print("Final text: ",re.match(r"\b\w*\s+\w*",text))

Original text:  
      1.
      Die Verurteilten
(1994)



Let's start creating the list of titles.

In [40]:
titles = []

In [46]:
titles = [elem.get_text() for elem in soup.select("td.titleColumn a")]
titles

['Die Verurteilten',
 'Der Pate',
 'Der Pate 2',
 'The Dark Knight',
 'Die zwölf Geschworenen',
 'Schindlers Liste',
 'Der Herr der Ringe: Die Rückkehr des Königs',
 'Pulp Fiction',
 'Zwei glorreiche Halunken',
 'Der Herr der Ringe: Die Gefährten',
 'Fight Club',
 'Forrest Gump',
 'Inception',
 'Der Herr der Ringe: Die zwei Türme',
 'Das Imperium schlägt zurück',
 'Matrix',
 'GoodFellas - Drei Jahrzehnte in der Mafia',
 'Einer flog über das Kuckucksnest',
 'Die sieben Samurai',
 'Sieben',
 'Das Schweigen der Lämmer',
 'City of God',
 'Das Leben ist schön',
 'Ist das Leben nicht schön?',
 'Krieg der Sterne',
 'Der Soldat James Ryan',
 'Interstellar',
 'Chihiros Reise ins Zauberland',
 'The Green Mile',
 'Parasite',
 'Léon: Der Profi',
 'Harakiri',
 'Der Pianist',
 'Terminator 2: Tag der Abrechnung',
 'Die üblichen Verdächtigen',
 'Zurück in die Zukunft',
 'Psycho',
 'Der König der Löwen',
 'Moderne Zeiten',
 'American History X',
 'Die letzten Glühwürmchen',
 'Lichter der Großstadt - Ei

Now the lists of years.

In [47]:
years = []


In [48]:
years = [int(re.sub("\D","",elem.get_text())) for elem in soup.find_all("span",{"class":"secondaryInfo"})]
years

[1994,
 1972,
 1974,
 2008,
 1957,
 1993,
 2003,
 1994,
 1966,
 2001,
 1999,
 1994,
 2010,
 2002,
 1980,
 1999,
 1990,
 1975,
 1954,
 1995,
 1991,
 2002,
 1997,
 1946,
 1977,
 1998,
 2014,
 2001,
 1999,
 2019,
 1994,
 1962,
 2002,
 1991,
 1995,
 1985,
 1960,
 1994,
 1936,
 1998,
 1988,
 1931,
 2014,
 2000,
 2006,
 2011,
 2006,
 1942,
 1968,
 1954,
 1988,
 1979,
 1979,
 2000,
 1981,
 1940,
 2006,
 2012,
 1957,
 1950,
 2008,
 2018,
 1957,
 1980,
 2018,
 1964,
 1997,
 2019,
 2003,
 2016,
 2012,
 2017,
 1986,
 1984,
 2018,
 2019,
 1981,
 1963,
 1999,
 2009,
 1995,
 1984,
 1995,
 2020,
 2009,
 1997,
 1983,
 1968,
 1992,
 1931,
 1958,
 2007,
 1985,
 1941,
 2012,
 2000,
 1952,
 1959,
 2004,
 1948,
 1952,
 1962,
 1921,
 1987,
 2016,
 2020,
 1960,
 2010,
 1971,
 1927,
 1955,
 1976,
 1944,
 2021,
 2011,
 1973,
 1983,
 2000,
 2019,
 2001,
 1962,
 2010,
 1965,
 2009,
 1989,
 1995,
 1997,
 1961,
 1985,
 1988,
 1950,
 2018,
 2004,
 1975,
 1950,
 1959,
 2005,
 1992,
 1997,
 2004,
 2013,
 1961,
 1963,

Now the ratings.

In [49]:
#main > div > span > div > div > div.lister > table > tbody > tr:nth-child(1) > td.ratingColumn.imdbRating
ratings = []


In [50]:
ratings = [float(elem.get_text()) for elem in soup.select("strong")]
ratings

[9.2,
 9.1,
 9.0,
 9.0,
 8.9,
 8.9,
 8.9,
 8.8,
 8.8,
 8.8,
 8.8,
 8.7,
 8.7,
 8.7,
 8.7,
 8.6,
 8.6,
 8.6,
 8.6,
 8.6,
 8.6,
 8.6,
 8.6,
 8.6,
 8.6,
 8.6,
 8.5,
 8.5,
 8.5,
 8.5,
 8.5,
 8.5,
 8.5,
 8.5,
 8.5,
 8.5,
 8.5,
 8.5,
 8.5,
 8.5,
 8.5,
 8.5,
 8.5,
 8.5,
 8.5,
 8.5,
 8.5,
 8.4,
 8.4,
 8.4,
 8.4,
 8.4,
 8.4,
 8.4,
 8.4,
 8.4,
 8.4,
 8.4,
 8.4,
 8.4,
 8.4,
 8.4,
 8.4,
 8.4,
 8.4,
 8.4,
 8.3,
 8.3,
 8.3,
 8.3,
 8.3,
 8.3,
 8.3,
 8.3,
 8.3,
 8.3,
 8.3,
 8.3,
 8.3,
 8.3,
 8.3,
 8.3,
 8.3,
 8.3,
 8.3,
 8.3,
 8.3,
 8.3,
 8.3,
 8.3,
 8.3,
 8.3,
 8.3,
 8.3,
 8.3,
 8.3,
 8.3,
 8.3,
 8.3,
 8.3,
 8.3,
 8.3,
 8.2,
 8.2,
 8.2,
 8.2,
 8.2,
 8.2,
 8.2,
 8.2,
 8.2,
 8.2,
 8.2,
 8.2,
 8.2,
 8.2,
 8.2,
 8.2,
 8.2,
 8.2,
 8.2,
 8.2,
 8.2,
 8.2,
 8.2,
 8.2,
 8.2,
 8.2,
 8.2,
 8.2,
 8.2,
 8.2,
 8.2,
 8.2,
 8.2,
 8.2,
 8.2,
 8.2,
 8.2,
 8.2,
 8.2,
 8.2,
 8.2,
 8.2,
 8.2,
 8.2,
 8.1,
 8.1,
 8.1,
 8.1,
 8.1,
 8.1,
 8.1,
 8.1,
 8.1,
 8.1,
 8.1,
 8.1,
 8.1,
 8.1,
 8.1,
 8.1,
 8.1,
 8.1,
 8.1,
 8.1,
 8.1

Now the directors

In [52]:
#main > div > span > div > div > div.lister > table > tbody > tr:nth-child(1) > td.titleColumn > a
directors = []


In [53]:
[elem.select("a") for elem in soup.find_all("td", {"class":"titleColumn"})]

[[<a href="/title/tt0111161/" title="Frank Darabont (dir.), Tim Robbins, Morgan Freeman">Die Verurteilten</a>],
 [<a href="/title/tt0068646/" title="Francis Ford Coppola (dir.), Marlon Brando, Al Pacino">Der Pate</a>],
 [<a href="/title/tt0071562/" title="Francis Ford Coppola (dir.), Al Pacino, Robert De Niro">Der Pate 2</a>],
 [<a href="/title/tt0468569/" title="Christopher Nolan (dir.), Christian Bale, Heath Ledger">The Dark Knight</a>],
 [<a href="/title/tt0050083/" title="Sidney Lumet (dir.), Henry Fonda, Lee J. Cobb">Die zwölf Geschworenen</a>],
 [<a href="/title/tt0108052/" title="Steven Spielberg (dir.), Liam Neeson, Ralph Fiennes">Schindlers Liste</a>],
 [<a href="/title/tt0167260/" title="Peter Jackson (dir.), Elijah Wood, Viggo Mortensen">Der Herr der Ringe: Die Rückkehr des Königs</a>],
 [<a href="/title/tt0110912/" title="Quentin Tarantino (dir.), John Travolta, Uma Thurman">Pulp Fiction</a>],
 [<a href="/title/tt0060196/" title="Sergio Leone (dir.), Clint Eastwood, Eli Wal

In [54]:
directors = [elem['title'].split(" (dir.)")[0] for elem in soup.select("td.titleColumn a")]
directors

['Frank Darabont',
 'Francis Ford Coppola',
 'Francis Ford Coppola',
 'Christopher Nolan',
 'Sidney Lumet',
 'Steven Spielberg',
 'Peter Jackson',
 'Quentin Tarantino',
 'Sergio Leone',
 'Peter Jackson',
 'David Fincher',
 'Robert Zemeckis',
 'Christopher Nolan',
 'Peter Jackson',
 'Irvin Kershner',
 'Lana Wachowski',
 'Martin Scorsese',
 'Milos Forman',
 'Akira Kurosawa',
 'David Fincher',
 'Jonathan Demme',
 'Fernando Meirelles',
 'Roberto Benigni',
 'Frank Capra',
 'George Lucas',
 'Steven Spielberg',
 'Christopher Nolan',
 'Hayao Miyazaki',
 'Frank Darabont',
 'Bong Joon Ho',
 'Luc Besson',
 'Masaki Kobayashi',
 'Roman Polanski',
 'James Cameron',
 'Bryan Singer',
 'Robert Zemeckis',
 'Alfred Hitchcock',
 'Roger Allers',
 'Charles Chaplin',
 'Tony Kaye',
 'Isao Takahata',
 'Charles Chaplin',
 'Damien Chazelle',
 'Ridley Scott',
 'Martin Scorsese',
 'Olivier Nakache',
 'Christopher Nolan',
 'Michael Curtiz',
 'Sergio Leone',
 'Alfred Hitchcock',
 'Giuseppe Tornatore',
 'Ridley Scott

Finally the stars

In [55]:
stars = []


In [56]:
star1 = [elem['title'].split(",")[1][1:] for elem in soup.select("td.titleColumn a")]
star1

['Tim Robbins',
 'Marlon Brando',
 'Al Pacino',
 'Christian Bale',
 'Henry Fonda',
 'Liam Neeson',
 'Elijah Wood',
 'John Travolta',
 'Clint Eastwood',
 'Elijah Wood',
 'Brad Pitt',
 'Tom Hanks',
 'Leonardo DiCaprio',
 'Elijah Wood',
 'Mark Hamill',
 'Keanu Reeves',
 'Robert De Niro',
 'Jack Nicholson',
 'Toshirô Mifune',
 'Morgan Freeman',
 'Jodie Foster',
 'Alexandre Rodrigues',
 'Roberto Benigni',
 'James Stewart',
 'Mark Hamill',
 'Tom Hanks',
 'Matthew McConaughey',
 'Daveigh Chase',
 'Tom Hanks',
 'Kang-ho Song',
 'Jean Reno',
 'Tatsuya Nakadai',
 'Adrien Brody',
 'Arnold Schwarzenegger',
 'Kevin Spacey',
 'Michael J. Fox',
 'Anthony Perkins',
 'Matthew Broderick',
 'Charles Chaplin',
 'Edward Norton',
 'Tsutomu Tatsumi',
 'Charles Chaplin',
 'Miles Teller',
 'Russell Crowe',
 'Leonardo DiCaprio',
 'François Cluzet',
 'Christian Bale',
 'Humphrey Bogart',
 'Henry Fonda',
 'James Stewart',
 'Philippe Noiret',
 'Sigourney Weaver',
 'Martin Sheen',
 'Guy Pearce',
 'Harrison Ford',
 

In [57]:
star2 = [elem['title'].split(",")[2][1:] for elem in soup.select("td.titleColumn a")]
star2

['Morgan Freeman',
 'Al Pacino',
 'Robert De Niro',
 'Heath Ledger',
 'Lee J. Cobb',
 'Ralph Fiennes',
 'Viggo Mortensen',
 'Uma Thurman',
 'Eli Wallach',
 'Ian McKellen',
 'Edward Norton',
 'Robin Wright',
 'Joseph Gordon-Levitt',
 'Ian McKellen',
 'Harrison Ford',
 'Laurence Fishburne',
 'Ray Liotta',
 'Louise Fletcher',
 'Takashi Shimura',
 'Brad Pitt',
 'Anthony Hopkins',
 'Leandro Firmino',
 'Nicoletta Braschi',
 'Donna Reed',
 'Harrison Ford',
 'Matt Damon',
 'Anne Hathaway',
 'Suzanne Pleshette',
 'Michael Clarke Duncan',
 'Sun-kyun Lee',
 'Gary Oldman',
 'Akira Ishihama',
 'Thomas Kretschmann',
 'Linda Hamilton',
 'Gabriel Byrne',
 'Christopher Lloyd',
 'Janet Leigh',
 'Jeremy Irons',
 'Paulette Goddard',
 'Edward Furlong',
 'Ayano Shiraishi',
 'Virginia Cherrill',
 'J.K. Simmons',
 'Joaquin Phoenix',
 'Matt Damon',
 'Omar Sy',
 'Hugh Jackman',
 'Ingrid Bergman',
 'Charles Bronson',
 'Grace Kelly',
 'Enzo Cannavale',
 'Tom Skerritt',
 'Marlon Brando',
 'Carrie-Anne Moss',
 'Kar

This long selector we copied is kind of long and ugly, isn't it? And it only selects one single movie, while we will want to collect data from all of them. Going from that particular selector to one that's more "general" and "elegant" is the actual work the web scraper needs to do.

In this case, we can play around a bit with different tags and classes, until we notice that all the information about the movies is under the tag <td class="titleColumn">. We're lucky that under this tag there's not much "trash", just the info we need.

In [None]:
# the director and main stars are in the same tag, but as a value of the attribute "title"
# we can access attributes as key-value pairs of dictionaries: using ["key"] to get the value:

# instead of ["title"] we could use .get("title"): choose whatever you prefer

'Frank Darabont (dir.), Tim Robbins, Morgan Freeman'

In [None]:
# the years are inside a 'span' tag with the 'secondaryInfo' class
# we also specify the parent tag and its class, which is the same we used before
# the years are inside parentheses, but we'll take care of that later


'(1994)'

#### Building the dataframe

In [None]:
# Create a list for each of the variables you're scraping


In [81]:
# Each list becomes a dataframe column
#pd.DataFrame()
# We want a dataframe in which the columns are: title, year, director, stars
pd.DataFrame({'title': titles, 'year': years, 'director': directors, 'stars1': star1, 'stars2': star2, 'rating': ratings})

Unnamed: 0,title,year,director,stars1,stars2,rating
0,The Shawshank Redemption,1994,Frank Darabont,Tim Robbins,Morgan Freeman,9.2
1,The Godfather,1972,Francis Ford Coppola,Marlon Brando,Al Pacino,9.1
2,The Godfather: Part II,1974,Francis Ford Coppola,Al Pacino,Robert De Niro,9.0
3,The Dark Knight,2008,Christopher Nolan,Christian Bale,Heath Ledger,9.0
4,12 Angry Men,1957,Sidney Lumet,Henry Fonda,Lee J. Cobb,8.9
...,...,...,...,...,...,...
245,Three Colors: Red,1994,Krzysztof Kieslowski,Irène Jacob,Jean-Louis Trintignant,8.0
246,Sunrise,1927,F.W. Murnau,George O'Brien,Janet Gaynor,8.0
247,Neon Genesis Evangelion: The End of Evangelion,1997,Hideaki Anno,Megumi Ogata,Megumi Hayashibara,8.0
248,Drishyam,2013,Jeethu Joseph,Mohanlal,Meena,8.0


Unfortunatelly the elements contained inside the column stars are lists. We would like to have two columns: one for the first star and another for the second star. 

In [None]:
#star1 = []
#stat2 = []

star1 = [elem[0] for elem in stars]
star2 = [elem[1] for elem in stars]

pd.DataFrame({'title': titles, 'year': years, 'director': directors, 'star1': star1, 'star2': star2, 'ratings': ratings})

Unnamed: 0,title,year,director,star1,star2
0,The Shawshank Redemption,1994,Frank Darabont,Tim Robbins,Morgan Freeman
1,The Godfather,1972,Francis Ford Coppola,Marlon Brando,Al Pacino
2,The Godfather: Part II,1974,Francis Ford Coppola,Al Pacino,Robert De Niro
3,The Dark Knight,2008,Christopher Nolan,Christian Bale,Heath Ledger
4,12 Angry Men,1957,Sidney Lumet,Henry Fonda,Lee J. Cobb
...,...,...,...,...,...
245,Miracle in cell NO.7,2019,Mehmet Ada Öztekin,Aras Bulut Iynemli,Nisa Sofiya Aksongur
246,Tangerines,2013,Zaza Urushadze,Lembit Ulfsak,Elmo Nüganen
247,Hera Pheri,2000,Priyadarshan,Akshay Kumar,Sunil Shetty
248,Swades,2004,Ashutosh Gowariker,Shah Rukh Khan,Gayatri Joshi


#### Cleaning the data

An inherent part of web scraping is data cleaning. We managed to get the information we needed, but for it to be useful, we still need some extra steps:

- Take the year out of the parentheses: we know we can totally do that with regex, but string methods such as str.replace() might be simpler to use.

- Split dir_stars into 3 columns, one for each person: "director", "star_1", "star_2". This could have been done by filtering when extracting the data from the html document, but it looks easier afterwards:

    - The "(dir.)" pattern can be totally removed
    - We can split the string at each comma
    
- Change the data type of the year column to integer.

In [None]:
# year out of the parentheses


In [None]:
# remove "(dir.)"


In [None]:
# a column for each person


In [None]:
# year column to integer
