# Webscraping one page using beautiful soup 

### Tools for scraping 

+ https://www.crummy.com/software/BeautifulSoup/bs4/doc/  (this is what we will use in lectures)

+ https://scrapy.org/

+ https://selenium-python.readthedocs.io/



## Dormouse HTML Code 


In [1]:
#create the variable

html_doc ="""
<!DOCTYPE html>
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
</html>
"""

In [2]:
html_doc

'\n<!DOCTYPE html>\n<html><head><title>The Dormouse\'s story</title></head>\n<body>\n<p class="title"><b>The Dormouse\'s story</b></p>\n\n<p class="story">Once upon a time there were three little sisters; and their names were\n<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,\n<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and\n<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;\nand they lived at the bottom of a well.</p>\n\n<p class="story">...</p>\n</html>\n'

In [3]:
# after installing as conda install -c anaconda beautifulsoup4

#Import needed libraries - BeautifulSoup
from bs4 import BeautifulSoup



In [4]:
# parse (create) the soup 
soup_mouse=BeautifulSoup(html_doc,'html.parser')

In [5]:
soup_mouse


<!DOCTYPE html>

<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body></html>

In [6]:
# prettify the soup 
soup_mouse.prettify

<bound method Tag.prettify of 
<!DOCTYPE html>

<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body></html>
>

Option 1 - using beautiful soup the "HTML" way  

In [7]:
# using basic tree navigation to access single elements
soup_mouse.title

<title>The Dormouse's story</title>

In [8]:
soup_mouse.title.string

"The Dormouse's story"

In [9]:
soup_mouse.body

<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body>

In [10]:
soup_mouse.p

<p class="title"><b>The Dormouse's story</b></p>

In [11]:
# find elements of the tag using find_all()

In [12]:
soup_mouse.find_all('p')

[<p class="title"><b>The Dormouse's story</b></p>,
 <p class="story">Once upon a time there were three little sisters; and their names were
 <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
 and they lived at the bottom of a well.</p>,
 <p class="story">...</p>]

In [13]:
soup_mouse.find_all('a')

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

In [14]:
soup_mouse.title.parent

<head><title>The Dormouse's story</title></head>

In [15]:
print(soup_mouse.prettify())

<!DOCTYPE html>
<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The Dormouse's story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
    Elsie
   </a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   ;
and they lived at the bottom of a well.
  </p>
  <p class="story">
   ...
  </p>
 </body>
</html>



In [16]:
p_tags=soup_mouse.find_all('p')

In [17]:
for p in p_tags:
    print(p.get_text())

The Dormouse's story
Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well.
...


In [18]:
a_tags= soup_mouse.find_all('a')

In [19]:
a_tags

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

In [20]:
for atag in a_tags:
    print(atag.get('href'))

http://example.com/elsie
http://example.com/lacie
http://example.com/tillie


## Option 2 - using beautiful soup the "CSS" way

As we will be be using css selectors, let's learn first the syntax of css selectors playing this game: https://flukeout.github.io/

Everyone should reach level 12!

In [21]:
# using select()

We can combine the `select()` method with other bs4 methods, such as `get_text()`.

`get_text()`, however, can only be applied to single elements, while `select()` might return multiple elements. It's common to iterate through the output of `select()`

In [22]:
soup_mouse.select('#link2')

[<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

In [23]:
soup_mouse.select('.sister')

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

In [24]:
soup_mouse.select('#link1')

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

In [25]:
soup_mouse.select('a')

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

In [26]:
for a in soup_mouse.select('a'):
    print(a.get_text())

Elsie
Lacie
Tillie


In [27]:
for p in soup_mouse.select('p.story'):
    print(p.get_text())

Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well.
...


In [28]:
print(soup_mouse.select('p.story')[1])

<p class="story">...</p>


useful links for the lecture : 
    
    https://www.w3schools.com/cssref/css_selectors.asp
    https://www.w3schools.com/tags/default.asp
    https://www.w3schools.com/css/css_syntax.ASP
    https://www.imdb.com/chart/top/

## Activity 

Write code to extract and print the following contents (not including the html tags, only human-readable text): 

1. All the "fun facts"

2. The names of all the places

3. The content (name and fact) of all the cities (only cities, not countries) 

4. The names (not facts!) of all the cities (not countries)


In [29]:
geography = """
<!DOCTYPE html>
<html>
<head> Geography</head>
<body>

<div class="city">
  <h2>London</h2>
  <p>London is the most popular tourist destination in the world.</p>
</div>

<div class="city">
  <h2>Paris</h2>
  <p>Paris was originally a Roman City called Lutetia.</p>
</div>

<div class="country">
  <h2>Spain</h2>
  <p>Spain produces 43,8% of all the world's Olive Oil.</p>
</div>

</body>
</html>
"""

In [30]:
soup=BeautifulSoup(geography, 'html.parser')

In [31]:
print(soup.prettify())

<!DOCTYPE html>
<html>
 <head>
  Geography
 </head>
 <body>
  <div class="city">
   <h2>
    London
   </h2>
   <p>
    London is the most popular tourist destination in the world.
   </p>
  </div>
  <div class="city">
   <h2>
    Paris
   </h2>
   <p>
    Paris was originally a Roman City called Lutetia.
   </p>
  </div>
  <div class="country">
   <h2>
    Spain
   </h2>
   <p>
    Spain produces 43,8% of all the world's Olive Oil.
   </p>
  </div>
 </body>
</html>



In [32]:
# 1. All the "fun facts"
for i in soup.find_all('p'):
    print(i.get_text())

London is the most popular tourist destination in the world.
Paris was originally a Roman City called Lutetia.
Spain produces 43,8% of all the world's Olive Oil.


example : 
    

**Paris was originally a Roman City called Lutetia**


In [33]:
# 2. The names of all the places.
for i in soup.find_all('h2'):
    print(i.get_text())

London
Paris
Spain


example: 

**Paris**

In [34]:
# 3. All the content (name and fact) of all the cities (only cities, not countries!)

for i in soup.find_all('div',{'class':'city'}):
    print(i.get_text())


London
London is the most popular tourist destination in the world.


Paris
Paris was originally a Roman City called Lutetia.



example: 
    
**Paris**

**Paris was originally a Roman City called Lutetia.**

In [35]:
# 4. The names (not facts!) of all the cities (not countries!)

for i in soup.find_all('div',{'class':'city'}):
    print(i.h2.get_text())

London
Paris


In [36]:
for i in soup.find_all('div',{'class':'city'}):
    print(i.get_text('h2'))


h2Londonh2
h2London is the most popular tourist destination in the world.h2


h2Parish2
h2Paris was originally a Roman City called Lutetia.h2



## Scraping the IMDB top 250

Let's go to https://www.imdb.com/chart/top, where we'll see the top 250 movies according to IMDb ratings.

Notice how each movie has the following elements:

- Title

- Release Year

- IMDb rating

- Director & main stars (they appear when you hover over the title)

Our objective is going to be to scrape this information and store it in a pandas dataframe.

In [37]:
# 1. importing libraries- BeautifulSoup, requests, pandas
import requests
import pandas as pd

# 2. find url and store it in avariable
url = "https://www.imdb.com/chart/top"

# 3. download html with a get request
response = requests.get(url)

In [38]:
#check response status code 
response.status_code

200

In [39]:
#parse and store the contents of the url call
soup=BeautifulSoup(response.content, 'html.parser')

In [40]:
#prettify the soup 


### Query the soup to get movie title, actors, director, year 


In [41]:
soup.select('td.titleColumn a')[5].text

'La lista de Schindler'

In [42]:
soup.select('td.titleColumn a')[5]['title']

'Steven Spielberg (dir.), Liam Neeson, Ralph Fiennes'

In [43]:
# the director and main stars are in the same tag, but as a value of the attribute "title"
# we can access attributes as key-value pairs of dictionaries: using ["key"] to get the value:

# instead of ["title"] we could use .get("title"): choose whatever you prefer

In [44]:
soup.select('td.titleColumn span.secondaryInfo')[5].text

'(1993)'

In [45]:
# the years are inside a 'span' tag with the 'secondaryInfo' class
# we also specify the parent tag and its class, which is the same we used before
# the years are inside parentheses, but we'll take care of that later

### Once we have a method working for one movie, we can apply it for all the movies

- loop through movies
- pick up title, director, actors, year

+ store in a list

- for example 

**movie_lst = soup.select("td.titleColumn a")**

**yr_lst = soup.select("td.titleColumn span.secondaryInfo")**

In [46]:
## install tqqm.notebook using conda install -c conda-forge tqdm
from tqdm.notebook import tqdm

In [47]:
title=[]
dir_stars=[]
year=[]
len_movies=len(soup.select('td.titleColumn a'))

In [48]:
for i in tqdm (range(len_movies)):
    title.append(soup.select('td.titleColumn a')[i].text)
    dir_stars.append(soup.select('td.titleColumn a')[i]['title'])
    year.append(soup.select('td.titleColumn span.secondaryInfo')[i].text)

  0%|          | 0/250 [00:00<?, ?it/s]

In [49]:
year


['(1994)',
 '(1972)',
 '(1974)',
 '(2008)',
 '(1957)',
 '(1993)',
 '(2003)',
 '(1994)',
 '(1966)',
 '(2001)',
 '(1999)',
 '(1994)',
 '(2010)',
 '(2002)',
 '(1980)',
 '(1999)',
 '(1990)',
 '(1975)',
 '(1954)',
 '(1995)',
 '(1991)',
 '(2002)',
 '(1997)',
 '(1946)',
 '(1977)',
 '(1998)',
 '(2001)',
 '(2014)',
 '(1999)',
 '(2019)',
 '(1994)',
 '(1962)',
 '(1995)',
 '(2002)',
 '(1985)',
 '(1991)',
 '(1960)',
 '(1936)',
 '(1994)',
 '(1998)',
 '(1931)',
 '(2000)',
 '(1988)',
 '(2014)',
 '(2006)',
 '(2011)',
 '(2006)',
 '(1942)',
 '(1968)',
 '(1954)',
 '(1988)',
 '(1979)',
 '(1979)',
 '(2000)',
 '(1981)',
 '(1940)',
 '(2006)',
 '(2012)',
 '(1957)',
 '(1950)',
 '(2008)',
 '(1980)',
 '(2018)',
 '(1957)',
 '(2019)',
 '(1964)',
 '(2018)',
 '(2003)',
 '(1997)',
 '(2020)',
 '(2016)',
 '(1984)',
 '(2012)',
 '(1986)',
 '(2017)',
 '(1981)',
 '(2018)',
 '(2019)',
 '(1963)',
 '(1999)',
 '(1995)',
 '(1995)',
 '(1955)',
 '(1984)',
 '(2009)',
 '(2009)',
 '(1997)',
 '(1983)',
 '(1968)',
 '(1992)',
 '(1931)',

In [50]:
movies_top250=pd.DataFrame({'title':title, 'dir_stars':dir_stars,'year':year})

In [51]:
movies_top250.head()

Unnamed: 0,title,dir_stars,year
0,Cadena perpetua,"Frank Darabont (dir.), Tim Robbins, Morgan Fre...",(1994)
1,El padrino,"Francis Ford Coppola (dir.), Marlon Brando, Al...",(1972)
2,El padrino: Parte II,"Francis Ford Coppola (dir.), Al Pacino, Robert...",(1974)
3,El caballero oscuro,"Christopher Nolan (dir.), Christian Bale, Heat...",(2008)
4,12 hombres sin piedad,"Sidney Lumet (dir.), Henry Fonda, Lee J. Cobb",(1957)


In [52]:
len(dir_stars)

250

### Cleaning / Wrangling steps for the scraped data 

An inherent part of web scraping is data cleaning. We managed to get the information we needed, but for it to be useful, we still need some extra steps:

- Take the year out of the parentheses: we know we can do that with regex, but string methods such as str.replace() might be simpler to use.

- Split dir_stars into 3 columns, one for each person: "director", "star_1", "star_2". This could have been done by filtering when extracting the data from the html document, but it looks easier afterwards:

    - The "(dir.)" pattern can be removed
    - We can split the string at each comma
    
- Change the data type of the year column to integer.


In [53]:
year_clean=[yr.strip(')').strip('(')for yr in year]

In [54]:
year_clean

['1994',
 '1972',
 '1974',
 '2008',
 '1957',
 '1993',
 '2003',
 '1994',
 '1966',
 '2001',
 '1999',
 '1994',
 '2010',
 '2002',
 '1980',
 '1999',
 '1990',
 '1975',
 '1954',
 '1995',
 '1991',
 '2002',
 '1997',
 '1946',
 '1977',
 '1998',
 '2001',
 '2014',
 '1999',
 '2019',
 '1994',
 '1962',
 '1995',
 '2002',
 '1985',
 '1991',
 '1960',
 '1936',
 '1994',
 '1998',
 '1931',
 '2000',
 '1988',
 '2014',
 '2006',
 '2011',
 '2006',
 '1942',
 '1968',
 '1954',
 '1988',
 '1979',
 '1979',
 '2000',
 '1981',
 '1940',
 '2006',
 '2012',
 '1957',
 '1950',
 '2008',
 '1980',
 '2018',
 '1957',
 '2019',
 '1964',
 '2018',
 '2003',
 '1997',
 '2020',
 '2016',
 '1984',
 '2012',
 '1986',
 '2017',
 '1981',
 '2018',
 '2019',
 '1963',
 '1999',
 '1995',
 '1995',
 '1955',
 '1984',
 '2009',
 '2009',
 '1997',
 '1983',
 '1968',
 '1992',
 '1931',
 '2007',
 '1958',
 '1941',
 '2012',
 '1985',
 '2000',
 '1952',
 '1959',
 '2004',
 '1948',
 '1952',
 '1962',
 '1921',
 '1987',
 '2016',
 '2020',
 '1971',
 '1927',
 '1976',
 '1944',
 

In [None]:
# splitting out the directors from the stars (function). director, star1, star2

In [55]:
director=[]
star1=[]
star2=[]
for movie in dir_stars:
    split_list=movie.split(',')
    director.append(split_list[0].replace(' (dir.)', ''))
    star1.append(split_list[1])
    star2.append(split_list[2])

### Create data frame from results and preview 

In [56]:
movies_new=pd.DataFrame({'title': title, 'director':director, 'star1':star1, 'star2':star2, 'year':year_clean})

In [57]:
movies_new

Unnamed: 0,title,director,star1,star2,year
0,Cadena perpetua,Frank Darabont,Tim Robbins,Morgan Freeman,1994
1,El padrino,Francis Ford Coppola,Marlon Brando,Al Pacino,1972
2,El padrino: Parte II,Francis Ford Coppola,Al Pacino,Robert De Niro,1974
3,El caballero oscuro,Christopher Nolan,Christian Bale,Heath Ledger,2008
4,12 hombres sin piedad,Sidney Lumet,Henry Fonda,Lee J. Cobb,1957
...,...,...,...,...,...
245,Neon Genesis Evangelion: The End of Evangelion,Hideaki Anno,Megumi Ogata,Megumi Hayashibara,1997
246,Fanny y Alexander,Ingmar Bergman,Bertil Guve,Pernilla Allwin,1982
247,Soul,Pete Docter,Jamie Foxx,Tina Fey,2020
248,Amanecer,F.W. Murnau,George O'Brien,Janet Gaynor,1927
