<a href="https://colab.research.google.com/github/Loanisa/oceanarium/blob/main/WebScraping_CodeAlong_Solutions_wbs_movies_website.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Intro to web scraping

The first step of web scraping is to identify a website and download the html code from it.

Real html from websites tends to be long and a bit too chaotic for a total beginner. Here we will start with a dummy html document and learn the basics of extracting info with beautifulsoup.

In [None]:
html_doc = """
<!DOCTYPE html>
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
</html>
"""

In [None]:
html_doc

'\n<!DOCTYPE html>\n<html><head><title>The Dormouse\'s story</title></head>\n<body>\n<p class="title"><b>The Dormouse\'s story</b></p>\n\n<p class="story">Once upon a time there were three little sisters; and their names were\n<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,\n<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and\n<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;\nand they lived at the bottom of a well.</p>\n\n<p class="story">...</p>\n</html>\n'

In [None]:
!pip install --upgrade beautifulsoup4

Collecting beautifulsoup4
  Downloading beautifulsoup4-4.12.2-py3-none-any.whl (142 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m143.0/143.0 kB[0m [31m3.3 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: beautifulsoup4
  Attempting uninstall: beautifulsoup4
    Found existing installation: beautifulsoup4 4.11.2
    Uninstalling beautifulsoup4-4.11.2:
      Successfully uninstalled beautifulsoup4-4.11.2
Successfully installed beautifulsoup4-4.12.2


In [None]:
from bs4 import BeautifulSoup

#### "creating the soup"

In [None]:
# parse the element
soup = BeautifulSoup(html_doc, 'html.parser')

In [None]:
soup.prettify

<bound method Tag.prettify of 
<!DOCTYPE html>

<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body></html>
>

#### accessing single elements

In [None]:
soup.title

<title>The Dormouse's story</title>

In [None]:
soup.title.string

"The Dormouse's story"

In [None]:
soup.title.parent.name

'head'

In [None]:
soup.body

<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body>

In [None]:
# this method only retrieves the first element of the specified tag
soup.p

<p class="title"><b>The Dormouse's story</b></p>

#### finding all elements of a tag with find_all()

In [None]:
p_tags = soup.find_all("p")

In [None]:
p_tags

[<p class="title"><b>The Dormouse's story</b></p>,
 <p class="story">Once upon a time there were three little sisters; and their names were
 <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
 and they lived at the bottom of a well.</p>,
 <p class="story">...</p>]

In [None]:
for p in p_tags:
    print(p.get_text())

The Dormouse's story
Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well.
...


#### Using css selectors

https://htmlcheatsheet.com/css/

In [None]:
# select all elements with class="title"
soup.select(".title")

[<p class="title"><b>The Dormouse's story</b></p>]

In [None]:
# select all elements with class="sister"
soup.select(".sister")

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

In [None]:
# select "all" elements with the id="link2"
soup.select("#link2")

[<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

We can combine the `select()` method with other bs4 methods, such as `get_text()`.

`get_text()`, however, can only be applied to single elements, while `select()` might return multiple elements. It's common to iterate through the output of `select()`

In [None]:
print(soup.select("p.story")[0].get_text())

Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well.


In [None]:
for p in soup.select("p.story"):
    print(p.get_text())

Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well.
...


### Your turn:

Write code to print the following contents (not including the html tags, only human-readable text):

1. All the "fun facts".

2. The names of all the places.

3. The content (name and fact) of all the cities (only cities, not countries!)

4. The names (not facts!) of all the cities (not countries!)

In [None]:
geography = """
<!DOCTYPE html>
<html>
<head> Geography</head>
<body>

<div class="city">
  <h2>London</h2>
  <p>London is the most popular tourist destination in the world.</p>
</div>

<div class="city">
  <h2>Paris</h2>
  <p>Paris was originally a Roman City called Lutetia.</p>
</div>

<div class="country">
  <h2>Spain</h2>
  <p>Spain produces 43,8% of all the world's Olive Oil.</p>
</div>

</body>
</html>
"""

In [None]:
# Create the "soup"


In [None]:
# 1. All the "fun facts"


In [None]:
# 2. The names of all the places.


In [None]:
# 3. All the content (name and fact) of all the cities (only cities, not countries!)


In [None]:
# 4. The names (not facts!) of all the cities (not countries!)


## Use case: imdb top charts

Let's go to https://wbscodingschool.github.io/wbs-movies-website/, where we'll see the top 250 movies according to user ratings.

Notice how each movie has the following elements:

- Title

- Release Year

- Rating

Our objective is going to be to scrape this information and store it in a pandas dataframe.

In [None]:
# 1. import libraries
from bs4 import BeautifulSoup
import requests
import pandas as pd

In [None]:
# 2. find url and store it in a variable
url = "https://wbscodingschool.github.io/wbs-movies-website/"


The `robots.txt` file is like a guide for web robots (or "bots") visiting a website. It tells these bots which parts of the website they are allowed to visit and which parts they should avoid. However, it's more like a suggestion rather than a rule, so not all bots will follow it. It's mainly used to manage how bots interact with a website but should not be used as a way to secure sensitive parts of a site.

A step by step guide to check any `robots.txt` website would be:

- Step 1: Open a web browser of your choice (Google Chrome, Firefox, Safari, etc.).

- Step 2: In the URL bar, type the website's address, followed by "/robots.txt". For example, if you wanted to see the robots.txt file for Google, you would type in "https://www.google.com/robots.txt".

- Step 3: Press enter to navigate to that address.

**How to understand the robots.txt?**

Let's imagine we have the following robots.txt file:

```
User-agent: *
Disallow: /private/
Allow: /public/
```

* The "User-agent: *" part means that the following rules apply to all web robots.
* The "Disallow: /private/" part tells robots not to access anything in the "/private/" directory of the website.
* The "Allow: /public/" part tells robots that they are allowed to access the "/public/" directory.

In the absence of "Allow" or "Disallow" directives, the default is to allow access. So, if a robots.txt file only specifies "Disallow" directives, robots can assume they're allowed to access any part of the site not explicitly disallowed.

Let's check the robots from the website we will get the info https://wbscodingschool.github.io/wbs-movies-website/robots.txt .

In [None]:
# now, we know we can access with our code (robot) to the website safely.
# Let's go to it :)
# 3. download html with a get request
response = requests.get(url)
response.status_code # 200 status code means OK!

200

In [None]:
# 4.1. parse html (create the 'soup')
soup = BeautifulSoup(response.content, "html.parser")
# 4.2. check that the html code looks like it should
soup

<!DOCTYPE html>

<html>
<head>
<meta charset="utf-8"/>
<title>Top Rated Movies</title>
<link href="styles.css" rel="stylesheet"/>
</head>
<main>
<header class="headerTitle">
<h1 class="title">Top 100 Movies from all time</h1>
<p class="description">Welcome to the Top Rated Movies List!</p>
</header>
<table class="moviesTable">
<tr class="tableHeaders">
<th>Rank &amp; Title</th>
<th>WBS Rating</th>
</tr>
<tr class="movieInfo">
<td class="movieTitle">1. The Shawshank Redemption (1994)</td>
<td class="movieRating">9.2</td>
</tr>
<tr class="movieInfo">
<td class="movieTitle">2. The Godfather (1972)</td>
<td class="movieRating">9.2</td>
</tr>
<tr class="movieInfo">
<td class="movieTitle">3. The Dark Knight (2008)</td>
<td class="movieRating">9.0</td>
</tr>
<tr class="movieInfo">
<td class="movieTitle">4. The Godfather Part II (1974)</td>
<td class="movieRating">9.0</td>
</tr>
<tr class="movieInfo">
<td class="movieTitle">5. 12 Angry Men (1957)</td>
<td class="movieRating">9.0</td>
</tr>
<tr

In [None]:
soup.select("tr.movieInfo") # all the info about all the movies

[<tr class="movieInfo">
 <td class="movieTitle">1. The Shawshank Redemption (1994)</td>
 <td class="movieRating">9.2</td>
 </tr>,
 <tr class="movieInfo">
 <td class="movieTitle">2. The Godfather (1972)</td>
 <td class="movieRating">9.2</td>
 </tr>,
 <tr class="movieInfo">
 <td class="movieTitle">3. The Dark Knight (2008)</td>
 <td class="movieRating">9.0</td>
 </tr>,
 <tr class="movieInfo">
 <td class="movieTitle">4. The Godfather Part II (1974)</td>
 <td class="movieRating">9.0</td>
 </tr>,
 <tr class="movieInfo">
 <td class="movieTitle">5. 12 Angry Men (1957)</td>
 <td class="movieRating">9.0</td>
 </tr>,
 <tr class="movieInfo">
 <td class="movieTitle">6. Schindler's List (1993)</td>
 <td class="movieRating">8.9</td>
 </tr>,
 <tr class="movieInfo">
 <td class="movieTitle">
           7. The Lord of the Rings: The Return of the King (2003)
         </td>
 <td class="movieRating">8.9</td>
 </tr>,
 <tr class="movieInfo">
 <td class="movieTitle">8. Pulp Fiction (1994)</td>
 <td class="mo

In [None]:
soup.select(".movieTitle") # all elements containing movie titles

[<td class="movieTitle">1. The Shawshank Redemption (1994)</td>,
 <td class="movieTitle">2. The Godfather (1972)</td>,
 <td class="movieTitle">3. The Dark Knight (2008)</td>,
 <td class="movieTitle">4. The Godfather Part II (1974)</td>,
 <td class="movieTitle">5. 12 Angry Men (1957)</td>,
 <td class="movieTitle">6. Schindler's List (1993)</td>,
 <td class="movieTitle">
           7. The Lord of the Rings: The Return of the King (2003)
         </td>,
 <td class="movieTitle">8. Pulp Fiction (1994)</td>,
 <td class="movieTitle">
           9. The Lord of the Rings: The Fellowship of the Ring (2001)
         </td>,
 <td class="movieTitle">10. The Good, the Bad and the Ugly (1966)</td>,
 <td class="movieTitle">11. Forrest Gump (1994)</td>,
 <td class="movieTitle">
           12. Spider-Man: Across the Spider-Verse (2023)
         </td>,
 <td class="movieTitle">13. Fight Club (1999)</td>,
 <td class="movieTitle">
           14. The Lord of the Rings: The Two Towers (2002)
         </td>,
 <

In [None]:
# we can use .get_text() to extract the content of the tags we selected
# we'll need to do it to each tag with a for loop: here we do it to the first one
soup.select(".movieTitle")[0].get_text()

'1. The Shawshank Redemption (1994)'

In [None]:
# the rating are inside a 'td' tag
soup.select("td.movieRating")[0].get_text()

'9.2'

### Storing information in lists

In [None]:
#initialize empty lists
title = []
rating = []

In [None]:
# define the number of iterations of our for loop
# by checking how many elements are in the retrieved result set
# (this is equivalent but more robust than just explicitly defining 250 iterations)
num_iter = len(soup.select(".movieTitle"))
num_iter

100

In [None]:
# iterate through the result set and retrive all the data
for i in range(num_iter):
    title.append(soup.select(".movieTitle")[i].get_text())
    rating.append(soup.select("td.movieRating")[i].get_text())

In [None]:
title

['1. The Shawshank Redemption (1994)',
 '2. The Godfather (1972)',
 '3. The Dark Knight (2008)',
 '4. The Godfather Part II (1974)',
 '5. 12 Angry Men (1957)',
 "6. Schindler's List (1993)",
 '\n          7. The Lord of the Rings: The Return of the King (2003)\n        ',
 '8. Pulp Fiction (1994)',
 '\n          9. The Lord of the Rings: The Fellowship of the Ring (2001)\n        ',
 '10. The Good, the Bad and the Ugly (1966)',
 '11. Forrest Gump (1994)',
 '\n          12. Spider-Man: Across the Spider-Verse (2023)\n        ',
 '13. Fight Club (1999)',
 '\n          14. The Lord of the Rings: The Two Towers (2002)\n        ',
 '15. Inception (2010)',
 '\n          16. Star Wars: Episode V - The Empire Strikes Back (1980)\n        ',
 '17. The Matrix (1999)',
 '18. GoodFellas (1990)',
 "19. One Flew Over the Cuckoo's Nest (1975)",
 '20. Seven (1995)',
 "21. It's a Wonderful Life (1946)",
 '22. Seven Samurai (1954)',
 '23. The Silence of the Lambs (1991)',
 '24. Saving Private Ryan (1998

In [None]:
rating

['9.2',
 '9.2',
 '9.0',
 '9.0',
 '9.0',
 '8.9',
 '8.9',
 '8.8',
 '8.8',
 '8.8',
 '8.8',
 '8.8',
 '8.7',
 '8.7',
 '8.7',
 '8.7',
 '8.7',
 '8.7',
 '8.6',
 '8.6',
 '8.6',
 '8.6',
 '8.6',
 '8.6',
 '8.6',
 '8.6',
 '8.6',
 '8.6',
 '8.5',
 '8.5',
 '8.5',
 '8.5',
 '8.5',
 '8.5',
 '8.5',
 '8.5',
 '8.5',
 '8.5',
 '8.5',
 '8.5',
 '8.5',
 '8.5',
 '8.5',
 '8.5',
 '8.5',
 '8.5',
 '8.5',
 '8.4',
 '8.4',
 '8.4',
 '8.4',
 '8.4',
 '8.4',
 '8.4',
 '8.4',
 '8.4',
 '8.4',
 '8.4',
 '8.4',
 '8.4',
 '8.4',
 '8.4',
 '8.4',
 '8.4',
 '8.4',
 '8.4',
 '8.3',
 '8.3',
 '8.3',
 '8.3',
 '8.2',
 '8.2',
 '8.2',
 '8.2',
 '8.1',
 '8.1',
 '8.1',
 '8.1',
 '8.0',
 '8.0',
 '8.0',
 '8.0',
 '7.9',
 '7.9',
 '7.9',
 '7.9',
 '7.8',
 '7.8',
 '7.8',
 '7.8',
 '7.7',
 '7.7',
 '7.7',
 '7.7',
 '7.6',
 '7.6',
 '7.6',
 '7.6',
 '7.5',
 '7.5']

### Storing information in pandas DataFrames

If you get an error try this:
assert len(title) == len(rating)

In [None]:
import pandas as pd

In [None]:
movies_df = pd.DataFrame(
    {"movie_name": title,
     "rating": rating,
    }
)

In [None]:
movies_df.head()

Unnamed: 0,movie_name,rating
0,1. The Shawshank Redemption (1994),9.2
1,2. The Godfather (1972),9.2
2,3. The Dark Knight (2008),9.0
3,4. The Godfather Part II (1974),9.0
4,5. 12 Angry Men (1957),9.0


In [None]:
movies_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 2 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   movie_name  100 non-null    object
 1   rating      100 non-null    object
dtypes: object(2)
memory usage: 1.7+ KB


#### Challenge: Cleaning the data

An inherent part of web scraping is data cleaning. We managed to get the information we needed, but for it to be useful, we still need some extra steps:

- Take the rank and the year out of the title and clean unnecessary spaces.

- Tranform the data columns to the right types.

In [None]:
# your code here

In [None]:
# 1. Take the rank out of the title and clean unnecessary spaces

movies_df = movies_df.assign(
    Rank = movies_df['movie_name'].str.extract('(\d+)\.', expand=False),
    Movie = movies_df['movie_name'].str.extract('\.\s+(.*)\s+\(', expand=False),
    Year = movies_df['movie_name'].str.extract('\((\d+)\)', expand=False)
    ).drop(columns=['movie_name'])
movies_df

Unnamed: 0,rating,Rank,Movie,Year
0,9.2,1,The Shawshank Redemption,1994
1,9.2,2,The Godfather,1972
2,9.0,3,The Dark Knight,2008
3,9.0,4,The Godfather Part II,1974
4,9.0,5,12 Angry Men,1957
...,...,...,...,...
95,7.6,96,Moon's Reflection,2023
96,7.6,97,Dance of The Silver Phoenix,2023
97,7.6,98,Crystal Visions,2023
98,7.5,99,Silver Mirror,2023


In [None]:
# 2. Tranform the data columns to the right types.
movies_df.assign(
    rating = movies_df['rating'].astype(float),
    Rank = movies_df['Rank'].astype(int),
    Year = movies_df['Year'].astype(int)
)

Unnamed: 0,rating,Rank,Movie,Year
0,9.2,1,The Shawshank Redemption,1994
1,9.2,2,The Godfather,1972
2,9.0,3,The Dark Knight,2008
3,9.0,4,The Godfather Part II,1974
4,9.0,5,12 Angry Men,1957
...,...,...,...,...
95,7.6,96,Moon's Reflection,2023
96,7.6,97,Dance of The Silver Phoenix,2023
97,7.6,98,Crystal Visions,2023
98,7.5,99,Silver Mirror,2023
