# Getting data and web scraping

Learning objectives. At the end of this you will be able to:

- Identify when you need to use web scraping to retrieve data
- Use the Python requests module to retrieve a web page
- Use BeautifulSoup to isolate and retrieve data from a HTML web page
- Automate the download of multiple web pages using python

This lab is inspired by this one 
https://ourcodingclub.github.io/tutorials/webscraping/

It is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License

## When do we need to do web scraping to retrieve data?

Web scraping is the process of automating the process of extracting data from a web page.

Imagine you want to collect information on the area and percentage water area of African countries. It’s easy enough to head to Wikipedia, click through each page, then copy the relevant information and paste it into a spreadsheet. Now imagine you want to repeat this for every country in the world! This can quickly become very tedious as you click between lots of pages, repeating the same actions over and over. It also increases the chance of making mistakes when copying and pasting. By automating this process, i.e. web scraping, you can reduce the chance of making mistakes and speed up your data collection. Additionally, once you have written the script, it can be adapted for a lot of different projects, saving time in the long run.

The first rule of web scraping is **don't do it unless you have to**. Data is increasingly available in structured formats such as CSV and JSON. Web scraping seems cool, but it's also time-consuming and fiddly. It pays to spend some time searching for files in structured formats, in which the data may well be in a cleaner format than on a web page. However, if you can't find the data you need in a structured format, but it is available on a web page, then web scraping can be helpful.

## Before we start - the law and ethics of web-scraping

The law around web scraping is not always clear ([Byrce Davies - _Is web scraping legal?_](https://scrapediary.com/is-web-scraping-legal/)) but you should always check the terms and conditions of the website before before starting scraping. For example, [the Copyright Policy of the Financial Times website](https://help.ft.com/legal-privacy/copyright-policy/) says you cannot
> Frame, harvest or scrape FT content or otherwise access FT content for similar purposes.

Beyond any legal requirements are the ethical James Densmore in his article [*Ethics in webscraping*](https://towardsdatascience.com/ethics-in-web-scraping-b96b18136f01) suggests the following set rules for those undertaking webscraping:

> - If you have a public API that provides the data I’m looking for, I’ll use it and avoid scraping all together.
> - I will always provide a User Agent<sup>*</sup> string that makes my intentions clear and provides a way for you to contact me with questions or concerns.
> - I will request data at a reasonable rate. I will strive to never be confused for a [DDoS](https://en.wikipedia.org/wiki/Denial-of-service_attack) attack.
> - I will only save the data I absolutely need from your page. [...]
> - I will respect any content I do keep. I’ll never pass it off as my own.
> - I will look for ways to return value to you. Maybe I can drive some (real) traffic to your site or credit you in an article or post.
> - I will respond in a timely fashion to your outreach and work with you towards a resolution.
> - I will scrape for the purpose of creating new value from the data, not to duplicate it.

* We'll explain this later.

Equipped with this information, we'll now start scraping.

# Bird facts

Spring is in the air, and the birds are starting to sing again, so we might be inspired to find out more about the birds around us.  We're going to build a data frame of different species of UK birds from the [](https://www.iucnredlist.org/). Being fans of Linux, we'll focus on the status and when the assessment was made so we will have a data frame that looks something like this:

| Scientific Name         | Common Name       | Red List Status | Assessment Date |
|-------------------------|-------------------|-----------------|-----------------|
| Aptenodytes forsteri    | Emperor Penguin   | Near Threatened | 2016-10-01      |
| Aptenodytes patagonicus | King Penguin      | Least Concern   | 2016-10-01      |
| Spheniscus mendiculus   | Galapagos Penguin | Endangered      | 2016-10-01      |
| ...                     | ...               | ...             | ...             |


## First check the copyright

**Excercise 01** Take a look at the [IUCN Red list Terms of Use](https://www.iucnredlist.org/terms/terms-of-use). Can you see anything that would prevent us scraping the information? (Note that education is classed as a non-commercial activity.) What restrictions are there on how you can use the information.

Your answer:

There is nothing that specifically prevents web scraping. We are not allowed to use the data for commercial use:

> Neither (a) IUCN Red List Data nor (b) any work derived from or based upon IUCN Red List Data (i.e., “Derivative Works”) may be put to Commercial Use without the prior written permission of IUCN.

Nor are allowed to repost the data:

> All forms of reposting, and any sub-licensing, reselling, or other forms of redistribution of IUCN Red List Data in their original format, either whole or in part, alone or combined with other data, including within Derivative Works, are strictly prohibited without the prior written permission of IUCN. 

So we couldn't scrap the information from the Red list website, and then use the information to create a site called penguinpix.com.

## Getting our first page

A web server is just another computer on the internet that is listening for **requests** from other computer on the internet. When we're browsing the web, our web browsers are making requests to web servers all around the internet. The web browser is one example of a **user agent**. We're now going to use the python `requests` package as a user agent.

The first file we will retrieve is called `robots.txt`.

Every webserver should have this file to tell robots (i.e. automated user agents like search engine crawlers that move from page to page on the web) what the website owner is happy for them to request. (Note that the file is purely advisory. Bad bots - malware etc - will ignore this file.)

In [1]:
import requests

In [6]:
url = 'https://www.iucnredlist.org/robots.txt'
r = requests.get(url)
print(r.text)

User-agent: *
Disallow: /author_panel
Disallow: /api
Disallow: /delayed_job
Disallow: /ckeditor
Disallow: /search
Disallow: /account
Sitemap: https://www.iucnredlist.org/sitemap.xml.gz



`r` is an request object. `r.text` contains the contents of the robots.txt file.

The first line of the file `User-agent: *` tells us that the next section of the file applies to all user agents. The next lines tell us the areas of the site that should not be scraped automatically. There is more information on the format of the robot.txt file [here](https://www.robotstxt.org/robotstxt.html).

Note that the webserver will have a record of the "User-agent" in its logs. For example, the request we've just made will be recorded in the webserver logs like this:
```
77.99.216.20 - - [20/Feb/2021:18:24:40 +0000] "GET /robots.txt HTTP/1.1" 200 4594 "-" "python-requests/2.24.0"
```

This means that a user agent called "python-requests/2.24.0" has made a request from the IP address 77.99.216.20 for the `robots.txt` file. It might be more polite to advertise who we are, by changing the name of the user agent in the request like this:

In [12]:
headers = {'user-agent': 'Foundations of Data Science Course (https://www.github.com/Inf2-FDS)'}
r = requests.get(url, headers=headers)

This will appear in the server logs like this:
```
77.99.216.20 - - [20/Feb/2021:19:57:16 +0000] "GET /robots.txt HTTP/1.1" 200 4594 "-" "Foundations of Data Science Course (https://www.github.com/Inf2-FDS)"
```
So if the site owner gets annoyed by frequent requests, they can at least let us know about the problem. But please, when you web scrape, don't make lots of requests in quick succession.

Note that the server log also includes the Status code 200 - this means that the request has been delivered successfully by the server. At out end we can check that the code is 200:

In [13]:
r.status_code

200

If the code starts with 2, it means the request was successful. If it starts with 3, it means that the request was successful. If it starts with 4, it means that the page doesn't exist, and if it starts with 5, it indicates an error at the server end.

**Exercise:** Make a request to the non-existent page https://www.iucnredlist.org/dodo and report on the error code

In [19]:
r = requests.get("https://www.iucnredlist.org/dodo", headers=headers)
r.status_code

404

## Getting our first penguin

**Exercise 03:** Get the IUCN web page for the Emperor Penguin: https://www.iucnredlist.org/species/22697752/157658053  Print the text page that you return.

In [48]:
# Your answer
url = 'https://www.iucnredlist.org/species/22697752/157658053'
url = 'https://www.rspb.org.uk/birds-and-wildlife/wildlife-guides/bird-a-z/'
r = requests.get(url, headers=headers)

In [49]:
r.text

'\r\n\r\n    <!DOCTYPE html>\r\n    <!--[if lt IE 7]><html class="no-js lt-ie9 lt-ie8 lt-ie7" lang="en-GB"> <![endif]-->\r\n    <!--[if IE 7]><html class="no-js lt-ie9 lt-ie8" lang="en-GB"> <![endif]-->\r\n    <!--[if IE 8]><html class="no-js lt-ie10 lt-ie9" lang="en-GB"> <![endif]-->\r\n    <!--[if IE 9]><html class="no-js lt-ie10" lang="en-GB"> <![endif]-->\r\n    <!--[if gt IE 9]><!-->\r\n    <html class="no-js" lang=en-GB>\r\n    <!--<![endif]-->\r\n    <head>\r\n        <script>(function (H) { H.className = H.className.replace(/\\bno-js\\b/, \'js\') })(document.documentElement)</script>\r\n\r\n        <meta charset="utf-8" />\r\n        <meta http-equiv="X-UA-Compatible" content="IE=EDGE" />\r\n        <title>Birds A- Z | Bird Guides - The RSPB</title>\r\n        <meta name="description" content="Browse our UK bird guide by name. See birds alphabetised by name and family, A-Z in this handy guide">\r\n<meta property="og:title" content="Birds A- Z | Bird Guides - The RSPB" />\r\n<me

What you should see is the Hypertext Markup Language (HTML) source of the page. It's exactly what you would see if you looked at the page source in the browser.

You probably know already, but an HTML document has a nested structure, like this:
```
<!DOCTYPE html>
<html>
  <head>
    <title>Page title</title>
    <script src='https://www.googletagmanager.com/gtag/js?id=UA-11409245-4'/>
  </head>
  <body>
    <h1>Emperor penguin</h1>
    <p style='color: red'>Stuff about Emperor penguin</p>
  </body>
</html>
```
The things in angle brackets are called **tags** and some of them have **attributes**, such as the `style` attribute in the `p` tag above.

You can just see some tags in the text returned by our request, but it's not clear. Fortunately there is a Python package called [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/) that will help use to _parse_ the text so that we can extract the contents of tags easily.

In [50]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(r.content, 'html.parser')
soup


<!DOCTYPE html>

<!--[if lt IE 7]><html class="no-js lt-ie9 lt-ie8 lt-ie7" lang="en-GB"> <![endif]-->
<!--[if IE 7]><html class="no-js lt-ie9 lt-ie8" lang="en-GB"> <![endif]-->
<!--[if IE 8]><html class="no-js lt-ie10 lt-ie9" lang="en-GB"> <![endif]-->
<!--[if IE 9]><html class="no-js lt-ie10" lang="en-GB"> <![endif]-->
<!--[if gt IE 9]><!-->
<html class="no-js" lang="en-GB">
<!--<![endif]-->
<head>
<script>(function (H) { H.className = H.className.replace(/\bno-js\b/, 'js') })(document.documentElement)</script>
<meta charset="utf-8"/>
<meta content="IE=EDGE" http-equiv="X-UA-Compatible"/>
<title>Birds A- Z | Bird Guides - The RSPB</title>
<meta content="Browse our UK bird guide by name. See birds alphabetised by name and family, A-Z in this handy guide" name="description"/>
<meta content="Birds A- Z | Bird Guides - The RSPB" property="og:title">
<meta content="Browse our UK bird guide by name. See birds alphabetised by name and family, A-Z in this handy guide" property="og:descriptio

We can now see that HTML code is printed in a somewhat prettier way.

We can access the first tag of a particular type using the following syntax:

In [53]:
print(soup.title)

print(soup.title.name)

print(soup.title.string)

print(soup.img)

<title>Birds A- Z | Bird Guides - The RSPB</title>
title
Birds A- Z | Bird Guides - The RSPB
<img alt="RSPB - giving nature a home" class="lazyload" data-expand="500" data-src="/static/images/rspb-logo-white.png"/>


Note the last tag we have retrieved is an anchor (i.e. link). It has an **attribute** `href`. We can extract the contents of the href attribute like this:

In [54]:
soup.img['alt']

'RSPB - giving nature a home'

In [None]:
/html/body/div[1]/div/div[2]/div[2]/div/div[2]/div/div/form/div[3]/div[2]/a/div

In [None]:

								Aquatic warbler
							
        
        <div class="bird-browser__results__container__item__name">
								Aquatic warbler
							</div>

More often than not the tag we want won't be the first occurrence in the document. If we want to extract all the `h1` (heading level 1) tags we can use the `find_all` function like this:

In [58]:
soup.body.find_all('div', attrs={'class': "bird-browser__results__container__item__name"})

[<div class="bird-browser__results__container__item__name">
 								Aquatic warbler
 							</div>, <div class="bird-browser__results__container__item__name">
 								Arctic skua
 							</div>, <div class="bird-browser__results__container__item__name">
 								Arctic tern
 							</div>, <div class="bird-browser__results__container__item__name">
 								Avocet
 							</div>, <div class="bird-browser__results__container__item__name">
 								Balearic shearwater
 							</div>, <div class="bird-browser__results__container__item__name">
 								Bar-tailed godwit
 							</div>, <div class="bird-browser__results__container__item__name">
 								Barn owl
 							</div>, <div class="bird-browser__results__container__item__name">
 								Barnacle goose
 							</div>, <div class="bird-browser__results__container__item__name">
 								Bearded tit
 							</div>, <div class="bird-browser__results__container__item__name">
 								Bewick's swan
 							</div>, <div class="bird-browser

In [32]:
soup.title.parent.name
# u'head'

'head'

In [60]:
url = 'https://www.rspb.org.uk/birds-and-wildlife/wildlife-guides/bird-a-z/aquatic-warbler'
r = requests.get(url, headers=headers)

In [63]:
soup = BeautifulSoup(r.content, 'html.parser')
soup


<!DOCTYPE html>

<!--[if lt IE 7]><html class="no-js lt-ie9 lt-ie8 lt-ie7" lang="en-GB"> <![endif]-->
<!--[if IE 7]><html class="no-js lt-ie9 lt-ie8" lang="en-GB"> <![endif]-->
<!--[if IE 8]><html class="no-js lt-ie10 lt-ie9" lang="en-GB"> <![endif]-->
<!--[if IE 9]><html class="no-js lt-ie10" lang="en-GB"> <![endif]-->
<!--[if gt IE 9]><!-->
<html class="no-js" lang="en-GB">
<!--<![endif]-->
<head>
<script>(function (H) { H.className = H.className.replace(/\bno-js\b/, 'js') })(document.documentElement)</script>
<meta charset="utf-8"/>
<meta content="IE=EDGE" http-equiv="X-UA-Compatible"/>
<title>Aquatic Warbler Bird Facts | Acrocephalus Paludicola - The RSPB</title>
<meta content="A regular but scarce autumn migrant to certain areas in Britain, visiting on its way between breeding grounds in eastern Europe &amp; its winter home in West Africa" name="description"/>
<meta content="Aquatic Warbler Bird Facts | Acrocephalus Paludicola - The RSPB" property="og:title">
<meta content="A reg

In [66]:
soup.h1

<h1 class="species-hero__page-title">Aquatic warbler</h1>

In [67]:
soup.h1.text

'Aquatic warbler'

Now suppose we want the measurements of height, wingspan and weight. We need to locate this information in the webpage. One way to do this is to use the "Inspector" function in the browser. In Firefox you can do this by pressing `Ctrl-Alt-I`. Then, as in the picture below:

1. Select the "element picker"
2. Click on the part of the page you are interested in (we'll call this the element of interest)
3. Look at the corresponding code. You can copy it by right-clicking on it and selecting **Copy->Outer HTML**.

![Screenshot of Inspector in Firefox](Screenshot-inspector-annotated.png)

The HTML we have found looks like this:
```
<div class="species-measurements-population__measurements">
    <h3 class="key-info-title-measurements">Measurements:</h3>
    <dl class="species-measurements-population__details">
        <dt class="species-measurements-population__details-label">Length:</dt>
        <dd class="species-measurements-population__details-content">
            13cm
        </dd>
    </dl>

    <dl class="species-measurements-population__details">
        <dt class="species-measurements-population__details-label">Wingspan:</dt>
        <dd class="species-measurements-population__details-content">
            16.5-19.5cm
        </dd>
    </dl>

    <dl class="species-measurements-population__details">
        <dt class="species-measurements-population__details-label">Weight:</dt>
        <dd class="species-measurements-population__details-content">
            10-14g
        </dd>
   </dl>
</div>
```

We now need to work out how to retrieve these elements. This is where the art of web scraping comes in.

We notice that the elements are all in a `div` with the class `species-measurements-population__measurements`.

We can't use the previous method of accessing the tags

In [38]:
soup.body.find_all('a')
# <p class="title"><b>The Dormouse's story</b></p>

[<a href="mailto:redlist@iucn.org?subject=Feedback%20on%20new%20Red%20List%20website"><img alt="feedback" src="https://nrl.iucnredlist.org/assets/feedback-ff9e97bc39b9ea47cb552b887161aa72e3d59ab786f680255603c4c53b54d698.png"/>
 </a>,
 <a class="account-item account-item--link" href="#link" id="login-button">Login / Register</a>,
 <a class="account-item account-item--link" href="/support/whatsnew"><span>☀</span>
 What's New
 </a>,
 <a class="account-item account-item--link" href="/contact/contact-page">Contact</a>,
 <a class="account-item account-item--link" href="/terms/terms-of-use">Terms of Use</a>,
 <a class="brand-symbol" href="/" rel="home">The IUCN Red List of Threatened Species</a>,
 <a class="nav-site__link-single" href="/search">Advanced</a>,
 <a class="nav-site__link" href="/about">About</a>,
 <a class="nav-site__link" href="/assessment">Assessment process</a>,
 <a class="nav-site__link" href="/resources/grid">Resources &amp; Publications</a>,
 <a class="nav-site__link" href=

In [30]:
soup.p['class']
# u'title'

soup.a
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

soup.find_all('a')
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

soup.find(id="link3")
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

<title>
IUCN Red List of Threatened Species
</title>
title


Import multiple web pages with mapply

In [47]:
import requests_html

ModuleNotFoundError: No module named 'requests_html'

In [None]:
# Packages
library(rvest)
library(dplyr)
library(knitr)


# Set working directory to source file location
setwd(dirname(rstudioapi::getActiveDocumentContext()$path))

##########################################
# Sing web page scraping #################
##########################################

# Import data for one species
Penguin_html <- readLines("Aptenodytes forsteri (Emperor Penguin).html")

# Find Scientific Name
## Find anchor
grep("Scientific Name:", Penguin_html)  # 132

## Isolate 
Penguin_html[131:135]
Penguin_html[133]

## Store line in new object
species_line <- Penguin_html[133]

## Pipes to grab the text and get rid of unwanted information like html tags
species_name <- species_line %>% 
  gsub("<td class=\"sciName\"><span class=\"notranslate\"><span class=\"sciname\">", "", .) %>%  # Remove leading html
  gsub("</span></span></td>", "", .) %>%  # Remove trailing html
  gsub("^\\s+|\\s+$", "", .)  # Remove whitespace

# Find common name
## Find anchor
grep("Common Name", Penguin_html)  # 146

## Isolate
Penguin_html[130:160]
Penguin_html[150:160]
Penguin_html[151]

## Cut straight to cleaning html
common_name <- Penguin_html[151] %>% 
gsub("<td>", "", .) %>%
  gsub("</td>", "", .) %>%
  gsub("^\\s+|\\s+$", "", .)


# Find Red List Category
## Find anchor
grep("Red List Category", Penguin_html)  # 180

## Anchor
Penguin_html[179:185]  #182
Penguin_html[182]

## Cut straight to cleaning html
red_cat_line <- Penguin_html[182]
red_cat <- gsub("^\\s+|\\s+$", "", red_cat_line)

# Find date of listing
## Find anchor
grep("Date Assessed:", Penguin_html)  # 195

## Isolate
Penguin_html[192:197] #196
Penguin_html[196]

## Store line in new object
date_line <- Penguin_html[196]

## Clean html
date_assess <- date_line %>% 
  gsub("<td>", "",.) %>% 
  gsub("</td>", "",.) %>% 
  gsub("\\s", "",.)


# Combine vectors into a data frame
iucn <- data.frame(species_name, common_name, red_cat, date_assess)

# View iucn
View(iucn)

##########################################
# Multiple web page scraping #############
##########################################

#Download many web pages from list of URLs from search results from iucn:
search_html <- readLines("Search Results.html")

# Search for html tags relating to species URLs
## Create anchor and store locations in vector
line_list <- grep("<a href=\"/details", search_html)
## Use vector to isolate lines and store in vector
link_list <- search_html[line_list]

# Clean html and add IUCN URL prefix
species_list <- link_list %>%
  gsub('<a href=\"', "http://www.iucnredlist.org", .) %>%  # Add IUCN URL prefix
  gsub('\".*', "", .) %>%  # Remove punctuation marks
  gsub('\\s', "",.)  # Remove white space

# Collect species names for each species in the list of links using gsub for use in naming downloaded html files
file_list_grep <- link_list %>%  
  gsub('.*sciname\">', "", .) %>%  # Remove sciname\" at the end of the string
  gsub('</span></a>.*', ".html", .)  # remove trailing html and add .html file suffix

# Download each file and place in wd
mapply(function(x,y) download.file(x,y), species_list, file_list_grep)

# Import each file in list
penguin_html_list <- lapply(file_list_grep, readLines)

# Find scientific name
## Isolate line
sci_name_list_rough <- lapply(penguin_html_list, grep, pattern="Scientific Name:")
sci_name_list_rough  # 132 
penguin_html_list[[2]][133]  # +1

## Create vector of positions of lines for each list entry
sci_name_unlist_rough <- unlist(sci_name_list_rough) + 1

## Retrieve lines containing scientific names from each list entry in turn
sci_name_line <- mapply('[', penguin_html_list, sci_name_unlist_rough)

## Clean html
sci_name <- sci_name_line %>% 
  gsub(pattern = "<td class=\"sciName\"><span class=\"notranslate\"><span class=\"sciname\">", 
         replacement = "") %>%  # Remove leading html
  gsub(pattern = "</span></span></td>", replacement = "") %>%  # Remove trailing html
  gsub(pattern = "^\\s+|\\s+$", replacement = "")  # Remove white space

# Find common name
## Isolate line
common_name_list_rough <- lapply(penguin_html_list, grep, pattern = "Common Name")
common_name_list_rough #146
penguin_html_list[[1]][151]

## Create vector of positions of lines for each list entry
common_name_unlist_rough <- unlist(common_name_list_rough) + 5

## Retrieve lines containing common names from each list entry in turn
common_name_line <- mapply('[', penguin_html_list, common_name_unlist_rough)

## Clean html
common_name <- common_name_line %>%
  gsub(pattern = "<td>", replacement = "") %>%
  gsub(pattern = "</td>", replacement = "") %>%
  gsub(pattern = "^\\s+|\\s+$", replacement = "")

# Find red list category
## Isolate line
red_cat_list_rough <- lapply(penguin_html_list, grep, pattern = "Red List Category")
penguin_html_list[[16]][186]

## Create vector of positions of lines for each list entry
red_cat_unlist_rough <- unlist(red_cat_list_rough) + 2

## Retrieve lines containing common names from each list entry in turn
red_cat_line <- mapply(`[`, penguin_html_list, red_cat_unlist_rough)

## Clean html
red_cat <- gsub("^\\s+|\\s+$", "", red_cat_line)

# Find date of listing
## Isolate line
date_list_rough <- lapply(penguin_html_list, grep, pattern = "Date Assessed:")
penguin_html_list[[18]][200]

## Create vector of positions of lines for each list entry
date_unlist_rough <- unlist(date_list_rough) + 1 

## Retrieve lines containing common names from each list entry in turn
date_line <- mapply('[', penguin_html_list, date_unlist_rough)

## Clean html
date <- date_line %>%
  gsub("<td>", "",.) %>%
  gsub("</td>", "",.) %>%
  gsub("\\s", "",.)


# Create data frame from vectors
penguin_df <- data.frame(sci_name, common_name, red_cat, date)

# Create markdown table from penguin_df
kable(penguin_df, format = "markdown")