# Web scraping with `requests` and `BeautifulSoup`

Many websites contain data that might be useful but that cannot be directly downloaded in commonly used data formats (such as csv). This is especially true for text data, though we will also see how to get tabular data from websites. Since collecting large amounts of data manually is infeasible, we will use code to download websites and extract their content for further processing.

We will use two modules: `requests` to request data from websites, and `beautifulsoup4`, from which we will use objects of the type `BeautifulSoup`, which enable us to extract data from the websites.

In [1]:
from bs4 import BeautifulSoup, SoupStrainer
import re
import requests
import time
import pandas as pd

## Example: data from a fake online bookstore
Before discussing the steps towards getting and extracting in detail, let us take a look at an example from a website that has specifically been created to practice scraping. This example is based on https://github.com/jonathanoheix/scraping_basics_with_beautifulsoup

In [2]:
url_main = 'http://books.toscrape.com/'
url_index = url_main+'index.html'

We use the function `get` from the `requests` module to get the content of the website.

In [3]:
result = requests.get(url_index)

The attribute `text` contains a string comprising the website's content.

In [4]:
result.text[:1000]

'<!DOCTYPE html>\n<!--[if lt IE 7]>      <html lang="en-us" class="no-js lt-ie9 lt-ie8 lt-ie7"> <![endif]-->\n<!--[if IE 7]>         <html lang="en-us" class="no-js lt-ie9 lt-ie8"> <![endif]-->\n<!--[if IE 8]>         <html lang="en-us" class="no-js lt-ie9"> <![endif]-->\n<!--[if gt IE 8]><!--> <html lang="en-us" class="no-js"> <!--<![endif]-->\n    <head>\n        <title>\n    All products | Books to Scrape - Sandbox\n</title>\n\n        <meta http-equiv="content-type" content="text/html; charset=UTF-8" />\n        <meta name="created" content="24th Jun 2016 09:29" />\n        <meta name="description" content="" />\n        <meta name="viewport" content="width=device-width" />\n        <meta name="robots" content="NOARCHIVE,NOCACHE" />\n\n        <!-- Le HTML5 shim, for IE6-8 support of HTML elements -->\n        <!--[if lt IE 9]>\n        <script src="//html5shim.googlecode.com/svn/trunk/html5.js"></script>\n        <![endif]-->\n\n        \n            <link rel="shortcut icon" href

This doesn't look too nice. We will need `BeautifulSoup` to deal with this HTML code.

In [5]:
soup = BeautifulSoup(result.text, 'html.parser')

BeautifulSoup's method `prettify` gives us a clearer view of the structure of the text.

In [6]:
print(soup.prettify()[:1000])

<!DOCTYPE html>
<!--[if lt IE 7]>      <html lang="en-us" class="no-js lt-ie9 lt-ie8 lt-ie7"> <![endif]-->
<!--[if IE 7]>         <html lang="en-us" class="no-js lt-ie9 lt-ie8"> <![endif]-->
<!--[if IE 8]>         <html lang="en-us" class="no-js lt-ie9"> <![endif]-->
<!--[if gt IE 8]><!-->
<html class="no-js" lang="en-us">
 <!--<![endif]-->
 <head>
  <title>
   All products | Books to Scrape - Sandbox
  </title>
  <meta content="text/html; charset=utf-8" http-equiv="content-type"/>
  <meta content="24th Jun 2016 09:29" name="created"/>
  <meta content="" name="description"/>
  <meta content="width=device-width" name="viewport"/>
  <meta content="NOARCHIVE,NOCACHE" name="robots"/>
  <!-- Le HTML5 shim, for IE6-8 support of HTML elements -->
  <!--[if lt IE 9]>
        <script src="//html5shim.googlecode.com/svn/trunk/html5.js"></script>
        <![endif]-->
  <link href="static/oscar/favicon.ico" rel="shortcut icon"/>
  <link href="static/oscar/css/styles.css" rel="stylesheet" type="tex

After some work involving a close inspection of the HTML code, you will be able to extract the links to the individual books. There are many other books available in this bookstore, but we will limit our attention to those available from the homepage.

In [7]:
book_urls = [x.div.a.get('href') for x in soup.find_all("article", class_ = "product_pod")]
book_urls

['catalogue/a-light-in-the-attic_1000/index.html',
 'catalogue/tipping-the-velvet_999/index.html',
 'catalogue/soumission_998/index.html',
 'catalogue/sharp-objects_997/index.html',
 'catalogue/sapiens-a-brief-history-of-humankind_996/index.html',
 'catalogue/the-requiem-red_995/index.html',
 'catalogue/the-dirty-little-secrets-of-getting-your-dream-job_994/index.html',
 'catalogue/the-coming-woman-a-novel-based-on-the-life-of-the-infamous-feminist-victoria-woodhull_993/index.html',
 'catalogue/the-boys-in-the-boat-nine-americans-and-their-epic-quest-for-gold-at-the-1936-berlin-olympics_992/index.html',
 'catalogue/the-black-maria_991/index.html',
 'catalogue/starving-hearts-triangular-trade-trilogy-1_990/index.html',
 'catalogue/shakespeares-sonnets_989/index.html',
 'catalogue/set-me-free_988/index.html',
 'catalogue/scott-pilgrims-precious-little-life-scott-pilgrim-1_987/index.html',
 'catalogue/rip-it-up-and-start-again_986/index.html',
 'catalogue/our-band-could-be-your-life-scene

We can see those URLs are incomplete as they are just the path from the site's main address. To get the complete URLs, we need to join the main address with the paths.

In [8]:
book_urls = [url_main + x for x in book_urls]
book_urls

['http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html',
 'http://books.toscrape.com/catalogue/tipping-the-velvet_999/index.html',
 'http://books.toscrape.com/catalogue/soumission_998/index.html',
 'http://books.toscrape.com/catalogue/sharp-objects_997/index.html',
 'http://books.toscrape.com/catalogue/sapiens-a-brief-history-of-humankind_996/index.html',
 'http://books.toscrape.com/catalogue/the-requiem-red_995/index.html',
 'http://books.toscrape.com/catalogue/the-dirty-little-secrets-of-getting-your-dream-job_994/index.html',
 'http://books.toscrape.com/catalogue/the-coming-woman-a-novel-based-on-the-life-of-the-infamous-feminist-victoria-woodhull_993/index.html',
 'http://books.toscrape.com/catalogue/the-boys-in-the-boat-nine-americans-and-their-epic-quest-for-gold-at-the-1936-berlin-olympics_992/index.html',
 'http://books.toscrape.com/catalogue/the-black-maria_991/index.html',
 'http://books.toscrape.com/catalogue/starving-hearts-triangular-trade-trilogy-1_990

We can use these urls to get data from each book's site and collect the information available there. We will wait for one second after each request to avoid creating too heavy a load on the bookstore's server.

In [9]:
names = []
prices = []
nb_in_stock = []
img_urls = []
categories = []
ratings = []

for url in book_urls:
    soup = BeautifulSoup(requests.get(url).text, 'html.parser')
    time.sleep(1)
    names.append(soup.find("div", class_ = re.compile("product_main")).h1.text)
    prices.append(soup.find("p", class_ = "price_color").text[2:])
    nb_in_stock.append(re.sub("[^0-9]", "", soup.find("p", class_ = "instock availability").text))
    img_urls.append(url.replace("index.html", "") + soup.find("img").get("src"))
    categories.append(soup.find("a", href = re.compile("../category/books/")).get("href").split("/")[3])
    ratings.append(soup.find("p", class_ = re.compile("star-rating")).get("class")[1])

We create a pandas DataFrame containing the data we collected for later use.

In [10]:
books = pd.DataFrame({'name': names, 'price': prices, 'nb_in_stock': nb_in_stock, "url_img": img_urls, 
                      "product_category": categories, "rating": ratings})
books.tail()

Unnamed: 0,name,price,nb_in_stock,url_img,product_category,rating
15,Our Band Could Be Your Life: Scenes from the A...,57.25,19,http://books.toscrape.com/catalogue/our-band-c...,music_14,Three
16,Olio,23.88,19,http://books.toscrape.com/catalogue/olio_984/....,poetry_23,One
17,Mesaerion: The Best Science Fiction Stories 18...,37.59,19,http://books.toscrape.com/catalogue/mesaerion-...,science-fiction_16,One
18,Libertarianism for Beginners,51.33,19,http://books.toscrape.com/catalogue/libertaria...,politics_48,Two
19,It's Only the Himalayas,45.17,19,http://books.toscrape.com/catalogue/its-only-t...,travel_2,Two


# HTML
HyperText Markup Language (HTML) is the language used to create web pages. It is not a programming language but  markup language telling the browser how to present the content. If you right-click on a website and choose `View page source` or simply click `Ctrl+U` (both in Chrome, but it is similar with other browsers), you get to see the site's HTML code.

<center>

<img src="images/html.png" align="center" width="1400" />
</center>

While it would be possible to find the HTML corresponding to the elements you see on the website, it is better to use the browser's developer tools. To do so, move the mouse to the part of the site for which you would like to see the code, and right-click and choose `Inspect` or click `Ctrl+Shift+I`.

Let us have a look at the __Elements__ tab here. This shows you the DOM (Domain Object Model), which is, for a site consisting only of HTML code, just the site's HTML.

I had pointed my mouse at the link to the book on the top left and the corresponding code is highlighted. We can see a tree structure where each indentation means that the indented content consists of `children` of their `parent` node.

<center>
<img src="images/inspect.png" align="center" width="800" />
</center>

## Structure of HTML documents

We can see the structure of the file looks similar to what we had seen when applying the `prettify` method to the `BeautifulSoup` object in the example.

There is no need for you to know the details of HTML, but there is some basic knowledge that will help you better understand the code.

Any HTML document consists of opening and corresponding closing __tags__ with possibly some other tags and contents in between. Additionally, opening tags may contain attributes. If the name of a tag is 'tagname' and the names of its attributes are 'attribute1' and 'attribute2', the basic syntax for an opening followed by a closing tag is
```
<tagname attribute1="arribute1value" attribute2="attribute2value">

</tagname>
```


## Tags
The content of any HTML file is contained in `html` tags. The file's content is usually separated into a `head` and a `body`, as indicated by the corresponding tags. The head typically contains a `title`. We say that `head` and `body` are children of `html`, `title` is a child of `head` and a descendant of `html`, and `head` and `body` are siblings. `html` is the parent of `head` and `body`, and `head` is the parent of `title`.
```
<html>
    <head>
        <title>
        </title>
    </head>
    <body>    
        <!-- This is a html comment, which isn't displayed on the website.-->
    </body>    
</html>
```

The body can consist of

* divisions: the `div` tag
* paragraphs: the `p` tag
* hyperlinks to other urls: the `a` tag
* tables: the `table` tag
* section headings: `h1` to `h6` for headings of decreasing size

and many other tags: https://www.w3schools.com/TAGS/default.ASP



## The `a` tag for hyperlinks
`a` tags, which contain links to other urls, have the `href` attribute defining the url of the link, and optionally a `title` attribute defining tooltip text to be displayed when the mouse hovers over the link.
```
<a href="catalogue/a-light-in-the-attic_1000/index.html" title="A Light in the Attic">
A Light in the ...
</a>
```

## `class` and `id` attributes
The most important attributes of tags are `class` and `id`. These attributes are defined in the opening tags. E.g., in our bookstore you will find
```
<div class="product_price">
</div>
```
and
```
<div id="promotions_left">
</div>
```
These attributes can facilitate our navigation through the file if we want to use the content of a tag carrying a particular class or id.

# Beautiful Soup
We will closely follow the official documentation of BeautifulSoup here, though the latter is more complete. 

## Types of objects
There are four kinds of objects you will be dealing with when using BeautifulSoup:
* Tag
* NavigableString
* BeautifulSoup
* Comment

### Tag
`Tag` objects correspond to the tags you find in an HTML (or XML, more on that later) document.

We create an object `soup` containing the whole homepage of the bookstore site again. The first `a` tag inside the `soup` is a tag object.

In [11]:
soup = BeautifulSoup(result.text, 'html.parser')
tag = soup.a
type(tag)

bs4.element.Tag

In [12]:
tag

<a href="index.html">Books to Scrape</a>

Tags have names. A tag's name can be changed, which will be reflected in the HTML code.

In [13]:
tag.name

'a'

In [14]:
tag.name = 'link'
tag

<link href="index.html">Books to Scrape</link>

A tag can have attributes, in this case the attribute 'href'. The attributes are contained in a dictionary.

In [15]:
tag['href']

'index.html'

In [16]:
tag.attrs

{'href': 'index.html'}

In [17]:
type(tag.attrs)

dict

We can modify attributes, add new ones and also remove them.

In [18]:
tag['anotherattr'] = 'attribute2'
tag

<link anotherattr="attribute2" href="index.html">Books to Scrape</link>

In [19]:
del tag['anotherattr']
tag

<link href="index.html">Books to Scrape</link>

An attribute can have multiple values in HTML (the default behavior is different from XML). They are then contained in a list.

In [20]:
soup2 = BeautifulSoup('<p class="value1 value2"></p>')
soup2.p['class']

['value1', 'value2']

If we disable this behavior, the attribute will have only one value consisting of multiple words.

In [21]:
soup3 = BeautifulSoup('<p class="value1 value2"></p>', multi_valued_attributes=None)
soup3.p['class']

'value1 value2'

### NavigableString
NavigableString objects contain strings of text. They behave like ordinary strings except that they support Beautiful Soup methods for navigating and searching the tree that will be discussed later in this chapter. NavigableStrings can be converted to strings.

In [None]:
tag.string

In [None]:
type(tag.string)

In [None]:
s = str(tag.string)
type(s)

NavigableStrings are immutable, but we can replace them using the method `replace_with`.

In [None]:
tag.string.replace_with('Bookstore')
tag

### BeautifulSoup
BeautifulSoup objects behave similarly to tag objects, except that they form the root of the tree structure such that there are no parent nodes to search for. Since they don't correspond to a tag, they also don't have the name attribute.

### Comment
A comment object corresponds to a string containing an HTML comment.

In [None]:
html_comment = "<b><!--This is a comment.--></b>"
soup4 = BeautifulSoup(html_comment)
comment = soup4.b.string
type(comment)

## Navigating the tree
We will take the simple example html code from the Beautiful Soup documentation to illustrate the navigation. We will begin the navigation starting with the BeautifulSoup object that forms the root of the tree.

In [None]:
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""
soup = BeautifulSoup(html_doc, 'html.parser')

### Navigation using tag names
We can directly go to a given tag as follows. This is most obvious if we choose to go to a tag immediately below the level we are at. I.e., if we start the navigation from `soup`, the first `child` is the `html` tag.

In [None]:
soup.html

However, we can also nagivate further down within the tree.

In [None]:
soup.title, soup.head.title

Those two are equivalent here because there is only one `title` in the document. Therefore, it doesn't matter whether we tell the parser that the `title` should be a child of the `head`.

If there is more than one tag with the same name, the parser will navigate to the first one.

In [None]:
soup.a

### contents, children, descendants
A tag's `contents` attribute contains a list of its children. These children can also be empty lines.

In [None]:
soup.body.contents

A tag's `children` attribute is a generator of its children which you can iterate over.

In [None]:
soup.body.children

In [None]:
for child in soup.body.children:
    print(child)

A tag's `descendants` attribute is a generator not only of its children but also of all other descendants, i.e., the children's children etc.

In [None]:
for child in soup.head.descendants:
    print(child)

### string
We can use the attribute `string` to obtain the child's string if there is only one child.

In [None]:
soup.head.string

### strings and stripped_strings
The attribute `strings` provides a generator of all strings of children and descendants.

`repr` here provides a string representation explicitly showing the whitespace characters.

In [None]:
for string in soup.strings:
    print(repr(string))

`stripped_strings` omits strings consisting only of whitespace. 

In [None]:
for string in soup.stripped_strings:
    print(repr(string))

### parent and parents
We can move back up in the tree using the `parent` attribute. The `parents` attribute provides a generator that can be used to iterate over all ancestors.

In [None]:
tag = soup.head.title
tag.parent

In [None]:
tag = soup.body.p
for parent in tag.parents:
    if parent is None:
        print(parent)
    else:
        print(parent.name)

### next_sibling and previous_sibling
If there are multiple tags that have the same parent, we can move to the next and previous one, respectively, using `next_sibling` and `previous_sibling`.

In [None]:
a1 = soup.body.a
a1

In [None]:
a2 = a1.next_sibling
a2

In [None]:
a3 = a2.next_sibling
a3

In [None]:
a2.previous_sibling

### next_siblings and previous_siblings
We can iterate over all subsequent or previous siblings using `next_siblings` and `previous_siblings`.

In [None]:
for sibling in soup.a.next_siblings:
    print(repr(sibling))

In [None]:
for sibling in soup.find(id="link3").previous_siblings:
    print(repr(sibling))

## Searching the tree
The two most useful methods for searching the tree are `find` and `find_all`.

`find_all` returns all descendants of a tag that match the filter, i.e., the expression of what we search for. If we pass only one argument, this is understood to be the `name` argument, i.e., `find_all` searches for tags with names corresponding to the argument.

In [None]:
soup = BeautifulSoup(html_doc)
soup.find_all('a')

We can, alternatively or additionally, specify keyword arguments that must be matched. If we pass more than one argument, the methods return those descendants of the tag that match all of the arguments.

In [None]:
soup.find_all('a', id = 'link3')

We can just call a BeautifulSoup object or tag which is equivalent to calling `find_all`.

In [None]:
soup('a', id = 'link3')

### `find`
The method `find` returns the first match of its arguments. If we know there is only one match, this is faster than calling `find_all`.

In [None]:
soup.find('title')

### Filters
`find_all` and `find` can be passed filters in the form of a string, a regular expression, a list, or a function.

#### string

If we search by a string, `find_all` will return a list of all tags with a name equal to that string.

In [None]:
soup.find_all('p')

#### regular expression
If we search by a regular expression, `find_all` will return a list of all tags with a name matching the regular expression.

In [None]:
soup.find_all(re.compile(r"^b"))

#### list
If we search by a list, `find_all` will return a list of all tags with a name matching any item in the list.

In [None]:
soup.find_all(['a', re.compile(r"^b")])

#### function
If we search by a function, `find_all` will return a list of all tags for which the function returns true.

In [None]:
def has_class_but_no_id(tag):
    return tag.has_attr('class') and not tag.has_attr('id')

soup.find_all(has_class_but_no_id)

#### True
The value `True` matches everything. Therefore, if we pass the argument `true`, `find_all` will return a list of all tags contained in the document.

In [None]:
for tag in soup.find_all(True):
    print(tag.name)

### Searching by CSS class
We can search by the `class` attribute, which is part of CSS (Cascading Style Sheets), a language used to define the way websites are displayed, using the keyword `class_`. This is to escape the Python keyword `class` which we cannot use here.

In [None]:
soup.find_all(class_ = 'sister')

### Searching strings
We can search the strings contained in the tag by using the `string` argument.

In [None]:
soup.find_all(string = re.compile(r'cie$'))

### Limiting the length of the returned list
We can use the `limit` argument to get a list of only the first `limit` matches.

In [None]:
soup.find_all(class_ = 'sister', limit = 2)

### Avoiding recursive search
By default, `find_all` searches all of a tag's descendants and not only the direct children. To avoid this recursive search, we can set the argument `recursive=False`.

In [None]:
soup.html.find_all([re.compile(r"^b")], recursive=False)

### Other search methods
`find` and `find_all` search a tag's descendants. There are several methods taking the same arguments searching different parts of the tree. Some of those are

* `find_parents` and `find_parent` to parse a tag's parents
* `find_next_siblings` and `find_next_sibling` to parse a tag's next siblings
* `find_previous_siblings` and `find_previous_sibling` to parse a tag's previous siblings

### `get_text`
To get all of the text contained in a document or tag, use the `get_text` method. It returns a string containing all the text in a document or beneath a tag in the tree.

In [None]:
soup.get_text()

In [None]:
soup.body.get_text()

## Modifying the tree
For the purpose of using the content of an HTML document, you will rarely need to make changes to that content, though you might sometimes want to eliminate some part of the document before processing it further. We will limit the discussion to those methods potentially useful for that purpose here.

### `clear`
The method `clear` removes the content of a tag.

In [None]:
soup.a

In [None]:
soup.a.clear()
soup.a

### `extract`
The method `extract` removes and returns the tag or string from the tree.

In [None]:
extracted_a = soup.a.extract()

In [None]:
extracted_a

In [None]:
soup

In [None]:
extracted_string = soup.a.string.extract()
extracted_string

In [None]:
soup

### `decompose`
The method `decompose` removes a tag from the tree and destroys it.

In [None]:
soup

In [None]:
soup.a.decompose()
soup

### `unwrap`
The method `unwrap` replaces a tag with what is inside it.

In [None]:
soup.p

In [None]:
soup.p.b.unwrap()
soup.p

## Parsers
Parsers are the tools used to interpret the structure of the documents. They are not part of BeautifulSoup but used by it. We have used the `html.parser` in the examples above by passing it as an argument when creating the BeautifulSoup object.

There exist alternative parsers that slightly differ in how they interpret the documents. If the document is a complete, valid HTML document that doesn't contain any errors, the output of the HTML parsers is the same, though they differ in how they deal with errors such as an opening without a closing tag or vice versa.

The available HTML parsers (if they are installed, which should be the case if you installed Anaconda) are `html.parser`, `lxml`, and `html5lib`, and the default is `lxml`.

If you are dealing with an XML rather than an HTML document, the argument `xml` will tell BeautifulSoup to interpret the document as such. It uses the `lxml` parser.

## SoupStrainer
If we want to consider only a particular part of the HTML document, e.g., all tags of a certain type, all tags with a certain attribute or with a specific value for that attribute, we can define a `SoupStrainer` and pass it as an argument when creating a BeautifulSoup object.

Suppose we want to use only the 'a' tags.

In [None]:
a_tags = SoupStrainer("a")
soup_only_a = BeautifulSoup(html_doc, parse_only = a_tags)
print(soup_only_a.prettify())