In this notebook, we take a look at a couple more examples using Beautiful Soup. We'll mainly use the `select` method here to illustrate the use of CSS selectors.

In [1]:
import requests
from bs4 import BeautifulSoup

# Scraping Books to Scrape

For our first example, we'll use http://books.toscrape.com/. Play around a bit with this website to see how it works. We're going to scrape the details for all books.

Note: this website also illustrates the importance of checking how a site handles pagination. Here, we can use `http://books.toscrape.com/catalogue/page-XXX.html`, and we see that the site returns a 404 status in case the page doesn't exist. This is not always the case. E.g. some sites will show an empty listing, whereas others might show the last (existing) page again, requiring a manual double check.

In [2]:
page = 1
results = []

while True:
    print('Scraping page', page)
    p = requests.get('http://books.toscrape.com/catalogue/page-{}.html'.format(page))
    page += 1
    if p.status_code == 404:
        break
    soup = BeautifulSoup(p.text, 'html.parser')
    books = soup.select('.product_pod')
    for book in books:
        book_title = book.find('img').get('alt')
        book_link = book.find('a').get('href')
        book_rating = book.find(class_='star-rating').get('class')
        book_price = book.find(class_='price_color').get_text(strip=True)
        results.append({
            'book_title': book_title,
            'book_link': book_link,
            'book_rating': book_rating,
            'book_price': book_price
        })

Scraping page 1
Scraping page 2
Scraping page 3
Scraping page 4
Scraping page 5
Scraping page 6
Scraping page 7
Scraping page 8
Scraping page 9
Scraping page 10
Scraping page 11
Scraping page 12
Scraping page 13
Scraping page 14
Scraping page 15
Scraping page 16
Scraping page 17
Scraping page 18
Scraping page 19
Scraping page 20
Scraping page 21
Scraping page 22
Scraping page 23
Scraping page 24
Scraping page 25
Scraping page 26
Scraping page 27
Scraping page 28
Scraping page 29
Scraping page 30
Scraping page 31
Scraping page 32
Scraping page 33
Scraping page 34
Scraping page 35
Scraping page 36
Scraping page 37
Scraping page 38
Scraping page 39
Scraping page 40
Scraping page 41
Scraping page 42
Scraping page 43
Scraping page 44
Scraping page 45
Scraping page 46
Scraping page 47
Scraping page 48
Scraping page 49
Scraping page 50
Scraping page 51


In [3]:
results[:4]

[{'book_title': 'A Light in the Attic',
  'book_link': 'a-light-in-the-attic_1000/index.html',
  'book_rating': ['star-rating', 'Three'],
  'book_price': 'Â£51.77'},
 {'book_title': 'Tipping the Velvet',
  'book_link': 'tipping-the-velvet_999/index.html',
  'book_rating': ['star-rating', 'One'],
  'book_price': 'Â£53.74'},
 {'book_title': 'Soumission',
  'book_link': 'soumission_998/index.html',
  'book_rating': ['star-rating', 'One'],
  'book_price': 'Â£50.10'},
 {'book_title': 'Sharp Objects',
  'book_link': 'sharp-objects_997/index.html',
  'book_rating': ['star-rating', 'Four'],
  'book_price': 'Â£47.82'}]

This looks pretty good. Let's try this again, but fixing two elements:

- For the rating, we'll convert the rating to an actual number
- The price seems incorrectly parsed. This is due to Requests misinterpreting the encoding in this case

In [4]:
p.encoding

'ISO-8859-1'

Why did Requests think the encoding is `ISO-8859-1`? Let's take a look at the headers:

In [5]:
p.headers

{'Server': 'nginx/1.14.0 (Ubuntu)', 'Date': 'Fri, 31 Jul 2020 09:42:31 GMT', 'Content-Type': 'text/html', 'Transfer-Encoding': 'chunked', 'Connection': 'keep-alive', 'X-Upstream': 'toscrape-books-master_web', 'Content-Encoding': 'gzip'}

No extra information here, so Requests sticks to the default (`ISO-8859-1`), but in the HTML, we see:

In [6]:
requests.get('http://books.toscrape.com/catalogue/page-{}.html'.format(1)).text[:600]

'\n\n<!DOCTYPE html>\n<!--[if lt IE 7]>      <html lang="en-us" class="no-js lt-ie9 lt-ie8 lt-ie7"> <![endif]-->\n<!--[if IE 7]>         <html lang="en-us" class="no-js lt-ie9 lt-ie8"> <![endif]-->\n<!--[if IE 8]>         <html lang="en-us" class="no-js lt-ie9"> <![endif]-->\n<!--[if gt IE 8]><!--> <html lang="en-us" class="no-js"> <!--<![endif]-->\n    <head>\n        <title>\n    All products | Books to Scrape - Sandbox\n</title>\n\n        <meta http-equiv="content-type" content="text/html; charset=UTF-8" />\n        <meta name="created" content="24th Jun 2016 09:30" />\n        <meta name="description" c'

I.e. `<meta http-equiv="content-type" content="text/html; charset=UTF-8" />`. So we will also override Requests here.

In [7]:
page = 1
results = []

ratings = ['Zero', 'One', 'Two', 'Three', 'Four', 'Five']

while True:
    print('Scraping page', page)
    p = requests.get('http://books.toscrape.com/catalogue/page-{}.html'.format(page))
    p.encoding = 'UTF-8'
    page += 1
    if p.status_code == 404:
        break
    soup = BeautifulSoup(p.text, 'html.parser')
    books = soup.select('.product_pod')
    for book in books:
        book_title = book.find('img').get('alt')
        book_link = book.find('a').get('href')
        book_rating = ratings.index(book.find(class_='star-rating').get('class')[1])
        book_price = book.find(class_='price_color').get_text(strip=True)
        results.append({
            'book_title': book_title,
            'book_link': book_link,
            'book_rating': book_rating,
            'book_price': book_price
        })

Scraping page 1
Scraping page 2
Scraping page 3
Scraping page 4
Scraping page 5
Scraping page 6
Scraping page 7
Scraping page 8
Scraping page 9
Scraping page 10
Scraping page 11
Scraping page 12
Scraping page 13
Scraping page 14
Scraping page 15
Scraping page 16
Scraping page 17
Scraping page 18
Scraping page 19
Scraping page 20
Scraping page 21
Scraping page 22
Scraping page 23
Scraping page 24
Scraping page 25
Scraping page 26
Scraping page 27
Scraping page 28
Scraping page 29
Scraping page 30
Scraping page 31
Scraping page 32
Scraping page 33
Scraping page 34
Scraping page 35
Scraping page 36
Scraping page 37
Scraping page 38
Scraping page 39
Scraping page 40
Scraping page 41
Scraping page 42
Scraping page 43
Scraping page 44
Scraping page 45
Scraping page 46
Scraping page 47
Scraping page 48
Scraping page 49
Scraping page 50
Scraping page 51


In [8]:
results[:4]

[{'book_title': 'A Light in the Attic',
  'book_link': 'a-light-in-the-attic_1000/index.html',
  'book_rating': 3,
  'book_price': '£51.77'},
 {'book_title': 'Tipping the Velvet',
  'book_link': 'tipping-the-velvet_999/index.html',
  'book_rating': 1,
  'book_price': '£53.74'},
 {'book_title': 'Soumission',
  'book_link': 'soumission_998/index.html',
  'book_rating': 1,
  'book_price': '£50.10'},
 {'book_title': 'Sharp Objects',
  'book_link': 'sharp-objects_997/index.html',
  'book_rating': 4,
  'book_price': '£47.82'}]

# Scraping Zalando

In this example, we'll scrape some product information from the Zalando web store. Note that this example is already more complicated. We'll need to import the regex module as we'll need it later.

In [9]:
import re

We're going to scrape a number of womens dresses. Also not the custom `User-Agent` defined below. If we don't do so, Zalando will block our requests.

In [24]:
url = 'https://www.zalando.co.uk/womens-clothing-dresses/'
pages_to_crawl = 2
headers = {
    'User-Agent': 
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.70 Safari/537.36'
}

There are a couple of things we need to keep in mind here:

- First, getting a reliable list of articles is pretty tricky. You can e.g. also use `select('#z-nvg-cognac-root z-grid-item')` here, but this will return a couple of additional elements which do not actually correspond with pages
- As such, we also include the class here. However, we still get one additional element at the end not containing an actual article, which is what the `if` condition is for
- Also, we're selecting on classes here, but a lot of the class names Zalando uses have strange names, e.g. `class="cat_brandName-2XZRz cat_ellipsis-MujnT"`. Always try to make scrapers as robust as possible. As such, we might believe that the class suffixes here are auto-generated (e.g. by some CSS middleware toolkit). We'll hence use regex to match with the beginning part only
- Luckily for us, regex expressions and functions can also be used to filter on attributes
- Try to expand this example to get brand names, original price and discounted price (if available) as well

In [26]:
for p in range(1, pages_to_crawl+1):
    print('Scraping page:', p)
    r = requests.get(url, params={'p' : p}, headers=headers)
    html_soup = BeautifulSoup(r.text, 'html.parser')
    for article in html_soup.find_all('z-grid-item', class_=re.compile('^cat_card')):
        article_info = article.find(class_=re.compile('^cat_infoDetail'))
        if article_info is None:
            continue
        article_name = article.find(class_=re.compile('^cat_articleName')).get_text(strip=True)
        print(' -', article_name, article_info.get('href'))

Scraping page: 1
 - Maxi dress - almond /billabong-maxi-dress-almond-bi721c03g-b11.html
 - BE RIDER - Day dress - black /roxy-be-rider-day-dress-anthracite-ro521c04s-q11.html
 - BE RIDER - Day dress - mood indigo /roxy-be-rider-day-dress-mood-indigo-ro521c04s-k11.html
 - VIGRETA ANCLE DRESS - Day dress - samoan sun /vila-vigreta-ancle-dress-day-dress-samoan-sun-v1021c205-e11.html
 - Maxi dress - black/rose/dark green /anna-field-curvy-jersey-dress-blackrosedark-green-ax821c03n-q11.html
 - Maxi dress - turquoise /anna-field-curvy-maxi-dress-turquoise-ax821c03y-l11.html
 - TEE DRESS - Jersey dress - black /adidas-originals-tee-dress-jersey-dress-black-ad121c05r-q11.html
 - ABREUVOIR - Day dress - white /derhy-abreuvoir-day-dress-rd521c0h5-a11.html
 - RARE FEELING - Maxi dress - black /free-people-rare-feeling-maxi-maxi-dress-black-fp021c07o-q11.html
 - Denim dress - light blue denim /vero-moda-denim-dress-light-blue-denim-ve121c1ak-k11.html
 - UNGEFÜTTERT LANG - Day dress - varicolored /