Thanks to 'Web scraping basics with BeautifulSoup' article made by Jonathan Oheix and publishe on Towards Data Science.
Some parts of notebook are elements of my learning process - so the code could be shorter but I like it this way.

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Introduction" data-toc-modified-id="Introduction-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Introduction</a></span></li><li><span><a href="#Scraping-html-page-with-helper-function" data-toc-modified-id="Scraping-html-page-with-helper-function-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Scraping html page with helper function</a></span></li><li><span><a href="#Finding-book-categories-URLs-on-the-main-page" data-toc-modified-id="Finding-book-categories-URLs-on-the-main-page-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Finding book categories URLs on the main page</a></span></li><li><span><a href="#Scraping-all-book-data" data-toc-modified-id="Scraping-all-book-data-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Scraping all book data</a></span></li><li><span><a href="#Getting-all-products-URLs" data-toc-modified-id="Getting-all-products-URLs-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Getting all products URLs</a></span></li><li><span><a href="#Getting-product-data---table-with-available-information" data-toc-modified-id="Getting-product-data---table-with-available-information-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Getting product data - table with available information</a></span></li></ul></div>

In [1]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In [2]:
import requests
from bs4 import BeautifulSoup
import re
import pandas as pd

#### Introduction

In [3]:
main_url = 'http://books.toscrape.com/index.html'
r = requests.get(main_url)

Comparing the different ways of displaying the content of html page: raw code, with bs4 and with bs4+prettify method.

In [4]:
#print(r.content)

In [5]:
soup = BeautifulSoup(r.content)
#print(soup)

In [6]:
#print(soup.prettify())

In [7]:
# all BeautifulSoup methods
#print(soup.__dir__())

In [8]:
# first appearance of the article product
soup.find('article')

<article class="product_pod">
<div class="image_container">
<a href="catalogue/a-light-in-the-attic_1000/index.html"><img alt="A Light in the Attic" class="thumbnail" src="media/cache/2c/da/2cdad67c44b002e7ead0cc35693c0e8b.jpg"/></a>
</div>
<p class="star-rating Three">
<i class="icon-star"></i>
<i class="icon-star"></i>
<i class="icon-star"></i>
<i class="icon-star"></i>
<i class="icon-star"></i>
</p>
<h3><a href="catalogue/a-light-in-the-attic_1000/index.html" title="A Light in the Attic">A Light in the ...</a></h3>
<div class="product_price">
<p class="price_color">£51.77</p>
<p class="instock availability">
<i class="icon-ok"></i>
    
        In stock
    
</p>
<form>
<button class="btn btn-primary btn-block" data-loading-text="Adding..." type="submit">Add to basket</button>
</form>
</div>
</article>

In [9]:
soup.article.find('div')

<div class="image_container">
<a href="catalogue/a-light-in-the-attic_1000/index.html"><img alt="A Light in the Attic" class="thumbnail" src="media/cache/2c/da/2cdad67c44b002e7ead0cc35693c0e8b.jpg"/></a>
</div>

In [10]:
# getting to url of the article contained in 'href' value
soup.article.div.find('a').get('href')

'catalogue/a-light-in-the-attic_1000/index.html'

#### Scraping html page with helper function

In [11]:
# useful function to get BeautifulSoup object
def getAndParse(url):
    result = requests.get(url)
    soup = BeautifulSoup(result.text, 'html.parser')
    return soup

In [12]:
# finding the first article, to use class it needs underscore (class_=)
soup.find('article', class_='product_pod')

<article class="product_pod">
<div class="image_container">
<a href="catalogue/a-light-in-the-attic_1000/index.html"><img alt="A Light in the Attic" class="thumbnail" src="media/cache/2c/da/2cdad67c44b002e7ead0cc35693c0e8b.jpg"/></a>
</div>
<p class="star-rating Three">
<i class="icon-star"></i>
<i class="icon-star"></i>
<i class="icon-star"></i>
<i class="icon-star"></i>
<i class="icon-star"></i>
</p>
<h3><a href="catalogue/a-light-in-the-attic_1000/index.html" title="A Light in the Attic">A Light in the ...</a></h3>
<div class="product_price">
<p class="price_color">£51.77</p>
<p class="instock availability">
<i class="icon-ok"></i>
    
        In stock
    
</p>
<form>
<button class="btn btn-primary btn-block" data-loading-text="Adding..." type="submit">Add to basket</button>
</form>
</div>
</article>

In [13]:
# going deeper
soup.find('article', class_='product_pod').div

<div class="image_container">
<a href="catalogue/a-light-in-the-attic_1000/index.html"><img alt="A Light in the Attic" class="thumbnail" src="media/cache/2c/da/2cdad67c44b002e7ead0cc35693c0e8b.jpg"/></a>
</div>

In [14]:
# ...and deeper
soup.find('article', class_='product_pod').div.a.get('href')
# above is equivalent to
soup.article.div.find('a').get('href')

'catalogue/a-light-in-the-attic_1000/index.html'

'catalogue/a-light-in-the-attic_1000/index.html'

In [15]:
# getting all product urls from main page (using class_ is not necessary)
main_page_product_urls = [x.div.a.get('href') for x in soup.find_all('article')]
print(f'{len(main_page_product_urls)} fetched products URLs')

20 fetched products URLs


In [16]:
main_url
'/'.join(main_url.split('/')[:-1])

'http://books.toscrape.com/index.html'

'http://books.toscrape.com'

In [17]:
# example url address after supplementing with url of main page (and removing 'ingex.html' part)
print(f"One example:\n{'/'.join(main_url.split('/')[:-1])}/{main_page_product_urls[0]}")

One example:
http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html


In [18]:
# another helper function to get all books URLs list
def getBooksURLs(url):
    soup = getAndParse(url)
    return ['/'.join(url.split('/')[:-1]) + '/' + x.div.a.get('href')
            for x in soup.find_all('article', class_='product_pod')]

In [19]:
for book_url in getBooksURLs(main_url)[:5]:
    print(book_url)

http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html
http://books.toscrape.com/catalogue/tipping-the-velvet_999/index.html
http://books.toscrape.com/catalogue/soumission_998/index.html
http://books.toscrape.com/catalogue/sharp-objects_997/index.html
http://books.toscrape.com/catalogue/sapiens-a-brief-history-of-humankind_996/index.html


#### Finding book categories URLs on the main page

In [20]:
url = '/'.join(main_url.split('/')[:-1])
print(url)
categories_urls = [url + '/' + x.get('href') for x in soup.find_all('a', href=re.compile('catalogue/category/books'))]
categories_urls = categories_urls[1:] # fist category concerns all books

http://books.toscrape.com


In [21]:
print(str(len(categories_urls)), 'fetched categories URLs')
print('Some examples:')
for cat in categories_urls[:5]:
    print(cat)

50 fetched categories URLs
Some examples:
http://books.toscrape.com/catalogue/category/books/travel_2/index.html
http://books.toscrape.com/catalogue/category/books/mystery_3/index.html
http://books.toscrape.com/catalogue/category/books/historical-fiction_4/index.html
http://books.toscrape.com/catalogue/category/books/sequential-art_5/index.html
http://books.toscrape.com/catalogue/category/books/classics_6/index.html


#### Scraping all book data

In [22]:
pages_urls = [main_url]
soup = getAndParse(pages_urls[0])

while len(soup.find_all('a', href=re.compile('page')))==2 or len(pages_urls)==1:
    new_url = '/'.join(pages_urls[-1].split('/')[:-1])+'/'+soup.find_all('a', href=re.compile('page'))[-1].get('href')
    pages_urls.append(new_url)
    soup = getAndParse(new_url)

In [23]:
print(str(len(pages_urls)), 'fetched URLs')
print('Some examples:')
for page in pages_urls[:5]:
    print(page)

50 fetched URLs
Some examples:
http://books.toscrape.com/index.html
http://books.toscrape.com/catalogue/page-2.html
http://books.toscrape.com/catalogue/page-3.html
http://books.toscrape.com/catalogue/page-4.html
http://books.toscrape.com/catalogue/page-5.html


In [24]:
result = requests.get('http://books.toscrape.com/catalogue/page-50.html')
print('Status code for page 50:', result.status_code)
result = requests.get('http://books.toscrape.com/catalogue/page-51.html')
print('Status code for page 51:', result.status_code)

Status code for page 50: 200
Status code for page 51: 404


In [25]:
# using regular expression
url = 'http://books.toscrape.com/catalogue/page-51.html'
m = re.search('\d+', url).group()
print(f"Status code for page {m}:", result.status_code)

Status code for page 51: 404


In [26]:
# listing all pages with status code equal to 200
def allPages():
    page_urls = []
    new_page = 'http://books.toscrape.com/catalogue/page-1.html'
    while requests.get(new_page).status_code == 200:
        page_urls.append(new_page)
        new_page = page_urls[-1].split('-')[0] + '-' + str(int(re.search('\d+', new_page).group()) + 1) + '.html'
    return page_urls

allPages()[-5:]

['http://books.toscrape.com/catalogue/page-46.html',
 'http://books.toscrape.com/catalogue/page-47.html',
 'http://books.toscrape.com/catalogue/page-48.html',
 'http://books.toscrape.com/catalogue/page-49.html',
 'http://books.toscrape.com/catalogue/page-50.html']

#### Getting all products URLs

In [27]:
# function returning all books urls
def allBooks():
    all_books = []
    for page in allPages():
        for book_url in getBooksURLs(page):
            all_books.append(book_url)
    return all_books

all_urls = allBooks()
print(f'The total number of books in "Books to Scrape" repository: {len(all_urls)}')

print('Some examples:')
for book_url in all_urls[:5]:
      print(book_url)

The total number of books in "Books to Scrape" repository: 1000
Some examples:
http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html
http://books.toscrape.com/catalogue/tipping-the-velvet_999/index.html
http://books.toscrape.com/catalogue/soumission_998/index.html
http://books.toscrape.com/catalogue/sharp-objects_997/index.html
http://books.toscrape.com/catalogue/sapiens-a-brief-history-of-humankind_996/index.html


#### Getting product data - table with available information

In [42]:
# preparing list of url addresses for each book
all_urls = allBooks()

Singular codes for each category: title, in stock, price, book category, rating, image

In [43]:
url = all_urls[0]
soup = getAndParse(url)
#title = soup.find_all('li', class_="active")
#title[0].text
title = soup.find_all('h1')[0].text
print(title)

A Light in the Attic


In [44]:
in_stock = re.search('\d+', soup.find_all('p', class_="instock availability")[0].text).group()
print(in_stock)

22


In [45]:
price = re.search('\d+\.\d+', soup.find_all('p', class_="price_color")[0].text).group()
print(price)

51.77


In [32]:
book_category = re.match('[a-zA-Z]+',soup.find_all('a', href=re.compile("../category/books"))
                         [-1].get('href').split('/')[3]).group()
print(book_category)

poetry


In [33]:
rating = soup.find_all('p', class_=re.compile("star"))[0].get("class")[-1]
print(rating)

Three


In [34]:
url_image = soup.find_all('img')[0].get('src')
url_image = '/'.join(url.split('/')[:3]) + '/' + '/'.join(url_image.split('/')[2:])
print(url_image)

http://books.toscrape.com/media/cache/fe/72/fe72f0532301ec28892ae79a629a293c.jpg


In [35]:
# getting all books data for each category like: title, number in stock, price, book category, rating, url of image
titles = []
nb_in_stock = []
prices = []
book_categories = []
ratings = []
image_urls = []
for url in all_urls[:5]:
    soup = getAndParse(url)
    titles.append(soup.find_all('h1')[0].text)
    nb_in_stock.append(re.search('\d+', soup.find_all('p', class_="instock availability")[0].text).group())
    prices.append(re.search('\d+\.\d+', soup.find_all('p', class_="price_color")[0].text).group())
    book_categories.append(soup.find_all('a', href=re.compile('../category/books'))[-1].get('href').split('/')[3])
    ratings.append(soup.find_all('p', class_=re.compile('star'))[0].get("class")[-1])
    image_urls.append('/'.join(url.split('/')[:3]) + '/' + '/'.join(soup.find_all('img')[0].get('src').split('/')[2:]))
    

In [36]:
# printing results of first 5 books for each category
print(titles,'\n',nb_in_stock,'\n',prices,'\n',book_categories,'\n',ratings)
for image in image_urls:
    print(image)

['A Light in the Attic', 'Tipping the Velvet', 'Soumission', 'Sharp Objects', 'Sapiens: A Brief History of Humankind'] 
 ['22', '20', '20', '20', '20'] 
 ['51.77', '53.74', '50.10', '47.82', '54.23'] 
 ['poetry_23', 'historical-fiction_4', 'fiction_10', 'mystery_3', 'history_32'] 
 ['Three', 'One', 'One', 'Four', 'Five']
http://books.toscrape.com/media/cache/fe/72/fe72f0532301ec28892ae79a629a293c.jpg
http://books.toscrape.com/media/cache/08/e9/08e94f3731d7d6b760dfbfbc02ca5c62.jpg
http://books.toscrape.com/media/cache/ee/cf/eecfe998905e455df12064dba399c075.jpg
http://books.toscrape.com/media/cache/c0/59/c05972805aa7201171b8fc71a5b00292.jpg
http://books.toscrape.com/media/cache/ce/5f/ce5f052c65cc963cf4422be096e915c9.jpg


In [37]:
scraped_results = pd.DataFrame({'Titles':titles, 'Number in stock': nb_in_stock, 'Prices':prices,
                                'Categories': book_categories, 'Ratings': ratings, 'Image url': image_urls})
scraped_results.head()

Unnamed: 0,Titles,Number in stock,Prices,Categories,Ratings,Image url
0,A Light in the Attic,22,51.77,poetry_23,Three,http://books.toscrape.com/media/cache/fe/72/fe...
1,Tipping the Velvet,20,53.74,historical-fiction_4,One,http://books.toscrape.com/media/cache/08/e9/08...
2,Soumission,20,50.1,fiction_10,One,http://books.toscrape.com/media/cache/ee/cf/ee...
3,Sharp Objects,20,47.82,mystery_3,Four,http://books.toscrape.com/media/cache/c0/59/c0...
4,Sapiens: A Brief History of Humankind,20,54.23,history_32,Five,http://books.toscrape.com/media/cache/ce/5f/ce...


In [38]:
# converting ratings into numerical values
scraped_results['Ratings'] = scraped_results['Ratings'].map({'One': 1, 'Two': 2, 'Three': 3, 'Four': 4, 'Five': 5})
scraped_results.head()

Unnamed: 0,Titles,Number in stock,Prices,Categories,Ratings,Image url
0,A Light in the Attic,22,51.77,poetry_23,3,http://books.toscrape.com/media/cache/fe/72/fe...
1,Tipping the Velvet,20,53.74,historical-fiction_4,1,http://books.toscrape.com/media/cache/08/e9/08...
2,Soumission,20,50.1,fiction_10,1,http://books.toscrape.com/media/cache/ee/cf/ee...
3,Sharp Objects,20,47.82,mystery_3,4,http://books.toscrape.com/media/cache/c0/59/c0...
4,Sapiens: A Brief History of Humankind,20,54.23,history_32,5,http://books.toscrape.com/media/cache/ce/5f/ce...


In [39]:
# cleaning Categories values from numbers
def clean_cat(category):
     return re.match('[a-zA-Z]+', category).group()
scraped_results.Categories = scraped_results.Categories.apply(lambda x: re.match('[a-zA-Z]+', x).group())
scraped_results.head()

Unnamed: 0,Titles,Number in stock,Prices,Categories,Ratings,Image url
0,A Light in the Attic,22,51.77,poetry,3,http://books.toscrape.com/media/cache/fe/72/fe...
1,Tipping the Velvet,20,53.74,historical,1,http://books.toscrape.com/media/cache/08/e9/08...
2,Soumission,20,50.1,fiction,1,http://books.toscrape.com/media/cache/ee/cf/ee...
3,Sharp Objects,20,47.82,mystery,4,http://books.toscrape.com/media/cache/c0/59/c0...
4,Sapiens: A Brief History of Humankind,20,54.23,history,5,http://books.toscrape.com/media/cache/ce/5f/ce...


The end.