# Scraping data from a website

website link: https://books.toscrape.com/

* In order to work with websites in python we need to import the following libraries to extract data of the given url.

* These libraries are already pre-installed we just need to import them in python.

* It is imp to write BeautifulSoup in CamelCase otherwise it won't work.

* requests is a library which helps to send request to the website as we need to connect it here and we will then get response from the website here.

* will use 'response' keyword to store the response from the website.

* We cannot extract data from any website we want, we can extract only from those which websites allows to extract data.

* By using 'response' we got the source code of the website.







In [1]:
import requests
from bs4 import BeautifulSoup as bs

In [2]:
# requesting the website to connect

url = 'https://books.toscrape.com/'

response = requests.get(url)
response
# 200 is the status code for successful response

<Response [200]>

In [3]:
# printing status code

response.status_code

200

In [4]:
# checking type of response
type(response)

requests.models.Response

In [5]:
# converting response to text
print(response.text [:1000])

<!DOCTYPE html>
<!--[if lt IE 7]>      <html lang="en-us" class="no-js lt-ie9 lt-ie8 lt-ie7"> <![endif]-->
<!--[if IE 7]>         <html lang="en-us" class="no-js lt-ie9 lt-ie8"> <![endif]-->
<!--[if IE 8]>         <html lang="en-us" class="no-js lt-ie9"> <![endif]-->
<!--[if gt IE 8]><!--> <html lang="en-us" class="no-js"> <!--<![endif]-->
    <head>
        <title>
    All products | Books to Scrape - Sandbox
</title>

        <meta http-equiv="content-type" content="text/html; charset=UTF-8" />
        <meta name="created" content="24th Jun 2016 09:29" />
        <meta name="description" content="" />
        <meta name="viewport" content="width=device-width" />
        <meta name="robots" content="NOARCHIVE,NOCACHE" />

        <!-- Le HTML5 shim, for IE6-8 support of HTML elements -->
        <!--[if lt IE 9]>
        <script src="//html5shim.googlecode.com/svn/trunk/html5.js"></script>
        <![endif]-->

        
            <link rel="shortcut icon" href="static/oscar/favicon.

In [6]:
type(response.text)

str

In [7]:
# use bs to extract the title

soup = bs(response.text)
type(soup)

bs4.BeautifulSoup

In [8]:
soup.find('title').text

'\n    All products | Books to Scrape - Sandbox\n'

In [9]:
soup.find('title').text.strip()

'All products | Books to Scrape - Sandbox'

It's time to inspect all the HTML tags and to identify the book tag so that we can extract information about the books

In [10]:
# finding all the article tags

books_tag = soup.find_all('article', class_ = 'product_pod')

In [11]:
len(books_tag)

# len of books_tag is 20 as there are 20 books on each page of the website

20

Now, lets try to select a single book and extract all the information we can

In [12]:
# extracting individual part of first book

book_tag = books_tag[0]
book_tag

<article class="product_pod">
<div class="image_container">
<a href="catalogue/a-light-in-the-attic_1000/index.html"><img alt="A Light in the Attic" class="thumbnail" src="media/cache/2c/da/2cdad67c44b002e7ead0cc35693c0e8b.jpg"/></a>
</div>
<p class="star-rating Three">
<i class="icon-star"></i>
<i class="icon-star"></i>
<i class="icon-star"></i>
<i class="icon-star"></i>
<i class="icon-star"></i>
</p>
<h3><a href="catalogue/a-light-in-the-attic_1000/index.html" title="A Light in the Attic">A Light in the ...</a></h3>
<div class="product_price">
<p class="price_color">Â£51.77</p>
<p class="instock availability">
<i class="icon-ok"></i>
    
        In stock
    
</p>
<form>
<button class="btn btn-primary btn-block" data-loading-text="Adding..." type="submit">Add to basket</button>
</form>
</div>
</article>

The title is present inside 'a' tag (anchor tag). We cannot select all the 'a' tags. We only want the tag with 'title' attribute. So, lets select it.

In [13]:
# if we want to find any individual line from this book: here we are finding the anchor tag line

title_tag = book_tag.find('a', title = True)
title_tag
# a = anchor tag element

<a href="catalogue/a-light-in-the-attic_1000/index.html" title="A Light in the Attic">A Light in the ...</a>

In [14]:
# printing title as text
title_tag.text

'A Light in the ...'

But, as you can see, its not the complete title. The complete title of the book, can be extracted from the 'title' attribute.

In [15]:
# priting full title name
title_tag['title']

'A Light in the Attic'

In [16]:
# one more way to print full title
title = book_tag.find('a', title = True)['title']
title

'A Light in the Attic'

Following the same process as above to extract ratings, price and book link.

In [17]:
# rating

rating = book_tag.find('p')['class'][1]
rating
# p = paragraph tag
# 1 = first class of p tag

'Three'

In [18]:
# book link

title_tag['href']

'catalogue/a-light-in-the-attic_1000/index.html'

In [19]:
# another method to find the book link

link = 'https://books.toscrape.com/' + book_tag.find('a')['href']
link

'https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html'

lets put the above code inside a function

In [20]:
def get_details(book_tag):
  title = book_tag.find('a', title = True)['title']
  rating = book_tag.find('p')['class'][1]
  price  = book_tag.find('p', class_= 'price_color').text[1:]
  link = 'https://books.toscrape.com/' + book_tag.find('a')['href']
  return title, rating, price, link

get_details function takes a 'book_tag', extract all the details from it and returns them.
Lets write some more functions too.

In [21]:
def get_soup(url):
  """Takes URL and returns a soup object"""
  resp = requests.get(url)
  if resp.status_code == 200:
      return bs(resp.text)
  else: return None

def get_books(url):
  """Extract details from all the book tags"""
  soup = get_soup(url)
  book_tags = soup.find_all('article', class_ = 'product_pod')

  books = []   # list of books
  for book_tag in book_tags:
      books.append(get_details(book_tag))

  return books

In [22]:
url = 'https://books.toscrape.com/'
books = get_books(url)
len(books)

20

In [23]:
# first three books

books[:3]

[('A Light in the Attic',
  'Three',
  '£51.77',
  'https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html'),
 ('Tipping the Velvet',
  'One',
  '£53.74',
  'https://books.toscrape.com/catalogue/tipping-the-velvet_999/index.html'),
 ('Soumission',
  'One',
  '£50.10',
  'https://books.toscrape.com/catalogue/soumission_998/index.html')]

In [24]:
# converting the list of above tuples (output is in tuple) into dataframe using pandas

import pandas as pd

def get_all_books(page = 3):   # 3 is the no. of pages we want to scrape; can enter any amount of pages we want to scrape
  books = []
  for i in range(1, page + 1):
    # this is how the url chnages with every page
    url = f'https://books.toscrape.com/catalogue/page-{i}.html'
    soup = get_soup(url)
    if soup:
      book_tags = soup.find_all('article', class_ = 'product_pod')

      for book_tag in book_tags:
        books.append(get_details(book_tag))

  books = pd.DataFrame(books, columns = ['title', 'rating', 'price', 'link'])
  return books

We will only scrape first three pages to test our code

In [25]:
df = get_all_books(3)
df.head()

Unnamed: 0,title,rating,price,link
0,A Light in the Attic,Three,£51.77,https://books.toscrape.com/a-light-in-the-atti...
1,Tipping the Velvet,One,£53.74,https://books.toscrape.com/tipping-the-velvet_...
2,Soumission,One,£50.10,https://books.toscrape.com/soumission_998/inde...
3,Sharp Objects,Four,£47.82,https://books.toscrape.com/sharp-objects_997/i...
4,Sapiens: A Brief History of Humankind,Five,£54.23,https://books.toscrape.com/sapiens-a-brief-his...


In [26]:
df.shape

# 60 = no.of books as each page has 20 books on it and we have scraped 3 pages.

(60, 4)

**Perfection! Exactly what we expected.**

Before we scrape all the 1000 books, we will have to take care of few more things.

*   Whenever we are scraping a website, try to be responsible. A normal user generally makes 2-5 requests (clicks) per minute. But your python program can make upto 1000 requests load per seond. This can use all the resources in the server. Sometimes, it can even crash the server. So, make sure you sleep for a couple of seconds before you make the next request.

*   There are multiple things that can go wrong when scraping a website, like `netweok error`, `slow connection`, `timeout`, `element missing`, `code change`, etc. So, its highly recommended to use/try except blocks to handle errors effectively.

This is how your final code will look.

