- Importing Libraries

In [None]:
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd

+ Url of the website we want to scrape : http://books.toscrape.com/

In [None]:
url = 'http://books.toscrape.com/'

- Use the requests.get() function to send an HTTP GET request to the specified URL and store the response.

In [None]:
response = requests.get(url)
response

<Response [200]>

- The line response.status_code is used to check the HTTP status code returned by the web server in response to the HTTP request

In [None]:
response.status_code

200

In [None]:
type(response.text) #type of the Object Response

str

- The line print(response.text[:100]) is printing the first 100 characters of the text content received in the HTTP response.

In [None]:
response.text[:100]

'<!DOCTYPE html>\n<!--[if lt IE 7]>      <html lang="en-us" class="no-js lt-ie9 lt-ie8 lt-ie7"> <![end'

- bs(response.text): This creates a BeautifulSoup object (soup) by parsing the HTML content of the response.text. response.text is the text content of the HTTP response received from the server.

In [None]:
soup = bs(response.text)

In [None]:
type(soup)

bs4.BeautifulSoup

- Below line is using BeautifulSoup to find title tag within the BeautifulSoup object (soup)




In [None]:
soup.title.text.strip() #title tag

'All products | Books to Scrape - Sandbox'

- The line below is using BeautifulSoup to find all HTML elements with the tag 'a' and the class attribute set to 'product_pod' within the BeautifulSoup object (soup)

In [None]:
books_tag = soup.find_all('article',class_ ='product_pod')

- This line gives no of elements in the books tag

In [None]:
len(books_tag)

20

- Used to get 1st element of books tag

In [None]:
book = books_tag[0]
book

<article class="product_pod">
<div class="image_container">
<a href="catalogue/a-light-in-the-attic_1000/index.html"><img alt="A Light in the Attic" class="thumbnail" src="media/cache/2c/da/2cdad67c44b002e7ead0cc35693c0e8b.jpg"/></a>
</div>
<p class="star-rating Three">
<i class="icon-star"></i>
<i class="icon-star"></i>
<i class="icon-star"></i>
<i class="icon-star"></i>
<i class="icon-star"></i>
</p>
<h3><a href="catalogue/a-light-in-the-attic_1000/index.html" title="A Light in the Attic">A Light in the ...</a></h3>
<div class="product_price">
<p class="price_color">Â£51.77</p>
<p class="instock availability">
<i class="icon-ok"></i>
    
        In stock
    
</p>
<form>
<button class="btn btn-primary btn-block" data-loading-text="Adding..." type="submit">Add to basket</button>
</form>
</div>
</article>

- Extracting the 'title' of a book from the HTML structure of the selected book

In [None]:
title_tag = book.find('a',title=True)['title']
title_tag

'A Light in the Attic'

- Extracting the 'Rating' of a book from the HTML structure of the selected book

In [None]:
rating_tag = book.find('p')['class'][1]
rating_tag

'Three'

- Extracting the 'price' of a book from the HTML structure of the selected book

In [None]:
price_tag = book.find('p',class_ = 'price_color').text[1:]
price_tag

'£51.77'

- Extracting the 'link' of a book from the HTML structure of the selected book

In [None]:
link_tag = book.find('a')['href']
url + link_tag

'http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html'

In [None]:
list_data =[] # Creating a list to store the data after being scraped

- The code is part of a loop that iterates over the first 20 books in the 'books_tag' list, extracts information for each book, and appends it to the 'list_data'.

In [None]:
for book in books_tag:
  title_tag = book.find('a',title=True)['title']
  rating_tag = book.find('p')['class'][1]
  price_tag = book.find('p',class_ = 'price_color').text[1:]
  link_tag = url + book.find('a')['href']

  list_data.append([title_tag,rating_tag,price_tag,link_tag])

In [None]:
columns = ['title','rating','price','link']

- This creates a DataFrame from list using pandas

In [None]:
data = pd.DataFrame(list_data,columns=columns)

- Printing complete data of One page

In [None]:
data # complete data of one page

Unnamed: 0,title,rating,price,link
0,A Light in the Attic,Three,£51.77,http://books.toscrape.com/catalogue/a-light-in...
1,Tipping the Velvet,One,£53.74,http://books.toscrape.com/catalogue/tipping-th...
2,Soumission,One,£50.10,http://books.toscrape.com/catalogue/soumission...
3,Sharp Objects,Four,£47.82,http://books.toscrape.com/catalogue/sharp-obje...
4,Sapiens: A Brief History of Humankind,Five,£54.23,http://books.toscrape.com/catalogue/sapiens-a-...
5,The Requiem Red,One,£22.65,http://books.toscrape.com/catalogue/the-requie...
6,The Dirty Little Secrets of Getting Your Dream...,Four,£33.34,http://books.toscrape.com/catalogue/the-dirty-...
7,The Coming Woman: A Novel Based on the Life of...,Three,£17.93,http://books.toscrape.com/catalogue/the-coming...
8,The Boys in the Boat: Nine Americans and Their...,Four,£22.60,http://books.toscrape.com/catalogue/the-boys-i...
9,The Black Maria,One,£52.15,http://books.toscrape.com/catalogue/the-black-...


#checking for url

In [None]:
for i in range(1,5):
  url1 = url + str('/catalogue/page-')+ str(i) +str('.html')
  print(url1)

http://books.toscrape.com//catalogue/page-1.html
http://books.toscrape.com//catalogue/page-2.html
http://books.toscrape.com//catalogue/page-3.html
http://books.toscrape.com//catalogue/page-4.html


## **Steps to follow to scrape data from all pages**
- Loop through pages
- Construct URL for each page
- Make an HTTP request and create BeautifulSoup object
- Find all book articles on the page
- Extract data for each book on the page
- Append the data to the list

In [None]:
data_list =[]
total_pages = 50
for page in range(1,total_pages+1):                              # loop through pages
  url1 = url + str('/catalogue/page-')+ str(i) +str('.html')     # URL construction
  response = requests.get(url1)                                  # Make HTTP request
  soup = bs(response.text)
  books_tag = soup.find_all('article',class_ = 'product_pod')    # Find all book articles on the page
  for book in books_tag:                                         # Data for each book on the page is extracted
    title_tag = book.find('a',title=True)['title']
    rating_tag = book.find('p')['class'][1]
    price_tag = book.find('p',class_ = 'price_color').text[1:]
    link_tag = url + book.find('a')['href']
    data_list.append([title_tag,rating_tag,price_tag,link_tag])  # Appending the data to list

#List of data that contains complete information of all pages

In [None]:
data_list[:5]

[['The Nameless City (The Nameless City #1)',
  'Four',
  '£38.16',
  'http://books.toscrape.com/the-nameless-city-the-nameless-city-1_940/index.html'],
 ['The Murder That Never Was (Forensic Instincts #5)',
  'Three',
  '£54.11',
  'http://books.toscrape.com/the-murder-that-never-was-forensic-instincts-5_939/index.html'],
 ["The Most Perfect Thing: Inside (and Outside) a Bird's Egg",
  'Four',
  '£42.96',
  'http://books.toscrape.com/the-most-perfect-thing-inside-and-outside-a-birds-egg_938/index.html'],
 ['The Mindfulness and Acceptance Workbook for Anxiety: A Guide to Breaking Free from Anxiety, Phobias, and Worry Using Acceptance and Commitment Therapy',
  'Four',
  '£23.89',
  'http://books.toscrape.com/the-mindfulness-and-acceptance-workbook-for-anxiety-a-guide-to-breaking-free-from-anxiety-phobias-and-worry-using-acceptance-and-commitment-therapy_937/index.html'],
 ['The Life-Changing Magic of Tidying Up: The Japanese Art of Decluttering and Organizing',
  'Three',
  '£16.77',
 

#Converting a list of pages data into a DataFrame using pandas

In [None]:
complete_df = pd.DataFrame(data_list, columns = columns)

In [None]:
complete_df

Unnamed: 0,title,rating,price,link
0,The Nameless City (The Nameless City #1),Four,£38.16,http://books.toscrape.com/the-nameless-city-th...
1,The Murder That Never Was (Forensic Instincts #5),Three,£54.11,http://books.toscrape.com/the-murder-that-neve...
2,The Most Perfect Thing: Inside (and Outside) a...,Four,£42.96,http://books.toscrape.com/the-most-perfect-thi...
3,The Mindfulness and Acceptance Workbook for An...,Four,£23.89,http://books.toscrape.com/the-mindfulness-and-...
4,The Life-Changing Magic of Tidying Up: The Jap...,Three,£16.77,http://books.toscrape.com/the-life-changing-ma...
...,...,...,...,...
995,Security,Two,£39.25,http://books.toscrape.com/security_925/index.html
996,"Saga, Volume 6 (Saga (Collected Editions) #6)",Three,£25.02,http://books.toscrape.com/saga-volume-6-saga-c...
997,"Saga, Volume 5 (Saga (Collected Editions) #5)",Two,£51.04,http://books.toscrape.com/saga-volume-5-saga-c...
998,Reskilling America: Learning to Labor in the T...,Two,£19.83,http://books.toscrape.com/reskilling-america-l...


In [None]:
complete_df.shape

(1000, 4)

- Converting the DataFrame to .csv file for further use

In [None]:
complete_df.to_csv('books_data.csv', index=False)

#Using try except

In [None]:
import numpy as np
import time

In [None]:
dl =[]
total_pages = 50
for page in range(1,total_pages+1):                              # loop through pages
  url1 = url + str('/catalogue/page-')+ str(i) +str('.html')     # URL construction
  try:
    response = requests.get(url1)                                  # Make HTTP request
  except Exception as e:
    print(e)
  if (response.status_code == 200):
    soup = bs(response.text)
    try:
      books_tag = soup.find_all('article',class_ = 'product_pod')    # Find all book articles on the page
      for book in books_tag:
        try:
          title_tag = book.find('a',title=True)['title']
        except:
          title_tag = np.nan

        try:
          rating_tag = book.find('p')['class'][1]
        except:
          rating_tag = np.nan

        try:
          price_tag = book.find('p',class_ = 'price_color').text[1:]
        except:
          price_tag = np.nan

        try:
          link_tag = url + book.find('a')['href']
        except:
          link_tag = np.nan

        dl.append([title_tag,rating_tag,price_tag,link_tag])  # Appending the data to list
    except:
      print(f'Error reading page {page}')
    time.sleep(5)

In [None]:
dl[:3]

[['The Nameless City (The Nameless City #1)',
  'Four',
  '£38.16',
  'http://books.toscrape.com/the-nameless-city-the-nameless-city-1_940/index.html'],
 ['The Murder That Never Was (Forensic Instincts #5)',
  'Three',
  '£54.11',
  'http://books.toscrape.com/the-murder-that-never-was-forensic-instincts-5_939/index.html'],
 ["The Most Perfect Thing: Inside (and Outside) a Bird's Egg",
  'Four',
  '£42.96',
  'http://books.toscrape.com/the-most-perfect-thing-inside-and-outside-a-birds-egg_938/index.html']]

In [None]:
complete_data = pd.DataFrame(data_list, columns = columns)

In [None]:
complete_data.head()

Unnamed: 0,title,rating,price,link
0,The Nameless City (The Nameless City #1),Four,£38.16,http://books.toscrape.com/the-nameless-city-th...
1,The Murder That Never Was (Forensic Instincts #5),Three,£54.11,http://books.toscrape.com/the-murder-that-neve...
2,The Most Perfect Thing: Inside (and Outside) a...,Four,£42.96,http://books.toscrape.com/the-most-perfect-thi...
3,The Mindfulness and Acceptance Workbook for An...,Four,£23.89,http://books.toscrape.com/the-mindfulness-and-...
4,The Life-Changing Magic of Tidying Up: The Jap...,Three,£16.77,http://books.toscrape.com/the-life-changing-ma...


In [None]:
complete_data.to_csv('data_of_books.csv', index=False)