# Web-scraping Books to Scrape

<img src= "https://i.pinimg.com/564x/21/4c/cf/214ccffd449f7c7fd9693e4cf7a913d3.jpg" width = 600  height = 400/>

> Web scraping is the process of extracting and parsing data from websites in an automated fashion using a computer program. It's a useful technique for creating datasets for research and learning.

For this project, I would scrape books information from the website [Books to Scrape]('http://books.toscrape.com/index.html'), a fictional bookstore for learners and developers validating their scraping technologies. Thanks to the active python community, python offers a variety of libraries to scrape the web and collect the required data.

Libraries used:
 - **Requests** : To download and save web pages locally
 - **BeautifulSoup** : To parse and explore the structure of downloaded web pages
 - **Pandas** : To transform the data into Dataframe

 ### Project Outline :
 
 1. Examining the [Books to Scrape]('http://books.toscrape.com/index.html') website.
 2. Reading the web page into python using Requests library.
 3. Inspecting the tags to gather the required information.
 4. Create a list of URL of all books looping through page 1-50.
 5. For each book, we'll grab title, Price, Stock, UPC (universal product code) & link.
 6. Compile 1000 books details into list of dictionaries.
 7. Create a data frame using Pandas library.
 8. Finally, saving the information into a CSV file.


## Installing & Importing the libraries

In [1]:
!pip install jovian --upgrade --quiet
!pip install requests --upgrade --quiet
!pip install beautifulsoup4 --upgrade --quiet
!pip install pandas --upgrade --quiet

Import the required libraries.

In [2]:
import jovian
import requests
from bs4 import BeautifulSoup as BS

## Loading the webpage

Let's look at the URL for the book.

In [3]:
Landing_page = 'http://books.toscrape.com'
# Download the page
r = requests.get(Landing_page)
# Parse using beautifulsoup
s = BS(r.content, 'lxml')

link = s.find_all('h3')
print('http://books.toscrape.com' + link[0].a['href'])

http://books.toscrape.comcatalogue/a-light-in-the-attic_1000/index.html


![](https://i.imgur.com/yNBvPiu.png)

We got the URL for 1st movie, now we can loop through all 50 pages to create a list of books URL.

In [4]:
booklinks = []
for x in range(1,51):
    page_link = f'http://books.toscrape.com/catalogue/page-{x}.html'
    response = requests.get(page_link)
    soup = BS(response.content, 'lxml')
    booklist = soup.find_all('h3')
    for book in booklist:
        for link in book.find_all('a'):
            booklinks.append('http://books.toscrape.com/catalogue/' + link['href'])

In [5]:
len(booklinks)

1000

Now, we have got a list of URLs for 1000 books. We can use these URLs to extract the information for each book. 

Let's look at the page we will inspect to retrieve the required details.

![](https://i.imgur.com/RL5100F.png)

In [6]:
bookinfo = [] #create a list to store dictonary

for link in booklinks:  
    bresponse = requests.get(link) #loop through the links list, created earlier
    bsoup = BS(bresponse.content, 'lxml')

    title = bsoup.h1.text.strip()
    price = bsoup.find('p', {'class': 'price_color'}).text
    stock = bsoup.find('p', {'class': 'instock availability'}).text.strip()
    UPC = bsoup.find('td').text.strip()
   
    books = {  # create a dictionary to store each book's details 
        'Title' : title,
        'Price' : price,
        'Stock' : stock,
        'UPC' : UPC,
        'Link' : link
    }
    bookinfo.append(books)

In [7]:
bookinfo[:2]

[{'Title': 'A Light in the Attic',
  'Price': '£51.77',
  'Stock': 'In stock (22 available)',
  'UPC': 'a897fe39b1053632',
  'Link': 'http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html'},
 {'Title': 'Tipping the Velvet',
  'Price': '£53.74',
  'Stock': 'In stock (20 available)',
  'UPC': '90fa61229261140a',
  'Link': 'http://books.toscrape.com/catalogue/tipping-the-velvet_999/index.html'}]

We will now convert the list into a data frame using the pandas library.

In [8]:
import pandas as pd

df = pd.DataFrame(bookinfo)
df.head(10)

Unnamed: 0,Title,Price,Stock,UPC,Link
0,A Light in the Attic,£51.77,In stock (22 available),a897fe39b1053632,http://books.toscrape.com/catalogue/a-light-in...
1,Tipping the Velvet,£53.74,In stock (20 available),90fa61229261140a,http://books.toscrape.com/catalogue/tipping-th...
2,Soumission,£50.10,In stock (20 available),6957f44c3847a760,http://books.toscrape.com/catalogue/soumission...
3,Sharp Objects,£47.82,In stock (20 available),e00eb4fd7b871a48,http://books.toscrape.com/catalogue/sharp-obje...
4,Sapiens: A Brief History of Humankind,£54.23,In stock (20 available),4165285e1663650f,http://books.toscrape.com/catalogue/sapiens-a-...
5,The Requiem Red,£22.65,In stock (19 available),f77dbf2323deb740,http://books.toscrape.com/catalogue/the-requie...
6,The Dirty Little Secrets of Getting Your Dream...,£33.34,In stock (19 available),2597b5a345f45e1b,http://books.toscrape.com/catalogue/the-dirty-...
7,The Coming Woman: A Novel Based on the Life of...,£17.93,In stock (19 available),e72a5dfc7e9267b2,http://books.toscrape.com/catalogue/the-coming...
8,The Boys in the Boat: Nine Americans and Their...,£22.60,In stock (19 available),e10e1e165dc8be4a,http://books.toscrape.com/catalogue/the-boys-i...
9,The Black Maria,£52.15,In stock (19 available),1dfe412b8ac00530,http://books.toscrape.com/catalogue/the-black-...


In [9]:
df.to_csv('books_to_scrape', index = None) #creating a csv file

In [10]:
# Execute this to save new versions of the notebook
jovian.commit(project="web-scraping")

<IPython.core.display.Javascript object>

[jovian] Updating notebook "alkabhambhu98/web-scraping" on https://jovian.ai[0m
[jovian] Committed successfully! https://jovian.ai/alkabhambhu98/web-scraping[0m


'https://jovian.ai/alkabhambhu98/web-scraping'

### Summary

For my first project, I wanted to use a test site to scrape the data. Hence, I used [Books to Scrape]('http://books.toscrape.com/index.html') to create a dataset of all the books available on the website through out 50 pages, using python libraries such as Requests, BeautifulSoup and Pandas and at the end converting the dataset into csv file.

### Future Work

For my future work, I would use a powerful tools like selenium and scrapy to automate the task and deep dive into the web-scraping.

### References

* [Let’s Build a Python Web Scraping Project from Scratch](https://www.youtube.com/watch?v=RKsLLG-bzEY) | Hands-On Tutorial by Aakash N S, CEO, Jovian.

* Beautiful Soup [Documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)

* Image [Source](https://i.pinimg.com/564x/21/4c/cf/214ccffd449f7c7fd9693e4cf7a913d3.jpg)

* [Stackoverflow](https://stackoverflow.com/questions/54861405/scraping-multiple-pages-in-python-with-beautifulsoup)
