# BOOK SCRAPPER
This workbook demonstrates pulling data out from the landing page of **BOOKS TO SCRAPE** which contains information regarding books such as types, costs etc.

In [1]:
url='https://books.toscrape.com/catalogue/page-1.html'
import requests
import os
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
import seaborn as sns
import plotly as pt

In [2]:
res=requests.get(url)
print(res.status_code)

200


In [3]:
content=res.text

In [5]:
content



Having gotten the content in the landing page scrapping is performed creating an instance of Beautiful soup object passing in the content and parser.

In [6]:
soup=BeautifulSoup(content,'lxml')
print(soup.prettify())

<!DOCTYPE html>
<!--[if lt IE 7]>      <html lang="en-us" class="no-js lt-ie9 lt-ie8 lt-ie7"> <![endif]-->
<!--[if IE 7]>         <html lang="en-us" class="no-js lt-ie9 lt-ie8"> <![endif]-->
<!--[if IE 8]>         <html lang="en-us" class="no-js lt-ie9"> <![endif]-->
<!--[if gt IE 8]><!-->
<html class="no-js" lang="en-us">
 <!--<![endif]-->
 <head>
  <title>
   All products | Books to Scrape - Sandbox
  </title>
  <meta content="text/html; charset=utf-8" http-equiv="content-type"/>
  <meta content="24th Jun 2016 09:30" name="created"/>
  <meta content="" name="description"/>
  <meta content="width=device-width" name="viewport"/>
  <meta content="NOARCHIVE,NOCACHE" name="robots"/>
  <!-- Le HTML5 shim, for IE6-8 support of HTML elements -->
  <!--[if lt IE 9]>
        <script src="//html5shim.googlecode.com/svn/trunk/html5.js"></script>
        <![endif]-->
  <link href="../static/oscar/favicon.ico" rel="shortcut icon"/>
  <link href="../static/oscar/css/styles.css" rel="stylesheet" typ

HTML parser error : Tag header invalid
    <header class="header container-fluid">
                                          ^
HTML parser error : Tag aside invalid
            <aside class="sidebar col-sm-4 col-md-3">
                                                    ^
HTML parser error : Tag section invalid
        <section>
                ^
HTML parser error : Tag article invalid
    <article class="product_pod">
                                ^
HTML parser error : Tag article invalid
    <article class="product_pod">
                                ^
HTML parser error : Tag article invalid
    <article class="product_pod">
                                ^
HTML parser error : Tag article invalid
    <article class="product_pod">
                                ^
HTML parser error : Tag article invalid
    <article class="product_pod">
                                ^
HTML parser error : Tag article invalid
    <article class="product_pod">
                                ^
HTM

### FIND DIFFERENT GENRES OF BOOKS PRESENT
The webpage contain a list of different genres of books present. The different genres are retrieved along with link to it.

All genres are wrapped within anchor tags  and present under **div** tag with class as **side_categories** except the first one.

In [12]:
book_genres_with_links={}

for idx,item in enumerate(soup.find('div',attrs={'class':'side_categories'}).find_all('a')):
    if idx>0:
        genre=str(item.string).strip()
        link_to_genrepage = item['href']
        book_genres_with_links[genre]=link_to_genrepage

In [13]:
print(f"There exists {len(book_genres_with_links)} different genres of books on the webpage to select from with the genres being {book_genres_with_links}")

There exists 50 different genres of books on the webpage to select from with the genres being {'Travel': 'category/books/travel_2/index.html', 'Mystery': 'category/books/mystery_3/index.html', 'Historical Fiction': 'category/books/historical-fiction_4/index.html', 'Sequential Art': 'category/books/sequential-art_5/index.html', 'Classics': 'category/books/classics_6/index.html', 'Philosophy': 'category/books/philosophy_7/index.html', 'Romance': 'category/books/romance_8/index.html', 'Womens Fiction': 'category/books/womens-fiction_9/index.html', 'Fiction': 'category/books/fiction_10/index.html', 'Childrens': 'category/books/childrens_11/index.html', 'Religion': 'category/books/religion_12/index.html', 'Nonfiction': 'category/books/nonfiction_13/index.html', 'Music': 'category/books/music_14/index.html', 'Default': 'category/books/default_15/index.html', 'Science Fiction': 'category/books/science-fiction_16/index.html', 'Sports and Games': 'category/books/sports-and-games_17/index.html',

## SCRAPE BOOKS PRESENT ACROSS THE WEBSITE AND OBTAIN RELEVANT INFORMATION
Obtain list of all books mapped to price, genre and rating present across all pages of each genre. This is achieved by iterating the genres and for each we scrape all the books under it along with the price tags. In addition also map books to genre.

In [59]:
#TC=O(N^2)
prefix='https://books.toscrape.com/catalogue/'

books_to_genre_cost_rating={}
genre_to_books={}

#scrapes landing page for every genre and looks into whether there exists more than one page
for genre in book_genres_with_links:
    genre_books_index_url = prefix+book_genres_with_links[genre]
    genre_prefix=genre_books_index_url[:-10]

    res=requests.get(genre_books_index_url)

    if res.status_code==200:
        soup_genre = BeautifulSoup(res.text,'lxml')

        for item in soup_genre.find_all('article',attrs={'class':'product_pod'}):

            book_title=item.find('img')['alt']
            rating=item.find('p')['class'][-1]
            cost=str(item.find('div',attrs={'class':'product_price'}).find('p').string).strip()[1:]

            books_to_genre_cost_rating[book_title]=[rating,cost,genre]

            if genre not in genre_to_books:
                genre_to_books[genre]=[]
            genre_to_books[genre].append(book_title)
        
        #some genres may have books across many pages. For such genres we iterate through all such pages to scrape.
        next_page_link=soup_genre.find('li',attrs={'class':'next'})

        while next_page_link is not None:
            genre_next_page_url=genre_prefix+next_page_link.find('a')['href']
            genre_res=requests.get(genre_next_page_url)

            if genre_res.status_code==200:
                soup_genre_next_page=BeautifulSoup(genre_res.text,'lxml')

                for item in soup_genre_next_page.find_all('article',attrs={'class':'product_pod'}):
                    next_page_book_title=item.find('a').find('img')['alt']
                    next_page_rating=item.find('p')['class'][-1]
                    next_page_cost=str(item.find('div',attrs={'class':'product_price'}).find('p').string).strip()[1:]

                    books_to_genre_cost_rating[next_page_book_title]=[next_page_rating,next_page_cost,genre]
                    genre_to_books[genre].append(next_page_book_title)
          
                next_page_link=soup_genre_next_page.find('li',attrs={'class':'next'})            

HTML parser error : Tag header invalid
    <header class="header container-fluid">
                                          ^
HTML parser error : Tag aside invalid
            <aside class="sidebar col-sm-4 col-md-3">
                                                    ^
HTML parser error : Tag section invalid
        <section>
                ^
HTML parser error : Tag article invalid
    <article class="product_pod">
                                ^
HTML parser error : Tag article invalid
    <article class="product_pod">
                                ^
HTML parser error : Tag article invalid
    <article class="product_pod">
                                ^
HTML parser error : Tag article invalid
    <article class="product_pod">
                                ^
HTML parser error : Tag article invalid
    <article class="product_pod">
                                ^
HTML parser error : Tag article invalid
    <article class="product_pod">
                                ^
HTM

In [60]:
print(f"There exists a total of {len(books_to_genre_cost_rating)} books across all genres")

There exists a total of 999 books across all genres


#### FIND GENRES WITH MAX AND MIN BOOKS

In [61]:
import math

In [62]:
max_books=-1
min_books=math.inf

for genre in genre_to_books:
    if max_books<len(genre_to_books[genre]):
        if max_books!=-1:
            min_books=max_books
            min_genre=max_genre

        max_books=len(genre_to_books[genre])
        max_genre=genre

    elif max_books==len(genre_to_books[genre]):
        min_books=len(genre_to_books[genre])
        min_genre=genre
    else:
        if len(genre_to_books[genre])>=min_books:
            min_books=len(genre_to_books[genre])
            min_genre=genre
    

In [63]:
max_books,max_genre,min_books,min_genre
print(f"The {max_genre} genre has the most number of books under it which is {max_books} that make up {max_books/len(books_to_genre_cost_rating)*100:.2f}% of all books while {min_genre} has the least with a count of only {min_books} which is a mere {min_books/len(books_to_genre_cost_rating)*100:.2f}% of the total")

The Default genre has the most number of books under it which is 152 that make up 15.22% of all books while Nonfiction has the least with a count of only 110 which is a mere 11.01% of the total


#### CREATE A DATAFRAME THAT STORES ALL INFORMATION ABOUT BOOKS GATHERED

In [64]:
book_dict={}

for book in books_to_genre_cost_rating:
    if 'BOOK' not in book_dict:
        book_dict['BOOK']=[]
    book_dict['BOOK'].append(book)

    if 'GENRE' not in book_dict:
        book_dict['GENRE']=[]
    book_dict['GENRE'].append(books_to_genre_cost_rating[book][-1])

    if 'RATING' not in book_dict:
        book_dict['RATING']=[]
    book_dict['RATING'].append(books_to_genre_cost_rating[book][0])
    
    if 'PRICE' not in book_dict:
        book_dict['PRICE']=[]
    book_dict['PRICE'].append(books_to_genre_cost_rating[book][1])

book_df=pd.DataFrame(book_dict)

In [65]:
book_df.head()

Unnamed: 0,BOOK,GENRE,RATING,PRICE
0,It's Only the Himalayas,Travel,Two,£45.17
1,Full Moon over Noahâs Ark: An Odyssey to Mou...,Travel,Four,£49.43
2,See America: A Celebration of Our National Par...,Travel,Three,£48.87
3,Vagabonding: An Uncommon Guide to the Art of L...,Travel,Two,£36.94
4,Under the Tuscan Sun,Travel,Three,£37.33


In [66]:
os.getcwd()

'/Users/soumyadipsikdar/Desktop/Data Science/Web Scrapping'

In [68]:
book_df.to_csv(os.getcwd()+'/book_details.csv')