### Overview

This project involves crawling a website used to demonstrate web scrapers: **Books to Scrape** site; it's a great place to practice scraping websites. The aim is to parse the catalog of the site and examine the payloads to pull detailed information about books, including their titles, ratings, prices, stock and links to their product and image pages. Using requests and BeautifulSoup, and building on that with pandas and other libraries to create a clean data set from the HTML content, this project covers the journey from raw content to a structured data set. This project is a great way to practice data extraction skills that could apply to analyzing trends in book prices to building recommendation engines, or just exploring the wonders of web-scraping! 

Let's unlock the world of books, one scrape at a time!

### Scraping Books data from Home-Page

In [1]:
import requests
from bs4 import BeautifulSoup

link = 'https://books.toscrape.com/catalogue/page-1.html'

In [2]:
res = requests.get(link)

soup = BeautifulSoup(res.text, 'html.parser')

In [3]:
soup.find_all('li', class_ = 'col-xs-6 col-sm-4 col-md-3 col-lg-3')

[<li class="col-xs-6 col-sm-4 col-md-3 col-lg-3">
 <article class="product_pod">
 <div class="image_container">
 <a href="a-light-in-the-attic_1000/index.html"><img alt="A Light in the Attic" class="thumbnail" src="../media/cache/2c/da/2cdad67c44b002e7ead0cc35693c0e8b.jpg"/></a>
 </div>
 <p class="star-rating Three">
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 </p>
 <h3><a href="a-light-in-the-attic_1000/index.html" title="A Light in the Attic">A Light in the ...</a></h3>
 <div class="product_price">
 <p class="price_color">Â£51.77</p>
 <p class="instock availability">
 <i class="icon-ok"></i>
     
         In stock
     
 </p>
 <form>
 <button class="btn btn-primary btn-block" data-loading-text="Adding..." type="submit">Add to basket</button>
 </form>
 </div>
 </article>
 </li>,
 <li class="col-xs-6 col-sm-4 col-md-3 col-lg-3">
 <article class="product_pod">
 <div class="image_container">
 <a 

In [4]:
len(soup.find_all('li', class_ = 'col-xs-6 col-sm-4 col-md-3 col-lg-3'))

20

In [5]:
data = []

for sp in soup.find_all('li', class_ = 'col-xs-6 col-sm-4 col-md-3 col-lg-3'):
    img_link = 'https://books.toscrape.com/' + sp.find('img').get('src')[3:]
    book_link = 'https://books.toscrape.com/catalogue/' + sp.find_all('a')[-1].get('href')
    title = sp.find_all('a')[-1].get('title')
    rating = sp.find('p').get('class')[-1]
    price = sp.find('p', class_ = 'price_color').text[1:]
    stock = sp.find('p', class_ = 'instock availability').text.strip()
    data.append([title, rating, price, stock, book_link, img_link])

In [6]:
data[0]

['A Light in the Attic',
 'Three',
 '£51.77',
 'In stock',
 'https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html',
 'https://books.toscrape.com/media/cache/2c/da/2cdad67c44b002e7ead0cc35693c0e8b.jpg']

In [8]:
len(data)

20

### Scraping Books from Multiple Pages

In [9]:
import requests
from tqdm import tqdm
from bs4 import BeautifulSoup
import pandas as pd

In [10]:
link = 'https://books.toscrape.com/catalogue/page-1.html'

In [11]:
print('https://books.toscrape.com/catalogue/page-' + str(1) + '.html')

https://books.toscrape.com/catalogue/page-1.html


In [12]:
data = []

for i in tqdm(range(1, 51)):
    link = 'https://books.toscrape.com/catalogue/page-' + str(i) + '.html'
    res = requests.get(link)
    soup = BeautifulSoup(res.text, 'html.parser')
    
    for sp in soup.find_all('li', class_ = 'col-xs-6 col-sm-4 col-md-3 col-lg-3'):
        img_link = 'https://books.toscrape.com/' + sp.find('img').get('src')[3:]
        book_link = 'https://books.toscrape.com/catalogue/' + sp.find_all('a')[-1].get('href')
        title = sp.find_all('a')[-1].get('title')
        rating = sp.find('p').get('class')[-1]
        price = sp.find('p', class_ = 'price_color').text[1:]
        stock = sp.find('p', class_ = 'instock availability').text.strip()

        data.append([title, rating, price, stock, book_link, img_link])

100%|██████████████████████████████████████████████████████████████████████████████████| 50/50 [01:25<00:00,  1.71s/it]


In [13]:
len(data)

1000

In [14]:
df = pd.DataFrame(data, columns = ['title', 'rating', 'price', 'stock', 'book_link', 'img_link'])

In [15]:
df.head()

Unnamed: 0,title,rating,price,stock,book_link,img_link
0,A Light in the Attic,Three,£51.77,In stock,https://books.toscrape.com/catalogue/a-light-i...,https://books.toscrape.com/media/cache/2c/da/2...
1,Tipping the Velvet,One,£53.74,In stock,https://books.toscrape.com/catalogue/tipping-t...,https://books.toscrape.com/media/cache/26/0c/2...
2,Soumission,One,£50.10,In stock,https://books.toscrape.com/catalogue/soumissio...,https://books.toscrape.com/media/cache/3e/ef/3...
3,Sharp Objects,Four,£47.82,In stock,https://books.toscrape.com/catalogue/sharp-obj...,https://books.toscrape.com/media/cache/32/51/3...
4,Sapiens: A Brief History of Humankind,Five,£54.23,In stock,https://books.toscrape.com/catalogue/sapiens-a...,https://books.toscrape.com/media/cache/be/a5/b...


In [16]:
data[0]

['A Light in the Attic',
 'Three',
 '£51.77',
 'In stock',
 'https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html',
 'https://books.toscrape.com/media/cache/2c/da/2cdad67c44b002e7ead0cc35693c0e8b.jpg']

In [17]:
data[999]

['1,000 Places to See Before You Die',
 'Five',
 '£26.08',
 'In stock',
 'https://books.toscrape.com/catalogue/1000-places-to-see-before-you-die_1/index.html',
 'https://books.toscrape.com/media/cache/d7/0f/d70f7edd92705c45a82118c3ff6c299d.jpg']

In [62]:
df.to_csv('books.csv', index=False)

### Individual Page Scraper

In [18]:
import pandas as pd
import requests
from tqdm import tqdm
from bs4 import BeautifulSoup

In [19]:
df = pd.read_csv('books.csv')
df.head()

Unnamed: 0,title,rating,price,stock,book_link,img_link
0,A Light in the Attic,Three,£51.77,In stock,https://books.toscrape.com/catalogue/a-light-i...,https://books.toscrape.com/media/cache/2c/da/2...
1,Tipping the Velvet,One,£53.74,In stock,https://books.toscrape.com/catalogue/tipping-t...,https://books.toscrape.com/media/cache/26/0c/2...
2,Soumission,One,£50.10,In stock,https://books.toscrape.com/catalogue/soumissio...,https://books.toscrape.com/media/cache/3e/ef/3...
3,Sharp Objects,Four,£47.82,In stock,https://books.toscrape.com/catalogue/sharp-obj...,https://books.toscrape.com/media/cache/32/51/3...
4,Sapiens: A Brief History of Humankind,Five,£54.23,In stock,https://books.toscrape.com/catalogue/sapiens-a...,https://books.toscrape.com/media/cache/be/a5/b...


In [21]:
data = []

for link in tqdm(df['book_link']):
    res = requests.get(link)
    soup = BeautifulSoup(res.text, 'html.parser')
    book_type = soup.find('ul', class_ = 'breadcrumb').find_all('a')[2].text
    upc = soup.find('table', class_ = 'table table-striped').find_all('td')[0].text
    price_exclusive = soup.find('table', class_ = 'table table-striped').find_all('td')[2].text[2:]
    price_inclusive = soup.find('table', class_ = 'table table-striped').find_all('td')[3].text[2:]
    tax = soup.find('table', class_ = 'table table-striped').find_all('td')[4].text[2:]
    Quantity = soup.find('table', class_ = 'table table-striped').find_all('td')[5].text
    Reviews = soup.find('table', class_ = 'table table-striped').find_all('td')[6].text
    data.append([book_type, price_exclusive, price_inclusive, tax, Quantity, upc, Reviews])

100%|██████████████████████████████████████████████████████████████████████████████| 1000/1000 [19:32<00:00,  1.17s/it]


In [22]:
len(data)

1000

In [23]:
df = pd.DataFrame(data, columns=['category', 'price_e_tax', 'price_i_tax', 'tax', 'quantity', 'upc', 'reviews'])

df.head()

Unnamed: 0,category,price_e_tax,price_i_tax,tax,quantity,upc,reviews
0,Poetry,51.77,51.77,0.0,In stock (22 available),a897fe39b1053632,0
1,Historical Fiction,53.74,53.74,0.0,In stock (20 available),90fa61229261140a,0
2,Fiction,50.1,50.1,0.0,In stock (20 available),6957f44c3847a760,0
3,Mystery,47.82,47.82,0.0,In stock (20 available),e00eb4fd7b871a48,0
4,History,54.23,54.23,0.0,In stock (20 available),4165285e1663650f,0


In [24]:
df.isnull().sum()

category       0
price_e_tax    0
price_i_tax    0
tax            0
quantity       0
upc            0
reviews        0
dtype: int64

In [25]:
df.to_csv('books_data.csv', index=False)

### Data Combining

In [26]:
import pandas as pd

In [27]:
df_1 = pd.read_csv('books.csv')
df_2 = pd.read_csv('books_data.csv')

In [28]:
df_1.head()

Unnamed: 0,title,rating,price,stock,book_link,img_link
0,A Light in the Attic,Three,£51.77,In stock,https://books.toscrape.com/catalogue/a-light-i...,https://books.toscrape.com/media/cache/2c/da/2...
1,Tipping the Velvet,One,£53.74,In stock,https://books.toscrape.com/catalogue/tipping-t...,https://books.toscrape.com/media/cache/26/0c/2...
2,Soumission,One,£50.10,In stock,https://books.toscrape.com/catalogue/soumissio...,https://books.toscrape.com/media/cache/3e/ef/3...
3,Sharp Objects,Four,£47.82,In stock,https://books.toscrape.com/catalogue/sharp-obj...,https://books.toscrape.com/media/cache/32/51/3...
4,Sapiens: A Brief History of Humankind,Five,£54.23,In stock,https://books.toscrape.com/catalogue/sapiens-a...,https://books.toscrape.com/media/cache/be/a5/b...


In [29]:
df_2.head()

Unnamed: 0,category,price_e_tax,price_i_tax,tax,quantity,upc,reviews
0,Poetry,51.77,51.77,0.0,In stock (22 available),a897fe39b1053632,0
1,Historical Fiction,53.74,53.74,0.0,In stock (20 available),90fa61229261140a,0
2,Fiction,50.1,50.1,0.0,In stock (20 available),6957f44c3847a760,0
3,Mystery,47.82,47.82,0.0,In stock (20 available),e00eb4fd7b871a48,0
4,History,54.23,54.23,0.0,In stock (20 available),4165285e1663650f,0


#### Creating an empty DataFrame

In [32]:
df = pd.DataFrame()

In [35]:
df['title'] = df_1['title']
df['upc'] = df_2['upc']
df['category'] = df_2['category']
df['price_e_tax'] = df_2['price_e_tax']
df['price_i_tax'] = df_2['price_i_tax']
df['tax'] = df_2['tax']
df['rating'] = df_1['rating']
df['reviews'] = df_2['reviews']
df['quantity'] = df_2['quantity']
df['stock'] = df_1['stock']
df['book_link'] = df_1['book_link']
df['img_link'] = df_1['img_link']

In [36]:
df.head()

Unnamed: 0,title,category,price_e_tax,price_i_tax,upc,tax,rating,reviews,quantity,stock,book_link,img_link
0,A Light in the Attic,Poetry,51.77,51.77,a897fe39b1053632,0.0,Three,0,In stock (22 available),In stock,https://books.toscrape.com/catalogue/a-light-i...,https://books.toscrape.com/media/cache/2c/da/2...
1,Tipping the Velvet,Historical Fiction,53.74,53.74,90fa61229261140a,0.0,One,0,In stock (20 available),In stock,https://books.toscrape.com/catalogue/tipping-t...,https://books.toscrape.com/media/cache/26/0c/2...
2,Soumission,Fiction,50.1,50.1,6957f44c3847a760,0.0,One,0,In stock (20 available),In stock,https://books.toscrape.com/catalogue/soumissio...,https://books.toscrape.com/media/cache/3e/ef/3...
3,Sharp Objects,Mystery,47.82,47.82,e00eb4fd7b871a48,0.0,Four,0,In stock (20 available),In stock,https://books.toscrape.com/catalogue/sharp-obj...,https://books.toscrape.com/media/cache/32/51/3...
4,Sapiens: A Brief History of Humankind,History,54.23,54.23,4165285e1663650f,0.0,Five,0,In stock (20 available),In stock,https://books.toscrape.com/catalogue/sapiens-a...,https://books.toscrape.com/media/cache/be/a5/b...


In [37]:
df.to_csv('final.csv', index=False)