# Web Scraping Book Retailer

Scraping http://books.toscrape.com/. This website is a fake book retailer, designed to mimic the design of many retail websites. It exists solely to assist in practice web-scraping.

Goal is to generate a dataframe with four columns: one for the title, one for the price, one for the star-rating, and one or the book cover JPEG’s URL. The dataframe will also 1000 rows, one for each of the 1000 books listed on the 50 pages of this website.

## Import the following libraries:

In [1]:
#import library
import numpy as np
import pandas as pd
import requests
from bs4 import BeautifulSoup
import sys
sys.tracebacklimit = 0 # turn off the error tracebacks

## Pull HTML Code from Scrape
Pull the HTML code from http://books.toscrape.com/, providing a user agent string followed by parsing the HTML code and saving the parsed code as a separate Python variable.

In [2]:
#import url
url = "http://books.toscrape.com/"

#designate user agent header
headers = {'user-agent': 'UVA DS6001 class example version 1.0 (dnr4qq@virginia.edu) (Language=Python 3.8.2; Platform=Windows 10 120.2212.2020.0)'}

#initiate get request for data
r = requests.get(url, headers = headers)
r

<Response [200]>

### Extract Book Titles
Extract 20 of the book titles and save in a list. 

In [3]:
#initiate beautiful soup
book_scrape = BeautifulSoup(r.text, 'html')

#create title list for df
title_list = book_scrape.find_all("a", title = True)
titles = [t.string for t in title_list]
titles

['A Light in the ...',
 'Tipping the Velvet',
 'Soumission',
 'Sharp Objects',
 'Sapiens: A Brief History ...',
 'The Requiem Red',
 'The Dirty Little Secrets ...',
 'The Coming Woman: A ...',
 'The Boys in the ...',
 'The Black Maria',
 'Starving Hearts (Triangular Trade ...',
 "Shakespeare's Sonnets",
 'Set Me Free',
 "Scott Pilgrim's Precious Little ...",
 'Rip it Up and ...',
 'Our Band Could Be ...',
 'Olio',
 'Mesaerion: The Best Science ...',
 'Libertarianism for Beginners',
 "It's Only the Himalayas"]

### Extract Book Price
Extract price of each of the 20 books and save prices in a list. 
Prices are listed in British pounds - remove the £ symbols.

In [4]:
#create price list for df
price_list = book_scrape.find_all("p", "price_color")
prices = [p.string for p in price_list]
prices = [s.replace('Â£', '') for s in prices]
prices

['51.77',
 '53.74',
 '50.10',
 '47.82',
 '54.23',
 '22.65',
 '33.34',
 '17.93',
 '22.60',
 '52.15',
 '13.99',
 '20.66',
 '17.46',
 '52.29',
 '35.02',
 '57.25',
 '23.88',
 '37.59',
 '51.33',
 '45.17']

## Extract Star Level Rating
Extract the star level ratings for the 20 books.

In [5]:
#create star ranking list for df
star_list = book_scrape.find_all("p", "star-rating")
stars = [i.attrs['class'][1] for i in star_list]
stars

['Three',
 'One',
 'One',
 'Four',
 'Five',
 'One',
 'Four',
 'Three',
 'Four',
 'One',
 'Two',
 'Four',
 'Five',
 'Five',
 'Five',
 'Three',
 'One',
 'One',
 'Two',
 'Two']

## Extract URLs
Extract URLs for JPEG thumbnail images showing the covers of the 20 books.

In [6]:
#create image url list for df
img_list = book_scrape.find_all("img")
imgs = [i.attrs['src'] for i in img_list]
imgs

['media/cache/2c/da/2cdad67c44b002e7ead0cc35693c0e8b.jpg',
 'media/cache/26/0c/260c6ae16bce31c8f8c95daddd9f4a1c.jpg',
 'media/cache/3e/ef/3eef99c9d9adef34639f510662022830.jpg',
 'media/cache/32/51/3251cf3a3412f53f339e42cac2134093.jpg',
 'media/cache/be/a5/bea5697f2534a2f86a3ef27b5a8c12a6.jpg',
 'media/cache/68/33/68339b4c9bc034267e1da611ab3b34f8.jpg',
 'media/cache/92/27/92274a95b7c251fea59a2b8a78275ab4.jpg',
 'media/cache/3d/54/3d54940e57e662c4dd1f3ff00c78cc64.jpg',
 'media/cache/66/88/66883b91f6804b2323c8369331cb7dd1.jpg',
 'media/cache/58/46/5846057e28022268153beff6d352b06c.jpg',
 'media/cache/be/f4/bef44da28c98f905a3ebec0b87be8530.jpg',
 'media/cache/10/48/1048f63d3b5061cd2f424d20b3f9b666.jpg',
 'media/cache/5b/88/5b88c52633f53cacf162c15f4f823153.jpg',
 'media/cache/94/b1/94b1b8b244bce9677c2f29ccc890d4d2.jpg',
 'media/cache/81/c4/81c4a973364e17d01f217e1188253d5e.jpg',
 'media/cache/54/60/54607fe8945897cdcced0044103b10b6.jpg',
 'media/cache/55/33/553310a7162dfbc2c6d19a84da0df9e1.jpg

## Add Data to DF
Create dataframe with one row for each of the 20 books, and the book titles, prices, star ratings, and cover JPEG URLs as the four columns.

In [7]:
#create dictionary and headers for df
mydict = {'Book Title' : titles,
         'Price' : prices,
         'Star Rating' : stars,
         'cover JPEG URL' : imgs}

#initiate df with dictionary
books_df = pd.DataFrame(mydict)
books_df

Unnamed: 0,Book Title,Price,Star Rating,cover JPEG URL
0,A Light in the ...,51.77,Three,media/cache/2c/da/2cdad67c44b002e7ead0cc35693c...
1,Tipping the Velvet,53.74,One,media/cache/26/0c/260c6ae16bce31c8f8c95daddd9f...
2,Soumission,50.1,One,media/cache/3e/ef/3eef99c9d9adef34639f51066202...
3,Sharp Objects,47.82,Four,media/cache/32/51/3251cf3a3412f53f339e42cac213...
4,Sapiens: A Brief History ...,54.23,Five,media/cache/be/a5/bea5697f2534a2f86a3ef27b5a8c...
5,The Requiem Red,22.65,One,media/cache/68/33/68339b4c9bc034267e1da611ab3b...
6,The Dirty Little Secrets ...,33.34,Four,media/cache/92/27/92274a95b7c251fea59a2b8a7827...
7,The Coming Woman: A ...,17.93,Three,media/cache/3d/54/3d54940e57e662c4dd1f3ff00c78...
8,The Boys in the ...,22.6,Four,media/cache/66/88/66883b91f6804b2323c8369331cb...
9,The Black Maria,52.15,One,media/cache/58/46/5846057e28022268153beff6d352...


## Define Function to Apply Prior Code
Create function for URL of webpage as an input, applies prior code, and generates the dataframe.

In [8]:
#define scraping function
def scraping(url):
    """Perform web scraping for any fake book site given the available link"""
    
    #extract headers and data
    headers = {'user-agent': 'UVA DS6001 class example version 1.0 (dnr4qq@virginia.edu) (Language=Python 3.8.2; Platform=Windows 10 120.2212.2020.0)'}
    r = requests.get(url, headers = headers)
    book_scrape = BeautifulSoup(r.text, 'html')
    
    #create lists for data
    title_list = book_scrape.find_all("a", title = True)
    price_list = book_scrape.find_all("p", "price_color")
    star_list = book_scrape.find_all("p", "star-rating")
    img_list = book_scrape.find_all("img") 
    
    #create inputs for df
    titles = [t.string for t in title_list]    
    prices = [p.string for p in price_list]
    prices = [s.replace('Â£', '') for s in prices]
    stars = [i.attrs['class'][1] for i in star_list]
    imgs = [i.attrs['src'] for i in img_list]
    
    #create dictionary
    mydict = {'Book Title' : titles,
         'Price' : prices,
         'Star Rating' : stars,
         'cover JPEG URL' : imgs}
    
    #create and return df
    books_df = pd.DataFrame(data = mydict)
    return books_df

In [9]:
#run scraping function
scraping("http://books.toscrape.com/")

Unnamed: 0,Book Title,Price,Star Rating,cover JPEG URL
0,A Light in the ...,51.77,Three,media/cache/2c/da/2cdad67c44b002e7ead0cc35693c...
1,Tipping the Velvet,53.74,One,media/cache/26/0c/260c6ae16bce31c8f8c95daddd9f...
2,Soumission,50.1,One,media/cache/3e/ef/3eef99c9d9adef34639f51066202...
3,Sharp Objects,47.82,Four,media/cache/32/51/3251cf3a3412f53f339e42cac213...
4,Sapiens: A Brief History ...,54.23,Five,media/cache/be/a5/bea5697f2534a2f86a3ef27b5a8c...
5,The Requiem Red,22.65,One,media/cache/68/33/68339b4c9bc034267e1da611ab3b...
6,The Dirty Little Secrets ...,33.34,Four,media/cache/92/27/92274a95b7c251fea59a2b8a7827...
7,The Coming Woman: A ...,17.93,Three,media/cache/3d/54/3d54940e57e662c4dd1f3ff00c78...
8,The Boys in the ...,22.6,Four,media/cache/66/88/66883b91f6804b2323c8369331cb...
9,The Black Maria,52.15,One,media/cache/58/46/5846057e28022268153beff6d352...


## Spyder to Loop Over Pages
There are 50 total pages that exist on http://books.toscrape.com/ site.

Use loop function to scrape each of the 50 pages, appending each page to data frame together.

In [10]:
#create base url for book site
base_url = 'https://books.toscrape.com/catalogue/page-'

#create empty dataframe
books_total_df = pd.DataFrame()

#write for loop for page navigation
for b in range(1,51):
    moredata = scraping(base_url + str(b) + '.html')
    books_total_df = books_total_df.append(moredata)

In [11]:
#call new dataframe
books_total_df

Unnamed: 0,Book Title,Price,Star Rating,cover JPEG URL
0,A Light in the ...,51.77,Three,../media/cache/2c/da/2cdad67c44b002e7ead0cc356...
1,Tipping the Velvet,53.74,One,../media/cache/26/0c/260c6ae16bce31c8f8c95dadd...
2,Soumission,50.10,One,../media/cache/3e/ef/3eef99c9d9adef34639f51066...
3,Sharp Objects,47.82,Four,../media/cache/32/51/3251cf3a3412f53f339e42cac...
4,Sapiens: A Brief History ...,54.23,Five,../media/cache/be/a5/bea5697f2534a2f86a3ef27b5...
...,...,...,...,...
15,Alice in Wonderland (Alice's ...,55.53,One,../media/cache/96/ee/96ee77d71a31b7694dac6855f...
16,"Ajin: Demi-Human, Volume 1 ...",57.06,Four,../media/cache/09/7c/097cb5ecc6fb3fbe1690cf0cb...
17,A Spy's Devotion (The ...,16.97,Five,../media/cache/1b/5f/1b5ff86f3c75e51e24c573d3f...
18,1st to Die (Women's ...,53.98,One,../media/cache/2b/41/2b4161c5b72a4ae386b644682...
