# WEBSCRAPING LIBRARY WEBSITE PROJECT

The goal of the project is to scrape all available data from the website https://books.toscrape.com/ and organize it into a spreadsheet that can be used for data analysis. The website is fake bookstore with 1000 books. The data we can gather for each book consists in:
- Title
- Rating
- Price
- Stock availability
- Genre

Let's start by importing the required libraries.

In [2]:
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd

The website catalogue has 50 pages, with 20 books for each one. We start by scraping the first page. Our objective is to get a list with all the data we're interested in for the first 20 books. 

With the following lines we retrieve the html code from the page and store it as the variable "soup".

In [3]:
url = "https://books.toscrape.com/catalogue/page-1.html"

request = requests.get(url).text
soup = bs(request, "html.parser")

Next, we define a function that returns all the titles of the first 20 books.

In [4]:
# Create a list for all titles

def create_titles_list():
    books = soup.find_all("article")
    titles = []
    for title in range(len(books)):
        titles.append(books[title].h3.a.get("title"))
    return(titles)
    
titles = create_titles_list()
titles

['A Light in the Attic',
 'Tipping the Velvet',
 'Soumission',
 'Sharp Objects',
 'Sapiens: A Brief History of Humankind',
 'The Requiem Red',
 'The Dirty Little Secrets of Getting Your Dream Job',
 'The Coming Woman: A Novel Based on the Life of the Infamous Feminist, Victoria Woodhull',
 'The Boys in the Boat: Nine Americans and Their Epic Quest for Gold at the 1936 Berlin Olympics',
 'The Black Maria',
 'Starving Hearts (Triangular Trade Trilogy, #1)',
 "Shakespeare's Sonnets",
 'Set Me Free',
 "Scott Pilgrim's Precious Little Life (Scott Pilgrim #1)",
 'Rip it Up and Start Again',
 'Our Band Could Be Your Life: Scenes from the American Indie Underground, 1981-1991',
 'Olio',
 'Mesaerion: The Best Science Fiction Stories 1800-1849',
 'Libertarianism for Beginners',
 "It's Only the Himalayas"]

Same thing with the prices...

In [5]:
# Create a list for all the prices

def create_prices_list():
    book_prices = soup.find_all(class_= "price_color")
    prices = []
    for price in range(len(book_prices)):
        p = book_prices[price].text
        p = p.replace("Â£", "£")
        prices.append(p)
    return prices

prices = create_prices_list()    
prices

['£51.77',
 '£53.74',
 '£50.10',
 '£47.82',
 '£54.23',
 '£22.65',
 '£33.34',
 '£17.93',
 '£22.60',
 '£52.15',
 '£13.99',
 '£20.66',
 '£17.46',
 '£52.29',
 '£35.02',
 '£57.25',
 '£23.88',
 '£37.59',
 '£51.33',
 '£45.17']

... and with the ratings (measured on a scale from one to five).

In [6]:
# Create a list for the ratings

def create_ratings_list():
    book_ratings = soup.find_all(class_ ="star-rating")
    strbookratings = ""
    for i in range(len(book_ratings)):
        s = str(book_ratings[i])
        strbookratings = strbookratings + s      
        L = strbookratings.split("\n")
    O = L[::6]
    O.pop(-1)
    ratings = []
    for i in range(len(O)):
        h = O[i].split(" ")
        ratings.append(h[-1][:-2])
    return ratings

ratings = create_ratings_list()
ratings

['Three',
 'One',
 'One',
 'Four',
 'Five',
 'One',
 'Four',
 'Three',
 'Four',
 'One',
 'Two',
 'Four',
 'Five',
 'Five',
 'Five',
 'Three',
 'One',
 'One',
 'Two',
 'Two']

Getting the stock is a bit trickier as they are not reported in the catalogue. In order to scrape them we need to send a request to all individual books' webpage and procede from there.

In [25]:
# Stock availability is reported in the each book's webpage
# We start by scraping the stock data for the first book
# The url of the first book's webpage is:

url_stock = "https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html"

# We define a function that gets the stock from the page:

def scrap_stocks(url_stock):
    request_stock = requests.get(url_stock).text
    soup_stock = bs(request_stock, "html.parser")
    availab = soup_stock.find("td", string = lambda text: "stock" in text)  # isolate the stock data
    stravailab = str(availab)        # convert the tag into a string and then into a list
    listavailab = stravailab.split(" ")
    stocks = []            # isolate the number of available copies
    for i in range(len(listavailab)):
        if "(" in listavailab[i]:
            num = int(listavailab[i].replace("(", ""))
            stocks.append(num)
    return stocks      # returns a single element list with available copies for that book

scrap_stocks(url_stock)


[22]

We have scraped the number of stock of the first book, 19 more to go...

In [8]:
# We need to replicate the process for each book
# To automate it, we define a function that scraps from the catalogue's page all the urls of each book's page:

def get_page_urls(soup):
    hrefs = soup.find_all("a", href = True, title = True)
    urls_page = []
    for l in range(len(hrefs)):
        hrefs[l] = str(hrefs[l])
        hrefs[l] = hrefs[l].replace('<a href="', 'https://books.toscrape.com/catalogue/')
        hrefs[l] = hrefs[l].split('"')
        urls_page.append(hrefs[l][0])
    return urls_page

# Now we combine the two functions so that from all the urls we can retrieve the number of stocks

def get_page_stocks():
    urls = get_page_urls(soup)
    stocks_in_page = []
    for i in range(len(urls)):
        url_stock = urls[i]
        stocks_in_page.extend(scrap_stocks(url_stock))
    return stocks_in_page

stocks = get_page_stocks()
stocks

[22,
 20,
 20,
 20,
 20,
 19,
 19,
 19,
 19,
 19,
 19,
 19,
 19,
 19,
 19,
 19,
 19,
 19,
 19,
 19]

As for the genre, we incur in the same problem: we have to scrape it from each books' page. The good news is, since we already have the list of urls, half of the job is already done.

In [26]:
# Now for the genres.
# The genre of each book is in the book's webpage.
# Start from the first book

url_genre = "https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html"


def scrap_genre(url_genre):
    request_genre = requests.get(url_genre).text
    soup_genre = bs(request_genre, "html.parser")
    # Isolate the genre data
    ahrefs = soup_genre.find_all("a", href = True)  # First filter by the tags...
    g = ahrefs[3]        # ... and then get the information we need
    g = g.string   # convert it into a navigable string
    genre = []
    genre.append(g)
    return genre       # same procedure of stock

scrap_genre(url_genre)




['Poetry']

In [10]:
# Now we need to iterate the process for each book.
# Thankfully we can recycle the get_page_urls function

def get_page_genres():
    urls = get_page_urls(soup)
    genres_in_page = []
    for i in range(len(urls)):
        url_genre = urls[i]
        genres_in_page.extend(scrap_genre(url_genre))
    return genres_in_page

genres = get_page_genres()
genres

['Poetry',
 'Historical Fiction',
 'Fiction',
 'Mystery',
 'History',
 'Young Adult',
 'Business',
 'Default',
 'Default',
 'Poetry',
 'Default',
 'Poetry',
 'Young Adult',
 'Sequential Art',
 'Music',
 'Music',
 'Poetry',
 'Science Fiction',
 'Politics',
 'Travel']

We have succesfully gathered a list for all the titles, prices, ratings, stock and genres of the first 20 books! We can create a dataframe to visualize our results:

In [27]:
# We now have a list for the titles, the prices, the ratings, the stock and the genres.
# Let's create a dataframe to summarize our data

data = {"Title": titles, "Price": prices, "Rating": ratings, "In stock": stocks, "Genre": genres}
df = pd.DataFrame(data)
df

Unnamed: 0,Title,Price,Rating,In stock,Genre
0,A Light in the Attic,£51.77,Three,22,Poetry
1,Tipping the Velvet,£53.74,One,20,Historical Fiction
2,Soumission,£50.10,One,20,Fiction
3,Sharp Objects,£47.82,Four,20,Mystery
4,Sapiens: A Brief History of Humankind,£54.23,Five,20,History
5,The Requiem Red,£22.65,One,19,Young Adult
6,The Dirty Little Secrets of Getting Your Dream...,£33.34,Four,19,Business
7,The Coming Woman: A Novel Based on the Life of...,£17.93,Three,19,Default
8,The Boys in the Boat: Nine Americans and Their...,£22.60,Four,19,Default
9,The Black Maria,£52.15,One,19,Poetry


Now that our scraping functions are up and running, all we need to do is create a loop that scrapes all the pages in the catalogue (it will take some time).

In [12]:
# We have succesfully summarized all the data in the first page
# However, there are 49 more
# We can create aggregate the data from all pages by using a for loop

# WARNING: the following loop will take a few minutes to complete 

alltitles = []
allprices = []
allratings = []
allstocks = []
allgenres = []

for i in range(1, 51):
    url = "https://books.toscrape.com/catalogue/page-" + str(i) + ".html"
    request = requests.get(url).text
    soup = bs(request, "html.parser")
    

    alltitles.extend(create_titles_list())    
    allprices.extend(create_prices_list())   
    allratings.extend(create_ratings_list())
    allstocks.extend(get_page_stocks())
    allgenres.extend(get_page_genres())
    
    print("Page " + str(i) + " completed")

Page 1 completed
Page 2 completed
Page 3 completed
Page 4 completed
Page 5 completed
Page 6 completed
Page 7 completed
Page 8 completed
Page 9 completed
Page 10 completed
Page 11 completed
Page 12 completed
Page 13 completed
Page 14 completed
Page 15 completed
Page 16 completed
Page 17 completed
Page 18 completed
Page 19 completed
Page 20 completed
Page 21 completed
Page 22 completed
Page 23 completed
Page 24 completed
Page 25 completed
Page 26 completed
Page 27 completed
Page 28 completed
Page 29 completed
Page 30 completed
Page 31 completed
Page 32 completed
Page 33 completed
Page 34 completed
Page 35 completed
Page 36 completed
Page 37 completed
Page 38 completed
Page 39 completed
Page 40 completed
Page 41 completed
Page 42 completed
Page 43 completed
Page 44 completed
Page 45 completed
Page 46 completed
Page 47 completed
Page 48 completed
Page 49 completed
Page 50 completed


We now have all the data that we need. Time to summarized it into our final dataframe.

In [13]:
# The final dataframe!

Books2Scrap = {"Title": alltitles, "Price": allprices, "Rating": allratings, "Stocks": allstocks, "Genre": allgenres}

df = pd.DataFrame(Books2Scrap)
df

Unnamed: 0,Title,Price,Rating,Stocks,Genre
0,A Light in the Attic,£51.77,Three,22,Poetry
1,Tipping the Velvet,£53.74,One,20,Historical Fiction
2,Soumission,£50.10,One,20,Fiction
3,Sharp Objects,£47.82,Four,20,Mystery
4,Sapiens: A Brief History of Humankind,£54.23,Five,20,History
...,...,...,...,...,...
995,Alice in Wonderland (Alice's Adventures in Won...,£55.53,One,1,Classics
996,"Ajin: Demi-Human, Volume 1 (Ajin: Demi-Human #1)",£57.06,Four,1,Sequential Art
997,A Spy's Devotion (The Regency Spies of London #1),£16.97,Five,1,Historical Fiction
998,1st to Die (Women's Murder Club #1),£53.98,One,1,Mystery


In [14]:
# First index is 0, let's change it to 1

I = list(range(1, 1001))

Books2Scrap = {"Index": I, "Title": alltitles, "Price": allprices, "Rating": allratings, "Stocks": allstocks, "Genre": allgenres}

df = pd.DataFrame(Books2Scrap)
df.set_index("Index", inplace = True)

df


Unnamed: 0_level_0,Title,Price,Rating,Stocks,Genre
Index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,A Light in the Attic,£51.77,Three,22,Poetry
2,Tipping the Velvet,£53.74,One,20,Historical Fiction
3,Soumission,£50.10,One,20,Fiction
4,Sharp Objects,£47.82,Four,20,Mystery
5,Sapiens: A Brief History of Humankind,£54.23,Five,20,History
...,...,...,...,...,...
996,Alice in Wonderland (Alice's Adventures in Won...,£55.53,One,1,Classics
997,"Ajin: Demi-Human, Volume 1 (Ajin: Demi-Human #1)",£57.06,Four,1,Sequential Art
998,A Spy's Devotion (The Regency Spies of London #1),£16.97,Five,1,Historical Fiction
999,1st to Die (Women's Murder Club #1),£53.98,One,1,Mystery


Only one thing left to do: export the dataframe into a format that can be used in Excel. 

In [23]:
df.to_csv("/Users/Luca/Desktop/Books2Scrape.csv")    

Job's done! We have now a fully fledged spreadsheet with all the data we scraped from the website!

Check out the Excel file in the repository to see the final result ;)