## Applied Analytics -- Python Workshop Challenge

*The challenge: efficiently scrape a fictional book store's (http://books.toscrape.com/) Travel, Poetry, Art, Humor and Academic books and retrieve Book Title, Product Description, Price (excl. tax), Number of Reviews.
Store all of the data in Python dictionaries or lists.*

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

I take advantage of the fact that the url is structured in a predictable way to loop over the different genres. For every genre, I retrieve the results page and get the names/url of every book. Then I run a loop over the list of book urls to access each book's individual page and retrieve name, price description, price without tax, and the number of reviews.

In [2]:
title = [] #empty results vecotr
descr = []
price = []
reviews = []
book_genre = []

for genre in ["travel_2", "poetry_23", "art_25", "humor_30", "academic_40"]: #loop over genres
    url = "http://books.toscrape.com/catalogue/category/books/" + genre + "/index.html" #general url
    res = requests.get(url) #request data
    soup = BeautifulSoup(res.text, "lxml") 
    book_titles = soup.select(".product_pod a")  #select all book titles from the genre's page
    book_links = [] #empty book link list
    
    for i in range(0, len(book_titles)): #for every book of a certain genre
        if i % 2 != 0: ####only odd numbers to avoid duplicates!!
            link = str("http://books.toscrape.com/catalogue/" + book_titles[i]["href"]).replace("../../../", "") 
            book_links.append(link) #generates functional link by getting rid of ../../../
            
            book_genre.append(genre) #keeps track of book genre
        
    for w in range(0, len(book_links)): 
        url = book_links[w] #accesses every book in book_link's product page
        res = requests.get(url)
        soup = BeautifulSoup(res.text, "lxml")
        new_title = soup.select("h1") #selects title
        description = soup.select("#product_description + p") #product description
        price_notax = soup.select("tr:nth-of-type(3) td") #price excl tax
        n_reviews = soup.select("tr:nth-of-type(7) td") #number of reviews
    
        #then adds the above to my result lists
        title.append(new_title[0]) #adding index [0] so I append strings instead of lists
        descr.append(description[0])
        price.append(price_notax[0])
        reviews.append(n_reviews[0])

Note: during my previous run, I had a problem where every book is duplicated twice. This code therefore only appends the info of every other book.

Taking a look at the results:

In [3]:
title[:6]

[<h1>It's Only the Himalayas</h1>,
 <h1>Full Moon over Noahâs Ark: An Odyssey to Mount Ararat and Beyond</h1>,
 <h1>See America: A Celebration of Our National Parks &amp; Treasured Sites</h1>,
 <h1>Vagabonding: An Uncommon Guide to the Art of Long-Term World Travel</h1>,
 <h1>Under the Tuscan Sun</h1>,
 <h1>A Summer In Europe</h1>]

In [4]:
descr[0]

<p>âWherever you go, whatever you do, just . . . donât do anything stupid.â âMy MotherDuring her yearlong adventure backpacking from South Africa to Singapore, S. Bedford definitely did a few things her mother might classify as "stupid." She swam with great white sharks in South Africa, ran from lions in Zimbabwe, climbed a Himalayan mountain without training in Nepal, and wa âWherever you go, whatever you do, just . . . donât do anything stupid.â âMy MotherDuring her yearlong adventure backpacking from South Africa to Singapore, S. Bedford definitely did a few things her mother might classify as "stupid." She swam with great white sharks in South Africa, ran from lions in Zimbabwe, climbed a Himalayan mountain without training in Nepal, and watched as her friend was attacked by a monkey in Indonesia.But interspersed in those slightly more crazy moments, Sue Bedfored and her friend "Sara the Stoic" experienced the sights, sounds, life, and culture of fifteen countries. 

In [5]:
price[:3]

[<td>Â£45.17</td>, <td>Â£49.43</td>, <td>Â£48.87</td>]

In [6]:
reviews[:3]

[<td>0</td>, <td>0</td>, <td>0</td>]

There are a few formatting issues: quite a bit of html remains. I can easily remove it using string operations. I'll also use this as an opportunity to convert the content of some lists to `int` types. Moreover, some characters aren't encoded properly in the descriptions but I'll leave that to another day.

Note: this could probably have been done in the previous loop as the scraping results were being appended, but I find this clearer.

In [7]:
title = [str(i).replace("<h1>", "").replace("</h1>", "").replace("â\x80\x99", "'").replace("&amp;", "&") for i in title]
descr = [str(i).replace("<p>", "").replace("</p>", "") for i in descr]
price = [float(str(i).replace("<td>Â£", "").replace("</td>", "")) for i in price] #converts it to float
reviews = [int(str(i).replace("<td>", "").replace("</td>", "")) for i in reviews] #converts it to int
book_genre = [str(i).split("_")[0] for i in book_genre]
#a bit messy -- could've used dictionary to replace strings and create more compact code

Quick check:

In [8]:
print(title[32], book_genre[32], price[32], reviews[32])

Art and Fear: Observations on the Perils (and Rewards) of Artmaking art 48.63 0


It all seems good so I'll put it in a Dataframe:

In [9]:
book_scrape = pd.DataFrame({"title" : title})
book_scrape["genre"] = book_genre
book_scrape["price"] = price
book_scrape["reviews"] = reviews
book_scrape["descr"] = descr #python was being finnicky -- code is a bit clunky but works

In [10]:
book_scrape.head()

Unnamed: 0,title,genre,price,reviews,descr
0,It's Only the Himalayas,travel,45.17,0,"âWherever you go, whatever you do, just . . ..."
1,Full Moon over Noah's Ark: An Odyssey to Mount...,travel,49.43,0,Acclaimed travel writer Rick Antonson sets his...
2,See America: A Celebration of Our National Par...,travel,48.87,0,To coincide with the 2016 centennial anniversa...
3,Vagabonding: An Uncommon Guide to the Art of L...,travel,36.94,0,With a new foreword by Tim Ferriss â¢Thereâ...
4,Under the Tuscan Sun,travel,37.33,0,A CLASSIC FROM THE BESTSELLING AUTHOR OF UNDER...


In [11]:
book_scrape.tail()

Unnamed: 0,title,genre,price,reviews,descr
44,When You Are Engulfed in Flames,humor,30.89,0,It's early autumn 1964. Two straight-A student...
45,Naked,humor,31.69,0,"Welcome to the hilarious, strange, elegiac, ou..."
46,"Lamb: The Gospel According to Biff, Christ's C...",humor,55.5,0,"The birth of Jesus has been well chronicled, a..."
47,Holidays on Ice,humor,51.07,0,A new holiday classic--six of the most profoun...
48,Logan Kade (Fallen Crest High #5.5),academic,13.12,0,People think that just because they know my na...
