# ISYS613 - Data Sourcing and Quality
# Assignment 4
## Web Scraping

## Question 1 - Book Data Scraper

You have been asked to collect some price data about books from the Books to Scrape website.
Specifically, you are to scrape and capture the following book related information - book category,
book title, star rating and price. Once captured, you are to output your results to a CSV file.
```
Top-Level URL: http://books.toscrape.com/
```

### Requirements
1. Examine the HTML returned from the Books to Scrape top-level URL.
 Your objective is to identify and extract book category and the book
 category URL information from this page.
2. For each book category URL, follow the URL to the book category
 page. You may restrict your data scraping to the first page of books
 returned for the category URL.
3. For each of the books on a category page, capture the book title,
 star rating and price.
4. Convert the ordinal star rating data to a numeric scale. For
 example, the string 'star-rating One' would be converted to the integer number
 1, 'star-rating Two' would be converted to 2, and so on.
5. For each book, output the book category, title, numeric star rating and price.

### Challenge (Optional)
The challenge objective is to display all books (formatted as above) from all categories - not 
 just the first page of books from each category.  To see how this works, go the 
 top-level URL and observe how to 
 manually navigate from the first page of a category to the next, then next, etc. until you have 
 followed the Next link to all pages for a category.

Notice that book data from a category URL is returned 20 books at a time. If there more than 
20 books in category, the Next link appears at the bottom of the HTML page.  
When the list of books in a category has been exhausted, ie., when the last
category page has been reached, the Next link will no longer appear on the page.

Hint: That's how you will know your job is complete.

Copy your previous solution to a new code-cell. Modify your copied solution to follow
the Next page links located at the bottom of a category page. 

In [2]:
# TEST DATA
#URL = 'http://books.toscrape.com/'

from requests.exceptions import HTTPError
import requests
from bs4 import BeautifulSoup
import re
import pandas as pd

EW_URL = 'http://books.toscrape.com/'

def simple_get(url, *args, **kwargs):
    """
    Attempts to get the content at `url` by making an HTTP GET request.
    If the content-type of response is some kind of HTML/XML, return the
    text content, otherwise return None.
    """
    try:
        resp = requests.get(url, *args, **kwargs)
        # If the response was successful, no Exception will be raised
        resp.raise_for_status()

    except HTTPError as http_err:
        print(f'HTTP error occurred: {http_err}')
        raise http_err
    except Exception as err:
        print(f'Other error occurred: {err}')
        raise err

    return resp


def who_actors(url):
    resp = simple_get(url, timeout=5)
    html = resp.text
    
    soup = BeautifulSoup(html, 'html.parser')
    #print("here is testing")
    #print(soup.findAll("article", class_ = "product_pod"))
    
    url_main_page = [EW_URL+x.div.a.get('href') for x in soup.findAll("article", class_ = "product_pod")]

    title_name = []
    prices = []
    category = []
    word_rating = []

    for url in url_main_page:
        result = requests.get(url)
        soup = BeautifulSoup(result.text, 'html.parser')
        
        title_name.append(soup.find("div", class_ = re.compile("product_main")).h1.text)
        # Price of book
        prices.append(soup.find("p", class_ = "price_color").text[2:]) 
        
        # Category of title
        category.append(soup.find("a", href = re.compile("../category/books/")).get("href").split("/")[3])
        
        # ratings
        word_rating.append(soup.find("p", class_ = re.compile("star-rating")).get("class")[1])
        
    numberial_rating = []
    
    for t in word_rating:
        if t=="One":
            numberial_rating.append("1")
        if t=="Two":
            numberial_rating.append("2")
        if t=="Three":
            numberial_rating.append("3")
        if t=="Four":
            numberial_rating.append("4")
        if t=="Five":
            numberial_rating.append("5")
            
    output = pd.DataFrame({'title_name': title_name, 'price': prices,  "product_category": category, "rating": numberial_rating})

    
    #print("Here is URLS")
    print(output)
    output.to_csv('bookscrape.csv')

def main():
    who_actors(EW_URL)
    
if __name__ == "__main__":
    main()

                                           title_name  price  \
0                                A Light in the Attic  51.77   
1                                  Tipping the Velvet  53.74   
2                                          Soumission  50.10   
3                                       Sharp Objects  47.82   
4               Sapiens: A Brief History of Humankind  54.23   
5                                     The Requiem Red  22.65   
6   The Dirty Little Secrets of Getting Your Dream...  33.34   
7   The Coming Woman: A Novel Based on the Life of...  17.93   
8   The Boys in the Boat: Nine Americans and Their...  22.60   
9                                     The Black Maria  52.15   
10     Starving Hearts (Triangular Trade Trilogy, #1)  13.99   
11                              Shakespeare's Sonnets  20.66   
12                                        Set Me Free  17.46   
13  Scott Pilgrim's Precious Little Life (Scott Pi...  52.29   
14                          Rip it Up an