# ISYS613 - Data Sourcing and Quality
# Assignment 4
## Web Scraping

## Question 1 - Book Data Scraper

You have been asked to collect some price data about books from the Books to Scrape website.
Specifically, you are to scrape and capture the following book related information - book category,
book title, star rating and price. Once captured, you are to output your results to a CSV file.
```
Top-Level URL: http://books.toscrape.com/
```

### Requirements
1. Examine the HTML returned from the Books to Scrape top-level URL.
 Your objective is to identify and extract book category and the book
 category URL information from this page.
2. For each book category URL, follow the URL to the book category
 page. You may restrict your data scraping to the first page of books
 returned for the category URL.
3. For each of the books on a category page, capture the book title,
 star rating and price.
4. Convert the ordinal star rating data to a numeric scale. For
 example, the string 'star-rating One' would be converted to the integer number
 1, 'star-rating Two' would be converted to 2, and so on.
5. For each book, output the book category, title, numeric star rating and price.

### Challenge (Optional)
The challenge objective is to display all books (formatted as above) from all categories - not
 just the first page of books from each category.  To see how this works, go the
 top-level URL and observe how to
 manually navigate from the first page of a category to the next, then next, etc. until you have
 followed the Next link to all pages for a category.

Notice that book data from a category URL is returned 20 books at a time. If there more than
20 books in category, the Next link appears at the bottom of the HTML page.
When the list of books in a category has been exhausted, ie., when the last
category page has been reached, the Next link will no longer appear on the page.

Hint: That's how you will know your job is complete.

Copy your previous solution to a new code-cell. Modify your copied solution to follow
the Next page links located at the bottom of a category page.

In [17]:
from requests.exceptions import HTTPError
import requests
from bs4 import BeautifulSoup
import re
from pprint import pprint
from urllib.parse import urljoin

def simple_get(url, params=None):
    """
    Attempts to get the content at `url` by making an HTTP GET request.
    If the content-type of response is some kind of HTML/XML, return the
    text content, otherwise return None.
    """
    try:
        resp = requests.get(url, timeout=5, params=params)
        # If the response was successful, no Exception will be raised
        resp.raise_for_status()

    except HTTPError as http_err:
        print(f'HTTP error occurred: {http_err}')
        raise http_err
    except Exception as err:
        print(f'Other error occurred: {err}')
        raise err
    else:
        # sanity check
        # is this HTML?
        content_type = resp.headers['Content-Type'].lower()
        # True if the response seems to be HTML, False otherwise.
        # Followed by 'death'
        assert content_type.find('html') >= 0

    return resp

def book_category_contents(url, cat_list):
    
    resp = simple_get(url)
    # get the decoded payload.  the text() method uses metadata to devine encoding.
    html = resp.text
    soup = BeautifulSoup(html, 'html.parser')
    
    # CHALLENGE:
    # Recursively call book_category_contents() for all pages cat pages.
    # Looking for HTML
    # <li class="next"><a href="page-2.html">next</a></li>
    #
    li = soup.find('li', class_='next')
    if li is not None:      
        pgurl = urljoin(url, li.a['href'])       
        book_category_contents(pgurl, cat_list)
    

    '''
    # The following example chunk of HTML contains all
    # the data we need for each book in a category
    #
    <article class="product_pod">

            <div class="image_container">
                    <a href="..
            </div>
                <p class="star-rating One">
                .
                </p>
            <h3><a href="../../../
            <div class="product_price">
        <p class="price_color">Â£30.54</p>
    </article>
    '''
    star_list = ['One', 'Two', 'Three', 'Four', 'Five']

    for art in soup.find_all('article', class_=re.compile(r'^product_pod$')):

        # fetch title
        # Note: the <h3> tag contains a abbreviated title so don't use that

        div = art.find('div', class_='image_container')
        title = div.a.img['alt']

        # fetch star rating info
        regex = re.compile(r'^star-rating\s+(One|Two|Three|Four|Five)$')
        star = art.find('p', class_=regex)

        # the class attr of star-rating is two words separated by space and therefore the
        # the 'class' attr is returned as a list of words
        # Ex: ['star-rating', 'One']
        #
        cls_list = star['class']
        assert isinstance(cls_list, list)
        star_rating = star['class'][1]
        star_rating = star_list.index( star_rating ) + 1

        # fetch price info
        price_str = art.find('p', class_='price_color').text

        # use re to remove any chars that are
        # not \d (digit) or decimal place (dot)

        price = re.sub(r'[^\d.]', '', price_str)
        cat_dict = {'title': title, 'star_rating': star_rating, 'price': price}
        cat_list.append( cat_dict )


# The following function is actually a 'generator' function
# because of the use of the yield statement.  See below.
# A generator is sometimes called a 'lazy iterator'
# because it procrastinates work until absolutely necessary.
# So, instead of gobbling up all the categories and keeping them in
# a list, this generator will return a category and its url one
# at a time as they are requested.  That is, this function is iterable. If there were a
# large number of categories it might even be worth the effort!
#
# See http://naiquevin.github.io/python-generators-and-being-lazy.html

def book_categories(url):

    resp = simple_get(url)

    # get the decoded payload.
    # The text property uses HTTP or HTML metadata to devine encoding.
    html = resp.text

    # By inspection of HTTP on the URL we can see
    # the book cats are in an unordered list list (<ul> )
    # that has a class of "nav nav-list". We could use this
    # to drill down the right spot; however, there is a better
    # pattern.  Each book cat is an anchor tag (<a>) with an href
    # attribute that always begins (and ends - which is significant
    # w.r.t. the pattern) with the same text: 'catalogue/category/books/.../index.html'
    # Ex: <a href="catalogue/category/books/travel_2/index.html">
    # Time to write a RE!
    #
    soup = BeautifulSoup(html, 'html.parser')

    # Note: category urls can be composed of
    # characters in the character class:  [-_\w\d/]

    for anchor in soup.find_all('a', href= re.compile(r'^catalogue/category/books/[-_\w\d/]+/index[.]html$')):
        # the text for each <a> contains the category data
        cat = anchor.text.strip()
        url = anchor['href'].strip()

        # The use yield makes this function a generator function (aka 'lazy-iterator')
        # The yield effectively creates a snapshot of the state of this function
        # followed by the  return the cat, url tuple.
        # When the function is called again, the state of
        # the fuction is restored and the loop picks up where it left off. Pretty neat - IMO.
        # See https://realpython.com/introduction-to-python-generators/
        yield cat, url

def main():
    # Get the book categories and cat URLs from top-level URL
    BOOK_URL = 'http://books.toscrape.com/'

    # book_categories() is a generator function. See comment
    # above the function def
    #
    for c,u in book_categories(BOOK_URL):
        # create a list to hold book info for category.
        # Pass list as argument to func book_category_contents().
        # The reason the list is an argument (and not returned by the function)
        # is because the book_category_contents() function
        # recursively follows the 'next page' links.
        #
        cat_list = []
        cat_url = urljoin(BOOK_URL, u)
        
        book_category_contents(cat_url, cat_list)

        # at this point, cat_list will have been updated
        # to be a list of book related dictionaries of the following type:
        # {'title': title, 'star_rating': star_rating, 'price': price}
        # The keys in this dict need to be identical to the field names
        # expected by the csv.DictWriter object (if you planned to print the results).

        # display data
        for d in cat_list:
            # add category key to the dict
            d['category'] = c
            pprint(d, indent=2)

if __name__ == '__main__':
    main()

{ 'category': 'Travel',
  'price': '45.17',
  'star_rating': 2,
  'title': "It's Only the Himalayas"}
{ 'category': 'Travel',
  'price': '49.43',
  'star_rating': 4,
  'title': 'Full Moon over Noahâ\x80\x99s Ark: An Odyssey to Mount Ararat and '
           'Beyond'}
{ 'category': 'Travel',
  'price': '48.87',
  'star_rating': 3,
  'title': 'See America: A Celebration of Our National Parks & Treasured Sites'}
{ 'category': 'Travel',
  'price': '36.94',
  'star_rating': 2,
  'title': 'Vagabonding: An Uncommon Guide to the Art of Long-Term World '
           'Travel'}
{ 'category': 'Travel',
  'price': '37.33',
  'star_rating': 3,
  'title': 'Under the Tuscan Sun'}
{ 'category': 'Travel',
  'price': '44.34',
  'star_rating': 2,
  'title': 'A Summer In Europe'}
{ 'category': 'Travel',
  'price': '30.54',
  'star_rating': 1,
  'title': 'The Great Railway Bazaar'}
{ 'category': 'Travel',
  'price': '56.88',
  'star_rating': 4,
  'title': 'A Year in Provence (Provence #1)'}
{ 'category': 'Tra

  'price': '37.46',
  'star_rating': 1,
  'title': 'Sense and Sensibility'}
{ 'category': 'Classics',
  'price': '47.11',
  'star_rating': 2,
  'title': 'Of Mice and Men'}
{'category': 'Classics', 'price': '32.93', 'star_rating': 2, 'title': 'Emma'}
{ 'category': 'Classics',
  'price': '55.53',
  'star_rating': 1,
  'title': "Alice in Wonderland (Alice's Adventures in Wonderland #1)"}
{ 'category': 'Philosophy',
  'price': '15.94',
  'star_rating': 5,
  'title': "Sophie's World"}
{ 'category': 'Philosophy',
  'price': '58.11',
  'star_rating': 4,
  'title': 'The Death of Humanity: and the Case for Life'}
{ 'category': 'Philosophy',
  'price': '17.44',
  'star_rating': 4,
  'title': 'The Stranger'}
{ 'category': 'Philosophy',
  'price': '54.21',
  'star_rating': 1,
  'title': 'Proofs of God: Classical Arguments from Tertullian to Barth'}
{ 'category': 'Philosophy',
  'price': '47.13',
  'star_rating': 1,
  'title': 'Kierkegaard: A Christian Missionary to Christians'}
{ 'category': 'Phil

{ 'category': 'Childrens',
  'price': '16.26',
  'star_rating': 2,
  'title': 'The Cat in the Hat (Beginner Books B-1)'}
{ 'category': 'Childrens',
  'price': '28.54',
  'star_rating': 3,
  'title': 'Red: The True Story of Red Riding Hood'}
{ 'category': 'Childrens',
  'price': '37.52',
  'star_rating': 2,
  'title': 'Horrible Bear!'}
{ 'category': 'Childrens',
  'price': '10.79',
  'star_rating': 4,
  'title': 'Green Eggs and Ham (Beginner Books B-16)'}
{ 'category': 'Childrens',
  'price': '10.62',
  'star_rating': 1,
  'title': 'Counting Thyme'}
{ 'category': 'Childrens',
  'price': '10.66',
  'star_rating': 3,
  'title': 'Are We There Yet?'}
{ 'category': 'Childrens',
  'price': '52.88',
  'star_rating': 4,
  'title': 'Diary of a Minecraft Zombie Book 1: A Scare of a Dare (An '
           'Unofficial Minecraft Book)'}
{ 'category': 'Childrens',
  'price': '28.34',
  'star_rating': 1,
  'title': 'Matilda'}
{ 'category': 'Childrens',
  'price': '22.85',
  'star_rating': 3,
  'title':

{ 'category': 'Music',
  'price': '55.66',
  'star_rating': 2,
  'title': "Old Records Never Die: One Man's Quest for His Vinyl and His Past"}
{ 'category': 'Music',
  'price': '28.80',
  'star_rating': 3,
  'title': 'Forever Rockers (The Rocker #12)'}
{ 'category': 'Default',
  'price': '23.90',
  'star_rating': 5,
  'title': 'Dark Places'}
{ 'category': 'Default',
  'price': '35.28',
  'star_rating': 5,
  'title': 'Breaking Dawn (Twilight #4)'}
{ 'category': 'Default',
  'price': '21.55',
  'star_rating': 5,
  'title': 'Beautiful Creatures (Caster Chronicles #1)'}
{ 'category': 'Default',
  'price': '14.08',
  'star_rating': 5,
  'title': 'A Visit from the Goon Squad'}
{ 'category': 'Default',
  'price': '19.69',
  'star_rating': 5,
  'title': 'The Zombie Room'}
{ 'category': 'Default',
  'price': '50.59',
  'star_rating': 3,
  'title': 'The Name of the Wind (The Kingkiller Chronicle #1)'}
{ 'category': 'Default',
  'price': '18.88',
  'star_rating': 2,
  'title': 'Taking Shots (Assa

  'star_rating': 2,
  'title': 'Soft Apocalypse'}
{ 'category': 'Science Fiction',
  'price': '48.74',
  'star_rating': 1,
  'title': 'Sleeping Giants (Themis Files #1)'}
{ 'category': 'Science Fiction',
  'price': '21.36',
  'star_rating': 4,
  'title': 'Arena'}
{ 'category': 'Science Fiction',
  'price': '32.42',
  'star_rating': 1,
  'title': 'Foundation (Foundation (Publication Order) #1)'}
{ 'category': 'Science Fiction',
  'price': '10.92',
  'star_rating': 1,
  'title': "The Restaurant at the End of the Universe (Hitchhiker's Guide to "
           'the Galaxy #2)'}
{ 'category': 'Science Fiction',
  'price': '19.07',
  'star_rating': 4,
  'title': 'Ready Player One'}
{ 'category': 'Science Fiction',
  'price': '33.26',
  'star_rating': 2,
  'title': "Life, the Universe and Everything (Hitchhiker's Guide to the "
           'Galaxy #3)'}
{ 'category': 'Science Fiction',
  'price': '54.86',
  'star_rating': 1,
  'title': 'Dune (Dune #1)'}
{ 'category': 'Science Fiction',
  'price'

{ 'category': 'Fantasy',
  'price': '53.82',
  'star_rating': 2,
  'title': 'Vampire Girl (Vampire Girl #1)'}
{ 'category': 'Fantasy',
  'price': '36.25',
  'star_rating': 3,
  'title': 'The Silent Twin (Detective Jennifer Knight #3)'}
{ 'category': 'Fantasy',
  'price': '29.38',
  'star_rating': 1,
  'title': 'The Mirror & the Maze (The Wrath and the Dawn #1.5)'}
{ 'category': 'Fantasy',
  'price': '13.33',
  'star_rating': 3,
  'title': 'Sister Sable (The Mad Queen #1)'}
{ 'category': 'Fantasy',
  'price': '21.72',
  'star_rating': 4,
  'title': 'Shadow Rites (Jane Yellowrock #10)'}
{ 'category': 'Fantasy',
  'price': '28.99',
  'star_rating': 1,
  'title': 'Origins (Alphas 0.5)'}
{ 'category': 'Fantasy',
  'price': '52.94',
  'star_rating': 2,
  'title': 'One Second (Seven #7)'}
{ 'category': 'Fantasy',
  'price': '58.75',
  'star_rating': 4,
  'title': 'Myriad (Prentor #1)'}
{ 'category': 'Fantasy',
  'price': '12.16',
  'star_rating': 5,
  'title': 'Every Heart a Doorway (Every He

{ 'category': 'Poetry',
  'price': '51.77',
  'star_rating': 3,
  'title': 'A Light in the Attic'}
{ 'category': 'Poetry',
  'price': '52.15',
  'star_rating': 1,
  'title': 'The Black Maria'}
{ 'category': 'Poetry',
  'price': '20.66',
  'star_rating': 4,
  'title': "Shakespeare's Sonnets"}
{'category': 'Poetry', 'price': '23.88', 'star_rating': 1, 'title': 'Olio'}
{ 'category': 'Poetry',
  'price': '33.63',
  'star_rating': 2,
  'title': "You can't bury them all: Poems"}
{ 'category': 'Poetry',
  'price': '57.31',
  'star_rating': 3,
  'title': 'Slow States of Collapse: Poems'}
{ 'category': 'Poetry',
  'price': '14.27',
  'star_rating': 4,
  'title': 'Untitled Collection: Sabbath Poems 2014'}
{ 'category': 'Poetry',
  'price': '14.19',
  'star_rating': 4,
  'title': 'Poems That Make Grown Women Cry'}
{ 'category': 'Poetry',
  'price': '41.05',
  'star_rating': 1,
  'title': 'Night Sky with Exit Wounds'}
{'category': 'Poetry', 'price': '46.78', 'star_rating': 4, 'title': 'salt.'}
{ '

{ 'category': 'Food and Drink',
  'price': '46.01',
  'star_rating': 4,
  'title': 'How to Cook Everything Vegetarian: Simple Meatless Recipes for '
           'Great Food (How to Cook Everything)'}
{ 'category': 'Food and Drink',
  'price': '28.25',
  'star_rating': 2,
  'title': 'How to Be a Domestic Goddess: Baking and the Art of Comfort '
           'Cooking'}
{ 'category': 'Food and Drink',
  'price': '59.92',
  'star_rating': 5,
  'title': 'The Barefoot Contessa Cookbook'}
{ 'category': 'Food and Drink',
  'price': '39.61',
  'star_rating': 3,
  'title': 'Better Homes and Gardens New Cook Book'}
{ 'category': 'Food and Drink',
  'price': '11.05',
  'star_rating': 5,
  'title': 'The Power Greens Cookbook: 140 Delicious Superfood Recipes'}
{ 'category': 'Food and Drink',
  'price': '24.91',
  'star_rating': 5,
  'title': 'Mexican Today: New and Rediscovered Recipes for Contemporary '
           'Kitchens'}
{ 'category': 'Food and Drink',
  'price': '13.66',
  'star_rating': 2,
  't

{ 'category': 'Spirituality',
  'price': '17.66',
  'star_rating': 5,
  'title': 'The Four Agreements: A Practical Guide to Personal Freedom'}
{ 'category': 'Spirituality',
  'price': '32.24',
  'star_rating': 5,
  'title': "The Activist's Tao Te Ching: Ancient Advice for a Modern "
           'Revolution'}
{ 'category': 'Spirituality',
  'price': '37.80',
  'star_rating': 2,
  'title': 'Chasing Heaven: What Dying Taught Me About Living'}
{ 'category': 'Spirituality',
  'price': '20.91',
  'star_rating': 1,
  'title': "If I Gave You God's Phone Number....: Searching for Spirituality "
           'in America'}
{ 'category': 'Spirituality',
  'price': '46.33',
  'star_rating': 2,
  'title': 'Unreasonable Hope: Finding Faith in the God Who Brings Purpose to '
           'Your Pain'}
{ 'category': 'Spirituality',
  'price': '55.65',
  'star_rating': 5,
  'title': "A New Earth: Awakening to Your Life's Purpose"}
{ 'category': 'Academic',
  'price': '13.12',
  'star_rating': 2,
  'title': 'L