### Install Dependencies

In [None]:
!pip install beautifulsoup4

In [None]:
!pip install requests

### Import Packages

In [1]:
from bs4 import BeautifulSoup
import requests

### Overview

We will be using a good online resource [ToScrape.com](https://books.toscrape.com/) that allows us to *legally* collect data and use them for practice.

By the end of this activity, you should be able to:
- Understand how information is structured in a simple web page
- Know how to make GET requests to pull data
- Debug your code

### Exercise 0 - GET links to the books on front page

In this demo, we will be trying to get the links to all the books on the front page. 

In [5]:
def get_links_from_first_page():
    """This function demonstrates how to collect url to books on the initial page / """
    
    BASE_URL = "https://books.toscrape.com/"
    urls = []
    
    # GET /
    r = requests.get(BASE_URL)
    
    # if request successful
    if r.status_code == 200:
        page = r.text # take the html
        soup = BeautifulSoup(page, 'html.parser') # parse it using BS
        
        # find all products
        products = soup.find_all("article")
        
        # for each product, get the href attribute from the <a> tag
        for product in products:
            href = product.find("a")["href"]
            # form the url and add it to our list
            urls.append(BASE_URL + href)
            
    return urls
        

In [6]:
# Test your function
links = get_links_from_first_page()
print(links[0])
print(len(links))

https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html
20


### Exercise 1 - Write a function to get the links to books from pages 1-5 (inclusive)

**num_pages**: Integer - No. of pages to crawl

Hint: Try to look for patterns in the url. You will need a for loop for this. 

In [2]:
def get_links(num_pages):
    urls = []
    for i in range(1, num_pages+1):
        BASE_URL = "https://books.toscrape.com/catalogue/"
        PAGE_URL = f"{BASE_URL}page-{i}.html"
        # GET /
        r = requests.get(PAGE_URL)

        # if request successful
        if r.status_code == 200:
            page = r.text # take the html
            soup = BeautifulSoup(page, 'html.parser') # parse it using BS

            # find all products
            products = soup.find_all("article")

            # for each product, get the href attribute from the <a> tag
            for product in products:
                href = product.find("a")["href"]
                # form the url and add it to our list
                urls.append(BASE_URL + href)
    return urls

In [3]:
# Test your function
links = get_links(2)
print(links[:3])
print(len(links))

['https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html', 'https://books.toscrape.com/catalogue/tipping-the-velvet_999/index.html', 'https://books.toscrape.com/catalogue/soumission_998/index.html']
40


### Exercise 2 - Write a function to get the Book object from each page

**page_link**: String - The book's url

Example: https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/

Hint: Make the code work for the first link, then generalize it for all links.

In [41]:
# this is the book object
{
    "title": "A Light in the Attic",
    "description": "It's hard to imagine a world without A Light in the Attic. This now-classic collection of poetry and drawings from Shel Silverstein celebrates its 20th anniversary with this special edition. Silverstein's humorous and creative verse can amuse the dowdiest of readers. Lemon-faced adults and fidgety kids sit still and read these rhythmic words and laugh and smile and love th It's hard to imagine a world without A Light in the Attic. This now-classic collection of poetry and drawings from Shel Silverstein celebrates its 20th anniversary with this special edition. Silverstein's humorous and creative verse can amuse the dowdiest of readers. Lemon-faced adults and fidgety kids sit still and read these rhythmic words and laugh and smile and love that Silverstein. Need proof of his genius? RockabyeRockabye baby, in the treetopDon't you know a treetopIs no safe place to rock?And who put you up there,And your cradle, too?Baby, I think someone down here'sGot it in for you. Shel, you never sounded so good. ...more",
    "price": 51.77
}

{'title': 'A Light in the Attic',
 'description': "It's hard to imagine a world without A Light in the Attic. This now-classic collection of poetry and drawings from Shel Silverstein celebrates its 20th anniversary with this special edition. Silverstein's humorous and creative verse can amuse the dowdiest of readers. Lemon-faced adults and fidgety kids sit still and read these rhythmic words and laugh and smile and love th It's hard to imagine a world without A Light in the Attic. This now-classic collection of poetry and drawings from Shel Silverstein celebrates its 20th anniversary with this special edition. Silverstein's humorous and creative verse can amuse the dowdiest of readers. Lemon-faced adults and fidgety kids sit still and read these rhythmic words and laugh and smile and love that Silverstein. Need proof of his genius? RockabyeRockabye baby, in the treetopDon't you know a treetopIs no safe place to rock?And who put you up there,And your cradle, too?Baby, I think someone do

In [44]:
def get_book_object(page_link):
    r = requests.get("https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/")
    if r.status_code == 200:
        page = r.text # take the html
        soup = BeautifulSoup(page, 'html.parser') # parse it using BS

        # find products
        product = soup.find_all("article")[0]

        if product:
            title = product.h1.string
            ps = product.find_all('p')
            description = ps[-1].string
            price = float(ps[0].string[2:])
    return {
        "title": title,
        "price": price,
        "description": description
    }

In [45]:
# Test your function
book = get_book_object("https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/")
print(book)

{'title': 'A Light in the Attic', 'price': 51.77, 'description': "It's hard to imagine a world without A Light in the Attic. This now-classic collection of poetry and drawings from Shel Silverstein celebrates its 20th anniversary with this special edition. Silverstein's humorous and creative verse can amuse the dowdiest of readers. Lemon-faced adults and fidgety kids sit still and read these rhythmic words and laugh and smile and love th It's hard to imagine a world without A Light in the Attic. This now-classic collection of poetry and drawings from Shel Silverstein celebrates its 20th anniversary with this special edition. Silverstein's humorous and creative verse can amuse the dowdiest of readers. Lemon-faced adults and fidgety kids sit still and read these rhythmic words and laugh and smile and love that Silverstein. Need proof of his genius? RockabyeRockabye baby, in the treetopDon't you know a treetopIs no safe place to rock?And who put you up there,And your cradle, too?Baby, I t

### Exercise 3 - Write a function to get all Book objects from pages 1-2

**num_pages**: Integer - no. of pages of books to scrape

Hint: Use the functions you've previously defined

In [None]:
def get_all_books(num_pages):
    pass

In [None]:
# Test your function
books = get_all_books(11)
print(books)

### Bonus Exercise 1.1 - Write a function to get all category links

In [None]:
def get_categories():
    pass

### Bonus Exercise 1.2 - Write a function to get all books from a category

**category url**: String

In [None]:
def get_all_books_from_cat(category_link):
    pass

### Bonus Exercise 1.3 - Write a function to get all books from all categories

In [None]:
def get_everything():
    pass