# Lab Report 5: Web Scraping
## Name: Afnan Alabdulwahab

### Instructions
Please answer the following questions as completely as possible using text, code, and the results of code as needed. Format your answers in a Jupyter notebook. To receive full credit, make sure you address every part of the problem, and make sure your document is formatted in a clean and professional way.

For the following problems, you will be scraping http://books.toscrape.com/. This website is a fake book retailer, designed to mimic the design of many retail websites. It exists solely to help students practice web-scraping, so there aren’t going to be any ethical concerns with this particular exercise, and there shouldn’t be any issues with rate limits or other gates that could prevent web-scraping. Take a moment and look at this website, so that you know what you will be working with.

Your goal is to generate a dataframe with four columns: one for the title, one for the price, one for the star-rating, and one or the book cover JPEG’s URL. The dataframe will also 1000 rows, one for each of the 1000 books listed on the 50 pages of this website.

## Problem 0
Import the following libraries:

In [23]:
import numpy as np
import pandas as pd
import requests
from bs4 import BeautifulSoup
import sys
import json
sys.tracebacklimit = 0 # turn off the error tracebacks

## Problem 1

The following code snippet retrieves the user agent string from the https://httpbin.org/user-agent endpoint and sets it as the User-Agent header for subsequent HTTP requests.

In [2]:
# retrive the user agent strng
r = requests.get('https://httpbin.org/user-agent')
useragent = json.loads(r.text)['user-agent']

# define the header with the retrived user agent
headers = {'User-Agent': useragent,
           'From': 'aa7dd@virginia.ed'}

Here, I send an HTTP GET request to http://books.toscrape.com using the headers parameter defined above and checking the response to the request.

In [3]:
url = 'http://books.toscrape.com/'
r = requests.get(url, headers=headers)
r

<Response [200]>

### Problem 2

The following code parses the HTML content of the response object `r` using BeautfulSoup with 'html.parser' and saving the created object into `mysoup`.

In [4]:
mysoup = BeautifulSoup(r.text, 'html.parser')

To extract all 20 book titles from the website, I examined the raw HTML and identified that the book titles are within the following HTML tags:

```html
<article class="product_pod">
    <div class="image_container">
        <a href="catalogue/a-light-in-the-attic_1000/index.html"><img src="media/cache/2c/da/2cdad67c44b002e7ead0cc35693c0e8b.jpg" alt="A Light in the Attic" class="thumbnail"></a>
    </div>
    <p class="star-rating Three">
        <i class="icon-star"></i>
        <i class="icon-star"></i>
        <i class="icon-star"></i>
        <i class="icon-star"></i>
        <i class="icon-star"></i>
    </p>
    <h3><a href="catalogue/a-light-in-the-attic_1000/index.html" title="A Light in the Attic">A Light in the ...</a></h3>
</article>
```

To extract the 20 book titles and save them in a list, I took the following steps:

1. Used BeautifulSoup to find all `article` elements with the class `product_pod`.
2. Extracted the book titles from the `title` attribute of the `a` tags within the `h3` tags of these `article` elements.

In [5]:
# Find all 'article' elements with the class 'product_pod' in the parsed HTML
book_elements = mysoup.find_all('article', 'product_pod')

In [6]:
booktitles = [x.h3.a['title'] for x in book_elements]
booktitles

['A Light in the Attic',
 'Tipping the Velvet',
 'Soumission',
 'Sharp Objects',
 'Sapiens: A Brief History of Humankind',
 'The Requiem Red',
 'The Dirty Little Secrets of Getting Your Dream Job',
 'The Coming Woman: A Novel Based on the Life of the Infamous Feminist, Victoria Woodhull',
 'The Boys in the Boat: Nine Americans and Their Epic Quest for Gold at the 1936 Berlin Olympics',
 'The Black Maria',
 'Starving Hearts (Triangular Trade Trilogy, #1)',
 "Shakespeare's Sonnets",
 'Set Me Free',
 "Scott Pilgrim's Precious Little Life (Scott Pilgrim #1)",
 'Rip it Up and Start Again',
 'Our Band Could Be Your Life: Scenes from the American Indie Underground, 1981-1991',
 'Olio',
 'Mesaerion: The Best Science Fiction Stories 1800-1849',
 'Libertarianism for Beginners',
 "It's Only the Himalayas"]

### Problem 3

To extract the prices of each of the 20 books and save them in a list, I examined the raw HTML and identified that the prices are within `p` tags with the class `price_color`:

```html
<article class="product_pod">
    <div class="image_container">
        <a href="catalogue/a-light-in-the-attic_1000/index.html"><img src="media/cache/2c/da/2cdad67c44b002e7ead0cc35693c0e8b.jpg" alt="A Light in the Attic" class="thumbnail"></a>
    </div>
    <p class="star-rating Three">
        <i class="icon-star"></i>
        <i class="icon-star"></i>
        <i class="icon-star"></i>
        <i class="icon-star"></i>
        <i class="icon-star"></i>
    </p>
    <h3><a href="catalogue/a-light-in-the-attic_1000/index.html" title="A Light in the Attic">A Light in the ...</a></h3>
    <div class="product_price">
        <p class="price_color">£51.77</p>
    </div>
</article>
```

To extract the prices and save them in a list, I took the following steps:

1. Used BeautifulSoup to find all `p` elements with the class `price_color`.
2. Extracted the text content of these elements, which contains the prices.

In [7]:
price_elements = mysoup.find_all('p', 'price_color')
prices = [x.text for x in price_elements]
prices

['Â£51.77',
 'Â£53.74',
 'Â£50.10',
 'Â£47.82',
 'Â£54.23',
 'Â£22.65',
 'Â£33.34',
 'Â£17.93',
 'Â£22.60',
 'Â£52.15',
 'Â£13.99',
 'Â£20.66',
 'Â£17.46',
 'Â£52.29',
 'Â£35.02',
 'Â£57.25',
 'Â£23.88',
 'Â£37.59',
 'Â£51.33',
 'Â£45.17']

3. Removed the `£` symbols from the prices.

In [8]:
prices = [s.replace('Â£', '') for s in prices]
prices

['51.77',
 '53.74',
 '50.10',
 '47.82',
 '54.23',
 '22.65',
 '33.34',
 '17.93',
 '22.60',
 '52.15',
 '13.99',
 '20.66',
 '17.46',
 '52.29',
 '35.02',
 '57.25',
 '23.88',
 '37.59',
 '51.33',
 '45.17']

## Problem 4

To extract the star level ratings for the 20 books, I examined the raw HTML and identified that the star ratings are within `p` tags with the class `star-rating`, where the second item in the class list indicates the rating (e.g., "One", "Two", "Three", etc.):

```html
<article class="product_pod">
    <div class="image_container">
        <a href="catalogue/a-light-in-the-attic_1000/index.html"><img src="media/cache/2c/da/2cdad67c44b002e7ead0cc35693c0e8b.jpg" alt="A Light in the Attic" class="thumbnail"></a>
    </div>
    <p class="star-rating Three">
        <i class="icon-star"></i>
        <i class="icon-star"></i>
        <i class="icon-star"></i>
        <i class="icon-star"></i>
        <i class="icon-star"></i>
    </p>
    <h3><a href="catalogue/a-light-in-the-attic_1000/index.html" title="A Light in the Attic">A Light in the ...</a></h3>
    <div class="product_price">
        <p class="price_color">£51.77</p>
    </div>
</article>
```

To extract the star ratings and save them in a list, I took the following steps:

1. Used the previously saved `book_elements` where I used BeautifulSoup to find all `article` elements with the class `product_pod`.
2. Extracted the second item in the class list of the `p` tags within these `article` elements, which indicates the star rating.

In [9]:
ratings = [x.p['class'][1] for x in book_elements]
ratings

['Three',
 'One',
 'One',
 'Four',
 'Five',
 'One',
 'Four',
 'Three',
 'Four',
 'One',
 'Two',
 'Four',
 'Five',
 'Five',
 'Five',
 'Three',
 'One',
 'One',
 'Two',
 'Two']

## Problem 5

To extract the URLs for the JPEG thumbnail images that show the covers of the 20 books, I examined the raw HTML and identified that the image URLs are within the `img` tags inside `a` tags within the `div` elements with the class `image_container`, which is withing the `article` tag with the class `product_pod`:

```html
<article class="product_pod">
    <div class="image_container">
        <a href="catalogue/a-light-in-the-attic_1000/index.html">
            <img src="media/cache/2c/da/2cdad67c44b002e7ead0cc35693c0e8b.jpg" alt="A Light in the Attic" class="thumbnail">
        </a>
    </div>
</article>
```

To extract the URLs for the JPEG thumbnails and save them in a list, I took the following steps:

1. Defined the root URL of the website.
2. Used the previously saved `book_elements` where I used BeautifulSoup to find all `article` elements with the class `product_pod`.
3. Constructed the full URLs by concatenating the root URL with the `src` attribute of the `img` tags inside the `a` tags within the `div` elements.

In [10]:
root = 'http://books.toscrape.com/'

In [11]:
JPEGthumbnails = [root + x.a.img['src'] for x in book_elements]
JPEGthumbnails 

['http://books.toscrape.com/media/cache/2c/da/2cdad67c44b002e7ead0cc35693c0e8b.jpg',
 'http://books.toscrape.com/media/cache/26/0c/260c6ae16bce31c8f8c95daddd9f4a1c.jpg',
 'http://books.toscrape.com/media/cache/3e/ef/3eef99c9d9adef34639f510662022830.jpg',
 'http://books.toscrape.com/media/cache/32/51/3251cf3a3412f53f339e42cac2134093.jpg',
 'http://books.toscrape.com/media/cache/be/a5/bea5697f2534a2f86a3ef27b5a8c12a6.jpg',
 'http://books.toscrape.com/media/cache/68/33/68339b4c9bc034267e1da611ab3b34f8.jpg',
 'http://books.toscrape.com/media/cache/92/27/92274a95b7c251fea59a2b8a78275ab4.jpg',
 'http://books.toscrape.com/media/cache/3d/54/3d54940e57e662c4dd1f3ff00c78cc64.jpg',
 'http://books.toscrape.com/media/cache/66/88/66883b91f6804b2323c8369331cb7dd1.jpg',
 'http://books.toscrape.com/media/cache/58/46/5846057e28022268153beff6d352b06c.jpg',
 'http://books.toscrape.com/media/cache/be/f4/bef44da28c98f905a3ebec0b87be8530.jpg',
 'http://books.toscrape.com/media/cache/10/48/1048f63d3b5061cd2f4

## Problem 6

To create a DataFrame with one row for each of the 20 books, including the book titles, prices, star ratings, and cover JPEG URLs as the four columns, I took the following steps:

1. Created a dictionary to store the data for each column.
2. Constructed the DataFrame from the dictionary.

In [12]:
book_dict = {
    'titles': booktitles,
    'prices': prices,
    'ratings': ratings,
    'cover': JPEGthumbnails
}

book_df = pd.DataFrame.from_records(book_dict)

# Reorder the columns
book_df = book_df[['titles', 'ratings', 'prices', 'cover']]
book_df

Unnamed: 0,titles,ratings,prices,cover
0,A Light in the Attic,Three,51.77,http://books.toscrape.com/media/cache/2c/da/2c...
1,Tipping the Velvet,One,53.74,http://books.toscrape.com/media/cache/26/0c/26...
2,Soumission,One,50.1,http://books.toscrape.com/media/cache/3e/ef/3e...
3,Sharp Objects,Four,47.82,http://books.toscrape.com/media/cache/32/51/32...
4,Sapiens: A Brief History of Humankind,Five,54.23,http://books.toscrape.com/media/cache/be/a5/be...
5,The Requiem Red,One,22.65,http://books.toscrape.com/media/cache/68/33/68...
6,The Dirty Little Secrets of Getting Your Dream...,Four,33.34,http://books.toscrape.com/media/cache/92/27/92...
7,The Coming Woman: A Novel Based on the Life of...,Three,17.93,http://books.toscrape.com/media/cache/3d/54/3d...
8,The Boys in the Boat: Nine Americans and Their...,Four,22.6,http://books.toscrape.com/media/cache/66/88/66...
9,The Black Maria,One,52.15,http://books.toscrape.com/media/cache/58/46/58...


## Problem 7

This function takes a URL of a webpage to scrape as input. It retrieves data such as book titles, prices, ratings, and cover images from the specified URL using BeautifulSoup and generates a pandas DataFrame as output.

In [17]:
def bookscraper(url):
    r = requests.get(url, headers=headers)
    mysoup = BeautifulSoup(r.text, 'html.parser')
    root = 'http://books.toscrape.com/'

    book_dict = {}

    book_elements = mysoup.find_all('article', 'product_pod')
    book_dict['titles'] = [x.h3.a['title'] for x in book_elements]

    price_elements = mysoup.find_all('p', 'price_color')
    prices = [x.text for x in price_elements]
    book_dict['prices'] = [s.replace('Â£', '') for s in prices]

    book_dict['ratings'] = [x.p['class'][1] for x in book_elements]

    book_dict['cover'] = [root + x.a.img['src'] for x in book_elements]

    book_df = pd.DataFrame.from_records(book_dict)

    # Reorder the columns
    book_df = book_df[['titles', 'ratings', 'prices', 'cover']]
    return book_df 

## Problem 8

**To Scrape Multiple Pages of Book Data I took the following steps:**
1. Created a list `booklinks` containing URLs for each of the 50 pages to scrape.
2. Used a `for` loop to iterate through each URL in `booklinks`.
3. Inside the loop, called the `bookscraper` function with each URL to retrieve a DataFrame `temp_df`.
4. Concatenated `temp_df` with the main DataFrame `bookdf` using `pd.concat()`, ensuring all data is appended together."

In [24]:
booklinks = [f'http://books.toscrape.com/catalogue/page-{i+1}.html' \
             for i in range(50)]
booklinks 

['http://books.toscrape.com/catalogue/page-1.html',
 'http://books.toscrape.com/catalogue/page-2.html',
 'http://books.toscrape.com/catalogue/page-3.html',
 'http://books.toscrape.com/catalogue/page-4.html',
 'http://books.toscrape.com/catalogue/page-5.html',
 'http://books.toscrape.com/catalogue/page-6.html',
 'http://books.toscrape.com/catalogue/page-7.html',
 'http://books.toscrape.com/catalogue/page-8.html',
 'http://books.toscrape.com/catalogue/page-9.html',
 'http://books.toscrape.com/catalogue/page-10.html',
 'http://books.toscrape.com/catalogue/page-11.html',
 'http://books.toscrape.com/catalogue/page-12.html',
 'http://books.toscrape.com/catalogue/page-13.html',
 'http://books.toscrape.com/catalogue/page-14.html',
 'http://books.toscrape.com/catalogue/page-15.html',
 'http://books.toscrape.com/catalogue/page-16.html',
 'http://books.toscrape.com/catalogue/page-17.html',
 'http://books.toscrape.com/catalogue/page-18.html',
 'http://books.toscrape.com/catalogue/page-19.html',
 '

In [21]:
bookdf = pd.DataFrame()
for x in booklinks:
    temp_df = bookscraper(x)
    bookdf = pd.concat([bookdf, temp_df])
bookdf

Unnamed: 0,titles,ratings,prices,cover
0,A Light in the Attic,Three,51.77,http://books.toscrape.com/../media/cache/2c/da...
1,Tipping the Velvet,One,53.74,http://books.toscrape.com/../media/cache/26/0c...
2,Soumission,One,50.10,http://books.toscrape.com/../media/cache/3e/ef...
3,Sharp Objects,Four,47.82,http://books.toscrape.com/../media/cache/32/51...
4,Sapiens: A Brief History of Humankind,Five,54.23,http://books.toscrape.com/../media/cache/be/a5...
...,...,...,...,...
15,Alice in Wonderland (Alice's Adventures in Won...,One,55.53,http://books.toscrape.com/../media/cache/96/ee...
16,"Ajin: Demi-Human, Volume 1 (Ajin: Demi-Human #1)",Four,57.06,http://books.toscrape.com/../media/cache/09/7c...
17,A Spy's Devotion (The Regency Spies of London #1),Five,16.97,http://books.toscrape.com/../media/cache/1b/5f...
18,1st to Die (Women's Murder Club #1),One,53.98,http://books.toscrape.com/../media/cache/2b/41...
