Created by R. David Beales for the [Kelvin Smith Library](https://case.edu/library/) at [Case Western Reserve University](https://case.edu) under [Creative Commons CC BY License](https://creativecommons.org/licenses/by/4.0/)<br />
For questions/comments/improvements, email rdb104@case.edu.<br />
___

# Web Scraping: Making a Request and Receiving a Response

**Description:** This lesson explores how to collect multiple connected data points from a web page and store them in a csv file.  

**Use Case:** For Learners (Additional explanation, not ideal for researchers)

**Difficulty:** Beginner

**Completion time:** 60 minutes

**Knowledge Required:** Basic Python

**Knowledge Recommended:** HTML Structure

**Data Format:** `html`, `txt`, `py` 

**Libraries Used:** `requests` 
___

## Project #3: Scraping sets of related information into CSV files.

Build a scraper that collects multiple data points about each book based upon specific criteria.

In this project you will:
1. Determine what data we are able to collect about each book listed in the store.  
2. Use the `Inspect` tool in your web browser to identify the web page struture for those pieces of data. 
3. Understand and use a python script to crawl the web page and extract only the data that meets the classification criteria we identified in steps 1 and 2.
4. Write the data to a csv file. 



### What data is available?

If we look at this screenshot captured from the web page, we can see that there are several intresting pieces of data in addition to the title.  There is the price, the rating, and whether or not the book is in stock.  We could also save the cover images or collect the links to those images, but let's leave those out for right now.  So we are going to try to collect four pieces of data for each book; title, price, rating, stock status.

![title](img/booklisting.png)    

We'll get started the same way we have in the last few lessons, importing packages! 

In [2]:
from bs4 import BeautifulSoup
import requests  #https://requests.readthedocs.io/

Next, just like before, use requests to get the content of the website, store it in a variable, and then use BeautifulSoup to parse that content into the "soup" we can analyze.  

In [3]:
# 1.Fetch the page
results = requests.get("https://books.toscrape.com/")

# 2.Get the page content and assign it to the varaible 'content'
content = results.text

# 3. Create the soup
soup = BeautifulSoup(content, "lxml")

Now let's take a look at the html structure of the page so we can determine how we can identify each piece of information to scrape.  

Each book's information is presenteed in an `article` element with the `class=product_pod`.  We can use `find_all` to find all of these `article` elements and then scrape the data we need from each one.  

`articles = soup.find_all('article', class_='product_pod')`

As we determinded in the last lesson, the title is contained in the `h3` element.  

`title = article.find('h3').find('a')['title']`

We can see the price data is contained in a `p` element with `class=price_color`.    

`price = article.find('p', class_='price_color').text`

The rating is contained in another `p` element with `class="star-rating NUMBER"`.  We don't want the whole `p` element, just the `class`, so we add `['class']` to this line to limit what we scrape.

`rating = article.find('p', class_='star-rating')['class']`

For stock status, we are using the same apporach of identifying the element by its `class`.  

`stock = article.find('p', class_='instock availability').text`


![title](img/htmlscrape.png)  

In [15]:
# Find all article elements with class 'product_pod'
articles = soup.find_all('article', class_='product_pod')

# Initialize an empty list to store book information
book_info_list = []

# Iterate through each article to extract information and store in a list
for article in articles:
    # Extract title
    title = article.find('h3').find('a')['title']

    # Extract product price
    price = article.find('p', class_='price_color').text

    # Extract star rating (if available)
    rating = article.find('p', class_='star-rating')['class']

    # Extract stock status
    stock = article.find('p', class_='instock availability').text

    # Store the information in a list
    book_info = [title, price, rating, stock]

    # Append the book information to the main list
    book_info_list.append(book_info)

# Print the list of lists containing book information
for book_info in book_info_list:
    print(book_info)

['A Light in the Attic', 'Â£51.77', ['star-rating', 'Three'], '\n\n    \n        In stock\n    \n']
['Tipping the Velvet', 'Â£53.74', ['star-rating', 'One'], '\n\n    \n        In stock\n    \n']
['Soumission', 'Â£50.10', ['star-rating', 'One'], '\n\n    \n        In stock\n    \n']
['Sharp Objects', 'Â£47.82', ['star-rating', 'Four'], '\n\n    \n        In stock\n    \n']
['Sapiens: A Brief History of Humankind', 'Â£54.23', ['star-rating', 'Five'], '\n\n    \n        In stock\n    \n']
['The Requiem Red', 'Â£22.65', ['star-rating', 'One'], '\n\n    \n        In stock\n    \n']
['The Dirty Little Secrets of Getting Your Dream Job', 'Â£33.34', ['star-rating', 'Four'], '\n\n    \n        In stock\n    \n']
['The Coming Woman: A Novel Based on the Life of the Infamous Feminist, Victoria Woodhull', 'Â£17.93', ['star-rating', 'Three'], '\n\n    \n        In stock\n    \n']
['The Boys in the Boat: Nine Americans and Their Epic Quest for Gold at the 1936 Berlin Olympics', 'Â£22.60', ['star-ra

We are also using `.strip()` on the text.  You can specify which leading and trailing characters you'd like to remove from a string. In our case, we've left the parentheses empty so the strip method will use the default argument, which is just to remove any white space.


In [13]:
# Find all article elements with class 'product_pod'
articles = soup.find_all('article', class_='product_pod')

# Initialize an empty list to store book information
book_info_list = []

# Iterate through each article to extract information and store in a list
for article in articles:
    # Extract title
    title = article.find('h3').find('a')['title']

    # Extract product price
    price = article.find('p', class_='price_color').text.replace("Â", "")
    
    # Extract star rating (if available)
    rating = article.find('p', class_='star-rating')['class'][1]

    # Extract stock status
    stock = article.find('p', class_='instock availability').text.strip()

    # Store the information in a list
    book_info = [title, price, rating, stock]

    # Append the book information to the main list
    book_info_list.append(book_info)

# Print the list of lists containing book information
for book_info in book_info_list:
    print(book_info)

['A Light in the Attic', '£51.77', 'Three', 'In stock']
['Tipping the Velvet', '£53.74', 'One', 'In stock']
['Soumission', '£50.10', 'One', 'In stock']
['Sharp Objects', '£47.82', 'Four', 'In stock']
['Sapiens: A Brief History of Humankind', '£54.23', 'Five', 'In stock']
['The Requiem Red', '£22.65', 'One', 'In stock']
['The Dirty Little Secrets of Getting Your Dream Job', '£33.34', 'Four', 'In stock']
['The Coming Woman: A Novel Based on the Life of the Infamous Feminist, Victoria Woodhull', '£17.93', 'Three', 'In stock']
['The Boys in the Boat: Nine Americans and Their Epic Quest for Gold at the 1936 Berlin Olympics', '£22.60', 'Four', 'In stock']
['The Black Maria', '£52.15', 'One', 'In stock']
['Starving Hearts (Triangular Trade Trilogy, #1)', '£13.99', 'Two', 'In stock']
["Shakespeare's Sonnets", '£20.66', 'Four', 'In stock']
['Set Me Free', '£17.46', 'Five', 'In stock']
["Scott Pilgrim's Precious Little Life (Scott Pilgrim #1)", '£52.29', 'Five', 'In stock']
['Rip it Up and Start

In [None]:
# Find all article elements with class 'product_pod'
articles = soup.find_all('article', class_='product_pod')

# Initialize an empty list to store book information
book_info_list = []

# Iterate through each article to extract information and store in a list
for article in articles:
    # Extract title
    title = article.find('h3').find('a')['title']

    # Extract product price
    price = article.find('p', class_='price_color').text.replace('Â', '')

    # Extract star rating (if available)
    rating = article.find('p', class_='star-rating')['class'][1]

    # Extract stock status
    stock = article.find('p', class_='instock availability').text.strip()

    # Store the information in a list
    book_info = [title, price, rating, stock]

    # Append the book information to the main list
    book_info_list.append(book_info)

# Print the list of lists containing book information
for book_info in book_info_list:
    print(book_info)

In [18]:
with open('scrape.txt','w') as file:
    file.write(response.text)