Created by R. David Beales for the [Kelvin Smith Library](https://case.edu/library/) at [Case Western Reserve University](https://case.edu) under [Creative Commons CC BY License](https://creativecommons.org/licenses/by/4.0/)<br />
For questions/comments/improvements, email rdb104@case.edu.<br />
___

# Web Scraping: Scraping Sets of Related Information and Saving it to a CSV file.

**Description:** This lesson explores how to collect multiple connected data points from a web page and store them in a csv file.  

**Use Case:** For Learners (Additional explanation, not ideal for researchers)

**Difficulty:** Beginner

**Completion time:** 30 minutes

**Knowledge Required:** Basic Python

**Knowledge Recommended:** HTML Structure

**Data Format:** `html`, `txt`, `py` 

**Libraries Used:** `requests` `BeautifulSoup` `csv`
___

## Project #3: Scraping sets of related information into CSV files.

Build a scraper that collects multiple data points about each book based upon specific criteria.

In this project you will:
1. Determine what data we are able to collect about each book listed in the store.  
2. Use the `Inspect` tool in your web browser to identify the web page struture for those pieces of data. 
3. Understand and use a python script to crawl the web page and extract only the data that meets the classification criteria we identified in steps 1 and 2.
4. Write the data to a csv file. 



### What data is available?

If we look at this screenshot captured from the web page, we can see that there are several intresting pieces of data in addition to the title.  There is the price, the rating, and whether or not the book is in stock.  We could also save the cover images or collect the links to those images, but let's leave those out for right now.  So we are going to try to collect four pieces of data for each book; title, price, rating, stock status.

![title](img/booklisting.png)    

We'll get started the same way we have in the last few lessons, importing packages! 

In [1]:
from bs4 import BeautifulSoup
import requests  #https://requests.readthedocs.io/

Next, just like before, use requests to get the content of the website, store it in a variable, and then use BeautifulSoup to parse that content into the "soup" we can analyze.  

In [None]:
# 1.Fetch the page
results = requests.get("https://books.toscrape.com/")

# 2.Get the page content and assign it to the varaible 'content'
content = results.text

# 3. Create the soup
soup = BeautifulSoup(content, "lxml")

Now let's take a look at the html structure of the page so we can determine how we can identify each piece of information to scrape.  Right click on a book and use the `Inspect` option to open up the Inspector panel and take a look at the html.  You may need to expand the html using the grey arrows at the beginning of the elements in order to see all the relevant information.  It shold look like the image below.

Each book's information is presenteed in an `article` element with the `class=product_pod`.  We can use `find_all` to find all of these `article` elements and then scrape the data we need from each one.  

`articles = soup.find_all('article', class_='product_pod')`

As we determinded in the last lesson, the title is contained in the `h3` element.  

`title = article.find('h3').find('a')['title']`

We can see the price data is contained in a `p` element with `class=price_color`.    

`price = article.find('p', class_='price_color').text`

The rating is contained in another `p` element with `class="star-rating NUMBER"`.  We don't want the whole `p` element, just the `class`, so we add `['class']` to this line to limit what we scrape.

`rating = article.find('p', class_='star-rating')['class']`

For stock status, we are using the same apporach of identifying the element by its `class`.  

`stock = article.find('p', class_='instock availability').text`


![title](img/htmlscrape.png)  

We also need to create an empty list to store all this data in.  

This line initializes an empty list called book_info_list. `book_info_list = []`

Then we use a `for loop` to go through the data for each book and scrape the data we need.  We scrape title, price, rating, and stock and assign them to variables of the same name. 

`book_info = [title, price, rating, stockstatus]` initializes a list with the four fields of data that were collected for a book and that list is added to the `book_info_list` at the end of the loop with the `append` method `book_info_list.append(book_info)`. 

So we end up with a list of lists.  Each book has a list of data.  And the `book_info_list` is a list of all those lists.

Run the code block below and take a look at the results to see the list of lists.

In [None]:
# Initialize an empty list to store book information
book_info_list = []

# Find all article elements with class 'product_pod'
articles = soup.find_all('article', class_='product_pod')

# Iterate through each article to extract information and store in a list
for article in articles:
    # Extract title
    title = article.find('h3').find('a')['title']

    # Extract product price
    price = article.find('p', class_='price_color').text

    # Extract star rating (if available)
    rating = article.find('p', class_='star-rating')['class']

    # Extract stock status
    stock = article.find('p', class_='instock availability').text

    # Store the information in a list
    book_info = [title, price, rating, stock]

    # Append the book information to the main list
    book_info_list.append(book_info)

# Print the list of lists containing book information
for book_info in book_info_list:
    print(book_info)

Excellent!  We got a lot of useful information.  However, we can see that there are some problems with the data that we are scraping that need to be cleaned up.  

First, the price has an unusual `Â` character at the beginning.  This character is appearing as a result of a text encoding error.  We can add a .`replace` method to the line of code that scrapes the price.  `replace()` takes two arguments.  The first is the character you want to replace.  The second is what you want to replace it with.  So if we edit that line of code to ad replace we can take the `Â` and replace it with nothing, represented by the empty qutation marks `""`.

`price = article.find('p', class_='price_color').text.replace("Â", "")`

Second, the rating piece contains too much information.  `['star-rating', 'Two']`  We don't need the first part of the data, only the number in the second part.  The `[]` indicate that this is a list.  It is a short list with only two items.  We can just indicate which item we'd like to scrape using the index of the list.  In python, counting starts at zero.  So the first item in the list would be `[0]` and the second item in the list would be `[1]`.  So if we just add this index number to our line of code, we can only get the second item in the list from the `p` element. 

`rating = article.find('p', class_='star-rating')['class'][1]`

Third, the stock status information We are also using `.strip()` on the text.  You can specify which leading and trailing characters you'd like to remove from a string. In our case, we've left the parentheses empty so the strip method will use the default argument, which is just to remove any white space.  The `\n` characters are line breaks and will also be deleted as white space.

`stock = article.find('p', class_='instock availability').text.strip()`

Now we can add all these changes to the original blick of code and we should get a much cleaner set of data.  Run the nexy code block and see what you get.

In [None]:
# Initialize an empty list to store book information
book_info_list = []

# Find all article elements with class 'product_pod'
articles = soup.find_all('article', class_='product_pod')

# Iterate through each article to extract information and store in a list
for article in articles:
    # Extract title
    title = article.find('h3').find('a')['title']

    # Extract product price
    price = article.find('p', class_='price_color').text.replace("Â", "")
    
    # Extract star rating (if available)
    rating = article.find('p', class_='star-rating')['class'][1]

    # Extract stock status
    stock = article.find('p', class_='instock availability').text.strip()

    # Store the information in a list
    book_info = [title, price, rating, stock]

    # Append the book information to the main list
    book_info_list.append(book_info)

# Print the list of lists containing book information
for book_info in book_info_list:
    print(book_info)

Alright!  That is much cleaner data. 

When you ahve a list of lists it is easy to store it in a csv, or comma spearated value, file.  The list of data about each book would be a new line in the csv file.

First we import the `csv` package so python can work with csv files.  Then we use the same `with` statement to handle creating and writing to the file.  The `writerows` method of `csv.writer` will write the each list in our list of lists to the csv file as a new line.  

In [None]:
import csv

with open('book_data.csv', 'w', newline='') as f:
    writer = csv.writer(f)
    writer.writerows(book_info_list)

Click on the `book_data.csv` file in the file directory on the left to open it and take a look at our exported data. 

You might notice that we only scraped the data for 20 books.  There are a thousand books on this website, but we only scraped data from the first page.  In the next tutorial we will learn how to scrape all the pages on the [Books to Scrape](https://books.toscrape.com/) website.  