Created by R. David Beales for the [Kelvin Smith Library](https://case.edu/library/) at [Case Western Reserve University](https://case.edu) under [Creative Commons CC BY License](https://creativecommons.org/licenses/by/4.0/)<br />
For questions/comments/improvements, email rdb104@case.edu.<br />
___

# Web Scraping: Making a Request and Receiving a Response

**Description:** This lesson introduces the basic web scraping workflow using the `requests` library for Python.  

**Use Case:** For Learners (Additional explanation, not ideal for researchers)

**Difficulty:** Beginner

**Completion time:** 15 minutes

**Knowledge Required:** Basic Python

**Knowledge Recommended:** HTML Structure

**Data Format:** `html`, `txt`, `py` 

**Libraries Used:** `requests` 
___

## Scraping Data from Multiple Pages

In this project you will:
1. Use the `Inspect` tool to explore how <a href="https://books.toscrape.com/">Books to Scrape</a> handles navigation between pages.
2. Understand and use a python script to direct the web scraper to a navigate to the next page if there is one, and scrape the specified data from there, and repeat this process until there are no more pages.
3. Examine the data to see if our scraper worked the way we think it should.
4. Write the data to a csv file. 


### Scraping multiple pages using `requests`.

We're going to be using much of the same code we used in the last lesson, as the data we are trying to collect is the same.  However, we are going to wrap that code in some navigational instructions for 


![title](img/next.png)

In [1]:
from bs4 import BeautifulSoup
import requests  #https://requests.readthedocs.io/

In [2]:
# 1.Fetch the page
results = requests.get("https://books.toscrape.com/")

# 2.Get the page content and assign it to the varaible 'content'
content = results.text

# 3. Create the soup
soup = BeautifulSoup(content, "lxml")

In [None]:
# Find all article elements with class 'product_pod'
articles = soup.find_all('article', class_='product_pod')

# Initialize an empty list to store book information
book_info_list = []

# Iterate through each article to extract information and store in a list
for article in articles:
    # Extract title
    title = article.find('h3').find('a')['title']

    # Extract product price
    price = article.find('p', class_='price_color').text.replace("Â", "")
    
    # Extract star rating (if available)
    rating = article.find('p', class_='star-rating')['class'][1]

    # Extract stock status
    stock = article.find('p', class_='instock availability').text.strip()

    # Store the information in a list
    book_info = [title, price, rating, stock]

    # Append the book information to the main list
    book_info_list.append(book_info)

# Print the list of lists containing book information
for book_info in book_info_list:
    print(book_info)

Now that the `requests` package has been imported we can use the various excellent methods that are built into the package. The most common method for web scraping is the `get` method.

`requests.get` will send a `get` request to a web address that you specify.  This simple example will get everything from the web server at that url, but `requests` has powerful tools for selecting exactly what you want to scrape, which we will explore in a later lesson.

Try running the code below.
What response do you get?

In [None]:
base_url = 'https://books.toscrape.com/catalogue/'  

# Initialize an empty list to store scraped data
book_info_list = []

page_number = 1

while True:
    # Construct the URL for the current page
    url = f'{base_url}page-{page_number}.html'
    
    # Fetch the page content
    results = requests.get(url)
    
    if results.status_code == 200:
        # Parse the HTML content
        soup = BeautifulSoup(results.content, 'html.parser')
        
        # Find all articles with class 'product_pod'
        articles = soup.find_all('article', class_='product_pod')
        
        # Scraping logic for each article
        for article in articles:
            # Extract title
            title = article.find('h3').find('a')['title']

            # Extract product price
            price = article.find('p', class_='price_color').text.replace("Â", "")
            
            # Extract star rating (if available)
            rating = article.find('p', class_='star-rating')['class'][1]

            # Extract stock status
            stock = article.find('p', class_='instock availability').text.strip()

            # Store the information in a list
            book_info = [title, price, rating, stock]

            # Append the book information to the main list
            book_info_list.append(book_info)
        
        # Find the 'Next' link
        next_link = soup.find('li', class_='next')
        
        if next_link:
            # Update page number for the next iteration
            page_number += 1
        else:
            # No 'Next' link found, exit the loop
            break
    
    else:
        print(f"Failed to fetch page {page_number}. Status code: {results.status_code}")
        break

# Process all_book_info as needed (e.g., save to a file, further analysis)
# For example, you can print the scraped data:
for book_info in book_info_list:
    print(book_info)

In [None]:
import csv

with open('all_book_data.csv', 'w', newline='') as f:
    writer = csv.writer(f)
    writer.writerows(book_info_list)