Created by R. David Beales for the [Kelvin Smith Library](https://case.edu/library/) at [Case Western Reserve University](https://case.edu) under [Creative Commons CC BY License](https://creativecommons.org/licenses/by/4.0/)<br />
For questions/comments/improvements, email rdb104@case.edu.<br />
___

# Web Scraping: Scraping Data from Multiple Pages

**Description:** This lesson introduces the basic web scraping workflow using the `requests` library for Python.  

**Use Case:** For Learners (Additional explanation, not ideal for researchers)

**Difficulty:** Beginner

**Completion time:** 30 minutes

**Knowledge Required:** Basic Python

**Knowledge Recommended:** HTML Structure

**Data Format:** `html`, `txt`, `py` 

**Libraries Used:** `requests` `BeautifulSoup`
___

## Scraping Data from Multiple Pages

In this project you will:
1. Use the `Inspect` tool to explore how <a href="https://books.toscrape.com/">Books to Scrape</a> handles navigation between pages.
2. Understand and use a python script to direct the web scraper to a navigate to the next page if there is one, and scrape the specified data from there, and repeat this process until there are no more pages.
3. Examine the data to see if our scraper worked the way we think it should.
4. Write the data to a csv file. 

Let's import the packages we need so we can get started.


In [None]:
from bs4 import BeautifulSoup
import requests  #https://requests.readthedocs.io/

### Scraping multiple pages using `requests`.

We're going to be using much of the same code we used in the last lesson, as the data we are trying to collect is the same.  However, we are going to wrap that code in some navigational instructions for Python to use so it can visit and scrape all the pages.   

If you scroll to the bottom of the page and right-click>inspect the Next button you will see the html for the next page.  

![title](img/next.png)

We can see that each page has a url that ends in page-NUMBER.html  We are going to create several variables that we can use to pice toether the url for each page we want to scrape. 

`base_url` will be the part of the url that doesn't change.  
`page_number` will start at 1 and after we scrape a page we can add 1 to the `page_number` and use that new number to create the next url.  

Run the code cell below to create both of these variables.  

In [None]:
base_url = 'https://books.toscrape.com/catalogue/'  

page_number = 1

We are using a `while True` statement, a loop construct that will continue executing its block of code indefinitely as long as the condition provided to it is always True. It creates an infinite loop unless a specific condition within the loop causes it to break or exit.

In the context of web scraping, using `while True`` allows you to continuously scrape multiple pages by repeatedly fetching and parsing the content of subsequent pages until a specific condition is met.  In our example, the loop keeps fetching and scraping pages until it encounters a condition where there's no "Next" link or an error occurs while fetching the page, causing the loop to break and exit. 

1. while True: creates an infinite loop because the condition provided (True) is always true.
2. Inside the loop, a URL for the current page is constructed based on the page_number.
3. The script attempts to fetch the page content using requests.get().
4. If the HTTP response status code is 200 (indicating a successful request), the HTML content of the page is parsed using BeautifulSoup, and the necessary scraping operations are performed.
5. Within the loop, there's usually a condition check to determine whether to update the page_number for the next iteration or to break the loop (for example, if there's no "Next" link or any other condition that signifies the end of the scraping process).
6. If the response status code is not 200, it could mean there was an error fetching the page, so the loop breaks.

next_link = soup.find('li', class_='next'): This line uses Beautiful Soup (soup) to find an <li> element with the class 'next'. After using the `Inspect` tool, we know this is how 
if next_link:: This line checks if next_link contains a valid result. If next_link is not None, it means that a link to the next page has been found.

page_number += 1: If a valid link to the next page is found, this line adds 1 to the page_number variable, so the script will visit a new page on the next iteration of the loop.

else: break: If no 'Next' link is found on the current page (i.e., next_link is None), the script executes the break statement. This breaks out of the while True loop, effectively stopping the scraping process because there are no more pages to scrape.

Overall, this section of code is responsible for checking if a 'Next' link exists on the current page. If it does, it updates the page_number variable to move to the next page for scraping. If there's no 'Next' link, the loop breaks, terminating the scraping process as it indicates that there are no more pages to scrape.

Run the code block below and see if you get what you expect.  It is scraping info for 1000 books. so it may take up to 60 seconds to complete.

In [None]:
# Initialize an empty list to store scraped data
book_info_list = []

while True:
    # Construct the URL for the current page
    url = f'{base_url}page-{page_number}.html'
    
    # Fetch the page content
    results = requests.get(url)
    
    if results.status_code == 200:
        # Parse the HTML content
        soup = BeautifulSoup(results.content, 'html.parser')
        
        # Find all articles with class 'product_pod'
        articles = soup.find_all('article', class_='product_pod')
        
        # Scraping logic for each article
        for article in articles:
            # Extract title
            title = article.find('h3').find('a')['title']

            # Extract product price
            price = article.find('p', class_='price_color').text.replace("Â", "")
            
            # Extract star rating (if available)
            rating = article.find('p', class_='star-rating')['class'][1]

            # Extract stock status
            stock = article.find('p', class_='instock availability').text.strip()

            # Store the information in a list
            book_info = [title, price, rating, stock]

            # Append the book information to the main list
            book_info_list.append(book_info)
        
        # Find the 'Next' link using the class of the li element.  
        next_link = soup.find('li', class_='next')
        
        if next_link:
            # Update page number for the next iteration
            page_number += 1
        else:
            # No 'Next' link found, exit the loop
            break
    
    else:
        print(f"Failed to fetch page {page_number}. Status code: {results.status_code}")
        break

# Print the scraped data to see if it worked.
for book_info in book_info_list:
    print(book_info)

That looks good!!  We are getting more than the 20 books on the first page, and the data is still clean as a result of all the changes we made in project #3.

Let's write it to a file so we can use it for some data visualization in Project 5.

We're going to use the same csv library and `with open` command that we did in Project #3.  

In [None]:
import csv

with open('all_book_data.csv', 'w', newline='') as f:
    writer = csv.writer(f)
    writer.writerows(book_info_list)

If you want to explore how to use the results of our web-scraping in data visualization, go back to the starting page and take a look at Project #5.

