# Multiple Book Scraping
Here is a notebook demonstrating asynchronous scraping. 
The site used for this puprose is
the **Books to Scape**, which you can finde [here](http://books.toscrape.com/index.html). Its sole purpose is that for scraping so you can hit with as many request you want without nothing to worry about.
<div style="text-align: center;">
<img src="assets/async_python_scraping_small.jpg" width="auto" height="auto">
</div>
So let us get us going by importing our dependencies

In [1]:
from bs4 import BeautifulSoup, Tag
import pandas as pd

import aiohttp
import asyncio

from typing import Any, List


### Function Definitions

So besides our usual imports here
we define two important functions:
1. `scrape_page(content: bytes) -> List`: This function takes the content of a web page as input and extracts book-related data, such as title, star rating, and price, returning it as a list.

2. `fetch_page(session: aiohttp.ClientSession, url: str) -> bytes`: This asynchronous function fetches the content of a web page using aiohttp. It sends an HTTP GET request to a given URL, handles any errors, and returns the page's content as bytes.

3. `get_data(session: aiohttp.ClientSession, urls: List[str]) -> List`: This function is responsible for collecting data from multiple web pages asynchronously. This function takes a session and a list of URLs as input. In more detail:
- It initializes an empty list called `all_data` to store the collected data.
- It creates a list of asynchronous tasks (`tasks`) for fetching pages using the `fetch_page` function for each URL in the `urls` list.
- It uses `asyncio.gather(*tasks)` to execute these tasks concurrently and gather the results (page content) into the `pages` list.
- For each page content in the `pages` list, it calls the `scrape_page` function to extract book data and extends the `all_data` list with this data.
- Finally, it returns the `all_data` list containing data from all the pages.

In [9]:


# Scrapes the requirement elements from the page content provided
def scrape_page(content: bytes) -> List:

    # Parse site's content html
    soup: Beautifulsoup =  BeautifulSoup(content, 'html.parser')
    
    # Get the books from the list
    ol = soup.find('ol')
    articles = ol.find_all('article', class_='product_pod')
    data = []
    
    # Extract the fields needed
    for article in articles:
        image = article.find('img')
        title = image.attrs['alt']
        star_elem = article.find('p')
        star_num = star_elem.attrs['class'][1]
        price = article.find('p', class_='price_color').get_text()
        price_float = float(price[1:])
    
        data.append([title, price_float, star_num])
        
    return data


# Fetches page, called asynchronously by "parent" function
async def fetch_page(session: aiohttp.ClientSession, url: str) -> bytes:
    headers = {'User-Agent': "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/117.0",
               "X-Amzn-Trace-Id": "Root=1-65043b46-31bc2efb2ba67202432972da",
               "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8",
               "Accept-Encoding": "gzip, deflate",
               "Accept-Language": "en-US,en;q=0.5",
               "Upgrade-Insecure-Requests": "1"
              }

    try:
        async with session.get(url, headers=headers) as response:
            response.raise_for_status()
            return await response.read()
    except aiohttp.ClientError as e:
        # Return empty byte to indicate error
        print(f"HTTP error occured for URL {url}: {e}")
        return b''

        
    
# Gets data by fetching and scraping the pages asynchronously
async def get_data(session: aiohttp.ClientSession, urls: List[str]) -> List:
    all_data = []
    tasks = [fetch_page(session, url) for url in urls]
    pages = await asyncio.gather(*tasks)
    
    for page in pages:
        data = scrape_page(page)
        all_data.extend(data)
        
    return all_data


### Main Execution
The code following is main execution. It is what will have defined as a `main()`function, if we were to write this notebook into a script. It performs the following steps:
- Initializes an empty list called `books` to store all the book data.
- Creates a list of URLs (`urls`) representing the pages to be scraped (in this case, 50 pages).
- Creates an asyncio event loop and a session using `aiohttp.ClientSession()`.
- Uses `await get_data(session, urls)` to asynchronously fetch and scrape all the pages, and the collected data is stored in the `books_data` variable.
- Finally, it appends the `books_data` to the `books` list.

In [7]:
books: List[Any] = []
BASE_URL: str = "http://books.toscrape.com/catalogue/page-{}.html"
urls: List = [BASE_URL.format(i) for i in range(1,51)]
    
async with aiohttp.ClientSession() as session:
    books_data = await get_data(session, urls)
    
    for page_data in books_data:
        books.append(page_data)

### Data Processing and Exporting
In this cell, we perform the following actions:
- Create a pandas DataFrame (`df`) from the `books` list, specifying column names ('Title', 'Price', 'Star Rating').
- Save this DataFrame to a CSV file named "books.csv" using `df.to_csv()`.
- Print "Done!" to indicate the completion of the scraping and exporting process.

This cell is responsible for processing the collected data and storing it in a structured CSV format for further analysis or use.

In [8]:
df = pd.DataFrame(books, columns=['Title','Price', 'Star Rating'])
df.to_csv('books.csv')
print("Done!")

Done!
