# Scraping Bookstoscrape.com ---> SECTION 2

### CourseWork 1

#### Specified Tasks:

In this question, you will build a dataset on books. The data will be collected from
https://books.toscrape.com/
Note: The purpose of this exercise is to demonstrate data acquisition (scraping), cleaning, and exploration.
You can use BeautifulSoup and Requests Python libraries.

Write scripts to fulfil the below requirements:

a. Collect 1000 (or as many as possible) items from the website and save them to
csv(s) files.

    The data collected should include:
    1. Title
    2. Rating
    3. Price
    4. Stock availability (Boolean)
    5. UPC
    6. Quantity available
    7. Category
    (10 marks)
    
b. Identify any problems with the data and clean accordingly. (5 marks)

c. Calculate the total value of each category. Which category is the most valuable? (5 marks)

d. Which category with more than 10 titles has the highest percentage of books
rated 4 stars or higher? (5 marks)

e. Compare the average ratings of each category. What are your conclusions? (10 marks)

# SOLUTION

#### STEP 1: Import required libraries

In [None]:
import time
import pandas as pd
import requests
from bs4 import BeautifulSoup

    Pandas ---> Our library for data wrangling, data manipulation and data transformation.
    Requests ---> Our library for making requests to web pages and accessing resources on the internet.
    Time ---> Helps to handle time complexities in our code. Very useful for keeping track of when we last scraped.
    BS4 ---> The beautiful soup library used to parse, prettify, and work with HTML tags while scraping.

#### STEP 2: Creating functions

To scrape the books from the web page, we break down the task into 4 parts. each of these parts becomes a function that helps make solving the overall problem easier. This process is commonly referred to as functional programming and we would be applying this method towards creating and saving our dataset.

###### The Functions:
Functions for scraping:
- get_info_internal_links -> This function scrapes the html tags from the web address and creates an instance of the beautiful soup library using the data scraped and the 'html.parser' to parse the data.
- get_books_into_dataset -> Store all the data collected from each page into a list of lists.
- save_scraped_data -> Save the data to the generated_data folder.
- iterate_through_pages -> Iterate through multiple and single pages scraping the data from https://books.toscrape.com/ by combining all the functions created.


Functions for working with the data gotten:
- eda -> Takes a dataset and performs exploratory data analysis.
- convert_stock_to_boolean -> This function is created for the availability column in the dataset. It takes each row in the availability column and converts it to boolean TRUE if the book is in stock or boolean FALSE if it is not.
- convert_rating_to_int -> This function is created for the rating column in the dataset. It takes each row in the ratings column and converts it to its numerical representation.
- get_number_of_books -> This function is applied on a dataset to get the total number of books for each category.
- percentage_above_or_equal_to_4rating -> This function is applied on a dataset to get the percentage of books with 4 star rating and above for each category. This function uses the get_number_of_books function to get the numbers of books which will be used to create this function.

Please note: The functions used when working with the data only work after the data has been created and the data has been successfully stored as a pandas dataframe and python object.

In [None]:
def get_info_internal_links(address: str, tag: str, class_: str = None):
    """
    This function scrapes the html tags from the web address and creates an 
    instance of the beautiful soup library using the data scraped and the 
    'html.parser' to parse the data.

    Parameters
    ----------
    address : str
        The website to scrape the data.
    tag : str
        HTML tag for the container where the books are.
    class_ : str, optional
        The class associated with the HTML tag. The default is None.

    Returns
    -------
    info : Tag Object
        HTML tags associated with the request.

    """
    scraper = requests.get(address, timeout = 10)
    response = scraper.text
    
    soup = BeautifulSoup(response, "html.parser")
    info = soup.find(tag, class_ = class_)
    return info

In [None]:
def get_books_into_dataset(container_books, books) -> list:
    """
    Store all the data collected from each page into a list of lists.

    Parameters
    ----------
    container_books : Tag Object
        The HTML tag container where all the books on that page are stored.
    books : Tag Object
        The HTML tag container where all single book are stored.

    Returns
    -------
    store : list
        A list that lists the information of each book as a single row.

    """
    store = []
    for each_book in books:
        book_title = each_book.h3.a["title"]
        book_rating = each_book.p.attrs["class"][1]
        book_price = each_book.find("p", class_ ="price_color").text.replace("Â£", '')
        info_address = each_book.find("h3").a["href"].replace("../../../", "https://books.toscrape.com/catalogue/")
        
        get_upc = get_info_internal_links(address = info_address, tag = "table", class_ ="table table-striped")
        book_upc = get_upc.find("td").text
        
        book_in_stock = each_book.find("p", class_ ="instock availability").text.strip()
        
        get_quantity_available = get_info_internal_links(address = info_address, tag = "p", class_ ="instock availability")
        book_quantity_available = get_quantity_available.text.strip().replace("In stock (", "").split()[0]
        
        book_category = container_books.find("h1").text
        book_more_info = each_book.find("h3").a["href"].replace("../../../", "https://books.toscrape.com/catalogue/")
        store.append([book_title, 
                      book_rating, 
                      book_price, 
                      book_upc, 
                      book_in_stock,
                      book_quantity_available,
                      book_category, 
                      book_more_info])
    return store

In [None]:
def save_scraped_data(dataset: pd.DataFrame):
    """
    Save the data to the generated_data folder.

    Parameters
    ----------
    dataset : pd.DataFrame
        Dataset containing the scraped data.

    Returns
    -------
    None.

    """
    try:
        data_name = "scraped_books_data" 
        date = time.strftime("%Y-%m-%d")
        dataset.to_csv(f"../../generated_data/{data_name}_{date}.csv", index = False)
        print("\nSuccessfully saved file to the specified folder ---> generated_data folder.")
    except FileNotFoundError:
        print("\nFailed to save file to the specified folder ---> generated_data folder.")

In [None]:
def iterate_through_pages() -> pd.DataFrame:
    """
    Scrape the data from https://books.toscrape.com/ by combining all the 
    functions created:
        a) get_info_internal_links -> To get the webpage and parse it.
        b) get_books_into_dataset -> Gets the book data from the parsed data 
    and stores it.
        c) save_scraped_data -> Save the data to the generated_data folder.

    Returns
    -------
    dataset : pd.DataFrame
        A dataframe containing the scraped books data from all pages on the 
        https://books.toscrape.com/ web page.

    """
    # Get the dataset
    dataset = pd.DataFrame(columns = ["Title", "Rating", "Price", "UPC", "Availability", "Quantity_Available", "Category", "More_Info"])
    
    url = "https://books.toscrape.com/catalogue/category/books_1/index.html"
    box = get_info_internal_links(address = url, tag = "ul", class_ = "nav nav-list").li
    column_box = box.find_all("li")
    
    columns = [column.text.lower().strip().replace(" ", '-') for column in column_box]
    index = list(range(2, len(columns) + 2))
    
    print("Now iterating through pages to scrape data...\n    TRUE: Categories with next page.\n    FALSE: Categories with one page.\n")
    for category, index in zip(columns, index):
        url1 = f"https://books.toscrape.com/catalogue/category/books/{category}_{index}/index.html"
        
        container_books = get_info_internal_links(address = url1, tag = "div", class_ = "col-sm-8 col-md-9")
        books = container_books.find_all("li", class_ ="col-xs-6 col-sm-4 col-md-3 col-lg-3")
        
        print(f"{category}{index} -->", container_books.find("li", class_ = "current") is None)
        
        
        if container_books.find("li", class_ = "current") is None:
            store = get_books_into_dataset(container_books, books)
            data = pd.DataFrame(store, columns = ["Title", "Rating","Price", "UPC", "Availability", "Quantity_Available", "Category", "More_Info"])
            dataset = pd.concat([dataset, data])
            
        elif container_books.find("li", class_ = "current") is not None:
            page_number = container_books.find("li", class_ = "current").text.split()[-1]
            page_number = int(page_number)
            for page in range(1, (page_number + 1)):
                url2 = f"https://books.toscrape.com/catalogue/category/books/{category}_{index}/page-{page}.html"
                container_books = get_info_internal_links(address = url2, tag = "div", class_ = "col-sm-8 col-md-9")
                books = container_books.find_all("li", class_ ="col-xs-6 col-sm-4 col-md-3 col-lg-3")
                
                store = get_books_into_dataset(container_books, books)
                data = pd.DataFrame(store, columns = ["Title", "Rating","Price", "UPC", "Availability", "Quantity_Available", "Category", "More_Info"])
                dataset = pd.concat([dataset, data])
        
        else:
            print("No class called 'current'.")
            break
    
    # Save dataset to folder
    save_scraped_data(dataset)
    
    return dataset

In [None]:
def eda(dataset: pd.DataFrame) -> dict:
    """
    Perform exploratory data analysis on the dataset.

    Parameters
    ----------
    dataset : pd.DataFrame
        Dataset to perform EDA.

    Returns
    -------
    dict
        A dictionary containing different evaluation metrics for exploring the 
        columns and understanding how values in the dataset are distributed.

    """
    data_unique = {}
    data_category_count = {}
    dataset.info()
    data_head = dataset.head()
    data_tail = dataset.tail()
    data_mode = dataset.mode().iloc[0]
    data_descriptive_stats = dataset.describe()
    data_more_descriptive_stats = dataset.describe(include = "all")
    data_correlation_matrix = dataset.corr(numeric_only = True)
    data_distinct_count = dataset.nunique()
    data_count_duplicates = dataset.duplicated().sum()
    data_count_null = dataset.isnull().sum()
    data_total_null = dataset.isnull().sum().sum()
    for each_column in dataset.columns:
        data_unique[each_column] = dataset[each_column].unique()
    for each_column in dataset.select_dtypes(object).columns:
        data_category_count[each_column] = dataset[each_column].value_counts()
    
    result = {"data_head": data_head,
              "data_tail": data_tail,
              "data_mode": data_mode,
              "data_descriptive_stats": data_descriptive_stats,
              "data_more_descriptive_stats": data_more_descriptive_stats,
              "data_correlation_matrix": data_correlation_matrix,
              "data_distinct_count": data_distinct_count,
              "data_count_duplicates": data_count_duplicates,
              "data_count_null": data_count_null,
              "data_total_null": data_total_null,
              "data_unique": data_unique,
              "data_category_count": data_category_count,
              }
    
    return result

In [None]:
def convert_stock_to_boolean(row: str) -> bool:
    """
    This function is created for the availability column in the dataset. It takes 
    each row in the availability column and converts it to boolean TRUE if the book is
    in stock or boolean FALSE if it is not.

    Parameters
    ----------
    row : str
        Each row of the availability column.

    Returns
    -------
    bool
        Returns boolean of TRUE if the book is in stock, else FALSE.

    """
    if row.strip() == "In stock":
        return True
    else:
        return False


In [None]:
def convert_rating_to_int(row: str) -> Union[int, float]:
    """
    This function is created for the rating column in the dataset. It takes 
    each row in the ratings column and converts it to its numerical representation.

    Parameters
    ----------
    row : str
        Each row of the ratings column.

    Returns
    -------
    Union[int, float]
        Returns an integer which is the numerical representation of the ratings
        column or np.nan which is of type float.

    """
    if row.strip() == "One":
        return 1
    elif row.strip() == "Two":
        return 2
    elif row.strip() == "Three":
        return 3
    elif row.strip() == "Four":
        return 4
    elif row.strip() == "Five":
        return 5
    else:
        return np.nan

In [None]:
def get_number_of_books(row) -> int:
    """
    This function is applied on a dataset to get the total number of books for
    each category.

    Parameters
    ----------
    row : pd.DataFrame
        Takes a pandas dataframe and goes through the rows working with the 
        category and count columns.

    Returns
    -------
    int
        Returns an integer representing the total number of books for that category.

    """
    category = row["Category"]
    number_of_books = int(categories[categories["Category"] == category]["Count"].values)
    return number_of_books

In [None]:
def percentage_above_or_equal_to_4rating(row) -> Union[int, float]:
    """
    This function is applied on a dataset to get the percentage of books with 4
    star rating and above for each category. This function uses the get_number_of_books
    function to get the numbers of books which will be used to create this function.

    Parameters
    ----------
    row : pd.DataFrame
        Takes a pandas dataframe and goes through the rows working with the 
        count columns.

    Returns
    -------
    Union[int, float]
        An integer or floated value specifying the percentage of books with 4
        star rating and above for each category.

    """
    number_of_books = get_number_of_books(row)
    number = row["Count"]
    return (number/number_of_books) * 100

#### STEP 3: Start the program... Get the dataset

(A) SOLUTION

We use the function iterate_through_pages as our bot which combines other functions we have created such as get_info_internal_links, get_books_into_dataset, and save_scraped_data to scrape the data from the webpage. This is what we want to achieve when we run our program.

In [None]:
# Program Starts
if __name__ == "__main__":
    while True:
        dataframe = iterate_through_pages()
        print("\n\nNext scraping will be done in 2 hours.\nPlease wait...")
        time.sleep(7200) # We sleep for 2 hours to allow our bot not to be detected while it keeps running.

#### STEP 4: Working on the dataset

In [None]:
# Get scraped data
dataset = pd.read_csv("../../generated_data/scraped_books_data_2024-02-11.csv")
print(dataset)

In [None]:
# Exploratory Data Analysis
initial_eda = eda(dataset)

(B) SOLUTION

### Issues identified from the dataset
1) Convert the STOCK AVAILABILITY column to BOOLEAN
2) Convert the RATING column to integer of numeric values 
3) Convert the price column to floated values
4) Convert the quantity column to numeric values

In [None]:
# ---> Solution 1: Convert the STOCK AVAILABILITY column to BOOLEAN
dataset["Availability"] = dataset["Availability"].apply(convert_stock_to_boolean).astype(bool)
# ---> Solution 2: Convert the RATING column to integer of numeric values 
dataset["Rating"] = dataset["Rating"].apply(convert_rating_to_int)
# ---> Solution 3: Convert the price column to floated values
dataset["Price"] = dataset["Price"].astype(float)
# ---> Solution 4: Convert the quantity column to numeric values 
dataset["Quantity_Available"] = dataset["Quantity_Available"].astype(int)

(C) SOLUTION

### Calculate the total value of each category. Which category is the most valuable?

In [None]:
# ---> Calculate the total value of each category.
total_value_category = dataset[["Price", "Category"]].groupby("Category").sum()
total_value_category = total_value_category.reset_index().sort_values("Price", ascending = False)

In [None]:
# ---> Which category is the most valuable?
most_valuable = dataset[["Price", "Category"]].groupby("Category").sum()
most_valuable = most_valuable.reset_index().sort_values("Price", ascending = False)
most_valuable_category = most_valuable.iloc[0]

In [None]:
# ---> Visualising the most valuable category
plt.figure(figsize = (30, 10))
colormap = plt.cm.OrRd
plt.grid(True, alpha = 0.2)
container = plt.bar(most_valuable["Category"], 
                    most_valuable["Price"], 
                    color = colormap(np.linspace(0, 1, num = len(most_valuable["Category"]))))
plt.bar_label(container, 
              labels = round(most_valuable["Price"], 1), 
              padding = 10)
plt.plot(most_valuable["Category"], 
         most_valuable["Price"], 
         marker = "o", 
         linestyle = "--", 
         color = "orange", 
         markersize = 8)
plt.title("Analyzing the most valuable category.", pad = 10, fontsize = 20)
plt.xlabel("Category.", labelpad= 10)
plt.ylabel("Price", labelpad = 10)
plt.xticks(rotation = 90)
plt.tight_layout()
plt.show()

From our analysis, we find out the most valuable category is the DEFAULT category with a total price of $5227.7

(D) SOLUTION

### Which category with more than 10 titles has the highest percentage of books rated 4 stars or higher?

In [None]:
# ---> Get the number of categories with more than 10 titles - STAGE 1
categories = dataset["Category"].groupby(dataset["Category"]).count()
categories.name = "Count"
categories = categories.reset_index()
categories = categories[categories["Count"] >= 10]
# ---> Get the number of categories with more than 10 titles - STAGE 2
columns = categories["Category"].values
filtered_data = dataset[dataset["Category"].isin(columns)]

In [None]:
# ---> Category with highest percentage of books rated 4 stars or higher?
grouped_ratings = filtered_data[["Rating", "Category", "Title"]].groupby(["Category", "Rating"]).count().reset_index()
grouped_ratings = grouped_ratings[grouped_ratings["Rating"].isin([4, 5])]
ratings = grouped_ratings[["Category", "Title"]].groupby("Category").sum().reset_index()
ratings = ratings.rename({"Title": "Count"}, axis = 1)
ratings["Number of books"] = ratings.apply(get_number_of_books, axis = 1)
ratings["%Ratings of 4Stars above"] = ratings.apply(percentage_above_or_equal_to_4rating, axis = 1)
category_highest_percent_rating_above4stars = ratings.sort_values("%Ratings of 4Stars above", ascending = False).iloc[[0, 1, 2]]

In [None]:
# ---> Visualising category with highest percentage of books rated 4 stars or higher
ratings = ratings.sort_values("%Ratings of 4Stars above", ascending = False)
plt.figure(figsize = (30, 10))
colormap = plt.cm.OrRd
plt.grid(True, alpha = 0.2)
container = plt.bar(ratings["Category"], 
                    ratings["%Ratings of 4Stars above"], 
                    color = colormap(np.linspace(0, 1, num = len(ratings["Category"]))))
plt.bar_label(container, 
              labels = round(ratings["%Ratings of 4Stars above"], 1), 
              padding = 10)
plt.plot(ratings["Category"], 
         ratings["%Ratings of 4Stars above"], 
         marker = "o", 
         linestyle = "--", 
         color = "orange", 
         markersize = 8)
plt.title("Analyzing category with highest percentage of books rated 4 stars or higher.", pad = 10, fontsize = 20)
plt.xlabel("Category.", labelpad= 10)
plt.ylabel("Percentage of Ratings of 4 stars and above.", labelpad = 10)
plt.xticks(rotation = 90)
plt.tight_layout()
plt.show()

From our analysis, we find out the category with the highest percentage of books rated 4 stars or higher is the POETRY category. The POETRY category, with a rating of 63.2%, is the category with the highest percentage of books rated 4 stars or higher.

(E) SOLUTION

### Compare the average ratings of each category. What are your conclusions?

In [None]:
# ---> Getting the average ratings
avg_rating = dataset[["Rating", "Category"]].groupby("Category").mean(numeric_only = True)
avg_rating = avg_rating.reset_index().sort_values("Rating", ascending = False)

In [None]:
# ---> Visualise average ratings of each category
plt.figure(figsize = (30, 10))
colormap = plt.cm.OrRd
plt.grid(True, alpha = 0.2)
container = plt.bar(avg_rating["Category"], 
                    avg_rating["Rating"], 
                    color = colormap(np.linspace(0, 1, num = len(avg_rating["Category"]))))
plt.bar_label(container, 
              labels = round(avg_rating["Rating"], 1), 
              padding = 10)
plt.plot(avg_rating["Category"], 
         avg_rating["Rating"], 
         marker = "o", 
         linestyle = "--", 
         color = "orange", 
         markersize = 8)
plt.title("Analyzing average ratings of each category.", pad = 10, fontsize = 20)
plt.xlabel("Category.", labelpad= 10)
plt.ylabel("Rating.", labelpad = 10)
plt.xticks(rotation = 90)
plt.tight_layout()
plt.show()

From our analysis, we learn that 3 categories have the highest average rating of
5.0 across all books sold, rated, and reviewed. The three categories with the 
highest average rating of 5.0 are:
   - Erotica Category
   - Adult Fiction Category
   - Novels Category

At the lower end, we have 4 categories with a shared lowest rating of 1.0 across
all books sold, rated, and reviewed. The four categories with the 
lowest average rating of 1.0 are:
   - Short Stories Category
   - Crime Category
   - Cultural Category
   - Paranormal Category