<span style="background-color:yellow; font-size:20px;">Please run first the coding part. Then come back to re-run analysis part. Graphs won't work otherwise since the code needs to be run first</span>

<span style="font-size:20px;">For speeding up reviewing, all data have been saved into CSV file. Otherwise the scrapping takes ~5 minutes</span>

In this notebook will be presented data analysing exercise using data on "books.toscrape.com" website.

All data will be scrapped and analysed only using python libraries without using APIs (website doesn't even have one) for practice sake. 

"books.toscrape.com" website is a bookstore that contains data about book's titles, prices, rates etc. Since the website is made for practicing purpose, it's content is limited.


Possible data about each book:
- Title
- Price
- Rating (1-5)
- Availability
- Stock
- Genre 

1. **Amount of Books in the Store**
2. **Duplicates**: Check for duplicate entries.
3. **Availability**: Books in stock vs. out of stock.
4. **Books Prices**: Sorted list and price range.
5. **Books Ratings**: Sorted list and rating distribution.
6. **Price vs. Rating Correlation**
7. **Amount of Books per Genre**: Genre popularity.
8. **Price Distribution**: Average, range, and standard deviation.
9. **Genre vs. Price**

1. **Amount of books in the Store:** 1000 

2. **Duplicates:** 1

   Since the duplicate has different information details of price and stock it will be treated as a seperate title in further analysis.

In [None]:
duplicates_library

3. **Availability:** Books in stock vs. out of stock.

   All books are labeled as "in stock".

   Total amount of books in stock: 8585

4. **Books Prices**:

    The most expensive book:

   Title: "the perfect play (play by play #1)"

   Price: £59.99
   
   

   The cheapest book:

   Title: "an abundance of katherines"

   Price: £10.00

   Books Prices graph:

In [None]:
fig_price

5. **Books ratings:**

In [None]:
fig_rating

6. **Price vs. Rating Correlation:**

In [None]:
price_rating_corr

Correlation calculated using pandas "corr()" function.

Correlation: 0.028166239485872963

Correlation close to 0 meaning no correlation between price and rating.

7. **Amount of Books per Genre:**

In [None]:
fig_genres

Most popular: "default": 152, followed by "nonfiction": 110.

Least popular: "crime", "erotica", "novels", "cultural", "suspense", "short stories", "academic", "adult fiction", "parenting", "paranormal": 1

8. **Price distribution:**

    Average price of a book: £35.07035 = ~£35.07
    Price range: £49.99
    Standard Deviation: 14.446689669952764 ~14.45

    On average, the price of the books varies by £14.45 from the mean price of £35.07

    Shapiro-Wilk Test:
   Statistic = 0.9532239071596627,
   p-value = 2.6180709475683377e-17

   Statistic is close to 1 but not close enough.

   Small p-value indicates the data does not follow normal distribution which shows on the graph below

In [None]:
fig_histogram

9. **Genre vs. Price**:
    Average price per genre

In [None]:
fig_genre_price

<span style="background-color:yellow; font-size:20px;">Coding begins here</span>



Important libraries 

In [None]:
import requests # Imports library to send a request to a website
from bs4 import BeautifulSoup # Imports a library to clean website information that we request
import pandas as pd # Imports pandas
import plotly.express as px # Imports plotly

from scipy.stats import shapiro # Imports Shapiro Wilk test of normality for data distribution

Function to add a book with it's information into library (dictionary)

In [None]:
# Function to add a book with automatic ID


def add_book(library, title, price, rating, stock, genre, available=True):
    # Generate an automatic ID based on the current size of the library
    book_id = str(len(library) + 1)
    
    if book_id in library:
        print(f"Book ID {book_id} already exists. This should not happen!")
        return
            
    
    library[book_id] = {
        "title": title,
        "price": price,
        "rating": rating,
        "stock": stock,
        "genre": genre,
        "available": available
        
    }
    print(f"Book '{title}' added successfully with ID {book_id}.")

Checks if a title appears twice in the library, if so, prints both with all the details.

"Clean_duplicates()" can't be run twice without declaring the variables anew (remove it's content), since it's gonna add duplicates all over again. Making a function to check duplicates library for duplicates iterating not only title but also price is not worth the time for this project.

In [None]:
def clean_duplicates(library):

    for i in library:
        # Get the book title
        book_title = library[i]["title"] #string containing title of a book
        book_info = library[i] #dictionary information about the book
        
        if book_title not in seen:
            unique_library[i] = book_info #add book information to a new library
            #print(unique_library[i])
            seen.add(book_title) #adds book's title into a set (can't contain duplicates)
            
        else: #if the title was seen in the verification set
            for n in unique_library: 
                if unique_library[n]["title"] == book_title: #compares titles
                    
                    duplicates_library[str(len(duplicates_library) + 1)] = book_info
                    duplicates_library[str(len(duplicates_library) + 1)] = unique_library[n]
                    
                    print("Duplicate: \n", book_info)
                    print("Book in library: \n", unique_library[n], "\n")                                           


    print("Books in library before checking for duplicates:", len(library))
    print("Books in library after deleting duplicates:", len(unique_library))


A loop that gets data from a website about books. It saves all details into "library". Repeats each page.

This function is really slow since it's going into every link of every book (1000 books) and scrapping the data from there.
It could be done by scrapping the data from the catalogue pages directly (~20 books per page) significantly increasing the scrapping time, but the data will lack genre and stock (amount) informations. 



A decision has been made to make it slower but scrapping more information.

In [None]:
def get_books():
    page = 1
    while True:
        url = f"http://books.toscrape.com/catalogue/page-{page}.html"
        print("\nPage:", page, "| URL:", url)
        
        try:
            response = requests.get(url)
            response.raise_for_status()  # Will raise HTTPError if the response code is 4xx/5xx
            print("Connecting to website succesful\n")
        except requests.exceptions.RequestException as e:
            print(f"Error: {e}")
            break
        
        soup = BeautifulSoup(response.text, "html.parser")
        book_info = soup.find_all("h3")
        
        if not book_info:  # If no books are found, stop
            print("No more books found.")
            break
        
        for link in book_info:
            href = link.find("a")["href"]
            book_url = f"http://books.toscrape.com/catalogue/{href}"
            
            try:
                response = requests.get(book_url)
                response.raise_for_status()
            except requests.exceptions.RequestException as e:
                print(f"Error fetching book details: {e}")
                continue  # Skip this book and proceed to the next
            
            soup = BeautifulSoup(response.text, "html.parser")
            
            title = soup.find("li", class_="active").text.strip().lower()
            rating = soup.find("p", class_="star-rating")["class"][1].strip().lower()
            price = soup.find("p", class_="price_color").text[2:].strip()
            genre = soup.find("ul", class_="breadcrumb").find_all("a")[2].text.lower()
            
            storage = soup.find("p", class_="instock availability").text.strip().lower().split()
            stock = int(storage[2][1:])
            in_stock = " ".join(storage[:2])
            available = in_stock == "in stock"
            
            add_book(library, title, price, rating, stock, genre, available)
        
        page += 1

<span style="background-color:yellow; font-size:20px;">Don't run the code below. It will scrap the data in real time. Instead skip the "get_books()" code and continue with the "read_csv"</span>

Checks if the library has any duplicates, if does, prints them and creates a new library without duplicates.

In [None]:
library = {}  # Library containing ALL books, even duplicates
get_books()

In [None]:
seen = set() # Set that can't contain duplicates, used for a verification purpose  
duplicates_library = {} # Library containing only books that have been duplicated (both original and duplicate)
unique_library = {} # Library containing books without duplicates

clean_duplicates(library)

In [None]:
duplicates_library

Making a dataframe out of a dictionary

In [None]:
df = pd.DataFrame.from_dict(library).T # Read library as dataframe for analysis (transcend for convenience)

In [None]:
df.to_csv("books_to_scrap.csv", index="False")  # Saves with index

<span style="background-color:yellow; font-size:20px;">To read the CSV start below and continue from here. It will use already scrapped data from CSV file</span>

In [None]:
df = pd.read_csv("https://raw.githubusercontent.com/MollenFerneus/Data_Projects_Scrapping_Books/refs/heads/main/books_to_scrap.csv",index_col=0)  # Replace with your actual file name

In [None]:
word_to_number = {"one": 1, "two": 2, "three": 3, "four": 4,"five":5}
df["rating"] = df["rating"].replace(word_to_number) # Changes the rating from a word to a number

In [None]:
df["price"] = pd.to_numeric(df["price"]) # Changes the price from a string to an integer

Change rating string to integer

Change price to integer not string

________________________________________________________________

1. **Book availability.** Check how many books are available/unavailable in the store.

In [None]:
df_not_in_stock = df[df["available"] == False]
df_not_in_stock

________________________________________________________________

3. **Availability**: Books in stock vs. out of stock.

In [None]:
df_stock = df[["title", "stock"]]
df_stock.sort_values(by = "stock", ascending = False)

Sum all the books in the store.

In [None]:
df_stock["stock"].sum()

In [None]:
df

_____________________________________

4. **Sort books by price.**

In [None]:
df_price = df[["title","price"]]
df_price.sort_values(by = "price", ascending = False)


In [None]:
fig_price = px.bar(df, x=df.index.astype(str),y="price").update_xaxes(categoryorder='total ascending')
fig_price.update_layout(
    xaxis_title='Index',  # Change X-axis title
    yaxis_title='Price'  # Change Y-axis title
)

____________________________

5. **Rating and rating distribution**

In [None]:
df_rating = df["rating"].value_counts()
df_rating

In [None]:
fig_rating = px.bar(df_rating, text_auto="True")

fig_rating.update_layout(
    xaxis_title='Rating',  # Change X-axis title
    yaxis_title='Number of Books'  # Change Y-axis title
)

_____________________________

6. **Price vs Rating Correlation**

In [None]:
df_price_rating = df[["price","rating"]]
df_price_rating

In [None]:
price_rating_corr = df["price"].corr(df["rating"])

In [None]:
price_rating_corr

____________________________________

7. **Genre distribution**

In [None]:
df_genre = df["genre"].value_counts()
df_genre

In [None]:
fig_genres = px.bar(df_genre, text_auto="True")
fig_genres.update_layout(
    xaxis_title='Book Genre',  # Change X-axis title
    yaxis_title='Number of Books'  # Change Y-axis title
)


__________________________________________________

8. **Price Distribution**

In [None]:
price_range = df['price'].max() - df['price'].min()
price_range

In [None]:
price_average = df["price"].mean()
price_average

In [None]:
price_SD = df["price"].std()
price_SD

In [None]:
stat, p_value = shapiro(df['price'])
print(f"Shapiro-Wilk Test: Stat={stat}, p-value={p_value}")

In [None]:
fig_histogram = px.histogram(df["price"], nbins=10, title="Histogram Showing Distribution")
fig_histogram.update_layout(
    xaxis_title='Price',  # Change X-axis title
    yaxis_title='Number of Books'  # Change Y-axis title
)


____________________________________

9. **Genre vs. Price**

In [None]:
df_genre_price = df[["price", "genre"]]
df_genre_price = df_genre_price.groupby('genre')['price'].mean().reset_index()

In [None]:
fig_genre_price = px.bar(df_genre_price, x="genre",y="price").update_xaxes(categoryorder = "total ascending")
fig_genre_price