# Webscraping data about philosophy books
<br>
This program gathers information about current philosophy books for sale. The data includes details such as title, price, reviews, and availability. Once the data is collected, it undergoes a minor cleanup and is then saved in a .csv file. Subsequently, the program reads the .csv file, and the data is presented appropriately.

## Enviroment

In [None]:
from bs4 import BeautifulSoup as bs # webscraping
import matplotlib.pyplot as plt # visualization
import pandas as pd # data structure
import requests # HTTP-requests

## Collect data

In [None]:
url = "https://books.toscrape.com/catalogue/category/books/philosophy_7/index.html"

In [None]:
response = requests.get(url)
soup = bs(response.text, 'html.parser')

titles = [title.text.strip() for title in soup.select('h3 a')]
prices = [price.text.strip() for price in soup.select('div p.price_color')]
ratings = [rating['class'][1] for rating in soup.select('p.star-rating')]
availability = [status.text.strip() for status in soup.select('div p.availability')]

## Create dataframe

In [None]:
# create dataframe
data = {'Title': titles, 'Price': prices, 'Rating': ratings, 'Availability': availability}
df = pd.DataFrame(data)

# seems to be a mismatch between the character set on the website and this program. Therefore minor cleanup is done below:
df['Price'] = df['Price'].str.replace('Â', '')
df['Title'] = df['Title'].str.replace('©', '')
df['Title'] = df['Title'].str.replace('Ã', 'é')

# view dataframe head
print(df.head())

## Save data

In [None]:
# save dataframe to csv
df.to_csv('philosophy_books_data.csv', index=False, sep='\t', encoding='utf-8')

print("Data has been saved to a .csv file")

## Read data

In [None]:
# read data 
loaded_df = pd.read_csv('philosophy_books_data.csv', sep='\t', encoding='utf-8')

# view dataframe
print(loaded_df)

## Data visualisation

### Availability

In [None]:
# count number of unique values in colonne 
availability_counts = df['Availability'].value_counts()

plt.pie(availability_counts, labels=availability_counts.index, autopct='%1.1f%%', startangle=90)
plt.title('Availability of philosophy books')
plt.show()

### Raiting

In [None]:
# convert 'raiting' colonne to categorical datatype
rating_order = ['One', 'Two', 'Three', 'Four', 'Five']
df['Rating'] = pd.Categorical(df['Rating'], categories=rating_order, ordered=True)

plt.figure(figsize=(8, 5))
df['Rating'].value_counts().sort_index().plot(kind='bar', color='green')
plt.xlabel('Rating')
plt.ylabel('Number of books')
plt.title('Book rating')
plt.xticks(rotation=0, ha='center')
plt.show()

### Overview between price and raiting

In [None]:
# convert prices to floats
df['Price'] = df['Price'].replace('[\£,]', '', regex=True).astype(float)

# scatter plot
plt.figure(figsize=(10, 6))
plt.scatter(df['Price'], df['Rating'].map({'One': 1, 'Two': 2, 'Three': 3, 'Four': 4, 'Five': 5}), color='purple', alpha=0.5)
plt.xlabel('Price (£)')
plt.ylabel('Raiting')
plt.title('Overview between price and raiting')
# only show whole numbers
plt.yticks(range(1, 6))
plt.show()

### Pricing

In [None]:
plt.hist(df['Price'].astype(float), bins=20, color='skyblue', edgecolor='black')

plt.title('Overview of philosohpy books and price')
plt.xlabel('Price')
plt.ylabel('Number of books')

plt.show()
