# 📊 What we found after analyzing 1,000 Amazon Data Science Books 📚

## 🔥 Table of Contents
* [📚 Amazon Data Science Books Dataset](#books-dataset)
* [1. 🔎 Exploratory Data Analysis](#data-analysis)
    * [1.1. 💰 Price vs. Reviews](#analyse-price-reviews)
    * [1.2. 💰 Price vs. Book Length](#analyse-price-book-length)
    * [1.3. 📚 Best Python & Machine Learning Books](#analyse-best-books)
* [2. 🧐 Clustering Book Titles](#clustering-book-titles)
* [3. 🌐 Scraping Amazon Book Reviews](#scraping)

### Useful Python Libraries for Data Science

* **NumPy (Numerical Python)** with powerful N-dimensional array.
* **Pandas (Python Data Analysis)** is heavily used for Data Manipulation and Analysis.
* **Matplotlib** provides powerful and beautiful data visualizations as well as an object-oriented API for embedding those plots into applications. 
* **Plotly Express** is high-level API for creating figures.

In [None]:
!pip install numpy pandas matplotlib plotly

# !pip list

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import plotly.express as px

## 📚 Amazon Data Science Books Dataset <a class="anchor" id="books-dataset"></a>

> [This project uses the "Amazon Data Science Books Dataset" available on Kaggle](https://www.kaggle.com/datasets/die9origephit/amazon-data-science-books)

In [None]:
data = pd.read_csv('final_book_dataset_kaggle.csv')
# data.head(n=3)
# data.shape
# data.columns
# print(data.info())

df = pd.DataFrame(data)
print(df)

## 1. 🔎 Exploratory Data Analysis <a class="anchor" id="data-analysis"></a>

### 1.1. 💰 Price vs. Reviews <a class="anchor" id="analyse-price-reviews"></a>

> Do more expensive books have better reviews?

In [None]:
# fig = px.scatter(df, x='price', y='avg_reviews', size='n_reviews')
fig = px.scatter(df, x='price', y='avg_reviews', color='pages', size='n_reviews')

fig.show(renderer='iframe')
# fig.show(renderer='iframe_connected')

### 1.2. 💰 Price vs. Book Length <a class="anchor" id="analyse-price-book-length"></a>

> Do longer books have higher prices?

In [None]:
# fig = px.scatter(df, x='price', y='avg_reviews', size='pages')
# fig = px.scatter(df, x='price', y='avg_reviews', color='dimensions', size='pages')

# fig.show(renderer='iframe')
# fig.show(renderer='iframe_connected')

### 1.3. 📚 Best Python & Machine Learning Books <a class="anchor" id="analyse-best-books"></a>

> What are the best Python books? What are the best ML books?

In [None]:
## Select books based on title containing "Python"
python_books = df[df['title'].str.contains("Python")]

## Python books with most reviews and highest average rating
best_python_books = python_books.nlargest(7, ['n_reviews','avg_reviews'])
best_python_books

In [None]:
## Select books based on title containing "Machine Learning"
ml_books = df[df['title'].str.contains("Machine Learning")]

## ML books with most reviews and highest average rating
best_ml_books = ml_books.nlargest(7, ['n_reviews', 'avg_reviews'])
best_ml_books

## 2. 💎 Clustering Book Titles <a class="anchor" id="clustering-book-titles"></a>

> Cluster Analysis of book names / TF-IDF and K-Means

> 💡 What are the main types of Data Science books?

In [None]:
## Install sklearn if not done already!

!pip install scikit-learn

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(stop_words='english', ngram_range=(1,2))

X = vectorizer.fit_transform(df["title"])

In [None]:
pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names_out())

In [None]:
from sklearn.cluster import KMeans
sum_of_squared_distances = []
# Kmeans clustering
K = range(2,10)
for k in K:
   km = KMeans(n_clusters=k, max_iter=600, n_init=10)
   km.fit(X)
   sum_of_squared_distances.append(km.inertia_)

In [None]:
# plt.plot(K, sum_of_squared_distances, 'bx-')

plt.plot(K, sum_of_squared_distances, 'bx-')
plt.xlabel('Number of clusters')
plt.ylabel('Sum of squared distances')
plt.title('Elbow Method For Optimal k')
plt.show()

In [None]:
## Get clusters
true_k = 6
model = KMeans(n_clusters=true_k, init='k-means++', max_iter=600, n_init=10)
model.fit(X)

## Get prediction/ labels
labels = model.labels_
book_cl = pd.DataFrame(list(zip(df["title"],labels)),columns=['title','cluster'])
print(book_cl.sort_values(by=['cluster']))

In [None]:
!pip install wordcloud==1.8.2.2

In [None]:
## Create Wordclouds for Clusters
from wordcloud import WordCloud

for k in range(true_k):
   text = book_cl[book_cl.cluster == k]['title'].str.cat(sep=' ')
   wordcloud = WordCloud(max_font_size=50, max_words=100, background_color="white").generate(text)

   ## Create subplot
   plt.subplot(2, 3, k+1).set_title("Cluster " + str(k)) 
   plt.plot()
   plt.imshow(wordcloud, interpolation="bilinear")
   plt.axis("off")
plt.show()

In [None]:
cluster_num = '1'

In [None]:
## Books in clusters
book_cl[book_cl.cluster == int(cluster_num)]

In [None]:
## Prediction on unseen data
test = vectorizer.transform(['Tensorflow Deep learning'])
model.predict(test)[0]

## 3. 🌐 Scraping Amazon Book Reviews <a class="anchor" id="scraping"></a>

> Amazon review scraping & Book review summary

In [None]:
## Example urls
product_url = "https://www.amazon.com/Becoming-Data-Head-Understand-Statistics/dp/1119741742/"
reviews_url = "https://www.amazon.com/Becoming-Data-Head-Understand-Statistics/product-reviews/1119741742/ref=cm_cr_dp_d_show_all_btm?ie=UTF8&reviewerType=all_reviews"

In [None]:
def get_review_url(product_url):
    try:
        split_url = product_url.split('dp')
        product_number = split_url[1].split('/')[1]
        review_url = split_url[0] + 'product-reviews/' + product_number + "/ref=cm_cr_dp_d_show_all_btm?ie=UTF8&reviewerType=all_reviews"
    except:
        review_url = None
    return review_url

In [None]:
## Create review urls for each book in dataset
df['review_urls'] = df['complete_link'].apply(lambda x: get_review_url(x))

## Remove empty review urls and create a new dataset
df_reviews = df.loc[~df['review_urls'].isnull()].reset_index()

In [None]:
!pip install requests
# !pip install lxml

In [None]:
## Code adapted from Jeff James https://gist.github.com/jrjames83/4653d488801be6f0683b91eda8eeb627
import requests
import pandas as pd
from bs4 import BeautifulSoup
from datetime import datetime
import logging

headers = {
    "authority": "www.amazon.com",
    "pragma": "no-cache",
    "cache-control": "no-cache",
    "dnt": "1",
    "upgrade-insecure-requests": "1",
    "user-agent": "Mozilla/5.0 (X11; CrOS x86_64 8172.45.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.64 Safari/537.36",
    "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
    "sec-fetch-site": "none",
    "sec-fetch-mode": "navigate",
    "sec-fetch-dest": "document",
    "accept-language": "en-GB,en-US;q=0.9,en;q=0.8",
}

URLS = df_reviews['review_urls']
book_titles = df_reviews['title']

def get_page_html(page_url: str) -> str:
    resp = requests.get(page_url, headers=headers)
    return resp.text

def get_reviews_from_html(page_html: str) -> BeautifulSoup:
    # soup = BeautifulSoup(page_html, "lxml")
    ## BeautifulSoup(markup,"html.parser") --> Python HTML parser
    soup = BeautifulSoup(page_html, features="html.parser")
    reviews = soup.find_all("div", {"class": "a-section celwidget"})
    return reviews

def get_review_text(soup_object: BeautifulSoup) -> str:
    review_text = soup_object.find(
        "span", {"class": "a-size-base review-text review-text-content"}
    ).get_text()
    return review_text.strip()

def get_number_stars(soup_object: BeautifulSoup) -> str:
    stars = soup_object.find("span", {"class": "a-icon-alt"}).get_text()
    return stars.strip()

def orchestrate_data_gathering(single_review: BeautifulSoup) -> dict:
    return {
        "review_text": get_review_text(single_review),
        "review_stars": get_number_stars(single_review)
    }

if __name__ == '__main__':
    logging.basicConfig(level=logging.INFO)
    all_results = []

    for i in range(len(URLS)):
        logging.info(URLS[i])
        html = get_page_html(URLS[i])
        reviews = get_reviews_from_html(html)
        for rev in reviews:
            data = orchestrate_data_gathering(rev)
            data.update({'title': df_reviews['title'][i]})
            all_results.append(data)

    out = pd.DataFrame.from_records(all_results)
    logging.info(f"Total number of reviews {out.shape[0]}")
    save_name = f"book_reviews_{datetime.now().strftime('%Y-%m-%d-%h')}.csv"
    logging.info(f"saving to {save_name}")
    out.to_csv(save_name, index=False)
    logging.info('Done yayy')

In [None]:
# book_reviews = pd.read_csv('book_reviews_2022-12-14-12.csv')
csv_name = f"book_reviews_{datetime.now().strftime('%Y-%m-%d-%h')}.csv"
book_reviews = pd.read_csv(csv_name)

## Aggregate reviews for each book title
book_reviews['review_text'] = book_reviews['review_text'].astype(str)
book_reviews_agg = book_reviews.groupby(['title'], as_index = False).agg({'review_text': ' '.join})
book_reviews_agg

In [None]:
## Install Bert extractive summarizer if not done already!

!pip install bert-extractive-summarizer
# !pip install transformers summarizer
# !pip uninstall summarizer

# import transformers, torch, tensorflow
# from platform import python_version
# print("python={}".format(python_version()))
# print("transformers=={}\ntorch=={}\ntensorflow=={}\n".
#       format(transformers.__version__, torch.__version__, tensorflow.__version__ ))

## ~~Book Reviews Summarization~~

In [None]:
## Summarizing book reviews
# from summarizer import Summarizer

# bert_model = Summarizer()
# bert_summary = ''.join(bert_model(book_reviews_agg.review_text[2], ratio=0.2))
# print(bert_summary)

In [None]:
# print(book_reviews_agg.review_text[2])

In [None]:
# from IPython.display import display, Markdown
# display(Markdown(book_reviews_agg.review_text[2]))

In [None]:
# display(Markdown(bert_summary))