## Lab - EDA Bivariate Analysis: Diving into Amazon UK Product Insights Part II

**Objective**: Delve into the dynamics of product pricing on Amazon UK to uncover insights that can inform business strategies and decision-making.

**Dataset**: This lab utilizes the [Amazon UK product dataset](https://www.kaggle.com/datasets/asaniczka/uk-optimal-product-price-prediction/)
which provides information on product categories, brands, prices, ratings, and more from from Amazon UK. You'll need to download it to start working with it.

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import scipy.stats as stats

df = pd.read_csv("amz_uk_price_prediction_dataset.csv")

---
### Part 1: Analyzing Best-Seller Trends Across Product Categories

In [None]:
#  Create a crosstab between the product `category` and the `isBestSeller` status. Are there categories where being a best-seller is more prevalent?
# Hint: one option is to calculate the proportion of best-sellers for each category and then sort the categories based on this proportion in descending order.

crosstab_proportions = pd.crosstab(df["category"], df["isBestSeller"], normalize="index").sort_values(by=True, ascending=False)
crosstab_proportions
# There are no categories where being a best-seller is more prevalent. "Grocery" category has the highest proportion of best-selling products, and this proportion is less than 6%

In [None]:
# Conduct a Chi-square test to determine if the best-seller distribution is independent of the product category.
from scipy.stats import chi2_contingency

crosstab = pd.crosstab(df["isBestSeller"], df["category"])
chi2_statistic, chi2_p_value, _, _ = chi2_contingency(crosstab)
chi2_statistic, chi2_p_value

# chi2_statistic = 36540 implies a large discrepancy between observed and expected values, that the observed data deviate substantially from what we would expect if the variables were independent.
# chi2_p_value=0.0 suggests strong evidence against the null hypothesis, and implies an association between the categories.

In [None]:
# Compute Cramér's V to understand the strength of association between best-seller status and category.
from scipy.stats.contingency import association
association(crosstab, method="cramer")

# Cramér's V value = 0.12 implies a weak association between the 2 categories.

In [None]:
# Visualize the relationship between product categories and the best-seller status using a stacked bar chart.
crosstab_proportions = pd.crosstab(df["category"], df["isBestSeller"], normalize="index").sort_values(by=True, ascending=False)
sorted_crosstab = crosstab_proportions.sort_values(by=True, ascending=True).tail(10) # Top 10 based on proportion of best-sellers
sorted_crosstab.plot(kind="barh", stacked=True);

---
### Part 2: Exploring Product Prices and Ratings Across Categories and Brands

In [None]:
# Use a violin plot to visualize the distribution of `price` across different product `categories`. Filter out the top 5 categories based on count for better visualization.

top5 = df.groupby("category")["price"].count().sort_values(ascending=False).head(5).index
top5 = df[df["category"].isin(top5)]
sns.violinplot(data=top5, x="price", y="category", palette="bright",  hue="category");

In [None]:
# Which product category tends to have the highest median price? Don't filter here by top categories.
df.groupby("category")["price"].median().sort_values(ascending=False).head(1)

In [None]:
# Create a bar chart comparing the average price of products for the top 5 product categories (based on count).
sns.barplot(data=top5, x="price", y="category", palette="bright", hue="category");

In [None]:
# Which product category commands the highest average price? Don't filter here by top categories.
df.groupby("category")["price"].mean().sort_values(ascending=False).head(1)

In [None]:
# Visualize the distribution of product `ratings` based on their `category` using side-by-side box plots. Filter out the top 5 categories based on count for better visualization.

top_5_ratings = df.groupby("category")["stars"].count().sort_values(ascending=False).head(5).index
top_5_ratings = df[df["category"].isin(top_5_ratings)]
sns.boxplot(data=top_5_ratings, x="stars", y="category", palette="bright", hue="category");

In [None]:
# Which category tends to receive the highest median rating from customers? Don't filter here by top categories.
df.groupby("category")["stars"].median().sort_values(ascending=False).head(1)

---
### Part 3: Investigating the Interplay Between Product Prices and Ratings

In [None]:
# Calculate the correlation coefficient between `price` and `stars`. Is there a significant correlation between product price and its rating?
correlation = df["price"].corr(df["stars"])
spearman = df["price"].corr(df["stars"], method="spearman")
correlation, spearman

# Both Pearson and Spearman correlation coefficients (-0.12, -0.13) imply weak negative linear and weak negative monotonic association. 

In [None]:
# Use a scatter plot to visualize the relationship between product rating and price. What patterns can you observe?
sns.scatterplot(data = df, x = "stars", y = "price");

Patterns observed:
- majority of products are priced 0-20K, with 2 outliers (80K, 100K)
- most of the ratings are 4-5 stars, and it's strange that the 2 very expensive products (80K, 100K) have 0 stars (maybe innacurate data?),
- correlation between price and rating seems to not exist.

In [None]:
#  QQ plot shows that product prices typically don't follow a normal distribution.
import statsmodels.api as sm
sm.qqplot(df["price"], line="45");

In [None]:
# Use a correlation heatmap to visualize correlations between all numerical variables.
plt.figure(figsize=(8, 5))
correlation_matrix = df.select_dtypes("number").corr()  # Compute the correlation matrix
sns.heatmap(correlation_matrix, annot=True, annot_kws={"size": 10}, cmap="coolwarm")
plt.show()

---
### Bonus
Remove outliers in product prices and repeat the Part 2, Part 3. What are your insights?

In [None]:
# Remove outliers in product prices.
def tukeys_test_outliers_delete(data):
    data = data.copy()  # Create a copy to avoid the SettingWithCopyWarning
    Q1 = data.quantile(0.25)
    Q3 = data.quantile(0.75)
    IQR = Q3 - Q1
    
    # Define bounds for the outliers
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    
    # Keep only the values that are within the lower and upper bounds
    data = data[(data >= lower_bound) & (data <= upper_bound)]

    return data

df["price"] = tukeys_test_outliers_delete(df["price"])

---
### Part 2: Exploring Product Prices and Ratings Across Categories and Brands

In [None]:
# Violin plot is much more comprehensible and the top 5 categories are not same as the top 5 with outliers.
top5 = df.groupby("category")["price"].count().sort_values(ascending=False).head(5).index
top5 = df[df["category"].isin(top5)]
sns.violinplot(data=top5, x="price", y="category", palette="bright",  hue="category");

In [None]:
# The highest median price used to be "Laptops 1042.725". It's changed to "Desktop PCs 74.0".
df.groupby("category")["price"].median().sort_values(ascending=False).head(1)

In [None]:
sns.barplot(data=top5, x="price", y="category", palette="bright", hue="category");

In [None]:
# The highest average price used to be "Laptops 1087.987827". Now it's "Motherboards 68.772432".
df.groupby("category")["price"].mean().sort_values(ascending=False).head(1)

In [None]:
# Box plots didn't change because ["Category", "Stars"] aren't affected by the outliers removal.

top_5_ratings = df.groupby("category")["stars"].count().sort_values(ascending=False).head(5).index
top_5_ratings = df[df["category"].isin(top_5_ratings)]
sns.boxplot(data=top_5_ratings, x="stars", y="category", palette="bright", hue="category");

In [None]:
df.groupby("category")["stars"].median().sort_values(ascending=False).head(1)

---
### Part 3: Investigating the Interplay Between Product Prices and Ratings

In [None]:
# Calculate the correlation coefficient between `price` and `stars`. Is there a significant correlation between product price and its rating?
correlation = df["price"].corr(df["stars"])
spearman = df["price"].corr(df["stars"], method="spearman")
correlation, spearman

# Both Pearson and Spearman correlation coefficients (-0.07, -0.06) increased a little bit after removing the price outliers (-0.12, -0.13).

In [None]:
# Scatterplot now doesn't give an observable pattern.
sns.scatterplot(data = df, x = "stars", y = "price");

In [None]:
#  QQ plot still shows that product prices typically don't follow a normal distribution.
import statsmodels.api as sm
sm.qqplot(df["price"], line="45");

In [None]:
# Correlation heatmap still shows very weak correlations.
plt.figure(figsize=(8, 5))
correlation_matrix = df.select_dtypes("number").corr()  # Compute the correlation matrix
sns.heatmap(correlation_matrix, annot=True, annot_kws={"size": 10}, cmap="coolwarm")
plt.show()