# Lab - EDA Bivariate Analysis: Diving into Amazon UK Product Insights Part II
__Objective:__ Delve into the dynamics of product pricing on Amazon UK to uncover insights that can inform business strategies and decision-making.

Dataset: This lab utilizes the Amazon UK product dataset which provides information on product categories, brands, prices, ratings, and more from from Amazon UK. You'll need to download it to start working with it.

In [8]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from scipy.stats import skew, kurtosis
from scipy.stats import chi2_contingency
from scipy.stats.contingency import association

%matplotlib inline

dataset = 'amz_uk_price_prediction_dataset.csv'
df = pd.read_csv(dataset)

df.head()

Unnamed: 0,uid,asin,title,stars,reviews,price,isBestSeller,boughtInLastMonth,category
0,1,B09B96TG33,"Echo Dot (5th generation, 2022 release) | Big ...",4.7,15308,21.99,False,0,Hi-Fi Speakers
1,2,B01HTH3C8S,"Anker Soundcore mini, Super-Portable Bluetooth...",4.7,98099,23.99,True,0,Hi-Fi Speakers
2,3,B09B8YWXDF,"Echo Dot (5th generation, 2022 release) | Big ...",4.7,15308,21.99,False,0,Hi-Fi Speakers
3,4,B09B8T5VGV,"Echo Dot with clock (5th generation, 2022 rele...",4.7,7205,31.99,False,0,Hi-Fi Speakers
4,5,B09WX6QD65,Introducing Echo Pop | Full sound compact Wi-F...,4.6,1881,17.99,False,0,Hi-Fi Speakers


In [26]:
df.shape

(2443651, 9)

## Part 1: Analyzing Best-Seller Trends Across Product Categories

__Objective:__ Understand the relationship between product categories and their best-seller status.

1. __Crosstab Analysis:__
   * Create a crosstab between the product category and the _isBestSeller_ status.
   * Are there categories where being a best-seller is more prevalent?
   * Hint: one option is to calculate the proportion of best-sellers for each category and then sort the categories based on this proportion in descending order.

In [34]:
# 1. Creating a crosstab between 'category' and 'isBestSeller'
# This crosstab shows the counts of best-sellers and non-best-sellers across different product categories.
crosstab_category_bestseller = pd.crosstab(df['category'], df['isBestSeller'])

# Display the crosstab
display(crosstab_category_bestseller)

isBestSeller,False,True
category,Unnamed: 1_level_1,Unnamed: 2_level_1
3D Printers,247,1
3D Printing & Scanning,4065,2
Abrasive & Finishing Products,245,5
Action Cameras,1696,1
Adapters,251,3
...,...,...
Wind Instruments,243,7
Window Treatments,234,5
Women,17559,213
Women's Sports & Outdoor Shoes,1939,20


In [36]:
# 2. Calculate the Proportion of Best-Sellers
# Adding a column to calculate the proportion of best-sellers
crosstab_category_bestseller['Total'] = crosstab_category_bestseller.sum(axis=1)
crosstab_category_bestseller['BestSeller_Proportion'] = crosstab_category_bestseller[True] / crosstab_category_bestseller['Total']

# Display the updated crosstab with the proportion of best-sellers
crosstab_category_bestseller

isBestSeller,False,True,Total,BestSeller_Proportion
category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
3D Printers,247,1,248,0.004032
3D Printing & Scanning,4065,2,4067,0.000492
Abrasive & Finishing Products,245,5,250,0.020000
Action Cameras,1696,1,1697,0.000589
Adapters,251,3,254,0.011811
...,...,...,...,...
Wind Instruments,243,7,250,0.028000
Window Treatments,234,5,239,0.020921
Women,17559,213,17772,0.011985
Women's Sports & Outdoor Shoes,1939,20,1959,0.010209


In [38]:
# 3. Sort categories by the proportion of best-sellers in descending order
crosstab_sorted = crosstab_category_bestseller.sort_values(by='BestSeller_Proportion', ascending=False)

# Display the sorted table
crosstab_sorted[['BestSeller_Proportion']]

isBestSeller,BestSeller_Proportion
category,Unnamed: 1_level_1
Grocery,0.058135
Smart Home Security & Lighting,0.057692
Health & Personal Care,0.057686
Mobile Phone Accessories,0.042471
Power & Hand Tools,0.035339
...,...
Projectors,0.000000
Printer Accessories,0.000000
Power Supplies,0.000000
Basketball Footwear,0.000000


__Categories where being a Best-Seller is more prevalent:__

- __Grocery:__ This category has the highest proportion of best-sellers, with 5.8% of products being best-sellers.

- __Smart Home Security & Lighting:__ This category follows closely behind, with 5.77% of products being best-sellers.

- __Health & Personal Care:__ Another popular category, with 5.78% of products being best-sellers. Health-related products are essential and are often repeat purchases, which may explain why this category has a high proportion of best-sellers.

- __Mobile Phone Accessories and Power & Hand Tools:__ These categories also show a moderate prevalence of best-sellers, with around 4.24% and 3.53%, respectively.

2. __Statistical Tests:__

* Conduct a Chi-square test to determine if the best-seller distribution is independent of the product category.
* Compute Cramér's V to understand the strength of association between best-seller status and category.

_Null Hypothesis (H0)_ : The product category and best-seller status are independent. This means that the likelihood of a product being a best-seller is not influenced by its product category. In other words, the proportion of best-sellers is the same across all product categories.

_Alternative Hypothesus (H1)_: The product category and best-seller status are not independent. This means that the likelihood of a product being a best-seller depends on its product category.  In other words, certain categories may have a higher or lower proportion of best-sellers than others.


In [43]:
# 1. Chi-square Test of Independence

# Import necessary libraries
from scipy.stats import chi2_contingency

# Create a crosstab for category and isBestSeller
crosstab = pd.crosstab(df['category'], df['isBestSeller'])

# Perform the Chi-square test of independence
chi2_stat, p_value, dof, expected = chi2_contingency(crosstab)

# Print the results
print(f"Chi-square statistic: {chi2_stat}")
print(f"p-value: {p_value}")

Chi-square statistic: 36540.20270061387
p-value: 0.0


__Interpretation of the Chi-square Test Results:__
- __Chi-square statistic: 36540.20__: The large Chi-square statistic suggests a significant difference between the observed and expected counts in the crosstab. This means the observed distribution of best-sellers across product categories is far from what we would expect if they were independent of each other.

- __p-value: 0.0__: The p-value of 0.0 (or effectively very close to 0) is much smaller than the typical threshold of 0.05. This indicates that the difference between product categories and best-seller status is statistically significant.

__Conclusion:__
__Reject the null hypothesis:__ Given the extremely low p-value, we reject the null hypothesis. This means that product category and best-seller status are dependent—there is a statistically significant association between the two variables.

In [51]:
# 2. Calculate Cramér's V (measures the strength of association between two categorical variables)

from scipy.stats.contingency import association

# Calculate Cramér's V using the crosstab from the chi-square test
cramers_v = association(crosstab, method='cramer')

# Print the result
print(f"Cramér's V: {cramers_v}")


# Interpretation of Cramér's V:
# 0 to 0.1: Very weak or no association.
# 0.1 to 0.3: Weak to moderate association.
# 0.3 to 0.5: Moderate to strong association.
# 0.5 and above: Strong association.

Cramér's V: 0.1222829439760564


__Interpretation of Cramér's V:__
- __Cramér's V = 0.122__ falls into the range of 0.1 to 0.3, which indicates a _weak to moderate_ association between product category and best-seller status. This means, while we have already established that there is a statistically significant relationship between product category and best-seller status (from the Chi-square test), the strength of this association is relatively weak.
In practical terms, while product category does influence whether a product becomes a best-seller, it is not the dominant factor. There may be other factors (such as price, reviews, product quality, or market trends) that have a stronger influence on whether a product achieves best-seller status.

## Part 2: Exploring Product Prices and Ratings Across Categories and Brands

## Part 3: Investigating the Interplay Between Product Prices and Ratings