## Lab - EDA Univariate Analysis: Diving into Amazon UK Product Insights

**Objective**: Explore the product listing dynamics on Amazon UK to extract actionable business insights. By understanding the distribution, central tendencies, and relationships of various product attributes, businesses can make more informed decisions on product positioning, pricing strategies, and inventory management.

**Dataset**: This lab utilizes the [Amazon UK product dataset](https://www.kaggle.com/datasets/asaniczka/uk-optimal-product-price-prediction/)
which provides information on product categories, brands, prices, ratings, and more from from Amazon UK. You'll need to download it to start working with it.

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

df = pd.read_csv("amz_uk_price_prediction_dataset.csv")
df

---
### Part 1: Understanding Product Categories

In [None]:
# 1. Generate a frequency table for the product `category`.
frequency_table = df["category"].value_counts()
frequency_table

In [None]:
# 2. Display the distribution of products across different categories using a bar chart. If you face problems understanding the chart, do it for a subset of top categories.

subset = frequency_table[:5]

barplot = sns.barplot(x=subset.values, 
                      y=subset.index,
                      palette="Set1", 
                      hue=subset.index,
                      legend="full")

plt.legend(loc="lower right")
plt.show()

In [None]:
# 3. For a subset of top categories, visualize their proportions using a pie chart.

plt.figure(figsize=(5, 5)) #increase figsize if overlapping info
subset.plot.pie(autopct="%.1f%%", startangle=15);
plt.show()

---
### Part 2: Delving into Product Pricing

In [None]:
# 1. Measures of centrality show presence of a few extremely high prices because the mean price 89 is much higher than the median price 19.

df["price"].agg(["mean", "median"])

In [None]:
df["price"].mode()

In [None]:
# 2. Measures of dispersion show a significant spread of prices. While the most common price is 9.99, some items go up to 119445 more than this.

df["price"].agg(["var", "std"])

In [None]:
# 3. Most products probably fall in the range from 0 to mode=9.9. This makes for a histogram hard to read.

sns.histplot(df["price"], 
             kde=True, # plotline
             bins=30,  # more bins - more details
             color="salmon");

In [None]:
# 4. The viz solution can be to remove outliers and focus on the price belows the mean price

subset = df[df["price"] < df["price"].mean()]
sns.histplot(subset["price"], kde=True, bins=30, color="salmon")
plt.show()

In [None]:
# 5. Boxplot shows 2 products that are priced significantly higher than the rest: around 80 and 100K.

sns.boxplot(x=df["price"], color="salmon");

---
### Part 3: Unpacking Product Ratings

In [None]:
# 1. Measures of centrality show that most frequently customers give 0 stars when rating a product.

df["stars"].agg(["mean", "median"])

In [None]:
df["stars"].mode()

In [None]:
# 2. Measures of dispersion show a wide range of ratings, from 0 to 4.4 stars.

df["stars"].agg(["var", "std"])

In [None]:
df["stars"].describe()

In [None]:
# 3. Ratings are normally distributed, with skewness value of nearly 0.

df["stars"].skew()

In [None]:
# Kurtosis value -1.9 confirms uniform distribution of ratings.

df["stars"].kurtosis()

In [None]:
# 4. The histogram confirms that 0 stars is the most common and most frequent rating. It's interesting to note almost absence of 1-2-3 stars.

sns.histplot(df["stars"], 
             kde=True, # plotline
             bins=30,  # more bins - more details
             color="salmon");