## Lab - EDA Univariate Analysis: Diving into Amazon UK Product Insights

## Part 1: Understanding Product Categories

**Business Question**: What are the most popular product categories on Amazon UK, and how do they compare in terms of listing frequency?

**Frequency Tables:**

In [None]:
import pandas as pd

# Load the dataset - point here to your folder where the dataset is 
df = pd.read_csv('amz_uk_price_prediction_dataset.csv') 

In [None]:
# Generate a frequency table for the product category
category_freq = df['category'].value_counts()
category_freq.head(5)

**Visualizations:**

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# Bar chart
plt.figure(figsize=(12, 6))
sns.countplot(data=df, y='category', order=category_freq.index, palette="viridis")
plt.title('Distribution of Product Categories')
plt.show()

# Pie chart for top 5 categories
top_categories = category_freq.head(5)
top_categories.plot.pie(autopct='%1.1f%%', startangle=90, colors=sns.color_palette("viridis", 5))
plt.title('Top 5 Product Categories')
plt.ylabel('')
plt.show()

# To do - just show top categories in countplot

**Interpretation for Part 1: Understanding Product Categories**

The most popular product categories on Amazon UK, based on listing frequency, are as follows:

1. **Sports & Outdoors** with 836,265 listings, overwhelmingly dominates the platform.
2. **Beauty** with 19,312 listings.
3. **Handmade Clothing, Shoes & Accessories** with 19,229 listings.
4. **Bath & Body** with 19,092 listings.
5. **Birthday Gifts** with 18,978 listings.

Clearly, "Sports & Outdoors" stands out as the most listed category by a significant margin. The other categories in the top 5 have comparable numbers of listings, but they pale in comparison to "Sports & Outdoors."

Given the vast difference between the first category and the rest, sellers dealing in "Sports & Outdoors" products might face higher competition on Amazon UK. At the same time, the sheer volume suggests a potential high demand in this category.

## Part 2: Delving into Product Pricing

**Business Question**: How are products priced on Amazon UK, and are there specific price points or ranges that are more common?

**Measures of Centrality:**

In [None]:
mean_price = df['price'].mean()
median_price = df['price'].median()
mode_price = df['price'].mode()[0]

mean_price, median_price, mode_price


**Measures of Dispersion:**

In [None]:
variance_price = df['price'].var()
std_dev_price = df['price'].std()
min_price = df['price'].min()
max_price = df['price'].max()
q1 = df['price'].quantile(0.25)
q3 = df['price'].quantile(0.75)
iqr = df['price'].quantile(0.75) - df['price'].quantile(0.25)

variance_price, std_dev_price, min_price, max_price, q1, q3, iqr


**Interpretation**:

- The average (mean) price of products listed on Amazon UK is approximately £89.24.
- The median price of the products (the middle value when sorted) is £19.09. This is notably lower than the mean, suggesting that there are several high-priced items skewing the average upwards.
- The lowest priced item(s) on Amazon UK are available for free (or £0.00, which might indicate certain promotional products or digital products).
- The highest priced item is listed at £100,000. This indicates the presence of some luxury or niche items on the platform.
- The interquartile range (25% to 75% percentile) shows that 50% of the products on Amazon UK are priced between £9.99 and £45.99. This gives a sense of the typical price range for a majority of products.
  
From a business perspective, it's clear that while Amazon UK does cater to premium segments, a significant portion of its product listings are more affordably priced, making it accessible to a broader customer base.

Now, let's visually represent the distribution of product prices using histograms and box plots.

**Visualizations:**

In [None]:
# Histogram for product prices
plt.figure(figsize=(10, 6))
sns.histplot(data = df['price'], bins=30, kde=True, color="skyblue")
plt.title('Distribution of Product Prices')
plt.show()

# Boxplot for product prices
plt.figure(figsize=(8, 6))
sns.boxplot(data = df['price'], color="lightblue")
plt.title('Box Plot of Product Prices')
plt.show()

Both histograms and boxplots are really hard to read due to huge outliers.



In conclusion, while most products on Amazon UK are priced within a lower range (as seen from the statistics above), there are a few products that are priced significantly higher, which can be considered outliers.

## Part 3: Unpacking Product Ratings

**Business Question**: How do customers rate products on Amazon UK, and are there any patterns or tendencies in the ratings?

**Measures of Centrality:**

We'll begin by calculating the mean, median, and mode for the stars (rating) of products to understand how customers generally rate products on Amazon UK.

In [None]:
mean_rating = df['stars'].mean()
median_rating = df['stars'].median()
mode_rating = df['stars'].mode()[0]

mean_rating, median_rating, mode_rating

Here are the measures of centrality for the product ratings:

1. **Mean Rating**: Approximately \(2.15\)
2. **Median Rating**: \(0.0\)
3. **Mode (Most Common Rating)**: \(0.0\)

The results are intriguing. While the mean rating is a bit above 2, both the median and mode are 0. This suggests that there is a large number of products with a rating of 0. This could possibly indicate products that haven't received any ratings yet or products for which ratings are unavailable. 

**Measures of Dispersion:**

Next, let's determine the variance, standard deviation, and interquartile range for product ratings. This will help us understand the consistency or variation in customer feedback.

In [None]:
variance_rating = df['stars'].var()
std_dev_rating = df['stars'].std()
iqr_rating = df['stars'].quantile(0.75) - df['stars'].quantile(0.25)

variance_rating, std_dev_rating, iqr_rating


Here are the measures of dispersion for the product ratings:

1. **Variance**: Approximately \(4.82\)
2. **Standard Deviation**: Approximately \(2.19\)
3. **Interquartile Range (IQR)**: \(4.4\)

The relatively high standard deviation suggests that the ratings are spread out from the mean. The IQR indicates that the middle 50% of ratings span a range of 4.4 stars, which is quite a broad range, indicating a wide variation in customer feedback.

**Shape of the Distribution:**

Now, let's calculate the skewness and kurtosis for the `stars` (rating) column to determine the shape of the ratings distribution.

In [None]:
skewness_rating = df['stars'].skew()
kurtosis_rating = df['stars'].kurtosis()

skewness_rating, kurtosis_rating


Here are the measures related to the shape of the distribution for product ratings:

1. **Skewness**: Approximately \(0.0812\)
   - This value is close to 0, suggesting that the distribution is approximately symmetric around the mean. 
2. **Kurtosis**: Approximately \(-1.926\)
   - This negative value indicates that the distribution has lighter tails and a flatter peak compared to a normal distribution. It's termed as "platykurtic".

**Visualizations:**

Let's visualize the distribution of product ratings using a histogram.

In [None]:
# Histogram for product ratings
plt.figure(figsize=(10, 6))
sns.histplot(data = df['stars'], bins=20, kde=True, color="skyblue")
plt.title('Distribution of Product Ratings')
plt.show()

From the visualizations, we can infer the following:

**Histogram**:
1. A large number of products have a rating of 0, which confirms our earlier observation. These could be products without any reviews or ratings.
2. Among products with ratings, there seems to be a trend of higher ratings (around 4 to 5 stars) being more common.
