In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

%matplotlib inline

df = pd.read_csv(r'C:\Users\38095\Documents\GitHub\lab-eda-univariate\amz_uk_price_prediction_dataset.csv')
df

""""Part 1: Understanding Product Categories"""

In [None]:
df.dtypes

1. **Frequency Tables**:
    - Generate a frequency table for the product `category`.

In [None]:
counts = df['category'].value_counts()

In [None]:
frequency_table = df['category'].value_counts()
frequency_table 

 - Which are the top 5 most listed product categories?

In [None]:
counts.head(5)
most_listed = df['category'].value_counts().head(5)
most_listed

2. **Visualizations**:
    - Display the distribution of products across different categories using a bar chart.
    *If you face problems understanding the chart, do it for a subset of top categories.*
    - For a subset of top categories, visualize their proportions using a pie chart.
    Does any category dominate the listings?

In [None]:
sns.barplot(x=most_listed.index, y=most_listed.values, palette="Set3")
plt.xticks(rotation = 45)

In [None]:
df['category'].value_counts().head(5).plot.pie(autopct='%1.1f%%', startangle=0, colors=sns.color_palette("Set3"))

### Part 2: Delving into Product Pricing

**Business Question**: How are products priced on Amazon UK, and are there specific price points or ranges that are more common?

1. **Measures of Centrality**:
    - Calculate the mean, median, and mode for the `price` of products.
    - What's the average price point of products listed? How does this compare with the most common price point (mode)?

In [None]:
df['price'].mean()

In [None]:
df['price'].median()

In [None]:
df['price'].mode()

2. **Measures of Dispersion**:
    - Determine the variance, standard deviation, range, and interquartile range for product `price`.
    - How varied are the product prices? Are there any indicators of a significant spread in prices?

In [None]:
df['price'].describe()

In [None]:
variance_price = df['price'].var()
std_dev_price = df['price'].std()
min_price = df['price'].min()
max_price = df['price'].max()
range_price = max_price - min_price
quantiles_price = df['price'].quantile([0.25, 0.5, 0.75])

variance_price, std_dev_price, min_price, max_price, range_price, quantiles_price

3. **Visualizations**:
  - Is there a specific price range where most products fall?
    Plot a histogram to visualize the distribution of product prices.
*If its hard to read these diagrams, think why this is, and explain how it could be solved.*.

In [None]:
plt.figure(figsize=(8, 10))
plt.bar(quantiles_price.index, quantiles_price.values, color='skyblue')
plt.title('Quantiles of Price')
plt.xlabel('Quantile')
plt.ylabel('Price')
plt.xticks(quantiles_price.index, ['25%', '50%', '75%'])
plt.show()

In [None]:
df['price'].hist()

In [None]:
plt.figure(figsize=(15,5))
plt.hist(df['price'], bins=500, color='skyblue', edgecolor='black')
plt.title('Distribution of Product Prices')
plt.xlabel('Price')
plt.ylabel('Frequency')
plt.grid(True)
plt.show()

 - Are there products that are priced significantly higher than the rest? Use a box plot to showcase the spread and potential outliers in product pricing.

In [None]:
sns.boxplot(data = df['price'], color="lightblue")

### Part 3: Unpacking Product Ratings

**Business Question**: How do customers rate products on Amazon UK, and are there any patterns or tendencies in the ratings?

1. **Measures of Centrality**:
    - Calculate the mean, median, and mode for the `rating` of products.
    - How do customers generally rate products? Is there a common trend?

In [None]:
df.stars.median()

In [None]:
df.stars.mean()

In [None]:
df.stars.mode()

2. **Measures of Dispersion**:
    - Determine the variance, standard deviation, and interquartile range for product `rating`.
    - Are the ratings consistent, or is there a wide variation in customer feedback?

In [None]:
df['stars'].describe()

3. **Shape of the Distribution**:
    - Calculate the skewness and kurtosis for the `rating` column. 
    - Are the ratings normally distributed, or do they lean towards higher or lower values?

In [None]:
skewness_rating = df['stars'].skew()
kurtosis_rating = df['stars'].kurtosis()

skewness_rating, kurtosis_rating

**Visualizations**:
    - Plot a histogram to visualize the distribution of product ratings. Is there a specific rating that is more common?

In [None]:
sns.histplot(df['stars'], kde=True, bins=30, color="salmon")

In [None]:
df