Batu
Question: Is there a relationship between category and discounts?

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
print('Set up complete')

# Load data
file_path = '../data/amazon.csv'
amazon_df = pd.read_csv(file_path)

# TOP record
amazon_df.head()
amazon_df.shape

In [None]:
def convert_percentage(value):
    value = value.replace('%', '')
    new_value = float(value)
    new_value = new_value/100
    return new_value

def convert_standard_float(value):
    value = value.replace('₹', '')
    value = value.replace(',', '')
    new_value = float(value)
    return new_value

def convert_standard_int(value):
    value = value.replace('₹', '')
    value = value.replace(',', '')
    new_value = int(value)
    return new_value

working_df = amazon_df.copy()
working_df.loc[working_df['rating']== '|'] = working_df['rating'].mode()[0]
working_df['rating_count'] = working_df['rating_count'].fillna(working_df['rating_count'].mode()[0])
working_df['discounted_price'] = working_df['discounted_price'].apply(convert_standard_float)
working_df['actual_price'] = working_df['actual_price'].apply(convert_standard_float)
working_df['discount_percentage'] = working_df['discount_percentage'].apply(convert_percentage)
working_df['rating'] = working_df['rating'].apply(convert_standard_float)
working_df['rating_count'] = working_df['rating_count'].apply(convert_standard_float)

working_df.head()

## Descriptive Statistics

## Calculate Average Ratings and Discounts
Calculate the average ratings and discounts for each category to determine which categories to keep.

In [None]:
# Calculate average ratings and discounts for each category
category_summary = working_df.groupby('category').agg({
    'discount_percentage': 'mean',
    'rating': 'mean',
    'discounted_price': 'mean',
    'actual_price': 'mean',
    'rating_count': 'count'
}).reset_index()

# Rename columns for clarity
category_summary.columns = ['category', 'avg_discount', 'avg_rating', 'avg_discounted_price', 'avg_actual_price', 'product_count']

# Display the summary
print(category_summary)

## Filter Top Categories
Filter out the top N categories based on average rating. For example, we might keep the top 10 categories.

In [None]:
# Filter for top N categories by average rating
top_n = 10  # Change this value to include more or fewer categories
top_categories = category_summary.nlargest(top_n, 'avg_rating')

# Display top categories
print(top_categories)

## Group Less Popular Categories
Group less popular categories into an "Other" category for clearer visualizations.

In [None]:
# Define a threshold for minimum product count to keep a category
threshold = 50  # Categories with fewer than this count will be grouped as 'Other'
popular_categories = category_summary[category_summary['product_count'] >= threshold]
other_categories = category_summary[category_summary['product_count'] < threshold]

# Create a new DataFrame with 'Other' category
other_summary = other_categories[['avg_discount', 'avg_rating', 'avg_discounted_price', 'avg_actual_price']].mean().to_frame().T
other_summary['category'] = 'Other'
other_summary['product_count'] = other_categories['product_count'].sum()

# Combine popular categories and the 'Other' category
final_category_summary = pd.concat([popular_categories, other_summary], ignore_index=True)

# Display the final summary
print(final_category_summary)

## Visualize the Results

In [51]:
# Find the category with the highest average discount
highest_discount_category = final_category_summary.loc[final_category_summary['avg_discount'].idxmax()]

# Display the result
print("Category with the Highest Average Discount:")
print(highest_discount_category[['category', 'avg_discount']])

# Find the category with the lowest average discount
lowest_discount_category = final_category_summary.loc[final_category_summary['avg_discount'].idxmin()]

# Display the result
print("Category with the Lowest Average Discount:")
print(lowest_discount_category[['category', 'avg_discount']])

Category with the Highest Average Discount:
category        Electronics|WearableTechnology|SmartWatches
avg_discount                                       0.698158
Name: 4, dtype: object
Category with the Lowest Average Discount:
category        Electronics|Mobiles&Accessories|Smartphones&Ba...
avg_discount                                             0.232941
Name: 3, dtype: object


In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Set up for visualization
%matplotlib inline

# Visualization of average discounts by category
plt.figure(figsize=(12, 6))
sns.barplot(data=final_category_summary, x='avg_discount', y='category', palette='viridis')
plt.title('Average Discount Percentage by Product Category')
plt.xlabel('Average Discount Percentage')
plt.ylabel('Product Category')
plt.xticks(rotation=45)  # Rotate x-axis labels for better readability
plt.tight_layout()  # Adjust layout for better fit
plt.show()

# Correlation Analysis
# Encode the category as a numerical variable
final_category_summary['category_code'] = pd.factorize(final_category_summary['category'])[0]

# Calculate the correlation matrix focusing on category code and average discount
correlation_matrix = final_category_summary[['category_code', 'avg_discount']].corr()

# Display the correlation matrix
print("Correlation Matrix between Category and Discounts:")
print(correlation_matrix)





## Correlation Coefficients
category_code to avg_discount:
Value: -0.194391
- Interpretation: This value indicates a slight negative correlation between the encoded category code and average discount. A negative correlation suggests that, as the category code increases (representing different categories), the average discount tends to decrease slightly.

- Strength of Correlation: The correlation coefficient ranges from -1 to 1. In this case, -0.194391 is relatively close to 0, indicating a weak correlation. This means that while there is some relationship, it is not strong enough to imply a significant impact of category on discounts.

Summary
Overall, the outcome indicates that while there is a slight negative correlation between category and discounts, it is weak and suggests that other factors may also play significant roles in determining average discounts across different categories. Further analysis might be needed to explore these relationships in more depth or to investigate other influencing factors.



## Analysis
The analysis reveals significant disparities in average discounts among different product categories. The category with the highest average discount is Electronics|WearableTechnology|SmartWatches, boasting an impressive average discount of approximately 69.82%. This suggests that products within this category are often heavily discounted, potentially to stimulate sales in a competitive market. In contrast, the category with the lowest average discount is Electronics|Mobiles&Accessories|Smartphones, with an average discount of only about 23.29%. This smaller discount may indicate a more stable pricing strategy, possibly reflecting higher demand or a premium pricing model for smartphones.

The observed slight negative correlation coefficient of -0.194391 between category and average discount suggests that as the category code increases, representing a different grouping of products, the average discount tends to decrease slightly. However, this correlation is weak, indicating that the relationship between category and discounts is not robust and implies that other factors, such as market trends, consumer preferences, or promotional strategies, could significantly influence discount rates across categories. Further analysis may be warranted to understand these dynamics better and identify other variables affecting discount strategies.



