### AMAZON RATINGS

## Informations
* Authors: Dominik Panzarealla & Samuele Saporito
* Group A
* SUPSI 2023/2024

## Introduction
The aim of this project is to analyze the ratings and price pattern between Amazon Indian Products.

## Dataset 
Here is the link of the dataset used : [https://www.kaggle.com/datasets/lokeshparab/amazon-products-dataset](http://)

## Description of the dataset :

- 'name' : The name of the product
- 'main_category': The main category of the product belong
- 'sub_category': The sub category of the product belong
- 'image': The image of the product look like
- 'link': The amazon website reference link of the product
- 'ratings': The ratings given by amazon customers of the product
- 'no of ratings': The number of ratings given to this product in amazon shopping
- 'discount_price': The discount prices of the product
- 'actual_price': The actual MRP of the product
- 'rating': The category of the given rating with the followig values:
    - Rating below 2.0           = **Poor**
    - Rating range of 2.0 - 2.9  = **Below Average**
    - Rating range of 3.0 - 3.9  = **Average**
    - Rating range of 4.0 - 4.9  = **Above Average**
    - Rating of 5.0              = **Excellent**
- 'price_range': The price range where the actual_price is between
- 'price_category': The category of the given actual_price with the followig values:
    - Actual price below 5.0               = **Cheap**
    - Actual Price range of 5.1 - 33.33    = **Low**
    - Actual Price range of 33.34 - 66.66  = **Medium**
    - Actual Price range of 66.67 - 100    = **High**
- 'discount_category': The price range where the discount_percentage is between


## Loading Dataseet

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
import plotly.figure_factory as ff
from IPython.display import display, HTML
from plotly.subplots import make_subplots
import matplotlib.pyplot as plt


df = pd.read_csv("/kaggle/input/amazon-products-csv/Amazon-Products.csv")
df

In [None]:
df.shape

In [None]:
df.info()

In [None]:
df = df.dropna()

In [None]:
#Drop the first and fifth columns, and duplicates
df = df.drop(df.columns[[0,5]], axis=1)
df = df.drop_duplicates(subset=['name'], keep='first')

#Drop duplicates
df.loc[df["main_category"]=="home, kitchen, pets", "main_category"]="home & kitchen"
df = df.drop_duplicates(subset=['name'], keep='first')

## Data Cleaning and Preparation

The currency being used is the Indian Rupe, we are going to change it in Swiss Franc (CHF). The actual change is  **1 : 0,011** .

In [None]:
df["actual_price"] = df["actual_price"].str.replace('₹', '').str.replace(',','').astype(float)
df["discount_price"] = df["discount_price"].str.replace('₹', '').str.replace(',','').astype(float)
df["discount_percentage"] = 100 - ((df["discount_price"] / df["actual_price"])*100)
df["actual_price"] =  df["actual_price"] * 0.011
df["discount_price"] = df["discount_price"] * 0.011 

#### Check for correct ratings

In [None]:
df["ratings"].unique()

Drop the rows with the wrong rating values

In [None]:
#Delete of non-numerical value in this column
other_values = ['Get', 'FREE', '₹68.99', '₹65', '₹70','₹100','₹99', '₹2.99']
mask = df["ratings"].isin(other_values)
df.drop(df[mask].index, inplace=True)
df["ratings"] = df["ratings"].astype(float)

### Rating Category

We will map each rating to a particular category class:
- Rating below 2.0          = **Poor**
- Rating range of 2.0 - 2.9 = **Below Average**
- Rating range of 3.0 - 3.9 = **Average**
- Rating range of 4.0 - 4.9 = **Above Average**
- Rating of 5.0             = **Excellent**

In [None]:
def getCategory(score):
    if score<2.0: return "Poor"
    elif score<3.0: return "Below Average"
    elif score<4.0: return "Average"
    elif score<5.0: return "Above Average"
    else: return "Excellent"

df["category"] = df["ratings"]
df["category"] = df["category"].apply(lambda x:getCategory(x))

In [None]:
df["no_of_ratings"] = df["no_of_ratings"].str.replace(",","") 
df["no_of_ratings"] = pd.to_numeric(df["no_of_ratings"], errors="coerce")

## Product Pricing

In [None]:
min = df["actual_price"].min()
max = df["actual_price"].max()
bins = np.arange(0, max, 1000)
df['price_range'] = pd.cut(df['actual_price'], bins=bins, labels=[f'Fr. {i}-{i+1000}' for i in bins[:-1]])
average_price_range = df.groupby('price_range').agg(
    {"actual_price":"mean",
        "ratings":"count"
    }
).reset_index().rename(columns={"actual_price":"mean","ratings":"count"}).dropna()
total = average_price_range["count"].sum()
average_price_range["percentage"] = (average_price_range["count"] / total) * 100
average_price_range


Since 99% of the objects are concentrated in the 0-1000 range, let's analyze the data of these

In [None]:
mask = (df["actual_price"]>0) & (df["actual_price"]<=1000)
df_m = pd.DataFrame(df[mask])
bins = np.arange(0, 1000,100)
df_m['price_range'] = pd.cut(df_m['actual_price'], bins=bins, labels=[f'Fr. {i}-{i+100}' for i in bins[:-1]])
average_price_range = df_m.groupby('price_range').agg(
    {"actual_price":"mean",
        "ratings":"count"
    }
).reset_index().rename(columns={"actual_price":"mean","ratings":"count"}).dropna()
total = average_price_range["count"].sum()
average_price_range["percentage"] = (average_price_range["count"] / total) * 100
average_price_range


In [None]:
fig = px.bar(average_price_range,y="price_range", x="count")
fig.update_layout(
    paper_bgcolor='white',
    plot_bgcolor="white")
fig.show()

Note: Since the amount of listed product between Fr. 0 - 100 represent the 94% of the dataset, we will focus our analysis on this price range to have more accurate results.

In [None]:
mask = (df["actual_price"]>0) & (df["actual_price"]<=100)
df_100 = df[mask]

## Produt Category

As we have done for 'ratings' and 'discount_percentage', we map the 'actual_price' columns of a product into its price range and type of price (Cheap, Low, Medium or High)

In [None]:
def getCategoryPrice(x):
    if(x<=10):    return "0-10"
    elif x<=20:    return "11-20"
    elif x<=30:    return "21-30"
    elif x<=40:    return "31-40"
    elif x<=50:    return "41-50"
    elif x<=60:    return "51-60"
    elif x<=70:    return "61-70"
    elif x<=80:    return "71-80"
    elif x<=90:    return "81-90"
    else:          return "91-100"


In [None]:
def category_price(price):
    if price<=5:
        return "Cheap"
    elif price<=33.33:
        return "Low"
    elif price<=66.66:
        return "Medium"
    else:
        return "High"
#Green, LightGreen, Orange, Red

In [None]:
df_100["price_range"] = df_100["actual_price"]
df_100["price_range"] = df_100["price_range"].map(getCategoryPrice)
df_100["price_category"] = df_100["actual_price"].map(category_price)
df_100["discount_category"] = df_100["discount_percentage"].map(getCategoryPrice)

In [None]:
df_100

In [None]:
#Colors to be assigned

ratings_category_color = {
    "Poor": "red",
    "Below Average": "darkorange",
    "Average": "yellow",
    "Above Average": "limegreen",
    "Excellent": "green"
}

price_category_color = {
    "Cheap": "LightGreen",
    "Low": "Green",
    "Medium": "Orange",
    "High": "Red"
}

color_background = 'white'
color_line_background = 'black'

### Correlation between features

In [None]:
columns_of_interest = ["ratings", "no_of_ratings", "discount_price", "actual_price", "discount_percentage"]
correlation_matrix = df_100[columns_of_interest].corr()

# Create a mask to select the lower triangular part of the correlation matrix
mask = np.tril(np.ones_like(correlation_matrix, dtype=bool))

fig = px.imshow(correlation_matrix.where(mask), labels=dict(color="Correlation"), zmin=-1, zmax=1, color_continuous_scale="RdBu")

# Add annotations with two decimal places
for i in range(len(columns_of_interest)):
    for j in range(len(columns_of_interest)):
        if i<j:
            fig.add_annotation(x=i, y=j, text=f"{correlation_matrix.iloc[i, j]:.2f}", font_color="black", showarrow=False)
        if i==j:
            fig.add_annotation(x=i, y=j, text=f"{correlation_matrix.iloc[i, j]:.2f}", font_color="white", showarrow=False)
 
fig.update_layout(title=dict(text="<b>Correlation Matrix</b>", font=dict(size=25, color="black")),
    paper_bgcolor=color_background,
    plot_bgcolor=color_background
)

fig.show()

### Actual Price Distribution

In [None]:
fig = px.histogram(df_100, x="actual_price", nbins=80, height = 800, color_discrete_sequence=px.colors.sequential.Peach)
fig.update_traces(marker_line_color='black', marker_line_width=1, opacity=1)
fig.update_layout(
    title=dict(text='<b>Actual Price Distribution between 0-100 CHF</b>', font=dict(size=25, color="black")),
    yaxis=dict(title='Count', tickfont=dict(size=15, color="black"), title_font=dict(size=20, color="black")),
    xaxis=dict(title='Actual Price (CHF)',tickfont=dict(size=15, color="black"),title_font=dict(size=20, color="black")),
    height=1000,
    paper_bgcolor=color_background,
    plot_bgcolor=color_background,
    yaxis_gridcolor=color_line_background,
)
fig.show()

### Price distribution by Rating Category

Due to boxes for different rating categories overlap or are close together, it suggests similar distributions of actual prices in those categories.

In [None]:
fig = px.box(df_100,
             x="actual_price",
             y="category",
             color="category",
             category_orders={"category": ['Poor', 'Below Average', 'Average', 'Above Average', 'Excellent']},
             color_discrete_map=ratings_category_color)
fig.update_layout(
    title=dict(text="<b>Actual Price Distribution by Rating Category</b>", font=dict(size=25, color="black")),
    yaxis=dict(title='Rating Category', tickfont=dict(size=15, color="black"), title_font=dict(size=20, color="black")),
    xaxis=dict(title='Actual Price (CHF)', tickfont=dict(size=15, color="black"), title_font=dict(size=20, color="black")),
    paper_bgcolor=color_background,
    plot_bgcolor=color_background,
    xaxis_gridcolor=color_line_background,
    showlegend=False
)
fig.show()

### Distribution of Products' Price Category

We can see that the majority of products have a **Low** price (Fr. 5-33.33) followed by the **Medium** (Fr. 33.34 - 66.66) price.

In [None]:
color_sequence = list(price_category_color.values())
fig = px.pie(df_100, names='price_category', 
             labels={'price_category': 'Price Category'}, 
             category_orders = {"price_category":["Cheap","Low","Medium","High"]}, color_discrete_sequence = color_sequence)

fig.update_traces(textposition='inside', 
                  textinfo='percent+label', 
                  showlegend=False)

fig.update_layout(title=dict(text="<b>Distribution of Products' Price Category</b>", font=dict(size=30, color="black")),
    height=800,
    paper_bgcolor=color_background,
    plot_bgcolor=color_background,
    xaxis_gridcolor=color_line_background,
    yaxis_gridcolor=color_line_background,
    font=dict(size=15)
)

fig.show()

In [None]:
fig = px.box(df_100,y="main_category",x="actual_price",color_discrete_sequence=px.colors.sequential.Peach)
fig.update_layout(
    title=dict(text="<b>Actual Price Distribution by Main Category</b>", font=dict(size=25, color="black")),
    yaxis=dict(title='Price Category', tickfont=dict(size=10, color="black"), title_font=dict(size=20, color="black")),
    xaxis=dict(title='Rating Category', tickfont=dict(size=10, color="black"), title_font=dict(size=20, color="black")),
    paper_bgcolor=color_background,
    plot_bgcolor=color_background,
    xaxis_gridcolor=color_line_background,
    coloraxis_colorbar=dict(title="Mean No. of Ratings"),
    height=1000,
    width=1000, 
    showlegend=False
)

fig.show()

### Distribution of Count and Price Percentage for Main Category

In [None]:
# Group by 'main_category' and calculate count
df_count = df_100.groupby(["main_category"]).size().reset_index().rename(columns={0: "count"})

# Ordina df_count in base al conteggio in modo crescente
df_count = df_count.sort_values(by='count', ascending=True)

# Group by 'main_category' and 'price_category', calculate count, and sort
grouped_df = df_100.groupby(["main_category", "price_category"]).size().reset_index(name="count")
custom_price_order = ["Cheap", "Low", "Medium", "High", "Expensive"]
grouped_df_sorted = grouped_df.sort_values(by=['main_category', 'price_category'], key=lambda x: pd.Categorical(x, categories=custom_price_order, ordered=True))

# Calculate total count for each 'price_category'
total_count = grouped_df_sorted.groupby("main_category")["count"].transform("sum")

# Calculate percentage
grouped_df_sorted["percentage"] = (grouped_df_sorted["count"] / total_count) * 100

# Create subplots with shared y-axis
fig = make_subplots(rows=1, cols=2, shared_yaxes=True)

# Horizontal Histogram for Count
fig.add_trace(
    go.Bar(y=df_count['main_category'], x=df_count['count'], name='Count', orientation='h',marker=dict(color='#FFDAB9')),
    row=1, col=1
)

# Horizontal Histogram for Percentage
fig.add_trace(
    go.Bar(
        x=grouped_df_sorted['percentage'], 
        y=grouped_df_sorted['main_category'], 
        name='Percentage', 
        orientation='h',
        marker_color=[price_category_color[category] for category in grouped_df_sorted['price_category']],
        hovertext=grouped_df_sorted.apply(lambda row: f"{row['price_category']}<br>Percentage: {row['percentage']:.2f}%", axis=1),
        hoverinfo='text'
    ),
    row=1, col=2
)

# Update layout with axis titles
fig.update_layout(
    title=dict(text='<b>Distribution of Count and Price Percentage for Main Category</b>', font=dict(size=25, color="black")),
    showlegend=False,
    legend=dict(x=0.5, y=1.1),  # Adjust legend position above the second subplot,
    paper_bgcolor=color_background,
    plot_bgcolor=color_background,
    xaxis_gridcolor=color_line_background,
    height=800, 
    width=1200,
     yaxis=dict(title='Main Category', tickfont=dict(size=15, color="black"), title_font=dict(size=20, color="black")),
    xaxis=dict(title='Count',tickfont=dict(size=15, color="black"),title_font=dict(size=20, color="black")),
    xaxis2=dict(title='Percentage (%)',tickfont=dict(size=15, color="black"),title_font=dict(size=20, color="black"), range=[0,100]),

)

# Inizializza la variabile distanza
distanza = 0.545

# Spazio fisso tra le parole
spazio_tra_parole = 0.02

# Aggiunta di annotazioni per la legenda personalizzata
legend_annotations = []
for category, color in price_category_color.items():
    annotation = dict(
        xref='paper',
        yref='paper',
        x=distanza,
        y=1.02,
        xanchor='left',
        yanchor='middle',
        text=f'<span style="color:{color};margin-right:10px">&#x25cf;</span><span style="color:black">{category}</span>',
        showarrow=False,
        font=dict(color='black')
    )
    legend_annotations.append(annotation)
    distanza += spazio_tra_parole + len(category)/125 + 0.06

legend_annotations.append(
    dict(
        xref='paper',
        yref='paper',
        x=0.545,
        y=1.05,
        xanchor='left',
        yanchor='middle',
        text='<b>Price Category</b>',
        showarrow=False,
        font=dict(color='black')
    )
)
fig.update_layout(annotations=legend_annotations)  # Aggiungi le annotazioni della legenda personalizzata

# Imposta la larghezza del grafico a 1200
fig.update_layout(width=1200, margin=dict(r=200))

# Show the plot
fig.show()

### Ratings Distribution

In [None]:
fig = px.histogram(df_100, x="ratings", nbins=80, height = 800, color_discrete_sequence=px.colors.sequential.Peach)
fig.update_traces(marker_line_color='black', marker_line_width=2, opacity=1)
fig.update_layout(
    title=dict(text='Rating Distribution', font=dict(size=25, color="black")),
    yaxis=dict(title='Count', tickfont=dict(size=15, color="black"), title_font=dict(size=20, color="black")),
    xaxis=dict(title='Rating',tickfont=dict(size=15, color="black"),title_font=dict(size=20, color="black")),
    height=1000,
    paper_bgcolor=color_background,
    plot_bgcolor=color_background,
    xaxis_gridcolor=color_line_background,
    yaxis_gridcolor=color_line_background
)
fig.show()

### Ratings Distribution by Price Category

In [None]:
fig = px.box(df_100,
    y="price_category",
    x="ratings",
    title="Relationship between Ratings and Price Category",
    labels={'price_category': 'Price Category', 'ratings': 'Ratings'},
    category_orders={"price_category": ['Cheap', 'Low', 'Medium', 'High']},
    color="price_category",
    color_discrete_map=price_category_color
)

# Update layout
fig.update_layout(title=dict(text="<b>Ratings Distribution by Price Category</b>", font=dict(size=25, color="black")),
    yaxis=dict(title='Price Category', tickfont=dict(size=15, color="black"), title_font=dict(size=20, color="black")),
    xaxis=dict(title='Rating', tickfont=dict(size=15, color="black"), title_font=dict(size=20, color="black")),
    paper_bgcolor=color_background,
    plot_bgcolor=color_background,
    xaxis_gridcolor=color_line_background,
    showlegend=False,
)

# Show the plot
fig.show()

In [None]:
# Assuming ml_df is your DataFrame
categories = ['Cheap', 'Low', 'Medium', 'High']
# Calculate the number of rows and columns needed for the subplot grid
num_rows = 2
num_cols = 2

# Create subplots with appropriate titles
fig = make_subplots(rows=num_rows, cols=num_cols, subplot_titles=categories)

# Iterate over unique main_category values and add histograms to the subplots
for i, main_category in enumerate(categories, 1):
    
    data = df_100[df_100["price_category"]==main_category]['ratings']
    
    
    # Ensure 'main_category' is not accidentally redefined as a variable
    main_category_str = str(main_category)
    
    histogram = go.Histogram(x=data, name=main_category_str, marker=dict(color=price_category_color.get(main_category_str, 'LightGreen')))
    
    
    # Calculate the current row and column for subplot placement
    row = (i - 1) // num_cols + 1
    col = (i - 1) % num_cols + 1
    
    fig.add_trace(histogram, row=row, col=col)

# Update layout
fig.update_layout(
    height=800,
    width=1000,
    title_text="Ratings Distribution by Main Category",
    showlegend=False,  # Show legend only once
    paper_bgcolor=color_background,
    plot_bgcolor=color_background,
    yaxis_gridcolor=color_line_background
)

        # Update x-axis and y-axis titles for the subplots
for i in range(1, num_rows + 1):
    for j in range(1, num_cols + 1):
        fig.update_xaxes(title_text="Rating", row=i, col=j)
        fig.update_yaxes(title_text="Count", row=i, col=j)   
        fig.update_yaxes(gridcolor="black", row=i, col=j)
        fig.update_traces(marker_line_color='black', marker_line_width=1, opacity=1, row=i, col=j)

fig.show()


### Rating Category Count per Price Category

In [None]:
df_count = df_100.groupby(["price_category"]).size().reset_index().rename(columns={0: "count"})
grouped_df = df_100.groupby(["price_category", "category"]).size().reset_index(name="count")

# Calculate the total count for each 'price_category'
total_count = grouped_df.groupby("price_category")["count"].transform("sum")

# Calculate the percentage
grouped_df["percentage"] = (grouped_df["count"] / total_count) * 100


In [None]:
light_orange = '#FFE4B5'
medium_light_orange = '#FFDAB9'
medium_dark_orange = '#FFA500'
dark_orange = '#FF8C00'

orange_gradients = [dark_orange, medium_dark_orange, medium_light_orange, light_orange]



In [None]:
# Define custom orders
custom_order_price = ["High", "Medium", "Low", "Cheap"]
custom_order_ratings = ["Poor", "Below Average", "Average", "Above Average", "Excellent"]

# Sort the DataFrames based on custom order
df_count_sorted = df_count.set_index("price_category").loc[custom_order_price].reset_index()
grouped_df_sorted = grouped_df.set_index("price_category").loc[custom_order_price].reset_index()

# Create subplots with shared y-axis
fig = make_subplots(rows=1, cols=2, shared_yaxes=True)

# Horizontal Histogram for Count
fig.add_trace(
    go.Bar(
        y=df_count_sorted['price_category'], 
        x=df_count_sorted['count'], 
        name='Count', 
        orientation='h',
        marker=dict(color=orange_gradients)
    ),
    row=1, col=1
)

# Sort the DataFrame for the Percentage plot based on custom order_ratings
# Use an existing column (e.g., 'category') instead of 'ratings_category'
grouped_df_sorted['category'] = pd.Categorical(grouped_df_sorted['category'], categories=custom_order_ratings, ordered=True)
grouped_df_sorted = grouped_df_sorted.sort_values(by=['price_category', 'category'])

# Horizontal Histogram for Percentage
fig.add_trace(
    go.Bar(
        x=grouped_df_sorted['percentage'], 
        y=grouped_df_sorted['price_category'], 
        name='Percentage', 
        orientation='h',
        marker_color=[ratings_category_color[category] for category in grouped_df_sorted['category']],
        hovertext=grouped_df_sorted.apply(lambda row: f"{row['category']}<br>Percentage: {row['percentage']:.2f}%", axis=1),
        hoverinfo='text',
    ),
    row=1, col=2
)

# Update layout
fig.update_layout(
    title=dict(text='<b>Distribution of Count and Ratings Percentage for Price Category</b>',font=dict(size=25, color="black")),
    showlegend=False,
    paper_bgcolor=color_background,
    plot_bgcolor=color_background,
    xaxis_gridcolor=color_line_background,
    height=800, 
    width=1500,
    yaxis=dict(title='Price Category', tickfont=dict(size=15, color="black"), title_font=dict(size=20, color="black")),
    xaxis=dict(title='Count',tickfont=dict(size=15, color="black"),title_font=dict(size=20, color="black")),
    xaxis2=dict(title='Percentage (%)',tickfont=dict(size=15, color="black"),title_font=dict(size=20, color="black"), range=[0,100]),
)


# Aggiunta di annotazioni per la legenda personalizzata
legend_annotations = []
# Inizializza la variabile distanza
distanza = 0.545

# Spazio fisso tra le parole
spazio_tra_parole = 0.02
for category, color in ratings_category_color.items():

    annotation = dict(
        xref='paper',
        yref='paper',
        x=distanza,
        y=1.02,
        xanchor='left',
        yanchor='middle',
        text=f'<span style="color:{color};margin-right:10px">&#x25cf;</span><span style="color:black">{category}</span>',
        showarrow=False,
        font=dict(color='black')
    )
    legend_annotations.append(annotation)

    distanza += spazio_tra_parole + len(category)/125

legend_annotations.append(
    dict(
        xref='paper',
        yref='paper',
        x=0.545,
        y=1.05,
        xanchor='left',
        yanchor='middle',
        text='<b>Rating Category</b>',
        showarrow=False,
        font=dict(color='black',size=15)
    )
)
fig.update_layout(annotations=legend_annotations)  # Aggiungi le annotazioni della legenda personalizzata

# Imposta la larghezza del grafico a 1200
fig.update_layout(width=1200, margin=dict(r=200))

# Show the plot
fig.show()

### Percentage Distribution

In [None]:
fig = px.histogram(df_100, x="discount_percentage", nbins=80, height = 800, color_discrete_sequence=px.colors.sequential.Peach)
fig.update_traces(marker_line_color='black', marker_line_width=2, opacity=1)
fig.update_layout(
    title=dict(text='Discount Percentage Distribution', font=dict(size=25, color="black")),
    yaxis=dict(title='Count', tickfont=dict(size=15, color="black"), title_font=dict(size=20, color="black")),
    xaxis=dict(title='Discount Percentage (%)',tickfont=dict(size=15, color="black"),title_font=dict(size=20, color="black")),
    height=1000,
    paper_bgcolor=color_background,
    plot_bgcolor=color_background,
    xaxis_gridcolor=color_line_background,
    yaxis_gridcolor=color_line_background
)
fig.show()

### Rating Category Distribution per Discount Percentage

In [None]:
df_grouped = df_100.groupby(["discount_category","category"]).count()["name"].rename({"name":"count"}).unstack()
df_percentage = df_grouped.div(df_grouped.sum(axis=1), axis=0) * 100
custom_order = ['Poor', 'Below Average', 'Average', 'Above Average', 'Excellent']

# Sort the dataframe based on the custom order
df_percentage = df_percentage.reindex(columns=custom_order)

# Create a horizontal bar chart
fig = px.bar(df_percentage,
             orientation='h',
             barmode='stack',
             title='Category Percentage Distribution',
             labels={'value': 'Percentage', 'discount_category': 'Category Percentage'},
             width=1200,
             height=800,
             color_discrete_map=ratings_category_color)

# Update layout
fig.update_layout(title=dict(text="<b>Rating Category Distribution per Discount Percentage</b>", font=dict(size=25, color="black")),
                  xaxis_title=dict(text="Percentage (%)", font=dict(size=20, color="black")), 
                  yaxis_title=dict(text="Percentage Category", font=dict(size=20, color="black")),
                  legend=dict(title="Rating Category", y=1,
                    bgcolor='rgba(255, 255, 255, 0.7)',
                    bordercolor='rgba(0, 0, 0, 0.2)',
                    borderwidth=2,
                    font=dict(size=13, color='black')),
                  paper_bgcolor=color_background,
                  plot_bgcolor=color_background,
                  xaxis_gridcolor=color_line_background,
                  xaxis=dict(tickfont=dict(size=15, color='black')),
                  yaxis=dict(tickfont=dict(size=15, color='black')),
                  )

# Show the plot
fig.show()

### Discount Percentage Distribution by Price Category

In [None]:
fig = px.box(df_100, 
             y="price_category", 
             x="discount_percentage",
             category_orders = {"price_category":["Cheap","Low","Medium","High"]},color="price_category", color_discrete_map=price_category_color)
            

fig.update_layout(title=dict(text='Discount Percentage Distribution by Price Category', font=dict(size=25, color="black")),
    yaxis=dict(title='Price Category', tickfont=dict(size=15, color="black"), title_font=dict(size=20, color="black")),
    xaxis=dict(title='Discount Percentage (%)',tickfont=dict(size=15, color="black"),title_font=dict(size=20, color="black")),
    height=1000,
    paper_bgcolor=color_background,
    plot_bgcolor=color_background,
    xaxis_gridcolor=color_line_background,
    yaxis_gridcolor=color_line_background,
    legend=dict(title="Price Category", y=1,
                    bgcolor='rgba(255, 255, 255, 0.7)',
                    bordercolor='rgba(0, 0, 0, 0.2)',
                    borderwidth=2,
                    font=dict(size=13, color='black'))
)

fig.show()

### Correlation between actual and discount price

A positive correlation between "actual_price" and "discount_price" means that as the actual price of a product increases, the discount price tends to increase as well. In other words, when the original (actual) price of a product is higher, the discount offered on that product tends to be higher too.

In practical terms, a positive correlation in this context implies that the retailer is offering higher discounts on products with higher original prices, possibly as a strategy to attract customers or promote sales for more expensive items. It's a common pricing strategy where the discount is more substantial for higher-priced items to make them more appealing to consumers.

In [None]:
df_sample = df_100.sample(n=10000, random_state = 42)
fig = px.scatter(df_sample, x="actual_price", y="discount_price", 
                 trendline="ols",trendline_color_override="red", opacity=0.8,color_discrete_sequence=px.colors.sequential.Peach)
fig.update_yaxes(range=[0, 100])
fig.update_layout(title=dict(text="<b>Correlation between actual and discount price</b>", font=dict(size=20, color="black"))
    ,xaxis_title=dict(text="Actual Price (CHF)", font=dict(size=20, color="black")), 
                  yaxis_title=dict(text="Discount Price (CHF)", font=dict(size=20, color="black")),
                  legend=dict(title="Actual Price", y=1,
                    bgcolor='rgba(255, 255, 255, 0.7)',
                    bordercolor='rgba(0, 0, 0, 0.2)',
                    borderwidth=2,
                    font=dict(size=13, color='black')),
                  paper_bgcolor=color_background,
                  plot_bgcolor=color_background,
                  xaxis_gridcolor=color_line_background,
                  yaxis_gridcolor=color_line_background,
                  xaxis=dict(tickfont=dict(size=15, color='black')),
                  yaxis=dict(tickfont=dict(size=15, color='black')),
                  )
fig.show()


### Exploring Discounts Across Price Categories

Dive into the distribution of discounts across different price categories in this heatmap. Each column represents a specific price category—Cheap, Low, Medium, and High—while the rows display the count of products falling into distinct discount categories. The color intensity signifies the density of products in each combination, offering insights into the prevalence of various discounts within each price segment. This visualization provides a nuanced understanding of how discounts are distributed across different price ranges.

In [None]:
fig = px.density_heatmap(
  df_100,
  x=df_100["discount_category"].sort_values(ascending=True),
  y="price_category",
  title="Rating category count per Price Category",
  histfunc="count",
    category_orders = {"price_category":["Cheap","Low","Medium","High"]}
)

fig.update_layout(title=dict(text="<b>Exploring Discounts Across Price Categories</b>", font=dict(size=25, color="black")),
                  xaxis=dict(title='Discount Category (%)', tickfont=dict(size=10, color="black"), title_font=dict(size=20, color="black")),
                  yaxis=dict(title='Price Category',tickfont=dict(size=10, color="black"),title_font=dict(size=20, color="black")),
                  paper_bgcolor=color_background,
                  plot_bgcolor=color_background,
                  xaxis_gridcolor=color_line_background,
                  yaxis_gridcolor=color_line_background,
                  coloraxis_colorbar=dict(title="Product Count"))

fig.update_xaxes(categoryorder='array', categoryarray=['Cheap', 'Low', 'Medium', 'High'])

fig.update_traces(hovertemplate="Price Category: %{x}<br>Discount Category: %{y}<br>Product Count: %{z}")



fig.show()


### Visualizing Product Counts by Category and Price Range

This density heatmap illustrates the distribution of product counts based on their "category" and "price_range." Each cell in the heatmap represents a unique combination of product category and price range, with color intensity indicating the count of products falling into that specific category and price range. The x-axis represents different categories, and the y-axis represents price ranges. Darker regions signify higher product counts, providing a quick visual overview of the dataset's composition across different product categories and price ranges.

In [None]:
try_df = df_100.groupby(["category", "price_category"]).agg({'no_of_ratings': 'mean'}).reset_index()

fig = px.density_heatmap(
    try_df,
    x="category",
    y="price_category",
    z="no_of_ratings",
    category_orders={"category": ["Poor", "Below Average", "Average", "Above Average", "Excellent"],
                    "price_category": ["Cheap", "Low", "Medium", "High"]}
)

fig.update_layout(
    title=dict(text="<b>Mean No. of Ratings by Category and Price Category</b>", font=dict(size=25, color="black")),
    yaxis=dict(title='Price Category', tickfont=dict(size=10, color="black"), title_font=dict(size=20, color="black")),
    xaxis=dict(title='Rating Category', tickfont=dict(size=10, color="black"), title_font=dict(size=20, color="black")),
    paper_bgcolor=color_background,
    plot_bgcolor=color_background,
    coloraxis_colorbar=dict(title="<i>Mean No. of Ratings</i>")
)

fig.update_traces(customdata=try_df['no_of_ratings'], hovertemplate="Category: %{x}<br>Price Category: %{y}<br>Mean No. of Ratings: %{z:.2f}")


fig.show()

In [None]:
bin_edges = [0, 4, 16, 104, 589547]

# Define bin labels
bin_labels = ['Poor', 'Below Average', 'Above Average', 'Excellent']

# Create a new column 'rating_bins' in your DataFrame
df_100['rating_category'] = pd.cut(df_100['no_of_ratings'], bins=bin_edges, labels=bin_labels, include_lowest=True)
df_100

In [None]:
new_df = df_100.groupby(["rating_category","price_range"])["ratings"].mean().reset_index()

fig = px.density_heatmap(
    new_df,
    y="rating_category",
    x="price_range",
    z="ratings",
    category_orders={"rating_category": ["Excellent","Above Average","Below Average","Poor"]}
)

fig.update_layout(
    title=dict(text="<b>Price Range Rating by Number of Ratings</b>", font=dict(size=25, color="black")),
    
    xaxis=dict(title='Price Range (CHF)', tickfont=dict(size=12, color="black"), title_font=dict(size=20, color="black")),
    yaxis=dict(title='Category Num. of Ratings', tickfont=dict(size=12, color="black"), title_font=dict(size=20, color="black")),
    showlegend=False,
    paper_bgcolor=color_background,
    plot_bgcolor=color_background,
    xaxis_gridcolor=color_line_background,
    yaxis_gridcolor=color_line_background,
    coloraxis_colorbar=dict(title="<i>Average Rating</i>")
 )


fig.show()


## Main & Sub Category

### Sub-category with the heighest rated per Top 5 Main-Category

In [None]:
# Sub-Category with heighest rating for each Main-Category
to_compare = ["women's clothing","men's clothing"]
new_100 = df_100[df_100["main_category"].isin(to_compare)]
new_100 = new_100[["main_category","sub_category","name","actual_price","ratings","discount_percentage","no_of_ratings"]]

sub = df_100[["main_category", "sub_category", "name"]]
sub = sub.rename(columns={'category_1': 'main_category', 'category_2': 'sub_category', 'name': 'count'})
piv = pd.pivot_table(sub, index=["main_category", "sub_category"], aggfunc="count")
df_100_count = piv.groupby("main_category")["count"].sum().reset_index()

categories = df_100_count.sort_values(by="count",ascending=False).head(5)["main_category"].tolist()
df_5 = df_100[df_100["main_category"].isin(categories)]
ml_df = df_5.groupby(["main_category","sub_category"])["ratings"].mean().reset_index()
result = ml_df.loc[ml_df.groupby("main_category")["ratings"].idxmax()]
ml_df.set_index(['main_category', 'sub_category'], inplace=True)

# Define a function to highlight values in 'result'
def highlight_values(val):
    return 'background-color: yellow' if val in result['ratings'].values else ''

# Apply the custom function to the DataFrame and display it
styled_ml_df = ml_df.style.applymap(highlight_values)
styled_ml_df

In [None]:
sub_categories = result["sub_category"].tolist()
new_df = df[df["sub_category"].isin(result["sub_category"].tolist())]
new_df = new_df.groupby(["sub_category","name"])["ratings"].mean().reset_index()
result = new_df.loc[new_df.groupby("sub_category")["ratings"].idxmax()]
new_df = df[df["sub_category"].isin(sub_categories)]
sub_means = new_df.groupby("sub_category")["no_of_ratings"].mean().round(0).reset_index()
new_df = pd.merge(new_df, sub_means, on="sub_category", suffixes=('', '_mean'))

# Filter the DataFrame based on the condition
filtered_df = new_df[new_df["no_of_ratings"] >= new_df["no_of_ratings_mean"]]
means = filtered_df.groupby(["sub_category","name"])["ratings"].mean().reset_index()
result = filtered_df.loc[filtered_df.groupby("sub_category")["ratings"].idxmax()]

We see that the no_of_ratings is too low. So we are gonna apply a filter to how analysis

In [None]:
from IPython.display import display, HTML
for index, row in result.iterrows():
    name = row['name']
    sub_category = row['sub_category']
    mean_ratings = row['ratings']
    image_url = row['image']

    styled_html = f"<div style='background-color:#FAE8E0; padding: 10px;'>"
    styled_html += f"<b>Name:</b> {name}<br>"
    styled_html += f"<b>Sub-Category:</b> {sub_category}<br>"
    styled_html += f"<b>Mean Ratings:</b> {mean_ratings}<br>"
    styled_html += f"<img src='{image_url}' style='max-height:150px; max-width:150px; margin: 5px;' />"
    

    display(HTML(styled_html))
