<a href="https://colab.research.google.com/github/MohdSamsh/E_commerce_sales_analysis/blob/main/E_commerce_sales_analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Cleaning

In [49]:
# We are including the essential libraries required for Exploratory Data Analysis (EDA) of Amazon Sales data.

import pandas as pd
import numpy as np
import plotly.express as px
import seaborn as sns
import matplotlib.pyplot as plt

In [50]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [51]:
# Loading the csv dataset.
df = pd.read_csv("/content/drive/MyDrive/amazon_data/amazon.csv")
df.head(3)

Unnamed: 0,product_id,product_name,category,discounted_price,actual_price,discount_percentage,rating,rating_count,about_product,user_id,user_name,review_id,review_title,review_content,img_link,product_link
0,B07JW9H4J1,Wayona Nylon Braided USB to Lightning Fast Cha...,Computers&Accessories|Accessories&Peripherals|...,₹399,"₹1,099",64%,4.2,24269,High Compatibility : Compatible With iPhone 12...,"AG3D6O4STAQKAY2UVGEUV46KN35Q,AHMY5CWJMMK5BJRBB...","Manav,Adarsh gupta,Sundeep,S.Sayeed Ahmed,jasp...","R3HXWT0LRP0NMF,R2AJM3LFTLZHFO,R6AQJGUP6P86,R1K...","Satisfied,Charging is really fast,Value for mo...",Looks durable Charging is fine tooNo complains...,https://m.media-amazon.com/images/W/WEBP_40237...,https://www.amazon.in/Wayona-Braided-WN3LG1-Sy...
1,B098NS6PVG,Ambrane Unbreakable 60W / 3A Fast Charging 1.5...,Computers&Accessories|Accessories&Peripherals|...,₹199,₹349,43%,4.0,43994,"Compatible with all Type C enabled devices, be...","AECPFYFQVRUWC3KGNLJIOREFP5LQ,AGYYVPDD7YG7FYNBX...","ArdKn,Nirbhay kumar,Sagar Viswanathan,Asp,Plac...","RGIQEG07R9HS2,R1SMWZQ86XIN8U,R2J3Y1WL29GWDE,RY...","A Good Braided Cable for Your Type C Device,Go...",I ordered this cable to connect my phone to An...,https://m.media-amazon.com/images/W/WEBP_40237...,https://www.amazon.in/Ambrane-Unbreakable-Char...
2,B096MSW6CT,Sounce Fast Phone Charging Cable & Data Sync U...,Computers&Accessories|Accessories&Peripherals|...,₹199,"₹1,899",90%,3.9,7928,【 Fast Charger& Data Sync】-With built-in safet...,"AGU3BBQ2V2DDAMOAKGFAWDDQ6QHA,AESFLDV2PT363T2AQ...","Kunal,Himanshu,viswanath,sai niharka,saqib mal...","R3J3EQQ9TZI5ZJ,R3E7WBGK7ID0KV,RWU79XKQ6I1QF,R2...","Good speed for earlier versions,Good Product,W...","Not quite durable and sturdy,https://m.media-a...",https://m.media-amazon.com/images/W/WEBP_40237...,https://www.amazon.in/Sounce-iPhone-Charging-C...


In [52]:
df.columns

Index(['product_id', 'product_name', 'category', 'discounted_price',
       'actual_price', 'discount_percentage', 'rating', 'rating_count',
       'about_product', 'user_id', 'user_name', 'review_id', 'review_title',
       'review_content', 'img_link', 'product_link'],
      dtype='object')

In [53]:
# When examining the dataset's information, we observe that it comprises a total of 16 columns. Notably, the "rating_count" column has two missing values.
# It's worth noting that all columns in the dataset share the same data type, which requires correction and adjustment.
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1465 entries, 0 to 1464
Data columns (total 16 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   product_id           1465 non-null   object
 1   product_name         1465 non-null   object
 2   category             1465 non-null   object
 3   discounted_price     1465 non-null   object
 4   actual_price         1465 non-null   object
 5   discount_percentage  1465 non-null   object
 6   rating               1465 non-null   object
 7   rating_count         1463 non-null   object
 8   about_product        1465 non-null   object
 9   user_id              1465 non-null   object
 10  user_name            1465 non-null   object
 11  review_id            1465 non-null   object
 12  review_title         1465 non-null   object
 13  review_content       1465 non-null   object
 14  img_link             1465 non-null   object
 15  product_link         1465 non-null   object
dtypes: obj

In [54]:
# # Fixing the datatypes for string columns
# df['product_name'] = df['product_name'].astype(str)
# df['product_name'] = df['product_name'].astype(str)
# df['product_name'] = df['product_name'].astype(str)


In [55]:
# In this step, we are segmenting the 'category' column into seperate columns, which we will subsequently incorporate into the original dataset to facilitate Exploratory Data Analysis (EDA).

df_cat = df['category'].str.split('|', expand=True).rename(columns={0:'category_1', 1:'category_2', 2:'category_3', 3:'category_4',4:'category_5'})
df_cat.head()

Unnamed: 0,category_1,category_2,category_3,category_4,category_5,5,6
0,Computers&Accessories,Accessories&Peripherals,Cables&Accessories,Cables,USBCables,,
1,Computers&Accessories,Accessories&Peripherals,Cables&Accessories,Cables,USBCables,,
2,Computers&Accessories,Accessories&Peripherals,Cables&Accessories,Cables,USBCables,,
3,Computers&Accessories,Accessories&Peripherals,Cables&Accessories,Cables,USBCables,,
4,Computers&Accessories,Accessories&Peripherals,Cables&Accessories,Cables,USBCables,,


In [56]:
# Code to add the category column to the df
df['category_1'] = df_cat['category_1']
df['category_2'] = df_cat['category_2']
df['category_3'] = df_cat['category_3']
df['category_4'] = df_cat['category_4']

In [57]:
# Droping the Original Category column/ from the dataset.
df.drop('category', axis=1, inplace=True)

In [58]:
df.isnull().sum()

product_id               0
product_name             0
discounted_price         0
actual_price             0
discount_percentage      0
rating                   0
rating_count             2
about_product            0
user_id                  0
user_name                0
review_id                0
review_title             0
review_content           0
img_link                 0
product_link             0
category_1               0
category_2               0
category_3               8
category_4             165
dtype: int64

In [59]:
# It's worth noting that in the 'category_1' column, some specific words lack spaces between them, and we intend to correct this issue for improved data consistency.
df['category_1'].value_counts()

Electronics              526
Computers&Accessories    453
Home&Kitchen             448
OfficeProducts            31
MusicalInstruments         2
HomeImprovement            2
Toys&Games                 1
Car&Motorbike              1
Health&PersonalCare        1
Name: category_1, dtype: int64

In [60]:
df['category_1'] = df['category_1'].str.replace('&',' & ')
df['category_1'] = df['category_1'].str.replace('OfficeProducts', 'Office Products')
df['category_1'] = df['category_1'].str.replace('MusicalInstruments', 'Musical Instruments')
df['category_1'] = df['category_1'].str.replace('HomeImprovement', 'Home Improvement')

In [61]:
# We are replicating the identical procedure for the 'category_2' column, addressing any issues with spacing or formatting to maintain consistency within the dataset.

df['category_2'].value_counts()

Accessories&Peripherals                    381
Kitchen&HomeAppliances                     308
HomeTheater,TV&Video                       162
Mobiles&Accessories                        161
Heating,Cooling&AirQuality                 116
WearableTechnology                          76
Headphones,Earbuds&Accessories              66
NetworkingDevices                           34
OfficePaperProducts                         27
ExternalDevices&DataStorage                 18
Cameras&Photography                         16
HomeStorage&Organization                    16
HomeAudio                                   16
GeneralPurposeBatteries&BatteryChargers     14
Accessories                                 14
Printers,Inks&Accessories                   11
CraftMaterials                               7
Components                                   5
OfficeElectronics                            4
Electrical                                   2
Monitors                                     2
Microphones  

In [62]:
df['category_2'] = df['category_2'].str.replace('&',' & ')
df['category_2'] = df['category_2'].str.replace(',',' , ')

In [63]:
# In this step, we are eliminating the '₹' and '%' symbols from the 'discount_price,' 'actual_price,' and 'discount_percentage' columns. Additionally, we are converting the data type of these columns to float for numerical consistency and analysis.

# making sure that the 3 columns are in str dtypes
df['discounted_price'] = df['discounted_price'].astype(str)
df['actual_price'] = df['actual_price'].astype(str)
df['discount_percentage'] = df['discount_percentage'].astype(str)


#String operation for the 3 columns
df['discounted_price'] = df['discounted_price'].str.replace('₹','')
df['discounted_price'] = df['discounted_price'].str.replace(',','')
df['discounted_price'] = df['discounted_price'].astype(float)

df['actual_price'] = df['actual_price'].str.replace('₹','')
df['actual_price'] = df['actual_price'].str.replace(',','')
df['actual_price'] = df['actual_price'].astype(float)

df['discount_percentage'] = df['discount_percentage'].str.replace('%','')
df['discount_percentage'] = df['discount_percentage'].str.replace(',','')
df['discount_percentage'] = df['discount_percentage'].astype(float)

In [64]:
# It's noticeable that there is a row in the 'rating' column containing the symbol '|', which needs to be rectified for accurate data representation and analysis.
df["rating"].value_counts()

4.1    244
4.3    230
4.2    228
4.0    129
3.9    123
4.4    123
3.8     86
4.5     75
4       52
3.7     42
3.6     35
3.5     26
4.6     17
3.3     16
3.4     10
4.7      6
3.1      4
5.0      3
3.0      3
4.8      3
3.2      2
2.8      2
2.3      1
|        1
2        1
3        1
2.6      1
2.9      1
Name: rating, dtype: int64

In [65]:
# Below is the query that displays the row in which the 'rating' value is "|".

df.query('rating=="|"')

Unnamed: 0,product_id,product_name,discounted_price,actual_price,discount_percentage,rating,rating_count,about_product,user_id,user_name,review_id,review_title,review_content,img_link,product_link,category_1,category_2,category_3,category_4
1279,B08L12N5H1,Eureka Forbes car Vac 100 Watts Powerful Sucti...,2099.0,2499.0,16.0,|,992,No Installation is provided for this product|1...,"AGTDSNT2FKVYEPDPXAA673AIS44A,AER2XFSWNN4LAUCJ5...","Divya,Dr Nefario,Deekshith,Preeti,Prasanth R,P...","R2KKTKM4M9RDVJ,R1O692MZOBTE79,R2WRSEWL56SOS4,R...","Decent product,doesn't pick up sand,Ok ok,Must...","Does the job well,doesn't work on sand. though...",https://m.media-amazon.com/images/W/WEBP_40237...,https://www.amazon.in/Eureka-Forbes-Vacuum-Cle...,Home & Kitchen,Kitchen & HomeAppliances,"Vacuum,Cleaning&Ironing",Vacuums&FloorCare


In [66]:
# This is the code snippet to replace the "|" rating with the value 4.0 and convert it to a float data type for consistency.

df['rating'] = df['rating'].str.replace('|','4.0').astype(float)


The default value of regex will change from True to False in a future version. In addition, single character regular expressions will *not* be treated as literal strings when regex=True.



In [67]:
# It is evident that the 'user_name' and 'user_id' columns contain data for multiple individuals within a single row. Our objective is to separate this data into distinct rows for each individual.
df['user_id'] = df['user_id'].str.split(',', expand=False)
df['user_name'] = df['user_name'].str.split(',', expand=False)


In [68]:
# We are using the 'explode' function to expand the data into separate rows, effectively splitting the information within the 'user_name' and 'user_id' columns, resulting in one row per individual.

df1 = df.explode('user_name', ignore_index=True)
df2 = df.explode('user_id', ignore_index=True)

In [69]:
# We are removing the original 'user_id' column from the dataset.

df1.drop("user_id", axis=1, inplace=True)

In [70]:
# We are aligning the 'user_id' values between DataFrame df1 and df2 and then checking the shape of the resulting DataFrame.
df1["user_id"] = df2["user_id"]
df1.shape

(11515, 19)

In [71]:
# It's evident that both 'username' and 'user_id' information have been distributed into individual rows for all individuals within the dataset.n

df1.head()

Unnamed: 0,product_id,product_name,discounted_price,actual_price,discount_percentage,rating,rating_count,about_product,user_name,review_id,review_title,review_content,img_link,product_link,category_1,category_2,category_3,category_4,user_id
0,B07JW9H4J1,Wayona Nylon Braided USB to Lightning Fast Cha...,399.0,1099.0,64.0,4.2,24269,High Compatibility : Compatible With iPhone 12...,Manav,"R3HXWT0LRP0NMF,R2AJM3LFTLZHFO,R6AQJGUP6P86,R1K...","Satisfied,Charging is really fast,Value for mo...",Looks durable Charging is fine tooNo complains...,https://m.media-amazon.com/images/W/WEBP_40237...,https://www.amazon.in/Wayona-Braided-WN3LG1-Sy...,Computers & Accessories,Accessories & Peripherals,Cables&Accessories,Cables,AG3D6O4STAQKAY2UVGEUV46KN35Q
1,B07JW9H4J1,Wayona Nylon Braided USB to Lightning Fast Cha...,399.0,1099.0,64.0,4.2,24269,High Compatibility : Compatible With iPhone 12...,Adarsh gupta,"R3HXWT0LRP0NMF,R2AJM3LFTLZHFO,R6AQJGUP6P86,R1K...","Satisfied,Charging is really fast,Value for mo...",Looks durable Charging is fine tooNo complains...,https://m.media-amazon.com/images/W/WEBP_40237...,https://www.amazon.in/Wayona-Braided-WN3LG1-Sy...,Computers & Accessories,Accessories & Peripherals,Cables&Accessories,Cables,AHMY5CWJMMK5BJRBBSNLYT3ONILA
2,B07JW9H4J1,Wayona Nylon Braided USB to Lightning Fast Cha...,399.0,1099.0,64.0,4.2,24269,High Compatibility : Compatible With iPhone 12...,Sundeep,"R3HXWT0LRP0NMF,R2AJM3LFTLZHFO,R6AQJGUP6P86,R1K...","Satisfied,Charging is really fast,Value for mo...",Looks durable Charging is fine tooNo complains...,https://m.media-amazon.com/images/W/WEBP_40237...,https://www.amazon.in/Wayona-Braided-WN3LG1-Sy...,Computers & Accessories,Accessories & Peripherals,Cables&Accessories,Cables,AHCTC6ULH4XB6YHDY6PCH2R772LQ
3,B07JW9H4J1,Wayona Nylon Braided USB to Lightning Fast Cha...,399.0,1099.0,64.0,4.2,24269,High Compatibility : Compatible With iPhone 12...,S.Sayeed Ahmed,"R3HXWT0LRP0NMF,R2AJM3LFTLZHFO,R6AQJGUP6P86,R1K...","Satisfied,Charging is really fast,Value for mo...",Looks durable Charging is fine tooNo complains...,https://m.media-amazon.com/images/W/WEBP_40237...,https://www.amazon.in/Wayona-Braided-WN3LG1-Sy...,Computers & Accessories,Accessories & Peripherals,Cables&Accessories,Cables,AGYHHIERNXKA6P5T7CZLXKVPT7IQ
4,B07JW9H4J1,Wayona Nylon Braided USB to Lightning Fast Cha...,399.0,1099.0,64.0,4.2,24269,High Compatibility : Compatible With iPhone 12...,jaspreet singh,"R3HXWT0LRP0NMF,R2AJM3LFTLZHFO,R6AQJGUP6P86,R1K...","Satisfied,Charging is really fast,Value for mo...",Looks durable Charging is fine tooNo complains...,https://m.media-amazon.com/images/W/WEBP_40237...,https://www.amazon.in/Wayona-Braided-WN3LG1-Sy...,Computers & Accessories,Accessories & Peripherals,Cables&Accessories,Cables,AG4OGOFWXJZTQ2HKYIOCOY3KXF2Q


In [72]:
# In this step, we are extracting the 'user_id' and 'User_name' data from DataFrame df1 and using it to form a new DataFrame called df_users.

df_users = pd.DataFrame({'user_id':df1['user_id'], 'user_name':df1['user_name']})
df_users

Unnamed: 0,user_id,user_name
0,AG3D6O4STAQKAY2UVGEUV46KN35Q,Manav
1,AHMY5CWJMMK5BJRBBSNLYT3ONILA,Adarsh gupta
2,AHCTC6ULH4XB6YHDY6PCH2R772LQ,Sundeep
3,AGYHHIERNXKA6P5T7CZLXKVPT7IQ,S.Sayeed Ahmed
4,AG4OGOFWXJZTQ2HKYIOCOY3KXF2Q,jaspreet singh
...,...,...
11510,,PARDEEP
11511,,Anindya Pramanik
11512,,Vikas Singh
11513,,Harshada Pimple


In [73]:
df.isnull().sum()

product_id               0
product_name             0
discounted_price         0
actual_price             0
discount_percentage      0
rating                   0
rating_count             2
about_product            0
user_id                  0
user_name                0
review_id                0
review_title             0
review_content           0
img_link                 0
product_link             0
category_1               0
category_2               0
category_3               8
category_4             165
dtype: int64

In [74]:
# We have identified that there are two instances in the "rating_count" column where the value is NaN (Not a Number). We will be replacing these NaN values with 0 for consistency and data integrity.

df[df['rating_count'].isnull()==True]

Unnamed: 0,product_id,product_name,discounted_price,actual_price,discount_percentage,rating,rating_count,about_product,user_id,user_name,review_id,review_title,review_content,img_link,product_link,category_1,category_2,category_3,category_4
282,B0B94JPY2N,Amazon Brand - Solimo 65W Fast Charging Braide...,199.0,999.0,80.0,3.0,,USB C to C Cable: This cable has type C connec...,[AE7CFHY23VAJT2FI4NZKKP6GS2UQ],[Pranav],RUB7U91HVZ30,The cable works but is not 65W as advertised,I have a pd supported car charger and I bought...,https://m.media-amazon.com/images/W/WEBP_40237...,https://www.amazon.in/Amazon-Brand-Charging-Su...,Computers & Accessories,Accessories & Peripherals,Cables&Accessories,Cables
324,B0BQRJ3C47,"REDTECH USB-C to Lightning Cable 3.3FT, [Apple...",249.0,999.0,75.0,5.0,,💎[The Fastest Charge] - This iPhone USB C cabl...,[AGJC5O5H5BBXWUV7WRIEIOOR3TVQ],[Abdul Gafur],RQXD5SAMMPC6L,Awesome Product,Quick delivery.Awesome ProductPacking was good...,https://m.media-amazon.com/images/I/31-q0xhaTA...,https://www.amazon.in/REDTECH-Lightning-Certif...,Computers & Accessories,Accessories & Peripherals,Cables&Accessories,Cables


In [75]:
df['rating_count'].fillna(df['rating_count'].mode()[0], inplace=True)

# Data Visualization

In [76]:
# We are visualizing products from our dataframe that have received a 5.0 rating.;l
# Filter the DataFrame for products with a 5.0 rating
Five_star_products = df[df['rating'] == 5.0]

fig = px.bar(Five_star_products,
             y=Five_star_products['product_name'].str[:50],
             x=Five_star_products['actual_price'],
             color=Five_star_products['product_name'].str[:50],  # Use the same column for color
             color_discrete_sequence=['orange'],  # Set the color to orange
             title="Five Star Rating Products")

fig.update_layout(minreducedwidth=400,
                  minreducedheight=400,
                  width=1100,
                  height=350)

# Set the y-axis title using update_yaxes
fig.update_yaxes(title_text=None)

fig.show()



In [77]:
# Here I'm querying the dataframe based on different rating and refrencing them to a variable, that I can use quickly to visualize bar graph based on ratings

Five_star_products = df.query('rating==5')
Four_star_products = df.query('rating==4')
Three_star_products = df.query('rating==3')
Two_star_products = df.query('rating==2')
One_star_products = df.query('rating==1')

F1 = One_star_products
F2 = Two_star_products
F3 = Three_star_products
F4 = Four_star_products
F5 = Five_star_products



In [78]:
# Here I'm Visualizing the lowest rated product based on the ratings below "3".
# Filter the DataFrame for products with a rating less than 3 and sort by actual price
Lowest_rated_products = df[df['rating'] < 3].sort_values(by='actual_price')

fig = px.bar(Lowest_rated_products,
             x=Lowest_rated_products['product_name'].str[:50],
             y='actual_price',
             color='product_name',  # Use product name for color variation
             color_discrete_sequence=['red'],  # Set the color to red
             title="Lowest Rated Products by Actual Price",
             labels={'actual_price': 'Actual Price'},
             )

fig.update_layout(
    width=1100,
    height=500,
    showlegend=False,
    xaxis_title=None,
)

fig.update_xaxes(automargin=True)
fig.show()


In [79]:
# Visualizing Correlation between actual price and discount price of product

fig = px.scatter(df, x='actual_price', y='discounted_price', trendline="ols")

fig.update_layout(
    title="Correlation between Actual Price and Discount Price",
    xaxis_title="Actual Price",
    yaxis_title="Discounted Price",
    width=900,  # Increase the width
    height=600   # Decrease the height for slightly less length
)

fig.show()


In [80]:
# Here I'm visualizing the total number of products per Category 1.

Cat_1 = df['category_1'].value_counts()

fig1 = px.histogram(y=Cat_1.index, x=Cat_1.values)
fig1.update_layout(title='Total Number of Product by Category 1', xaxis_title = 'Count')
fig1.show()

In [81]:
# Here I'm visualizing the total number of products per Category 2.

Cat_2 = df['category_2'].value_counts()

fig2 = px.histogram(y=Cat_2.index, x=Cat_2.values)
fig2.update_layout(title='Total Number of Product by Category 2', xaxis_title = 'Count')
fig2.show()

In [82]:
# Here I'm visualizing the total number of products per Category 3.
# Calculate the total number of products per Category 3
Cat_3 = df['category_3'].value_counts()

fig = px.histogram(y=Cat_3.index, x=Cat_3.values, color_discrete_sequence=['green'])

fig.update_layout(title='Total Number of Product by Category 3', xaxis_title='Count')

fig.show()


In [83]:
#  Here we'll be arranging all the productes based on rating from 1 to 5 and assign a sting value based as Customer Satisfaction.

Customer_Satisfaction = []

for score in df['rating']:
    if score < 2.0:
        Customer_Satisfaction.append('Very Dissatisfied')
    elif score < 3.0:
        Customer_Satisfaction.append('Dissatisfied')
    elif score < 4.0:
        Customer_Satisfaction.append('Neutral')
    elif score < 5.0:
        Customer_Satisfaction.append('Satisfied')
    elif score == 5.0:
        Customer_Satisfaction.append('Very Satisfied')

In [84]:
#  Adding the Customer Satisfaction Column in the original dataser and chaning it dtype as Category

df['Customer_Satisfaction'] = Customer_Satisfaction
df['Customer_Satisfaction'] = df['Customer_Satisfaction'].astype('category')


In [85]:
# Here I'll be creating a tabled based on Customer Satisfaction and it's value counts

Customer_Satisfaction_Score = df['Customer_Satisfaction'].value_counts().reindex(index = ['Very Dissatisfied','Dissatisfied','Neutral','Satisfied','Very Satisfied']).\
rename_axis('Customer_Satisfaction_Score').reset_index(name='Total Counts')
Customer_Satisfaction_Score

Unnamed: 0,Customer_Satisfaction_Score,Total Counts
0,Very Dissatisfied,
1,Dissatisfied,6.0
2,Neutral,348.0
3,Satisfied,1108.0
4,Very Satisfied,3.0


In [86]:
#  Visualizing the Customer Satisfaction Score in boxplot

fig = px.box(Customer_Satisfaction_Score,
             x='Customer_Satisfaction_Score',
             title="Customer Satisfaction Score",
             labels={'Customer_Satisfaction_Score': 'Rating Score'},
             points="all"  # Show all data points (optional)
             )

fig.update_layout(
    showlegend=False,
    xaxis_title="Rating Score",
)

fig.show()


In [87]:
#  Donut Chart representation for the "Customer Satisfaction Score"

# Define custom colors for each rating score
custom_colors = ['#FF5733', '#FFC300', '#33FF57', '#33C5FF', '#9133FF']

fig = px.pie(Customer_Satisfaction_Score,
             values='Total Counts',
             names='Customer_Satisfaction_Score',
             hole=0.5,
             color_discrete_sequence=custom_colors)  # Set custom colors

fig.update_layout(title="Customer Satisfaction Percentage")

fig.show()


In [88]:
# Visuzlizing top 10 reviewers from the df_user dataframe that I created earlier.

Top_10_reviewer = df_users['user_name'].value_counts().head(10).rename_axis('UserName').reset_index(name='counts')

fig = px.bar(y=Top_10_reviewer['UserName'], x=Top_10_reviewer['counts'], color_discrete_sequence=['#FF5733'])

fig.update_layout(title='Top 10 Reviewers',
                   xaxis_title='Total Number of Reviews',
                   yaxis_title='User Name')

fig.show()


In [89]:
# Visualizing the Distrubution of Rating as Histogram

fig = px.histogram(df, x='rating', color_discrete_sequence=['yellow'])

fig.update_layout(
    title='Distribution of Rating',
    xaxis_title='Rating'
)

fig.show()


In [90]:
# Now we'll be renaming the Category 1&2 into Main Category and Sub-Category and see which sub-category product belongs to main category.

df.rename(columns= {'category_1':'Main_Category', 'category_2':'Sub_Category'}, inplace= True)

In [91]:
#  Visualizing Distrubition of Main Category per rating

fig = px.bar(df, x='rating', color='Main_Category',
             title='Distribution of Main Category per Rating',
             labels={'rating': 'Rating'},
             barmode='stack')

fig.update_layout(xaxis_title='Rating')

fig.show()


In [92]:
#  Visualizing Distrubition of Sub Category per rating

fig = px.box(df, x='rating', y='Sub_Category',
             title='Distribution of Sub Category per Rating',
             labels={'rating': 'Rating'})

fig.update_layout(xaxis_title='Rating', yaxis_title='Sub Category')

fig.show()
#  The Highest rated subrated subproduct is "Accessories & Peripherals"


In [93]:
# Table Representating Main Category followed by Sub Category and it's count

df_cat = df.groupby(['Main_Category','Sub_Category']).agg('count').iloc[:,1].rename_axis().reset_index(name='Count')
df_cat

Unnamed: 0,Main_Category,Sub_Category,Count
0,Car & Motorbike,CarAccessories,1
1,Computers & Accessories,Accessories & Peripherals,381
2,Computers & Accessories,Components,5
3,Computers & Accessories,ExternalDevices & DataStorage,18
4,Computers & Accessories,Laptops,1
5,Computers & Accessories,Monitors,2
6,Computers & Accessories,NetworkingDevices,34
7,Computers & Accessories,"Printers , Inks & Accessories",11
8,Computers & Accessories,Tablets,1
9,Electronics,Accessories,14


In [94]:
#  Visualizing the table

fig = px.histogram(df_cat, y='Main_Category', x='Count', color='Sub_Category',
                   title="Number of Products for each Category and each Sub-Category",
                   labels={'Count': 'Count', 'Main_Category': 'Main Category'},
                   color_discrete_sequence=px.colors.qualitative.Set3)

fig.update_layout(yaxis_title="Main Category")

fig.show()
