<a href="https://www.kaggle.com/code/jaspreetkhokhar/amazon-product-reviews-ratings-analysis?scriptVersionId=240004662" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

## ***Dataset Description***   

This dataset contains information about over 1,000 Amazon products, including their ratings, reviews, pricing, and product details as listed on the official Amazon website. It is designed to provide insights into customer opinions, product popularity, pricing strategies, and discount effectiveness. By analyzing this data, we can identify patterns in customer preferences, evaluate product performance, and explore the impact of discounts and ratings on sales.

## ***Column Description***   
| **Column Name**          | **Description**                                                                 |
|---------------------------|---------------------------------------------------------------------------------|
| `product_id`             | Unique identifier for each product listed on Amazon.                            |
| `product_name`           | The name or title of the product.                                               |
| `category`               | Category or type of the product.                                                |
| `discounted_price`       | Current price of the product after discounts.                                   |
| `actual_price`           | Original price of the product before any discounts.                             |
| `discount_percentage`    | Percentage of discount applied to the product.                                  |
| `rating`                 | Average customer rating of the product out of 5.                                |
| `rating_count`           | Total number of customers who rated the product.                                |
| `about_product`          | Brief description or features of the product.                                   |
| `user_id`                | Unique identifier for the user who wrote the review.                            |
| `user_name`              | Name of the user who provided the review.                                       |
| `review_id`              | Unique identifier for each review.                                              |
| `review_title`           | Short headline or title of the review.                                          |
| `review_content`         | Detailed content of the user's review.                                          |
| `img_link`               | URL link to the product image.                                                  |
| `product_link`           | URL link to the official Amazon product page.                                   |


In [1]:
# Import libraries
import pandas as pd
import numpy as np

In [2]:
# Read the dataset
file_path = '/kaggle/input/amazon-sales-dataset/amazon.csv'
data = pd.read_csv(file_path)

## **Data Cleaning and Preprocessing**

Data cleaning is an essential step to ensure our dataset is ready for analysis. In this section, we will perform the following steps:

### **1. Understanding the Data Structure**
We will begin by exploring the first few rows and understanding the structure of the dataset.


In [3]:
# Display the first few rows to understand the structure
data.head()

Unnamed: 0,product_id,product_name,category,discounted_price,actual_price,discount_percentage,rating,rating_count,about_product,user_id,user_name,review_id,review_title,review_content,img_link,product_link
0,B07JW9H4J1,Wayona Nylon Braided USB to Lightning Fast Cha...,Computers&Accessories|Accessories&Peripherals|...,₹399,"₹1,099",64%,4.2,24269,High Compatibility : Compatible With iPhone 12...,"AG3D6O4STAQKAY2UVGEUV46KN35Q,AHMY5CWJMMK5BJRBB...","Manav,Adarsh gupta,Sundeep,S.Sayeed Ahmed,jasp...","R3HXWT0LRP0NMF,R2AJM3LFTLZHFO,R6AQJGUP6P86,R1K...","Satisfied,Charging is really fast,Value for mo...",Looks durable Charging is fine tooNo complains...,https://m.media-amazon.com/images/W/WEBP_40237...,https://www.amazon.in/Wayona-Braided-WN3LG1-Sy...
1,B098NS6PVG,Ambrane Unbreakable 60W / 3A Fast Charging 1.5...,Computers&Accessories|Accessories&Peripherals|...,₹199,₹349,43%,4.0,43994,"Compatible with all Type C enabled devices, be...","AECPFYFQVRUWC3KGNLJIOREFP5LQ,AGYYVPDD7YG7FYNBX...","ArdKn,Nirbhay kumar,Sagar Viswanathan,Asp,Plac...","RGIQEG07R9HS2,R1SMWZQ86XIN8U,R2J3Y1WL29GWDE,RY...","A Good Braided Cable for Your Type C Device,Go...",I ordered this cable to connect my phone to An...,https://m.media-amazon.com/images/W/WEBP_40237...,https://www.amazon.in/Ambrane-Unbreakable-Char...
2,B096MSW6CT,Sounce Fast Phone Charging Cable & Data Sync U...,Computers&Accessories|Accessories&Peripherals|...,₹199,"₹1,899",90%,3.9,7928,【 Fast Charger& Data Sync】-With built-in safet...,"AGU3BBQ2V2DDAMOAKGFAWDDQ6QHA,AESFLDV2PT363T2AQ...","Kunal,Himanshu,viswanath,sai niharka,saqib mal...","R3J3EQQ9TZI5ZJ,R3E7WBGK7ID0KV,RWU79XKQ6I1QF,R2...","Good speed for earlier versions,Good Product,W...","Not quite durable and sturdy,https://m.media-a...",https://m.media-amazon.com/images/W/WEBP_40237...,https://www.amazon.in/Sounce-iPhone-Charging-C...
3,B08HDJ86NZ,boAt Deuce USB 300 2 in 1 Type-C & Micro USB S...,Computers&Accessories|Accessories&Peripherals|...,₹329,₹699,53%,4.2,94363,The boAt Deuce USB 300 2 in 1 cable is compati...,"AEWAZDZZJLQUYVOVGBEUKSLXHQ5A,AG5HTSFRRE6NL3M5S...","Omkar dhale,JD,HEMALATHA,Ajwadh a.,amar singh ...","R3EEUZKKK9J36I,R3HJVYCLYOY554,REDECAZ7AMPQC,R1...","Good product,Good one,Nice,Really nice product...","Good product,long wire,Charges good,Nice,I bou...",https://m.media-amazon.com/images/I/41V5FtEWPk...,https://www.amazon.in/Deuce-300-Resistant-Tang...
4,B08CF3B7N1,Portronics Konnect L 1.2M Fast Charging 3A 8 P...,Computers&Accessories|Accessories&Peripherals|...,₹154,₹399,61%,4.2,16905,[CHARGE & SYNC FUNCTION]- This cable comes wit...,"AE3Q6KSUK5P75D5HFYHCRAOLODSA,AFUGIFH5ZAFXRDSZH...","rahuls6099,Swasat Borah,Ajay Wadke,Pranali,RVK...","R1BP4L2HH9TFUP,R16PVJEXKV6QZS,R2UPDB81N66T4P,R...","As good as original,Decent,Good one for second...","Bought this instead of original apple, does th...",https://m.media-amazon.com/images/W/WEBP_40237...,https://www.amazon.in/Portronics-Konnect-POR-1...


In [4]:
# Display Dataset Information
print("\n🔍 **Dataset Information:**\n")
data.info()


🔍 **Dataset Information:**

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1465 entries, 0 to 1464
Data columns (total 16 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   product_id           1465 non-null   object
 1   product_name         1465 non-null   object
 2   category             1465 non-null   object
 3   discounted_price     1465 non-null   object
 4   actual_price         1465 non-null   object
 5   discount_percentage  1465 non-null   object
 6   rating               1465 non-null   object
 7   rating_count         1463 non-null   object
 8   about_product        1465 non-null   object
 9   user_id              1465 non-null   object
 10  user_name            1465 non-null   object
 11  review_id            1465 non-null   object
 12  review_title         1465 non-null   object
 13  review_content       1465 non-null   object
 14  img_link             1465 non-null   object
 15  product_link         1465 

###  **2. Removing Unwanted Characters and Converting Data Types** 
Certain columns contain symbols (like ₹, %), commas, and are in string format. We need to:

* Remove symbols and commas
* Convert columns to appropriate data types

In [5]:
def clean_column(column):
    """Cleans and converts columns to numeric."""
    return pd.to_numeric(column.str.replace('[₹,]', '', regex=True).str.strip(), errors='coerce')

# Apply the cleaning function to the required columns
data['discounted_price'] = clean_column(data['discounted_price'])
data['actual_price'] = clean_column(data['actual_price'])

data['discount_percentage'] = data['discount_percentage'].str.replace('%', '').astype(float)
data['rating_count'] = data['rating_count'].str.replace(',', '').astype(float)

### **3. Identifying and Handling Inconsistent Data**

We check for any unusual strings or symbols in numeric columns.



In [6]:
#Finding unusual string in the rating column

data['rating'].value_counts()

rating
4.1    244
4.3    230
4.2    228
4.0    129
3.9    123
4.4    123
3.8     86
4.5     75
4       52
3.7     42
3.6     35
3.5     26
4.6     17
3.3     16
3.4     10
4.7      6
3.1      4
5.0      3
3.0      3
4.8      3
3.2      2
2.8      2
2.3      1
|        1
2        1
3        1
2.6      1
2.9      1
Name: count, dtype: int64

In [7]:
#Inspecting the strange row

data.query('rating == "|"')

Unnamed: 0,product_id,product_name,category,discounted_price,actual_price,discount_percentage,rating,rating_count,about_product,user_id,user_name,review_id,review_title,review_content,img_link,product_link
1279,B08L12N5H1,Eureka Forbes car Vac 100 Watts Powerful Sucti...,"Home&Kitchen|Kitchen&HomeAppliances|Vacuum,Cle...",2099.0,2499.0,16.0,|,992.0,No Installation is provided for this product|1...,"AGTDSNT2FKVYEPDPXAA673AIS44A,AER2XFSWNN4LAUCJ5...","Divya,Dr Nefario,Deekshith,Preeti,Prasanth R,P...","R2KKTKM4M9RDVJ,R1O692MZOBTE79,R2WRSEWL56SOS4,R...","Decent product,doesn't pick up sand,Ok ok,Must...","Does the job well,doesn't work on sand. though...",https://m.media-amazon.com/images/W/WEBP_40237...,https://www.amazon.in/Eureka-Forbes-Vacuum-Cle...


In [8]:
#Changing Rating Columns Data Type

data['rating'] = data['rating'].str.replace('|', '4.0').astype('float64')

### **4. Checking Duplicate Entries**
We check for duplicated rows and remove them if found.

In [9]:
# Display all duplicated rows
data[data.duplicated()]

Unnamed: 0,product_id,product_name,category,discounted_price,actual_price,discount_percentage,rating,rating_count,about_product,user_id,user_name,review_id,review_title,review_content,img_link,product_link


### **5. Checking Missing Values**
We identify columns with missing values and decide how to handle them.

In [10]:
#Checking Missing Values

data.isna().sum()

product_id             0
product_name           0
category               0
discounted_price       0
actual_price           0
discount_percentage    0
rating                 0
rating_count           2
about_product          0
user_id                0
user_name              0
review_id              0
review_title           0
review_content         0
img_link               0
product_link           0
dtype: int64