In [1]:
import pandas as pd

In [3]:
# Read the CSV file
file_path = 'scrapped_amazon_products.csv'
data = pd.read_csv(file_path)

# Display the first few rows of the dataset to understand its structure
data.head()

Unnamed: 0,Title,Price,Rating,Brand
0,Samsung Galaxy M15 5G Prime Edition (Stone Gre...,13499.0,3.9,Samsung
1,"Redmi A4 5G (Starry Black, 4GB RAM, 128GB Stor...",9299.0,,Redmi
2,"Samsung Galaxy M05 (Mint Green, 4GB RAM, 64 GB...",6499.0,3.9,Samsung
3,"realme NARZO N61 (Voyage Blue,6GB RAM+128GB St...",8498.0,4.0,realme
4,POCO C61 Mystical Green 4GB RAM 64GB ROM,5999.0,3.5,POCO


In [4]:
# Check for missing values
missing_values = data.isnull().sum()
missing_values

Unnamed: 0,0
Title,0
Price,2
Rating,11
Brand,0


In [5]:
# Check for duplicate rows
duplicates = data.duplicated().sum()
duplicates

53

In [6]:
# Data types for verification
data_types = data.dtypes
data_types

Unnamed: 0,0
Title,object
Price,float64
Rating,float64
Brand,object


In [7]:
# Fill missing values
data['Price'].fillna(data['Price'].median(), inplace=True)
data['Rating'].fillna(data['Rating'].mean(), inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  data['Price'].fillna(data['Price'].median(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  data['Rating'].fillna(data['Rating'].mean(), inplace=True)


In [8]:
# Remove duplicate rows
data_cleaned = data.drop_duplicates()

In [9]:
# Clean text fields
data_cleaned['Title'] = data_cleaned['Title'].str.strip()
data_cleaned['Brand'] = data_cleaned['Brand'].str.strip()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_cleaned['Title'] = data_cleaned['Title'].str.strip()
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_cleaned['Brand'] = data_cleaned['Brand'].str.strip()


In [10]:
# Re-check for remaining issues
remaining_missing = data_cleaned.isnull().sum()
remaining_duplicates = data_cleaned.duplicated().sum()


In [11]:
{
    "Remaining Missing Values": remaining_missing,
    "Remaining Duplicate Rows": remaining_duplicates,
    "Shape After Cleaning": data_cleaned.shape
}

{'Remaining Missing Values': Title     0
 Price     0
 Rating    0
 Brand     0
 dtype: int64,
 'Remaining Duplicate Rows': 0,
 'Shape After Cleaning': (147, 4)}

**Cleaning Summary:**

Missing Values:
All missing values in Price and Rating have been filled.

Duplicates:
All 53 duplicate rows were removed. The dataset now contains 147 rows.

Text Field Cleaning:
Leading and trailing spaces in Title and Brand columns have been removed.

In [14]:
data_cleaned.to_csv("cleaned_dataset.csv", index=False)