# Data Cleaning - Nykaa Cosmetics Product Reviews

This notebook performs comprehensive data cleaning including:
- Missing value analysis
- Duplicate detection
- Data type conversions
- Text cleaning
- Outlier removal

## Import Libraries

In [1]:
import pandas as pd
import numpy as np
import re
from datetime import datetime
import warnings
warnings.filterwarnings('ignore')

## Load Data

In [2]:
# Load the CSV file
df = pd.read_csv('nyka_top_brands_cosmetics_product_reviews.csv')
print(f"Data loaded: {df.shape[0]} rows, {df.shape[1]} columns")
df.head()

Data loaded: 61284 rows, 18 columns


Unnamed: 0,product_id,brand_name,review_id,review_title,review_text,author,review_date,review_rating,is_a_buyer,pro_user,review_label,product_title,mrp,price,product_rating,product_rating_count,product_tags,product_url
0,781070,Olay,16752142,Worth buying 50g one,Works as it claims. Could see the difference f...,Ashton Dsouza,2021-01-23 15:17:18,5.0,True,False,Verified Buyer,Olay Ultra Lightweight Moisturiser: Luminous W...,1999,1599,4.1,43,,https://www.nykaa.com/olay-ultra-lightweight-m...
1,781070,Olay,14682550,Best cream to start ur day,It does what it claims . Best thing is it smoo...,Amrit Neelam,2020-09-07 15:30:42,5.0,True,False,Verified Buyer,Olay Ultra Lightweight Moisturiser: Luminous W...,1999,1599,4.1,43,,https://www.nykaa.com/olay-ultra-lightweight-m...
2,781070,Olay,15618995,perfect for summers dry for winters,I have been using this product for months now....,Sanchi Gupta,2020-11-13 12:24:14,4.0,True,False,Verified Buyer,Olay Ultra Lightweight Moisturiser: Luminous W...,1999,1599,4.1,43,,https://www.nykaa.com/olay-ultra-lightweight-m...
3,781070,Olay,13474509,Not a moisturizer,"i have an oily skin, while this whip acts as a...",Ruchi Shah,2020-06-14 11:56:50,3.0,True,False,Verified Buyer,Olay Ultra Lightweight Moisturiser: Luminous W...,1999,1599,4.1,43,,https://www.nykaa.com/olay-ultra-lightweight-m...
4,781070,Olay,16338982,Average,It's not that good. Please refresh try for oth...,Sukanya Sarkar,2020-12-22 15:24:35,2.0,True,False,Verified Buyer,Olay Ultra Lightweight Moisturiser: Luminous W...,1999,1599,4.1,43,,https://www.nykaa.com/olay-ultra-lightweight-m...


## Initial Data Overview

In [3]:
# Display basic information
print("Dataset Info:")
df.info()

Dataset Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 61284 entries, 0 to 61283
Data columns (total 18 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   product_id            61284 non-null  int64  
 1   brand_name            61284 non-null  object 
 2   review_id             61284 non-null  int64  
 3   review_title          61284 non-null  object 
 4   review_text           61275 non-null  object 
 5   author                61284 non-null  object 
 6   review_date           61284 non-null  object 
 7   review_rating         61283 non-null  float64
 8   is_a_buyer            61284 non-null  bool   
 9   pro_user              61284 non-null  bool   
 10  review_label          48249 non-null  object 
 11  product_title         61284 non-null  object 
 12  mrp                   61284 non-null  int64  
 13  price                 61284 non-null  int64  
 14  product_rating        61284 non-null  float64
 15  produ

In [4]:
# Display first few rows
df.head(10)

Unnamed: 0,product_id,brand_name,review_id,review_title,review_text,author,review_date,review_rating,is_a_buyer,pro_user,review_label,product_title,mrp,price,product_rating,product_rating_count,product_tags,product_url
0,781070,Olay,16752142,Worth buying 50g one,Works as it claims. Could see the difference f...,Ashton Dsouza,2021-01-23 15:17:18,5.0,True,False,Verified Buyer,Olay Ultra Lightweight Moisturiser: Luminous W...,1999,1599,4.1,43,,https://www.nykaa.com/olay-ultra-lightweight-m...
1,781070,Olay,14682550,Best cream to start ur day,It does what it claims . Best thing is it smoo...,Amrit Neelam,2020-09-07 15:30:42,5.0,True,False,Verified Buyer,Olay Ultra Lightweight Moisturiser: Luminous W...,1999,1599,4.1,43,,https://www.nykaa.com/olay-ultra-lightweight-m...
2,781070,Olay,15618995,perfect for summers dry for winters,I have been using this product for months now....,Sanchi Gupta,2020-11-13 12:24:14,4.0,True,False,Verified Buyer,Olay Ultra Lightweight Moisturiser: Luminous W...,1999,1599,4.1,43,,https://www.nykaa.com/olay-ultra-lightweight-m...
3,781070,Olay,13474509,Not a moisturizer,"i have an oily skin, while this whip acts as a...",Ruchi Shah,2020-06-14 11:56:50,3.0,True,False,Verified Buyer,Olay Ultra Lightweight Moisturiser: Luminous W...,1999,1599,4.1,43,,https://www.nykaa.com/olay-ultra-lightweight-m...
4,781070,Olay,16338982,Average,It's not that good. Please refresh try for oth...,Sukanya Sarkar,2020-12-22 15:24:35,2.0,True,False,Verified Buyer,Olay Ultra Lightweight Moisturiser: Luminous W...,1999,1599,4.1,43,,https://www.nykaa.com/olay-ultra-lightweight-m...
5,781070,Olay,14549640,not good for oily skin,dz product z best for dry skin ...one of olay ...,Laxmi Basumatary,2020-08-27 18:16:47,1.0,True,False,Verified Buyer,Olay Ultra Lightweight Moisturiser: Luminous W...,1999,1599,4.1,43,,https://www.nykaa.com/olay-ultra-lightweight-m...
6,739418,Olay,16531371,All time favorite,"This cream is just awesome, It makes my rough ...",Priyanka Barwal,2021-01-06 15:43:25,5.0,False,False,,Olay Regenerist Whip Mini and Ultimate Eye Cre...,2198,1943,4.0,792,,https://www.nykaa.com/olay-regenerist-whip-min...
7,739418,Olay,21356560,"""Good Product """,Instantly perfect skin tone appearance.,Bandana Mukherjee,2021-11-13 19:57:06,5.0,False,False,,Olay Regenerist Whip Mini and Ultimate Eye Cre...,2198,1943,4.0,792,,https://www.nykaa.com/olay-regenerist-whip-min...
8,739418,Olay,15235570,Good eye cream combo,This eye cream combo is effective. Works on fi...,krish,2020-10-19 15:08:13,5.0,False,False,,Olay Regenerist Whip Mini and Ultimate Eye Cre...,2198,1943,4.0,792,,https://www.nykaa.com/olay-regenerist-whip-min...
9,739418,Olay,22008691,"""Olay''","3in1 benifits, helps reduces dark spots and wr...",Gargi Mukherjee,2021-12-16 14:55:52,5.0,False,False,,Olay Regenerist Whip Mini and Ultimate Eye Cre...,2198,1943,4.0,792,,https://www.nykaa.com/olay-regenerist-whip-min...


## Missing Values Analysis

In [5]:
# Check missing values
missing = df.isnull().sum()
missing_pct = (missing / len(df)) * 100
missing_df = pd.DataFrame({
    'Missing_Count': missing,
    'Percentage': missing_pct
})

print("Missing Values Summary:")
print(missing_df[missing_df['Missing_Count'] > 0].sort_values('Missing_Count', ascending=False))

Missing Values Summary:
               Missing_Count  Percentage
product_tags           47782   77.968148
review_label           13035   21.269826
review_text                9    0.014686
review_rating              1    0.001632


## Duplicate Analysis

In [6]:
# Check for duplicates
duplicates = df.duplicated().sum()
print(f"Total duplicate rows: {duplicates}")

# Check duplicate review_ids
duplicate_reviews = df['review_id'].duplicated().sum()
print(f"Duplicate review IDs: {duplicate_reviews}")

# Remove duplicates if any
if duplicates > 0:
    df = df.drop_duplicates()
    print(f"Removed {duplicates} duplicate rows")
    print(f"New shape: {df.shape}")

Total duplicate rows: 0
Duplicate review IDs: 0


## Clean Text Columns

In [7]:
# Clean text columns
text_cols = ['review_title', 'review_text', 'author', 'product_title', 'brand_name']

for col in text_cols:
    if col in df.columns:
        # Remove extra whitespaces
        df[col] = df[col].astype(str).str.strip()
        df[col] = df[col].str.replace(r'\s+', ' ', regex=True)
        # Replace 'nan' string with actual NaN
        df[col] = df[col].replace('nan', np.nan)
        print(f"Cleaned: {col}")

print("\nText columns cleaned successfully!")

Cleaned: review_title
Cleaned: review_text
Cleaned: author
Cleaned: product_title
Cleaned: brand_name

Text columns cleaned successfully!


## Clean Numeric Columns

In [8]:
# Convert rating columns to float
rating_cols = ['review_rating', 'product_rating']
for col in rating_cols:
    if col in df.columns:
        df[col] = pd.to_numeric(df[col], errors='coerce')
        print(f"Converted {col} to numeric")

# Convert price columns to numeric
price_cols = ['mrp', 'price']
for col in price_cols:
    if col in df.columns:
        df[col] = pd.to_numeric(df[col], errors='coerce')
        print(f"Converted {col} to numeric")

# Convert product_rating_count to integer
if 'product_rating_count' in df.columns:
    df['product_rating_count'] = pd.to_numeric(df['product_rating_count'], errors='coerce')
    print("Converted product_rating_count to numeric")

print("\nNumeric columns cleaned successfully!")

Converted review_rating to numeric
Converted product_rating to numeric
Converted mrp to numeric
Converted price to numeric
Converted product_rating_count to numeric

Numeric columns cleaned successfully!


## Clean Boolean Columns

In [9]:
# Clean boolean columns
bool_cols = ['is_a_buyer', 'pro_user']

for col in bool_cols:
    if col in df.columns:
        df[col] = df[col].astype(str).str.strip().str.lower()
        df[col] = df[col].map({'true': True, 'false': False})
        print(f"Converted {col} to boolean")

print("\nBoolean columns cleaned successfully!")

Converted is_a_buyer to boolean
Converted pro_user to boolean

Boolean columns cleaned successfully!


## Clean Date Columns

In [10]:
# Convert date column
if 'review_date' in df.columns:
    df['review_date'] = pd.to_datetime(df['review_date'], errors='coerce')
    print(f"Date range: {df['review_date'].min()} to {df['review_date'].max()}")
    print("Date column cleaned successfully!")

Date range: 2013-05-20 16:48:56 to 2022-10-22 18:12:27
Date column cleaned successfully!


## Remove Outliers

In [11]:
# Check rating outliers (should be between 1-5)
if 'review_rating' in df.columns:
    invalid_ratings = df[(df['review_rating'] < 1) | (df['review_rating'] > 5)].shape[0]
    print(f"Invalid review ratings (not 1-5): {invalid_ratings}")
    
    # Remove invalid ratings
    df = df[(df['review_rating'] >= 1) & (df['review_rating'] <= 5) | df['review_rating'].isna()]

# Check price outliers (negative prices)
if 'price' in df.columns:
    negative_prices = df[df['price'] < 0].shape[0]
    print(f"Negative prices: {negative_prices}")
    
    # Remove negative prices
    df = df[(df['price'] >= 0) | df['price'].isna()]

print(f"\nFinal shape after outlier removal: {df.shape}")

Invalid review ratings (not 1-5): 0
Negative prices: 0

Final shape after outlier removal: (61284, 18)


## Final Data Overview

In [12]:
# Display final data info
print("Final Dataset Info:")
df.info()

Final Dataset Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 61284 entries, 0 to 61283
Data columns (total 18 columns):
 #   Column                Non-Null Count  Dtype         
---  ------                --------------  -----         
 0   product_id            61284 non-null  int64         
 1   brand_name            61284 non-null  object        
 2   review_id             61284 non-null  int64         
 3   review_title          61284 non-null  object        
 4   review_text           61275 non-null  object        
 5   author                61284 non-null  object        
 6   review_date           61284 non-null  datetime64[ns]
 7   review_rating         61283 non-null  float64       
 8   is_a_buyer            61284 non-null  bool          
 9   pro_user              61284 non-null  bool          
 10  review_label          48249 non-null  object        
 11  product_title         61284 non-null  object        
 12  mrp                   61284 non-null  int64         
 

In [13]:
# Display summary statistics
df.describe()

Unnamed: 0,product_id,review_id,review_date,review_rating,mrp,price,product_rating,product_rating_count
count,61284.0,61284.0,61284,61283.0,61284.0,61284.0,61284.0,61284.0
mean,798380.2,14849950.0,2020-08-22 20:00:04.072498432,4.414781,573.260247,462.129512,4.09913,7582.96384
min,250.0,96.0,2013-05-20 16:48:56,1.0,75.0,45.0,1.5,1.0
25%,160488.0,11023730.0,2019-10-23 01:55:38.500000,4.0,300.0,262.0,4.0,1760.0
50%,452443.0,15251420.0,2020-10-20 13:07:14.500000,5.0,599.0,400.0,4.1,3925.0
75%,766529.0,20029270.0,2021-08-14 19:37:17.500000,5.0,799.0,639.0,4.3,8720.0
max,7749427.0,29630310.0,2022-10-22 18:12:27,5.0,3874.0,2947.0,4.8,98477.0
std,1281418.0,7383506.0,,1.062547,324.09893,264.876964,0.235945,14463.246136


In [14]:
# Display sample of cleaned data
df.head(10)

Unnamed: 0,product_id,brand_name,review_id,review_title,review_text,author,review_date,review_rating,is_a_buyer,pro_user,review_label,product_title,mrp,price,product_rating,product_rating_count,product_tags,product_url
0,781070,Olay,16752142,Worth buying 50g one,Works as it claims. Could see the difference f...,Ashton Dsouza,2021-01-23 15:17:18,5.0,True,False,Verified Buyer,Olay Ultra Lightweight Moisturiser: Luminous W...,1999,1599,4.1,43,,https://www.nykaa.com/olay-ultra-lightweight-m...
1,781070,Olay,14682550,Best cream to start ur day,It does what it claims . Best thing is it smoo...,Amrit Neelam,2020-09-07 15:30:42,5.0,True,False,Verified Buyer,Olay Ultra Lightweight Moisturiser: Luminous W...,1999,1599,4.1,43,,https://www.nykaa.com/olay-ultra-lightweight-m...
2,781070,Olay,15618995,perfect for summers dry for winters,I have been using this product for months now....,Sanchi Gupta,2020-11-13 12:24:14,4.0,True,False,Verified Buyer,Olay Ultra Lightweight Moisturiser: Luminous W...,1999,1599,4.1,43,,https://www.nykaa.com/olay-ultra-lightweight-m...
3,781070,Olay,13474509,Not a moisturizer,"i have an oily skin, while this whip acts as a...",Ruchi Shah,2020-06-14 11:56:50,3.0,True,False,Verified Buyer,Olay Ultra Lightweight Moisturiser: Luminous W...,1999,1599,4.1,43,,https://www.nykaa.com/olay-ultra-lightweight-m...
4,781070,Olay,16338982,Average,It's not that good. Please refresh try for oth...,Sukanya Sarkar,2020-12-22 15:24:35,2.0,True,False,Verified Buyer,Olay Ultra Lightweight Moisturiser: Luminous W...,1999,1599,4.1,43,,https://www.nykaa.com/olay-ultra-lightweight-m...
5,781070,Olay,14549640,not good for oily skin,dz product z best for dry skin ...one of olay ...,Laxmi Basumatary,2020-08-27 18:16:47,1.0,True,False,Verified Buyer,Olay Ultra Lightweight Moisturiser: Luminous W...,1999,1599,4.1,43,,https://www.nykaa.com/olay-ultra-lightweight-m...
6,739418,Olay,16531371,All time favorite,"This cream is just awesome, It makes my rough ...",Priyanka Barwal,2021-01-06 15:43:25,5.0,False,False,,Olay Regenerist Whip Mini and Ultimate Eye Cre...,2198,1943,4.0,792,,https://www.nykaa.com/olay-regenerist-whip-min...
7,739418,Olay,21356560,"""Good Product """,Instantly perfect skin tone appearance.,Bandana Mukherjee,2021-11-13 19:57:06,5.0,False,False,,Olay Regenerist Whip Mini and Ultimate Eye Cre...,2198,1943,4.0,792,,https://www.nykaa.com/olay-regenerist-whip-min...
8,739418,Olay,15235570,Good eye cream combo,This eye cream combo is effective. Works on fi...,krish,2020-10-19 15:08:13,5.0,False,False,,Olay Regenerist Whip Mini and Ultimate Eye Cre...,2198,1943,4.0,792,,https://www.nykaa.com/olay-regenerist-whip-min...
9,739418,Olay,22008691,"""Olay''","3in1 benifits, helps reduces dark spots and wr...",Gargi Mukherjee,2021-12-16 14:55:52,5.0,False,False,,Olay Regenerist Whip Mini and Ultimate Eye Cre...,2198,1943,4.0,792,,https://www.nykaa.com/olay-regenerist-whip-min...


## Save Cleaned Data

In [15]:
# Save cleaned data to CSV
df.to_csv('cleaned_data.csv', index=False)
print(f"Cleaned data saved to: cleaned_data.csv")
print(f"Final shape: {df.shape}")

Cleaned data saved to: cleaned_data.csv
Final shape: (61284, 18)


## Summary

Data cleaning completed successfully! The cleaned dataset is ready for preprocessing and analysis.