# Data Preprocessing - Nykaa Cosmetics Product Reviews

This notebook performs comprehensive data preprocessing including:
- Feature engineering
- Text preprocessing with NLP
- Categorical encoding
- Missing value handling
- Feature scaling

## Import Libraries

In [1]:
import pandas as pd
import numpy as np
import re
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.feature_extraction.text import TfidfVectorizer
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
import warnings
warnings.filterwarnings('ignore')


## Download NLTK Data

In [2]:
# Download required NLTK data
try:
    nltk.data.find('tokenizers/punkt')
except LookupError:
    nltk.download('punkt')

try:
    nltk.data.find('corpora/stopwords')
except LookupError:
    nltk.download('stopwords')

try:
    nltk.data.find('corpora/wordnet')
except LookupError:
    nltk.download('wordnet')

print("NLTK data ready!")

NLTK data ready!


[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\USER\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


## Load Cleaned Data

In [3]:
# Load cleaned data
df = pd.read_csv('cleaned_data.csv')
print(f"Data loaded: {df.shape}")
df.head()

Data loaded: (61284, 18)


Unnamed: 0,product_id,brand_name,review_id,review_title,review_text,author,review_date,review_rating,is_a_buyer,pro_user,review_label,product_title,mrp,price,product_rating,product_rating_count,product_tags,product_url
0,781070,Olay,16752142,Worth buying 50g one,Works as it claims. Could see the difference f...,Ashton Dsouza,2021-01-23 15:17:18,5.0,True,False,Verified Buyer,Olay Ultra Lightweight Moisturiser: Luminous W...,1999,1599,4.1,43,,https://www.nykaa.com/olay-ultra-lightweight-m...
1,781070,Olay,14682550,Best cream to start ur day,It does what it claims . Best thing is it smoo...,Amrit Neelam,2020-09-07 15:30:42,5.0,True,False,Verified Buyer,Olay Ultra Lightweight Moisturiser: Luminous W...,1999,1599,4.1,43,,https://www.nykaa.com/olay-ultra-lightweight-m...
2,781070,Olay,15618995,perfect for summers dry for winters,I have been using this product for months now....,Sanchi Gupta,2020-11-13 12:24:14,4.0,True,False,Verified Buyer,Olay Ultra Lightweight Moisturiser: Luminous W...,1999,1599,4.1,43,,https://www.nykaa.com/olay-ultra-lightweight-m...
3,781070,Olay,13474509,Not a moisturizer,"i have an oily skin, while this whip acts as a...",Ruchi Shah,2020-06-14 11:56:50,3.0,True,False,Verified Buyer,Olay Ultra Lightweight Moisturiser: Luminous W...,1999,1599,4.1,43,,https://www.nykaa.com/olay-ultra-lightweight-m...
4,781070,Olay,16338982,Average,It's not that good. Please refresh try for oth...,Sukanya Sarkar,2020-12-22 15:24:35,2.0,True,False,Verified Buyer,Olay Ultra Lightweight Moisturiser: Luminous W...,1999,1599,4.1,43,,https://www.nykaa.com/olay-ultra-lightweight-m...


## Feature Engineering

### Date Features

In [4]:
# Create date features
if 'review_date' in df.columns:
    df['review_date'] = pd.to_datetime(df['review_date'])
    df['review_year'] = df['review_date'].dt.year
    df['review_month'] = df['review_date'].dt.month
    df['review_day'] = df['review_date'].dt.day
    df['review_dayofweek'] = df['review_date'].dt.dayofweek
    df['review_quarter'] = df['review_date'].dt.quarter
    print("Date features created:")
    print("- review_year, review_month, review_day")
    print("- review_dayofweek, review_quarter")

df[['review_date', 'review_year', 'review_month', 'review_quarter']].head()

Date features created:
- review_year, review_month, review_day
- review_dayofweek, review_quarter


Unnamed: 0,review_date,review_year,review_month,review_quarter
0,2021-01-23 15:17:18,2021,1,1
1,2020-09-07 15:30:42,2020,9,3
2,2020-11-13 12:24:14,2020,11,4
3,2020-06-14 11:56:50,2020,6,2
4,2020-12-22 15:24:35,2020,12,4


### Price Features

In [5]:
# Create price features
if 'mrp' in df.columns and 'price' in df.columns:
    df['discount'] = df['mrp'] - df['price']
    df['discount_percentage'] = ((df['mrp'] - df['price']) / df['mrp'] * 100).round(2)
    print("Price features created:")
    print("- discount")
    print("- discount_percentage")

df[['mrp', 'price', 'discount', 'discount_percentage']].head()

Price features created:
- discount
- discount_percentage


Unnamed: 0,mrp,price,discount,discount_percentage
0,1999,1599,400,20.01
1,1999,1599,400,20.01
2,1999,1599,400,20.01
3,1999,1599,400,20.01
4,1999,1599,400,20.01


### Text Length Features

In [6]:
# Create text length features
if 'review_text' in df.columns:
    df['review_text_length'] = df['review_text'].astype(str).str.len()
    df['review_word_count'] = df['review_text'].astype(str).str.split().str.len()
    print("Review text features created")

if 'review_title' in df.columns:
    df['review_title_length'] = df['review_title'].astype(str).str.len()
    print("Review title features created")

df[['review_text_length', 'review_word_count', 'review_title_length']].describe()

Review text features created
Review title features created


Unnamed: 0,review_text_length,review_word_count,review_title_length
count,61284.0,61284.0,61284.0
mean,116.315237,21.612003,14.257424
std,101.363261,19.094298,9.413795
min,1.0,1.0,1.0
25%,53.0,10.0,8.0
50%,87.0,16.0,12.0
75%,147.0,27.0,18.0
max,2555.0,437.0,201.0


### Rating Features

In [7]:
# Create rating difference feature
if 'review_rating' in df.columns and 'product_rating' in df.columns:
    df['rating_diff'] = df['review_rating'] - df['product_rating']
    print("Rating difference feature created")

# Create sentiment categories based on rating
if 'review_rating' in df.columns:
    df['sentiment'] = pd.cut(df['review_rating'], 
                              bins=[0, 2, 3, 5], 
                              labels=['Negative', 'Neutral', 'Positive'])
    print("Sentiment categories created")
    print("\nSentiment distribution:")
    print(df['sentiment'].value_counts())

df[['review_rating', 'product_rating', 'rating_diff', 'sentiment']].head()

Rating difference feature created
Sentiment categories created

Sentiment distribution:
sentiment
Positive    52948
Negative     4795
Neutral      3540
Name: count, dtype: int64


Unnamed: 0,review_rating,product_rating,rating_diff,sentiment
0,5.0,4.1,0.9,Positive
1,5.0,4.1,0.9,Positive
2,4.0,4.1,-0.1,Positive
3,3.0,4.1,-1.1,Neutral
4,2.0,4.1,-2.1,Negative


## Text Preprocessing

### Define Text Preprocessing Function

In [8]:
def preprocess_text(text):
    """Preprocess text data"""
    if pd.isna(text):
        return ""
    
    # Convert to lowercase
    text = str(text).lower()
    
    # Remove special characters and digits
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    
    # Tokenize
    tokens = word_tokenize(text)
    
    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    tokens = [word for word in tokens if word not in stop_words]
    
    # Lemmatization
    lemmatizer = WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(word) for word in tokens]
    
    return ' '.join(tokens)

print("Text preprocessing function defined")

Text preprocessing function defined


### Process Review Text

In [9]:
# Process review text (this may take a few minutes)
if 'review_text' in df.columns:
    print("Processing review text... (this may take a few minutes)")
    df['processed_review_text'] = df['review_text'].apply(preprocess_text)
    print("Review text processed!")

# Show example
print("\nExample:")
print("Original:", df['review_text'].iloc[0])
print("Processed:", df['processed_review_text'].iloc[0])

Processing review text... (this may take a few minutes)
Review text processed!

Example:
Original: Works as it claims. Could see the difference from the first day. Use it with Olay cleanser for best results
Processed: work claim could see difference first day use olay cleanser best result


### Process Review Title

In [10]:
# Process review title
if 'review_title' in df.columns:
    print("Processing review title...")
    df['processed_review_title'] = df['review_title'].apply(preprocess_text)
    print("Review title processed!")

# Show example
print("\nExample:")
print("Original:", df['review_title'].iloc[0])
print("Processed:", df['processed_review_title'].iloc[0])

Processing review title...
Review title processed!

Example:
Original: Worth buying 50g one
Processed: worth buying g one


## Encode Categorical Features

### Label Encoding for Brands

In [11]:
# Label encoding for brand names
if 'brand_name' in df.columns:
    le_brand = LabelEncoder()
    df['brand_encoded'] = le_brand.fit_transform(df['brand_name'].astype(str))
    print(f"Brand names encoded: {len(le_brand.classes_)} unique brands")
    print(f"\nBrand encoding sample:")
    print(df[['brand_name', 'brand_encoded']].head(10))

Brand names encoded: 11 unique brands

Brand encoding sample:
  brand_name  brand_encoded
0       Olay             10
1       Olay             10
2       Olay             10
3       Olay             10
4       Olay             10
5       Olay             10
6       Olay             10
7       Olay             10
8       Olay             10
9       Olay             10


### One-Hot Encoding for Sentiment

In [12]:
# One-hot encoding for sentiment
if 'sentiment' in df.columns:
    sentiment_dummies = pd.get_dummies(df['sentiment'], prefix='sentiment')
    df = pd.concat([df, sentiment_dummies], axis=1)
    print("Sentiment one-hot encoded")
    print(f"\nNew columns: {list(sentiment_dummies.columns)}")
    print(sentiment_dummies.head())

Sentiment one-hot encoded

New columns: ['sentiment_Negative', 'sentiment_Neutral', 'sentiment_Positive']
   sentiment_Negative  sentiment_Neutral  sentiment_Positive
0               False              False                True
1               False              False                True
2               False              False                True
3               False               True               False
4                True              False               False


### Encode Boolean Columns

In [13]:
# Encode boolean columns
bool_cols = ['is_a_buyer', 'pro_user']
for col in bool_cols:
    if col in df.columns:
        df[f'{col}_encoded'] = df[col].astype(int)
        print(f"Encoded: {col}")

df[['is_a_buyer', 'is_a_buyer_encoded', 'pro_user', 'pro_user_encoded']].head()

Encoded: is_a_buyer
Encoded: pro_user


Unnamed: 0,is_a_buyer,is_a_buyer_encoded,pro_user,pro_user_encoded
0,True,1,False,0
1,True,1,False,0
2,True,1,False,0
3,True,1,False,0
4,True,1,False,0


## Handle Missing Values

In [14]:
# Check missing values before handling
print("Missing values before handling:")
print(df.isnull().sum()[df.isnull().sum() > 0])

Missing values before handling:
review_text          9
review_rating        1
review_label     13035
product_tags     47782
rating_diff          1
sentiment            1
dtype: int64


In [15]:
# Fill numeric columns with median
numeric_cols = df.select_dtypes(include=[np.number]).columns
for col in numeric_cols:
    if df[col].isnull().sum() > 0:
        df[col].fillna(df[col].median(), inplace=True)
        print(f"Filled {col} with median")

Filled review_rating with median
Filled rating_diff with median


In [16]:
# Fill categorical columns with mode
categorical_cols = df.select_dtypes(include=['object']).columns
for col in categorical_cols:
    if df[col].isnull().sum() > 0:
        df[col].fillna(df[col].mode()[0] if not df[col].mode().empty else 'Unknown', inplace=True)
        print(f"Filled {col} with mode")

Filled review_text with mode
Filled review_label with mode
Filled product_tags with mode


In [17]:
# Check missing values after handling
print("\nMissing values after handling:")
print(df.isnull().sum().sum())


Missing values after handling:
1


## Feature Scaling (Optional)

In [18]:
# Note: Uncomment this cell if you need scaled features for ML models

# # Select numeric columns to scale
# cols_to_scale = ['mrp', 'price', 'discount', 'discount_percentage', 
#                  'review_text_length', 'review_word_count', 'product_rating_count']

# cols_to_scale = [col for col in cols_to_scale if col in df.columns]

# if cols_to_scale:
#     scaler = StandardScaler()
#     df[cols_to_scale] = scaler.fit_transform(df[cols_to_scale])
#     print(f"Scaled columns: {cols_to_scale}")

print("Scaling skipped (uncomment code above if needed)")

Scaling skipped (uncomment code above if needed)


## Final Preprocessed Data Overview

In [19]:
# Display final data info
print("Final Preprocessed Data Info:")
df.info()

Final Preprocessed Data Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 61284 entries, 0 to 61283
Data columns (total 38 columns):
 #   Column                  Non-Null Count  Dtype         
---  ------                  --------------  -----         
 0   product_id              61284 non-null  int64         
 1   brand_name              61284 non-null  object        
 2   review_id               61284 non-null  int64         
 3   review_title            61284 non-null  object        
 4   review_text             61284 non-null  object        
 5   author                  61284 non-null  object        
 6   review_date             61284 non-null  datetime64[ns]
 7   review_rating           61284 non-null  float64       
 8   is_a_buyer              61284 non-null  bool          
 9   pro_user                61284 non-null  bool          
 10  review_label            61284 non-null  object        
 11  product_title           61284 non-null  object        
 12  mrp             

In [20]:
# Display summary statistics
df.describe()

Unnamed: 0,product_id,review_id,review_date,review_rating,mrp,price,product_rating,product_rating_count,review_year,review_month,...,review_quarter,discount,discount_percentage,review_text_length,review_word_count,review_title_length,rating_diff,brand_encoded,is_a_buyer_encoded,pro_user_encoded
count,61284.0,61284.0,61284,61284.0,61284.0,61284.0,61284.0,61284.0,61284.0,61284.0,...,61284.0,61284.0,61284.0,61284.0,61284.0,61284.0,61284.0,61284.0,61284.0,61284.0
mean,798380.2,14849950.0,2020-08-22 20:00:04.072498432,4.41479,573.260247,462.129512,4.09913,7582.96384,2020.124747,6.726128,...,2.586809,111.130736,17.663469,116.315237,21.612003,14.257424,0.315655,4.730027,0.786861,0.00749
min,250.0,96.0,2013-05-20 16:48:56,1.0,75.0,45.0,1.5,1.0,2013.0,1.0,...,1.0,0.0,0.0,1.0,1.0,1.0,-3.6,0.0,0.0,0.0
25%,160488.0,11023730.0,2019-10-23 01:55:38.500000,4.0,300.0,262.0,4.0,1760.0,2019.0,4.0,...,2.0,0.0,0.0,53.0,10.0,8.0,0.0,2.0,1.0,0.0
50%,452443.0,15251420.0,2020-10-20 13:07:14.500000,5.0,599.0,400.0,4.1,3925.0,2020.0,7.0,...,3.0,90.0,20.03,87.0,16.0,12.0,0.7,4.0,1.0,0.0
75%,766529.0,20029270.0,2021-08-14 19:37:17.500000,5.0,799.0,639.0,4.3,8720.0,2021.0,10.0,...,4.0,165.0,25.03,147.0,27.0,18.0,0.9,8.0,1.0,0.0
max,7749427.0,29630310.0,2022-10-22 18:12:27,5.0,3874.0,2947.0,4.8,98477.0,2022.0,12.0,...,4.0,1200.0,50.0,2555.0,437.0,201.0,2.2,10.0,1.0,1.0
std,1281418.0,7383506.0,,1.06254,324.09893,264.876964,0.235945,14463.246136,1.375558,3.366954,...,1.095502,113.211927,14.005325,101.363261,19.094298,9.413795,1.043421,2.681349,0.409528,0.086219


In [21]:
# Display sample of preprocessed data
df.head()

Unnamed: 0,product_id,brand_name,review_id,review_title,review_text,author,review_date,review_rating,is_a_buyer,pro_user,...,rating_diff,sentiment,processed_review_text,processed_review_title,brand_encoded,sentiment_Negative,sentiment_Neutral,sentiment_Positive,is_a_buyer_encoded,pro_user_encoded
0,781070,Olay,16752142,Worth buying 50g one,Works as it claims. Could see the difference f...,Ashton Dsouza,2021-01-23 15:17:18,5.0,True,False,...,0.9,Positive,work claim could see difference first day use ...,worth buying g one,10,False,False,True,1,0
1,781070,Olay,14682550,Best cream to start ur day,It does what it claims . Best thing is it smoo...,Amrit Neelam,2020-09-07 15:30:42,5.0,True,False,...,0.9,Positive,claim best thing smoothens ur skin n make soft...,best cream start ur day,10,False,False,True,1,0
2,781070,Olay,15618995,perfect for summers dry for winters,I have been using this product for months now....,Sanchi Gupta,2020-11-13 12:24:14,4.0,True,False,...,-0.1,Positive,using product month perfect combination n oily...,perfect summer dry winter,10,False,False,True,1,0
3,781070,Olay,13474509,Not a moisturizer,"i have an oily skin, while this whip acts as a...",Ruchi Shah,2020-06-14 11:56:50,3.0,True,False,...,-1.1,Neutral,oily skin whip act great base primer smoothens...,moisturizer,10,False,True,False,1,0
4,781070,Olay,16338982,Average,It's not that good. Please refresh try for oth...,Sukanya Sarkar,2020-12-22 15:24:35,2.0,True,False,...,-2.1,Negative,good please refresh try product,average,10,True,False,False,1,0


## Save Preprocessed Data

In [22]:
# Save preprocessed data to CSV
df.to_csv('preprocessed_data.csv', index=False)
print(f"Preprocessed data saved to: preprocessed_data.csv")
print(f"Final shape: {df.shape}")

Preprocessed data saved to: preprocessed_data.csv
Final shape: (61284, 38)


## Summary

Data preprocessing completed successfully! The dataset now includes:
- Date features (year, month, quarter, etc.)
- Price features (discount, discount percentage)
- Text features (length, word count)
- Processed text (cleaned and lemmatized)
- Encoded categorical variables
- Sentiment categories
- No missing values

The data is now ready for exploratory data analysis and modeling!