# 00 – Data Preprocessing

In this notebook, I’m preparing and cleaning the raw Sephora skincare reviews dataset.  
This is the foundation of my machine learning project.

I will:
- Merge all raw review files
- Clean missing and irrelevant data
- Engineer new features like review length, word count, and sentiment score
- Add a sentiment label based on ratings
- Save a clean, ready-to-use dataset for modeling

I also focus on memory efficiency, reproducibility, and version control.


In [12]:
# 📦 Import libraries
import os
import pandas as pd
import numpy as np
import glob
import warnings
import random
from textblob import TextBlob
from datetime import datetime
import matplotlib.pyplot as plt
import seaborn as sns

# Settings
warnings.filterwarnings('ignore')
pd.set_option('display.max_columns', None)

# 🧬 Reproducibility
random.seed(42)
np.random.seed(42)


In [13]:
# 📁 Define data paths
RAW_DATA_PATH = "../data/raw/"
INTERIM_DATA_PATH = "../data/interim/"
PROCESSED_DATA_PATH = "../data/processed/"

# Create folders if missing
os.makedirs(INTERIM_DATA_PATH, exist_ok=True)
os.makedirs(PROCESSED_DATA_PATH, exist_ok=True)


## 🧩 Merge all review files

To save memory, I process each review CSV file one by one and add features during the loop.  
I include review length, word count, and a basic sentiment score using TextBlob.


In [14]:
# Find all review CSV files
review_files = sorted([
    os.path.join(RAW_DATA_PATH, file)
    for file in os.listdir(RAW_DATA_PATH)
    if file.startswith("reviews_") and file.endswith(".csv")
])

# Process each file individually
review_dfs = []

for file in review_files:
    try:
        df = pd.read_csv(file)
        df.dropna(subset=['review_text', 'rating'], inplace=True)
        df['review_length'] = df['review_text'].astype(str).apply(len)
        df['word_count'] = df['review_text'].apply(lambda x: len(str(x).split()))
        df['sentiment_score'] = df['review_text'].apply(lambda x: TextBlob(str(x)).sentiment.polarity)
        review_dfs.append(df)
    except Exception as e:
        print(f"❌ Error processing {file}: {e}")

# Merge all DataFrames
all_reviews = pd.concat(review_dfs, ignore_index=True)
print("✅ Merged review shape:", all_reviews.shape)


✅ Merged review shape: (1092967, 22)


##  Clean review data

I now drop columns that are unnecessary for modeling, handle categorical variables, and prepare sentiment labels.  
For simplicity, I’ll drop ‘neutral’ reviews (rating = 3) and focus only on binary classification.


In [15]:
# Drop unused or sparse columns
all_reviews.drop(columns=[
    'Unnamed: 0', 'review_title', 'hair_color',
    'eye_color', 'skin_tone', 'submission_time'
], inplace=True, errors='ignore')

# Handle missing skin_type and one-hot encode
all_reviews['skin_type'] = all_reviews['skin_type'].fillna('unknown')
skin_dummies = pd.get_dummies(all_reviews['skin_type'], prefix='skin')
all_reviews = pd.concat([all_reviews, skin_dummies], axis=1)
all_reviews.drop(columns=['skin_type'], inplace=True)


In [16]:
# Create sentiment labels from ratings
def classify_sentiment(rating):
    if rating >= 4:
        return 'positive'
    elif rating <= 2:
        return 'negative'
    else:
        return 'neutral'

all_reviews['sentiment'] = all_reviews['rating'].apply(classify_sentiment)

# Drop 'neutral' ratings
all_reviews = all_reviews[all_reviews['sentiment'] != 'neutral'].reset_index(drop=True)

print("✅ Cleaned and labeled review dataset shape:", all_reviews.shape)
display(all_reviews.head())


✅ Cleaned and labeled review dataset shape: (1011215, 21)


Unnamed: 0,author_id,rating,is_recommended,helpfulness,total_feedback_count,total_neg_feedback_count,total_pos_feedback_count,review_text,product_id,product_name,brand_name,price_usd,review_length,word_count,sentiment_score,skin_combination,skin_dry,skin_normal,skin_oily,skin_unknown,sentiment
0,1741593524,5,1.0,1.0,2,0,2,I use this with the Nudestix “Citrus Clean Bal...,P504322,Gentle Hydra-Gel Face Cleanser,NUDESTIX,19.0,455,79,0.283333,False,True,False,False,False,positive
1,31423088263,1,0.0,,0,0,0,I bought this lip mask after reading the revie...,P420652,Lip Sleeping Mask Intense Hydration with Vitam...,LANEIGE,24.0,162,28,0.0,False,False,False,False,True,negative
2,5061282401,5,1.0,,0,0,0,My review title says it all! I get so excited ...,P420652,Lip Sleeping Mask Intense Hydration with Vitam...,LANEIGE,24.0,272,53,0.102778,False,True,False,False,False,positive
3,6083038851,5,1.0,,0,0,0,I’ve always loved this formula for a long time...,P420652,Lip Sleeping Mask Intense Hydration with Vitam...,LANEIGE,24.0,230,45,0.38125,True,False,False,False,False,positive
4,47056667835,5,1.0,,0,0,0,"If you have dry cracked lips, this is a must h...",P420652,Lip Sleeping Mask Intense Hydration with Vitam...,LANEIGE,24.0,213,46,-0.127381,True,False,False,False,False,positive


## 💾 Save processed dataset

I save the cleaned dataset to `/data/processed/`, using a timestamp in the filename to support version control and reproducibility.


In [17]:
# Create timestamped filename
timestamp = datetime.now().strftime("%Y%m%d_%H%M")
filename = f"clean_reviews_v1_{timestamp}.csv"

# Save to processed folder
try:
    all_reviews.to_csv(os.path.join(PROCESSED_DATA_PATH, filename), index=False)
    print(f"✅ Saved final dataset as: {filename}")
except Exception as e:
    print("❌ Error saving file:", e)


✅ Saved final dataset as: clean_reviews_v1_20250803_1522.csv
