# Process of Selecting Reviews for Sentiment Analysis

## Objective
Sample 400 reviews per product from Sephora cosmetics dataset and split them into sentence-level data for text analytics.

## Output Files
1. `sampled_reviews_by_product_400_*.xlsx` - Review-level data (16,800 reviews)
2. `sampled_400_sentence_level_*.xlsx` - Sentence-level data (~70,000 sentences)

## Sampling Strategy
- **Target**: 400 reviews per product, 42 products total
- **Categories**: 3 (Cleansers, Moisturizers, Treatments)
- **Brands**: Top 3 brands per category (9 brands total)
- **Products**: Top 5 products per brand (ensuring 400+ reviews available)
- **Balance**: Star ratings, recency, review length, quality


## Configuration


In [None]:
# Configuration Parameters
import pandas as pd
import numpy as np
import glob
import os
import re
import time
from hashlib import sha1

# Set random seed for reproducibility
np.random.seed(42)

# Paths
DATA_PATH = "Group project/Dataset for Group project/"
OUTPUT_DIR = "/Users/dinghongyan/Downloads/Text Analytic"

# Sampling Configuration
REVIEWS_PER_PRODUCT = 400
REVIEWS_PER_STAR = 40
MIN_REVIEW_CHARS = 20
RECENT_PERCENTILE = 0.80
LONG_REVIEW_PERCENTILE = 0.75

# Target Categories
TARGET_CATEGORIES = ["Moisturizers", "Treatments", "Cleansers"]


## 1. Data Loading and Initial Inspection


In [None]:
# Load product information
product_file = os.path.join(DATA_PATH, "product_info.csv")
df_product = pd.read_csv(product_file)

# Load all review files
review_files = sorted(glob.glob(os.path.join(DATA_PATH, "reviews_*.csv")))
df_reviews = pd.concat([pd.read_csv(f) for f in review_files], ignore_index=True)

print(f"Total reviews loaded: {len(df_reviews):,}")
print(f"Total products: {df_product['product_id'].nunique():,}")


In [None]:
# Check for missing values
missing_summary = df_reviews.isnull().sum()
missing_report = pd.DataFrame({
    "Missing Count": missing_summary,
    "Missing Ratio (%)": (missing_summary / len(df_reviews) * 100).round(2)
})

print("=== Missing Value Report ===")
print(missing_report)

missing_cols = missing_report[missing_report["Missing Count"] > 0]
if not missing_cols.empty:
    print("\n⚠️ Columns with missing values:")
    print(missing_cols)
else:
    print("\n✅ No missing values detected.")


## 2. Merge Product Categories


In [None]:
# Columns to merge from product_info
TARGET_COLS = [
    "primary_category", "secondary_category", "tertiary_category",
    "variation_type", "variation_value", "variation_desc"
]

# Check if product_id is unique
dup_in_product = df_product["product_id"].duplicated(keep=False)
if dup_in_product.any():
    print(f"⚠️ Warning: {dup_in_product.sum()} duplicate product_ids in product_info")
else:
    print("✅ product_id is unique in product_info")

# Select columns for merging
cols_to_add = [c for c in TARGET_COLS if c in df_product.columns]
product_info_sub = df_product[["product_id"] + cols_to_add].copy()

# Merge with reviews
df_reviews = df_reviews.reset_index(drop=True).copy()
df_merged = df_reviews.merge(product_info_sub, on="product_id", how="left")
df_merged["review_seq_id"] = (df_merged.index + 1).astype("int64")

print(f"\n✅ Merge completed!")
print(f"Rows before merge: {len(df_reviews):,}")
print(f"Rows after merge: {len(df_merged):,}")
print(f"Columns added: {['review_seq_id'] + cols_to_add}")


## 3. Exploratory Analysis - Category Distribution
