# Project Part 3

## Project Goal and Significance

### Goal

The objective of Part 3 of this project is to **predict** a specific customer's product rating by leveraging **user intrinsic characteristics** (e.g., skin tone, skin type, hair color, total negative feedback count, total positive feedback count, etc.) and **product features** (e.g., product name, loves count, average rating, etc.). We also plan to analyze which features contribute most to product ratings, providing meaningful insights into customer preferences and product considerations.

### Significance
This project holds value in several key areas:
1. **Personalized Recommendations**: Enables tailored product suggestions by leveraging customer and product attributes, enhancing user experience.
2. **Improved Customer Satisfaction**: Delivers actionable insights to meet individual preferences more effectively.
3. **Business Insights**: Helps businesses identify features that drive satisfaction, informing product improvements and marketing strategies.
4. **Advancing E-commerce AI**: Demonstrates the power of predictive modeling in improving decision-making and engagement within e-commerce platforms.

In summary, this project not only enhances user satisfaction but also deepens our understanding of consumer behavior and product performance, bridging the gap between data-driven technology and personalized service.

## Preprocessing Data

In this step, we plan to import and preprocess the data for future usage. This includes handling missing values, extracting relevant features, and merging datasets to create a comprehensive dataframe for analysis. The preprocessing steps ensure that the data is clean and ready for modeling and visualization.

### Data Import

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn
import collections

In [None]:
# Import the dataset
product_df = pd.read_csv("Sephora/product_info.csv")
print("product info dataframe shape: ", product_df.shape)
product_df.head()

In [None]:
# Reference: https://www.kaggle.com/code/themeeemul/sephora-eda-and-sentiment-analysis-using-pytorch
review_df_1 = pd.read_csv("Sephora/reviews_0-250.csv",index_col = 0, dtype={'author_id':'str'})
review_df_2 = pd.read_csv("Sephora/reviews_250-500.csv",index_col = 0, dtype={'author_id':'str'})
review_df_3 = pd.read_csv("Sephora/reviews_500-750.csv",index_col = 0, dtype={'author_id':'str'})
review_df_4 = pd.read_csv("Sephora/reviews_750-1250.csv",index_col = 0, dtype={'author_id':'str'})
review_df_5 = pd.read_csv("Sephora/reviews_1250-end.csv",index_col = 0, dtype={'author_id':'str'})

# Merge review_df_1 till review_df_6
review_df = pd.concat([review_df_1, review_df_2, review_df_3, review_df_4, review_df_5],axis=0)
print("review_df shape: ",review_df.shape)
review_df.head()

### Data Cleaning

In [None]:
product_df['size_oz'] = product_df['size'].str.extract(r'(\d+\.?\d*)\s*oz', expand=False).astype(float)
print(product_df[['size', 'size_oz']].sample(5))

product_df['size_ml'] = product_df['size'].str.extract(r'(\d+\.?\d*)\s*mL', expand=False)
product_df['size_ml'] = product_df['size_ml'].astype(float)
print(product_df[['size', 'size_ml']].sample(5))

In [None]:
product_df['price_per_oz'] = product_df['price_usd'] / product_df['size_oz']
product_df['price_per_ml'] = product_df['price_usd'] / product_df['size_ml']
product_df[['price_usd', 'size_oz', 'price_per_oz', 'size_ml', 'price_per_ml']].sample(10)

In [None]:
# Reference: https://www.kaggle.com/code/themeeemul/sephora-eda-and-sentiment-analysis-using-pytorch
# Merge product_df and review_df

# Identify overlapping columns (excluding 'product_id')
product_df = product_df.drop(columns=['product_name', 'brand_name', 'price_usd'])
common_cols = set(product_df.columns).intersection(set(review_df.columns)) - {'product_id'}

# Add prefixes to overlapping columns to avoid conflicts
product_df = product_df.rename(columns={col: f"product_{col}" for col in common_cols})
review_df = review_df.rename(columns={col: f"review_{col}" for col in common_cols})

# Ensure 'product_id' remains consistent in both dataframes
if 'product_id' not in product_df.columns:
    product_df = product_df.rename(columns={"product_product_id": "product_id"})
if 'product_id' not in review_df.columns:
    review_df = review_df.rename(columns={"review_product_id": "product_id"})

# Retrieve non-overlapping columns from product_df
cols_to_use = product_df.columns.difference(review_df.columns).tolist()
cols_to_use.append('product_id')  # Ensure 'product_id' is included for merging

# Merge the two datasets
Sephora_df = pd.merge(review_df, product_df[cols_to_use], how='outer', on='product_id')
print("Sephora Shape: ", Sephora_df.shape)


In [None]:
# Check for missing values in the dataset

# 1. Handle feedback-related columns
Sephora_df = Sephora_df.assign(
    helpfulness=Sephora_df['helpfulness'].fillna(0),
    is_recommended=Sephora_df['is_recommended'].fillna(1)
)

# 2. Handle price-related columns by filling with price_usd
Sephora_df = Sephora_df.assign(
    child_max_price=Sephora_df['child_max_price'].fillna(Sephora_df['price_usd']),
    child_min_price=Sephora_df['child_min_price'].fillna(Sephora_df['price_usd']),
    sale_price_usd=Sephora_df['sale_price_usd'].fillna(Sephora_df['price_usd']),
    value_price_usd=Sephora_df['value_price_usd'].fillna(Sephora_df['price_usd'])
)

# 3. Handle variation-related columns
Sephora_df = Sephora_df.assign(
    variation_desc=Sephora_df['variation_desc'].fillna('No variation'),
    variation_type=Sephora_df['variation_type'].fillna('Unknown'),
    variation_value=Sephora_df['variation_value'].fillna('Unknown')
)

# 4. Handle product and review metadata
Sephora_df = Sephora_df.assign(
    review_text=Sephora_df['review_text'].fillna('No review provided'),
    review_title=Sephora_df['review_title'].fillna('No title'),
    tertiary_category=Sephora_df['tertiary_category'].fillna('Uncategorized')
)

# 5. Handle skin, hair, and eye attributes
Sephora_df = Sephora_df.assign(
    skin_tone=Sephora_df['skin_tone'].fillna('Not specified'),
    eye_color=Sephora_df['eye_color'].fillna('Not specified'),
    skin_type=Sephora_df['skin_type'].fillna('Not specified'),
    hair_color=Sephora_df['hair_color'].fillna('Not specified')
)

# 6. Handle ingredient and highlight columns
Sephora_df = Sephora_df.assign(
    ingredients=Sephora_df['ingredients'].fillna('Ingredients not listed'),
    highlights=Sephora_df['highlights'].fillna('No highlights available')
)

# 7. Handle size-related columns
Sephora_df = Sephora_df.assign(
    size=Sephora_df['size'].fillna('Unknown size'),
    size_ml=Sephora_df['size_ml'].fillna(Sephora_df['size_ml'].median()),
    size_oz=Sephora_df['size_oz'].fillna(Sephora_df['size_oz'].median())
)

# 8. Handle price per unit columns
Sephora_df = Sephora_df.assign(
    price_per_ml=Sephora_df['price_per_ml'].fillna(Sephora_df['price_per_ml'].median()),
    price_per_oz=Sephora_df['price_per_oz'].fillna(Sephora_df['price_per_oz'].median())
)

# 9. Print the result to confirm the missing values have been handled
missing_counts_after = Sephora_df.isna().sum()
print("Missing values after processing:")
print(missing_counts_after[missing_counts_after > 0])

In [None]:
# Drop rows without review_rating information for analysis convenience
Sephora_df = Sephora_df.dropna(subset=['review_rating'])

# Print the result to confirm the missing values have been handled
missing_counts_after = Sephora_df.isna().sum()
print("Missing values after processing:")
print(missing_counts_after[missing_counts_after > 0])

To conduct effective analysis, we need to retain intrinsic user characteristics. Therefore, we will remove features such as `is_recommended` and `helpfulness`.

In [None]:
drop_columns = ['is_recommended', 'helpfulness', 'review_text', 'review_title', 'submission_time', \
    'total_feedback_count', 'total_neg_feedback_count', 'total_pos_feedback_count']
Sephora_df = Sephora_df.drop(columns=drop_columns)

In [None]:
drop_columns2 = ['size', 'size_ml']
Sephora_df = Sephora_df.drop(columns=drop_columns2)

In [None]:
print("Shape of Sephora: ", Sephora_df.shape)
Sephora_df.head()

In [None]:
Sephora_df.columns

### Data Visualization

In [None]:
target_cols = ['review_rating']
id_cols = ['author_id', 'product_id', 'brand_id']
numerical_features = ['price_usd', 'child_count', 'child_max_price', 'child_min_price', 
                    'loves_count','price_per_ml', 'price_per_oz', 'product_rating', 
                    'reviews', 'sale_price_usd', 'size_oz', 'value_price_usd']
categorical_features = ['skin_tone', 'eye_color', 'skin_type', 'hair_color', 'new', 
                        'online_only', 'out_of_stock', 'sephora_exclusive',
                        'limited_edition']
text_cols = ['product_name', 'variation_desc', 'variation_type', 'variation_value',
             'highlights', 'secondary_category', 'tertiary_category', 'ingredients',
             'brand_name']

In [None]:
# Code generation by ChatGPT

# Update matplotlib parameters for tighter layouts
plt.rcParams.update({
    'figure.autolayout': True,
    'axes.titlesize': 14,
    'axes.labelsize': 12,
    'xtick.labelsize': 10,
    'ytick.labelsize': 10,
    'figure.constrained_layout.use': True
})


def plot_categorical_distribution_by_target(df, column, target="Transported", palette="Set2"):
    """
    Plot the distribution of a categorical variable, grouped by the target variable in a 1-row, 2-column layout.
    
    Parameters:
    df (pd.DataFrame): The dataset.
    column (str): The categorical column to visualize.
    target (str): The target variable to group by.
    palette (str): The color palette to use.
    
    Returns:
    None
    """
    # Extract unique categories and sort them by their first letter
    categories = df[column].dropna().unique()
    sorted_categories = sorted(categories, key=lambda x: str(x)[0])
    
    target_values = df[target].unique()
    if isinstance(target_values[0], np.float64):
        target_values = sorted(target_values)
    colors = sns.color_palette(palette, n_colors=len(target_values))
    fig, axes = plt.subplots(1, 5, figsize=(18, 4), sharey=False)  # 5 columns per row
    
    for i, value in enumerate(target_values):
        sns.countplot(
            data=df[df[target] == value],
            x=column,
            order=sorted_categories,  # Pass the sorted order here
            hue=None,
            color=colors[i],
            ax=axes[i]
        )
        axes[i].set_title(f"{target} = {value}")
        axes[i].set_xlabel(column)
        axes[i].set_ylabel("Count" if i == 0 else "")
        axes[i].tick_params(axis='x', rotation=45)
    
    plt.suptitle(f"Distribution of {column} by {target}")
    plt.tight_layout()
    plt.show()


def plot_numeric_distribution_by_target(df, column, target="Transported", bins=20, kde=True, palette="Set2"):
    """
    Plot the distribution of a numeric variable, grouped by the target variable in a 1-row, 2-column layout.
    
    Parameters:
    df (pd.DataFrame): The dataset.
    column (str): The numeric column to visualize.
    target (str): The target variable to group by.
    bins (int): Number of bins for the histogram.
    kde (bool): Whether to show a KDE plot.
    palette (str): The color palette for different target values.
    
    Returns:
    None
    """
    target_values = df[target].unique()
    if isinstance(target_values[0], np.float64):
        target_values = sorted(target_values)
    fig, axes = plt.subplots(1, 5, figsize=(18, 4), sharey=False)  # 5 columns per row
    colors = sns.color_palette(palette, n_colors=len(target_values))
    for i, value in enumerate(target_values):
        sns.histplot(
            df[df[target] == value][column].dropna(),
            bins=bins,
            kde=kde,
            ax=axes[i],
            color=colors[i],
            alpha=0.7
        )
        axes[i].set_title(f"{target} = {value}")
        axes[i].set_xlabel(column)
        axes[i].set_ylabel("Frequency" if i == 0 else "")
    plt.suptitle(f"Distribution of {column} by {target}")
    plt.tight_layout()
    plt.show()

def plot_correlation_heatmap(df, target="Transported"):
    """
    Plot a heatmap of correlations between numeric variables, with respect to the target variable.
    
    Parameters:
    df (pd.DataFrame): The dataset.
    target (str): The target variable.
    
    Returns:
    None
    """
    numeric_cols = df.select_dtypes(include=["float64", "int64", "bool"]).drop(columns=["PassengerId"], errors="ignore")
    correlation_matrix = numeric_cols.corr()
    plt.figure(figsize=(10, 8))
    sns.heatmap(correlation_matrix, annot=True, fmt=".2f", cmap="coolwarm", vmin=-1, vmax=1)
    plt.title("Correlation Heatmap of Numeric Features")
    plt.show()

def visualize_sephora_data_by_target(df, categorical_features, numerical_features, target="Transported"):
    """
    Visualize the dataset by target, with categorical and numeric distributions shown side by side.
    
    Parameters:
    df (pd.DataFrame): The dataset.
    target (str): The target variable.
    
    Returns:
    None
    """
    
    # Plot categorical features
    for feature in categorical_features:
        plot_categorical_distribution_by_target(df, column=feature, target=target)
    
    # Plot numeric features
    for feature in numerical_features:
        plot_numeric_distribution_by_target(df, column=feature, target=target, bins=20, kde=True, palette="Set2")
    
    # Plot correlation heatmap
    plot_correlation_heatmap(df)

In [None]:
visualize_sephora_data_by_target(Sephora_df, categorical_features, numerical_features, target="review_rating")

### Data Transformation

#### Numerical Features - Log Transformation

In [None]:
# Apply Log Transformation to several numeric features and visualize the changes
# Use ChatGPT to facilitate data transformation

def apply_log_transformation(df, features):
    """
    Apply log transformation to specified numeric features in the DataFrame.
    
    Parameters:
    df (pd.DataFrame): Input DataFrame.
    features (list): List of column names to log transform.
    
    Returns:
    pd.DataFrame: DataFrame with log-transformed features.
    """
    df_transformed = df.copy()
    for feature in features:
        if feature in df_transformed.columns:
            # Apply log1p to handle zeros
            df_transformed[feature] = np.log1p(df_transformed[feature])
        else:
            print(f"Feature '{feature}' not found in DataFrame.")
    return df_transformed

def apply_sqrt_transformation(df, features):
    """
    Apply square root transformation to specified numeric features in the DataFrame.
    
    Parameters:
    df (pd.DataFrame): Input DataFrame.
    features (list): List of column names to apply square root transformation.
    
    Returns:
    pd.DataFrame: DataFrame with square root-transformed features.
    """
    df_transformed = df.copy()
    for feature in features:
        if feature in df_transformed.columns:
            # Apply sqrt transformation, ensure non-negative values
            df_transformed[feature] = np.sqrt(df_transformed[feature].clip(lower=0))
        else:
            print(f"Feature '{feature}' not found in DataFrame.")
    return df_transformed


def visualize_features(df, features, bins=20):
    """
    Visualize the distributions of specified features in the DataFrame.
    
    Parameters:
    df (pd.DataFrame): Input DataFrame.
    features (list): List of column names to visualize.
    bins (int): Number of bins for the histograms.
    
    Returns:
    None
    """
    print("Visualizing features...")
    print(features + ["review_rating"])
    visualize_sephora_data_by_target(df[features + ["review_rating"]], categorical_features=[], \
        numerical_features=features, target="review_rating")

In [None]:
# Apply log transformation

features_to_transform = ['child_max_price', 'child_min_price', 'product_rating', 
                    'reviews', 'loves_count', 'price_usd', 'price_per_ml',
                    'price_per_oz', 'sale_price_usd', 'value_price_usd']

Sephora_df = apply_log_transformation(Sephora_df, features_to_transform)

# Visualize transformed features
print("Log Transformed Features:")
visualize_features(Sephora_df, features_to_transform)

#### Category Features - One hot encoding

In [None]:
def one_hot_encode_features(df, categorical_features):
    """
    Perform one-hot encoding for a list of categorical features in a DataFrame.
    
    Parameters:
    df (pd.DataFrame): The input DataFrame.
    categorical_features (list): A list of categorical column names to one-hot encode.
    
    Returns:
    pd.DataFrame: A DataFrame with the specified categorical features one-hot encoded.
    """
    # Perform one-hot encoding for the categorical features
    df_encoded = pd.get_dummies(df, columns=categorical_features, drop_first=True)
    return df_encoded

In [None]:
categorical_features = ['skin_tone', 'eye_color', 'skin_type', 'hair_color', 'new', 
                        'online_only', 'out_of_stock', 'sephora_exclusive',
                        'limited_edition']
Sephora_df[categorical_features] = Sephora_df[categorical_features].astype('category')
Sephora_df = one_hot_encode_features(Sephora_df, categorical_features)

# Check the result
print("Original DataFrame shape:", Sephora_df.shape)
print("Encoded DataFrame shape:", Sephora_df.shape)

#### Text Features - TF-IDF Encoding

In [None]:
def tfidf_svd_transform(df, text_columns, n_components=100, max_features=1000):
    """
    Perform TF-IDF transformation on specified text columns and apply SVD for dimensionality reduction.

    Parameters:
    df (pd.DataFrame): The input DataFrame.
    text_columns (list): List of column names containing text data.
    n_components (int): Number of components for SVD.
    max_features (int): Maximum number of features for TF-IDF vectorization.

    Returns:
    pd.DataFrame: A DataFrame with reduced dimensions after SVD.
    """
    tfidf_vectorizers = {}
    tfidf_results = []
    
    # Perform TF-IDF on each text column
    for col in text_columns:
        vectorizer = TfidfVectorizer(max_features=max_features, stop_words='english')
        tfidf_matrix = vectorizer.fit_transform(df[col].fillna(""))  # Handle NaN by filling with empty strings
        tfidf_vectorizers[col] = vectorizer
        tfidf_results.append(tfidf_matrix)

    # Concatenate all TF-IDF matrices
    combined_tfidf = np.hstack([result.toarray() for result in tfidf_results])

    # Apply SVD for dimensionality reduction
    svd = TruncatedSVD(n_components=n_components, random_state=42)
    reduced_matrix = svd.fit_transform(combined_tfidf)

    # Return as a DataFrame for interpretability
    svd_columns = [f"svd_component_{i+1}" for i in range(n_components)]
    return pd.DataFrame(reduced_matrix, columns=svd_columns)

In [None]:
text_cols = ['product_name', 'variation_desc', 'variation_type', 'variation_value',
             'highlights', 'secondary_category', 'tertiary_category', 'ingredients',
             'brand_name']
reduced_df = tfidf_svd_transform(Sephora_df, text_cols, n_components=50, max_features=500)

# Check the reduced DataFrame
print(reduced_df.shape)
print(reduced_df.head())

## Classification Analysis