# Module 2: Feature Engineering - Notebook 1: Handling Categorical Data and Customer Interaction Journey

**Objectives:**
- Load persisted e-commerce data (interactions, customers, products).
- Handle categorical features using `StringIndexer` and `OneHotEncoder`.
- Engineer features based on customer-product interaction journeys (counts, time, purchase history, engagement).
- Create the binary target variable `has_purchased`.
- Combine features and save the resulting DataFrame for the next notebook.

In [1]:
# Setup: Import necessary libraries
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DoubleType, TimestampType, LongType
from pyspark.sql.window import Window
from pyspark.ml.feature import StringIndexer, OneHotEncoder
from pyspark.ml import Pipeline
import os

# Créer une SparkSession
spark = SparkSession.builder \
    .appName("MonApplication") \
    .getOrCreate()

## 1. Load Data

Load the three tables from the `ecommerce` database.

In [3]:
# Load tables

# Define paths for each file in the Workspace storage
customers_path = "data/ecommerce_customers.parquet"
products_path = "data/ecommerce_products.parquet"
interactions_path = "data/ecommerce_interactions.parquet"

# Load the parquet files
customers_df = spark.read.parquet(customers_path)
products_df = spark.read.parquet(products_path)
interactions_df = spark.read.parquet(interactions_path)

customers_df.show(5)
products_df.show(5)
interactions_df.show(5)

+-----------+---+------+-------+-----------+----------------+
|customer_id|age|gender|country|tenure_days|membership_level|
+-----------+---+------+-------+-----------+----------------+
|          1| 40|     M|     US|        936|        Platinum|
|          2| 33|     M|     UK|        192|          Silver|
|          3| 42|     M|     MX|        160|          Bronze|
|          4| 53|     M|     AU|        823|          Bronze|
|          5| 32|     F|     US|        513|          Bronze|
+-----------+---+------+-------+-----------+----------------+
only showing top 5 rows

+----------+--------+-----+----------+
|product_id|category|price|avg_rating|
+----------+--------+-----+----------+
|         1|Clothing|33.64|       3.8|
|         2|Clothing|24.05|       3.5|
|         3|  Beauty|58.67|       4.1|
|         4|Clothing|11.15|       4.3|
|         5|  Beauty|47.74|       4.5|
+----------+--------+-----+----------+
only showing top 5 rows

+-----------+----------+-----------------

## 2. Feature Engineering: Customer-Product Journey Aggregation

We need to aggregate interaction data to the `customer_id`, `product_id` level. This is where we will create:
1.  The **target variable** `has_purchased`.
2.  **Journey features**: Interaction counts, time-based features, purchase history aggregates, and engagement metrics.
3.  We also need to bring in customer and product features. We can join the tables first, then aggregate.

In [5]:
# Rename potentially conflicting columns before join
customers_renamed_df = customers_df.withColumnRenamed("customer_id", "cust_id")
products_renamed_df = products_df.withColumnRenamed("product_id", "prod_id").withColumnRenamed("avg_rating", "product_avg_rating")

# Join the three tables
# Use left joins starting from interactions to keep all interaction records
# Then join with customers and products
joined_df = interactions_df \
    .join(customers_renamed_df, interactions_df["customer_id"] == customers_renamed_df["cust_id"], "left") \
    .join(products_renamed_df, interactions_df["product_id"] == products_renamed_df["prod_id"], "left") \
    .select(
        "customer_id", "product_id", "timestamp", "interaction_type", "time_spent_seconds",
        "purchase_amount", "user_rating", "device", "previous_visits", # From interactions
        "age", "gender", "country", "tenure_days", "membership_level", # From customers
        "category", "price", "product_avg_rating" # From products
    )

print("Joined DataFrame Schema:")
joined_df.printSchema()
print("\nJoined DataFrame Sample:")
joined_df.show(5)

Joined DataFrame Schema:
root
 |-- customer_id: integer (nullable = true)
 |-- product_id: long (nullable = true)
 |-- timestamp: timestamp_ntz (nullable = true)
 |-- interaction_type: string (nullable = true)
 |-- time_spent_seconds: long (nullable = true)
 |-- purchase_amount: double (nullable = true)
 |-- user_rating: double (nullable = true)
 |-- device: string (nullable = true)
 |-- previous_visits: long (nullable = true)
 |-- age: integer (nullable = true)
 |-- gender: string (nullable = true)
 |-- country: string (nullable = true)
 |-- tenure_days: integer (nullable = true)
 |-- membership_level: string (nullable = true)
 |-- category: string (nullable = true)
 |-- price: double (nullable = true)
 |-- product_avg_rating: double (nullable = true)


Joined DataFrame Sample:
+-----------+----------+--------------------+----------------+------------------+---------------+-----------+-------+---------------+---+------+-------+-----------+----------------+-----------+------+----------

### Aggregation Logic
Group by `customer_id` and `product_id` and calculate aggregates.

In [7]:
# Step 1: Calculate TARGETS and overall purchase stats based on ALL interactions per group
# Also find the first purchase timestamp.
window_spec_cp = Window.partitionBy("customer_id", "product_id")
overall_agg_expressions = [
    F.max(F.when(F.col("interaction_type") == "purchase", 1).otherwise(0)).alias("has_purchased"),
    F.sum(F.when(F.col("interaction_type") == "purchase", F.col("purchase_amount")).otherwise(0)).alias("total_purchase_amount"),
    F.sum(F.when(F.col("interaction_type") == "purchase", 1).otherwise(0)).alias("purchase_count"),
    F.min(F.when(F.col("interaction_type") == "purchase", F.col("timestamp"))).over(window_spec_cp).alias("first_purchase_ts_win") # Use window to get ts on each row first
]

# Apply window function first to get first_purchase_ts accessible for grouping key calculation
joined_with_purchase_ts = joined_df.withColumn("first_purchase_ts", F.min(F.when(F.col("interaction_type") == "purchase", F.col("timestamp"))).over(window_spec_cp))

# Now group and aggregate overall stats
overall_stats_agg = joined_with_purchase_ts.groupBy("customer_id", "product_id").agg(
    F.max(F.when(F.col("interaction_type") == "purchase", 1).otherwise(0)).alias("has_purchased"),
    F.sum(F.when(F.col("interaction_type") == "purchase", F.col("purchase_amount")).otherwise(0)).alias("total_purchase_amount"),
    F.sum(F.when(F.col("interaction_type") == "purchase", 1).otherwise(0)).alias("purchase_count"),
    # Also get the first purchase timestamp for filtering later
    F.first("first_purchase_ts", ignorenulls=True).alias("first_purchase_ts")
)
# Calculate overall average purchase amount (handle division by zero)
overall_stats_agg = overall_stats_agg.withColumn(
    "avg_purchase_amount",
     F.when(F.col("purchase_count") > 0, F.col("total_purchase_amount") / F.col("purchase_count"))
     .otherwise(0.0)
)

print("Overall Stats Aggregation Schema:")
overall_stats_agg.printSchema()
overall_stats_agg.show(5)


# Step 2: Filter interactions to only those BEFORE the first purchase (if any)
# Join the first purchase time back to the original joined data
# Note: joined_with_purchase_ts already has first_purchase_ts from the window function earlier
interactions_for_features = joined_with_purchase_ts.where(
    (F.col("timestamp") < F.col("first_purchase_ts")) | F.col("first_purchase_ts").isNull()
)
print("Sample Interactions Considered for Feature Calculation:")
interactions_for_features.select("customer_id", "product_id", "timestamp", "interaction_type", "first_purchase_ts").show(5)


# Step 3: Aggregate FEATURES based ONLY on the filtered pre-purchase (or all if no purchase) interactions
# Exclude the overall purchase aggregates from this list
feature_agg_expressions = [
    # Interaction Counts (pre-purchase)
    F.sum(F.when(F.col("interaction_type") == "view", 1).otherwise(0)).alias("view_count"),
    F.sum(F.when(F.col("interaction_type") == "add_to_cart", 1).otherwise(0)).alias("add_to_cart_count"),
    F.sum(F.when(F.col("interaction_type") == "review", 1).otherwise(0)).alias("review_count"),
    F.count("*").alias("total_interactions"), # Now counts pre-purchase interactions

    # Time-based Features (pre-purchase)
    F.sum("time_spent_seconds").alias("total_time_spent"),
    (F.unix_timestamp(F.max("timestamp")) - F.unix_timestamp(F.min("timestamp"))).alias("interaction_time_span_seconds"),
    F.min("timestamp").alias("first_interaction_time"), # Based on pre-purchase interactions

    # Engagement Metrics (pre-purchase reviews)
    F.avg(F.when(F.col("interaction_type") == "review", F.col("user_rating"))).alias("avg_user_rating_non_null"),

    # Collect Original Features
    F.first("age", ignorenulls=True).alias("age"),
    F.first("gender", ignorenulls=True).alias("gender"),
    F.first("country", ignorenulls=True).alias("country"),
    F.first("tenure_days", ignorenulls=True).alias("tenure_days"),
    F.first("membership_level", ignorenulls=True).alias("membership_level"),
    F.first("category", ignorenulls=True).alias("category"),
    F.first("price", ignorenulls=True).alias("price"),
    F.first("product_avg_rating", ignorenulls=True).alias("product_avg_rating"),
    F.first("device", ignorenulls=True).alias("device"),
    F.first("previous_visits", ignorenulls=True).alias("previous_visits")
]

# Perform aggregation on filtered data
pre_purchase_features_agg = interactions_for_features.groupBy("customer_id", "product_id").agg(*feature_agg_expressions)

print("Aggregated Pre-Purchase Features DataFrame Schema:")
pre_purchase_features_agg.printSchema()

Overall Stats Aggregation Schema:
root
 |-- customer_id: integer (nullable = true)
 |-- product_id: long (nullable = true)
 |-- has_purchased: integer (nullable = true)
 |-- total_purchase_amount: double (nullable = true)
 |-- purchase_count: long (nullable = true)
 |-- first_purchase_ts: timestamp_ntz (nullable = true)
 |-- avg_purchase_amount: double (nullable = true)

+-----------+----------+-------------+---------------------+--------------+--------------------+-------------------+
|customer_id|product_id|has_purchased|total_purchase_amount|purchase_count|   first_purchase_ts|avg_purchase_amount|
+-----------+----------+-------------+---------------------+--------------+--------------------+-------------------+
|          1|       154|            1|                41.26|             1|2025-02-12 19:42:...|              41.26|
|          1|       290|            1|                398.2|             1|2025-02-23 13:49:...|              398.2|
|          1|       551|            0|   

### Post-Processing Aggregated Features
Calculate derived features, handle imputations, and finalize columns based on the aggregation results.

In [8]:
# Step 4: Post-processing Pre-Purchase Features
# Fill nulls resulting from aggregation (e.g., if no pre-purchase interactions existed)
customer_product_features_temp = pre_purchase_features_agg.fillna(0, subset=[
    "view_count", "add_to_cart_count", "review_count", "total_interactions",
    "total_time_spent"
])
# Fill time span seconds if null
customer_product_features_temp = customer_product_features_temp.fillna(0, subset=["interaction_time_span_seconds"])


# Calculate interaction_time_span_days
customer_product_features_temp = customer_product_features_temp.withColumn(
    "interaction_time_span_days",
    F.when(F.col("interaction_time_span_seconds") > 0,
           F.col("interaction_time_span_seconds") / (60 * 60 * 24))
    .otherwise(0.0)
).drop("interaction_time_span_seconds")

# Calculate days_since_first_interaction (relative to max timestamp overall)
max_timestamp_overall = joined_df.select(F.max("timestamp")).first()[0]
customer_product_features_temp = customer_product_features_temp.withColumn(
    "days_since_first_interaction",
    F.when(F.col("first_interaction_time").isNotNull(),
        (F.unix_timestamp(F.lit(max_timestamp_overall)) - F.unix_timestamp(F.col("first_interaction_time"))) / (60*60*24)
    ).otherwise(0.0) 
)

# Impute avg_user_rating (pre-purchase reviews)
customer_product_features_temp = customer_product_features_temp.withColumn(
    "avg_user_rating",
    F.coalesce(F.col("avg_user_rating_non_null"), F.col("product_avg_rating"), F.lit(3.0))
).drop("avg_user_rating_non_null")

# Drop intermediate time columns if they exist
if "first_interaction_time" in customer_product_features_temp.columns:
    customer_product_features_temp = customer_product_features_temp.drop("first_interaction_time")


# Step 5: Join the pre-purchase features with the overall target variables and purchase stats
final_customer_product_features = customer_product_features_temp.join(
    overall_stats_agg.select(
        "customer_id", "product_id", "has_purchased",
        "total_purchase_amount", "avg_purchase_amount", "purchase_count"
    ),
    ["customer_id", "product_id"],
    "inner" # Use inner join, ensures we only have pairs with calculated features
)

# Assign to the variable name expected by the next cell (cell 9)
customer_product_features = final_customer_product_features

print("Final Combined DataFrame Schema:")
customer_product_features.printSchema()
print("Final Combined DataFrame Sample:")
customer_product_features.show(5)

Final Combined DataFrame Schema:
root
 |-- customer_id: integer (nullable = true)
 |-- product_id: long (nullable = true)
 |-- view_count: long (nullable = true)
 |-- add_to_cart_count: long (nullable = true)
 |-- review_count: long (nullable = true)
 |-- total_interactions: long (nullable = false)
 |-- total_time_spent: long (nullable = true)
 |-- age: integer (nullable = true)
 |-- gender: string (nullable = true)
 |-- country: string (nullable = true)
 |-- tenure_days: integer (nullable = true)
 |-- membership_level: string (nullable = true)
 |-- category: string (nullable = true)
 |-- price: double (nullable = true)
 |-- product_avg_rating: double (nullable = true)
 |-- device: string (nullable = true)
 |-- previous_visits: long (nullable = true)
 |-- interaction_time_span_days: double (nullable = true)
 |-- days_since_first_interaction: double (nullable = true)
 |-- avg_user_rating: double (nullable = false)
 |-- has_purchased: integer (nullable = true)
 |-- total_purchase_amount:

## 3. Handling Categorical Features

Now apply `StringIndexer` and `OneHotEncoder` to the categorical columns collected during aggregation (`gender`, `country`, `membership_level`, `category`, `device`).

In [9]:
# Identify categorical columns to encode
# Ensure these columns exist and handle potential nulls from joins if necessary
categorical_cols = ["gender", "country", "membership_level", "category", "device"]

# Fill nulls in categorical columns before indexing (important!)
# Use a placeholder like 'Unknown' or 'Missing'
fill_values_cat = {col: "Unknown" for col in categorical_cols}
customer_product_features_filled = customer_product_features.fillna(fill_values_cat)

# Create list of indexer and encoder stages
indexers = [StringIndexer(inputCol=col, outputCol=col + "_index", handleInvalid="keep") for col in categorical_cols]
encoders = [OneHotEncoder(inputCol=col + "_index", outputCol=col + "_vec") for col in categorical_cols]

# Create pipeline
pipeline_stages = indexers + encoders
categorical_pipeline = Pipeline(stages=pipeline_stages)

# Fit and transform the data
feature_pipeline_model = categorical_pipeline.fit(customer_product_features_filled)
features_encoded_df = feature_pipeline_model.transform(customer_product_features_filled)

print("Schema after Categorical Encoding:")
features_encoded_df.printSchema()

# Show some encoded columns
print("\nSample Data after Encoding:")
features_encoded_df.select("customer_id", "product_id", "gender", "gender_index", "gender_vec", "category", "category_index", "category_vec").show(5)

Schema after Categorical Encoding:
root
 |-- customer_id: integer (nullable = true)
 |-- product_id: long (nullable = true)
 |-- view_count: long (nullable = true)
 |-- add_to_cart_count: long (nullable = true)
 |-- review_count: long (nullable = true)
 |-- total_interactions: long (nullable = false)
 |-- total_time_spent: long (nullable = true)
 |-- age: integer (nullable = true)
 |-- gender: string (nullable = false)
 |-- country: string (nullable = false)
 |-- tenure_days: integer (nullable = true)
 |-- membership_level: string (nullable = false)
 |-- category: string (nullable = false)
 |-- price: double (nullable = true)
 |-- product_avg_rating: double (nullable = true)
 |-- device: string (nullable = false)
 |-- previous_visits: long (nullable = true)
 |-- interaction_time_span_days: double (nullable = true)
 |-- days_since_first_interaction: double (nullable = true)
 |-- avg_user_rating: double (nullable = false)
 |-- has_purchased: integer (nullable = true)
 |-- total_purchase_

## 4. Final Feature Selection and Schema Alignment

Select the final columns including the customer/product identifiers, encoded categorical vectors, original numerical features (which still need scaling in the next notebook), engineered journey features (also need scaling), and the target variable.

In [10]:
# Define final columns
final_columns_desired = [
    # Identifiers
    "customer_id", "product_id",
    # Encoded Categorical Vectors
    "gender_vec", "country_vec", "membership_level_vec", "category_vec", "device_vec",
    # Numerical - Original (To be scaled in the next notebook)
    "age", "tenure_days", "price", "product_avg_rating", "previous_visits",
    # Numerical - Journey Features (To be scaled in the next notebook)
    "view_count", "add_to_cart_count", "review_count", "purchase_count", "total_interactions",
    "total_time_spent", "interaction_time_span_days", "days_since_first_interaction",
    "total_purchase_amount", # Leakage risk for classifier, keep for now
    "avg_purchase_amount",   # Leakage risk for classifier, keep for now
    "avg_user_rating",
    # Target
    "has_purchased"
]

final_df = features_encoded_df.select(final_columns_desired)

print("\nFinal DataFrame Schema:")
final_df.printSchema()
print("\nFinal DataFrame Sample:")
final_df.show(5)


Final DataFrame Schema:
root
 |-- customer_id: integer (nullable = true)
 |-- product_id: long (nullable = true)
 |-- gender_vec: vector (nullable = true)
 |-- country_vec: vector (nullable = true)
 |-- membership_level_vec: vector (nullable = true)
 |-- category_vec: vector (nullable = true)
 |-- device_vec: vector (nullable = true)
 |-- age: integer (nullable = true)
 |-- tenure_days: integer (nullable = true)
 |-- price: double (nullable = true)
 |-- product_avg_rating: double (nullable = true)
 |-- previous_visits: long (nullable = true)
 |-- view_count: long (nullable = true)
 |-- add_to_cart_count: long (nullable = true)
 |-- review_count: long (nullable = true)
 |-- purchase_count: long (nullable = true)
 |-- total_interactions: long (nullable = false)
 |-- total_time_spent: long (nullable = true)
 |-- interaction_time_span_days: double (nullable = true)
 |-- days_since_first_interaction: double (nullable = true)
 |-- total_purchase_amount: double (nullable = true)
 |-- avg_pur

## 5. Save Output

Persist the final DataFrame (`final_df`) for use in the next notebook (Module 2, Notebook 2). We will save it as a managed table.

In [16]:
# Define output table name
output_table_name = "ecommerce.customer_product_features"

# Save as Managed Table (Overwrite if exists)
try:
    print(f"\nSaving final DataFrame to managed table: {output_table_name}")
    final_df.write.mode("overwrite").saveAsTable(output_table_name)
    print(f"Table '{output_table_name}' saved successfully.")

    # Verify save by reading back a sample
    print("\nVerifying saved table:")
    spark.table(output_table_name).show(5)

except Exception as e:
    print(f"Error saving table {output_table_name}: {e}")



Saving final DataFrame to managed table: ecommerce.customer_product_features
Table 'ecommerce.customer_product_features' saved successfully.

Verifying saved table:
+-----------+----------+-------------+--------------+--------------------+--------------+-------------+---+-----------+------+------------------+---------------+----------+-----------------+------------+--------------+------------------+----------------+--------------------------+----------------------------+---------------------+-------------------+---------------+-------------+
|customer_id|product_id|   gender_vec|   country_vec|membership_level_vec|  category_vec|   device_vec|age|tenure_days| price|product_avg_rating|previous_visits|view_count|add_to_cart_count|review_count|purchase_count|total_interactions|total_time_spent|interaction_time_span_days|days_since_first_interaction|total_purchase_amount|avg_purchase_amount|avg_user_rating|has_purchased|
+-----------+----------+-------------+--------------+---------------

## 6. Conclusion

This notebook successfully loaded the base data, engineered customer-product journey features including the `has_purchased` target variable, handled categorical variables using `StringIndexer` and `OneHotEncoder`, and aligned the output schema.

The resulting `ecommerce.customer_product_features` table is now ready for the next stage in Module 2 (Notebook 2), which will focus on scaling the numerical features.