# Module 2 Notebook 2: Handling Numerical Data

**Objective:** Process the numerical features from the previous notebook's output (`ecommerce.customer_product_features`) using scaling techniques and prepare two distinct feature sets: one for classification and one for regression.

**Input:** `ecommerce.customer_product_features` table.
**Outputs:**
1. `ecommerce.classification_scaled_features`: Data ready for classification feature selection.
2. `ecommerce.regression_scaled_features`: Data ready for regression feature selection.

## 1. Setup & Load Data

Load the features table created in the previous notebook.

In [3]:
# Créer une SparkSession
spark = SparkSession.builder \
    .appName("MonApplication") \
    .getOrCreate()

# Create database if it doesn't exist
spark.sql("CREATE DATABASE IF NOT EXISTS ecommerce")
spark.sql("USE ecommerce")


print("Using 'ecommerce' database")

Using 'ecommerce' database


In [7]:

    from pyspark.sql import SparkSession
    from pyspark.ml.feature import MinMaxScaler, VectorAssembler
    from pyspark.sql.functions import col, sum
    import sys
    import os
    


    # Load the data from the previous notebook
    input_table = "ecommerce.customer_product_features"
    features_df = spark.read.parquet("spark-warehouse/ecommerce.db/customer_product_features")
    print(f"Successfully loaded data from {input_table}")
    features_df.show(5)


Successfully loaded data from ecommerce.customer_product_features
+-----------+----------+-------------+--------------+--------------------+--------------+-------------+---+-----------+------+------------------+---------------+----------+-----------------+------------+--------------+------------------+----------------+--------------------------+----------------------------+---------------------+-------------------+---------------+-------------+
|customer_id|product_id|   gender_vec|   country_vec|membership_level_vec|  category_vec|   device_vec|age|tenure_days| price|product_avg_rating|previous_visits|view_count|add_to_cart_count|review_count|purchase_count|total_interactions|total_time_spent|interaction_time_span_days|days_since_first_interaction|total_purchase_amount|avg_purchase_amount|avg_user_rating|has_purchased|
+-----------+----------+-------------+--------------+--------------------+--------------+-------------+---+-----------+------+------------------+---------------+-------

## 2. Identify Feature Columns

We need to identify which columns are our already encoded categorical features (vectors ending in `_vec`) and which are the numerical features that require scaling.

In [8]:

    # Get column names
    all_cols = features_df.columns

    # Identify categorical vector columns (ending with '_vec')
    categorical_vec_cols = [c for c in all_cols if c.endswith('_vec')]
    print(f"Categorical Vector Columns: {categorical_vec_cols}")

    # Identify numerical columns for scaling
    # Exclude IDs, the target 'has_purchased', and leakage risks for classification
    numerical_cols_to_scale = [
        'age', 'tenure_days', 'price', 'product_avg_rating', 'previous_visits',
        'view_count', 'add_to_cart_count', 'review_count',
        'total_interactions', 'total_time_spent',
        'interaction_time_span_days', 'days_since_first_interaction',
        'avg_user_rating'
        # Excluded: 'purchase_count', 'total_purchase_amount', 'avg_purchase_amount' (leakage risk for classifier)
    ]
    print(f"Numerical Columns to Scale: {numerical_cols_to_scale}")

    # Identify target columns
    classification_target_col = "has_purchased"
    regression_target_col = "total_purchase_amount"
    print(f"Classification Target: {classification_target_col}")
    print(f"Regression Target: {regression_target_col}")

    # Columns to exclude from feature vector assembly (IDs, targets, potential leakage for specific tasks)
    exclude_cols = ['customer_id', 'product_id', classification_target_col, regression_target_col, 'avg_purchase_amount', 'purchase_count']
    print(f"Columns generally excluded from direct feature assembly: {exclude_cols}")


Categorical Vector Columns: ['gender_vec', 'country_vec', 'membership_level_vec', 'category_vec', 'device_vec']
Numerical Columns to Scale: ['age', 'tenure_days', 'price', 'product_avg_rating', 'previous_visits', 'view_count', 'add_to_cart_count', 'review_count', 'total_interactions', 'total_time_spent', 'interaction_time_span_days', 'days_since_first_interaction', 'avg_user_rating']
Classification Target: has_purchased
Regression Target: total_purchase_amount
Columns generally excluded from direct feature assembly: ['customer_id', 'product_id', 'has_purchased', 'total_purchase_amount', 'avg_purchase_amount', 'purchase_count']


## 3. Numerical Feature Scaling

Scaling numerical features is crucial because many ML algorithms are sensitive to the magnitude of feature values. Algorithms like Logistic Regression, SVMs, and PCA can perform poorly or converge slowly if features are on vastly different scales. Standardization (using `StandardScaler`) is a common technique that scales data to have zero mean and unit variance.

We will:
1. Assemble all numerical features needing scaling into a single vector.
2. Apply `StandardScaler` to this vector.

In [10]:

    # 1. Assemble numerical features into an intermediate vector
    num_vector_assembler = VectorAssembler(
        inputCols=numerical_cols_to_scale,
        outputCol="unscaled_numerical_features",
        handleInvalid="keep" # 'keep' preserves rows with nulls, but StandardScaler might fail.
                           # 'skip' would drop rows with nulls in these columns.
                           # Assuming M2N1 handled nulls, 'keep' is okay.
                           # If nulls were possible, use 'skip' or impute first.
    )
    df_with_unscaled_vec = num_vector_assembler.transform(features_df)

# 2. Apply MinMaxScaler
    scaler = MinMaxScaler(
        inputCol="unscaled_numerical_features",
        outputCol="scaled_numerical_features"
        # min=0.0, max=1.0 are the defaults
    )
    scaler_model = scaler.fit(df_with_unscaled_vec)
    scaled_df = scaler_model.transform(df_with_unscaled_vec)

    print("Schema after Scaling:")
    scaled_df.printSchema()
    print("Sample Data with Scaled Numerical Features:")
    scaled_df.select('customer_id', 'product_id', 'scaled_numerical_features').show(5)


Schema after Scaling:
root
 |-- customer_id: integer (nullable = true)
 |-- product_id: long (nullable = true)
 |-- gender_vec: vector (nullable = true)
 |-- country_vec: vector (nullable = true)
 |-- membership_level_vec: vector (nullable = true)
 |-- category_vec: vector (nullable = true)
 |-- device_vec: vector (nullable = true)
 |-- age: integer (nullable = true)
 |-- tenure_days: integer (nullable = true)
 |-- price: double (nullable = true)
 |-- product_avg_rating: double (nullable = true)
 |-- previous_visits: long (nullable = true)
 |-- view_count: long (nullable = true)
 |-- add_to_cart_count: long (nullable = true)
 |-- review_count: long (nullable = true)
 |-- purchase_count: long (nullable = true)
 |-- total_interactions: long (nullable = true)
 |-- total_time_spent: long (nullable = true)
 |-- interaction_time_span_days: double (nullable = true)
 |-- days_since_first_interaction: double (nullable = true)
 |-- total_purchase_amount: double (nullable = true)
 |-- avg_purchas

## 4. Prepare Classification Dataset

For the classification task (predicting `has_purchased`), we need to assemble the final feature vector. This vector will include:
- The `scaled_numerical_features` vector.
- All the one-hot encoded categorical vectors (`_vec` columns).

**Important:** We must **exclude** features that would cause target leakage. The columns `purchase_count`, `total_purchase_amount`, and `avg_purchase_amount` are potential leakage risks for predicting `has_purchased`. These should **not** be included in the input features for the classification model, even in their scaled form (they were excluded from `numerical_cols_to_scale` earlier).

In [11]:

    # Define columns for the classification feature vector
    # Includes scaled numerical vector and all categorical vectors
    classification_feature_cols = ["scaled_numerical_features"] + categorical_vec_cols
    print(f"Columns for Classification Features: {classification_feature_cols}")

    # Assemble the final classification feature vector
    classification_assembler = VectorAssembler(
        inputCols=classification_feature_cols,
        outputCol="features",
        handleInvalid="keep" # Consistent handling with scaler input
    )
    classification_prepared_df = classification_assembler.transform(scaled_df)

    # Select final columns for the classification dataset
    classification_final_df = classification_prepared_df.select(
        "customer_id",
        "product_id",
        "features",
        classification_target_col # Target variable 'has_purchased'
    )

    print("Classification Dataset Schema:")
    classification_final_df.printSchema()
    print("Sample Classification Data:")
    classification_final_df.show(5)


Columns for Classification Features: ['scaled_numerical_features', 'gender_vec', 'country_vec', 'membership_level_vec', 'category_vec', 'device_vec']
Classification Dataset Schema:
root
 |-- customer_id: integer (nullable = true)
 |-- product_id: long (nullable = true)
 |-- features: vector (nullable = true)
 |-- has_purchased: integer (nullable = true)

Sample Classification Data:
+-----------+----------+--------------------+-------------+
|customer_id|product_id|            features|has_purchased|
+-----------+----------+--------------------+-------------+
|          2|       485|(43,[0,1,2,3,6,7,...|            0|
|          4|       312|(43,[0,1,2,3,4,6,...|            0|
|          4|       616|(43,[0,1,2,3,4,7,...|            1|
|          5|       825|(43,[0,1,2,3,4,6,...|            1|
|          6|       143|(43,[0,1,2,3,4,7,...|            0|
+-----------+----------+--------------------+-------------+
only showing top 5 rows



## 5. Prepare Regression Dataset

For the regression task (predicting `total_purchase_amount`), we first need to filter the data to include only instances where a purchase actually occurred (`has_purchased == 1`).

The feature vector for regression will include:
- The `scaled_numerical_features` vector.
- All the one-hot encoded categorical vectors (`_vec` columns).

We must exclude the target variable (`total_purchase_amount`) and features directly derived from it (like `avg_purchase_amount`) from the input `features` vector.

In [12]:

    # 1. Filter data for instances where a purchase occurred
    regression_filtered_df = scaled_df.filter(col(classification_target_col) == 1)
    print(f"Number of rows for regression (purchases only): {regression_filtered_df.count()}")

    # 2. Define columns for the regression feature vector
    # These are the *same* inputs as classification in this setup, as potentially leaky features
    # were already excluded during scaling definition or are targets themselves.
    regression_feature_cols = ["scaled_numerical_features"] + categorical_vec_cols
    print(f"Columns for Regression Features: {regression_feature_cols}") # Same as classification

    # 3. Assemble the final regression feature vector
    # We can reuse the classification_assembler definition if inputs are identical,
    # but defining it again makes the step clearer.
    regression_assembler = VectorAssembler(
        inputCols=regression_feature_cols,
        outputCol="features",
        handleInvalid="keep" # Consistent handling
    )
    regression_prepared_df = regression_assembler.transform(regression_filtered_df)

    # 4. Select final columns for the regression dataset
    regression_final_df = regression_prepared_df.select(
        "customer_id",
        "product_id",
        "features",
        regression_target_col # Target variable 'total_purchase_amount'
    )

    print("Regression Dataset Schema:")
    regression_final_df.printSchema()
    print("Sample Regression Data:")
    regression_final_df.show(5)


Number of rows for regression (purchases only): 25238
Columns for Regression Features: ['scaled_numerical_features', 'gender_vec', 'country_vec', 'membership_level_vec', 'category_vec', 'device_vec']
Regression Dataset Schema:
root
 |-- customer_id: integer (nullable = true)
 |-- product_id: long (nullable = true)
 |-- features: vector (nullable = true)
 |-- total_purchase_amount: double (nullable = true)

Sample Regression Data:
+-----------+----------+--------------------+---------------------+
|customer_id|product_id|            features|total_purchase_amount|
+-----------+----------+--------------------+---------------------+
|          4|       616|(43,[0,1,2,3,4,7,...|               281.39|
|          5|       825|(43,[0,1,2,3,4,6,...|               232.21|
|          7|       584|(43,[0,1,2,3,4,7,...|               656.01|
|          7|       615|(43,[0,1,2,3,4,7,...|               205.55|
|          9|       340|(43,[0,1,2,3,4,7,...|               496.87|
+-----------+---------

## 6. Save Outputs

Save the prepared classification and regression datasets to new tables for use in the next notebook (Module 2, Notebook 3: PCA and Feature Selection).

In [14]:

    # Define output table names
    classification_output_table = "ecommerce.classification_scaled_features"
    regression_output_table = "ecommerce.regression_scaled_features"

    # Save classification data
    print(f"Saving classification data to {classification_output_table}...")
    classification_final_df.write \
        .mode("overwrite") \
        .format("parquet") \
        .saveAsTable(f"{classification_output_table}")
    print("Classification data saved.")

    # Save regression data
    print(f"Saving regression data to {regression_output_table}...")
    regression_final_df.write \
        .mode("overwrite") \
        .format("parquet") \
        .saveAsTable(f"{regression_output_table}")
    print("Regression data saved.")


Saving classification data to ecommerce.classification_scaled_features...
Classification data saved.
Saving regression data to ecommerce.regression_scaled_features...
Regression data saved.


## 7. Summary & Next Steps

In this notebook, we:
1. Loaded the feature-engineered data from M2N1.
2. Identified numerical columns requiring scaling and categorical vector columns.
3. Applied `StandardScaler` to the numerical features after verifying no nulls were present in those columns.
4. Assembled distinct feature vectors for classification and regression tasks, carefully considering potential target leakage for the classification model.
5. Saved the two prepared datasets (`classification_scaled_features` and `regression_scaled_features`) for the next stage.

**Next:** In Module 2, Notebook 3, we will explore feature selection techniques (like Chi-Square selection) and Principal Component Analysis (PCA) using these scaled datasets to potentially reduce dimensionality and improve model performance and interpretability.