# Module 2 Notebook 3: PCA and Feature Selection

**Objective:** Demonstrate PCA and Chi-Square feature selection on the scaled datasets, select the final feature set using Chi-Square (per design decision), and split the data into training/testing sets.

**Inputs:**
1. `ecommerce.classification_scaled_features` (from M2N2 with MinMaxScaler)
2. `ecommerce.regression_scaled_features` (from M2N2 with MinMaxScaler)

**Outputs:**
1. `ecommerce.classification_train_features`
2. `ecommerce.classification_test_features`
3. `ecommerce.regression_train_features`
4. `ecommerce.regression_test_features`


## 1. Setup & Load Data

Load the scaled feature tables from the previous notebook.


In [4]:
from pyspark.sql import SparkSession
from pyspark.ml.feature import PCA, ChiSqSelector, VectorSlicer, VectorAssembler
from pyspark.sql.functions import col
import sys
import os

# Créer une SparkSession
spark = SparkSession.builder \
    .appName("MonApplication") \
    .getOrCreate()


# Load data
try:
    classification_df = spark.read.parquet(f"spark-warehouse/ecommerce.db/classification_scaled_features")
    regression_df = spark.read.parquet(f"spark-warehouse/ecommerce.db/regression_scaled_features")
    print(f"Successfully loaded data from {classification_input_table} and {regression_input_table}")
except Exception as e:
    print(f"Error loading tables: {e}")
    print("Please ensure Module 2 Notebook 2 was run successfully with MinMaxScaler.")


Successfully loaded data from ecommerce.classification_scaled_features and ecommerce.regression_scaled_features


In [6]:
# Display sample data for classification
print("Sample Classification Input Data:")
classification_df.show(5)


Sample Classification Input Data:
+-----------+----------+--------------------+-------------+
|customer_id|product_id|            features|has_purchased|
+-----------+----------+--------------------+-------------+
|          2|       485|(43,[0,1,2,3,6,7,...|            0|
|          4|       312|(43,[0,1,2,3,4,6,...|            0|
|          4|       616|(43,[0,1,2,3,4,7,...|            1|
|          5|       825|(43,[0,1,2,3,4,6,...|            1|
|          6|       143|(43,[0,1,2,3,4,7,...|            0|
+-----------+----------+--------------------+-------------+
only showing top 5 rows



## 2. PCA Demonstration

Principal Component Analysis (PCA) is a dimensionality reduction technique that transforms features into a smaller set of principal components, capturing the most variance. While powerful, it can make features harder to interpret directly.

We'll demonstrate applying PCA but won't use its results for our final model pipeline, sticking to ChiSqSelector for better interpretability.


In [10]:
# Configure PCA
num_pca_components = 10 # Choose a small number of components for demonstration
pca = PCA(
    k=num_pca_components, 
    inputCol="features", 
    outputCol="pcaFeatures"
)

# Fit PCA model to the classification data
print(f"\nFitting PCA with k={num_pca_components}...")
pca_model = pca.fit(classification_df)

# Transform the data 
pca_df = pca_model.transform(classification_df)
print("\nData after PCA Transformation:")
pca_df.select('customer_id', 'product_id', 'pcaFeatures').show(5)

# Show explained variance
explained_variance = pca_model.explainedVariance
print(f"Explained Variance per component: {list(explained_variance)}") # Cast to list for cleaner print
print(f"Total Variance Explained by {num_pca_components} components: {sum(explained_variance):.4f}")



Fitting PCA with k=10...

Data after PCA Transformation:
+-----------+----------+--------------------+
|customer_id|product_id|         pcaFeatures|
+-----------+----------+--------------------+
|          2|       485|[0.74976916095425...|
|          4|       312|[0.73520409210458...|
|          4|       616|[0.66944346981817...|
|          5|       825|[-0.7729857783717...|
|          6|       143|[0.67368060065962...|
+-----------+----------+--------------------+
only showing top 5 rows

Explained Variance per component: [0.12195263026849008, 0.11017128129711976, 0.07874457921957756, 0.06297266098746845, 0.060305930370820056, 0.04806744810650913, 0.03444931623220214, 0.03409640759744138, 0.033225320949748, 0.02940129278425248]
Total Variance Explained by 10 components: 0.6134


As we can see, PCA reduces the number of features. 

The 'explained variance' tells us how much of the original data's spread is captured by each new principal component. We can see that our first 10 components together capture approximately 61% of the total variance from the original 43 features. This highlights PCA's ability to condense information, though often more components are analyzed in practice to capture a higher percentage (like 90%+).

However, for our course, we prioritize interpretability and will proceed with Chi-Square Feature Selection.


## 3. Feature Selection (Chi-Square)

The Chi-Square test is designed for categorical features. We will first isolate the one-hot encoded (OHE) categorical parts of our feature vector using `VectorSlicer`, apply `ChiSqSelector` to only those features, and then combine the selected categorical features with the original scaled numerical features.


In [12]:
# Indices based on M2N2 assembly order: 13 numerical + 3 gender + 4 membership + 3 device + 10 country + 10 category = 43 total
numerical_indices = list(range(0, 13))
categorical_indices = list(range(13, 43))

print(f"Numerical Feature Indices (0-based): {numerical_indices}")
print(f"Categorical Feature Indices (0-based): {categorical_indices}")


Numerical Feature Indices (0-based): [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]
Categorical Feature Indices (0-based): [13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42]


In [13]:
# Slice out the OHE categorical features
ohe_slicer = VectorSlicer(inputCol="features", outputCol="ohe_features", indices=categorical_indices)
classification_df_sliced_ohe = ohe_slicer.transform(classification_df)

print("Schema after slicing OHE features:")
classification_df_sliced_ohe.printSchema()
classification_df_sliced_ohe.select("ohe_features").show(5)


Schema after slicing OHE features:
root
 |-- customer_id: integer (nullable = true)
 |-- product_id: long (nullable = true)
 |-- features: vector (nullable = true)
 |-- has_purchased: integer (nullable = true)
 |-- ohe_features: vector (nullable = true)

+--------------------+
|        ohe_features|
+--------------------+
|(30,[1,4,14,25,27...|
|(30,[1,9,13,18,27...|
|(30,[1,9,13,23,28...|
|(30,[0,3,13,17,28...|
|(30,[1,6,13,17,28...|
+--------------------+
only showing top 5 rows



In [17]:
# Configure ChiSqSelector for the OHE features
fpr_threshold = 0.05
chisq_selector = ChiSqSelector(
    selectorType="fpr", 
    fpr=fpr_threshold, 
    featuresCol="ohe_features", 
    outputCol="selected_ohe_features", 
    labelCol="has_purchased"
)

# Fit ChiSqSelector model
print(f"\nFitting ChiSqSelector on OHE features (fpr={fpr_threshold})...")
chisq_model = chisq_selector.fit(classification_df_sliced_ohe)

# Transform to get selected OHE features
classification_selected_ohe = chisq_model.transform(classification_df_sliced_ohe)

print("Schema after selecting OHE features:")
classification_selected_ohe.printSchema()
classification_selected_ohe.select("selected_ohe_features").show(5)


Fitting ChiSqSelector on OHE features (fpr=0.05)...
Schema after selecting OHE features:
root
 |-- customer_id: integer (nullable = true)
 |-- product_id: long (nullable = true)
 |-- features: vector (nullable = true)
 |-- has_purchased: integer (nullable = true)
 |-- ohe_features: vector (nullable = true)
 |-- selected_ohe_features: vector (nullable = true)

+---------------------+
|selected_ohe_features|
+---------------------+
|      (14,[12],[1.0])|
| (14,[1,5],[1.0,1.0])|
| (14,[1,10],[1.0,1...|
| (14,[1,4],[1.0,1.0])|
| (14,[0,1,4],[1.0,...|
+---------------------+
only showing top 5 rows



In [18]:
# Slice out the scaled numerical features from the original vector
num_slicer = VectorSlicer(inputCol="features", outputCol="numerical_features_scaled", indices=numerical_indices)
# Apply to the df that now has selected_ohe_features
classification_final_parts = num_slicer.transform(classification_selected_ohe)

print("Schema after slicing numerical features:")
classification_final_parts.printSchema()
classification_final_parts.select("numerical_features_scaled").show(5)

Schema after slicing numerical features:
root
 |-- customer_id: integer (nullable = true)
 |-- product_id: long (nullable = true)
 |-- features: vector (nullable = true)
 |-- has_purchased: integer (nullable = true)
 |-- ohe_features: vector (nullable = true)
 |-- selected_ohe_features: vector (nullable = true)
 |-- numerical_features_scaled: vector (nullable = true)

+-------------------------+
|numerical_features_scaled|
+-------------------------+
|     (13,[0,1,2,3,6,7,...|
|     (13,[0,1,2,3,4,6,...|
|     (13,[0,1,2,3,4,7,...|
|     (13,[0,1,2,3,4,6,...|
|     (13,[0,1,2,3,4,7,...|
+-------------------------+
only showing top 5 rows



In [19]:
# Assemble the final feature vector: scaled numerical + selected OHE
final_assembler = VectorAssembler(
    inputCols=["numerical_features_scaled", "selected_ohe_features"], 
    outputCol="final_features"
)
classification_assembled = final_assembler.transform(classification_final_parts)

print("Schema after final assembly:")
classification_assembled.printSchema()
classification_assembled.select("final_features").show(5)

Schema after final assembly:
root
 |-- customer_id: integer (nullable = true)
 |-- product_id: long (nullable = true)
 |-- features: vector (nullable = true)
 |-- has_purchased: integer (nullable = true)
 |-- ohe_features: vector (nullable = true)
 |-- selected_ohe_features: vector (nullable = true)
 |-- numerical_features_scaled: vector (nullable = true)
 |-- final_features: vector (nullable = true)

+--------------------+
|      final_features|
+--------------------+
|(27,[0,1,2,3,6,7,...|
|(27,[0,1,2,3,4,6,...|
|(27,[0,1,2,3,4,7,...|
|(27,[0,1,2,3,4,6,...|
|(27,[0,1,2,3,4,7,...|
+--------------------+
only showing top 5 rows



## 4. Prepare Final Datasets using Selected Features

Select the final columns for the classification dataset and apply the same transformations to the regression dataset.


In [26]:
# Select final columns for classification, renaming final_features -> features
final_classification_df = classification_assembled.select(
    "customer_id", 
    "product_id", 
    col("final_features").alias("features"), 
    "has_purchased"
)

print("Final Classification Dataset Schema:")
final_classification_df.printSchema()
final_classification_df.show(5)


Final Classification Dataset Schema:
root
 |-- customer_id: integer (nullable = true)
 |-- product_id: long (nullable = true)
 |-- features: vector (nullable = true)
 |-- has_purchased: integer (nullable = true)

+-----------+----------+--------------------+-------------+
|customer_id|product_id|            features|has_purchased|
+-----------+----------+--------------------+-------------+
|          2|       485|(27,[0,1,2,3,6,7,...|            0|
|          4|       312|(27,[0,1,2,3,4,6,...|            0|
|          4|       616|(27,[0,1,2,3,4,7,...|            1|
|          5|       825|(27,[0,1,2,3,4,6,...|            1|
|          6|       143|(27,[0,1,2,3,4,7,...|            0|
+-----------+----------+--------------------+-------------+
only showing top 5 rows



In [27]:
# Apply the same transformations to the regression data
print("\nProcessing regression data...")
# 1. Slice OHE
regression_df_sliced_ohe = ohe_slicer.transform(regression_df)
# 2. Select OHE using the *fitted* chisq_model
regression_selected_ohe = chisq_model.transform(regression_df_sliced_ohe)
# 3. Slice Numerical
regression_final_parts = num_slicer.transform(regression_selected_ohe)
# 4. Assemble Final
regression_assembled = final_assembler.transform(regression_final_parts)

# Select final columns for regression
final_regression_df = regression_assembled.select(
    "customer_id", 
    "product_id", 
    col("final_features").alias("features"), 
    "total_purchase_amount"
)

print("Final Regression Dataset Schema:")
final_regression_df.printSchema()
final_regression_df.show(5)



Processing regression data...
Final Regression Dataset Schema:
root
 |-- customer_id: integer (nullable = true)
 |-- product_id: long (nullable = true)
 |-- features: vector (nullable = true)
 |-- total_purchase_amount: double (nullable = true)

+-----------+----------+--------------------+---------------------+
|customer_id|product_id|            features|total_purchase_amount|
+-----------+----------+--------------------+---------------------+
|          4|       616|(27,[0,1,2,3,4,7,...|               281.39|
|          5|       825|(27,[0,1,2,3,4,6,...|               232.21|
|          7|       584|(27,[0,1,2,3,4,7,...|               656.01|
|          7|       615|(27,[0,1,2,3,4,7,...|               205.55|
|          9|       340|(27,[0,1,2,3,4,7,...|               496.87|
+-----------+----------+--------------------+---------------------+
only showing top 5 rows



## 5. Split Data (Train/Test)

Split both the final classification and regression datasets into training and testing sets for model evaluation in the next module.


In [28]:
# Define split ratio and seed
train_ratio = 0.8
test_ratio = 0.2
seed = 42

# Split classification data
print(f"\nSplitting classification data ({train_ratio*100}% train / {test_ratio*100}% test)...")
class_train_df, class_test_df = final_classification_df.randomSplit([train_ratio, test_ratio], seed=seed)

# Split regression data
print(f"Splitting regression data ({train_ratio*100}% train / {test_ratio*100}% test)...")
reg_train_df, reg_test_df = final_regression_df.randomSplit([train_ratio, test_ratio], seed=seed)




Splitting classification data (80.0% train / 20.0% test)...
Splitting regression data (80.0% train / 20.0% test)...


In [29]:
# Show counts of the splits
print("\nDataset Counts after Splitting:")
print(f"Classification Train Count: {class_train_df.count()}")
print(f"Classification Test Count:  {class_test_df.count()}")
print(f"Regression Train Count:   {reg_train_df.count()}")
print(f"Regression Test Count:    {reg_test_df.count()}")



Dataset Counts after Splitting:
Classification Train Count: 40174
Classification Test Count:  9793
Regression Train Count:   20286
Regression Test Count:    4952


## 6. Save Outputs

Save the training and testing datasets as Delta tables for use in Module 3.


In [31]:
# Create database if it doesn't exist
spark.sql("CREATE DATABASE IF NOT EXISTS ecommerce")
spark.sql("USE ecommerce")


print("Using 'ecommerce' database")

Using 'ecommerce' database


In [32]:
# Define output table names
class_train_table = "ecommerce.classification_train_features"
class_test_table = "ecommerce.classification_test_features"
reg_train_table = "ecommerce.regression_train_features"
reg_test_table = "ecommerce.regression_test_features"

# Save tables
print("\nSaving split datasets as Delta tables...")
class_train_df.write.mode("overwrite").format("parquet").saveAsTable(f"{class_train_table}")
class_test_df.write.mode("overwrite").format("parquet").saveAsTable(f"{class_test_table}")
reg_train_df.write.mode("overwrite").format("parquet").saveAsTable(f"{reg_train_table}")
reg_test_df.write.mode("overwrite").format("parquet").saveAsTable(f"{reg_test_table}")

print("Successfully saved training and testing datasets.")



Saving split datasets as Delta tables...
Successfully saved training and testing datasets.


## 7. Summary & Next Steps

In this notebook, we:
1. Loaded the scaled feature data from M2N2.
2. Briefly demonstrated PCA for dimensionality reduction.
3. Isolated the OHE categorical features using `VectorSlicer`.
4. Applied Chi-Square feature selection (using `fpr`) *only* to the OHE features.
5. Isolated the scaled numerical features using `VectorSlicer`.
6. Recombined the selected OHE features and the scaled numerical features using `VectorAssembler`.
7. Applied the same transformations to the regression dataset.
8. Split both datasets into training and testing sets.
9. Saved the final train/test datasets as Delta tables.

**Next:** Module 3 will utilize these prepared datasets to train and evaluate machine learning models.
