# Module 09: Feature Engineering at Scale

**Difficulty**: ⭐⭐  
**Estimated Time**: 70 minutes  
**Prerequisites**: 
- [Module 08: MLlib Basics](08_pyspark_machine_learning_mllib_basics.ipynb)
- Understanding of feature transformation concepts

## Learning Objectives

By the end of this notebook, you will be able to:

1. Apply various scaling techniques (StandardScaler, MinMaxScaler, Normalizer) to normalize features
2. Use VectorAssembler effectively to combine heterogeneous features into ML-ready vectors
3. Implement categorical encoding strategies (StringIndexer, OneHotEncoder, label encoding)
4. Perform feature selection using statistical methods and model-based approaches
5. Build robust feature engineering pipelines that scale to large datasets

## 1. Setup and Introduction

**What is Feature Engineering?**

Feature engineering is the process of transforming raw data into features that better represent the underlying patterns for machine learning models. Good features can:
- Improve model accuracy significantly
- Reduce training time
- Make models more interpretable
- Handle data quality issues

**Why Scale Features?**

Many ML algorithms are sensitive to feature scales:
- **Distance-based algorithms** (K-NN, SVM): Features with larger ranges dominate
- **Gradient descent** (Neural Networks, Linear Models): Convergence is faster with normalized features
- **Regularization**: Works better when features are on similar scales

**Common Scaling Techniques:**
- **StandardScaler**: Z-score normalization (mean=0, std=1)
- **MinMaxScaler**: Scale to a specific range (e.g., [0, 1])
- **Normalizer**: Scale each sample to unit norm
- **MaxAbsScaler**: Scale by maximum absolute value

In [None]:
# Import required libraries
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, rand, when, expr, round as spark_round
from pyspark.sql.types import StructType, StructField, StringType, DoubleType, IntegerType

# Feature engineering imports
from pyspark.ml.feature import (
    VectorAssembler, StandardScaler, MinMaxScaler, Normalizer, MaxAbsScaler,
    StringIndexer, OneHotEncoder, IndexToString,
    Bucketizer, QuantileDiscretizer, VectorIndexer,
    ChiSqSelector, UnivariateFeatureSelector
)
from pyspark.ml.classification import LogisticRegression, RandomForestClassifier
from pyspark.ml.regression import LinearRegression
from pyspark.ml import Pipeline
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml.linalg import Vectors

# For generating data
import numpy as np
import random

# Set random seeds
np.random.seed(42)
random.seed(42)

In [None]:
# Create Spark session
spark = SparkSession.builder \
    .appName("Feature Engineering at Scale") \
    .config("spark.driver.memory", "2g") \
    .getOrCreate()

spark.sparkContext.setLogLevel("ERROR")
print(f"Spark version: {spark.version}")
print("Spark session created successfully!")

## 2. Feature Scaling Techniques

We'll explore different scaling methods and understand when to use each one.

### StandardScaler (Z-score Normalization)

**Formula**: `(x - mean) / std`

**When to use:**
- Features follow roughly normal distribution
- You want zero mean and unit variance
- For algorithms sensitive to feature variance (SVM, Neural Networks)

**Parameters:**
- `withMean`: Center data to mean=0 (default: False, because it makes vectors dense)
- `withStd`: Scale to std=1 (default: True)

In [None]:
# Generate sample data with different scales
# Features on very different scales to demonstrate the effect of scaling
sample_data = [
    (1, 100000.0, 25, 1.75),   # Salary: 100k, Age: 25, Height: 1.75m
    (2, 55000.0, 35, 1.68),
    (3, 80000.0, 28, 1.82),
    (4, 45000.0, 42, 1.65),
    (5, 120000.0, 30, 1.78),
    (6, 65000.0, 38, 1.70),
    (7, 90000.0, 26, 1.88),
    (8, 75000.0, 33, 1.72)
]

df_scale = spark.createDataFrame(
    sample_data,
    ["id", "salary", "age", "height"]
)

df_scale.show()
df_scale.describe().show()

In [None]:
# First, assemble features into a vector
# This is required for all MLlib transformers
assembler = VectorAssembler(
    inputCols=["salary", "age", "height"],
    outputCol="features"
)

df_assembled = assembler.transform(df_scale)
df_assembled.select("id", "features").show(truncate=False)

In [None]:
# Apply StandardScaler
# withMean=False keeps vectors sparse (important for large-scale data)
# withStd=True scales to unit variance
standard_scaler = StandardScaler(
    inputCol="features",
    outputCol="scaled_features",
    withMean=False,
    withStd=True
)

scaler_model = standard_scaler.fit(df_assembled)
df_standard_scaled = scaler_model.transform(df_assembled)

print("Original vs Standard Scaled features:")
df_standard_scaled.select("id", "features", "scaled_features").show(truncate=False)

print(f"\nStandard deviations: {scaler_model.std}")
print(f"Means: {scaler_model.mean}")

### MinMaxScaler

**Formula**: `(x - min) / (max - min) * (max_range - min_range) + min_range`

**When to use:**
- You need features in a specific range (e.g., [0, 1])
- For image processing (pixel values)
- When you want to preserve zero values in sparse data

**Advantages:**
- Preserves relationships between values
- Less affected by outliers than StandardScaler
- Bounded output range

In [None]:
# Apply MinMaxScaler
# Default scales to [0, 1]
minmax_scaler = MinMaxScaler(
    inputCol="features",
    outputCol="minmax_features"
)

minmax_model = minmax_scaler.fit(df_assembled)
df_minmax_scaled = minmax_model.transform(df_assembled)

print("Original vs MinMax Scaled features:")
df_minmax_scaled.select("id", "features", "minmax_features").show(truncate=False)

print(f"\nOriginal mins: {minmax_model.originalMin}")
print(f"Original maxs: {minmax_model.originalMax}")

### Normalizer

**Purpose**: Scales each **row** (sample) to have unit norm

**When to use:**
- Text classification (TF-IDF vectors)
- When the magnitude of feature vectors matters
- Comparing similarity between samples

**Norms:**
- L1: Sum of absolute values = 1
- L2: Sum of squares = 1 (default, Euclidean norm)
- L∞: Maximum absolute value = 1

In [None]:
# Apply Normalizer with L2 norm (default)
normalizer = Normalizer(
    inputCol="features",
    outputCol="normalized_features",
    p=2.0  # L2 norm
)

# Normalizer is a Transformer, not an Estimator
# It doesn't need to be fitted
df_normalized = normalizer.transform(df_assembled)

print("Original vs Normalized features (L2):")
df_normalized.select("id", "features", "normalized_features").show(truncate=False)

In [None]:
# Compare L1, L2, and L∞ normalization
normalizer_l1 = Normalizer(inputCol="features", outputCol="norm_l1", p=1.0)
normalizer_l2 = Normalizer(inputCol="features", outputCol="norm_l2", p=2.0)
normalizer_linf = Normalizer(inputCol="features", outputCol="norm_linf", p=float('inf'))

df_norm_comparison = df_assembled
df_norm_comparison = normalizer_l1.transform(df_norm_comparison)
df_norm_comparison = normalizer_l2.transform(df_norm_comparison)
df_norm_comparison = normalizer_linf.transform(df_norm_comparison)

print("Comparison of different norms:")
df_norm_comparison.select("id", "features", "norm_l1", "norm_l2", "norm_linf").show(3, truncate=False)

## 3. Advanced Categorical Encoding

Beyond basic StringIndexer and OneHotEncoder, we'll explore more sophisticated encoding strategies.

### Encoding Strategies:

1. **Label Encoding** (StringIndexer): Assigns integer indices
   - Good for: Tree-based models, ordinal categories
   - Bad for: Linear models (implies ordering)

2. **One-Hot Encoding**: Creates binary columns
   - Good for: Linear models, low-cardinality categories
   - Bad for: High-cardinality features (too many columns)

3. **Target Encoding**: Uses target statistics (advanced, not covered here)

4. **Frequency Encoding**: Encodes by category frequency

In [None]:
# Generate data with multiple categorical features
n_samples = 500
categories = ["Electronics", "Clothing", "Books", "Home", "Sports"]
sizes = ["Small", "Medium", "Large", "XLarge"]
colors = ["Red", "Blue", "Green", "Black", "White"]

categorical_data = []
for i in range(n_samples):
    category = random.choice(categories)
    size = random.choice(sizes)
    color = random.choice(colors)
    price = np.random.uniform(10, 500)
    quantity = np.random.randint(1, 20)
    
    # Create target based on features
    score = 0
    if category == "Electronics":
        score += 2
    if size in ["Large", "XLarge"]:
        score += 1
    if price > 200:
        score += 1
    
    purchased = 1 if (score >= 2 and np.random.random() > 0.3) else 0
    
    categorical_data.append((category, size, color, float(price), quantity, purchased))

df_categorical = spark.createDataFrame(
    categorical_data,
    ["category", "size", "color", "price", "quantity", "purchased"]
)

print(f"Total samples: {df_categorical.count()}")
print("\nCategory distribution:")
df_categorical.groupBy("category").count().show()
df_categorical.show(10)

In [None]:
# Strategy 1: Label Encoding Only (for tree-based models)
# Tree models can handle categorical features as indices
category_indexer = StringIndexer(inputCol="category", outputCol="category_idx")
size_indexer = StringIndexer(inputCol="size", outputCol="size_idx")
color_indexer = StringIndexer(inputCol="color", outputCol="color_idx")

df_indexed = category_indexer.fit(df_categorical).transform(df_categorical)
df_indexed = size_indexer.fit(df_indexed).transform(df_indexed)
df_indexed = color_indexer.fit(df_indexed).transform(df_indexed)

print("Label Encoded Features:")
df_indexed.select("category", "category_idx", "size", "size_idx", "color", "color_idx").show(10)

In [None]:
# Strategy 2: One-Hot Encoding (for linear models)
# Convert indices to binary vectors
category_encoder = OneHotEncoder(inputCol="category_idx", outputCol="category_vec")
size_encoder = OneHotEncoder(inputCol="size_idx", outputCol="size_vec")
color_encoder = OneHotEncoder(inputCol="color_idx", outputCol="color_vec")

df_encoded = category_encoder.fit(df_indexed).transform(df_indexed)
df_encoded = size_encoder.fit(df_encoded).transform(df_encoded)
df_encoded = color_encoder.fit(df_encoded).transform(df_encoded)

print("One-Hot Encoded Features:")
df_encoded.select("category", "category_vec", "size", "size_vec").show(10, truncate=False)

In [None]:
# Combine all features for a linear model
# Use one-hot encoded vectors for categorical features
linear_assembler = VectorAssembler(
    inputCols=["category_vec", "size_vec", "color_vec", "price", "quantity"],
    outputCol="features"
)

df_linear_features = linear_assembler.transform(df_encoded)
print("Feature vector for linear model:")
df_linear_features.select("features", "purchased").show(5, truncate=False)

In [None]:
# Combine features for tree-based model
# Use indexed features (no one-hot encoding needed)
tree_assembler = VectorAssembler(
    inputCols=["category_idx", "size_idx", "color_idx", "price", "quantity"],
    outputCol="tree_features"
)

df_tree_features = tree_assembler.transform(df_indexed)
print("Feature vector for tree model:")
df_tree_features.select("tree_features", "purchased").show(5, truncate=False)

## 4. Binning and Discretization

Converting continuous features into discrete bins can:
- Reduce the effect of outliers
- Capture non-linear relationships
- Make features more interpretable

### Bucketizer vs QuantileDiscretizer:

**Bucketizer**: Uses fixed bin edges
- You specify exact split points
- Bins may have unequal counts

**QuantileDiscretizer**: Creates bins with equal counts
- You specify number of bins
- Automatically finds split points

In [None]:
# Create data for binning
age_data = [(float(age),) for age in np.random.randint(18, 80, 200)]
df_ages = spark.createDataFrame(age_data, ["age"])

print("Age distribution:")
df_ages.describe().show()

In [None]:
# Bucketizer: Fixed bin edges
# Create age groups: 18-30, 30-45, 45-60, 60+
bucketizer = Bucketizer(
    splits=[18, 30, 45, 60, 80],  # Note: splits has n+1 values for n bins
    inputCol="age",
    outputCol="age_bucket"
)

df_bucketed = bucketizer.transform(df_ages)
print("Bucketed ages:")
df_bucketed.show(10)

print("\nBucket distribution:")
df_bucketed.groupBy("age_bucket").count().orderBy("age_bucket").show()

In [None]:
# QuantileDiscretizer: Equal-frequency bins
# Automatically finds split points for 4 bins with roughly equal counts
quantile_discretizer = QuantileDiscretizer(
    numBuckets=4,
    inputCol="age",
    outputCol="age_quantile"
)

quantile_model = quantile_discretizer.fit(df_ages)
df_quantiled = quantile_model.transform(df_ages)

print("Quantile-based bins:")
df_quantiled.show(10)

print(f"\nSplit points found: {quantile_model.getSplits()}")
print("\nQuantile distribution (should be roughly equal):")
df_quantiled.groupBy("age_quantile").count().orderBy("age_quantile").show()

## 5. Feature Selection

**Why Feature Selection?**
- Reduce overfitting by removing irrelevant features
- Improve model performance and training speed
- Enhance interpretability
- Reduce storage and computation costs

**Methods in PySpark:**

1. **ChiSqSelector**: Chi-squared test for categorical targets
2. **UnivariateFeatureSelector**: Supports multiple statistical tests
3. **VectorIndexer**: Identifies and handles categorical features in vectors
4. **Model-based**: Use feature importances from tree models

In [None]:
# Generate data with relevant and irrelevant features
n_samples = 1000
feature_data = []

for _ in range(n_samples):
    # Relevant features (actually influence target)
    f1 = np.random.uniform(0, 10)
    f2 = np.random.uniform(0, 10)
    # Irrelevant features (random noise)
    f3 = np.random.uniform(0, 10)
    f4 = np.random.uniform(0, 10)
    f5 = np.random.uniform(0, 10)
    
    # Target depends only on f1 and f2
    label = 1 if (f1 + f2 > 10) else 0
    
    feature_data.append((float(f1), float(f2), float(f3), float(f4), float(f5), label))

df_features = spark.createDataFrame(
    feature_data,
    ["f1", "f2", "f3", "f4", "f5", "label"]
)

print(f"Total samples: {df_features.count()}")
print("Label distribution:")
df_features.groupBy("label").count().show()
df_features.show(10)

In [None]:
# Assemble features
feature_assembler = VectorAssembler(
    inputCols=["f1", "f2", "f3", "f4", "f5"],
    outputCol="features"
)

df_assembled_features = feature_assembler.transform(df_features)
df_assembled_features.select("features", "label").show(5, truncate=False)

In [None]:
# ChiSqSelector: Select top k features based on chi-squared test
# This will identify that f1 and f2 are most relevant
chi_selector = ChiSqSelector(
    numTopFeatures=3,  # Select top 3 features
    featuresCol="features",
    outputCol="selected_features",
    labelCol="label"
)

chi_model = chi_selector.fit(df_assembled_features)
df_selected = chi_model.transform(df_assembled_features)

print("Original features vs Selected features:")
df_selected.select("features", "selected_features", "label").show(10, truncate=False)

print(f"\nSelected feature indices: {chi_model.selectedFeatures}")

In [None]:
# UnivariateFeatureSelector: More flexible feature selection
# Supports different selection modes and feature types
univariate_selector = UnivariateFeatureSelector(
    featuresCol="features",
    outputCol="univariate_features",
    labelCol="label",
    selectionMode="numTopFeatures"  # Can also use "percentile", "fpr", "fdr", "fwe"
)

# Set number of features to select
univariate_selector.setFeatureType("continuous").setLabelType("categorical")
univariate_selector.setSelectionThreshold(2)  # Select top 2 features

univariate_model = univariate_selector.fit(df_assembled_features)
df_univariate = univariate_model.transform(df_assembled_features)

print("Univariate feature selection:")
df_univariate.select("features", "univariate_features", "label").show(10, truncate=False)

print(f"\nSelected feature indices: {univariate_model.selectedFeatures}")

In [None]:
# Compare model performance with and without feature selection
train_df, test_df = df_assembled_features.randomSplit([0.7, 0.3], seed=42)

# Model with all features
lr_all = LogisticRegression(labelCol="label", featuresCol="features", maxIter=10)
model_all = lr_all.fit(train_df)
predictions_all = model_all.transform(test_df)

evaluator = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction", metricName="accuracy")
accuracy_all = evaluator.evaluate(predictions_all)

print(f"Accuracy with all features (5): {accuracy_all:.4f}")

In [None]:
# Model with selected features
train_selected = chi_model.transform(train_df)
test_selected = chi_model.transform(test_df)

lr_selected = LogisticRegression(labelCol="label", featuresCol="selected_features", maxIter=10)
model_selected = lr_selected.fit(train_selected)
predictions_selected = model_selected.transform(test_selected)

accuracy_selected = evaluator.evaluate(predictions_selected)

print(f"Accuracy with selected features (3): {accuracy_selected:.4f}")
print(f"\nDifference: {accuracy_selected - accuracy_all:.4f}")
print("Feature selection removed noise without hurting performance!")

## 6. Building Robust Feature Engineering Pipelines

A complete feature engineering pipeline should:
1. Handle missing values (if any)
2. Encode categorical features appropriately
3. Scale numerical features
4. Perform feature selection
5. Assemble final feature vector

All steps should be in a Pipeline to ensure reproducibility and prevent data leakage.

In [None]:
# Create realistic dataset for complete pipeline
n_samples = 1000
complete_data = []

for _ in range(n_samples):
    # Categorical features
    department = random.choice(["Sales", "Engineering", "Marketing", "HR"])
    education = random.choice(["High School", "Bachelor", "Master", "PhD"])
    
    # Numerical features
    age = np.random.randint(22, 65)
    years_experience = min(age - 22, np.random.randint(0, 40))
    salary = np.random.uniform(30000, 150000)
    satisfaction_score = np.random.uniform(1, 10)
    
    # Target: Will employee be promoted?
    promo_score = (
        years_experience * 0.1 +
        satisfaction_score * 0.5 +
        (1 if department == "Engineering" else 0) * 2 +
        (1 if education in ["Master", "PhD"] else 0) * 1
    )
    promoted = 1 if promo_score > 6 and np.random.random() > 0.3 else 0
    
    complete_data.append((
        department, education, age, years_experience,
        float(salary), float(satisfaction_score), promoted
    ))

df_complete = spark.createDataFrame(
    complete_data,
    ["department", "education", "age", "years_experience", "salary", "satisfaction_score", "promoted"]
)

print(f"Total samples: {df_complete.count()}")
print("\nPromotion distribution:")
df_complete.groupBy("promoted").count().show()
df_complete.show(10)

In [None]:
# Split data first (before any transformations)
complete_train, complete_test = df_complete.randomSplit([0.7, 0.3], seed=42)

print(f"Training samples: {complete_train.count()}")
print(f"Test samples: {complete_test.count()}")

In [None]:
# Build complete feature engineering pipeline

# Stage 1: Index categorical features
dept_indexer = StringIndexer(inputCol="department", outputCol="dept_idx")
edu_indexer = StringIndexer(inputCol="education", outputCol="edu_idx")

# Stage 2: One-hot encode
dept_encoder = OneHotEncoder(inputCol="dept_idx", outputCol="dept_vec")
edu_encoder = OneHotEncoder(inputCol="edu_idx", outputCol="edu_vec")

# Stage 3: Assemble numerical features for scaling
numeric_assembler = VectorAssembler(
    inputCols=["age", "years_experience", "salary", "satisfaction_score"],
    outputCol="numeric_features"
)

# Stage 4: Scale numerical features
scaler = StandardScaler(
    inputCol="numeric_features",
    outputCol="scaled_numeric",
    withStd=True,
    withMean=False
)

# Stage 5: Assemble all features (categorical + scaled numerical)
final_assembler = VectorAssembler(
    inputCols=["dept_vec", "edu_vec", "scaled_numeric"],
    outputCol="raw_features"
)

# Stage 6: Feature selection
feature_selector = ChiSqSelector(
    numTopFeatures=8,
    featuresCol="raw_features",
    outputCol="features",
    labelCol="promoted"
)

# Stage 7: Model
lr_model = LogisticRegression(
    labelCol="promoted",
    featuresCol="features",
    maxIter=10
)

# Create the complete pipeline
complete_pipeline = Pipeline(stages=[
    dept_indexer,
    edu_indexer,
    dept_encoder,
    edu_encoder,
    numeric_assembler,
    scaler,
    final_assembler,
    feature_selector,
    lr_model
])

print("Complete feature engineering pipeline created with 9 stages!")

In [None]:
# Fit the pipeline
# All transformations are learned from training data only
complete_model = complete_pipeline.fit(complete_train)

print("Pipeline fitted successfully!")

# Make predictions
complete_predictions = complete_model.transform(complete_test)

# Show results
complete_predictions.select(
    "department", "education", "age", "years_experience",
    "promoted", "prediction", "probability"
).show(15, truncate=False)

In [None]:
# Evaluate the complete pipeline
complete_evaluator = MulticlassClassificationEvaluator(
    labelCol="promoted",
    predictionCol="prediction"
)

accuracy = complete_evaluator.evaluate(complete_predictions, {complete_evaluator.metricName: "accuracy"})
f1 = complete_evaluator.evaluate(complete_predictions, {complete_evaluator.metricName: "f1"})
precision = complete_evaluator.evaluate(complete_predictions, {complete_evaluator.metricName: "weightedPrecision"})
recall = complete_evaluator.evaluate(complete_predictions, {complete_evaluator.metricName: "weightedRecall"})

print("Complete Pipeline Performance:")
print(f"Accuracy: {accuracy:.4f}")
print(f"F1 Score: {f1:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")

## 7. Exercises

### Exercise 1: Compare Scaling Methods

Create a dataset and compare the performance of a model with different scaling methods.

**Tasks:**
1. Generate synthetic data with features on different scales
2. Train three models: no scaling, StandardScaler, MinMaxScaler
3. Compare accuracy and training time
4. Determine which scaling method works best for this data

In [None]:
# Your code here
# TODO: Generate data with different scales
# TODO: Create three pipelines with different scaling
# TODO: Train and compare results

### Exercise 2: Categorical Encoding Strategy

Experiment with different categorical encoding strategies for high-cardinality features.

**Tasks:**
1. Create data with a high-cardinality categorical feature (e.g., 50+ unique values)
2. Try: (a) Label encoding, (b) One-hot encoding, (c) Frequency-based encoding (count occurrences)
3. Compare model performance and feature vector size
4. Discuss the trade-offs

In [None]:
# Your code here
# TODO: Generate data with high-cardinality categorical feature
# TODO: Implement different encoding strategies
# TODO: Compare results

### Exercise 3: Feature Selection Impact

Build a pipeline that demonstrates the impact of feature selection.

**Tasks:**
1. Create a dataset with 20 features, where only 5 are actually relevant
2. Build two pipelines: one with all features, one with ChiSqSelector selecting top 5
3. Compare training time, model complexity, and accuracy
4. Verify that the selector identified the correct relevant features

In [None]:
# Your code here
# TODO: Generate data with relevant and irrelevant features
# TODO: Build pipelines with and without selection
# TODO: Compare and analyze results

## 8. Exercise Solutions

### Solution 1: Compare Scaling Methods

In [None]:
# Generate data with vastly different scales
import time

n = 2000
scale_data = []
for _ in range(n):
    # Features on very different scales
    tiny_feature = np.random.uniform(0, 1)
    small_feature = np.random.uniform(0, 100)
    large_feature = np.random.uniform(0, 100000)
    
    # Label depends on all three (but large_feature dominates without scaling)
    label = 1 if (tiny_feature * 10000 + small_feature * 100 + large_feature * 0.1 > 6000) else 0
    
    scale_data.append((float(tiny_feature), float(small_feature), float(large_feature), label))

df_scale_test = spark.createDataFrame(scale_data, ["tiny", "small", "large", "label"])
train_scale, test_scale = df_scale_test.randomSplit([0.7, 0.3], seed=42)

print("Sample data:")
df_scale_test.show(5)
df_scale_test.describe().show()

In [None]:
# Pipeline 1: No scaling
no_scale_assembler = VectorAssembler(inputCols=["tiny", "small", "large"], outputCol="features")
no_scale_lr = LogisticRegression(labelCol="label", featuresCol="features", maxIter=20)
no_scale_pipeline = Pipeline(stages=[no_scale_assembler, no_scale_lr])

start = time.time()
no_scale_model = no_scale_pipeline.fit(train_scale)
no_scale_time = time.time() - start
no_scale_pred = no_scale_model.transform(test_scale)
no_scale_acc = evaluator.evaluate(no_scale_pred)

print(f"No Scaling - Accuracy: {no_scale_acc:.4f}, Time: {no_scale_time:.2f}s")

In [None]:
# Pipeline 2: StandardScaler
std_assembler = VectorAssembler(inputCols=["tiny", "small", "large"], outputCol="raw_features")
std_scaler = StandardScaler(inputCol="raw_features", outputCol="features", withStd=True, withMean=False)
std_lr = LogisticRegression(labelCol="label", featuresCol="features", maxIter=20)
std_pipeline = Pipeline(stages=[std_assembler, std_scaler, std_lr])

start = time.time()
std_model = std_pipeline.fit(train_scale)
std_time = time.time() - start
std_pred = std_model.transform(test_scale)
std_acc = evaluator.evaluate(std_pred)

print(f"StandardScaler - Accuracy: {std_acc:.4f}, Time: {std_time:.2f}s")

In [None]:
# Pipeline 3: MinMaxScaler
mm_assembler = VectorAssembler(inputCols=["tiny", "small", "large"], outputCol="raw_features")
mm_scaler = MinMaxScaler(inputCol="raw_features", outputCol="features")
mm_lr = LogisticRegression(labelCol="label", featuresCol="features", maxIter=20)
mm_pipeline = Pipeline(stages=[mm_assembler, mm_scaler, mm_lr])

start = time.time()
mm_model = mm_pipeline.fit(train_scale)
mm_time = time.time() - start
mm_pred = mm_model.transform(test_scale)
mm_acc = evaluator.evaluate(mm_pred)

print(f"MinMaxScaler - Accuracy: {mm_acc:.4f}, Time: {mm_time:.2f}s")

In [None]:
# Summary comparison
print("\n=== Scaling Method Comparison ===")
print(f"No Scaling:      Accuracy={no_scale_acc:.4f}, Time={no_scale_time:.2f}s")
print(f"StandardScaler:  Accuracy={std_acc:.4f}, Time={std_time:.2f}s")
print(f"MinMaxScaler:    Accuracy={mm_acc:.4f}, Time={mm_time:.2f}s")
print("\nConclusion: Scaling improves both accuracy and convergence speed!")

### Solution 2: Categorical Encoding Strategy

In [None]:
# Generate high-cardinality categorical data
# Scenario: User ID (100 unique users)
n = 1000
user_ids = [f"user_{i:03d}" for i in range(100)]

high_card_data = []
for _ in range(n):
    user_id = random.choice(user_ids)
    activity_score = np.random.uniform(0, 100)
    
    # Some users are more likely to convert
    user_num = int(user_id.split("_")[1])
    base_prob = 0.5 if user_num < 50 else 0.3
    converted = 1 if (activity_score > 50 and np.random.random() < base_prob) else 0
    
    high_card_data.append((user_id, float(activity_score), converted))

df_high_card = spark.createDataFrame(high_card_data, ["user_id", "activity_score", "converted"])

print(f"Total samples: {df_high_card.count()}")
print(f"Unique users: {df_high_card.select('user_id').distinct().count()}")
df_high_card.show(10)

In [None]:
# Split data
hc_train, hc_test = df_high_card.randomSplit([0.7, 0.3], seed=42)

# Strategy 1: Label encoding (small vector)
label_indexer = StringIndexer(inputCol="user_id", outputCol="user_idx")
label_assembler = VectorAssembler(inputCols=["user_idx", "activity_score"], outputCol="features")
label_lr = LogisticRegression(labelCol="converted", featuresCol="features", maxIter=10)
label_pipeline = Pipeline(stages=[label_indexer, label_assembler, label_lr])

label_model = label_pipeline.fit(hc_train)
label_pred = label_model.transform(hc_test)
label_acc = evaluator.evaluate(label_pred.withColumnRenamed("converted", "label"))

print(f"Label Encoding - Accuracy: {label_acc:.4f}")
print(f"Feature vector size: 2 (user_idx + activity_score)")

In [None]:
# Strategy 2: One-hot encoding (large sparse vector)
oh_indexer = StringIndexer(inputCol="user_id", outputCol="user_idx")
oh_encoder = OneHotEncoder(inputCol="user_idx", outputCol="user_vec")
oh_assembler = VectorAssembler(inputCols=["user_vec", "activity_score"], outputCol="features")
oh_lr = LogisticRegression(labelCol="converted", featuresCol="features", maxIter=10)
oh_pipeline = Pipeline(stages=[oh_indexer, oh_encoder, oh_assembler, oh_lr])

oh_model = oh_pipeline.fit(hc_train)
oh_pred = oh_model.transform(hc_test)
oh_acc = evaluator.evaluate(oh_pred.withColumnRenamed("converted", "label"))

print(f"One-Hot Encoding - Accuracy: {oh_acc:.4f}")
print(f"Feature vector size: ~100 (one per user + activity_score)")

In [None]:
# Strategy 3: Frequency encoding (count-based)
from pyspark.sql import Window
from pyspark.sql.functions import count

# Calculate user frequency in training data
user_counts = hc_train.groupBy("user_id").agg(count("*").alias("user_frequency"))
hc_train_freq = hc_train.join(user_counts, on="user_id", how="left")
hc_test_freq = hc_test.join(user_counts, on="user_id", how="left").fillna(0, subset=["user_frequency"])

freq_assembler = VectorAssembler(inputCols=["user_frequency", "activity_score"], outputCol="features")
freq_lr = LogisticRegression(labelCol="converted", featuresCol="features", maxIter=10)
freq_pipeline = Pipeline(stages=[freq_assembler, freq_lr])

freq_model = freq_pipeline.fit(hc_train_freq)
freq_pred = freq_model.transform(hc_test_freq)
freq_acc = evaluator.evaluate(freq_pred.withColumnRenamed("converted", "label"))

print(f"Frequency Encoding - Accuracy: {freq_acc:.4f}")
print(f"Feature vector size: 2 (user_frequency + activity_score)")

In [None]:
# Comparison summary
print("\n=== Encoding Strategy Comparison ===")
print(f"Label Encoding:     Acc={label_acc:.4f}, Vector Size=2")
print(f"One-Hot Encoding:   Acc={oh_acc:.4f}, Vector Size=~100")
print(f"Frequency Encoding: Acc={freq_acc:.4f}, Vector Size=2")
print("\nTrade-offs:")
print("- One-hot: Better for linear models but creates huge sparse vectors")
print("- Label: Compact but implies ordering")
print("- Frequency: Compact and captures popularity information")

### Solution 3: Feature Selection Impact

In [None]:
# Generate data with 20 features (only 5 relevant)
n = 2000
many_features_data = []

for _ in range(n):
    # 5 relevant features
    relevant = [np.random.uniform(0, 10) for _ in range(5)]
    # 15 irrelevant features (noise)
    irrelevant = [np.random.uniform(0, 10) for _ in range(15)]
    
    # Label depends only on relevant features
    label = 1 if sum(relevant) > 25 else 0
    
    all_features = relevant + irrelevant
    many_features_data.append(tuple(all_features + [label]))

feature_cols = [f"f{i}" for i in range(20)]
df_many = spark.createDataFrame(many_features_data, feature_cols + ["label"])

print(f"Total samples: {df_many.count()}")
print(f"Total features: 20 (5 relevant + 15 noise)")
print("\nLabel distribution:")
df_many.groupBy("label").count().show()

In [None]:
# Split data
many_train, many_test = df_many.randomSplit([0.7, 0.3], seed=42)

# Pipeline 1: All features
all_assembler = VectorAssembler(inputCols=feature_cols, outputCol="features")
all_lr = LogisticRegression(labelCol="label", featuresCol="features", maxIter=10)
all_pipeline = Pipeline(stages=[all_assembler, all_lr])

start = time.time()
all_model = all_pipeline.fit(many_train)
all_time = time.time() - start
all_pred = all_model.transform(many_test)
all_acc = evaluator.evaluate(all_pred)

print(f"All Features (20) - Accuracy: {all_acc:.4f}, Time: {all_time:.2f}s")

In [None]:
# Pipeline 2: Selected features (top 5)
sel_assembler = VectorAssembler(inputCols=feature_cols, outputCol="all_features")
selector = ChiSqSelector(numTopFeatures=5, featuresCol="all_features", outputCol="features", labelCol="label")
sel_lr = LogisticRegression(labelCol="label", featuresCol="features", maxIter=10)
sel_pipeline = Pipeline(stages=[sel_assembler, selector, sel_lr])

start = time.time()
sel_model = sel_pipeline.fit(many_train)
sel_time = time.time() - start
sel_pred = sel_model.transform(many_test)
sel_acc = evaluator.evaluate(sel_pred)

# Get selected features
selector_stage = sel_model.stages[1]  # The ChiSqSelector stage
selected_indices = selector_stage.selectedFeatures

print(f"Selected Features (5) - Accuracy: {sel_acc:.4f}, Time: {sel_time:.2f}s")
print(f"\nSelected feature indices: {selected_indices}")
print(f"Expected relevant indices: [0, 1, 2, 3, 4]")
print(f"\nCorrectly identified: {sum(1 for i in selected_indices if i < 5)}/5 relevant features")

In [None]:
# Summary
print("\n=== Feature Selection Impact ===")
print(f"All Features (20):      Acc={all_acc:.4f}, Time={all_time:.2f}s")
print(f"Selected Features (5):  Acc={sel_acc:.4f}, Time={sel_time:.2f}s")
print(f"\nBenefits:")
print(f"- Training speedup: {all_time/sel_time:.2f}x faster")
print(f"- Model simplicity: 75% fewer features")
print(f"- Accuracy change: {sel_acc - all_acc:+.4f}")
print(f"\nFeature selection removed noise and improved efficiency!")

## 9. Summary

Congratulations! You've mastered feature engineering at scale with PySpark.

### Key Takeaways:

1. **Feature Scaling**:
   - StandardScaler: Best for normally distributed data, zero-mean unit-variance
   - MinMaxScaler: Best for bounded ranges, preserves zero values
   - Normalizer: Best for sample-wise normalization (text, similarity)
   - Always scale before training distance-based or gradient-based models

2. **Categorical Encoding**:
   - Label encoding: Compact, good for tree models, implies ordering
   - One-hot encoding: No ordering assumption, good for linear models, can be sparse
   - Choose based on cardinality and model type

3. **Binning and Discretization**:
   - Bucketizer: Fixed bin edges, interpretable bins
   - QuantileDiscretizer: Equal-frequency bins, handles skewed distributions
   - Useful for capturing non-linear patterns and reducing outlier effects

4. **Feature Selection**:
   - Reduces overfitting and improves generalization
   - Speeds up training and prediction
   - ChiSqSelector: Statistical approach for categorical targets
   - UnivariateFeatureSelector: More flexible, supports multiple tests

5. **Production Pipelines**:
   - Always split data before building pipelines
   - Chain all transformations to ensure consistency
   - Fit transformations only on training data
   - Pipelines make deployment and reuse straightforward

### Best Practices:

- Understand your data distribution before choosing scaling methods
- Consider model type when encoding categorical features
- Use feature selection to combat the curse of dimensionality
- Always validate transformations on held-out test data
- Document your feature engineering decisions for reproducibility

### What's Next?

In [Module 10: Model Training and Evaluation](10_model_training_and_evaluation.ipynb), you'll learn:
- Training multiple classification and regression algorithms
- Cross-validation for robust model selection
- Hyperparameter tuning with ParamGridBuilder
- Comprehensive model evaluation strategies
- Comparing and selecting the best model

### Additional Resources:

- [Feature Transformers Guide](https://spark.apache.org/docs/latest/ml-features.html)
- [Feature Selection in MLlib](https://spark.apache.org/docs/latest/ml-features.html#feature-selectors)
- [ML Pipelines Documentation](https://spark.apache.org/docs/latest/ml-pipeline.html)

In [None]:
# Clean up
spark.stop()
print("Spark session stopped. Excellent work on feature engineering!")