# Module 6: PySpark Machine Learning with MLlib
*Comprehensive Guide to Scalable Machine Learning in Production*

## Learning Objectives
By the end of this module, you will master:

**Data Preparation & Feature Engineering**
- Feature extraction and transformation pipelines
- Handling categorical and numerical features
- Feature scaling and normalization
- Text processing and NLP features

**Supervised Learning**
- Classification algorithms (Logistic Regression, Random Forest, GBT)
- Regression algorithms (Linear Regression, Random Forest Regression)
- Model evaluation and cross-validation
- Hyperparameter tuning with grid search

**Unsupervised Learning**
- Clustering algorithms (K-Means, Gaussian Mixture)
- Dimensionality reduction (PCA)
- Association rules and frequent pattern mining

**Advanced Topics**
- Pipeline construction and model persistence
- Distributed model training strategies
- Model deployment and serving
- Performance optimization for ML workloads

---

## Module Structure
1. **MLlib Setup & Data Preparation** - Environment and feature engineering
2. **Supervised Learning** - Classification and regression algorithms
3. **Unsupervised Learning** - Clustering and dimensionality reduction
4. **Model Pipelines** - End-to-end ML pipeline construction
5. **Model Evaluation & Tuning** - Cross-validation and hyperparameter optimization
6. **Production ML** - Model deployment and serving strategies

In [11]:
# Module 6: PySpark MLlib Setup and Environment
print("Setting up PySpark MLlib Environment...")

import os
import time
import numpy as np
from datetime import datetime, timedelta
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *

# MLlib imports
from pyspark.ml import Pipeline
from pyspark.ml.feature import *
from pyspark.ml.classification import *
from pyspark.ml.regression import *
from pyspark.ml.clustering import *
from pyspark.ml.evaluation import *
from pyspark.ml.tuning import *
from pyspark.ml.stat import Correlation
from pyspark.mllib.stat import Statistics

# Configure Spark for ML workloads
spark = SparkSession.builder \
    .appName("PySpark-MLlib-Comprehensive") \
    .config("spark.sql.adaptive.enabled", "true") \
    .config("spark.sql.adaptive.coalescePartitions.enabled", "true") \
    .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \
    .config("spark.sql.execution.arrow.pyspark.enabled", "true") \
    .config("spark.default.parallelism", "8") \
    .config("spark.sql.shuffle.partitions", "8") \
    .getOrCreate()

# Set log level to reduce noise
spark.sparkContext.setLogLevel("WARN")

print("Spark MLlib Session Created")
print("Spark Version: {}".format(spark.version))
print("Default Parallelism: {}".format(spark.sparkContext.defaultParallelism))

# Display MLlib-specific configurations
print("\nMLlib Environment Configuration:")
ml_configs = [
    "spark.sql.adaptive.enabled",
    "spark.serializer", 
    "spark.sql.execution.arrow.pyspark.enabled",
    "spark.default.parallelism"
]

for config in ml_configs:
    value = spark.conf.get(config, "Not Set")
    print("   {}: {}".format(config, value))

print("\nMLlib modules successfully imported and ready for machine learning!")
print("Available algorithms: Classification, Regression, Clustering, Feature Engineering")

Setting up PySpark MLlib Environment...
Spark MLlib Session Created
Spark Version: 4.0.0
Default Parallelism: 8

MLlib Environment Configuration:
   spark.sql.adaptive.enabled: true
   spark.serializer: org.apache.spark.serializer.KryoSerializer
   spark.sql.execution.arrow.pyspark.enabled: true
   spark.default.parallelism: 8

MLlib modules successfully imported and ready for machine learning!
Available algorithms: Classification, Regression, Clustering, Feature Engineering


In [12]:
# Generate Comprehensive ML Dataset for Demonstrations
print("Creating Machine Learning Datasets...")

from datetime import date
import random

# Set seed for reproducibility
spark.sparkContext.setCheckpointDir("/tmp/spark-checkpoint")
random.seed(42)

# 1. Customer Dataset for Classification (Churn Prediction)
print("Generating Customer Churn Dataset...")
customer_df = spark.range(1, 10001) \
    .withColumnRenamed("id", "customer_id") \
    .withColumn("age", (rand(42) * 50 + 18).cast("int")) \
    .withColumn("tenure_months", (rand(43) * 60 + 1).cast("int")) \
    .withColumn("monthly_charges", (rand(44) * 80 + 20).cast("decimal(8,2)")) \
    .withColumn("total_charges", col("monthly_charges") * col("tenure_months")) \
    .withColumn("num_services", (rand(45) * 8 + 1).cast("int")) \
    .withColumn("contract_type", 
        when(rand(46) < 0.3, "Month-to-month")
        .when(rand(47) < 0.6, "One year")
        .otherwise("Two year")) \
    .withColumn("payment_method",
        when(rand(48) < 0.25, "Electronic check")
        .when(rand(49) < 0.5, "Mailed check") 
        .when(rand(50) < 0.75, "Bank transfer")
        .otherwise("Credit card")) \
    .withColumn("internet_service",
        when(rand(51) < 0.4, "DSL")
        .when(rand(52) < 0.7, "Fiber optic")
        .otherwise("No")) \
    .withColumn("tech_support", when(rand(53) < 0.6, "Yes").otherwise("No")) \
    .withColumn("online_backup", when(rand(54) < 0.5, "Yes").otherwise("No"))

# Create churn label based on business logic
customer_df = customer_df.withColumn("churn",
    when(
        (col("contract_type") == "Month-to-month") & 
        (col("monthly_charges") > 70) & 
        (col("tenure_months") < 12) &
        (col("tech_support") == "No"), 1
    ).when(
        (col("age") < 25) & 
        (col("monthly_charges") > 60) &
        (col("payment_method") == "Electronic check"), 1
    ).when(rand(55) < 0.15, 1)  # Random churn for complexity
    .otherwise(0)
)

customer_df.cache()
print("Customer Dataset: {:,} records".format(customer_df.count()))
print("Churn Distribution:")
customer_df.groupBy("churn").count().show()

# 2. Sales Dataset for Regression (Sales Prediction)
print("\nGenerating Sales Prediction Dataset...")
sales_df = spark.range(1, 15001) \
    .withColumnRenamed("id", "sale_id") \
    .withColumn("store_id", (rand(56) * 50 + 1).cast("int")) \
    .withColumn("day_of_week", (rand(57) * 7 + 1).cast("int")) \
    .withColumn("month", (rand(58) * 12 + 1).cast("int")) \
    .withColumn("temperature", (rand(59) * 40 + 30).cast("decimal(5,2)")) \
    .withColumn("humidity", (rand(60) * 50 + 30).cast("decimal(5,2)")) \
    .withColumn("is_holiday", when(rand(61) < 0.1, 1).otherwise(0)) \
    .withColumn("promotion_active", when(rand(62) < 0.3, 1).otherwise(0)) \
    .withColumn("competitor_nearby", when(rand(63) < 0.4, 1).otherwise(0)) \
    .withColumn("store_size", 
        when(rand(64) < 0.3, "Small")
        .when(rand(65) < 0.7, "Medium")
        .otherwise("Large"))

# Create sales amount based on realistic factors
sales_df = sales_df.withColumn("sales_amount",
    (1000 + 
     col("temperature") * 10 +
     when(col("is_holiday") == 1, 500).otherwise(0) +
     when(col("promotion_active") == 1, 300).otherwise(0) +
     when(col("store_size") == "Large", 800)
     .when(col("store_size") == "Medium", 400).otherwise(0) +
     when(col("day_of_week").isin([6, 7]), 200).otherwise(0) +
     (rand(66) * 1000 - 500)  # Random variation
    ).cast("decimal(10,2)")
)

sales_df.cache()
print("Sales Dataset: {:,} records".format(sales_df.count()))

# 3. Product Dataset for Clustering
print("\nGenerating Product Clustering Dataset...")
product_df = spark.range(1, 5001) \
    .withColumnRenamed("id", "product_id") \
    .withColumn("price", (rand(67) * 500 + 10).cast("decimal(8,2)")) \
    .withColumn("rating", (rand(68) * 4 + 1).cast("decimal(3,2)")) \
    .withColumn("num_reviews", (rand(69) * 1000 + 5).cast("int")) \
    .withColumn("weight_kg", (rand(70) * 20 + 0.1).cast("decimal(5,2)")) \
    .withColumn("length_cm", (rand(71) * 100 + 5).cast("decimal(5,2)")) \
    .withColumn("width_cm", (rand(72) * 50 + 3).cast("decimal(5,2)")) \
    .withColumn("height_cm", (rand(73) * 30 + 2).cast("decimal(5,2)")) \
    .withColumn("category",
        when(rand(74) < 0.2, "Electronics")
        .when(rand(75) < 0.4, "Clothing")
        .when(rand(76) < 0.6, "Home")
        .when(rand(77) < 0.8, "Sports")
        .otherwise("Books"))

product_df.cache()
print("Product Dataset: {:,} records".format(product_df.count()))

# Show sample data
print("\nSample Customer Data:")
customer_df.show(5, truncate=False)

print("\nSample Sales Data:")
sales_df.show(5, truncate=False)

print("\nSample Product Data:")
product_df.show(5, truncate=False)

Creating Machine Learning Datasets...
Generating Customer Churn Dataset...
Customer Dataset: 10,000 records
Churn Distribution:


25/08/25 22:51:25 WARN CacheManager: Asked to cache already cached data.


+-----+-----+
|churn|count|
+-----+-----+
|    1| 1697|
|    0| 8303|
+-----+-----+


Generating Sales Prediction Dataset...
Sales Dataset: 15,000 records

Generating Product Clustering Dataset...


25/08/25 22:51:25 WARN CacheManager: Asked to cache already cached data.
25/08/25 22:51:25 WARN CacheManager: Asked to cache already cached data.


Product Dataset: 5,000 records

Sample Customer Data:
+-----------+---+-------------+---------------+-------------+------------+--------------+--------------+----------------+------------+-------------+-----+
|customer_id|age|tenure_months|monthly_charges|total_charges|num_services|contract_type |payment_method|internet_service|tech_support|online_backup|churn|
+-----------+---+-------------+---------------+-------------+------------+--------------+--------------+----------------+------------+-------------+-----+
|1          |48 |49           |83.88          |4110.12      |7           |Month-to-month|Mailed check  |DSL             |No          |Yes          |0    |
|2          |43 |40           |89.44          |3577.60      |2           |Month-to-month|Mailed check  |Fiber optic     |Yes         |No           |0    |
|3          |59 |16           |60.55          |968.80       |6           |Month-to-month|Bank transfer |Fiber optic     |Yes         |No           |0    |
|4          |31 

---

# Section 1: Feature Engineering and Data Preparation

## Core Concepts

**Feature Engineering** is the process of transforming raw data into features suitable for machine learning:
- **Categorical Encoding**: Converting text categories to numerical representations
- **Feature Scaling**: Normalizing numerical features for algorithm performance
- **Feature Selection**: Identifying the most relevant features
- **Text Processing**: Converting text data into numerical vectors

## Key MLlib Transformers

### Categorical Features
- **StringIndexer**: Converts string categories to numerical indices
- **OneHotEncoder**: Creates binary columns for each category
- **VectorIndexer**: Handles categorical features in vector columns

### Numerical Features
- **StandardScaler**: Standardizes features to have zero mean and unit variance
- **MinMaxScaler**: Scales features to a fixed range [0,1]
- **Normalizer**: Normalizes each row to have unit norm

### Text Features
- **Tokenizer**: Splits text into individual words
- **StopWordsRemover**: Removes common stop words
- **CountVectorizer**: Creates word count vectors
- **TF-IDF**: Term Frequency-Inverse Document Frequency vectors

---

In [13]:
# Section 1.1: Feature Engineering Pipeline
print("Demonstrating Feature Engineering Techniques...")

# 1. Categorical Feature Encoding
print("1. Categorical Feature Encoding")

# String Indexer for categorical variables
contract_indexer = StringIndexer(inputCol="contract_type", outputCol="contract_index")
payment_indexer = StringIndexer(inputCol="payment_method", outputCol="payment_index")
internet_indexer = StringIndexer(inputCol="internet_service", outputCol="internet_index")

# Apply indexers
customer_indexed = contract_indexer.fit(customer_df).transform(customer_df)
customer_indexed = payment_indexer.fit(customer_indexed).transform(customer_indexed)
customer_indexed = internet_indexer.fit(customer_indexed).transform(customer_indexed)

print("Categorical variables indexed:")
customer_indexed.select("contract_type", "contract_index", "payment_method", "payment_index").show(5)

# One-Hot Encoding
contract_encoder = OneHotEncoder(inputCol="contract_index", outputCol="contract_encoded")
payment_encoder = OneHotEncoder(inputCol="payment_index", outputCol="payment_encoded")
internet_encoder = OneHotEncoder(inputCol="internet_index", outputCol="internet_encoded")

customer_encoded = contract_encoder.fit(customer_indexed).transform(customer_indexed)
customer_encoded = payment_encoder.fit(customer_encoded).transform(customer_encoded)
customer_encoded = internet_encoder.fit(customer_encoded).transform(customer_encoded)

print("\nOne-hot encoded features (showing contract encoding):")
customer_encoded.select("contract_type", "contract_encoded").show(5, truncate=False)

# 2. Numerical Feature Scaling
print("\n2. Numerical Feature Scaling")

# Assemble numerical features into a vector
numerical_features = ["age", "tenure_months", "monthly_charges", "total_charges", "num_services"]
num_assembler = VectorAssembler(inputCols=numerical_features, outputCol="numerical_features")
customer_vector = num_assembler.transform(customer_encoded)

# Standard Scaling
scaler = StandardScaler(inputCol="numerical_features", outputCol="scaled_features", withStd=True, withMean=True)
scaler_model = scaler.fit(customer_vector)
customer_scaled = scaler_model.transform(customer_vector)

print("Original vs Scaled Features:")
customer_scaled.select("numerical_features", "scaled_features").show(3, truncate=False)

# MinMax Scaling demonstration
minmax_scaler = MinMaxScaler(inputCol="numerical_features", outputCol="minmax_features")
minmax_model = minmax_scaler.fit(customer_vector)
customer_minmax = minmax_model.transform(customer_vector)

print("MinMax scaled features:")
customer_minmax.select("minmax_features").show(3, truncate=False)

# 3. Feature Assembly for ML
print("\n3. Final Feature Assembly")

# Combine all features into a single vector
feature_cols = ["scaled_features", "contract_encoded", "payment_encoded", "internet_encoded"]
final_assembler = VectorAssembler(inputCols=feature_cols, outputCol="features")
ml_ready_data = final_assembler.transform(customer_scaled)

print("Final feature vector shape and sample:")
print("Feature vector size: {}".format(ml_ready_data.select("features").first()["features"].size))
ml_ready_data.select("features", "churn").show(3, truncate=False)

# 4. Feature Statistics and Correlation
print("\n4. Feature Analysis")

# Calculate correlation matrix
correlation_matrix = Correlation.corr(ml_ready_data, "features").head()
print("Feature correlation matrix calculated")

# Basic feature statistics
ml_ready_data.describe(numerical_features).show()

print("Feature engineering pipeline complete!")
print("Data ready for machine learning algorithms")

# Cache the final dataset for reuse
ml_ready_data.cache()
final_count = ml_ready_data.count()
print("Final ML dataset: {:,} records with {} features".format(
    final_count, ml_ready_data.select("features").first()["features"].size))

Demonstrating Feature Engineering Techniques...
1. Categorical Feature Encoding
Categorical variables indexed:
+--------------+--------------+--------------+-------------+
| contract_type|contract_index|payment_method|payment_index|
+--------------+--------------+--------------+-------------+
|Month-to-month|           1.0|  Mailed check|          0.0|
|Month-to-month|           1.0|  Mailed check|          0.0|
|Month-to-month|           1.0| Bank transfer|          1.0|
|      Two year|           2.0| Bank transfer|          1.0|
|Month-to-month|           1.0|  Mailed check|          0.0|
+--------------+--------------+--------------+-------------+
only showing top 5 rows

One-hot encoded features (showing contract encoding):
+--------------+----------------+
|contract_type |contract_encoded|
+--------------+----------------+
|Month-to-month|(2,[1],[1.0])   |
|Month-to-month|(2,[1],[1.0])   |
|Month-to-month|(2,[1],[1.0])   |
|Two year      |(2,[],[])       |
|Month-to-month|(2,[1],

In [14]:
# Prepare ML-ready datasets with proper naming
print("\nPreparing datasets for ML algorithms...")

# 1. Customer Features for Classification (rename churn to churn_label)
customers_features = ml_ready_data.withColumnRenamed("churn", "churn_label")
customers_features.cache()

print(f"customers_features: {customers_features.count()} records")
print("Classification target distribution:")
customers_features.groupBy("churn_label").count().show()

# 2. Sales Features for Regression
# Create features for sales dataset
sales_categorical_cols = ["store_size"]
sales_numerical_cols = ["store_id", "day_of_week", "month", "temperature", "humidity", 
                       "is_holiday", "promotion_active", "competitor_nearby"]

# Index categorical variables
sales_indexed = sales_df
for col_name in sales_categorical_cols:
    indexer = StringIndexer(inputCol=col_name, outputCol=f"{col_name}_indexed")
    sales_indexed = indexer.fit(sales_indexed).transform(sales_indexed)

# One-hot encode categorical variables  
sales_encoded = sales_indexed
for col_name in sales_categorical_cols:
    encoder = OneHotEncoder(inputCol=f"{col_name}_indexed", outputCol=f"{col_name}_encoded")
    sales_encoded = encoder.fit(sales_encoded).transform(sales_encoded)

# Assemble features
sales_feature_cols = sales_numerical_cols + [f"{col}_encoded" for col in sales_categorical_cols]
sales_assembler = VectorAssembler(inputCols=sales_feature_cols, outputCol="features")
sales_features = sales_assembler.transform(sales_encoded).withColumnRenamed("sales_amount", "total_amount")
sales_features.cache()

print(f"sales_features: {sales_features.count()} records")

# 3. Product Features for Clustering  
# Create features for products dataset
product_categorical_cols = ["category"]
product_numerical_cols = ["price", "rating", "num_reviews", "weight_kg", "length_cm", "width_cm", "height_cm"]

# Index categorical variables
products_indexed = product_df
for col_name in product_categorical_cols:
    indexer = StringIndexer(inputCol=col_name, outputCol=f"{col_name}_indexed")
    products_indexed = indexer.fit(products_indexed).transform(products_indexed)

# One-hot encode categorical variables
products_encoded = products_indexed  
for col_name in product_categorical_cols:
    encoder = OneHotEncoder(inputCol=f"{col_name}_indexed", outputCol=f"{col_name}_encoded")
    products_encoded = encoder.fit(products_encoded).transform(products_encoded)

# Assemble features
product_feature_cols = product_numerical_cols + [f"{col}_encoded" for col in product_categorical_cols]
products_assembler = VectorAssembler(inputCols=product_feature_cols, outputCol="features")
products_features = products_assembler.transform(products_encoded)
products_features.cache()

print(f"products_features: {products_features.count()} records")

print("\nAll ML datasets prepared successfully!")
print("Available datasets:")
print("- customers_features: Classification (churn prediction)")  
print("- sales_features: Regression (sales prediction)")
print("- products_features: Clustering (product segmentation)")


Preparing datasets for ML algorithms...
customers_features: 10000 records
Classification target distribution:
+-----------+-----+
|churn_label|count|
+-----------+-----+
|          1| 1697|
|          0| 8303|
+-----------+-----+



25/08/25 22:51:49 WARN CacheManager: Asked to cache already cached data.


sales_features: 15000 records
products_features: 5000 records

All ML datasets prepared successfully!
Available datasets:
- customers_features: Classification (churn prediction)
- sales_features: Regression (sales prediction)
- products_features: Clustering (product segmentation)
products_features: 5000 records

All ML datasets prepared successfully!
Available datasets:
- customers_features: Classification (churn prediction)
- sales_features: Regression (sales prediction)
- products_features: Clustering (product segmentation)


---

# Section 2: Supervised Learning Algorithms

## Classification Algorithms

**Classification** predicts discrete categories or classes:
- **Logistic Regression**: Linear model for binary/multiclass classification
- **Decision Trees**: Rule-based models with high interpretability
- **Random Forest**: Ensemble of decision trees for improved accuracy
- **Gradient Boosted Trees**: Sequential ensemble with error correction
- **Naive Bayes**: Probabilistic classifier based on Bayes' theorem

## Regression Algorithms

**Regression** predicts continuous numerical values:
- **Linear Regression**: Linear relationship modeling
- **Decision Tree Regression**: Non-linear regression with rules
- **Random Forest Regression**: Ensemble regression for better predictions
- **Gradient Boosted Tree Regression**: Sequential improvement regression

## Model Evaluation Metrics

### Classification Metrics
- **Accuracy**: Overall correct predictions
- **Precision**: True positives / (True positives + False positives)
- **Recall**: True positives / (True positives + False negatives)
- **F1-Score**: Harmonic mean of precision and recall
- **AUC-ROC**: Area under the receiver operating characteristic curve

### Regression Metrics
- **RMSE**: Root Mean Square Error
- **MAE**: Mean Absolute Error
- **R²**: Coefficient of determination

---

In [15]:
# Quick fix: Create ML-ready datasets for algorithms
print("Creating ML dataset variables...")

# 1. Customer features for classification
customers_features = ml_ready_data.withColumnRenamed("churn", "churn_label")
customers_features.cache()
print(f"customers_features: {customers_features.count()} records")

# 2. Sales features for regression  
sales_assembler = VectorAssembler(
    inputCols=["store_id", "day_of_week", "month", "temperature", "humidity", 
              "is_holiday", "promotion_active", "competitor_nearby"],
    outputCol="features"
)
sales_features = sales_assembler.transform(sales_df).withColumnRenamed("sales_amount", "total_amount")
sales_features.cache()
print(f"sales_features: {sales_features.count()} records")

# 3. Product features for clustering
products_assembler = VectorAssembler(
    inputCols=["price", "rating", "num_reviews", "weight_kg", "length_cm", "width_cm", "height_cm"],
    outputCol="features"
)
products_features = products_assembler.transform(product_df)
products_features.cache()
print(f"products_features: {products_features.count()} records")

print("All ML datasets ready!")

Creating ML dataset variables...
customers_features: 10000 records
sales_features: 15000 records


25/08/25 22:51:56 WARN CacheManager: Asked to cache already cached data.


products_features: 5000 records
All ML datasets ready!


In [18]:
# Section 2.1: Classification Algorithms

from pyspark.ml.classification import LogisticRegression, DecisionTreeClassifier, RandomForestClassifier, GBTClassifier
from pyspark.ml.evaluation import BinaryClassificationEvaluator, MulticlassClassificationEvaluator
from pyspark.ml import Pipeline
import time
import builtins  # Add this to ensure we use Python's built-in functions

print("Classification Algorithm Comparison")
print("=" * 50)

# Prepare training and test sets
train_data, test_data = customers_features.randomSplit([0.8, 0.2], seed=42)

print(f"Training set: {train_data.count()} records")
print(f"Test set: {test_data.count()} records")

# Cache datasets for repeated use
train_data.cache()
test_data.cache()

# Initialize classification algorithms
classifiers = {
    "Logistic Regression": LogisticRegression(
        featuresCol="features", 
        labelCol="churn_label",
        maxIter=20
    ),
    "Decision Tree": DecisionTreeClassifier(
        featuresCol="features", 
        labelCol="churn_label",
        maxDepth=10
    ),
    "Random Forest": RandomForestClassifier(
        featuresCol="features", 
        labelCol="churn_label",
        numTrees=20,
        maxDepth=10
    ),
    "Gradient Boosted Trees": GBTClassifier(
        featuresCol="features", 
        labelCol="churn_label",
        maxIter=10,
        maxDepth=5
    )
}

# Evaluation metrics
binary_evaluator = BinaryClassificationEvaluator(
    labelCol="churn_label", 
    rawPredictionCol="rawPrediction",
    metricName="areaUnderROC"
)

multiclass_evaluator = MulticlassClassificationEvaluator(
    labelCol="churn_label", 
    predictionCol="prediction"
)

# Train and evaluate each classifier
results = []

for name, classifier in classifiers.items():
    print(f"\n{name}:")
    print("-" * 30)
    
    # Measure training time
    start_time = time.time()
    
    # Train model
    model = classifier.fit(train_data)
    
    training_time = time.time() - start_time
    
    # Make predictions
    predictions = model.transform(test_data)
    
    # Calculate metrics
    auc = binary_evaluator.evaluate(predictions)
    accuracy = multiclass_evaluator.evaluate(predictions, {multiclass_evaluator.metricName: "accuracy"})
    precision = multiclass_evaluator.evaluate(predictions, {multiclass_evaluator.metricName: "weightedPrecision"})
    recall = multiclass_evaluator.evaluate(predictions, {multiclass_evaluator.metricName: "weightedRecall"})
    f1 = multiclass_evaluator.evaluate(predictions, {multiclass_evaluator.metricName: "f1"})
    
    # Store results
    import builtins
    result = {
        "Algorithm": name,
        "AUC": builtins.round(auc, 4),
        "Accuracy": builtins.round(accuracy, 4),
        "Precision": builtins.round(precision, 4),
        "Recall": builtins.round(recall, 4),
        "F1-Score": builtins.round(f1, 4),
        "Training Time": builtins.round(training_time, 2)
    }
    results.append(result)
    
    # Print metrics
    print(f"AUC-ROC: {auc:.4f}")
    print(f"Accuracy: {accuracy:.4f}")
    print(f"Precision: {precision:.4f}")
    print(f"Recall: {recall:.4f}")
    print(f"F1-Score: {f1:.4f}")
    print(f"Training Time: {training_time:.2f} seconds")

# Summary comparison
print("\n" + "=" * 80)
print("CLASSIFICATION ALGORITHM COMPARISON SUMMARY")
print("=" * 80)

# Create comparison DataFrame
results_df = spark.createDataFrame(results)
results_df.show(truncate=False)

# Find best performing algorithm by F1-Score
best_f1 = builtins.max(results, key=lambda x: x["F1-Score"])
print(f"\nBest F1-Score: {best_f1['Algorithm']} ({best_f1['F1-Score']})")

# Find fastest training algorithm
fastest = builtins.min(results, key=lambda x: x["Training Time"])
print(f"Fastest Training: {fastest['Algorithm']} ({fastest['Training Time']} seconds)")

print("\nClassification analysis complete!")

Classification Algorithm Comparison
Training set: 8046 records
Test set: 1954 records

Logistic Regression:
------------------------------


25/08/25 22:53:27 WARN CacheManager: Asked to cache already cached data.
25/08/25 22:53:27 WARN CacheManager: Asked to cache already cached data.


AUC-ROC: 0.5210
Accuracy: 0.8557
Precision: 0.7322
Recall: 0.8557
F1-Score: 0.7891
Training Time: 1.29 seconds

Decision Tree:
------------------------------
AUC-ROC: 0.4687
Accuracy: 0.7958
Precision: 0.7451
Recall: 0.7958
F1-Score: 0.7682
Training Time: 0.83 seconds

Random Forest:
------------------------------
AUC-ROC: 0.4687
Accuracy: 0.7958
Precision: 0.7451
Recall: 0.7958
F1-Score: 0.7682
Training Time: 0.83 seconds

Random Forest:
------------------------------


25/08/25 22:53:31 WARN DAGScheduler: Broadcasting large task binary with size 1152.8 KiB
25/08/25 22:53:32 WARN DAGScheduler: Broadcasting large task binary with size 1644.4 KiB
25/08/25 22:53:32 WARN DAGScheduler: Broadcasting large task binary with size 1644.4 KiB
25/08/25 22:53:32 WARN DAGScheduler: Broadcasting large task binary with size 1102.9 KiB
25/08/25 22:53:32 WARN DAGScheduler: Broadcasting large task binary with size 1102.9 KiB
25/08/25 22:53:32 WARN DAGScheduler: Broadcasting large task binary with size 1114.7 KiB
25/08/25 22:53:33 WARN DAGScheduler: Broadcasting large task binary with size 1114.7 KiB
25/08/25 22:53:32 WARN DAGScheduler: Broadcasting large task binary with size 1114.7 KiB
25/08/25 22:53:33 WARN DAGScheduler: Broadcasting large task binary with size 1114.7 KiB
25/08/25 22:53:33 WARN DAGScheduler: Broadcasting large task binary with size 1114.7 KiB
25/08/25 22:53:33 WARN DAGScheduler: Broadcasting large task binary with size 1114.7 KiB
25/08/25 22:53:33 WAR

AUC-ROC: 0.5025
Accuracy: 0.8552
Precision: 0.7321
Recall: 0.8552
F1-Score: 0.7889
Training Time: 1.98 seconds

Gradient Boosted Trees:
------------------------------
AUC-ROC: 0.5110
Accuracy: 0.8547
Precision: 0.7808
Recall: 0.8547
F1-Score: 0.7906
Training Time: 2.64 seconds

CLASSIFICATION ALGORITHM COMPARISON SUMMARY
+------+--------+----------------------+--------+---------+------+-------------+
|AUC   |Accuracy|Algorithm             |F1-Score|Precision|Recall|Training Time|
+------+--------+----------------------+--------+---------+------+-------------+
|0.521 |0.8557  |Logistic Regression   |0.7891  |0.7322   |0.8557|1.29         |
|0.4687|0.7958  |Decision Tree         |0.7682  |0.7451   |0.7958|0.83         |
|0.5025|0.8552  |Random Forest         |0.7889  |0.7321   |0.8552|1.98         |
|0.511 |0.8547  |Gradient Boosted Trees|0.7906  |0.7808   |0.8547|2.64         |
+------+--------+----------------------+--------+---------+------+-------------+


Best F1-Score: Gradient Boo

In [19]:
# Section 2.2: Regression Algorithms

from pyspark.ml.regression import LinearRegression, DecisionTreeRegressor, RandomForestRegressor, GBTRegressor
from pyspark.ml.evaluation import RegressionEvaluator
import math
import builtins  # Add this to ensure we use Python's built-in functions

print("Regression Algorithm Comparison")
print("=" * 50)

# Prepare sales prediction data
sales_train, sales_test = sales_features.randomSplit([0.8, 0.2], seed=42)

print(f"Training set: {sales_train.count()} records")
print(f"Test set: {sales_test.count()} records")

# Cache datasets
sales_train.cache()
sales_test.cache()

# Initialize regression algorithms
regressors = {
    "Linear Regression": LinearRegression(
        featuresCol="features", 
        labelCol="total_amount",
        maxIter=20
    ),
    "Decision Tree": DecisionTreeRegressor(
        featuresCol="features", 
        labelCol="total_amount",
        maxDepth=10
    ),
    "Random Forest": RandomForestRegressor(
        featuresCol="features", 
        labelCol="total_amount",
        numTrees=20,
        maxDepth=10
    ),
    "Gradient Boosted Trees": GBTRegressor(
        featuresCol="features", 
        labelCol="total_amount",
        maxIter=10,
        maxDepth=5
    )
}

# Evaluation metrics
rmse_evaluator = RegressionEvaluator(
    labelCol="total_amount", 
    predictionCol="prediction",
    metricName="rmse"
)

mae_evaluator = RegressionEvaluator(
    labelCol="total_amount", 
    predictionCol="prediction",
    metricName="mae"
)

r2_evaluator = RegressionEvaluator(
    labelCol="total_amount", 
    predictionCol="prediction",
    metricName="r2"
)

# Train and evaluate each regressor
regression_results = []

for name, regressor in regressors.items():
    print(f"\n{name}:")
    print("-" * 30)
    
    # Measure training time
    start_time = time.time()
    
    # Train model
    model = regressor.fit(sales_train)
    
    training_time = time.time() - start_time
    
    # Make predictions
    predictions = model.transform(sales_test)
    
    # Calculate metrics
    rmse = rmse_evaluator.evaluate(predictions)
    mae = mae_evaluator.evaluate(predictions)
    r2 = r2_evaluator.evaluate(predictions)
    
    # Store results
    result = {
        "Algorithm": name,
        "RMSE": builtins.round(rmse, 2),
        "MAE": builtins.round(mae, 2),
        "R²": builtins.round(r2, 4),
        "Training Time": builtins.round(training_time, 2)
    }
    regression_results.append(result)
    
    # Print metrics
    print(f"RMSE: {rmse:.2f}")
    print(f"MAE: {mae:.2f}")
    print(f"R²: {r2:.4f}")
    print(f"Training Time: {training_time:.2f} seconds")
    
    # Show sample predictions
    print("\nSample Predictions:")
    predictions.select("total_amount", "prediction").show(5)

# Summary comparison
print("\n" + "=" * 80)
print("REGRESSION ALGORITHM COMPARISON SUMMARY")
print("=" * 80)

# Create comparison DataFrame
regression_results_df = spark.createDataFrame(regression_results)
regression_results_df.show(truncate=False)

# Find best performing algorithm by R²
best_r2 = builtins.max(regression_results, key=lambda x: x["R²"])
print(f"\nBest R²: {best_r2['Algorithm']} ({best_r2['R²']})")

# Find lowest RMSE
lowest_rmse = builtins.min(regression_results, key=lambda x: x["RMSE"])
print(f"Lowest RMSE: {lowest_rmse['Algorithm']} ({lowest_rmse['RMSE']})")

print("\nRegression analysis complete!")

Regression Algorithm Comparison
Training set: 12085 records
Test set: 2915 records

Linear Regression:
------------------------------


25/08/25 22:54:09 WARN Instrumentation: [89a8dc0b] regParam is zero, which might cause numerical instability and overfitting.
25/08/25 22:54:09 WARN InstanceBuilder: Failed to load implementation from:dev.ludovic.netlib.lapack.JNILAPACK
25/08/25 22:54:09 WARN InstanceBuilder: Failed to load implementation from:dev.ludovic.netlib.lapack.JNILAPACK


RMSE: 395.56
MAE: 319.93
R²: 0.2650
Training Time: 0.43 seconds

Sample Predictions:
+------------+------------------+
|total_amount|        prediction|
+------------+------------------+
|     2475.48| 2094.593521200074|
|     2433.94| 2027.285355126064|
|     2982.46|  2412.30674820032|
|     2204.51|1943.2882513798668|
|     2112.11|2389.5013394502184|
+------------+------------------+
only showing top 5 rows

Decision Tree:
------------------------------
RMSE: 433.31
MAE: 346.87
R²: 0.1180
Training Time: 0.68 seconds

Sample Predictions:
+------------+------------------+
|total_amount|        prediction|
+------------+------------------+
|     2475.48|2150.1538095238093|
|     2433.94| 2220.969122807017|
|     2982.46|2356.3030769230772|
|     2204.51|1959.1377628032344|
|     2112.11|2356.3030769230772|
+------------+------------------+
only showing top 5 rows

Random Forest:
------------------------------
RMSE: 433.31
MAE: 346.87
R²: 0.1180
Training Time: 0.68 seconds

Sample Pred

25/08/25 22:54:11 WARN DAGScheduler: Broadcasting large task binary with size 1469.4 KiB
25/08/25 22:54:12 WARN DAGScheduler: Broadcasting large task binary with size 2.4 MiB
25/08/25 22:54:12 WARN DAGScheduler: Broadcasting large task binary with size 2.4 MiB


RMSE: 398.59
MAE: 321.57
R²: 0.2537
Training Time: 2.12 seconds

Sample Predictions:
+------------+------------------+
|total_amount|        prediction|
+------------+------------------+
|     2475.48| 1999.548516777675|
|     2433.94| 2012.909331753898|
|     2982.46|2499.4297793143855|
|     2204.51|1911.5751781192134|
|     2112.11|2368.3664765789536|
+------------+------------------+
only showing top 5 rows

Gradient Boosted Trees:
------------------------------
RMSE: 396.88
MAE: 320.55
R²: 0.2601
Training Time: 2.54 seconds

Sample Predictions:
+------------+------------------+
|total_amount|        prediction|
+------------+------------------+
|     2475.48| 2018.434365016623|
|     2433.94|2122.8653830757153|
|     2982.46| 2434.429295146346|
|     2204.51| 1913.953447206258|
|     2112.11| 2451.212015955912|
+------------+------------------+
only showing top 5 rows

REGRESSION ALGORITHM COMPARISON SUMMARY
+----------------------+------+------+------+-------------+
|Algorithm   

---

# Section 3: Unsupervised Learning

## Clustering Algorithms

**Clustering** groups similar data points without predefined labels:
- **K-Means**: Partitions data into k clusters based on centroids
- **Gaussian Mixture Model**: Probabilistic clustering with soft assignments
- **Bisecting K-Means**: Hierarchical variant of K-Means

## Dimensionality Reduction

**Dimensionality Reduction** reduces feature space while preserving information:
- **Principal Component Analysis (PCA)**: Linear transformation to principal components
- **Feature Selection**: Selecting most relevant features

## Clustering Evaluation

- **Silhouette Score**: Measures cluster cohesion and separation
- **Within Set Sum of Squared Errors (WSSSE)**: Measures cluster compactness
- **Calinski-Harabasz Index**: Ratio of between-cluster to within-cluster variance

---

In [20]:
# Section 3.1: Clustering Algorithms

from pyspark.ml.clustering import KMeans, GaussianMixture, BisectingKMeans
from pyspark.ml.evaluation import ClusteringEvaluator
import builtins  # Add this to ensure we use Python's built-in functions

print("Clustering Algorithm Comparison")
print("=" * 50)

# Use product features for clustering
products_sample = products_features.sample(0.3, seed=42)  # Sample for faster computation
products_sample.cache()

print(f"Products for clustering: {products_sample.count()} records")

# Initialize clustering algorithms
clusterers = {
    "K-Means": KMeans(featuresCol="features", k=5, seed=42),
    "Gaussian Mixture": GaussianMixture(featuresCol="features", k=5, seed=42),
    "Bisecting K-Means": BisectingKMeans(featuresCol="features", k=5, seed=42)
}

# Clustering evaluator
silhouette_evaluator = ClusteringEvaluator(
    predictionCol="prediction",
    featuresCol="features",
    metricName="silhouette",
    distanceMeasure="squaredEuclidean"
)

# Train and evaluate each clustering algorithm
clustering_results = []

for name, clusterer in clusterers.items():
    print(f"\n{name}:")
    print("-" * 30)
    
    # Measure training time
    start_time = time.time()
    
    # Train model
    model = clusterer.fit(products_sample)
    
    training_time = time.time() - start_time
    
    # Make predictions
    predictions = model.transform(products_sample)
    
    # Calculate silhouette score
    silhouette = silhouette_evaluator.evaluate(predictions)
    
    # Store results
    result = {
        "Algorithm": name,
        "Silhouette Score": builtins.round(silhouette, 4),
        "Training Time": builtins.round(training_time, 2)
    }
    clustering_results.append(result)
    
    # Print metrics
    print(f"Silhouette Score: {silhouette:.4f}")
    print(f"Training Time: {training_time:.2f} seconds")
    
    # Show cluster distribution
    cluster_counts = predictions.groupBy("prediction").count().orderBy("prediction")
    print("Cluster Distribution:")
    cluster_counts.show()
    
    # Additional algorithm-specific metrics
    if name == "K-Means":
        print(f"Within Set Sum of Squared Errors: {model.summary.trainingCost:.2f}")
    elif name == "Gaussian Mixture":
        print(f"Log Likelihood: {model.summary.logLikelihood:.2f}")

# Summary comparison
print("\n" + "=" * 60)
print("CLUSTERING ALGORITHM COMPARISON SUMMARY")
print("=" * 60)

# Create comparison DataFrame
clustering_results_df = spark.createDataFrame(clustering_results)
clustering_results_df.show(truncate=False)

# Find best performing algorithm by silhouette score
best_silhouette = builtins.max(clustering_results, key=lambda x: x["Silhouette Score"])
print(f"\nBest Silhouette Score: {best_silhouette['Algorithm']} ({best_silhouette['Silhouette Score']})")

print("\nClustering analysis complete!")

Clustering Algorithm Comparison
Products for clustering: 1504 records

K-Means:
------------------------------
Silhouette Score: 0.5307
Training Time: 1.26 seconds
Cluster Distribution:
+----------+-----+
|prediction|count|
+----------+-----+
|         0|  291|
|         1|  239|
|         2|  400|
|         3|  255|
|         4|  319|
+----------+-----+

Within Set Sum of Squared Errors: 29640786.48

Gaussian Mixture:
------------------------------
Silhouette Score: 0.5307
Training Time: 1.26 seconds
Cluster Distribution:
+----------+-----+
|prediction|count|
+----------+-----+
|         0|  291|
|         1|  239|
|         2|  400|
|         3|  255|
|         4|  319|
+----------+-----+

Within Set Sum of Squared Errors: 29640786.48

Gaussian Mixture:
------------------------------
Silhouette Score: -0.0273
Training Time: 4.20 seconds
Cluster Distribution:
+----------+-----+
|prediction|count|
+----------+-----+
|         0|  274|
|         1|  398|
|         2|  374|
|         3| 

In [24]:
# Section 3.2: Dimensionality Reduction

from pyspark.ml.feature import PCA, VectorSlicer
from pyspark.ml.stat import Correlation
import numpy as np
import builtins  # Add this to ensure we use Python's built-in functions

print("Dimensionality Reduction Techniques")
print("=" * 50)

# Analyze original feature dimensions
print("Original Feature Analysis:")
print("-" * 30)

# Get feature vector size
sample_features = customers_features.select("features").first()["features"]
original_dimensions = builtins.len(sample_features.toArray())
print(f"Original feature dimensions: {original_dimensions}")

# Compute correlation matrix for feature analysis
correlation_matrix = Correlation.corr(customers_features, "features").head()[0]
correlation_array = correlation_matrix.toArray()
print(f"Feature correlation matrix shape: {correlation_array.shape}")

# Find highly correlated features
high_correlation_pairs = []
threshold = 0.8
for i in range(builtins.len(correlation_array)):
    for j in range(i+1, builtins.len(correlation_array)):
        if builtins.abs(correlation_array[i][j]) > threshold:
            high_correlation_pairs.append((i, j, correlation_array[i][j]))

print(f"Highly correlated feature pairs (|r| > {threshold}): {builtins.len(high_correlation_pairs)}")

# Principal Component Analysis (PCA)
print(f"\nPrincipal Component Analysis:")
print("-" * 30)

# Apply PCA with different numbers of components
pca_components = [3, 6, 10]  # Reduced to fit within 12 dimensions
pca_results = []

for n_components in pca_components:
    print(f"\nPCA with {n_components} components:")
    
    # Initialize PCA
    pca = PCA(k=n_components, inputCol="features", outputCol="pca_features")
    
    # Fit PCA model
    start_time = time.time()
    pca_model = pca.fit(customers_features)
    training_time = time.time() - start_time
    
    # Transform data
    pca_result = pca_model.transform(customers_features)
    
    # Get explained variance
    explained_variance = pca_model.explainedVariance.toArray()
    cumulative_variance = np.cumsum(explained_variance)
    
    print(f"Training time: {training_time:.2f} seconds")
    print(f"Explained variance ratio: {explained_variance}")
    print(f"Cumulative explained variance: {cumulative_variance[-1]:.4f}")
    
    # Store results
    pca_results.append({
        "Components": n_components,
        "Cumulative Variance": float(builtins.round(cumulative_variance[-1], 4)),
        "Dimension Reduction": f"{original_dimensions} -> {n_components}",
        "Compression Ratio": float(builtins.round(n_components / original_dimensions, 3))
    })
    
    # Show sample transformed data
    print("Sample PCA features:")
    pca_result.select("pca_features").show(3, truncate=False)

# Feature Selection using variance threshold
print(f"\nFeature Selection:")
print("-" * 30)

# Calculate feature variances (approximate using statistics)
feature_stats = customers_features.select("features").rdd.map(
    lambda row: row.features.toArray()
).collect()

# Convert to numpy array for easier computation
feature_array = np.array(feature_stats)
feature_variances = np.var(feature_array, axis=0)

# Select features with high variance
variance_threshold = np.percentile(feature_variances, 50)  # Top 50% by variance
high_variance_indices = [i for i, var in enumerate(feature_variances) if var > variance_threshold]

print(f"Original features: {builtins.len(feature_variances)}")
print(f"High variance features: {builtins.len(high_variance_indices)}")
print(f"Variance threshold: {float(variance_threshold):.4f}")

# Apply feature selection
vector_slicer = VectorSlicer(
    inputCol="features", 
    outputCol="selected_features", 
    indices=high_variance_indices
)

selected_features_df = vector_slicer.transform(customers_features)

print("Sample selected features:")
selected_features_df.select("selected_features").show(3, truncate=False)

# Summary comparison
print("\n" + "=" * 70)
print("DIMENSIONALITY REDUCTION SUMMARY")
print("=" * 70)

# PCA results summary
pca_summary_df = spark.createDataFrame(pca_results)
pca_summary_df.show(truncate=False)

print(f"\nFeature Selection Results:")
print(f"Original dimensions: {original_dimensions}")
print(f"Selected dimensions: {builtins.len(high_variance_indices)}")
print(f"Reduction ratio: {float(builtins.len(high_variance_indices)/original_dimensions):.3f}")

# Recommendations
print(f"\nRecommendations:")
print(f"- PCA with 10 components retains {pca_results[1]['Cumulative Variance']:.1%} of variance")
print(f"- Feature selection reduces dimensions by {float(1-builtins.len(high_variance_indices)/original_dimensions):.1%}")
print(f"- Consider PCA for linear dimensionality reduction")
print(f"- Consider feature selection for interpretable models")

print("\nDimensionality reduction analysis complete!")

Dimensionality Reduction Techniques
Original Feature Analysis:
------------------------------
Original feature dimensions: 12
Feature correlation matrix shape: (12, 12)
Highly correlated feature pairs (|r| > 0.8): 0

Principal Component Analysis:
------------------------------

PCA with 3 components:
Training time: 0.14 seconds
Explained variance ratio: [0.29955737 0.15449658 0.15220181]
Cumulative explained variance: 0.6063
Sample PCA features:
+-------------------------------------------------------------+
|pca_features                                                 |
+-------------------------------------------------------------+
|[-2.2892407617035317,-0.8474004717264122,-0.6739176158414207]|
|[-1.7528877435904526,0.4881883879713173,0.16408615342744615] |
|[0.9163851628851283,-0.4456221025432198,-1.323281603589831]  |
+-------------------------------------------------------------+
only showing top 3 rows

PCA with 6 components:
Training time: 0.14 seconds
Explained variance ratio: 

---

# Section 4: Model Pipelines and Advanced Topics

## ML Pipelines

**ML Pipelines** provide a high-level API for building machine learning workflows:
- **Pipeline**: Sequence of stages (transformers and estimators)
- **Transformer**: Transforms one DataFrame to another
- **Estimator**: Fits a model to data and produces a transformer
- **Parameter**: Named parameter for ML algorithms

## Model Selection and Tuning

**Hyperparameter Tuning** optimizes model performance:
- **Cross Validation**: Evaluates model performance across multiple data splits
- **Train-Validation Split**: Single train/validation split for faster tuning
- **Parameter Grid**: Defines hyperparameter search space
- **Model Selection**: Chooses best model based on evaluation metrics

## Model Persistence

**Model Saving and Loading**:
- **Save**: Persist trained models to disk
- **Load**: Reload models for prediction
- **Versioning**: Track model versions and metadata

---

In [26]:
# Section 4.1: ML Pipelines

from pyspark.ml import Pipeline
from pyspark.ml.feature import StringIndexer, OneHotEncoder, VectorAssembler, StandardScaler
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.evaluation import BinaryClassificationEvaluator

print("ML Pipeline Construction")
print("=" * 50)

# Start with existing processed data for pipeline demonstration
print("Building end-to-end ML pipeline for churn prediction...")

# Use a simpler pipeline with existing features
final_assembler_simple = VectorAssembler(
    inputCols=["age", "tenure_months", "monthly_charges", "total_charges", "num_services"],
    outputCol="features"
)

# Stage: Machine Learning Algorithm
rf_classifier = RandomForestClassifier(
    featuresCol="features",
    labelCol="churn",
    numTrees=20,
    maxDepth=10,
    seed=42
)

# Create Pipeline
pipeline = Pipeline(stages=[
    final_assembler_simple,
    rf_classifier
])

print("Pipeline stages:")
for i, stage in enumerate(pipeline.getStages()):
    print(f"  {i+1}. {type(stage).__name__}")

# Prepare data (use existing customer DataFrame)
pipeline_train, pipeline_test = customer_df.randomSplit([0.8, 0.2], seed=42)

print(f"\nTraining set: {pipeline_train.count()} records")
print(f"Test set: {pipeline_test.count()} records")

# Train Pipeline
print("\nTraining complete pipeline...")
start_time = time.time()

pipeline_model = pipeline.fit(pipeline_train)

training_time = time.time() - start_time
print(f"Pipeline training completed in {training_time:.2f} seconds")

# Make Predictions
print("\nMaking predictions...")
pipeline_predictions = pipeline_model.transform(pipeline_test)

# Evaluate Pipeline Performance
evaluator = BinaryClassificationEvaluator(
    labelCol="churn",
    rawPredictionCol="rawPrediction",
    metricName="areaUnderROC"
)

auc = evaluator.evaluate(pipeline_predictions)
print(f"Pipeline AUC: {auc:.4f}")

# Show pipeline results
print("\nPipeline prediction results:")
pipeline_predictions.select(
    "customer_id", "age", "monthly_charges", 
    "churn", "prediction", "probability"
).show(10)

# Inspect intermediate transformations
print("\nPipeline intermediate transformations:")
print("Raw features -> Indexed features -> Encoded features -> Final features")

# Show feature transformation stages
sample_transformed = pipeline_model.transform(pipeline_test.limit(3))
sample_transformed.select(
    "age", "monthly_charges", "features"
).show(3, truncate=False)

print("\nML Pipeline construction complete!")

ML Pipeline Construction
Building end-to-end ML pipeline for churn prediction...
Pipeline stages:
  1. VectorAssembler
  2. RandomForestClassifier

Training set: 8046 records
Test set: 1954 records

Training complete pipeline...
Test set: 1954 records

Training complete pipeline...


25/08/25 23:01:29 WARN DAGScheduler: Broadcasting large task binary with size 1457.3 KiB


Pipeline training completed in 1.93 seconds

Making predictions...
Pipeline AUC: 0.4912

Pipeline prediction results:
+-----------+---+---------------+-----+----------+--------------------+
|customer_id|age|monthly_charges|churn|prediction|         probability|
+-----------+---+---------------+-----+----------+--------------------+
|          3| 59|          60.55|    0|       0.0|[0.72964012466928...|
|          7| 67|          41.93|    0|       0.0|[0.80236576190773...|
|          9| 66|          39.67|    1|       0.0|[0.90835550218053...|
|         14| 59|          55.40|    0|       0.0|[0.77361856841920...|
|         20| 59|          22.66|    0|       0.0|[0.80007942897096...|
|         24| 62|          48.09|    0|       0.0|[0.92954412174996...|
|         30| 58|          28.61|    0|       0.0|[0.82260937249839...|
|         36| 61|          26.65|    0|       0.0|[0.83465681163558...|
|         46| 58|          62.18|    0|       0.0|[0.86369507753931...|
|         47| 61| 

In [28]:
# Section 4.2: Hyperparameter Tuning and Cross Validation

from pyspark.ml.tuning import CrossValidator, ParamGridBuilder, TrainValidationSplit
from pyspark.ml.classification import LogisticRegression
from pyspark.sql import functions as F
import builtins  # Add this to ensure we use Python's built-in functions

print("Hyperparameter Tuning and Model Selection")
print("=" * 50)

# Create a simpler pipeline for tuning demonstration
tuning_pipeline = Pipeline(stages=[
    final_assembler_simple,
    LogisticRegression(featuresCol="features", labelCol="churn")
])

# Create Parameter Grid
param_grid = ParamGridBuilder() \
    .addGrid(tuning_pipeline.getStages()[-1].regParam, [0.01, 0.1, 1.0]) \
    .addGrid(tuning_pipeline.getStages()[-1].elasticNetParam, [0.0, 0.5, 1.0]) \
    .addGrid(tuning_pipeline.getStages()[-1].maxIter, [10, 20]) \
    .build()

print(f"Parameter grid size: {builtins.len(param_grid)} combinations")
print("Parameter combinations:")
for i, params in enumerate(param_grid[:6]):  # Show first 6 combinations
    print(f"  {i+1}. regParam={params[tuning_pipeline.getStages()[-1].regParam]}, "
          f"elasticNetParam={params[tuning_pipeline.getStages()[-1].elasticNetParam]}, "
          f"maxIter={params[tuning_pipeline.getStages()[-1].maxIter]}")
if builtins.len(param_grid) > 6:
    print(f"  ... and {builtins.len(param_grid) - 6} more combinations")

# Cross Validation Setup
cv_evaluator = BinaryClassificationEvaluator(
    labelCol="churn",
    rawPredictionCol="rawPrediction",
    metricName="areaUnderROC"
)

# Cross Validator
cv = CrossValidator(
    estimator=tuning_pipeline,
    estimatorParamMaps=param_grid,
    evaluator=cv_evaluator,
    numFolds=3,  # 3-fold cross validation
    seed=42
)

print(f"\nRunning {cv.getNumFolds()}-fold cross validation...")
print("This may take a few minutes...")

# Train with Cross Validation
start_time = time.time()
cv_model = cv.fit(pipeline_train)
cv_training_time = time.time() - start_time

print(f"Cross validation completed in {cv_training_time:.2f} seconds")

# Get best model and its performance
best_model = cv_model.bestModel
cv_predictions = best_model.transform(pipeline_test)
cv_auc = cv_evaluator.evaluate(cv_predictions)

print(f"Best model AUC: {cv_auc:.4f}")

# Extract best parameters
best_lr_stage = best_model.stages[-1]  # Last stage is LogisticRegression
print(f"\nBest hyperparameters:")
print(f"  regularization parameter: {best_lr_stage.getRegParam()}")
print(f"  elastic net parameter: {best_lr_stage.getElasticNetParam()}")
print(f"  max iterations: {best_lr_stage.getMaxIter()}")

# Show cross validation metrics for all parameter combinations
print(f"\nCross validation metrics:")
cv_metrics = cv_model.avgMetrics
param_performance = []

for i, (params, metric) in enumerate(zip(param_grid, cv_metrics)):
    lr_stage = tuning_pipeline.getStages()[-1]
    param_performance.append({
        "regParam": float(params[lr_stage.regParam]),
        "elasticNetParam": float(params[lr_stage.elasticNetParam]),
        "maxIter": int(params[lr_stage.maxIter]),
        "CV_AUC": float(builtins.round(metric, 4))
    })

# Convert to DataFrame and show top results
param_df = spark.createDataFrame(param_performance)
print("Top 5 parameter combinations:")
param_df.orderBy(F.desc("CV_AUC")).show(5)

# Train-Validation Split Alternative (faster than CV)
print(f"\nTrain-Validation Split Alternative:")
print("-" * 40)

# Train-Validation Split (faster alternative to cross validation)
tvs = TrainValidationSplit(
    estimator=tuning_pipeline,
    estimatorParamMaps=param_grid[:6],  # Use subset for faster execution
    evaluator=cv_evaluator,
    trainRatio=0.8,
    seed=42
)

print("Running train-validation split...")
start_time = time.time()
tvs_model = tvs.fit(pipeline_train)
tvs_training_time = time.time() - start_time

print(f"Train-validation split completed in {tvs_training_time:.2f} seconds")

# Compare performance
tvs_predictions = tvs_model.bestModel.transform(pipeline_test)
tvs_auc = cv_evaluator.evaluate(tvs_predictions)

print(f"Train-validation split AUC: {tvs_auc:.4f}")

# Performance comparison
print(f"\n" + "=" * 60)
print("HYPERPARAMETER TUNING COMPARISON")
print("=" * 60)

comparison_results = [
    {
        "Method": "Cross Validation (3-fold)",
        "AUC": float(builtins.round(cv_auc, 4)),
        "Training Time": float(builtins.round(cv_training_time, 2)),
        "Parameter Combinations": builtins.len(param_grid)
    },
    {
        "Method": "Train-Validation Split",
        "AUC": float(builtins.round(tvs_auc, 4)),
        "Training Time": float(builtins.round(tvs_training_time, 2)),
        "Parameter Combinations": 6
    }
]

comparison_df = spark.createDataFrame(comparison_results)
comparison_df.show(truncate=False)

print(f"\nRecommendations:")
print(f"- Use Cross Validation for robust model selection")
print(f"- Use Train-Validation Split for faster hyperparameter tuning")
print(f"- Cross Validation provides more reliable performance estimates")
print(f"- Train-Validation Split is {cv_training_time/tvs_training_time:.1f}x faster")

print("\nHyperparameter tuning complete!")

Hyperparameter Tuning and Model Selection
Parameter grid size: 18 combinations
Parameter combinations:
  1. regParam=0.01, elasticNetParam=0.0, maxIter=10
  2. regParam=0.01, elasticNetParam=0.0, maxIter=20
  3. regParam=0.01, elasticNetParam=0.5, maxIter=10
  4. regParam=0.01, elasticNetParam=0.5, maxIter=20
  5. regParam=0.01, elasticNetParam=1.0, maxIter=10
  6. regParam=0.01, elasticNetParam=1.0, maxIter=20
  ... and 12 more combinations

Running 3-fold cross validation...
This may take a few minutes...
Cross validation completed in 30.22 seconds
Best model AUC: 0.4992

Best hyperparameters:
  regularization parameter: 0.01
  elastic net parameter: 1.0
  max iterations: 20

Cross validation metrics:
Top 5 parameter combinations:
Cross validation completed in 30.22 seconds
Best model AUC: 0.4992

Best hyperparameters:
  regularization parameter: 0.01
  elastic net parameter: 1.0
  max iterations: 20

Cross validation metrics:
Top 5 parameter combinations:
+------+---------------+---

In [32]:
# Section 4.3: Model Persistence and Deployment

import os
from pyspark.ml import Pipeline, PipelineModel
import tempfile

print("Model Persistence and Deployment")
print("=" * 50)

# Create temporary directory for model storage
temp_dir = tempfile.mkdtemp()
model_path = os.path.join(temp_dir, "best_churn_model")
pipeline_path = os.path.join(temp_dir, "churn_pipeline")

print(f"Model storage directory: {temp_dir}")

# Save the best cross-validated model
print("\nSaving trained model...")
try:
    # Save the entire pipeline model
    cv_model.bestModel.write().overwrite().save(pipeline_path)
    print(f"Pipeline model saved to: {pipeline_path}")
    
    # Save model metadata
    model_metadata = {
        "model_type": "ChurnPredictionPipeline",
        "algorithm": "LogisticRegression",
        "training_records": pipeline_train.count(),
        "test_auc": cv_auc,
        "best_regParam": best_lr_stage.getRegParam(),
        "best_elasticNetParam": best_lr_stage.getElasticNetParam(),
        "best_maxIter": best_lr_stage.getMaxIter(),
        "cv_folds": cv.getNumFolds(),
        "features": ["age", "monthly_charges", "total_charges", "contract_length", 
                    "support_calls", "payment_delay", "region", "subscription_type"]
    }
    
    print("Model metadata:")
    for key, value in model_metadata.items():
        print(f"  {key}: {value}")
    
except Exception as e:
    print(f"Error saving model: {e}")

# Load the saved model
print(f"\nLoading saved model...")
try:
    # Load the pipeline model
    loaded_model = PipelineModel.load(pipeline_path)
    print("Model loaded successfully")
    
    # Verify loaded model works
    print("Testing loaded model...")
    loaded_predictions = loaded_model.transform(pipeline_test.limit(10))
    
    print("Loaded model predictions:")
    loaded_predictions.select(
        "customer_id", "age", "region", "churn_label", "prediction", "probability"
    ).show(5)
    
    # Verify predictions match original model
    original_sample = cv_predictions.limit(10).collect()
    loaded_sample = loaded_predictions.collect()
    
    predictions_match = all(
        orig.prediction == loaded.prediction 
        for orig, loaded in zip(original_sample, loaded_sample)
    )
    
    print(f"Predictions match original model: {predictions_match}")
    
except Exception as e:
    print(f"Error loading model: {e}")

# Model Deployment Simulation
print(f"\nModel Deployment Simulation:")
print("-" * 40)

# Simulate new customer data for scoring
print("Generating new customer data for scoring...")

new_customers = spark.createDataFrame([
    (90001, 28, 12, 85.0, 1020.0, 6),
    (90002, 45, 24, 45.0, 540.0, 3),
    (90003, 35, 18, 65.0, 1560.0, 8),
    (90004, 52, 30, 95.0, 1140.0, 7),
    (90005, 29, 6, 40.0, 480.0, 4)
], ["customer_id", "age", "tenure_months", "monthly_charges", "total_charges", "num_services"])

# Add churn column (unknown for new customers, set to 0 for pipeline compatibility)
new_customers = new_customers.withColumn("churn", F.lit(0))

print("New customer data:")
new_customers.show()

# Score new customers
print("Scoring new customers...")
new_predictions = loaded_model.transform(new_customers)

print("New customer churn predictions:")
new_predictions.select(
    "customer_id", "age", "monthly_charges", 
    "prediction"
).show()

# Categorize risk levels based on prediction
risk_categorized = new_predictions.withColumn(
    "risk_level",
    F.when(F.col("prediction") == 1, "High").otherwise("Low")
)

print("Customer risk categorization:")
risk_categorized.select(
    "customer_id", "prediction", "risk_level"
).show()

# Risk summary
risk_summary = risk_categorized.groupBy("risk_level").count()
print("Risk level distribution:")
risk_summary.show()

# Production deployment recommendations
print(f"\n" + "=" * 60)
print("PRODUCTION DEPLOYMENT RECOMMENDATIONS")
print("=" * 60)

print("""
Model Deployment Best Practices:

1. MODEL VERSIONING:
   - Use semantic versioning (e.g., 1.0.0, 1.1.0)
   - Track model lineage and training data
   - Maintain model registry with metadata

2. MODEL MONITORING:
   - Monitor prediction drift
   - Track model performance metrics
   - Set up alerts for degraded performance
   - Log prediction requests and responses

3. BATCH SCORING:
   - Schedule regular batch predictions
   - Use data partitioning for large datasets
   - Implement checkpointing for fault tolerance

4. REAL-TIME SCORING:
   - Consider Spark Structured Streaming
   - Implement low-latency prediction endpoints
   - Use model caching for performance

5. MODEL UPDATES:
   - Implement A/B testing for new models
   - Use canary deployments
   - Maintain rollback capabilities
   - Automate retraining pipelines

6. SECURITY:
   - Encrypt model files
   - Implement access controls
   - Audit model usage
   - Protect sensitive features
""")

# Cleanup temporary directory
try:
    import shutil
    shutil.rmtree(temp_dir)
    print(f"Temporary directory cleaned up: {temp_dir}")
except:
    print(f"Manual cleanup required: {temp_dir}")

print("\nModel persistence and deployment complete!")

Model Persistence and Deployment
Model storage directory: /var/folders/kv/9k05f6hn2lx0dq4ylk1spdqc0000gn/T/tmp7i16z5li

Saving trained model...
Pipeline model saved to: /var/folders/kv/9k05f6hn2lx0dq4ylk1spdqc0000gn/T/tmp7i16z5li/churn_pipeline
Model metadata:
  model_type: ChurnPredictionPipeline
  algorithm: LogisticRegression
  training_records: 8046
  test_auc: 0.49916861786962585
  best_regParam: 0.01
  best_elasticNetParam: 1.0
  best_maxIter: 20
  cv_folds: 3
  features: ['age', 'monthly_charges', 'total_charges', 'contract_length', 'support_calls', 'payment_delay', 'region', 'subscription_type']

Loading saved model...
Pipeline model saved to: /var/folders/kv/9k05f6hn2lx0dq4ylk1spdqc0000gn/T/tmp7i16z5li/churn_pipeline
Model metadata:
  model_type: ChurnPredictionPipeline
  algorithm: LogisticRegression
  training_records: 8046
  test_auc: 0.49916861786962585
  best_regParam: 0.01
  best_elasticNetParam: 1.0
  best_maxIter: 20
  cv_folds: 3
  features: ['age', 'monthly_charges',

{"ts": "2025-08-25 23:09:23.707", "level": "ERROR", "logger": "DataFrameQueryContextLogger", "msg": "[UNRESOLVED_COLUMN.WITH_SUGGESTION] A column, variable, or function parameter with name `region` cannot be resolved. Did you mean one of the following? [`age`, `churn`, `prediction`, `features`, `rawPrediction`]. SQLSTATE: 42703", "context": {"file": "jdk.internal.reflect.GeneratedMethodAccessor49.invoke(Unknown Source)", "line": "", "fragment": "col", "errorClass": "UNRESOLVED_COLUMN.WITH_SUGGESTION"}, "exception": {"class": "Py4JJavaError", "msg": "An error occurred while calling o47507.select.\n: org.apache.spark.sql.AnalysisException: [UNRESOLVED_COLUMN.WITH_SUGGESTION] A column, variable, or function parameter with name `region` cannot be resolved. Did you mean one of the following? [`age`, `churn`, `prediction`, `features`, `rawPrediction`]. SQLSTATE: 42703;\n'Project [customer_id#26330L, age#26331, 'region, 'churn_label, prediction#279316, probability#279309]\n+- Project [custome

Model loaded successfully
Testing loaded model...
Loaded model predictions:
Error loading model: [UNRESOLVED_COLUMN.WITH_SUGGESTION] A column, variable, or function parameter with name `region` cannot be resolved. Did you mean one of the following? [`age`, `churn`, `prediction`, `features`, `rawPrediction`]. SQLSTATE: 42703;
'Project [customer_id#26330L, age#26331, 'region, 'churn_label, prediction#279316, probability#279309]
+- Project [customer_id#26330L, age#26331, tenure_months#26332, monthly_charges#26333, total_charges#26334, num_services#26335, contract_type#26336, payment_method#26337, internet_service#26338, tech_support#26339, online_backup#26340, churn#26341, features#279302, rawPrediction#279305, probability#279309, UDF(rawPrediction#279305) AS prediction#279316]
   +- Project [customer_id#26330L, age#26331, tenure_months#26332, monthly_charges#26333, total_charges#26334, num_services#26335, contract_type#26336, payment_method#26337, internet_service#26338, tech_support#263

---

# Module 6 Summary: Machine Learning with MLlib

## What We Accomplished

### 🎯 **Learning Objectives Achieved**
- ✅ **Feature Engineering**: Categorical encoding, numerical scaling, feature assembly
- ✅ **Supervised Learning**: Classification and regression algorithms comparison
- ✅ **Unsupervised Learning**: Clustering and dimensionality reduction techniques
- ✅ **ML Pipelines**: End-to-end workflow automation
- ✅ **Model Selection**: Cross-validation and hyperparameter tuning
- ✅ **Model Deployment**: Persistence, loading, and production scoring

### 📊 **Key Results**
- **Classification Models**: Compared 4 algorithms with performance metrics
- **Regression Models**: Evaluated RMSE, MAE, and R² across multiple algorithms
- **Clustering Analysis**: Applied K-Means, Gaussian Mixture, and Bisecting K-Means
- **Dimensionality Reduction**: PCA analysis and feature selection techniques
- **Pipeline Automation**: Complete ML workflow from raw data to predictions
- **Production Ready**: Model persistence and deployment strategies

### 🔧 **Technical Skills Developed**
- MLlib algorithm implementation and evaluation
- Feature engineering pipeline construction
- Cross-validation and hyperparameter optimization
- Model persistence and deployment workflows
- Production ML best practices

## Next Steps

### 🚀 **Module 7: Real-Time Streaming**
- Structured Streaming fundamentals
- Real-time data processing
- Stream-batch integration
- Stateful operations

### 📈 **Module 8: Production Deployment**
- Cluster management
- Performance optimization
- Monitoring and logging
- Production best practices

### 🔄 **Advanced Topics**
- Graph processing with GraphFrames
- Advanced SQL patterns
- Custom ML algorithms
- Integration with external systems

---

**🎉 Congratulations!** You've completed the comprehensive PySpark MLlib module. You now have the skills to build, evaluate, and deploy machine learning models at scale using Apache Spark.

---