# Module 08: PySpark Machine Learning - MLlib Basics

**Difficulty**: ⭐⭐  
**Estimated Time**: 60 minutes  
**Prerequisites**: 
- [Module 03: DataFrames and Datasets](03_dataframes_and_datasets.ipynb)
- [Module 05: DataFrame Operations](05_dataframe_operations.ipynb)
- Basic understanding of machine learning concepts

## Learning Objectives

By the end of this notebook, you will be able to:

1. Understand the MLlib library architecture and the ML Pipeline concept
2. Build simple classification models using Logistic Regression in PySpark
3. Build simple regression models using Linear Regression in PySpark
4. Use feature transformers like StringIndexer, OneHotEncoder, and VectorAssembler
5. Evaluate machine learning models using built-in metrics and methods

## 1. Setup and Introduction

**What is MLlib?**

MLlib is Spark's machine learning library that provides:
- Distributed ML algorithms that scale to large datasets
- A unified API for building ML pipelines
- Feature transformers and extractors
- Model evaluation and tuning utilities

**Two APIs:**
- `spark.ml` - DataFrame-based API (recommended, what we'll use)
- `spark.mllib` - RDD-based API (legacy, in maintenance mode)

**ML Pipeline Architecture:**
- **Transformer**: Converts one DataFrame to another (e.g., StringIndexer, trained model)
- **Estimator**: Fits on a DataFrame to produce a Transformer (e.g., LogisticRegression)
- **Pipeline**: Chains multiple Transformers and Estimators together
- **Parameter**: Configuration for Transformers and Estimators

In [None]:
# Import required libraries
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, rand, when, round as spark_round
from pyspark.sql.types import StructType, StructField, StringType, DoubleType, IntegerType

# ML imports
from pyspark.ml.feature import VectorAssembler, StringIndexer, OneHotEncoder
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.regression import LinearRegression
from pyspark.ml import Pipeline
from pyspark.ml.evaluation import BinaryClassificationEvaluator, MulticlassClassificationEvaluator
from pyspark.ml.evaluation import RegressionEvaluator

# For generating sample data
import numpy as np
import random

# Set random seeds for reproducibility
np.random.seed(42)
random.seed(42)

In [None]:
# Create Spark session with appropriate memory settings for ML
spark = SparkSession.builder \
    .appName("MLlib Basics") \
    .config("spark.driver.memory", "2g") \
    .config("spark.executor.memory", "2g") \
    .getOrCreate()

# Set log level to reduce verbose output
spark.sparkContext.setLogLevel("ERROR")

print(f"Spark version: {spark.version}")
print("Spark session created successfully!")

## 2. Feature Transformers

Before building models, we need to prepare features. MLlib requires features to be in a **Vector** format.

### Common Feature Transformers:

**VectorAssembler**: Combines multiple columns into a single vector column
- Input: Multiple numeric columns
- Output: Single `features` column of type Vector

**StringIndexer**: Converts string labels to numeric indices
- Input: String column (e.g., "cat", "dog", "bird")
- Output: Numeric column (e.g., 0, 1, 2)
- Handles unseen labels during prediction

**OneHotEncoder**: Converts categorical indices to binary vectors
- Input: Numeric indices (from StringIndexer)
- Output: Sparse vector with one-hot encoding
- Prevents ordinality assumptions in categorical data

In [None]:
# Create sample data to demonstrate transformers
sample_data = [
    ("blue", "small", 10.5),
    ("red", "medium", 15.2),
    ("blue", "large", 20.0),
    ("green", "small", 8.5),
    ("red", "small", 12.0),
    ("green", "medium", 18.5)
]

df_demo = spark.createDataFrame(sample_data, ["color", "size", "value"])
df_demo.show()

In [None]:
# StringIndexer: Convert categorical strings to numeric indices
color_indexer = StringIndexer(inputCol="color", outputCol="color_index")
size_indexer = StringIndexer(inputCol="size", outputCol="size_index")

# Fit the indexers on the data (learns the unique values)
color_model = color_indexer.fit(df_demo)
size_model = size_indexer.fit(df_demo)

# Transform the data
df_indexed = color_model.transform(df_demo)
df_indexed = size_model.transform(df_indexed)

df_indexed.show()

In [None]:
# OneHotEncoder: Convert indices to binary vectors
# This prevents the model from assuming ordering (e.g., red > blue > green)
color_encoder = OneHotEncoder(inputCol="color_index", outputCol="color_vec")
size_encoder = OneHotEncoder(inputCol="size_index", outputCol="size_vec")

df_encoded = color_encoder.fit(df_indexed).transform(df_indexed)
df_encoded = size_encoder.fit(df_encoded).transform(df_encoded)

df_encoded.select("color", "color_index", "color_vec", "size", "size_index", "size_vec").show(truncate=False)

In [None]:
# VectorAssembler: Combine multiple feature columns into a single vector
# This is required by MLlib algorithms which expect a single 'features' column
assembler = VectorAssembler(
    inputCols=["color_vec", "size_vec", "value"],
    outputCol="features"
)

df_features = assembler.transform(df_encoded)
df_features.select("color", "size", "value", "features").show(truncate=False)

## 3. Classification with Logistic Regression

**Logistic Regression** is a linear classifier used for binary or multi-class classification.

**Use cases:**
- Spam detection (spam vs. not spam)
- Customer churn prediction (churn vs. retain)
- Disease diagnosis (positive vs. negative)

**How it works in PySpark:**
1. Prepare features as a vector column
2. Create a label column (numeric: 0, 1 for binary)
3. Instantiate LogisticRegression estimator
4. Fit the estimator on training data → produces a model (transformer)
5. Use model to make predictions on test data
6. Evaluate using appropriate metrics

In [None]:
# Generate synthetic classification data
# Scenario: Predict customer purchase based on age, income, and browsing time
n_samples = 1000

classification_data = []
for _ in range(n_samples):
    age = np.random.randint(18, 70)
    income = np.random.uniform(20000, 150000)
    browsing_time = np.random.uniform(0, 120)  # minutes
    
    # Create a decision rule with some randomness
    # More likely to purchase if: younger, higher income, more browsing time
    score = (70 - age) * 0.01 + income * 0.00001 + browsing_time * 0.02
    probability = 1 / (1 + np.exp(-score + 1.5))
    purchased = 1 if np.random.random() < probability else 0
    
    classification_data.append((age, float(income), float(browsing_time), purchased))

df_classification = spark.createDataFrame(
    classification_data,
    ["age", "income", "browsing_time", "purchased"]
)

print(f"Total samples: {df_classification.count()}")
print("\nClass distribution:")
df_classification.groupBy("purchased").count().show()

In [None]:
# Display sample data
df_classification.show(10)

# Basic statistics
df_classification.describe().show()

In [None]:
# Prepare features for classification
# Combine all feature columns into a single vector
feature_assembler = VectorAssembler(
    inputCols=["age", "income", "browsing_time"],
    outputCol="features"
)

df_class_features = feature_assembler.transform(df_classification)
df_class_features.select("features", "purchased").show(5, truncate=False)

In [None]:
# Split data into training and test sets
# 70% for training, 30% for testing
# Setting a seed ensures reproducibility
train_df, test_df = df_class_features.randomSplit([0.7, 0.3], seed=42)

print(f"Training samples: {train_df.count()}")
print(f"Test samples: {test_df.count()}")

In [None]:
# Create and train Logistic Regression model
# labelCol: the column containing the target variable
# featuresCol: the column containing the feature vector
# maxIter: maximum number of iterations for optimization
lr = LogisticRegression(
    labelCol="purchased",
    featuresCol="features",
    maxIter=10,
    regParam=0.01  # Regularization parameter to prevent overfitting
)

# Fit the model (this is where learning happens)
lr_model = lr.fit(train_df)

print("Model trained successfully!")
print(f"Coefficients: {lr_model.coefficients}")
print(f"Intercept: {lr_model.intercept}")

In [None]:
# Make predictions on test data
predictions = lr_model.transform(test_df)

# Show predictions with probabilities
# rawPrediction: raw confidence scores
# probability: calibrated probabilities for each class
# prediction: final predicted class
predictions.select("features", "purchased", "rawPrediction", "probability", "prediction").show(10, truncate=False)

In [None]:
# Evaluate the model
# BinaryClassificationEvaluator uses Area Under ROC as default metric
evaluator = BinaryClassificationEvaluator(
    labelCol="purchased",
    rawPredictionCol="rawPrediction",
    metricName="areaUnderROC"
)

auc = evaluator.evaluate(predictions)
print(f"Area Under ROC: {auc:.4f}")

# Also calculate accuracy using MulticlassClassificationEvaluator
multi_evaluator = MulticlassClassificationEvaluator(
    labelCol="purchased",
    predictionCol="prediction",
    metricName="accuracy"
)

accuracy = multi_evaluator.evaluate(predictions)
print(f"Accuracy: {accuracy:.4f}")

# F1 Score
f1_evaluator = MulticlassClassificationEvaluator(
    labelCol="purchased",
    predictionCol="prediction",
    metricName="f1"
)
f1 = f1_evaluator.evaluate(predictions)
print(f"F1 Score: {f1:.4f}")

## 4. Regression with Linear Regression

**Linear Regression** predicts a continuous numeric value based on input features.

**Use cases:**
- House price prediction
- Sales forecasting
- Temperature prediction

**Key differences from classification:**
- Label is continuous (not discrete classes)
- Evaluation uses metrics like RMSE, R², MAE
- No probability or class prediction

In [None]:
# Generate synthetic regression data
# Scenario: Predict house price based on size, bedrooms, and age
n_samples = 1000

regression_data = []
for _ in range(n_samples):
    size_sqft = np.random.uniform(500, 3500)
    bedrooms = np.random.randint(1, 6)
    age_years = np.random.randint(0, 50)
    
    # True relationship with some noise
    # Price increases with size and bedrooms, decreases with age
    price = (
        150 * size_sqft +
        50000 * bedrooms -
        1000 * age_years +
        100000 +
        np.random.normal(0, 50000)  # Random noise
    )
    
    regression_data.append((float(size_sqft), bedrooms, age_years, float(price)))

df_regression = spark.createDataFrame(
    regression_data,
    ["size_sqft", "bedrooms", "age_years", "price"]
)

print(f"Total samples: {df_regression.count()}")
df_regression.show(10)

In [None]:
# Summary statistics for regression data
df_regression.describe().show()

In [None]:
# Prepare features for regression
reg_assembler = VectorAssembler(
    inputCols=["size_sqft", "bedrooms", "age_years"],
    outputCol="features"
)

df_reg_features = reg_assembler.transform(df_regression)
df_reg_features.select("features", "price").show(5, truncate=False)

In [None]:
# Split into training and test sets
reg_train, reg_test = df_reg_features.randomSplit([0.7, 0.3], seed=42)

print(f"Training samples: {reg_train.count()}")
print(f"Test samples: {reg_test.count()}")

In [None]:
# Create and train Linear Regression model
lin_reg = LinearRegression(
    labelCol="price",
    featuresCol="features",
    maxIter=10,
    regParam=0.01,  # L2 regularization
    elasticNetParam=0.0  # 0 = L2, 1 = L1, between = mix
)

# Fit the model
lin_reg_model = lin_reg.fit(reg_train)

print("Model trained successfully!")
print(f"Coefficients: {lin_reg_model.coefficients}")
print(f"Intercept: {lin_reg_model.intercept}")

In [None]:
# Make predictions
reg_predictions = lin_reg_model.transform(reg_test)

# Show actual vs predicted prices
reg_predictions.select("features", "price", "prediction").show(15)

In [None]:
# Evaluate regression model
reg_evaluator = RegressionEvaluator(
    labelCol="price",
    predictionCol="prediction"
)

# RMSE: Root Mean Squared Error (lower is better)
rmse = reg_evaluator.evaluate(reg_predictions, {reg_evaluator.metricName: "rmse"})
print(f"RMSE: ${rmse:,.2f}")

# MAE: Mean Absolute Error (lower is better)
mae = reg_evaluator.evaluate(reg_predictions, {reg_evaluator.metricName: "mae"})
print(f"MAE: ${mae:,.2f}")

# R²: Coefficient of determination (higher is better, 1.0 is perfect)
r2 = reg_evaluator.evaluate(reg_predictions, {reg_evaluator.metricName: "r2"})
print(f"R² Score: {r2:.4f}")

# Training summary (additional metrics from the model itself)
print("\nTraining Summary:")
print(f"Training RMSE: ${lin_reg_model.summary.rootMeanSquaredError:,.2f}")
print(f"Training R²: {lin_reg_model.summary.r2:.4f}")

## 5. Building ML Pipelines

**Why use Pipelines?**
- Combine multiple steps into a single workflow
- Ensure transformations are applied consistently
- Make it easier to deploy models to production
- Prevent data leakage (transformations fit only on training data)

**Pipeline stages:**
1. Feature transformers (StringIndexer, OneHotEncoder, VectorAssembler)
2. Model estimator (LogisticRegression, LinearRegression, etc.)

The pipeline is fitted on training data and produces a PipelineModel that can be used on new data.

In [None]:
# Create sample data with categorical features
pipeline_data = [
    ("male", "bachelor", 35000, 0),
    ("female", "master", 55000, 1),
    ("male", "phd", 75000, 1),
    ("female", "bachelor", 40000, 0),
    ("male", "master", 60000, 1),
    ("female", "phd", 80000, 1),
    ("male", "bachelor", 32000, 0),
    ("female", "master", 58000, 1),
] * 100  # Repeat to have more data

df_pipeline = spark.createDataFrame(
    pipeline_data,
    ["gender", "education", "salary", "promoted"]
)

df_pipeline.show(10)
print(f"Total samples: {df_pipeline.count()}")

In [None]:
# Split data FIRST before building pipeline
# This prevents data leakage
pipeline_train, pipeline_test = df_pipeline.randomSplit([0.7, 0.3], seed=42)

print(f"Training samples: {pipeline_train.count()}")
print(f"Test samples: {pipeline_test.count()}")

In [None]:
# Build a complete ML pipeline
# Stage 1: Index categorical features
gender_indexer = StringIndexer(inputCol="gender", outputCol="gender_index")
education_indexer = StringIndexer(inputCol="education", outputCol="education_index")

# Stage 2: One-hot encode the indexed features
gender_encoder = OneHotEncoder(inputCol="gender_index", outputCol="gender_vec")
education_encoder = OneHotEncoder(inputCol="education_index", outputCol="education_vec")

# Stage 3: Assemble all features into a vector
pipeline_assembler = VectorAssembler(
    inputCols=["gender_vec", "education_vec", "salary"],
    outputCol="features"
)

# Stage 4: Train the model
pipeline_lr = LogisticRegression(
    labelCol="promoted",
    featuresCol="features",
    maxIter=10
)

# Create the pipeline with all stages
ml_pipeline = Pipeline(stages=[
    gender_indexer,
    education_indexer,
    gender_encoder,
    education_encoder,
    pipeline_assembler,
    pipeline_lr
])

print("Pipeline created with 6 stages")

In [None]:
# Fit the entire pipeline on training data
# This fits each stage sequentially:
# 1. StringIndexers learn the categorical mappings
# 2. OneHotEncoders create the encoding scheme
# 3. VectorAssembler combines features
# 4. LogisticRegression trains the model
pipeline_model = ml_pipeline.fit(pipeline_train)

print("Pipeline fitted successfully!")

In [None]:
# Make predictions using the pipeline model
# The model automatically applies all transformations
pipeline_predictions = pipeline_model.transform(pipeline_test)

# Show original features and predictions
pipeline_predictions.select(
    "gender", "education", "salary", "promoted", "prediction", "probability"
).show(15, truncate=False)

In [None]:
# Evaluate the pipeline model
pipeline_evaluator = MulticlassClassificationEvaluator(
    labelCol="promoted",
    predictionCol="prediction",
    metricName="accuracy"
)

pipeline_accuracy = pipeline_evaluator.evaluate(pipeline_predictions)
print(f"Pipeline Model Accuracy: {pipeline_accuracy:.4f}")

## 6. Exercises

Now it's your turn to practice! Complete the following exercises.

### Exercise 1: Customer Churn Classification

Create a classification model to predict customer churn.

**Tasks:**
1. Generate synthetic customer data with features: monthly_charges, tenure_months, total_charges, churn (0 or 1)
2. Create a logistic regression model to predict churn
3. Evaluate the model using accuracy and AUC metrics
4. Print the model coefficients and interpret which features are most important

In [None]:
# Your code here
# TODO: Generate customer data
# TODO: Prepare features
# TODO: Train logistic regression model
# TODO: Evaluate the model

### Exercise 2: Student Score Prediction

Build a regression model to predict student exam scores.

**Tasks:**
1. Generate student data with: study_hours, previous_score, attendance_rate, final_score
2. Create a linear regression model to predict final_score
3. Evaluate using RMSE, MAE, and R² metrics
4. Make predictions for new students with specific characteristics

In [None]:
# Your code here
# TODO: Generate student data
# TODO: Prepare features and split data
# TODO: Train linear regression model
# TODO: Evaluate and make predictions

### Exercise 3: Complete ML Pipeline

Build a complete pipeline for a multi-class classification problem.

**Tasks:**
1. Create data with categorical features (product_category, customer_type) and numeric features (price, quantity)
2. Predict purchase_rating (1, 2, 3, 4, or 5 stars)
3. Build a pipeline that includes:
   - StringIndexer for categorical features
   - OneHotEncoder for the indexed features
   - VectorAssembler to combine all features
   - LogisticRegression for multi-class classification
4. Evaluate the pipeline using accuracy and F1 score

In [None]:
# Your code here
# TODO: Create sample data with categorical and numeric features
# TODO: Build pipeline with all transformation stages
# TODO: Fit pipeline and make predictions
# TODO: Evaluate multi-class classification performance

## 7. Exercise Solutions

### Solution 1: Customer Churn Classification

In [None]:
# Generate customer churn data
n_customers = 1000
churn_data = []

for _ in range(n_customers):
    monthly_charges = np.random.uniform(20, 100)
    tenure_months = np.random.randint(1, 72)
    total_charges = monthly_charges * tenure_months + np.random.normal(0, 100)
    
    # Higher charges and shorter tenure increase churn probability
    churn_prob = 1 / (1 + np.exp(-(monthly_charges * 0.02 - tenure_months * 0.05 + 1)))
    churn = 1 if np.random.random() < churn_prob else 0
    
    churn_data.append((float(monthly_charges), tenure_months, float(total_charges), churn))

df_churn = spark.createDataFrame(churn_data, ["monthly_charges", "tenure_months", "total_charges", "churn"])

print(f"Total customers: {df_churn.count()}")
print("\nChurn distribution:")
df_churn.groupBy("churn").count().show()
df_churn.show(10)

In [None]:
# Prepare features
churn_assembler = VectorAssembler(
    inputCols=["monthly_charges", "tenure_months", "total_charges"],
    outputCol="features"
)
df_churn_features = churn_assembler.transform(df_churn)

# Split data
churn_train, churn_test = df_churn_features.randomSplit([0.7, 0.3], seed=42)

# Train model
churn_lr = LogisticRegression(labelCol="churn", featuresCol="features", maxIter=10)
churn_model = churn_lr.fit(churn_train)

print("Churn model coefficients:")
print(f"Monthly Charges: {churn_model.coefficients[0]:.6f}")
print(f"Tenure Months: {churn_model.coefficients[1]:.6f}")
print(f"Total Charges: {churn_model.coefficients[2]:.6f}")
print(f"\nInterpretation: Positive coefficients increase churn probability, negative decrease it.")

In [None]:
# Evaluate
churn_predictions = churn_model.transform(churn_test)

churn_acc_evaluator = MulticlassClassificationEvaluator(
    labelCol="churn", predictionCol="prediction", metricName="accuracy"
)
churn_auc_evaluator = BinaryClassificationEvaluator(
    labelCol="churn", rawPredictionCol="rawPrediction", metricName="areaUnderROC"
)

churn_accuracy = churn_acc_evaluator.evaluate(churn_predictions)
churn_auc = churn_auc_evaluator.evaluate(churn_predictions)

print(f"Churn Model Accuracy: {churn_accuracy:.4f}")
print(f"Churn Model AUC: {churn_auc:.4f}")

### Solution 2: Student Score Prediction

In [None]:
# Generate student score data
n_students = 800
student_data = []

for _ in range(n_students):
    study_hours = np.random.uniform(0, 10)
    previous_score = np.random.uniform(40, 100)
    attendance_rate = np.random.uniform(0.5, 1.0)
    
    # Final score depends on all three factors with some noise
    final_score = (
        study_hours * 3.5 +
        previous_score * 0.5 +
        attendance_rate * 20 +
        np.random.normal(0, 5)
    )
    final_score = min(100, max(0, final_score))  # Clamp to 0-100
    
    student_data.append((float(study_hours), float(previous_score), float(attendance_rate), float(final_score)))

df_students = spark.createDataFrame(
    student_data,
    ["study_hours", "previous_score", "attendance_rate", "final_score"]
)

print(f"Total students: {df_students.count()}")
df_students.show(10)
df_students.describe().show()

In [None]:
# Prepare features and train model
student_assembler = VectorAssembler(
    inputCols=["study_hours", "previous_score", "attendance_rate"],
    outputCol="features"
)
df_student_features = student_assembler.transform(df_students)

# Split data
student_train, student_test = df_student_features.randomSplit([0.7, 0.3], seed=42)

# Train linear regression
student_lr = LinearRegression(labelCol="final_score", featuresCol="features", maxIter=10)
student_model = student_lr.fit(student_train)

print("Student score model trained!")
print(f"Coefficients: {student_model.coefficients}")
print(f"Intercept: {student_model.intercept}")

In [None]:
# Evaluate
student_predictions = student_model.transform(student_test)
student_evaluator = RegressionEvaluator(labelCol="final_score", predictionCol="prediction")

student_rmse = student_evaluator.evaluate(student_predictions, {student_evaluator.metricName: "rmse"})
student_mae = student_evaluator.evaluate(student_predictions, {student_evaluator.metricName: "mae"})
student_r2 = student_evaluator.evaluate(student_predictions, {student_evaluator.metricName: "r2"})

print(f"RMSE: {student_rmse:.2f}")
print(f"MAE: {student_mae:.2f}")
print(f"R² Score: {student_r2:.4f}")

# Make predictions for new students
new_students = spark.createDataFrame([
    (8.0, 85.0, 0.95),  # High effort student
    (2.0, 60.0, 0.70),  # Average student
    (5.0, 75.0, 0.85)   # Good student
], ["study_hours", "previous_score", "attendance_rate"])

new_student_features = student_assembler.transform(new_students)
new_predictions = student_model.transform(new_student_features)

print("\nPredictions for new students:")
new_predictions.select("study_hours", "previous_score", "attendance_rate", "prediction").show()

### Solution 3: Complete ML Pipeline

In [None]:
# Generate multi-class classification data
categories = ["Electronics", "Clothing", "Books", "Home"]
customer_types = ["Regular", "Premium", "VIP"]
n_purchases = 1000

purchase_data = []
for _ in range(n_purchases):
    category = random.choice(categories)
    customer_type = random.choice(customer_types)
    price = np.random.uniform(10, 500)
    quantity = np.random.randint(1, 10)
    
    # Rating depends on customer type, price, and category
    base_rating = 3
    if customer_type == "Premium":
        base_rating += 0.5
    elif customer_type == "VIP":
        base_rating += 1.0
    
    if category == "Electronics" and price > 200:
        base_rating += 0.5
    
    # Add randomness and clamp to 1-5
    rating = int(min(5, max(1, base_rating + np.random.normal(0, 0.8))))
    
    purchase_data.append((category, customer_type, float(price), quantity, rating))

df_purchases = spark.createDataFrame(
    purchase_data,
    ["product_category", "customer_type", "price", "quantity", "purchase_rating"]
)

print(f"Total purchases: {df_purchases.count()}")
print("\nRating distribution:")
df_purchases.groupBy("purchase_rating").count().orderBy("purchase_rating").show()
df_purchases.show(10)

In [None]:
# Split data first
purchase_train, purchase_test = df_purchases.randomSplit([0.7, 0.3], seed=42)

# Build complete pipeline
category_indexer = StringIndexer(inputCol="product_category", outputCol="category_index")
customer_indexer = StringIndexer(inputCol="customer_type", outputCol="customer_index")

category_encoder = OneHotEncoder(inputCol="category_index", outputCol="category_vec")
customer_encoder = OneHotEncoder(inputCol="customer_index", outputCol="customer_vec")

purchase_assembler = VectorAssembler(
    inputCols=["category_vec", "customer_vec", "price", "quantity"],
    outputCol="features"
)

# For multi-class, we need to ensure label is indexed (0, 1, 2, 3, 4)
# Subtract 1 from rating to get 0-indexed labels
from pyspark.sql.functions import col
purchase_train = purchase_train.withColumn("label", col("purchase_rating") - 1)
purchase_test = purchase_test.withColumn("label", col("purchase_rating") - 1)

multiclass_lr = LogisticRegression(
    labelCol="label",
    featuresCol="features",
    maxIter=10,
    family="multinomial"  # Explicitly set for multi-class
)

# Create pipeline
purchase_pipeline = Pipeline(stages=[
    category_indexer,
    customer_indexer,
    category_encoder,
    customer_encoder,
    purchase_assembler,
    multiclass_lr
])

print("Purchase rating pipeline created!")

In [None]:
# Fit and evaluate pipeline
purchase_pipeline_model = purchase_pipeline.fit(purchase_train)
purchase_predictions = purchase_pipeline_model.transform(purchase_test)

# Show predictions (add 1 back to get original ratings)
purchase_predictions_display = purchase_predictions.withColumn(
    "predicted_rating", col("prediction") + 1
)
purchase_predictions_display.select(
    "product_category", "customer_type", "price", "quantity", "purchase_rating", "predicted_rating"
).show(15)

# Evaluate
multiclass_acc_eval = MulticlassClassificationEvaluator(
    labelCol="label", predictionCol="prediction", metricName="accuracy"
)
multiclass_f1_eval = MulticlassClassificationEvaluator(
    labelCol="label", predictionCol="prediction", metricName="f1"
)

purchase_accuracy = multiclass_acc_eval.evaluate(purchase_predictions)
purchase_f1 = multiclass_f1_eval.evaluate(purchase_predictions)

print(f"\nPurchase Rating Model Accuracy: {purchase_accuracy:.4f}")
print(f"Purchase Rating Model F1 Score: {purchase_f1:.4f}")

## 8. Summary

Congratulations! You've learned the fundamentals of machine learning with PySpark MLlib.

### Key Concepts:

1. **MLlib Architecture**:
   - DataFrame-based API (`spark.ml`) is the modern approach
   - Transformers convert DataFrames (e.g., trained models, encoders)
   - Estimators fit on data to produce Transformers (e.g., algorithms)
   - Pipelines chain multiple stages for reproducible workflows

2. **Feature Preparation**:
   - StringIndexer: Convert categorical strings to numeric indices
   - OneHotEncoder: Create binary vectors from categorical indices
   - VectorAssembler: Combine features into a single vector column
   - All MLlib algorithms require features in vector format

3. **Classification**:
   - Logistic Regression for binary and multi-class problems
   - Evaluation metrics: Accuracy, AUC, F1 score
   - Produces probabilities and class predictions

4. **Regression**:
   - Linear Regression for continuous predictions
   - Evaluation metrics: RMSE, MAE, R²
   - Learns linear relationships between features and target

5. **ML Pipelines**:
   - Combine preprocessing and modeling into one object
   - Prevent data leakage by fitting only on training data
   - Make deployment and reuse easier
   - Ensure consistent transformations across train/test/production

### Best Practices:

- Always split data BEFORE building pipelines
- Use appropriate evaluation metrics for your problem
- Set random seeds for reproducibility
- Include regularization to prevent overfitting
- Validate on held-out test data

### What's Next?

In [Module 09: Feature Engineering at Scale](09_feature_engineering_at_scale.ipynb), you'll learn:
- Advanced feature transformations and scaling techniques
- Feature selection methods
- Handling imbalanced datasets
- Custom transformers and feature engineering pipelines

### Additional Resources:

- [PySpark MLlib Guide](https://spark.apache.org/docs/latest/ml-guide.html)
- [ML Pipelines Documentation](https://spark.apache.org/docs/latest/ml-pipeline.html)
- [MLlib API Reference](https://spark.apache.org/docs/latest/api/python/reference/pyspark.ml.html)

In [None]:
# Clean up
spark.stop()
print("Spark session stopped. Great work!")