# ML Layer â€“ Baseline Repeat Purchase Prediction

## Purpose
This notebook trains a baseline Logistic Regression model to predict
whether a customer will make a repeat purchase within 30 days.

The model uses customer-level features from the Gold layer and is
tracked using MLflow on Databricks.

## Input
- Gold table:
  workspace.repeat_purchase.gold_customer_features

## Target Variable
- repeat_purchase_label
  - 1 = repeat purchase within 30 days
  - 0 = no repeat purchase observed

## Model Choice
- Logistic Regression (baseline, interpretable)

## Tracking
- MLflow is used to track model and evaluation metrics

## Notes
- Focus is on clarity and correctness, not model complexity


In [0]:
# Load final customer-level features from Gold table

df_gold = spark.table(
    "workspace.repeat_purchase.gold_customer_features"
)


In [0]:
# Select features that describe customer behavior
# - frequency (total_orders)
# - monetary value (total_spent, avg_order_value)
# - purchase regularity (avg_days_between_orders)
# - customer longevity (active_days)

feature_cols = [
    "total_orders",
    "total_spent",
    "avg_order_value",
    "avg_days_between_orders",
    "active_days"
]

target_col = "repeat_purchase_label"

df_ml = df_gold.select(
    *feature_cols,
    target_col
)


In [0]:
# Customers with only one order have no order gap.
# Assign a large value to indicate very infrequent purchases.

from pyspark.sql.functions import col, when

df_ml = df_ml.withColumn(
    "avg_days_between_orders",
    when(col("avg_days_between_orders").isNull(), 999)
    .otherwise(col("avg_days_between_orders"))
)


In [0]:
# Split data into training and testing sets

train_df, test_df = df_ml.randomSplit(
    [0.8, 0.2],
    seed=42
)


In [0]:
# Spark ML requires features to be assembled into a single vector

from pyspark.ml.feature import VectorAssembler

assembler = VectorAssembler(
    inputCols=feature_cols,
    outputCol="features"
)

train_vec = assembler.transform(train_df)
test_vec = assembler.transform(test_df)


In [0]:
# Logistic Regression chosen as a simple, interpretable baseline model

from pyspark.ml.classification import LogisticRegression

lr = LogisticRegression(
    featuresCol="features",
    labelCol=target_col
)


In [0]:
# Configure MLflow to use Unity Catalog volume for temporary storage

import os
import mlflow
import mlflow.spark

os.environ["MLFLOW_DFS_TMP"] = "/Volumes/workspace/repeat_purchase/raw_data/mlflow_tmp"


In [0]:
# Ensure no active MLflow run is open (safe for re-runs)

if mlflow.active_run() is not None:
    mlflow.end_run()


In [0]:
# Train model

lr_model = lr.fit(train_vec)


In [0]:
# Generate predictions on test data

predictions = lr_model.transform(test_vec)


In [0]:
# Evaluate model performance using AUC

from pyspark.ml.evaluation import BinaryClassificationEvaluator

evaluator = BinaryClassificationEvaluator(
    labelCol=target_col,
    metricName="areaUnderROC"
)

auc = evaluator.evaluate(predictions)

print(f"AUC Score: {auc}")


AUC Score: 0.968471071726706


In [0]:
with mlflow.start_run(run_name="baseline_logistic_regression"):

    mlflow.spark.log_model(
        lr_model,
        artifact_path="model"
    )

    mlflow.log_metric("AUC", auc)




In [0]:
# Sanity check: compare actual vs predicted labels

display(
    predictions.groupBy(
        "repeat_purchase_label",
        "prediction"
    ).count()
)


repeat_purchase_label,prediction,count
0,1.0,53
1,0.0,36
1,1.0,322
0,0.0,469


## Final Notes

- A simple Logistic Regression model was used to predict repeat purchases.
- The model shows a strong ability to separate customers who return from those who do not.
- Predictions were checked by comparing actual customer behavior with model output.
- This model is meant as a clear baseline and can be improved further in future work.
