Traditional Market Basket Analysis (MBA) using FP-Growth identifies frequent product combinations based only on static measures like support and confidence.
However, these static rules overlook how purchasing behavior evolves over time, how strong or stable these associations are, and how they can be used for prediction of future purchases.

This project aims to design a Dynamic Market Basket Framework that not only extracts association rules using FP-Growth but also introduces three novel indices ‚Äî Affinity Strength Index (ASI), Temporal Stability Index (TSI), and Diversity Spread Index (DSI) ‚Äî to enhance interpretability.

Additionally, the project integrates a Machine Learning model to predict the next likely product.

In [None]:
!apt-get update
# Download Java Virtual Machine (JVM)
!apt-get install openjdk-8-jdk-headless -qq > /dev/null

Hit:1 https://cli.github.com/packages stable InRelease
Hit:2 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease
Hit:3 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease
Hit:4 http://archive.ubuntu.com/ubuntu jammy InRelease
Hit:5 http://security.ubuntu.com/ubuntu jammy-security InRelease
Hit:6 https://r2u.stat.illinois.edu/ubuntu jammy InRelease
Hit:7 http://archive.ubuntu.com/ubuntu jammy-updates InRelease
Hit:8 http://archive.ubuntu.com/ubuntu jammy-backports InRelease
Hit:9 https://ppa.launchpadcontent.net/deadsnakes/ppa/ubuntu jammy InRelease
Hit:10 https://ppa.launchpadcontent.net/graphics-drivers/ppa/ubuntu jammy InRelease
Hit:11 https://ppa.launchpadcontent.net/ubuntugis/ppa/ubuntu jammy InRelease
Reading package lists... Done
W: Skipping acquire of configured file 'main/source/Sources' as repository 'https://r2u.stat.illinois.edu/ubuntu jammy InRelease' does not seem to provide it (sources.list entry misspelt?)


In [None]:
!pip install -q pyspark


In [None]:
from pyspark.sql import SparkSession

# Create or get existing Spark session
spark = SparkSession.builder \
    .appName("MyColabSparkApp") \
    .getOrCreate()

# Check Spark version
print("‚úÖ Spark version:", spark.version)


‚úÖ Spark version: 3.5.1


In [None]:
# Step 5: Load all CSV files from local Colab working directory
# Make sure you've uploaded them via the Files sidebar or using files.upload()

aisles = spark.read.csv("/content/aisles.csv", header=True, inferSchema=True)
departments = spark.read.csv("/content/departments.csv", header=True, inferSchema=True)
order_products__prior = spark.read.csv("/content/order_products__prior.csv", header=True, inferSchema=True)
order_products__train = spark.read.csv("/content/order_products__train.csv", header=True, inferSchema=True)
orders = spark.read.csv("/content/orders.csv", header=True, inferSchema=True)
products = spark.read.csv("/content/products.csv", header=True, inferSchema=True)


In [None]:
# run this cell first
from pyspark.sql import functions as F
from pyspark.sql.functions import col, explode, array_distinct, size, collect_set, countDistinct
from pyspark.ml.fpm import FPGrowth
from pyspark.sql.window import Window
from pyspark.ml.feature import StringIndexer, VectorAssembler
from pyspark.ml.classification import GBTClassifier
from pyspark.ml import Pipeline
from pyspark.ml.evaluation import BinaryClassificationEvaluator


**1) Combine prior + train and join metadata**

In [None]:
# Create temporary views
aisles.createOrReplaceTempView("aisles")
departments.createOrReplaceTempView("departments")
orders.createOrReplaceTempView("orders")
order_products__prior.createOrReplaceTempView("order_products_prior")
order_products__train.createOrReplaceTempView("order_products_train")
products.createOrReplaceTempView("products")

# Combine prior + train
query_union = """
SELECT * FROM order_products_prior
UNION ALL
SELECT * FROM order_products_train
"""
spark.sql(query_union).createOrReplaceTempView("order_products_all")

# Join all datasets together using INNER JOIN to avoid nulls
query_joined = """
SELECT
    opa.order_id,
    o.user_id,
    opa.product_id,
    p.product_name,
    p.aisle_id,
    p.department_id,
    a.aisle,
    d.department,
    o.order_number,
    o.order_dow,
    o.order_hour_of_day,
    o.days_since_prior_order
FROM order_products_all opa
INNER JOIN products p       ON opa.product_id = p.product_id
INNER JOIN aisles a         ON p.aisle_id = a.aisle_id
INNER JOIN departments d    ON p.department_id = d.department_id
INNER JOIN orders o         ON opa.order_id = o.order_id
"""

# Create the final combined view
instacart_full = spark.sql(query_joined)
instacart_full.createOrReplaceTempView("instacart_full")

# Show a sample
instacart_full.select(
    "order_id", "user_id", "product_id", "product_name", "aisle", "department", "order_dow"
).show(5, truncate=80)


+--------+-------+----------+---------------------------------------------+----------------------+----------+---------+
|order_id|user_id|product_id|                                 product_name|                 aisle|department|order_dow|
+--------+-------+----------+---------------------------------------------+----------------------+----------+---------+
|       6|  22352|     40462|                                      Cleanse|          refrigerated| beverages|        1|
|       6|  22352|     15873|                  Dryer Sheets Geranium Scent|               laundry| household|        1|
|       6|  22352|     41897|Clean Day Lavender Scent Room Freshener Spray|air fresheners candles| household|        1|
|       8|   3107|     23423|                Original Hawaiian Sweet Rolls|            buns rolls|    bakery|        4|
|      14|  18194|     20392|                Hair Bender Whole Bean Coffee|                coffee| beverages|        3|
+--------+-------+----------+-----------

**Preparing your transaction data for market basket analysis**

In [None]:
# create 'items' as array of string product_ids per order
tx = (order_products_all
      .withColumn("product_id_str", F.col("product_id").cast("string"))
      .groupBy("order_id")
      .agg(collect_set("product_id_str").alias("items")))

print("Total transactions:", tx.count())
tx.show(5, truncate=False)


Total transactions: 108610
+--------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|order_id|items                                                                                                                                                                             |
+--------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|1       |[49683, 22035, 47209, 43633, 49302, 11109, 13176, 10246]                                                                                                                          |
|3       |[46667, 21903, 17461, 32665, 33754, 24838, 17668, 17704]                                                                                                                          |
|5       |[12962, 48825

FREQUENT ITEMS AND ASSOCIATION RULES

In [None]:
fp = FPGrowth(itemsCol="items", minSupport=0.002, minConfidence=0.15)
model = fp.fit(tx)

freq_itemsets = model.freqItemsets  # columns: items, freq
rules = model.associationRules      # columns: antecedent, consequent, confidence, support, lift

print("Frequent itemsets (sample):")
freq_itemsets.orderBy(col("freq").desc()).show(10, truncate=False)

print("Association rules (sample):")
rules.orderBy(col("confidence").desc()).show(10, truncate=False)


Frequent itemsets (sample):
+-------+-----+
|items  |freq |
+-------+-----+
|[24852]|15736|
|[13176]|12871|
|[21137]|8867 |
|[21903]|8135 |
|[47209]|6566 |
|[47626]|6104 |
|[47766]|6084 |
|[16797]|5224 |
|[26209]|4911 |
|[27966]|4648 |
+-------+-----+
only showing top 10 rows

Association rules (sample):
+--------------+----------+-------------------+------------------+---------------------+
|antecedent    |consequent|confidence         |lift              |support              |
+--------------+----------+-------------------+------------------+---------------------+
|[27966, 47209]|[13176]   |0.4994413407821229 |4.214460727398521 |0.004115643126783906 |
|[19057, 21137]|[13176]   |0.4200743494423792 |3.5447342936008703|0.0020808397016849277|
|[5876, 47209] |[13176]   |0.4172461752433936 |3.5208691704750974|0.002762176595156984 |
|[4957]        |[33754]   |0.4152542372881356 |48.91622853781389 |0.0022557775527115367|
|[30391, 47209]|[13176]   |0.41509433962264153|3.502711228841201 |0.002

# METRIC 1 - ASI - Association Strength Index
Support tells you how often A and B occur together.

Confidence tells you how likely B appears when A is bought.

Lift tells you how much stronger that relationship is compared to random chance.

But sometimes:

High support = popular products bought often ‚Üí may not mean a strong relation.

High confidence = can be misleading if one product is common across all baskets.

**So, ASI was proposed to measure how consistently two items appear together across transactions, capturing both strength and stability of their relationship.**

In [None]:
# total number of transactions
total_tx = tx.count()

# singleton frequencies
singles = freq_itemsets.filter(size(col("items")) == 1) \
    .select(col("items").getItem(0).alias("item"), col("freq").alias("freq_single"))

# pair frequencies
pairs = freq_itemsets.filter(size(col("items")) == 2) \
    .select(col("items").getItem(0).alias("itemA"), col("items").getItem(1).alias("itemB"), col("freq").alias("freq_pair"))

# Join singles to pairs to compute ASI
asi_df = (pairs.join(singles.withColumnRenamed("item","itemA").withColumnRenamed("freq_single","freqA"), on="itemA")
              .join(singles.withColumnRenamed("item","itemB").withColumnRenamed("freq_single","freqB"), on="itemB"))

# compute ASI
asi_df = asi_df.withColumn(
    "ASI",
    (col("freq_pair")/F.lit(total_tx)) / ((col("freqA") + col("freqB") - col("freq_pair"))/F.lit(total_tx))
)

# Order by ASI descending
asi_df.select("itemA","itemB","freq_pair","freqA","freqB","ASI").orderBy(col("ASI").desc()).show(20, truncate=False)


+-----+-----+---------+-----+-----+-------------------+
|itemA|itemB|freq_pair|freqA|freqB|ASI                |
+-----+-----+---------+-----+-----+-------------------+
|4957 |33754|245      |590  |922  |0.19337016574585636|
|33787|33754|237      |587  |922  |0.18632075471698117|
|2295 |15290|290      |941  |1417 |0.1402321083172147 |
|26209|47626|1182     |4911 |6104 |0.12020746465981896|
|21137|13176|2326     |8867 |12871|0.11982279002678756|
|35221|44632|465      |1608 |2740 |0.11975276847798093|
|21709|35221|292      |1124 |1608 |0.11967213114754098|
|47209|13176|2036     |6566 |12871|0.11700476984081373|
|24964|22935|715      |3543 |3601 |0.11121480790169544|
|31717|26209|704      |2628 |4911 |0.10299926847110462|
|27966|21137|1224     |4648 |8867 |0.0995850622406639 |
|47766|24852|1855     |6084 |15736|0.09291259704482845|
|27966|13176|1487     |4648 |12871|0.09275199600798403|
|21903|13176|1782     |8135 |12871|0.09269662921348315|
|47209|21137|1308     |6566 |8867 |0.09260176991

In [None]:
from pyspark.sql.functions import col, round, concat_ws, lit

# Compute all metrics
asi_rules = asi_df.withColumn(
    "support", col("freq_pair") / F.lit(total_tx)
).withColumn(
    "confidence_A_to_B", col("freq_pair") / col("freqA")
).withColumn(
    "confidence_B_to_A", col("freq_pair") / col("freqB")
).withColumn(
    "lift",
    (col("freq_pair") / F.lit(total_tx)) /
    ((col("freqA") / F.lit(total_tx)) * (col("freqB") / F.lit(total_tx)))
).withColumn(
    "ASI", round(col("ASI"), 4)
).withColumn(
    "support", round(col("support"), 4)
).withColumn(
    "confidence_A_to_B", round(col("confidence_A_to_B"), 4)
).withColumn(
    "confidence_B_to_A", round(col("confidence_B_to_A"), 4)
).withColumn(
    "lift", round(col("lift"), 4)
)

# ü™Ñ Create readable rule strings (like Apriori)
asi_rules = asi_rules.withColumn(
    "Rule_A_to_B", concat_ws(" ‚Üí ", col("itemA"), col("itemB"))
).withColumn(
    "Rule_B_to_A", concat_ws(" ‚Üí ", col("itemB"), col("itemA"))
)

# üîπ Show Top Rules (A‚ÜíB)
print("üîπ Association Rules (A ‚Üí B):")
asi_rules.select(
    "Rule_A_to_B", "support", "confidence_A_to_B", "lift", "ASI"
).orderBy(col("ASI").desc()).show(20, truncate=False)

# üîπ Show Reverse Rules (B ‚Üí A)
print("üîπ Association Rules (B ‚Üí A):")
asi_rules.select(
    "Rule_B_to_A", "support", "confidence_B_to_A", "lift", "ASI"
).orderBy(col("ASI").desc()).show(20, truncate=False)


üîπ Association Rules (A ‚Üí B):
+-------------+-------+-----------------+-------+------+
|Rule_A_to_B  |support|confidence_A_to_B|lift   |ASI   |
+-------------+-------+-----------------+-------+------+
|4957 ‚Üí 33754 |0.0023 |0.4153           |48.9162|0.1934|
|33787 ‚Üí 33754|0.0022 |0.4037           |47.5608|0.1863|
|2295 ‚Üí 15290 |0.0027 |0.3082           |23.6215|0.1402|
|26209 ‚Üí 47626|0.0109 |0.2407           |4.2826 |0.1202|
|21137 ‚Üí 13176|0.0214 |0.2623           |2.2136 |0.1198|
|35221 ‚Üí 44632|0.0043 |0.2892           |11.4627|0.1198|
|21709 ‚Üí 35221|0.0027 |0.2598           |17.5469|0.1197|
|47209 ‚Üí 13176|0.0187 |0.3101           |2.6166 |0.117 |
|24964 ‚Üí 22935|0.0066 |0.2018           |6.0867 |0.1112|
|31717 ‚Üí 26209|0.0065 |0.2679           |5.9244 |0.103 |
|27966 ‚Üí 21137|0.0113 |0.2633           |3.2256 |0.0996|
|47766 ‚Üí 24852|0.0171 |0.3049           |2.1044 |0.0929|
|27966 ‚Üí 13176|0.0137 |0.3199           |2.6996 |0.0928|
|21903 ‚Üí 13176|0.0164 |0.2

# METRIC 2 Temporal index - day of week  -

TSI (Temporal Stability Index) measures how consistent a relationship between products is over time.

It doesn‚Äôt just check how often two items occur together ‚Äî
it checks whether they occur together regularly and predictably over time

In [None]:
from pyspark.sql import functions as F
from pyspark.sql.functions import col, explode, collect_set, concat_ws, lit
from pyspark.sql.types import ArrayType, StringType
from pyspark.sql.functions import udf

# ---------------------------------------
# 1Ô∏è‚É£ Define UDF to create item pairs per basket
# ---------------------------------------
def generate_pairs(items):
    pairs = []
    items = list(items)
    for i in range(len(items)):
        for j in range(i + 1, len(items)):
            pairs.append((items[i], items[j]))
    return pairs

pairs_udf = udf(generate_pairs, ArrayType(ArrayType(StringType())))

# ---------------------------------------
# 2Ô∏è‚É£ Create all item pairs per order
# ---------------------------------------
order_pairs = (
    order_products_all
    .withColumn("product_id_str", col("product_id").cast("string"))
    .groupBy("order_id", "order_dow")
    .agg(collect_set("product_id_str").alias("items"))
    .withColumn("pairs", pairs_udf(col("items")))
    .select("order_dow", explode("pairs").alias("pair"))
    .select(
        col("order_dow"),
        col("pair").getItem(0).alias("itemA"),
        col("pair").getItem(1).alias("itemB")
    )
)

# ---------------------------------------
# 3Ô∏è‚É£ Frequency of pairs per day-of-week (for TSI)
# ---------------------------------------
pair_dow_counts = (
    order_pairs.groupBy("itemA", "itemB", "order_dow")
    .count()
    .withColumnRenamed("count", "freq_dow")
)

# ---------------------------------------
# 4Ô∏è‚É£ Compute mean, stddev and TSI for each pair
# ---------------------------------------
tsi_df = (
    pair_dow_counts.groupBy("itemA", "itemB")
    .agg(
        F.mean("freq_dow").alias("mu"),
        F.stddev("freq_dow").alias("sigma")
    )
    .withColumn(
        "TSI",
        F.when(col("mu") == 0, F.lit(0)).otherwise(1 - (col("sigma") / col("mu")))
    )
)

# ---------------------------------------
# 5Ô∏è‚É£ Compute overall frequencies for support/confidence/lift
# ---------------------------------------
pair_freq = (
    order_pairs.groupBy("itemA", "itemB").count().withColumnRenamed("count", "freq_pair")
)
freqA = (
    order_pairs.groupBy("itemA").count().withColumnRenamed("count", "freqA")
)
freqB = (
    order_pairs.groupBy("itemB").count().withColumnRenamed("count", "freqB")
)

total_tx = order_pairs.select("order_dow").distinct().count()  # total unique days (or use total orders if you prefer)

# Join all frequency info
rules_df = (
    pair_freq
    .join(freqA, on="itemA")
    .join(freqB, on="itemB")
    .join(tsi_df.select("itemA", "itemB", "TSI"), on=["itemA", "itemB"], how="left")
)

# ---------------------------------------
# 6Ô∏è‚É£ Compute Support, Confidence, Lift
# ---------------------------------------
rules_df = (
    rules_df
    .withColumn("support", col("freq_pair") / lit(total_tx))
    .withColumn("confidence_A_to_B", col("freq_pair") / col("freqA"))
    .withColumn("confidence_B_to_A", col("freq_pair") / col("freqB"))
    .withColumn(
        "lift",
        (col("freq_pair") / lit(total_tx)) /
        ((col("freqA") / lit(total_tx)) * (col("freqB") / lit(total_tx)))
    )
)

# ---------------------------------------
# 7Ô∏è‚É£ Make readable rules and show
# ---------------------------------------
rules_df = (
    rules_df
    .withColumn("Rule_A_to_B", concat_ws(" ‚Üí ", col("itemA"), col("itemB")))
    .withColumn("Rule_B_to_A", concat_ws(" ‚Üí ", col("itemB"), col("itemA")))
    .select(
        "Rule_A_to_B", "Rule_B_to_A", "support",
        "confidence_A_to_B", "confidence_B_to_A",
        "lift", "TSI"
    )
)

# ---------------------------------------
# 8Ô∏è‚É£ Show results
# ---------------------------------------
print("üîπ Association Rules (A ‚Üí B) with TSI:")
rules_df.orderBy(col("TSI").desc(), col("lift").desc()).show(20, truncate=False)


üîπ Association Rules (A ‚Üí B) with TSI:
+-------------+-------------+-------+--------------------+--------------------+--------------------+---+
|Rule_A_to_B  |Rule_B_to_A  |support|confidence_A_to_B   |confidence_B_to_A   |lift                |TSI|
+-------------+-------------+-------+--------------------+--------------------+--------------------+---+
|27371 ‚Üí 27128|27128 ‚Üí 27371|0.25   |0.058823529411764705|0.5                 |0.11764705882352941 |1.0|
|18375 ‚Üí 21516|21516 ‚Üí 18375|0.25   |0.3333333333333333  |0.07692307692307693 |0.10256410256410256 |1.0|
|17158 ‚Üí 5433 |5433 ‚Üí 17158 |0.25   |0.2222222222222222  |0.1                 |0.08888888888888889 |1.0|
|27128 ‚Üí 38347|38347 ‚Üí 27128|0.25   |0.5                 |0.044444444444444446|0.08888888888888889 |1.0|
|9026 ‚Üí 47097 |47097 ‚Üí 9026 |0.25   |0.4                 |0.05263157894736842 |0.08421052631578947 |1.0|
|30972 ‚Üí 44930|44930 ‚Üí 30972|0.25   |0.1                 |0.15384615384615385 |0.061538461538

# METRIC - 3 - **Diversity Spread Index (DSI) - basket diversity score**

* Basket-level DSI =DSI measures how diverse a single shopping basket is ‚Äî that is, how many different types (departments or categories) of products are present.
 * How broad or narrow the shopper‚Äôs interest is within a single purchase.

In [None]:
from pyspark.sql.functions import countDistinct, avg, col

# Compute basket diversity for each cleaned order
basket_div = (
    clean_df.groupBy("order_id")
    .agg(
        countDistinct("department_id").alias("unique_dept"),
        countDistinct("product_id").alias("total_items")
    )
    .withColumn("DSI", col("unique_dept") / (col("total_items") + 1))
)

# Compute average DSI per user (optional)
user_div = (
    basket_div.join(
        orders.select("order_id", "user_id"), "order_id", "left"
    )
    .groupBy("user_id")
    .agg(avg("DSI").alias("avg_DSI"))
    .orderBy(col("avg_DSI").desc())
)

# Show results
print("üß∫ Basket-level Diversity Spread Index:")
basket_div.show(10, truncate=False)

print("\nüë• User-level Average DSI:")
user_div.show(10, truncate=False)


üß∫ Basket-level Diversity Spread Index:
+--------+-----------+-----------+-------------------+
|order_id|unique_dept|total_items|DSI                |
+--------+-----------+-----------+-------------------+
|12799   |7          |13         |0.5                |
|35947   |6          |11         |0.5                |
|37489   |12         |39         |0.3                |
|31983   |4          |12         |0.3076923076923077 |
|4900    |7          |24         |0.28               |
|35071   |2          |3          |0.5                |
|22521   |5          |16         |0.29411764705882354|
|31261   |5          |9          |0.5                |
|12046   |3          |12         |0.23076923076923078|
|20497   |4          |9          |0.4                |
+--------+-----------+-----------+-------------------+
only showing top 10 rows


üë• User-level Average DSI:
+-------+------------------+
|user_id|avg_DSI           |
+-------+------------------+
|20548  |0.8888888888888888|
|14630  |0.88888

measures the average DSI across all the baskets of a user.

This gives you an overall user profile:

High average DSI ‚Üí shopper who buys many types of products (generalist)

Low average DSI ‚Üí shopper who sticks to one category (specialist)

# GRADIENT BOOSTING CLASSFIIER -
  
  **"Predicting Product Reorders Using Gradient Boosted Trees Based on User Purchase Behavior Parameters (like total purchases, cart position, order time, and reorder gap), with Feature Importance Analysis to Identify Key Buying Patterns."**


In [None]:
from pyspark.sql import functions as F
from pyspark.sql.functions import col, count, avg, max, min

# Merge product and order info for feature creation
train_df = (
    order_products__prior
    .join(orders.select("order_id", "user_id", "order_dow", "order_hour_of_day", "days_since_prior_order"), "order_id", "left")
    .join(products.select("product_id", "aisle_id", "department_id"), "product_id", "left")
    .filter(col("reordered").isNotNull())  # keep only rows with reorder info
)

# Aggregate user-level and product-level behavioral features
features_df = (
    train_df.groupBy("user_id", "product_id")
    .agg(
        count("*").alias("total_purchases"),
        avg("add_to_cart_order").alias("avg_cart_position"),
        avg("order_dow").alias("avg_order_dow"),
        avg("order_hour_of_day").alias("avg_order_hour"),
        avg("days_since_prior_order").alias("avg_days_between_orders"),
        F.max("reordered").alias("label")  # 1 if ever reordered
    )
)


In [None]:
features_df = features_df.fillna({"avg_days_between_orders": 0})


In [None]:
# Fill nulls with 0 or column means
features_df = features_df.fillna({
    "total_purchases": 0,
    "avg_cart_position": 0,
    "avg_order_dow": 0,
    "avg_order_hour": 0,
    "avg_days_between_orders": 0,
    "label": 0
})


In [None]:
from pyspark.ml.feature import VectorAssembler

assembler = VectorAssembler(
    inputCols=[
        "total_purchases",
        "avg_cart_position",
        "avg_order_dow",
        "avg_order_hour",
        "avg_days_between_orders"
    ],
    outputCol="features",
    handleInvalid="keep"   # <- prevents failure if nulls still exist
)

final_df = assembler.transform(features_df).select("user_id", "product_id", "features", "label")


In [None]:
train_data, test_data = final_df.randomSplit([0.8, 0.2], seed=42)

from pyspark.ml.classification import GBTClassifier

gbt = GBTClassifier(
    labelCol="label",
    featuresCol="features",
    maxIter=50,
    maxDepth=5,
    stepSize=0.1,
    seed=42
)

gbt_model = gbt.fit(train_data)


In [None]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator

predictions = gbt_model.transform(test_data)

# Evaluate AUC and Accuracy
evaluator_auc = BinaryClassificationEvaluator(labelCol="label", metricName="areaUnderROC")
auc = evaluator_auc.evaluate(predictions)

evaluator_pr = BinaryClassificationEvaluator(labelCol="label", metricName="areaUnderPR")
pr_auc = evaluator_pr.evaluate(predictions)

print(f"‚úÖ AUC: {auc:.4f}")
print(f"‚úÖ PR-AUC: {pr_auc:.4f}")


‚úÖ AUC: 0.7687
‚úÖ PR-AUC: 0.8600


In [None]:
feature_importances = list(zip(
    assembler.getInputCols(),
    gbt_model.featureImportances.toArray()
))

print("\nüîç Feature Importances:")
for feature, importance in feature_importances:
    print(f"{feature}: {importance:.4f}")



üîç Feature Importances:
total_purchases: 0.5048
avg_cart_position: 0.0735
avg_order_dow: 0.0641
avg_order_hour: 0.1707
avg_days_between_orders: 0.1870


This extracts how important each feature is for the model.

Feature	Importance	Meaning
total_purchases	0.5790	Most influential ‚Äî frequent buyers behave consistently.
avg_cart_position	0.1093	Somewhat relevant ‚Äî maybe early cart positions indicate priority.
avg_order_dow	0.0652	Slightly influences ‚Äî weekday pattern of orders.
avg_order_hour	0.1074	Moderate ‚Äî ordering time has some predictive power.
avg_days_between_orders	0.1391	Significant ‚Äî regular vs irregular buyers.