## Executive Overview

This project tackles a critical challenge facing modern retailers: how to allocate marketing resources efficiently when a small core of Premium Loyalists generates roughly 76% of revenue, while the broader customer base remains underpenetrated. We're working with the Dunnhumby Complete Journey dataset, two years of real transaction data from 2017 to 2018 covering 2,500 households, to answer five interconnected business questions that span customer segmentation, promotional tactics, pricing strategy, and retention.

The overarching goal is to protect and grow our most valuable segments without eroding margins through unnecessary discounts, while simultaneously identifying the promotional levers that bundles, displays, coupons, and pricing actually move volume and extend customer lifetime. Every analysis workstream ties back to actionable decisions: which households to target, which products to bundle, how much to rely on price versus merchandising, which coupons to fund, and when to intervene before customers churn.

## Business Context and Motivation

Retailers face a classic paradox: their top customers generate the majority of revenue, yet marketing budgets often spread resources evenly across the entire base. This project quantifies that imbalance and provides a data-driven framework for concentrating efforts where they matter most.

Revenue concentration is a critical issue. Premium Loyalists, roughly 20 to 25% of households, drive 76% of total revenue, so losing even a small fraction of this segment devastates the bottom line.

Promotional inefficiency is another challenge. Without targeting, coupon campaigns reach low propensity households who wouldn't redeem anyway, wasting budget while missing high propensity shoppers who could have converted.

Margin erosion from aggressive discounting pulls down prices for customers who would have paid full retail, especially when price elasticity is mild and demand doesn't surge much from cuts.

Churn blind spots mean retailers often discover a household has churned only after months of inactivity, when reactivation costs far exceed preemptive retention.

By addressing these four pain points across five analytical modules, we build a cohesive playbook that balances short term volume goals with long term profitability and customer equity.

## Data Foundation: Dunnhumby Complete Journey

All analyses draw from the Complete Journey dataset, a comprehensive record of household shopping behavior provided by Dunnhumby. The data spans from day 1 to day 711, covering January 1, 2017 to December 12, 2018, approximately two years of transactions.

We have 2,500 households with complete purchase histories, generating around 2.6 million transaction rows that aggregate into over 275,000 baskets. The products include roughly 92,000 unique product IDs spanning categories from soft drinks to fresh proteins.

The promotional data includes a causal table with flags for displays, weekly mailers, and various campaign types. Coupon information links redemption records to specific campaigns, and demographic data covers homeownership, household composition including whether they have kids, income levels, and age group segments.

The dataset is structured across multiple files. Transaction data provides line item purchases with product ID, quantity, sales value, discounts, and basket ID. Product information includes the hierarchy from department to commodity to subcommodity, along with brand and manufacturer details. Household demographics give us attributes for segmentation and profiling. Causal data contains promotion flags for displays and mailers matched to product, week, and store combinations. Campaign and coupon files include campaign assignments, coupon details, and redemption events.

This richness lets us measure not just what customers buy, but how they respond to pricing, merchandising, and targeted offers. These are exactly the levers marketing teams control.

## Project Structure: Five Analytical Workstreams

The project is organized into five self-contained modules, each answering a distinct business question while feeding insights into the others.

Customer Value Map: Who Are Our Most Valuable Households?

The business question we're addressing here is who are our most valuable households today, and how are they distributed by size, share, and characteristics?

Our approach uses RFM segmentation, which stands for Recency, Frequency, and Monetary. This scores every household on how recently they shopped, how often they visit, and how much they spend, producing intuitive tiers like Champions, Loyal Customers, At Risk, and Dormant.

We then use K-means clustering to refine these RFM segments by grouping households on normalized RFM dimensions. This reveals four clusters that map cleanly to strategic priorities. Cluster 2 represents Premium Loyalists, containing 392 households that account for 47% of revenue. Cluster 0 represents Growth Regulars who are nurture candidates. Cluster 3 is the Reactivation Pool, with 86% Dormant households. Cluster 1 contains Lost or Churned customers with minimal activity.

The key outputs include cluster profiles with median spend, trip frequency, and demographic overlays such as homeowners and households with kids. This creates actionable segmentation for downstream coupon targeting and churn intervention.

Bundle Analysis: Which Products Should We Co-Promote?

The business question here is, given that Premium Loyalists generate roughly 76% of revenue, which high support product and category bundles within Dunnhumby baskets deserve prioritized co-promotional shelf space and tailored bundles?

We use market basket analysis using the Apriori algorithm to mine association rules from 275,000 baskets. Rules reveal which categories consistently appear together, for example if cereal and milk are purchased, then bread follows with 81% confidence and 3 times the expected lift.

We focus on high support anchors like soft drinks, dairy, bakery bread, and proteins to ensure bundles resonate with most shoppers, not just niche segments. Basket size analysis confirms that Premium Loyalists average 9 unique items per trip, so bundles stay tight with 3 core products to leave room for additional purchases.

The key outputs are the top 10 association rules ranked by lift and support, covering breakfast kits with cereal, bread, and milk, deli lunch bundles with meats, bread, and snacks, pasta night combos with pasta, cheese, and milk, and snack pairings like baked goods with soft drinks. We also create network visualizations showing which categories cluster together, guiding cross-aisle display design.

Price Elasticity and In-Store Support: Price or Promote?

The business question is should we rely more on pricing moves or in-store support like displays and weekly mailers to grow volume in our top performing categories and manufacturer brands?

We use log-log regression to estimate price elasticity and promotional lift for top categories such as soft drinks, beef, and dairy, as well as leading manufacturers. The model structure takes the log of quantity as the dependent variable, with independent variables including the log of price, display flags, mailer flags, and controls.

The price coefficient is interpreted directly as elasticity. Values between negative 0.5 and negative 0.9 indicate inelastic demand, meaning a 10% price hike only trims volume by 5 to 9%. Display and mailer coefficients translate to percentage lifts, often 15 to 30%, quantifying the incremental volume from merchandising.

The key outputs include an elasticity classification showing which categories tolerate price increases versus which require careful pricing. We also provide lift estimates for displays and mailers, revealing that in-store support consistently outperforms discounting for driving short term volume without margin erosion.

Coupon Redemption Propensity: Who to Target?

The business question is which households should receive coupons to maximize redemption rates and campaign spend without wasting budget on low propensity shoppers?

We use logistic regression to predict household level redemption probability using features like recent spend, trip frequency, historical coupon usage, and campaign type. Lift analysis ranks households by predicted score and compares top deciles to random targeting. The top 30%, which includes deciles 1 through 3, capture around 68% of all redeemers and deliver 2 times the lift over baseline.

We perform dual objective optimization to balance redemption volume, which is coupon take rate, against campaign spend, which is incremental revenue. We prioritize high score households in mid to high spend quartiles.

The key outputs include a scoring model with an AUC of around 0.75 that ranks households for coupon allocation. Campaign type recommendations show TypeA campaigns convert 2 times better than TypeB or TypeC, so we should fund TypeA first. Spend lift charts show which score deciles deliver both redemptions and meaningful revenue gains.

Time to Churn: When to Intervene?

The business question is which customer segments and engagement levers like coupons, campaigns, and spend intensity meaningfully extend household survival, so we know whom to target and which promotional tactics actually drive profitable retention?

We use Kaplan-Meier survival curves to visualize retention over the 711 day observation window, comparing coupon redeemers versus non-redeemers and campaign reached versus unreached households. Descriptively, coupon users stay active 15 to 30 percentage points longer by week 90.

The Cox proportional hazards model isolates the causal drivers after controlling for trips, spend, and demographics. The key finding is that log trips and log total spend are the only significant protective factors, with hazard ratios around 0.57 and 0.67, while campaign and coupon flags lose significance once behavior is controlled.

Log-rank tests confirm that Age Group 6 households churn faster than younger cohorts, even after adjusting for spend.

The key outputs include survival curves segmented by engagement level and demographics. Cox hazard ratios show that campaigns and coupons only extend survival when they trigger incremental trips or spend, not just by targeting already loyal customers. For intervention timing, households at risk with declining trip frequency and low recent spend should receive preemptive offers before they slide into the churn zone.

## Analytical Methodology Overview

Each workstream employs a specific statistical or machine learning technique aligned with course lectures.

The Customer Value Map uses RFM scoring and K-means clustering, referencing Lectures 2 and 3, with key metrics including Silhouette score, cluster size, and revenue share.

Bundle Analysis applies Apriori association rules based on market basket principles, measuring support, confidence, and lift.

Price Elasticity uses log-log OLS regression with Ridge and Lasso validation from Lecture 6, focusing on elasticity coefficients, R squared values, and lift percentages.

Coupon Propensity applies logistic regression with ROC and AUC evaluation from classification fundamentals, tracking AUC, lift tables, and precision recall metrics.

Time to Churn uses Kaplan-Meier estimators and Cox proportional hazards from Lecture 10, examining hazard ratios, log-rank p-values, and median survival times.

Common Design Principles

Reproducibility is fundamental. Every notebook starts by loading a cleaned, versioned dataset in parquet or CSV format produced by a separate data prep module. This ensures all analyses reference the same households and time windows.

Interpretability drives our approach. We prioritize business friendly metrics like elasticity, lift percentages, and hazard ratios over black box model performance. Every coefficient or cluster profile ties to a concrete action, such as adding a product to a bundle or targeting specific households.

Validation ensures robustness. Models include diagnostics like Ridge and Lasso for multicollinearity, ROC curves for classification, and log-rank tests for survival analysis to confirm that findings are not artifacts of overfitting.

Segmentation refines our insights. Where possible, we break out results by customer tier, comparing Premium Loyalists to occasional shoppers, or by campaign type, comparing TypeA to TypeB and TypeC, so recommendations stay tailored to segment economics.

Visual storytelling makes insights accessible. Each analysis pairs quantitative tables with charts like heatmaps, scatter plots, lift curves, and survival plots that make patterns obvious to non-technical stakeholders.

## Cross-Module Integration

While each workstream stands alone, they reinforce each other to form a cohesive strategy.

Segmentation feeds targeting. The Customer Value Map identifies Premium Loyalists in Cluster 2 and at-risk households in Cluster 0 declining into Cluster 3. These segments become the priority targets in coupon propensity scoring and churn intervention.

Bundles inform promotions. The high support categories from bundle analysis, including soft drinks, dairy, and proteins, are the same categories that show inelastic demand in the price elasticity study. This alignment suggests we should design cross aisle displays pairing these anchors, like cereal with milk and soft drinks, and rely on display lifts of 20 to 30% rather than price cuts to move volume.

Coupons drive retention, but only if behavioral. The coupon propensity model ranks households for redemption likelihood, but the survival analysis clarifies that coupons only extend lifetime if they trigger incremental trips or spend. So we should target low engagement households with high propensity scores, not just reward already loyal Champions who would have shopped anyway.

Pricing and promotions balance margin. Module C shows that most top categories are inelastic, with elasticities between negative 0.5 and negative 0.9, giving us headroom to stabilize or slightly raise prices without losing volume. Meanwhile, displays and mailers deliver strong lifts of 15 to 30%, so we can grow baskets profitably by reallocating dollars from discounts to merchandising.

Together, these five modules answer the strategic question: How do we allocate finite marketing resources, including shelf space, mailer features, coupon budgets, and pricing flexibility, to maximize both short term revenue and long term customer equity, while protecting the Premium Loyalist core that drives 76% of sales?

## Navigation and Next Steps

The project is structured to be read sequentially or by individual interest.

Start with the Customer Value Map to understand who the high value households are and how they're distributed. This establishes the who for all downstream targeting.

Move to Bundle Analysis to see which product combinations these households already buy together, setting the stage for promotional design.

Continue to Price Elasticity to quantify whether price or merchandising drives more volume for those bundles, informing the promotional mix.

Review Coupon Propensity to learn how to allocate limited coupon budgets to households most likely to redeem, maximizing ROI.

Finish with Time to Churn to understand which engagement levers actually extend customer lifetime, distinguishing tactics that cause retention from those that merely correlate with it.

Each module concludes with actionable recommendations tied back to the data, and the final Strategic Conclusions notebook synthesizes all five workstreams into an integrated marketing playbook.

Let's begin the journey.