# Elist E-Commerce Recommendation Engine

This notebook builds a Recommendation Engine to deliver personalized product suggestions for brand-new members at sign-up. Instead of waiting for browsing or purchase history, it uses information available at the time of signup (date, platform, channel, geography, loyalty status) to predict a likely first purchase and return a compact Top-K list that encourages purchase and accelerates the first order. 

Operationally, it runs as a simple, reliable workflow: train once, save the model, then do fast lookups whenever needed. After each run, the system automatically emails the recommendations (e.g., “Top choices for your first order”) using existing email service, so results reach the user immediately. This closes the cold-start gap in my stack, plugs cleanly into onboarding and CRM triggers, and gives the marketing/loyalty programs a measurable lever to lift revenue without heavy data collection or complex behavior tracking.

To prove it moves the needle, it will launch with an A/B test: randomly split new sign-ups 50/50 into Control (standard welcome email) and Variant (same email + Top-3 recommendations). Keep send time, subject, and layout identical; only the recommendations block changes. 

Primary KPI: first-order conversion within 7 days. Secondary KPIs: revenue per recipient, average order value, email click-through on the recommended items, plus guardrails like unsubscribe/complaint rates. Run until it hits a reasonable sample (e.g., ~5–10k recipients or ~2 weeks, whichever comes first) to get a strong sample, and review results overall and by key segments (country, loyalty). 

Success would be a clear lift in conversion rate (target +3–5% or better); if positive, roll out to 100% of new users.

In [None]:
# Import packages

import pandas as pd
import numpy as np

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split, GridSearchCV, StratifiedKFold
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.metrics import top_k_accuracy_score
import joblib
from datetime import datetime

In [71]:
# Load data

path = r"c:\Users\danac\OneDrive - OneWorkplace\Documents\Elist\Elist ML Data.csv"

keep_cols = [
    "PURCHASE_TS", "PRODUCT_NAME", 
    "PURCHASE_PLATFORM", "MARKETING_CHANNEL",
    "ACCOUNT_CREATION_METHOD", "COUNTRY_CODE",
    "Region", "LOYALTY_PROGRAM"
]

data = pd.read_csv(path, usecols=keep_cols, encoding="utf-8-sig")

# Normalize names

data.columns = (
    data.columns.str.strip().str.lower().str.replace(" ", "_", regex=False)
)

data = data.replace(r"^\s*$", pd.NA, regex=True)
print(data.head(3))

  purchase_ts                 product_name purchase_platform  \
0    1/1/2019     Apple Airpods Headphones           website   
1    1/1/2019  Samsung Charging Cable Pack           website   
2    1/1/2019     Apple Airpods Headphones           website   

  marketing_channel account_creation_method country_code region  \
0            direct                 desktop           IT   EMEA   
1         affiliate                 unknown           US    NaN   
2            direct                 desktop           US    NaN   

   loyalty_program  
0                0  
1                0  
2                0  


In [72]:
# Parsing dates and feature extraction

data["purchase_ts"] = pd.to_datetime(data["purchase_ts"], errors="coerce")
bad_ts = data["purchase_ts"].isna().sum()
if bad_ts:
    print("Rows with unparseable purchase_ts (dropping):", bad_ts)
    data = data[~data["purchase_ts"].isna()].copy()

data["ts_year"]  = data["purchase_ts"].dt.year
data["ts_month"] = data["purchase_ts"].dt.month
print("Added time features:", ["ts_year", "ts_month"])

Added time features: ['ts_year', 'ts_month']


In [73]:
# Convert blanks to NA 

before = len(data)
data = data[~data["product_name"].isna()].copy()
print("Dropped rows:", before - len(data))

print("Rows after target/date cleaning:", len(data))

Dropped rows: 0
Rows after target/date cleaning: 108124


In [74]:
# Define feature sets

numeric_features = [c for c in ["ts_year", "ts_month"] if c in data.columns]
categorical_features = [c for c in [
    "purchase_platform", "marketing_channel", "account_creation_method",
    "country_code", "region", "loyalty_program"
] if c in data.columns]

X = data[numeric_features + categorical_features].copy()
y = data["product_name"].astype(str)

print("Numeric features:", numeric_features)
print("Categorical features:", categorical_features)
print("X shape:", X.shape, "| Unique classes:", y.nunique())

Numeric features: ['ts_year', 'ts_month']
Categorical features: ['purchase_platform', 'marketing_channel', 'account_creation_method', 'country_code', 'region', 'loyalty_program']
X shape: (108124, 8) | Unique classes: 8


In [75]:
# Encoding target

le = LabelEncoder()
y_enc = le.fit_transform(y)

In [76]:
# Train / Test split

X_train, X_test, y_train, y_test = train_test_split(
    X, y_enc, test_size=0.2, random_state=42, stratify=y_enc
)

X_train.shape, X_test.shape

((86499, 8), (21625, 8))

In [77]:
# Preprocessing

num_pipe = Pipeline([
    ("impute", SimpleImputer(strategy="median", add_indicator=True)),
])

cat_pipe = Pipeline([
    ("ohe", OneHotEncoder(handle_unknown="ignore", sparse=False)),
])

preprocess = ColumnTransformer([
    ("num", num_pipe, numeric_features),
    ("cat", cat_pipe, categorical_features),
], remainder="drop")

## Model

I chose the Histogram-based Gradient Boosting Classifier largely because it provides predict_proba which allowed me to rank products by likelihood and optimize for top-K recommendations. This model also works great on tabular data and captures non-linear patterns and interactions between the features without the need for heavy engineering. It is also memory-effcient and handles one hot encoded categorical features as well. 

In [None]:
# HGB Classification Model

RANDOM_STATE = 42

pipe = Pipeline([
    ("pre", preprocess),
    ("clf", HistGradientBoostingClassifier(
        early_stopping=True,
        validation_fraction=0.10,
        random_state=RANDOM_STATE
    )),
])

def top3_scorer(estimator, X, y):
    proba = estimator.predict_proba(X)
    return top_k_accuracy_score(y, proba, k=3)

param_grid = {
    "clf__learning_rate":     [0.05, 0.07, 0.09],
    "clf__max_iter":          [400, 500, 600],    
    "clf__max_leaf_nodes":    [12, 27, 42],
    "clf__min_samples_leaf":  [1, 3, 5],
    "clf__l2_regularization": [0.01, 0.02, 0.03],
    "clf__max_depth":         [None]
}

# GridSearchCV

cv = StratifiedKFold(n_splits=6, shuffle=True, random_state=RANDOM_STATE)
gs = GridSearchCV(
    estimator=pipe,
    param_grid=param_grid,
    scoring=top3_scorer,
    refit=True,
    cv=cv,
    n_jobs=8,
    verbose=2,
)

In [None]:
# Fit Model

gs.fit(X_train, y_train)

Fitting 6 folds for each of 243 candidates, totalling 1458 fits


In [None]:
# Model Results

print("Best params (Top-3):", gs.best_params_)
print("Best CV Top-3 accuracy:", gs.best_score_)

best_model = gs.best_estimator_
y_proba = best_model.predict_proba(X_test)

print(f"Test Top-3 accuracy: {top_k_accuracy_score(y_test, y_proba, k=3):.4f}")

Best params (Top-3): {'clf__l2_regularization': 0.02, 'clf__learning_rate': 0.05, 'clf__max_depth': None, 'clf__max_iter': 400, 'clf__max_leaf_nodes': 12, 'clf__min_samples_leaf': 1}
Best CV Top-3 accuracy: 0.9451785745537541
Test Top-2 accuracy: 0.8205
Test Top-3 accuracy: 0.9432


In [59]:
#Persistence

ARTIFACTS_PATH = Path("models/recommendation_engine.joblib")
ARTIFACTS_PATH.parent.mkdir(parents=True, exist_ok=True)

joblib.dump({
    "model": best_model,
    "label_encoder": le,
    "feature_cols": list(X.columns),
}, ARTIFACTS_PATH)

['models\\recommendation_engine.joblib']

In [None]:
# Inference

def recommend_topk(
    context: dict,
    when: str | datetime | None = None,
    topk: int = 3,
    artifacts_path: Path = ARTIFACTS_PATH,
):
    art = joblib.load(artifacts_path)
    model = art["model"]
    le = art["label_encoder"]
    feature_cols = art["feature_cols"]

    # Derive time features
    ts = pd.to_datetime(when) if when is not None else pd.Timestamp.utcnow()
    base = {"ts_year": int(ts.year), "ts_month": int(ts.month)}

    # Build one-row input
    row = {c: base.get(c, None) for c in feature_cols}
    for k, v in (context or {}).items():
        if k in row:
            row[k] = v
    X_one = pd.DataFrame([row], columns=feature_cols)

    proba = model.predict_proba(X_one)[0]
    k = min(int(topk), len(proba))
    top_idx = np.argsort(proba)[-k:][::-1]

    labels = le.inverse_transform(top_idx)
    scores = proba[top_idx]
    return [(str(lbl), float(s)) for lbl, s in zip(labels, scores)]

In [63]:
# Sanity Check:
recommend_topk({
"purchase_platform": "Web",
"marketing_channel": "Email",
"account_creation_method": "Email",
"country_code": "US",
"region": "North America",
"loyalty_program": "Yes"
}, when="2024-11-15T14:30:00Z", topk=3)

[('Apple Airpods Headphones', 0.6421616021059674),
 ('27in 4k gaming monitor', 0.309085911284575),
 ('Macbook Air Laptop', 0.024213171740366413)]