# Logistic Regression Baseline - Customer Purchase Propensity Prediction

This notebook trains a Logistic Regression model to predict whether a customer will purchase a product after adding it to cart.

## Plan:
1. Load data from Feast Feature Store
2. Preprocessing: StandardScaler for numerical, OneHotEncoder for categorical
3. Train/Val/Test split: 64%/16%/20%
4. Regularization tuning on validation set
5. Evaluate with Accuracy, Precision, Recall, F1, AUC-ROC
6. Save metrics to JSON

In [12]:
import pandas as pd
import numpy as np
import json
from pathlib import Path
from datetime import datetime

from feast import FeatureStore

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (
    accuracy_score,
    precision_score,
    recall_score,
    f1_score,
    roc_auc_score,
    classification_report,
    confusion_matrix
)

import warnings
warnings.filterwarnings('ignore')

print("Libraries imported successfully!")

Libraries imported successfully!


## 1. Load Data from Feast Feature Store

In [13]:
# Define paths
FEATURE_REPO_PATH = Path("../../data_pipeline/propensity_feature_store/propensity_features/feature_repo")
FEATURE_STORE_YAML = FEATURE_REPO_PATH / "feature_store.yaml"

print(f"Feature store path: {FEATURE_STORE_YAML}")
print(f"Exists: {FEATURE_STORE_YAML.exists()}")

# Initialize Feast Feature Store
store = FeatureStore(repo_path=str(FEATURE_REPO_PATH))
print("\nFeast Feature Store initialized successfully!")
print(f"Project: {store.project}")

Feature store path: ../../data_pipeline/propensity_feature_store/propensity_features/feature_repo/feature_store.yaml
Exists: True

Feast Feature Store initialized successfully!
Project: propensity_features


In [14]:
# Load data directly from parquet file (faster than get_historical_features for full dataset)
PARQUET_PATH = FEATURE_REPO_PATH / "data" / "processed_purchase_propensity_data_v1.parquet"

print(f"Loading data from: {PARQUET_PATH}")
df = pd.read_parquet(PARQUET_PATH)

print(f"\nDataset shape: {df.shape}")
print(f"\nColumns: {df.columns.tolist()}")
print(f"\nData types:\n{df.dtypes}")

Loading data from: ../../data_pipeline/propensity_feature_store/propensity_features/feature_repo/data/processed_purchase_propensity_data_v1.parquet

Dataset shape: (2933439, 11)

Columns: ['user_id', 'product_id', 'event_timestamp', 'created_timestamp', 'category_code_level1', 'category_code_level2', 'brand', 'event_weekday', 'price', 'activity_count', 'is_purchased']

Data types:
user_id                          int64
product_id                       int64
event_timestamp         datetime64[ns]
created_timestamp       datetime64[us]
category_code_level1            object
category_code_level2            object
brand                           object
event_weekday                    int64
price                          float64
activity_count                   int64
is_purchased                     int64
dtype: object


In [15]:
# Explore the data
print("First 5 rows:")
display(df.head())

print(f"\nTarget distribution (is_purchased):")
print(df['is_purchased'].value_counts())
print(f"\nPositive rate: {df['is_purchased'].mean():.4f} ({df['is_purchased'].mean()*100:.2f}%)")

First 5 rows:


Unnamed: 0,user_id,product_id,event_timestamp,created_timestamp,category_code_level1,category_code_level2,brand,event_weekday,price,activity_count,is_purchased
0,515903856,2601552,2019-11-17 00:11:39,2026-01-18 22:17:22.150556,unknown,unknown,gorenje,6,486.24,6,0
1,516301799,12702930,2019-11-12 15:40:15,2026-01-18 22:17:22.150556,unknown,unknown,cordiant,1,35.78,2,0
2,516301799,12702930,2019-11-12 15:41:46,2026-01-18 22:17:22.150556,unknown,unknown,cordiant,1,35.78,6,0
3,516301799,12702930,2019-11-12 15:42:05,2026-01-18 22:17:22.150556,unknown,unknown,cordiant,1,35.78,8,0
4,561066382,3800966,2019-11-15 23:36:25,2026-01-18 22:17:22.150556,appliances,iron,elenberg,4,20.57,2,0



Target distribution (is_purchased):
is_purchased
0    2170105
1     763334
Name: count, dtype: int64

Positive rate: 0.2602 (26.02%)


In [16]:
# Check for missing values
print("Missing values:")
print(df.isnull().sum())

print(f"\nUnique values per categorical column:")
for col in ['brand', 'category_code_level1', 'category_code_level2']:
    print(f"  {col}: {df[col].nunique()} unique values")

Missing values:
user_id                 0
product_id              0
event_timestamp         0
created_timestamp       0
category_code_level1    0
category_code_level2    0
brand                   0
event_weekday           0
price                   0
activity_count          0
is_purchased            0
dtype: int64

Unique values per categorical column:
  brand: 3058 unique values
  category_code_level1: 14 unique values
  category_code_level2: 58 unique values


## 2. Preprocessing Pipeline

- **Numerical features**: StandardScaler for `price`, `activity_count`, `event_weekday`
- **Categorical features**: OneHotEncoder for `brand`, `category_code_level1`, `category_code_level2`

In [17]:
# Define feature columns
NUMERICAL_FEATURES = ['price', 'activity_count', 'event_weekday']
CATEGORICAL_FEATURES = ['brand', 'category_code_level1', 'category_code_level2']
TARGET = 'is_purchased'

ALL_FEATURES = NUMERICAL_FEATURES + CATEGORICAL_FEATURES

print(f"Numerical features: {NUMERICAL_FEATURES}")
print(f"Categorical features: {CATEGORICAL_FEATURES}")
print(f"Target: {TARGET}")

Numerical features: ['price', 'activity_count', 'event_weekday']
Categorical features: ['brand', 'category_code_level1', 'category_code_level2']
Target: is_purchased


In [18]:
# Prepare X and y
X = df[ALL_FEATURES].copy()
y = df[TARGET].copy()

# Convert categorical columns to string type for OneHotEncoder
for col in CATEGORICAL_FEATURES:
    X[col] = X[col].astype(str)

print(f"X shape: {X.shape}")
print(f"y shape: {y.shape}")
print(f"\nX dtypes:\n{X.dtypes}")

X shape: (2933439, 6)
y shape: (2933439,)

X dtypes:
price                   float64
activity_count            int64
event_weekday             int64
brand                    object
category_code_level1     object
category_code_level2     object
dtype: object


## 3. Train/Validation/Test Split (64%/16%/20%)

In [19]:
# First split: 80% train+val, 20% test
X_train_val, X_test, y_train_val, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Second split: 80% train, 20% val (of the 80% = 64% and 16% of total)
X_train, X_val, y_train, y_val = train_test_split(
    X_train_val, y_train_val, test_size=0.2, random_state=42, stratify=y_train_val
)

print(f"Training set:   {X_train.shape[0]:,} samples ({X_train.shape[0]/len(X)*100:.1f}%)")
print(f"Validation set: {X_val.shape[0]:,} samples ({X_val.shape[0]/len(X)*100:.1f}%)")
print(f"Test set:       {X_test.shape[0]:,} samples ({X_test.shape[0]/len(X)*100:.1f}%)")

print(f"\nClass distribution:")
print(f"  Train - Positive rate: {y_train.mean():.4f}")
print(f"  Val   - Positive rate: {y_val.mean():.4f}")
print(f"  Test  - Positive rate: {y_test.mean():.4f}")

Training set:   1,877,400 samples (64.0%)
Validation set: 469,351 samples (16.0%)
Test set:       586,688 samples (20.0%)

Class distribution:
  Train - Positive rate: 0.2602
  Val   - Positive rate: 0.2602
  Test  - Positive rate: 0.2602


In [20]:
# Create preprocessing pipeline
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), NUMERICAL_FEATURES),
        ('cat', OneHotEncoder(handle_unknown='ignore', sparse_output=False), CATEGORICAL_FEATURES)
    ],
    remainder='drop'
)

print("Preprocessor created!")
print(preprocessor)

Preprocessor created!
ColumnTransformer(transformers=[('num', StandardScaler(),
                                 ['price', 'activity_count', 'event_weekday']),
                                ('cat',
                                 OneHotEncoder(handle_unknown='ignore',
                                               sparse_output=False),
                                 ['brand', 'category_code_level1',
                                  'category_code_level2'])])


## 4. Regularization Tuning on Validation Set

Try different values of C (inverse of regularization strength) and select the best one based on validation AUC-ROC.

In [21]:
# Define C values to try
C_VALUES = [0.001, 0.01, 0.1, 1, 10, 100]

# Store results
tuning_results = []

print("Starting regularization tuning...")
print("="*60)

for C in C_VALUES:
    print(f"\nTraining with C={C}...")
    
    # Create pipeline with current C
    pipeline = Pipeline([
        ('preprocessor', preprocessor),
        ('classifier', LogisticRegression(
            C=C,
            solver='lbfgs',
            max_iter=1000,
            class_weight='balanced',
            random_state=42,
            n_jobs=-1
        ))
    ])
    
    # Fit on training data
    pipeline.fit(X_train, y_train)
    
    # Predict on validation set
    y_val_pred = pipeline.predict(X_val)
    y_val_proba = pipeline.predict_proba(X_val)[:, 1]
    
    # Calculate metrics
    val_accuracy = accuracy_score(y_val, y_val_pred)
    val_f1 = f1_score(y_val, y_val_pred, average='macro')
    val_auc = roc_auc_score(y_val, y_val_proba)
    
    result = {
        'C': C,
        'accuracy': val_accuracy,
        'f1_macro': val_f1,
        'auc_roc': val_auc,
        'pipeline': pipeline
    }
    tuning_results.append(result)
    
    print(f"  Accuracy: {val_accuracy:.4f}")
    print(f"  F1 Macro: {val_f1:.4f}")
    print(f"  AUC-ROC:  {val_auc:.4f}")

print("\n" + "="*60)
print("Tuning complete!")

Starting regularization tuning...

Training with C=0.001...
  Accuracy: 0.5580
  F1 Macro: 0.5246
  AUC-ROC:  0.5798

Training with C=0.01...
  Accuracy: 0.5534
  F1 Macro: 0.5228
  AUC-ROC:  0.5832

Training with C=0.1...
  Accuracy: 0.5503
  F1 Macro: 0.5211
  AUC-ROC:  0.5843

Training with C=1...


KeyboardInterrupt: 

In [None]:
# Select best model based on AUC-ROC
best_result = max(tuning_results, key=lambda x: x['auc_roc'])
best_C = best_result['C']
best_pipeline = best_result['pipeline']

print(f"Best C value: {best_C}")
print(f"Best validation AUC-ROC: {best_result['auc_roc']:.4f}")

# Summary table
print("\nTuning Summary:")
print("-" * 50)
print(f"{'C':>10} | {'Accuracy':>10} | {'F1 Macro':>10} | {'AUC-ROC':>10}")
print("-" * 50)
for r in tuning_results:
    marker = " <-- BEST" if r['C'] == best_C else ""
    print(f"{r['C']:>10} | {r['accuracy']:>10.4f} | {r['f1_macro']:>10.4f} | {r['auc_roc']:>10.4f}{marker}")

## 5. Final Training and Evaluation

Train on train+validation, evaluate on test set.

In [None]:
# Combine train and validation for final training
X_train_final = pd.concat([X_train, X_val], axis=0)
y_train_final = pd.concat([y_train, y_val], axis=0)

print(f"Final training set size: {len(X_train_final):,} samples")

# Create final pipeline with best C
final_pipeline = Pipeline([
    ('preprocessor', ColumnTransformer(
        transformers=[
            ('num', StandardScaler(), NUMERICAL_FEATURES),
            ('cat', OneHotEncoder(handle_unknown='ignore', sparse_output=False), CATEGORICAL_FEATURES)
        ],
        remainder='drop'
    )),
    ('classifier', LogisticRegression(
        C=best_C,
        solver='lbfgs',
        max_iter=1000,
        class_weight='balanced',
        random_state=42,
        n_jobs=-1
    ))
])

# Train final model
print("\nTraining final model...")
final_pipeline.fit(X_train_final, y_train_final)
print("Training complete!")

In [None]:
# Evaluate on test set
print("Evaluating on test set...")
print("="*60)

y_test_pred = final_pipeline.predict(X_test)
y_test_proba = final_pipeline.predict_proba(X_test)[:, 1]

# Calculate all metrics
test_accuracy = accuracy_score(y_test, y_test_pred)
test_precision_macro = precision_score(y_test, y_test_pred, average='macro')
test_recall_macro = recall_score(y_test, y_test_pred, average='macro')
test_f1_macro = f1_score(y_test, y_test_pred, average='macro')
test_auc_roc = roc_auc_score(y_test, y_test_proba)

# Per-class metrics
test_precision_per_class = precision_score(y_test, y_test_pred, average=None)
test_recall_per_class = recall_score(y_test, y_test_pred, average=None)
test_f1_per_class = f1_score(y_test, y_test_pred, average=None)

print(f"\nTest Set Metrics:")
print(f"  Accuracy:  {test_accuracy:.4f}")
print(f"  Precision: {test_precision_macro:.4f} (macro)")
print(f"  Recall:    {test_recall_macro:.4f} (macro)")
print(f"  F1-Score:  {test_f1_macro:.4f} (macro)")
print(f"  AUC-ROC:   {test_auc_roc:.4f}")

print(f"\nPer-Class Metrics:")
print(f"  Class 0 (Not Purchased): Precision={test_precision_per_class[0]:.4f}, Recall={test_recall_per_class[0]:.4f}, F1={test_f1_per_class[0]:.4f}")
print(f"  Class 1 (Purchased):     Precision={test_precision_per_class[1]:.4f}, Recall={test_recall_per_class[1]:.4f}, F1={test_f1_per_class[1]:.4f}")

In [None]:
# Confusion Matrix
print("Confusion Matrix:")
cm = confusion_matrix(y_test, y_test_pred)
print(cm)

print(f"\nClassification Report:")
print(classification_report(y_test, y_test_pred, target_names=['Not Purchased', 'Purchased']))

In [None]:
# Also get validation metrics from the best model for comparison
y_val_pred_best = best_pipeline.predict(X_val)
y_val_proba_best = best_pipeline.predict_proba(X_val)[:, 1]

val_accuracy = accuracy_score(y_val, y_val_pred_best)
val_precision_macro = precision_score(y_val, y_val_pred_best, average='macro')
val_recall_macro = recall_score(y_val, y_val_pred_best, average='macro')
val_f1_macro = f1_score(y_val, y_val_pred_best, average='macro')
val_auc_roc = roc_auc_score(y_val, y_val_proba_best)

val_precision_per_class = precision_score(y_val, y_val_pred_best, average=None)
val_recall_per_class = recall_score(y_val, y_val_pred_best, average=None)
val_f1_per_class = f1_score(y_val, y_val_pred_best, average=None)

print("Validation Set Metrics (for comparison):")
print(f"  Accuracy:  {val_accuracy:.4f}")
print(f"  Precision: {val_precision_macro:.4f} (macro)")
print(f"  Recall:    {val_recall_macro:.4f} (macro)")
print(f"  F1-Score:  {val_f1_macro:.4f} (macro)")
print(f"  AUC-ROC:   {val_auc_roc:.4f}")

## 6. Save Metrics to JSON

In [None]:
# Prepare metrics dictionary
metrics = {
    "model": "LogisticRegression",
    "timestamp": datetime.now().isoformat(),
    "hyperparameters": {
        "best_C": best_C,
        "solver": "lbfgs",
        "max_iter": 1000,
        "class_weight": "balanced"
    },
    "data_split": {
        "train_size": int(len(X_train)),
        "val_size": int(len(X_val)),
        "test_size": int(len(X_test)),
        "train_val_size": int(len(X_train_final)),
        "total_size": int(len(X))
    },
    "features": {
        "numerical": NUMERICAL_FEATURES,
        "categorical": CATEGORICAL_FEATURES,
        "preprocessing": {
            "numerical": "StandardScaler",
            "categorical": "OneHotEncoder"
        }
    },
    "regularization_tuning": [
        {
            "C": r['C'],
            "val_accuracy": round(r['accuracy'], 4),
            "val_f1_macro": round(r['f1_macro'], 4),
            "val_auc_roc": round(r['auc_roc'], 4)
        }
        for r in tuning_results
    ],
    "validation_metrics": {
        "accuracy": round(val_accuracy, 4),
        "precision": {
            "macro": round(val_precision_macro, 4),
            "class_0": round(float(val_precision_per_class[0]), 4),
            "class_1": round(float(val_precision_per_class[1]), 4)
        },
        "recall": {
            "macro": round(val_recall_macro, 4),
            "class_0": round(float(val_recall_per_class[0]), 4),
            "class_1": round(float(val_recall_per_class[1]), 4)
        },
        "f1": {
            "macro": round(val_f1_macro, 4),
            "class_0": round(float(val_f1_per_class[0]), 4),
            "class_1": round(float(val_f1_per_class[1]), 4)
        },
        "auc_roc": round(val_auc_roc, 4)
    },
    "test_metrics": {
        "accuracy": round(test_accuracy, 4),
        "precision": {
            "macro": round(test_precision_macro, 4),
            "class_0": round(float(test_precision_per_class[0]), 4),
            "class_1": round(float(test_precision_per_class[1]), 4)
        },
        "recall": {
            "macro": round(test_recall_macro, 4),
            "class_0": round(float(test_recall_per_class[0]), 4),
            "class_1": round(float(test_recall_per_class[1]), 4)
        },
        "f1": {
            "macro": round(test_f1_macro, 4),
            "class_0": round(float(test_f1_per_class[0]), 4),
            "class_1": round(float(test_f1_per_class[1]), 4)
        },
        "auc_roc": round(test_auc_roc, 4)
    },
    "confusion_matrix": {
        "true_negative": int(cm[0, 0]),
        "false_positive": int(cm[0, 1]),
        "false_negative": int(cm[1, 0]),
        "true_positive": int(cm[1, 1])
    }
}

print("Metrics prepared!")
print(json.dumps(metrics, indent=2))

In [None]:
# Save to JSON file
METRICS_PATH = Path("../metrics/logistic_regression_metrics.json")
METRICS_PATH.parent.mkdir(parents=True, exist_ok=True)

with open(METRICS_PATH, 'w') as f:
    json.dump(metrics, f, indent=2)

print(f"Metrics saved to: {METRICS_PATH.resolve()}")
print("\nTraining complete!")

## Summary

This notebook trained a Logistic Regression baseline model for customer purchase propensity prediction.

### Key Results:
- Best regularization parameter C was selected based on validation AUC-ROC
- Model uses class_weight='balanced' to handle imbalanced data (26% positive rate)
- Final model trained on train+validation, evaluated on held-out test set
- All metrics saved to `model_pipeline/metrics/logistic_regression_metrics.json`