New Modeling Backend, BetaGeoModel, and PPD Metric #24

ColtAllen · 2022-06-10T11:55:54Z

Overview

The current autograd backend for lifetimes is no longer being actively developed, and fits models via maximum-likelihood estimation. MLE is an estimate of the posterior distribution mode (i.e., the peak of the bell curve) for each model parameter. It's fast, but considering only a single parameter value is akin to missing the forest for the trees and can be limiting in a number of ways.

In #6, @juanitorduz provided a link to a notebook he wrote containing variants of BetaGeoFitter written in pymc, the premiere Python library for Bayesian statistical modeling. Bayesian methods can estimate the full posterior distribution of each parameter and confer many advantages for model tuning and interpretation, so I've adapted Juan's code into a new backend for lifetimes and a Beta-Geometric/NBD model class called BetaGeoModel. Best of all, pymc has a very large and engaged developer community, ensuring the dependencies for this library will never be deprecated.

New Feature Description

UML diagram of the new modeling backend:

The new BaseModel object contains methods and attributes shared between all models and is also an abstract class, providing a template for future models as well as enforcing API standardization.

In Bayesian modeling, the user specifies prior distributions for model parameters based on their own subjective intuition. Here's the stochastic dependency graph of what I came up with for BetaGeoModel:

Parameter posteriors are estimated by the No-U-Turn Sampler (NUTS), a variation of the Hamiltonian Monte Carlo (HMC) sampling algorithm. The following link contains research papers links and interactive demos of these and similar algorithms in action:

The Markov-chain Monte Carlo Interactive Gallery

For larger datasets (perhaps >30k rows) prior specifications become trivial because the data will overwhelm the subjectivity of the priors. However, the smaller the dataset, the more important prior selection becomes. These default model priors performed well in testing, but in a future PR I want to give users more options for tunability.

After a model has been fitted, it will contain an idata attribute of type InferenceData object, which can be plotted and analyzed with the ArviZ library. Here's an example slightly modified from a code excerpt in Juan's notebook:

import lifetimes as lt
import pymc as pm
import arviz as az

data_df = lt.load_cdnow_summary()

# Fit current lifetimes BetaGeo model
bg_mle = lt.BetaGeoFitter().fit(
    frequency=data_df['frequency'].values,
    recency=data_df['recency'].values ,
    T=data_df ['T'].values
    )

# Fit proposed new BetaGeo Bayesian model
bg_bayes = lt.BetaGeoModel().fit(data_df)

# Use ArviZ to plot posterior parameter distributions against the MLE estimates
axes = az.plot_trace(
    data=bg_bayes.idata,
    var_names=["a", "b", "alpha", "r"],
    lines=[(k, {}, [v]) for k, v in bg_mle.summary["coef"].items()],
    compact=True,
    backend_kwargs={
        "figsize": (12, 9),
        "layout": "constrained"
    },
)
fig = axes[0][0].get_figure()
fig.subtitle("BG/NBD Model Trace")

The existing user API for lifetimes remains fully intact; in fact the predictive methods for BetaGeoModel were copy/pasted right over from BetaGeoFitter! The coolest thing about a Bayesian approach is that repeated sample draws from the parameter posteriors will build an entire predictive probability distribution for any given customer, enabling prediction intervals and further analysis around customer behavior. Here's another slightly-modified example from Juan's notebook:

from matplotlib import pyplot as plt
import seaborn as sns

import lifetimes as lt

data_df = lt.load_cdnow_summary()

# Fit current lifetimes BetaGeo model and estimate p_alive
bg_mle = lt.BetaGeoFitter().fit(
    frequency=data_df['frequency'].values,
    recency=data_df['recency'].values ,
    T=data_df ['T'].values
    )

p_alive_mle = bg_mle.conditional_probability_alive(
    frequency=data_df['frequency'].values,
    recency=data_df['recency'].values ,
    T=data_df ['T'].values
    )

# Fit proposed new BetaGeo Bayesian model and estimate p_alive
bg_bayes = lt.BetaGeoModel().fit(data_df)
p_alive_bayes = bg_bayes.conditional_probability_alive()

# Plotting function to compare results
def plot_conditional_probability_alive(p_alive_bayes, p_alive_mle , idx, ax):
    sns.kdeplot(x=p_alive_sample[idx], color="C0", fill=True, ax=ax)
    ax.axvline(x=p_alive[idx], color="C1", linestyle="--")
    ax.set(title=f"idx={idx}")
    return ax

fig, axes = plt.subplots(
    nrows=3,
    ncols=3,
    figsize=(9, 9),
    layout="constrained"
)
for idx, ax in enumerate(axes.flatten()):
    plot_conditional_probability_alive(p_alive_sample, p_alive, idx, ax)

fig.subtitle("Conditional Probability Alive", fontsize=16)

Collaborators: The plotting function defined in that code block would a great addition to the `plotting.py` module in a future PR.

Please note these graphs cannot be recreated yet with the current version of lt.BetaGeoModel().conditional_probability_alive() because it only supports point estimates at this time. However, posterior sampling for predictions is a top priority for the next version release of this library.

Instead of saving/loading the full model object via pickle files - which I consider a security risk because anything can be pickled, including malware - model persistence now comes in the form of a stripped-down JSON containing arrays for posterior parameter distributions and some metadata. During testing, the typical file size of this JSON was about 1.5 MB.

Comparisons to `BetaGeoFitter`

A downside to this new backend is that full posterior estimation requires more time to fit a model than MLE. The tests I ran on the CDNOW dataset took about a minute to complete on 2.4k rows of data, whereas MLE converged in seconds. However, pymc has experimental support for jax just-in-time-compilation (JIT) and even GPUs, which could speed things up quite a bit and be a great future enhancement for lifetimes.

The mean values of the parameter posteriors, as well as the predictive outputs they generated, were comparable to that of BetaGeoFitter by 1-3 decimal places when fit to the same CDNOW dataset during testing. For well-behaved benchmarking datasets like CDNOW, MLE will outperform Bayesian estimates. However, real-world data is noisy and constantly influenced by unknown external factors, posing overfitting risks for point estimates of model parameters. Full posterior distributions provide a range of possible parameter values that vary every time the model runs, and repeated runs are loosely analogous to an ML modeling stack, affording greater resilience and explainability for the irreducible uncertainties of reality.

Posterior Predictive Deviation Metric

Bayesian posterior predictive checks (PPCs) are usually a subjective visual exercise, comparing graphs of stochastic model outputs against observed data. However, an article on Medium inspired me to create a potential metric for PPCs based on the Wasserstein Distance, which I've dubbed the posterior_predictive_deviation():

Statistical Tests Won't Help You to Compare Distributions

It can be interpreted as the number of standard deviations required to transform the distribution of model outputs into that of the observed data. The Wasserstein Distance implementation in scipy only calculates distances between 1D arrays, but does allow optional weight vectors to be specified. During testing I weighed the Wasserstein Distance of customer frequency against recency because I wanted to consider both variables in this metric, but this is an uncharted new path for model interpretation and demands further analysis.

After this PR is merged I intend to proceed with publishing an alpha release to PyPI. This new modeling backend is still incomplete and lacking adequate documentation, but it's a start. Functionality elsewhere in `lifetimes` remains unchanged.

Merge new pymc model backend

… hinting.

juanitorduz · 2022-06-10T11:59:57Z

@ColtAllen ! Amazon job! I will tale a look at it next week (been busy these days) 🚀 !

ColtAllen · 2022-06-12T18:28:04Z

@ColtAllen ! Amazing job! I will take a look at it next week (been busy these days) 🚀 !

Thanks, and that's great! I want to get this PR merged and the alpha release published to PyPI before Fri, 17-Jun, because I'm going on vacation afterward and won't be able to resume work on lifetimes until July.

.gitignore

lifetimes/models/__init__.py

lifetimes/models/beta_geo_model.py

juanitorduz · 2022-06-17T08:43:40Z

lifetimes/utils.py

+
+def posterior_predictive_deviation(obs_freq: npt.ArrayLike, gen_freq: npt.ArrayLike, obs_rec: npt.ArrayLike, gen_rec: npt.ArrayLike) -> float:
+    """ 
+    In lieu of a traditional posterior predictive check, calculate the standardized wasserstein distance of frequency, weighed by recency. 


Can you provide more detail? I'll need to merge this PR today, but I can still answer any questions you have and/or create an issue to implement any proposed changes afterward.

lifetimes/models/beta_geo_model.py

juanitorduz

Hey! I had a quick look, as I was in a business trip this week, and I know you want feedback in order to iterate (and go to vacations 😎 !). Here Are somme comments:

Code

The pymc package is not in the requirements so I can not run the tests. For example, when trying pytest -v tests/test_models.py I get:

ImportError while importing test module '/Users/juanitorduz/Documents/lifetimes/tests/test_models.py'.
Hint: make sure your test modules/packages have valid Python names.
Traceback:
../../opt/anaconda3/envs/lifetimes_dev/lib/python3.9/importlib/__init__.py:127: in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
tests/test_models.py:12: in <module>
    import arviz as az

Same for other modules like psutil.
I think we need to make sure the tests and model runs when installing locally via pip install -e . inside a virtual environment. I believe the requirements have to be specified inside the setup.py file from the requirements.txt as in https://github.com/pymc-devs/pymc/blob/main/setup.py#L52-L53
We need to be able to run the tests in a CI/CD. I can support with that if we use github actions.
In view of the CI/DC, we should also have code style checkers and linters as the code in this PR does not have a consistent convention. I would suggest black and flake.
There some .coverage files which are being pushed into this branch and are not ignored correctly via the .gitignore.
We should add the testing artifacts into a test/fixtures folder to be able to run the tests in the CI/CD pipeline (so I guess we would need to remove them from the .gitignore)

Model

It would be nice to be able to allow the user to modify the priors, for example as in Bambi : https://bambinos.github.io/bambi/main/notebooks/radon_example.html (Ups! I just read this would be part of the second iteration 🙈 !)
I wonder if we would like to expose the user with the complete InferenceData object (instead of just returning the .to_dict() as xarray is a very rich library.
It would be great to have an example notebook which would serve as documentation on how to run these new generation of models.

I did not have time to go into the details of the methods math expressions ... but I will try to come back to them in the upcoming days. I hope this quick feedback is helpful for the development. Keep up the great work! 🚀 !

ColtAllen · 2022-06-19T18:44:10Z

Thanks @juanitorduz! I've replied to all of your review comments and will be creating work issues for the requested changes, because I need to get this PR merged today before I go on vacation.

The requirements.txt and other package files are actually leftovers from older versions of lifetimes which I'll be removing today before publishing the alpha release.

juanitorduz · 2022-06-20T07:18:59Z

Hey! Congrats on the alpha release! I'll try to give a more detail review and test it 🚀 ! Enjoy your time off!

ColtAllen · 2022-07-03T15:11:42Z

Thanks @juanitorduz! I'm back from vacation and have reorganized the work tasks into a tentative release roadmap..

zwelitunyiswa · 2022-07-30T11:27:56Z

@ColtAllen Instead of a JSON, would an InferenceData object (i.e. Xarray) not make more sense to persist files? You would then get Arviz functions for free as well.

ColtAllen · 2022-07-31T20:11:43Z

@ColtAllen Instead of a JSON, would an InferenceData object (i.e. Xarray) not make more sense to persist files? You would then get Arviz functions for free as well.

Hey @zwelitunyiswa, as of the Beta release the full InferenceData object is now persisted as a JSON. If I can figure out the issue with loading from Pandas DataFrames then many more formats could be supported like CSV, Parquet & Feather, etc. In general I don't like using Pickle files because they can obscure malware and pose security risks.

Colt Allen and others added 8 commits April 17, 2022 20:28

BaseInference class created.

52769a9

NEW FEATURE: new backend Base Modeling class added.

e73965c

pymc backend implemented with Beta-Geo/NBD model.

3f33dcf

Merge pull request #15 from ColtAllen/beta-geo-nbd

5900286

Merge new pymc model backend

Updated .gitignore

67a6a16

Increased pymc model default trace and draw arg values and added type…

eb6bcb9

… hinting.

Added metric for posterior predictive checks.

5df37d5

Comparative test added for PPD metric

d6d2d18

ColtAllen added enhancement New feature or request good first issue Good for newcomers labels Jun 10, 2022

ColtAllen added this to the pymc milestone Jun 10, 2022

ColtAllen requested a review from juanitorduz June 10, 2022 11:55

ColtAllen self-assigned this Jun 10, 2022

ColtAllen linked an issue Jun 10, 2022 that may be closed by this pull request

BG/NBD Model in PyMC #6

Closed

ColtAllen mentioned this pull request Jun 10, 2022

BG/NBD Model in PyMC #6

Closed

ColtAllen mentioned this pull request Jun 10, 2022

On the Development of MCMC Models CamDavidsonPilon/lifetimes#312

Open