Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New Modeling Backend, BetaGeoModel, and PPD Metric #24

Merged
merged 8 commits into from
Jun 19, 2022
Merged

New Modeling Backend, BetaGeoModel, and PPD Metric #24

merged 8 commits into from
Jun 19, 2022

Conversation

ColtAllen
Copy link
Owner

@ColtAllen ColtAllen commented Jun 10, 2022

Overview

The current autograd backend for lifetimes is no longer being actively developed, and fits models via maximum-likelihood estimation. MLE is an estimate of the posterior distribution mode (i.e., the peak of the bell curve) for each model parameter. It's fast, but considering only a single parameter value is akin to missing the forest for the trees and can be limiting in a number of ways.

In #6, @juanitorduz provided a link to a notebook he wrote containing variants of BetaGeoFitter written in pymc, the premiere Python library for Bayesian statistical modeling. Bayesian methods can estimate the full posterior distribution of each parameter and confer many advantages for model tuning and interpretation, so I've adapted Juan's code into a new backend for lifetimes and a Beta-Geometric/NBD model class called BetaGeoModel. Best of all, pymc has a very large and engaged developer community, ensuring the dependencies for this library will never be deprecated.

New Feature Description

UML diagram of the new modeling backend:

models_uml

The new BaseModel object contains methods and attributes shared between all models and is also an abstract class, providing a template for future models as well as enforcing API standardization.

In Bayesian modeling, the user specifies prior distributions for model parameters based on their own subjective intuition. Here's the stochastic dependency graph of what I came up with for BetaGeoModel:

betageo_graph

Parameter posteriors are estimated by the No-U-Turn Sampler (NUTS), a variation of the Hamiltonian Monte Carlo (HMC) sampling algorithm. The following link contains research papers links and interactive demos of these and similar algorithms in action:

The Markov-chain Monte Carlo Interactive Gallery

For larger datasets (perhaps >30k rows) prior specifications become trivial because the data will overwhelm the subjectivity of the priors. However, the smaller the dataset, the more important prior selection becomes. These default model priors performed well in testing, but in a future PR I want to give users more options for tunability.

After a model has been fitted, it will contain an idata attribute of type InferenceData object, which can be plotted and analyzed with the ArviZ library. Here's an example slightly modified from a code excerpt in Juan's notebook:

import lifetimes as lt
import pymc as pm
import arviz as az

data_df = lt.load_cdnow_summary()

# Fit current lifetimes BetaGeo model
bg_mle = lt.BetaGeoFitter().fit(
    frequency=data_df['frequency'].values,
    recency=data_df['recency'].values ,
    T=data_df ['T'].values
    )

# Fit proposed new BetaGeo Bayesian model
bg_bayes = lt.BetaGeoModel().fit(data_df)

# Use ArviZ to plot posterior parameter distributions against the MLE estimates
axes = az.plot_trace(
    data=bg_bayes.idata,
    var_names=["a", "b", "alpha", "r"],
    lines=[(k, {}, [v]) for k, v in bg_mle.summary["coef"].items()],
    compact=True,
    backend_kwargs={
        "figsize": (12, 9),
        "layout": "constrained"
    },
)
fig = axes[0][0].get_figure()
fig.subtitle("BG/NBD Model Trace")

bg_nbd_arviz_plots

The existing user API for lifetimes remains fully intact; in fact the predictive methods for BetaGeoModel were copy/pasted right over from BetaGeoFitter! The coolest thing about a Bayesian approach is that repeated sample draws from the parameter posteriors will build an entire predictive probability distribution for any given customer, enabling prediction intervals and further analysis around customer behavior. Here's another slightly-modified example from Juan's notebook:

from matplotlib import pyplot as plt
import seaborn as sns

import lifetimes as lt

data_df = lt.load_cdnow_summary()

# Fit current lifetimes BetaGeo model and estimate p_alive
bg_mle = lt.BetaGeoFitter().fit(
    frequency=data_df['frequency'].values,
    recency=data_df['recency'].values ,
    T=data_df ['T'].values
    )

p_alive_mle = bg_mle.conditional_probability_alive(
    frequency=data_df['frequency'].values,
    recency=data_df['recency'].values ,
    T=data_df ['T'].values
    )

# Fit proposed new BetaGeo Bayesian model and estimate p_alive
bg_bayes = lt.BetaGeoModel().fit(data_df)
p_alive_bayes = bg_bayes.conditional_probability_alive()

# Plotting function to compare results
def plot_conditional_probability_alive(p_alive_bayes, p_alive_mle , idx, ax):
    sns.kdeplot(x=p_alive_sample[idx], color="C0", fill=True, ax=ax)
    ax.axvline(x=p_alive[idx], color="C1", linestyle="--")
    ax.set(title=f"idx={idx}")
    return ax

fig, axes = plt.subplots(
    nrows=3,
    ncols=3,
    figsize=(9, 9),
    layout="constrained"
)
for idx, ax in enumerate(axes.flatten()):
    plot_conditional_probability_alive(p_alive_sample, p_alive, idx, ax)

fig.subtitle("Conditional Probability Alive", fontsize=16)

bg_nbd_pymc_prob_alive_plots

Collaborators: The plotting function defined in that code block would a great addition to the plotting.py module in a future PR.

Please note these graphs cannot be recreated yet with the current version of lt.BetaGeoModel().conditional_probability_alive() because it only supports point estimates at this time. However, posterior sampling for predictions is a top priority for the next version release of this library.

Instead of saving/loading the full model object via pickle files - which I consider a security risk because anything can be pickled, including malware - model persistence now comes in the form of a stripped-down JSON containing arrays for posterior parameter distributions and some metadata. During testing, the typical file size of this JSON was about 1.5 MB.

Comparisons to BetaGeoFitter

A downside to this new backend is that full posterior estimation requires more time to fit a model than MLE. The tests I ran on the CDNOW dataset took about a minute to complete on 2.4k rows of data, whereas MLE converged in seconds. However, pymc has experimental support for jax just-in-time-compilation (JIT) and even GPUs, which could speed things up quite a bit and be a great future enhancement for lifetimes.

The mean values of the parameter posteriors, as well as the predictive outputs they generated, were comparable to that of BetaGeoFitter by 1-3 decimal places when fit to the same CDNOW dataset during testing. For well-behaved benchmarking datasets like CDNOW, MLE will outperform Bayesian estimates. However, real-world data is noisy and constantly influenced by unknown external factors, posing overfitting risks for point estimates of model parameters. Full posterior distributions provide a range of possible parameter values that vary every time the model runs, and repeated runs are loosely analogous to an ML modeling stack, affording greater resilience and explainability for the irreducible uncertainties of reality.

Posterior Predictive Deviation Metric

Bayesian posterior predictive checks (PPCs) are usually a subjective visual exercise, comparing graphs of stochastic model outputs against observed data. However, an article on Medium inspired me to create a potential metric for PPCs based on the Wasserstein Distance, which I've dubbed the posterior_predictive_deviation():

Statistical Tests Won't Help You to Compare Distributions

It can be interpreted as the number of standard deviations required to transform the distribution of model outputs into that of the observed data. The Wasserstein Distance implementation in scipy only calculates distances between 1D arrays, but does allow optional weight vectors to be specified. During testing I weighed the Wasserstein Distance of customer frequency against recency because I wanted to consider both variables in this metric, but this is an uncharted new path for model interpretation and demands further analysis.

After this PR is merged I intend to proceed with publishing an alpha release to PyPI. This new modeling backend is still incomplete and lacking adequate documentation, but it's a start. Functionality elsewhere in lifetimes remains unchanged.

@ColtAllen ColtAllen added enhancement New feature or request good first issue Good for newcomers labels Jun 10, 2022
@ColtAllen ColtAllen added this to the pymc milestone Jun 10, 2022
@ColtAllen ColtAllen self-assigned this Jun 10, 2022
@ColtAllen ColtAllen linked an issue Jun 10, 2022 that may be closed by this pull request
@ColtAllen ColtAllen mentioned this pull request Jun 10, 2022
@juanitorduz
Copy link
Collaborator

@ColtAllen ! Amazon job! I will tale a look at it next week (been busy these days) 🚀 !

@ColtAllen
Copy link
Owner Author

ColtAllen commented Jun 12, 2022

@ColtAllen ! Amazing job! I will take a look at it next week (been busy these days) 🚀 !

Thanks, and that's great! I want to get this PR merged and the alpha release published to PyPI before Fri, 17-Jun, because I'm going on vacation afterward and won't be able to resume work on lifetimes until July.


def posterior_predictive_deviation(obs_freq: npt.ArrayLike, gen_freq: npt.ArrayLike, obs_rec: npt.ArrayLike, gen_rec: npt.ArrayLike) -> float:
"""
In lieu of a traditional posterior predictive check, calculate the standardized wasserstein distance of frequency, weighed by recency.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In view?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you provide more detail? I'll need to merge this PR today, but I can still answer any questions you have and/or create an issue to implement any proposed changes afterward.

Copy link
Collaborator

@juanitorduz juanitorduz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey! I had a quick look, as I was in a business trip this week, and I know you want feedback in order to iterate (and go to vacations 😎 !). Here Are somme comments:

Code

  • The pymc package is not in the requirements so I can not run the tests. For example, when trying pytest -v tests/test_models.py I get:
ImportError while importing test module '/Users/juanitorduz/Documents/lifetimes/tests/test_models.py'.
Hint: make sure your test modules/packages have valid Python names.
Traceback:
../../opt/anaconda3/envs/lifetimes_dev/lib/python3.9/importlib/__init__.py:127: in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
tests/test_models.py:12: in <module>
    import arviz as az
  • Same for other modules like psutil.
  • I think we need to make sure the tests and model runs when installing locally via pip install -e . inside a virtual environment. I believe the requirements have to be specified inside the setup.py file from the requirements.txt as in https://github.com/pymc-devs/pymc/blob/main/setup.py#L52-L53
  • We need to be able to run the tests in a CI/CD. I can support with that if we use github actions.
  • In view of the CI/DC, we should also have code style checkers and linters as the code in this PR does not have a consistent convention. I would suggest black and flake.
  • There some .coverage files which are being pushed into this branch and are not ignored correctly via the .gitignore.
  • We should add the testing artifacts into a test/fixtures folder to be able to run the tests in the CI/CD pipeline (so I guess we would need to remove them from the .gitignore)

Model

  • It would be nice to be able to allow the user to modify the priors, for example as in Bambi : https://bambinos.github.io/bambi/main/notebooks/radon_example.html (Ups! I just read this would be part of the second iteration 🙈 !)
  • I wonder if we would like to expose the user with the complete InferenceData object (instead of just returning the .to_dict() as xarray is a very rich library.
  • It would be great to have an example notebook which would serve as documentation on how to run these new generation of models.

I did not have time to go into the details of the methods math expressions ... but I will try to come back to them in the upcoming days. I hope this quick feedback is helpful for the development. Keep up the great work! 🚀 !

@ColtAllen
Copy link
Owner Author

Thanks @juanitorduz! I've replied to all of your review comments and will be creating work issues for the requested changes, because I need to get this PR merged today before I go on vacation.

The requirements.txt and other package files are actually leftovers from older versions of lifetimes which I'll be removing today before publishing the alpha release.

@ColtAllen ColtAllen merged commit 1f12714 into main Jun 19, 2022
@ColtAllen ColtAllen deleted the pymc branch June 19, 2022 18:45
@juanitorduz
Copy link
Collaborator

Hey! Congrats on the alpha release! I'll try to give a more detail review and test it 🚀 ! Enjoy your time off!

@ColtAllen
Copy link
Owner Author

Thanks @juanitorduz! I'm back from vacation and have reorganized the work tasks into a tentative release roadmap..

@zwelitunyiswa
Copy link

@ColtAllen Instead of a JSON, would an InferenceData object (i.e. Xarray) not make more sense to persist files? You would then get Arviz functions for free as well.

@ColtAllen
Copy link
Owner Author

@ColtAllen Instead of a JSON, would an InferenceData object (i.e. Xarray) not make more sense to persist files? You would then get Arviz functions for free as well.

Hey @zwelitunyiswa, as of the Beta release the full InferenceData object is now persisted as a JSON. If I can figure out the issue with loading from Pandas DataFrames then many more formats could be supported like CSV, Parquet & Feather, etc. In general I don't like using Pickle files because they can obscure malware and pose security risks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers
Development

Successfully merging this pull request may close these issues.

BG/NBD Model in PyMC
3 participants