Skip to content
This repository has been archived by the owner on Jun 28, 2024. It is now read-only.

Predicted LTV Too High #313

Closed
dmanhattan opened this issue Aug 26, 2019 · 5 comments
Closed

Predicted LTV Too High #313

dmanhattan opened this issue Aug 26, 2019 · 5 comments
Labels
bug necessarily a mistake, improvements or other ways to do it should go elsewhere use case help with a use case scenario

Comments

@dmanhattan
Copy link

dmanhattan commented Aug 26, 2019

Hi Cameron,

Thanks for creating this great library!

I'm having a few problems with the values predicted using customer_lifetime_value( ).

I would expect the LTV to be a slightly discounted version of the predicted purchases * average order value, but as shown in the screen shot below it's actually coming out much higher:

Screenshot 2019-08-26 at 12 59 19

In one example, a customer who has only spent $180 over 2 years is predicted to have a 12 month value of $2670.

Is there an error somewhere in the calulate being caused by the fix for Issue #180?

Cheers,

Duncan

@orenshk
Copy link

orenshk commented Aug 27, 2019

what's the correlation between frequency and monetary value in your dataset?

@psygo psygo added the use case help with a use case scenario label Aug 28, 2019
@psygo
Copy link
Collaborator

psygo commented Aug 28, 2019

Please, whenever posting code, don't use screenshots, copy-paste the code itself. It's not only easier on the eyes but it will most llikely help those who are trying to help. Even the example table you showed can be pasted as code.

what's the correlation between frequency and monetary value?

@orenshk's question is relevant and is something you could show us before we arrive at any conclusions.

Another thing you could plot in order to better diagnose a problem is the monetary value distribution itself. That's something I would like to include in the library in a future PR. Essentially, you're going to plot the supposedly Gamma Distribution of the data itself and the one given by the model and compare both. If they are off, try varying the penalizer_coef until you reach a better fit. This is mentioned in one of Prof. Bruce Hardie's notes about the Gamma-Gamma Model.

@dmanhattan
Copy link
Author

dmanhattan commented Aug 28, 2019

Here's the correlation data:

import scipy
scipy.stats.spearmanr(rfm_data['frequency'],rfm_data['monetary_value'])
SpearmanrResult(correlation=0.0305697204235341, pvalue=0.11448729124506023)

scipy.stats.pearsonr(rfm_data['frequency'],rfm_data['monetary_value'])
(-0.04303217429907956, 0.02626314703027851)

import matplotlib
matplotlib.pyplot.scatter(rfm_data['frequency'],rfm_data['monetary_value'])

image

Apologies for copy-pasting an image — here's the code:

# Perform average monetary value analysis for repeat purchasers only
from lifetimes import GammaGammaFitter

returning_customers_summary = rfm_data[rfm_data['frequency']>0]

ggf = GammaGammaFitter(penalizer_coef = 0)
ggf.fit(
    returning_customers_summary['frequency'],
    returning_customers_summary['monetary_value']
)

# Calculate expected average order value for each customer
rfm_data['average_order_revenue'] = ggf.conditional_expected_average_profit(
    rfm_data['frequency'],
    rfm_data['monetary_value']
)

# Predict LTV over period

# There appears to be an issue in this method 
# see https://github.com/CamDavidsonPilon/lifetimes/issues/313
rfm_data['90_day_predicted_ltv'] = ggf.customer_lifetime_value(bgf,
    rfm_data['frequency'],
    rfm_data['recency'],                             
    rfm_data['monetary_value'],
    rfm_data['T'],
    time=3, # Number of months to predict
    discount_rate=0.01,
    freq='D'
)

rfm_data['90_day_predicted_ltv_manual'] = (
    rfm_data['90_day_expected_purchases'] * rfm_data['average_order_revenue']
).round(2)

rfm_data
frequency recency T monetary_value probability_alive 90_day_expected_purchases average_order_revenue 90_day_predicted_ltv 90_day_predicted_ltv_manual
0.0 0.0 711.0 0.000000 0.894723 0.052595 50.817010 10.206675 2.67
0.0 0.0 633.0 0.000000 0.899626 0.057467 50.817010 10.206675 2.92
0.0 0.0 707.0 0.000000 0.894969 0.052826 50.817010 10.206675 2.68
3.0 554.0 720.0 39.733333 0.984319 0.326599 41.193103 661.351880 13.45
0.0 0.0 248.0 0.000000 0.927597 0.103553 50.817010 10.206675 5.26
0.0 0.0 145.0 0.000000 0.936630 0.130707 50.817010 10.206675 6.64
0.0 0.0 582.0 0.000000 0.902946 0.061144 50.817010 10.206675 3.11
2.0 385.0 696.0 90.750000 0.969252 0.239031 83.348384 374.311556 19.92

It seems like there must be an issue in the customer_lifetime_value() function that is perhaps not deducting the previous period as it iterates over time:

for i in steps * factor:
# since the prediction of number of transactions is cumulative, we have to subtract off the previous periods
expected_number_of_transactions = transaction_prediction_model.predict(
i, frequency, recency, T
) - transaction_prediction_model.predict(i - factor, frequency, recency, T)
# sum up the CLV estimates of all of the periods and apply discounted cash flow
df["clv"] += (monetary_value * expected_number_of_transactions) / (1 + discount_rate) ** (i / factor)

@psygo
Copy link
Collaborator

psygo commented Aug 31, 2019

This does seem pretty weird indeed.

It seems like there must be an issue in the customer_lifetime_value() function that is perhaps not deducting the previous period as it iterates over time:

But isn't what the transaction_prediction_model.predict(i - factor, frequency, recency, T) term is doing? If anything, I would suspect the (i / factor) term. In my opinion, what's really strange is the for sum (+=): isn't the last prediction already the total? Why are we summing the in-period predictions if they were supposedly already accounted for (it seems to me that we are getting a future history of transactions for the customer and then wrongly summing it up)? Maybe you could try this and see if it gets better results?

I will have to do some more digging, but, unfortunately, I don't have much time for that right now (maybe in a week or two).

At any rate, it is necessary that you check the fit of your Gamma-Gamma distributions. The first thing you could do is maybe check the negative log likelihood and see if by varying the penalizer_coef you can get a small value. Secondly, you should check if the model's gamma distribution of the monetary_value is close to the real monetary_value distribution, i.e., from your data. The numpy library has some functions to plot gamma distributions that may be of help. If the model's gamma distribution is far off to the right, it will give feed values much higher than the average monetary_value to the customer_lifetime_value() function, resulting in the weird behavior you've seen.

@psygo psygo added the bug necessarily a mistake, improvements or other ways to do it should go elsewhere label Aug 31, 2019
@ColtAllen
Copy link

I've a good idea of what's causing this, which I've summed up here and will be working on in the btyd successor library.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug necessarily a mistake, improvements or other ways to do it should go elsewhere use case help with a use case scenario
Projects
None yet
Development

No branches or pull requests

5 participants