Predicted LTV Too High #313

dmanhattan · 2019-08-26T17:10:39Z

Hi Cameron,

Thanks for creating this great library!

I'm having a few problems with the values predicted using customer_lifetime_value( ).

I would expect the LTV to be a slightly discounted version of the predicted purchases * average order value, but as shown in the screen shot below it's actually coming out much higher:

In one example, a customer who has only spent $180 over 2 years is predicted to have a 12 month value of $2670.

Is there an error somewhere in the calulate being caused by the fix for Issue #180?

Cheers,

Duncan

The text was updated successfully, but these errors were encountered:

orenshk · 2019-08-27T22:02:38Z

what's the correlation between frequency and monetary value in your dataset?

psygo · 2019-08-28T09:54:28Z

Please, whenever posting code, don't use screenshots, copy-paste the code itself. It's not only easier on the eyes but it will most llikely help those who are trying to help. Even the example table you showed can be pasted as code.

what's the correlation between frequency and monetary value?

@orenshk's question is relevant and is something you could show us before we arrive at any conclusions.

Another thing you could plot in order to better diagnose a problem is the monetary value distribution itself. That's something I would like to include in the library in a future PR. Essentially, you're going to plot the supposedly Gamma Distribution of the data itself and the one given by the model and compare both. If they are off, try varying the penalizer_coef until you reach a better fit. This is mentioned in one of Prof. Bruce Hardie's notes about the Gamma-Gamma Model.

dmanhattan · 2019-08-28T18:31:40Z

Here's the correlation data:

import scipy
scipy.stats.spearmanr(rfm_data['frequency'],rfm_data['monetary_value'])
SpearmanrResult(correlation=0.0305697204235341, pvalue=0.11448729124506023)

scipy.stats.pearsonr(rfm_data['frequency'],rfm_data['monetary_value'])
(-0.04303217429907956, 0.02626314703027851)

import matplotlib
matplotlib.pyplot.scatter(rfm_data['frequency'],rfm_data['monetary_value'])

Apologies for copy-pasting an image — here's the code:

# Perform average monetary value analysis for repeat purchasers only
from lifetimes import GammaGammaFitter

returning_customers_summary = rfm_data[rfm_data['frequency']>0]

ggf = GammaGammaFitter(penalizer_coef = 0)
ggf.fit(
    returning_customers_summary['frequency'],
    returning_customers_summary['monetary_value']
)

# Calculate expected average order value for each customer
rfm_data['average_order_revenue'] = ggf.conditional_expected_average_profit(
    rfm_data['frequency'],
    rfm_data['monetary_value']
)

# Predict LTV over period

# There appears to be an issue in this method 
# see https://github.com/CamDavidsonPilon/lifetimes/issues/313
rfm_data['90_day_predicted_ltv'] = ggf.customer_lifetime_value(bgf,
    rfm_data['frequency'],
    rfm_data['recency'],                             
    rfm_data['monetary_value'],
    rfm_data['T'],
    time=3, # Number of months to predict
    discount_rate=0.01,
    freq='D'
)

rfm_data['90_day_predicted_ltv_manual'] = (
    rfm_data['90_day_expected_purchases'] * rfm_data['average_order_revenue']
).round(2)

rfm_data

frequency	recency	T	monetary_value	probability_alive	90_day_expected_purchases	average_order_revenue	90_day_predicted_ltv	90_day_predicted_ltv_manual
0.0	0.0	711.0	0.000000	0.894723	0.052595	50.817010	10.206675	2.67
0.0	0.0	633.0	0.000000	0.899626	0.057467	50.817010	10.206675	2.92
0.0	0.0	707.0	0.000000	0.894969	0.052826	50.817010	10.206675	2.68
3.0	554.0	720.0	39.733333	0.984319	0.326599	41.193103	661.351880	13.45
0.0	0.0	248.0	0.000000	0.927597	0.103553	50.817010	10.206675	5.26
0.0	0.0	145.0	0.000000	0.936630	0.130707	50.817010	10.206675	6.64
0.0	0.0	582.0	0.000000	0.902946	0.061144	50.817010	10.206675	3.11
2.0	385.0	696.0	90.750000	0.969252	0.239031	83.348384	374.311556	19.92

It seems like there must be an issue in the customer_lifetime_value() function that is perhaps not deducting the previous period as it iterates over time:

lifetimes/lifetimes/utils.py

Lines 472 to 478 in 95b4d2a

    
           for i in steps * factor: 
        
               # since the prediction of number of transactions is cumulative, we have to subtract off the previous periods 
        
               expected_number_of_transactions = transaction_prediction_model.predict( 
        
                   i, frequency, recency, T 
        
               ) - transaction_prediction_model.predict(i - factor, frequency, recency, T) 
        
               # sum up the CLV estimates of all of the periods and apply discounted cash flow 
        
               df["clv"] += (monetary_value * expected_number_of_transactions) / (1 + discount_rate) ** (i / factor)

psygo · 2019-08-31T02:15:42Z

This does seem pretty weird indeed.

It seems like there must be an issue in the customer_lifetime_value() function that is perhaps not deducting the previous period as it iterates over time:

But isn't what the transaction_prediction_model.predict(i - factor, frequency, recency, T) term is doing? If anything, I would suspect the (i / factor) term. In my opinion, what's really strange is the for sum (+=): isn't the last prediction already the total? Why are we summing the in-period predictions if they were supposedly already accounted for (it seems to me that we are getting a future history of transactions for the customer and then wrongly summing it up)? Maybe you could try this and see if it gets better results?

I will have to do some more digging, but, unfortunately, I don't have much time for that right now (maybe in a week or two).

At any rate, it is necessary that you check the fit of your Gamma-Gamma distributions. The first thing you could do is maybe check the negative log likelihood and see if by varying the penalizer_coef you can get a small value. Secondly, you should check if the model's gamma distribution of the monetary_value is close to the real monetary_value distribution, i.e., from your data. The numpy library has some functions to plot gamma distributions that may be of help. If the model's gamma distribution is far off to the right, it will give feed values much higher than the average monetary_value to the customer_lifetime_value() function, resulting in the weird behavior you've seen.

ColtAllen · 2022-11-11T15:40:23Z

I've a good idea of what's causing this, which I've summed up here and will be working on in the btyd successor library.

psygo added the use case help with a use case scenario label Aug 28, 2019

psygo added the bug necessarily a mistake, improvements or other ways to do it should go elsewhere label Aug 31, 2019

SSMK-wq mentioned this issue Nov 9, 2022

CLV value too high in btyd ColtAllen/btyd#73

Closed

ColtAllen mentioned this issue Nov 11, 2022

Add Continuous Time CLV Calculations ColtAllen/btyd#76

Open

CamDavidsonPilon closed this as completed Jun 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Predicted LTV Too High #313

Predicted LTV Too High #313

dmanhattan commented Aug 26, 2019 •

edited by psygo

Loading

orenshk commented Aug 27, 2019 •

edited

Loading

psygo commented Aug 28, 2019 •

edited

Loading

dmanhattan commented Aug 28, 2019 •

edited by psygo

Loading

psygo commented Aug 31, 2019

ColtAllen commented Nov 11, 2022

Predicted LTV Too High #313

Predicted LTV Too High #313

Comments

dmanhattan commented Aug 26, 2019 • edited by psygo Loading

orenshk commented Aug 27, 2019 • edited Loading

psygo commented Aug 28, 2019 • edited Loading

dmanhattan commented Aug 28, 2019 • edited by psygo Loading

psygo commented Aug 31, 2019

ColtAllen commented Nov 11, 2022

dmanhattan commented Aug 26, 2019 •

edited by psygo

Loading

orenshk commented Aug 27, 2019 •

edited

Loading

psygo commented Aug 28, 2019 •

edited

Loading

dmanhattan commented Aug 28, 2019 •

edited by psygo

Loading