Evaluate metrics over time #68

gverbock · 2021-02-10T17:32:17Z

Problem Description
It would be great if Probatus would give the metric (including volatility) over time so that eventual drops in model performance can be spotted easily. The time aggregation level (day, month, quarter) would be chosen by the user.

Desired Outcome
The output would be a dataframe containing the following columns: dates, metric1, metric2. The input could then be used for a plot like the following:

Also possibility to evaluate for out-of-time would be required.

Solution Outline
Maybe incorporate it in the metric_volatility class. Passing a series with the date to aggregate and using groupby before computing the metrics.

operte · 2021-02-15T15:30:49Z

I'm not sure, but is this something that can be done with popmon?

timvink · 2021-02-15T16:08:57Z

Have you thought about how a potential API would look like (pseudo code)?

Matgrb · 2021-02-16T10:07:39Z

I think this could be done by extending BaseVolatilityEstimator and implementing something similar to TrainTestVolatility with one crucial difference:

When you split the data into train and test you take into account time column:

stratify split based on time column, this allows to have train and test samples from the entire time duration. Repeating this split multiple times allows to plot volatility of the Out-of-sample split volatility over time
split data into multiple time-based folds. At each test on one fold and train on remaining folds. In order to get time based volatility you can apply bootstrapping on train and test folds. This will basically tell you if you how a given time-based fold is different and volatile when predicted based on a model trained on other folds.
split data into multiple time-based folds. Then apply the schema as shown in the image below. In order to get the volatility you can again apply bootstrapping on train and test folds. This will tell you how volatile the model is with OOT splits, and how much data you need for training to have a stable OOT result.

The first option seems easiest to implement reusing most of the existing code. For the remaining two it would be more difficult. I suppose the first one would be a good starting point.

In all cases, you would need to consider a new plotting method, that would be similar for all time-based metric volatility. You could also make another base class BaseTimeBasedVolatilityEstimator, which overwrites the plot method of BaseVolatilityEstimator.

Regarding use of popmon, we could try to use it for plotting, however, i think this is a minor part of the feature, and we could get a more efficient implementation if we do it ourselves.

gverbock · 2021-02-17T10:52:13Z

My thoughts were to start simple:

Having something like

class PerformanceOverTimeEstimator(model, X, y, scorer_list, dates, frequency)       
       
def boosting_process(self,...) 
            X_proba = model.predict_proba(X)
       
          for boost in range(0, 1000)
               X_boost, Y_boost = time_stratified_sampling(X, y, dates, frequency)
              scores = compute_scores_over_time(X_boost, Y_boost, dates, frequency, scorer_list)
             result.append(score)

     def plot_results

     def results_as_table

I had in mind to have a fitted model as argument so that hyperparmeters optimization is done outside the class

Matgrb · 2021-02-17T12:21:23Z

Possible improvements:

X_proba could be computed using cross validation using cross_val_predict, to ensure there is no leakage.
Let's try to stick to the probatus API: init (clf, metrics, ...), fit(X, y, ...), compute(metrics, ...), plot().
The clf provided by the user can be model or a model wrapped into GridSearchCV that will perform hyperparameter optimization at each training. So you don't have to worry too much about the hyperparam opt.

What would the time_stratified_sampling and compute_scores_over_time do? Also what would the frequency parameter do?

What would be use case for using this code? Could you provide example what this analysis tells you about the model/data?

gverbock · 2021-02-17T12:48:05Z

Good points Mateusz.

Frequency would provide the level of aggregation over time. Monthly, quarterly, ...
time_stratified_sampling would ensure the bootstrap is homogeneously distributed across time.
compute scores_over_time would compute the score_metrics for each unit of time (month, quarter).

The benefit of the new code is that the user sees (for example) the AUC over time and can easily spot performance degradation. For example specific months (say Covid, summer holidays, ....).

This helps you to either assess the impact of unexpected changes (like for example Covid, crisis, bad publicity) but also understanding the reason may reveal some weaknesses of the model. For example the model starts deteriorate once the mortgage production increased. Then you would try to mitigate this by adding features related to mortgages.

Matgrb · 2021-02-17T13:11:54Z

So to summarize:

Compute probabilities for the entire X using Cross-Validation
Split data into time buckets
For each window of data, randomly sample examples multiple times (bootstrapping), and measure the metric e.g. AUC multiple times.
Compute a plot and report about volatility of the metric in each time bucket

Is that correct?

I like the approach for the simplicity. It provides you information for which periods of time to be cautious, and possible data drifts. It is similar to another issue #72 but it focuses on how the performance of target prediction changes over time.

The limitation I see is that when you compute the probabilities for X using CV, the model is trained on the data from the entire time span. Imagine you have a sample in the middle of the dataset, the model has seen samples before and after that.

Let's also ask others what do they think? @timvink @anilkumarpanda @operte

gverbock · 2021-02-17T13:48:04Z

You understood it correctly.

I am not sure the limitation you raise on the cross-section would have a large impact.

Matgrb · 2021-02-17T13:50:51Z

Indeed probably low impact 👍

However, I would reach out to a couple of users and see if they would find it useful for their projects.

ReinierKoops · 2024-03-17T21:32:45Z

Is this still a feature that we want to work on @gverbock @anilkumarpanda ?

gverbock added the enhancement New feature or request label Feb 10, 2021

gverbock closed this as completed Feb 17, 2021

Matgrb reopened this Feb 17, 2021

Matgrb added the investigation needed One to do research related to this issue, and share findings label Feb 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evaluate metrics over time #68

Evaluate metrics over time #68

gverbock commented Feb 10, 2021

operte commented Feb 15, 2021

timvink commented Feb 15, 2021

Matgrb commented Feb 16, 2021

gverbock commented Feb 17, 2021 •

edited

Matgrb commented Feb 17, 2021

gverbock commented Feb 17, 2021 •

edited

Matgrb commented Feb 17, 2021 •

edited

gverbock commented Feb 17, 2021

Matgrb commented Feb 17, 2021

ReinierKoops commented Mar 17, 2024

Evaluate metrics over time #68

Evaluate metrics over time #68

Comments

gverbock commented Feb 10, 2021

operte commented Feb 15, 2021

timvink commented Feb 15, 2021

Matgrb commented Feb 16, 2021

gverbock commented Feb 17, 2021 • edited

Matgrb commented Feb 17, 2021

gverbock commented Feb 17, 2021 • edited

Matgrb commented Feb 17, 2021 • edited

gverbock commented Feb 17, 2021

Matgrb commented Feb 17, 2021

ReinierKoops commented Mar 17, 2024

gverbock commented Feb 17, 2021 •

edited

gverbock commented Feb 17, 2021 •

edited

Matgrb commented Feb 17, 2021 •

edited