Time-series validation workflow #7

AntonBiryukovUofC · 2020-05-07T19:29:00Z

Hello folks,

I really like the idea of your package and the approach. I was just curious how difficult it might be to introduce a custom CV validation (or even just a ts-meaningful) validation.

I could probably assist in that with a bit of guidance from you :)

Thanks,

Anton.

AntonBiryukovUofC · 2020-05-07T19:35:49Z

As a baseline, we could start with a non-shuffled KFold, or TimeSeriesCV provided in scikit-learn

8080labs · 2020-05-08T05:50:21Z

Hi Anton
that sounds great. We can pass through a cv argument which behaves like a scikit-learn crossvalidator.
Did you already have a look at the code? Please let me know if something is unclear
Florian

AntonBiryukovUofC · 2020-05-08T18:35:25Z

Not yet, will dig in over the weekend!

AntonBiryukovUofC · 2020-05-10T20:34:18Z

So here's what needs to be done in case i did not miss anything:
Generally for inspiration on various CV I was thinking to use mlxtend by Sebi Raschka, e.g. similar to what he uses here:

http://rasbt.github.io/mlxtend/user_guide/feature_selection/SequentialFeatureSelector/

Re: your codebase:

_calculate_model_cv_score_ - this function needs re-vamping , with encoders plugged into a Pipeline object likely to avoid leakage. Alternatively, we can explicitly write a for-loop over splits and avoid using cross_val_score
_mae_normalizer(df, y, model_score) - we need to think what to do here as median/baseline should be calculated over a given CV object. Most likely solution is same as above - an explicit for over cv.split(X) or something similar.
Pass CV somehow to score and matrix.

@8080labs Is there anything I missed?

AntonBiryukovUofC · 2020-05-11T00:52:11Z

A second look at cross_val_score() makes me think that we could also introduce a new scorer, that calculates a baseline and the decision tree score simultaneously...I prefer doing the two things explicitly though

UPD: I think I have implemented most of the stuff necessary..Will try to test on some simple examples. It is probably worth making up some tests that are sensitive to CV changes..
Here, check this out (also check the tests in that branch, and let me know if I should open a PR for it):

https://github.com/AntonBiryukovUofC/ppscore/blob/custom_cv_regression/src/ppscore/calculation.py

Any chance you could create a dev branch so I could stage a PR ?

In the meantime I'll think about a test that would work / fail in the case of KFold with/without shuffle=True, as well as some time-series related test case (should be easy given we have a DecisionTree here)

8080labs · 2020-05-12T19:26:16Z

@AntonBiryukovUofC I've created a dev branch. Looking forward to your PR :)
(I looked at your code but it's better to wait for the PR so I can see the diff)

Cheers,
Tobias

FlorianWetschoreck self-assigned this Jul 13, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Time-series validation workflow #7

Time-series validation workflow #7

AntonBiryukovUofC commented May 7, 2020

AntonBiryukovUofC commented May 7, 2020

8080labs commented May 8, 2020

AntonBiryukovUofC commented May 8, 2020

AntonBiryukovUofC commented May 10, 2020 •

edited

Loading

AntonBiryukovUofC commented May 11, 2020 •

edited

Loading

8080labs commented May 12, 2020

Time-series validation workflow #7

Time-series validation workflow #7

Comments

AntonBiryukovUofC commented May 7, 2020

AntonBiryukovUofC commented May 7, 2020

8080labs commented May 8, 2020

AntonBiryukovUofC commented May 8, 2020

AntonBiryukovUofC commented May 10, 2020 • edited Loading

AntonBiryukovUofC commented May 11, 2020 • edited Loading

8080labs commented May 12, 2020

AntonBiryukovUofC commented May 10, 2020 •

edited

Loading

AntonBiryukovUofC commented May 11, 2020 •

edited

Loading