Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Time-series validation workflow #7

Open
AntonBiryukovUofC opened this issue May 7, 2020 · 6 comments
Open

Time-series validation workflow #7

AntonBiryukovUofC opened this issue May 7, 2020 · 6 comments
Assignees

Comments

@AntonBiryukovUofC
Copy link

Hello folks,

I really like the idea of your package and the approach. I was just curious how difficult it might be to introduce a custom CV validation (or even just a ts-meaningful) validation.

I could probably assist in that with a bit of guidance from you :)

Thanks,

Anton.

@AntonBiryukovUofC
Copy link
Author

As a baseline, we could start with a non-shuffled KFold, or TimeSeriesCV provided in scikit-learn

@8080labs
Copy link
Owner

8080labs commented May 8, 2020

Hi Anton
that sounds great. We can pass through a cv argument which behaves like a scikit-learn crossvalidator.
Did you already have a look at the code? Please let me know if something is unclear
Florian

@AntonBiryukovUofC
Copy link
Author

Not yet, will dig in over the weekend!

@AntonBiryukovUofC
Copy link
Author

AntonBiryukovUofC commented May 10, 2020

So here's what needs to be done in case i did not miss anything:
Generally for inspiration on various CV I was thinking to use mlxtend by Sebi Raschka, e.g. similar to what he uses here:

http://rasbt.github.io/mlxtend/user_guide/feature_selection/SequentialFeatureSelector/

Re: your codebase:

  1. _calculate_model_cv_score_ - this function needs re-vamping , with encoders plugged into a Pipeline object likely to avoid leakage. Alternatively, we can explicitly write a for-loop over splits and avoid using cross_val_score
  2. _mae_normalizer(df, y, model_score) - we need to think what to do here as median/baseline should be calculated over a given CV object. Most likely solution is same as above - an explicit for over cv.split(X) or something similar.
  3. Pass CV somehow to score and matrix.

@8080labs Is there anything I missed?

@AntonBiryukovUofC
Copy link
Author

AntonBiryukovUofC commented May 11, 2020

A second look at cross_val_score() makes me think that we could also introduce a new scorer, that calculates a baseline and the decision tree score simultaneously...I prefer doing the two things explicitly though

UPD: I think I have implemented most of the stuff necessary..Will try to test on some simple examples. It is probably worth making up some tests that are sensitive to CV changes..
Here, check this out (also check the tests in that branch, and let me know if I should open a PR for it):

https://github.com/AntonBiryukovUofC/ppscore/blob/custom_cv_regression/src/ppscore/calculation.py

Any chance you could create a dev branch so I could stage a PR ?

In the meantime I'll think about a test that would work / fail in the case of KFold with/without shuffle=True, as well as some time-series related test case (should be easy given we have a DecisionTree here)

@8080labs
Copy link
Owner

@AntonBiryukovUofC I've created a dev branch. Looking forward to your PR :)
(I looked at your code but it's better to wait for the PR so I can see the diff)

Cheers,
Tobias

@FlorianWetschoreck FlorianWetschoreck self-assigned this Jul 13, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants