Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FUTURE: add in a subset of new features, prune out the non-useful ones, repeat #59

Open
ClimbsRocks opened this issue Dec 8, 2015 · 0 comments

Comments

@ClimbsRocks
Copy link
Owner

right now anytime we add in new features (polynomialFeatures.py, groupBy.py, imputingMissingValues.py, etc.), we add them all in at once as a big group.

and then, only at some much later point in time, once we've aggregated together all these new features, do we perform feature selection.

it might make much more sense to perform feature selection at the end of each file where we add in new features.

and then, if we wanted to optimize further, it might make more sense to add in only a subset of new features, perform feature selection, and then add in the next subset of new features.

this would add additional calculation time while saving on memory.

this would ensure that any features that ultimately made it through so many rounds of feature selection were really, really robust.

however, it probably cuts out some marginally useful/borderline features, which may make the cut one time but not the next.

what this would ultimately end up doing is letting us try many more features. since the useless features will be pruned very quickly, we can try adding in many more things, without worrying about creating a memory explosion. it's highly unlikely that all of our feature engineering is going to be useful, but it is highly likely that some of it will be. i would rather have the opportunity to try everything, and let the data decide what's best for this particular dataset.

we do, of course, risk overfitting, but we're using so much cross-validation that i'm not too concerned.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant