-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DEM 303] Adding functions from mosaic-modeling
#64
Conversation
geoml/feature_selection_new.py
Outdated
# if there is only one feature to be selected, then tuning will take care of alpha value | ||
if len(labels_x) == 1: | ||
return alpha_now |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Handling case where we have one candidate feature
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have to check and be sure feat_n_sel == 1
as well? There may be a chance that alpha_initial
does not result in 1 feature (but I'm not sure).
Note this is related to the TODO
question above:
TODO: Do we want to allow for zero features to be selected? If so, we should rethink our alpha range for modeling with one candidate feature.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm spinning my wheels on this one a bit. On one hand, we should ensure that our functions can handle all situations, but on the other hand, why the heck would we be running feature selection when there's only one feature? Haha.
This is something I can look into more thoroughly with some test data I already have loaded in this repo, but it'd be nice to agree on desired behavior. This is what I'm thinking:
- We don't want to select 0 features in the feature selection. If we are interested in looking at the null model, we can create it. This should be adjusted in the other helper functions.
- For
lasso_feature_selection()
, if we only have one feature to select from, we will determine the maximum value for which we still have one feature selected (i.e.,alpha_min_feats
) and not zero features. Then, we will setalpha_max_feats = alpha_min_feats * 0.1
and continue on with the function. As this is a feature selection function, we do not need to find the best value of alpha for this model; that is done in the tuning step.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your code changes from today look good.
This function operates in four parts: | ||
|
||
(1) Use "while" loops to search for a roughly maximum alpha value that achieves `n_feats`, which | ||
is called `alpha_now`. Construct an alpha range around `alpha_now` using factors of 10. | ||
|
||
(2) A custom lasso feature selection function is created which returns -alpha if `n_feats` is achieved. | ||
Otherwise, 0 is returned. To minimize this function over an alpha range would be to maximize alpha. | ||
|
||
(3) Narrow the alpha range created in (1) to ensure that the global "minimum" value is achieved. | ||
Determine where `n_feats` is achieved at regular intervals in the alpha range using the function in (2). | ||
Then, expand the range out one index in either direction. | ||
|
||
(4) Use scipy.optimize.minimize_scalar() to minimize the function in (2) across alpha range from (3). | ||
|
||
Raises Exception if the optimization step is unsuccessful. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I went through and tested/checked each part of this function, I added additional information here on what was going on and why.
plot_alpha_vs_n_features( | ||
alpha_min=alpha_min_feats, | ||
alpha_max=alpha_max_feats, | ||
x_train=x_train_copy, | ||
y_train=y_train, | ||
plot_save=plot_save, | ||
model_dir=model_dir, | ||
model_name=model_name, | ||
random_state=random_state, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made this step into a separate function so that it can be used to help debug problems with lasso_feature_selection
if needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the plot_save
arg be deleted from this function? In general, I think creating and saving plots (same for csvs, rasters, etc.) should be in higher level application code rather than in library code. This probably isn't always true, but I think is generally true and would be the case with feature selection in particular.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I see what you are saying. You want this function to always return the plot when called?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I'm suggesting to pull plot_alpha_vs_n_features()
out of lasso_feature_selection()
, separating those two actions. Perhaps plot_alpha_vs_n_features()
should be named more explicitly to plot_lasso_alpha_vs_n_features()
too?
It's fine as is, but just a suggestion.
pyproject.toml
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do we want to change the version to here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tnigon What are your thoughts on a version for this story?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's bump to 1.0.0
. This is the beginning of a major overhaul that will eliminate co-dependency on the old geoml database schema.
geoml/feature_selection_new.py
Outdated
return feat_n_sel | ||
|
||
|
||
# TODO: Will this ever get caught in a while loop? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it will, but we could add a few lines of code to ensure it doesn't. Something like limit to 100 or 500 iterations, and before it gives up, try multiplying alpha_now
by something much higher than 1.2 - maybe 100 or 1000.
Then if it's still not down to 1 feature, just issue a warning or raise an error? I think any of these options are fine for now, and I don't really see this as being a troublesome part of the ML training process in my experience.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By allowing 500 iterations of the while loop from alpha_initial = 1
(default), the min and max alpha values that are attempted are 5.490e-80 and 4.67e39, respectively. Thus, I decided to not try the last resort type approach you are suggesting in your first paragraph, since I figured that's probably enough. Perhaps we want to reduce the limit on iterations to 100, though? I'm not really sure what makes the most sense at this point.
c011aaf
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the iteration limit to put in as an arg looks good to me. That will ensure we don't get stuck in the while loop. If you think it's reasonable, you can bump the default down to 100 or 200, but it's a pretty minor thing. These iterations will happen in the matter of seconds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good!
Adding functions from
mosaic-modeling
What?
mosaic-modeling
project and can operate (for the time being) as modules ingeoml
for future projects. I adjusted some of these functions from their application inmosaic-modeling
and highlighted any major changes with comments.geoml
in terms of dependencies/versions, etc.Why?
geoml
for the new database and workflowCR Checklist
PR Checklist
_version.py
andpyproject.toml
poetry update
and committedpyproject.toml
andpoetry.lock
poetry run pre-commit autoupdate
Breaking Changes