Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DEM 303] Adding functions from mosaic-modeling #64

Merged
merged 15 commits into from
Mar 29, 2023

Conversation

marissakivi
Copy link
Contributor

@marissakivi marissakivi commented Mar 22, 2023

Adding functions from mosaic-modeling

What?

  • Created two new files that contain useful functions that can be reused from the mosaic-modeling project and can operate (for the time being) as modules in geoml for future projects. I adjusted some of these functions from their application in mosaic-modeling and highlighted any major changes with comments.
  • Dusted off geoml in terms of dependencies/versions, etc.

Why?

  • Avoiding redundant work across project repos
  • Starting to rebuild and adapt geoml for the new database and workflow

CR Checklist

  • Ensure that the version changes make sense
  • Look through the two new files and the included functions. Is there any part/function that could be better generalized? Is every function operating as it should?
  • Do the changes I made make sense? Do we want to add any TODOs for later/make any changes now before starting project fulfillment?

PR Checklist

  • Merged latest main
  • Updated version number
  • Version numbers match between package _version.py and pyproject.toml
  • Ran poetry update and committed pyproject.toml and poetry.lock
  • Ran poetry run pre-commit autoupdate

Breaking Changes

Comment on lines 74 to 76
# if there is only one feature to be selected, then tuning will take care of alpha value
if len(labels_x) == 1:
return alpha_now
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Handling case where we have one candidate feature

Copy link
Contributor

@tnigon tnigon Mar 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have to check and be sure feat_n_sel == 1 as well? There may be a chance that alpha_initial does not result in 1 feature (but I'm not sure).

Note this is related to the TODO question above:

TODO: Do we want to allow for zero features to be selected? If so, we should rethink our alpha range for modeling with one candidate feature.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm spinning my wheels on this one a bit. On one hand, we should ensure that our functions can handle all situations, but on the other hand, why the heck would we be running feature selection when there's only one feature? Haha.

This is something I can look into more thoroughly with some test data I already have loaded in this repo, but it'd be nice to agree on desired behavior. This is what I'm thinking:

  • We don't want to select 0 features in the feature selection. If we are interested in looking at the null model, we can create it. This should be adjusted in the other helper functions.
  • For lasso_feature_selection(), if we only have one feature to select from, we will determine the maximum value for which we still have one feature selected (i.e., alpha_min_feats) and not zero features. Then, we will set alpha_max_feats = alpha_min_feats * 0.1 and continue on with the function. As this is a feature selection function, we do not need to find the best value of alpha for this model; that is done in the tuning step.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your code changes from today look good.

Comment on lines +122 to +136
This function operates in four parts:

(1) Use "while" loops to search for a roughly maximum alpha value that achieves `n_feats`, which
is called `alpha_now`. Construct an alpha range around `alpha_now` using factors of 10.

(2) A custom lasso feature selection function is created which returns -alpha if `n_feats` is achieved.
Otherwise, 0 is returned. To minimize this function over an alpha range would be to maximize alpha.

(3) Narrow the alpha range created in (1) to ensure that the global "minimum" value is achieved.
Determine where `n_feats` is achieved at regular intervals in the alpha range using the function in (2).
Then, expand the range out one index in either direction.

(4) Use scipy.optimize.minimize_scalar() to minimize the function in (2) across alpha range from (3).

Raises Exception if the optimization step is unsuccessful.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I went through and tested/checked each part of this function, I added additional information here on what was going on and why.

Comment on lines +408 to +417
plot_alpha_vs_n_features(
alpha_min=alpha_min_feats,
alpha_max=alpha_max_feats,
x_train=x_train_copy,
y_train=y_train,
plot_save=plot_save,
model_dir=model_dir,
model_name=model_name,
random_state=random_state,
)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made this step into a separate function so that it can be used to help debug problems with lasso_feature_selection if needed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the plot_save arg be deleted from this function? In general, I think creating and saving plots (same for csvs, rasters, etc.) should be in higher level application code rather than in library code. This probably isn't always true, but I think is generally true and would be the case with feature selection in particular.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I see what you are saying. You want this function to always return the plot when called?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'm suggesting to pull plot_alpha_vs_n_features() out of lasso_feature_selection(), separating those two actions. Perhaps plot_alpha_vs_n_features() should be named more explicitly to plot_lasso_alpha_vs_n_features() too?

It's fine as is, but just a suggestion.

pyproject.toml Outdated
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do we want to change the version to here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tnigon What are your thoughts on a version for this story?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's bump to 1.0.0. This is the beginning of a major overhaul that will eliminate co-dependency on the old geoml database schema.

geoml/utils.py Show resolved Hide resolved
return feat_n_sel


# TODO: Will this ever get caught in a while loop?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it will, but we could add a few lines of code to ensure it doesn't. Something like limit to 100 or 500 iterations, and before it gives up, try multiplying alpha_now by something much higher than 1.2 - maybe 100 or 1000.

Then if it's still not down to 1 feature, just issue a warning or raise an error? I think any of these options are fine for now, and I don't really see this as being a troublesome part of the ML training process in my experience.

Copy link
Contributor Author

@marissakivi marissakivi Mar 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By allowing 500 iterations of the while loop from alpha_initial = 1 (default), the min and max alpha values that are attempted are 5.490e-80 and 4.67e39, respectively. Thus, I decided to not try the last resort type approach you are suggesting in your first paragraph, since I figured that's probably enough. Perhaps we want to reduce the limit on iterations to 100, though? I'm not really sure what makes the most sense at this point.
c011aaf

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the iteration limit to put in as an arg looks good to me. That will ensure we don't get stuck in the while loop. If you think it's reasonable, you can bump the default down to 100 or 200, but it's a pretty minor thing. These iterations will happen in the matter of seconds.

Copy link
Contributor

@tnigon tnigon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

@marissakivi marissakivi merged commit 65f500e into main Mar 29, 2023
@marissakivi marissakivi deleted the DEM-303-mosaic-funcs branch March 29, 2023 14:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants