[DEM 303] Adding functions from `mosaic-modeling` #64

marissakivi · 2023-03-22T20:13:10Z

Adding functions from `mosaic-modeling`

What?

Created two new files that contain useful functions that can be reused from the mosaic-modeling project and can operate (for the time being) as modules in geoml for future projects. I adjusted some of these functions from their application in mosaic-modeling and highlighted any major changes with comments.
Dusted off geoml in terms of dependencies/versions, etc.

Why?

Avoiding redundant work across project repos
Starting to rebuild and adapt geoml for the new database and workflow

CR Checklist

Ensure that the version changes make sense
Look through the two new files and the included functions. Is there any part/function that could be better generalized? Is every function operating as it should?
Do the changes I made make sense? Do we want to add any TODOs for later/make any changes now before starting project fulfillment?

PR Checklist

Merged latest main
Updated version number
Version numbers match between package _version.py and pyproject.toml
Ran poetry update and committed pyproject.toml and poetry.lock
Ran poetry run pre-commit autoupdate

Breaking Changes

geoml/__init__.py

marissakivi · 2023-03-22T20:21:30Z

geoml/feature_selection_new.py

+    # if there is only one feature to be selected, then tuning will take care of alpha value
+    if len(labels_x) == 1:
+        return alpha_now


Handling case where we have one candidate feature

Do we have to check and be sure feat_n_sel == 1 as well? There may be a chance that alpha_initial does not result in 1 feature (but I'm not sure).

Note this is related to the TODO question above:

TODO: Do we want to allow for zero features to be selected? If so, we should rethink our alpha range for modeling with one candidate feature.

I'm spinning my wheels on this one a bit. On one hand, we should ensure that our functions can handle all situations, but on the other hand, why the heck would we be running feature selection when there's only one feature? Haha.

This is something I can look into more thoroughly with some test data I already have loaded in this repo, but it'd be nice to agree on desired behavior. This is what I'm thinking:

We don't want to select 0 features in the feature selection. If we are interested in looking at the null model, we can create it. This should be adjusted in the other helper functions.

For lasso_feature_selection(), if we only have one feature to select from, we will determine the maximum value for which we still have one feature selected (i.e., alpha_min_feats) and not zero features. Then, we will set alpha_max_feats = alpha_min_feats * 0.1 and continue on with the function. As this is a feature selection function, we do not need to find the best value of alpha for this model; that is done in the tuning step.

Your code changes from today look good.

marissakivi · 2023-03-22T20:22:21Z

geoml/feature_selection_new.py

+    This function operates in four parts:
+
+    (1) Use "while" loops to search for a roughly maximum alpha value that achieves `n_feats`, which
+    is called `alpha_now`. Construct an alpha range around `alpha_now` using factors of 10.
+
+    (2) A custom lasso feature selection function is created which returns -alpha if `n_feats` is achieved.
+    Otherwise, 0 is returned. To minimize this function over an alpha range would be to maximize alpha.
+
+    (3) Narrow the alpha range created in (1) to ensure that the global "minimum" value is achieved.
+    Determine where `n_feats` is achieved at regular intervals in the alpha range using the function in (2).
+    Then, expand the range out one index in either direction.
+
+    (4) Use scipy.optimize.minimize_scalar() to minimize the function in (2) across alpha range from (3).
+
+    Raises Exception if the optimization step is unsuccessful.


As I went through and tested/checked each part of this function, I added additional information here on what was going on and why.

geoml/feature_selection_new.py

marissakivi · 2023-03-22T20:24:54Z

geoml/feature_selection_new.py

+        plot_alpha_vs_n_features(
+            alpha_min=alpha_min_feats,
+            alpha_max=alpha_max_feats,
+            x_train=x_train_copy,
+            y_train=y_train,
+            plot_save=plot_save,
+            model_dir=model_dir,
+            model_name=model_name,
+            random_state=random_state,
+        )


Made this step into a separate function so that it can be used to help debug problems with lasso_feature_selection if needed.

Should the plot_save arg be deleted from this function? In general, I think creating and saving plots (same for csvs, rasters, etc.) should be in higher level application code rather than in library code. This probably isn't always true, but I think is generally true and would be the case with feature selection in particular.

I think I see what you are saying. You want this function to always return the plot when called?

I think I'm suggesting to pull plot_alpha_vs_n_features() out of lasso_feature_selection(), separating those two actions. Perhaps plot_alpha_vs_n_features() should be named more explicitly to plot_lasso_alpha_vs_n_features() too?

It's fine as is, but just a suggestion.

marissakivi · 2023-03-22T20:25:13Z

pyproject.toml

What do we want to change the version to here?

@tnigon What are your thoughts on a version for this story?

Let's bump to 1.0.0. This is the beginning of a major overhaul that will eliminate co-dependency on the old geoml database schema.

geoml/utils.py

tnigon · 2023-03-27T18:36:31Z

geoml/feature_selection_new.py

+    return feat_n_sel
+
+
+# TODO: Will this ever get caught in a while loop?


I don't think it will, but we could add a few lines of code to ensure it doesn't. Something like limit to 100 or 500 iterations, and before it gives up, try multiplying alpha_now by something much higher than 1.2 - maybe 100 or 1000.

Then if it's still not down to 1 feature, just issue a warning or raise an error? I think any of these options are fine for now, and I don't really see this as being a troublesome part of the ML training process in my experience.

By allowing 500 iterations of the while loop from alpha_initial = 1 (default), the min and max alpha values that are attempted are 5.490e-80 and 4.67e39, respectively. Thus, I decided to not try the last resort type approach you are suggesting in your first paragraph, since I figured that's probably enough. Perhaps we want to reduce the limit on iterations to 100, though? I'm not really sure what makes the most sense at this point.
c011aaf

I think the iteration limit to put in as an arg looks good to me. That will ensure we don't get stuck in the while loop. If you think it's reasonable, you can bump the default down to 100 or 200, but it's a pretty minor thing. These iterations will happen in the matter of seconds.

tnigon

Looks good!

Marissa Kivi added 5 commits March 21, 2023 16:54

bumping versions

dee135d

trying to load all of the classes was breaking for me

e56276e

updating PR template

c824957

adding files from mosaic_modeling

1f1c1cd

removing quotations

089e468