Bootstrap weights #485

alanlujan91 · 2024-03-04T22:07:51Z

Add weights kwarg to handle survey weights

codecov · 2024-03-05T14:05:41Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 92.99%. Comparing base (1183231) to head (b8644dd).

❗ Current head b8644dd differs from pull request most recent head a39bf15

Please upload reports for the commit a39bf15 to get more accurate results.

Additional details and impacted files

@@           Coverage Diff           @@
##             main     #485   +/-   ##
=======================================
  Coverage   92.98%   92.99%           
=======================================
  Files         194      193    -1     
  Lines       14659    14665    +6     
=======================================
+ Hits        13631    13638    +7     
+ Misses       1028     1027    -1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

timmens · 2024-03-05T18:29:03Z

Hey Alan! Thanks for opening another pull request.

Since this time, the PR is not related to a minor fix but an extension, could you extend the PR description? In particular, could you state the problem (i.e. the missing feature), how this is called in the bootstrap literature, and how you plan to implement it. If you know other libraries that implement this feature, this would also be great to know.

timmens

Nice work so far! Thank you!

As mentioned above, I think we need to go through a few things first. If implementing these changes will take too much of your time, you can also re-open this pull request as an issue (feature request). Then we will think about how to implement the feature.

timmens · 2024-03-05T18:35:18Z

tests/inference/test_bootstrap_samples.py

You only check that the get_bootstrap_indices function works when called with the new weights arguments. But I think you can do more:

I would rename "wts" to "weights", just to be explicit

If the weights are uniform (as in your example), the bootstrap indices with and without the weights argument should be the same. Could you test that? You will need to create two random number generators with the same state.

Additionally, it would be very helpful to see if the weights work. An extreme case to check this would be a weighting vector where one entry is 1, and the other is 0. Then, you could (probabilistically) check that the index corresponding to the 0 weight is never chosen.

timmens · 2024-03-05T18:41:05Z

src/estimagic/inference/bootstrap_samples.py

+            bootstrap_indices = list(
+                rng.choice(
+                    n_obs,
+                    size=(n_draws, n_obs),
+                    replace=True,
+                    p=data[weights] / data[weights].sum(),
+                )
+            )


It would be more readable if you compute the vector of probabilities outside of these if conditions. If weights is None, the vector of probabilities could be None. In this case, you could also write the same code for both cases:

probs = _calculate_bootstrap_indices_probs(data, weight_by, cluster_by) bootstrap_indices = list( rng.choice( n_obs, size=(n_draws, n_obs), replace=True, p=probs, ) )

This will be slower than np.integers(n_obs, ...) but I think this is a trade-off we would want to make here.

src/estimagic/estimation/msm_weighting.py

timmens · 2024-03-05T18:55:01Z

src/estimagic/inference/bootstrap_samples.py

+            clusters = data.groupby(cluster_by)[weights].sum().reset_index()
+
+            drawn_clusters = rng.choice(
+                clusters[cluster_by],
+                size=(n_draws, len(clusters)),
+                replace=True,
+                p=clusters[weights] / clusters[weights].sum(),
+            )
+
+            bootstrap_indices = _convert_cluster_ids_to_indices(
+                data[cluster_by], drawn_clusters
+            )


This has to be updated. Especially the line clusters[clusters_by] is very confusing. I would think that clusters is equal to something like data[cluster_by].

Similarly as above, it would be nice to pre-compute the probabilities.

alanlujan91 · 2024-03-06T13:39:01Z

Hey Alan! Thanks for opening another pull request.

Since this time, the PR is not related to a minor fix but an extension, could you extend the PR description? In particular, could you state the problem (i.e. the missing feature), how this is called in the bootstrap literature, and how you plan to implement it. If you know other libraries that implement this feature, this would also be great to know.

Will try to address all of this in the coming days! Thanks!

timmens · 2024-05-02T16:28:33Z

Hey @alanlujan91, thanks again for working on this feature. We are planning a major refactoring and want to close as many PRs as possible beforehand.

Are you planning to work on this PR in the near future?

alanlujan91 · 2024-05-02T16:42:14Z

hi @timmens

I know this feature as proportional sampling, which is something we have done in Econ-ARK but I don't have a good link to the literature.

https://en.wikipedia.org/wiki/Probability-proportional-to-size_sampling

I can work on this more if you think this is a desired feature, I've just been pretty busy so this hasn't been a priority.

janosg · 2024-05-03T06:20:03Z

Hi Alan, thanks for the PR. I think this would be nice to have in estimagic. As Tim said, we are planning major refactorings and would like to have few PRs open when we start. Do you think you can finish within the next three weeks? Otherwise it might be a good idea to convert this to an issue and postpone the actual implementation until you have more free time.

alanlujan91 · 2024-05-06T20:05:53Z

@janosg I can definitely finish this within 3 weeks

alanlujan91 · 2024-05-21T15:57:43Z

Modified code according to above comments from @timmens, except for

If the weights are uniform (as in your example), the bootstrap indices with and without the weights argument should be the same. Could you test that? You will need to create two random number generators with the same state.

This one is complicated because of the following note from numpy

"Setting user-specified probabilities through p uses a more general but less efficient sampler than the default. The general sampler produces a different sample than the optimized sampler even if each element of p is 1 / len(a)."

It can be done, but requires generating dummy p's as p = np.ones(n_obs)/n_obs when weights_by is None. Let me know if this is desirable

timmens

The changes look really good, thank you! I only have a few minor remarks and two small request, and then we can merge 🎉

Request 1

The function _get_probs_for_bootstrap_indices is not tested yet and the docstring is not that informative. Can you

Extend the docstring by explaining what to expect in the different cases (i.e., weight_by is not None, cluster_by is not None, etc.)
Write unit tests for these cases. Looking at the function, it is not directly clear to me that the case where both weight_by and cluster_by are not None is correctly handled. With an example in a unit test, this would be much easier to see.

Request 2

It can be done, but requires generating dummy p's as p = np.ones(n_obs)/n_obs when weights_by is None. Let me know if this is desirable

Ah, I see; very good point! In this case, can you add a structural test like this:

def test_get_bootstrap_indices_heterogeneous_weights():
    data = pd.DataFrame({"id": [0, 1], "w_homogenous": [0.5, 0.5], "w_heterogenous": [0.1, 0.9]})
    
    res_homogenous = get_bootstrap_indices(data, weight_by="w_homogenous", n_draws=1_000, rng=get_rng(seed=0))
    res_heterogenous = get_bootstrap_indices(data, weight_by="w_heterogenous", n_draws=1_000, rng=get_rng(seed=0))
    
    # Given the weights, the first sample mean should be close to 0.5, while the second one should be close to 0.9
    assert np.mean(res_homogenous) < 0.75 < np.mean(res_heterogenous)

timmens · 2024-05-22T08:36:03Z

tests/inference/test_bootstrap_ci.py

+    weights = "this is not a column name of df"
+    expected = "Input 'weight_by' must be None or a column name of 'data'."
+
+    with pytest.raises(ValueError) as error:
+        check_inputs(data=setup["df"], weight_by=weights)
+    assert str(error.value) == expected


Suggested change

weights = "this is not a column name of df"

expected = "Input 'weight_by' must be None or a column name of 'data'."

with pytest.raises(ValueError) as error:

check_inputs(data=setup["df"], weight_by=weights)

assert str(error.value) == expected

expected_error_msg = "Input 'weight_by' must be None or a column name of 'data'."

with pytest.raises(ValueError, match=expected_error_msg):

check_inputs(data=setup["df"], weight_by="this is not a column name of df")

timmens · 2024-05-22T10:38:43Z

src/estimagic/inference/bootstrap_samples.py

+        cluster_by (str): column name of the variable to cluster by.
+
+    Returns:
+        list: numpy array with probabilities.


Looking at the code, the output will be either None or a pd.Series, and not a list or list of arrays, right?

timmens · 2024-05-22T10:41:21Z

src/estimagic/inference/bootstrap_samples.py

@@ -34,6 +45,29 @@ def get_bootstrap_indices(data, rng, cluster_by=None, n_draws=1000):
    return bootstrap_indices


+def _get_probs_for_bootstrap_indices(data, weight_by, cluster_by):


I prefer the function name _calculate_bootstrap_indices_weights. In that case, the first line of the docstring needs to be updated as well.

alanlujan91 added 4 commits March 4, 2024 17:06

add kwarg weights to get_bootstrap_indices

3ca1a39

fix downstream

38bdd3e

update get_moments_cov

3d04678

add tests

7029aba

alanlujan91 mentioned this pull request Mar 4, 2024

Bootstrap with survey weights #486

Open

Merge remote-tracking branch 'upstream/main' into bootstrap_weights

f3933df

timmens reviewed Mar 5, 2024

View reviewed changes

change weights to weight_by

b8644dd

Merge branch 'main' into bootstrap_weights

1556ecb

alanlujan91 added 2 commits May 20, 2024 10:14

Merge remote-tracking branch 'upstream/main' into bootstrap_weights

7b3146a

add _get_probs_for_bootstrap_indices

a39bf15

timmens requested changes May 22, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bootstrap weights #485

Bootstrap weights #485

alanlujan91 commented Mar 4, 2024

codecov bot commented Mar 5, 2024 •

edited

timmens commented Mar 5, 2024

timmens left a comment

timmens Mar 5, 2024

timmens Mar 5, 2024

timmens Mar 5, 2024

alanlujan91 commented Mar 6, 2024

timmens commented May 2, 2024

alanlujan91 commented May 2, 2024

janosg commented May 3, 2024

alanlujan91 commented May 6, 2024

alanlujan91 commented May 21, 2024

timmens left a comment •

edited

timmens May 22, 2024

timmens May 22, 2024

timmens May 22, 2024

		@@ -34,6 +45,29 @@ def get_bootstrap_indices(data, rng, cluster_by=None, n_draws=1000):
		return bootstrap_indices


		def _get_probs_for_bootstrap_indices(data, weight_by, cluster_by):

Bootstrap weights #485

Are you sure you want to change the base?

Bootstrap weights #485

Conversation

alanlujan91 commented Mar 4, 2024

codecov bot commented Mar 5, 2024 • edited

Codecov Report

timmens commented Mar 5, 2024

timmens left a comment

Choose a reason for hiding this comment

timmens Mar 5, 2024

Choose a reason for hiding this comment

timmens Mar 5, 2024

Choose a reason for hiding this comment

timmens Mar 5, 2024

Choose a reason for hiding this comment

alanlujan91 commented Mar 6, 2024

timmens commented May 2, 2024

alanlujan91 commented May 2, 2024

janosg commented May 3, 2024

alanlujan91 commented May 6, 2024

alanlujan91 commented May 21, 2024

timmens left a comment • edited

Choose a reason for hiding this comment

Request 1

Request 2

timmens May 22, 2024

Choose a reason for hiding this comment

timmens May 22, 2024

Choose a reason for hiding this comment

timmens May 22, 2024

Choose a reason for hiding this comment

codecov bot commented Mar 5, 2024 •

edited

timmens left a comment •

edited