[BUG] DATAGEN_SEED=<seed> environment does not override the marker datagen_overrides #10089

gerashegalov · 2023-12-21T01:36:44Z

We document the DATAGEN_SEED environment variable as a mechanism to reproduce test failures.

However, if a bug cannot be resolved in a timely manner, we resort to a constant seed that is known to pass the failing test using the datagen_overrides marker. However, such a marker disables DATAGEN_SEED which in turn invalidates the repro steps described in bugs like #9992

We need to settle and document the precedence of various ways to set the seed.

Could use pytest CLI parameter as a final override
Could use DATAGEN_SEED_OVERRIDE as yet another env variable for the super override since this is the current test decorator for such tests.

The text was updated successfully, but these errors were encountered:

abellina · 2023-12-27T14:18:42Z

We set DATAGEN_SEED as an environment variable today to pass the actual seed, I'd say we add the DATAGEN_SEED_OVERRIDE variable to make it clear we overriding the overrides.

jlowe · 2023-12-27T14:53:07Z

I think it will be confusing to change what everyone is already used to, and that's setting DATAGEN_SEED. Instead I propose we use a different, internal-only environment variable to denote the randomly rolled seed generated by the test framework.
DATAGEN_SEED overrides that (and any explicit seeds set by decorators on the test) if it is set. Test names would still print [DATAGEN_SEED=xxx] to indicate that's the environment variable users should set to force a particular seed.

abellina · 2023-12-27T14:58:38Z

ah makes sense to me now, this is the "I wrote it, so it makes sense to me" issue. Sorry!

I'll put up a quick patch with this today.

revans2 · 2023-12-27T15:01:01Z

I don't think there is a good solution to this that does not violate the principle of lease surprise. Those that are used to how it works today will be surprised if it changes to override the overrides. Those who are not used to it will be surprised/are surprised if it does. Just pick names that are as clear as we can make it and document what it does. We can all adapt. I don't want the value that is used to set the DATAGEN_SEED to be hidden. It needs to be documented too so that the person who sets an override as a work around can test that they did it right.

revans2 · 2023-12-27T15:10:35Z

Also on that note do we need a new annotation to let us distinguish between, we are overriding the SEED temporarily to make the test pass until we can fix it vs the seed is hard coded because fundamentally there are problems that we cannot deal with. @jlowe I know you set a few of those, so I am curious if you have an opinion here.

My main concern is that we get a failure and someone tries to test with the bad SEED (around all of the related tests) and because we cannot distinguish between the two a user ends up seeing test failures that are not directly related and they waste time trying to debug it.

jlowe · 2023-12-27T15:54:36Z

Having a different decorator (or argument to the existing decorator) to denote temporary vs. permanent is probably a good idea. We could omit the [DATAGEN_SEED=] text from those test names using a permanent override, since they don't support an override of the seed.

NVnavkumar · 2023-12-27T18:28:48Z

How often would we run into the situation @revans2 described if the user passes the -k parameter? I agree that there is no ideal solution, but is having 2 different decorators and multiple environment variables really necessary? I get that it's documented, but I find that to be a bit complicated even if I read the docs. Really the concern is how to we make it easy as possible to replicate the failure so that it can be debugged quickly, so shouldn't we optimize on that?

gerashegalov · 2023-12-27T20:29:29Z

if we admit there is no ideal solution I would side on making externally supplied DATAGEN_SEED=<seed> to definitely override since so many issues already instruct the user to reproduce bugs this way.

adding more nuance to the datagen_overrides decorator via attributes alone for the documentation purpose sounds like a good idea.

To avoid confusions with blanket overrides affecting many different tests, we probably can find a way to raise an error if the pytest invocation involves more test node ids than a single specific prefix test_file.py::test_func

abellina · 2023-12-28T15:54:44Z

I am leaning towards the following:

DATAGEN_SEED=x ./run_pyspark.. --disable_datagen_temp_overrides

I think using the python options would be a cleaner way to deal with this, and yes we need to remember to add the modifier to the tests, but at least this is better than having to comment out the test marker.

abellina · 2023-12-28T17:22:47Z

Ough found this while testing this change: #10108

Fixing it with the same PR.

gerashegalov · 2023-12-28T18:21:15Z

DATAGEN_SEED=x ./run_pyspark.. --disable_datagen_temp_overrides
I think using the python options would be a cleaner way to deal with this, and yes we need to remember to add the modifier to the tests, but at least this is better than having to comment out the test marker.

Why not allow DATAGEN_SEED to override by default if there is a single specific prefix test_file.py::test_func of collected nodeid
so that it just works without any extra switches?

abellina · 2023-12-28T18:32:23Z

I am leaning towards not doing this automatically because if 2+ functions are selected, DATAGEN_SEED has a different meaning than with 1 item. I think it has to be explicit, otherwise it is bound to be even more confusing.

gerashegalov · 2023-12-28T19:07:32Z

I was trying to reconcile my requirement for this issue that DATAGEN_SEED should override unconditionally to reproduce a typical DATAGEN_SEED bug with the requirement of accidental overrides for unrelated tests.

Another way to look at it, what is the current use case for setting DATAGEN_SEED for more than one test function? If the answer is none then we can just raise an error instead of having a different semantics.

abellina · 2023-12-28T19:21:42Z

I was trying to reconcile my requirement for this issue that DATAGEN_SEED should override unconditionally to reproduce a typical DATAGEN_SEED bug with the requirement of accidental overrides for unrelated tests.

Another way to look at it, what is the current use case for setting DATAGEN_SEED for more than one test function? If the answer is none then we can just raise an error instead of having a different semantics.

If multiple tests fail one could run them all with the different seed and they could be a whole class of tests, I don’t want to preclude that use case. Is it really that bad a user experience to set the extra argument I’ve proposed here?

revans2 · 2023-12-28T19:25:19Z

We are making this way too complicated (too many cooks in the kitchen). Lets just have DATAGEN_SEED override the seed for anything that is annotated/tagged with an override that is not permanent. Have an annotation that distinguishes between permanent and non-permanent. Either single tag or separate ones and be done with it. Any concerns I would have beyond that are really just minor and spending more time to "fix" them feels like a waste.

gerashegalov · 2023-12-28T20:08:30Z

Is it really that bad a user experience to set the extra argument I’ve proposed here?

My point is that an issue like #9992 contains a repro command (Github conveniently offers the user to copy it into the paste buffer with a single click at the top right corner). My expectation is that it reproduces the bug.

Lets just have DATAGEN_SEED override the seed for anything that is annotated/tagged with an override that is not permanent.

This takes us there in most cases, and the rest should have been already closed when making an override permanent. This is reasonable.

abellina · 2023-12-28T20:25:55Z

Ok thanks all for the patience. Here's the latest and greatest, following the example in #9992, this is how it looks like when we set DATAGEN_SEED (and I verified that is indeed the seed we use):

TEST_PARALLEL=0 DATAGEN_SEED=1701976979 TZ=UTC ./run_pyspark_from_build.sh --test_oom_injection_mode=always -k 'test_conditional_with_side_effects_cast'
...
../../src/main/python/conditionals_test.py::test_conditional_with_side_effects_cast[String][DATAGEN_SEED=1701976979, INJECT_OOM]

If we don't set DATAGEN_SEED this is the result:

TEST_PARALLEL=0 TZ=UTC ./run_pyspark_from_build.sh --test_oom_injection_mode=always -k 'test_conditional_with_side_effects_cast'
...
../../src/main/python/conditionals_test.py::test_conditional_with_side_effects_cast[String][DATAGEN_SEED_OVERRIDE=0, INJECT_OOM]

If we specify the permanent flag: @datagen_seed(seed=0, permanent=True,...), this is the result:

TEST_PARALLEL=0 DATAGEN_SEED=1701976979 TZ=UTC ./run_pyspark_from_build.sh --test_oom_injection_mode=always -k 'test_conditional_with_side_effects_cast'
...
../../src/main/python/conditionals_test.py::test_conditional_with_side_effects_cast[String][DATAGEN_SEED_OVERRIDE_PERMANENT=0, INJECT_OOM]

gerashegalov added bug Something isn't working ? - Needs Triage Need team to review and classify labels Dec 21, 2023

NVnavkumar added the test Only impacts tests label Dec 26, 2023

abellina self-assigned this Dec 27, 2023

sameerz removed the ? - Needs Triage Need team to review and classify label Dec 27, 2023

This was referenced Dec 28, 2023

[BUG] @datagen_overrides seed is sticky when it shouldn't be #10108

Closed

Fixes a bug where datagen seed overrides were sticky and adds datagen_seed_override_disabled #10109

Merged

abellina closed this as completed in #10109 Dec 29, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] DATAGEN_SEED=<seed> environment does not override the marker datagen_overrides #10089

[BUG] DATAGEN_SEED=<seed> environment does not override the marker datagen_overrides #10089

gerashegalov commented Dec 21, 2023 •

edited

abellina commented Dec 27, 2023

jlowe commented Dec 27, 2023 •

edited

abellina commented Dec 27, 2023

revans2 commented Dec 27, 2023

revans2 commented Dec 27, 2023

jlowe commented Dec 27, 2023

NVnavkumar commented Dec 27, 2023

gerashegalov commented Dec 27, 2023 •

edited

abellina commented Dec 28, 2023

abellina commented Dec 28, 2023

gerashegalov commented Dec 28, 2023 •

edited

abellina commented Dec 28, 2023

gerashegalov commented Dec 28, 2023 •

edited

abellina commented Dec 28, 2023

revans2 commented Dec 28, 2023

gerashegalov commented Dec 28, 2023 •

edited

abellina commented Dec 28, 2023

[BUG] DATAGEN_SEED=<seed> environment does not override the marker datagen_overrides #10089

[BUG] DATAGEN_SEED=<seed> environment does not override the marker datagen_overrides #10089

Comments

gerashegalov commented Dec 21, 2023 • edited

abellina commented Dec 27, 2023

jlowe commented Dec 27, 2023 • edited

abellina commented Dec 27, 2023

revans2 commented Dec 27, 2023

revans2 commented Dec 27, 2023

jlowe commented Dec 27, 2023

NVnavkumar commented Dec 27, 2023

gerashegalov commented Dec 27, 2023 • edited

abellina commented Dec 28, 2023

abellina commented Dec 28, 2023

gerashegalov commented Dec 28, 2023 • edited

abellina commented Dec 28, 2023

gerashegalov commented Dec 28, 2023 • edited

abellina commented Dec 28, 2023

revans2 commented Dec 28, 2023

gerashegalov commented Dec 28, 2023 • edited

abellina commented Dec 28, 2023

gerashegalov commented Dec 21, 2023 •

edited

jlowe commented Dec 27, 2023 •

edited

gerashegalov commented Dec 27, 2023 •

edited

gerashegalov commented Dec 28, 2023 •

edited

gerashegalov commented Dec 28, 2023 •

edited

gerashegalov commented Dec 28, 2023 •

edited