Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Drop/Create Index Timestamping Race condition #1907

Merged
merged 6 commits into from Feb 10, 2020

Conversation

nacrooks
Copy link
Contributor

@nacrooks nacrooks commented Feb 10, 2020

1) Race condition
There was a race condition where dropping an index and recreate an index in close succession would cause timestamping to stop on a source.

2) Naming collision
This PR also fixes a case of duplicate naming if

  1. there is a non materialized view v
  2. there is a materialized view v1 that derives from v
  3. we create an index on v

3) CHBench Load Generator

  1. Ensures that customers of all countries are generated
  2. Removes accidental correlation between item ids and supplier countries.

@cuongdo cuongdo added this to the 0.1 milestone Feb 10, 2020
.hist_date_offset_millis = zero_const,
.order_entry_date_offset_millis = zero_const,
.orderline_delivery_date_offset_millis = zero_const,
.hist_date_offset_millis = static_cast<inner_type>(std::uniform_int_distribution<int64_t>(946684800,1704067200)), // 2010-2024
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are changes like these also intended to produce dates with a greater range of values than just the day that the demo is being run?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for the metabase demo, which uses this, yes that is beneficial.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've configured it to generate dates from 2000 to 2024.

src/coord/coord.rs Outdated Show resolved Hide resolved
src/coord/coord.rs Outdated Show resolved Hide resolved
src/coord/coord.rs Outdated Show resolved Hide resolved
Copy link
Contributor

@frankmcsherry frankmcsherry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! My understanding is limited, but the in-person explanation made things more clear.

@quodlibetor quodlibetor self-requested a review February 10, 2020 15:49
@quodlibetor
Copy link
Contributor

could you include what the meaning of the conversation with andi was? that commit message won't help us remember the rationale later.

@nacrooks
Copy link
Contributor Author

nacrooks commented Feb 10, 2020

What this PR doesn't solve:

Right now we lack a way to "identify" a source instantiation in a way that is 1) unique across dataflows 2) persists across crashes

We can achieve 1) by identifying a source instance with the pair (source_id, id of the first index that that view exports). Unfortunately, we cannot necessarily achieve 2): if we have two indices on a view i and ii (say respectively on column a and column b), we will recover the view with a different id than the one we created it with.

create view v1 as select * from aggdata;
create index i on v1(a);
[2020-02-10T18:13:32.783740Z  INFO coord/timestamp:266] Timestamping Source u2/u4 with Real Time Consistency
create index ii on v1(b);
drop index i ;
crash
recover
[2020-02-10T18:13:52.820311Z  INFO coord/timestamp:266] Timestamping Source u2/u5 with Real Time Consistency

This "bug" will have no effect on sources with BYO consistency. For "real-time" consistency, this will have the effect of treating the second "iteration" of the view as a "fresh" iteration with no prior timestamping information. This means before the crash, record 1 can be timestamped with t, and after the crash, it can be retimestamped with t'. It will not cause a crash or any data to be lost.
The only way that I can see to solve this is to associate a persistent dataflow id when we materialize the view and persist that dataflow id in the catalog.

Copy link
Contributor Author

@nacrooks nacrooks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The discussion with Andi was trying to identify a tuple that would refer uniquely to a given source instantiation. The pair (src_id.sid, first_export_id) is unique for a given run of Materialize, but not across runs if we create two indices and drop the first one only (see general comment in main PR)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants