Introduce Primary Terms #14062

bleskes · 2015-10-12T07:58:28Z

Every shard group in Elasticsearch has a selected copy called a primary. When a primary shard fails a new primary would be selected from the existing replica copies. This PR introduces primary terms to track the number of times this has happened. This will allow us, as follow up work and among other things, to identify operations that come from old stale primaries. It is also the first step in road towards sequence numbers.

Relates to #10708

bleskes · 2015-10-12T07:58:49Z

@brwe @jasontedor care to take a look?

brwe · 2015-10-12T12:51:28Z

core/src/main/java/org/elasticsearch/cluster/routing/ShardRouting.java

@@ -637,6 +662,9 @@ public boolean equals(Object o) {
        if (unassignedInfo != null ? !unassignedInfo.equals(that.unassignedInfo) : that.unassignedInfo != null) {
            return false;
        }
+        if (primaryTerm != that.primaryTerm) {


do we need a change in hashCode() too?

brwe · 2015-10-12T15:06:07Z

Should the primary term also increase when we move a primary from one node to another?

brwe · 2015-10-12T15:33:02Z

When I restart a node then primaryTerm of primaries is incremented by 2. Is this intended?

jasontedor · 2015-10-13T13:01:37Z

core/src/main/java/org/elasticsearch/cluster/metadata/IndexMetaData.java

@@ -580,6 +605,7 @@ public void writeTo(StreamOutput out) throws IOException {
            out.writeLong(version);
            out.writeByte(state.id);
            Settings.writeSettingsToStream(settings, out);
+            out.writeIntArray(primaryTerms);


Since we expect these to be non-negative and "not large", I wonder if it'd be better to serialize these using a variable-length encoding? See this PR.

we can - (and yeah, reviewed your PR :) )

updated to writeVIntArray

bleskes · 2015-10-13T13:07:11Z

Should the primary term also increase when we move a primary from one node to another?

We can maybe later. My feeling is now that this not needed and as this is the "same" primary - it just moved.

When I restart a node then primaryTerm of primaries is incremented by 2. Is this intended?

Good catch!! fixed and added some testing.

brwe · 2015-10-14T14:09:51Z

core/src/test/java/org/elasticsearch/gateway/PrimaryShardAllocatorTests.java

        assertThat(testAllocator.needToFindPrimaryCopy(shard), equalTo(false));
    }

    @Test
-    public void testNoProcessPrimayNotAllcoatedBefore() {
-        ShardRouting shard = TestShardRouting.newShardRouting("test", 0, null, null, null, true, ShardRoutingState.UNASSIGNED, 0, new UnassignedInfo(UnassignedInfo.Reason.INDEX_CREATED, null));
+    public void testNoProcessPrimacyNotAllocatedBefore() {


yep.. fixed.

brwe · 2015-10-14T15:59:54Z

I left some nitpicking but in general I wonder this: shard version and primary term should always be same for all copies. We add primary term to index meta data but not the shard version. Also, we write the shard version when we persist the shard meta data but not the primary term. Why do we treat them differently?
I also wonder why we cannot just get the version of the shard and the primary term from the index metadata instead of adding this information to the shard routings? It might be less confusing to have a single source of truth for this information and we write the index meta data now in any case.

jasontedor · 2015-10-16T13:04:42Z

core/src/main/java/org/elasticsearch/cluster/metadata/IndexMetaData.java

        return true;
    }

    @Override
    public int hashCode() {
        int result = index.hashCode();
+        result = 31 * result + (int) (version ^ (version >>> 32));


There's a built-in (int Long#hashCode(long)) for computing the hash code of a long since Java 8.

sure thing.

brwe · 2015-10-16T13:47:34Z

Note exactly - versions are incremented with every shard routing change, of any shard (primary or not). the terms are only incremented on primary assignment en promotion.

What I meant was that they are both always the same for each copy (although primary term and version can of course differ). Shard version is only in the ShardRoutings but shard term is in both and that seems redundant to me.
I was actually hoping we could move the shard term to IndexMetaData only and not store this information in the ShardRouting too because I found it cumbersome to figure out where shard versions are incremented before and now we do the same thing for primary terms. But after some digging I think this is not really easy to do.

However, this is not a problem with this pull request but more with how versioning of MetaData, IndexMetaData, shards etc. works now. I have no good idea how to make this easier to read but opened an issue here to discuss: #14158
We can leave it now in this pull request as is.

s1monw · 2015-10-16T20:28:34Z

I look at the PR and I wonder if we should introduce a dedicated class for this for several reasons:

documentation, I think this needs a lot of documentation once used what it is and what it's semantics are.
we can implement ToXContent, Comparable and Writeable
we can ensure it's always positive and never decreasing
we should also use a long just to be on the safe end :) (over paranoid simon)

WDYT?

bleskes · 2015-10-19T18:58:40Z

pushed another commit with a fix for the double version increment issue @brwe found and some(what) beefed java docs.

@s1monw I gave it some more thought and I still think - at least as things stand now - that a wrapper class for the PrimaryTerm will add complexity instead of making things clearer. It will just be a wrapper around an int and would obscure simple operation behind a method. Since it's a gut feeling thing I've asked the group today and @jasontedor tends to agree. We do totally see the importance of documentation. I've beefed up what I could in the current PR and added an explicit docs todo on the seq no meta data issue. I suggest we proceed as is. This is the very first step in a longer journey - as soon as there is more complex logic around the primary term that needs a home we'll wrap it up in a class.

I also moved primary terms to be long. I made them int originally to address concerns people voiced about 16 bytes (term + counter) per doc but I agree we can review it later on and maybe just encode it differently.

jasontedor · 2015-10-20T15:06:40Z

core/src/main/java/org/elasticsearch/cluster/metadata/IndexMetaData.java

+        }
+
+        private void primaryTerms(long[] primaryTerms) {
+            this.primaryTerms = primaryTerms;


Should this be a copy?

Sure, I'll a copy for safety (though it's called with freshly constructed arrays).

Maybe I misread, but I think there's one place where it's not in IndexMetaDataDiff.apply?

You didn’t missread - the only thing is that the diffs are read of the network and are discarded. That one pulled me over the line to actually change it and copy the array.

On 21 Oct 2015, at 18:15, Jason Tedor notifications@github.com wrote:

In core/src/main/java/org/elasticsearch/cluster/metadata/IndexMetaData.java:

}

/**

\* sets the primary term for the given shard.

\* See {@link IndexMetaData#primaryTerm(int)} for more information.

*/

public Builder primaryTerm(int shardId, long primaryTerm) {

if (primaryTerms == null) {

initializePrimaryTerms();

}

this.primaryTerms[shardId] = primaryTerm;

return this;

}

private void primaryTerms(long[] primaryTerms) {

this.primaryTerms = primaryTerms;

Maybe I misread, but I think there's one place where it's not in IndexMetaDataDiff.apply?

—
Reply to this email directly or view it on GitHub.

jasontedor · 2015-10-20T17:11:25Z

I left a few more comments:

I have reservations about the conversion from int to long but I think we can continue to think about that as this work progresses. Otherwise, LGTM.

Every shard group in Elasticsearch has a selected copy called a primary. When a primary shard fails a new primary would be selected from the existing replica copies. This PR introduces `primary terms` to track the number of times this has happened. This will allow us, as follow up work and among other things, to identify operations that come from old stale primaries. It is also the first step in road towards sequence numbers. Relates to #10708 Closes #14062

bleskes · 2015-10-21T16:27:47Z

this is pushed to the feature/seq_no branch. Thanks @jasontedor @brwe and @s1monw for the reviews.

Primary terms is a way to make sure that operations replicated from stale primary are rejected by shards following a newly elected primary. Original PRs adding this to the seq# feature branch elastic#14062 , elastic#14651 . Unlike those PR, here we take a different approach (based on newer code in master) where the primary terms are stored in the meta data only (and not in `ShardRouting` objects). Relates to elastic#17038 Closes elastic#17044

bleskes added >enhancement :Internal labels Oct 12, 2015

brwe reviewed Oct 12, 2015
View reviewed changes

jasontedor reviewed Oct 13, 2015
View reviewed changes

brwe reviewed Oct 14, 2015
View reviewed changes

bleskes added 10 commits October 15, 2015 20:40

initial commit

a798db3

better handling of primary term in builder

905404c

fix StartedShardsRoutingTests

3dff0ea

fix ClusterStateCreationUtils

47ff727

move primary term syncing with meta data to AllocationService.

849876b

add primary terms to ClusterState's XContent rendering

f8c6e6b

add some assertions to RecoveryFromGatewayIT

2ea979d

failed to increment routing table and meta data table

f1434f4

truely copy the primaryTerms array

3d3df93

fix RoutingTableTests to user routing results

82f65fa

jasontedor reviewed Oct 16, 2015
View reviewed changes

brwe mentioned this pull request Oct 16, 2015

Cluster state and versions - when should we increment which version and how often? #14158

Closed

fancy hashing

f5499d3

bleskes added 2 commits October 19, 2015 19:26

fix double version increments + some improved docs

3d5afcb

move primaryTerm to a long

21d2654

jasontedor reviewed Oct 20, 2015
View reviewed changes

final feeback

a13a424

bleskes closed this Oct 21, 2015

bleskes deleted the primary_terms branch October 21, 2015 16:27

bleskes mentioned this pull request Nov 3, 2015

Add Sequence Numbers to write operations #10708

Closed

64 tasks

clintongormley added :Sequence IDs and removed :Internal labels Nov 18, 2015

bleskes mentioned this pull request Mar 10, 2016

Port Primary Terms to master #17044

Merged

clintongormley added :Engine :Distributed/Engine Anything around managing Lucene and the Translog in an open shard. and removed :Sequence IDs labels Feb 14, 2018

vigyasharma mentioned this pull request Feb 22, 2019

Data loss when old master node dead and startup again #39282

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce Primary Terms #14062

Introduce Primary Terms #14062

bleskes commented Oct 12, 2015

bleskes commented Oct 12, 2015

brwe Oct 12, 2015

brwe commented Oct 12, 2015

brwe commented Oct 12, 2015

jasontedor Oct 13, 2015

bleskes Oct 13, 2015

bleskes Oct 15, 2015

bleskes commented Oct 13, 2015

brwe Oct 14, 2015

bleskes Oct 15, 2015

brwe commented Oct 14, 2015

jasontedor Oct 16, 2015

bleskes Oct 16, 2015

brwe commented Oct 16, 2015

s1monw commented Oct 16, 2015

bleskes commented Oct 19, 2015

jasontedor Oct 20, 2015

bleskes Oct 21, 2015

jasontedor Oct 21, 2015

bleskes Oct 21, 2015

jasontedor commented Oct 20, 2015

bleskes commented Oct 21, 2015

Introduce Primary Terms #14062

Introduce Primary Terms #14062

Conversation

bleskes commented Oct 12, 2015

bleskes commented Oct 12, 2015

Choose a reason for hiding this comment

brwe commented Oct 12, 2015

brwe commented Oct 12, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bleskes commented Oct 13, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

brwe commented Oct 14, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

brwe commented Oct 16, 2015

s1monw commented Oct 16, 2015

bleskes commented Oct 19, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jasontedor commented Oct 20, 2015

bleskes commented Oct 21, 2015