Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce Primary Terms #14062

Closed
wants to merge 19 commits into from
Closed

Introduce Primary Terms #14062

wants to merge 19 commits into from

Conversation

bleskes
Copy link
Contributor

@bleskes bleskes commented Oct 12, 2015

Every shard group in Elasticsearch has a selected copy called a primary. When a primary shard fails a new primary would be selected from the existing replica copies. This PR introduces primary terms to track the number of times this has happened. This will allow us, as follow up work and among other things, to identify operations that come from old stale primaries. It is also the first step in road towards sequence numbers.

Relates to #10708

@bleskes
Copy link
Contributor Author

bleskes commented Oct 12, 2015

@brwe @jasontedor care to take a look?

@@ -637,6 +662,9 @@ public boolean equals(Object o) {
if (unassignedInfo != null ? !unassignedInfo.equals(that.unassignedInfo) : that.unassignedInfo != null) {
return false;
}
if (primaryTerm != that.primaryTerm) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need a change in hashCode() too?

@brwe
Copy link
Contributor

brwe commented Oct 12, 2015

Should the primary term also increase when we move a primary from one node to another?

@brwe
Copy link
Contributor

brwe commented Oct 12, 2015

When I restart a node then primaryTerm of primaries is incremented by 2. Is this intended?

@@ -580,6 +605,7 @@ public void writeTo(StreamOutput out) throws IOException {
out.writeLong(version);
out.writeByte(state.id);
Settings.writeSettingsToStream(settings, out);
out.writeIntArray(primaryTerms);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we expect these to be non-negative and "not large", I wonder if it'd be better to serialize these using a variable-length encoding? See this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can - (and yeah, reviewed your PR :) )

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated to writeVIntArray

@bleskes
Copy link
Contributor Author

bleskes commented Oct 13, 2015

Should the primary term also increase when we move a primary from one node to another?

We can maybe later. My feeling is now that this not needed and as this is the "same" primary - it just moved.

When I restart a node then primaryTerm of primaries is incremented by 2. Is this intended?

Good catch!! fixed and added some testing.

assertThat(testAllocator.needToFindPrimaryCopy(shard), equalTo(false));
}

@Test
public void testNoProcessPrimayNotAllcoatedBefore() {
ShardRouting shard = TestShardRouting.newShardRouting("test", 0, null, null, null, true, ShardRoutingState.UNASSIGNED, 0, new UnassignedInfo(UnassignedInfo.Reason.INDEX_CREATED, null));
public void testNoProcessPrimacyNotAllocatedBefore() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep.. fixed.

@brwe
Copy link
Contributor

brwe commented Oct 14, 2015

I left some nitpicking but in general I wonder this: shard version and primary term should always be same for all copies. We add primary term to index meta data but not the shard version. Also, we write the shard version when we persist the shard meta data but not the primary term. Why do we treat them differently?
I also wonder why we cannot just get the version of the shard and the primary term from the index metadata instead of adding this information to the shard routings? It might be less confusing to have a single source of truth for this information and we write the index meta data now in any case.

return true;
}

@Override
public int hashCode() {
int result = index.hashCode();
result = 31 * result + (int) (version ^ (version >>> 32));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a built-in (int Long#hashCode(long)) for computing the hash code of a long since Java 8.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure thing.

@brwe
Copy link
Contributor

brwe commented Oct 16, 2015

Note exactly - versions are incremented with every shard routing change, of any shard (primary or not). the terms are only incremented on primary assignment en promotion.

What I meant was that they are both always the same for each copy (although primary term and version can of course differ). Shard version is only in the ShardRoutings but shard term is in both and that seems redundant to me.
I was actually hoping we could move the shard term to IndexMetaData only and not store this information in the ShardRouting too because I found it cumbersome to figure out where shard versions are incremented before and now we do the same thing for primary terms. But after some digging I think this is not really easy to do.

However, this is not a problem with this pull request but more with how versioning of MetaData, IndexMetaData, shards etc. works now. I have no good idea how to make this easier to read but opened an issue here to discuss: #14158
We can leave it now in this pull request as is.

@s1monw
Copy link
Contributor

s1monw commented Oct 16, 2015

I look at the PR and I wonder if we should introduce a dedicated class for this for several reasons:

  • documentation, I think this needs a lot of documentation once used what it is and what it's semantics are.
  • we can implement ToXContent, Comparable and Writeable
  • we can ensure it's always positive and never decreasing
  • we should also use a long just to be on the safe end :) (over paranoid simon)

WDYT?

@bleskes
Copy link
Contributor Author

bleskes commented Oct 19, 2015

pushed another commit with a fix for the double version increment issue @brwe found and some(what) beefed java docs.

@s1monw I gave it some more thought and I still think - at least as things stand now - that a wrapper class for the PrimaryTerm will add complexity instead of making things clearer. It will just be a wrapper around an int and would obscure simple operation behind a method. Since it's a gut feeling thing I've asked the group today and @jasontedor tends to agree. We do totally see the importance of documentation. I've beefed up what I could in the current PR and added an explicit docs todo on the seq no meta data issue. I suggest we proceed as is. This is the very first step in a longer journey - as soon as there is more complex logic around the primary term that needs a home we'll wrap it up in a class.

I also moved primary terms to be long. I made them int originally to address concerns people voiced about 16 bytes (term + counter) per doc but I agree we can review it later on and maybe just encode it differently.

}

private void primaryTerms(long[] primaryTerms) {
this.primaryTerms = primaryTerms;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be a copy?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I'll a copy for safety (though it's called with freshly constructed arrays).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I misread, but I think there's one place where it's not in IndexMetaDataDiff.apply?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You didn’t missread - the only thing is that the diffs are read of the network and are discarded. That one pulled me over the line to actually change it and copy the array.

On 21 Oct 2015, at 18:15, Jason Tedor notifications@github.com wrote:

In core/src/main/java/org/elasticsearch/cluster/metadata/IndexMetaData.java:

  •    }
    
  •    /**
    
  •     \* sets the primary term for the given shard.
    
  •     \* See {@link IndexMetaData#primaryTerm(int)} for more information.
    
  •     */
    
  •    public Builder primaryTerm(int shardId, long primaryTerm) {
    
  •        if (primaryTerms == null) {
    
  •            initializePrimaryTerms();
    
  •        }
    
  •        this.primaryTerms[shardId] = primaryTerm;
    
  •        return this;
    
  •    }
    
  •    private void primaryTerms(long[] primaryTerms) {
    
  •        this.primaryTerms = primaryTerms;
    

Maybe I misread, but I think there's one place where it's not in IndexMetaDataDiff.apply?


Reply to this email directly or view it on GitHub.

@jasontedor
Copy link
Member

I left a few more comments:

I have reservations about the conversion from int to long but I think we can continue to think about that as this work progresses. Otherwise, LGTM.

bleskes added a commit that referenced this pull request Oct 21, 2015
Every shard group in Elasticsearch has a selected copy called a primary. When a primary shard fails a new primary would be selected from the existing replica copies. This PR introduces `primary terms` to track the number of times this has happened. This will allow us, as follow up work and among other things, to identify operations that come from old stale primaries. It is also the first step in road towards sequence numbers.

Relates to #10708
Closes #14062
@bleskes
Copy link
Contributor Author

bleskes commented Oct 21, 2015

this is pushed to the feature/seq_no branch. Thanks @jasontedor @brwe and @s1monw for the reviews.

@bleskes bleskes closed this Oct 21, 2015
@bleskes bleskes deleted the primary_terms branch October 21, 2015 16:27
bleskes added a commit to bleskes/elasticsearch that referenced this pull request Mar 25, 2016
Primary terms is a way to make sure that operations replicated from stale primary are rejected by shards following a newly elected primary.

Original PRs adding this to the seq# feature branch elastic#14062 , elastic#14651 . Unlike those PR, here we take a different approach (based on newer code in master) where the primary terms are stored in the meta data only (and not in `ShardRouting` objects).

Relates to elastic#17038

Closes elastic#17044
@clintongormley clintongormley added :Engine :Distributed/Engine Anything around managing Lucene and the Translog in an open shard. and removed :Sequence IDs labels Feb 14, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed/Engine Anything around managing Lucene and the Translog in an open shard. >enhancement
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants