Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Throttle recovery speed #1821

Open
bobrik opened this issue Jan 24, 2018 · 6 comments
Open

Throttle recovery speed #1821

bobrik opened this issue Jan 24, 2018 · 6 comments
Assignees
Labels

Comments

@bobrik
Copy link
Contributor

@bobrik bobrik commented Jan 24, 2018

We're in a process of migration from older hardware under ClickHouse to the newer generation.

Older machines have 12x6T disks, 128GB RAM and 2x10G NICs, newer machines have 12x10T disks, 256GB RAM and 2x25G NICs. Dataset per replica is around 35TiB. Each shard is 3 replicas.

Our process is:

  1. Stop one replica from shard.
  2. Clear it from zookeeper.
  3. Remove it from cluster topology (znode update for remote_servers).
  4. Add new replica to cluster topology.
  5. Start new replica and let it replicate all the data from peers.

The issue we're seeing is that source replicas saturate disks, starving user queries and merges.

It takes ~7h to replicate full dataset, below are the graphs for 12h around that time:

image

image

image

image

image

Source peer:

image

image

Target peer:

image

image

Naturally, source peers are not great at IO in the first place (that's why we're upgrading), but having 7h of degraded service is not great. It'd be nice to be able to set recovery speed, so it doesn't starve other activities like user queries and merges.

Moreover, max number of parts in partition quickly reaches the max (we set to 1000) on the target replica, where inserts are throttled, which doesn't make things any better. With throttled recovery at lower speed but with higher duration, this will probably be even longer period. Maybe we should split threads that do merges and threads that do replication, it seems like whole pool is busy just replicating.

It is also possible that we're just doing it wrong, then it'd be great to have a guide describing the process.

cc @vavrusa, @dqminh, @bocharov

@bobrik
Copy link
Contributor Author

@bobrik bobrik commented Jan 29, 2018

If anyone stumbles on this issue, we worked around it by throttling 50G interface to 4G-8G with comcast tool, depending on disk performance:

./bin/comcast --device vlanXYZ --stop; ./bin/comcast --device vlanXYZ --packet-loss=0% --target-bw $((4 * 1000 * 1000)

Linux is still not great at throttling ingress, so this has to be run on source replicas.

@blinkov
Copy link
Member

@blinkov blinkov commented Sep 13, 2018

Looks like you have successfully went through this.

@alexey-milovidov
Copy link
Member

@alexey-milovidov alexey-milovidov commented Sep 13, 2018

Limit on number of connections and throttling of recovery speed is actual task.

@filimonov
Copy link
Collaborator

@filimonov filimonov commented Mar 12, 2019

Related: #520

@xichen2020
Copy link

@xichen2020 xichen2020 commented Apr 30, 2019

+1, running into similar issues when adding/replacing nodes during our tests, any ETA on the fix?

@Felixoid
Copy link
Member

@Felixoid Felixoid commented Jun 11, 2019

Surprisingly, it affects our hypervisors during deploy of new nodes as well... Would like to see either speed or threads limitation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
7 participants