Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement a proxy module with manual failover #2625

Open
kostja opened this issue Jul 21, 2017 · 16 comments
Open

Implement a proxy module with manual failover #2625

kostja opened this issue Jul 21, 2017 · 16 comments
Assignees
Labels
Milestone

Comments

@kostja
Copy link
Contributor

kostja commented Jul 21, 2017

Implement a tarantool module which would proxy traffic to a master-master configuration and fail over automatically. Right now this is done in https://github.com/tarantool/sentinel, we should take it on board.

The proxy has to be built-in.

API

proxy = require('proxy').new()
proxy:set_rw(uri)
proxy:set_ro(uri)
proxy:listen(uri)

Failover, manual:

proxy:fail_over_to(uri)

How should we forward authentication?
How should we detect which node is a master and which is a replica?

@kostja kostja added the feature A new functionality label Jul 21, 2017
@kostja kostja added this to the 1.7.6 milestone Jul 21, 2017
@kostja kostja added the prio1 label Aug 15, 2017
@rtsisyk rtsisyk modified the milestones: 1.7.6, 1.7.7 Oct 21, 2017
@kostja kostja modified the milestones: 1.8.5, 1.7.7 Oct 27, 2017
@kostja
Copy link
Contributor Author

kostja commented Jan 19, 2018

Requested by ICQ

@rtsisyk
Copy link
Contributor

rtsisyk commented Jan 22, 2018

Please move to 1.7.8

@rtsisyk rtsisyk assigned rtsisyk and unassigned GeorgyKirichenko Jan 22, 2018
@rtsisyk rtsisyk added the needs feedback Something is unclear with the issue label Jan 22, 2018
@Gerold103 Gerold103 assigned Gerold103 and unassigned rtsisyk Feb 9, 2018
@kostja kostja modified the milestones: 1.8.0, 1.9.0 Feb 11, 2018
@kostja kostja added replication in design Requires design document and removed needs feedback Something is unclear with the issue in design Requires design document labels Feb 11, 2018
@Gerold103
Copy link
Collaborator

Gerold103 commented Mar 1, 2018

In progress, but already opened for comments and discussion.

Why?

  1. A client must not know which instance is master, and which is replica - these
    details must be hidden behind a proxy.

  2. When multiple clients exist, their client-to-proxy connections could be
    multiplexed into more 'fat' proxy-to-storage connections.

Main solved problems

  1. Connections can not be multiplexed, if they contains different credentials.
    Each 'fat' connection must serve only one Tarantool user (literaly user, from
    space._user).

  2. Proxy must use its own sync values, because multiple client-to-proxy
    connections, multiplexed in a single proxy-to-storage connection, can send
    conflicting syncs. So the proxy must replace these syncs with its own when
    sending requests to a storage, and replace them back, sending response to a
    client. This mechanism in a manner similar to NAT must work per one
    proxy-to-storage connection.

  3. If a proxy is run in the same process as a replicaset master, then all its
    requests must be sent to a master directly via cbus, with no overhead on sockets
    IO. It is the killer-feature of the Tarantool proxy.

API

Proxy is a module, and can be configured like this:

proxy = require('proxy')
proxy.cfg{
	replicaset = {
		{ uri = <instance URI>, is_master = <boolean> },
		...
	},
	-- Options, keept from storage --
	listen
	readahead,
	log,
	log_nonblock,
	log_format
}

Proxy on cfg with set listen starts an IProto thread, or joins to an existing
one, if box.cfg is already called and a master in proxy configuration is
exactly this instance. All options are static on the first implementation.

Architecture

client, user1 ------*
 ...                 \           proxy              master
client, user1 --------*----------* - *----------------*
                         ---- SYNC -> proXYNC ---->
                         <--- SYNC <- proXYNC -----

Connecting, authorization

All proxy internals can be implemented inside IProto thread. Only
configuration/reconfiguration details must stay in TX thread.

IProto thread when started, knows which port is proxy, and to which storages it
must proxy incomming connections. IProto thread establishes connections to all
storages under a guest user, with no IPROTO_AUTH request. Despite of the fact
that proxy sends all requests to a master, it must be able to do fast failover
to one of replicas. So it must connect to slaves too.

On connect a proxy for each proxy-to-storage connection saves salt - name it
storage salt further. When a client is connected, a proxy sends to him its
own proxy salt, different from a storage salt.
Then the client sends IPROTO_AUTH request. A proxy extracts a user name from the
request, looks up in its local credentials storage for a password hash, salt it
with the proxy salt and compares them. If they mismatch, then authorisation
error is sent to a client. If they match, then salt the password hash with the
storage salt.
Then search for a proxy-to-storage connection under the same user. If found,
then this modified IPROTO_AUTH request is sent to it. Further
all requests are proxied with no changes.

If there is no a connection under this user, then create a new one.

Note: if a proxy has out of date passwords, then authorization can be
success on a proxy, but not success on a storage. This is because each new
client-to-proxy connection must check a password again.

Sync translation

If a proxy-to-storage connection serves one client-to-proxy connection, then
sync translation is not needed - there are no conflicts.

When a proxy-to-storage connection serves multiple client-to-proxy connections,
the first one stores and maintains increasing sync counter. Consider the
communication steps:

  1. A client sends a request to a proxy with sync = s1;
  2. A proxy remembers this sync, changes it to sync = s2, sends the request
    to a storage;
  3. A response with sync = s2 is received from the storage. The proxy replaces
    s2 back to s1 and sends the response to the client.

Queuing

Consider one proxy-to-storage connection. To prevent mixing parts of multiple
requests from different client-to-proxy connections, a proxy must forward
requests one by one. To do it fairly, a proxy-to-storage connection has a queue.
In the queue client-to-proxy connections are stored, those sockets are available
for reading.

When a client socket with no available data becomes available for reading, it
stands at the end of the queue. First client in the queue after sending ONE
request is removed from a queue. If it has more requests, then it stands at the
end of the queue to send them. It guarantees a fairness if one client will be
always available for reading.

To speed up sync translation, it can be done right after receiving a request
from a client, with no waiting until a proxy-to-storage connection is available
for writing. It allows to do not dawdle with syncs when a client appears in
the front of the queue.

Storage and proxy alliance

The main feature of the proxy is ability to merge a proxy and a storage. If in
the same process a master storage with no box.cfg.listen and a proxy are
started, then the proxy could send requests to the master with no sockets,
directly via cbus. To do it, a proxy on start must check, that UUID of a master
in its proxy.cfg.replicaset is the same as the local UUID. If it is true, then
all requests are sent directly to tx_pipe.

@Gerold103
Copy link
Collaborator

Gerold103 commented Mar 1, 2018

@kostja comments:

  • proxy must not do any automatic things - neither detect master change or initiate it;
  • some remarks about authorisation, that I can not understand now;
  • proxy must be a fully separate module;
  • look at another proxies: MariaDB proxy, haproxy, MySQL Router for MySQL, proxysql, redis-proxy, redis-sentinel;
  • cite:
Ah, almost forgot: the proxy configuration should make it possible
to forward some requests to the same tarantool which is operating
the proxy, without a round trip to the operating system/socket
interface. In other words, if one of the instances in proxy configuration
is a loopback, the proxy should not need to perform any network
I/O to deliver data to its local node, but push it right into tx
thread.

This would make it possible to entirely replace box.cfg{listen=}
with require('proxy').listen() and make box.cfg{listen=}
obsolete.

@Gerold103 Gerold103 added the in design Requires design document label Mar 1, 2018
@Gerold103 Gerold103 changed the title Implement a proxy module with automatic failover Implement a proxy module with manual failover Mar 5, 2018
@Mons
Copy link
Contributor

Mons commented Apr 10, 2020

https://github.com/moonlibs/apex already implements proxy with failover, role-aware request balancing, health ping checks and lua-namespace autoload proxying.

@kyukhin
Copy link
Contributor

kyukhin commented Apr 10, 2020

We might want to implement override of IPROTO_(SELECT|INSERT|...) instead.

@knazarov
Copy link

After thinking about it for some time, I don't believe this will be useful. For sharded tarantool, vshard already takes care of this. Even if we put it before (multiple) vshard-routers, this proxy will fail to handle the whole traffic, unless it's multi-threaded.

@kostja
Copy link
Contributor Author

kostja commented Apr 26, 2020

This is supposed to be present in every tarantool instance, so as many threads as iproto threads in all instances you have. I don't see how it's possible to make RAFT work without it.

@knazarov
Copy link

@kostja I mean that it will be mostly useless for clients if designed as written above. Proxying authentication and iproto is only useful when the client is working with only one instance of Tarantool. And working with single instance makes it impossible to scale horizontally.

If we subtract the actual proxying and only implement it for regular net.box calls between a router and storage, it may be useful for RAFT. But still, RAFT is not on the horizon anytime soon, as far as I know. As for vshard, it already can implement such logic in "callrw".

@kostja
Copy link
Contributor Author

kostja commented Apr 27, 2020

@knazarov the idea with this patch is that eventually you will add vshard router functions into iproto. What you're suggesting is that since you can't do it yet it's useless. That would be a pity to stop this work only because it doesn't as is meets your present day needs.

@knazarov
Copy link

The description and design by @Gerold103 doesn't say anything about the fact that it's needed for vshard, or the plans to add vshard calls to iproto. So I'm only making assumptions from what I read, or can deduce from my current knowledge. The original description only talks about client -> storage connections.

Af for vshard, I doubt the usefulness of proxying it for a few reasons:

  • you will add an additional network hop: client -> router -> storage -> another storage, instead of the current client -> router -> storage
  • it will mean that client libraries get a low-level vshard API and have to program their own merging facilities and a high level querying API to be generally useful.
  • tarantool libraries didn't follow a "fat client" design since the inception, and I didn't hear anywhere that this approach will change

So, in general I see exposing current vshard API as useless, and would instead like to see a high-level API for client libraries to work with. (The API that has at least CRUD-level querying capabilities)

If you think otherwise, please share your reasoning.

@kostja
Copy link
Contributor Author

kostja commented Apr 27, 2020

Let's talk about read/write queries for elastic, clustered storage for a minute. The target scheme we need to come to is client -> correct storage instance. For this, the client needs to be sharding topology aware. However, not all clients are sharding-topology aware, and there are times when topology changes. So buckets are being moved around. For these clients and times, iproto should be able to bounce the sharded write/read on client's behalf.
The "router" function is artificial - a perfect tarantool cluster cluster should only have application nodes and storage nodes, or even be homogeneous (true data grid architecture), when every app node is also a storage node.
To transition (eventually) to this kind of topology all routing functions have to be handled by iproto.

@kostja
Copy link
Contributor Author

kostja commented Apr 27, 2020

When I was mentioning adding vshard router functions to iproto, I didn't mean exposing a new API. I meant iproto should be able to route a query to the right bucket automatically if it arrived at a wrong storage instance.

@kostja
Copy link
Contributor Author

kostja commented Apr 27, 2020

And yes, the goal is to move all routing logic to iproto, be it for RAFT or for sharding. It won't isolate this logic in iproto alone, it's impossible: imagine you run a stored procedure on a storage node, eventually we'll want to select from a remote bucket from it. And as you are rightfully pointing out, it is not the best idea to route a request to a random node, so client will have to be topology aware as well. This counts for at least 3 places where we will do the routing. The way I view iproto routing function is:

  • enforcing correctness (i.e. doing the right thing in 100% of cases, client and in-tx net.box can be only 99% right)
  • support dumb clients efficiently, so that anyone could write a simple key/value app using tarantool cluster without messing with heterogeneous roles or complex drivers.

@kostja
Copy link
Contributor Author

kostja commented Apr 27, 2020

Imagine operating a heterogeneous cluster, i.e. a cluster you're trying to upgrade from 3.0 to 4.0. In 4.0, your routing logic may be different from 3.0, so at 4.0 nodes you'll need to behave differently depending on whether the bucket is on old or new node. It's impossible you'll be able to fix old clients and 3.0 net.box to handle it.

@kostja
Copy link
Contributor Author

kostja commented Apr 27, 2020

BTW, @Gerold103 write-up + comments about "some kostja's comments which I did not understand" doesn't pass as a complete spec to me :) Why not make a yet another round of RFC discussion about this item and iron out all wrinkles?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

9 participants