FATAL ERROR: "172.17.0.2:51081" is in use (duplicate or overlapping run?) #140

HeinrichTremblay · 2023-07-17T14:38:44Z

Error Message

E 14:12:39.182323 target:296 FATAL ERROR: t[XIbjcKDg]: "172.17.0.2:51081" is in use (duplicate or overlapping run?)
FATAL ERROR: t[XIbjcKDg]: "172.17.0.2:51081" is in use (duplicate or overlapping run?)

Context

I have deployed aistore using the docker image with success at first following the docs. The fatal error message happened after I restarted my machine and run again the Docker image to start the cluster (since the container was not running anymore). Here is the docker run command:

docker run -d \
  -p 51080:51080 \
  -v /mnt/disk0:/ais/disk0 \
  -v /mnt/disk1:/ais/disk1 \
  -v /mnt/disk2:/ais/disk2 \
  aistorage/cluster-minimal:latest

The text was updated successfully, but these errors were encountered:

alex-aizman · 2023-07-17T15:42:46Z

The proper way to shutdown node or an entire cluster is using (documented) shutdown API. In CLI terms, for the cluster it'd be something like ais cluster shutdown. The same applies across the board to big production clusters and a toy cluster like the one you are running.
A starting-up node always checks for the "proper shutdown" condition. This is done for several reasons that'd be beyond the scope to discuss.
Upon detecting a problem, the next thing we do is checking for "duplicate run": an attempt to run the same node twice. For this, we currently use lsof command, e.g. lsof -sTCP:LISTEN -i tcp@hostname:51080.

You say: "restarted my machine." It'd be interesting to find out why exactly lsof reports that somebody's still listening on 172.17.0.2:51081 after restart.

compiaffe · 2023-08-25T07:51:13Z

I see the exact same problem.

I check with lsof -sTCP:LISTEN -i tcp@localhost:51080 and lsof -sTCP:LISTEN -i tcp@localhost:51081 before starting. Nothing is reported there.

However, I can successfully start ais if I reformat the partition prior to starting docker.

HeinrichTremblay · 2023-08-27T05:17:49Z

I also checked with lsof -sTCP:LISTEN -i tcp@localhost:51080 and lsof -sTCP:LISTEN -i tcp@localhost:51081 and no output.

I inspected the source code for the check that trigger the error message and found the checkRestarted function that check for markers.

func (t *target) checkRestarted() (fatalErr, writeErr error) {
	if fs.MarkerExists(fname.NodeRestartedMarker) {
		// NOTE the risk: duplicate aisnode run - which'll fail shortly with "bind:
		// address already in use" but not before triggering (`NodeRestartedPrev` => GFN)
		// sequence and stealing nlog symlinks - that's why we go extra length
		if _lsof(t.si.PubNet.TCPEndpoint()) {
			fatalErr = fmt.Errorf("%s: %q is in use (duplicate or overlapping run?)",
				t, t.si.PubNet.TCPEndpoint())
			return
		}

		t.statsT.Inc(stats.RestartCount)
		fs.PersistMarker(fname.NodeRestartedPrev)
	}
	fatalErr, writeErr = fs.PersistMarker(fname.NodeRestartedMarker)
	return
}

I tried deleting .ais.markers directly in the mounted disks and now it seems to work again when I run the Docker image to start the cluster, and running ais show cluster confirms that my cluster is up as expected.

alex-aizman · 2023-08-27T17:29:30Z

of course. But that's illegal - the whole point of this specific persistent marker, and the reason for its existence, is to let us know that the node restarted without proper shutting-down.

compiaffe · 2023-08-28T07:57:37Z

The error message is a little confusing in that case. Shouldn't the system automatically try to recover from such a condition?

In any case, good to know how to manually recover.

alex-aizman · 2023-08-28T14:55:49Z

the keyword is "overlapping run". Maybe there's a better way to express the fact that there is another instance of ais storage target running (and listening on the same local port), and that immediate exit seems to be the best remedy.

compiaffe · 2023-08-31T10:14:06Z

Yes, the overlapping run is clear as such. However, the user doesn't explicitly spin up a second target, nor is on running on the host machine prior to starting the docker container. It is clearly the cluster-minimal container that tries to spin up multiple overlapping targets.

So the questions why the cluster-minimal container spins up multiple targets, causing the error message shown above.

The only difference between successful runs and the ones experiencing this behaviour is the mounting of an improperly shutdown volume.

alex-aizman · 2023-08-31T19:08:04Z

I just don't reproduce it. Here's what I've done:

# 1. run it first time
#  `/tmp/cluster-minimal` here is just an arbitrary place where the container can write 
$ docker run -d -p 51080:51080 -v /tmp/cluster-minimal:/ais/disk0 aistorage/cluster-minimal:latest

# 2. use it somehow, this new cluster
$ AIS_ENDPOINT=http://localhost:51080 aisloader -bucket=ais://nnn -cleanup=false -totalputsize=50M -duration=0 -minsize=1MB -maxsize=1MB -numworkers=8 -pctput=100 -quiet

$ AIS_ENDPOINT=http://localhost:51080 ais ls --summary

# 3. shutdown
$ AIS_ENDPOINT=http://localhost:51080 ais cluster shutdown

# 4. restart
$ docker run -d -p 51080:51080 -v /tmp/cluster-minimal:/ais/disk0 aistorage/cluster-minimal:latest

# 5. Finally, see that it sees ais://nnn bucket and generally works
export AIS_ENDPOINT=http://localhost:51080
$ ais show cluster
$ ais ls --summary

# and so on

This is with aistore v3.19

compiaffe · 2023-09-05T11:00:19Z

The difference is that I hadn't run ais cluster shutdown but instead either a restarted the machine, or a did a docker stop or docker compose down (depending on usage).

alex-aizman · 2023-09-05T13:07:13Z

#140 (comment)

closing

compiaffe mentioned this issue Aug 25, 2023

Hotplugging of volumes in an air-gapped deployment #150

Closed

alex-aizman closed this as completed Sep 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FATAL ERROR: "172.17.0.2:51081" is in use (duplicate or overlapping run?) #140

FATAL ERROR: "172.17.0.2:51081" is in use (duplicate or overlapping run?) #140

HeinrichTremblay commented Jul 17, 2023

alex-aizman commented Jul 17, 2023 •

edited

Loading

compiaffe commented Aug 25, 2023 •

edited

Loading

HeinrichTremblay commented Aug 27, 2023

alex-aizman commented Aug 27, 2023

compiaffe commented Aug 28, 2023

alex-aizman commented Aug 28, 2023

compiaffe commented Aug 31, 2023

alex-aizman commented Aug 31, 2023 •

edited

Loading

compiaffe commented Sep 5, 2023 •

edited

Loading

alex-aizman commented Sep 5, 2023

FATAL ERROR: "172.17.0.2:51081" is in use (duplicate or overlapping run?) #140

FATAL ERROR: "172.17.0.2:51081" is in use (duplicate or overlapping run?) #140

Comments

HeinrichTremblay commented Jul 17, 2023

Error Message

Context

alex-aizman commented Jul 17, 2023 • edited Loading

compiaffe commented Aug 25, 2023 • edited Loading

HeinrichTremblay commented Aug 27, 2023

alex-aizman commented Aug 27, 2023

compiaffe commented Aug 28, 2023

alex-aizman commented Aug 28, 2023

compiaffe commented Aug 31, 2023

alex-aizman commented Aug 31, 2023 • edited Loading

compiaffe commented Sep 5, 2023 • edited Loading

alex-aizman commented Sep 5, 2023

alex-aizman commented Jul 17, 2023 •

edited

Loading

compiaffe commented Aug 25, 2023 •

edited

Loading

alex-aizman commented Aug 31, 2023 •

edited

Loading

compiaffe commented Sep 5, 2023 •

edited

Loading