Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FATAL ERROR: "172.17.0.2:51081" is in use (duplicate or overlapping run?) #140

Closed
HeinrichTremblay opened this issue Jul 17, 2023 · 10 comments

Comments

@HeinrichTremblay
Copy link

Error Message

E 14:12:39.182323 target:296 FATAL ERROR: t[XIbjcKDg]: "172.17.0.2:51081" is in use (duplicate or overlapping run?)
FATAL ERROR: t[XIbjcKDg]: "172.17.0.2:51081" is in use (duplicate or overlapping run?)

Context

I have deployed aistore using the docker image with success at first following the docs. The fatal error message happened after I restarted my machine and run again the Docker image to start the cluster (since the container was not running anymore). Here is the docker run command:

docker run -d \
  -p 51080:51080 \
  -v /mnt/disk0:/ais/disk0 \
  -v /mnt/disk1:/ais/disk1 \
  -v /mnt/disk2:/ais/disk2 \
  aistorage/cluster-minimal:latest
@alex-aizman
Copy link
Member

alex-aizman commented Jul 17, 2023

  • The proper way to shutdown node or an entire cluster is using (documented) shutdown API. In CLI terms, for the cluster it'd be something like ais cluster shutdown. The same applies across the board to big production clusters and a toy cluster like the one you are running.
  • A starting-up node always checks for the "proper shutdown" condition. This is done for several reasons that'd be beyond the scope to discuss.
  • Upon detecting a problem, the next thing we do is checking for "duplicate run": an attempt to run the same node twice. For this, we currently use lsof command, e.g. lsof -sTCP:LISTEN -i tcp@hostname:51080.

You say: "restarted my machine." It'd be interesting to find out why exactly lsof reports that somebody's still listening on 172.17.0.2:51081 after restart.

@compiaffe
Copy link

compiaffe commented Aug 25, 2023

I see the exact same problem.

I check with lsof -sTCP:LISTEN -i tcp@localhost:51080 and lsof -sTCP:LISTEN -i tcp@localhost:51081 before starting. Nothing is reported there.

However, I can successfully start ais if I reformat the partition prior to starting docker.

@HeinrichTremblay
Copy link
Author

I also checked with lsof -sTCP:LISTEN -i tcp@localhost:51080 and lsof -sTCP:LISTEN -i tcp@localhost:51081 and no output.

I inspected the source code for the check that trigger the error message and found the checkRestarted function that check for markers.

func (t *target) checkRestarted() (fatalErr, writeErr error) {
	if fs.MarkerExists(fname.NodeRestartedMarker) {
		// NOTE the risk: duplicate aisnode run - which'll fail shortly with "bind:
		// address already in use" but not before triggering (`NodeRestartedPrev` => GFN)
		// sequence and stealing nlog symlinks - that's why we go extra length
		if _lsof(t.si.PubNet.TCPEndpoint()) {
			fatalErr = fmt.Errorf("%s: %q is in use (duplicate or overlapping run?)",
				t, t.si.PubNet.TCPEndpoint())
			return
		}

		t.statsT.Inc(stats.RestartCount)
		fs.PersistMarker(fname.NodeRestartedPrev)
	}
	fatalErr, writeErr = fs.PersistMarker(fname.NodeRestartedMarker)
	return
}

I tried deleting .ais.markers directly in the mounted disks and now it seems to work again when I run the Docker image to start the cluster, and running ais show cluster confirms that my cluster is up as expected.

@alex-aizman
Copy link
Member

of course. But that's illegal - the whole point of this specific persistent marker, and the reason for its existence, is to let us know that the node restarted without proper shutting-down.

@compiaffe
Copy link

The error message is a little confusing in that case. Shouldn't the system automatically try to recover from such a condition?

In any case, good to know how to manually recover.

@alex-aizman
Copy link
Member

the keyword is "overlapping run". Maybe there's a better way to express the fact that there is another instance of ais storage target running (and listening on the same local port), and that immediate exit seems to be the best remedy.

@compiaffe
Copy link

Yes, the overlapping run is clear as such. However, the user doesn't explicitly spin up a second target, nor is on running on the host machine prior to starting the docker container. It is clearly the cluster-minimal container that tries to spin up multiple overlapping targets.

So the questions why the cluster-minimal container spins up multiple targets, causing the error message shown above.

The only difference between successful runs and the ones experiencing this behaviour is the mounting of an improperly shutdown volume.

@alex-aizman
Copy link
Member

alex-aizman commented Aug 31, 2023

I just don't reproduce it. Here's what I've done:

# 1. run it first time
#  `/tmp/cluster-minimal` here is just an arbitrary place where the container can write 
$ docker run -d -p 51080:51080 -v /tmp/cluster-minimal:/ais/disk0 aistorage/cluster-minimal:latest
# 2. use it somehow, this new cluster
$ AIS_ENDPOINT=http://localhost:51080 aisloader -bucket=ais://nnn -cleanup=false -totalputsize=50M -duration=0 -minsize=1MB -maxsize=1MB -numworkers=8 -pctput=100 -quiet

$ AIS_ENDPOINT=http://localhost:51080 ais ls --summary
# 3. shutdown
$ AIS_ENDPOINT=http://localhost:51080 ais cluster shutdown
# 4. restart
$ docker run -d -p 51080:51080 -v /tmp/cluster-minimal:/ais/disk0 aistorage/cluster-minimal:latest
# 5. Finally, see that it sees ais://nnn bucket and generally works
export AIS_ENDPOINT=http://localhost:51080
$ ais show cluster
$ ais ls --summary

# and so on

This is with aistore v3.19

@compiaffe
Copy link

compiaffe commented Sep 5, 2023

The difference is that I hadn't run ais cluster shutdown but instead either a restarted the machine, or a did a docker stop or docker compose down (depending on usage).

@alex-aizman
Copy link
Member

#140 (comment)

closing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants