Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dsub 0.4.3 crashes on ubuntu 20.04 #219

Closed
hylkedonker opened this issue Dec 20, 2020 · 12 comments
Closed

dsub 0.4.3 crashes on ubuntu 20.04 #219

hylkedonker opened this issue Dec 20, 2020 · 12 comments

Comments

@hylkedonker
Copy link

hylkedonker commented Dec 20, 2020

When I run the hello world example

dsub \
   --provider local \
   --logging "${TMPDIR:-/tmp}/dsub-test/logging/" \
   --output OUT="${TMPDIR:-/tmp}/dsub-test/output/out.txt" \
   --command 'echo "Hello World" > "${OUT}"' \
   --wait

it crashes as follows:

***WARNING: No Docker image specified. The default, `ubuntu:14.04` will be used.
***WARNING: For reproducible pipelines, specify an image with the `--image` flag.
Job properties:
  job-id: echo--hylke--201220-165144-45
  job-name: echo
  user-id: hylke
Launched job-id: echo--hylke--201220-165144-45
To check the status, run:
  dstat --provider local --jobs 'echo--hylke--201220-165144-45' --users 'hylke' --status '*'
To cancel the job, run:
  ddel --provider local --jobs 'echo--hylke--201220-165144-45' --users 'hylke'
Waiting for job to complete...
Waiting for: echo--hylke--201220-165144-45.
Traceback (most recent call last):
  File "/home/hylke/.local/bin/dsub", line 8, in <module>
    sys.exit(main())
  File "/home/hylke/.local/lib/python3.8/site-packages/dsub/commands/dsub.py", line 1106, in main
    dsub_main(prog, argv)
  File "/home/hylke/.local/lib/python3.8/site-packages/dsub/commands/dsub.py", line 1091, in dsub_main
    launched_job = run_main(args)
  File "/home/hylke/.local/lib/python3.8/site-packages/dsub/commands/dsub.py", line 1168, in run_main
    return run(
  File "/home/hylke/.local/lib/python3.8/site-packages/dsub/commands/dsub.py", line 1322, in run
    error_messages = _wait_after(provider, [job_metadata['job-id']],
  File "/home/hylke/.local/lib/python3.8/site-packages/dsub/commands/dsub.py", line 786, in _wait_after
    jobs_left = _wait_for_any_job(provider, job_ids_to_check, poll_interval,
  File "/home/hylke/.local/lib/python3.8/site-packages/dsub/commands/dsub.py", line 1015, in _wait_for_any_job
    tasks = provider.lookup_job_tasks({'*'}, job_ids=job_ids)
  File "/home/hylke/.local/lib/python3.8/site-packages/dsub/providers/local.py", line 519, in lookup_job_tasks
    task = self._get_task_from_task_dir(j, u, task_id, task_attempt)
  File "/home/hylke/.local/lib/python3.8/site-packages/dsub/providers/local.py", line 669, in _get_task_from_task_dir
    end_time = self._get_end_time_from_task_dir(task_dir)
  File "/home/hylke/.local/lib/python3.8/site-packages/dsub/providers/local.py", line 583, in _get_end_time_from_task_dir
    datetime.datetime.strptime(f.readline().strip(),
  File "/usr/lib/python3.8/_strptime.py", line 568, in _strptime_datetime
    tt, fraction, gmtoff_fraction = _strptime(data_string, format)
  File "/usr/lib/python3.8/_strptime.py", line 349, in _strptime
    raise ValueError("time data %r does not match format %r" %
ValueError: time data '' does not match format '%Y-%m-%d %H:%M:%S.%f

Any ideas how to get the Hello World example to work?

Versions:
dsub: 0.4.3
Ubuntu: 20.04
Python: 3.8.5
Docker: 19.03.11

Thanks,

Hylke

@wnojopra
Copy link
Contributor

Hi @hylkedonker,

That is a very interesting error, and one that I haven't ran across before. When using the local provider, it writes a timestamp to a tmp file named 'end-time.txt' in /tmp. The error seems to indicate that that either that file is empty, or was read incorrectly. There are a few things you could do to help debug:

  1. Do you still get the same error if you run the hello world example again?
  2. Perhaps there is a problem getting the timestamp? Run this command:
    python -c 'import datetime; print(datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S.%f"))'
    You should see a timestamp. If not, then we'll need to debug your system python.
  3. Perhaps there is a problem writing to the file? Run this command:
    cat /tmp/dsub-local/echo--hylke--201220-165144-45/task/end-time.txt
    (or replace echo--hylke--201220-165144-45 with any job id).
    Let me know what is output here.

@hylkedonker
Copy link
Author

hylkedonker commented Dec 21, 2020

Thanks for getting in touch.

  1. Yes, I can confirm that this problem still persists when running the example again.
  2. python3 -c 'import datetime; print(datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S.%f"))' gives the output 2020-12-21 12:01:42.924656.
  3. The output indicates an empty file:
$ cat /tmp/dsub-local/echo--hylke--201221-115840-24/task/end-time.txt 

$
  1. When I uninstall the package and reinstall in the virtual environment:
python3 -m venv dsub_libs
source dsub_libs/bin/activate
pip3 install dsub
dsub \        
   --provider local \
   --logging "${TMPDIR:-/tmp}/dsub-test/logging/" \
   --output OUT="${TMPDIR:-/tmp}/dsub-test/output/out.txt" \
   --command 'echo "Hello World" > "${OUT}"' \
   --wait

I get a slightly different error:

***WARNING: No Docker image specified. The default, `ubuntu:14.04` will be used.
***WARNING: For reproducible pipelines, specify an image with the `--image` flag.
Job properties:
  job-id: echo--hylke--201221-123051-85
  job-name: echo
  user-id: hylke
Launched job-id: echo--hylke--201221-123051-85
To check the status, run:
  dstat --provider local --jobs 'echo--hylke--201221-123051-85' --users 'hylke' --status '*'
To cancel the job, run:
  ddel --provider local --jobs 'echo--hylke--201221-123051-85' --users 'hylke'
Waiting for job to complete...
Waiting for: echo--hylke--201221-123051-85.
  echo--hylke--201221-123051-85: FAILURE
['2020-12-21 12:30:54 I: Copying the logs before cleanup']
JobExecutionError: One or more jobs finished with status FAILURE or CANCELED during wait.
echo--hylke--201221-123051-85

with the stderr output:

$ cat stderr.txt 
Error: No such container: dsub-echo--hylke--201221-123051-85.task

I hope this helps.

@wnojopra
Copy link
Contributor

Hi @hylkedonker, do you have Docker installed? The Hello World example uses the local provider, and installing Docker is necessary if using the local provider.

Going back to your first error, I imagine looking at cat /tmp/dsub-local/echo--hylke--201220-165144-45/task/stderr.txt may also reveal information about whether Docker is installed.

The error we are seeing here is certainly not helpful. In the past, I recall seeing errors about Docker not being installed. I will work on a fix about improving the error handling here.

@hylkedonker
Copy link
Author

hylkedonker commented Dec 21, 2020 via email

@wnojopra
Copy link
Contributor

I was able to get a machine with ubuntu 20.04 running, and was able to reproduce your error regardless of whether Docker was installed.

The root issue is that the local provider makes a call to python (not python3) to get timestamps. On ubuntu 20.04, python3 is installed, but python is not. This is not a problem if you are using dsub from a venv, but we'll fix it in a future release. Using a venv would be my recommended workaround, but if you're unable to resolve the 'No such container' error, you could also try creating a symlink named 'python' that points to 'python3'. This command would do that:

update-alternatives --install /usr/bin/python python /usr/bin/python3 10

@hylkedonker
Copy link
Author

hylkedonker commented Dec 22, 2020

I tried the command and was able to verify that python now points to python3. However the "No such container" error still persists (both with and without venv). Any suggestions for further debugging?

@wnojopra
Copy link
Contributor

I'm unable to reproduce the "no such container" error, and doing a couple searches mostly shows that its a transient issue, which isn't too helpful here. There are a couple things we could do to debug further:

  1. In the same /tmp/.../task dir, can you share the output of the log.txt and runner-log.txt files? There may be more detailed error messages here.
  2. Can we confirm that docker is set up correctly? Does running docker run hello-world produce any errors?
  3. Are there any unexpected containers showing when you run docker ps -a?

@hylkedonker
Copy link
Author

  1. Output of runner-log.txt:
2020-12-23 13:19:19 I: Localizing inputs.
2020-12-23 13:19:19 I: Checking image userid.
2020-12-23 13:19:19 I: Task image: ubuntu:14.04
2020-12-23 13:19:21 I: Running Docker image.
docker: open /tmp/dsub-local/echo--hylke--201223-131919-18/task/docker.env: no such file or directory.
See 'docker run --help'.
Error response from daemon: No such container: dsub-echo--hylke--201223-131919-18.task
2020-12-23 13:19:21 I: Docker exit code .
2020-12-23 13:19:21 I: Ensure host user (1000:1000) owns Docker-written data
2020-12-23 13:19:22 I: Copying outputs.
CommandException: No URLs matched: /tmp/dsub-local/echo--hylke--201223-131919-18/task/data/output/file/tmp/dsub-test/output/out.txt
CommandException: 1 file/object could not be transferred.
2020-12-23 13:19:23 E: Error on or near line 47; /tmp/dsub-local/echo--hylke--201223-131919-18/task/runner.sh exiting with status 1
2020-12-23 13:19:23 I: Copying the logs before cleanup

As for the error docker: open /tmp/dsub-local/echo--hylke--201223-131919-18/task/docker.env: no such file or directory., that file does seem to exist:

$ cat /tmp/dsub-local/echo--hylke--201223-131919-18/task/docker.env       
DATA_ROOT=/mnt/data
TMPDIR=/mnt/data/tmp
OUT=/mnt/data/output/file/tmp/dsub-test/output/out.txt
$

Output of log.txt

2020-12-23 13:19:19 I: Localizing inputs.
2020-12-23 13:19:19 I: Checking image userid.
2020-12-23 13:19:19 I: Task image: ubuntu:14.04
2020-12-23 13:19:21 I: Running Docker image.
2020-12-23 13:19:21 I: Docker exit code .
2020-12-23 13:19:21 I: Ensure host user (1000:1000) owns Docker-written data
2020-12-23 13:19:22 I: Copying outputs.
2020-12-23 13:19:23 E: Error on or near line 47; /tmp/dsub-local/echo--hylke--201223-131919-18/task/runner.sh exiting with status 1
2020-12-23 13:19:23 I: Copying the logs before cleanup
  1. The hello-world seems to work correctly.
$ docker run hello-world

Hello from Docker!
This message shows that your installation appears to be working correctly.

To generate this message, Docker took the following steps:
 1. The Docker client contacted the Docker daemon.
 2. The Docker daemon pulled the "hello-world" image from the Docker Hub.
    (amd64)
 3. The Docker daemon created a new container from that image which runs the
    executable that produces the output you are currently reading.
 4. The Docker daemon streamed that output to the Docker client, which sent it
    to your terminal.

To try something more ambitious, you can run an Ubuntu container with:
 $ docker run -it ubuntu bash

Share images, automate workflows, and more with a free Docker ID:
 https://hub.docker.com/

For more examples and ideas, visit:
 https://docs.docker.com/get-started/
$

Also, I haven't experienced any problems with Docker lately. No problems building, running and pushing containers.
I am running

$ docker --version      
Docker version 19.03.11, build dd360c7
  1. Nothing pops up that strikes me as out of the ordinary.

I would be more inclined to think that some arguments may be empty, resulting in arguments being passed in the incorrect order.

@wnojopra
Copy link
Contributor

The missing docker.env file looks to be the most suspicious to me, and I'm not sure why we're seeing that. It is a little bit interesting that it is the last file dsub writes before executing the runner.sh script. I wonder if the file is still buffered and not flushed or synced before command to run docker is executed.

I believe the expectation is that because we're using with open(env_fname, 'wt') as f: construct, Python guarantees that the file descriptor close method is called, which flushes the file. This should mean that the subprocess for runner.sh should be able to read the docker.env file, but I'm not sure...

@hylkedonker
Copy link
Author

hylkedonker commented Dec 24, 2020 via email

@hylkedonker
Copy link
Author

hylkedonker commented Dec 30, 2020

It seems that the problem is indeed related to Docker, which does not like the /tmp folder.
For example, this does not work

$ docker run \
   --env-file=/tmp/docker.env \
   "ubuntu:14.04"
docker: open /tmp/docker.env: no such file or directory.
See 'docker run --help'.

But if I copy /tmp/docker.env to my $HOME folder, it works okay:

$ docker run \
   --env-file=$HOME/docker.env \
   "ubuntu:14.04"
$

Solution:
I removed the snap Docker package

sudo snap remove docker

and installed the apt Docker package instead (after adding the Docker apt repository)

sudo apt-get install docker-ce docker-ce-cli containerd.io

Now it seems to be working as expected (I am now running Docker version 20.10.1, build 831ebea):

$ dsub \                
   --provider local \
   --logging "${TMPDIR:-/tmp}/dsub-test/logging/" \
   --output OUT="${TMPDIR:-/tmp}/dsub-test/output/out.txt" \
   --command 'echo "Hello World" > "${OUT}"' \
   --wait
***WARNING: No Docker image specified. The default, `ubuntu:14.04` will be used.
***WARNING: For reproducible pipelines, specify an image with the `--image` flag.
Job properties:
  job-id: echo--hylke--201230-204230-26
  job-name: echo
  user-id: hylke
Launched job-id: echo--hylke--201230-204230-26
To check the status, run:
  dstat --provider local --jobs 'echo--hylke--201230-204230-26' --users 'hylke' --status '*'
To cancel the job, run:
  ddel --provider local --jobs 'echo--hylke--201230-204230-26' --users 'hylke'
Waiting for job to complete...
Waiting for: echo--hylke--201230-204230-26.
  echo--hylke--201230-204230-26: SUCCESS
echo--hylke--201230-204230-26

@wnojopra
Copy link
Contributor

wnojopra commented Jan 4, 2021

I'm happy to see you were able to get it working. I am not familiar with snap, but it looks like the way it containerizes docker prevents it from accessing /tmp. I can't think of a workaround with dsub other than to use the apt package of docker as you have done.

On our end we'll update the documentation to make this clear, and provider better error handling in the local provider.

@wnojopra wnojopra closed this as completed Jan 4, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants