Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement retrying on cleanup #265

Merged
merged 2 commits into from
Jul 14, 2016
Merged

Implement retrying on cleanup #265

merged 2 commits into from
Jul 14, 2016

Conversation

brainstorm
Copy link
Contributor

Tjena Phil!

This is related to the following backtrace seen on a new cluster with NFS mounts.

Seems that my mantra on "Filesystems are evil and logs should be streamed" is paying dividends in this particular instance, @ewels ;P

  File "/home/hofmann/local/share/bcbio/anaconda/lib/python2.7/site-packages/bcbio/provenance/do.py", line 21, in run                                                                                                  [0/1999]
    _do_run(cmd, checks, log_stdout)
  File "/home/hofmann/local/share/bcbio/anaconda/lib/python2.7/site-packages/bcbio/provenance/do.py", line 95, in _do_run
    raise subprocess.CalledProcessError(exitcode, error_msg)
subprocess.CalledProcessError: Command 'set -o pipefail; export TMPDIR=/scratch/hofmann/projects/NA12878-exome-eval/tmp && /home/hofmann/local/share/bcbio/anaconda/bin/multiqc -f /scratch/hofmann/projects/NA12878-exome-eval
/work_local/qc/NA12878-1/variants /scratch/hofmann/projects/NA12878-exome-eval/work_local/multiqc/report/metrics /scratch/hofmann/projects/NA12878-exome-eval/work_local/qc/NA12878-1/fastqc /scratch/hofmann/projects/NA12878-
exome-eval/work_local/qc/NA12878-2/fastqc /scratch/hofmann/projects/NA12878-exome-eval/work_local/qc/NA12878-1/samtools /scratch/hofmann/projects/NA12878-exome-eval/work_local/qc/NA12878-2/samtools /scratch/hofmann/projects
/NA12878-exome-eval/work_local/qc/NA12878-2/variants /scratch/hofmann/projects/NA12878-exome-eval/work_local/qc/NA12878-1/gemini /scratch/hofmann/projects/NA12878-exome-eval/work_local/qc/NA12878-2/gemini   -o /scratch/hofm
ann/projects/NA12878-exome-eval/tmp/bcbiotx/c24eb7d9-9a37-45c7-9410-b5cf4d2c9a03/tmp9O6oxf 
[INFO   ]         multiqc : This is MultiQC v0.7
[INFO   ]         multiqc : Searching '/scratch/hofmann/projects/NA12878-exome-eval/work_local/qc/NA12878-1/variants'
[INFO   ]         multiqc : Searching '/scratch/hofmann/projects/NA12878-exome-eval/work_local/multiqc/report/metrics'
[INFO   ]         multiqc : Searching '/scratch/hofmann/projects/NA12878-exome-eval/work_local/qc/NA12878-1/fastqc'
[INFO   ]         multiqc : Searching '/scratch/hofmann/projects/NA12878-exome-eval/work_local/qc/NA12878-2/fastqc'
[INFO   ]         multiqc : Searching '/scratch/hofmann/projects/NA12878-exome-eval/work_local/qc/NA12878-1/samtools'
[INFO   ]         multiqc : Searching '/scratch/hofmann/projects/NA12878-exome-eval/work_local/qc/NA12878-2/samtools'
[INFO   ]         multiqc : Searching '/scratch/hofmann/projects/NA12878-exome-eval/work_local/qc/NA12878-2/variants'
[INFO   ]         multiqc : Searching '/scratch/hofmann/projects/NA12878-exome-eval/work_local/qc/NA12878-1/gemini'
[INFO   ]         multiqc : Searching '/scratch/hofmann/projects/NA12878-exome-eval/work_local/qc/NA12878-2/gemini'
[INFO   ]        samtools : Found 2 stats reports
[INFO   ]          fastqc : Found 2 reports
[INFO   ]         multiqc : Report      : ../tmp/bcbiotx/c24eb7d9-9a37-45c7-9410-b5cf4d2c9a03/tmp9O6oxf/multiqc_report.html
[INFO   ]         multiqc : Data        : ../tmp/bcbiotx/c24eb7d9-9a37-45c7-9410-b5cf4d2c9a03/tmp9O6oxf/multiqc_data
[INFO   ]         multiqc : MultiQC complete
Traceback (most recent call last):
  File "/home/hofmann/local/share/bcbio/anaconda/bin/multiqc", line 4, in <module>
    __import__('pkg_resources').run_script('multiqc==0.7', 'multiqc')
  File "/home/hofmann/local/share/bcbio/anaconda/lib/python2.7/site-packages/setuptools-20.3-py2.7.egg/pkg_resources/__init__.py", line 726, in run_script
  File "/home/hofmann/local/share/bcbio/anaconda/lib/python2.7/site-packages/setuptools-20.3-py2.7.egg/pkg_resources/__init__.py", line 1484, in run_script
  File "/home/hofmann/local/share/bcbio/anaconda/lib/python2.7/site-packages/multiqc-0.7-py2.7.egg/EGG-INFO/scripts/multiqc", line 467, in <module>
    multiqc()
  File "/home/hofmann/local/share/bcbio/anaconda/lib/python2.7/site-packages/click/core.py", line 716, in __call__
    return self.main(*args, **kwargs)
  File "/home/hofmann/local/share/bcbio/anaconda/lib/python2.7/site-packages/click/core.py", line 696, in main
    rv = self.invoke(ctx)
  File "/home/hofmann/local/share/bcbio/anaconda/lib/python2.7/site-packages/click/core.py", line 889, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/hofmann/local/share/bcbio/anaconda/lib/python2.7/site-packages/click/core.py", line 534, in invoke
    return callback(*args, **kwargs)
  File "/home/hofmann/local/share/bcbio/anaconda/lib/python2.7/site-packages/multiqc-0.7-py2.7.egg/EGG-INFO/scripts/multiqc", line 456, in multiqc
    log.copy_tmp_log()
  File "/home/hofmann/local/share/bcbio/anaconda/lib/python2.7/site-packages/multiqc-0.7-py2.7.egg/multiqc/utils/log.py", line 57, in copy_tmp_log
    shutil.rmtree(log_tmp_dir)
  File "/home/hofmann/local/share/bcbio/anaconda/lib/python2.7/shutil.py", line 256, in rmtree
    onerror(os.rmdir, path, sys.exc_info())
  File "/home/hofmann/local/share/bcbio/anaconda/lib/python2.7/shutil.py", line 254, in rmtree
    os.rmdir(path)
OSError: [Errno 39] Directory not empty: '/scratch/hofmann/projects/NA12878-exome-eval/tmp/tmp5e2RDe'
' returned non-zero exit status 1

Bits that deserve some attention before merging this up:

  1. Tune up the default waiting time, ideally backed by experience. Panasas took around 30min on average on my experience... but do we really want to introduce that delay?
  2. Bubble up that error and/or apply it to other disk ops/functions.

@ohofmann, perhaps you can apply this to your cluster "manually" as a "Hotfix" for now, in the interest of time.

@chapmanb

@ewels
Copy link
Member

ewels commented Jul 14, 2016

Hah, ok - interesting that this happens for the log file but not for the temporary directory used for report generation?

Perhaps instead of this, I could just create a new configuration option to disable logging to a file? Then if you / others have problems, you can just disable that piece of code altogether? Also avoids adding another dependency as a bonus. Best of all, we both get our desired behaviour 😉

Phil

@ewels
Copy link
Member

ewels commented Jul 14, 2016

After a bit of googling and help from stack overflow, it sounds like it could be because logger doesn't close its filehandles. So something like the following chunk of code could fix it more cleanly?

logger.handlers[0].close()
logger.removeHandler(logger.handlers[0])
shutil.rmtree(log_tmp_dir)

@ewels
Copy link
Member

ewels commented Jul 14, 2016

Version someone else wrote for the same reason with no dependencies: https://github.com/hashdist/hashdist/pull/116/files (no exponential back off, but would be easy to add).

@brainstorm
Copy link
Contributor Author

Oh, gotcha, forgot to add those links (actually Oliver told me about them first, doh)... anyway will put that robust_rmtree() in, hold on ;)

@@ -6,11 +6,9 @@

import logging
import os
import shutil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still need shutil for copying files..

@brainstorm
Copy link
Contributor Author

brainstorm commented Jul 14, 2016

Rebased, the last commit/comment. I think both commits (retrying and robust_rmtree()) should be there for future reference, when exponential backoff suddenly becomes hip ;P.

@brainstorm brainstorm changed the title Implement retrying with exponential backoff Implement retrying on cleanup Jul 14, 2016
@@ -7,9 +7,31 @@
import json
import os
import yaml
import time
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to add shutil here too.. Good spot Travis!

@ewels ewels merged commit 87a0b07 into MultiQC:master Jul 14, 2016
@ewels
Copy link
Member

ewels commented Jul 14, 2016

Thanks! 👍

@ohofmann
Copy link

I leave the office for a few hours and you fix everything? Thanks, appreciate it -- off to give it a try :) The logger as the source makes sense and might explain why I am not seeing the error with any of the rmtrree() calls that bcbio issues.

@ewels
Copy link
Member

ewels commented Jul 14, 2016

Yup, agreed - MultiQC has other rmtree() calls too which are working fine, so I think it must be the trailing open file handler left by the logger which is causing the problem. That's not fixed here, but I guess we can come back to it if people still encounter the bug.

@roryk
Copy link
Contributor

roryk commented Jul 14, 2016

Awesome, thanks folks. I saw this as well occasionally so it is awesome to have a fix in.

@brainstorm brainstorm deleted the retrying branch July 14, 2016 16:04
@brainstorm
Copy link
Contributor Author

Hmm, apparently it is not enough with robust_rmtree(), we have an EBUSY errno :_/

    do.run(cmd.format(**locals()), "Run multiqc")
  File "/home/hofmann/local/share/bcbio/anaconda/lib/python2.7/site-packages/bcbio/provenance/do.py", line 21, in run
    _do_run(cmd, checks, log_stdout)
  File "/home/hofmann/local/share/bcbio/anaconda/lib/python2.7/site-packages/bcbio/provenance/do.py", line 95, in _do_run
    raise subprocess.CalledProcessError(exitcode, error_msg)
subprocess.CalledProcessError: Command 'set -o pipefail; export TMPDIR=/scratch/hofmann/projects/NA12878-exome-eval/tmp && /home/hofmann/local/share/bcbio/anaconda/bin/multiqc -f /scratch/hofmann/projects/NA12878-exome-eval/work_local/qc/NA12878-1/variants /scratch/hofmann/projects/NA12878-exome-eval/work_local/multiqc/report/metrics /scratch/hofmann/projects/NA12878-exome-eval/work_local/qc/NA12878-1/fastqc /scratch/hofmann/projects/NA12878-exome-eval/work_local/qc/NA12878-2/fastqc /scratch/hofmann/projects/NA12878-exome-eval/work_local/qc/NA12878-1/samtools /scratch/hofmann/projects/NA12878-exome-eval/work_local/qc/NA12878-2/samtools /scratch/hofmann/projects/NA12878-exome-eval/work_local/qc/NA12878-2/variants /scratch/hofmann/projects/NA12878-exome-eval/work_local/qc/NA12878-1/gemini /scratch/hofmann/projects/NA12878-exome-eval/work_local/qc/NA12878-2/gemini   -o /scratch/hofmann/projects/NA12878-exome-eval/tmp/bcbiotx/d53c474d-44fd-4c87-a09e-3743db5f1e4e/tmpw_QyZs 
[INFO   ]         multiqc : This is MultiQC v0.8.dev0
[INFO   ]         multiqc : Template    : default
[INFO   ]         multiqc : Searching '/scratch/hofmann/projects/NA12878-exome-eval/work_local/qc/NA12878-1/variants'
[INFO   ]         multiqc : Searching '/scratch/hofmann/projects/NA12878-exome-eval/work_local/multiqc/report/metrics'
[INFO   ]         multiqc : Searching '/scratch/hofmann/projects/NA12878-exome-eval/work_local/qc/NA12878-1/fastqc'
[INFO   ]         multiqc : Searching '/scratch/hofmann/projects/NA12878-exome-eval/work_local/qc/NA12878-2/fastqc'
[INFO   ]         multiqc : Searching '/scratch/hofmann/projects/NA12878-exome-eval/work_local/qc/NA12878-1/samtools'
[INFO   ]         multiqc : Searching '/scratch/hofmann/projects/NA12878-exome-eval/work_local/qc/NA12878-2/samtools'
[INFO   ]         multiqc : Searching '/scratch/hofmann/projects/NA12878-exome-eval/work_local/qc/NA12878-2/variants'
[INFO   ]         multiqc : Searching '/scratch/hofmann/projects/NA12878-exome-eval/work_local/qc/NA12878-1/gemini'
[INFO   ]         multiqc : Searching '/scratch/hofmann/projects/NA12878-exome-eval/work_local/qc/NA12878-2/gemini'
[INFO   ]        samtools : Found 2 stats reports
[INFO   ]          fastqc : Found 2 reports
[INFO   ]         multiqc : Report      : ../tmp/bcbiotx/d53c474d-44fd-4c87-a09e-3743db5f1e4e/tmpw_QyZs/multiqc_report.html
[INFO   ]         multiqc : Data        : ../tmp/bcbiotx/d53c474d-44fd-4c87-a09e-3743db5f1e4e/tmpw_QyZs/multiqc_data
[INFO   ]         multiqc : MultiQC complete
Traceback (most recent call last):
  File "/home/hofmann/local/share/bcbio/anaconda/bin/multiqc", line 475, in <module>
    multiqc()
  File "/home/hofmann/local/share/bcbio/anaconda/lib/python2.7/site-packages/click/core.py", line 716, in __call__
    return self.main(*args, **kwargs)
  File "/home/hofmann/local/share/bcbio/anaconda/lib/python2.7/site-packages/click/core.py", line 696, in main
    rv = self.invoke(ctx)
  File "/home/hofmann/local/share/bcbio/anaconda/lib/python2.7/site-packages/click/core.py", line 889, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/hofmann/local/share/bcbio/anaconda/lib/python2.7/site-packages/click/core.py", line 534, in invoke
    return callback(*args, **kwargs)
  File "/home/hofmann/local/share/bcbio/anaconda/bin/multiqc", line 464, in multiqc
    log.copy_tmp_log()
  File "/home/hofmann/local/share/bcbio/anaconda/lib/python2.7/site-packages/multiqc/utils/log.py", line 56, in copy_tmp_log
    util_functions.robust_rmtree(log_tmp_dir)
  File "/home/hofmann/local/share/bcbio/anaconda/lib/python2.7/site-packages/multiqc/utils/util_functions.py", line 33, in robust_rmtree
    shutil.rmtree(path)
  File "/home/hofmann/local/share/bcbio/anaconda/lib/python2.7/shutil.py", line 252, in rmtree
    onerror(os.remove, fullname, sys.exc_info())
  File "/home/hofmann/local/share/bcbio/anaconda/lib/python2.7/shutil.py", line 250, in rmtree
    os.remove(fullname)
OSError: [Errno 16] Device or resource busy: '/scratch/hofmann/projects/NA12878-exome-eval/tmp/tmp9RQ21i/.nfs00000000e017f9490000157f'
' returned non-zero exit status 1

@ewels
Copy link
Member

ewels commented Jul 14, 2016

What about if you try my idea above for closing the logging handler? Sorry I can't be more help here, difficult when I can't reproduce the error myself.

@brainstorm
Copy link
Contributor Author

brainstorm commented Jul 15, 2016

@ewels, That too, but do you think it'll then remove the NFS stalled file? tmp9RQ21i/.nfs00000000e017f9490000157f. Worth a shot in any case, brb.

@brainstorm
Copy link
Contributor Author

Wow, there's even an errno dedicated to NFS:

errno.ESTALE
    Stale NFS file handle

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants