Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dRep failes with symlinks pointing to the same file #25

Closed
fungs opened this issue Mar 11, 2018 · 4 comments
Closed

dRep failes with symlinks pointing to the same file #25

fungs opened this issue Mar 11, 2018 · 4 comments

Comments

@fungs
Copy link

fungs commented Mar 11, 2018

Hi, dRep produces errors when trying to lower the thresholds. I basically want to cluster/dereplicate all bins, regardless of size and completion levels. So I ran dRep dereplicate -pa 0.8 -sa 0.98 -comp 0 -con 50 -l 20000 which gave the following error in v2.0.5:

2b. Cluster pair-wise MASH clustering
Traceback (most recent call last):
  File "/home/johdro/.conda/envs/drep/bin/dRep", line 26, in <module>
    controller.parseArguments(args)
  File "/home/johdro/.conda/envs/drep/lib/python3.6/site-packages/drep/controller.py", line 144, in parseArguments
    self.dereplicate_operation(**vars(args))
  File "/home/johdro/.conda/envs/drep/lib/python3.6/site-packages/drep/controller.py", line 86, in dereplicate_operation
    drep.d_workflows.dereplicate_wrapper(kwargs['work_directory'],**kwargs)
  File "/home/johdro/.conda/envs/drep/lib/python3.6/site-packages/drep/d_workflows.py", line 36, in dereplicate_wrapper
    drep.d_cluster.d_cluster_wrapper(wd, **kwargs)
  File "/home/johdro/.conda/envs/drep/lib/python3.6/site-packages/drep/d_cluster.py", line 75, in d_cluster_wrapper
    data_folder, wd=workDirectory, **kwargs)
  File "/home/johdro/.conda/envs/drep/lib/python3.6/site-packages/drep/d_cluster.py", line 145, in cluster_genomes
    Cdb, cluster_ret = cluster_mash_database(Mdb, **kwargs)
  File "/home/johdro/.conda/envs/drep/lib/python3.6/site-packages/drep/d_cluster.py", line 570, in cluster_mash_database
    linkage_db = db.pivot("genome1","genome2","dist")
  File "/home/johdro/.conda/envs/drep/lib/python3.6/site-packages/pandas/core/frame.py", line 4382, in pivot
    return pivot(self, index=index, columns=columns, values=values)
  File "/home/johdro/.conda/envs/drep/lib/python3.6/site-packages/pandas/core/reshape/reshape.py", line 389, in pivot
    return indexed.unstack(columns)
  File "/home/johdro/.conda/envs/drep/lib/python3.6/site-packages/pandas/core/series.py", line 2224, in unstack
    return unstack(self, level, fill_value)
  File "/home/johdro/.conda/envs/drep/lib/python3.6/site-packages/pandas/core/reshape/reshape.py", line 474, in unstack
    fill_value=fill_value)
  File "/home/johdro/.conda/envs/drep/lib/python3.6/site-packages/pandas/core/reshape/reshape.py", line 116, in __init__
    self._make_selectors()
  File "/home/johdro/.conda/envs/drep/lib/python3.6/site-packages/pandas/core/reshape/reshape.py", line 154, in _make_selectors
    raise ValueError('Index contains duplicate entries, '
ValueError: Index contains duplicate entries, cannot reshape

It would be great if dRep could also do ANI clustering and representative picking for smaller bins which usually get sorted out, this is the real challenge in metagenome data.

Best,
Johannes

@MrOlm
Copy link
Owner

MrOlm commented Mar 12, 2018

Hi Johannes,

  1. That error results where there is more than one genome with the same name as an input file. Could that be the case? If not, would you mind uploading your log file? (located in log/logger.log)

  2. The problem with small bins is that mash ANI estimates are inaccurate for them. If you're genome set is small enough / if you have the computational power, you can run it with --SkipMash in which case the small bins should be handled fine.

Best,
-Matt

@fungs
Copy link
Author

fungs commented Mar 13, 2018

I don't think it does contain duplicate files, they are all unique. I will check and see, if I can produce a minimal example of the error and send you the log file. Might take a little, though.

Regarding the ANI calculation, that is a good tip! Maybe dRep could automatically skip the MASH filter for smaller bins, if the computations are not costly (which I believe they are not for small vs. small bins).

Best,
Johannes

@fungs
Copy link
Author

fungs commented Mar 14, 2018

I've checked and can confirm your hypothesis. There were different files but some of them where symlinks pointing to the same file, by my mistake. It seems CheckM resolves these links and does not do a duplicate file filtering on the list of input files? Simple fix would be to filter the list and emit a warning (just like it does for empty files, for instance).

@fungs fungs changed the title Error when lowering thresholds dRep failes with symlinks pointing to the same file Mar 14, 2018
@MrOlm
Copy link
Owner

MrOlm commented Mar 19, 2018

Great- thanks for suggestion. Will add to my internal "to-do" list

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants