Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Complete genomes can be 'complete' or 'chromosome' #14

Closed
tseemann opened this issue Sep 10, 2016 · 7 comments
Closed

Complete genomes can be 'complete' or 'chromosome' #14

tseemann opened this issue Sep 10, 2016 · 7 comments

Comments

@tseemann
Copy link

We have a problem with bacterial genomes that sometimes the assembly_summary.txt file says complete and sometimes it says chromosome for finished bacterial genomes.

This is partially to do with the fact that bacteria usually only have 1 chromosome, but partially because they have plasmids too. It's confusing.

If I want both, can I do -l complete -l chromosome to get both?
Or so I run 2 commands with same -o folder?

@kblin kblin added the question label Sep 10, 2016
@kblin
Copy link
Owner

kblin commented Sep 10, 2016

It seems that chromosome means that there can still be gaps in the assembly, but the gaps need to be of known size. Short reads that likely fill the gaps are frequently also included. I've not seen this in complete records.

In any case, you can always run ncbi-genome-download multiple times with the same output dir and different options to get exactly the set of downloaded files that you want.

Allowing multiple --assembly-level would be nice, as would be allowing multiple --format parameters, and some others. But it also makes the CLI more and more complicated, not to mention the code. :)

@kblin
Copy link
Owner

kblin commented Sep 10, 2016

I will add this to the documentation. I'm just beginning to think that the features are outgrowing the README file and I need to start doing this properly.

@tseemann
Copy link
Author

There is also overlap between taxid and division
eg. using -t 2 (or -T 2 ?) is the same as using bacteria at end.

@tseemann
Copy link
Author

tseemann commented Sep 11, 2016

I am thinking if --assembly-level or --format contain a "comma" , character then you could convert into a regexp and match using regexps in your filtering logic.

For example --format genbank,fasta,wgs would become ^(genbank|fasta|wgs)$

And --format protein-cds would just be ^(protein-cds)$ etc

I've used this method in my tools before and it works well. You could also use a hash/dict lookup by creating a want_format[ xxx ] = True/False system instead of regexp. This method also has the advantage that if you preset all the valid options to False then you can validate input easily.

@kblin
Copy link
Owner

kblin commented Sep 11, 2016

That's not quite how the input parsing works, but it shouldn't be too hard to do this.

@kblin
Copy link
Owner

kblin commented Jan 24, 2018

I think @rhpvorderman's patches take care of this use case, don't you think?

@kblin
Copy link
Owner

kblin commented Mar 9, 2018

The new 0.2.6 release contains the patch from @rhpvorderman that allows you to use --assembly-level complete,chromosome. That should take care of this use case, right? If not, feel free to reopen this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants