Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to download others branches using downloadRefSeq.pl #5

Closed
RxLoutre opened this issue Aug 10, 2018 · 14 comments
Closed

How to download others branches using downloadRefSeq.pl #5

RxLoutre opened this issue Aug 10, 2018 · 14 comments

Comments

@RxLoutre
Copy link

@RxLoutre RxLoutre commented Aug 10, 2018

Hello there,

I would like to know if there is anyway to download data from other species than this list : "archae bacteria fungi protozoa viral" ?

For specific purpose, I would like to construct a wider database, with less species from each branche (like 1 or 2 fungi, 1 or 2 archae etc) including vertebrate species. I see in the perl code that it was somewhat planned to do so.

Line 90 :

#my @target_subdirs = qw/bacteria fungi protozoa viral invertebrate plant unknown vertebrate_other vertebrate_mammalian/;

I'm really tempted to uncomment it, but I guess it is commented in purpose. Any idea of how I could achieve that ? Is it something you are working on ?

Cheers,

Roxane

@AlexanderDilthey

This comment has been minimized.

Copy link
Member

@AlexanderDilthey AlexanderDilthey commented Aug 13, 2018

Hi Roxane,

Yes, more comfortable ways to construct new reference databases is something I am working on.

First, if you want to download more data, you can specify --targetBranches - e.g. --targetBranches fungi,protozoa

What currently happens is that the subsequent steps will iterate through your downloaded directory; search for gzipped assembly files; extract (i.e. un-gzip) some of these, based on some criteria (e.g. based on whether the assembly is complete); and then, in a final step, find all unzipped assembly files, combine these into a single file, and introduce appropriate additional taxon IDs.

I guess this process might not be ideal for you - you probably want a very comprehensive download, and then somehow specify which files from the download you want to use? What would be a good way for you to specify which assembly files to include?

Best wishes,

Alex

@RxLoutre

This comment has been minimized.

Copy link
Author

@RxLoutre RxLoutre commented Aug 13, 2018

Hi Alex,

Alright, I'm not sure I got all of it. Something that could be easily configurable would be parsing a file listing all the genome you want to be included in. I can't help it but thinking about something that look like FastQScreen conf file :

#DATABASE       Human   /save/ng6/TODO/HiSeqIndexedGenomes/new_struct/Homo_sapiens/genome/Hg38/BWA/hg38_normalized.fa
#DATABASE       Mouse   /save/ng6/TODO/HiSeqIndexedGenomes/new_struct/Mus_musculus/genome/bwa/Mus_musculus.GRCm38.75.dna.toplevel.fa
#DATABASE       ZebraFish       /save/ng6/TODO/HiSeqIndexedGenomes/new_struct/Danio_rerio/genome/bwa/Danio_Rerio_ref_Zv9.fa
#DATABASE       AThaliana       /save/ng6/TODO/HiSeqIndexedGenomes/new_struct/Arabidopsis_thaliana/genome/TAIR10/bwa/TAIR10.fasta
#DATABASE       Lizard  /save/ng6/TODO/HiSeqIndexedGenomes/new_struct/Anas_platyrhynchos/Genome/BWA/GCF_000355885.1_BGI_duck_1.0_genomic.fasta
#DATABASE       Drosophila      /save/ng6/TODO/HiSeqIndexedGenomes/new_struct/Drosophilia_melanogaster/genome/bwa/Drosophila_melanogaster.BDGP5.69.dna.toplevel.fa
#DATABASE       Lactobacillus /save/ng6/TODO/HiSeqIndexedGenomes/new_struct/Lactococcus_lactis/subsp_lactis_Il1403/genome/bwa/NC_002662.fasta
DATABASE        Xanthomonas     /save/ng6/TODO/HiSeqIndexedGenomes/new_struct/Xanthomonas_campestris/genome/bwa/NC007086.1.fasta

I don't know if it can fit the rest of the pipeline, but this is the idea that came up in my mind when you asked !

@AlexanderDilthey

This comment has been minimized.

Copy link
Member

@AlexanderDilthey AlexanderDilthey commented Aug 13, 2018

@RxLoutre

This comment has been minimized.

Copy link
Author

@RxLoutre RxLoutre commented Aug 14, 2018

Well, I'm hyped about it, I think it could work :)

@AlexanderDilthey

This comment has been minimized.

Copy link
Member

@AlexanderDilthey AlexanderDilthey commented Aug 16, 2018

OK! I'll implement this and get back to you! Give me a few days :-)

@AlexanderDilthey

This comment has been minimized.

Copy link
Member

@AlexanderDilthey AlexanderDilthey commented Aug 20, 2018

OK - I've got something, let me know what you think and whether this is useful to you!

I've added a new command combineAndAnnotateReferences.pl - basically the idea is that you specify input genomes and taxon IDs in a file, and this script will prepare everything for you to pass on to buildDB.pl.

Here is an example command line:

perl combineAndAnnotateReferences.pl --inputFileList tests/fileList --outputFile tests/combinedHumanMouse.fa --taxonomyInDirectory downloads/taxonomy --taxonomyOutDirectory tests/combinedHumanMouse_NCBI_taxonomy_modified

The file fileList could look like this (one line per reference genome - i.e. here we've got two complete human genomes and one mouse genome):

9606 tests/shortHuman1.fa
9606 tests/shortHuman2.fa
10090 tests/shortMouse.fa

... and the database construction process could be completed with:

perl buildDB.pl --DB databases/testHumanMouse --FASTAs tests/combinedHumanMouse.fa --taxonomy tests/combinedHumanMouse_NCBI_taxonomy_modified

@RxLoutre

This comment has been minimized.

Copy link
Author

@RxLoutre RxLoutre commented Aug 20, 2018

Hi Alexander,

Thanks for working hard and fast on that case !

So, if I understand well, the new script combineAndAnnotateReferences.pl is a patch so I can use only a file listing all the path to my genomes.fasta as the only input, right ? Which kinda replace the first step which was to download the database from RefSeq.

I'm still uncertain about the meaning of the taxon ID in here.. If I have a bunch of genomes, is the taxon ID just a number I can made up in order to identify my genome or is it related to refSeq somehow and I have to fetch a proper TaxonID ? I mean, this is probably an important information you use for assigning a read right ? But I have no idea on how to attribute such a ID to my genomes...

I'm also not sure to understand what is needed with --taxonomyInDirectory, which is an input to the new script. How can I have this information ? If I didn't downloaded the genome from RefSeq ? Because so far, I don't even have a taxonomy directory. I saw one on an example run, it contains 4 .dmp files. delnodes, merged, names and nodes. And to be honest I have no idea of what informations they contains and what it means.

Sorry for all thoses question, I'm still very new to the field !

Thanks for your time,

Rox

@AlexanderDilthey

This comment has been minimized.

Copy link
Member

@AlexanderDilthey AlexanderDilthey commented Aug 21, 2018

Hi Roxane,

These are excellent questions!

First and most importantly, you're right in that this script replaces the RefSeq download.

Now, for the taxonomy: MetaMaps needs this information to understand how the input genomes are related to each other. It is contained in the 4 files you mentioned. The format of these files is identical to the format of the NCBI taxonomy database. By far the easiest method is therefore to stay within the NCBI taxonomy - almost any known species should have a unique ID in the NCBI system (https://www.ncbi.nlm.nih.gov/taxonomy), and typically this ID is part of the assembly metadata or can be obtained by querying the database for the name of your species. You can get a complete download of the most recent version of the NCBI taxonomy from ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz, and the directory contained therein can be passed as input to --taxonomyInDirectory. It is instructive to open the files contained in there in a text editor, to get an idea of what they look like and what information they contain.

So, the easiest way forward is to identify the NCBI taxonomy ID for each of your input genomes and pass this information to MetaMaps.

In theory you could also make up your own taxonomy, store it in NCBI format and use it with MetaMaps, but unless you know exactly what you're doing I would heavily advise against that.

Please let me know if you have further questions!

@RxLoutre

This comment has been minimized.

Copy link
Author

@RxLoutre RxLoutre commented Aug 22, 2018

Hi Alexander,

Thanks a lot for your answers ! I'm kind of a newbie in the taxonomy field. I downloaded ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz and untared the file in a taxonomy directory within metamaps. Even if it contains more informations than it should (I mean by that more genomes than I'm going to use), is it still okay to give this as a --taxonomyInDirectory ?

The next step is to create the input file with my genomes and their corresponding NCBI taxonomy ID, and then finally try to build the database.

I have one more question : I'm working with a cluster, which means I do not install tools myself and I do not have the right to update MetaMaps installation. I knew that I'll probably have to update the tool at least once, as MetaMaps is very recent. But I'd like to make it easy for the people working at the datacenter and cluster facility. How can they update MetaMaps easily ? Is it something than can be done with git for example ?

Thanks again for your feedback !

Cheers,

Roxane

@RxLoutre

This comment has been minimized.

Copy link
Author

@RxLoutre RxLoutre commented Aug 22, 2018

Oh and one other simple question : The fileList use as a seperator a tabulation or a simple space ?
ANd again an other one : can we comment this file with line starting with a # or not ?

@AlexanderDilthey

This comment has been minimized.

Copy link
Member

@AlexanderDilthey AlexanderDilthey commented Aug 22, 2018

Hi Roxane,

for the file list, any whitespace character (space or tab) is fine. Currently no comments are accepted. This could be easily changed, though.

Even if it contains more informations than it should (I mean by that more genomes than I'm going to use), is it still okay to give this as a --taxonomyInDirectory ?

Absolutely yes! You can always use the complete NCBI taxonomy!

Regarding the cluster question: you're right that the source code is still evolving rapidly and that future changes are likely. Currently the best way to update MetaMaps is to do a git pull followed be a rebuild (make clean; make metamaps). At some point in the future we'll switch to a more formal release schedule.

Best wishes,

Alex

@RxLoutre

This comment has been minimized.

Copy link
Author

@RxLoutre RxLoutre commented Aug 23, 2018

Great, thanks a lot for all your answers, have a great day ! And good luck with MetaMaps : )

Rox

@AlexanderDilthey

This comment has been minimized.

Copy link
Member

@AlexanderDilthey AlexanderDilthey commented Aug 23, 2018

Thank you! Please let me know if you run into problems or if you have additional feature requests! :-)

Alex

@ms-gx

This comment has been minimized.

Copy link

@ms-gx ms-gx commented Nov 29, 2019

To generate my own DB: Do I ideally provide species_taxid or taxid (as for example given in assembly_summary_genbank.txt). I guess taxid, because MetaMaps goes down to strain level.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
3 participants
You can’t perform that action at this time.