Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

centrifuge-build hanging up #199

Open
GastonViarengo opened this issue Sep 13, 2020 · 19 comments
Open

centrifuge-build hanging up #199

GastonViarengo opened this issue Sep 13, 2020 · 19 comments

Comments

@GastonViarengo
Copy link

Hello everyone. I've recently started using Centrifuge, and I've been able to create a viral index and use it with my metagenomic data. However, when I'm trying to build a bacteria index (bac), the process hangs up (at least that's the only explanation I've encountered so far). I'm using the following script:

centrifuge-build -p 8 --conversion-table seqid2taxid.map --taxonomy-tree taxonomy/nodes.dmp --name-table taxonomy/names.dmp inputs/seq_bac.fna indices/bac

The files bac.1.cf, bac.2.cf, and bac.3.cf, are created within a few minutes after the job begins, but file bac.2.cf is 0 kb size. The output shows:

Settings:
Output files: "indices/bac..cf"
Line rate: 7 (line is 128 bytes)
Lines per side: 1 (side is 128 bytes)
Offset rate: 4 (one in 16)
FTable chars: 10
Strings: unpacked
Local offset rate: 3 (one in 8)
Local fTable chars: 6
Max bucket size: default
Max bucket size, sqrt multiplier: default
Max bucket size, len divisor: 4
Difference-cover sample period: 1024
Endianness: little
Actual local endianness: little
Sanity checking: disabled
Assertions: disabled
Random seed: 0
Sizeofs: void
:8, int:4, long:8, size_t:8
Input files DNA, FASTA:
inputs/seq_bac.fna
Reading reference sizes
Warning: Encountered reference sequence with only gaps
Time reading reference sizes: 00:07:04
Calculating joined length
Writing header
Reserving space for joined string
Could not allocate space for a joined string of 67127059294 elements.
Switching to a packed string representation.
Reading reference sizes
Warning: Encountered reference sequence with only gaps
Time reading reference sizes: 00:07:04
Calculating joined length
Writing header
Reserving space for joined string
Joining reference sequences
Time to join reference sequences: 00:07:05
Warning: taxomony id doesn't exists for NC_017270.1! (repetead several times for different ids)
Warning: Taxonomy ID 90270 is not in the provided taxonomy tree (taxonomy/nodes.dmp)! (repetead several times for different ids)

Even after leaving it running for a few days, bac.*.cf files do not show modifications, and output is freezed (I believe hanged up).

I've tried removing the erroneus IDs but the process still hangs up.

Could you help me understand what's going on in order to solve this?

Thank you so much!

Best regards

Prof. Dr. Gastón Viarengo
Institute of Molecular and Cellular Biology of Rosario (IBR-CONICET)
Human Virology Lab

@mourisl
Copy link
Collaborator

mourisl commented Oct 22, 2020

Sorry for the delayed reply, which version of Centrifuge did you use? Thank you.

@GastonViarengo
Copy link
Author

Sorry for the delayed reply, which version of Centrifuge did you use? Thank you.

Hello Li Song, no problem, thanks for your response. I'm using versión 1.0.4-beta. Could you help me find out the problem? Thank you.

@mourisl
Copy link
Collaborator

mourisl commented Oct 23, 2020

I just checked the log and realized that I fixed this bug after the release of 1.0.4-beta. Can you try git clone to get the most recent version of Centrifuge? Thank you.

@GastonViarengo
Copy link
Author

Thanks Li Song, I'll try with that and let you know how it goes. What was the bug?. Bests, Gastón.

@fanninpm
Copy link

fanninpm commented May 3, 2021

I also ran into this (or a similar issue) while I was using the provided Makefile to make an nt database. Compiling 65c42fc from source did not change anything.

@choede
Copy link

choede commented Dec 14, 2021

Hi, I have a similar issue with nt. I'm using version 1.0.4. I modified map file to have something starting with : accession.version taxid
A00001.1 10641
A00002.1 9913
A00003.1 9913
A00004.1 32630
A00005.1 32630
and launched
centrifuge-build -p 16 --bmax 1342177280 --conversion-table gi_taxid_nucl.2map
--taxonomy-tree taxonomy/nodes.dmp --name-table taxonomy/names.dmp
nt.fa nt

After one hour, the process do not write anything else. nr.1.cf and nt.3.cf are not empty but nt.2.cg is empty. I have only warning in output logs. The process uses only one CPU. Moreover, nt indexes available in centrifuge web site are not up to date (They are from 2018). Could you help me, please ?
Thanks a lot in advance

@Jolvii85
Copy link

Jolvii85 commented Jan 1, 2022

Hi all, I have the same error with nt, anyone fix it?

@savytskanatalia
Copy link

Hi all, I have similar problem with a custom database. Did anyone figure it out?

@Jolvii85
Copy link

Jolvii85 commented May 24, 2022 via email

@wittler-github
Copy link

wittler-github commented Oct 20, 2022

For me this occurred error "Warning: taxomony id doesn't exists for NC_0####.1! (repetead several times for different ids)" it was that when I concatenated several seqid2taxid.maps it sporadically missed a newline at a junction between two files which made centrifuge miss all the NCBI taxid entries after that, when running centrifuge-build

@ramnageena11
Copy link

is there any solution, if anyone got?
I am in this situation from last 20 days.

Thank
Ram

@ramnageena11
Copy link

Hello Any suggestions.

@ramnageena11
Copy link

Hi
It seems I need to change the strategy to analyze my data. Any suggestion other than Centrifuge? I am using Long reads data from ONT, does "Kraken2" will work for Taxonomy analysis?

Pls suggest.
Thanks
RNS

@ramnageena11
Copy link

hi

@sarah-buddle
Copy link

Hi, have there been any updates on this issue? I am encountering the same thing.

@mourisl
Copy link
Collaborator

mourisl commented Sep 12, 2023

How much memory do you have on your server and which database are you building? Thank you.

@sarah-buddle
Copy link

I am trying to build a custom database based on bacteria, viral, fungi and protozoa downloaded from RefSeq. I'm running centrifuge v1.0.4, and have tried with the conda installation and installed from source. The total size of my fasta file is 148GB. On my last attempt to build, I tried with 80GB of memory and 8 cores. I didn't get any error messages about running out of memory, I just got warnings e.g. "Warning: taxonomy id doesn't exists for NCxxx" as above, and the output file refseq.4.cf was empty. I have access to more memory though, so I could try with that. The command I used to build was:
centrifuge-build --conversion-table ${db}/seqid2taxid.map --taxonomy-tree ${software}/taxdump/new_taxdump_2023-08-01/nodes.dmp --name-table ${software}/taxdump/new_taxdump_2023-08-01/names.dmp ${db}/refseq_all_genomic.fasta refseq -p 8

@mourisl
Copy link
Collaborator

mourisl commented Sep 12, 2023

With 148G sequence, I think you may need about 600GB memory to build the index. You can increase --dcv and --bmax values to reduce the memory, but may taking longer time to build.

@sarah-buddle
Copy link

OK thank you, I will try that!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants