Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The second file of customer db is empty #264

Closed
LilyAnderssonLee opened this issue Nov 7, 2023 · 4 comments
Closed

The second file of customer db is empty #264

LilyAnderssonLee opened this issue Nov 7, 2023 · 4 comments

Comments

@LilyAnderssonLee
Copy link

Hi, I am In the process of building a database using RefSeq data that covers bacteria, viral, archaea, fungi, parasite, protoza, plasmid and even contaminants. The input data is quite large, around 1.3TB in size.

However, I've run into an issue where the second file db.2.cf, always turns out empty. Has anyone else had this problem? Here is the code I've been using:

#!/bin/bash
#SBATCH -A xx
#SBATCH -p core
#SBATCH -n 50
#SBATCH -t 10-00:00:00
#SBATCH -J centrifuge_db
#SBATCH --mem=400GB
centrifuge-build -p 50 --bmax 3342177280 --conversion-table seqid2taxid.map
--taxonomy-tree taxonomy/nodes.dmp --name-table taxonomy/names.dmp
input-sequences.fna db

@mourisl
Copy link
Collaborator

mourisl commented Nov 7, 2023

I think for 1.3TB sequences, you may need about 3TB memory to build the index...

@LilyAnderssonLee
Copy link
Author

@mourisl Thanks for your response. It's sad that I don't have sufficient memory available. I suppose I'll need to reduce the data size, perhaps by only keeping the representative genome for each species.

@LilyAnderssonLee
Copy link
Author

LilyAnderssonLee commented Nov 22, 2023

@mourisl I am wondering what is the k-mer length used during genomes compression in the centrifuge database h+p+v+c or what is the default k-mer in database construction?

Are you planning to update the Centrifuge databases or create Centrifuge databases based on all RefSeq genomes?

@mourisl
Copy link
Collaborator

mourisl commented Nov 22, 2023

Centrifuge itself does not use k-mers. For the compression part, it use 31-mers, but this k-mer is used to cluster more similar strains from the species, so the information is not directly used in the compression either.

For the recent RefSeq prokaryotic genomes, the size is too huge, and the index size is above 80GB, which is beyond the limit from Zenodo...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants