Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

indexing error: there is no k-mer that belongs to a list of size > 128 and <= 256 #126

Closed
rpolicastro opened this issue Jan 30, 2024 · 9 comments

Comments

@rpolicastro
Copy link

Hi!

I was trying out the new version of simpleaf (0.16.1) and piscem (0.7.0) and I ran into an error during indexing.

Installing everything through conda/mamba.

mamba create -n simpleaf_reprex -c conda-forge -c bioconda -y simpleaf==0.16.1 piscem==0.7.0
mamba activate simpleaf_reprex

Downloading the mouse genome from ENSEMBL.

# Download and unzip the genome assembly.
curl -O 'https://ftp.ensembl.org/pub/release-111/fasta/mus_musculus/dna/Mus_musculus.GRCm39.dna_rm.primary_assembly.fa.gz'
gunzip 'Mus_musculus.GRCm39.dna_rm.primary_assembly.fa.gz'

# Download the genome annotation.
curl -O 'https://ftp.ensembl.org/pub/release-111/gtf/mus_musculus/Mus_musculus.GRCm39.111.chr.gtf.gz'

Running simpleaf index.

export ALEVIN_FRY_HOME='.'
simpleaf set-paths

simpleaf index \
    --output index \
    --overwrite \
    --threads 12 \
    --ref-type 'spliced+intronic' \
    --fasta 'Mus_musculus.GRCm39.dna_rm.primary_assembly.fa' \
    --gtf 'Mus_musculus.GRCm39.111.chr.gtf.gz' \
    --rlen 100 \
    --decoy-paths 'Mus_musculus.GRCm39.dna_rm.primary_assembly.fa'

The error.

[2024-01-30 15:17:45.643] [info] === step 3: 'build_index' 81.135411 [sec] (114.02275084064424 [ns/kmer])
[2024-01-30 15:17:45.643] [info] max_num_super_kmers_in_bucket 273
[2024-01-30 15:17:45.643] [info] log2_max_num_super_kmers_in_bucket 9
[2024-01-30 15:17:46.285] [info] num_buckets_in_skew_index 55 / 124009136 (4.435157100038178e-05%)
[2024-01-30 15:17:46.880] [info] num_partitions 3
[2024-01-30 15:17:46.880] [info] computing partitions...
[2024-01-30 15:17:46.880] [info] num_kmers belonging to buckets of size > 64 and <= 128: 15989
[2024-01-30 15:17:46.880] [critical] ==> Empty bucket detected:
there is no k-mer that belongs to a list of size > 128 and <= 256
fatal runtime error: Rust cannot catch foreign exceptions
====
Error: piscem index failed to build succesfully ExitStatus(unix_wait_status(134))
@rob-p
Copy link
Contributor

rob-p commented Jan 30, 2024

ohhh — thanks for reporting this @rpolicastro! This seems like an error in the underlying construction of sshash. I'll ping @jermp here as he will likely know where this is coming from and what the fix might be upstream (once we know the fix, it will be pretty easy to propagate new versions of piscem; simpleaf itself shouldn't require an update). I believe it's related to the skew data structure in sshash, and particularly to one of the skew buckets being empty! One really simple idea is, what if you try with a smaller minimizer size — like -m 17 or 15?

@jermp
Copy link

jermp commented Jan 30, 2024

Hi all,
yes, as Rob suggested, the error is due to SSHash detecting an empty bucket in the skew index.
You don't need to understand this detail; just change the minimizer length or try a different seed for the construction.
@rob-p, does piscem accept a seed argument and pass it to SSHash?

Best,
-Giulio

@rob-p
Copy link
Contributor

rob-p commented Jan 30, 2024

@jermp,

No, but it would be trivial to expose (the underlying piscem-cpp includes it). So I can add that and in the meantime @rpolicastro can try a different minimizer size. However, I wonder if there is a way to make the skew index itself robust to possibly empty buckets.

--Rob

@jermp
Copy link

jermp commented Jan 30, 2024

Yeah, good point. I know it is annoying that this might happen.
In principle, it is possibile to avoid -- of course -- but needs a "search" that is slower. I'll think about it.
Perhaps, it is of no harm since there are very few bucket thresholds.

@jermp
Copy link

jermp commented Jan 30, 2024

Well, just to keep my on track: jermp/sshash#42.
:)

@rpolicastro
Copy link
Author

Thank you both for the quick response. -m 17 did indeed fix the issue. I'm glad it was easy to diagnose!

I assume if I move forward with this setting for now it won't have any adverse effects on the downstream analysis?

Cheers,
Bob

@rob-p
Copy link
Contributor

rob-p commented Jan 30, 2024

@rpolicastro,

Nope! In fact, the results should not be any different. The only distinction is that mapping may become a few % slower (may not even be noticeable). I am adding the --seed flag to piscem upstream, and then I will propagate it to simpleaf as well; just a flag you can pass to change the seed for SSHash construction if you don't want to change the minimizer length.

--Rob

@rpolicastro
Copy link
Author

rpolicastro commented Feb 1, 2024

I didn't check with the above reprex but I just wanted to confirm with the new simpleaf version (0.16.2) that setting the minimizer length back to default -m 19 and setting some arbitrary seed --seed 100 fixed the error in my original workflow :)

@rob-p
Copy link
Contributor

rob-p commented Feb 1, 2024

w00t! I was just going to come here to ask if you could do this if it wasn't too much trouble, but you already did it! Glad to hear that just propagating a different seed fixes it :).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants