Duplicate alllele id's inserted into the database #168

jrober84 · 2023-03-21T16:22:44Z

I downloaded the Listeria Monocytogenes scheme from ChewieNS and used it successfully to allele call public data ~52K genomes split into batches of 5K. All of these allele calling jobs finished and produced allele call results successfully. However, I went to call new genomes using the same scheme folder as the others and ran into an error referring to duplicate keys for locus Pasteur_cgMLST-00025341_*164. I looked at the fasta file for that locus and there are multiple duplicate allele names all of which have an asterix at the front of the number. I checked other files and it seems to be the same issue. I run my jobs in an HPC and allele calling can happen concurrently. So I am wondering is that is what is happening here that each time the program runs it grabs the last "valid" integer allele code and increments from there. From what I have seen that is what looks to be happening as there are blocks of repeated identifiers and all start from the next available integer. Is there a way to process the allele database to remove the asterix naming and have them assigned a valid id within chewie? Or alternatively, is there a way of not having these interim named sequences into the database?

Thanks!

chewBBACA version: 3.1.0
Authors: Mickael Silva, Pedro Cerqueira, Rafael Mamede
Github: https://github.com/B-UMMI/chewBBACA
Documentation: https://chewbbaca.readthedocs.io/en/latest/index.html
Contacts: imm-bioinfo@medicina.ulisboa.pt

==========================
chewBBACA - AlleleCall

Started at: 2023-03-21T10:53:13

Minimum sequence length: 144
Size threshold: 0.2
Translation table: 11
BLAST Score Ratio: 0.6
Word size: 5
Window size: 5
Clustering similarity: 0.2
Prodigal training file:
Listeria/Listeria_cgmlst/lmonocytogenes_Pasteur_cgMLST/Listeria_monocytogenes.trn
CPU cores: 32
BLAST path: conda/envs/chewbbaca/bin
CDS input: False
Prodigal mode: single
Mode: 4
Number of inputs: 3759
Number of loci: 1748

== CDS prediction ==

Predicting CDS for 3759 inputs...
[====================] 100%

== CDS extraction ==

Extracting predicted CDS for 3759 inputs...
[====================] 100%
Extracted a total of 10904979 CDS from 3759 inputs.

== CDS deduplication ==

Identifying distinct CDS...identified 344785 distinct CDS.

== CDS exact matches ==

Searching for DNA exact matches...found 6554947 exact matches (matching 129058 distinct alleles).
Unclassified CDS: 228846

== CDS translation ==

Translating 228846 CDS...
[====================] 100%
Identified 55640 CDS that could not be translated.
Information about untranslatable and small sequences stored in allele_calls2/temp/invalid_cds.txt
Unclassified CDS: 173206

== Protein deduplication ==

Identifying distinct proteins...identified 120104 distinct proteins.

== Protein exact matches ==

Searching for Protein exact matches...found 37663 exact matches (80662 distinct CDS, 110937 total CDS).
Unclassified proteins: 109913

== Clustering ==

Translating schema's representative alleles...done.
Creating minimizer index for representative alleles...done.
Created index with 226722 distinct minimizers for 1748 loci.
Clustering proteins...
[====================] 100%
Clustered 109913 proteins into 3321 clusters.
Traceback (most recent call last):
File ".conda/envs/chewbbaca/bin/chewBBACA.py", line 10, in
sys.exit(main())
^^^^^^
File "/conda/envs/chewbbaca/lib/python3.11/site-packages/CHEWBBACA/chewBBACA.py", line 1584, in main
functions_info[process]1
File "conda/envs/chewbbaca/lib/python3.11/site-packages/CHEWBBACA/utils/process_datetime.py", line 146, in wrapper
func(*args, **kwargs)
File "conda/envs/chewbbaca/lib/python3.11/site-packages/CHEWBBACA/chewBBACA.py", line 514, in allele_call
AlleleCall.main(genome_list, loci_list, args.schema_directory,
File "conda/envs/chewbbaca/lib/python3.11/site-packages/CHEWBBACA/AlleleCall/AlleleCall.py", line 2671, in main
results = allele_calling(input_files, schema_directory, temp_directory,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "conda/envs/chewbbaca/lib/python3.11/site-packages/CHEWBBACA/AlleleCall/AlleleCall.py", line 2258, in allele_calling
prot_index = fao.index_fasta(all_prots)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "conda/envs/chewbbaca/lib/python3.11/site-packages/CHEWBBACA/utils/fasta_operations.py", line 60, in index_fasta
fasta_index = SeqIO.index(fasta_file, 'fasta')
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "conda/envs/chewbbaca/lib/python3.11/site-packages/Bio/SeqIO/init.py", line 872, in index
return IndexedSeqFileDict(
^^^^^^^^^^^^^^^^^^^^
File "conda/envs/chewbbaca/lib/python3.11/site-packages/Bio/File.py", line 203, in init
raise ValueError(f"Duplicate key '{key}'")
ValueError: Duplicate key 'Pasteur_cgMLST-00025341*164'

ramirma · 2023-03-22T08:25:13Z

Dear @jrober84 , you have guessed correctly and having used an HPC with concurrent jobs is what led to the problem. chewBBACA's design does not allow for such distributed processing. Asterisk denote novel alleles. Running processes in parallel meant that a process was started while another was running. If both find the same "new" allele, then two different IDs may be attributed to the same sequence, or the same ID may be attributed to two different sequences (representing distinct novel alleles), generating the problem you describe. Although I image that a strategy could be devised to sort through the files generated, we have not implemented anything for that purpose.
I see that you are using chewBBACA 3.1.0. I would advise to update to 3.1.2 because it corrects a few issues (including some to handle unexpected strings in sequence names) that possibly have no impact for you but that would give you greater confidence when analyzing such large number of isolates.
The other comment is that chewBBACA 3 is quite fast, so it should not be too much of a hassle and it would not take an undue amount of time to run the 5K batches sequentially. This would obviate the problem you describe and give you clean results. This would be my preferred option since it would identify novel alleles, update the local database and it would give you the option of submitting the new alleles to chewie-NS. The other option is that you may run chewBBACA without identifying or storing novel alleles. You can do this by exploring the --no-inferred or the --mode flags (more info in chewBBACA's readthedocs.
I hope to have clarified this issue but do not hesitate to reach out in case we can be of further help.

jrober84 · 2023-03-22T13:27:35Z

Thanks so much, @ramirma; that definitely clarifies things. I was actually performing this as a test for a few purposes, which included simulating multiple individuals using a pipeline I am developing concurrently. I think it is safe to close this issue. But I would be most interested in any developments that make it work within a multi-user environment. But I definitely know that is no small task, and it is a bit of an edge case.

jrober84 closed this as completed Mar 22, 2023

rfm-targa added the Status: Completed Issue has been resolved. label Mar 23, 2023

ramirma mentioned this issue Apr 2, 2023

Error: invalid start byte #173

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Duplicate alllele id's inserted into the database #168

Duplicate alllele id's inserted into the database #168

jrober84 commented Mar 21, 2023 •

edited

ramirma commented Mar 22, 2023

jrober84 commented Mar 22, 2023

Duplicate alllele id's inserted into the database #168

Duplicate alllele id's inserted into the database #168

Comments

jrober84 commented Mar 21, 2023 • edited

========================== chewBBACA - AlleleCall

ramirma commented Mar 22, 2023

jrober84 commented Mar 22, 2023

jrober84 commented Mar 21, 2023 •

edited

==========================
chewBBACA - AlleleCall