New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Duplicate alllele id's inserted into the database #168
Comments
Dear @jrober84 , you have guessed correctly and having used an HPC with concurrent jobs is what led to the problem. chewBBACA's design does not allow for such distributed processing. Asterisk denote novel alleles. Running processes in parallel meant that a process was started while another was running. If both find the same "new" allele, then two different IDs may be attributed to the same sequence, or the same ID may be attributed to two different sequences (representing distinct novel alleles), generating the problem you describe. Although I image that a strategy could be devised to sort through the files generated, we have not implemented anything for that purpose. |
Thanks so much, @ramirma; that definitely clarifies things. I was actually performing this as a test for a few purposes, which included simulating multiple individuals using a pipeline I am developing concurrently. I think it is safe to close this issue. But I would be most interested in any developments that make it work within a multi-user environment. But I definitely know that is no small task, and it is a bit of an edge case. |
I downloaded the Listeria Monocytogenes scheme from ChewieNS and used it successfully to allele call public data ~52K genomes split into batches of 5K. All of these allele calling jobs finished and produced allele call results successfully. However, I went to call new genomes using the same scheme folder as the others and ran into an error referring to duplicate keys for locus Pasteur_cgMLST-00025341_*164. I looked at the fasta file for that locus and there are multiple duplicate allele names all of which have an asterix at the front of the number. I checked other files and it seems to be the same issue. I run my jobs in an HPC and allele calling can happen concurrently. So I am wondering is that is what is happening here that each time the program runs it grabs the last "valid" integer allele code and increments from there. From what I have seen that is what looks to be happening as there are blocks of repeated identifiers and all start from the next available integer. Is there a way to process the allele database to remove the asterix naming and have them assigned a valid id within chewie? Or alternatively, is there a way of not having these interim named sequences into the database?
Thanks!
chewBBACA version: 3.1.0
Authors: Mickael Silva, Pedro Cerqueira, Rafael Mamede
Github: https://github.com/B-UMMI/chewBBACA
Documentation: https://chewbbaca.readthedocs.io/en/latest/index.html
Contacts: imm-bioinfo@medicina.ulisboa.pt
==========================
chewBBACA - AlleleCall
Started at: 2023-03-21T10:53:13
Minimum sequence length: 144
Size threshold: 0.2
Translation table: 11
BLAST Score Ratio: 0.6
Word size: 5
Window size: 5
Clustering similarity: 0.2
Prodigal training file:
Listeria/Listeria_cgmlst/lmonocytogenes_Pasteur_cgMLST/Listeria_monocytogenes.trn
CPU cores: 32
BLAST path: conda/envs/chewbbaca/bin
CDS input: False
Prodigal mode: single
Mode: 4
Number of inputs: 3759
Number of loci: 1748
== CDS prediction ==
Predicting CDS for 3759 inputs...
[====================] 100%
== CDS extraction ==
Extracting predicted CDS for 3759 inputs...
[====================] 100%
Extracted a total of 10904979 CDS from 3759 inputs.
== CDS deduplication ==
Identifying distinct CDS...identified 344785 distinct CDS.
== CDS exact matches ==
Searching for DNA exact matches...found 6554947 exact matches (matching 129058 distinct alleles).
Unclassified CDS: 228846
== CDS translation ==
Translating 228846 CDS...
[====================] 100%
Identified 55640 CDS that could not be translated.
Information about untranslatable and small sequences stored in allele_calls2/temp/invalid_cds.txt
Unclassified CDS: 173206
== Protein deduplication ==
Identifying distinct proteins...identified 120104 distinct proteins.
== Protein exact matches ==
Searching for Protein exact matches...found 37663 exact matches (80662 distinct CDS, 110937 total CDS).
Unclassified proteins: 109913
== Clustering ==
Translating schema's representative alleles...done.
Creating minimizer index for representative alleles...done.
Created index with 226722 distinct minimizers for 1748 loci.
Clustering proteins...
[====================] 100%
Clustered 109913 proteins into 3321 clusters.
Traceback (most recent call last):
File ".conda/envs/chewbbaca/bin/chewBBACA.py", line 10, in
sys.exit(main())
^^^^^^
File "/conda/envs/chewbbaca/lib/python3.11/site-packages/CHEWBBACA/chewBBACA.py", line 1584, in main
functions_info[process]1
File "conda/envs/chewbbaca/lib/python3.11/site-packages/CHEWBBACA/utils/process_datetime.py", line 146, in wrapper
func(*args, **kwargs)
File "conda/envs/chewbbaca/lib/python3.11/site-packages/CHEWBBACA/chewBBACA.py", line 514, in allele_call
AlleleCall.main(genome_list, loci_list, args.schema_directory,
File "conda/envs/chewbbaca/lib/python3.11/site-packages/CHEWBBACA/AlleleCall/AlleleCall.py", line 2671, in main
results = allele_calling(input_files, schema_directory, temp_directory,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "conda/envs/chewbbaca/lib/python3.11/site-packages/CHEWBBACA/AlleleCall/AlleleCall.py", line 2258, in allele_calling
prot_index = fao.index_fasta(all_prots)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "conda/envs/chewbbaca/lib/python3.11/site-packages/CHEWBBACA/utils/fasta_operations.py", line 60, in index_fasta
fasta_index = SeqIO.index(fasta_file, 'fasta')
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "conda/envs/chewbbaca/lib/python3.11/site-packages/Bio/SeqIO/init.py", line 872, in index
return IndexedSeqFileDict(
^^^^^^^^^^^^^^^^^^^^
File "conda/envs/chewbbaca/lib/python3.11/site-packages/Bio/File.py", line 203, in init
raise ValueError(f"Duplicate key '{key}'")
ValueError: Duplicate key 'Pasteur_cgMLST-00025341*164'
The text was updated successfully, but these errors were encountered: