Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duplicate alllele id's inserted into the database #168

Closed
jrober84 opened this issue Mar 21, 2023 · 2 comments
Closed

Duplicate alllele id's inserted into the database #168

jrober84 opened this issue Mar 21, 2023 · 2 comments
Labels
Status: Completed Issue has been resolved.

Comments

@jrober84
Copy link

jrober84 commented Mar 21, 2023

I downloaded the Listeria Monocytogenes scheme from ChewieNS and used it successfully to allele call public data ~52K genomes split into batches of 5K. All of these allele calling jobs finished and produced allele call results successfully. However, I went to call new genomes using the same scheme folder as the others and ran into an error referring to duplicate keys for locus Pasteur_cgMLST-00025341_*164. I looked at the fasta file for that locus and there are multiple duplicate allele names all of which have an asterix at the front of the number. I checked other files and it seems to be the same issue. I run my jobs in an HPC and allele calling can happen concurrently. So I am wondering is that is what is happening here that each time the program runs it grabs the last "valid" integer allele code and increments from there. From what I have seen that is what looks to be happening as there are blocks of repeated identifiers and all start from the next available integer. Is there a way to process the allele database to remove the asterix naming and have them assigned a valid id within chewie? Or alternatively, is there a way of not having these interim named sequences into the database?

Thanks!

chewBBACA version: 3.1.0
Authors: Mickael Silva, Pedro Cerqueira, Rafael Mamede
Github: https://github.com/B-UMMI/chewBBACA
Documentation: https://chewbbaca.readthedocs.io/en/latest/index.html
Contacts: imm-bioinfo@medicina.ulisboa.pt

==========================
chewBBACA - AlleleCall

Started at: 2023-03-21T10:53:13

Minimum sequence length: 144
Size threshold: 0.2
Translation table: 11
BLAST Score Ratio: 0.6
Word size: 5
Window size: 5
Clustering similarity: 0.2
Prodigal training file:
Listeria/Listeria_cgmlst/lmonocytogenes_Pasteur_cgMLST/Listeria_monocytogenes.trn
CPU cores: 32
BLAST path: conda/envs/chewbbaca/bin
CDS input: False
Prodigal mode: single
Mode: 4
Number of inputs: 3759
Number of loci: 1748

== CDS prediction ==

Predicting CDS for 3759 inputs...
[====================] 100%

== CDS extraction ==

Extracting predicted CDS for 3759 inputs...
[====================] 100%
Extracted a total of 10904979 CDS from 3759 inputs.

== CDS deduplication ==

Identifying distinct CDS...identified 344785 distinct CDS.

== CDS exact matches ==

Searching for DNA exact matches...found 6554947 exact matches (matching 129058 distinct alleles).
Unclassified CDS: 228846

== CDS translation ==

Translating 228846 CDS...
[====================] 100%
Identified 55640 CDS that could not be translated.
Information about untranslatable and small sequences stored in allele_calls2/temp/invalid_cds.txt
Unclassified CDS: 173206

== Protein deduplication ==

Identifying distinct proteins...identified 120104 distinct proteins.

== Protein exact matches ==

Searching for Protein exact matches...found 37663 exact matches (80662 distinct CDS, 110937 total CDS).
Unclassified proteins: 109913

== Clustering ==

Translating schema's representative alleles...done.
Creating minimizer index for representative alleles...done.
Created index with 226722 distinct minimizers for 1748 loci.
Clustering proteins...
[====================] 100%
Clustered 109913 proteins into 3321 clusters.
Traceback (most recent call last):
File ".conda/envs/chewbbaca/bin/chewBBACA.py", line 10, in
sys.exit(main())
^^^^^^
File "/conda/envs/chewbbaca/lib/python3.11/site-packages/CHEWBBACA/chewBBACA.py", line 1584, in main
functions_info[process]1
File "conda/envs/chewbbaca/lib/python3.11/site-packages/CHEWBBACA/utils/process_datetime.py", line 146, in wrapper
func(*args, **kwargs)
File "conda/envs/chewbbaca/lib/python3.11/site-packages/CHEWBBACA/chewBBACA.py", line 514, in allele_call
AlleleCall.main(genome_list, loci_list, args.schema_directory,
File "conda/envs/chewbbaca/lib/python3.11/site-packages/CHEWBBACA/AlleleCall/AlleleCall.py", line 2671, in main
results = allele_calling(input_files, schema_directory, temp_directory,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "conda/envs/chewbbaca/lib/python3.11/site-packages/CHEWBBACA/AlleleCall/AlleleCall.py", line 2258, in allele_calling
prot_index = fao.index_fasta(all_prots)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "conda/envs/chewbbaca/lib/python3.11/site-packages/CHEWBBACA/utils/fasta_operations.py", line 60, in index_fasta
fasta_index = SeqIO.index(fasta_file, 'fasta')
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "conda/envs/chewbbaca/lib/python3.11/site-packages/Bio/SeqIO/init.py", line 872, in index
return IndexedSeqFileDict(
^^^^^^^^^^^^^^^^^^^^
File "conda/envs/chewbbaca/lib/python3.11/site-packages/Bio/File.py", line 203, in init
raise ValueError(f"Duplicate key '{key}'")
ValueError: Duplicate key 'Pasteur_cgMLST-00025341
*164'

@ramirma
Copy link
Member

ramirma commented Mar 22, 2023

Dear @jrober84 , you have guessed correctly and having used an HPC with concurrent jobs is what led to the problem. chewBBACA's design does not allow for such distributed processing. Asterisk denote novel alleles. Running processes in parallel meant that a process was started while another was running. If both find the same "new" allele, then two different IDs may be attributed to the same sequence, or the same ID may be attributed to two different sequences (representing distinct novel alleles), generating the problem you describe. Although I image that a strategy could be devised to sort through the files generated, we have not implemented anything for that purpose.
I see that you are using chewBBACA 3.1.0. I would advise to update to 3.1.2 because it corrects a few issues (including some to handle unexpected strings in sequence names) that possibly have no impact for you but that would give you greater confidence when analyzing such large number of isolates.
The other comment is that chewBBACA 3 is quite fast, so it should not be too much of a hassle and it would not take an undue amount of time to run the 5K batches sequentially. This would obviate the problem you describe and give you clean results. This would be my preferred option since it would identify novel alleles, update the local database and it would give you the option of submitting the new alleles to chewie-NS. The other option is that you may run chewBBACA without identifying or storing novel alleles. You can do this by exploring the --no-inferred or the --mode flags (more info in chewBBACA's readthedocs.
I hope to have clarified this issue but do not hesitate to reach out in case we can be of further help.

@jrober84
Copy link
Author

Thanks so much, @ramirma; that definitely clarifies things. I was actually performing this as a test for a few purposes, which included simulating multiple individuals using a pipeline I am developing concurrently. I think it is safe to close this issue. But I would be most interested in any developments that make it work within a multi-user environment. But I definitely know that is no small task, and it is a bit of an edge case.

@rfm-targa rfm-targa added the Status: Completed Issue has been resolved. label Mar 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Completed Issue has been resolved.
Projects
None yet
Development

No branches or pull requests

3 participants