Skip to content

5.4 horizontal or lateral transfer assessment of gene clusters using salt

Rauf Salamzade edited this page Jun 10, 2024 · 19 revisions

salt is an auxiliary program to assess support for horizontal or lateral transfer of a gene cluster in the different genomes they are found in using fai. It takes as input a fai results directory and the prepTG database directory used to run fai.

salt reports multiple statistics to inform users about whether the gene cluster was horizontally transferred in a specific genomic context. These statistics are based on three independent analyses:

  1. Running codoff to assess codon usage discrepancies between the gene cluster instance and its respective background genomic context.
  2. Running annotation of viral and plasmid-associated proteins of full genomes to determine whether scaffolds might be viral or plasmid and annotation of IS/transposon elements of full genomes to determine how distant gene cluster instances are from such elements. For this step to be performed users must have run prepTG with the option --mge_annotation/-ma.
  3. Selection of a reference genome and subsequent relative assessment of average amino acid identity (AAI) similarity for universal ribosomal and gene cluster proteins from it to other genomes with the gene cluster. This is to determine whether the gene cluster is perhaps more similar between genomes than would be expected based on ribosomal protein AAI thus indicating potential HGT. The reference gene cluster instance/genome is selected based on best match to the queries provided to fai.

💁

This functionality can be used with any method of running fai, including single query mode - where only a single query protein is provided and homologs are identified in genomes with flanking contexts gathered as well. See the options -sq and -f in the fai usage page.

While the zol suite is designed to work with eukaryotic genomic data as well, salt is designed for specifically for bacteri!!!

Example: Assessing staphyloxanthin synthesizing crt operon assimilation in Staphylococcus representative genomes

1. Create a prepTG database of all S. warneri genomes in GTDB release 220:

Note, not recommending run this - it will take a while - especially with only 10 threads! If you want to run this yourself, we recommend skipping -ma argument which will make it a thousand times faster (not actually timed!)

We will first create a database of Staphylococcus representative genomes selected using skDER based on GTDB release 214.

While we have an option in prepTG to automatically download a premade database for Staphylococcus from zenodo (e.g. -d "Staphylococcus"), this database does not have MGE annotations available. So we will instead download the representative genomes for the genus from Zenodo and rerun prepTG:

# download 496 representative _Staphylococcus_ genomes
wget https://zenodo.org/records/10041203/files/Staphylococcus.tar.gz?download=1
mv Staphylococcus.tar.gz\?download\=1 Staphylocxcoccus.tar.gz
tar -zxvf Staphylocxcoccus.tar.gz

# now run prepTG with the -ma option
# Note, this will take a bit (mainly because of the -ma option) and 
# in the below command we are requesting usage of 10 threads
prepTG -i Staphylococcus/ -o prepTG_Database/ -c 10 -ma

Note, we specify the -ma argument to request genomes are annotated for mobile-genetic element associated proteins for use in salt downstream.

2. Search for the crt operon in the prepTG database using fai

Next, we will create a FASTA file of crt operon proteins (n=5; crtM, crtN, crtO, crtP, crtQ) and use this as a query to search for the operon in the representative Staphylococcus genomes.

Note, the proteins below are from Staphylococcus equorum because we had previously found that this species had a version of the crt operon that has been laterally transferred in the lsaBGC study.

# create FASTA file of query proteins
echo '>crtO' > crt_proteins.faa
echo 'MKKLIIGNVLYWFMIQMIISTLGTFVTNAFLERHLKYFRIWPIEQNGALWQKYFKVKRWKRYLPDGQRINPNIYSKSKFDKSITSEALHQFIIETRRAELVHILSIVPVVIFLRATKMIKVINVIYVLLANVPCMIAQRYNRPKLERYYLAKVNRKGD' >> crt_proteins.faa
echo '>crtP' >> crt_proteins.faa
echo 'MNNEHVVVIGGGLGGISSAIRMAQAGYSVDLYEQNNHIGGKVNRLKIEQFGFDLGPSILTMPKIFQRLFSYSNKNLKDYVHIQKLDMQIRNFYPDGTIIDLYEAMEDTLINNDALTKTDINQLNTFFDYAKNIHKFAEKGYFNKGLDTLLEIVRYHGPFTALKEFDYFHTMQQAINRRVESPYLREMLGYFIKYVGSSSYDAPAVLSLLPQMQHAEGLWYVEGGIHKLAEAMEQLAKELGVRIHLNQKIIDMKYNNQHEITEIECANGQRIETDFIISNMEVIPLYKNLLHFEEKKLKKLERKFEPASSGYVMHLGVDKSYPELGHHNFFFSSNSERNYDEVFNQYVLPQDPTIYVVNSNKTDETQAPKGYENIKVLPHIPYIQAQDFTEKQYQQFRENILDKLEGMGLKGLRKHIVYEDTWTPHDIETTYGSNKGAIYGVVSSKKKNNGFKFPKHSQYFKNLYFVGGSVNPGPGMPMVTLSGMQVAEAIISQNRK' >> crt_proteins.faa
echo '>crtQ' >> crt_proteins.faa
echo 'MKIAVVGGGVSGLAAASRLSANGHQVDVYEKNKQIGGRMNQIKQDGFTFDMGPTIVMMSEIYHDIFNYAQKDMNDYLEIKQLAYIYDVYFSDTDKIRVPTDLAQLRDMLESIEPNSTHGFMSFLTDIYNRYEIARKYFLERTFRKPSDFYNPFTLYQGMKLKTFDTADNLIEKYVDNEKIQKLLAFQTLYIGIDPKRSPSLYSIIPMIELMFGVHFIKGGMYSFVNALEQLNYELGTQIYTNASVEEIIIDQRFKRAEGLKVNGKIEKYDKILCTADFPYAASELLNEDNQTKKYTHEKIEEMDYSCSAFLMYIGIDKDLSEDVLIHNIIFSNDFNNNIEEIFGGELSHDPSIYVYAPSVEDESLAPKGQTGLYILMPVSELKTGNVDWSDEQSIENVKKIIYKKLATVKALDNLKEHVVTEIIYTPNDFEGDYNAKFGSAFGLMPTLAQSNYYRPPNVSRDYKNLYFAGASVHPGAGVPIVLTSAKITVEEILEDIKNGI' >> crt_proteins.faa
echo '>crtM' >> crt_proteins.faa
echo 'MNQLERDYQYCHNVMKFHSKTFSYAFDFLEFKKKKAIWAIYAVCRIIDDSIDRDKDVKQLSKIEKDLEGIYNNSLEQYHSDEAIMNAFNDTLNYYDIPHEPFRTLIHYVKADLNLKNLSTDEELFNYCYGVAGTVGELLTPILASQNSKNIECAEYAAIELGKALQLTNILRDVGEDFENERIYLTEERLNQYKVNLQEIYQSGVTQDYIDLWESYAQDAAQFYKNALNGVNNFDEEVHYIIELASRAYLEILEEARRADYTLHKKVYISKIKKMKIYHEMVSKYNRSEKI' >> crt_proteins.faa
echo '>crtN' >> crt_proteins.faa
echo 'MYQQRQKISDEKFPNNKMAVSLIIPTRNEAQNLPNLLATITGIDNKEIEVIVMNDGSTDKTQEIAQGYGAKVYNIDNKSSWKGKSRACWEGSKHASHDLLLFIDADVQFCGSESIEKIVQQYQRQNGHGLLSIQPYHRIQKVYENISAIFNLMTIVGMNKFSITSNNKDKKNAFGPIMLTNKKDYHETQGHLNAKDKVIEGFALSKAYSDANMPVEIYEGEGIANFRMYPQGFKALVEGWSKHFALGSTTTKKSTFSLIILWLMGSVVSSLTVLLSVKLSIIYIMLSLLIYIAYTIQFHLLINRTGNFSLTASACHPFLFICFLAIFFKSWLDANIFKRIKWKDRDIKL' >> crt_proteins.faa

# run fai
fai -pq crt_proteins.faa -tg prepTG_Database/ -c 10 -o fai_Results/

3. Assess support for horizontal/lateral transfer of the crt operon in different genomic contexts it was found in

We can now run salt!

salt -f fai_Results/ -tg prepTG_Database/ -o salt_Results/ -c 10

4. Investigate the results:

There are two major results from fai:

1. The SALT_Report.xlsx Spreadsheet: The main result is this spreadsheet which can be opened in Excel/Google Sheets and manually sorted and investigated:

Let's open up the results we just produced. The first thing I would do is sort by the column "GC AAI observed - expectation" in descending order (largest to smallest). Then we can assess the individual quantitative measures to assess if we see support for horizontal/lateral transfer relative to the instance from S. equorum. One row jumps out right away:

Screenshot from 2024-06-06 20-33-11

This corresponds to an instance of the crt operon detected in a Staphylococcus pasteuri genome. Why is this interesting:

  1. Low codoff empirical P-value (Column H): This tells us that the codon usage for the crt operon is different from the codon-usage of the rest of the genome.
  2. Short distance from a transposon (Column L): This tells us that the crt operon is not far away from suspected IS element or transposon.
  3. Homologs of plasmid-associated proteins found on the same scaffold (Column O): This suggests the crt operon instance might be on a plasmid. We can look up the scaffold ID on NCBI and, indeed, it is marked as a plasmid in the name.

So this particular instance is likely a case where the specific crt operon was at some point laterally transferred to an ancestor of this particular S. pasteuri strain.

2. The GC_to_RiboProt_AAI_Relationships.pdf Visual: A scatterplot which shows the relationship between the ribosomal protein average amino acid identity (AAI) and gene cluster AAI. A linear regression is fitted through the data and shown as a dashed red line. A dashed grey line also shows the 1 to 1 between the two axes, but of course the ribosomal proteins evolve much more slowly and thus the values are all under this line. If horizontal transfer has occurred more recently and what is being investigated spans multiple genera/families, then it is possible for values to exist above the grey line and those would be highly suggestive of HGT.

Screenshot from 2024-06-06 20-42-58

Note, this assessment and the "GC AAI observed - expectation" statistic is relative to the reference genome selected (best match to query proteins used in fai). It can thus vary considerably if we were to use crt proteins from another species/genome, say from Staphylococcus aureus:

Screenshot from 2024-06-07 01-08-30

Overview of salt table report

Column Description
gene cluster (GC) instance A unique identifier for the gene cluster instance.
GC gbk path The path to the gene cluster instance GenBank file.
genome The genome identifier for the gene cluster instance.
scaffold The scaffold the gene cluster instance is found on in.
scaffold length (bp) The length of the scaffold the gene cluster instance is found on.
scaffold CDS count The number of CDS features on the scaffold.
GC CDS count The number of CDS in the specific gene cluster instance.
codoff empirical p-value The empirical P-value computed by codoff for how divergent the codon usage of the gene cluster instance is relative to the background genome.
GC AAI observed - expectation The difference between the observed gene cluster AAI and the expected gene cluster AAI based on fitting a linear regression on the data and the observed ribosomal protein AAI. Values above 0 indicate that the gene cluster is more similar than might be expected and might be supportive of horizontal/lateral transfer.
GC AAI between genome and reference genome The gene cluster AAI between the reference genome and the genome featuring the specific gene cluster instance.
ribosomal protein AAI between genome and reference genome The AAI for highly conserved universal ribosomal proteins (from GToTree by Lee 2019; originally from Hug et al. 2016) between the reference genome and the genome featuring the specific gene cluster instance.
distance to IS element Minimum distance to an IS element / transposon from the gene cluster instance.
scaffold CDS proportion IS elements The proportion of the scaffold's coding sequences / proteins that are homologous to IS elements/transposons.
scaffold CDS proportion VOGs The proportion of the scaffold's coding sequences / proteins that are homologous to viral ortholog groups (VOGs).
scaffold CDS proportion plasmid-associated The proportion of the scaffold's coding sequences / proteins that are homologous to plasmid-associated proteins (from MOB-suite).

Usage:

usage: salt [-h] -f FAI_RESULTS_DIR -tg TARGET_GENOMES_DB -o OUTDIR [-c CPUS]

	Program: salt
	Author: Rauf Salamzade
	Affiliation: Kalan Lab, UW Madison, Department of Medical Microbiology and Immunology

	salt - Support Assessment for Lateral Transfer
	
	salt performs various analyses to assess support for horizontal vs. vertical evolution
	of a gene cluster across target genomes searched using fai. It takes as input the result
	directory from fai as well as the prepTG database searched.
								  
	salt will: (1) run codoff to assess codon usage similarities between gene clusters detected
	and their respective background genomes, (2) infer similarity between query gene cluster 
	searched for in fai and detected homolog with respect to expected similarity based on 
	universal ribosomal proteins, and (3) assess whether the scaffold the detected gene cluster is on 
	features insertion-elements, phage proteins, or plasmid proteins (assumming --mge_annotation
	was requestd for prepTG).
								  
	Similar to other zol programs, the final result is an auto-color-formatted XLSX spreadsheet.
	

options:
  -h, --help            show this help message and exit
  -f FAI_RESULTS_DIR, --fai_results_dir FAI_RESULTS_DIR
                        Path to antiSMASH BGC prediction results directory for a single sample/genome.
  -tg TARGET_GENOMES_DB, --target_genomes_db TARGET_GENOMES_DB
                        Result directory from running prepTG for target genomes of interest.
  -o OUTDIR, --outdir OUTDIR
                        Output directory for saHGT analysis.
  -c CPUS, --cpus CPUS  The number of CPUs to use [Default is 1].