Skip to content

1‐Featured‐function

Ruolin He edited this page Mar 14, 2024 · 3 revisions

C domain subtype classification

One of the most important features is that NRPS-motif-Finder supports the full subtype classification of C domain.

C_all_tree7

Maximum-likelihood phylogenetic tree of the condensation domain superfamily.

Subtype classification and sequences are described in the main text and the Method. Different subtypes are indicated by colors, with subtypes exclusive to fungi marked by underlines, and subtypes found predominantly in bacteria marked by asterisks. This tree is rooted, taking papA and WES as outgroups(black shading). L-clade and D-clade are indicated by blue and red shading, respectively.

The details of C domain subtypes

C domain subtypes Species distribution Function Comment Reference sequence source
LCL Bacteria and Fungi both LCL-type C domains catalyze peptide bond formation between two L-amino acids. It's hard to distingush between LCL and SgcC5 due to the high sequence similarity. Conserved Protein Domain Family
DCL Bacteria and Fungi both The DCL-type C domain catalyzes the condensation between a D-aminoacyl/peptidyl-PCP donor and a L-aminoacyl-PCP acceptor. Conserved Protein Domain Family
Starter Bacteria dominate While standard C domains catalyze peptide bond formation between two amino acids, the (Starter) C-domain may instead acylate an amino acid with a fatty acid in the first module of NRPS. Conserved Protein Domain Family
Dual Bacteria and Fungi both Dual function E/C domains have both an epimerization and a DCL condensation activity. Dual E/C domains first epimerize the substrate amino acid to produce a D-configuration, then catalyze the condensation between the D-aminoacyl/peptidyl-PCP donor and a L-aminoacyl-PCP acceptor. Conserved Protein Domain Family
CT Fungi only Unlike bacterial NRPS, which typically have specialized terminal thioesterase (TE) domains to cyclize peptide products, many fungal NRPSs employ a terminal condensation-like (CT) domain to produce macrocyclic peptidyl products. Conserved Protein Domain Family
CT-DCL Fungi only CT-DCL domain catalyzes the same reaction with DCL domain but has high sequence similarity with CT domain. This subtype is proposed in our paper. Conserved Protein Domain Family
CT-A Fungi only CT-Atypical (CT-A) domain catalyzes the same reaction with DCL domain but has high sequence similarity with CT domain. And it is always behind an ACP (acyl carrier protein) domain rather than a T domain. This subtype is proposed in our paper. Conserved Protein Domain Family
PS Bacteria dominate PS domain catalyzes Pictet-Spengler reaction. Literature
bL Bacteria dominate Beta-lactam (bL) C domain mediates an unusual cyclization to form beta-lactam rings. bL domain actually is a subtype of DCL domain. Conserved Protein Domain Family
X Bacteria dominate X domain is a catalytically inactive Condensation-like domain shown to recruit oxygenases to the NRPS. Conserved Protein Domain Family
Cyc Bacteria and Fungi both Cyc (heterocyclization) domains catalyze two separate reactions in the creation of heterocyclized peptide products in NRPS: amide bond formation followed by intramolecular cyclodehydration between a Cys, Ser, or Thr side chain and a carbonyl carbon on the peptide backbone to form a thiazoline, oxazoline, or methyloxazoline ring. Conserved Protein Domain Family
I Bacteria dominate Interface (I) domain plays a role in positioning the β-hydroxylase and the NRPS-bound amino acid substrate prior to hydroxylation. Literature
modAA Bacteria dominate The core function of modAA C domain is to catalyze the dehydration of beta-hydroxy amino acid (such as Ser, Thr) and form a dehydroamino acid. The derived functions include pyrrolizidine formation, conjugate addition instead of amideformation, pyrimidine formation, l-2-amino-4-methoxy-trans-3-butenoic acid formation, Side chain conjugate addition. Literature
Cglyc Bacteria dominate Glycopeptide condensation domain functions in peptide bond formation during glycopeptide antibiotic biosynthesis. MiBiG v2
Hybrid Bacteria and Fungi both C domain of hybrid polyketide synthetase/nonribosomal peptide synthetases (PKS/NRPSs) catalyze peptide bond formation within (usually) large multi-modular enzymatic complexes. Hybrid PKS/NRPS create polymers containing both polyketide and amide linkages. Conserved Protein Domain Family
FUM14 Fungi only C domain of NRPS similar to the ester-bond forming Fusarium verticillioides FUM14 protein. The module with FUM14 domain is always used iteratively. And ester-bond formation function is uncommon. Conserved Protein Domain Family
SgcC5 Bacteria and Fungi both SgcC5 is a NRPS C domain with ester- and amide- bond forming activity. It's hard to distingush between LCL and SgcC5 due to the high sequence similarity. Conserved Protein Domain Family
LCL-A Fungi only C domain with an atypical active site motif. Members of this subfamily typically have a non-canonical conserved SHXXXDX(14)Y motif which replaces HHXXXD motif typically found in the C domain. This subtype is named in our paper. Conserved Protein Domain Family
E Bacteria and Fungi both Epimerization (E) domains of NRPS flip the chirality of the end amino acid of a peptide being manufactured by the NRPS. Conserved Protein Domain Family

Note: In the NRPS-motif-Finder result, E domain is not considered to be a kind of C domain subtype. And E domain has 7 motifs while C domain has 10 motifs.

HMM files for C domain subtype classification can be found in here.

The raw sequences for HMM construction can be found in here.

A domain loop group classification

Loop length and loop group are proposed in our paper. And we found loop group is related with A domain substrate specificity.

The information of phylum+loop group can reduce the diversity of specificity-conferring code.

Fig3_20230113

The specificity-conferring code of the A domain is correlated with loop length and phylogeny

A. SCA of 2,636 A domain sequences, together with their substrate specificities attached to the last column of the multiple sequence alignment. Six sectors with a high contribution from the substrate column (>0.05, the size of points on the left scales the substrate’s contribution to the sector, see Method for details) are sorted by their eigenvalues. The size of points scales its contribution to the sector. Orange bars mark the A domain motifs from A1 to A8. The start and end of the five loop regions are marked by black and green dotted lines, respectively. S4 and S6 are the 4th and 6th of the specificity-conferring codes. G is the G-motif.

B. Distance matrix of A domain. Upper right on the heatmap is the Euclidean distance of the loop length as a 5-element vector. Lower left on the heatmap is the sequence distance of the A domain. The matrix is sorted by the substrate specificity followed by the loop length group. Substrates, groups of loop length, and phylum of these A domains, are shown by colors in sidebars.

C. Example showing that A domains conferring identical substrate exhibit distinct specificity-conferring codes, when they are categorized into different loop-length-groups. Phylum composition in each group is shown in the pie chart.

S32_Fig

Clustering and groups of five loops

A. Hierarchical clustering of the A domains based on the Euclidean distances of their lengths in five loops. A domains were categorized into five groups based on their loop-length vectors. For visual clearness, in calculation, the Euclidean distances which are more than 12 are set as 12 before normalizing.

B. Loop length profiles of five groups shown in A. More details in Method.

S34_Fig

The sequence logo of the specificity-conferring code for substrate alanine in the dimension of phylum and loop group

9 of 10 the specificity-conferring code are displayed. The last one is conserved lysine (K) in the A10 motif. It wasn’t shown because our A domain sequences only cover A1-A8. Sequence logo will not be plotted, if the number of sequences is less than 3. Substrate abbreviation: A=alanine.

New motifs

We found there are some conserved sites in A domain and T domain, and they weren't known motif.

The new motif before A1 was named as Aalpha motif. The new motif between A5 and A6 was named as G-motif due to the core conserved Gly in the motif. The new motif before T1 was named as Talpha motif.

Fig S7

Sequence logo of highly conserved positions near known motifs

The black box shows known core motifs in the A domain. The black triangle shows highly conserved positions in multialignment from the 1,161 C+A+T NRPS sequences from MiBiG database v2.

Fig5_revision20230113

Analysis of amino acid frequency reveals potential new motifs with implications for structural flexibility

A. Amino acid frequency and gap frequency along the multiple sequence alignment of the NRPS CAT modules. In the bottom panel, bar heights indicate the frequency of the most frequent amino acid. Bars in the known core motifs from the C, A, and T domains were colored blue, orange, and yellow, respectively. The horizontal red dashed line represents the 0.95 frequency level. Domain boundaries annotated by Pfam are divided by red triangles. The colored patch above the amino acid frequency indicates gap frequency. Three potential new motifs (position 1183, 1960, and 2435/2447 in MSA) are marked by the blue dashed box. The upper panel shows the sequence logo and the gap frequency near the three potential new motifs.

B. Chemical interactions and secondary structures surrounding the second potential new motif in (A) at the substrate donation state (PDB: 6MFY). Hydrogen bonds near the most conserved Gly were shown in blue dashed lines. Covalent bonds were shown as black lines. Secondary structures, such as beta-sheets, were demonstrated as bold gray arrows. Known A domain motifs adjacent to related residues were shown in the orange box.

C. Same as that in B, but in the thiolation state (PDB: 6MG0).

D. Same as that in B, but in the condensation state (PDB: 6MFZ).

We found G-motif is important for structural flexibility (Figure above B-D), and G-motif was proposed as re-engineering cut point (Ref1,Ref2).

We verified the importance of G-motif in A domain by point mutation experiment in fungi. And it's found that G-motif is important for A domain function.

Fig 3(experimet)20230128

Mutations of G-motif G409 in FmqC support the importance of the conserved domain in the biosynthesis of fumiquinazoline C

A. Residues in and near the G-motif in the predicted structure of FmqC by Phyre2[96]. G409 is the conserved glycine in the G-motif. Residues in G-motif are marked by blue. The residues N397 and S491 (equivalent to F493 and N577 in LgrA, Fig 3B-D), which may collide with G409 are marked by yellow and cyan.

B. Same as that in A, but with simulated mutation of G409W. The mutated tryptophan is marked in magenta.

C. The fmq gene cluster responsible for the production of FQC. Two NRPSs, fmqA and fmqC, are filled in red with their substrate selectivity marked. Ant: non-proteinogenic amino acid anthranilate. C* represents a truncated and presumably inactive C domain. CT represents a terminal condensation-like domain that catalyzes macrocyclization reaction.

D. The biosynthetic pathway for FQC is depicted, along with how it diverges into the production of compound 1 in the absence of functional FmqC.

E. LC-MS analysis of the control fmqC (first row), ΔfmqC (second row), and six point mutation strains (3rd to 8th row).

F. Normalized yield of FQC and compound 1 in different strains. For FQC, the yield is normalized by its production in the wild-type strain. For compound 1, the yield is normalized by its production in the fmqC gene deletion strain. Error bars show standard deviations.

NRPS motif sequence logo repository

In our paper, we construct the most comprehensive NRPS motif sequence logo in 16,820 bacterial and 2,505 fungal genomes (as of 2022/10/23).

In this large dataset, we analyzed 83,489 C domains, 95,582 A domains, 86,688 T domains, 14,502 E domains, and 23,590 TE domains from bacteria; and 34,269 C domains, 40,458 A domains, 26,651 T domains, 3,982 E domains, and 4,008 TE domains from fungi.

You can get Figure S7-S14 and S16-S18 in the Word document file. High solution figures can be downloaded from the supporting information in our paper.

motif logo

Our findings highlight the variation in the location and motif of C domains with the T domain between subtypes in fungi and bacteria (Figure 2D).

Firstly, some C domain subtypes do not directly precede a T domain: the fungus Hybrid and CT-A subtypes are adjacent to an ACP domain rather than a T domain; the bacteria DCL and fungus CT-DCL subtypes are located after an E domain, consistent with previous reports[65]; For the Cyc subtype, only 29.34% of them are adjacent to a T domain, while the rest are located at the start of the pathway. The Starter, I, and PS subtypes are also exclusively found at the start of the pathway.

Secondly, we observed that the coupling between the C domain and its adjacent T domain cannot be solely explained by the L- and D-clades. Previous research mainly based on bacterial data suggested that the LCL subtype (located in the L-clade of the phylogenetic tree in our analysis) is adjacent to a T domain with the LGGHSL motif, and the DCL subtype (located in the D-clade of the phylogenetic tree in our analysis) is adjacent to a T domain with the LGGDSI motif [64, 65].

Our analysis not only confirmed these relationships but also showed that not all subtypes in the L- and D-clades follow this pattern: The X subtype in the L-clade is adjacent to a T domain with the LGGDSG motif; in the D-clade, and the Dual and fungal DCL subtypes in the D-clade are adjacent to T domains with the LGGHSI and LGGHSL motifs, respectively. These observations demonstrate that the T1 motifs are primarily linked to the specific subtypes of their adjacent C domains, rather than their clades.

C_subtype_figure_D

Figure 2D. The sequence logo for the C3 or E2 motif from different C domain subtypes and the T1 or ACP1 motif adjacent to each subtype. Sequences from bacteria were marked by red, while sequences from fungi were marked by blue.

AT_NRPS

Figure S7. Sequence logo of the twelve A motifs and two T motifs among 95,582 A domains and 86,688 T domains in bacteria

The y-axis ranges in sequence logo figures all are 0~4.4 bits.

AT_NRPS

Figure S10. Sequence logo of the twelve A motifs and two T motifs among 40,458 A domains and 26,651 T domains in fungi

The y-axis ranges in sequence logo figures all are 0~4.4 bits.

E

Figure S8. Sequence logo of the seven E motifs among 14,502 E domains in bacteria

The y-axis ranges in sequence logo figures all are 0~4.4 bits.

E

Figure S11. Sequence logo of the seven E motifs among 3,982 E domains in fungi

The y-axis ranges in sequence logo figures all are 0~4.4 bits.

TE_NRPS

Figure S9. Sequence logo of the TE1 motifs among 23,590 TE domains in bacteria

The y-axis range in the sequence logo figure is 0~4.4 bits.

TE_NRPS

Figure S12. Sequence logo of the TE1 motifs among 4,008 TE domains in fungi

The y-axis range in sequence logo figures is 0~4.4 bits.

C_200_FUM

Figure S16. Sequence logo of the seven C motifs among 77,152 C domains (first row), and among 13 subtypes of C domains (second row) in bacteria

There are 77,152 C domains with subtype prediction scores more than the threshold (200) and a count of domain subtype sequences of more than 3.

The y-axis ranges in sequence logo figures all are 0~4.4 bits, except it’s 0~1 for the bL subtype and 0~2.1 for the PS subtype because these subtypes are few in the sequence number.

The numbers of each C domain subtype are labeled at the end. For clarification, only motifs that have prevalent length are used in plotting.

The actual total C domain number used in the figure for the specific motif is shown in the title.

There are some gaps at both ends of the sequence for alignment. In motif C5, there are some interior gaps labeled in red to align with motif C5 of other subtypes.

C_200_FUM

Figure S17. Sequence logo of the seven C motifs among 34,269 C domains (first row), and among 11 subtypes of C domains (second row) in fungi

There are 34,269 C domains with subtype prediction scores more than the threshold (200) and a count of domain subtype sequences of more than 3.

The y-axis ranges in sequence logo figures all are 0~4.4 bits.

The numbers of each C domain subtype are labeled at the end. For clarification, only motifs that have prevalent length are used in plotting.

The actual total C domain number used in the figure for the specific motif is shown in the title. There are some gaps at both ends of the sequence for alignment.

In motif C5, there are some interior gaps labeled in red to align with motif C5 of other subtypes.

Domain architecture between bacteria and fungi source

intermotif_AT(1)

Figure S13. Comparison of NRPS A and T domain architecture between bacteria and fungi source

For comparison, only A domains which have the same motif length with reference A domain are used. Sequence numbers of intermotifs (A1-A2, A2-A3, A3-A4, A4-A5, A5-G motif, G motif-A6, A6-A7, A7-A8, A8-A9, A9-A10, Tα1-T1) are 75,407 in bacteria and 17,890 in fungi.

Sequence numbers of A1-Tα1 intermotif (actually interdomain) in bacteria source are 69,440 while they in fungi source are 12,194 because only part of A domains are adjacent with the T domain.

Sequence numbers of Tα1-T1 intermotif in bacteria are 85,755 while they in fungi are 25,069

intermotif_E(1)

Figure S14. Comparison of NRPS E domain architecture between bacteria and fungi source

For comparison, only A domains which have same motif length with reference A domain are used. Sequence numbers of intermotifs (E1-E2, E2-E3, E3-E4, E4-E5, E5-E6, E6-E7) are 12,875 in bacteria source and 2,852 in fungi source. Sequence numbers of intermotifs (actually interdomain) in bacteria are 12,618 for T1-E1 and 8,353 for E7-C1 while they in fungi are 2,088 for T1-E1 and 2,530 for E7-C1.

intermotif_C_FUM

Figure S18. Comparison of NRPS C domain architecture between bacteria and fungi

For comparison of intermotif length between different C domain subtypes and E domain, we chose the conserved positions which exist in all C domain and E domain as start and end of intermotif.

T1/ACP1-C1/E1 ends before the conserved “Q” in C1 and E1 (the conserved “E” in LCL-A subtype).

C1-C2 starts with the conserved “Q” in C1 (the conserved “E” in LCL-A subtype), and ends before the second conserved “R” in C2.

C2-C3 starts before the second conserved “R” in C2 and ends before the conserved “D” in C3.

C3-C4 starts before the conserved “D” in C3 and ends before the second conserved “Y” in C4.

C4-C5 starts before the second conserved “Y” in C4 and ends before the conserved “G” in C5.

C5-C6 starts before the conserved “G” in C5 and ends before the conserved “P” in C6.

C6-C7 starts before the conserved “P” in C6 and ends before the conserved “F” in C7 (the conserved “F” in LCL).

C7-Aα1 starts before the conserved “F” in C7 (the conserved “F” in LCL) and ends before Aα1.

Sequence numbers of intermotifs (C1-C2, C2-C3, C3-C4, C4-C5, C5-C6, C6-C7) are 77,152 in bacteria and 34,269 in fungi.

Sequence numbers of intermotifs (actually interdomain) in bacteria are 28,185 for T1-C1 and 33,176 for C7-Aα1 in LCL subtype C domain, 6,967 for E7-C1 and 6,752 for C7-Aα1 in DCL subtype C domain, 3,860 for T1-C1 and 4,495 for C7-Aα1 in Dual subtype C domain, 6,967 for T1-C1 and 6,752 for C7-Aα1 in starter subtype C domain and while they in fungi are 434 for T1-C1 and 616 for C7-Aα1 in LCL subtype C domain, 808 for E7-C1 and 1,931 for C7-Aα1 in DCL subtype C domain and 24 for T1-C1 and 21 for C7-Aα1 in Dual subtype C domain.