# Reverse Ecology and Metatranscriptomics of Uncultivated Freshwater Actinobacteria

# Overview

An organism's seed set contains all of the metabolites which cannot be synthesized by its metabolic network. They can represent auxotrophies, compounds which can be degraded, and (in many case) errors in the organism's metabolic network. Seed compounds were manually curated to identify those which may be biologically meaningful. Examples of each case (for the acI-C composite genome) will be discussed further below.

Many seed compounds were also associated with reactions catalyzed by peptidases or glycoside hydrolases, and genes associated with these reactions were re-annotated. Peptidase sequences were annotated using the MEROPS batch BLAST interface. Glycoside hydrolases were first annotated using dbCAN to assign these genes to glycoside hydrolase families, and HMMER3  was used to assign these genes to individual sub-families using HMMs downloaded from dbCAN.

## Curation of Predicted Seed Compounds

### Checking Compounds for Biological Plausability

This section presents vignettes presenting the diverse ways in which seed compounds were manually curated.

#### Homoserine Auxotrophy

`L-Aspartate-4-semialdehyde`, `L-homoserine`, and `O-Phospho-L-homoserine` were identified as seed compounds. These three compounds can be interconverted via the following reactions:

        Homoserine dehydrogenase: L-Aspartate-4-semialdehyde <--> L-homoserine
        
        Homoserine kinase: L-homoserine <--> O-Phospho-L-homoserine

Because they can be freely interconverted, the three compounds are considered equivalent and any of them could be a seed. `Homoserine dehydrogenase` is the final step in homoserine biosynthesis, so these compounds suggest an auxotrophy for homoserine.

#### Degradation of Peptides

`Ala-Leu` and `gly-pro-L` were predicted to be seed compounds. The compounds are associated with the following reactions:

        H2O + Ala-Leu --> L-Leucine + L-Alanine
        
        H2O + Gly-Pro --> Glycine + L-Proline
        
and the COGs were annotated as various peptidases. These seed compounds suggest the ability to degrade peptides.

#### Carbamoyl phosphate: A result of network pruning heuristics

Another putative seed compound was carbamoyl phosphate. Carbamoylphosphate synthase is the first step in arginine and pyrmidine biosynthesis, and catalyzes the reactions:

    2 ATP + L-glutamine + hydrogen carbonate + H2O → L-glutamate + carbamoyl phosphate + 2 ADP + phosphate + 2 H+

This reaction contains a number of currency metabolites (the P carriers ATP/ADP, the NH3-carriers glutamine and glutamate), as well as the single metabolites carbonate, water, phosphate and protons. All of these metaboliteswere removed from the network during pruning. Thus, all inward arcs to carbamoyl phosphate were removed, rendering it a seed compound. A quick inspection of the genome identified the gene for carbamoylphosphate synthase, confirming this hypothesis.

#### Fatty acids: A network error

Many putative seed compounds participate in fatty acid biosynthesis. For example, the pipeline predicted 26 R-enoyl-ACP compounds to be seed compounds. All 26 compounds were associated with a COG annotated as an `Enoyl-[acyl-carrier-protein] reductase [NADH] (EC 1.3.1.9)`, the enzyme which catalyzes the final step in fatty acid elongation. Fatty acid and lipid biosynthesis pathways are typically difficult to reconstruct automatically (e.g., using the Model SEED), and these compounds are highly unlikely to be seeds. For acI-C, a total of 29 of 73 putatitve seed compounds participate in fatty acid or lipid biosynthesis, so this is a huge problem!

### Auxotrophies: Checking Genome Annotations and Pathway Completeness

Automatic metabolic network reconstruction pipelines are prone to errors, such as missing/poor annotations, incorrect reaction assignments, etc. Thus, for each seed compound identified above, I manually inspected the genome for evidence for/against the prediction being a true positive. For the sake of brevity, some typical analyses are presented below. 

#### Auxotrophy for Homoserine

As described above, the seed compound L-Aspartate-4-semialdehyde suggests an auxotrophy for homoserine. Homoserine biosynthesis occurs via the following reactions:

    aspartate kinase: aspartate --> L-aspartyl-4-phosphate
    aspartate semialdehyde dehydrogenase: L-aspartyl-4-phosphate --> L-Aspartate-4-semialdehyde
    homoserine dehydrogenase: L-Aspartate-4-semialdehyde --> homoserine
    
The presence of L-Aspartate-4-semialdehyde as a seed compound suggests the reaction `aspartate semialdehyde dehydrogenase` is missing, and I am unable to identify a candidate gene for this reaction.  However, I can find the other two reactions in the pathway: aspartate kinase (group00288) and homoserine dehydrogenase (group00198), but not `aspartate semialdehyde dehydrogenase`. Thus, on the evidence available, I conclude acI-C is auxotrophic for homoserine.

#### Auxotrophy for Threonine

The seed compound L-arogenate suggests an auxotrophy for tyrosine. Tyrosine can be synthesized via the following route:

    chorismate mutase: chorismate --> prephenate
    prephenate aminotransferase: prephanate --> L-arogenate
    arogenate dehydrogenase: L-arogenate --> L-tyrosine
    
L-arogenate was predicted to be a seed compound based on the presence of `arogenate dehydrogenase`, the final step in the pathways. The reaction `chorismate mutase` is also present, and I was unable to find a likely gene for `prephenate aminotransferase`.

However, L-tyrosine can be synthesized from chorismate via an alternative pathway:

    chorismate mutase: chorismate --> prephenate
    prephenate dehydrogenase: prephenate --> 4-hydroxyphenylpyruvate
    tyrosine aminotransferase: 4-hydroxyphenylpyruvate --> L-tyrosine
    
All three genes in this pathway are present in the genome, indicating acI-C is not auxotrophic for tyrosine.

### Degradation: Checking Genome Annotations

#### Taurine

The seed compound `aminoacetaldehyde` suggested the ability for acI-C to degrade `taurine`, for which `aminoacetaldehyde` is an intermediate. `Aminoacetaldehyde` degradation was predicted from three COG groups annotated as `aldehyde dehydrogenases`. I was unable to find any other reactions in the taurine degradation pathway, and BLAST-based annotations suggest these genes are probably `Acyl-CoA reductases`. So taurine is probably not degraded.

#### Phospholipids

Phospholipids were predicted to be seed compounds on the basis of a COG annotated as `Glycerophosphoryl diester phosphodiesterase.` The annotation seems correct, but this enzyme serves to internally recycle phospholipids. So it's probably not a seed compound.

### Degradation: Re-annotation of GHases and Peptidases

The remaining potential degradation reactions occur via peptidases or glycoside hydrolases. These types of enzymes typically have broad specificity, while KBase indicates a narrow substrate specificy. We re-annotated all peptide-associated genes with [MEROPS](http://merops.sanger.ac.uk/) and all glycoside hydrolase associated genes with [dbCAN](http://csbl.bmb.uga.edu/dbCAN/index.php]) to obtain a better picture of these enzymes' substrate specificity.

#### Degradation Pathways: Annotating Peptidases using MEROPS

KBase identified two peptidase reactions:

        H2O + Ala-Leu --> L-Leucine + L-Alanine
        
        H2O + Gly-Pro --> Glycine + L-Proline
        
associated with four COG groups: `group00108`,  `group00018`, `group00328`, `group02654`,  Using protein sequences for genes in these COGs, I used the [MEROPS batch BLAST tool](http://merops.sanger.ac.uk/cgi-bin/batch_blast) to assign these COGs to peptidase families. In all cases, all genes within a COG were assigned to the same family. The results:

COG | Peptidase Family | Description | Expression (%ile)
----|------------------|-------------|-----------
group00018 | M17 | Intracellular. Maximal activity between pH 9 and 9.5. Can cleave any n-terminal acid from di- and poly-peptides. Prefers leucine, no proline activity. | 90
group00108 | M01 | Cleaves proteins and dipeptides. Capable of releasing a variety of residues. Can be both cytosolic and membrane-bound (w/ enzyme activity outside the cell). | 78
group00328 | M24A | Intracellular, removes the initiating N-terminal methionine from newly synthesized proteins. | 58
group02654 | None | None | 

Thus, I conclude that acI-C has the ability to degrade di- and poly-peptides.

#### Degradation Pathways: Annotating Glycoside Hydrolases using dbCAN

KBase identified the following degradation reactions associated with seed compounds:

Group | Substrate
------|---------
group00539 | Maltose
group01023 | Cellobiose
group01186 | Stachyose, Manninotriose
group01316 | Maltose
group01410 | Maltose

Using protein sequences for genes in these COGs, I used the [dbCAN Annotation Website](http://csbl.bmb.uga.edu/dbCAN/annotate.php) to assign these COGs to glycoside hydrolase familes, as defined in the Carbohydrate-Active enZYmes Database [CAZY](http://www.cazy.org/). I then used Hidden Markov Models for these sub-families (downloaded from dbCAN), to assign these genes to individual sub-families using default parameters using HMMER3.

In all cases, all genes within a COG were assigned to the same family. The results:

COG | GH Family | Subfamily | Known Activities | Expression (%ile)
----|-----------|-----------|------------------|-----------
group00539 | GH13 | GH13-7 OR GH13-27 | 3.2.1.20 (alpha-glucosidase) or 3.2.1.28 (trehalose) | 65
group01023 | GH1 | GH1-8 | 3.2.1.21 (beta-glucosidase), 3.2.1.23 (beta-galactosidase), 3.2.1.38 (beta-D-fucosidase) | 15
group01186 | GH36 | GH36-4 | 3.2.1.22 (alpha-galactosidase) | 68
group01316 | GH13 | GH13-104 | 3.2.1.1 (alpha-glucosidase) | 28
group01410 | GH13 and CBM34 | GH13-20 | 3.2.1.20 (alpha-glucosidase) | 26

In general, the annotations from KBase make sense within the GH families.

## Re-annotation of Transporters

## Limitations of Seed-Set Analysis