# Reverse Ecology and Metatranscriptomics of Uncultivated Freshwater Actinobacteria

## Overview

The previous notebook predicted seed compounds for acI-A, acI-B, and acI-C composite genomes. An organism's seed set contains all of the metabolics which cannot be synthesized by its metabolic network. They may represent auxotrophies, or compounds which can be degraded. In the latter case, genes associated with the degradation of these compounds should be expressed.

However, seed compounds were computed from the compound metabolic network graph of a clade, and individual reactions in the network graph will be associated with genes from genes from multiple genomes. To overcome this obstacle, we decided to map metatranscriptome samples to the "pan-genome" of each clade. 

To construct the pan-genome, we used our reference genome collection to define acI COGs (clusters of orthologous groups), and defined the pan-genome of a clade as the union of all COGs present in at least one genome. We then used BBMap to map metatranscriptome reads to our reference genome collection, and counted the unique reads which map to each actinobacterial COG.

## Protein Clustering

We used OrthoMCL to identify clusters of orthologous genes (COGs) in the set of 36 freshwater acI genomes. OrthoMCL is an algorithm for grouping proteins into orthologous gene families based on sequence similarity. OrthoMCL takes as input a set of protein sequences and returns a list of COGs and the proteins which belong to each COG. The OrthoMCL pipeline consists of the following steps:

1. Format `faa` files to be compatible with OrthoMCL (script `01faaParser.py`).
2. Run all-vs-all BLAST on the concatenated set of protein sequences  (script `02parallelBlast`).
3. Initialize the MySQL server to store OrthoMCL output and run OrthoMCL (scripts `setupMySql.sh` and `runOrthoMCL.sh`).
4. Rearrange the OrthoMCL output into a user-friendly format (script `05parseCOGs`). 

Detailed instructions for running these scripts can be found in `code/orthoMCL/README.md`. The output of these scripts are described below and located in `data/orthoMCL`.


#### cogTable.csv
A table listing the locus tags associated with each (genome, COG) pair.

|   | AAA023D18 | AAA023J06 | AAA024D14 |
|---|---|---|---|---|
| group00000 | AAA023D18.genome.CDS.1002; AAA023D18.genome.CDS.925; AAA023D18.genome.CDS.939 | AAA023J06.genome.CDS.1227; AAA023J06.genome.CDS.862	 |  |
| group00001 | AAA023D18.genome.CDS.800 | AAA023J06.genome.CDS.798 | AAA024D14.genome.CDS.945; AAA024D14.genome.CDS.1601 |

For example, in genome AAA023D18, the following genes belong to cog00000: AAA023D18.genome.CDS.1002, AAA023D18.genome.CDS.925, and AAA023D18.genome.CDS.939.

#### annotTable.csv
A table listing the annotations associated with each (genome, COG) pair.

|   | AAA023D18 | AAA023J06 | AAA024D14 |
|---|---|---|---|---|
| group00000 | Short-chain dehydrogenase/reductase in hypothetical Actinobacterial gene cluster; hypothetical protein; 3-oxoacyl-[acyl-carrier protein] reductase (EC 1.1.1.100) | 3-oxoacyl-[acyl-carrier protein] reductase (EC 1.1.1.100); 3-oxoacyl-[acyl-carrier protein] reductase (EC 1.1.1.100)	|  |
| group00001 | DNA gyrase subunit A (EC 5.99.1.3) | DNA gyrase subunit A (EC 5.99.1.3) | DNA gyrase subunit A (EC 5.99.1.3); Topoisomerase IV subunit A (EC 5.99.1.-) |

#### annotSummary.csv
This table provides a list of all annotations associated with the genes in a COG. It can be further manually parsed to reveal the distribution of annotations associated with a COG. For example, COG00000 contains 94 genes across 72 genomes, as follows:

| Annotation | Counts |
|------------|--------|
| 3-oxoacyl-[acyl-carrier protein] reductase (EC 1.1.1.100)	| 63 |
| Short-chain dehydrogenase/reductase in hypothetical Actinobacterial gene cluster | 11 |
| None Provided	| 9 |
| hypothetical protein | 3 |
| COG1028: Dehydrogenases with different specificities (related to short-chain alcohol dehydrogenases) | 2 |
| 2,3-butanediol dehydrogenase, S-alcohol forming, (S)-acetoin-specific (EC 1.1.1.76) | 1 |
| D-beta-hydroxybutyrate dehydrogenase (EC 1.1.1.30) | 1 |
| Oxidoreductase, short chain dehydrogenase/reductase family | 1 |
| Short-chain dehydrogenase/reductase SDR | 1 |
| Acetoacetyl-CoA reductase (EC 1.1.1.36) | 1 |
| short chain dehydrogenase | 1 |

This COG appears to be a 3-oxoacyl-[acyl-carrier protein] reductase.

#### genomes/genomeCOGs.txt
This file contains a list of (gene, COG) pairs, giving the COG associated with each gene in the genome. One such file is created per genome. For example,

    AAA023D18.genome.CDS.834,group01620
    AAA023D18.genome.CDS.1427,group00803

### References
1. Li, L., Stoeckert, C. J., & Roos, D. S. (2003). OrthoMCL: identification of ortholog groups for eukaryotic genomes. Genome Research, 13(9), 2178–89. http://doi.org/10.1101/gr.1224503

## Metatranscriptomic Mapping

## Clade-Level Gene Expression