Download metagenomic assemblies from ENA and assemble a gene catalog
- Identify all of the host-associated metagenomes with the ENA API
- Download each metagenome assembly and extract the CDS (gene sequences)
- Aggregate all gene sequences
- Cluster genes at a few different levels of amino acid identity
- Return the non-redundant gene catalogs
For each level of amino acid identity (e.g. 100%, 90%, 80%):
- FASTA of protein sequences
- TSV linking each input gene to the final cluster