# Amplicon mapping to AGORA2 Genomes (AmtG)
## Notebook 1: Creating a marker gene compendium

#### Kathleen Beilsmith, Argonne National Lab (ANL)
#### ANL Henry Group & UChicago Chang Lab Mouse Gut Microbiome collaboration
#### Data Credit: Megan Kennedy

This notebook takes a collection of genomes in fasta format (in this case, AGORA2 genomes) and generates a compendium of sequences for all copies of a marker gene (in this case, 16S) found in that collection.

**catalogCollection()** takes a directory path and a genome file extension and returns a list of the genome file paths for the collection.

The following three functions are used in a loop to annotate the genomes in the collection:

**getContigs()** returns the sequences for each contig in a genome fasta file.

**setAnnotationParams()** uses the output of getContigs() to set up parameters for the RAST annotation workflow for a genome.

This notebook is set up to annotate **rRNA** genes by running the workflow "call_features_rRNA_SEED" at https://tutorial.theseed.org/services/genome_annotation. For different kinds of marker genes, this part of the pipeline needs to change.

**extractResult()** takes the RAST annotation output and searches for the RAST feature name of the marker gene of interest. It returns a compendium of copies of that marker gene for a genome. Each entry for a copy of the marker gene provides the genome and contig where it was found as well as the strand direction, start position, length, and full sequence.

This notebook is set up to create a compendium of **"SSU rRNA ## 16S rRNA, small subunit ribosomal RNA"** sequences for genomes. For a different marker gene, this part of the pipeline needs to change. The success of finding a marker gene in the genome depends on the accuracy and consistency of the annotations.


In [1]:
############################
# Required modules
############################

import os
import modelseedpy
import json as _json
from modelseedpy import RPCClient
from Bio import SeqUtils
from Bio import SeqIO
from Bio.Seq import Seq

import json
import glob

from Bio.Align import MultipleSeqAlignment
from Bio import Align
from Bio import pairwise2 

### Start with a collection of genomes saved as fasta files in a directory.

In [2]:
# Given a directory path and a file extension for the fasta genomes,
# this function returns a list of all the genome file paths for the collection.

def catalogCollection(directory, genome_file_extension = ".fna"):
    
    # Formatting checks for directory path
    if not directory[-1] == "/":
        directory = directory + "/"

    # Formatting checks for file extension
    if not genome_file_extension[0] == ".":
        genome_file_extension = "." + genome_file_extension
    
    # List and count all files in the directory
    file_paths = os.listdir(directory)
    file_count = len(file_paths)
    
    # List and count all files (genomes) with specified extension in the directory
    genome_file_paths = glob.glob(directory + "*" + genome_file_extension)
    genome_count = len(genome_file_paths)

    print(f"Collection is in directory {directory} with {file_count} file(s). \n")
    print(f"Collection has {genome_count} genome(s) with extension {genome_file_extension}. \n")
    
    return genome_file_paths

In [3]:
my_collection = catalogCollection('/Users/kbeilsmith/Desktop/2024_AmtG/AmtG_0.0.2_test_files/FNA_files')

Collection is in directory /Users/kbeilsmith/Desktop/2024_AmtG/AmtG_0.0.2_test_files/FNA_files/ with 11 file(s). 

Collection has 10 genome(s) with extension .fna. 



### Get a list of contigs in the genome for annotation.

In [4]:
# Given a fasta genome file path,
# this function returns a tuple.
# The tuple contains a genome name (str) and list of contigs (dicts).
# Each contig has a 'dna' entry with the sequence and an 'id' entry with the contig name.

def getContigs(genome_file):
    
    # Get the genome GCA/GCF name (this will be the last field in NCBI downloads)
    genome_name = genome_file.split("/")[-1]
    
    contig_list = []
    
    # SeqIO.parse will read out each contig record in the genome fasta.
    # Each contig record has a sequence ("dna") and a name ("id") different from the genome.
    for contig in SeqIO.parse(genome_file, "fasta"):
        contig_record = {"dna":str(contig.seq), "id":str(contig.id)}
        contig_list.append(contig_record)
    contig_count = len(contig_list)
    
    print(f"Genome {genome_name} has {contig_count} contig(s). \n")
    
    return genome_name, contig_list

### Get RAST annotations for features on the contigs.

##### Set parameters for RAST annotation

genome id: use genome_name returned by getContigs()

genetic code: default 11

scientific name: default "Unknown"

domain: default "Bacteria"

contigs: use contig_list returned by getContigs()

features: default [ ]

workflow: default "call_features_rRNA_SEED"

In [5]:
def setAnnotationParams(genome_id, contig_list, genetic_code=11,
                        sci_name="Unknown", domain="Bacteria", features=[],
                        workflow_name="call_features_rRNA_SEED"):
    genome = {
                "id":f"{genome_id}",
                "genetic_code":genetic_code,
                "scientific_name":sci_name,
                "domain":domain,
                "contigs":contig_list,
                "features":features
            }
    workflow = {
                "stages":[
                {"name":workflow_name},  
                ]
            }
    return [genome,workflow]

### Prune RAST annotations to feature of interest (marker gene) and extract location information.

In [6]:
def extractResult(SEED_output, marker_gene_compendium = {}, 
                 geneAnnotation = "SSU rRNA ## 16S rRNA, small subunit ribosomal RNA"):
    
    for r in SEED_output[0]['features']:
                if r['function'] == geneAnnotation:
                    print(r['function'])
                        
                    ##################################################
                    # Based on location, get full contig sequence
                    ##################################################
                    
                    genome_id = SEED_output[0]["id"]
                    print(f"Genome ID: {genome_id}")
                    
                    contig_id = r['location'][0][0]
                    
                    for contig in SEED_output[0]["contigs"]:
                        if contig["id"] == contig_id:
                            print(f"Contig ID: {contig_id}")
                            contig_seq = contig["dna"]
                            
                    ##################################################
                    # Get strand because feature sequence extraction depends on direction
                    ##################################################
                        
                    direction = r['location'][0][2]
                        
                    ##################################################
                    # For + or - case, list strand, start, end, 
                    # length, and the feature's sequence
                    ##################################################

                    length = r['location'][0][3]

                    if direction == "+":
                        start_loc = r['location'][0][1]
                        end_loc = start_loc + length
                        feature_seq = contig_seq[start_loc:end_loc]
            
                    elif direction == "-":
                        end_loc = r['location'][0][1]
                        start_loc = end_loc - length
                        # feature_seq = contig_seq[start_loc:end_loc].reverse_complement()
                        feature_seq = str(Seq(contig_seq[start_loc:end_loc]).reverse_complement())

                    print(f"Direction: {direction}")
                    print(f"Start: {start_loc}")            
                    print(f"Length: {length}") 
                    print(f"End: {end_loc}")
                    print(f"Sequence: {feature_seq} \n")

                    unique_feature_id = f"{genome_id};{contig_id};{direction};{start_loc};{length}"
                    marker_gene_compendium[unique_feature_id] = feature_seq
    return marker_gene_compendium

In [8]:
##################################################
# RAST annotate contigs of the genome
##################################################

rRNA_dict = {}

# For each genome:
for a_genome in my_collection:
    
    # Get the genome name (str) and contigs (list of dicts):
    my_contig_list = getContigs(a_genome)
    
    # Annotation:
    client = RPCClient("https://tutorial.theseed.org/services/genome_annotation")
    # Use the genome name and contigs to fill out the parameters for the RAST call:
    params = setAnnotationParams(my_contig_list[0], my_contig_list[1])
    # Annotation results:
    result = client.call("GenomeAnnotation.run_pipeline",params)
    
    # This counts and reports the number of copies of the marker gene in the genome:
    feature_count = 0 
    for r in result[0]["features"]:
        if r["function"] == "SSU rRNA ## 16S rRNA, small subunit ribosomal RNA":
            #print(r["function"])
            #print(r["id"])
            #print(r["location"])
            feature_count = feature_count + 1
            
    print(f"Annotated {my_contig_list[0]} with {feature_count} copies of marker gene. \n")
    
    # This function takes the results and produces the compendium:
    rRNA_dict = extractResult(result)

Genome GCA_000210035.1_ASM21003v1_genomic.fna has 1 contig(s). 

Annotated GCA_000210035.1_ASM21003v1_genomic.fna with 1 copies of marker gene. 

SSU rRNA ## 16S rRNA, small subunit ribosomal RNA
Genome ID: GCA_000210035.1_ASM21003v1_genomic.fna
Contig ID: FP929055.1
Direction: +
Start: 147
Length: 1539
End: 1686
Sequence: CACTTTTAACGAGAGTTTGATCCTGGCTCAGGATGAACGCTGGCGGCGTGCTTAACACATGCAAGTCGAACGAAGCACCTTGATTTGATTCTTCGGATGAAGATCCTGGTGACTGAGTGGCGGACGGGTGAGTAACGCGTGGGTAACCTGCCTCATACAGGGGGATAACAGTTAGAAATGACTGCTAATACCGCATAAGACCACAGCACCGCATGGTGCAGGGGTAAAAACTCCGGTGGTATGAGATGGACCCGCGTCTGATTAGGTAGTTGGTGGGGTAACGGCCTACCAAGCCGACGATCAGTAGCCGACCTGAGAGGGTGACCGGCCACATTGGGACTGAGACACGGCCCAAACTCCTACGGGAGGCAGCAGTGGGGAATATTGCACAATGGGGGAAACCCTGATGCAGCGACGCCGCGTGAGCGATGAAGTATTTCGGTATGTAAAGCTCTATCAGCAGGGAAGAAAATGACGGTACCTGACTAAGAAGCACCGGCTAAATACGTGCCAGCAGCCGCGGTAATACGTATGGTGCAAGCGTTATCCGGATTTACTGGGTGTAAAGGGAGCGTAGACGGAGTGGCAAGTCTGATGTGAAAACCCGGGGCTCAACCCCGGGACTGCATTGGAAACTGTCAATCTGGAGTACCGGAGAGGTAAGCGGAATTCCTA

Annotated GCF_000157535.1_ASM15753v1_genomic.fna with 1 copies of marker gene. 

SSU rRNA ## 16S rRNA, small subunit ribosomal RNA
Genome ID: GCF_000157535.1_ASM15753v1_genomic.fna
Contig ID: NZ_GG688487.1
Direction: -
Start: 2137
Length: 1573
End: 3710
Sequence: TGTTCGAACTTTTTATGAGAGTTTGATCCTGGCTCAGGACGAACGCTGGCGGCGTGCCTAATACATGCAAGTCGAACGCTTCTTTTTCCACCGGAGCTTGCTCCACCGGAAAAAGAGGAGTGGCGAACGGGTGAGTAACACGTGGGTAACCTGCCCATCAGAAGGGGATAACACTTGGAAACAGGTGCTAATACCGTATAACAATCGAAACCGCATGGTTTTGATTTGAAAGGCGCTTTCGGGTGTCGCTGATGGATGGACCCGCGGTGCATTAGCTAGTTGGTGAGGTAACGGCTCACCAAGGCCACGATGCATAGCCGACCTGAGAGGGTGATCGGCCACATTGGGACTGAGACACGGCCCAAACTCCTACGGGAGGCAGCAGTAGGGAATCTTCGGCAATGGACGAAAGTCTGACCGAGCAACGCCGCGTGAGTGAAGAAGGTTTTCGGATCGTAAAACTCTGTTGTTAGAGAAGAACAAGGATGAGAGTAACTGTTCATCCCTTGACGGTATCTAACCAGAAAGCCACGGCTAACTACGTGCCAGCAGCCGCGGTAATACGTAGGTGGCAAGCGTTGTCCGGATTTATTGGGCGTAAAGCGAGCGCAGGCGGTTTCTTAAGTCTGATGTGAAAGCCCCCGGCTCAACCGGGGAGGGTCATTGGAAACTGGGAGACTTGAGTGCAGAAGAGGAGAGTGGAATTCCATGTGTAGCGGTGAAATGCGTAGATATA

Annotated GCA_000210015.1_ASM21001v1_genomic.fna with 0 copies of marker gene. 

Genome GCA_000209835.1_ASM20983v1_genomic.fna has 1 contig(s). 

Annotated GCA_000209835.1_ASM20983v1_genomic.fna with 0 copies of marker gene. 

Genome GCA_000185445.1_ASM18544v1_genomic.fna has 5 contig(s). 

Annotated GCA_000185445.1_ASM18544v1_genomic.fna with 1 copies of marker gene. 

SSU rRNA ## 16S rRNA, small subunit ribosomal RNA
Genome ID: GCA_000185445.1_ASM18544v1_genomic.fna
Contig ID: GL622347.1
Direction: +
Start: 171
Length: 1546
End: 1717
Sequence: TTTTTTTGGAGAGTTTGATCCTGGCTCAGGACGAACGCTGGCGGCGTGCTTAACACATGCAAGTCGAACGATGAAGCCGCAGCTTGCTGTGGTGGATTAGTGGCGAACGGGCGAGTAACACGTGAGTAACCTGTCCTTTTCTTTGGGATAACGGCTGGAAACGGCTGCTAATACTGGATATTCAGGCGTCACCGCATGGTGGTGTTTGGAAAGGTTTTTTCTGGGATTGGGTGGGCTCGCGGCCTATCAGCTTGTTGGTGGGGTGATGGCTTACCAAGGCTTTGACGGGTAGCCGGCCTGAGAGGGTGGTCGGTCGCACTGGGACTGAGATACGGCCCAGACTCCTACGGGAGGCAGCAGTGGGGAATATTGCACAATGGACGAAAGTTTGATGCAGCGACGCCGCGTGGAGGGTGTAGGCCTTCGGGTTGTGAACTCCTTTTTCTCG

### Write out the marker gene compendium to use in Notebook 2.

In [9]:
# Make a copy of the compendium
rRNA_dict_JSON = rRNA_dict.copy()

# Change the keys to strings
for key in rRNA_dict_JSON:
        rRNA_dict_JSON[key] = str(rRNA_dict_JSON[key])

# Write out
output_json = "/Users/kbeilsmith/Desktop/2024_AmtG/AmtG_0.0.2_test_files/test_Compendium.json"
with open(output_json, "w") as outfile:
    json.dump(rRNA_dict_JSON, outfile)

In [10]:
import sys
print(sys.version)

3.7.9 (default, Aug 31 2020, 07:22:35) 
[Clang 10.0.0 ]
