# 02 Retrieving GenBank Annotations

## Pyrewton module genbank, submodule get_genbank_annotations

This notebook describes the process of retrieving all protein annotations from GenBank files, referring to the 
process of summarising the CAZymes within GenBank files, by identifying annotated coding sequences which are linked to any entry in the [CAZy database](http://www.cazy.org/).

This notebook refers to the `pyrewton` submodule `get_genbank_annotations`.


<div class="alert-danger">
<p></p>
<b>Note:</b> this notebook does not include the entirity of script code, but instead includes exerts to help illustrate the programme arcitecture and function. Specifically, the function _main_, which orchestrates the calling of functions to perform the overal operation of the script is excluded from this notebook, as well as logging and error checking.
</div>
<p></p>
<div class="alert-danger">    
Additionally, in some instances sections of code have been removed, replaced by a comment to indicate the intent or function of the code and/or introduced later on in the notebook than is reflected in the code. This is to enable a detailed description of the code function at a more logical, and oppertune time.
</div>
<p></p>
<div class="alert-success">
<p></p>
<p>For the complete script, navigate to `pyrewton/genbank/get_genbank_annotations` within the repository.</p>
<p></p>
</div>

## Contents

- [Operating the script](#linkoperating)
- [Script input](#linkinput)
- [Command line options](#linkcommand)
- [Python imports](#linkimports)
- [Coordination of protein annotations retrieval](#linkcoordination)
- [Annotation retrieval](#linkannotation)


<a id="linkoperating"><a/>

## Operating the script


The script `get_genbank_annotations.py` is written as a command-line programme and thus is most easily operated via the command-line. The standard structure of the command-line call to operate the script is:
`python3 get_genbank_annotations.py <other options>`

Multiple options are avilable to customise the operation of the script to the users needs. A full list of the available options are included in the documentation at [Read the Docs](https://phd-project-scripts.readthedocs.io/en/latest/genbank.html#get-genbank-annotations), as well as further on in this notebook.


<div class="alert-info">
    For the retrieving of protein annotations from the GenBank files retrieved previously (see the [notebook 01_downloading_genbank_files](WWWWWWWW), the following code was run at the command-line:
</div>

> `python3 get_genbank_annotations -d get_ncbi_genomes/2020_05_31_genome_dataframe.csv -g get_ncbi_genomes/2020_05_31_GenBank_file_pulldown -l get_genbank_annotations/2020_07_21_genbank_annotations_log.log -o get_genbank_annotations/2020_07_21_genbank_annotaitons_dataframe.csv -v`

<a id="linkcommand"><a/>

## Command-line options

The script is designed as command-line programme, and thus operation of the script is customisable by passing arguments from the command-line.

**Compulsory argument**<br>
The option `-u` or `--user` <font color=red>**must**</font> be used in order to provider the users email address, becuase this is a requiremnt for Entrez which is used to call to NCBI.

**Optional arguments**<br>

`-d, --df_input`<br>
&emsp;&emsp;Path to input dataframe.

`-f, --force`<br>
&emsp;&emsp;Enable writting in specificed output directory if output directory already exists.

`-g, --genbank`<br>
&emsp;&emsp;Path to directory containing GenBank files.

`-h, --help`<br>
&emsp;&emsp;Display help messages and exit


`-l, --log`<br>
&emsp;&emsp;Specify name of log file (With extension). If only filename is given, log file will be written out to the current working directory, otherwise provide path including filename. If not option is given no log file will be written out, however, logs will still be printed to the terminal.

`-n, --nodelete`<br>
&emsp;&emsp;Enable not deleting files in existing output directory. If not enabled, output directory exists and writing in output directory is ‘forced’ then files in output directory will not be deleted, and new files will be written to the output directory.

`-o, --output`<br>
&emsp;&emsp;Specify filename (with extension) of output file. If not option is given output will be written to STDOUT.

`-v, --verbose`<br>
&emsp;&emsp;Enable verbose logging - changes logger level from WARNING to INFO.

<a id="linkimports"><a/>

## Python imports

In [None]:
import gzip
import logging
import sys

from pathlib import Path
from typing import List, Optional

import pandas as pd

from Bio import SeqIO
from tqdm import tqdm

from pyrewton.file_io import make_output_directory, write_out_dataframe
from pyrewton.loggers import build_logger
from pyrewton.parsers.parser_get_genbank_annotations import build_parser


<a id="linkparsing"><a/>

## Create a foundation dataframe


Prior to collecting the protein annotations from the GenBank files, an empty dataframe is created, to which all retrieved annotation data can be added.

Then the input dataframe, containg the taxonomic and genomic accession numbers is parsed, retrieving the protein annotations for one species at a time. This one-at-a-time approach facilitates the close association of annotation data with the host species taxonomic data.


In [None]:
def create_dataframe(input_df, args, logger):

    # Create empty dataframe to add data to
    protein_annotation_df = pd.DataFrame(
        columns=[
            "Genus",
            "Species",
            "NCBI Taxonomy ID",
            "NCBI Accession Number",
            "NCBI Protein ID",
            "Locus Tag",
            "Gene Locus",
            "NCBI Recorded Function",
            "Protein Sequence",
        ]
    )

    # Retrieve data for dataframe foundation and add to empty dataframe
    df_index = 0
    for df_index in range(len(input_df["Genus"])):
        protein_annotation_df = protein_annotation_df.append(
            get_genbank_annotations(input_df.iloc[df_index], args, logger),
            ignore_index=True,
        )
        df_index += 1

    return protein_annotation_df

<a id="linkcoordination"><a/>

## Coordination of protein annotations retrieval


The function `get_genbank_annotations` coordinates the retrieval and processing of retrieved annotations from all GenBank files for a given species.

The function creates an empty dataframe (identical to that to above) to add the collected data from the annotations to.

Then function separates the human readable list of accession numbers into a Python list. The list is parsed by the function to retrieve and collate the protein annotations for accession number.


In [None]:
def get_genbank_annotations(df_row, args, logger):

    # Create empty dataframe to store data in
    protein_data_df = pd.DataFrame(
        columns=[
            "Genus",
            "Species",
            "NCBI Taxonomy ID",
            "NCBI Accession Number",
            "NCBI Protein ID",
            "Locus Tag",
            "Gene Locus",
            "NCBI Recorded Function",
            "Protein Sequence",
        ]
    )

    # convert human readable list of accession numbers into Python list
    accession_list = df_row[3].split(", ")



Once the protein data has been retrieved from the respective GenBank file for the current accession number, if no data was retrieved then a default 'no data' row is added to the empty dataframe for the host species.


<div class="alert-warning">
Retrieval of no protein is frequentl the result of an unannotated or poorly annotated genomic assembly submission to NCBI. With the capacity for genome sequencing far out weighing the capacity for automated and mannual annotation, many genomes are not fully annotated with CDS features.
</div>

In [None]:

    if len(protein_data) == 0:
        # log warning
        # Add null values to dataframe for the accession number
        new_df_row = {
            "Genus": df_row[0],
            "Species": df_row[1],
            "NCBI Taxonomy ID": df_row[2],
            "NCBI Accession Number": accession,
            "NCBI Protein ID": "NA",
            "Locus Tag": "NA",
            "Gene Locus": "NA",
            "NCBI Recorded Function": "NA",
            "Protein Sequence": "NA",
        }
        protein_data_df = protein_data_df.append(new_df_row, ignore_index=True)



If data is retrieved from the GenBank file, the data is add to the appropriate key/value pair in a dictionary, to form the data for a new row in the dataframe of protein data for the host species.


In [None]:
    else:
        protein_index = 0  # index number in protein_data tuple
        for protein_index in tqdm(
            range(len(protein_data)),
            desc=f"Getting proteins {df_row[2]}-{accession}",
        ):
            # Compile data for new row to be added to dataframe
            new_df_row = {
                "Genus": df_row[0],
                "Species": df_row[1],
                "NCBI Taxonomy ID": df_row[2],
                "NCBI Accession Number": accession,
                "NCBI Protein ID": protein_data[protein_index][0],
                "Locus Tag": protein_data[protein_index][1],
                "Gene Locus": protein_data[protein_index][2],
                "NCBI Recorded Function": protein_data[protein_index][3],
                "Protein Sequence": protein_data[protein_index][4],
            }

            # Add new row to dataframe
            protein_data_df = protein_data_df.append(new_df_row, ignore_index=True)
            protein_index += 1



Then the dataframe containing all retrieved protein data for the given species is returned add appended to the master dataframe created in the `create_dataframe` function.


<a id="linkannotation"><a/>

### Annotation retrieval


The function `get_annotations` coordinates the retrieval of annotations from a single GenBank file, named by the accession number passed to the function.

Initially, the function checks that an accession number is provided. If none is provided the default null values for all protein is returned for the accession number.


In [None]:
def get_annotations(accession_number, args, logger):

    # check if accession number was provided
    if accession_number == "NA":
        logger.warning(
            (
                f"Null value ('NA') was contained in cell for {accession_number},"
                "exiting retrieval of protein data.\nReturning null ('NA') value"
                "for all protein data"
            )
        )
        return ["NA", "NA", "NA", "NA", "NA"]


If an accession number is passed to the function, the associated GenBank file is retrieved, using the `get_genbank_file` function. 


If no GenBank file, multiple files or an empty file (i.e. is 0 bytes in size) is returned, null values are returned for all protein data for the accession number - the latter two tests are performed by `get_genbank_file`.


In [None]:

gb_file = get_genbank_file(
        accession_number, args, logger

    if gb_file is None:
        # error logging performd in get_genbank_file()
        return ["NA", "NA", "NA", "NA", "NA"]
    


If a GenBank file is retrieved, each recorded is parsed, retrieving the data from all 'CDS' features (i.e. all proteins) annotated within the GenBank file. The retrieval of the data for each feature is performed by the `get_record_feature` function.


In [None]:

    # Retrieve protein data from GenBank file
    with gzip.open(gb_file, "rt") as handle:
        # create list to store all protein data retrieved from GenBank file, making it a tuple
        for gb_record in SeqIO.parse(handle, "genbank"):
            for (index, feature) in enumerate(gb_record.features):
                # empty protein data list so as not to contaminate data of next protein
                protein_data = []
                # Parse over only protein encoding features (type = 'CDS')
                if feature.type == "CDS":
                    # extract protein ID
                    protein_data.append(
                        get_record_feature(feature, "protein_id", logger)
                    )
                    # extract locus tag
                    protein_data.append(
                        get_record_feature(feature, "locus_tag", logger)
                    )
                    # extract location
                    protein_data.append(get_record_feature(feature, "location", logger))
                    # extract annotated function of product
                    protein_data.append(get_record_feature(feature, "product", logger))
                    # extract protein sequence
                    protein_data.append(
                        get_record_feature(feature, "translation", logger)
                    )


While still processing the GenBank file, it is checked if only null values (i.e. "NA" is returned for all protein data) or if an error occured and not all data was retrieved. In either case all null values are returned for the protein data. Otherwise the data retrieved from the GenBank file is returned to the `get_genbank_annotations` function for addition to the growing dataframe of all protein data for a given species.


In [None]:

    if len(protein_data) == 5:
        # if null value was returned for every feature attribute log error
        # and don't add to all_protein_data list
        if protein_data == ["NA", "NA", "NA", "NA", "NA"]:
            logger.warning(
                f"No data retrieved from CDS type feature, index: {index}",
                exc_info=1,
            )
        # if some data retrieved, add to all_protein_list
        else:
            all_protein_data.append(protein_data)

    else:
        # error occured in that one of the appending actions failed to append
        # and would lead to misalignment in the dataframe if added to the
        # all_protein_data list
        logger.warning(
            (
                f"Error occured during retrieval of data from feature, {index}\n"
                f"for {accession_number}. Returning no protein data"
            )
        )

return all_protein_data
