# 01 Downloading GenBank files

## pyrewton module genbank submodule get_ncbi_genomes

This notebook describes the process of downloading GenBank files for all genomic assemblies associated with each species of interest.

This notebook refers to the `pyrewton` submodule `get_ncbi_genomes`.


In [None]:
<div class="alert-warning">
<p></p>
<b>Note:</b> the genomic assemblies are downloaded as GenBank Flat Files (.gbff). This format was chosen because it provides nucleotide sequences and annotation data, which includes the product of nucleotide sequence and unique identifier which can be used to find further information about the annotated object in NCBI-linked databases, such as [UniProtKB] (https://www.uniprot.org/). Additionally, the BioPython module [SeqIO](https://biopython.org/wiki/SeqIO), a standard ‘Sequence Input/Output’ interface for sequence handling Python scripts, facilitates the automated parsing of GenBank files (including GenBank Flat Files) and extraction of specified data (BioPython, 2020).
<p></p>
</div>

<div class="alert-danger">
<p></p>
<b>Note:</b> this notebook does not include the entirity of script code, but instead includes exerts to help illustrate the programme arcitecture and function. Specifically, the function _main_, which orchestrates the calling of functions to perform the overal operation of the script is excluded from this notebook, as well as logging and error checking.
</div>
<p></p>
<div class="alert-danger">    
Additionally, in some instances sections of code have been removed, replaced by a comment to indicate the intent or function of the code and/or introduced later on in the notebook than is reflected in the code. This is to enable a detailed description of the code function at a more logical, and oppertune time.
</div>
<p></p>
<div class="alert-success">
<p></p>
<p>For the complete script, navigate to `pyrewton/genbank/get_ncbi_genomes` within the repository.</p>
<p></p>
</div>

## Contents


- [Operating the script](#linkoperating)
- [Script input](#linkinput)
- [Command line options](#linkcommand)
- [Python imports](#linkimports)
- [Parsing the input file and creating the dataframe](#linkparsing)
    - [Scientific name and Taxonomy ID retrieval](#linkscientific)
    - [Creating the dataframe](#linkcdf)
- [Retrieving accession numbers](#linkretrieving)
    -[Retrieval of assembly IDs](#linkassemblyids)
    -[Retrieval of the accession numbers](#linkanretrieve)
- [Downloading GenBank file](#linkgenbank)
    - [Compiling the URL](#linkurl)
    - [Downloading the GenBank file](#linkdownload)
- [Writing out the dataframe](#linkoutdf)


<a id="linkoperating"><a/>

## Operating the script

The script `get_ncbi_genomes.py` is written as a command-line programme and thus is most easily operated via the command-line. The standard structure of the command-line call to operate the script is:
`python3 Extract_genomes_NCBI.py -u <user email> <other options>`

Multiple options are avilable to customise the operation of the script to the users needs. A full list of the available options are included in the README.

A email address must be provided becuase this is a requirement to access the NCBI database remotely using `Entrez`.


<div class="alert-info">
    For the dowloading of the GenBank files for the PhD project, the following code was run at the command-line:
</div>

> `python3 Extract_genomes_NCBI.py -u eemh1@st-andrews.ac.uk -i selected_species_list.txt -o 2020_05_31_GenBank_file_pulldown/ -l 2020_05_31_GB_file_download -v -d 2020_05_31_genome_dataframe.csv`

<div class="alert-danger">
Note: Before using the script `get_ncbi_genomes.py` read the documentation for Entrez, taking care to note expected practises laid out under _'Frequency, Timing and Registration of E-utility URL Requests'_. Failing to meet the expected practises can result in restricted access or banned access to the Entrez utilitity.
</div>

Entrez documentation can be found at the following links:
- ['Entrez documentation'](https://www.ncbi.nlm.nih.gov/books/NBK25497/)
- ['BioPython Entrez modeul documentation'](https://biopython.org/DIST/docs/api/Bio.Entrez-module.html)

<a id="linkinput"><a/>

## Script input

The script takes a plain text file as input, containing a list of the species of interst. Each line contains a unique species, identified by their scientific name or NCIB taxonomy ID (including the 'NCBI:txid' prefix). An input file template can be found within the `get_ncbi_genomes` directory within the repository.


<div class=\"alert alert-warning\">
For the downloading of the GenBank files for the PhD project, the file `selected_species_list.txt` within the `get_ncbi_genomes` directory was used as the input file.
</div>

<a id="linkcommand"><a/>

## Command-line options

The script is designed as command-line programme, and thus operation of the script is customisable by passing arguments from the command-line.

**Compulsory argument**<br>
The option `-u` or `--user` <font color=red>**must**</font> be used in order to provider the users email address, becuase this is a requiremnt for Entrez which is used to call to NCBI.

**Optional arguments**<br>

`-d, --dataframe`<br>
&emsp;&emsp;Specify output path for dataframe (include file extensions). If not provided dataframe will be written out to STDOUT.

`-f, --force`<br>
&emsp;&emsp;Enable writting in specificed output directory if output directory already exists.

`-g, --genbank`<br>
&emsp;&emsp;Enable or disable downloading of GenBank files.

`-h, --help`<br>
&emsp;&emsp;Display help messages and exit

`-i, --input`<br>
&emsp;&emsp;Specify path to input filename (include extension) input file. If not given input will be taken from STDIN.

`-l, --log`<br>
&emsp;&emsp;Specify name of log file (include extension). If not option is given no log file will be written out,
however, logs will still be printed to the terminal.

`-n, --nodelete`<br>
&emsp;&emsp;Enable not deleting files in existing output directory. If not enabled, output directory exists and writing in output directory is 'forced' then files in output directory will not be deleted, and new files will be written to the output directory.

`-o, --output`<br>
&emsp;&emsp;Specify filename (including extension) of output file. If not given output will be wrtten to STDOUT.
If only the filename is given, Extract_genomes_NCBI.py

`-r, --retries`<br>
&emsp;&emsp;Specifiy maximum number of retries of trying to call to NCBI before cancelling retrying call to NCBI. The default is a maximum of 10 retries. 

`-t, --timeout`<br>
&emsp;&emsp;Specify timeout limit of URL connection when downloading GenBank files. Default is 10 seconds.

`-v, --verbose`<br>
&emsp;&emsp;Enable verbose logging - changes logger level from WARNING to INFO.

<a id="linkimports"><a/>

## Python imports

In [None]:
import argparse
import logging
import re
import shutil
import sys
import time

from pathlib import Path
from socket import timeout
from typing import List, Optional
from urllib.error import HTTPError, URLError
from urllib.request import urlopen

import pandas as pd

from Bio import Entrez
from tqdm import tqdm


<a id="linkparsing"><a/>

## Parsing the input file and creating the dataframe

The input file is either taken from STDIN or if a path was provided at the command-line, it is taken from where the path directs.

Opening and parsing of the input file is performed by the function `parse_input_file`.

Initially, it is checked that the path to the input file is valid and if not, the programme terminates. 
Afterwards, the file is opened, and parsed line-by-line.

In [None]:
def parse_input_file(input_filename, logger, retries):
    
    # test path to input file exists, if not exit programme
    if not input_filename.is_file():
        # report to user and exit programme
    
    # if path to input file exists proceed
    # parse input file
    with open(input_filename) as file:
        input_list = file.read().splitlines()

    # Parse input, retrieving tax ID or scientific name as appropriate
    line_count = 0
    for line in tqdm(input_list, desc="Reading lines"):
        line_count += 1

        if line.startswith("#"):
            continue

        line_data = parse_line(line, logger, line_count, retries)
        all_species_data.append(line_data)

    # create dataframe containing three columns: 'Genus', 'Species', 'NCBI Taxonomy ID'
    species_table = pd.DataFrame(
        all_species_data, columns=["Genus", "Species", "NCBI Taxonomy ID"]
    )
    return species_table


The function `parse_line` coordinates the approrpiate calling of functions to retrieve the NCBI taxonomy ID if the scientific name is provided, retrieve the scientific name if the taxonomy ID is provided or perform no function if a comment is passed (indicated by the starting line character '#').

The scientific name and taxonomy ID is stored as a single list for each species. This list contains the 3 elements:
- The species 'Genus' name
- The species 'Species' name
- The species taxonomy ID (including the 'NCBI:txid' prefix).

The list is returned by the function and added to the tuple `all_species_data` within the `parse_input_file` function. This creates a tuple, with each element containing the indentification data for a unique species.


In [None]:
def parse_line(line, logger, line_count, retries):
    
    # For taxonomy ID retrieve scientific name
    if line.startswith("NCBI:txid"):
        gs_name = get_genus_species_name(line[9:], logger, line_count, retries)
        line_data = gs_name.split()
        line_data.append(line)
        
    # For scientific name retrieve taxonomy ID
    else:
        tax_id = get_tax_id(line, logger, line_count, retries)
        line_data = line.split()
        line_data.append(tax_id)

    return line_data


<a id="linkscientific"><a/>

### Scientific name and Taxonomy ID retrieval

The retrieval of the scientific names and taxonomy ID from the NCBI Taxonomy databased is performed by the functions `get_genus_species_name` and `get_taxonomy_ID`.

Each function calls to the NCBI Taxonomy database using entrez.


<div class="alert-danger">
If the retrieval of scientific name or taxonomy ID fails, the null value 'NA' is returned.
The retreieval of scientific name has the additional test to check if the 'name' passed to the function contains any numbers. IF so the null value is returned 'NA', the most common cause for this is a typo in the scientific name or exclusion of the taxonomy ID prefix 'NCBI:txid' in the taxonomy ID written in the input.
More detail on the probably causes of common errors is discussed in the projects README.md file.
</div>

In [None]:
def get_genus_species_name(taxonomy_id, logger, line_number, retries):

    # Retrieve scientific name
    with entrez_retry(
        logger, retries, Entrez.efetch, db="Taxonomy", id=taxonomy_id, retmode="xml"
    ) as handle:
        record = Entrez.read(handle)

    # extract scientific name from record
    try:
        return record[0]["ScientificName"]

    except IndexError:
        # log error and return null value 'NA'
        return "NA"


def get_tax_id(genus_species, logger, line_number, retries):
    # check for potential mistake in taxonomy ID prefix
    if re.search(r"\d", genus_species):
        # log warning that numbers were found in line which was identified as a scientific name
        return "NA"

    else:
        with entrez_retry(
            logger, retries, Entrez.esearch, db="Taxonomy", term=genus_species
        ) as handle:
            record = Entrez.read(handle)

    # extract taxonomy ID from record
    try:
        return "NCBI:txid" + record["IdList"][0]

    except IndexError:
        # log error
        return "NA"


<a id="linkcdf"><a/>

### Creating the dataframe

The `pandas` module is used to create a dataframe, within the `parse_input_file` function, with three columns:
- Genus
- Species
- NCBI Taxonomy ID
With a unique species per line.

`Pandas` enables easier and faster storage and malipulation of dataframes than other table creating packages. More information is [available at](https://pandas.pydata.org/).

The dataframe (`species_table`) is returned to the function `main`.

In [None]:
    species_table = pd.DataFrame(
        all_species_data, columns=["Genus", "Species", "NCBI Taxonomy ID"]
    )
    return species_table

<a id="linkretrieving"><a/>

## Retrieving accession numbers

The `pandas` module allows simultanous applying of a function to a dataframe, iterating over a given axis (row or column). This results in a 100-times faster processing of the dataframe than using a traditional Pythonic `for loop`. Therefore, the `pandas apply` function is used to retrieve the accession numbers of all genomic assemblies associated with each taxonomy ID within the `species_table` dataframe; this is completed by calling the function `get_accession_numbers`, and retrieved accession numbers are stored in a new column in the dataframe: `NCBI Accession Numbers`.


<div class="alert-warning">
Entrez will only retrieve the accession numbers of genomic assembly entries, within the NCBI Assembly database, which are **directly** linked to the taxonomy entry (identified by its taxonomy ID); it will not return accession numbers for genomic assembly entries which are **subtree** linked.

**Directly** linked entries are those that have the organism assigned as the source organism for the data.<br>
**Subtree** linked entries are those record assocated with taxonomy node, and all nodes which are underneath.

Browsing a taxonomy entry witihn the NCBI Taxonomy database via the browser will quickly identify if any genomic assemblies have been directly linked.
</div>

<div class="alert-danger">
As before, if the retrieval of the asseccion numbers fails at any point, the null value 'NA' is returned, and the retrieval of the accession numbers is exited.

If the taxonomy ID was failed to be retrieved previously (and was thus stored as 'NA') a null value of 'NA' will automatically be returned for accession numbers'.
</div>

In [None]:
species_table["NCBI Accession Numbers"] = species_table.apply(
    get_accession_numbers, args=(logger, args), axis=1
)


The `pandas apply` function passes a `pandas series` with each element (or cell within the dataframe) accessible via an index number, similar to accessing an element in a list.

- df_row[0]: Genus
- df_row[1]: Species
- df_row[2]: Taxonomy ID


<a id="linkassemblyids"><a/>

### Retrieval of assembly IDs

`Entrez` is used to perform the call to the NCBI Assembly database. Owing to the taxonomy IDs for a species being stored within the NCBI Taxonomy database the `Entrez` function `elink` is used to retrieve the IDs of all genomic assemblies within the NCBI Assebmly database that are linked with the species NCBI Taxonomy database entry.


In [None]:
def get_accession_numbers(df_row, logger, args):
        with entrez_retry(
        logger,
        args.retries,
        Entrez.elink,
        dbfrom="Taxonomy",
        id=df_row[2][9:],
        db="Assembly",
        linkname="taxonomy_assembly",
    ) as assembly_number_handle:
        assembly_number_record = Entrez.read(assembly_number_handle)

This retrieves a list of IDs for all genomic assemblies associated with the taxonomy ID. To minimus the number of calls to the NCBI database, the list of IDs is posted as a single query to NCBI using `Entrez`, of which the web environment and query key is retrieved to facilitate the retrieval of the accession numbers of the identified genomic assemblies.

In [None]:
    # compile list of ids in suitable format for epost
    id_post_list = str(",".join(assembly_id_list))
    # Post all assembly IDs to Entrez-NCBI for downstream pulldown of accession numbers
    epost_search_results = Entrez.read(
        entrez_retry(logger, args.retries, Entrez.epost, "Assembly", id=id_post_list)
    )

    # Retrieve web environment and query key from Entrez epost
    epost_webenv = epost_search_results["WebEnv"]
    epost_query_key = epost_search_results["QueryKey"]

<a id="linkanretrieve"><a/>

### Retrieval of the accession numbers

`Entrez` is used again for retrieval the accession numbers from the previously posted query. The accession numbers are collated into a single list, which is then converted into a string, this prevents the appearance of list book ending sequare brackets `[]` within the final dataframe.

In [None]:
ncbi_accession_numbers_list = []

    with entrez_retry(
        logger,
        args.retries,
        Entrez.efetch,
        db="Assembly",
        query_key=epost_query_key,
        WebEnv=epost_webenv,
        rettype="docsum",
        retmode="xml",
    ) as accession_handle:
        accession_record = Entrez.read(accession_handle, validate=False)


    # Extract accession numbers from document summary
    for index_number in tqdm(
        range(len(accession_record["DocumentSummarySet"]["DocumentSummary"])),
        desc=f"Retrieving accessions ({df_row[2]})",
    ):
        try:
            new_accession_number = accession_record["DocumentSummarySet"][
                "DocumentSummary"
            ][index_number]["AssemblyAccession"]
            ncbi_accession_numbers_list.append(new_accession_number)

        except IndexError:
            # log error and return null value
            return "NA"
        
        # Download genbank file for genomic assemble entry

        index_number += 1

    # Process accession numbers into human readable list for dataframe
    ncbi_accession_numbers = ", ".join(ncbi_accession_numbers_list)

    return ncbi_accession_numbers


<a id="linkgenbank"><a/>

## Downloading GenBank files

Prior to adding the newly retrieved accession number to the growing list of all accession numbers for a specific species, a check is performed to see if the the downloading of the GenBank file (.gbff) of the genomic assembly entry is enabled.

If enabled, the downloading of the GenBank file is initated by calling the `get_genbank_files` function.

In [None]:
        # If downloading of GenBank files is enabled, download Genbank files
        if args.genbank is True:
            get_genbank_files(
                new_accession_number,
                accession_record["DocumentSummarySet"]["DocumentSummary"][index_number][
                    "AssemblyName"
                ],
                logger,
                args,
            )
            
            
def get_genbank_files(
    accession_number,
    assembly_name,
    logger,
    args,
    suffix="genomic.gbff.gz",
):

The downloading of any file requires a URL address from which to download a file, and an output path to write the downloaded data too. The creation of both is orchestrated by the `get_genbank_files` function.


In [None]:
def get_genbank_files(
    accession_number,
    assembly_name,
    logger,
    args,
    suffix="genomic.gbff.gz",
):
    
    # compile url for download
    genbank_url, filestem = compile_url(accession_number, assembly_name, logger, suffix)

    # if downloaded file is not to be written to STDOUT, compile output path
    if args.output is not sys.stdout:
        out_file_path = args.output / "_".join([filestem.replace(".", "_"), suffix])
    else:
        out_file_path = args.output

    # download GenBank file
    download_file(
        genbank_url, args, out_file_path, logger, accession_number, "GenBank file",
    )

    return


<a id="linkurl"><a/>

### Compiling the URL

NCBI records can include a variety of escape characters within its records, but requires all escape characters to be written as underscores within its URLs. The `regular expression` module of Python is used to replace any escape charcters with underscords, prior to compiling the URL.

<div class="alert-warning">NCBI genomic assembly files are avaialble for download from the NCBI FTP site hence the prefix (the ftpstem) and structure of the compiled URL does not match the structure of the URL when searching the NCBI Assembly database in-browser.</div>


In [None]:
def compile_url(
    accession_number,
    assembly_name,
    logger,
    suffix,
    ftpstem="ftp://ftp.ncbi.nlm.nih.gov/genomes/all",
):
   
    # Extract assembly name, removing alterantive escape characters
    escape_characters = re.compile(r"[\s/,#\(\)]")
    escape_name = re.sub(escape_characters, "_", assembly_name)

    # compile filstem
    filestem = "_".join([accession_number, escape_name])

    # separate out filesteam into GCstem, accession number intergers and discarded
    url_parts = tuple(filestem.split("_", 2))

    # separate identifying numbers from version number
    sub_directories = "/".join(
        [url_parts[1][i : i + 3] for i in range(0, len(url_parts[1].split(".")[0]), 3)]
    )

    # return url for downloading file
    return (
        "{0}/{1}/{2}/{3}/{3}_{4}".format(
            ftpstem, url_parts[0], sub_directories, filestem, suffix
        ),
        filestem,
    )


<a id="linkdownload"><a/>

### Downloading the GenBank file

The downloading of the GenBank file is initated with calling the `download_file` function.

The URL connection is coordinated by the Python `urllib` module ([Documentation found here](https://docs.python.org/3/library/urllib.html).

The Python `tqdm` module is also used provide a visual log of the download progress in the terminal.


In [None]:
# Try URL connection
    try:
        response = urlopen(genbank_url, timeout=args.timeout)
    except HTTPError, URLError, timeout:
        # log error and exit downloading of GenBank file
        return

    if out_file_path.exists():
        logger.warning(f"Output file {out_file_path} exists, not downloading")
        
    else:
        # Download file
        file_size = int(response.info().get("Content-length"))
        bsize = 1_048_576
        
        try:
            with open(out_file_path, "wb") as out_handle:
                # Using leave=False as this will be an internally-nested progress bar
                with tqdm(
                    total=file_size,
                    leave=False,
                    desc=f"Downloading {accession_number} {file_type}",
                ) as pbar:
                    while True:
                        buffer = response.read(bsize)
                        if not buffer:
                            break
                        pbar.update(len(buffer))
                        out_handle.write(buffer)

        return
    

<a id="linkoutdf"><a/>

## Writing out the dataframe

If the option to write out the dataframe to a file, rather than STDOUT, is enabled, the function `write_out_dataframe` will be called.

The dataframe is written out as a .csv file (column-separated values format), to allow the opening and reading of the file by several packages including Microsoft Offic Excel, as well as easy parsing of the dataframe by other `pandas` using Python scripts.

<div class="alert-danger">
When providing the output path for the .csv file, make sure to include the .csv extention. This will not prevent file being written out but when the file is written out it will be missing the '.csv' extention in its path otherwise.
</div>


In [None]:
def write_out_dataframe(species_table, logger, outdir, force, nodelete):

    # Check if overwrite of existing directory will occur
    logger.info("Checking if output directory for dataframe already exists")
    if outdir.exists():
        if force is False:
            logger.warning(
                "Specified directory for dataframe already exists.\nExiting writing out dataframe."
            )
            return ()
        else:
            logger.warning(
                "Specified directory for dataframe already exists.\nForced overwritting enabled."
            )

    # Check if user included .csv file extension
    if outdir.endswith(".csv"):
        species_table.to_csv(outdir)
    else:
        out_df = outdir + ".csv"
        species_table.to_csv(out_df)

    return

<div class=\"alert alert-warning\">
For the downloading of GenBank files for the project, the dataframe was written out to '2020_05_31_genomes_dataframe.csv', stored within the same directory as the script (get_ncbi_genomes). The following dataframe was written:
</div>

In [6]:
import pandas as pd

from pathlib import Path
from IPython.display import display

display(pd.read_csv(
    '2020_05_31_genome_dataframe.csv',
    header=0,
    names=["Genus", "Species", "NCBI Taxonomy ID", "NCBI Accession Numbers"]
))

Unnamed: 0,Genus,Species,NCBI Taxonomy ID,NCBI Accession Numbers
0,Aspergillus,fumigatus,NCBI:txid746128,"GCA_012656185.1, GCA_012656215.1, GCA_01265616..."
1,Aspergillus,nidulans,NCBI:txid162425,"GCA_011075025.1, GCA_011074995.1"
2,Aspergillus,niger,NCBI:txid5061,"GCA_011316255.1, GCA_009812365.1, GCA_00463431..."
3,Aspergillus,sydowii,NCBI:txid75750,"GCA_009828905.1, GCA_009193685.1"
4,Fusarium,graminearum,NCBI:txid5518,"GCA_012959185.1, GCA_006942295.1, GCA_90049270..."
5,Fusarium,oxysporum,NCBI:txid5507,"GCA_011428085.1, GCA_011426355.1, GCA_01142633..."
6,Fusarium,proliferatum,NCBI:txid948311,"GCA_003709405.1, GCA_003705095.1, GCA_00370496..."
7,Magnaporthe,grisea,NCBI:txid148305,"GCF_004355905.1, GCA_003933175.1, GCA_00292524..."
8,Magnaporthe,oryzae,NCBI:txid318829,"GCA_012979135.1, GCA_012978465.1, GCA_01297841..."
9,Mycosphaerella,graminicola,NCBI:txid1047171,"GCA_902712725.1, GCA_003613095.1, GCA_00361118..."
