# Section 1: Extracting GenBank files from NCBI

This notebook describes the process of downloading GenBank files for all genomic assemblies associated with each species of interest.

This notebook refers to the Python script Extract_genomes_NCBI.py, included in the `Proteng` package.


    Note, that this notebook does not include the entirity of script code, but instead includes exerts to help illustrate the programme arcitecture and function. Speficially, the function `main`, which orchestrates the calling of functions to perform the overal operation of the script is excluded from this notebook, as well as logging and error checking.
    
    Additionally, in some instances sections of code have been removed and replaced by a comment to indicate the intent or function of the code, to enable a detailed description of the codes function at a more logical, and oppertune time.
    
    For the complete script, navigate to `Section1_Extracting_genomes_NCBI/Extract_genomes_NCBI.py` within the repository.

## Contents


    For the downloading of the GenBank files for the project, the following code was run at the command-line to initate the script run.

## Operating the script

The script `Extract_genomes_NCBI.py` is written as a command-line programme and thus is mostly easily operated via the command-line. The standard structure of the command-line call to operate the script is:
`python3 Extract_genomes_NCBI.py <user email> <options>`

Multiple options are avilable to customise the operation of the script to the users needs. A full list of the available options are included in the README.

A email address must be provided becuase this is a requirement to access the NCBI database remotely using `Entrez`.


    For the downloading of the GenBank files for the project, the following code was run at the command-line:
    python3 Extract_genomes_NCBI.py eemh1@st-andrews.ac.uk -i Species_list.txt -o 2020_05_31_GenBank_file_pulldown/ -l 2020_05_31_GB_file_download -v -d 2020_05_31_genome_dataframe.csv

    **Note**: Before using the script `Extract_genomes_NCBI.py` ensure the documentation for Entrez has been read and understood, taking care to note expected practises laid out under  ['Frequency, Timing and Registration of E-utility URL Requests']( https://www.ncbi.nlm.nih.gov/books/NBK25497/)
    
    In total 695 GenBank files were downloaded in a single run of the script `Extract_genomes_NCBI.py` for the project.

## Script input

The script takes a plain text file as input, containing a list of the species of interst. Each line contains a unique species, identified by their scientific name or NCIB taxonomy ID (including the 'NCBI:txid' prefix).


    For the downloading of the GenBank files for the project, the file `Species_list.txt` within the `Section1_Extracting_genomes_NCBI` was used as the input file.

## Python imports

In [None]:
import argparse
import logging
import re
import shutil
import sys
import time

from pathlib import Path
from socket import timeout
from typing import List, Optional
from urllib.error import HTTPError, URLError
from urllib.request import urlopen

import pandas as pd

from Bio import Entrez
from tqdm import tqdm


## Parsing the input file and creating the dataframe

The input file is either taken from STDIN or if a path was provided at the command-line, it is taken from where the path directs.

Opening and parsing of the input file is performed by the function `parse_input_file`.

Initially, it is checked that the path to the input file is valid and if not, the programme terminates. 
Afterwards, the file is opened, and parsed line-by-line.

In [None]:
def parse_input_file(input_filename, logger, retries):
    
    # test path to input file exists, if not exit programme
    if not input_filename.is_file():
        # report to user and exit programme


If the input file is retrievable, the file is parsed line-by-line.

In [None]:
    # if path to input file exists proceed
    # parse input file
    with open(input_filename) as file:
        input_list = file.read().splitlines()

    # Parse input, retrieving tax ID or scientific name as appropriate
    line_count = 0
    for line in tqdm(input_list, desc="Reading lines"):
        line_count += 1

        if line.startswith("#"):
            continue

        line_data = parse_line(line, logger, line_count, retries)
        all_species_data.append(line_data)

    # create dataframe containing three columns: 'Genus', 'Species', 'NCBI Taxonomy ID'
    species_table = pd.DataFrame(
        all_species_data, columns=["Genus", "Species", "NCBI Taxonomy ID"]
    )
    return species_table


The function `parse_line` coordinates the approrpiate calling of functions to retrieve the NCBI taxonomy ID if the scientific name is provided, retrieve the scientific name if the taxonomy ID is provided or perform no function if a comment is passed (indicated by the starting line character '#').

The scientific name and taxonomy ID is stored as a single list for each species. This list contains the 3 elements:
- The species 'Genus' name
- The species 'Species' name
- The species taxonomy ID (including the 'NCBI:txid' prefix).

The list is returned by the function and added to the tuple `all_species_data` within the `parse_input_file` function. This creates a tuple, with each element containing the indentification data for a unique species.


In [None]:
def parse_line(line, logger, line_count, retries):
    
    # For taxonomy ID retrieve scientific name
    if line.startswith("NCBI:txid"):
        gs_name = get_genus_species_name(line[9:], logger, line_count, retries)
        line_data = gs_name.split()
        line_data.append(line)
        
    # For scientific name retrieve taxonomy ID
    else:
        tax_id = get_tax_id(line, logger, line_count, retries)
        line_data = line.split()
        line_data.append(tax_id)

    return line_data


### Scientific name and Taxonomy ID retrieval

The retrieval of the scientific names and taxonomy ID from the NCBI Taxonomy databased is performed by the functions `get_genus_species_name` and `get_taxonomy_ID`.

Each function calls to the NCBI Taxonomy database using entrez.

If the scientific name or taxonomy ID was failed to be retrieved the `pandas` module recognised null value 'NA'.

The retrieve of scientific names has the additional test to check if the 'name' passed to the function contains any number. If so, the null value is returned 'NA', the most common cause for this is inclusion of a taxonomy ID in the input file without the prefix 'NCBI:txid'.

In [None]:
def get_genus_species_name(taxonomy_id, logger, line_number, retries):

    # Retrieve scientific name
    with entrez_retry(
        logger, retries, Entrez.efetch, db="Taxonomy", id=taxonomy_id, retmode="xml"
    ) as handle:
        record = Entrez.read(handle)

    # extract scientific name from record
    try:
        return record[0]["ScientificName"]

    except IndexError:
        # log error and return null value 'NA'
        return "NA"


def get_tax_id(genus_species, logger, line_number, retries):
    # check for potential mistake in taxonomy ID prefix
    if re.search(r"\d", genus_species):
        # log warning that numbers were found in line which was identified as a scientific name
        return "NA"

    else:
        with entrez_retry(
            logger, retries, Entrez.esearch, db="Taxonomy", term=genus_species
        ) as handle:
            record = Entrez.read(handle)

    # extract taxonomy ID from record
    try:
        return "NCBI:txid" + record["IdList"][0]

    except IndexError:
        # log error
        return "NA"


### Creating the dataframe

The `pandas` module is used to create a dataframe, within the `parse_input_file` function, with three columns:
- Genus
- Species
- NCBI Taxonomy ID
With a unique species per line.

`Pandas` enables easier and faster storage and malipulation of dataframes than other table creating packages. More information is [available at](https://pandas.pydata.org/).

The dataframe (`species_table`) is returned to the function `main`.

In [None]:
    species_table = pd.DataFrame(
        all_species_data, columns=["Genus", "Species", "NCBI Taxonomy ID"]
    )
    return species_table

## Retrieving accession numbers

The `pandas` module allows simultanous applying of a function to a dataframe, iterating over a given axis (row or column). This results in a 100-times faster processing of the dataframe than using a traditional Pythonic `for loop`. Therefore, the `pandas apply` function is used to retrieve the accession numbers of all genomic assemblies associated with each taxonomy ID within the `species_table` dataframe; this is completed by calling the function `get_accession_numbers`, and retrieved accession numbers are stored in a new column in the dataframe: `NCBI Accession Numbers`.


    As before, if the retrieval of the asseccion numbers fails the null value 'NA' is returned.

In [None]:
species_table["NCBI Accession Numbers"] = species_table.apply(
    get_accession_numbers, args=(logger, args), axis=1
)


    If the taxonomy ID was failed to be retrieved previously (and was thus stored as 'NA') a null value of 'NA' will automatically be returned for accession numbers'.

    The `pandas apply` function passes a `pandas series` with each element (or cell within the dataframe) accessible via an index number, similar to accessing an element in a list.

    The `apply function` is applied to each row of the dataframe thus each column is given its own index number, in order to access the appropriate data. In this case the index number assignments are:
        df_row[0]: Genus
        df_row[1]: Species
        df_row[2]: Taxonomy ID

### Retrieval of assembly IDs

`Entrez` is used to perform the call to the NCBI Assembly database. Owing to the taxonomy IDs for a species being stored within the NCBI Taxonomy database the `Entrez` function `elink` is used to retrieve the IDs of all genomic assemblies within the NCBI Assebmly database that are associated with the provided taxonomy ID.


In [None]:
def get_accession_numbers(df_row, logger, args):
        with entrez_retry(
        logger,
        args.retries,
        Entrez.elink,
        dbfrom="Taxonomy",
        id=df_row[2][9:],
        db="Assembly",
        linkname="taxonomy_assembly",
    ) as assembly_number_handle:
        assembly_number_record = Entrez.read(assembly_number_handle)

This retrieves a list of IDs for all genomic assemblies associated with the taxonomy ID. To minimus the number of calls to the NCBI database, the list of IDs is posted as a single query to NCBI using `Entrez`.

The web environment and query key are retrieved to facilitate the retrieval of the accession numbers of the identified genomic assemblies.

In [None]:
    # compile list of ids in suitable format for epost
    id_post_list = str(",".join(assembly_id_list))
    # Post all assembly IDs to Entrez-NCBI for downstream pulldown of accession numbers
    epost_search_results = Entrez.read(
        entrez_retry(logger, args.retries, Entrez.epost, "Assembly", id=id_post_list)
    )

    # Retrieve web environment and query key from Entrez epost
    epost_webenv = epost_search_results["WebEnv"]
    epost_query_key = epost_search_results["QueryKey"]

### Retrieval of accession numbers

`Entrez` is used again to retrieval the accession numbers from the previously posted query. The accession numbers are collated into a single list, which is then converted into a string, this prevents the storage/appearance of list book ending sequare brackets `[]` within the final dataframe.

In [None]:
ncbi_accession_numbers_list = []

    with entrez_retry(
        logger,
        args.retries,
        Entrez.efetch,
        db="Assembly",
        query_key=epost_query_key,
        WebEnv=epost_webenv,
        rettype="docsum",
        retmode="xml",
    ) as accession_handle:
        accession_record = Entrez.read(accession_handle, validate=False)


    # Extract accession numbers from document summary
    for index_number in tqdm(
        range(len(accession_record["DocumentSummarySet"]["DocumentSummary"])),
        desc=f"Retrieving accessions ({df_row[2]})",
    ):
        try:
            new_accession_number = accession_record["DocumentSummarySet"][
                "DocumentSummary"
            ][index_number]["AssemblyAccession"]
            ncbi_accession_numbers_list.append(new_accession_number)

        except IndexError:
            # log error and return null value
            return "NA"
        
        # GenBank file download enabling check

        index_number += 1

    # Process accession numbers into human readable list for dataframe
    ncbi_accession_numbers = ", ".join(ncbi_accession_numbers_list)

    return ncbi_accession_numbers

## Downloading GenBank files

Prior to add the retrieved accession number to the growing list of accessio numbers, a check is performed to see if the the downloading of the GenBank file (.gbff) associated with the genomic assembly entry is to be downloaded.

If enabled, the downloading of the GenBank file is initated by called to the `get_genbank_files` function.

In [None]:
        # If downloading of GenBank files is enabled, download Genbank files
        if args.genbank is True:
            get_genbank_files(
                new_accession_number,
                accession_record["DocumentSummarySet"]["DocumentSummary"][index_number][
                    "AssemblyName"
                ],
                logger,
                args,
            )
            
            
def get_genbank_files(
    accession_number,
    assembly_name,
    logger,
    args,
    suffix="genomic.gbff.gz",
):

The downloading of any file requires a URL address from which to download a file, and an output path to write the downloaded data too. The creation of both is orchestrated by the `get_genbank_files` function.


In [None]:
def get_genbank_files(
    accession_number,
    assembly_name,
    logger,
    args,
    suffix="genomic.gbff.gz",
):
    
    # compile url for download
    genbank_url, filestem = compile_url(accession_number, assembly_name, logger, suffix)

    # if downloaded file is not to be written to STDOUT, compile output path
    if args.output is not sys.stdout:
        out_file_path = args.output / "_".join([filestem.replace(".", "_"), suffix])
    else:
        out_file_path = args.output

    # download GenBank file
    download_file(
        genbank_url, args, out_file_path, logger, accession_number, "GenBank file",
    )

    return

### Compiling the URL

NCBI records can include a variety of escape characters within its records, but requires all escape characters to be written as underscores within its URLS. The `regular expression` module of Python is used to replace any escape charcters with underscords, prior to compiling the URL.


In [None]:
def compile_url(
    accession_number,
    assembly_name,
    logger,
    suffix,
    ftpstem="ftp://ftp.ncbi.nlm.nih.gov/genomes/all",
):
   
    # Extract assembly name, removing alterantive escape characters
    escape_characters = re.compile(r"[\s/,#\(\)]")
    escape_name = re.sub(escape_characters, "_", assembly_name)

    # compile filstem
    filestem = "_".join([accession_number, escape_name])

    # separate out filesteam into GCstem, accession number intergers and discarded
    url_parts = tuple(filestem.split("_", 2))

    # separate identifying numbers from version number
    sub_directories = "/".join(
        [url_parts[1][i : i + 3] for i in range(0, len(url_parts[1].split(".")[0]), 3)]
    )

    # return url for downloading file
    return (
        "{0}/{1}/{2}/{3}/{3}_{4}".format(
            ftpstem, url_parts[0], sub_directories, filestem, suffix
        ),
        filestem,
    )


### Downloading the GenBank 

The downloading of the GenBank file is initated with calling the `download_file` function.

The URL connection is coordinated by the Python `urllib` module ([Documentation found here](https://docs.python.org/3/library/urllib.html).

The Python `tqdm` module is also used provide a visual to log the download progress in the terminal.


In [None]:
# Try URL connection
    try:
        response = urlopen(genbank_url, timeout=args.timeout)
    except HTTPError, URLError, timeout:
        # log error and exit downloading of GenBank file
        return

    if out_file_path.exists():
        logger.warning(f"Output file {out_file_path} exists, not downloading")
        
    else:
        # Download file
        file_size = int(response.info().get("Content-length"))
        bsize = 1_048_576
        
        try:
            with open(out_file_path, "wb") as out_handle:
                # Using leave=False as this will be an internally-nested progress bar
                with tqdm(
                    total=file_size,
                    leave=False,
                    desc=f"Downloading {accession_number} {file_type}",
                ) as pbar:
                    while True:
                        buffer = response.read(bsize)
                        if not buffer:
                            break
                        pbar.update(len(buffer))
                        out_handle.write(buffer)

        return
    

## Writing out the dataframe

If the option to write out the dataframe to a file, rather than STDOUT, is enabled, the function `write_out_dataframe` will be called.

The dataframe is written out as a .csv file (column-separated values format), to allow the opening and reading of the file by several packages including Microsoft Offic Excel, as well as easy parsing of the dataframe by other `pandas` using Python scripts.


In [None]:
def write_out_dataframe(species_table, logger, outdir, force, nodelete):

    # Check if overwrite of existing directory will occur
    logger.info("Checking if output directory for dataframe already exists")
    if outdir.exists():
        if force is False:
            logger.warning(
                "Specified directory for dataframe already exists.\nExiting writing out dataframe."
            )
            return ()
        else:
            logger.warning(
                "Specified directory for dataframe already exists.\nForced overwritting enabled."
            )

    # Check if user included .csv file extension
    if outdir.endswith(".csv"):
        species_table.to_csv(outdir)
    else:
        out_df = outdir + ".csv"
        species_table.to_csv(out_df)

    return

    For the downloading of GenBank files for the project, the following dataframe was written:

In [2]:
import pandas as pd
from pathlib import Path

input_path = ('/Section1_Extracting_Genomes/2020_05_31_genome_dataframe.csv')

input_df = pd.read_csv(
        input_path,
        header=0,
        names=["Genus", "Species", "NCBI Taxonomy ID", "NCBI Accession Numbers"],
    )

print(input_df)

ModuleNotFoundError: No module named 'pandas'