# 01 Downloading GenBank Files

## pyrewton module genbank, submodule get_ncbi_genomes

This notebook describes the process of downloading GenBank files for all genomic assemblies associated with each species of interest.

This notebook refers to the `pyrewton` submodule `get_ncbi_genomes`.


In [None]:
<div class="alert-warning">
<p></p>
<b>Note:</b> the genomic assemblies are downloaded as GenBank Flat Files (.gbff). This format was chosen because it provides nucleotide sequences and annotation data, which includes the product of nucleotide sequence and unique identifier which can be used to find further information about the annotated object in NCBI-linked databases, such as [UniProtKB] (https://www.uniprot.org/). Additionally, the BioPython module [SeqIO](https://biopython.org/wiki/SeqIO), a standard ‘Sequence Input/Output’ interface for sequence handling Python scripts, facilitates the automated parsing of GenBank files (including GenBank Flat Files) and extraction of specified data (BioPython, 2020).
<p></p>
</div>

<div class="alert-danger">
<p></p>
<b>Note:</b> this notebook does not include the entirity of script code, but instead includes exerts to help illustrate the programme arcitecture and function. Specifically, the function _main_, which orchestrates the calling of functions to perform the overal operation of the script is excluded from this notebook, as well as logging and error checking.
</div>
<p></p>
<div class="alert-danger">    
Additionally, in some instances sections of code have been removed, replaced by a comment to indicate the intent or function of the code and/or introduced later on in the notebook than is reflected in the code. This is to enable a detailed description of the code function at a more logical, and oppertune time.
</div>
<p></p>
<div class="alert-success">
<p></p>
<p>For the complete script, navigate to `pyrewton/genbank/get_ncbi_genomes` within the repository.</p>
<p></p>
</div>

## Contents


- [Operating the script](#linkoperating)
- [Script input](#linkinput)
- [Command line options](#linkcommand)
- [Python imports](#linkimports)
- [Parsing the input file and creating the dataframe](#linkparsing)
    - [Scientific name and Taxonomy ID retrieval](#linkscientific)
    - [Creating the dataframe](#linkcdf)
- [Retrieving accession numbers](#linkretrieving)
    -[Retrieval of assembly IDs](#linkassemblyids)
    -[Retrieval of the accession numbers](#linkanretrieve)
- [Downloading GenBank file](#linkgenbank)
    - [Compiling the URL](#linkurl)
    - [Downloading the GenBank file](#linkdownload)
- [Writing out the dataframe](#linkoutdf)


<a id="linkoperating"><a/>

## Operating the script

The script `get_ncbi_genomes.py` is written as a command-line programme and thus is most easily operated via the command-line. The standard structure of the command-line call to operate the script is:
`python3 get_ncbi_genomes.py -u <user email> <other options>`

Multiple options are avilable to customise the operation of the script to the users needs. A full list of the available options are included in the documentation at [Read the Docs](https://phd-project-scripts.readthedocs.io/en/latest/genbank.html#genbank-mod-get-ncbi-genomes-def), as well as further on in this notebook.

A email address must be provided becuase this is a requirement to access the NCBI database remotely using `Entrez`.


<div class="alert-info">
    For the dowloading of the GenBank files for the PhD project, the following code was run at the command-line:
</div>

> `python3 get_ncib_genomes.py -u eemh1@st-andrews.ac.uk -i selected_species_list.txt -o 2020_05_31_GenBank_file_pulldown/ -l 2020_05_31_GB_file_download -v -d 2020_05_31_genome_dataframe.csv`

<div class="alert-danger">
Note: Before using the script `get_ncbi_genomes.py` read the documentation for Entrez, taking care to note expected practises laid out under _'Frequency, Timing and Registration of E-utility URL Requests'_. Failing to meet the expected practises can result in restricted access or banned access to the Entrez utilitity.
</div>

Entrez documentation can be found at the following links:
- ['Entrez documentation'](https://www.ncbi.nlm.nih.gov/books/NBK25497/)
- ['BioPython Entrez modeul documentation'](https://biopython.org/DIST/docs/api/Bio.Entrez-module.html)

<a id="linkinput"><a/>

## Script input

The script takes a plain text file as input, containing a list of the species of interst. Each line contains a unique species, identified by their scientific name or NCIB taxonomy ID (including the 'NCBI:txid' prefix). An input file template can be found within the `get_ncbi_genomes` directory within the repository.


<div class=\"alert alert-warning\">
For the downloading of the GenBank files for the PhD project, the file `selected_species_list.txt` within the `get_ncbi_genomes` directory was used as the input file.
</div>

<a id="linkcommand"><a/>

## Command-line options

The script is designed as command-line programme, and thus operation of the script is customisable by passing arguments from the command-line.

**Compulsory argument**<br>
The option `-u` or `--user` <font color=red>**must**</font> be used in order to provider the users email address, becuase this is a requiremnt for Entrez which is used to call to NCBI.

**Optional arguments**<br>

`-d, --dataframe`<br>
&emsp;&emsp;Specify output path for dataframe (include file extensions). If not provided dataframe will be written out to STDOUT.

`-f, --force`<br>
&emsp;&emsp;Enable writting in specificed output directory if output directory already exists.

`-g, --genbank`<br>
&emsp;&emsp;Enable or disable downloading of GenBank files.

`-h, --help`<br>
&emsp;&emsp;Display help messages and exit

`-i, --input`<br>
&emsp;&emsp;Specify path to input filename (include extension) input file. If not given input will be taken from STDIN.

`-l, --log`<br>
&emsp;&emsp;Specify name of log file (include extension). If not option is given no log file will be written out,
however, logs will still be printed to the terminal.

`-n, --nodelete`<br>
&emsp;&emsp;Enable not deleting files in existing output directory. If not enabled, output directory exists and writing in output directory is 'forced' then files in output directory will not be deleted, and new files will be written to the output directory.

`-o, --output`<br>
&emsp;&emsp;Specify filename (including extension) of output file. If not given output will be wrtten to STDOUT.
If only the filename is given, Extract_genomes_NCBI.py

`-r, --retries`<br>
&emsp;&emsp;Specifiy maximum number of retries of trying to call to NCBI before cancelling retrying call to NCBI. The default is a maximum of 10 retries. 

`-t, --timeout`<br>
&emsp;&emsp;Specify timeout limit of URL connection when downloading GenBank files. Default is 10 seconds.

`-v, --verbose`<br>
&emsp;&emsp;Enable verbose logging - changes logger level from WARNING to INFO.

<a id="linkimports"><a/>

## Python imports

In [None]:
import logging
import re
import sys
import time

from socket import timeout
from typing import List, Optional
from urllib.error import HTTPError, URLError
from urllib.request import urlopen

import pandas as pd

from Bio import Entrez
from tqdm import tqdm

from pyrewton.loggers import build_logger
from pyrewton.parsers.parser_get_ncbi_genomes import build_parser
from pyrewton.file_io import make_output_directory, write_out_dataframe

<a id="linkparsing"><a/>

## Parsing the input file and creating the dataframe

The input file is either taken from STDIN or if a path was provided at the command-line, it is taken from where the path directs.

Opening and parsing of the input file is performed by the function `parse_input_file`.

Initially, it is checked that the path to the input file is valid and if not, the programme terminates. 
Afterwards, the file is opened, and parsed line-by-line.

In [None]:
def parse_input_file(input_filename, logger, retries):
    
    # test path to input file exists, if not exit programme
    if not input_filename.is_file():
        # report to user and exit programme
    
    # if path to input file exists proceed
    # parse input file
    with open(input_filename) as file:
        input_list = file.read().splitlines()

    # Parse input, retrieving tax ID or scientific name as appropriate
    line_count = 0
    for line in tqdm(input_list, desc="Reading lines"):
        line_count += 1

        if line.startswith("#"):
            continue

        line_data = parse_line(line, logger, line_count, retries)
        all_species_data.append(line_data)

    # create dataframe containing three columns: 'Genus', 'Species', 'NCBI Taxonomy ID'
    species_table = pd.DataFrame(
        all_species_data, columns=["Genus", "Species", "NCBI Taxonomy ID"]
    )
    return species_table


The function `parse_line` coordinates the approrpiate calling of functions to retrieve the NCBI taxonomy ID if the scientific name is provided, retrieve the scientific name if the taxonomy ID is provided or perform no function if a comment is passed (indicated by the starting line character '#').

The scientific name and taxonomy ID is stored as a single list for each species. This list contains the 3 elements:
- The species 'Genus' name
- The species 'Species' name
- The species taxonomy ID (including the 'NCBI:txid' prefix).

The list is returned by the function and added to the tuple `all_species_data` within the `parse_input_file` function. This creates a tuple, with each element containing the indentification data for a unique species.


In [None]:
def parse_line(line, logger, line_count, retries):
    
    # For taxonomy ID retrieve scientific name
    if line.startswith("NCBI:txid"):
        gs_name = get_genus_species_name(line[9:], logger, line_count, retries)
        line_data = gs_name.split()
        line_data.append(line)
        
    # For scientific name retrieve taxonomy ID
    else:
        tax_id = get_tax_id(line, logger, line_count, retries)
        line_data = line.split()
        line_data.append(tax_id)

    return line_data


<a id="linkscientific"><a/>

### Scientific name and Taxonomy ID retrieval

The retrieval of the scientific names and taxonomy ID from the NCBI Taxonomy databased is performed by the functions `get_genus_species_name` and `get_taxonomy_ID`.

Each function calls to the NCBI Taxonomy database using entrez.


<div class="alert-danger">
If the retrieval of scientific name or taxonomy ID fails, the null value 'NA' is returned.
The retreieval of scientific name has the additional test to check if the 'name' passed to the function contains any numbers. IF so the null value is returned 'NA', the most common cause for this is a typo in the scientific name or exclusion of the taxonomy ID prefix 'NCBI:txid' in the taxonomy ID written in the input.
More detail on the probably causes of common errors is discussed in the projects README.md file.
</div>

In [None]:
def get_genus_species_name(taxonomy_id, logger, line_number, retries):

    # Retrieve scientific name
    with entrez_retry(
        logger, retries, Entrez.efetch, db="Taxonomy", id=taxonomy_id, retmode="xml"
    ) as handle:
        record = Entrez.read(handle)

    # extract scientific name from record
    try:
        return record[0]["ScientificName"]

    except IndexError:
        # log error and return null value 'NA'
        return "NA"


def get_tax_id(genus_species, logger, line_number, retries):
    # check for potential mistake in taxonomy ID prefix
    if re.search(r"\d", genus_species):
        # log warning that numbers were found in line which was identified as a scientific name
        return "NA"

    else:
        with entrez_retry(
            logger, retries, Entrez.esearch, db="Taxonomy", term=genus_species
        ) as handle:
            record = Entrez.read(handle)

    # extract taxonomy ID from record
    try:
        return "NCBI:txid" + record["IdList"][0]

    except IndexError:
        # log error
        return "NA"


<a id="linkcdf"><a/>

### Creating the dataframe

The `pandas` module is used to create a dataframe, within the `parse_input_file` function, with three columns:
- Genus
- Species
- NCBI Taxonomy ID
With a unique species per line.

`Pandas` enables easier and faster storage and malipulation of dataframes than other table creating packages. More information is [available at](https://pandas.pydata.org/).

The dataframe (`species_table`) is returned to the function `main`.

In [None]:
    species_table = pd.DataFrame(
        all_species_data, columns=["Genus", "Species", "NCBI Taxonomy ID"]
    )
    return species_table

<a id="linkretrieving"><a/>

## Retrieving accession numbers

The `pandas` module allows simultanous applying of a function to a dataframe, iterating over a given axis (row or column). This results in a 100-times faster processing of the dataframe than using a traditional Pythonic `for loop`. Therefore, the `pandas apply` function is used to retrieve the accession numbers of all genomic assemblies associated with each taxonomy ID within the `species_table` dataframe; this is completed by calling the function `get_accession_numbers`, and retrieved accession numbers are stored in a new column in the dataframe: `NCBI Accession Numbers`.


<div class="alert-warning">
Entrez will only retrieve the accession numbers of genomic assembly entries, within the NCBI Assembly database, which are **directly** linked to the taxonomy entry (identified by its taxonomy ID); it will not return accession numbers for genomic assembly entries which are **subtree** linked.

**Directly** linked entries are those that have the organism assigned as the source organism for the data.<br>
**Subtree** linked entries are those record assocated with taxonomy node, and all nodes which are underneath.

Browsing a taxonomy entry witihn the NCBI Taxonomy database via the browser will quickly identify if any genomic assemblies have been directly linked.
</div>

<div class="alert-danger">
As before, if the retrieval of the asseccion numbers fails at any point, the null value 'NA' is returned, and the retrieval of the accession numbers is exited.

If the taxonomy ID was failed to be retrieved previously (and was thus stored as 'NA') a null value of 'NA' will automatically be returned for accession numbers'.
</div>

In [None]:
species_table["NCBI Accession Numbers"] = species_table.apply(
    get_accession_numbers, args=(logger, args), axis=1
)


The `pandas apply` function passes a `pandas series` with each element (or cell within the dataframe) accessible via an index number, similar to accessing an element in a list.

- df_row[0]: Genus
- df_row[1]: Species
- df_row[2]: Taxonomy ID


<a id="linkassemblyids"><a/>

### Coordination of accession number retrieval


The retrieval of accession numbers of all directly linked genomic assemblies for each species of interest was coordinated by multiple functions.

A single function (`get_accession_numbers`) coordinate the multiple calls to Entrez required to ultimately retrieve the accession numbers.

The first call (`get_assembly_ids`) to Entrez links the taxonomy data to the species directly linked genomic assemblies.

The second call (`post_assembly_ids`) to Entrez posted the IDs of the genomics assemblies to Entrez, in orde to search for the genomic assembly data as a singel query, and thus reduce unecessary traffic to Entrez.

The final call used the web environment data from the posting of the assembly IDs to retrieve of the genomic assembly data from Entrez (`retrieve_accession_numbers`).


In [None]:
def get_accession_numbers(df_row, logger, args):
    """Return all NCBI accession numbers associated with NCBI Taxonomy ID.

    Reminder of Pandas series structure:
    df_row[0]: Genus
    df_row[1]: Species
    df_row[2]: Taxonomy ID

    Return list of NCBI accession numbers.
    """
    # If previously failed to retrieve the taxonomy ID cancel retrieval of accession numbers
    if df_row[2] == "NA":
        # log warning
        return "NA"

    # Retrieve all IDs of genomic assemblies for taxonomy ID

    logger.info(f"Retrieving assembly IDs for {df_row[2]}")
    assembly_id_list = get_assembly_ids(df_row, logger, args)

    # Check if assembly ID retrieval was successful
    if assembly_id_list == "NA":
        # log error
        return "NA"

    logger.info(f"Posting assembly IDs for {df_row[2]} to retrieve accession numbers")
    epost_webenv_data = post_assembly_ids(assembly_id_list, df_row, logger, args)

    # Check web environment data was retrieved from epost
    if epost_webenv_data == "NA":
        # log warning
        return "NA"

    logger.info(f"Retrieving accession numbers for {df_row[2]}")
    accession_numbers = retrieve_accession_numbers(
        epost_webenv_data, df_row, logger, args
    )

    if accession_numbers == "NA":
        logger.error(
            (
                f"Failed to retrieve accession numbers for {df_row[2]}.\n"
                "Returning 'NA' for accession numbers."
            )
        )
        return "NA"

    logger.info(f"Finished processing retrieval accession numbers for {df_row[2]}")

    return accession_numbers

#### Retrieval of assembly IDs


The first call to Entrez to retrieve genomic assembly data from the NCBI Assembly database with data from the NCBI Taxonomy database require the `elink` to be used - `elink` looks up neighbouring entries in other NCBI databases. This is coordianated by the function `get_assembly_ids`, which retrieves the IDs of all directly linked genomic assemblies for the given NCBI Taxonomy ID as a list.


In [None]:
def get_assembly_ids(df_row, logger, args):

    # df_row[2][9:] removes 'NCBI:txid' prefix
    with entrez_retry(
        logger,
        args.retries,
        Entrez.elink,
        dbfrom="Taxonomy",
        id=df_row[2][9:],
        db="Assembly",
        linkname="taxonomy_assembly",
    ) as assembly_number_handle:
        try:
            assembly_number_record = Entrez.read(assembly_number_handle)
        # if no record is returned from call to Entrez
        except (TypeError, AttributeError) as error:
            # log error
            return "NA"

    # extract assembly IDs from record
    try:
        assembly_id_list = [
            dict["Id"] for dict in assembly_number_record[0]["LinkSetDb"][0]["Link"]
        ]

    except IndexError:
        # log error
        return "NA"

    return assembly_id_list


#### Posting of assembly IDs


To minimus the number of calls to the NCBI database, the list of IDs is posted as a single query to NCBI using `Entrez`, of which the web environment and query key is retrieved to facilitate the retrieval of the accession numbers of the identified genomic assemblies. This is coordinated by the function `post_assembly_ids`, which also retrieves the necessary web environment data so that the genomic assembly data can be retrieved.


In [None]:
def post_assembly_ids(assembly_id_list, df_row, logger, args):

    # compile list of ids in suitable format for epost
    id_post_list = str(",".join(assembly_id_list))
    # Post all assembly IDs to Entrez-NCBI for downstream pulldown of accession numbers
    try:
        epost_search_results = Entrez.read(
            entrez_retry(
                logger, args.retries, Entrez.epost, "Assembly", id=id_post_list
            )
        )
    # if no record is returned from call to Entrez
    except (TypeError, AttributeError) as error:
        # log error
        return "NA"

    # Retrieve web environment and query key from Entrez epost
    epost_webenv = epost_search_results["WebEnv"]
    epost_query_key = epost_search_results["QueryKey"]

    return epost_webenv, epost_query_key

#### Retrieve accession numbers


Entrez is used again for retrieval the accession numbers from the previously posted query, which is coordiated by the function `retrieve_accession_numbers`. The accession numbers are collated into a single list, which is then converted into a string, this prevents the appearance of list book ending sequare brackets `[]` within the final dataframe.

The last stage of `retrieve_accession_numbers` is to call the required function to coordinate the downloading of the associated GenBank file for the most recently retrieved accession number, if the downloading of GenBank files is enabled.


In [None]:
def retrieve_accession_numbers(webenv, df_row, logger, args):
    # create empty list to store accession numbers
    ncbi_accession_numbers_list = []

    with entrez_retry(
        logger,
        args.retries,
        Entrez.efetch,
        db="Assembly",
        query_key=webenv[1],
        WebEnv=webenv[0],
        rettype="docsum",
        retmode="xml",
    ) as accession_handle:
        try:
            accession_record = Entrez.read(accession_handle, validate=False)
        # if no record is returned from call to Entrez
        except (TypeError, AttributeError) as error:
            # log error
            return "NA"

    # Extract accession numbers from document summary
    for index_number in tqdm(
        range(len(accession_record["DocumentSummarySet"]["DocumentSummary"])),
        desc=f"Retrieving accessions ({df_row[2]})",
    ):
        try:
            new_accession_number = accession_record["DocumentSummarySet"][
                "DocumentSummary"
            ][index_number]["AssemblyAccession"]
            ncbi_accession_numbers_list.append(new_accession_number)

        except IndexError:
            total_assemblies = len(
                accession_record["DocumentSummarySet"]["DocumentSummary"]
            )
            # log error when fails to retrieve data
            return "NA"

        # If downloading of GenBank files is enabled, download Genbank files
        if args.genbank is True:
            get_genbank_files(
                new_accession_number,
                accession_record["DocumentSummarySet"]["DocumentSummary"][index_number][
                    "AssemblyName"
                ],
                logger,
                args,
            )

        index_number += 1

    # Process accession numbers into human readable list for dataframe
    ncbi_accession_numbers = ", ".join(ncbi_accession_numbers_list)

    return ncbi_accession_numbers


<a id="linkgenbank"><a/>

## Downloading GenBank files

Prior to adding the newly retrieved accession number to the growing list of all accession numbers for a specific species, a check is performed to see if the the downloading of the GenBank file (.gbff) of the genomic assembly entry is enabled.

If enabled, the downloading of the GenBank file is initated by calling the `get_genbank_files` function, which coordinates the multiple stages of downloading the GenBank file.

In [None]:
def get_genbank_files(
    accession_number,
    assembly_name,
    logger,
    args,
    suffix="genomic.gbff.gz",
):
    
    # compile url for download
    genbank_url, filestem = compile_url(accession_number, assembly_name, logger, suffix)

    # if downloaded file is not to be written to STDOUT, compile output path
    if args.output is not sys.stdout:
        out_file_path = args.output / "_".join([filestem.replace(".", "_"), suffix])
    else:
        out_file_path = args.output

    # download GenBank file
    download_file(
        genbank_url, args, out_file_path, logger, accession_number, "GenBank file",
    )

    return


<a id="linkurl"><a/>

### Compiling the URL

The downloading of any file requires a URL address from which to download a file, and an output path to write the downloaded data too, created by the `compile_url` funciton.

NCBI records can include a variety of escape characters within its records, but requires all escape characters to be written as underscores within its URLs. The `regular expression` module of Python is used to replace any escape charcters with underscords, prior to compiling the URL.

<div class="alert-warning">NCBI genomic assembly files are avaialble for download from the NCBI FTP site hence the prefix (the ftpstem) and structure of the compiled URL does not match the structure of the URL when searching the NCBI Assembly database in-browser.</div>


In [None]:
def compile_url(
    accession_number,
    assembly_name,
    logger,
    suffix,
    ftpstem="ftp://ftp.ncbi.nlm.nih.gov/genomes/all",
):
   
    # Extract assembly name, removing alterantive escape characters
    escape_characters = re.compile(r"[\s/,#\(\)]")
    escape_name = re.sub(escape_characters, "_", assembly_name)

    # compile filstem
    filestem = "_".join([accession_number, escape_name])

    # separate out filesteam into GCstem, accession number intergers and discarded
    url_parts = tuple(filestem.split("_", 2))

    # separate identifying numbers from version number
    sub_directories = "/".join(
        [url_parts[1][i : i + 3] for i in range(0, len(url_parts[1].split(".")[0]), 3)]
    )

    # return url for downloading file
    return (
        "{0}/{1}/{2}/{3}/{3}_{4}".format(
            ftpstem, url_parts[0], sub_directories, filestem, suffix
        ),
        filestem,
    )


<a id="linkdownload"><a/>

### Downloading the GenBank file

The downloading of the GenBank file is initated with calling the `download_file` function.

The URL connection is coordinated by the Python `urllib` module, the documentation can be found [here](https://docs.python.org/3/library/urllib.html).

The Python `tqdm` module is also used provide a visual log of the download progress in the terminal.


In [None]:
# Try URL connection
    try:
        response = urlopen(genbank_url, timeout=args.timeout)
    except HTTPError, URLError, timeout:
        # log error and exit downloading of GenBank file
        return

    if out_file_path.exists():
        logger.warning(f"Output file {out_file_path} exists, not downloading")
        
    else:
        # Download file
        file_size = int(response.info().get("Content-length"))
        bsize = 1_048_576
        
        try:
            with open(out_file_path, "wb") as out_handle:
                # Using leave=False as this will be an internally-nested progress bar
                with tqdm(
                    total=file_size,
                    leave=False,
                    desc=f"Downloading {accession_number} {file_type}",
                ) as pbar:
                    while True:
                        buffer = response.read(bsize)
                        if not buffer:
                            break
                        pbar.update(len(buffer))
                        out_handle.write(buffer)

        return
    

<a id="linkoutdf"><a/>

## Writing out the dataframe

If the option to write out the dataframe to a file, rather than STDOUT, is enabled, the function `write_out_dataframe` will be called, which is stored within the `pyrewton` `file_io` module.

<div class="alert-warning">To review the code for writing out the created dataframe, navigate to the the `file_io` directory within the `pyrewton` repository.</div>

The dataframe is written out as a .csv file (column-separated values format), to allow the opening and reading of the file by several packages including Microsoft Offic Excel, as well as easy parsing of the dataframe by other `pandas` using Python scripts.

<div class="alert-danger">
When providing the output path for the .csv file, make sure to include the .csv extention. Excluding the .csv extension from the file name will not prevent the file from being written out; however, when the file is written out it will be missing the '.csv' extention in its path otherwise.
</div>


In [None]:
    # Write out dataframe
    if args.dataframe is not sys.stdout:
        write_out_dataframe(species_table, logger, args.dataframe, args.force)
    else:
        species_table.to_csv(args.dataframe)

<div class=\"alert alert-warning\">
For the downloading of GenBank files for the project, the dataframe was written out to '2020_05_31_genomes_dataframe.csv', stored within the same directory as the script (get_ncbi_genomes). The following dataframe was written:
</div>

In [7]:
%%javascript
IPython.OutputArea.prototype._should_scroll = function(lines) {
    return false;
}

<IPython.core.display.Javascript object>

In [10]:
import pandas as pd

from pathlib import Path
from IPython.display import display

pd.set_option('display.max_colwidth', None)

display(pd.read_csv(
    '2020_05_31_genome_dataframe.csv',
    header=0,
    names=["Genus", "Species", "NCBI Taxonomy ID", "NCBI Accession Numbers"]
))


Unnamed: 0,Genus,Species,NCBI Taxonomy ID,NCBI Accession Numbers
0,Aspergillus,fumigatus,NCBI:txid746128,"GCA_012656185.1, GCA_012656215.1, GCA_012656165.1, GCA_012656115.1, GCA_012656125.1, GCA_005768625.2, GCA_003069565.1, GCA_002234985.1, GCA_002234955.1, GCA_001715275.2, GCA_001643655.1, GCA_001643665.1"
1,Aspergillus,nidulans,NCBI:txid162425,"GCA_011075025.1, GCA_011074995.1"
2,Aspergillus,niger,NCBI:txid5061,"GCA_011316255.1, GCA_009812365.1, GCA_004634315.1, GCA_002211485.2, GCA_900248155.1, GCA_002740505.1, GCA_001931795.1, GCA_001741915.1, GCA_001741905.1, GCA_001741885.1, GCA_001715265.1, GCA_001515345.1, GCF_000002855.3"
3,Aspergillus,sydowii,NCBI:txid75750,"GCA_009828905.1, GCA_009193685.1"
4,Fusarium,graminearum,NCBI:txid5518,"GCA_012959185.1, GCA_006942295.1, GCA_900492705.1, GCA_900476405.1, GCA_002352725.1, GCA_900044135.1, GCA_001717915.1, GCA_001717905.1, GCA_000966635.1, GCA_000966645.1, GCA_000599445.1"
5,Fusarium,oxysporum,NCBI:txid5507,"GCA_011428085.1, GCA_011426355.1, GCA_011426335.1, GCA_011424645.1, GCA_011424625.1, GCA_011424605.1, GCA_011421335.1, GCA_011421285.1, GCA_011421305.1, GCA_011421375.1, GCA_011421365.1, GCA_011421355.1, GCA_011421275.1, GCA_011421325.1, GCA_011037735.1, GCA_011037105.1, GCA_011037075.1, GCA_011036425.1, GCA_011036365.1, GCA_011036345.1, GCA_011036325.1, GCA_011036305.1, GCA_011036285.1, GCA_011036015.1, GCA_011035995.1, GCA_011035975.1, GCA_011035895.1, GCA_011035875.1, GCA_011035855.1, GCA_011035785.1, GCA_011035765.1, GCA_011035725.1, GCA_011037135.1, GCA_011037005.1, GCA_011036985.1, GCA_011036965.1, GCA_011035695.1, GCA_011036925.1, GCA_011036905.1, GCA_011036835.1, GCA_011035665.1, GCA_011035645.1, GCA_011035625.1, GCA_011035595.1, GCA_011035555.1, GCA_011035525.1, GCA_011036745.1, GCA_011035505.1, GCA_011036685.1, GCA_011036655.1, GCA_011036635.1, GCA_011036615.1, GCA_011035355.1, GCA_011036575.1, GCA_011035205.1, GCA_011035185.1, GCA_011035135.1, GCA_011035015.1, GCA_011034965.1, GCA_011034945.1, GCA_011034875.1, GCA_011034825.1, GCA_011034785.1, GCA_011034745.1, GCA_011034655.1, GCA_011034575.1, GCA_011034545.1, GCA_011034455.1, GCA_011034415.1, GCA_011034375.1, GCA_011034275.1, GCA_011034205.1, GCA_011034135.1, GCA_011034075.1, GCA_011034045.1, GCA_011034025.1, GCA_011037795.1, GCA_011033995.1, GCA_011033925.1, GCA_011033815.1, GCA_011033715.1, GCA_011033745.1, GCA_011033645.1, GCA_011036945.1, GCA_011036875.1, GCA_011036795.1, GCA_011036775.1, GCA_011036815.1, GCA_011036855.1, GCA_011036595.1, GCA_011036705.1, GCA_011036765.1, GCA_011036565.1, GCA_011036545.1, GCA_011036445.1, GCA_011036725.1, GCA_011036505.1, GCA_011036515.1, GCA_011036475.1, GCA_011036455.1, GCA_011036395.1, GCA_011036385.1, GCA_011036275.1, GCA_011036235.1, GCA_011036165.1, GCA_011036225.1, GCA_011036215.1, GCA_011036205.1, GCA_011036075.1, GCA_011036135.1, GCA_011036125.1, GCA_011036115.1, GCA_011036055.1, GCA_011036065.1, GCA_011036045.1, GCA_011035965.1, GCA_011035955.1, GCA_011035835.1, GCA_011035845.1, GCA_011035825.1, GCA_011035755.1, GCA_011035745.1, GCA_011033685.1, GCA_011033665.1, GCA_011033625.1, GCA_011033575.1, GCA_011033555.1, GCA_011033535.1, GCA_011033505.1, GCA_011033485.1, GCA_011033455.1, GCA_011035615.1, GCA_011035485.1, GCA_011035495.1, GCA_011035455.1, GCA_011035415.1, GCA_011035435.1, GCA_011035375.1, GCA_011035345.1, GCA_011035385.1, GCA_011035335.1, GCA_011035235.1, GCA_011035255.1, GCA_011035245.1, GCA_011035265.1, GCA_011035275.1, GCA_011035075.1, GCA_011035065.1, GCA_011035045.1, GCA_011035055.1, GCA_011035035.1, GCA_011034915.1, GCA_011034925.1, GCA_011034935.1, GCA_011034815.1, GCA_011034845.1, GCA_011034805.1, GCA_011034775.1, GCA_011034735.1, GCA_011034615.1, GCA_011034635.1, GCA_011034645.1, GCA_011034625.1, GCA_011034675.1, GCA_011034565.1, GCA_011034515.1, GCA_011034445.1, GCA_011034485.1, GCA_011034475.1, GCA_011034395.1, GCA_011034265.1, GCA_011034195.1, GCA_011034235.1, GCA_011034155.1, GCA_011034225.1, GCA_011033985.1, GCA_011033945.1, GCA_011034125.1, GCA_011034105.1, GCA_011033955.1, GCA_011033875.1, GCA_011034095.1, GCA_011033805.1, GCA_011033885.1, GCA_011033895.1, GCA_011033835.1, GCA_011033705.1, GCA_011033765.1, GCA_011033785.1, GCA_011033475.1, GCA_011033595.1, GCA_011033375.1, GCA_011033525.1, GCA_011033385.1, GCA_011032885.1, GCA_011032855.1, GCA_009746015.1, GCA_009299335.1, GCA_009299235.1, GCA_009299215.1, GCA_009299195.1, GCA_009299155.1, GCA_009299095.1, GCA_009299045.1, GCA_009298875.1, GCA_009298855.1, GCA_009298805.1, GCA_009298685.1, GCA_009298645.1, GCA_009298615.1, GCA_009298555.1, GCA_009298505.1, GCA_009298475.1, GCA_009298435.1, GCA_009298405.1, GCA_009298245.1, GCA_009298235.1, GCA_009298205.1, GCA_009298195.1, GCA_009298175.1, GCA_009298145.1, GCA_009298125.1, GCA_009298085.1, GCA_009298065.1, GCA_009298075.1, GCA_009298035.1, GCA_009297995.1, GCA_009297985.1, GCA_009297935.1, GCA_009297945.1, GCA_009297925.1, GCA_009297855.1, GCA_009297755.1, GCA_009297735.1, GCA_009297675.1, GCA_009297655.1, GCA_009297635.1, GCA_009297575.1, GCA_009297555.1, GCA_009297465.1, GCA_009297425.1, GCA_009297405.1, GCA_009297385.1, GCA_009297365.1, GCA_009297885.1, GCA_009297835.1, GCA_009299255.1, GCA_009299295.1, GCA_009299135.1, GCA_009299115.1, GCA_009299075.1, GCA_009299125.1, GCA_009299175.1, GCA_009298955.1, GCA_009299025.1, GCA_009298985.1, GCA_009299005.1, GCA_009298915.1, GCA_009298925.1, GCA_009298935.1, GCA_009298945.1, GCA_009298845.1, GCA_009298825.1, GCA_009298675.1, GCA_009298715.1, GCA_009298755.1, GCA_009298705.1, GCA_009298745.1, GCA_009298655.1, GCA_009298635.1, GCA_009298515.1, GCA_009298545.1, GCA_009298495.1, GCA_009298455.1, GCA_009298465.1, GCA_009298395.1, GCA_009298275.1, GCA_009298295.1, GCA_009298315.1, GCA_009298335.1, GCA_009298285.1, GCA_009298305.1, GCA_009298045.1, GCA_009297915.1, GCA_009297825.1, GCA_009297725.1, GCA_009297785.1, GCA_009297715.1, GCA_009297695.1, GCA_009297625.1, GCA_009297605.1, GCA_009297515.1, GCA_009297505.1, GCA_009297445.1, GCA_009297485.1, GCA_009297495.1, GCA_004292535.1, GCA_004291455.1, GCA_003709395.1, GCA_003705045.1, GCA_003704975.1, GCA_003705035.1, GCA_003615165.1, GCA_003615155.1, GCA_003615115.1, GCA_003615185.1, GCA_003025235.1, GCA_003025205.1, GCA_002894245.1, GCA_900096695.1, GCA_002233955.1, GCA_002233985.1, GCA_002233935.1, GCA_002233995.1, GCA_001931975.2, GCA_001703125.1, GCA_000733055.2"
6,Fusarium,proliferatum,NCBI:txid948311,"GCA_003709405.1, GCA_003705095.1, GCA_003704965.1, GCA_003704895.1, GCA_003704885.1, GCA_003704875.1, GCA_003615215.1, GCA_003290285.1, GCA_003123625.1, GCA_002234285.1, GCA_900029915.1, GCA_001705295.1"
7,Magnaporthe,grisea,NCBI:txid148305,"GCF_004355905.1, GCA_003933175.1, GCA_002925245.1, GCA_002924675.1, GCA_001548815.1, GCA_001548795.1"
8,Magnaporthe,oryzae,NCBI:txid318829,"GCA_012979135.1, GCA_012978465.1, GCA_012978415.1, GCA_012979075.1, GCA_012978505.1, GCA_012978515.1, GCA_012978495.1, GCA_012978435.1, GCA_012272995.1, GCA_012922935.1, GCA_012654135.1, GCA_012654105.1, GCA_012654075.1, GCA_012654115.1, GCA_012654035.1, GCA_012596185.1, GCA_012490815.1, GCA_012490805.1, GCA_011799965.1, GCA_011799925.1, GCA_011799915.1, GCA_011799905.1, GCA_900474545.3, GCA_900474475.3, GCA_900474655.3, GCA_900474175.3, GCA_004785725.1, GCA_004346965.1, GCA_900474375.2, GCA_900474635.2, GCA_900474435.2, GCA_900474225.2, GCA_003991345.1, GCA_003017255.1, GCA_003017175.1, GCA_003017165.1, GCA_003017125.1, GCA_003017115.1, GCA_003017045.1, GCA_003017065.1, GCA_003017035.1, GCA_003017025.1, GCA_003016985.1, GCA_003016965.1, GCA_003016955.1, GCA_003016935.1, GCA_003016905.1, GCA_003016895.1, GCA_003016875.1, GCA_003016855.1, GCA_003016825.1, GCA_003016805.1, GCA_003016795.1, GCA_003016785.1, GCA_003016745.1, GCA_003016725.1, GCA_003016715.1, GCA_003016705.1, GCA_003016665.1, GCA_003016655.1, GCA_003016635.1, GCA_003016625.1, GCA_003016585.1, GCA_003016555.1, GCA_003016575.1, GCA_003016545.1, GCA_003016505.1, GCA_003016495.1, GCA_003016475.1, GCA_003016465.1, GCA_003016425.1, GCA_003016415.1, GCA_003016395.1, GCA_003016385.1, GCA_003016325.1, GCA_003016265.1, GCA_003016275.1, GCA_003016255.1, GCA_003016245.1, GCA_003016195.1, GCA_003016185.1, GCA_003016175.1, GCA_003016165.1, GCA_003016105.1, GCA_003016115.1, GCA_003016095.1, GCA_003016085.1, GCA_003016015.1, GCA_003016035.1, GCA_003016025.1, GCA_003016005.1, GCA_003015975.1, GCA_003015955.1, GCA_003015935.1, GCA_003015925.1, GCA_003015895.1, GCA_003015885.1, GCA_003015825.1, GCA_003015835.1, GCA_003015815.1, GCA_003015805.1, GCA_003015755.1, GCA_003015745.1, GCA_003015735.1, GCA_003015705.1, GCA_003015645.1, GCA_003015655.1, GCA_003015635.1, GCA_003015625.1, GCA_003015595.1, GCA_003015565.1, GCA_003015555.1, GCA_003015545.1, GCA_003015495.1, GCA_003015515.1, GCA_003015465.1, GCA_003015475.1, GCA_003015425.1, GCA_003015405.1, GCA_003015395.1, GCA_003015385.1, GCA_003013125.1, GCA_002924695.1, GCA_002925445.1, GCA_002925415.1, GCA_002925425.1, GCA_002925405.1, GCA_002925385.1, GCA_002925325.1, GCA_002925335.1, GCA_002925345.1, GCA_002925295.1, GCA_002925285.1, GCA_002925215.1, GCA_002925225.1, GCA_002925205.1, GCA_002925165.1, GCA_002925145.1, GCA_002925155.1, GCA_002925095.1, GCA_002925085.1, GCA_002925105.1, GCA_002925065.1, GCA_002925045.1, GCA_002924965.1, GCA_002925025.1, GCA_002924985.1, GCA_002924975.1, GCA_002924945.1, GCA_002924885.1, GCA_002924865.1, GCA_002924915.1, GCA_002924875.1, GCA_002924825.1, GCA_002924785.1, GCA_002924835.1, GCA_002924795.1, GCA_002924755.1, GCA_002924745.1, GCA_002924665.1, GCA_002924685.1, GCA_002368515.1, GCA_002368525.1, GCA_002368485.1, GCA_002368475.1, GCA_002218485.1, GCA_002218465.1, GCA_002218475.1, GCA_002218435.1, GCA_002218425.1, GCA_002218355.1, GCA_002218345.1, GCA_002105295.1, GCA_001936935.1, GCA_001936435.1, GCA_001936075.1, GCA_001853415.2, GCA_001675605.1, GCA_001675625.1, GCA_001675595.1, GCA_001675615.1, GCA_001548855.1, GCA_001548845.1, GCA_001548775.1, GCA_001548785.1, GCA_000805855.1, GCA_000734785.1, GCA_000734755.1, GCA_000734735.1, GCA_000734685.1, GCA_000734675.1, GCA_000734705.1, GCA_000734655.1, GCA_000734635.1, GCA_000734605.1, GCA_000734595.1, GCA_000734575.1, GCA_000734555.1, GCA_000734515.1, GCA_000734525.1, GCA_000734495.1, GCA_000734455.1, GCA_000734425.1, GCA_000734395.1, GCA_000734405.1, GCA_000734325.1, GCA_000734345.1, GCA_000734335.1, GCA_000734315.1, GCA_000734275.1, GCA_000734265.1, GCA_000734245.1, GCA_000734235.1, GCA_000734215.1, GCA_000734185.1, GCA_000734165.1, GCA_000734155.1, GCA_000734105.1, GCA_000734075.1, GCA_000734095.1, GCA_000734085.1"
9,Mycosphaerella,graminicola,NCBI:txid1047171,"GCA_902712725.1, GCA_003613095.1, GCA_003611185.1, GCA_003611175.1, GCA_003611135.1, GCA_003611115.1, GCA_003611125.1, GCA_003611075.1, GCA_003611065.1, GCA_003611055.1, GCA_002937425.1"
