# NCBI: National Center for Biotechnology Information
#### Author: Ghassan Abboud - Genorobotics, Date : March 2024

NCBI provides international genetic databases that serve as a basis for comparing the DNA fragments we obtain by sequencing. The database we use is GenBank, the result of a collaboration between the DNA Databank of Japan, the European Nucleotide Archive and NCBI. NCBI also provides a very efficient algorithm, BLAST, for aligning DNA sequences with their database and returning the best matches. 

Ways to access the GenBank database:
- Search GenBank for sequences, names or identifiers: https://www.ncbi.nlm.nih.gov/genbank/
- Using BLAST for sequence alignment compares your query to all entries in GenBank.
- Use the Entrez Programming Utilities (see Part Two)

## Part One: Exploring GenBank and BLASTn online

Please refer to the pdf tutorial [_Exploring GenBank and BLASTn online_](Exploring_GenBank_and_BLASTn_online.pdf) on the repo.


## Part Two: Accessing Genbank through Python

### Introduction to E-Utilities

NCBI provides a set of tools called the _Entrez Programming Utilities_ that give access to GenBank's query and database system through code. It allows you to retrieve information such as databases of genetic sequences or information on taxonomy. It uses a fixed URL syntax and HTTP requests, all starting with `https://eutils.ncbi.nlm.nih.gov/entrez/eutils/`. Your main reference for coding pipelines that use these requests is the [Entrez Programming Utilities Help document](ncbi_entrez_eutilities_help.pdf) that provides examples in the PERL programming language. 

If you are comfortable with creating HTTP requests with Python, you can directly translate code from the Help document  and get inspiration from the sample applications on page 27. Creating the requests yourself probably provides more flexibility and control over the parameters. In this tutorial we will use a sublibrary of BioPython called `Biopython.Entrez` to create the requests. 

There are nine e-utilities, aka nine functions for accessing different information. the most important ones are:

|E-utility|URL|description|parameters|
|---|---|---|---|
|ESearch (text searches)|eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi|Responds to a text query with the list of matching UIDs in a given database (for later use in ESummary, EFetch or ELink), along with the term translations of the query.| db = database, term = search term|
|EFetch (data record downloads)|eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi|Responds to a list of UIDs in a given database with the corresponding data records in a specified format.| db = database, id = list of IDs separated by commas|
|EPost (ID uploads)|eutils.ncbi.nlm.nih.gov/entrez/eutils/epost.fcgi|Accepts a list of UIDs from a given database, stores the set on the History Server, and responds with a query key and web environment for the uploaded dataset.|db = database, id = list of IDs separated by commas|

**Note:** to search on GenBank, the db parameter should be set equal to _nuccore_.

**Note:** the list of IDs can be composed of GI or accession.version IDs. The term UID refers to any of these two.

**Note:** the parameters listed are only the required ones. Before using a utility it is important to go over the optional parameters on page 39 because some like _retstart_ and _retmax_ come in handy.

### Warning on NCBI requests frequency

NCBI limits the number of requests that can be done by a single user to ensure the server is not overloaded. The limit is 3 requests per second. NCBI also recommends doing large jobs on weekends or between 9:00 PM and 5:00 AM Eastern time during weekdays If a user fails to comply with these guidelines, his IP address may get blocked.

Hence, it is very important to properly investigate how the e-utilities can be used to execute a job in the _least amount of requests_. One should beware from using loops that contain e-utilities calls. One must also include their email address in any program they run, as we will see below. This way, NCBI may contact you in case of breach of guidelines instead of directly blocking your IP.

### Biopython.Entrez

this library provides functions that make the HTTP requests and return a handle to read the results. Handles are often used in Python to read ans write text files. You can checkout the library's [documentation](https://biopython.org/docs/1.75/api/Bio.Entrez.html).

The functions in this module are named after their respective e-utilities and automatically respect the limit of 3 requests per second.


In [10]:
from Bio import Entrez
import os
#First, one must set an email
Entrez.email = 'fill.in.the.blanks@hotmail.com' #replace by your email address

In [None]:
#Let's first do a simple request
handle = Entrez.efetch(db = "nucleotide", id = "NM_005297.4", rettype = "gb", retmode = "text")
#This line launches a request that donwloads the entry corresponding to ID NM_005297.4
#This parameter rettype and retmode are explained in the Eutilities Help book
entry = handle.read()
handle.close()
print(entry)


In [24]:
#Let's now download this record again as a fasta file and writing it locally
handle = Entrez.efetch(db = "nucleotide", id = "NM_005297.4", rettype = "fasta", retmode = "fasta")
sequence = handle.read()
handle.close()
path = os.getcwd()+ "\extracted_sequence.fasta" #get current working directory and add file name
with open(path, mode = "w") as files: #we open a handle into a new file in our current working directory
   files.write(sequence) #and we write the sequence inside

After running the above cell you should see a new file appear with your extracted sequence. For the next level, let's combine efetch with esearch.
We will perform a search using a term as if we were on the GenBank website. This search will return IDs that we will then plug into efetch to retrieve the associated GenBank entries.

By default, `Entrez.esearch` only returns the top 20 results. We can change that with the optional keyword `retmax`. Note that the maximum value of retmax is 10,000. For searches returning more than 10,000 results, they must be downloaded in batches.

Compared to efetch, esearch always returns the results in XML format, which is not very handy, we can parse it into a Python dictionary with `Entrez.read`

In [None]:
search_term = "rbcl" #we search for entries corresponding to the rbcl gene
handle = Entrez.esearch(db = "nucleotide", term = search_term, retmax = 50, idtype = "acc")
search_result = Entrez.read(handle)
handle.close()
relevant_ids= search_result["IdList"]
print(relevant_ids)


In [36]:
handle = Entrez.efetch(db="nucleotide", id=relevant_ids, retmode = "fasta", rettype = "fasta")
rbcl_sequences = handle.read()
handle.close()
path = os.getcwd()+ r"\\rbcl_database.fasta" #get current working directory and add file name
with open(path, mode = "w") as file: #we open a handle into a new file in our current working directory
   file.write(rbcl_sequences) #and we write the sequence inside


In the last example, we downloaded a database of the 20 first entries. Look at the first one in your downloaded file, it is the **complete genome** of a plant. To avoid such situations, we will now learn to filter our results by sequence length and other parameters, like we did on the webpage by using advanced search.

The e-Utilities do not take optional arguments like sequence length, _all these parameters need to be included in the search term, following the same structure as the one built by the online advanced search tool (see Part One)_. Here is a recommended workflow for developing these search terms:
- Go on the GenBank website and use [Advanced Search](https://www.ncbi.nlm.nih.gov/nuccore/advanced)
- fill in the fields you want to filter by
- copy the term from the search bar and use it in your code.

This seems counter-intuitive, why even bother downloading it through code if I have to go on the website anyway? Yes! However, by understanding how the search term is built, one can code a function that takes as arguments some fields and builds the search term, as in the example below.

In [39]:
def download_gene_database(gene_name, min_sequence_length, max_sequence_length, file_name, retmax):
    '''Download data to a .fasta file, with filtering options

    Args:
        gene_name: (str) name of the gene to extract from GenBank
        min_sequence_length: (int) lower bound of the sequence length filter
        max_sequence_length: (int) upper bound of the sequence length filter
        file_name:(str) name of the generated file
        retmax:(int) number of entries to be downloaded, smaller than 10 000
    '''
    #building search term
    search_term = f"{gene_name}[Gene Name] AND {min_sequence_length}:{max_sequence_length}[Sequence Length]"

    #searching for the IDs of the entries that match the search
    handle = Entrez.esearch(db = "nucleotide", term = search_term, retmax = retmax, idtype = "acc")
    search_result = Entrez.read(handle)
    handle.close()
    relevant_ids= search_result["IdList"]

    #Using these IDs to retrieve the entries
    handle = Entrez.efetch(db="nucleotide", id=relevant_ids, retmode = "fasta", rettype = "fasta")
    sequences = handle.read()
    handle.close()

    #writing the entries to our own local FASTA file
    path = os.getcwd()+ r"\\"+ file_name
    with open(path, mode = "w") as file:
        file.write(sequences)

In [40]:
download_gene_database("matk", 750, 1500, "my_gene_database.fasta", 20)

In [None]:
#Bonus: can you edit the function's body to be able to donwload more than 10 000 entries.
#Reminder: one esearch call cannot download more than 10000 entries even if retmax is set higher
#Hint: check the help book for the retstart optional parameter