# Bio.Entrez module
## 1. [Introduction](#1.-Introduction):
1. NCBI and E-Utilities/Entrez
2. Bio.Entrez

## 2. [An Example](#2.-Bio.Entrez-Example)
## 3. [More Detail](#3.-More-Bio.Entrez:)
## 4. [More?](#4.-More-Request-Options)

In [2]:
import re
from Bio import Entrez
from Bio import SeqIO

# 1. Introduction
1. NCBI Entrez E-Utilities provide a url syntax for searching 38 NCBI databases
2. Bio.Entrez is simply a wrapper for these E-Utilities:
    1. Programmatic calls help slightly. Mostly ensures adherance to NCBI guidelines.
    2. Parsing of results helps more

# 2. Bio.Entrez Example
#### _"Hey, do any of those proteins include annotated CBMs?"_

In [3]:
Entrez.email="combee_example_not_real at wisc dot edu" #providing an email address is good for everyone

In [4]:
"""Get the list of accession codes"""
import re
#1st-read text file from regex example
with open('./documents/AccessionExample1.txt') as f:
    ex1text=f.read()
#compile regex and find accession codes then 'clean' em up
proteingb_regex=re.compile('[A-Za-z]{3}\d{5}')
paccs_=proteingb_regex.findall(ex1text)
paccs_=list(set([x.upper() for x in paccs_]))
paccs_

['ADZ22510',
 'FGU98722',
 'EIM57503',
 'AAC19169',
 'BAB04322',
 'AEV67086',
 'ABW39335',
 'AAA23220',
 'AAA23224',
 'CBL17440']

In [5]:
def parse_region(ftable_entry):
    gbfquals_=ftable_entry['GBFeature_quals']
    #if len(gbfquals_)>1:
    #    return None
    #gbfquals=gbfquals_[0]
    retdict={}
    for gbfdict in gbfquals_:
        dakey,davalue=gbfdict['GBQualifier_name'],gbfdict['GBQualifier_value']
        retdict[dakey]=davalue
    return retdict

for p in paccs_:
    try:
        srhandle=Entrez.efetch(db="protein",id=p,rettype="gp",retmode="xml")
    except:
        print("failed to fetch {0}".format(p))
        continue
    info=Entrez.read(srhandle)

    for f in info[0]['GBSeq_feature-table']:
        if f['GBFeature_key']=='Region':
            regdict=parse_region(f)
            if regdict['region_name'][:3]=="CBM":
                print("CBM included in {0}:{1}".format(p,regdict))

failed to fetch FGU98722
CBM included in BAB04322:{'note': 'Carbohydrate binding domain X2; pfam03442', 'db_xref': 'CDD:281441', 'region_name': 'CBM_X2'}


# 3. More Bio.Entrez:
1. Changing the retmode
2. 'esearch' and others

#### We used xml return type. Entrez.read() provides a big dictionary

In [14]:
srhandle=Entrez.efetch(db="protein",id='BAB04322',rettype="gp",retmode="xml")
protein_infos_=Entrez.read(srhandle)
print(protein_infos_)

[{'GBSeq_project': 'PRJNA235', 'GBSeq_xrefs': [{'GBXref_id': 'PRJNA235', 'GBXref_dbname': 'BioProject'}, {'GBXref_id': 'SAMD00061093', 'GBXref_dbname': 'BioSample'}], 'GBSeq_source': 'Bacillus halodurans C-125', 'GBSeq_organism': 'Bacillus halodurans C-125', 'GBSeq_update-date': '10-MAY-2017', 'GBSeq_references': [{'GBReference_authors': ['Takami,H.'], 'GBReference_journal': '(in) Horikoshi,K. and Tsujii,K. (Eds.); EXTREMOPHILES IN DEEP-SEA ENVIRONMENTS: 249-284; Springer-Verlag (1999)', 'GBReference_title': 'Genome analysis of facultatively alkalihilic Bacillus halodurans C-125', 'GBReference_reference': '1'}, {'GBReference_xref': [{'GBXref_id': '10.1271/bbb.63.943', 'GBXref_dbname': 'doi'}], 'GBReference_reference': '2', 'GBReference_authors': ['Takami,H.', 'Horikoshi,K.'], 'GBReference_journal': 'Biosci. Biotechnol. Biochem. 63 (5), 943-945 (1999)', 'GBReference_title': 'Reidentification of Facultatively Alkaliphilic Bacillus sp. C-125 to Bacillus halodurans', 'GBReference_pubmed': 

#### We could instead pull as text and parse as a SeqRecord

In [7]:
from Bio import SeqIO
srhandle=Entrez.efetch(db="protein",id='ADZ22510',rettype="gp",retmode="text")
sr=list(SeqIO.parse(srhandle,"genbank"))[0]
print(sr)

ID: ADZ22510.1
Name: ADZ22510
Description: Endoglucanase family 5; cell-adhesion and dockerin domains [Clostridium acetobutylicum EA 2018].
Database cross-references: BioProject:PRJNA50455, BioSample:SAMN02603410
Number of features: 8
/keywords=['']
/taxonomy=['Bacteria', 'Firmicutes', 'Clostridia', 'Clostridiales', 'Clostridiaceae', 'Clostridium']
/date=07-JAN-2015
/sequence_version=1
/comment=Method: conceptual translation.
/source=Clostridium acetobutylicum EA 2018
/topology=linear
/data_file_division=BCT
/organism=Clostridium acetobutylicum EA 2018
/accessions=['ADZ22510']
/db_source=accession CP002118.1
/references=[Reference(title='Comparative genomic and transcriptomic analysis revealed genetic characteristics related to solvent formation and xylose utilization in Clostridium acetobutylicum EA 2018', ...), Reference(title='Direct Submission', ...)]
Seq('MRNKKRITSLVTGLAMLFTCAVGNTSLKVHADAQSIYTTKGETTKIYASAFTQN...KGA', IUPACProtein())


#### esearch

In [13]:
'''find all annotated GH5s in Ruminococcus albus'''
srhandle=Entrez.esearch(db="protein",term="Ruminococcus albus[Organism] AND \"glycoside hydrolase family 5\"[All Fields]",idtype="acc",retmax=65)
a=srhandle.read()
print(a)

<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE eSearchResult PUBLIC "-//NLM//DTD esearch 20060628//EN" "https://eutils.ncbi.nlm.nih.gov/eutils/dtd/20060628/esearch.dtd">
<eSearchResult><Count>17</Count><RetMax>17</RetMax><RetStart>0</RetStart><IdList>
<Id>ADU23246.1</Id>
<Id>ADU23111.1</Id>
<Id>ADU23074.1</Id>
<Id>ADU22510.1</Id>
<Id>ADU22343.1</Id>
<Id>ADU22290.1</Id>
<Id>ADU21877.1</Id>
<Id>ADU21807.1</Id>
<Id>ADU21608.1</Id>
<Id>ADU21575.1</Id>
<Id>ADU21559.1</Id>
<Id>ADU21423.1</Id>
<Id>ADU21113.1</Id>
<Id>ADU20585.1</Id>
<Id>AAT48117.1</Id>
<Id>BAA92430.1</Id>
<Id>BAA92146.1</Id>
</IdList><TranslationSet><Translation>     <From>Ruminococcus albus[Organism]</From>     <To>"Ruminococcus albus"[Organism]</To>    </Translation></TranslationSet><TranslationStack>   <TermSet>    <Term>"Ruminococcus albus"[Organism]</Term>    <Field>Organism</Field>    <Count>38790</Count>    <Explode>Y</Explode>   </TermSet>   <TermSet>    <Term>"glycoside hydrolase family 5"[All Fields]</Term>    <F

# 4. More Request Options

#### many return type options

In [9]:
srhandle=Entrez.efetch(db="protein",id="BAB04322",rettype="ft",retmode="xml")
a=srhandle.read()
print(a)

>Feature dbj|BAB04322.1|
1	574	Protein
			product	endo-beta-1,4-glucanase (celulase B)
1	382	Region
			region_name	BglC
			note	Aryl-phospho-beta-D-glucosidase BglC, GH1 family [Carbohydrate transport and metabolism]
			db_xref	CDD:225344
59	340	Region
			region_name	Cellulase
			note	Cellulase (glycosyl hydrolase family 5)
			db_xref	CDD:278575
367	454	Region
			region_name	CBM_X2
			note	Carbohydrate binding domain X2
			db_xref	CDD:281441




### Bio.Entrez documentation mostly refers to NCBI Entrez
[Available databases](https://www.ncbi.nlm.nih.gov/books/NBK25497/table/chapter2.T._entrez_unique_identifiers_ui/?report=objectonly)

![title](./images/EntrezDBNames.png "our key")

[efetch options](https://www.ncbi.nlm.nih.gov/books/NBK25499/table/chapter4.T._valid_values_of__retmode_and/?report=objectonly)

![title](./images/EntrezEFetchOptions.png "efetchstuff")
