# ACCESS BIOINFORMATICS DATABASES WITH BIO-PYTHON

1. [NCBI](#1.-NCBI)<br>
    1.1. [Nucleotide BLAST](#1.1.-Nucleotide-BLAST)<br>
    1.2. [Protein BLAST](#1.2.-Protein-BLAST)
    
2. [ENTREZ](#2.-ENTREZ)<br>
    2.1. [PUBMED](#2.1.-PUBMED)<br>
    2.2. [Nucleotide](#2.2.-Nucleotide)
    
3. [PDB](#3.-PDB)

4. [EXPASY](#4.-EXPASY)<br>
    4.1. [PROSITE](#4.1.-PROSITE)<br>
    4.2. [ScanProsite](#4.2.-ScanProsite)
    
5. [KEGG](#5.-KEGG)

# 1. NCBI

### Import Modules

In [16]:
from Bio.Blast import NCBIWWW
from Bio import SeqIO, SearchIO

In [2]:
#help(NCBIWWW)

Help on module Bio.Blast.NCBIWWW in Bio.Blast:

NAME
    Bio.Blast.NCBIWWW - Code to invoke the NCBI BLAST server over the internet.

DESCRIPTION
    This module provides code to work with the WWW version of BLAST
    provided by the NCBI. https://blast.ncbi.nlm.nih.gov/

FUNCTIONS
    qblast(program, database, sequence, url_base='https://blast.ncbi.nlm.nih.gov/Blast.cgi', auto_format=None, composition_based_statistics=None, db_genetic_code=None, endpoints=None, entrez_query='(none)', expect=10.0, filter=None, gapcosts=None, genetic_code=None, hitlist_size=50, i_thresh=None, layout=None, lcase_mask=None, matrix_name=None, nucl_penalty=None, nucl_reward=None, other_advanced=None, perc_ident=None, phi_pattern=None, query_file=None, query_believe_defline=None, query_from=None, query_to=None, searchsp_eff=None, service=None, threshold=None, ungapped_alignment=None, word_size=None, short_query=None, alignments=500, alignment_view=None, descriptions=500, entrez_links_new_window=None, expect_

## 1.1. Nucleotide BLAST

In [17]:
!ls

dir  notebook.ipynb  nuc_seq.fasta  obsolete  prot_seq.fasta


In [18]:
nuc_record = SeqIO.read("nuc_seq.fasta", format = "fasta")
len(nuc_record)

774

In [8]:
result_handle = NCBIWWW.qblast("blastn", "nt", nuc_record.seq)
blast_result = SearchIO.read(result_handle,"blast-xml")

In [9]:
print(blast_result[0:2])

Program: blastn (2.14.1+)
  Query: No (774)
         definition line
 Target: nt
   Hits: ----  -----  ----------------------------------------------------------
            #  # HSP  ID + description
         ----  -----  ----------------------------------------------------------
            0      1  gi|2621968847|emb|OY783932.1|  Severe acute respiratory...
            1      1  gi|2621965820|emb|OY785339.1|  Severe acute respiratory...


In [10]:
Seq = blast_result[0]
print(f"Sequence ID: {Seq.id}")
print(f"Sequence Description: {Seq.description}")

details = Seq[0]
print(f"E-value: {details.evalue}")

#E_value is 0.0 sequence is exaxt or very closely match to our sequence

Sequence ID: gi|2621968847|emb|OY783932.1|
Sequence Description: Severe acute respiratory syndrome coronavirus 2 genome assembly, chromosome: 1
E-value: 0.0


In [11]:
print(f"alignment:\n{details.aln}")

alignment:
Alignment with 2 rows and 774 columns
ATCGCTCCAGGGCAAACTGGAAAGATTGCTGATTATAATTATAA...GGT No
ATCGCTCCAGGGCAAACTGGAAAGATTGCTGATTATAATTATAA...GGT gi|2621968847|emb|OY783932.1|


## 1.2. Protein BLAST

In [19]:
prot_record = SeqIO.read("prot_seq.fasta", format="fasta")
len(prot_record)

258

In [13]:
result_handle = NCBIWWW.qblast("blastp", "pdb", prot_record.seq)
blast_result = SearchIO.read(result_handle, "blast-xml")

In [14]:
print(blast_result[0:2])

Program: blastp (2.14.1+)
  Query: unnamed (258)
         protein product
 Target: pdb
   Hits: ----  -----  ----------------------------------------------------------
            #  # HSP  ID + description
         ----  -----  ----------------------------------------------------------
            0      1  pdb|8ELJ|A  Chain A, Spike glycoprotein [Severe acute r...
            1      1  pdb|7CAB|A  Chain A, Spike glycoprotein [Severe acute r...


In [15]:
Seq = blast_result [0]
print(f"Sequence ID: {Seq.id}")
print(f"Sequence Description: {Seq.description}")

details = Seq[0]
print(f"E-value: {details.evalue}")

Sequence ID: pdb|8ELJ|A
Sequence Description: Chain A, Spike glycoprotein [Severe acute respiratory syndrome coronavirus 2]
E-value: 0.0


In [16]:
print(f"alignment:\n {details.aln}")

alignment:
 Alignment with 2 rows and 258 columns
IAPGQTGKIADYNYKLPDDFTGCVIAWNSNNLDSKVGGNYNYLY...PIG unnamed
IAPGQTGKIADYNYKLPDDFTGCVIAWNSNNLDSKVGGNYNYLY...PIG pdb|8ELJ|A


------------------------------------------------------

# 2. ENTREZ

### Import Modules

In [17]:
from Bio import Entrez

In [18]:
help(Entrez)

Help on package Bio.Entrez in Bio:

NAME
    Bio.Entrez - Provides code to access NCBI over the WWW.

DESCRIPTION
    The main Entrez web page is available at:
    http://www.ncbi.nlm.nih.gov/Entrez/
    
    Entrez Programming Utilities web page is available at:
    http://www.ncbi.nlm.nih.gov/books/NBK25501/
    
    This module provides a number of functions like ``efetch`` (short for
    Entrez Fetch) which will return the data as a handle object. This is
    a standard interface used in Python for reading data from a file, or
    in this case a remote network connection, and provides methods like
    ``.read()`` or offers iteration over the contents line by line. See
    also "What the heck is a handle?" in the Biopython Tutorial and
    Cookbook: http://biopython.org/DIST/docs/tutorial/Tutorial.html
    http://biopython.org/DIST/docs/tutorial/Tutorial.pdf
    The handle returned by these functions can be either in text mode or
    in binary mode, depending on the data requested a

In [19]:
Entrez.email = "datacyclopes@gmail.com"

In [20]:
handle = Entrez.einfo()
record = Entrez.read(handle)
record["DbList"]

['pubmed', 'protein', 'nuccore', 'ipg', 'nucleotide', 'structure', 'genome', 'annotinfo', 'assembly', 'bioproject', 'biosample', 'blastdbinfo', 'books', 'cdd', 'clinvar', 'gap', 'gapplus', 'grasp', 'dbvar', 'gene', 'gds', 'geoprofiles', 'homologene', 'medgen', 'mesh', 'nlmcatalog', 'omim', 'orgtrack', 'pmc', 'popset', 'proteinclusters', 'pcassay', 'protfam', 'pccompound', 'pcsubstance', 'seqannot', 'snp', 'sra', 'taxonomy', 'biocollections', 'gtr']

## 2.1. PUBMED

In [21]:
handle = Entrez.einfo(db="pubmed")
record = Entrez.read(handle)
record["DbInfo"]["Description"]

'PubMed bibliographic record'

In [22]:
record["DbInfo"]["Count"]

'36525710'

In [24]:
handle = Entrez.esearch(db="pubmed", term="biopython")
record = Entrez.read(handle)
record["IdList"]

['37810457', '37668712', '37577677', '36818783', '36245797', '36094101', '35497637', '35496474', '35402671', '34735950', '34484417', '34434786', '34189012', '33994075', '33902722', '33809815', '33242467', '32044951', '31762715', '31278684']

In [25]:
handle = Entrez.esummary(db="pubmed", id='37810457, 37668712')
records = Entrez.parse(handle)


for record in records:
    print(record['AuthorList'], record['Title'], record['PubDate'], record['FullJournalName'])

['Arora C', 'De Oliveira Rosa N', 'Matic M', 'Cascone M', 'Miglionico P', 'Raimondi F'] EXPANSION: a webserver to explore the functional consequences of protein-coding alternative splice variants in cancer genomics. 2023 Bioinformatics advances
['Rustagi V', 'Gupta SRR', 'Bajaj M', 'Singh A', 'Singh IK'] PepAnalyzer: predicting peptide properties using its sequence. 2023 Oct Amino acids


In [26]:
handle = Entrez.efetch(db="pubmed", id="19811691")
print(handle.read())

b'<?xml version="1.0" ?>\n<!DOCTYPE PubmedArticleSet PUBLIC "-//NLM//DTD PubMedArticle, 1st January 2023//EN" "https://dtd.nlm.nih.gov/ncbi/pubmed/out/pubmed_230101.dtd">\n<PubmedArticleSet>\n<PubmedArticle><MedlineCitation Status="MEDLINE" Owner="NLM"><PMID Version="1">19811691</PMID><DateCompleted><Year>2010</Year><Month>02</Month><Day>12</Day></DateCompleted><DateRevised><Year>2021</Year><Month>10</Month><Day>20</Day></DateRevised><Article PubModel="Electronic"><Journal><ISSN IssnType="Electronic">1471-2105</ISSN><JournalIssue CitedMedium="Internet"><Volume>10 Suppl 11</Volume><Issue>Suppl 11</Issue><PubDate><Year>2009</Year><Month>Oct</Month><Day>08</Day></PubDate></JournalIssue><Title>BMC bioinformatics</Title><ISOAbbreviation>BMC Bioinformatics</ISOAbbreviation></Journal><ArticleTitle>Exploratory visual analysis of conserved domains on multiple sequence alignments.</ArticleTitle><Pagination><StartPage>S7</StartPage><MedlinePgn>S7</MedlinePgn></Pagination><ELocationID EIdType="doi

## 2.2. Nucleotide

In [27]:
handle = Entrez.esearch(db="nucleotide", retmax=10, term="Severe acute respiratory syndrome")
record = Entrez.read(handle)
record["IdList"]

['2623675418', '2623675407', '2623675395', '2623675383', '2623675371', '2623675359', '2623675347', '2623675335', '2623675323', '2623675311']

In [28]:
handle = Entrez.efetch(db="nucleotide", id='2623675418', rettype="gb", retmode="text")
print(handle.read())

LOCUS       OR840572               29648 bp    RNA     linear   VRL 22-NOV-2023
DEFINITION  Severe acute respiratory syndrome coronavirus 2 isolate
            SARS-CoV-2/human/USA/OSPHL09231/2023 ORF1ab polyprotein (ORF1ab),
            ORF1a polyprotein (ORF1ab), surface glycoprotein (S), ORF3a protein
            (ORF3a), envelope protein (E), membrane glycoprotein (M), ORF6
            protein (ORF6), ORF7a protein (ORF7a), and ORF7b (ORF7b) genes,
            complete cds; ORF8 gene, complete sequence; and nucleocapsid
            phosphoprotein (N) and ORF10 protein (ORF10) genes, complete cds.
ACCESSION   OR840572
VERSION     OR840572.1
KEYWORDS    .
SOURCE      Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2)
  ORGANISM  Severe acute respiratory syndrome coronavirus 2
            Viruses; Riboviria; Orthornavirae; Pisuviricota; Pisoniviricetes;
            Nidovirales; Cornidovirineae; Coronaviridae; Orthocoronavirinae;
            Betacoronavirus; Sarbecovirus; Sev

In [29]:
handle = Entrez.esearch(db='nucleotide', term='accD[Gene Name] AND "E. coli"[Organism]', retmax="20")
result_list = Entrez.read(handle)

In [30]:
id_list = result_list['IdList']
count = result_list['Count']

print(id_list)
print("\n")
print(count)

['2125548482', '2125547083', '2125545051', '2125544115', '2125543024', '2125541822', '2125536048', '2125532027', '2125529603', '2125527650', '2125514834', '2125511716', '2125507780', '2125503277', '2125499772', '2125499237', '2125493107', '2125481180', '2125479731', '2125477219']


246849


In [31]:
handle.close()

------------------------------------------------------

# 3. PDB

### Import Modules

In [1]:
from Bio.PDB import PDBParser,PDBList

In [2]:
help(PDBList)

Help on class PDBList in module Bio.PDB.PDBList:

class PDBList(builtins.object)
 |  Quick access to the structure lists on the PDB or its mirrors.
 |  
 |  This class provides quick access to the structure lists on the
 |  PDB server or its mirrors. The structure lists contain
 |  four-letter PDB codes, indicating that structures are
 |  new, have been modified or are obsolete. The lists are released
 |  on a weekly basis.
 |  
 |  It also provides a function to retrieve PDB files from the server.
 |  To use it properly, prepare a directory /pdb or the like,
 |  where PDB files are stored.
 |  
 |  All available file formats (PDB, PDBx/mmCif, PDBML, mmtf) are supported.
 |  Please note that large structures (containing >62 chains
 |  and/or 99999 ATOM lines) are no longer stored as a single PDB file
 |  and by default (when PDB format selected) are not downloaded.
 |  
 |  Large structures can be downloaded in other formats, including PDBx/mmCif
 |  or as a .tar file (a collection of 

In [3]:
pdbl=PDBList()
pdbl.retrieve_pdb_file("7BYR", file_format="pdb", pdir="dir")

Downloading PDB structure '7BYR'...


'dir/pdb7byr.ent'

In [4]:
parser = PDBParser()
structure = parser.get_structure("7BYR","dir/pdb7byr.ent")



In [5]:
for chain in structure[0]:
    print(f"chainid: {chain.id}")

chainid: A
chainid: B
chainid: C
chainid: H
chainid: L
chainid: D
chainid: E
chainid: F
chainid: G
chainid: I
chainid: J


In [6]:
resolution= structure.header["resolution"]
resolution #angstrom

3.84

In [7]:
keywords = structure.header["keywords"]
keywords

'sars-cov-2, antigen, rbd, neutralizing antibody, viral protein'

------------------------------------------------------

# 4. EXPASY

## 4.1. PROSITE

### Import Modules

In [8]:
from Bio import ExPASy
from Bio.ExPASy import Prosite

In [9]:
help(Prosite)

Help on module Bio.ExPASy.Prosite in Bio.ExPASy:

NAME
    Bio.ExPASy.Prosite - Parser for the prosite dat file from Prosite at ExPASy.

DESCRIPTION
    See https://www.expasy.org/prosite/
    
    Tested with:
     - Release 20.43, 10-Feb-2009
     - Release 2017_03 of 15-Mar-2017.
    
    Functions:
     - read                  Reads a Prosite file containing one Prosite record
     - parse                 Iterates over records in a Prosite file.
    
    Classes:
     - Record                Holds Prosite data.

CLASSES
    builtins.object
        Record
    
    class Record(builtins.object)
     |  Holds information from a Prosite record.
     |  
     |  Main attributes:
     |   - name           ID of the record.  e.g. ADH_ZINC
     |   - type           Type of entry.  e.g. PATTERN, MATRIX, or RULE
     |   - accession      e.g. PS00387
     |   - created        Date the entry was created.  (MMM-YYYY for releases
     |     before January 2017, DD-MMM-YYYY since January 2017)
 

In [23]:
handle = ExPASy.get_prosite_raw('PS51442')
record = Prosite.read(handle)

In [24]:
print(record.description)

Coronavirus main protease (M-pro) domain profile.


In [25]:
print(record.pdb_structs[:10]) #so these are 10 proteins of domain profile coronavirus main protease

['1LVO', '1P9S', '1P9U', '1Q2W', '1UJ1', '1UK2', '1UK3', '1UK4', '1WOF', '1Z1I']


In [26]:
handle = ExPASy.get_prosite_raw('PS00001')
record = Prosite.read(handle)
print(record.pattern) #the common pattern within domain is asparagine, proline, serine and threonine

N-{P}-[ST]-{P}.


## 4.2. ScanProsite

### Import Modules

In [27]:
from Bio.ExPASy import ScanProsite

In [28]:
prot_record = SeqIO.read("prot_seq.fasta", format="fasta")
len(prot_record.seq)

258

In [29]:
handle = ScanProsite.scan(seq=prot_record.seq, mirror="https://prosite.expasy.org/")
result = ScanProsite.read(handle)

HTTPError: HTTP Error 308: Permanent Redirect

In [30]:
result.n_match

NameError: name 'result' is not defined

In [31]:
result[0]

NameError: name 'result' is not defined

------------------------------------------------------

# 5. KEGG

### Import Modules

In [32]:
from Bio.KEGG import REST, Enzyme #pathways and genes that associated with KEGG

In [33]:
help(Enzyme)

Help on package Bio.KEGG.Enzyme in Bio.KEGG:

NAME
    Bio.KEGG.Enzyme - Code to work with the KEGG Enzyme database.

DESCRIPTION
    Functions:
     - parse - Returns an iterator giving Record objects.
    
    Classes:
     - Record - Holds the information from a KEGG Enzyme record.

PACKAGE CONTENTS


CLASSES
    builtins.object
        Record
    
    class Record(builtins.object)
     |  Holds info from a KEGG Enzyme record.
     |  
     |  Attributes:
     |   - entry       The EC number (withou the 'EC ').
     |   - name        A list of the enzyme names.
     |   - classname   A list of the classification terms.
     |   - sysname     The systematic name of the enzyme.
     |   - reaction    A list of the reaction description strings.
     |   - substrate   A list of the substrates.
     |   - product     A list of the products.
     |   - inhibitor   A list of the inhibitors.
     |   - cofactor    A list of the cofactors.
     |   - effector    A list of the effectors.
    

In [37]:
request = REST.kegg_get("ec:5.4.2.2") #ênzymecommission number
open("ec_5.4.2.2.txt", "w").write(request.read())

274949

In [38]:
records = Enzyme.parse(open("ec_5.4.2.2.txt"))
record = list(records)[0]
record.classname

['Isomerases;',
 'Intramolecular transferases;',
 'Phosphotransferases (phosphomutases)']

In [39]:
record.pathway  #all the pathways that associated with this enzyme

[('PATH', 'ec00010', 'Glycolysis / Gluconeogenesis'),
 ('PATH', 'ec00030', 'Pentose phosphate pathway'),
 ('PATH', 'ec00052', 'Galactose metabolism'),
 ('PATH', 'ec00230', 'Purine metabolism'),
 ('PATH', 'ec00500', 'Starch and sucrose metabolism'),
 ('PATH', 'ec00520', 'Amino sugar and nucleotide sugar metabolism'),
 ('PATH', 'ec00521', 'Streptomycin biosynthesis'),
 ('PATH', 'ec01100', 'Metabolic pathways'),
 ('PATH', 'ec01110', 'Biosynthesis of secondary metabolites'),
 ('PATH', 'ec01120', 'Microbial metabolism in diverse environments')]

In [40]:
record.genes[:10]

[('HSA', ['5236', '55276']),
 ('PTR', ['456908', '461162']),
 ('PPS', ['100977295', '100993927']),
 ('GGO', ['101128874', '101131551']),
 ('PON', ['100190836', '100438793']),
 ('NLE', ['100596081', '100600656']),
 ('HMH', ['116456694', '116457795']),
 ('MCC', ['100424648', '699401']),
 ('MCF', ['101925921', '102130622']),
 ('MTHB', ['126935012', '126954887'])]

In [41]:
list_genes = []
for x,y in record.genes:
    list_genes += x.split("\n")
    
print(list_genes[:10])

['HSA', 'PTR', 'PPS', 'GGO', 'PON', 'NLE', 'HMH', 'MCC', 'MCF', 'MTHB']


------------------------------------------------------