# ACCESS BIOINFORMATICS DATABASES WITH BIO-PYTHON

1. [NCBI](#1.-NCBI)<br>
    1.1. [Nucleotide BLAST](#1.1.-Nucleotide-BLAST)<br>
    1.2. [Protein BLAST](#1.2.-Protein-BLAST)
    
2. [ENTREZ](#2.-ENTREZ)<br>
    2.1. [PUBMED](#2.1.-PUBMED)<br>
    2.2. [Nucleotide](#2.2.-Nucleotide)
    
3. [PDB](#3.-PDB)

4. [EXPASY](#4.-EXPASY)<br>
    4.1. [PROSITE](#4.1.-PROSITE)<br>
    4.2. [ScanProsite](#4.2.-ScanProsite)
    
5. [KEGG](#5.-KEGG)

# 1. NCBI

### Import Modules

In [1]:
from Bio.Blast import NCBIWWW
from Bio import SeqIO, SearchIO

In [2]:
#help(NCBIWWW)

Help on module Bio.Blast.NCBIWWW in Bio.Blast:

NAME
    Bio.Blast.NCBIWWW - Code to invoke the NCBI BLAST server over the internet.

DESCRIPTION
    This module provides code to work with the WWW version of BLAST
    provided by the NCBI. https://blast.ncbi.nlm.nih.gov/
    
    Variables:
    
        - email        Set the Blast email parameter (default is None).
        - tool         Set the Blast tool parameter (default is ``biopython``).

FUNCTIONS
    qblast(program, database, sequence, url_base='https://blast.ncbi.nlm.nih.gov/Blast.cgi', auto_format=None, composition_based_statistics=None, db_genetic_code=None, endpoints=None, entrez_query='(none)', expect=10.0, filter=None, gapcosts=None, genetic_code=None, hitlist_size=50, i_thresh=None, layout=None, lcase_mask=None, matrix_name=None, nucl_penalty=None, nucl_reward=None, other_advanced=None, perc_ident=None, phi_pattern=None, query_file=None, query_believe_defline=None, query_from=None, query_to=None, searchsp_eff=None, serv

## 1.1. Nucleotide BLAST

In [3]:
!ls

notebook.ipynb	nuc_seq.fasta  prot_seq.fasta


In [4]:
nuc_record = SeqIO.read("nuc_seq.fasta", format = "fasta")
len(nuc_record)

774

In [5]:
nuc_record.description

'MT598137.1 Severe acute respiratory syndrome coronavirus 2 isolate SARS-CoV-2/human/IRN/PN-2142-S/2020 surface glycoprotein (S) gene, partial cds'

In [6]:
nuc_record.seq


Seq('ATCGCTCCAGGGCAAACTGGAAAGATTGCTGATTATAATTATAAATTACCAGAT...GGT')

In [7]:
result_handle = NCBIWWW.qblast("blastn", "nt", nuc_record.seq)
blast_result = SearchIO.read(result_handle, "blast-xml")

In [8]:
print(blast_result[0:2])


Program: blastn (2.16.0+)
  Query: No (774)
         definition line
 Target: core_nt
   Hits: ----  -----  ----------------------------------------------------------
            #  # HSP  ID + description
         ----  -----  ----------------------------------------------------------
            0      1  gi|2633241341|emb|OY967334.1|  Severe acute respiratory...
            1      1  gi|2521539964|emb|OX660655.1|  Severe acute respiratory...


In [9]:
Seq = blast_result[0]
print(f"Sequence ID: {Seq.id}")
print(f"Sequence Description: {Seq.description}")

details = Seq[0]
print(f"E-value: {details.evalue}")

Sequence ID: gi|2633241341|emb|OY967334.1|
Sequence Description: Severe acute respiratory syndrome coronavirus 2 genome assembly, complete genome: monopartite
E-value: 0.0


In [10]:
print(f"alignment:\n{details.aln}")

alignment:
Alignment with 2 rows and 774 columns
ATCGCTCCAGGGCAAACTGGAAAGATTGCTGATTATAATTATAA...GGT No
ATCGCTCCAGGGCAAACTGGAAAGATTGCTGATTATAATTATAA...GGT gi|2633241341|emb|OY967334.1|


## 1.2. Protein BLAST

In [11]:
prot_record = SeqIO.read("prot_seq.fasta", format="fasta")
len(prot_record)

258

In [13]:
result_handle = NCBIWWW.qblast("blastp", "pdb", prot_record.seq)
blast_result = SearchIO.read(result_handle, "blast-xml")

In [14]:
print(blast_result[0:2])

Program: blastp (2.16.0+)
  Query: unnamed (258)
         protein product
 Target: pdb
   Hits: ----  -----  ----------------------------------------------------------
            #  # HSP  ID + description
         ----  -----  ----------------------------------------------------------
            0      1  pdb|8ELJ|A  Chain A, Spike glycoprotein [Severe acute r...
            1      1  pdb|7CAB|A  Chain A, Spike glycoprotein [Severe acute r...


In [15]:
Seq = blast_result [0]
print(f"Sequence ID: {Seq.id}")
print(f"Sequence Description: {Seq.description}")

details = Seq[0]
print(f"E-value: {details.evalue}")

Sequence ID: pdb|8ELJ|A
Sequence Description: Chain A, Spike glycoprotein [Severe acute respiratory syndrome coronavirus 2]
E-value: 0.0


In [16]:
print(f"alignment:\n {details.aln}")

alignment:
 Alignment with 2 rows and 258 columns
IAPGQTGKIADYNYKLPDDFTGCVIAWNSNNLDSKVGGNYNYLY...PIG unnamed
IAPGQTGKIADYNYKLPDDFTGCVIAWNSNNLDSKVGGNYNYLY...PIG pdb|8ELJ|A


------------------------------------------------------

# 2. ENTREZ

### Import Modules

In [43]:
from Bio import Entrez

In [22]:
#help(Entrez)

Help on package Bio.Entrez in Bio:

NAME
    Bio.Entrez - Provides code to access NCBI over the WWW.

DESCRIPTION
    The main Entrez web page is available at:
    http://www.ncbi.nlm.nih.gov/Entrez/
    
    Entrez Programming Utilities web page is available at:
    http://www.ncbi.nlm.nih.gov/books/NBK25501/
    
    This module provides a number of functions like ``efetch`` (short for
    Entrez Fetch) which will return the data as a handle object. This is
    a standad interface used in Python for reading data from a file, or
    in this case a remote network connection, and provides methods like
    ``.read()`` or offers iteration over the contents line by line. See
    also "What the heck is a handle?" in the Biopython Tutorial and
    Cookbook: http://biopython.org/DIST/docs/tutorial/Tutorial.html
    http://biopython.org/DIST/docs/tutorial/Tutorial.pdf
    The handle returned by these functions can be either in text mode or
    in binary mode, depending on the data requested an

In [44]:
Entrez.email = "datacyclopes@gmail.com"

In [45]:
handle = Entrez.einfo()
record = Entrez.read(handle)
record["DbList"]

['pubmed', 'protein', 'nuccore', 'ipg', 'nucleotide', 'structure', 'genome', 'annotinfo', 'assembly', 'bioproject', 'biosample', 'blastdbinfo', 'books', 'cdd', 'clinvar', 'gap', 'gapplus', 'grasp', 'dbvar', 'gene', 'gds', 'geoprofiles', 'medgen', 'mesh', 'nlmcatalog', 'omim', 'orgtrack', 'pmc', 'popset', 'proteinclusters', 'pcassay', 'protfam', 'pccompound', 'pcsubstance', 'seqannot', 'snp', 'sra', 'taxonomy', 'biocollections', 'gtr']

## 2.1. PUBMED

In [35]:
handle = Entrez.einfo(db="pubmed")
record = Entrez.read(handle)
record["DbInfo"]["Description"]

'PubMed bibliographic record'

In [36]:
record["DbInfo"]["Count"]

'37859381'

In [37]:
handle = Entrez.esearch(db="pubmed", term="biopython")
record = Entrez.read(handle)
record["IdList"]

['38808697', '38650605', '38365590', '38235175', '37810457', '37668712', '36818783', '36245797', '36094101', '35497637', '35496474', '35402671', '34735950', '34484417', '34434786', '34189012', '33994075', '33902722', '33809815', '33242467']

In [47]:
handle = Entrez.esummary(db="pubmed", id= '38808697, 38650605')
records = Entrez.parse(handle)


for record in records:
    #print(record)
    print(record['AuthorList'], record['Title'], record['PubDate'], record['FullJournalName'])

['Ullah S', 'Rahman W', 'Ullah F', 'Ullah A', 'Ahmad G', 'Ijaz M', 'Ullah H', 'Sharafmal DM'] The HABD: Home of All Biological Databases Empowering Biological Research With Cutting-Edge Database Systems. 2024 May Current protocols
['Sulkowski A', 'Bouton C', 'Swanson C'] Syn-CpG-Spacer: A Panel web app for synonymous recoding of viral genomes with CpG dinucleotides. 2024 Apr 3 Journal of open source software


In [48]:
handle = Entrez.efetch(db="pubmed", id="19811691")
print(handle.read())

b'<?xml version="1.0" ?>\n<!DOCTYPE PubmedArticleSet PUBLIC "-//NLM//DTD PubMedArticle, 1st January 2024//EN" "https://dtd.nlm.nih.gov/ncbi/pubmed/out/pubmed_240101.dtd">\n<PubmedArticleSet>\n<PubmedArticle><MedlineCitation Status="MEDLINE" Owner="NLM"><PMID Version="1">19811691</PMID><DateCompleted><Year>2010</Year><Month>02</Month><Day>12</Day></DateCompleted><DateRevised><Year>2021</Year><Month>10</Month><Day>20</Day></DateRevised><Article PubModel="Electronic"><Journal><ISSN IssnType="Electronic">1471-2105</ISSN><JournalIssue CitedMedium="Internet"><Volume>10 Suppl 11</Volume><Issue>Suppl 11</Issue><PubDate><Year>2009</Year><Month>Oct</Month><Day>08</Day></PubDate></JournalIssue><Title>BMC bioinformatics</Title><ISOAbbreviation>BMC Bioinformatics</ISOAbbreviation></Journal><ArticleTitle>Exploratory visual analysis of conserved domains on multiple sequence alignments.</ArticleTitle><Pagination><StartPage>S7</StartPage><MedlinePgn>S7</MedlinePgn></Pagination><ELocationID EIdType="doi

## 2.2. Nucleotide

In [49]:
handle = Entrez.esearch(db="nucleotide", retmax=10, term="Severe acute respiratory syndrome")
record = Entrez.read(handle)
record["IdList"]

['1677537253', '1677531637', '1677530470', '1677498759', '1677486604', '2820736972', '2820736955', '2820736938', '2820736920', '2820736903']

In [50]:
handle = Entrez.efetch(db="nucleotide", id='1677537253', rettype="gb", retmode="text")
print(handle.read())

LOCUS       NM_001025366            3660 bp    mRNA    linear   PRI 12-OCT-2024
DEFINITION  Homo sapiens vascular endothelial growth factor A (VEGFA),
            transcript variant 1, mRNA.
ACCESSION   NM_001025366
VERSION     NM_001025366.3
KEYWORDS    RefSeq.
SOURCE      Homo sapiens (human)
  ORGANISM  Homo sapiens
            Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
            Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini;
            Catarrhini; Hominidae; Homo.
REFERENCE   1  (bases 1 to 3660)
  AUTHORS   Gutierrez-Zepeda,B.M., Gomez-Del Toro,M.M., Ortiz-Soto,D.J.,
            Becerra-Loaiza,D.S., Quiroz-Bolanos,A.F., Topete,A.,
            Franco-Topete,R.A., Daneri-Navarro,A., Del Toro-Arreola,A. and
            Quintero-Ramos,A.
  TITLE     The VEGFA rs3025039 Variant Is a Risk Factor for Breast Cancer in
            Mexican Women
  JOURNAL   Int J Mol Sci 25 (18), 10172 (2024)
   PUBMED   39337657
  REMARK    GeneRIF: The VEGFA rs30250

In [51]:
handle = Entrez.esearch(db='nucleotide', term='accD[Gene Name] AND "E. coli"[Organism]', retmax="20")
result_list = Entrez.read(handle)

In [52]:
id_list = result_list['IdList']
count = result_list['Count']

print(id_list)
print("\n")
print(count)

['2822569355', '2822496522', '2822472277', '2822472270', '2822415480', '2822413527', '2822413524', '2822413508', '2822398882', '2822398879', '2822398878', '2822398877', '2822398874', '2822398859', '2822398853', '2822265898', '2822247679', '2822025126', '2821912788', '2821911646']


287253


In [53]:
handle.close()

------------------------------------------------------

# 3. PDB

### Import Modules

In [55]:
from Bio.PDB import PDBParser,PDBList

In [56]:
#help(PDBList)

Help on class PDBList in module Bio.PDB.PDBList:

class PDBList(builtins.object)
 |  PDBList(server='https://files.wwpdb.org', pdb=None, obsolete_pdb=None, verbose=True)
 |  
 |  Quick access to the structure lists on the PDB or its mirrors.
 |  
 |  This class provides quick access to the structure lists on the
 |  PDB server or its mirrors. The structure lists contain
 |  four-letter PDB codes, indicating that structures are
 |  new, have been modified or are obsolete. The lists are released
 |  on a weekly basis.
 |  
 |  It also provides a function to retrieve PDB files from the server.
 |  To use it properly, prepare a directory /pdb or the like,
 |  where PDB files are stored.
 |  
 |  All available file formats (PDB, PDBx/mmCif, PDBML, mmtf) are supported.
 |  Please note that large structures (containing >62 chains
 |  and/or 99999 ATOM lines) are no longer stored as a single PDB file
 |  and by default (when PDB format selected) are not downloaded.
 |  
 |  Large structures ca

In [58]:
pdbl=PDBList()
pdbl.retrieve_pdb_file("7BYR", file_format="pdb", pdir="dir")

Downloading PDB structure '7byr'...


'dir/pdb7byr.ent'

In [59]:
parser= PDBParser()
structure = parser.get_structure("7BYR", "dir/pdb7byr.ent")

In [60]:
for chain in structure[0]:
    print(f"chainid: {chain.id}")

chainid: A
chainid: B
chainid: C
chainid: H
chainid: L
chainid: D
chainid: E
chainid: F
chainid: G
chainid: I
chainid: J


In [62]:
keywords = structure.header["keywords"]
keywords

'sars-cov-2, antigen, rbd, neutralizing antibody, viral protein'

------------------------------------------------------

# 4. EXPASY

## 4.1. PROSITE

### Import Modules

In [63]:
from Bio import ExPASy
from Bio.ExPASy import Prosite

In [64]:
#help(Prosite)

Help on module Bio.ExPASy.Prosite in Bio.ExPASy:

NAME
    Bio.ExPASy.Prosite - Parser for the prosite dat file from Prosite at ExPASy.

DESCRIPTION
    See https://www.expasy.org/prosite/
    
    Tested with:
     - Release 20.43, 10-Feb-2009
     - Release 2017_03 of 15-Mar-2017.
    
    Functions:
     - read                  Reads a Prosite file containing one Prosite record
     - parse                 Iterates over records in a Prosite file.
    
    Classes:
     - Record                Holds Prosite data.

CLASSES
    builtins.object
        Record
    
    class Record(builtins.object)
     |  Holds information from a Prosite record.
     |  
     |  Main attributes:
     |   - name           ID of the record.  e.g. ADH_ZINC
     |   - type           Type of entry.  e.g. PATTERN, MATRIX, or RULE
     |   - accession      e.g. PS00387
     |   - created        Date the entry was created.  (MMM-YYYY for releases
     |     before January 2017, DD-MMM-YYYY since January 2017)
 

In [65]:
handle = ExPASy.get_prosite_raw('PS51442')
record = Prosite.read(handle)

In [66]:
print(record.description)

Coronavirus main protease (M-pro) domain profile.


In [68]:
print(record.pdb_structs[:10])

[]


In [70]:
handle = ExPASy.get_prosite_raw('PS00001')
record = Prosite.read(handle)
print(record.pattern)

N-{P}-[ST]-{P}.


## 4.2. ScanProsite

### Import Modules

In [71]:
from Bio.ExPASy import ScanProsite

In [72]:
prot_record = SeqIO.read("prot_seq.fasta", format="fasta")
len(prot_record.seq)

258

In [73]:
handle = ScanProsite.scan(seq=prot_record.seq, mirror="https://prosite.expasy.org/")
result = ScanProsite.read(handle)

In [74]:
result.n_match

1

In [75]:
result[0]

{'sequence_ac': 'USERSEQ1',
 'start': 1,
 'stop': 118,
 'signature_ac': 'PS51921',
 'score': '32.871',
 'level': '0'}

------------------------------------------------------

# 5. KEGG

### Import Modules

In [76]:
from Bio.KEGG import REST, Enzyme

In [77]:
#help(Enzyme)

Help on package Bio.KEGG.Enzyme in Bio.KEGG:

NAME
    Bio.KEGG.Enzyme - Code to work with the KEGG Enzyme database.

DESCRIPTION
    Functions:
     - parse - Returns an iterator giving Record objects.
    
    Classes:
     - Record - Holds the information from a KEGG Enzyme record.

PACKAGE CONTENTS


CLASSES
    builtins.object
        Record
    
    class Record(builtins.object)
     |  Holds info from a KEGG Enzyme record.
     |  
     |  Attributes:
     |   - entry       The EC number (without the 'EC ').
     |   - name        A list of the enzyme names.
     |   - classname   A list of the classification terms.
     |   - sysname     The systematic name of the enzyme.
     |   - reaction    A list of the reaction description strings.
     |   - substrate   A list of the substrates.
     |   - product     A list of the products.
     |   - inhibitor   A list of the inhibitors.
     |   - cofactor    A list of the cofactors.
     |   - effector    A list of the effectors.
   

In [78]:
request = REST.kegg_get("ec:5.4.2.2")
open("ec_5.4.2.2","w").write(request.read())

297798

In [79]:
records = Enzyme.parse(open("ec_5.4.2.2"))
record = list(records)[0]
record.classname

['Isomerases;',
 'Intramolecular transferases;',
 'Phosphotransferases (phosphomutases)']

In [80]:
record.pathway

[('PATH', 'ec00010', 'Glycolysis / Gluconeogenesis'),
 ('PATH', 'ec00030', 'Pentose phosphate pathway'),
 ('PATH', 'ec00052', 'Galactose metabolism'),
 ('PATH', 'ec00230', 'Purine metabolism'),
 ('PATH', 'ec00500', 'Starch and sucrose metabolism'),
 ('PATH', 'ec00520', 'Amino sugar and nucleotide sugar metabolism'),
 ('PATH', 'ec00521', 'Streptomycin biosynthesis'),
 ('PATH', 'ec01100', 'Metabolic pathways'),
 ('PATH', 'ec01110', 'Biosynthesis of secondary metabolites'),
 ('PATH', 'ec01120', 'Microbial metabolism in diverse environments')]

In [81]:
record.genes[:10]

[('HSA', ['5236', '55276']),
 ('PTR', ['456908', '461162']),
 ('PPS', ['100977295', '100993927']),
 ('GGO', ['101128874', '101131551']),
 ('PON', ['100190836', '100438793']),
 ('PPYG', ['129034752', '129035286']),
 ('NLE', ['100596081', '100600656']),
 ('HMH', ['116456694', '116457795']),
 ('SSYN', ['129458637', '129464875']),
 ('MCC', ['100424648', '699401'])]

In [94]:
list_genes = []
for x, y in record.genes:
    list_genes += x.split("\n")
    
print(list_genes[:10])

['HSA', 'PTR', 'PPS', 'GGO', 'PON', 'PPYG', 'NLE', 'HMH', 'SSYN', 'MCC']


------------------------------------------------------