# ACCESS BIOINFORMATICS DATABASES WITH BIO-PYTHON

1. [NCBI](#1.-NCBI)<br>
    1.1. [Nucleotide BLAST](#1.1.-Nucleotide-BLAST)<br>
    1.2. [Protein BLAST](#1.2.-Protein-BLAST)
    
2. [ENTREZ](#2.-ENTREZ)<br>
    2.1. [PUBMED](#2.1.-PUBMED)<br>
    2.2. [Nucleotide](#2.2.-Nucleotide)
    
3. [PDB](#3.-PDB)

4. [EXPASY](#4.-EXPASY)<br>
    4.1. [PROSITE](#4.1.-PROSITE)<br>
    4.2. [ScanProsite](#4.2.-ScanProsite)
    
5. [KEGG](#5.-KEGG)

# 1. NCBI

### Import Modules

In [1]:
from Bio.Blast import NCBIWWW
from Bio import SeqIO, SearchIO

In [2]:
help(NCBIWWW)

Help on module Bio.Blast.NCBIWWW in Bio.Blast:

NAME
    Bio.Blast.NCBIWWW - Code to invoke the NCBI BLAST server over the internet.

DESCRIPTION
    This module provides code to work with the WWW version of BLAST
    provided by the NCBI. https://blast.ncbi.nlm.nih.gov/

FUNCTIONS
    qblast(program, database, sequence, url_base='https://blast.ncbi.nlm.nih.gov/Blast.cgi', auto_format=None, composition_based_statistics=None, db_genetic_code=None, endpoints=None, entrez_query='(none)', expect=10.0, filter=None, gapcosts=None, genetic_code=None, hitlist_size=50, i_thresh=None, layout=None, lcase_mask=None, matrix_name=None, nucl_penalty=None, nucl_reward=None, other_advanced=None, perc_ident=None, phi_pattern=None, query_file=None, query_believe_defline=None, query_from=None, query_to=None, searchsp_eff=None, service=None, threshold=None, ungapped_alignment=None, word_size=None, short_query=None, alignments=500, alignment_view=None, descriptions=500, entrez_links_new_window=None, expect_

## 1.1. Nucleotide BLAST

In [3]:
!ls

dir  notebook.ipynb  nuc_seq.fasta  obsolete  prot_seq.fasta


In [4]:
nuc_record=SeqIO.read("nuc_seq.fasta",format="fasta")
len(nuc_record)

774

In [5]:
#nuc_record.description # description de sequance 
#nuc_rec_ord.seq  # seq de stering de nucloitde A C G T
result_handle =NCBIWWW.qblast("blastn","nt",nuc_record.seq) # output in XML  format
blast_result =SearchIO.read(result_handle,"blast-xml")


In [6]:
print(blast_result[0:2])

Program: blastn (2.15.0+)
  Query: No (774)
         definition line
 Target: nt
   Hits: ----  -----  ----------------------------------------------------------
            #  # HSP  ID + description
         ----  -----  ----------------------------------------------------------
            0      1  gi|2716167577|gb|PP645709.1|  Severe acute respiratory ...
            1      1  gi|2716164760|gb|PP645492.1|  Severe acute respiratory ...


In [7]:
Seq = blast_result[0]
print(f"Sequence ID: {Seq.id}")
print(f"Sequence Description: {Seq.description}")

details=Seq[0]
print(f"E-value: {details.evalue}")

Sequence ID: gi|2716167577|gb|PP645709.1|
Sequence Description: Severe acute respiratory syndrome coronavirus 2 isolate SARS-CoV-2/human/USA/RI-RISHL-023586/2020 ORF1ab polyprotein (ORF1ab), ORF1a polyprotein (ORF1ab), surface glycoprotein (S), ORF3a protein (ORF3a), envelope protein (E), membrane glycoprotein (M), ORF6 protein (ORF6), ORF7a protein (ORF7a), ORF7b (ORF7b), ORF8 protein (ORF8), nucleocapsid phosphoprotein (N), and ORF10 protein (ORF10) genes, complete cds
E-value: 0.0


In [8]:
print(f"alignment:\n{details.aln}")

alignment:
Alignment with 2 rows and 774 columns
ATCGCTCCAGGGCAAACTGGAAAGATTGCTGATTATAATTATAA...GGT No
ATCGCTCCAGGGCAAACTGGAAAGATTGCTGATTATAATTATAA...GGT gi|2716167577|gb|PP645709.1|


## 1.2. Protein BLAST

In [9]:
prot_record = SeqIO.read("prot_seq.fasta", format="fasta")
len(prot_record)

258

In [10]:
result_handle = NCBIWWW.qblast("blastp", "pdb", prot_record.seq)
blast_result = SearchIO.read(result_handle, "blast-xml")

In [11]:
print(blast_result[0:2])

Program: blastp (2.15.0+)
  Query: unnamed (258)
         protein product
 Target: pdb
   Hits: ----  -----  ----------------------------------------------------------
            #  # HSP  ID + description
         ----  -----  ----------------------------------------------------------
            0      1  pdb|8ELJ|A  Chain A, Spike glycoprotein [Severe acute r...
            1      1  pdb|7CAB|A  Chain A, Spike glycoprotein [Severe acute r...


In [12]:
Seq = blast_result [0]
print(f"Sequence ID: {Seq.id}")
print(f"Sequence Description: {Seq.description}")

details = Seq[0]
print(f"E-value: {details.evalue}")

Sequence ID: pdb|8ELJ|A
Sequence Description: Chain A, Spike glycoprotein [Severe acute respiratory syndrome coronavirus 2]
E-value: 0.0


In [13]:
print(f"alignment:\n {details.aln}")

alignment:
 Alignment with 2 rows and 258 columns
IAPGQTGKIADYNYKLPDDFTGCVIAWNSNNLDSKVGGNYNYLY...PIG unnamed
IAPGQTGKIADYNYKLPDDFTGCVIAWNSNNLDSKVGGNYNYLY...PIG pdb|8ELJ|A


------------------------------------------------------

# 2. ENTREZ

### Import Modules

In [14]:
from Bio import Entrez

In [15]:
help(Entrez)

Help on package Bio.Entrez in Bio:

NAME
    Bio.Entrez - Provides code to access NCBI over the WWW.

DESCRIPTION
    The main Entrez web page is available at:
    http://www.ncbi.nlm.nih.gov/Entrez/
    
    Entrez Programming Utilities web page is available at:
    http://www.ncbi.nlm.nih.gov/books/NBK25501/
    
    This module provides a number of functions like ``efetch`` (short for
    Entrez Fetch) which will return the data as a handle object. This is
    a standard interface used in Python for reading data from a file, or
    in this case a remote network connection, and provides methods like
    ``.read()`` or offers iteration over the contents line by line. See
    also "What the heck is a handle?" in the Biopython Tutorial and
    Cookbook: http://biopython.org/DIST/docs/tutorial/Tutorial.html
    http://biopython.org/DIST/docs/tutorial/Tutorial.pdf
    The handle returned by these functions can be either in text mode or
    in binary mode, depending on the data requested a

In [16]:
Entrez.email = "datacyclopes@gmail.com"

In [17]:
handle = Entrez.einfo()
record = Entrez.read(handle)
record["DbList"]

['pubmed', 'protein', 'nuccore', 'ipg', 'nucleotide', 'structure', 'genome', 'annotinfo', 'assembly', 'bioproject', 'biosample', 'blastdbinfo', 'books', 'cdd', 'clinvar', 'gap', 'gapplus', 'grasp', 'dbvar', 'gene', 'gds', 'geoprofiles', 'medgen', 'mesh', 'nlmcatalog', 'omim', 'orgtrack', 'pmc', 'popset', 'proteinclusters', 'pcassay', 'protfam', 'pccompound', 'pcsubstance', 'seqannot', 'snp', 'sra', 'taxonomy', 'biocollections', 'gtr']

## 2.1. PUBMED

In [18]:
handle = Entrez.einfo(db="pubmed")
record = Entrez.read(handle)
record["DbInfo"]["Description"]

'PubMed bibliographic record'

In [19]:
record["DbInfo"]["Count"]


'37208799'

In [20]:
handle = Entrez.esearch(db="pubmed", term="biopython")
record = Entrez.read(handle)
record["IdList"]


['38650605', '38365590', '38235175', '37810457', '37668712', '36818783', '36245797', '36094101', '35497637', '35496474', '35402671', '34735950', '34484417', '34434786', '34189012', '33994075', '33902722', '33809815', '33242467', '32044951']

In [21]:
handle = Entrez.esummary(db="pubmed", id='38650605, 38365590')
records = Entrez.parse(handle)


for record in records:
    print(record['AuthorList'],record['Title'],record['PubDate'],record['FullJournalName'])


['Sulkowski A', 'Bouton C', 'Swanson C'] Syn-CpG-Spacer: A Panel web app for synonymous recoding of viral genomes with CpG dinucleotides. 2024 Apr 3 Journal of open source software
['Bessa-Silva A'] Fasta2Structure: a user-friendly tool for converting multiple aligned FASTA files to STRUCTURE format. 2024 Feb 15 BMC bioinformatics


In [22]:
handle = Entrez.efetch(db="pubmed", id="19811691")
print(handle.read())



b'<?xml version="1.0" ?>\n<!DOCTYPE PubmedArticleSet PUBLIC "-//NLM//DTD PubMedArticle, 1st January 2024//EN" "https://dtd.nlm.nih.gov/ncbi/pubmed/out/pubmed_240101.dtd">\n<PubmedArticleSet>\n<PubmedArticle><MedlineCitation Status="MEDLINE" Owner="NLM"><PMID Version="1">19811691</PMID><DateCompleted><Year>2010</Year><Month>02</Month><Day>12</Day></DateCompleted><DateRevised><Year>2021</Year><Month>10</Month><Day>20</Day></DateRevised><Article PubModel="Electronic"><Journal><ISSN IssnType="Electronic">1471-2105</ISSN><JournalIssue CitedMedium="Internet"><Volume>10 Suppl 11</Volume><Issue>Suppl 11</Issue><PubDate><Year>2009</Year><Month>Oct</Month><Day>08</Day></PubDate></JournalIssue><Title>BMC bioinformatics</Title><ISOAbbreviation>BMC Bioinformatics</ISOAbbreviation></Journal><ArticleTitle>Exploratory visual analysis of conserved domains on multiple sequence alignments.</ArticleTitle><Pagination><StartPage>S7</StartPage><MedlinePgn>S7</MedlinePgn></Pagination><ELocationID EIdType="doi

## 2.2. Nucleotide

In [23]:
handle = Entrez.esearch(db="nucleotide",retmax=10 ,term="Severe acute respiratory syndrome")
record = Entrez.read(handle)
record["IdList"]


['2727771785', '2727771771', '2727771757', '2727771744', '2727771730', '2727771715', '2727771699', '2727771685', '2727771670', '2727771657']

In [24]:
handle =Entrez.efetch(db="nucleotide",id="1993774296",rettype='gb' ,retmode="text")
print(handle.read())



LOCUS       MW656254               29832 bp    RNA     linear   VRL 23-NOV-2021
DEFINITION  Severe acute respiratory syndrome coronavirus 2 isolate
            SARS-CoV-2/human/USA/NC-SLPH-0027/2020 ORF1ab polyprotein (ORF1ab),
            ORF1a polyprotein (ORF1ab), surface glycoprotein (S), ORF3a protein
            (ORF3a), envelope protein (E), membrane glycoprotein (M), and ORF6
            protein (ORF6) genes, complete cds; ORF7a protein (ORF7a) and ORF7b
            (ORF7b) genes, partial cds; and ORF8 protein (ORF8), nucleocapsid
            phosphoprotein (N), and ORF10 protein (ORF10) genes, complete cds.
ACCESSION   MW656254
VERSION     MW656254.1
KEYWORDS    purposeofsampling:baselinesurveillance.
SOURCE      Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2)
  ORGANISM  Severe acute respiratory syndrome coronavirus 2
            Viruses; Riboviria; Orthornavirae; Pisuviricota; Pisoniviricetes;
            Nidovirales; Cornidovirineae; Coronaviridae; Orthocoronav

In [25]:
handle = Entrez.esearch(db='nucleotide', term='accD[Gene Name] AND "E. coli"[Organism]', retmax="20")
result_list = Entrez.read(handle)

In [26]:
id_list = result_list['IdList']
count = result_list['Count']

print(id_list)
print("\n")
print(count)


['2727925337', '2727925307', '2727924808', '2727924260', '2727923901', '2727923899', '2727682446', '2727614230', '2727614229', '2727614228', '2727613749', '2727613748', '2727613747', '2727613727', '2727581472', '2727577636', '2727577630', '2727577629', '2727577627', '2727577625']


267924


In [27]:
handle.close() 


------------------------------------------------------

# 3. PDB

### Import Modules

In [28]:
from Bio.PDB import PDBParser,PDBList

In [29]:
help(PDBList)

Help on class PDBList in module Bio.PDB.PDBList:

class PDBList(builtins.object)
 |  Quick access to the structure lists on the PDB or its mirrors.
 |  
 |  This class provides quick access to the structure lists on the
 |  PDB server or its mirrors. The structure lists contain
 |  four-letter PDB codes, indicating that structures are
 |  new, have been modified or are obsolete. The lists are released
 |  on a weekly basis.
 |  
 |  It also provides a function to retrieve PDB files from the server.
 |  To use it properly, prepare a directory /pdb or the like,
 |  where PDB files are stored.
 |  
 |  All available file formats (PDB, PDBx/mmCif, PDBML, mmtf) are supported.
 |  Please note that large structures (containing >62 chains
 |  and/or 99999 ATOM lines) are no longer stored as a single PDB file
 |  and by default (when PDB format selected) are not downloaded.
 |  
 |  Large structures can be downloaded in other formats, including PDBx/mmCif
 |  or as a .tar file (a collection of 

In [30]:
pdbl=PDBList()
pdbl.retrieve_pdb_file("7BYR",file_format="pdb",pdir="dir")

Structure exists: 'dir/pdb7byr.ent' 


'dir/pdb7byr.ent'

In [31]:
parser =PDBParser()
structure =parser.get_structure("7BYR","dir/pdb7byr.ent")



In [32]:
for chain in structure[0]:
    print(f"chaibid :{chain.id}")

chaibid :A
chaibid :B
chaibid :C
chaibid :H
chaibid :L
chaibid :D
chaibid :E
chaibid :F
chaibid :G
chaibid :I
chaibid :J


In [33]:
resolution = structure.header["resolution"]
resolution

3.84

In [34]:
keywords =structure.header["keywords"]
keywords

'sars-cov-2, antigen, rbd, neutralizing antibody, viral protein'

------------------------------------------------------

# 4. EXPASY

## 4.1. PROSITE

### Import Modules

In [35]:
from Bio import ExPASy
from Bio.ExPASy import Prosite

In [36]:
help(Prosite)

Help on module Bio.ExPASy.Prosite in Bio.ExPASy:

NAME
    Bio.ExPASy.Prosite - Parser for the prosite dat file from Prosite at ExPASy.

DESCRIPTION
    See https://www.expasy.org/prosite/
    
    Tested with:
     - Release 20.43, 10-Feb-2009
     - Release 2017_03 of 15-Mar-2017.
    
    Functions:
     - read                  Reads a Prosite file containing one Prosite record
     - parse                 Iterates over records in a Prosite file.
    
    Classes:
     - Record                Holds Prosite data.

CLASSES
    builtins.object
        Record
    
    class Record(builtins.object)
     |  Holds information from a Prosite record.
     |  
     |  Main attributes:
     |   - name           ID of the record.  e.g. ADH_ZINC
     |   - type           Type of entry.  e.g. PATTERN, MATRIX, or RULE
     |   - accession      e.g. PS00387
     |   - created        Date the entry was created.  (MMM-YYYY for releases
     |     before January 2017, DD-MMM-YYYY since January 2017)
 

In [37]:
handle = ExPASy.get_prosite_raw("PS51442")
record = Prosite.read(handle)

In [38]:
print(record.description)

Coronavirus main protease (M-pro) domain profile.


In [39]:
print(record.pdb_structs[:10])

['1LVO', '1P9S', '1P9U', '1Q2W', '1UJ1', '1UK2', '1UK3', '1UK4', '1WOF', '1Z1I']


In [40]:
handle = ExPASy.get_prosite_raw('P500001')
record = Prosite.read(handle)
print(record.pattern)

ValueError: Unknown keyword <! found

## 4.2. ScanProsite

### Import Modules

In [None]:
from Bio.ExPASy import ScanProsite

In [None]:
prot_record = SeqIO.read("prot_seq.fasta", format="fasta")
len(prot_record.seq)

In [None]:
handle = ScanProiste(seq=prot_record.seq, mirror="https://prosite.expasy.org/")
result = ScanProiste.read(handle)

In [None]:
result.n_match

In [None]:
result[0]

------------------------------------------------------

# 5. KEGG

### Import Modules

In [None]:
from Bio.KEGG import REST, Enzyme

In [None]:
help(Enzyme)

In [None]:
request=REST.kegg_get("ec:5.4.22")
open("ec_5.4.2.2.txt","w").write(request.read())

In [None]:
records=Enzyme.parse(open("ec_5.4.2.2.txt"))
record =List(records)[]
record.classname

In [None]:
recorf.pathway

In [None]:
record.genes[:10]

In [None]:
list_genes=[]
for x,y in record.genes:
    list_genes +=x.split("\n")
    
print(list_genes[:10])

------------------------------------------------------