Python for Bioinformatics
-----------------------------

![title](https://s3.amazonaws.com/py4bio/tapabiosmall.png)

This Jupyter notebook is intented to be used alongside the book [Python for Bioinformatics](http://py3.us/)



**Note:** Before opening the file, this file should be accesible from this Jupyter notebook. In order to do so, the following commands will download these files from Github and extract them into a directory called samples.

In [0]:
!curl https://raw.githubusercontent.com/Serulab/Py4Bio/master/samples/samples.tar.bz2 -o samples.tar.bz2
!mkdir samples
!tar xvfj samples.tar.bz2 -C samples

Chapter 9: Introduction to Biopython
-----------------------------

In [0]:
import platform
platform.platform()

'Linux-4.4.0-81-generic-x86_64-with-debian-stretch-sid'

In [0]:
!pip install biopython

Collecting biopython
[?25l  Downloading https://files.pythonhosted.org/packages/83/3d/e0c8a993dbea1136be90c31345aefc5babdd5046cd52f81c18fc3fdad865/biopython-1.76-cp36-cp36m-manylinux1_x86_64.whl (2.3MB)
[K     |████████████████████████████████| 2.3MB 2.8MB/s 
Installing collected packages: biopython
Successfully installed biopython-1.76


Test Installation
-----------------

In [0]:
import Bio
Bio.__version__

'1.68'

In [0]:
import Bio.Alphabet
Bio.Alphabet.ThreeLetterProtein.letters

['Ala',
 'Asx',
 'Cys',
 'Asp',
 'Glu',
 'Phe',
 'Gly',
 'His',
 'Ile',
 'Lys',
 'Leu',
 'Met',
 'Asn',
 'Pro',
 'Gln',
 'Arg',
 'Ser',
 'Thr',
 'Sec',
 'Val',
 'Trp',
 'Xaa',
 'Tyr',
 'Glx']

In [0]:
from Bio.Alphabet import IUPAC
IUPAC.IUPACProtein.letters

'ACDEFGHIKLMNPQRSTVWY'

In [0]:
IUPAC.unambiguous_dna.letters

'GATC'

In [0]:
IUPAC.ambiguous_dna.letters

'GATCRYWSMKHBVDN'

In [0]:
IUPAC.ExtendedIUPACProtein.letters

'ACDEFGHIKLMNPQRSTVWYBXZJUO'

In [0]:
IUPAC.ExtendedIUPACDNA.letters

'GATCBDSW'

In [0]:
from Bio.Seq import Seq
import Bio.Alphabet
seq = Seq('CCGGGTT', Bio.Alphabet.IUPAC.unambiguous_dna)
seq.transcribe()

Seq('CCGGGUU', IUPACUnambiguousRNA())

In [0]:
seq.translate()



Seq('PG', IUPACProtein())

In [0]:
rna_seq = Seq('CCGGGUU',Bio.Alphabet.IUPAC.unambiguous_rna)
rna_seq.transcribe()

ValueError: RNA cannot be transcribed!

In [0]:
rna_seq.translate()



Seq('PG', IUPACProtein())

In [0]:
rna_seq.back_transcribe()

Seq('CCGGGTT', IUPACUnambiguousDNA())

Tip: The Transcribe Function in Biopython
-----------------------------------------

In [0]:
from Bio.Seq import translate, transcribe, back_transcribe
dnaseq = 'ATGGTATAA'
translate(dnaseq)


'MV*'

In [0]:
transcribe(dnaseq)

'AUGGUAUAA'

In [0]:
rnaseq = transcribe(dnaseq)
translate(rnaseq)

'MV*'

In [0]:
back_transcribe(rnaseq)

'ATGGTATAA'

Seq Objects as a String
-----------------------

In [0]:
seq = Seq('CCGGGTTAACGTA',Bio.Alphabet.IUPAC.unambiguous_dna)
seq[:5]

Seq('CCGGG', IUPACUnambiguousDNA())

In [0]:
len(seq)

13

In [0]:
print(seq)

CCGGGTTAACGTA


MutableSeq
----------

In [0]:
seq[0] = 'T'

TypeError: 'Seq' object does not support item assignment

In [0]:
mut_seq = seq.tomutable()
mut_seq

MutableSeq('CCGGGTTAACGTA', IUPACUnambiguousDNA())

In [0]:
mut_seq[0] = 'T'
mut_seq

MutableSeq('TCGGGTTAACGTA', IUPACUnambiguousDNA())

In [0]:
mut_seq.reverse()
mut_seq

MutableSeq('ATGCAATTGGGCT', IUPACUnambiguousDNA())

In [0]:
mut_seq.complement()

In [0]:
mut_seq

MutableSeq('TACGTTAACCCGA', IUPACUnambiguousDNA())

In [0]:
mut_seq.reverse_complement()
mut_seq

MutableSeq('TCGGGTTAACGTA', IUPACUnambiguousDNA())

SeqRecord
---------

In [0]:
from Bio.SeqRecord import SeqRecord
SeqRecord(seq, id='001', name='MHC gene')

SeqRecord(seq=Seq('CCGGGTTAACGTA', IUPACUnambiguousDNA()), id='001', name='MHC gene', description='<unknown description>', dbxrefs=[])

In [0]:
from Bio.SeqRecord import SeqRecord
from Bio.Seq import Seq
from Bio.Alphabet import generic_protein
rec = SeqRecord(Seq('mdstnvrsgmksrkkkpkttvidddddcmtcsacqs'
                        'klvkisditkvsldyintmrgntlacaacgsslkll',
                generic_protein),
                id = 'P20994.1', name = 'P20994',
                description = 'Protein A19',
                dbxrefs = ['Pfam:PF05077', 'InterPro:IPR007769',
                           'DIP:2186N'])
rec.annotations['note'] = 'A simple note'
print(rec)

ID: P20994.1
Name: P20994
Description: Protein A19
Database cross-references: Pfam:PF05077, InterPro:IPR007769, DIP:2186N
Number of features: 0
/note=A simple note
Seq('mdstnvrsgmksrkkkpkttvidddddcmtcsacqsklvkisditkvsldyint...kll', ProteinAlphabet())


Align
-----

**Listing 9.1:** Using Align module

In [0]:
from Bio.Alphabet import generic_protein
from Bio.Align import MultipleSeqAlignment
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord

seq1 = 'MHQAIFIYQIGYPLKSGYIQSIRSPEYDNW'
seq2 = 'MH--IFIYQIGYALKSGYIQSIRSPEY-NW'
seq_rec_1 = SeqRecord(Seq(seq1, generic_protein), id = 'asp')
seq_rec_2 = SeqRecord(Seq(seq2, generic_protein), id = 'unk')
align = MultipleSeqAlignment([seq_rec_1, seq_rec_2])
print(align)

ProteinAlphabet() alignment with 2 rows and 30 columns
MHQAIFIYQIGYPLKSGYIQSIRSPEYDNW asp
MH--IFIYQIGYALKSGYIQSIRSPEY-NW unk


In [0]:
seq3 = 'M---IFIYQIGYAAKSGYIQSIRSPEY--W'
seq_rec_3 = SeqRecord(Seq(seq3, generic_protein), id = 'cas')
align.extend([seq_rec_3])
print(align)

ProteinAlphabet() alignment with 3 rows and 30 columns
MHQAIFIYQIGYPLKSGYIQSIRSPEYDNW asp
MH--IFIYQIGYALKSGYIQSIRSPEY-NW unk
M---IFIYQIGYAAKSGYIQSIRSPEY--W cas


In [0]:
align[0]

SeqRecord(seq=Seq('MHQAIFIYQIGYPLKSGYIQSIRSPEYDNW', ProteinAlphabet()), id='asp', name='<unknown name>', description='<unknown description>', dbxrefs=[])

In [0]:
print(align[:2,5:11])

ProteinAlphabet() alignment with 2 rows and 6 columns
FIYQIG asp
FIYQIG unk


In [0]:
len(align)

3

In [0]:
from Bio.SeqUtils.ProtParam import ProteinAnalysis
for seq in align:
    print(ProteinAnalysis(str(seq.seq)).isoelectric_point())

6.50421142578125
8.16033935546875
8.13848876953125


AlignIO
--------

In [0]:
from Bio import AlignIO
align = AlignIO.read('samples/cas9align.fasta', 'fasta')
print(align)

SingleLetterAlphabet() alignment with 8 rows and 1407 columns
MDKKYSIGLDIGTNSVGWAVITDDYKVPSKKFKVLGNTDRHSIK...GGD J7M7J1
MDKKYSIGLDIGTNSVGWAVITDDYKVPSKKFKVLGNTDRHSIK...GGD A0A0C6FZC2
MDKKYSIGLDIGTNSVGWAVITDDYKVPSKKFKVLGNTDRHSIK...GGD A0A1C2CVQ9
MDKKYSIGLDIGTNSVGWAVITDDYKVPSKKFKVLGNTDRHSIK...GGD A0A1C2CV43
MDKKYSIGLDIGTNSVGWAVITDDYKVPSKKFKVLGNTDRHSIK...GGD Q48TU5
MDKKYSIGLDIGTNSVGWAVITDDYKVPSKKFKVLGNTDRHSIK...GGD M4YX12
MKKPYSIGLDIGTNSVGWAVVTDDYKVPAKKMKVLGNTDKSHIK...GGD A0A0E2EP65
--------------------------------------------...GED A0A150NVN1


In [0]:
from Bio import AlignIO
for alignment in AlignIO.parse('samples/example.aln', 'clustal'):
    print(len(alignment))

6


In [0]:
from Bio import AlignIO
AlignIO.convert(open('samples/cas9align.fasta'), 'fasta', 'cas9align.aln', 'clustal')

1

AlignInfo
---------

In [0]:
from Bio import AlignIO
from Bio.Align.AlignInfo import SummaryInfo
from Bio.Alphabet import ProteinAlphabet

align = AlignIO.read('samples/cas9align.fasta', 'fasta', alphabet=ProteinAlphabet())
summary = SummaryInfo(align)
print(summary.information_content())

4951.072487965924


In [0]:
summary.dumb_consensus(consensus_alpha=ProteinAlphabet())

Seq('MDKKYSIGLDIGTNSVGWAVITDDYKVPSKKFKVLGNTDRHSIKKNLIGALLFD...GGD', ProteinAlphabet())

In [0]:
summary.gap_consensus(consensus_alpha=ProteinAlphabet())

Seq('MDKKYSIGLDIGTNSVGWAVITDDYKVPSKKFKVLGNTDRHSIKKNLIGALLFD...GGD', ProteinAlphabet())

In [0]:
print(summary.alignment)

ProteinAlphabet() alignment with 8 rows and 1407 columns
MDKKYSIGLDIGTNSVGWAVITDDYKVPSKKFKVLGNTDRHSIK...GGD J7M7J1
MDKKYSIGLDIGTNSVGWAVITDDYKVPSKKFKVLGNTDRHSIK...GGD A0A0C6FZC2
MDKKYSIGLDIGTNSVGWAVITDDYKVPSKKFKVLGNTDRHSIK...GGD A0A1C2CVQ9
MDKKYSIGLDIGTNSVGWAVITDDYKVPSKKFKVLGNTDRHSIK...GGD A0A1C2CV43
MDKKYSIGLDIGTNSVGWAVITDDYKVPSKKFKVLGNTDRHSIK...GGD Q48TU5
MDKKYSIGLDIGTNSVGWAVITDDYKVPSKKFKVLGNTDRHSIK...GGD M4YX12
MKKPYSIGLDIGTNSVGWAVVTDDYKVPAKKMKVLGNTDKSHIK...GGD A0A0E2EP65
--------------------------------------------...GED A0A150NVN1


In [0]:
print(summary.pos_specific_score_matrix())

    -   A   C   D   E   F   G   H   I   K   L   M   N   P   Q   R   S   T   V   W   Y
M  1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 7.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
D  1.0 0.0 0.0 6.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
K  1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 7.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
K  1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 6.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
Y  1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 7.0
S  1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 7.0 0.0 0.0 0.0 0.0
I  1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 7.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
G  1.0 0.0 0.0 0.0 0.0 0.0 7.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
L  1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 7.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
D  1.0 0.0 0.0 7.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
I  1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 7.0 0.0 0

In [0]:
from Bio.Align.Applications import ClustalwCommandline
clustalw_exe = 'clustalw2'
ccli = ClustalwCommandline(clustalw_exe, infile="samples/input4align.fasta", outfile='../../aoutput.aln')
print(ccli)


clustalw2 -infile=samples/input4align.fasta -outfile=../../aoutput.aln


In [0]:
 clustalw_exe = 'clustalw2'

In [0]:
clustalw_exe='c:\\windows\\program file\\clustal\\clustalw.exe'

In [0]:
from Bio.Align.Applications import ClustalwCommandline
clustalw_exe = 'clustalw2'
ccli = ClustalwCommandline(clustalw_exe,
infile="samples/input4align.fasta", outfile='../../aoutput.aln')
ccli()

ApplicationError: Non-zero return code 127 from 'clustalw2 -infile=samples/input4align.fasta -outfile=../../aoutput.aln', message '/bin/sh: 1: clustalw2: not found'

In [0]:
from Bio import AlignIO
seqs = AlignIO.read('samples/aoutput.aln', 'clustal')
seqs[0]

FileNotFoundError: [Errno 2] No such file or directory: 'samples/aoutput.aln'

In [0]:
seqs[1]

NameError: name 'seqs' is not defined

In [0]:
seqs[2]

NameError: name 'seqs' is not defined

In [0]:
from Bio.Align.Applications import ClustalwCommandline
clustalw_exe = 'clustalw2'
ccli = ClustalwCommandline(clustalw_exe,
infile="input4align.fasta", outfile='../../aoutput.aln',
pwgapopen=5)
print(ccli)

ImportError: No module named 'Bio'

In [0]:
from Bio.Align.Applications import ClustalwCommandline
ccli = ClustalwCommandline()
help(ccli)

ImportError: No module named 'Bio'

SeqIO
----------


In [0]:
from Bio import SeqIO
f_in = open('samples/a19.gp')
seq = SeqIO.parse(f_in, 'genbank')
next(seq)

SeqRecord(seq=Seq('MGHHHHHHHHHHSSGHIDDDDKHMLEMDSTNVRSGMKSRKKKPKTTVIDDDDDC...FAS', IUPACProtein()), id='AAX78491.1', name='AAX78491', description='unknown [synthetic construct].', dbxrefs=[])

In [0]:
f_in = open('samples/a19.gp')
SeqIO.read(f_in, 'genbank')

SeqRecord(seq=Seq('MGHHHHHHHHHHSSGHIDDDDKHMLEMDSTNVRSGMKSRKKKPKTTVIDDDDDC...FAS', IUPACProtein()), id='AAX78491.1', name='AAX78491', description='unknown [synthetic construct].', dbxrefs=[])

**Listing 9.2:** readfasta.py: Read a FASTA file

In [0]:
from Bio import SeqIO

FILE_IN = 'samples/3seqs.fas'

with open(FILE_IN) as fh:
    for record in SeqIO.parse(fh, 'fasta'):
        id_ = record.id
        seq = record.seq
        print('Name: {0}, size: {1}'.format(id_, len(seq)))

Name: Protein-X, size: 38
Name: Protein-Y, size: 62
Name: Protein-Z, size: 60


**Listing 9.3:** rwfasta.py: Read a file and write it as a FASTA sequence

In [0]:
from Bio import SeqIO
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
with open('samples/NC2033.txt') as fh:
    with open('NC2033.fasta','w') as f_out:
        rawseq = fh.read().replace('\n','')
        record = (SeqRecord(Seq(rawseq),'NC2033.txt','',''),)
        SeqIO.write(record, f_out,'fasta')

In [0]:
from Bio import SeqIO
fo_handle = open('myseqs.fasta','w')
readseq = SeqIO.parse(open('samples/myseqs.gbk'), 'genbank')
SeqIO.write(readseq, fo_handle, "fasta")
fo_handle.close()

FileNotFoundError: [Errno 2] No such file or directory: 'samples/myseqs.gbk'

In [0]:
from Bio import AlignIO
fn = open('samples/secu3.aln')
align = AlignIO.read(fn, 'clustal')
print(align)

FileNotFoundError: [Errno 2] No such file or directory: 'samples/secu3.aln'

**Listing 9.4:** Alignments

In [0]:
fi = open('samples/example.aln')
with open('samples/example.phy', 'w') as fo:
    align = AlignIO.read(fi, 'clustal')
    AlignIO.write([align], fo, 'phylip')

BLAST
-----

**Listing 9.5:** runblastn.py: Running a local NCBI BLAST

In [0]:
from Bio.Blast.Applications imp
BLAST_EXE = '~/opt/ncbi-blast-2
f_in = '../../samples/seq3.txt'
b_db = 'db/samples/TAIR8cds'
blastn_cline = blastn(cmd=BLAST
                      evalue=.0
rh, eh = blastn_cline()

SyntaxError: invalid syntax (<ipython-input-26-eabf2fcca57f>, line 1)

In [0]:
 rh.readline()

NameError: name 'rh' is not defined

In [0]:
rh.readline()

NameError: name 'rh' is not defined

In [0]:
 eh.readline()

NameError: name 'eh' is not defined

In [0]:
fh = open('testblast.xml','w')
fh.write(rh.read())
fh.close()

NameError: name 'rh' is not defined

In [0]:
from Bio.Blast import NCBIXML
for blast_record in NCBIXML.parse(rh):
# Do something with blast_record

SyntaxError: unexpected EOF while parsing (<ipython-input-30-96ccfdc6ac51>, line 3)

**Listing 9.7:** BLASTparser1.py: Extract alignments title from a BLAST output

In [0]:
from Bio.Blast import NCBIXML
with open('samples/sampleXblast.xml') as xmlfh:
    for record in NCBIXML.parse(xmlfh):
        for align in record.alignments:
            print(align.title)

gi|114816|sp|P04252.1|BAHG_VITST RecName: Full=Bacterial hemoglobin; AltName: Full=Soluble cytochrome O
gi|52000645|sp|Q9RC40.1|HMP_BACHD RecName: Full=Flavohemoprotein; AltName: Full=Flavohemoglobin; AltName: Full=Hemoglobin-like protein; AltName: Full=Nitric oxide dioxygenase; Short=NO oxygenase; Short=NOD
gi|52000637|sp|Q8ETH0.1|HMP_OCEIH RecName: Full=Flavohemoprotein; AltName: Full=Flavohemoglobin; AltName: Full=Hemoglobin-like protein; AltName: Full=Nitric oxide dioxygenase; Short=NO oxygenase; Short=NOD


In [0]:
 align.length

406

In [0]:
align.hit_id

'gi|52000637|sp|Q8ETH0.1|HMP_OCEIH'

In [0]:
 align.hit_def

'RecName: Full=Flavohemoprotein; AltName: Full=Flavohemoglobin; AltName: Full=Hemoglobin-like protein; AltName: Full=Nitric oxide dioxygenase; Short=NO oxygenase; Short=NOD'

In [0]:
align.hsps

[<Bio.Blast.Record.HSP at 0x7f35ec3716d8>,
 <Bio.Blast.Record.HSP at 0x7f35ec371710>]

In [0]:
from Bio.Blast import NCBIXML
threshold = 0.0001
xmlfh = open('samples/other.xml')
blast_record = next(NCBIXML.parse(open(xmlfh)))
for align in blast_record.alignments:
    if align.hsps[0].expect < threshold:
        print(align.accession)    

TypeError: invalid file: <_io.TextIOWrapper name='samples/other.xml' mode='r' encoding='ANSI_X3.4-1968'>

In [0]:
from Bio.Data import IUPACData
IUPACData.ambiguous_dna_values['M']

'AC'

In [0]:
IUPACData.ambiguous_dna_values['H']

'ACT'

In [0]:
IUPACData.ambiguous_dna_values['X']

'GATC'

**Listing 9.9:** protwwbiopy.py: Protein weight calculator with Biopython

In [0]:
from Bio.Data.IUPACData import protein_weights as pw
protseq = input('Enter your protein sequence: ')
total_w = 0
for aa in protseq:
    total_w += pw.get(aa.upper(),0)
total_w -= 18*(len(protseq)-1)
print('The net weight is: {0}'.format(total_w))

Enter your protein sequence: ADVSMTGATATATAT
The net weight is: 1368.6815000000001


In [0]:
from Bio.Data.CodonTable import unambiguous_dna_by_id
bact_trans=unambiguous_dna_by_id[11]
bact_trans.forward_table['GTC']

'V'

In [0]:
 bact_trans.back_table['R']

'CGT'

In [0]:
from Bio.Data import CodonTable
print (CodonTable.generic_by_id[2])

Table 2 Vertebrate Mitochondrial, SGC1

  |  U      |  C      |  A      |  G      |
--+---------+---------+---------+---------+--
U | UUU F   | UCU S   | UAU Y   | UGU C   | U
U | UUC F   | UCC S   | UAC Y   | UGC C   | C
U | UUA L   | UCA S   | UAA Stop| UGA W   | A
U | UUG L   | UCG S   | UAG Stop| UGG W   | G
--+---------+---------+---------+---------+--
C | CUU L   | CCU P   | CAU H   | CGU R   | U
C | CUC L   | CCC P   | CAC H   | CGC R   | C
C | CUA L   | CCA P   | CAA Q   | CGA R   | A
C | CUG L   | CCG P   | CAG Q   | CGG R   | G
--+---------+---------+---------+---------+--
A | AUU I(s)| ACU T   | AAU N   | AGU S   | U
A | AUC I(s)| ACC T   | AAC N   | AGC S   | C
A | AUA M(s)| ACA T   | AAA K   | AGA Stop| A
A | AUG M(s)| ACG T   | AAG K   | AGG Stop| G
--+---------+---------+---------+---------+--
G | GUU V   | GCU A   | GAU D   | GGU G   | U
G | GUC V   | GCC A   | GAC D   | GGC G   | C
G | GUA V   | GCA A   | GAA E   | GGA G   | A
G | GUG V(s)| GCG A   | GAG E   | GGG G   

eUtils: Retrieving Bibliography
-----------

**Listing 9.12:** entrez1.py: Retrieve and display data from PubMed

In [0]:
from Bio import Entrez
my_em = 'user@example.com'
db = "pubmed"
# Search de Entrez website using esearch from eUtils
# esearch returns a handle (called h_search)
h_search = Entrez.esearch(db=db, email=my_em,
                         term='python and bioinformatics')
# Parse the result with Entrez.read()
record = Entrez.read(h_search)
# Get the list of Ids returned by previous search
res_ids = record["IdList"]
# For each id in the list
for r_id in res_ids:
   # Get summary information for each id
   h_summ = Entrez.esummary(db=db, id=r_id, email=my_em)
   # Parse the result with Entrez.read()
   summ = Entrez.read(h_summ)
   print(summ[0]['Title'])
   print(summ[0]['DOI'])
   print('==============================================')

URLError: <urlopen error Tunnel connection failed: 403 Forbidden>

eUtils: Retrieving Gene Information
-------------------

**Listing 9.13:** entrez2.py: Retrieve and display data from PubMed

In [0]:
from Bio import Entrez
my_em = 'user@example.com'
db = "gene"
term = 'cobalamin synthase homo sapiens'
h_search = Entrez.esearch(db=db, email=my_em, term=term)
record = Entrez.read(h_search)
res_ids = record["IdList"]
for r_id in res_ids:
   h_summ = Entrez.esummary(db=db, id=r_id, email=my_em)
   s = Entrez.read(h_summ)
   print(r_id)
   name = s['DocumentSummarySet']['DocumentSummary'][0]['Name']
   print(name)
   su = s['DocumentSummarySet']['DocumentSummary'][0]['Summary']
   print(su)
   print('==============================================')

URLError: <urlopen error Tunnel connection failed: 403 Forbidden>

In [0]:
n = "nucleotide"
handle = Entrez.efetch(db=n, id="326625", rettype='fasta')
print (handle.read())

Email address is not specified.

To make use of NCBI's E-utilities, NCBI requires you to specify your
email address with each request.  As an example, if your email address
is A.N.Other@example.com, you can specify it as follows:
   from Bio import Entrez
   Entrez.email = 'A.N.Other@example.com'
In case of excessive usage of the E-utilities, NCBI will attempt to contact
a user at the email address provided before blocking access to the
E-utilities.


URLError: <urlopen error Tunnel connection failed: 403 Forbidden>

In [0]:
handle = Entrez.efetch(db=n, id="326625", retmode='xml')
record[0]['GBSeq_moltype']

Email address is not specified.

To make use of NCBI's E-utilities, NCBI requires you to specify your
email address with each request.  As an example, if your email address
is A.N.Other@example.com, you can specify it as follows:
   from Bio import Entrez
   Entrez.email = 'A.N.Other@example.com'
In case of excessive usage of the E-utilities, NCBI will attempt to contact
a user at the email address provided before blocking access to the
E-utilities.


URLError: <urlopen error Tunnel connection failed: 403 Forbidden>

In [0]:
 record[0]['GBSeq_sequence']

TypeError: 'Blast' object does not support indexing

In [0]:
 record[0]['GBSeq_organism']

NameError: name 'record' is not defined

In [0]:
from Bio.PDB.PDBParser import PDBParser
pdbfn = '../../samples/1FAT.pdb'
parser = PDBParser(PERMISSIVE=1)
structure = parser.get_structure("1fat", pdbfn)

FileNotFoundError: [Errno 2] No such file or directory: '../../samples/1FAT.pdb'

In [0]:
 structure.child_list

NameError: name 'structure' is not defined

In [0]:
model = structure[0]
model.child_list

NameError: name 'structure' is not defined

In [0]:
chain = model['B']
chain.child_list[:5]


NameError: name 'model' is not defined

In [0]:
residue = chain[4]
residue.child_list

NameError: name 'chain' is not defined

In [0]:
atom = residue['CB']
atom.bfactor

NameError: name 'residue' is not defined

**Listing 9.14:** pdb2.py: Parse a gzipped PDB file

In [0]:
import gzip
import io
from Bio.PDB.PDBParser import PDBParser

def disorder(structure):
   for chain in structure[0].get_list():
       for residue in chain.get_list():
           for atom in residue.get_list():
               if atom.is_disordered():
                   print(residue, atom)
   return None

pdbfn = 'samples/pdb1apk.ent.gz'
handle = gzip.GzipFile(pdbfn)
handle = io.StringIO(handle.read().decode('utf-8'))
parser = PDBParser()
structure = parser.get_structure('test', handle)
disorder(structure)

<Residue CMP het=H_CMP resseq=401 icode= > <Disordered Atom P>
<Residue CMP het=H_CMP resseq=401 icode= > <Disordered Atom O1P>
<Residue CMP het=H_CMP resseq=401 icode= > <Disordered Atom O2P>
<Residue CMP het=H_CMP resseq=401 icode= > <Disordered Atom O5*>
<Residue CMP het=H_CMP resseq=401 icode= > <Disordered Atom C5*>
<Residue CMP het=H_CMP resseq=401 icode= > <Disordered Atom C4*>
<Residue CMP het=H_CMP resseq=401 icode= > <Disordered Atom O4*>
<Residue CMP het=H_CMP resseq=401 icode= > <Disordered Atom C3*>
<Residue CMP het=H_CMP resseq=401 icode= > <Disordered Atom O3*>
<Residue CMP het=H_CMP resseq=401 icode= > <Disordered Atom C2*>
<Residue CMP het=H_CMP resseq=401 icode= > <Disordered Atom O2*>
<Residue CMP het=H_CMP resseq=401 icode= > <Disordered Atom C1*>
<Residue CMP het=H_CMP resseq=401 icode= > <Disordered Atom N9>
<Residue CMP het=H_CMP resseq=401 icode= > <Disordered Atom C8>
<Residue CMP het=H_CMP resseq=401 icode= > <Disordered Atom N7>
<Residue CMP het=H_CMP resseq=

PROSITE
------

In [0]:
from Bio import Prosite
handle = open("prosite.dat")
records = Prosite.parse(handle)
for r in records:
    print(r.accession)
    print(r.name)
    print(r.description)
    print(r.pattern)
    print(r.created)
    print(r.pdoc)
    print("===================================")

ImportError: cannot import name 'Prosite'

In [0]:
from Bio import Restriction
Restriction.EcoRI

AttributeError: type object 'RestrictionType' has no attribute 'size'

AttributeError: type object 'RestrictionType' has no attribute 'size'

TypeError: 'NoneType' object is not iterable

In [0]:
from Bio.Seq import Seq
from Bio.Alphabet.IUPAC import IUPACAmbiguousDNA
alfa = IUPACAmbiguousDNA()
gi1942535 = Seq('CGCGAATTCGCG', alfa
Restriction.EcoRI.search(gi1942535)

SyntaxError: invalid syntax (<ipython-input-66-0cf64976e820>, line 5)

In [0]:
Restriction.EcoRI.catalyse(gi1942535)

NameError: name 'gi1942535' is not defined

In [0]:
enz1 = Restriction.EcoRI
enz2 = Restriction.HindIII
batch1 = Restriction.RestrictionBatch([enz1, enz2])
batch1.search(gi1942535)

NameError: name 'gi1942535' is not defined

In [0]:
dd = batch1.search(gi1942535)
dd.get(Restriction.EcoRI)

NameError: name 'gi1942535' is not defined

In [0]:
 dd.get(Restriction.HindIII)

NameError: name 'dd' is not defined

In [0]:
batch1.add(Restriction.EarI)
batch1

RestrictionBatch(['EarI', 'EcoRI', 'HindIII'])

In [0]:
batch1.remove(Restriction.EarI)
batch1

RestrictionBatch(['EcoRI', 'HindIII'])

In [0]:
batch2 = Restriction.CommOnly

In [0]:
an1 = Restriction.Analysis(batch1,gi1942535)
an1.full()    

NameError: name 'gi1942535' is not defined

In [0]:
an1.print_that()

NameError: name 'an1' is not defined

In [0]:
an1.print_as('map')
an1.print_that()

NameError: name 'an1' is not defined

In [0]:
an1.only_between(1,8)

NameError: name 'an1' is not defined

DNA Utils
--------

In [0]:
from Bio.SeqUtils import GC
GC('gacgatcggtattcgtag')    

50.0

In [0]:
from Bio.SeqUtils import MeltingTemp
MeltingTemp.Tm_staluc('tgcagtacgtatcgt')

42.211472744873504

In [0]:
 print('%.2f'%MeltingTemp.Tm_staluc('tgcagtacgtatcgt'))

42.21


In [0]:
from Bio.SeqUtils import CheckSum
myseq = 'acaagatgccattgtcccccggcctcctgctgctgct'
CheckSum.gcg(myseq)

1149

In [0]:
CheckSum.crc32(myseq)

2188528553

In [0]:
CheckSum.crc64(myseq)

'CRC-A2CFDBE6AB3F7CFF'

In [0]:
CheckSum.seguid(myseq)

'9V7Kf19tfPA5TntEP75YiZEm/9U'

Protein Utils
------------

**Listing 9.15:** protparam.py: Apply PropParam functions to a group of proteins

In [0]:
from Bio.SeqUtils.ProtParam import ProteinAnalysis
from Bio.SeqUtils import ProtParamData
from Bio import SeqIO
with open('samples/pdbaa') as fh:
   for rec in SeqIO.parse(fh,'fasta'):
       myprot = ProteinAnalysis(str(rec.seq))
       print(myprot.count_amino_acids())
       print(myprot.get_amino_acids_percent())
       print(myprot.molecular_weight())
       print(myprot.aromaticity())
       print(myprot.instability_index())
       print(myprot.flexibility())
       print(myprot.isoelectric_point())
       print(myprot.secondary_structure_fraction())
       print(myprot.protein_scale(ProtParamData.kd, 9, .4))

{'F': 9, 'L': 14, 'E': 8, 'P': 10, 'W': 0, 'T': 8, 'S': 10, 'C': 4, 'M': 2, 'R': 3, 'D': 10, 'V': 5, 'K': 5, 'I': 8, 'G': 17, 'Q': 6, 'A': 21, 'N': 7, 'H': 2, 'Y': 6}
{'F': 0.05806451612903226, 'L': 0.09032258064516129, 'E': 0.05161290322580645, 'P': 0.06451612903225806, 'W': 0.0, 'T': 0.05161290322580645, 'S': 0.06451612903225806, 'C': 0.025806451612903226, 'A': 0.13548387096774195, 'R': 0.01935483870967742, 'V': 0.03225806451612903, 'K': 0.03225806451612903, 'D': 0.06451612903225806, 'G': 0.10967741935483871, 'I': 0.05161290322580645, 'M': 0.012903225806451613, 'Q': 0.03870967741935484, 'N': 0.04516129032258064, 'Y': 0.03870967741935484, 'H': 0.012903225806451613}
16229.94529999999
0.0967741935483871
24.29296774193547
[0.9806904761904762, 0.9747380952380953, 0.9736309523809523, 0.9731666666666666, 0.9723214285714286, 0.9627857142857142, 0.9687261904761906, 0.9978095238095237, 0.9725357142857143, 0.9962976190476192, 0.9756785714285715, 0.9804047619047621, 0.9990357142857142, 0.9922738

**Listing 9.16:** phd1.py: Extract data from a .phd.1 file

In [0]:
import pprint
from Bio.Sequencing import Phd
fn = 'samples/phd1'
with open(fn) as fh:
    rp = Phd.read(fh)
# All the comments are in a dictionary
pprint.pprint(rp.comments)
# Sequence information
print('Sequence: %s' % rp.seq)
# Quality information for each base
print('Quality: %s' % rp.sites)

{'abi_thumbprint': 0,
 'call_method': 'phred',
 'chem': 'term',
 'chromat_file': '34_222_(80-A03-19).b.ab1',
 'dye': 'big',
 'phred_version': '0.020425.c',
 'quality_levels': 99,
 'time': 'Fri Feb 13 09:16:11 2004',
 'trace_array_max_index': 10867,
 'trace_array_min_index': 0,
 'trace_peak_area_ratio': 0.1467,
 'trim': (3, 391, 0.05)}
Sequence: ctccgtcggaacatcatcggatcctatcacagagtttttgaacgagttctcgatgatttctcttgaggaaacagtcccatcatcggtgccaacagcaaccaaagtgttcgttaacgggcaatgggttggtattcacaagaacccggcggatctcgttgacactcttcgaagcttgcgtcgaacgactgatgttactccggaactgtcagttgtgcgtgatgtgtctgacaaggaattgcgcttgtacacggatggcggtcgtatagcccgaccactcttcgtcgttaacgaggagcagcgactcgcacttaagcgcgatatgttggagcggctcaatcctgacccggaaactggagagcgcatctcttactggaatgatcttatttcggagggtatgggttgagtacctggacactgaggaagacgtaactgtcctgatcgccttggtcgccgtcctattatagcgctttccagtaaggcggtgaacaccgttgactactctggctcacgtaacttatcttcacattccatattttctgtccactactgccttacaaccctgatgtctgcgctcgtgcgtgcagattgccctccagaaattcgtctttgtcgttcgtagttaggtggtttgtacattccctcatctttaatac

In [0]:
from Bio import SeqIO
fn = '../../samples/phd1'
fh = open(fn)
seqs = SeqIO.parse(fh,'phd')
seqs = SeqIO.parse(fh,'phd')
for s in seqs:
    print(s.seq)

FileNotFoundError: [Errno 2] No such file or directory: '../../samples/phd1'

In [0]:
from Bio.Sequencing import Ace
fn='836CLEAN-100.fasta.cap.ace'
acefilerecord=Ace.read(open(fn))
acefilerecord.ncontigs

FileNotFoundError: [Errno 2] No such file or directory: '836CLEAN-100.fasta.cap.ace'

In [0]:
acefilerecord.nreads

NameError: name 'acefilerecord' is not defined

In [0]:
acefilerecord.wa[0].info

NameError: name 'acefilerecord' is not defined

In [0]:
acefilerecord.wa[0].date

NameError: name 'acefilerecord' is not defined

**Listing 9.17:** ace.py: Retrieve data from an “.ace” file

In [0]:
from Bio.Sequencing import Ace

fn = 'samples/contig1.ace'
acefilerecord = Ace.read(open(fn))

# For each contig:
for ctg in acefilerecord.contigs:
    print('==========================================')
    print('Contig name: %s'%ctg.name)
    print('Bases: %s'%ctg.nbases)
    print('Reads: %s'%ctg.nreads)
    print('Segments: %s'%ctg.nsegments)
    print('Sequence: %s'%ctg.sequence)
    print('Quality: %s'%ctg.quality)
    # For each read in contig:
    for read in ctg.reads:
        print('Read name: %s'%read.rd.name)
        print('Align start: %s'%read.qa.align_clipping_start)
        print('Align end: %s'%read.qa.align_clipping_end)
        print('Qual start: %s'%read.qa.qual_clipping_start)
        print('Qual end: %s'%read.qa.qual_clipping_end)
        print('Read sequence: %s'%read.rd.sequence)
        print('==========================================')

Contig name: Contig1
Bases: 856
Reads: 2
Segments: 31
Sequence: aatacgGGATTGCCCTAGTAACGGCGAGTGAAGCGGCAACAGCTCAAATTTGAAATCTGGCCCCCCGGCCCGAGTTGTAATTTGTAGAGG*ATGCTTCTGGGTAGCGACCGGTCTAAGTTCCTCGGAACAGGACGTCATAGAGGGTGAGAATCCCGTATGCGACCGGCCCGCGCCCTCCACGTAGCTCCTTCGACGAGTCGAGTTGTTTGGGAATGCAGCTCTAAATGGGAGGTAAATTTCTTCTAAAGCTAAATACCGGCCAGAGACCGATAGCGCACAAGTAGAGTGATCGAAAGATGAAAAGCACTTTGGAAAGAGAGTTAAAAAGCACGTGAAATTGTTGAAAGGGAAGCGCTTACAACCAGACTTTGGGGCGGTGTTCCGCCGGTCTTCTGACCGGTCCACTCGCCGTCCCGAGGCCAACATCATCTGGGACCGCCGGACAAGACCTCAGGAATGTGGCTCCCCCTCGGGGGAGTGTTATAGCCTGTGGTGATGCGGCGCGTCCCGGGTGAGGTCCGCGCTTCGGCAAGGATGTTGGCGTAATGGTTGTCAGCGGCCCGTCTTGAAACACGGACCAAGGAGTCTAACATCTATGCGAGTGTTCGGGTGTCAAACCCCTACGCGGAATGAAAGTGAACGGAGGTGGGAAGGGTAACCTGCACCATCGACCGATCCTGATGTCCTCGGATGGATTTGAGTAAGAGCATAGCTGTTGGGACCCGAAAGATGGTGAACTATGCCTGAATAGGGTGAAGCCAGAGGAAACTCTGGTGGAGGCTCGCAGCGGTTCTGACGTGCAAATCGATCGTCAAATTTGGGTATAGGGGCGAAAGActatcgaAcATCTAGtac
Quality: [0, 0, 0, 0, 0, 0, 22, 23, 25, 28, 34, 47, 61, 46, 39, 34, 34, 30, 30,

**Listing 9.18:** Retrieve data from a SwissProt file

In [0]:
from Bio import SwissProt
with open('samples/spfile.txt') as fh:
        records = SwissProt.parse(fh)
        for record in records:
            print('Entry name: %s' % record.entry_name)
            print('Accession(s): %s' % ','.join(record.accessions))
            print('Keywords: %s' % ','.join(record.keywords))
            print('Sequence: %s' % record.sequence)

Entry name: 6PGL_ECOLC
Accession(s): B1IXL9
Keywords: Acetylation,Carbohydrate metabolism,Glucose metabolism,Hydrolase
Sequence: MKQTVYIASPESQQIHVWNLNHEGALTLTQVVDVPGQVQPMVVSPDKRYLYVGVRPEFRVLAYRIAPDDGALTFAAESALPGSPTHISTDHQGQFVFVGSYNAGNVSVTRLEDGLPVGVVDVVEGLDGCHSANISPDNRTLWVPALKQDRICLFTVSDDGHLVAQDPAEVTTVEGAGPRHMVFHPNEQYAYCVNELNSSVDVWELKDPHGNIECVQTLDMMPENFSDTRWAADIHITPDGRHLYACDRTASLITVFSVSEDGSVLSKEGFQPTETQPRGFNVDHSGKYLIAAGQKSHHISVYEIVGEQGLLHEKGRYAVGQGPMWVVVNAH


**Listing 9.19:** Attributes of a SwissProt record

In [0]:
from Bio import SwissProt
with open('samples/spfile.txt') as fh:
    record = next(SwissProt.parse(fh))
    for att in dir(record):
        if not att.startswith('__'):
                print(att, getattr(record,att))

accessions ['B1IXL9']
annotation_update ('05-JUL-2017', 0)
comments ['FUNCTION: Catalyzes the hydrolysis of 6-phosphogluconolactone to 6-phosphogluconate. {ECO:0000255|HAMAP-Rule:MF_01605}.', 'CATALYTIC ACTIVITY: 6-phospho-D-glucono-1,5-lactone + H(2)O = 6- phospho-D-gluconate. {ECO:0000255|HAMAP-Rule:MF_01605}.', 'PATHWAY: Carbohydrate degradation; pentose phosphate pathway; D- ribulose 5-phosphate from D-glucose 6-phosphate (oxidative stage): step 2/3. {ECO:0000255|HAMAP-Rule:MF_01605}.', 'SIMILARITY: Belongs to the cycloisomerase 2 family. {ECO:0000255|HAMAP-Rule:MF_01605}.']
created ('20-MAY-2008', 0)
cross_references [('EMBL', 'CP000946', 'ACA78522.1', '-', 'Genomic_DNA'), ('RefSeq', 'WP_000815435.1', 'NC_010468.1'), ('ProteinModelPortal', 'B1IXL9', '-'), ('SMR', 'B1IXL9', '-'), ('EnsemblBacteria', 'ACA78522', 'ACA78522', 'EcolC_2895'), ('KEGG', 'ecl:EcolC_2895', '-'), ('eggNOG', 'ENOG4105HHQ', 'Bacteria'), ('eggNOG', 'COG2706', 'LUCA'), ('HOGENOM', 'HOG000257418', '-'), ('KO', 'K

VER DE ACA PARA ABAJO SI ESTO ESTA BIEN O HABRIA QUE PONERLO ARRIBA!!!!
-------------------

*How to download a file*

In [0]:
!curl https://s3.amazonaws.com/py4bio/cas9align.fasta -o cas9align.fasta

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 11532  100 11532    0     0   2754      0  0:00:04  0:00:04 --:--:--  2755


In [0]:
with open('cas9align.fasta') as f_in:
    print(f_in.read())

>J7M7J1
MDKKYSIGLDIGTNSVGWAVITDDYKVPSKKFKVLGNTDRHSIKKNLIGALLFDSGETAE
ATRLKRTARRRYTRRKNRICYLQEIFSNEMAKVDDSFFHRLEESFLVEEDKKHERHPIFG
NIVDEVAYHEKYPTIYHLRKKLVDSTDKADLRLIYLALAHMIKFRGHFLIEGDLNPDNSD
VDKLFIQLVQTYNQLFEENPINASGVDAKAILSARLSKSRRLENLIAQLPGEKKNGLFGN
LIALSLGLTPNFKSNFDLAEDAKLQLSKDTYDDDLDNLLAQIGDQYADLFLAAKNLSDAI
LLSDILRVNTEITKAPLSASMIKRYDEHHQDLTLLKALVRQQLPEKYKEIFFDQSKNGYA
GYIDGGASQEEFYKFIKPILEKMDGTEELLVKLNREDLLRKQRTFDNGSIPHQIHLGELH
AILRRQEDFYPFLKDNREKIEKILTFRIPYYVGPLARGNSRFAWMTRKSEETITPWNFEE
VVDKGASAQSFIERMTNFDKNLPNEKVLPKHSLLYEYFTVYNELTKVKYVTEGMRKPAFL
SGEQKKAIVDLLFKTNRKVTVKQLKEDYFKKIECFDSVEISGV---EDRFNASLGTYHDL
LKIIKDKDFLDNEENEDILEDIVLTLTLFEDREMIEERLKTYAHLFDDKVMKQLKRRRYT
GWGRLSRKLINGIRDKQSGKTILDFLKSDGFANRNFMQLIHDDSLTFKEDIQKAQVSGQG
DSLHEHIANLAGSPAIKKGILQTVKVVDELVKVMGRHKPENIVIEMARENQTTQKGQKNS
RERMKRIEEGIKEL----GSQILKEHPVENTQLQNEKLYLYYLQNGRDMYVDQELDINRL
SDYDVDHIVPQSFLKDDSIDNKVLTRSDKNRGKSDNVPSEEVVKKMKNYWRQLLNAKLIT
QRKFDNLTKAERGGLSELDKAGFIKRQLVETRQITKHVAQILDSRMNTKYDENDKLIREV
KVITLKSKLVSDFRKD

In [0]:
print(dir(align))

['__add__', '__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__gt__', '__hash__', '__init__', '__iter__', '__le__', '__len__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_alphabet', '_append', '_records', '_str_line', 'add_sequence', 'annotations', 'append', 'extend', 'format', 'get_alignment_length', 'get_all_seqs', 'get_column', 'get_seq_by_num', 'sort']


In [0]:
from Bio import AlignIO
AlignIO.write(align, 'cas9align.phy', 'phylip')

1

In [0]:
from Bio.Alphabet import ProteinAlphabet

In [0]:
align._alphabet = ProteinAlphabet()

In [0]:
print(align)

ProteinAlphabet() alignment with 8 rows and 1407 columns
MDKKYSIGLDIGTNSVGWAVITDDYKVPSKKFKVLGNTDRHSIK...GGD J7M7J1
MDKKYSIGLDIGTNSVGWAVITDDYKVPSKKFKVLGNTDRHSIK...GGD A0A0C6FZC2
MDKKYSIGLDIGTNSVGWAVITDDYKVPSKKFKVLGNTDRHSIK...GGD A0A1C2CVQ9
MDKKYSIGLDIGTNSVGWAVITDDYKVPSKKFKVLGNTDRHSIK...GGD A0A1C2CV43
MDKKYSIGLDIGTNSVGWAVITDDYKVPSKKFKVLGNTDRHSIK...GGD Q48TU5
MDKKYSIGLDIGTNSVGWAVITDDYKVPSKKFKVLGNTDRHSIK...GGD M4YX12
MKKPYSIGLDIGTNSVGWAVVTDDYKVPAKKMKVLGNTDKSHIK...GGD A0A0E2EP65
--------------------------------------------...GED A0A150NVN1


In [0]:
print(align[3:,:5])

SingleLetterAlphabet() alignment with 5 rows and 5 columns
MDKKY A0A1C2CV43
MDKKY Q48TU5
MDKKY M4YX12
MKKPY A0A0E2EP65
----- A0A150NVN1


In [0]:
with open('cas9align.aln') as f_in:
    print(f_in.read())

CLUSTAL X (1.81) multiple sequence alignment


J7M7J1                              MDKKYSIGLDIGTNSVGWAVITDDYKVPSKKFKVLGNTDRHSIKKNLIGA
A0A0C6FZC2                          MDKKYSIGLDIGTNSVGWAVITDDYKVPSKKFKVLGNTDRHSIKKNLIGA
A0A1C2CVQ9                          MDKKYSIGLDIGTNSVGWAVITDDYKVPSKKFKVLGNTDRHSIKKNLIGA
A0A1C2CV43                          MDKKYSIGLDIGTNSVGWAVITDDYKVPSKKFKVLGNTDRHSIKKNLIGA
Q48TU5                              MDKKYSIGLDIGTNSVGWAVITDDYKVPSKKFKVLGNTDRHSIKKNLIGA
M4YX12                              MDKKYSIGLDIGTNSVGWAVITDDYKVPSKKFKVLGNTDRHSIKKNLIGA
A0A0E2EP65                          MKKPYSIGLDIGTNSVGWAVVTDDYKVPAKKMKVLGNTDKSHIKKNLLGA
A0A150NVN1                          --------------------------------------------------

J7M7J1                              LLFDSGETAEATRLKRTARRRYTRRKNRICYLQEIFSNEMAKVDDSFFHR
A0A0C6FZC2                          LLFDSGETAEATRLKRTARRRYTRRKNRICYLQEIFSNEMAKVDDSFFHR
A0A1C2CVQ9                          LLFDSGETAEATRLKRTARRRYTRRKNRIRYLQEIFSSEMSKVDDS

In [0]:
from Bio.Align.AlignInfo import print_info_content
print_info_content(align)

AttributeError: 'MultipleSeqAlignment' object has no attribute 'ic_vector'

In [0]:
summary.pos_specific_score_matrix()

ValueError: Alignment contains a sequence with                                 an incompatible alphabet.

In [0]:
from Bio import Alphabet 
for record in align:
    print(Alphabet._get_base_alphabet(record.seq.alphabet))

SingleLetterAlphabet()
SingleLetterAlphabet()
SingleLetterAlphabet()
SingleLetterAlphabet()
SingleLetterAlphabet()
SingleLetterAlphabet()
SingleLetterAlphabet()
SingleLetterAlphabet()


In [0]:
!curl https://raw.githubusercontent.com/Serulab/Py4Bio/master/cas9align.fasta -o archivo2.txt

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100    15  100    15    0     0      2      0  0:00:07  0:00:05  0:00:02     3


In [0]:
with open('archivo2.txt') as fh:
    print(fh.read())

404: Not Found



In [0]:
# http://biopython.org/DIST/docs/api/Bio.Align.Applications._ClustalOmega.ClustalOmegaCommandline-class.html
