In [88]:
import importlib

importlib.reload(toolbox.fasta)
importlib.reload(toolbox.hamming)
importlib.reload(toolbox.alignment)
importlib.reload(toolbox.pdb)

<module 'toolbox.pdb' from '/Users/eugen/code/bioinformatics-toolbox/toolbox/pdb.py'>

In [2]:
from os.path import join

DATA_FOLDER = "sample_data"

## FASTA

We can load the fasta file like this

In [33]:
import toolbox.fasta

sequences = toolbox.fasta.parse(join(DATA_FOLDER, "test.fasta"))
sequences = list(sequences)
seq = sequences[0]
seq

Sequence(description='sp|P83673|LYS1_CRAVI Lysozyme 1 OS=Crassostrea virginica OX=6565 GN=lysoz1 PE=1 SV=3', name='sp|P83673|LYS1_CRAVI', sequence='MNGLILFCAVVFATAVCTYGSDAPCLRAGGRCQHDSITCSGRYRTGLCSGGVRRRCCVPSSSNSGSFSTGMVSQQCLRCICNVESGCRPIGCHWDVNSDSCGYFQIKRAYWIDCGSPGGDWQTCANNLACSSRCVQAYMARYHRRSGCSNSCESFARIHNGGPRGCRNSNTEGYWRRVQAQGCN')

We can access the name of the sequence..


In [34]:
seq.name

'sp|P83673|LYS1_CRAVI'

In [None]:
...its description...

In [35]:
seq.description

'sp|P83673|LYS1_CRAVI Lysozyme 1 OS=Crassostrea virginica OX=6565 GN=lysoz1 PE=1 SV=3'

...its content...

In [37]:
seq.sequence

'MNGLILFCAVVFATAVCTYGSDAPCLRAGGRCQHDSITCSGRYRTGLCSGGVRRRCCVPSSSNSGSFSTGMVSQQCLRCICNVESGCRPIGCHWDVNSDSCGYFQIKRAYWIDCGSPGGDWQTCANNLACSSRCVQAYMARYHRRSGCSNSCESFARIHNGGPRGCRNSNTEGYWRRVQAQGCN'

...and its length.

In [38]:
len(seq)

184

The `Sequence` is also an `Iterable` and a `Mapping`, so we can loop through it and access the residues by index.

In [40]:
for r in seq[10:15]:
    print(r.upper())

V
F
A
T
A


## Hamming distance

Let's try the hamming distance on simple strings first, so that we can check it's working correctly.

In [7]:
import toolbox.hamming

for a, b in [("ABCD", "CBDD"), ("HOKSZA", "HOKZSA"), ("WYBITUL", "WYTUBIL")]:
    print(f"The distance between {a} and {b} is {toolbox.hamming.distance (a, b)}")

The distance between ABCD and CBDD is 2
The distance between HOKSZA and HOKZSA is 2
The distance between WYBITUL and WYTUBIL is 4


Now, let's use the sequences we loaded above (notice how the function works for strings as well as sequences). Firstly the distance of a sequence to itself should be 0...

In [8]:
toolbox.hamming.distance(seq, seq)

0

Secondly, we can only measure distances between equally sized sequences. Let's pull up the other sequence from the fasta file, and compute its distance to the first one.

In [41]:
seq2 = sequences[1]
toolbox.hamming.distance(seq, seq2)

ValueError: The sequences should have identical lengths, but the lengths are 184, 607

## Alignment

The `align` function returns a dictionary with the Levenshtein edit distance and the alignment itself.

Let's start with simple strings again...

In [51]:
import toolbox.alignment

A = "clock"
B = "lacks"

res = toolbox.alignment.align(A, B)
res["distance"]

3

...and to view the alignment(s):

In [52]:
for s1, s2 in res["alignments"]:
    print("".join(s1))
    print("".join(s2))

clock-
-lacks


If there's more optimal alignments, we get all of them.

In [54]:
res = toolbox.alignment.align("ABBBD", "ABVD")
for s1, s2 in res["alignments"]:
    print("".join(s1))
    print("".join(s2))
    print()

ABBBD
ABV-D

ABBBBBD
AB-VV-D



And it works for `Sequence` objects, too! Let's switch of the backtracking, though, to save some time.



In [55]:
toolbox.alignment.align(seq, seq2, skip_backtrack=True)

{'distance': 511}

## PDB Parser

Next up is parser for PDB files. The code is just a thin layer over `Bio.PDB`, and we can work directly with its objects, to obtain the whole model...

In [89]:
import toolbox.pdb

bsa = toolbox.pdb.Structure("3v03", join(DATA_FOLDER, "pdb3v03.pdb"))

bsa.structure



<Structure id=3v03>

...or all of the chains...

In [90]:
list(bsa.structure.get_chains())

[<Chain id=A>, <Chain id=B>]

...or all of its residues...

In [91]:
list(bsa.structure.get_residues())[:10] # truncated

[<Residue HIS het=  resseq=3 icode= >,
 <Residue LYS het=  resseq=4 icode= >,
 <Residue SER het=  resseq=5 icode= >,
 <Residue GLU het=  resseq=6 icode= >,
 <Residue ILE het=  resseq=7 icode= >,
 <Residue ALA het=  resseq=8 icode= >,
 <Residue HIS het=  resseq=9 icode= >,
 <Residue ARG het=  resseq=10 icode= >,
 <Residue PHE het=  resseq=11 icode= >,
 <Residue LYS het=  resseq=12 icode= >]

or finally, all of its atoms.

In [92]:
list(bsa.structure.get_atoms())[:10] # truncated

[<Atom N>,
 <Atom CA>,
 <Atom C>,
 <Atom O>,
 <Atom CB>,
 <Atom N>,
 <Atom CA>,
 <Atom C>,
 <Atom O>,
 <Atom CB>]

We can get a general overview of the structure by looking at its `summary`.

In [93]:
bsa.summary

{'models:': 1, 'chains': 2, 'residues': 1211, 'atoms': 8810}

We can also compute its spatial width.

In [94]:
bsa.width

146.49719889539784

Last but not least, we can show the neighbours of a HETATM ligand in the structure (or any atom at all, in fact). 
Either the neighbouring residues...

In [97]:
atom = bsa.structure[0]['A'][10]['CB']
bsa.search_around(atom, radius=4, level="R")

[<Residue PHE het=  resseq=11 icode= >,
 <Residue GLU het=  resseq=6 icode= >,
 <Residue ILE het=  resseq=7 icode= >,
 <Residue HIS het=  resseq=9 icode= >,
 <Residue ARG het=  resseq=10 icode= >]

...or neighbouting atoms...

In [99]:
bsa.search_around(atom, radius=4, level="A")

[<Atom NE>,
 <Atom O>,
 <Atom CD>,
 <Atom N>,
 <Atom CA>,
 <Atom CG>,
 <Atom CB>,
 <Atom O>,
 <Atom C>,
 <Atom C>,
 <Atom O>,
 <Atom N>]

...or alteratively neighbouring chains, or models, by passing `"C"` and `"M"` as `level`, repsectively. That doesn't seem particularly useful to me, but it's possile nonetheless.