In [8]:
import importlib

importlib.reload(toolbox.fasta)
importlib.reload(toolbox.hamming)
importlib.reload(toolbox.alignment)

<module 'toolbox.alignment' from '/Users/eugen/code/bioinformatics-toolbox/toolbox/alignment.py'>

## FASTA

We can load the fasta file like this

In [1]:
import toolbox.fasta

sequences = toolbox.fasta.parse("../sample_data/test.fasta")
sequences = list(sequences)
sequences

[Sequence(description='sp|P83673|LYS1_CRAVI Lysozyme 1 OS=Crassostrea virginica OX=6565 GN=lysoz1 PE=1 SV=3', name='sp|P83673|LYS1_CRAVI', sequence='MNGLILFCAVVFATAVCTYGSDAPCLRAGGRCQHDSITCSGRYRTGLCSGGVRRRCCVPSSSNSGSFSTGMVSQQCLRCICNVESGCRPIGCHWDVNSDSCGYFQIKRAYWIDCGSPGGDWQTCANNLACSSRCVQAYMARYHRRSGCSNSCESFARIHNGGPRGCRNSNTEGYWRRVQAQGCN'),
 Sequence(description='sp|P02769|ALBU_BOVIN Albumin OS=Bos taurus OX=9913 GN=ALB PE=1 SV=4', name='sp|P02769|ALBU_BOVIN', sequence='MKWVTFISLLLLFSSAYSRGVFRRDTHKSEIAHRFKDLGEEHFKGLVLIAFSQYLQQCPFDEHVKLVNELTEFAKTCVADESHAGCEKSLHTLFGDELCKVASLRETYGDMADCCEKQEPERNECFLSHKDDSPDLPKLKPDPNTLCDEFKADEKKFWGKYLYEIARRHPYFYAPELLYYANKYNGVFQECCQAEDKGACLLPKIETMREKVLASSARQRLRCASIQKFGERALKAWSVARLSQKFPKAEFVEVTKLVTDLTKVHKECCHGDLLECADDRADLAKYICDNQDTISSKLKECCDKPLLEKSHCIAEVEKDAIPENLPPLTADFAEDKDVCKNYQEAKDAFLGSFLYEYSRRHPEYAVSVLLRLAKEYEATLEECCAKDDPHACYSTVFDKLKHLVDEPQNLIKQNCDQFEKLGEYGFQNALIVRYTRKVPQVSTPTLVEVSRSLGKVGTRCCTKPESERMPCTEDYLSLILNRLCVLHEKTPVSEKVTKCCTESLVNRRPCFSALTPDETYVPKAFDEKLFT

The resulting value is a generator, so let's unpack the first value to take a better look at it.

In [2]:
seq = sequences[0]
seq


Sequence(description='sp|P83673|LYS1_CRAVI Lysozyme 1 OS=Crassostrea virginica OX=6565 GN=lysoz1 PE=1 SV=3', name='sp|P83673|LYS1_CRAVI', sequence='MNGLILFCAVVFATAVCTYGSDAPCLRAGGRCQHDSITCSGRYRTGLCSGGVRRRCCVPSSSNSGSFSTGMVSQQCLRCICNVESGCRPIGCHWDVNSDSCGYFQIKRAYWIDCGSPGGDWQTCANNLACSSRCVQAYMARYHRRSGCSNSCESFARIHNGGPRGCRNSNTEGYWRRVQAQGCN')

We can access the name of the sequence, its description, its content and its lenght.


In [3]:
f"The length of sequence {seq.description} is {len(seq)}"

'The length of sequence sp|P83673|LYS1_CRAVI Lysozyme 1 OS=Crassostrea virginica OX=6565 GN=lysoz1 PE=1 SV=3 is 184'

## Hamming distance

Let's try the hamming distance on mock strings first, so that we can check it's working correctly.

In [4]:
import toolbox.hamming

for a, b in [("ABCD", "CBDD"), ("HOKSZA", "HOKZSA"), ("WYBITUL", "WYTUBIL")]:
    print(f"The distance between {a} and {b} is {toolbox.hamming.distance (a, b)}")

The distance between ABCD and CBDD is 2
The distance between HOKSZA and HOKZSA is 2
The distance between WYBITUL and WYTUBIL is 4


Now, let's use the sequences we loaded above. The distance of a sequence to itself should be 0:

In [5]:
toolbox.hamming.distance(seq, seq)

0

We can only measure distances between equally sized sequences. Let's pull up the other sequence from the fasta file, and compute its distance to the first one.

In [6]:
seq2 = sequences[1]
toolbox.hamming.distance(seq, seq2)

ValueError: The sequences should have identical lengths, but the lengths are 184, 607

## Alignment

The `align` function returns a tuple; the Levenshtein edit distance and the alignment operations which can be used to find the optimal alignment.

So, for strings:

In [16]:
import toolbox.alignment

importlib.reload(toolbox.alignment)

A = "xabj"
B = "x"

length, alignments = toolbox.alignment.align(A, B)
length

3

...and to view the alignment(s):

In [19]:
for s1, s2 in alignments:
    print("".join(s1))
    print("".join(s2))

xabj
x---


If there's more optimal alignments, we can view all of them.

In [21]:
length, alignments = toolbox.alignment.align("ABBBD", "ABVD")
for s1, s2 in alignments:
    print("".join(s1))
    print("".join(s2))
    print()

ABBBD
ABV-D

ABBBBBD
AB-VV-D



And it works for `Sequence` objects, too! I wouldn't recommend trying, though, becasue there's a _lot_ of possible alignments.

