# Float peptides
If you have epitope coordinates, you can easily "float" your epitopes so they
match up with the parent sequence. This function simply adds the number of gaps
to the start of your epitope sequence.

Let's generate a random sequence and some random epitopes. We're using `index=1`, which means the first position in the sequence is 1.

In [2]:
import pandas as pd
from epimap import map, utils

#Define an example sequence and set of epitopes
sequence = utils.random_seq(10)
epitopes = utils.random_epitopes(sequence, n=5, epitope_lengths=(3,6), index=1, includeend=True)
print(sequence)
epitopes

PCTACWPVHY


Unnamed: 0,start,end,seq,length
0,2,5,CTAC,4
1,7,10,PVHY,4
2,7,10,PVHY,4
3,4,9,ACWPVH,6
4,7,10,PVHY,4


Float the epitope sequences so they would be in the correct location in
an alignment. Note that have to tell `float_peptides` that `index=1`.


In [3]:
print(sequence)
for epi in map.float_peptides(epitopes, index=1):
    print(epi)


PCTACWPVHY
-CTAC
------PVHY
------PVHY
---ACWPVH
------PVHY


If you use `id_col` to specify a column of the epitope dataframe, `float_peptides()`
will return a list of Biopython SeqRecords. This can be saved out as a fasta or other
type of sequence alignment file.

In [4]:
sequence = "abcdefghi"
epitopes = pd.DataFrame(
    {
        'name': ["A", "B", "C", "D", "E"],
        'start': [1.0,2,3,4,6],
        'end': [4,5,6,7,9],
        'seq': ['abc','bcd','cde','def','fgh']
    }
)
print(epitopes)
map.float_peptides(epitopes, index=1, id_col="name")

  name  start  end  seq
0    A    1.0    4  abc
1    B    2.0    5  bcd
2    C    3.0    6  cde
3    D    4.0    7  def
4    E    6.0    9  fgh


[SeqRecord(seq=Seq('abc'), id='A', name='<unknown name>', description='<unknown description>', dbxrefs=[]),
 SeqRecord(seq=Seq('-bcd'), id='B', name='<unknown name>', description='<unknown description>', dbxrefs=[]),
 SeqRecord(seq=Seq('--cde'), id='C', name='<unknown name>', description='<unknown description>', dbxrefs=[]),
 SeqRecord(seq=Seq('---def'), id='D', name='<unknown name>', description='<unknown description>', dbxrefs=[]),
 SeqRecord(seq=Seq('-----fgh'), id='E', name='<unknown name>', description='<unknown description>', dbxrefs=[])]

If your epitope positions use a zero index, the function call is the
same except we set `index=0` when calling `float_peptides()`.
We'll demonstrate this with another example using `index=0`.

In [5]:
sequence = utils.random_seq(10)
epitopes = utils.random_epitopes(sequence, n=5, epitope_lengths=(3,6), index=0, includeend=True)
print(sequence)
epitopes

HQHYQMFRGN


Unnamed: 0,start,end,seq,length
0,6,9,FRGN,4
1,1,6,QHYQMF,6
2,2,5,HYQM,4
3,5,9,MFRGN,5
4,5,9,MFRGN,5


In [6]:
print(sequence)
for epi in map.float_peptides(epitopes, index=0):
    print(epi)


HQHYQMFRGN
------FRGN
-QHYQMF
--HYQM
-----MFRGN
-----MFRGN


If your epitope table doesn't use "start" and "seq" columns
you can set custom column names

In [7]:
epitopes = epitopes.copy()
epitopes.columns = [c.upper() for c in epitopes.columns]
epitopes

Unnamed: 0,START,END,SEQ,LENGTH
0,6,9,FRGN,4
1,1,6,QHYQMF,6
2,2,5,HYQM,4
3,5,9,MFRGN,5
4,5,9,MFRGN,5


In [8]:
map.float_peptides(epitopes, start_col="START", seq_col="SEQ", index=0)


['------FRGN', '-QHYQMF', '--HYQM', '-----MFRGN', '-----MFRGN']

## Alignment accuracy
You can score how well a floating epitopes matches it's parent sequence.
This is useful to identify epitopes that don't match their parent sequence,
or low scores can indicate the wrong index.

The alignment score can be calculated with `score_epitope_alignment()`.
This function takes a floating epitope sequence, and the parent sequence.
It returns a tuple, first element is the proportion of matches.
Second element is a list of booleans for matches.

In [17]:
sequence = utils.random_seq(10)
epitopes = utils.random_epitopes(sequence, 5, epitope_lengths=(3,6), index=1, includeend=True)
floating_epitopes = map.float_peptides(epitopes, index=1)

epitope = floating_epitopes[-1]
print(sequence)
print(epitope)

map.score_epitope_alignment(epitope, sequence)

YHEQWWNCTR
-----WNCTR


(1.0, [True, True, True, True, True])

Use list comprehension to calculate the score for all epitopes

In [None]:
[map.score_epitope_alignment(epi, sequence) for epi in floating_epitopes]


[(1.0, [True, True, True, True, True, True]),
 (1.0, [True, True, True, True, True]),
 (1.0, [True, True, True, True]),
 (1.0, [True, True, True, True]),
 (1.0, [True, True, True, True, True])]

## Locate epitopes
Given an epitopes that has been aligned to a parent sequence, for example the output of `float_peptides`,
what are the coordinates of the epitopes in that alignment?

We can use `locate_epitopes()` for this. It returns the first and last non-gap positions for each epitope.
With the arguments `index` and `includeend` we can set which coordinate system we want to use.

Let's demonstrate with an example set of aligned epitopes.

In [24]:
import re
import pandas as pd
from epimap import map
from Bio.Seq import Seq 
from Bio.SeqRecord import SeqRecord

epi1 = SeqRecord(Seq("abc"), description="epi1")
epi2 = SeqRecord(Seq("--cdef"), description="epi2")
epi3 = SeqRecord(Seq("-----fghi"), description="epi3")
epi4 = SeqRecord(Seq("--cdef---"), description="epi4")
epi5 = SeqRecord(Seq("--cd-f---"), description="epi5")

al = [
    epi1, epi2, epi3, epi4, epi5
]
for r in al:
    print(r.seq)


abc
--cdef
-----fghi
--cdef---
--cd-f---


We create a table of the aligned sequences.

In [26]:
al_df = pd.DataFrame({
    'description': [r.description for r in al],
    'peptide': [str(r.seq.replace("-", "")) for r in al],
    'seq': [str(r.seq) for r in al]
})
al_df

Unnamed: 0,description,peptide,seq
0,epi1,abc,abc
1,epi2,cdef,--cdef
2,epi3,fghi,-----fghi
3,epi4,cdef,--cdef---
4,epi5,cdf,--cd-f---


Using the column of aligned sequences you can get their non-gap start and end coordinates.
`located_peptide` returns a tuple of the start and end.
We use this tuple to add start and end coordinates to the table.

Note that you can use this for complex alignments where there are gaps in the epitope.

In [29]:
pos = al_df.seq.apply(map.locate_peptide, index=0, includeend=False)
al_df[['start','end']] = pd.DataFrame(pos.tolist())

al_df

Unnamed: 0,description,peptide,seq,start,end
0,epi1,abc,abc,0,3
1,epi2,cdef,--cdef,2,6
2,epi3,fghi,-----fghi,5,9
3,epi4,cdef,--cdef---,2,6
4,epi5,cdf,--cd-f---,2,6


In [33]:
# Python uses zero indexes and slicing that doesn't include
# the end position. However, you can use the IEDB
# format if you'd like
al_df.seq.apply(map.locate_peptide, index=1, includeend=True)


0    (1, 3)
1    (3, 6)
2    (6, 9)
3    (3, 6)
4    (3, 6)
Name: seq, dtype: object