# Edit based distance to canonical uniprot sequences
### Hamming + normalized col
### Levenshtein + normalized col 
---

## Pairwise sequence alignment is the process of comparing only two strings
* edit distance hard to compute for longer strings
* hamming needs same length, but i like that it looks at positions, since one insertions is one change but that could through off positions by alot

* hamming and levenshtein are both edit based, but hamming penalizes on positional differences whereas levenshtein does not

### Levenshtein distance = 
- In information theory, linguistics and computer science, the Levenshtein distance is a string metric for measuring the difference between two sequences. Informally, the Levenshtein distance between two words is the minimum number of single-character edits (insertions, deletions or substitutions) required to change one word into the other. It is named after the Soviet mathematician Vladimir Levenshtein, who considered this distance in 1965.[1]


### Damerau–Levenshtein distance = 
- differs from the classical Levenshtein distance by including transpositions among its allowable operations in addition to the three classical single-character edit operations (insertions, deletions and substitutions).[4][2]

##### From the elementary costs, we set
> δE = min{cost of σ : σ ∈ Sx,y}
    
    where Sx,y is the set of sequences of elementary edit operationsthat transform x into y, and the cost of an element σ ∈ Sx,y is the sum of the costs of the edit operations of the sequence σ. The function δE is then a distance on Σ ∗, and it is called the edit distance (Damerau-Levenshtein distance).

---

### Relationship with other edit distance metrics
There are other popular measures of edit distance, which are calculated using a different set of allowable edit operations. For instance,
1. the Levenshtein distance allows deletion, insertion and substitution;
2. the Damerau–Levenshtein distance allows insertion, deletion, substitution, and the transposition of two adjacent characters;
3. the longest common subsequence (LCS) distance allows only insertion and deletion, not substitution;
4. the Hamming distance allows only substitution, hence, it only applies to strings of the same length.
> Edit distance is usually defined as a parameterizable metric calculated with a specific set of allowed edit operations, and each operation is assigned a cost (possibly infinite). This is further generalized by DNA sequence alignment algorithms such as the Smith–Waterman algorithm, which make an operation's cost depend on where it is applied.

**Levenshtein counts the number of edits (insertions, deletions, or substitutions) needed to convert one string to the other. Damerau-Levenshtein is a modified version that also considers transpositions as single edits. Although the output is the integer number of edits, this can be normalized to give a similarity value by the formula**

> 1 - (edit distance / length of the larger of the two strings)

    The Jaro algorithm is a measure of characters in common, being no more than half the length of the longer string in distance, with consideration for transpositions. Winkler modified this algorithm to support the idea that differences near the start of the string are more significant than differences near the end of the string. Jaro and Jaro-Winkler are suited for comparing smaller strings like words and names.

Deciding which to use is not just a matter of performance. It's important to pick a method that is suited to the nature of the strings you are comparing. In general though, both of the algorithms you mentioned can be expensive, because each string must be compared to every other string, and with millions of strings in your data set, that is a tremendous number of comparisons. That is much more expensive than something like computing a phonetic encoding for each string, and then simply grouping strings sharing identical encodings.

# i want to use both the hamming distance and the levenshtein distance
- both edit distance algorithms, hamming considers positional difference, therefore only allows substitutions
- levenshtein allows substitutions, deletion, and insertion of 1 AA
- if insertion is at last position in string
    > example MARIA + P, then both are 1
    
- if insertion is at first position in string
    > example M + MARIA then leven is 1 still but hamming is 5, normalized hamming is 5/6

In [1]:
# packages
import os
import sys
import numpy as np
import pandas as pd
from ast import literal_eval # for mismap_score func
import difflib
import jellyfish
import textdistance

In [2]:
sys.path.append("/Users/mariapalafox/Desktop/Toolbox/")
from all_funx import *

In [3]:
from IPython.display import display
pd.set_option('display.max_columns', None)
pd.options.display.max_seq_items = 2000

In [18]:
m = "MARIA"
m2 = "AAR"
m3 = "MARIAA"
m4 = "MMARIA"
print(jellyfish.hamming_distance(m, m4))
print(textdistance.hamming.normalized_distance(m, m4))
print(textdistance.hamming.normalized_distance(m4, m))
# 5/6 = 0.8333333333333334 so hamming normalized takes the longest len as denom and num is # mismatched positions
print(textdistance.levenshtein(m4, m))

5

In [38]:
def identicalSequences(dfEnsp, ref_dic, newcolresult, hamming, hammingNorm, levenshtein, levenshteinNorm):
    res = []
    ham = []
    hamnorm = []
    lev = []
    levnorm = []
    serSeq = dfEnsp['proSequence'].copy()
    serID = dfEnsp['UniProtSP_xref'].copy()
    for inx, val in serSeq.items():
        pep = str(val)
        p = pep.strip()
        idd = str(serID[inx])
        # check pep to dict pep sequence
        mypep = ref_dic[idd]
        str(mypep)
        # identical
        if mypep == p:
            res.append('True')
#             ham.append('identical')
#             hamnorm.append('identical')
#             lev.append('identical')
            ham.append(textdistance.hamming(mypep, p))
            # normalized hamming = # mismatched positions/ len of longer sequence
            hamnorm.append(textdistance.hamming.normalized_distance(mypep, p))
            # levenshtein score is edit based but not not penalized position, insertion at pos 1 is jsut 1 diff
            lev.append(textdistance.levenshtein(mypep,p))
            levnorm.append(textdistance.levenshtein.normalized_distance(mypep, p))
        
        # not identical to canonical
        if mypep != p: 
            res.append('False')
            # calculates hamming distance, penalizes positional differences, edit based distance
            ham.append(textdistance.hamming(mypep, p))
            # normalized hamming = # mismatched positions/ len of longer sequence
            hamnorm.append(textdistance.hamming.normalized_distance(mypep, p))
            # levenshtein score is edit based but not not penalized position, insertion at pos 1 is jsut 1 diff
            lev.append(textdistance.levenshtein(mypep,p))
            levnorm.append(textdistance.levenshtein.normalized_distance(mypep, p))
            
    # add new column
    dfEnsp.loc[:,newcolresult] = res
    dfEnsp.loc[:,hamming] = ham
    dfEnsp.loc[:,hammingNorm] = hamnorm
    dfEnsp.loc[:,levenshtein] = lev
    dfEnsp.loc[:,levenshteinNorm] = levnorm
    return dfEnsp

In [4]:
os.chdir("/Users/mariapalafox/Box Sync/CODE_DATA/dir_MAPpaper/TSV_UNIPROT_xref")

['v96Homo_sapiens.GRCh38.pep.all.fa', 'v97Homo_sapiens.GRCh37.pep.all.fa', 'sharedv85_10272_uniqueENSP.csv', 'MISMAP_SCORED_differentNumUKBID_2124_v96_notTrue4myIdentityScore_4623.csv', 'Homo_sapiens.GRCh38.97.uniprot.tsv', 'MISMAP_SCORED_differentNumUKBID_3885_v97_True4myIdentityScore_6052.csv', 'uniprotIDs3979.csv', 'groupedMISMAP_score_85_FALSEidentity_1805.csv', 'v85_labeledEverUKBID_xref100_sourceref100_15.csv', 'MISMAP_SCORE_xref_SharedID_allReleases_3979_UKBkey.csv', '.DS_Store', 'MISMAP_SCORED_differentNumUKBID_3878_v92_True4myIdentityScore_5957.csv', 'MISMAP_SCORED_v94_ENSP_posDict_checked_10699.csv', 'v94_labeledEverUKBID_xref100_sourceref100_6097.csv', 'MISMAP_SCORED_v97_ENSP_posDict_checked_10650.csv', 'v85_labeledEverUKBID_xref100_sourceref100_14.csv', 'MISMAP_SCORED_described_10750.csv', 'MISMAP_SCORED_v96_ENSP_posDict_checked_10750.csv', 'Homo_sapiens.GRCh38.92.uniprot.tsv', 'v96_labeledEverUKBID_xref100_sourceref100_6152.csv', 'uniprotIDs_3979.csv', 'MISMAP_SCORED_sameN

In [5]:
# creating dictionary of canonical uniprot sequences
refuniprot = pd.read_csv("MISMAP_SCORE_UniprotFastaCKabund_w_UniprotCYSLYSpositionsLabeled_3979.csv")

In [11]:
refuniprot.drop(['Unnamed: 0'],inplace=True,axis=1)
refuniprot.head(3)

Unnamed: 0,UniProtSP_xref,UniProt_length,proSequence,C_abundance,K_abundance,in3979xref,total_targets,pos_dict
0,Q9HAS0,396,MLPSLQESMDGDEKELESSEEGGSAEERRLEPPSSSHYCLYSYRGS...,0.0303,0.0581,True,1,{182: 'C'}
1,Q86X76,327,MLGFITRPPHRFLSLLCPGLRIPQLSVLCAQPRPRAMAISSSSCEL...,0.0459,0.0245,True,3,"{161: 'K', 165: 'C', 203: 'C'}"
2,Q9NQR4,276,MTSFRLALIQLQISSIKSDNVTRACSFIREAATQGAKIVSLPECFN...,0.0254,0.0652,True,7,"{44: 'C', 52: 'K', 123: 'K', 130: 'K', 146: 'C..."


In [12]:
# stripped the white spaces from uniprot seq col
ukb = refuniprot['proSequence'].apply(lambda x: x.strip())
ukb.head(3)
refuniprot.proSequence = ukb

In [13]:
refuniprot['proSequence'].apply(lambda x: len(x))
# lenght matches df

0        396
1        327
2        276
3        301
4       2471
5        535
6        596
7        503
8        281
9        548
10       127
11       767
12      1436
13      2090
14       331
15       400
16       172
17      1960
18      2157
19       587
20       706
21       456
22      1025
23       205
24       390
25       377
26       436
27       887
28       271
29      1249
        ... 
3949     311
3950     391
3951     289
3952     633
3953    3433
3954     119
3955     213
3956     466
3957    3174
3958     699
3959     313
3960     601
3961     508
3962    5183
3963     425
3964    1555
3965    1090
3966    3258
3967     310
3968     123
3969     475
3970     153
3971    2620
3972     649
3973     489
3974    1375
3975    1888
3976    1464
3977     944
3978     313
Name: proSequence, Length: 3979, dtype: int64

In [14]:
# are all uniprot seq unique?
seq = refuniprot.proSequence.tolist()
print(len(seq))
print(len(set(seq)))
# cant use sequence as key since there is one duplicate

3979
3978


In [15]:
DUP = refuniprot[refuniprot.duplicated(['proSequence'], keep=False)]
DUP
# CCZ1 Gene(Protein Coding) 
# CCZ1 Homolog, Vacuolar Protein Trafficking And Biogenesis Associated

Unnamed: 0,UniProtSP_xref,UniProt_length,proSequence,C_abundance,K_abundance,in3979xref,total_targets,pos_dict
176,P86791,482,MAAAAAGAGSGPWAAQEKQFPPALLSFFIYNPRFGPREGQEENKIL...,0.0166,0.0726,True,1,{358: 'C'}
2704,P86790,482,MAAAAAGAGSGPWAAQEKQFPPALLSFFIYNPRFGPREGQEENKIL...,0.0166,0.0726,True,1,{358: 'C'}


In [16]:
# creating dictionary from ID and Sequence columns
ref_dic = dict(zip(refuniprot.UniProtSP_xref, refuniprot.proSequence))

---
---
---

---
---
---

---
---
---

# adding distance values to csv of each release including TRUE / FALSE identity scores 
- all distance scores in relationship to canonical ukb sequence in which the ENSP ID is linked to

In [35]:
os.chdir("/Users/mariapalafox/Box Sync/CODE_DATA/dir_MAPpaper/TSV_UNIPROT_xref/MISMAP_ENSPseq_myIdentity_2canonUKBseq/")

In [36]:
# ensembl TRUE AND FALSE all rows
v85= pd.read_csv("MISMAP_SCORE_fasta85_CanonicalUKB_identitycol_distanceScores_10272.csv")  
v92= pd.read_csv("MISMAP_SCORE_fasta92_CanonicalUKB_identitycol_distanceScores_10479.csv")  
v94= pd.read_csv("MISMAP_SCORE_fasta94_CanonicalUKB_identitycol_distanceScores_10699.csv")  
v96= pd.read_csv("MISMAP_SCORE_fasta96_CanonicalUKB_identitycol_distanceScores_10750.csv")  
v97= pd.read_csv("MISMAP_SCORE_fasta97_CanonicalUKB_identitycol_distanceScores_10650.csv")

In [41]:
v85 = identicalSequences(v85, ref_dic, "matchedUKBcanonical", "hamming_distance", "hamming_normalized_dist", "levenshtein_distance", "levenshtein_normalized_dist")
checkColumnValues(v85,'matchedUKBcanonical')

  matchedUKBcanonical  Count
0                True   5925
1               False   4347


In [40]:
v92 = identicalSequences(v92, ref_dic, "matchedUKBcanonical", "hamming_distance", "hamming_normalized_dist", "levenshtein_distance", "levenshtein_normalized_dist")
checkColumnValues(v92, 'matchedUKBcanonical')

  matchedUKBcanonical  Count
0                True   5957
1               False   4522


In [42]:
v94 = identicalSequences(v94, ref_dic, "matchedUKBcanonical", "hamming_distance", "hamming_normalized_dist", "levenshtein_distance", "levenshtein_normalized_dist")
checkColumnValues(v94, 'matchedUKBcanonical')

  matchedUKBcanonical  Count
0                True   6097
1               False   4602


In [43]:
v96 = identicalSequences(v96, ref_dic, "matchedUKBcanonical", "hamming_distance", "hamming_normalized_dist", "levenshtein_distance", "levenshtein_normalized_dist")
checkColumnValues(v96, 'matchedUKBcanonical')

  matchedUKBcanonical  Count
0                True   6127
1               False   4623


In [44]:
v97 = identicalSequences(v97, ref_dic, "matchedUKBcanonical", "hamming_distance", "hamming_normalized_dist", "levenshtein_distance", "levenshtein_normalized_dist")
checkColumnValues(v97, 'matchedUKBcanonical')

  matchedUKBcanonical  Count
0                True   6052
1               False   4598


In [48]:
# dropping 1 version of score columns
v85.drop(['Unnamed: 0','hammingScore', 'hamNormalizedScore','levenshteinScore'],inplace=True,axis=1)
v92.drop(['Unnamed: 0','hammingScore', 'hamNormalizedScore','levenshteinScore'],inplace=True,axis=1)
v94.drop(['Unnamed: 0','hammingScore', 'hamNormalizedScore','levenshteinScore'],inplace=True,axis=1)
v96.drop(['Unnamed: 0','hammingScore', 'hamNormalizedScore','levenshteinScore'],inplace=True,axis=1)
v97.drop(['Unnamed: 0','hammingScore', 'hamNormalizedScore','levenshteinScore'],inplace=True,axis=1)

In [49]:
v85.head(4)

Unnamed: 0,ENSPv,ENSP,Length,proSequence,MISMAP_SCORE_ENSP,UniProtSP_xref,Identical_UKB_seq,matchedUKBcanonical,hamming_distance,hamming_normalized_dist,levenshtein_distance,levenshtein_normalized_dist
0,ENSP00000313454.4,ENSP00000313454,1052,MEGSEPVAAHQGEEASCSSWGTGSTNKNLPIMSTASVEIDDALYSR...,True,A0AVT1,True,True,0,0.0,0,0.0
1,ENSP00000399234.2,ENSP00000399234,389,MEGSEPVAAHQGEEASCSSWGTGSTNKNLPIMSTASVEIDDALYSR...,True,A0AVT1,False,False,684,0.65019,663,0.630228
2,ENSP00000251527.5,ENSP00000251527,893,MTPPSRAEAGVRRSRVPSEGRWRGAEPPGISASTQPASAGRAARHC...,True,A0FGR8,False,False,859,0.932682,70,0.076004
3,ENSP00000279907.7,ENSP00000279907,1464,MAGIIKKQILKHLSRFTKNLSPDKINLSTLKGEGELKNLELDEEVL...,True,A0JNW5,True,True,0,0.0,0,0.0


In [50]:
v97.head(10)

Unnamed: 0,ENSPv,ENSP,Length,proSequence,MISMAP_SCORE_ENSP,UniProtSP_xref,Identical_UKB_seq,matchedUKBcanonical,hamming_distance,hamming_normalized_dist,levenshtein_distance,levenshtein_normalized_dist
0,ENSP00000313454.4,ENSP00000313454,1052,MEGSEPVAAHQGEEASCSSWGTGSTNKNLPIMSTASVEIDDALYSR...,True,A0AVT1,True,True,0,0.0,0,0.0
1,ENSP00000399234.2,ENSP00000399234,389,MEGSEPVAAHQGEEASCSSWGTGSTNKNLPIMSTASVEIDDALYSR...,True,A0AVT1,False,False,684,0.65019,663,0.630228
2,ENSP00000251527.6,ENSP00000251527,845,MSGARGEGPEAGAGGAGGRAAPENPGGVLSVELPGLLAQLARSFAL...,True,A0FGR8,False,False,859,0.932682,77,0.083605
3,ENSP00000279907.7,ENSP00000279907,1464,MAGIIKKQILKHLSRFTKNLSPDKINLSTLKGEGELKNLELDEEVL...,True,A0JNW5,True,True,0,0.0,0,0.0
4,ENSP00000349285.3,ENSP00000349285,522,MAGIIKKQILKHLSRFTKNLSPDKINLSTLKGEGELKNLELDEEVL...,True,A0JNW5,False,False,948,0.647541,942,0.643443
5,ENSP00000260777.11,ENSP00000260777,571,MVIEEVNFMQNHLEIEKTCRESAEALATKLNKENKTLKRISMLYMA...,True,A0MZ66,False,False,583,0.92393,60,0.095087
6,ENSP00000347532.4,ENSP00000347532,631,MNSSDEEKQLQLITSLKEQAIGEYEDLRAENQKTKEKCDKIRQERD...,True,A0MZ66,True,True,0,0.0,0,0.0
7,ENSP00000376635.4,ENSP00000376635,498,MVIEEVNFMQNHLEIEKTCRESAEALATKLNKENKTLKRISMLYMA...,True,A0MZ66,False,False,587,0.930269,133,0.210777
8,ENSP00000376636.3,ENSP00000376636,558,MNSSDEEKQLQLITSLKEQAIGEYEDLRAENQKTKEKCDKIRQERD...,True,A0MZ66,False,False,73,0.115689,73,0.115689
9,ENSP00000480109.1,ENSP00000480109,456,MNSSDEEKQLQLITSLKEQAIGEYEDLRAENQKTKEKCDKIRQERD...,True,A0MZ66,False,False,178,0.282092,175,0.277338


In [51]:
# describing the column
v85.describe()

Unnamed: 0,Length,hamming_distance,hamming_normalized_dist,levenshtein_distance,levenshtein_normalized_dist
count,10272.0,10272.0,10272.0,10272.0,10272.0
mean,598.666083,229.175623,0.268769,75.42796,0.082515
std,931.677181,951.349554,0.375971,541.687178,0.171341
min,35.0,0.0,0.0,0.0,0.0
25%,248.0,0.0,0.0,0.0,0.0
50%,429.0,0.0,0.0,0.0,0.0
75%,715.0,270.0,0.604929,41.0,0.078397
max,35991.0,34325.0,0.999272,34137.0,0.993799


In [52]:
v92.describe()

Unnamed: 0,Length,hamming_distance,hamming_normalized_dist,levenshtein_distance,levenshtein_normalized_dist
count,10479.0,10479.0,10479.0,10479.0,10479.0
mean,582.769539,215.04781,0.278229,67.820689,0.08246
std,877.420592,752.27439,0.380984,402.689523,0.167596
min,35.0,0.0,0.0,0.0,0.0
25%,245.0,0.0,0.0,0.0,0.0
50%,425.0,0.0,0.0,0.0,0.0
75%,703.0,274.0,0.646998,41.0,0.085388
max,34350.0,32142.0,0.997551,28825.0,0.974703


In [53]:
v94.describe()

Unnamed: 0,Length,hamming_distance,hamming_normalized_dist,levenshtein_distance,levenshtein_normalized_dist
count,10699.0,10699.0,10699.0,10699.0,10699.0
mean,584.026545,213.942985,0.276695,67.65436,0.082414
std,873.117756,746.158567,0.380208,399.61136,0.167505
min,35.0,0.0,0.0,0.0,0.0
25%,245.0,0.0,0.0,0.0,0.0
50%,426.0,0.0,0.0,0.0,0.0
75%,703.0,272.5,0.638012,41.0,0.085356
max,34350.0,32142.0,0.997551,28825.0,0.974703


In [54]:
v96.describe()

Unnamed: 0,Length,hamming_distance,hamming_normalized_dist,levenshtein_distance,levenshtein_normalized_dist
count,10750.0,10750.0,10750.0,10750.0,10750.0
mean,584.467535,214.605767,0.277247,67.994605,0.082616
std,872.508523,745.282945,0.380588,399.397148,0.167981
min,35.0,0.0,0.0,0.0,0.0
25%,245.0,0.0,0.0,0.0,0.0
50%,425.0,0.0,0.0,0.0,0.0
75%,706.0,274.0,0.641072,41.0,0.085306
max,34350.0,32142.0,0.997551,28825.0,0.974703


In [55]:
v97.describe()

Unnamed: 0,Length,hamming_distance,hamming_normalized_dist,levenshtein_distance,levenshtein_normalized_dist
count,10650.0,10650.0,10650.0,10650.0,10650.0
mean,582.773052,215.720188,0.27893,68.339061,0.083227
std,873.306108,748.005374,0.381136,400.803172,0.168604
min,35.0,0.0,0.0,0.0,0.0
25%,244.0,0.0,0.0,0.0,0.0
50%,425.0,0.0,0.0,0.0,0.0
75%,705.0,276.0,0.646541,41.0,0.085901
max,34350.0,32142.0,0.997551,28825.0,0.974703


In [56]:
# saving ensembl release csv file with identity and distance scores
v85.to_csv("MISMAP_SCORE_fasta85_CanonicalUKB_identitycol_distanceScores_10272.csv")  
v92.to_csv("MISMAP_SCORE_fasta92_CanonicalUKB_identitycol_distanceScores_10479.csv")  
v94.to_csv("MISMAP_SCORE_fasta94_CanonicalUKB_identitycol_distanceScores_10699.csv")  
v96.to_csv("MISMAP_SCORE_fasta96_CanonicalUKB_identitycol_distanceScores_10750.csv")  
v97.to_csv("MISMAP_SCORE_fasta97_CanonicalUKB_identitycol_distanceScores_10650.csv")

---
---
---
---
---
---
---
---

## ALSO did QC for my identity score values
- confirm no hidden characters alter identity score (used python strip())- the results were unchanged after QC

- confirmed the TRUE and FALSE myidentity score results were correct, all TRUE files  have sequnce identical to my canonical sequences and all FALSE files do not have identical match 

- MISMAP data code written in 19_09_11_Ensembl_versionIDs_proteinSeq_differences_CYSLYScoordinatesTESTING

```python 
  Identical_UKB_seq  Count
0              True   5925
1             False   4347

  Identical_UKB_seq  Count
0              True   5957
1             False   4522

  Identical_UKB_seq  Count
0              True   6097
1             False   4602

  Identical_UKB_seq  Count
0              True   6127
1             False   4623

  Identical_UKB_seq  Count
0              True   6052
1             False   4598
```

## TODO- 2 scripts for QC
1. script that confirms results of TRUE FALSE comparing uniprot canonical seq in dict form to the fasta files
    - I have each fasta file, seperated into True and False
    - double check the True and False files but adding column for specific False reasons the seq does not match
    
    - added distance metrics 
        - hamming distance, allows substitutions only
        - Levenshtein distance, allows single sub, inser, del only
    
2. script that confirms results of which Uniprot IDs do not have a canonical sequence match in any of the release fasta files...total 49 canonical sequneces that SHOULD NOT BE PRESENT IN ANY RELEASES

# TODO 2  confirm these 49 IDs do not have a canonical sequence match in any of the Ensembl release fasta files

In [1]:
# FOUND THESE UNIPROT IDS have canonical sequences that are not in the fasta of ANY RELEASES!!!
# DOUBLE CHECK THAT ALL OF THESE SEQUENCES ARE NOT IN THE FASTA FILES

noSeqMatchers = ['A0FGR8',
 'A6NNF4',
 'O14965',
 'O15061',
 'O15392',
 'O43708',
 'O60645',
 'O75400',
 'P02765',
 'P07686',
 'P11182',
 'P11532',
 'P11586',
 'P16278',
 'P17927',
 'P18887',
 'P20839',
 'P30837',
 'P36639',
 'P53990',
 'Q03001',
 'Q12912',
 'Q13459',
 'Q14135',
 'Q14558',
 'Q15170',
 'Q68E01',
 'Q6PKG0',
 'Q86TG7',
 'Q8IY17',
 'Q8NBJ7',
 'Q8NBT2',
 'Q8NCA5',
 'Q8WWI1',
 'Q8WX93',
 'Q96ME1',
 'Q99729',
 'Q9BV68',
 'Q9BX63',
 'Q9BZ29',
 'Q9NRG7',
 'Q9P2N6',
 'Q9UJ41',
 'Q9UM54',
 'Q9UMY4',
 'Q9UNH6',
 'Q9UNH7',
 'Q9Y520',
 'Q9Y679']

In [2]:
len(noSeqMatchers)

49

In [6]:
enspA3KN83 = "MVEPGQDLLLAALSESGISPNDLFDIDGGDAGLATPMPTPSVQQSVPLSALELGLETEAAVPVKQEPETVPTPALLNVRQPPSTTTFVLNQINHLPPLGSTIVMTKTPPVTTNRQTITLTKFIQTTASTRPSVSAPTVRNAMTSAPSKDQVQLKDLLKNNSLNELMKLKPPANIAQPVATAATDVSNGTVKKESSNKEGARMWINDMKMRSFSPTMKVPVVKEDDEPEEEDEEEMGHAETYAEYMPIKLKIGLRHPDAVVETSSLSSVTPPDVWYKTSISEETIDNGWLSALQLEAITYAAQQHETFLPNGDRAGFLIGDGAGVGKGRTIAGIIYENYLLSRKRALWFSVSNDLKYDAERDLRDIGAKNILVHSLNKFKYGKISSKHNGSVKKGVIFATYSSLIGESQSGGKYKTRLKQLLHWCGDDFDGVIVFDECHKAKNLCPVGSSKPTKTGLAVLELQNKLPKARVVYASATGASEPRNMAYMNRLGIWGEGTPFREFSDFIQAVERRGVGAMEIVAMDMKLRGMYIARQLSFTGVTFKIEEVLLSQSYVKMYNKAVKLWVIARERFQQAADLIDAEQRMKKSMWGQFWSAHQRFFKYLCIASKVKRVVQLAREEIKNGKCVVIGLQSTGEARTLEALEEGGGELNDFVSTAKGVLQSLIEKHFPAPDRKKLYSLLGIDLTAPSNNSSPRDSPCKENKIKKRKGEEITREAKKARKVGGLTGSSSDDSGSESDASDNEESDYESSKNMSSGDDDDFNPFLDESNEDDENDPWLIRKDHKKNKEKKKKKSIDPDSIQSALLASGLGSKRPSFSSTPVISPAPNSTPANSNTNSNSSLITSQDAVERAQQMKKDLLDKLEKLAEDLPPNTLDELIDELGGPENVAEMTGRKGRVVSNDDGSISYESRSELDVPVEILNITEKQRFMDGDKNIAIISEAASSGISLQADRRAKNQRRRVHMTLELPWSADRAIQQFGRTHRSNQVTAPEYVFLISELAGEQRFASIVAKRLESLGALTHGDRRATESRDLSRFNFDNKYGRNALEIVMKSIVNLDSPMVSPPPDYPGEFFKDVRQGLIGVGLINVEDRSGILTLDKDYNNIGKFLNRILGMEVHQQNALFQYFADTLTAVVQNAKKNGRYDMGILDLGSGDEKVRKSDVKKFLTPGYSTSGHVELYTISVERGMSWEEATKIWAELTGPDDGFYLSLQIRNNKKTAILVKEVNPKKKLFLVYRPNTGKQLKLEIYADLKKKYKKVVSDDALMHWLDQYNSSADTCTHAYWRGNCKKASLGLVCEIGLRCRTYYVLCGSVLSVWTKVEGVLASVSGTNVKMQIVRLRTEDGQRIVGLIIPANCVSPLVNLLSTSDQSQQLAVQQKQLWQQHHPQSITNLSNA"

ukbA3KN83 = "MVEPGQDLLLAALSESGISPNDLFDIDGGDAGLATPMPTPSVQQSVPLSALELGLETEAAVPVKQEPETVPTPALLNVRQQPPSTTTFVLNQINHLPPLGSTIVMTKTPPVTTNRQTITLTKFIQTTASTRPSVSAPTVRNAMTSAPSKDQVQLKDLLKNNSLNELMKLKPPANIAQPVATAATDVSNGTVKKESSNKEGARMWINDMKMRSFSPTMKVPVVKEDDEPEEEDEEEMGHAETYAEYMPIKLKIGLRHPDAVVETSSLSSVTPPDVWYKTSISEETIDNGWLSALQLEAITYAAQQHETFLPNGDRAGFLIGDGAGVGKGRTIAGIIYENYLLSRKRALWFSVSNDLKYDAERDLRDIGAKNILVHSLNKFKYGKISSKHNGSVKKGVIFATYSSLIGESQSGGKYKTRLKQLLHWCGDDFDGVIVFDECHKAKNLCPVGSSKPTKTGLAVLELQNKLPKARVVYASATGASEPRNMAYMNRLGIWGEGTPFREFSDFIQAVERRGVGAMEIVAMDMKLRGMYIARQLSFTGVTFKIEEVLLSQSYVKMYNKAVKLWVIARERFQQAADLIDAEQRMKKSMWGQFWSAHQRFFKYLCIASKVKRVVQLAREEIKNGKCVVIGLQSTGEARTLEALEEGGGELNDFVSTAKGVLQSLIEKHFPAPDRKKLYSLLGIDLTAPSNNSSPRDSPCKENKIKKRKGEEITREAKKARKVGGLTGSSSDDSGSESDASDNEESDYESSKNMSSGDDDDFNPFLDESNEDDENDPWLIRKDHKKNKEKKKKKSIDPDSIQSALLASGLGSKRPSFSSTPVISPAPNSTPANSNTNSNSSLITSQDAVERAQQMKKDLLDKLEKLAEDLPPNTLDELIDELGGPENVAEMTGRKGRVVSNDDGSISYESRSELDVPVEILNITEKQRFMDGDKNIAIISEAASSGISLQADRRAKNQRRRVHMTLELPWSADRAIQQFGRTHRSNQVTAPEYVFLISELAGEQRFASIVAKRLESLGALTHGDRRATESRDLSRFNFDNKYGRNALEIVMKSIVNLDSPMVSPPPDYPGEFFKDVRQGLIGVGLINVEDRSGILTLDKDYNNIGKFLNRILGMEVHQQNALFQYFADTLTAVVQNAKKNGRYDMGILDLGSGDEKVRKSDVKKFLTPGYSTSGHVELYTISVERGMSWEEATKIWAELTGPDDGFYLSLQIRNNKKTAILVKEVNPKKKLFLVYRPNTGKQLKLEIYADLKKKYKKVVSDDALMHWLDQYNSSADTCTHAYWRGNCKKASLGLVCEIGLRCRTYYVLCGSVLSVWTKVEGVLASVSGTNVKMQIVRLRTEDGQRIVGLIIPANCVSPLVNLLSTSDQSQQLAVQQKQLWQQHHPQSITNLSNA"

# teh ensp is from v85



In [7]:
import difflib
output_list = [li for li in difflib.ndiff(enspA3KN83, ukbA3KN83) if li[0] != ' ']
output_list

['+ Q']