UPDATED SPLITS WITH SPECIES: Preprocess uniprot data (SPs that are only experimentally verified and verified by sequence analysis) and split into train/val/test with tokens for the species.

First, go through and split the sequences into the signal peptide and the remainder of the sequence. 
Discard sequences where the signal peptide does not start at the first position. Then, discard sequences
where the signal peptide is not between 10 and 70 amino acids, inclusive. Also discard sequences where 
the remaining sequence is not strictly longer than the signal peptide. 

In one training dataset, keep the first 100 amino acids of the mature protein. In another training dataset, only keep the first 95, 100, and 105 amino acids of the mature protein in the training dataset to vary the length of the protein sequences. This way, we get "more" training data if for each one.

Remove examples where the SP is the same and the protein sequences are > 0.5 the same.

For each example, also save the organism. All organisms with fewer than 5 examples get lumped together as token 0: 'AAUnknown' There are a total of 754 species tokens. 

There are a total of 32263 examples. 

Finally, shuffle the signal peptide/mature protein pairs and set aside 20% each as test and validation sets. The split is 19359/6452/6452. 

Minimal SP Length: 10 AA
Maximal SP Length: 70 AA
 
Defaults from SignalP http://www.cbs.dtu.dk/services/SignalP/instructions.php#limits
 
Minimal Protein Length: Longer than Signal Peptide
Maximal Protein Length: truncated to 70 -> according the SignalP’s SI (below)
https://images.nature.com/full/nature-assets/nmeth/journal/v8/n10/extref/nmeth.1701-S1.pdf

**Remade filtered datasets (50%, 75%, 90%, 95%, and 99%) with no test batch. Previously, some of the protein sequences/signal peptide pairs were split into test batches, but this is unnecessary b/c the test batch should consist of protein sequences from Zach's excel "initial_enzymes_1" analysis.**

In [2]:
%matplotlib inline
import pickle
import random
from collections import Counter

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('whitegrid')

import csv

In [12]:
# read in datasets from csv
dataset_75 = []
dataset_90 = []
dataset_95 = []
dataset_99 = []

for i in range(5):
    filename = "dataset_" + str(i + 1) + ".csv"
    with open(filename, "r") as f:
        reader = csv.reader(f, delimiter="\t")
        index = 0 # for appending to lists of similarities within the list "sim"
        for j, line in enumerate(reader): # reads in each row of the csv
            if i == 1:
                dataset_75.append(line[0])
            elif i == 2:
                dataset_90.append(line[0])
            elif i == 3:
                dataset_95.append(line[0])
            elif i == 4:
                dataset_99.append(line[0])

print(len(dataset_75))
print(len(dataset_90))
print(len(dataset_95))
print(len(dataset_99))

22120
27687
28218
28546


In [16]:
dataset_99

['RKPRRKRWTGHLETSKPSHLYKKNLDVTKIRTGKPRPLLRVEDHDFTMRPAFGGPAIPVGVDVQVESLDSISEVDMDFTMTLYLRHYWRDERLAFPSSSNRSMTFDGRLVKKIWVPDVFFVHSKRSFTHDTTTDNIMLRVFPDGHVLYSMRITVTAMCNMDFSHFPLDSQTCSLELESYAYTDEDLMLYWKNGDESLKTDEKISLSQFLIQKFHTTSRLAFYSSTGWYNRLYINFTLRRHIFFFLLQTYFPATLMVMLSWVSFWIDHRAVPARVSLGIMTVLTMSTIITGVNASMPRVSYIRAVDIYLWVSFVFVFLSVLEYAAVNYLTTVQEQKERKLRDKFPCTCGMLHSRTMTLDGSYSESEANSLAGYPRSHILPEEERQDKIVVHLALNSELTSSRKKGLLKGQMGLYIFQNTHAIDKYSRLIFPAFYIVFNLIYWSVFS',
 'DILAYNFENASQTFEDLPARFGYRLPAEGLKGFLINSKPENACEPIVPPPLKDNSSGTFIVLIRRLDCNFDIKVLNAQRAGYKAAIVHNVDSDDLISMGSNDIDTLKKIDIPSVFIGESSANSLKDEFTYEKGGHIILVPELSLPLEYYLIPFLIIVGICLILIVIFMITKFVQDRHRNRRNRLRKDQLKKLPVHKFKKGDEYDVCAICLEEYEDGDKLRILPCSHAYHCKCVDPWLTKTKKTCPVCKQKVVPSQGDSDSDTDSSQEENQVSEHTPLLPPSASARTQSFGSLSESHSHHNMTESSDYEDDDNEETDSSDADNEITDHSVVVQLQPNGEQDYNIANTV',
 'TRSTQKESVADNAGMLAGGIKDVPANENDLQLQELARFAVNEHNQKANALLGFEKLVKAKTQVVAGTMYYLTIEVKDGEVKKLYEAKVWEKPWENFKQLQEFKPVEEGASA',
 'GMHVLRYGYTGIFDDTSHMTLTVVGIFDGQHFFTYHVNSSDKASSRANGTISWMANVSAAYPTYLDGERAKGDLIFNQTE

In [3]:
# load in prot sequences from dataset
df = pd.read_excel('../dataset.xls')
si_c = df['Signal peptides'].values
pr_c = df['Prot Sequences'].values

In [4]:
# include triplets of all lengths
triplets_75 = []
triplets_90 = []
triplets_95 = []
triplets_99 = []

for si, pr in zip(si_c, pr_c):
    if pr in dataset_75:
        triplets_75.append((si, pr))
    if pr in dataset_90:
        triplets_90.append((si, pr))
    if pr in dataset_95:
        triplets_95.append((si, pr))
    if pr in dataset_99:
        triplets_99.append((si, pr))

print(len(triplets_75))
print(len(triplets_90))
print(len(triplets_95))
print(len(triplets_99))

22120
27687
28218
28546


In [5]:
# Remove exact duplicates
triplets_75 = list(set(triplets_75))
print(len(triplets_75))

triplets_90 = list(set(triplets_90))
print(len(triplets_90))

triplets_95 = list(set(triplets_95))
print(len(triplets_95))

triplets_99 = list(set(triplets_99))
print(len(triplets_99))

22115
27678
28208
28535


In [6]:
random.seed(a=1)
random.shuffle(triplets_75)
random.shuffle(triplets_90)
random.shuffle(triplets_95)
random.shuffle(triplets_99)

In [7]:
L = (int) (len(triplets_75) * 0.8)
train_75 = triplets_75[:L]
val_75 = triplets_75[L:]
print(len(train_75), len(val_75))#check tokens

L = (int) (len(triplets_90) * 0.8)
train_90 = triplets_90[:L]
val_90 = triplets_90[L:]
print(len(train_90), len(val_90))#check tokens

L = (int) (len(triplets_95) * 0.8)
train_95 = triplets_95[:L]
val_95 = triplets_95[L:]
print(len(train_95), len(val_95))#check tokens

L = (int) (len(triplets_99) * 0.8)
train_99 = triplets_99[:L]
val_99 = triplets_99[L:]
print(len(train_99), len(val_99))#check tokens

17692 4423
22142 5536
22566 5642
22828 5707


In [8]:
# Ensure prot seq length of val and test are 100 aa
# dataset where training has prot seqs of length 100

train_len_75 = [(si, pr[:100]) for si, pr in train_75]
val_len_75 = [(si, pr[:100]) for si, pr in val_75]

train_75 = train_len_75
val_75 = val_len_75

train_len_90 = [(si, pr[:100]) for si, pr in train_90]
val_len_90 = [(si, pr[:100]) for si, pr in val_90]

train_90 = train_len_90
val_90 = val_len_90

train_len_95 = [(si, pr[:100]) for si, pr in train_95]
val_len_95 = [(si, pr[:100]) for si, pr in val_95]

train_95 = train_len_95
val_95 = val_len_95

train_len_99 = [(si, pr[:100]) for si, pr in train_99]
val_len_99 = [(si, pr[:100]) for si, pr in val_99]

train_99 = train_len_99
val_99 = val_len_99

In [9]:
with open('../../6-14-18_filtered_data/train_75.pkl', 'wb') as f:
    pickle.dump(train_75, f)
with open('../../6-14-18_filtered_data/validate_75.pkl', 'wb') as f:
    pickle.dump(val_75, f)

with open('../../6-14-18_filtered_data/train_90.pkl', 'wb') as f:
    pickle.dump(train_90, f)
with open('../../6-14-18_filtered_data/validate_90.pkl', 'wb') as f:
    pickle.dump(val_90, f)

with open('../../6-14-18_filtered_data/train_95.pkl', 'wb') as f:
    pickle.dump(train_95, f)
with open('../../6-14-18_filtered_data/validate_95.pkl', 'wb') as f:
    pickle.dump(val_95, f)
    
with open('../../6-14-18_filtered_data/train_99.pkl', 'wb') as f:
    pickle.dump(train_99, f)
with open('../../6-14-18_filtered_data/validate_99.pkl', 'wb') as f:
    pickle.dump(val_99, f)

In [10]:
# vary prot length in training set
maximum = 105
leng1 = 0
leng2 = 0

train_vlen = []
train_vlen1 = []
train_vlen2 = []

####### keep train with og lengths, len, len-5, len-10 (smaller of 105 or actual protein length for len)
for t in train_75:
    si, pr = t
    if len(pr) < maximum:
        leng1 = len(pr) - 10
        leng2 = len(pr) - 5
        train_vlen.append((si, pr[:leng1]))
        train_vlen1.append((si, pr[:leng2]))
        train_vlen2.append((si, pr))
    else:
        train_vlen.append((si, pr[:95]))
        train_vlen1.append((si, pr[:100]))
        train_vlen2.append((si, pr[:105]))

print(len(train_vlen))
print(len(train_vlen1))
print(len(train_vlen2))

train_75 = []
train_75 = train_vlen + train_vlen1 + train_vlen2

print(len(train_75))
# Remove exact duplicates
train_75 = list(set(train_75))
print(len(train_75))

17692
17692
17692
53076
52978


In [11]:
train_vlen = []
train_vlen1 = []
train_vlen2 = []

####### keep train with og lengths, len, len-5, len-10 (smaller of 105 or actual protein length for len)
for t in train_90:
    si, pr = t
    if len(pr) < maximum:
        leng1 = len(pr) - 10
        leng2 = len(pr) - 5
        train_vlen.append((si, pr[:leng1]))
        train_vlen1.append((si, pr[:leng2]))
        train_vlen2.append((si, pr))
    else:
        train_vlen.append((si, pr[:95]))
        train_vlen1.append((si, pr[:100]))
        train_vlen2.append((si, pr[:105]))

print(len(train_vlen))
print(len(train_vlen1))
print(len(train_vlen2))

train_90 = []
train_90 = train_vlen + train_vlen1 + train_vlen2

print(len(train_90))
# Remove exact duplicates
train_90 = list(set(train_90))
print(len(train_90))

22142
22142
22142
66426
66311


In [12]:
train_vlen = []
train_vlen1 = []
train_vlen2 = []

####### keep train with og lengths, len, len-5, len-10 (smaller of 105 or actual protein length for len)
for t in train_95:
    si, pr = t
    if len(pr) < maximum:
        leng1 = len(pr) - 10
        leng2 = len(pr) - 5
        train_vlen.append((si, pr[:leng1]))
        train_vlen1.append((si, pr[:leng2]))
        train_vlen2.append((si, pr))
    else:
        train_vlen.append((si, pr[:95]))
        train_vlen1.append((si, pr[:100]))
        train_vlen2.append((si, pr[:105]))

print(len(train_vlen))
print(len(train_vlen1))
print(len(train_vlen2))

train_95 = []
train_95 = train_vlen + train_vlen1 + train_vlen2

print(len(train_95))
# Remove exact duplicates
train_95 = list(set(train_95))
print(len(train_95))

22566
22566
22566
67698
67568


In [13]:
train_vlen = []
train_vlen1 = []
train_vlen2 = []

####### keep train with og lengths, len, len-5, len-10 (smaller of 105 or actual protein length for len)
for t in train_99:
    si, pr = t
    if len(pr) < maximum:
        leng1 = len(pr) - 10
        leng2 = len(pr) - 5
        train_vlen.append((si, pr[:leng1]))
        train_vlen1.append((si, pr[:leng2]))
        train_vlen2.append((si, pr))
    else:
        train_vlen.append((si, pr[:95]))
        train_vlen1.append((si, pr[:100]))
        train_vlen2.append((si, pr[:105]))

print(len(train_vlen))
print(len(train_vlen1))
print(len(train_vlen2))

train_99 = []
train_99 = train_vlen + train_vlen1 + train_vlen2

print(len(train_99))
# Remove exact duplicates
train_99 = list(set(train_99))
print(len(train_99))

22828
22828
22828
68484
68347


In [14]:
# dump data with sp, prot in *_augmented.pkl files (training dataset with varied prot seq lengths of 95, 100, and 105)

with open('../../6-14-18_filtered_data/train_augmented_75.pkl', 'wb') as f:
    pickle.dump(train_75, f)
with open('../../6-14-18_filtered_data/train_augmented_90.pkl', 'wb') as f:
    pickle.dump(train_90, f)
with open('../../6-14-18_filtered_data/train_augmented_95.pkl', 'wb') as f:
    pickle.dump(train_95, f)
with open('../../6-14-18_filtered_data/train_augmented_99.pkl', 'wb') as f:
    pickle.dump(train_99, f)

In [15]:
with open('../../6-14-18_filtered_data/train_augmented_75.pkl', 'rb') as f:
    t = pickle.load(f)
t[0]

('MFGTLLLYCFFLATVPALA',
 'ETGGERQLSPEKSEIWGPGLKADVVLPARYFYIQAVDTSGNKFTSSPGEKVFQVKVSAPEEQFTRVGVQVLDRKDGSFIVRYRMYASYKNLKVEIKFQGQ')

In [None]:
train = [('MRLSTAQLIAIAYYMLSIGATVPQVDG', 'QGETEEALIQKRSYDYYQEPCDDYPQQQQQQEPCDYPQQQQQEEPCDYPQQQPQEPCDYPQQPQEPCDYPQQPQEPCDYPQQPQEPCDNPPQPDV', 121), ('MLTPRVLRALGWTGLFFLLLSPSNVLG', 'ASLSRDLETPPFLSFDPSNISINGAPLTEVPHAPSTESVSTNSESTNEHTITETTGKNAYIHNNASTDKQNANDTHKTPNILCDTEEVFVFLNET', 260)]
leng1 = 0
leng2 = 0
lst = []

for t in train:
    si, pr, sp = t
    leng1 = len(pr) - 10
    leng2 = len(pr) - 5
    lst.append(pr[:leng1])
    lst.append(pr[:leng2])
    lst.append(pr)

print(len(lst))
list(set(lst))
print(len(lst))

In [8]:
with open('../../6-14-18_filtered_data/train_augmented_99.pkl', 'rb') as f:
    t = pickle.load(f)

In [9]:
t

[('MRVLFILFLFYFYTYTEA',
  'QQYYPIDPTGKCEQYIGDSVITPCNSIYVSSTANQSTAMLSLNLYFSLLGGSSASCQNAYTFATLCSTYLPECEIFIDNSTNKEIAMPKRVCLDT'),
 ('MFGTLLLYCFFLATVPALA',
  'ETGGERQLSPEKSEIWGPGLKADVVLPARYFYIQAVDTSGNKFTSSPGEKVFQVKVSAPEEQFTRVGVQVLDRKDGSFIVRYRMYASYKNLKVEIKFQGQ'),
 ('MKSLLLLAFFLSFFFGSLLA',
  'RHLPTSSHPSHHHVGMTGALKRQRRRPDTVQVAGSRLPDCSHACGSCSPCRLVMVSFVCASVEEAETCPMAYKCMCNNKSYPVP'),
 ('MPGLKRILTVTILALWLPHPGNA',
  'QQQCTNGFDLDRQSGQCLDIDECRTIPEACRGDMMCVNQNGGYLCIPRTNPVYRGPYSNPYSTSYSGPYPAAAPPVPASNYPTISRPLVC'),
 ('MSSALAYMLLVLSISLLNG',
  'QSPPGKPEIHKCRSPDKETFTCWWNPGSDGGLPTNYSLTYSKEGEKNTYECPDYKTSGPNSCFFSKQYTSIWKIYIITVNATNEMGSSTSDPLYVDVTYI'),
 ('MASSQRYFALLALFAVSLKFCYC',
  'QNETIDVAGSGTAGVTWYGEPFGAGSTGGACGYGSAVANPPLYAMVSAGGPSLFNNGKGCGTCYQVVCIGHPACSGSPITVTITDECPGG'),
 ('MKPPILVFIVYLLQLRDCQC',
  'APTGKDRTSIREDPKGFSKAGEIDVDEEVKKALIGMKQMKILMERREEEHSKLMRTLKKCREEKQEALKLMNEVQEHLEEEERLCQVSLMDSWDE'),
 ('MGRGQIIMILVGLLCLANESYS',
  'DVIAKSSLQMCENTGNSDDPYNVVDQKACEKKLIVTLSVRSGQNGTEFLKAVTNVSKVYDQTEKEMARLYNPFIITLAKTPV

In [10]:
with open('../../6-14-18_filtered_data/validate_99.pkl', 'rb') as f:
    t = pickle.load(f)

In [11]:
t

[('MMAKATMAFCFLLMLTTVMLP',
  'TEGKTIAGRTDCEQHTDCSAASGPVYCCQDSDCCGGVDYICTNYGQCVRHF'),
 ('MLMRLYTFFAAALLACCAAA',
  'GPLHPELPQLVGKSWIPDWWFPFPRPSTRAATTTTTPATSTTGLATTTTKPTTTSSKPVTPTPQPATSTAQPAISSTANATATATASSASTSTTSSSTSA'),
 ('MARMGLAGAAGRWWGLALGLTAFFLPGTHT',
  'QVVQVNDSMYGFIGTDVVLHCSFANPLPSVKITQVTWQKASNGSKQNMAIYNPTMGVSVLPPYEKRVEFLRPSFIDGTIRLSGLELEDEGMYICEFATFP'),
 ('MNKLIRRAVTIFAVTSVASLFA',
  'SGVLETSMAESLSTNVISLADTKAKETTSHQKDRKARKNHQNRTSVVRKEVTAVRDTKAVEPRQDSCFGKMYTVKVNDDRNVEIVQSVPEYATVGSPYPI'),
 ('MKGIFLVVQLGFSIMVFLFLAAVNW',
  'YQGSELVSDRFDWNYTAKLSKLLNGIDAVSSPKQISQLDFFIYSAKHYPVMSALMIISFLYVLAALFLLIYSVKCNKQEIHLDC'),
 ('MAALPVLVLVLLLACGGPRAAG',
  'QKRKEMVLSEKVSQLMEWTSKRSVIRMNGDKFRRLVKAPPRNYSVIVMFTALQPHRQCVVCKQADEEYQVLANSWRYSSAFTNKIFFAMVDFDEGSDVFQ'),
 ('MQSLLLALLLLPVCSPGGA',
  'SNLHEDLTLLRTDLALRLYRSVAAAGNQTNLVLSPAGAFIPLELLQFGARGNTGRQLAQALGYTVHDPKVREWLQTVYAVLPSTNPGAKLELACTLYVQT'),
 ('MAMLWLAVLLTCGAPAAL',
  'LPTSGVGCPSRCDPASCAPAPTNCPAGETALRCGCCPVCAAAEWERCGEGPEDPLCASGLRCVKNGGVARCQCPSNLPVCGSDGKTYP