This notebook provides an example on how to build a predicted library using Seq2MS from a fasta in silico digestion text file.

In [None]:
# import necessary libraries
import math
from pyteomics import mgf, mass
import argparse
import pandas as pd
import numpy as np
import tensorflow as tf
from tensorflow import keras as K
from tensorflow.keras import layers
from utils import *
import matplotlib.pyplot as plt
import random
import sys
import pickle
import tensorflow.keras as k
from tensorflow.keras import backend as K

First we load the list of peptides we want to predict. (Note: different digestion outputs may require a different way of importing)

In [None]:
data = pd.read_csv('digested_proteins.txt', sep='\t')

Convert text file into our import format by taking out sequence, mass and protein. Then we add the charge state that we want to predict peptides for. Here, the peptide data is filtered for lengths 7 to 25.

In [None]:
for i in range(len(data)):
    df.append({'Sequence':data.iloc[i]['Sequence'], 'Mass':data/iloc[i]['Monoisotopic_Mass'], 'len':len(data.iloc[i]['Sequence']), 'Protein':data.iloc[i]['Protein_Name']})
    
print(len(df))
df = pd.DataFrame(df)
df = df[df['Sequence'].str.len() <= 25]
df = df[df['Sequence'].str.len() >= 7]
df = df[df['Sequence'].str.contains('X') == False]
df = df[df['Sequence'].str.contains('U') == False]
df = df[df['Sequence'].str.contains('O') == False]

df['Charge'] = 2
df['Modified sequence'] = df['Sequence']
df['Modification'] = ''

data = pd.DataFrame(df)

Next, we process the data into encodings to input to the model.

In [None]:
embedded_data = asnp32([embed_maxquant(data.iloc[i]) for i in range(len(data))])

To feed the encoded data to the model, a generator dataset is used.

In [None]:
class input_generator(k.utils.Sequence):
    def __init__(self, spectra, data, batch_size):
        self.spectra = spectra
        self.batch_size = batch_size
        self.data = data

    def __len__(self):
        return math.ceil(len(self.spectra) / self.batch_size)

    def __getitem__(self, idx):
        start_idx = idx * self.batch_size
        end_idx = min(start_idx + self.batch_size, len(self.spectra))

        return (self.spectra[start_idx: end_idx], self.data[start_idx: end_idx])

types = (tf.float16)
shapes = ((MAX_PEPTIDE_LENGTH+2, encoding_dimension))
generator = input_generator(embedded_data,data,batch_size)

Now the input data is ready, we can now load the model from disk and compile it for use.

In [None]:
pm = k.models.load_model('seq2ms_pretrained_model',custom_objects={'masked_spectral_distance': masked_spectral_distance})
pm.compile(optimizer=k.optimizers.Adam(lr=0.0003), loss=masked_spectral_distance, metrics=[tf.keras.metrics.CosineSimilarity(axis=1), masked_spectral_distance])

Using model.predict, we can iteratively predict and write the predicted spectra along with the corresponding peptide labels to output library. The output library is written to disk as 'example_library.msp'.

In [None]:
with open('example_library.msp', 'w') as f:
  for i in generator:
      label, data = i
      pred_spectra = pm.predict(label,verbose=2)
      write_msp(f, pred_spectra, data)

f.close()