The purpose of this notebook is to prep the files to be able to use AutoRT. The first part prepares the data that we use to train AutoRT.
The second part gets each individual file ready to run through AutoRT. The modifications must be specified a specific way. The peptide column must be titled "x" and the actual retention time must be titled "y". 

In [1]:
import os
import mokapot
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sys
sys.path.append("..")
sys.path
import data_loader as dl

The purpose of this function is to clean up the original 'before' data so that we are not counting decoys or duplicate scans.

In [2]:
def filter_data(df, prob_column):
    #drop decoys
    df = df[df["decoy"]==False]
    #sort by qvalue
    df = df.sort_values(prob_column)
    #drop duplicate scans
    df = df.drop_duplicates(subset=["scan"], keep="first") #keep highest scoring
    
    return df

AutoRT requries that all modifications be input as numbers. Here we are formatting the modifications as needed. 1 represents oxidation and 2 represents carbamidomethyl. 

In [3]:
def format_oxidation(row, column, to_replace):
    peptide = row[column]
    replace_with = "1"
    if pd.isna(peptide):
        new_pep = peptide
    else:
        if to_replace in peptide:
            new_pep = peptide.replace(to_replace, replace_with)
        else:
            new_pep = peptide
    return new_pep


def format_carbamidomethyl(row, column, to_replace):
    peptide = row[column]
    replace_with = "2"
    if pd.isna(peptide):
        new_pep = peptide
    else:
        if to_replace in peptide:
            new_pep = peptide.replace(to_replace, replace_with)
        else:
            new_pep = peptide
    return new_pep


def format_carbamidomethyl2(row, column, to_replace):
    peptide = row[column]
    replace_with = "3"
    if pd.isna(peptide):
        new_pep = peptide
    else:
        if to_replace in peptide:
            new_pep = peptide.replace(to_replace, replace_with)
        else:
            new_pep = peptide
    return new_pep

Occasionally we are given back multiple peptides. We take the first one and use it, as AutoRT will not know what to do with multiple peptides seperated by a "|"

In [4]:
#pulling only one peptide out
def format_peptide(row):
    string = row
    if '|' in string:
        spot = string.find('|')
        string = string[ :spot]
    
    return string

Only pulling the scans with a qvalue of 0 to give to AutoRT to train on. 

In [5]:
#pulling the best scans based off of qvalue. 
df = dl.clean_metamorph("2ng_rep5")
df = filter_data(df, "QValue")
#Taking only the best scoring
df = df[df["QValue"]== 0.0]

df["RT_formatted_peptides"] = df["Full Sequence"].apply(format_peptide)

df["RT_formatted_peptides"] = df.apply(lambda row: format_oxidation(row, "RT_formatted_peptides", "[Common Variable:Oxidation on M]"), axis=1)
df["RT_formatted_peptides"] = df.apply(lambda row: format_carbamidomethyl(row, "RT_formatted_peptides", "[Common Fixed:Carbamidomethyl on C]"), axis=1)


In [6]:
df = df.filter(['RT_formatted_peptides', 'Scan Retention Time'])
df.rename(columns = {'RT_formatted_peptides' : 'x', 'Scan Retention Time': "y"}, inplace = True)
df.to_csv("RT_training.tsv", sep ='\t')

Unnamed: 0,x,y
0,LVQDVANNTNEEAGDGTTTATVLAR,54.35835
7285,ELTSTC2SPIISK,46.61025
7286,MLVSGAGDIK,46.86372
7287,GDFC2IQVGR,55.17063
7288,TLQTISLLGYMK,86.17753
...,...,...
3647,VAQVAEITYGQK,46.79825
3634,MIAAVDTDSPR,41.81118
3646,SITILSTPEGTSAAC2K,57.68496
3645,IWSVPNASC2VQVVR,69.54589


Formatting the 2ng file to be able to be ran through AutoRT

In [7]:
all_files = ["2ng_rep1", "2ng_rep2", "2ng_rep3", "2ng_rep4", "2ng_rep5", "2ng_rep6", 
             "0.2ng_rep1", "0.2ng_rep2", "0.2ng_rep3", "0.2ng_rep4", "0.2ng_rep5", "0.2ng_rep6"]

AutoRT will not accept the amino acid "U". There are only 2 scans that have this is all of our data, so we just remove them. 
Going through all of the files and preparing the files to be able to be run though AutoRT

In [8]:
#Getting the data to predict
for file in all_files:
    df = dl.clean_metamorph(file)

    df = df.loc[~(df["peptide"].str.contains("U"))] #removing the U

    df["RT_formatted_peptides"] = df["Full Sequence"].apply(format_peptide)


    #formatting the modifications
    df["RT_formatted_peptides"] = df.apply(lambda row: format_oxidation(row, "RT_formatted_peptides", "[Common Variable:Oxidation on M]"), axis=1)
    df["RT_formatted_peptides"] = df.apply(lambda row: format_carbamidomethyl(row, "RT_formatted_peptides", "[Common Fixed:Carbamidomethyl on C]"), axis=1)
    df["RT_formatted_peptides"] = df.apply(lambda row: format_carbamidomethyl2(row, "RT_formatted_peptides", "[Common Fixed:Carbamidomethyl on U]"), axis=1)

    
    df.rename(columns = {'RT_formatted_peptides' : 'x', 'Scan Retention Time': "y"}, inplace = True)

    df.to_csv("data_for_AutoRT/" + file + "_to_predict.tsv", sep='\t') 