In [1]:
import os
import mokapot
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sys
sys.path.append("..")
sys.path
import data_loader as dl

The purpose of this function is to clean up the original 'before' data so that we are not counting decoys or duplicate scans. Mokapot needs the decoys, so they are only removed from the 'before' data.

In [2]:
def filter_data(df, prob_column='QValue'):
    #drop decoy
    df = df[df["decoy"]==False]
    #sort by qvalue
    df = df.sort_values(prob_column)
    #drop duplicate scans
    df = df.drop_duplicates(subset=["scan"], keep="first") #keep highest coring
 
    return df

MokaPot needs a column in a specific format that notates whether a scan was decoy or not. The "make_decoy_col_msgf" function generates that column.

In [3]:
def make_decoy_col_msgf(row):
    if row["Protein"].startswith("XXX_"):
        return False
    else:
        return True

This gives us back the "before" dataset. We will clean the data by dropping duplicates and decoys. The purpose of this is to be able to count how many scans we originally have at or under a specific cutoff.

In [4]:
def get_PreMokaPot_data(file):
    msg_cleaned_df = dl.clean_msgfplus(file)
    msg_cleaned_df = filter_data(msg_cleaned_df)
    return msg_cleaned_df

Here we read in the data and clean it up before we run it thorugh MokaPot. We drop any duplicate scans before the data is fed into MokaPot. We also change the scan number name so that all of our files will match in the end so we can merge them into one megascript.

In [5]:
#Reading in the data and formatting it for MokaPot
def get_data_for_MokaPot(file):
    msg_df = dl.clean_msgfplus(file)
    
    #adding target colum for MokaPot
    msg_df["target_column"] = msg_df.apply(make_decoy_col_msgf, axis = 1)
    
    #sort by qvalue and drop duplicate scans
    msg_df = msg_df.sort_values('QValue')
    msg_df = msg_df.drop_duplicates(subset=["scan"], keep="first") #keep highest scoring
    
    msg_df = msg_df.rename(columns = {"scan": "ScanNr"})
    
    return msg_df

This is where we run each input file through MokaPot. Each file we read in is analyzed and the results are saved in a seperate output file.
MokaPot requires feature columns that it uses to give each scan a new q value score. The columns from Msgf+ that we have chosen to use as features are: 'IsotopeError', 'PrecursorError(ppm)', 'DeNovoScore', 'MSGFScore', and 'SpecEValue'. 'Charge' is also used as a one hot encoding per the recommendation of the writer of MokaPot.

In [6]:
#Read all the files into a list here
file_names = ["2ng_rep1", "2ng_rep2", "2ng_rep3", "2ng_rep4", "2ng_rep5", "2ng_rep6",
             "0.2ng_rep1", "0.2ng_rep2", "0.2ng_rep3", "0.2ng_rep4", "0.2ng_rep5", "0.2ng_rep6"]

In [9]:
for file in file_names:
    msg_cleaned_df = get_PreMokaPot_data(file)
    msg_df = get_data_for_MokaPot(file)

    charge_feat = pd.get_dummies(msg_df["Charge"], prefix="Charge")
    msg_df = pd.concat([msg_df, charge_feat], axis=1)

    msg_for_MP = mokapot.dataset.LinearPsmDataset(msg_df, target_column = "target_column", spectrum_columns = "ScanNr",
                                            peptide_column = "peptide", protein_column=None, 
                                            group_column=None, feature_columns= (list(charge_feat.columns) + ["IsotopeError",  
                                            "PrecursorError(ppm)", "DeNovoScore","MSGFScore","SpecEValue"]), copy_data=True)
    
    results, models = mokapot.brew(msg_for_MP)
    
    results_df = results.psms
    results_df.to_csv("MokaPot_Output/MsgfPlus/msg_" + file + ".csv")
    
    print("The number of PSMs found at or above 0.01 for " + file + ":")   
    print("\t" + "MSGF+: " + str(len(msg_df[msg_df['QValue'] <= 0.01])))
    print("\t""MSGF+ and MokaPot: " + str(len(results.psms[results.psms['mokapot q-value'] <= 0.01])))



The number of PSMs found at or above 0.01 for 2ng_rep1:
	MSGF+: 12014
	MSGF+ and MokaPot: 11931




The number of PSMs found at or above 0.01 for 2ng_rep2:
	MSGF+: 11808
	MSGF+ and MokaPot: 11727




The number of PSMs found at or above 0.01 for 2ng_rep3:
	MSGF+: 9684
	MSGF+ and MokaPot: 9580




The number of PSMs found at or above 0.01 for 2ng_rep4:
	MSGF+: 9513
	MSGF+ and MokaPot: 9421




The number of PSMs found at or above 0.01 for 2ng_rep5:
	MSGF+: 14406
	MSGF+ and MokaPot: 14314




The number of PSMs found at or above 0.01 for 2ng_rep6:
	MSGF+: 13520
	MSGF+ and MokaPot: 13392




The number of PSMs found at or above 0.01 for 0.2ng_rep1:
	MSGF+: 6322
	MSGF+ and MokaPot: 6261




The number of PSMs found at or above 0.01 for 0.2ng_rep2:
	MSGF+: 6305
	MSGF+ and MokaPot: 6249




The number of PSMs found at or above 0.01 for 0.2ng_rep3:
	MSGF+: 4505
	MSGF+ and MokaPot: 4442




The number of PSMs found at or above 0.01 for 0.2ng_rep4:
	MSGF+: 4268
	MSGF+ and MokaPot: 4221




The number of PSMs found at or above 0.01 for 0.2ng_rep5:
	MSGF+: 4887
	MSGF+ and MokaPot: 4839




The number of PSMs found at or above 0.01 for 0.2ng_rep6:
	MSGF+: 4387
	MSGF+ and MokaPot: 4345
