In [1]:
import os
import mokapot
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sys
sys.path.append("..")
sys.path
import data_loader as dl

The purpose of this function is to clean up the original 'before' data so that we are not counting decoys or duplicate scans. Mokapot needs the decoys, so they are only removed from the 'before' data.
Any scans with a "nan" value in the precursor intensity column are replaced with a 0.

In [2]:
def filter_data(df, cutoff, prob_column='PEP'):
    #drop decoys
    df = df[df["decoy"]==False]
    #sort by qvalue
    df = df.sort_values(prob_column)
    #drop duplicate scans
    df = df.drop_duplicates(subset=["scan"], keep="first") #keep highest scoring
    #replacing any precursor intensity that have a "nan" value with a 0
    df['Precursor Intensity'].replace(np.nan, 0, inplace = True)
    
    return df

MokaPot needs a column that notates whether a scan was decoy or not. The "make_decoy_col_maxquant" function generates that column.

In [3]:
def make_decoy_col_maxquant(row):
    if row["Reverse"].startswith("+"):
        return False
    else:
        return True

Here we read in the data and clean it up before we run it thorugh MokaPot. 
We drop any duplicate scans before the data is fed into MokaPot. We also change the scan number name so that all of our files will match in the end so we can merge them into one megascript. 

In [4]:
#Reading in the data and formatting it for MokaPot
def get_data_for_MokaPot(filename):
    mq_df =  dl.clean_maxquant(filename)
    mq_df["target_column"] = mq_df.apply(make_decoy_col_maxquant, axis = 1)
    
    
    #Dropping any rows that are missing the sequence
    mq_df['Sequence'].replace(' ', np.nan, inplace = True)
    mq_df.dropna(subset=['Sequence'], inplace=True)
    
    #sort and drop duplicate scans
    mq_df = mq_df.sort_values('PEP')
    mq_df = mq_df.drop_duplicates(subset=["scan"], keep="first") #keep highest scoring
    
    #replacing any precursor intensity that have a "nan" value with a 0
    mq_df['Precursor Intensity'].replace(np.nan, 0, inplace = True)
    
    
    mq_df = mq_df.rename(columns = {"scan": "ScanNr"})
    
    return mq_df

This gives us back a dataset that drops the decoys, duplicate scans, and rows without a sequence number. It has not been ran through MokaPot. The purpose of this is to be able to count how many scans we originally have at or under a specific cutoff. 

In [5]:
def get_PreMokaPot_data(filename):
    mq_df =  dl.clean_maxquant(filename)
    
#Formatting and dropping any rows that are missing the sequence (Do we still want this??)
    mq_df['Sequence'].replace(' ', np.nan, inplace = True)
    mq_df.dropna(subset=['Sequence'], inplace=True)
    
    mq_df = filter_data(mq_df, 0.01)
    
    return  mq_df


In [8]:
#all the files we want to run through MokaPot here 
file_names = ["2ng_rep1", "2ng_rep2", "2ng_rep3", "2ng_rep4", "2ng_rep5", "2ng_rep6",
             "0.2ng_rep1", "0.2ng_rep2", "0.2ng_rep3", "0.2ng_rep4", "0.2ng_rep5", "0.2ng_rep6"]

This is where we run each input file through MokaPot. Each file we read in is analyzed and the results are saved in a seperate output file.

MokaPot requires feature columns that it uses to give each scan a new q value score. The columns from MaxQuant that we have chosen to use as features are: 'Precursor Intensity', 'Score','Length', 'Missed cleavages', 'm/z', 'Mass', 'Retention time', and 'Delta score'. 'Charge' is also used as an one hot encoding per the recommendation of the writer of MokaPot. 

In [9]:
for file in file_names:
    mq_cleaned_df = get_PreMokaPot_data(file)
    mq_df = get_data_for_MokaPot(file)

    charge_feat = pd.get_dummies(mq_df["Charge"], prefix="Charge")
    mq_df = pd.concat([mq_df, charge_feat], axis=1)

    mq_for_MP = mokapot.dataset.LinearPsmDataset(mq_df, target_column = "target_column", spectrum_columns = "ScanNr", 
                                                     peptide_column = "peptide", protein_column=None, 
                                                     group_column=None, feature_columns=(list(charge_feat.columns) + ["Precursor Intensity", 'Score', 
                                                                                                                      'Length', 'Missed cleavages', 'm/z', 'Mass', 
                                                                                                                      'Retention time', 'Delta score', ]), copy_data=True)
    results, models = mokapot.brew(mq_for_MP)

    results_df = results.psms
    results_df.to_csv("MokaPot_Output/MaxQuant/mq_" + file + ".csv")
    
    print("\n" + "The number of PSMs found at or above 0.01 for file " + file + ":")   
    print("\t" + "MaxQuant: " + str(len(mq_cleaned_df[mq_cleaned_df['PEP'] <= 0.01])))
    print("\t""MaxQuant and MokaPot: " + str(len(results.psms[results.psms['mokapot q-value'] <= 0.01]))) 


The number of PSMs found at or above 0.01 for file 2ng_rep1:
	MaxQuant: 7900
	MaxQuant and MokaPot: 12050

The number of PSMs found at or above 0.01 for file 2ng_rep2:
	MaxQuant: 8210
	MaxQuant and MokaPot: 11372

The number of PSMs found at or above 0.01 for file 2ng_rep3:
	MaxQuant: 6067
	MaxQuant and MokaPot: 9449

The number of PSMs found at or above 0.01 for file 2ng_rep4:
	MaxQuant: 6401
	MaxQuant and MokaPot: 9145

The number of PSMs found at or above 0.01 for file 2ng_rep5:
	MaxQuant: 11196
	MaxQuant and MokaPot: 14226

The number of PSMs found at or above 0.01 for file 2ng_rep6:
	MaxQuant: 10361
	MaxQuant and MokaPot: 13270

The number of PSMs found at or above 0.01 for file 0.2ng_rep1:
	MaxQuant: 4096
	MaxQuant and MokaPot: 6172

The number of PSMs found at or above 0.01 for file 0.2ng_rep2:
	MaxQuant: 4077
	MaxQuant and MokaPot: 6035

The number of PSMs found at or above 0.01 for file 0.2ng_rep3:
	MaxQuant: 2809
	MaxQuant and MokaPot: 4429

The number of PSMs found at or ab