# MAPSI Tutorial

## Overview
MAPSI (MzML And Processed Spectra Interface) is designed to aid the user in combining and viewing mzML files and their corresponding PSM, peptide, and protein quantification files into a multiIndexed pandas dataframe.

Depending on whether you are using MetaMorpheus or MSFragger files, you will want to import the corresponding parser script. In this tutorial I will import both parser scripts.

In [1]:
from MSFragger_Parser import parse_files as msfragger_parser
from MetaMorpheus_Parser import parse_files as MetaMorpheus_parser

In [12]:
import pandas as pd

## Test Files

In [2]:
# mzML test file
mzml_file_path = "C:\\Users\\Sarah Curtis\\OneDrive - BYU\\Documents\\Single Cell Team Documents\\API_dev\\MetaM\\2ng\\Ex_Auto_J3_30umTB_2ngQC_60m_1.mzML"

# MSFragger test files
msfragger_psm_file_path = "C:\\Users\\Sarah Curtis\\OneDrive - BYU\\Documents\\Single Cell Team Documents\\API_dev\\msfragger\\psm1.tsv"
msfragger_peptide_file_path = "C:\\Users\\Sarah Curtis\\OneDrive - BYU\\Documents\\Single Cell Team Documents\\API_dev\\msfragger\\combined_peptide.tsv"
msfragger_protein_file_path = "C:\\Users\\Sarah Curtis\\OneDrive - BYU\\Documents\\Single Cell Team Documents\\API_dev\\msfragger\\combined_protein.tsv"

# MetaMorpheus test files
mzml_file_path = "C:\\Users\\Sarah Curtis\\OneDrive - BYU\\Documents\\Single Cell Team Documents\\API_dev\\MetaM\\2ng\\Ex_Auto_J3_30umTB_2ngQC_60m_1.mzML"
mm_psm_file_path = "C:\\Users\\Sarah Curtis\\OneDrive - BYU\\Documents\\Single Cell Team Documents\\API_dev\\MetaM\\2ng\\Ex_Auto_J3_30umTB_2ngQC_60m_1-calib_PSMs.psmtsv"
mm_peptideQ_file_path = "C:\\Users\\Sarah Curtis\\OneDrive - BYU\\Documents\\Single Cell Team Documents\\API_dev\\MetaM\\2ng\\AllQuantifiedPeptides.tsv"
mm_protein_file_path = "C:\\Users\\Sarah Curtis\\OneDrive - BYU\\Documents\\Single Cell Team Documents\\API_dev\\MetaM\\2ng\\AllQuantifiedProteinGroups.tsv"

# output test file
outfile = "C:\\Users\\Sarah Curtis\\OneDrive - BYU\\Documents\\Single Cell Team Documents\\API_dev\\msfragger\\tester.tsv"

## Inputs
#### input_files
You may input a list of 1 to 4 files in any order. If you only want to read in one file, it can be inputted as a string.

In [3]:
# let's try combining a psm and peptide quantification file
psm_and_peptide_df = msfragger_parser(input_files=[msfragger_peptide_file_path, msfragger_psm_file_path])
psm_and_peptide_df.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,File Name,Modified Peptide,Charge,Retention,Observed Mass,Calibrated Observed Mass,Observed M/Z,Calibrated Observed M/Z,Calculated Peptide Mass,Calculated M/Z,...,Ex_Auto_K13_30umTA_02ngQC_60m_1 Intensity,Ex_Auto_K13_30umTA_02ngQC_60m_2 Intensity,Ex_Auto_W17_30umTA_02ngQC_60m_3 Intensity,Ex_Auto_W17_30umTA_02ngQC_60m_4 Intensity,Ex_Auto_J3_30umTB_02ngQC_60m_1 MaxLFQ Intensity,Ex_Auto_J3_30umTB_02ngQC_60m_2 MaxLFQ Intensity,Ex_Auto_K13_30umTA_02ngQC_60m_1 MaxLFQ Intensity,Ex_Auto_K13_30umTA_02ngQC_60m_2 MaxLFQ Intensity,Ex_Auto_W17_30umTA_02ngQC_60m_3 MaxLFQ Intensity,Ex_Auto_W17_30umTA_02ngQC_60m_4 MaxLFQ Intensity
Protein Accession,Peptide,Scan Number,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1
A0A096LP55,EQCEQLEK,5237,interact-Ex_Auto_J3_30umTB_02ngQC_60m_1.pep.xml,EQCEQLEK,2,1776.8865,1062.4706,1062.4664,532.2426,532.2405,1062.4651,532.2398,...,0.0,0.0,0.0,0.0,219563.42,234511.55,0.0,0.0,0.0,0.0
A6NHQ2,TNIIPVLEDAR,14755,interact-Ex_Auto_J3_30umTB_02ngQC_60m_1.pep.xml,,2,4218.5518,1239.6882,1239.6842,620.8514,620.8494,1239.6823,620.8484,...,523321.72,649737.94,814822.1,1081377.4,560720.75,659080.6,523321.72,649737.94,814822.1,1081377.4
A6NKT7,NSIPEPIDPLFK,16986,interact-Ex_Auto_J3_30umTB_02ngQC_60m_1.pep.xml,,2,4795.0662,1368.7386,1368.7316,685.3766,685.3731,1368.7288,685.3717,...,0.0,64618.39,0.0,0.0,33486.43,34693.87,0.0,64618.39,0.0,0.0
A8MWD9,GNSIIMLEALER,18334,interact-Ex_Auto_J3_30umTB_02ngQC_60m_1.pep.xml,,2,5152.8337,1344.7134,1344.7035,673.364,673.359,1344.707,673.3608,...,409638.97,376445.3,231965.92,487477.6,501898.06,230573.27,414935.78,203264.5,234965.34,493780.9
B0I1T2,DFLFQDFK,18706,interact-Ex_Auto_J3_30umTB_02ngQC_60m_1.pep.xml,,2,5252.6033,1058.5103,1058.5061,530.2624,530.2603,1058.5072,530.2609,...,0.0,0.0,0.0,0.0,59026.492,104090.06,0.0,0.0,0.0,0.0


In [4]:
# we can also just load data from one file
psm_df = msfragger_parser(input_files=msfragger_psm_file_path)
psm_df.head()

converting input_files from a string into list


Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,File Name,Modified Peptide,Prev AA,Next AA,Peptide Length,Charge,Retention,Observed Mass,Calibrated Observed Mass,Observed M/Z,...,Protein Start,Protein End,Intensity,Assigned Modifications,Observed Modifications,Is Unique,Protein,Entry Name,Gene,Protein Description
Protein Accession,Peptide,Scan Number,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1
A0A096LP55,EQCEQLEK,5237,interact-Ex_Auto_J3_30umTB_02ngQC_60m_1.pep.xml,EQCEQLEK,R,C,8,2,1776.8865,1062.4706,1062.4664,532.2426,...,35,42,200838.4,3C(57.0215),,False,sp|A0A096LP55|QCR6L_HUMAN,QCR6L_HUMAN,UQCRHL,"Cytochrome b-c1 complex subunit 6-like, mitoch..."
A6NHQ2,TNIIPVLEDAR,14755,interact-Ex_Auto_J3_30umTB_02ngQC_60m_1.pep.xml,,R,H,11,2,4218.5518,1239.6882,1239.6842,620.8514,...,220,230,517661.1,,,True,sp|A6NHQ2|FBLL1_HUMAN,FBLL1_HUMAN,FBLL1,rRNA/tRNA 2'-O-methyltransferase fibrillarin-l...
A6NKT7,NSIPEPIDPLFK,16986,interact-Ex_Auto_J3_30umTB_02ngQC_60m_1.pep.xml,,K,H,12,2,4795.0662,1368.7386,1368.7316,685.3766,...,621,632,30740.525,,,False,sp|A6NKT7|RGPD3_HUMAN,RGPD3_HUMAN,RGPD3,RanBP2-like and GRIP domain-containing protein 3
A8MWD9,GNSIIMLEALER,18334,interact-Ex_Auto_J3_30umTB_02ngQC_60m_1.pep.xml,,R,-,12,2,5152.8337,1344.7134,1344.7035,673.364,...,64,75,445926.78,,,False,sp|A8MWD9|RUXGL_HUMAN,RUXGL_HUMAN,SNRPGP15,Putative small nuclear ribonucleoprotein G-lik...
B0I1T2,DFLFQDFK,18706,interact-Ex_Auto_J3_30umTB_02ngQC_60m_1.pep.xml,,R,R,8,2,5252.6033,1058.5103,1058.5061,530.2624,...,537,544,54186.293,,,True,sp|B0I1T2|MYO1G_HUMAN,MYO1G_HUMAN,MYO1G,Unconventional myosin-Ig


  Note that MAPSI cannot process more than one of each file type. If you would like to combine several files of the same type (ex: multiple mzML and PSM files), it is recommended that you run corresponding files through MAPSI and then concatenate the resulting dataframes.   
  ```python:
  list_of_mzml_file_paths = ['mzml_1', 'mzml_2', 'mzml_3']
  list_of_psm_file_paths = ['psm_1', 'psm_2', 'psm_3']
  # iterate over the mzML and PSM files 
  list_of_joined_dataframes = []
  for index, mzml_file_path in enumerate(list_of_mzml_file_paths):
    psm_file_path = list_of_psm_file_paths[index]
    # join data from the mzML and PSM files and append it to the 'list_of_joined_dataframes'
    new_df = msfragger_parser(input_files=[mzml_file_path, psm_file_path])
    list_of_joined_dataframes.append(new_df)
  # loop through the new_df's and concatenate them
  joined_df = pd.concat(list_of_joined_dataframes)
  ```

MAPSI recognizes file types by the file extension and name. If MAPSI is having trouble recognizing your files, make sure that:
1. The mzML file extension is `.mzml`
2. The PSM file name contains `psm`
3. The peptide file name contains `peptide` or `pep`
4. The protein file name contains `protein` or `prot`

#### output_file_path
This is an optional parameter which allows you to save your dataframe as a `.tsv` file. 

In [18]:
# ex:
msfragger_parser(input_files=[msfragger_protein_file_path, msfragger_peptide_file_path], output_file_path=outfile).head()

Dataframe saved.


Unnamed: 0_level_0,Unnamed: 1_level_0,Prev AA,Next AA,Start,End,Peptide Length,Charges,Protein Description,Mapped Genes,Mapped Proteins,Ex_Auto_J3_30umTB_02ngQC_60m_1 Spectral Count_peptide,...,Ex_Auto_K13_30umTA_02ngQC_60m_2 MaxLFQ Unique Intensity,Ex_Auto_W17_30umTA_02ngQC_60m_3 MaxLFQ Unique Intensity,Ex_Auto_W17_30umTA_02ngQC_60m_4 MaxLFQ Unique Intensity,Ex_Auto_J3_30umTB_02ngQC_60m_1 MaxLFQ Total Intensity,Ex_Auto_J3_30umTB_02ngQC_60m_2 MaxLFQ Total Intensity,Ex_Auto_K13_30umTA_02ngQC_60m_1 MaxLFQ Total Intensity,Ex_Auto_K13_30umTA_02ngQC_60m_2 MaxLFQ Total Intensity,Ex_Auto_W17_30umTA_02ngQC_60m_3 MaxLFQ Total Intensity,Ex_Auto_W17_30umTA_02ngQC_60m_4 MaxLFQ Total Intensity,Indistinguishable Proteins
Protein Accession,Peptide,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
A0A096LP55,EQCEQLEK,R,C,35,42,8,2,"Cytochrome b-c1 complex subunit 6-like, mitoch...",UQCRH,sp|P07919|QCR6_HUMAN,1,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,sp|P07919|QCR6_HUMAN
A6NDG6,TILTLTGVSTLGDVK,K,N,276,290,15,2,Glycerol-3-phosphate phosphatase,,,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,
A6NHL2,SFGGGTGSGFTSLLMER,R,L,147,163,17,2,Tubulin alpha chain-like 3,,,0,...,0.0,0.0,0.0,3170630.8,2816038.0,3657354.0,3326739.8,2417819.5,4331914.5,
A6NHQ2,TNIIPVLEDAR,R,H,220,230,11,2,rRNA/tRNA 2'-O-methyltransferase fibrillarin-l...,,,1,...,0.0,0.0,0.0,629399.44,553218.6,720522.44,609172.75,678163.0,1068284.8,
A6NKT7,NSIPEPIDPLFK,K,H,621,632,12,2,RanBP2-like and GRIP domain-containing protein 3,"RANBP2, RGPD1, RGPD2, RGPD4, RGPD5, RGPD8","sp|O14715|RGPD8_HUMAN, sp|P0DJD0|RGPD1_HUMAN, ...",1,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,"sp|O14715|RGPD8_HUMAN, sp|P0DJD0|RGPD1_HUMAN, ..."


#### columns_to_keep
This is an optional parameter which allows you to input a list of columns you would like to see the data for (MAPSI will also take a str if you only wish to input one column). It is recommended that you check the column names from your original columns before running MAPSI.

In [7]:
my_dataframe = MetaMorpheus_parser(input_files=[mm_peptideQ_file_path, mm_psm_file_path, mm_protein_file_path], columns_to_keep=['Scan Number', 'Protein Accession', 'Peptide', 'Gene Name', 'Base Sequence', 'Score'])
my_dataframe.head()

Your inputs: [['C:\\Users\\Sarah Curtis\\OneDrive - BYU\\Documents\\Single Cell Team Documents\\API_dev\\MetaM\\2ng\\AllQuantifiedPeptides.tsv', 'C:\\Users\\Sarah Curtis\\OneDrive - BYU\\Documents\\Single Cell Team Documents\\API_dev\\MetaM\\2ng\\Ex_Auto_J3_30umTB_2ngQC_60m_1-calib_PSMs.psmtsv', 'C:\\Users\\Sarah Curtis\\OneDrive - BYU\\Documents\\Single Cell Team Documents\\API_dev\\MetaM\\2ng\\AllQuantifiedProteinGroups.tsv'], None, ['Scan Number', 'Protein Accession', 'Peptide', 'Gene Name', 'Base Sequence', 'Score'], None, None, None, None]


  psm_dataframe = pd.read_table(psm_file_path, delimiter='\t')


Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Gene Name,Base Sequence,Score
Protein Accession,Peptide,Scan Number,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
A0A075B6X5,EPELLLK,24419,primary:TRAV18,EPELLLK,6.399
A0A075B6X5,EPELLLK,24592,primary:TRAV18,EPELLLK,6.293
A0A096LP49,LQDVAGR,6514,primary:CCDC187,LQDVAGR,6.208
A0A0B4J2D5|P0DPI2,GVEVTVGHEQEEGGK,8901,primary:GATD3B|primary:GATD3,GVEVTVGHEQEEGGK,11.07
A0A0B4J2D5|P0DPI2,NLSTFAVDGK,16610,primary:GATD3B|primary:GATD3,NLSTFAVDGK,9.268



If one of the inputted columns is not found in the database, the entire database will be returned to ensure that you are not missing any important information. To manually select columns to keep, follow the outline below:    
```python:
example_columns_to_keep = ['Column_1', 'Column_2', 'Column_3']
my_dataframe = my_dataframe[example_columns_to_keep]
```

In [8]:
modified_df = my_dataframe[['Gene Name', 'Score']]
modified_df.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Gene Name,Score
Protein Accession,Peptide,Scan Number,Unnamed: 3_level_1,Unnamed: 4_level_1
A0A075B6X5,EPELLLK,24419,primary:TRAV18,6.399
A0A075B6X5,EPELLLK,24592,primary:TRAV18,6.293
A0A096LP49,LQDVAGR,6514,primary:CCDC187,6.208
A0A0B4J2D5|P0DPI2,GVEVTVGHEQEEGGK,8901,primary:GATD3B|primary:GATD3,11.07
A0A0B4J2D5|P0DPI2,NLSTFAVDGK,16610,primary:GATD3B|primary:GATD3,9.268


#### rows_to_keep: (scans_to_keep, peptides_to_keep, proteins_to_keep)
In addition to selecting columns to keep, you can also select specific rows to keep in your dataframe. This is split into 3 optional parameters which allow you to select rows based on scan number, peptide sequence, and/or protein accession. Note that these parameters must be entered as a `list` of str or a `str`.

In [21]:
scans = ['16819', '17761', '8942', '9686', '10645', '17633', '17638']
proteins = ['B5ME19', 'E9PAV3']
msfragger_parser(input_files=[msfragger_psm_file_path, msfragger_peptide_file_path], scans_to_keep=scans, proteins_to_keep=proteins)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,File Name,Modified Peptide,Charge,Retention,Observed Mass,Calibrated Observed Mass,Observed M/Z,Calibrated Observed M/Z,Calculated Peptide Mass,Calculated M/Z,...,Ex_Auto_K13_30umTA_02ngQC_60m_1 Intensity,Ex_Auto_K13_30umTA_02ngQC_60m_2 Intensity,Ex_Auto_W17_30umTA_02ngQC_60m_3 Intensity,Ex_Auto_W17_30umTA_02ngQC_60m_4 Intensity,Ex_Auto_J3_30umTB_02ngQC_60m_1 MaxLFQ Intensity,Ex_Auto_J3_30umTB_02ngQC_60m_2 MaxLFQ Intensity,Ex_Auto_K13_30umTA_02ngQC_60m_1 MaxLFQ Intensity,Ex_Auto_K13_30umTA_02ngQC_60m_2 MaxLFQ Intensity,Ex_Auto_W17_30umTA_02ngQC_60m_3 MaxLFQ Intensity,Ex_Auto_W17_30umTA_02ngQC_60m_4 MaxLFQ Intensity
Protein Accession,Peptide,Scan Number,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1
B5ME19,CLEEFELLGK,16819,interact-Ex_Auto_J3_30umTB_02ngQC_60m_1.pep.xml,CLEEFELLGK,2,4750.5885,1236.6061,1236.6012,619.3103,619.3079,1236.606,619.3103,...,0.0,0.0,104613.39,224767.39,244778.5,208354.9,0.0,0.0,104613.39,224767.39
B5ME19,ELLGQGLLLR,17761,interact-Ex_Auto_J3_30umTB_02ngQC_60m_1.pep.xml,,2,5000.1031,1110.6814,1110.6769,556.348,556.3457,1110.676,556.3453,...,303160.78,347066.38,286371.53,458405.72,355898.0,215320.58,303160.78,347066.38,286371.53,458405.72
B5ME19,IMQNTDPHSQEYVEHLK,8942,interact-Ex_Auto_J3_30umTB_02ngQC_60m_1.pep.xml,,4,2727.5402,2067.9734,2067.9666,518.0006,517.9989,2067.9683,517.9994,...,0.0,157719.05,0.0,0.0,148305.56,144439.25,0.0,157719.05,0.0,0.0
B5ME19,LNEILQAR,9686,interact-Ex_Auto_J3_30umTB_02ngQC_60m_1.pep.xml,,2,2918.0127,955.5461,955.5394,478.7803,478.777,955.545,478.7798,...,0.0,0.0,202957.69,0.0,326863.56,310425.34,0.0,0.0,202957.69,0.0
E9PAV3,IEDLSQQAQLAAAEK,10645,interact-Ex_Auto_J3_30umTB_02ngQC_60m_1.pep.xml,,2,3163.9378,1613.8329,1613.8262,807.9237,807.9204,1613.8259,807.9202,...,1095534.0,1065425.1,248189.31,742463.75,1702008.6,1788910.5,1095534.0,1065425.1,0.0,742463.75
E9PAV3,NILFVITKPDVYK,17633,interact-Ex_Auto_J3_30umTB_02ngQC_60m_1.pep.xml,,3,4965.9947,1549.8988,1549.8953,517.6402,517.639,1548.8915,517.3044,...,503796.4,397470.34,166437.53,764465.0,276094.22,255739.28,503804.8,397476.97,166440.31,417387.44
E9PAV3,NILFVITKPDVYK,17638,interact-Ex_Auto_J3_30umTB_02ngQC_60m_1.pep.xml,,2,4967.3657,1548.8954,1548.8916,775.455,775.4531,1548.8915,775.453,...,503796.4,397470.34,166437.53,764465.0,276094.22,255739.28,503804.8,397476.97,166440.31,417387.44


If one of the inputs is not found in the specified row, the dataframe will not be filtered based off of that row.

In [25]:
scans_1 = ['16819', '17761', '8942', '9686', '10645', '17633', '17638', '10000']
proteins_1 = ['B5ME19', 'E9PAV3']
selected_rows_df = msfragger_parser(input_files=[msfragger_psm_file_path, msfragger_peptide_file_path], scans_to_keep=scans_1, proteins_to_keep=proteins_1)
selected_rows_df

The following scans were not found in the dataframe: [10000]
To ensure that you have all the information needed, the dataframe will not be filted by Scan Number


Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,File Name,Modified Peptide,Charge,Retention,Observed Mass,Calibrated Observed Mass,Observed M/Z,Calibrated Observed M/Z,Calculated Peptide Mass,Calculated M/Z,...,Ex_Auto_K13_30umTA_02ngQC_60m_1 Intensity,Ex_Auto_K13_30umTA_02ngQC_60m_2 Intensity,Ex_Auto_W17_30umTA_02ngQC_60m_3 Intensity,Ex_Auto_W17_30umTA_02ngQC_60m_4 Intensity,Ex_Auto_J3_30umTB_02ngQC_60m_1 MaxLFQ Intensity,Ex_Auto_J3_30umTB_02ngQC_60m_2 MaxLFQ Intensity,Ex_Auto_K13_30umTA_02ngQC_60m_1 MaxLFQ Intensity,Ex_Auto_K13_30umTA_02ngQC_60m_2 MaxLFQ Intensity,Ex_Auto_W17_30umTA_02ngQC_60m_3 MaxLFQ Intensity,Ex_Auto_W17_30umTA_02ngQC_60m_4 MaxLFQ Intensity
Protein Accession,Peptide,Scan Number,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1
B5ME19,CLEEFELLGK,16819,interact-Ex_Auto_J3_30umTB_02ngQC_60m_1.pep.xml,CLEEFELLGK,2,4750.5885,1236.6061,1236.6012,619.3103,619.3079,1236.606,619.3103,...,0.0,0.0,104613.39,224767.39,244778.5,208354.9,0.0,0.0,104613.39,224767.39
B5ME19,ELLGQGLLLR,17761,interact-Ex_Auto_J3_30umTB_02ngQC_60m_1.pep.xml,,2,5000.1031,1110.6814,1110.6769,556.348,556.3457,1110.676,556.3453,...,303160.78,347066.38,286371.53,458405.72,355898.0,215320.58,303160.78,347066.38,286371.53,458405.72
B5ME19,GCILTLVER,15367,interact-Ex_Auto_J3_30umTB_02ngQC_60m_1.pep.xml,GCILTLVER,2,4375.291,1059.5825,1059.5787,530.7985,530.7966,1059.5746,530.7946,...,210961.16,232756.39,242643.52,465316.62,256856.95,241219.42,210961.16,232756.39,242643.52,465316.62
B5ME19,GTEITHAVVIK,9078,interact-Ex_Auto_J3_30umTB_02ngQC_60m_1.pep.xml,,2,2762.2254,1166.6719,1166.6656,584.3432,584.3401,1166.6659,584.3402,...,0.0,0.0,0.0,0.0,242487.4,158317.23,0.0,0.0,0.0,0.0
B5ME19,IMQNTDPHSQEYVEHLK,8942,interact-Ex_Auto_J3_30umTB_02ngQC_60m_1.pep.xml,,4,2727.5402,2067.9734,2067.9666,518.0006,517.9989,2067.9683,517.9994,...,0.0,157719.05,0.0,0.0,148305.56,144439.25,0.0,157719.05,0.0,0.0
B5ME19,LNEILQAR,9686,interact-Ex_Auto_J3_30umTB_02ngQC_60m_1.pep.xml,,2,2918.0127,955.5461,955.5394,478.7803,478.777,955.545,478.7798,...,0.0,0.0,202957.69,0.0,326863.56,310425.34,0.0,0.0,202957.69,0.0
B5ME19,TCHSFIINEK,9463,interact-Ex_Auto_J3_30umTB_02ngQC_60m_1.pep.xml,TCHSFIINEK,3,2860.9772,1247.601,1247.5967,416.8743,416.8728,1247.5968,416.8729,...,0.0,0.0,297304.62,0.0,239002.52,265927.22,0.0,0.0,297304.62,0.0
E9PAV3,IEDLSQQAQLAAAEK,10645,interact-Ex_Auto_J3_30umTB_02ngQC_60m_1.pep.xml,,2,3163.9378,1613.8329,1613.8262,807.9237,807.9204,1613.8259,807.9202,...,1095534.0,1065425.1,248189.31,742463.75,1702008.6,1788910.5,1095534.0,1065425.1,0.0,742463.75
E9PAV3,NILFVITKPDVYK,17633,interact-Ex_Auto_J3_30umTB_02ngQC_60m_1.pep.xml,,3,4965.9947,1549.8988,1549.8953,517.6402,517.639,1548.8915,517.3044,...,503796.4,397470.34,166437.53,764465.0,276094.22,255739.28,503804.8,397476.97,166440.31,417387.44
E9PAV3,NILFVITKPDVYK,17638,interact-Ex_Auto_J3_30umTB_02ngQC_60m_1.pep.xml,,2,4967.3657,1548.8954,1548.8916,775.455,775.4531,1548.8915,775.453,...,503796.4,397470.34,166437.53,764465.0,276094.22,255739.28,503804.8,397476.97,166440.31,417387.44


If you wish to manually correct this, you can do it as follows:

In [33]:
scans_corrected = [16819, 17761, 8942, 9686, 10645, 17633, 17638]
# Note that we need to reset the index in order to store the data into columns/rows
corrected_selected_rows_df = selected_rows_df.reset_index()
corrected_selected_rows_df = corrected_selected_rows_df.loc[corrected_selected_rows_df['Scan Number'].isin(scans_corrected)]
corrected_selected_rows_df = corrected_selected_rows_df.set_index(['Protein Accession', 'Peptide', 'Scan Number'])
corrected_selected_rows_df

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,File Name,Modified Peptide,Charge,Retention,Observed Mass,Calibrated Observed Mass,Observed M/Z,Calibrated Observed M/Z,Calculated Peptide Mass,Calculated M/Z,...,Ex_Auto_K13_30umTA_02ngQC_60m_1 Intensity,Ex_Auto_K13_30umTA_02ngQC_60m_2 Intensity,Ex_Auto_W17_30umTA_02ngQC_60m_3 Intensity,Ex_Auto_W17_30umTA_02ngQC_60m_4 Intensity,Ex_Auto_J3_30umTB_02ngQC_60m_1 MaxLFQ Intensity,Ex_Auto_J3_30umTB_02ngQC_60m_2 MaxLFQ Intensity,Ex_Auto_K13_30umTA_02ngQC_60m_1 MaxLFQ Intensity,Ex_Auto_K13_30umTA_02ngQC_60m_2 MaxLFQ Intensity,Ex_Auto_W17_30umTA_02ngQC_60m_3 MaxLFQ Intensity,Ex_Auto_W17_30umTA_02ngQC_60m_4 MaxLFQ Intensity
Protein Accession,Peptide,Scan Number,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1
B5ME19,CLEEFELLGK,16819,interact-Ex_Auto_J3_30umTB_02ngQC_60m_1.pep.xml,CLEEFELLGK,2,4750.5885,1236.6061,1236.6012,619.3103,619.3079,1236.606,619.3103,...,0.0,0.0,104613.39,224767.39,244778.5,208354.9,0.0,0.0,104613.39,224767.39
B5ME19,ELLGQGLLLR,17761,interact-Ex_Auto_J3_30umTB_02ngQC_60m_1.pep.xml,,2,5000.1031,1110.6814,1110.6769,556.348,556.3457,1110.676,556.3453,...,303160.78,347066.38,286371.53,458405.72,355898.0,215320.58,303160.78,347066.38,286371.53,458405.72
B5ME19,IMQNTDPHSQEYVEHLK,8942,interact-Ex_Auto_J3_30umTB_02ngQC_60m_1.pep.xml,,4,2727.5402,2067.9734,2067.9666,518.0006,517.9989,2067.9683,517.9994,...,0.0,157719.05,0.0,0.0,148305.56,144439.25,0.0,157719.05,0.0,0.0
B5ME19,LNEILQAR,9686,interact-Ex_Auto_J3_30umTB_02ngQC_60m_1.pep.xml,,2,2918.0127,955.5461,955.5394,478.7803,478.777,955.545,478.7798,...,0.0,0.0,202957.69,0.0,326863.56,310425.34,0.0,0.0,202957.69,0.0
E9PAV3,IEDLSQQAQLAAAEK,10645,interact-Ex_Auto_J3_30umTB_02ngQC_60m_1.pep.xml,,2,3163.9378,1613.8329,1613.8262,807.9237,807.9204,1613.8259,807.9202,...,1095534.0,1065425.1,248189.31,742463.75,1702008.6,1788910.5,1095534.0,1065425.1,0.0,742463.75
E9PAV3,NILFVITKPDVYK,17633,interact-Ex_Auto_J3_30umTB_02ngQC_60m_1.pep.xml,,3,4965.9947,1549.8988,1549.8953,517.6402,517.639,1548.8915,517.3044,...,503796.4,397470.34,166437.53,764465.0,276094.22,255739.28,503804.8,397476.97,166440.31,417387.44
E9PAV3,NILFVITKPDVYK,17638,interact-Ex_Auto_J3_30umTB_02ngQC_60m_1.pep.xml,,2,4967.3657,1548.8954,1548.8916,775.455,775.4531,1548.8915,775.453,...,503796.4,397470.34,166437.53,764465.0,276094.22,255739.28,503804.8,397476.97,166440.31,417387.44


#### multiIndex
This optional parameter allows you to create a custom multiIndex for your dataframe. The default multiIndicies are as follows:
*   1 mzML file: `['Scan Number']`
*   A dataframe which includes a PSM file: `['Protein Accession', 'Peptide', 'Scan Number']`
*   A peptide quantification file: `['Protein Accession', 'Peptide']`
*   1 protein quantification file: `['Protein Accession']`

 Custom MultiIndicies are inputted as a list (or str if it is a single index). MultiIndicies inputted as a list must be ordered according to the desired multiIndex hierarchy.

In [15]:
my_multiIndex = ['File Name', 'Protein Accession', 'Peptide', 'Scan Number']
custom_multiIndex_df = msfragger_parser(input_files=[msfragger_psm_file_path, msfragger_protein_file_path, msfragger_peptide_file_path], output_file_path=outfile, multiIndex=my_multiIndex)
custom_multiIndex_df.head()

Dataframe saved.


Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Modified Peptide,Charge,Retention,Observed Mass,Calibrated Observed Mass,Observed M/Z,Calibrated Observed M/Z,Calculated Peptide Mass,Calculated M/Z,Delta Mass,...,Ex_Auto_K13_30umTA_02ngQC_60m_2 MaxLFQ Unique Intensity,Ex_Auto_W17_30umTA_02ngQC_60m_3 MaxLFQ Unique Intensity,Ex_Auto_W17_30umTA_02ngQC_60m_4 MaxLFQ Unique Intensity,Ex_Auto_J3_30umTB_02ngQC_60m_1 MaxLFQ Total Intensity,Ex_Auto_J3_30umTB_02ngQC_60m_2 MaxLFQ Total Intensity,Ex_Auto_K13_30umTA_02ngQC_60m_1 MaxLFQ Total Intensity,Ex_Auto_K13_30umTA_02ngQC_60m_2 MaxLFQ Total Intensity,Ex_Auto_W17_30umTA_02ngQC_60m_3 MaxLFQ Total Intensity,Ex_Auto_W17_30umTA_02ngQC_60m_4 MaxLFQ Total Intensity,Indistinguishable Proteins
File Name,Protein Accession,Peptide,Scan Number,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1
interact-Ex_Auto_J3_30umTB_02ngQC_60m_1.pep.xml,A0A096LP55,EQCEQLEK,5237,EQCEQLEK,2,1776.8865,1062.4706,1062.4664,532.2426,532.2405,1062.4651,532.2398,0.0013,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,sp|P07919|QCR6_HUMAN
interact-Ex_Auto_J3_30umTB_02ngQC_60m_1.pep.xml,A6NHQ2,TNIIPVLEDAR,14755,,2,4218.5518,1239.6882,1239.6842,620.8514,620.8494,1239.6823,620.8484,0.0019,...,0.0,0.0,0.0,629399.44,553218.6,720522.44,609172.75,678163.0,1068284.8,
interact-Ex_Auto_J3_30umTB_02ngQC_60m_1.pep.xml,A6NKT7,NSIPEPIDPLFK,16986,,2,4795.0662,1368.7386,1368.7316,685.3766,685.3731,1368.7288,685.3717,0.0028,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,"sp|O14715|RGPD8_HUMAN, sp|P0DJD0|RGPD1_HUMAN, ..."
interact-Ex_Auto_J3_30umTB_02ngQC_60m_1.pep.xml,A8MWD9,GNSIIMLEALER,18334,,2,5152.8337,1344.7134,1344.7035,673.364,673.359,1344.707,673.3608,-0.0035,...,0.0,0.0,0.0,0.0,246876.1,0.0,199852.67,0.0,0.0,sp|P62308|RUXG_HUMAN
interact-Ex_Auto_J3_30umTB_02ngQC_60m_1.pep.xml,B0I1T2,DFLFQDFK,18706,,2,5252.6033,1058.5103,1058.5061,530.2624,530.2603,1058.5072,530.2609,-0.0011,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,


It is also important to note that the multiIndex will not be saved in the output file. If you want to load a saved dataframe with a multiIndex, it can be done as follows:

In [17]:
loaded_df = pd.read_table(outfile).set_index(my_multiIndex).sort_index()
loaded_df.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Modified Peptide,Charge,Retention,Observed Mass,Calibrated Observed Mass,Observed M/Z,Calibrated Observed M/Z,Calculated Peptide Mass,Calculated M/Z,Delta Mass,...,Ex_Auto_K13_30umTA_02ngQC_60m_2 MaxLFQ Unique Intensity,Ex_Auto_W17_30umTA_02ngQC_60m_3 MaxLFQ Unique Intensity,Ex_Auto_W17_30umTA_02ngQC_60m_4 MaxLFQ Unique Intensity,Ex_Auto_J3_30umTB_02ngQC_60m_1 MaxLFQ Total Intensity,Ex_Auto_J3_30umTB_02ngQC_60m_2 MaxLFQ Total Intensity,Ex_Auto_K13_30umTA_02ngQC_60m_1 MaxLFQ Total Intensity,Ex_Auto_K13_30umTA_02ngQC_60m_2 MaxLFQ Total Intensity,Ex_Auto_W17_30umTA_02ngQC_60m_3 MaxLFQ Total Intensity,Ex_Auto_W17_30umTA_02ngQC_60m_4 MaxLFQ Total Intensity,Indistinguishable Proteins
File Name,Protein Accession,Peptide,Scan Number,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1
interact-Ex_Auto_J3_30umTB_02ngQC_60m_1.pep.xml,A0A096LP55,EQCEQLEK,5237,EQCEQLEK,2,1776.8865,1062.4706,1062.4664,532.2426,532.2405,1062.4651,532.2398,0.0013,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,sp|P07919|QCR6_HUMAN
interact-Ex_Auto_J3_30umTB_02ngQC_60m_1.pep.xml,A6NHQ2,TNIIPVLEDAR,14755,,2,4218.5518,1239.6882,1239.6842,620.8514,620.8494,1239.6823,620.8484,0.0019,...,0.0,0.0,0.0,629399.44,553218.6,720522.44,609172.75,678163.0,1068284.8,
interact-Ex_Auto_J3_30umTB_02ngQC_60m_1.pep.xml,A6NKT7,NSIPEPIDPLFK,16986,,2,4795.0662,1368.7386,1368.7316,685.3766,685.3731,1368.7288,685.3717,0.0028,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,"sp|O14715|RGPD8_HUMAN, sp|P0DJD0|RGPD1_HUMAN, ..."
interact-Ex_Auto_J3_30umTB_02ngQC_60m_1.pep.xml,A8MWD9,GNSIIMLEALER,18334,,2,5152.8337,1344.7134,1344.7035,673.364,673.359,1344.707,673.3608,-0.0035,...,0.0,0.0,0.0,0.0,246876.1,0.0,199852.67,0.0,0.0,sp|P62308|RUXG_HUMAN
interact-Ex_Auto_J3_30umTB_02ngQC_60m_1.pep.xml,B0I1T2,DFLFQDFK,18706,,2,5252.6033,1058.5103,1058.5061,530.2624,530.2603,1058.5072,530.2609,-0.0011,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,
