# MAPSI Tutorial

## Overview
MAPSI (MzML And Processed Spectra Interface) is designed to aid the user in combining and viewing mzML files and their corresponding PSM, peptide, and protein quantification files into a multiIndexed pandas dataframe.

Depending on whether you are using MetaMorpheus or MSFragger files, you will want to import the corresponding parser script. In this tutorial I will import both parser scripts.

In [8]:
from MSFragger_Parser import parse_files as msfragger_parser
from MetaMorpheus_Parser import parse_files as MetaMorpheus_parser

## Test Files

In [9]:
# mzML test file
mzml_file_path = "C:\\Users\\Sarah Curtis\\OneDrive - BYU\\Documents\\Single Cell Team Documents\\API_dev\\MetaM\\2ng\\Ex_Auto_J3_30umTB_2ngQC_60m_1.mzML"

# MSFragger test files
msfragger_psm_file_path = "C:\\Users\\Sarah Curtis\\OneDrive - BYU\\Documents\\Single Cell Team Documents\\API_dev\\msfragger\\psm1.tsv"
msfragger_peptide_file_path = "C:\\Users\\Sarah Curtis\\OneDrive - BYU\\Documents\\Single Cell Team Documents\\API_dev\\msfragger\\combined_peptide.tsv"
msfragger_protein_file_path = "C:\\Users\\Sarah Curtis\\OneDrive - BYU\\Documents\\Single Cell Team Documents\\API_dev\\msfragger\\combined_protein.tsv"

# MetaMorpheus test files
mzml_file_path = "C:\\Users\\Sarah Curtis\\OneDrive - BYU\\Documents\\Single Cell Team Documents\\API_dev\\MetaM\\2ng\\Ex_Auto_J3_30umTB_2ngQC_60m_1.mzML"
mm_psm_file_path = "C:\\Users\\Sarah Curtis\\OneDrive - BYU\\Documents\\Single Cell Team Documents\\API_dev\\MetaM\\2ng\\Ex_Auto_J3_30umTB_2ngQC_60m_1-calib_PSMs.psmtsv"
mm_peptideQ_file_path = "C:\\Users\\Sarah Curtis\\OneDrive - BYU\\Documents\\Single Cell Team Documents\\API_dev\\MetaM\\2ng\\AllQuantifiedPeptides.tsv"
mm_protein_file_path = "C:\\Users\\Sarah Curtis\\OneDrive - BYU\\Documents\\Single Cell Team Documents\\API_dev\\MetaM\\2ng\\AllQuantifiedProteinGroups.tsv"

# output test file
outfile = "C:\\Users\\Sarah Curtis\\OneDrive - BYU\\Documents\\Single Cell Team Documents\\API_dev\\msfragger\\tester.tsv"

## Inputs
#### input_files
You may input a list of 1 to 4 files in any order. If you only want to read in one file, it can be inputted as a string.

In [10]:
# let's try combining a psm and peptide quantification file
psm_and_peptide_df = msfragger_parser(input_files=[msfragger_peptide_file_path, msfragger_psm_file_path])
psm_and_peptide_df.head()

Exception: output_file_path must be a str or Path obj

In [None]:
# we can also just load data from one file
psm_df = msfragger_parser(input_files=msfragger_psm_file_path)
psm_df.head()

  Note that MAPSI cannot process more than one of each file type. If you would like to combine several files of the same type (ex: multiple mzML and PSM files), it is recommended that you run corresponding files through MAPSI and then concatenate the resulting dataframes.   
  ```python:
  list_of_mzml_file_paths = ['mzml_1', 'mzml_2', 'mzml_3']
  list_of_psm_file_paths = ['psm_1', 'psm_2', 'psm_3']
  # iterate over the mzML and PSM files 
  list_of_joined_dataframes = []
  for index, mzml_file_path in enumerate(list_of_mzml_file_paths):
    psm_file_path = list_of_psm_file_paths[index]
    # join data from the mzML and PSM files and append it to the 'list_of_joined_dataframes'
    new_df = msfragger_parser(input_files=[mzml_file_path, psm_file_path])
    list_of_joined_dataframes.append(new_df)
  # loop through the new_df's and concatenate them
  joined_df = pd.concat(list_of_joined_dataframes)
  ```

  You will get the following warning if MAPSI cannot recognize a file:

In [11]:
MetaMorpheus_parser(input_files="invalid file")

Your inputs: [['invalid file'], None, None, None, None, None, None]


Exception: Function could not identify invalid file. Please rename your file to contain the file type. Ex: .mzml, psm, peptide/pep, protein/prot

MAPSI recognizes file types by the file extension and name. If MAPSI is having trouble recognizing your files, make sure that:
1. The mzML file extension is `.mzml`
2. The PSM file name contains `psm`
3. The peptide file name contains `peptide` or `pep`
4. The protein file name contains `protein` or `prot`

#### output_file_path
This is an optional parameter which allows you to save your dataframe as a `.tsv` file. 

In [12]:
# ex:
msfragger_parser(input_files=[msfragger_protein_file_path, msfragger_peptide_file_path], output_file_path=outfile)

Your inputs: [['C:\\Users\\Sarah Curtis\\OneDrive - BYU\\Documents\\Single Cell Team Documents\\API_dev\\msfragger\\combined_protein.tsv', 'C:\\Users\\Sarah Curtis\\OneDrive - BYU\\Documents\\Single Cell Team Documents\\API_dev\\msfragger\\combined_peptide.tsv'], 'C:\\Users\\Sarah Curtis\\OneDrive - BYU\\Documents\\Single Cell Team Documents\\API_dev\\msfragger\\tester.tsv', None, None, None, None, None]
Dataframe saved.


Unnamed: 0_level_0,Unnamed: 1_level_0,Prev AA,Next AA,Start,End,Peptide Length,Charges,Protein Description,Mapped Genes,Mapped Proteins,Ex_Auto_J3_30umTB_02ngQC_60m_1 Spectral Count_peptide,...,Ex_Auto_K13_30umTA_02ngQC_60m_2 MaxLFQ Unique Intensity,Ex_Auto_W17_30umTA_02ngQC_60m_3 MaxLFQ Unique Intensity,Ex_Auto_W17_30umTA_02ngQC_60m_4 MaxLFQ Unique Intensity,Ex_Auto_J3_30umTB_02ngQC_60m_1 MaxLFQ Total Intensity,Ex_Auto_J3_30umTB_02ngQC_60m_2 MaxLFQ Total Intensity,Ex_Auto_K13_30umTA_02ngQC_60m_1 MaxLFQ Total Intensity,Ex_Auto_K13_30umTA_02ngQC_60m_2 MaxLFQ Total Intensity,Ex_Auto_W17_30umTA_02ngQC_60m_3 MaxLFQ Total Intensity,Ex_Auto_W17_30umTA_02ngQC_60m_4 MaxLFQ Total Intensity,Indistinguishable Proteins
Protein Accession,Peptide,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
A0A096LP55,EQCEQLEK,R,C,35,42,8,2,"Cytochrome b-c1 complex subunit 6-like, mitoch...",UQCRH,sp|P07919|QCR6_HUMAN,1,...,0.0,0.0,0.0,0.00,0.0,0.00,0.00,0.0,0.0,sp|P07919|QCR6_HUMAN
A6NDG6,TILTLTGVSTLGDVK,K,N,276,290,15,2,Glycerol-3-phosphate phosphatase,,,0,...,0.0,0.0,0.0,0.00,0.0,0.00,0.00,0.0,0.0,
A6NHL2,SFGGGTGSGFTSLLMER,R,L,147,163,17,2,Tubulin alpha chain-like 3,,,0,...,0.0,0.0,0.0,3170630.80,2816038.0,3657354.00,3326739.80,2417819.5,4331914.5,
A6NHQ2,TNIIPVLEDAR,R,H,220,230,11,2,rRNA/tRNA 2'-O-methyltransferase fibrillarin-l...,,,1,...,0.0,0.0,0.0,629399.44,553218.6,720522.44,609172.75,678163.0,1068284.8,
A6NKT7,NSIPEPIDPLFK,K,H,621,632,12,2,RanBP2-like and GRIP domain-containing protein 3,"RANBP2, RGPD1, RGPD2, RGPD4, RGPD5, RGPD8","sp|O14715|RGPD8_HUMAN, sp|P0DJD0|RGPD1_HUMAN, ...",1,...,0.0,0.0,0.0,0.00,0.0,0.00,0.00,0.0,0.0,"sp|O14715|RGPD8_HUMAN, sp|P0DJD0|RGPD1_HUMAN, ..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
contam_sp|Q06830|PRDX1_HUMAN Peroxiredoxin-1 OS=Homo sapiens GN=PRDX1 PE=1 SV=1,QGGLGPMNIPLVSDPK,K,R,94,109,16,2,Peroxiredoxin-1,,,2,...,8945861.0,8905544.0,7285309.0,10982488.00,11487941.0,12186659.00,11550702.00,11941962.0,9579125.0,
contam_sp|Q06830|PRDX1_HUMAN Peroxiredoxin-1 OS=Homo sapiens GN=PRDX1 PE=1 SV=1,QITVNDLPVGR,R,S,141,151,11,2,Peroxiredoxin-1,PRDX2,sp|P32119|PRDX2_HUMAN,1,...,8945861.0,8905544.0,7285309.0,10982488.00,11487941.0,12186659.00,11550702.00,11941962.0,9579125.0,
contam_sp|Q06830|PRDX1_HUMAN Peroxiredoxin-1 OS=Homo sapiens GN=PRDX1 PE=1 SV=1,RTIAQDYGVLK,K,A,110,120,11,23,Peroxiredoxin-1,,,2,...,8945861.0,8905544.0,7285309.0,10982488.00,11487941.0,12186659.00,11550702.00,11941962.0,9579125.0,
contam_sp|Q06830|PRDX1_HUMAN Peroxiredoxin-1 OS=Homo sapiens GN=PRDX1 PE=1 SV=1,TIAQDYGVLK,R,A,111,120,10,2,Peroxiredoxin-1,,,1,...,8945861.0,8905544.0,7285309.0,10982488.00,11487941.0,12186659.00,11550702.00,11941962.0,9579125.0,


#### columns_to_keep
This is an optional parameter which allows you to input a list of columns you would like to see the data for (MAPSI will also take a str if you only wish to input one column). It is recommended that you check the column names from your original columns before running MAPSI.

In [25]:
my_dataframe = MetaMorpheus_parser(input_files=[mm_peptideQ_file_path, mm_psm_file_path, mm_protein_file_path], columns_to_keep=['Scan Number', 'Protein Accession', 'Peptide', 'Gene Name', 'Base Sequence', 'Score'])
my_dataframe.head()

Your inputs: [['C:\\Users\\Sarah Curtis\\OneDrive - BYU\\Documents\\Single Cell Team Documents\\API_dev\\MetaM\\2ng\\AllQuantifiedPeptides.tsv', 'C:\\Users\\Sarah Curtis\\OneDrive - BYU\\Documents\\Single Cell Team Documents\\API_dev\\MetaM\\2ng\\Ex_Auto_J3_30umTB_2ngQC_60m_1-calib_PSMs.psmtsv', 'C:\\Users\\Sarah Curtis\\OneDrive - BYU\\Documents\\Single Cell Team Documents\\API_dev\\MetaM\\2ng\\AllQuantifiedProteinGroups.tsv'], None, ['Scan Number', 'Protein Accession', 'Peptide', 'Gene Name', 'Base Sequence', 'Score'], None, None, None, None]


  psm_dataframe = pd.read_table(psm_file_path, delimiter='\t')


Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Gene Name,Base Sequence,Score
Protein Accession,Peptide,Scan Number,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
A0A075B6X5,EPELLLK,24419,primary:TRAV18,EPELLLK,6.399
A0A075B6X5,EPELLLK,24592,primary:TRAV18,EPELLLK,6.293
A0A096LP49,LQDVAGR,6514,primary:CCDC187,LQDVAGR,6.208
A0A0B4J2D5|P0DPI2,GVEVTVGHEQEEGGK,8901,primary:GATD3B|primary:GATD3,GVEVTVGHEQEEGGK,11.07
A0A0B4J2D5|P0DPI2,NLSTFAVDGK,16610,primary:GATD3B|primary:GATD3,NLSTFAVDGK,9.268



If one of the inputted columns is not found in the database, the entire database will be returned to ensure that you are not missing any important information. To manually select columns to keep, follow the outline below:    
```python:
example_columns_to_keep = ['Column_1', 'Column_2', 'Column_3']
my_dataframe = my_dataframe[example_columns_to_keep]
```

In [27]:
modified_df = my_dataframe[['Gene Name', 'Score']]
modified_df.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Gene Name,Score
Protein Accession,Peptide,Scan Number,Unnamed: 3_level_1,Unnamed: 4_level_1
A0A075B6X5,EPELLLK,24419,primary:TRAV18,6.399
A0A075B6X5,EPELLLK,24592,primary:TRAV18,6.293
A0A096LP49,LQDVAGR,6514,primary:CCDC187,6.208
A0A0B4J2D5|P0DPI2,GVEVTVGHEQEEGGK,8901,primary:GATD3B|primary:GATD3,11.07
A0A0B4J2D5|P0DPI2,NLSTFAVDGK,16610,primary:GATD3B|primary:GATD3,9.268


In [None]:
#test1 = parse_files(input_files=msfragger_psm_file_path, output_file_path=outfile)
test1 = msfragger_parser(input_files=[msfragger_psm_file_path], output_file_path=outfile)

<class 'list'>
Your inputs: [['C:\\Users\\Sarah Curtis\\OneDrive - BYU\\Documents\\Single Cell Team Documents\\API_dev\\msfragger\\psm1.tsv'], 'C:\\Users\\Sarah Curtis\\OneDrive - BYU\\Documents\\Single Cell Team Documents\\API_dev\\msfragger\\tester.tsv', None, None, None, None, None]
Dataframe saved.


In [None]:
test1

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,File Name,Modified Peptide,Prev AA,Next AA,Peptide Length,Charge,Retention,Observed Mass,Calibrated Observed Mass,Observed M/Z,...,Protein Start,Protein End,Intensity,Assigned Modifications,Observed Modifications,Is Unique,Protein,Entry Name,Gene,Protein Description
Protein Accession,Peptide,Scan Number,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1
A0A096LP55,EQCEQLEK,5237,interact-Ex_Auto_J3_30umTB_02ngQC_60m_1.pep.xml,EQCEQLEK,R,C,8,2,1776.8865,1062.4706,1062.4664,532.2426,...,35,42,200838.400,3C(57.0215),,False,sp|A0A096LP55|QCR6L_HUMAN,QCR6L_HUMAN,UQCRHL,"Cytochrome b-c1 complex subunit 6-like, mitoch..."
A6NHQ2,TNIIPVLEDAR,14755,interact-Ex_Auto_J3_30umTB_02ngQC_60m_1.pep.xml,,R,H,11,2,4218.5518,1239.6882,1239.6842,620.8514,...,220,230,517661.100,,,True,sp|A6NHQ2|FBLL1_HUMAN,FBLL1_HUMAN,FBLL1,rRNA/tRNA 2'-O-methyltransferase fibrillarin-l...
A6NKT7,NSIPEPIDPLFK,16986,interact-Ex_Auto_J3_30umTB_02ngQC_60m_1.pep.xml,,K,H,12,2,4795.0662,1368.7386,1368.7316,685.3766,...,621,632,30740.525,,,False,sp|A6NKT7|RGPD3_HUMAN,RGPD3_HUMAN,RGPD3,RanBP2-like and GRIP domain-containing protein 3
A8MWD9,GNSIIMLEALER,18334,interact-Ex_Auto_J3_30umTB_02ngQC_60m_1.pep.xml,,R,-,12,2,5152.8337,1344.7134,1344.7035,673.3640,...,64,75,445926.780,,,False,sp|A8MWD9|RUXGL_HUMAN,RUXGL_HUMAN,SNRPGP15,Putative small nuclear ribonucleoprotein G-lik...
B0I1T2,DFLFQDFK,18706,interact-Ex_Auto_J3_30umTB_02ngQC_60m_1.pep.xml,,R,R,8,2,5252.6033,1058.5103,1058.5061,530.2624,...,537,544,54186.293,,,True,sp|B0I1T2|MYO1G_HUMAN,MYO1G_HUMAN,MYO1G,Unconventional myosin-Ig
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
contam_sp|Q06830|PRDX1_HUMAN Peroxiredoxin-1 OS=Homo sapiens GN=PRDX1 PE=1 SV=1,QITVNDLPVGR,12189,interact-Ex_Auto_J3_30umTB_02ngQC_60m_1.pep.xml,,R,S,11,2,3559.5387,1210.6725,1210.6665,606.3435,...,141,151,8893934.000,,,False,contam_sp|Q06830|PRDX1_HUMAN,contam_sp|Q06830|PRDX1_HUMAN Peroxiredoxin-1 O...,,Peroxiredoxin-1
contam_sp|Q06830|PRDX1_HUMAN Peroxiredoxin-1 OS=Homo sapiens GN=PRDX1 PE=1 SV=1,RTIAQDYGVLK,10981,interact-Ex_Auto_J3_30umTB_02ngQC_60m_1.pep.xml,,K,A,11,2,3249.6437,1262.7015,1262.6985,632.3580,...,110,120,750390.750,,,True,contam_sp|Q06830|PRDX1_HUMAN,contam_sp|Q06830|PRDX1_HUMAN Peroxiredoxin-1 O...,,Peroxiredoxin-1
contam_sp|Q06830|PRDX1_HUMAN Peroxiredoxin-1 OS=Homo sapiens GN=PRDX1 PE=1 SV=1,RTIAQDYGVLK,10985,interact-Ex_Auto_J3_30umTB_02ngQC_60m_1.pep.xml,,K,A,11,3,3250.6423,1262.7026,1262.6973,421.9081,...,110,120,311785.160,,,True,contam_sp|Q06830|PRDX1_HUMAN,contam_sp|Q06830|PRDX1_HUMAN Peroxiredoxin-1 O...,,Peroxiredoxin-1
contam_sp|Q06830|PRDX1_HUMAN Peroxiredoxin-1 OS=Homo sapiens GN=PRDX1 PE=1 SV=1,TIAQDYGVLK,11345,interact-Ex_Auto_J3_30umTB_02ngQC_60m_1.pep.xml,,R,A,10,2,3343.4081,1106.6042,1106.5964,554.3094,...,111,120,5151185.000,,,True,contam_sp|Q06830|PRDX1_HUMAN,contam_sp|Q06830|PRDX1_HUMAN Peroxiredoxin-1 O...,,Peroxiredoxin-1
