# MAPSI Tutorial

## Overview
MAPSI (MzML And Processed Spectra Interface) is designed to aid the user in combining and viewing mzML files and their corresponding PSM, peptide, and protein quantification files into a multiIndexed pandas dataframe.

Depending on whether you are using MetaMorpheus or MSFragger files, you will want to import the corresponding parser script. In this tutorial I will import both parser scripts.

In [14]:
from MSFragger_Parser import parse_files as msfragger_parser
from MetaMorpheus_Parser import parse_files as MetaMorpheus_parser

In [15]:
import pandas as pd

## Test Files

In [9]:
# mzML test file
mzml_file_path = "C:\\Users\\Sarah Curtis\\OneDrive - BYU\\Documents\\Single Cell Team Documents\\API_dev\\MetaM\\2ng\\Ex_Auto_J3_30umTB_2ngQC_60m_1.mzML"

# MSFragger test files
msfragger_psm_file_path = "C:\\Users\\Sarah Curtis\\OneDrive - BYU\\Documents\\Single Cell Team Documents\\API_dev\\msfragger\\psm1.tsv"
msfragger_peptide_file_path = "C:\\Users\\Sarah Curtis\\OneDrive - BYU\\Documents\\Single Cell Team Documents\\API_dev\\msfragger\\combined_peptide.tsv"
msfragger_protein_file_path = "C:\\Users\\Sarah Curtis\\OneDrive - BYU\\Documents\\Single Cell Team Documents\\API_dev\\msfragger\\combined_protein.tsv"

# MetaMorpheus test files
mzml_file_path = "C:\\Users\\Sarah Curtis\\OneDrive - BYU\\Documents\\Single Cell Team Documents\\API_dev\\MetaM\\2ng\\Ex_Auto_J3_30umTB_2ngQC_60m_1.mzML"
mm_psm_file_path = "C:\\Users\\Sarah Curtis\\OneDrive - BYU\\Documents\\Single Cell Team Documents\\API_dev\\MetaM\\2ng\\Ex_Auto_J3_30umTB_2ngQC_60m_1-calib_PSMs.psmtsv"
mm_peptideQ_file_path = "C:\\Users\\Sarah Curtis\\OneDrive - BYU\\Documents\\Single Cell Team Documents\\API_dev\\MetaM\\2ng\\AllQuantifiedPeptides.tsv"
mm_protein_file_path = "C:\\Users\\Sarah Curtis\\OneDrive - BYU\\Documents\\Single Cell Team Documents\\API_dev\\MetaM\\2ng\\AllQuantifiedProteinGroups.tsv"

# output test file
outfile = "C:\\Users\\Sarah Curtis\\OneDrive - BYU\\Documents\\Single Cell Team Documents\\API_dev\\msfragger\\tester.tsv"

## Inputs
#### input_files
You may input a list of 1 to 4 files in any order. If you only want to read in one file, it can be inputted as a string.

In [None]:
# let's try combining a psm and peptide quantification file
psm_and_peptide_df = msfragger_parser(input_files=[msfragger_peptide_file_path, msfragger_psm_file_path])
psm_and_peptide_df.head()

In [None]:
# we can also just load data from one file
psm_df = msfragger_parser(input_files=msfragger_psm_file_path)
psm_df.head()

  Note that MAPSI cannot process more than one of each file type. If you would like to combine several files of the same type (ex: multiple mzML and PSM files), it is recommended that you run corresponding files through MAPSI and then concatenate the resulting dataframes.   
  ```python:
  list_of_mzml_file_paths = ['mzml_1', 'mzml_2', 'mzml_3']
  list_of_psm_file_paths = ['psm_1', 'psm_2', 'psm_3']
  # iterate over the mzML and PSM files 
  list_of_joined_dataframes = []
  for index, mzml_file_path in enumerate(list_of_mzml_file_paths):
    psm_file_path = list_of_psm_file_paths[index]
    # join data from the mzML and PSM files and append it to the 'list_of_joined_dataframes'
    new_df = msfragger_parser(input_files=[mzml_file_path, psm_file_path])
    list_of_joined_dataframes.append(new_df)
  # loop through the new_df's and concatenate them
  joined_df = pd.concat(list_of_joined_dataframes)
  ```

MAPSI recognizes file types by the file extension and name. If MAPSI is having trouble recognizing your files, make sure that:
1. The mzML file extension is `.mzml`
2. The PSM file name contains `psm`
3. The peptide file name contains `peptide` or `pep`
4. The protein file name contains `protein` or `prot`

#### output_file_path
This is an optional parameter which allows you to save your dataframe as a `.tsv` file. 

In [None]:
# ex:
msfragger_parser(input_files=[msfragger_protein_file_path, msfragger_peptide_file_path], output_file_path=outfile).head()

#### columns_to_keep
This is an optional parameter which allows you to input a list of columns you would like to see the data for (MAPSI will also take a str if you only wish to input one column). It is recommended that you check the column names from your original columns before running MAPSI.

In [None]:
my_dataframe = MetaMorpheus_parser(input_files=[mm_peptideQ_file_path, mm_psm_file_path, mm_protein_file_path], columns_to_keep=['Scan Number', 'Protein Accession', 'Peptide', 'Gene Name', 'Base Sequence', 'Score'])
my_dataframe.head()


If one of the inputted columns is not found in the database, the entire database will be returned to ensure that you are not missing any important information. To manually select columns to keep, follow the outline below:    
```python:
example_columns_to_keep = ['Column_1', 'Column_2', 'Column_3']
my_dataframe = my_dataframe[example_columns_to_keep]
```

In [None]:
modified_df = my_dataframe[['Gene Name', 'Score']]
modified_df.head()

#### rows_to_keep: (scans_to_keep, peptides_to_keep, proteins_to_keep)
In addition to selecting columns to keep, you can also select specific rows to keep in your dataframe. This is split into 3 optional parameters which allow you to select rows based on scan number, peptide sequence, and/or protein accession. Note that these parameters must be entered as a `list` of str or a `str`.

In [None]:
scans = ['16819', '17761', '8942', '9686', '10645', '17633', '17638']
proteins = ['B5ME19', 'E9PAV3']
msfragger_parser(input_files=[msfragger_psm_file_path, msfragger_peptide_file_path], scans_to_keep=scans, proteins_to_keep=proteins)

If one of the inputs is not found in the specified row, the dataframe will not be filtered based off of that row.

In [None]:
scans_1 = ['16819', '17761', '8942', '9686', '10645', '17633', '17638', '10000']
proteins_1 = ['B5ME19', 'E9PAV3']
selected_rows_df = msfragger_parser(input_files=[msfragger_psm_file_path, msfragger_peptide_file_path], scans_to_keep=scans_1, proteins_to_keep=proteins_1)
selected_rows_df

If you wish to manually correct this, you can do it as follows:

In [None]:
scans_corrected = [16819, 17761, 8942, 9686, 10645, 17633, 17638]
# Note that we need to reset the index in order to store the data into columns/rows
corrected_selected_rows_df = selected_rows_df.reset_index()
corrected_selected_rows_df = corrected_selected_rows_df.loc[corrected_selected_rows_df['Scan Number'].isin(scans_corrected)]
corrected_selected_rows_df = corrected_selected_rows_df.set_index(['Protein Accession', 'Peptide', 'Scan Number'])
corrected_selected_rows_df

#### multiIndex
This optional parameter allows you to create a custom multiIndex for your dataframe. The default multiIndicies are as follows:
*   1 mzML file: `['Scan Number']`
*   A dataframe which includes a PSM file: `['Protein Accession', 'Peptide', 'Scan Number']`
*   A peptide quantification file: `['Protein Accession', 'Peptide']`
*   1 protein quantification file: `['Protein Accession']`

 Custom MultiIndicies are inputted as a list (or str if it is a single index). MultiIndicies inputted as a list must be ordered according to the desired multiIndex hierarchy.

In [None]:
my_multiIndex = ['File Name', 'Protein Accession', 'Peptide', 'Scan Number']
custom_multiIndex_df = msfragger_parser(input_files=[msfragger_psm_file_path, msfragger_protein_file_path, msfragger_peptide_file_path], output_file_path=outfile, multiIndex=my_multiIndex)
custom_multiIndex_df.head()

It is also important to note that the multiIndex will not be saved in the output file. If you want to load a saved dataframe with a multiIndex, it can be done as follows:

In [None]:
loaded_df = pd.read_table(outfile).set_index(my_multiIndex).sort_index()
loaded_df.head()

# Additional Info about Input Files
## mzML Files

### columns


In [16]:
mzml_df = MetaMorpheus_parser(input_files=mzml_file_path)

Your inputs: [['C:\\Users\\Sarah Curtis\\OneDrive - BYU\\Documents\\Single Cell Team Documents\\API_dev\\MetaM\\2ng\\Ex_Auto_J3_30umTB_2ngQC_60m_1.mzML'], None, None, None, None, None, None]


Exception: Function could not identify ex_auto_j3_30umtb_2ngqc_60m_1.mzml. Please rename your file to contain the file type. Ex: .mzml, psm, peptide/pep, protein/prot