# PSM readers

In [1]:
%reload_ext autoreload
%autoreload 2

In [2]:
# Helper packages
import io
from copy import copy
import numpy as np
import pandas as pd 

# alphabase
from alphabase.psm_reader import psm_reader_provider, psm_reader_yaml

## Background 

The `alphabase.psm_reader` module provides a unifying interface to read PSM tables from different search engines and file formats. It is designed to be easy to use, and to provide a consistent output format in the form of `pandas.DataFrame`s, regardless of the input file format.

### Introduction to peptide spectrum matches (PSMs)

Peptide spectrum matches (PSMs) are the primary output of proteomics search engines. In a PSM table, each row typically represents a single peptide-spectrum-match, i.e. a peptide sequence that the proteomics search engine identified to be compatible with an observed mass spectrum in a given sample. PSM tables contain information about both 1) the *peptide sequence*, 2) the *spectrum*, as well as 3) the *score* assigned to the PSM by the search engine. 

A minimal PSM table could look something like this:

| sample_id | peptide | confidence_score | scan_id |
|-------------|---------|-------| -------|
| 1           | PEPTIDE | 0.99  | 1234   |

In this example, the search engine identified the peptide **PEPTIDE** as a match to the spectrum 1234 sample with `ID 1`, and assigned a confidence score of `0.99` to this match. 

### Search engine outputs

In reality, PSM tables are significantly more complex than this, as they contain additional information on both the spectrum (e.g. the sample run, detector type, precursor intensities, ...), the peptide (e.g. the protein it belongs to, the modifications it carries), and the peptide search (quality control measures). This additional information can be extremely useful for downstream analyses, but also makes PSM tables more difficult to work with, as the exact names may differ between search engines, versions, and file formats. 

#### Unifying properties 

Alphabase aligns the column names to a unified vocabulary, as defined in the `alphabase.psm_reader.psm_reader_yaml` mapping. We can explore this standardization, that facilitates cross-engine comparisons. Note that some search engines use version-dependent names for the same property (indicated by lists) and alphabase can deal with this ambiguity. To support both Bruker and Thermo data, we did not use `Scan Number` in the output dataframe but `spec_idx` (starts with 0). `spec_idx = scan_num - 1` in thermo data. 

In [3]:
# Generate a dataframe that maps the report-specific column names to the unified column names based on alphabase mapping
psm_column_mapping = (
    pd.DataFrame.from_dict({
        mapping.get("reader_type", "psm"): mapping.get("column_mapping", {}) for k, mapping in psm_reader_yaml.items() if k != "modification_mappings"
    })
    .T
    .rename_axis(columns="alphabase (unified name)")
    )

# Order by importance (number of search engines with corresponding column)
columns_ordered = psm_column_mapping.isna().mean(axis=0).sort_values(ascending=True)
psm_column_mapping = (
    psm_column_mapping.loc[:, columns_ordered.index]
    .sort_index(axis=0)
    .dropna(how="all", axis=1)
)

# Compute summary
summary = (
    psm_column_mapping
    .agg([lambda x: x.isna().mean()])
)



## Visualize the mapping of PSM columns to alphabase unified columns
# Stylize with pandas CSS class 
headers = {
    'selector': 'th',#'th:not(.index_name)',
    'props': 'background-color: #18456d; color: white;'
}
cell_hover = {  # for row hover use <tr> instead of <td>
    'selector': 'td:hover',
    'props': [('background-color', '#ffffb3')]
}
index_names = {
    'selector': '.index_name',
    'props': 'font-style: italic; font-weight:bold; position: sticky'
}

summary_row = {"selector": "tbody tr:last-child", "props": [("background-color", "#efefef"), ("font-weight", "bold")]}

caption = {
    "selector": "caption",
    "props": "caption-side: top; font-style: italic; font-size: 12pt; text-align:left; margin-bottom: 10pt;"
}

# Visualize
psm_column_mapping_stylized = (
    psm_column_mapping
        .style

        .concat(
            summary
                .style
                .relabel_index(["Missing"])
                .bar(color='#cccccc', vmin=0, vmax=1)
        )
        .set_caption("Mapping of PSM columns to alphabase unified columns")
        .set_table_styles(
            [headers, cell_hover, index_names, summary_row, caption]
        )

)

psm_column_mapping_stylized

alphabase (unified name),raw_name,charge,rt,mobility,proteins,sequence,score,uniprot_ids,genes,ccs,fdr,scan_num,decoy,precursor_mz,query_id,modified_sequence,intensity,rt_stop,rt_start,peptide_fdr,mods,fragment_intensity,fragment_mz,fragment_type,fragment_charge,fragment_series,fragment_loss_type,scannr,spec_idx,protein_fdr
alphadia,run,charge,rt_observed,mobility,proteins,sequence,score,uniprot_ids,genes,ccs,fdr,,,,,,intensity,rt_stop,rt_start,,mods,,,,,,,,,
alphapept,raw_name,charge,rt,mobility,,,score,,,,q_value,scan_no,decoy,mz,query_idx,,,,,,,,,,,,,,raw_idx,
diann,Run,Precursor.Charge,RT,"['IM', 'IonMobility']",Protein.Names,Stripped.Sequence,CScore,Protein.Ids,Genes,CCS,Q.Value,MS2.Scan,,,,,,RT.Stop,RT.Start,,,,,,,,,,,
library_reader_base,ReferenceRun,PrecursorCharge,"['RT', 'iRT', 'Tr_recalibrated', 'RetentionTime', 'NormalizedRetentionTime']","['Mobility', 'IonMobility', 'PrecursorIonMobility']","['ProteinId', 'ProteinID', 'ProteinName', 'Protein Name']","['PeptideSequence', 'StrippedPeptide']",,"['UniProtIds', 'UniProtID', 'UniprotId']","['GeneName', 'Genes', 'Gene']",CCS,,,,PrecursorMz,,"['ModifiedPeptideSequence', 'ModifiedPeptide']",,,,,,"['LibraryIntensity', 'RelativeIntensity', 'RelativeFragmentIntensity', 'RelativeFragmentIonIntensity']",['ProductMz'],"['FragmentType', 'FragmentIonType', 'ProductType', 'ProductIonType']","['FragmentCharge', 'FragmentIonCharge', 'ProductCharge', 'ProductIonCharge']","['FragmentSeriesNumber', 'FragmentNumber']","['FragmentLossType', 'FragmentIonLossType', 'ProductLossType', 'ProductIonLossType']",,,
maxquant,Raw file,Charge,Retention time,"['Mobility', 'IonMobility', 'K0', '1/K0']",Proteins,Sequence,Score,,"['Gene Names', 'Gene names']",CCS,,"['Scan number', 'MS/MS scan number', 'MS/MS Scan Number', 'Scan index']",Reverse,m/z,,,Intensity,,,,,,,,,,,,,
msfragger_pepxml,raw_name,assumed_charge,retention_time_sec,ion_mobility,protein,peptide,expect,,,,,start_scan,,,spectrum,,,,,,,,,,,,,,,
pfind,raw_name,Charge,RT,,Proteins,Sequence,Final_Score,Proteins,,,Q-value,Scan_No,"['Target/Decoy', 'Targe/Decoy']",,File_Name,,,,,,,,,,,,,,,
sage,filename,charge,rt,mobility,proteins,stripped_peptide,sage_discriminant_score,,,,spectrum_q,,is_decoy,,,peptide,,,,peptide_q,,,,,,,,scannr,,protein_q
spectronaut,ReferenceRun,PrecursorCharge,"['RT', 'iRT', 'Tr_recalibrated', 'RetentionTime', 'NormalizedRetentionTime']","['Mobility', 'IonMobility', 'PrecursorIonMobility']","['Protein Name', 'ProteinId', 'ProteinID', 'ProteinName', 'ProteinGroup', 'ProteinGroups']","['StrippedPeptide', 'PeptideSequence']",,"['UniProtIds', 'UniProtID', 'UniprotId']","['Genes', 'Gene', 'GeneName', 'GeneNames']",CCS,,,,PrecursorMz,,,,,,,,,,,,,,,,
spectronaut_report,R.FileName,charge,"['EG.ApexRT', 'EG.MeanApexRT']",['FG.ApexIonMobility'],"['PG.ProteinNames', 'PG.ProteinGroups']",,,PG.UniProtIds,PG.Genes,,,,,,,,,,,,,,,,,,,,,


#### Unifying peptide modifications 

Alphabase further unifies representations of peptide modifications between the different search engines to the community-driven unimod format.

E.g. the MaxQuant-internal representations of phosphorylated serines are mapped to the unimod representation:

| alphabase/UniMod | MaxQuant |
|------------------|----------|
| Phospho@S | S(Phospho (S)), S(Phospho (ST)), S(Phospho (STY)), S(Phospho (STYDH)), S(ph), pS |

See `alphabase.psm_reader.psm_reader_yaml["modification_mappings"]` for all mappings as parsed dictionaries and `alphabase.constants.const_files.psm_reader_yaml` for the underlying file.

## Code | Read and parse PSM tables

The alphabase `psm_reader` module provides single one-liners to parse proteomics PSM reports to a dataframe for most common search engines via its `alphabase.psm_reader.psm_reader_provider` factory. 

### Available readers 

`alphabase.psm_reader.psm_reader_provider` has registered some basic reader classes. A list of implemented readers can be accessed via its `reader_dict` property: 

In [4]:
all_registered_readers = psm_reader_provider.reader_dict.keys()

# Display all registered readers
sep = "\n\t- "
print("Registered readers in alphabase:", sep.join(sorted(all_registered_readers)), sep=sep)

Registered readers in alphabase:
	- alphadia
	- alphadia_parquet
	- alphapept
	- diann
	- maxquant
	- msfragger
	- msfragger_pepxml
	- msfragger_psm_tsv
	- openswath
	- pfind
	- pfind3
	- sage_parquet
	- sage_tsv
	- speclib_tsv
	- spectronaut
	- spectronaut_report
	- swath


### Interact with the reader provider

#### Example 1 - MaxQuant

We demonstrate how to interact with PSM tables via alphabase based on a minimal example output of the MaxQuant search engine. 

First, let's provide some minimal input, which is the header of a real MaxQuant report

In [5]:
maxquant_example = io.StringIO(
'''Raw file	Scan number	Scan index	Sequence	Length	Missed cleavages	Modifications	Modified sequence	Oxidation (M) Probabilities	Oxidation (M) Score diffs	Acetyl (Protein_N-term)	Oxidation (M)	Proteins	Charge	Fragmentation	Mass analyzer	Type	Scan event number	Isotope index	m/z	Mass	Mass error [ppm]	Mass error [Da]	Simple mass error [ppm]	Retention time	PEP	Score	Delta score	Score diff	Localization prob	Combinatorics	PIF	Fraction of total spectrum	Base peak fraction	Precursor full scan number	Precursor Intensity	Precursor apex fraction	Precursor apex offset	Precursor apex offset time	Matches	Intensities	Mass deviations [Da]	Mass deviations [ppm]	Masses	Number of matches	Intensity coverage	Peak coverage	Neutral loss level	ETD identification type	Reverse	All scores	All sequences	All modified sequences	Reporter PIF	Reporter fraction	id	Protein group IDs	Peptide ID	Mod. peptide ID	Evidence ID	Oxidation (M) site IDs
20190402_QX1_SeVW_MA_HeLa_500ng_LC11	81358	73979	AAAAAAAAAPAAAATAPTTAATTAATAAQ	29	0	Unmodified	_(Acetyl (Protein_N-term))AAAAAAAAM(Oxidation (M))PAAAATAPTTAATTAATAAQ_			0	0	sp|P37108|SRP14_HUMAN	3	HCD	FTMS	MULTI-MSMS	13	1	790.07495	2367.203	0.35311	0.00027898	-0.061634807	70.261	0.012774	41.423	36.666	NaN	NaN	1	0	0	0	81345	10653955	0.0338597821787898	-11	0.139877319335938	y1;y2;y3;y4;y11;y1-NH3;y2-NH3;a2;b2;b3;b4;b5;b6;b7;b8;b9;b11;b12;b6(2+);b8(2+);b13(2+);b18(2+)	2000000;2000000;300000;400000;200000;1000000;400000;300000;600000;1000000;2000000;3000000;3000000;3000000;3000000;2000000;600000;500000;1000000;2000000;300000;200000	5.2861228709844E-06;-6.86980268369553E-05;-0.00238178789771837;0.000624715964988809;-0.0145624692099773;-0.000143471782706683;-0.000609501446461991;-0.000524972720768346;0.00010190530804266;5.8620815195809E-05;0.000229901232955854;-0.000108750048696038;-0.000229593152369034;0.00183148682538103;0.00276641182404092;0.000193118923334623;0.00200988580445483;0.000102216846016745;5.86208151389656E-05;0.000229901232955854;-0.00104559184393338;0.00525030008475369	0.0359413365445091;-0.314964433555295;-8.23711898839045;1.60102421155213;-14.8975999917227;-1.10320467763838;-3.03102462870716;-4.56152475051625;0.712219104095465;0.273777366204575;0.806231096969562;-0.305312183824154;-0.537399178230218;3.67572664689217;4.85930954169285;0.301587577451224;2.48616190909398;0.116225745519871;0.273777365939099;0.806231096969562;-2.19774169175011;7.53961026980589	147.076413378177;218.113601150127;289.153028027798;390.197699998035;977.50437775671;130.050013034583;201.087592852046;115.087114392821;143.081402136892;214.118559209185;285.155501716567;356.192954155649;427.230188786552;498.265241494374;569.301420357176;640.341107437877;808.429168310795;879.468189767554;214.118559209185;285.155501716567;475.757386711244;696.362265007215	22	0.262893575628735	0.0826446280991736	None	Unknown		41.4230894199432;4.75668724862449;3.9515580701967	AAAAAAAAAPAAAATAPTTAATTAATAAQ;FHRGPPDKDDMVSVTQILQGK;PVTLWITVTHMQADEVSVWR	_AAAAAAAAAPAAAATAPTTAATTAATAAQ_;_FHRGPPDKDDMVSVTQILQGK_;_PVTLWITVTHMQADEVSVWR_			0	1443	0	0	0	
20190402_QX1_SeVW_MA_HeLa_500ng_LC11	81391	74010	AAAAAAAAAAPAAAATAPTTAATTAATAAQ	29	0	Unmodified	_AAAAAAAAAPAAAATAPTTAATTAATAAQ_			0	0	sp|P37108|SRP14_HUMAN	2	HCD	FTMS	MULTI-MSMS	14	0	1184.6088	2367.203	0.037108	4.3959E-05	1.7026696	70.287	7.1474E-09	118.21	100.52	NaN	NaN	1	0	0	0	81377	9347701	0.166790347889974	-10	0.12664794921875	y1;y2;y3;y4;y5;y9;y12;y13;y14;y20;y13-H2O;y20-H2O;y1-NH3;y20-NH3;b3;b4;b5;b6;b7;b8;b9;b11;b12;b13;b14;b15;b16;b19;b15-H2O;b16-H2O	500000;600000;200000;400000;200000;100000;200000;1000000;200000;300000;200000;100000;100000;70000;300000;900000;2000000;3000000;5000000;8000000;6000000;600000;800000;600000;200000;300000;200000;300000;300000;1000000	-0.000194444760495571;0.000149986878682284;0.000774202587820128;-0.0002445094036716;0.000374520568641401;-0.00694293246522193;-0.0109837291331587;-0.0037745820627606;-0.000945546471939451;0.00152326440706929;0.00506054832726477;0.00996886361417637;6.25847393393997E-05;-0.024881067836759;-3.11821549132674E-05;-0.000183099230639527;0.000161332473453513;0.000265434980121881;0.000747070697229901;0.000975534518261156;0.00101513939785036;0.00651913000274362;0.0058584595163893;0.00579536744021425;0.00131097834105276;-0.0131378531671089;0.00472955218901916;-0.00161006322559842;-0.00201443239325272;0.0227149399370319	-1.32206444236914;0.687655553213019;2.6775131607882;-0.626628140021726;0.811995006209331;-8.6203492854282;-10.1838066275079;-3.21078702288986;-0.758483069159249;0.881072738747222;4.37168212373889;5.82682888353564;0.481236695337485;-14.5343986203644;-0.145630261806375;-0.642102166533079;0.452935954800214;0.621293379181583;1.49934012872483;1.71355878380837;1.58531240493271;8.06399202403175;6.6614096214532;6.09718023739784;1.28333378040908;-11.7030234519348;3.96235146626144;-1.07856912288932;-1.82370619437775;19.3220953109188	147.07661310906;218.113382465221;289.149872037312;390.198569223404;461.235063981231;805.411965958065;1078.54847749073;1175.59403219566;1246.62831694787;1728.87474561429;1157.57463237897;1710.85573532879;130.049806978061;1711.87460084504;214.118649012155;285.155914717031;356.192684073126;427.22969375842;498.266325910503;569.303211234482;640.340285417402;808.424659066597;879.462433524883;950.49961040476;1021.54120858166;1122.60333588727;1193.62258226971;1492.77704268533;1104.58164778019;1175.59403219566	30	0.474003002083763	0.167630057803468	None	Unknown		118.209976573419;17.6937689289157;17.2534171481793	AAAAAAAAAPAAAATAPTTAATTAATAAQ;SELKQEAMQSEQLQSVLYLK;VGSSVPSKASELVVMGDHDAARR	_AAAAAAAAAPAAAATAPTTAATTAATAAQ_;_SELKQEAM(Oxidation (M))QSEQLQSVLYLK_;_VGSSVPSKASELVVMGDHDAARR_			1	1443	0	0	1	
20190402_QX1_SeVW_MA_HeLa_500ng_LC11	107307	98306	AAAAAAAGDSDSWDADAFSVEDPVRK	26	1	Acetyl (Protein_N-term)	_(Acetyl (Protein_N-term))AAAAAAAGDSDSWDADAFSVEDPVRK_			1	0	sp|O75822|EIF3J_HUMAN	3	HCD	FTMS	MULTI-MSMS	10	2	879.06841	2634.1834	-0.93926	-0.00082567	-3.2012471	90.978	2.1945E-12	148.95	141.24	NaN	NaN	1	0	0	0	107297	10193939	0.267970762043589	-8	0.10211181640625	y1;y2;y4;y5;y6;y7;y8;y9;y10;y11;y12;y13;y14;y15;y17;y18;y19;y20;y21;y23;y21-H2O;y1-NH3;y19-NH3;y14(2+);y16(2+);y22(2+);a2;b2;b3;b4;b5;b6;b7	300000;200000;3000000;600000;1000000;500000;2000000;1000000;1000000;1000000;90000;1000000;400000;900000;1000000;400000;3000000;2000000;1000000;400000;100000;200000;200000;80000;100000;200000;200000;2000000;5000000;5000000;5000000;2000000;300000	1.34859050149316E-07;-6.05140996867704E-06;2.27812602133781E-05;0.00128986659160546;-0.00934536073077652;0.000941953783126337;-0.00160424237344614;-0.00239257341399934;-0.00111053968612396;-0.00331340710044969;0.00330702864630439;0.000963683996815234;0.00596290290945944;-0.00662057038289277;-0.0117122701335575;0.00777853472800416;0.0021841542961738;0.000144322111736983;-0.00087403893667215;0.0197121595674616;-0.021204007680808;-0.000308954599830713;-0.026636719419912;-0.0137790992353075;0.00596067266928912;-0.0077053835773313;9.11402199221811E-06;-0.000142539300128419;-0.000251999832926231;1.90791054137662E-05;-0.00236430185879044;-9.54583337602344E-05;-0.000556959493223985	0.000916705048437201;-0.0199575598103408;0.0456231928690862;2.09952637717462;-12.5708704058425;1.11808305811426;-1.72590731777249;-2.22239181008062;-0.967696370445928;-2.62418809422166;2.47964286628144;0.665205752892023;3.64753748704453;-3.84510115530963;-6.08782672045773;3.81508105974837;1.04209904973991;0.0666012719936656;-0.390545453668809;8.28224925531311;-9.55133250134922;-2.37499239179248;-12.8127653858411;-16.846761946123;6.48662354975264;-6.67117082062383;0.0580151981289049;-0.770098855873447;-0.983876895688683;0.0583162347158579;-5.93738717724506;-0.203431522818505;-1.03087538746314	147.112804035741;303.21392125011;499.33507018564;614.360746132308;743.413974455831;842.472101057517;929.506675663573;1076.57587791081;1147.61170966489;1262.6408555643;1333.67134891635;1448.700635293;1634.77494902759;1721.81956091078;1923.88362405243;2038.89107627957;2095.9181343836;2166.95728800359;2237.99542015244;2380.04906152953;2220.00518543488;130.0865640237;2078.92040615582;817.907873297785;918.917619246831;1155.02717356753;157.097144992378;185.0922112678;256.129434516133;327.166277224995;398.205774393759;469.240619338034;540.278194626993	33	0.574496146107112	0.14410480349345	None	Unknown		148.951235201399;7.71201258444522;7.36039532447559	AAAAAAAGDSDSWDADAFSVEDPVRK;PSRQESELMWQWVDQRSDGER;HTLTSFWNFKAGCEEKCYSNR	_(Acetyl (Protein_N-term))AAAAAAAGDSDSWDADAFSVEDPVRK_;_PSRQESELM(Oxidation (M))WQWVDQRSDGER_;_HTLTSFWNFKAGCEEKCYSNR_			2	625	1	1	2	'''
)

# Parse with pandas for visualization purposes
pd.read_csv(copy(maxquant_example), sep="\t")

Unnamed: 0,Raw file,Scan number,Scan index,Sequence,Length,Missed cleavages,Modifications,Modified sequence,Oxidation (M) Probabilities,Oxidation (M) Score diffs,...,All sequences,All modified sequences,Reporter PIF,Reporter fraction,id,Protein group IDs,Peptide ID,Mod. peptide ID,Evidence ID,Oxidation (M) site IDs
0,20190402_QX1_SeVW_MA_HeLa_500ng_LC11,81358,73979,AAAAAAAAAPAAAATAPTTAATTAATAAQ,29,0,Unmodified,_(Acetyl (Protein_N-term))AAAAAAAAM(Oxidation ...,,,...,AAAAAAAAAPAAAATAPTTAATTAATAAQ;FHRGPPDKDDMVSVTQ...,_AAAAAAAAAPAAAATAPTTAATTAATAAQ_;_FHRGPPDKDDMVS...,,,0,1443,0,0,0,
1,20190402_QX1_SeVW_MA_HeLa_500ng_LC11,81391,74010,AAAAAAAAAAPAAAATAPTTAATTAATAAQ,29,0,Unmodified,_AAAAAAAAAPAAAATAPTTAATTAATAAQ_,,,...,AAAAAAAAAPAAAATAPTTAATTAATAAQ;SELKQEAMQSEQLQSV...,_AAAAAAAAAPAAAATAPTTAATTAATAAQ_;_SELKQEAM(Oxid...,,,1,1443,0,0,1,
2,20190402_QX1_SeVW_MA_HeLa_500ng_LC11,107307,98306,AAAAAAAGDSDSWDADAFSVEDPVRK,26,1,Acetyl (Protein_N-term),_(Acetyl (Protein_N-term))AAAAAAAGDSDSWDADAFSV...,,,...,AAAAAAAGDSDSWDADAFSVEDPVRK;PSRQESELMWQWVDQRSDG...,_(Acetyl (Protein_N-term))AAAAAAAGDSDSWDADAFSV...,,,2,625,1,1,2,


Then use the `psm_reader_provider.get_reader` method to get the maxquant-report reader. Use the `import_file` method to read the file, which is directly returned as a pandas DataFrame. 

In [6]:
maxquant_reader = psm_reader_provider.get_reader('maxquant')

# Import the file or a bytestream
maxquant_report = maxquant_reader.import_file(maxquant_example)

# The parsed PSM is also stored in the reader class as `psm_df` attribute
# maxquant_report = maxquant_reader.psm_df

maxquant_report



Unnamed: 0,sequence,charge,rt,scan_num,raw_name,precursor_mz,score,proteins,decoy,spec_idx,mods,mod_sites,nAA,rt_norm
0,AAAAAAAAAAPAAAATAPTTAATTAATAAQ,2,70.287,81391,20190402_QX1_SeVW_MA_HeLa_500ng_LC11,1184.6088,118.21,sp|P37108|SRP14_HUMAN,0,81390,,,30,0.772571


#### Example 2 - Set custom arguments 

One can also customize the reader by setting specific arguments. For example, one can set more stringent `fdr` filters (default: $fdr=0.01$). We showcase this on the example of a DIANN PSM report table.

In [7]:
diann_tsv_example = io.StringIO(r'''File.Name	Run	Protein.Group	Protein.Ids	Protein.Names	Genes	PG.Quantity	PG.Normalised	PG.MaxLFQ	Genes.Quantity	Genes.Normalised	Genes.MaxLFQ	Genes.MaxLFQ.Unique	Modified.Sequence	Stripped.Sequence	Precursor.Id	Precursor.Charge	Q.Value	Global.Q.Value	Protein.Q.Value	PG.Q.Value	Global.PG.Q.Value	GG.Q.Value	Translated.Q.Value	Proteotypic	Precursor.Quantity	Precursor.Normalised	Precursor.Translated	Quantity.Quality	RT	RT.Start	RT.Stop	iRT	Predicted.RT	Predicted.iRT	Lib.Q.Value	Ms1.Profile.Corr	Ms1.Area	Evidence	Spectrum.Similarity	Mass.Evidence	CScore	Decoy.Evidence	Decoy.CScore	Fragment.Quant.Raw	Fragment.Quant.Corrected	Fragment.Correlations	MS2.Scan	IM	iIM	Predicted.IM	Predicted.iIM
F:\XXX\20201218_tims03_Evo03_PS_SA_HeLa_200ng_high_speed_21min_8cm_S2-A2_1_22636.d	20201218_tims03_Evo03_PS_SA_HeLa_200ng_high_speed_21min_8cm_S2-A2_1_22636	Q9UH36	Q9UH36		SRRD	3296.49	3428.89	3428.89	3296.49	3428.89	3428.89	3428.89	(UniMod:1)AAAAAAALESWQAAAPR	AAAAAAALESWQAAAPR	(UniMod:1)AAAAAAALESWQAAAPR2	2	3.99074e-05	1.96448e-05	0.000159821	0.000159821	0.000146135	0.000161212	0	1	3296.49	3428.89	3296.49	0.852479	19.9208	19.8731	19.9685	123.9	19.8266	128.292	0	0.960106	5308.05	1.96902	0.683134	0.362287	0.999997	1.23691	3.43242e-05	1212.01;2178.03;1390.01;1020.01;714.008;778.008;	1212.01;1351.73;887.591;432.92;216.728;732.751;	0.956668;0.757581;0.670497;0.592489;0.47072;0.855203;	30053	1.19708	1.19328	1.19453	1.19469
F:\XXX\20201218_tims03_Evo03_PS_SA_HeLa_200ng_high_speed_21min_8cm_S2-A8_1_22642.d	20201218_tims03_Evo03_PS_SA_HeLa_200ng_high_speed_21min_8cm_S2-A8_1_22642	Q9UH36	Q9UH36		SRRD	2365	2334.05	2334.05	2365	2334.05	2334.05	2334.05	(UniMod:1)AAAAAAALESWQAAAPR	AAAAAAALESWQAAAPR	(UniMod:1)AAAAAAALESWQAAAPR2	2	0.000184434	1.96448e-05	0.000596659	0.000596659	0.000146135	0.000604961	0	1	2365	2334.05	2365	0.922581	19.905	19.8573	19.9527	123.9	19.782	128.535	0	0.940191	4594.04	1.31068	0.758988	0	0.995505	0.28633	2.12584e-06	1209.02;1210.02;1414.02;1051.01;236.003;130.002;	1209.02;1109.89;732.154;735.384;0;46.0967;	0.919244;0.937624;0.436748;0.639369;0.296736;0.647924;	30029	1.195	1.19328	1.19381	1.19339
F:\XXX\20201218_tims03_Evo03_PS_SA_HeLa_200ng_high_speed_21min_8cm_S2-B2_1_22648.d	20201218_tims03_Evo03_PS_SA_HeLa_200ng_high_speed_21min_8cm_S2-B2_1_22648	Q9UH36	Q9UH36		SRRD	1664.51	1635.46	1635.47	1664.51	1635.46	1635.47	1635.47	(UniMod:1)AAAAAAALESWQAAAPR	AAAAAAALESWQAAAPR	(UniMod:1)AAAAAAALESWQAAAPR2	2	0.000185123	1.96448e-05	0.000307409	0.000307409	0.000146135	0.000311332	0	1	1664.51	1635.46	1664.51	0.811147	19.8893	19.8416	19.937	123.9	19.7567	128.896	0	0.458773	6614.06	1.7503	0.491071	0.00111683	0.997286	1.92753	2.80543e-05	744.01;1708.02;1630.02;1475.02;0;533.006;	322.907;808.594;577.15;536.033;0;533.006;	0.760181;0.764072;0.542005;0.415779;0;0.913438;	30005	1.19409	1.19328	1.19323	1.19308
''')

pd.read_csv(copy(diann_tsv_example), sep="\t")

Unnamed: 0,File.Name,Run,Protein.Group,Protein.Ids,Protein.Names,Genes,PG.Quantity,PG.Normalised,PG.MaxLFQ,Genes.Quantity,...,Decoy.Evidence,Decoy.CScore,Fragment.Quant.Raw,Fragment.Quant.Corrected,Fragment.Correlations,MS2.Scan,IM,iIM,Predicted.IM,Predicted.iIM
0,F:\XXX\20201218_tims03_Evo03_PS_SA_HeLa_200ng_...,20201218_tims03_Evo03_PS_SA_HeLa_200ng_high_sp...,Q9UH36,Q9UH36,,SRRD,3296.49,3428.89,3428.89,3296.49,...,1.23691,3.4e-05,1212.01;2178.03;1390.01;1020.01;714.008;778.008;,1212.01;1351.73;887.591;432.92;216.728;732.751;,0.956668;0.757581;0.670497;0.592489;0.47072;0....,30053,1.19708,1.19328,1.19453,1.19469
1,F:\XXX\20201218_tims03_Evo03_PS_SA_HeLa_200ng_...,20201218_tims03_Evo03_PS_SA_HeLa_200ng_high_sp...,Q9UH36,Q9UH36,,SRRD,2365.0,2334.05,2334.05,2365.0,...,0.28633,2e-06,1209.02;1210.02;1414.02;1051.01;236.003;130.002;,1209.02;1109.89;732.154;735.384;0;46.0967;,0.919244;0.937624;0.436748;0.639369;0.296736;0...,30029,1.195,1.19328,1.19381,1.19339
2,F:\XXX\20201218_tims03_Evo03_PS_SA_HeLa_200ng_...,20201218_tims03_Evo03_PS_SA_HeLa_200ng_high_sp...,Q9UH36,Q9UH36,,SRRD,1664.51,1635.46,1635.47,1664.51,...,1.92753,2.8e-05,744.01;1708.02;1630.02;1475.02;0;533.006;,322.907;808.594;577.15;536.033;0;533.006;,0.760181;0.764072;0.542005;0.415779;0;0.913438;,30005,1.19409,1.19328,1.19323,1.19308


By passing the more stringent `fdr` filter ($fdr_{\text{stringent}} = 10^{-4}$) in the second function call, two precursors with an fdr of $\sim0.0002$ are removed from the resulting table

In [8]:
# Read PSM reports with one liners
diann_psm_standard = psm_reader_provider.get_reader('diann').import_file(copy(diann_tsv_example))
diann_psm_custom_fdr = psm_reader_provider.get_reader('diann', fdr=1e-4).import_file(copy(diann_tsv_example))

print("Number of observations (Standard filter):", len(diann_psm_standard))
print("Number of observations (Stringent filter):", len(diann_psm_custom_fdr))

Number of observations (Standard filter): 3
Number of observations (Stringent filter): 1


## Conclusion

Overall, this tutorial 

- Explained how `alphabase` maps different search engine outputs to a unified format
- Provides examples on how to read PSM tables from different search engines
- Gives an overview over the available and implemented readers