# AlphaPept workflow and files

## Core
The core function of Alphapept is `interface.run_complete_workflow()`. This function requires a settings file (a dictionary containing the settings and file paths). Filewise, we store settings files as `*.yaml` file. When calling the core function, it will run a complete workflow based on the settings given.

<img src="images/workflow/core.png" align="center" style="width:600px"/>


## GUI

When starting the AlphaPept GUI via the shortcut that the one-click installer created or via python (`python -m alphapept gui`), the AlphaPept server will be started. It can be accessed via a browser and provides a graphical user interface (GUI) to the AlphaPept functionality. The server extends the core function to a processing framework.
The server is centered around three folders, `Queue`, `Failed`, and `Finished,` which will be created in the `.alphapept`-folder in the user's home directory.  Whenever a new `*.yaml`-file is found in the `Queue`-folder, the server will start handing this over to the core function and start processing. There are three ways to add files to the `Queue`-folder:
1. Via the `New experiment`-tab in the GUI
2. Manually copying a `*.yaml`-file into the `Queue`-folder
3. Automatically via the `File watcher.`

The `File watcher` can be set up to monitor a folder; whenever a new file matching pre-defined settings is copied to the folder, it will create a `*.yaml`-file and add it to the `Queue`-folder.

Whenever an experiment succeeds, the `*.yaml`-file will be appended by summary information of the experiment and moved to the `Finished`-folder. As the `*.yaml`-file is only very small in size (~kB), it is intended to serve as a history of processed files.

Whenever an experiment fails, the `*.yaml`-file will be moved to the `Failed`-folder. It can be moved from there to the `Queue`-folder for reprocessing.


### History and Results

AlphaPept screens all `*.yaml`-files in the finished folder and plots a run history based on the summary information. This is especially useful for QC or comparison purposes. Additionally, the `*.yaml`-files can be used to investigate the results of a run.

<img src="images/workflow/gui.png" align="center" style="width:600px"/>

## Output Files

For each run, AlphaPept creates several output files:
- For each raw file, there will a `.ms_data.hdf`-file with raw-specific data, such as `feature_table`, `first_search`, `second_search` and `peptide_fdr`.
- For the entire experiment, there will be a `results.hdf` (name can be defined in the settings), which contains experiment-specific data, such as `protein_fdr` and the `protein_table` (containing quantified proteins over all files).
- Additionally to the `results.hdf`, there will be a `*.yaml`-file which contains the run settings and summary information of the run. This `*.yaml` can be used to serve as a template to rerun other files with the same settings.
- If a database is created from `FASTA`-files there will be a `database.hdf` (name can be defined in the settings). This contains theoretical spectra and can be reused for other experiments (and speedup total analysis time)

The `ms_data.hdf`, `results.hdf` and database containers can be accessed via the `alphapept.io` library. The GUI also allows to explore these files. Additionally, the `results.hdf` can be directly loaded via the pandas-package (e.g. `pd.read_hdf('results.hdf', 'protein_table')`.

For easier access, AlphaPept directly exports the most relevant tables as `*.csv`:
- `results.csv`: The search results after protein_fdr
- `results_proteins.csv`: The quantified proteins per file.

## Column headers

Below is a description of the column headers in the output files. 

### protein_fdr

Name | Description |
--- | --- |
abs_delta_m_ppm | absolute value  of `delta_m` in ppm
b-H2O_hits | b-ion hit with a water loss
b-NH3 hits | b-ion hit with a NH3 loss
b_hits | number of b ion hits
charge | charge of the peptide
db_idx | index to the theoretical database
decoy | is the sequence a decoy or a hit (Yes / No)
decoys_cum | cumulative number of decoys in table (used for FDR calculation)
delta_m |  mean mass delta when comparing experimental fragments to theoretical fragments when searching
delta_m_ppm | `delta_m` in ppm
dist | a metric used to measure the distance of an MS1 feature (quantification) to a matching MS2 spectrum (identification). This is important in the mapping of MS1 features to MS2 spectra 
fasta_index | index to the fasta file that you use for searching
fdr | calculated false discovery rate value for this peptide in the table. As the PSM score decreases more decoys will be found the FDR score increases until you reach your FDR threshold cutoff, below which we don’t count any more hits.
feature_idx | index to feature table from feature finding
feature_rank | multiple ms1 features will be mapped to a single ms2 spectra. The rank indicates the how close the feature was to the spectrum in comparison to other features in close distance.
fwhm | fwhm of the feature
hits | total number of b- and y-ion hits. A hit occurs when a theoretical fragment can be found within the `frag_tol` of an experimentally recorded fragment


['fwhm', 'hits', 'index',
       'int_apex', 'int_ratio', 'int_sum', 'ion_idx', 'ion_int', 'ion_types',
       'mass', 'matched_int', 'matched_int_ratio', 'matched_ion_fraction',
       'mz', 'n_AA', 'n_internal', 'n_ions', 'n_missed', 'naked_sequence',
       'o_mass', 'o_mass_ppm', 'o_mass_ppm_raw', 'o_mass_raw', 'precursor',
       'q_value', 'query_idx', 'rank', 'rank_precursor', 'raw_idx', 'raw_rank',
       'rt', 'rt_apex', 'rt_end', 'rt_start', 'scan_no', 'score',
       'score_precursor', 'sequence', 'target', 'target_cum',
       'target_precursor', 'total_int', 'x_tandem', 'y-H2O_hits', 'y-NH3_hits',
       'y_hits', 'filename', 'shortname', 'protein', 'protein_group', 'razor',
       'protein_idx', 'decoy_protein', 'n_possible_proteins',
       'index_protein_group', 'score_protein_group', 'target_protein_group',
       'target_cum_protein_group', 'decoys_cum_protein_group',
       'fdr_protein_group', 'q_value_protein_group']


fdr
calculated false discovery rate value for this peptide, where the highest scoring spectra will have an FDR closest to 0. As the PSM score decreases, the FDR score increases until you reach your FDR threshold cutoff, below which we don’t count any more hits.

fdr_protein
calculated false discovery rate value for a protein, where the highest scoring peptide will have an FDR closest to 0. As the PSM score decreases, the FDR score increases until you reach your FDR threshold cutoff, below which we don’t count any more peptides.



index
—> ?

index_protein
—>? 

int_apex
intensity in the apex (top) of the spectral feature 

int_ratio
?

int_sum
—> ? sum of the intensity of the feature 

ion_idx
index of the matched ions   

mass (m/z)

matched_int
? 

matched_int_ratio
how much of all the intensity for a specific peptide could be matched. For the top scoring feature, you get the highest matched intensity ratio. 

matched_ion_fraction
?

mz
mass / charge

n_AA
number of Amino acids in a sequence 

n_internal
number of internal modifications 

n_ions
number of matched b and y ions and water losses

n_missed 
number of missed cleavages 

naked_sequence
AA sequence without PTMs 

o_mass
the offset (difference) between theoretical vs. experimental mass

o_mass_ppm
the offset (difference) between theoretical vs. experimental mass in parts per million

precursor
sequence and appended charge information 

protein
name of the protein that is matched

protein_group
name of the protein group that is matched


q_value
another metric used in estimating the rank of a PSM, follows a similar trend to FDR scores

q_value_protein
another metric used in estimating the rank of a protein, follows a similar trend to FDR scores

query_idx
since one feature could belong to multiple spectra, the query index can be used to  differentiate between distinct feature matches to a single experimental spectra (see raw_idx). 

rank
rank for each spectrum 

rank_precursor
rank for each precursor (with charge) 

raw_idx
the index to the actual spectra during a run (c.f. query idx) 

raw_rank
rank for each spectrum 

razor
whether this protein is razor or not

rt
retention time for the elution peak

rt_apex 
retention time ______? 

rt_end
the retention time when the peak finishes eluting

rt_start
the retention time when this peak start to elute

scan_no
the scan number for the PSM as per the MS

score
matching score for the PSM

score_precursor
matching score for the PSM with charge

score_protein
matching score for the peptide

sequence
AA sequence containing the PTMs 

target
is the sequence a target (Yes / No) 

target_cum
the cumulative number of targets

target_cum_protein
cumulative number of hits for a target protein
(see decoy_cum_protein)

target_precursor
—> can the target peptide be matched to a precursor? 

target_protein
—> can the target be matched to a protein? 

total_int
total intensity

x_tandem
hypergeometric score of b ions and y ions and the matched intensity ratios
the highest scoring peptide will typically also have the highest X! tandem score


y-H20_hits
y ion hit with a water loss 

y-NH3_hits
y ion hit with an NH3 loss

y_hits
number of y ion hits 

## Downstream analysis

AlphaPept offers some basic plots in the results section (e.g., volcano, heatmap, and PCA). The `*.csv`-format should be generic to use with multiple other tools. Feel free to reach out in case you have ideas for plots or find that the output format not supported or has required columns missing. To reach out, report an issue [here](https://github.com/MannLabs/alphapept/issues/new/choose) or send an email to opensource@alphapept.com.

### Using with Perseus

Perseus offers a generic table import, so you can directly use the `results_proteins.csv`.

#### Example: Volcano-Plot
An excellent tutorial for creating volcano-plots with Perseus can be found [here](http://www.coxdocs.org/doku.php?id=perseus:user:use_cases:interactions).

Below a quickstart to use AlphaPept with Perseus (tested with `1.6.15.0`) The file used here is `PXD006109` from the test runner (multi-species quantification test) with six files (three each group).

1. Open Perseus.
<img src="images/workflow/perseus_0.PNG" align="center"/>

2. Drag and drop the `results_proteins.csv` in the central pane of Perseus. The `Generic matrix upload`-window will open.
<img src="images/workflow/perseus_1.PNG" align="center"/>

3. Select the appropriate columns (e.g., LFQ for LFQ-intensities) and select them for Main with the `>`-Button. The first row is empty. Assign this for text. Click `OK,` and the table should be loaded.
<img src="images/workflow/perseus_2.PNG" align="center"/>

4. Click on the `f(x)`-button and press `OK` on the window that opens to apply a `log2(x)`-transformation.
<img src="images/workflow/perseus_3.PNG" align="center"/>

5. Click on `Annot. rows` > `Categorical annotation rows` to assign a group for each file. Select multiple entries and click on the checkmark to assign multiple groups at the same time. Click `OK` to close the window.
<img src="images/workflow/perseus_4.PNG" align="center"/>

6. Click on the `Volcano plot`-symbol in the upper right `Analysis`-column. For the tutorial, we keep the standard settings and press `OK`.

7. You can double-click on the small volcano plot to show the plot.
<img src="images/workflow/perseus_5.PNG" align="center"/>

Enjoy your volcano-plot.