# Explore MaxQuant (MQ) output files of single runs

The `project/10_training_data.ipynb` notebook does extract information to be used as training data. File specific one could also use the retention time analysis to identify _valid_ co-occurring peptides to be use during training. Potentially this preprocessing step can be used at inference time.

This notebook contains some relevant analysis for a specific `txt` output-folder in the current project

##### Analysis overview

> Report for example data

- relation between `peptides.txt` and `evidence.txt`

In [None]:
import logging
import os
from pathlib import Path
import random

import ipywidgets as widgets
import pandas as pd
# pd.options.display.float_format = '{:,.1f}'.format

from vaep.io.mq import FASTA_KEYS, MaxQuantOutput, MaxQuantOutputDynamic
from vaep.io import search_files, search_subfolders

##################
##### CONFIG #####
##################

from config import FIGUREFOLDER
from config import FOLDER_RAW_DATA
from config import FOLDER_KEY  # defines how filenames are parsed for use as indices
from config import FOLDER_DATA  # project folder for storing the data

print(f"Search Raw-Files on path: {FOLDER_RAW_DATA}")

In [None]:
from datetime import datetime

#Delete Jupyter notebook root logger handler
logger = logging.getLogger()
logger.handlers = []

# logger = logging.getLogger(mq_output.folder.stem)
logger = logging.getLogger('vaep')
logger.setLevel(logging.INFO)

c_handler = logging.StreamHandler()
c_handler.setLevel(logging.INFO)


date_log_file = "{:%y%m%d_%H%M}".format(datetime.now())
f_handler = logging.FileHandler(f"log_01_explore_raw_MQ_{date_log_file}.txt")
f_handler.setLevel(logging.INFO)

c_format = logging.Formatter(
    f'%(name)s - %(levelname)-8s %(message)s ')

c_handler.setFormatter(c_format)
f_handler.setFormatter(c_format)

logger.handlers = []  #remove any handler in case you reexecute the cell
logger.addHandler(c_handler)
logger.addHandler(f_handler)

In [None]:
logger.handlers

In [None]:
folders = search_subfolders(path=FOLDER_RAW_DATA, depth=1, exclude_root=True)
w_folder = widgets.Dropdown(options=folders, description='Select a folder')
w_folder

In [None]:
mq_output = MaxQuantOutput(folder=w_folder.value)

## Some important columns

Grouped by a namedtuple allowing attribute access

In [None]:
from vaep.io.mq import mq_col
mq_col

## `peptides.txt`

> For reference on final "result"

In [None]:
pd.options.display.max_columns = len(mq_output.peptides.columns)
mq_output.peptides

`peptides.txt` contains aggregated peptides

In [None]:
intensities = mq_output.peptides.Intensity
intensities

Not all peptides are associated with a Protein or Gene by MQ, although there is evidence for the peptide. This is due to potential `CON_`taminants in the medium which is encouded by default by MQ.

In [None]:
mq_output.peptides[FASTA_KEYS].isna().sum()

## `evidence.txt` 

contains
- retention time for peptides
- has repeated measures of the same sequences, which are all aggregated in `peptides.txt`


In [None]:
pd.options.display.max_columns = len(mq_output.evidence.columns)
mq_output.evidence

In [None]:
mq_output.evidence.Charge.value_counts().sort_index()

In [None]:
mask = mq_output.evidence[mq_col.RETENTION_TIME] != mq_output.evidence[mq_col.CALIBRATED_RETENTION_TIME]
print("Number of non-matching retention times between calibrated and non-calibrated column:", mask.sum())

# try:
#     assert mask.sum() == 0, "More than one replica?"
# except AssertionError as e:
#     logger.warning(e)
assert mask.sum() == 0, "More than one replica?"

Using only one quality control sample, leaves the initial retention time as is.

In [None]:
rt = mq_output.evidence[mq_col.CALIBRATED_RETENTION_TIME]

In [None]:
pep_measured_freq_in_evidence = rt.index.value_counts()
pep_measured_freq_in_evidence.iloc[:10]  # top10 repeatedly measured peptides

In [None]:
max_observed_pep_evidence = pep_measured_freq_in_evidence.index[0]
mq_output.evidence.loc[
    max_observed_pep_evidence,
    :
]

The retention time index is non-unique.

In [None]:
print('The retention time index is unique: {}'.format(rt.index.is_unique))

Peptides observed more than once at different times.

In [None]:
mask_duplicated = rt.index.duplicated(keep=False)
rt_duplicates = rt.loc[mask_duplicated]
rt_duplicates

In [None]:
mq_output.evidence.loc[mask_duplicated, [
    'Charge', 'Calibrated retention time', 'Intensity']]

Calculate median intensity and calculate standard deviation

In [None]:
_agg_functions = ['median', 'std']

rt_summary = rt.groupby(level=0).agg(_agg_functions)
rt_summary

Let's see several quartiles for both median and standard deviation (the columns are independent from each other) for the retention time

In [None]:
rt_summary.describe(percentiles=[0.8, 0.9, 0.95, 0.96, 0.97, 0.98, 0.99])

In [None]:
rt_summary['median']

A large standard-deviation indicates that the intensity values originate from time points (in min) widely spread.

### Peptides observed several times a different points of experimental run

In [None]:
mask = rt_summary['std'] > 40.0
mask_indices = mask[mask].index
rt.loc[mask_indices]

Peptides with differen RT have different charges.

In [None]:
mq_output.evidence.loc[mask_indices]

Model evaluation possibility: Discard samples with several measurements from an experiment and predict value. See which intensity measurement corresponds more closely. 

In [None]:
from numpy import random
_peptide = random.choice(mask_indices)

In [None]:
mq_output.evidence.loc[_peptide]

`Type` column indicates if peptide is based on one or more MS-MS spectra.

In [None]:
mq_output.peptides.loc[_peptide].to_frame().T

## Differences in intensities b/w peptides.txt and evidence.txt


The intensity reported in `peptides.txt` corresponds to roughly to the sum of the intensities found in different scans:

In [None]:
from numpy.testing import assert_almost_equal

col_intensity = mq_col.INTENSITY
try:

    assert_almost_equal(
        _pep_int_evidence := mq_output.evidence.loc[_peptide, col_intensity].sum(),
        _pep_int_peptides := mq_output.peptides.loc[_peptide, col_intensity],
        err_msg='Mismatch between evidence.txt summed peptide intensities to reported peptide intensities in peptides.txt')
except AssertionError as e:
    logging.error(
        f"{e}\n Difference: {_pep_int_evidence - _pep_int_peptides:,.2f}")

In [None]:
mq_output.evidence.loc[_peptide, col_intensity]

In [None]:
mq_output.peptides.loc[_peptide, col_intensity]

Make this comparison for all peptides

In [None]:
_pep_int_evidence = mq_output.evidence.groupby(
    level=0).agg({col_intensity: [sum, len]})
_pep_int_evidence.columns = [col_intensity, 'count']
_pep_int_evidence

In [None]:
_diff = _pep_int_evidence[col_intensity] - \
    mq_output.peptides[col_intensity].astype(float)
mask_diff = _diff != 0.0
_pep_int_evidence.loc[mask_diff].describe()

In [None]:
_diff.loc[mask_diff]

In [None]:
_diff[mask_diff].describe()

Several smaller and larger differences in an intensity range way below the detection limit arise for some sequences. 

### Ideas on source of difference
 - Are all peptides (sequences) which are based on single observations in `evidence.txt` represented as is in `peptides.txt`?
 - how many peptides with more than one PTM have non-zero differences between the sum of intensity values in `evidence.txt` and the respective value in `peptides.txt`?
 - maybe some peptides are filtered based on assignment as contaminent (`CON__`)?

In [None]:
# ToDo see above

In [None]:
_diff_indices = _diff[mask_diff].index
# some pep-seq in peptides.txt not in evidence.txt
_diff_indices = _diff_indices.intersection(mq_output.evidence.index.unique())

In [None]:
from numpy import random
sample_index = random.choice(_diff_indices)

In [None]:
mq_output.evidence.loc[sample_index]

In [None]:
mq_output.peptides.loc[sample_index].to_frame().T

### Modifications

In [None]:
mq_output.evidence.Modifications.value_counts()

### Potential contaminant peptides

The `CON__` entries are possible contaminations resulting from sample preparation using a e.g. a serum:

```python
data_fasta['ENSEMBL:ENSBTAP00000024146']
data_fasta['P12763'] # bovine serum protein -> present in cell cultures and in list of default contaminant in MQ
data_fasta['P00735'] # also bovin serum protein
```

In [None]:
mask = mq_output.peptides['Potential contaminant'].notna()
mq_output.peptides.loc[mask]

### Aggregate identifiers in evidence.txt

In [None]:
fasta_keys = ["Proteins", "Leading proteins",
              "Leading razor protein", "Gene names"]
mq_output.evidence[fasta_keys]

The protein assignment information is not entirely unique for each group of peptides.

## align intensities and retention time (RT) for peptides

- intensities are values reported in `peptides.txt`
- some (few) peptides in `peptides.txt` are not in `evidence.txt`, but then probably zero

In [None]:
intensities.index

In [None]:
seq_w_summed_intensities = intensities.to_frame().merge(
    rt_summary, left_index=True, right_index=True, how='left')

In [None]:
seq_w_summed_intensities

In [None]:
mask = ~mq_output.evidence.reset_index(
)[["Sequence", "Proteins", "Gene names"]].duplicated()
mask.index = mq_output.evidence.index

In [None]:
diff_ = seq_w_summed_intensities.index.unique().difference(mask.index.unique())
diff_.to_list()

In [None]:
# mq_output.msms.set_index('Sequence').loc['GIPNMLLSEEETES']

In [None]:
# There is no evidence, but then it is reported in peptides?!
# Is this the case for more than one MQ-RUN (last or first not written to file?)
try:
    if len(diff_) > 0:
        mq_output.evidence.loc[diff_]
except KeyError as e:
    logging.error(e)

In [None]:
mq_output.peptides.loc[diff_]

### Option: Peptide scan with highest score for repeatedly measured peptides

- only select one of repeated peptide scans, namely the one with the highest score
- discards information, no summation of peptide intensities
- yields unique retention time per peptide, by discarding additional information

In [None]:
COL_SCORE = 'Score'
mq_output.evidence.groupby(level=0)[COL_SCORE].max()

In [None]:
mask_max_per_seq = mq_output.evidence.groupby(
    level=0)[COL_SCORE].transform("max").eq(mq_output.evidence[COL_SCORE])
mask_intensity_not_na = mq_output.evidence.Intensity.notna()
mask = mask_max_per_seq & mask_intensity_not_na

This leads to a non-unique mapping, as some scores are exactly the same for two peptides.

In [None]:
mask_duplicates = mq_output.evidence.loc[mask].sort_values(
    mq_col.INTENSITY).index.duplicated()
sequences_duplicated = mq_output.evidence.loc[mask].index[mask_duplicates]
mq_output.evidence.loc[mask].loc[sequences_duplicated, [
    COL_SCORE, mq_col.INTENSITY, mq_col.RETENTION_TIME]]  # .groupby(level=0).agg({mq_col.INTENSITY : max})

In [None]:
mask = mq_output.evidence.reset_index().sort_values(
    by=["Sequence", "Score", mq_col.INTENSITY]).duplicated(subset=["Sequence", "Score"], keep='last')
_sequences = mq_output.evidence.index[mask]
mq_output.evidence.loc[_sequences, [
    "Score", "Retention time", mq_col.INTENSITY, "Proteins"]]

- random, non missing intensity?

In [None]:
aggregators = ["Sequence", "Score", mq_col.INTENSITY]
mask_intensity_not_na = mq_output.evidence.Intensity.notna()
seq_max_score_max_intensity = mq_output.evidence.loc[mask_intensity_not_na].reset_index(
)[aggregators+["Proteins", "Gene names"]].sort_values(by=aggregators).set_index("Sequence").groupby(level=0).last()
seq_max_score_max_intensity

In [None]:
# drop NA intensities first.
assert seq_max_score_max_intensity.Intensity.isna().sum() == 0

Certain peptides have no Protein or gene assigned.

In [None]:
seq_max_score_max_intensity.isna().sum()

In [None]:
mask_seq_selected_not_assigned = seq_max_score_max_intensity.Proteins.isna(
) | seq_max_score_max_intensity["Gene names"].isna()
seq_max_score_max_intensity.loc[mask_seq_selected_not_assigned]

These might be a candiate for evaluating predictions, as the information is measured, but unknown. If they cannot be assigned, the closest fit on different genes with model predictions could be a criterion for selection

## Create dumps of intensities in `peptides.txt`

In [None]:
# mq_output.evidence.loc["AAAGGGGGGAAAAGR"]

In [None]:
# ToDo: dump this?
mq_output.dump_intensity(folder='data/peptides_txt_intensities/')

## Create dumps per gene

Some hundred peptides map to more than two genes 

In [None]:
from vaep.pandas import length

seq_max_score_max_intensity[mq_col.GENE_NAMES].str.split(";"
                                                         ).apply(lambda x: length(x)
                                                                 ).value_counts(
).sort_index()

Mostly unique genes associated with a peptide.

### Select sensible training data per gene
- sequence coverage information?
- minimal number or minimal sequence coverage, otherwise discared
- multiple genes:
    - select first and add reference in others
    - split and dump repeatedly
    
Load fasta-file information

In [None]:
import json

import config

with open(src.config.FN_FASTA_DB) as f:
    data_fasta = json.load(f)
print(f'Number of proteins in fasta file DB: {len(data_fasta)}')

In [None]:
# schema validation? Load class with schema?
# -> Fasta-File creation should save schema with it

### Fasta Entries considered as contaminants by MQ

In [None]:
mask_potential_contaminant = mq_output.peptides['Potential contaminant'] == '+'
contaminants = mq_output.peptides.loc[mask_potential_contaminant, [mq_col.PROTEINS, mq_col.LEADING_RAZOR_PROTEIN]]
contaminants.head()

In [None]:
unique_cont = contaminants[mq_col.PROTEINS].str.split(';').to_list()
set_all = set().union(*unique_cont)
set_cont = {x.split('CON__')[-1] for x in set_all if 'CON__' in x}
set_proteins_to_remove = set_all.intersection(set_cont)
set_proteins_to_remove

List of proteins which are both in the fasta file and potential contaminants

In [None]:
mask = mq_output.peptides[mq_col.LEADING_RAZOR_PROTEIN].isin(set_proteins_to_remove)
mq_output.peptides.loc[mask, 'Potential contaminant'].value_counts() # ToDo: Remove potential contaminants, check evidence.txt

### `id_map`: Find genes based on fasta file

Using `ID_MAP`, all protein entries for that gene are queried and combined.

In [None]:
# # slow! discarded for now

# from config import FN_ID_MAP

# with open(FN_ID_MAP) as f:
#     id_map = json.load(f)
# id_map = pd.read_json(FN_ID_MAP, orient="split")

# protein_groups_per_gene = id_map.groupby(by="gene")
# gene_found = []
# for name, gene_data in protein_groups_per_gene:

#     _peptides = set()
#     for protein_id in gene_data.index:
#         _peptides = _peptides.union(p for p_list in data_fasta[protein_id]['peptides']
#                                       for p in p_list)

#     # select intersection of theoretical peptides for gene with observed peptides
#     _matched = mq_output.peptides.index.intersection(_peptides)
#     # add completness?
#     if not _matched.empty and len(_matched) > 3:
#         gene_found.append(name)
#         #
#         if not len(gene_found) % 500 :
#             print(f"Found {len(gene_found):6}")
# print(f"Total: {len(gene_found):5}")

Compare this with the entries in the `Gene names` column of `peptides.txt`

> Mapping is non-unique. MQ has no treshold on number of identified peptides. (How many (unique) peptides does MQ need?)

### `peptides.txt`: Multiple Genes per peptides

- can gene name be collapsed meaningfully?
- some gene groups share common stem -> can this be used?

In [None]:
mq_output.peptides[mq_col.GENE_NAMES].head(10)

In [None]:
import vaep.io.mq as mq

gene_sets_unique = mq_output.peptides["Gene names"].unique()

N_GENE_SETS = len(gene_sets_unique)
print(f'There are {N_GENE_SETS} unique sets of genes.')
assert N_GENE_SETS != 0, 'No genes?'

genes_single_unique = mq.get_set_of_genes(gene_sets_unique)
N_GENE_SINGLE_UNIQUE = len(genes_single_unique)

mq.validate_gene_set(N_GENE_SINGLE_UNIQUE, N_GENE_SETS)

How often do genes names appear in unique sets?

In [None]:
genes_counted_each_in_unique_sets = pd.Series(mq.count_genes_in_sets(
    gene_sets=gene_sets_unique))

title_ = 'Frequency of counts for each gene in unique set of genes'

ax = genes_counted_each_in_unique_sets.value_counts().sort_index().plot(
    kind='bar',
    title=title_,
    xlabel='Count of a gene',
    ylabel='Frequency of counts',
    ax=None,
)
fig = ax.get_figure()

fig_folder = FIGUREFOLDER / mq_output.folder.stem
fig_folder.mkdir(exist_ok=True)
fig.savefig(fig_folder / f'{title_}.pdf')

Unique gene sets with more than one gene:

In [None]:
gene_sets_unique = pd.Series(gene_sets_unique).dropna()

mask_more_than_one_gene = gene_sets_unique.str.contains(';')
gene_sets_unique.loc[mask_more_than_one_gene]

### Long format for genes - `peptides_with_single_gene`

Expand the rows for sets of genes using [`pandas.DataFrame.explode`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.explode.html).

Does a group of peptide only assigns unique set of genes? Genes can have more than one protein.
  - first build groups
  - then see matches (see further below)
  

In [None]:
peptides_with_single_gene = mq.get_peptides_with_single_gene(
    peptides=mq_output.peptides)
peptides_with_single_gene

In [None]:
peptides_with_single_gene.dtypes

In [None]:
print(
    f"DataFrame has due to unfolding now {len(peptides_with_single_gene)} instead of {len(mq_output.peptides)} rows")

Should peptides from potential contaminants be considered?

In [None]:
mask = peptides_with_single_gene['Proteins'].str.contains('CON__')
peptides_with_single_gene.loc[mask]

In [None]:
_mask_con = peptides_with_single_gene.loc[mask, mq_col.PROTEINS].str.split(";"
                                                                           ).apply(lambda x: [True if "CON_" in item else False for item in x]
                                                                                   ).apply(all)

assert _mask_con.sum() == 0, "There are peptides resulting only from possible confounders: {}".format(
    ", ".join(str(x) for x in peptides_with_single_gene.loc[mask, mq_col.PROTEINS].loc[_mask_con].index))

In [None]:
peptides_per_gene = peptides_with_single_gene.value_counts(mq_col.GENE_NAMES)
peptides_per_gene


#### Find genes based on `Gene names` column in elonged data-set

More efficient as it does not query unnecessary data or data twice.

In [None]:
protein_groups_per_gene = peptides_with_single_gene.groupby(
    by=mq_col.GENE_NAMES, dropna=True)

gene_data = protein_groups_per_gene.get_group(peptides_per_gene.index[3])
gene_data

In [None]:
list_of_proteins = gene_data[mq_col.PROTEINS].str.split(';').to_list()
set_of_proteins = set().union(*list_of_proteins)
set_of_proteins = {x for x in set_of_proteins if 'CON__' not in x}
set_of_proteins

In [None]:
gene_data[mq_col.PROTEINS].value_counts() # combine? select first in case of a CON_ as leading razor protein?

In [None]:
protein_id = set_of_proteins.pop()
print(protein_id)
data_fasta[protein_id]['seq']

In [None]:
data_fasta[protein_id]

### Sample completeness
Find a sample with a certain completeness level:

In [None]:
peps_exact_cleaved = mq.find_exact_cleaved_peptides_for_razor_protein(
    gene_data, fasta_db=data_fasta)
peps_exact_cleaved[:10]

Then search the list of possible peptides originating from the fasta files assuming no miscleavages to the set of found peptides.

- How many unique exact-cleaved peptides can be mapped to any peptide found in the sample (**completness**)?

In [None]:
peps_in_data = gene_data.index

mq.calculate_completness_for_sample(
    peps_exact_cleaved=peps_exact_cleaved, 
    peps_in_data=peps_in_data)

The number of peptides found can be then used to calculate the completeness

Select candidates by completeness of training data in single samples and save by experiment name

In [None]:
mq_output.folder.stem  # needs to go to root?

### GeneData accessor?

- [Registering custom accessors tutorial](https://pandas.pydata.org/pandas-docs/stable/development/extending.html#registering-custom-accessors)

In [None]:
# @pd.api.extensions.register_dataframe_accessor('gene')
# class GeneDataAccessor:

#     COL_INTENSITY  = mq_col.INTENSITY
#     COL_RAZOR_PROT = 'Leading razor protein'
#     COL_PROTEINS   = 'Proteins'
#     COL_GENE_NAME  = 'Gene names'

#     COLS_EXPECTED = {COL_INTENSITY, COL_RAZOR_PROT, COL_PROTEINS, COL_GENE_NAME}

#     def __init__(self, pandas_df):
#         self._validate(df=pandas_df)

#     @classmethod
#     def _validate(cls, df):
#         """Verify if expected columns and layout apply to panda.DataFrame (view)"""
#         _found_columns = cls.COLS_EXPECTED.intersection(df.columns)
#         if not _found_columns == cls.COLS_EXPECTED:
#             raise AttributeError("Expected columns not in DataFrame: {}".format(
#                     list(cls.COLS_EXPECTED - _found_columns)))
#         if not len(df[COL_RAZOR_PROT].unique()) != 1:


# # GeneDataAccessor(gene_data.drop(mq_col.INTENSITY, axis=1))
# # GeneDataAccessor(gene_data)
# # gene_data.drop(mq_col.INTENSITY, axis=1).gene
# gene_data.gene

### Gene Data Mapper?

In [None]:
class GeneDataMapper:

    COL_INTENSITY = mq_col.INTENSITY
    COL_RAZOR_PROT = mq_col.LEADING_RAZOR_PROTEIN
    COL_PROTEINS = mq_col.PROTEINS
    COL_GENE_NAME = mq_col.GENE_NAMES

    COLS_EXPECTED = {COL_INTENSITY, COL_RAZOR_PROT,
                     COL_PROTEINS, COL_GENE_NAME}

    def __init__(self, pandas_df, fasta_dict):
        self._validate(df=pandas_df)
        self._df = pandas_df
        self._fasta_dict = fasta_dict

        # self.log?

    @classmethod
    def _validate(cls, df):
        """Verify if expected columns and layout apply to panda.DataFrame (view)"""
        _found_columns = cls.COLS_EXPECTED.intersection(df.columns)
        if not _found_columns == cls.COLS_EXPECTED:
            raise AttributeError("Expected columns not in DataFrame: {}".format(
                list(cls.COLS_EXPECTED - _found_columns)))
        if len(df[cls.COL_RAZOR_PROT].unique()) != 1:
            raise ValueError(
                "Non-unique razor-protein in DataFrame: ", df[cls.COL_RAZOR_PROT].unique())

    def __repr__(self):
        return f"{self.__class__.__name__} at {id(self)}"


GeneDataMapper(gene_data, data_fasta)

### Dump samples as json

- select unique gene-names in set (have to be shared by all peptides)
- dump peptide intensities as json from `peptides.txt`

In [None]:
peptides_with_single_gene  # long-format with repeated peptide information by gene

In [None]:
root_logger = logging.getLogger()
root_logger.handlers = []
root_logger.handlers

In [None]:
genes_counted_each_in_unique_sets = pd.Series(mq.count_genes_in_sets(
    gene_sets=gene_sets_unique))

# # ToDo: Develop
# class MaxQuantTrainingDataExtractor():
#     """Class to extract training data from `MaxQuantOutput`."""

#     def __init__(self, out_folder):
#         self.out_folder = Path(out_folder)
#         self.out_folder.mkdir(exist_ok=True)
#         self.fname_template = '{gene}.json'

completeness_per_gene = mq.ExtractFromPeptidesTxt(
    out_folder='train', mq_output_object=mq_output, fasta_db=data_fasta)()

In [None]:
# same code fails in `vaep.io.mq`, ABC needed?
isinstance(mq_output, MaxQuantOutput), type(mq_output)

#### Descriptics

In [None]:
s_completeness = pd.Series(completeness_per_gene,  name='completenes_by_gene')
s_completeness.describe()

In [None]:
N_BINS = 20
ax = s_completeness.plot(kind='hist',
                         bins=N_BINS,
                         xticks=[x/100 for x in range(0, 101, 5)],
                         figsize=(10, 5),
                         rot=90,
                         title=f"Frequency of proportion of observed exact peptides (completness) per razor protein from 0 to 1 in {N_BINS} bins"
                               f"\nin sample {mq_output.folder.stem}")

_ = ax.set_xlabel(
    "Proportion of exactly observed peptides (including up to 2 mis-cleavages)")

fig = ax.get_figure()
fig.tight_layout()
fig.savefig(FIGUREFOLDER / mq_output.folder.stem / 'freq_completeness.png')

based on completeness, select valid training data

In [None]:
# continously decrease this number in the scope of the project
mask = s_completeness > .6
s_completeness.loc[mask]