The first section of the paper **answers the following question**:

# Which de novo algorithm performs best

Key questions are how to define 'best'. For this, several metrics will be used: **Peptide and amino acid recall** as defined in the review of *W. Bittremieux*.


Previously in the first section of the *Denis Beslic* paper, 2 plot types were used:
1. PR curves on 3 different enzymes (amino acid precision / amino acid recall) under different score thresholds
2. AUC barplots of the PR curves in 1
3. Barplots on 6 different enzymes with following metrics:
    1. peptide recall
    2. Amino acid recall
    3. Amino acid precision

This would conclude the section on performance for DB.

## The flow of this section

### **Accuracy of de novo tools**

We start off by asking which de novo tool performs optimal.

Part 1: De novo sequencing models

<ins>*1.1. accuracy*</ins>

To answer this, the standard metrics as utilized in the field will be used. This includes the peptide and amino acid recall and precision. However, this will only give partial insights into the relative performance of the tools. Therefore, in the next section, the 'redundancy' will be established by generating heatmaps

<ins>*1.2. overlap*</ins>

After seeing some tools perform similarly, the question remains whether the same spectra are correctly predicted or whether the tools perform more optimal on spectra with differing characteristics. This could indicate that a given network architecture is more suitable for certain spectra. By looking at the overlap of correctly predicted spectra, this question can be answered.

A follow-up question is, whether the tools also overlap on spectra which are not correctly predicted. This can indicate false positives from the database ground-truth and is interesting to store for future analysis. This future analysis is related to rescoring/reranking and the question of ambiguity.

<br>

Part 2: Refinement de novo models

A second type of de novo models include those that refine original de novo sequencing results in the hopes to correct small sequencing mistakes. The tools under evaluation are *Spectralis*, *InstaNovo+*, and *MS2Rescore* (PNovo3 alternative). The effect of these models on the original predictions can be subdivided in 3 distinct parts.

<ins>*2.1. sequence changes*</ins>

Firstly, these models **change the sequence** of the predicted peptide sequence. This is the most obvious effect of such a tool. Most use evolutionary algorithms or diffusion models to do this. To investigate this effect, one could simply look at how many of the peptide sequences that have undergone a change, were correct and if false, became correct after perturbing. A second analysis involves investigating which amino acids (or tags) are prone to change as this might indicate recurrent ambiguous sections or the propensity of an algorithm to pay attention to this particular site. (check levenschtein distance)

<ins>*2.2. Re-ordering*</ins>

Secondly, irrespective of changing the sequence, **a new score** is associated to each PSM by the model. This is important as de novo tools could be confident about a prediction yet did not integrate all available information such as retention time, ion mobility and simpler metrics such as ion coverage. Ignoring this information can lead to a suboptimal scoring function. Here, the rescoring aspect can re-order the PSMs over the results and thus keep the PR-curve high for longer (more correct predictions are higher scored then false predictions).

<ins>*2.3. Re-ranking*</ins>

Thirdly, and the focus of tools such as *pNovo3*, *PostNovo*, and *RankNovo*, is **reranking**. Distinct from rescoring and re-ordering spectra, this aspect focuses on reranking peptide predictions within a single spectrum. Indeed, in some cases, lower ranked hits might be the correct prediction, yet due to similar issues as in re-ordering, the scoring function might've been suboptimal. This aspect could indicate issues in the searching strategy employed when building up the sequence from amino acid probabilities such as in beam search in autoregressive models or graph searching in models such as PepNet and pi-PrimeNovo.

After these 3 aspects are investigated, a clear image might arise how benefitial post-processing models are.



## The evaluation in figures

Here we will generate the following figures in the coming notebooks:

### 1_1_accuracy

The standard metrics are computed for the several tools.

- PR plots on amino acid and peptide precision/coverage
- barplots on peptide recall, amino acid recall and precision


### 1_2_overlap

- Heatmap on correct predictions
- Heatmap on (commonly) false predictions

### 2_1_refinement_seq

- Barplot with following bars indicating the change in 'correctness' after perturbation
    - correct vs incorrect originally
    - correct vs incorrect after prediction
- Barplot? showing counts of the sequence change (in terms of sequence alignment ?)
- Kdeplot with hue (correctness) showing the sequence alignment before and after perturbation to showcase how much of the sequence was changed

### 2_2_refinement_score

- PR plots (engine by engine) showing the different scorings for each model

### 2_3_refinement_re-ordering

- Barplot (with hue rescoring model) with indices as the rank and height as number of correct predictions.
- ?? Boxplot showing difference in scores between correct vs lower rank vs nr1 incorrect vs correct lower rank ?? this could indicate ambiguity ?
