The following question is central in the 2nd part of the paper on errors:

# Why do the models make mistakes and what are these mistakes?

The cause of sequencing errors made by de novo algorithms can be largely subdivided in two sections: (i) Errors due to a suboptimal de novo algorithm or, (ii) inherent limitations of noisy data hindering completely accurate inference. In this section, the question related to errors in de novo sequencing is answered in two subquestions:

1. How do the errors look like ?
2. Are these errors related to data characteristics ?

In the final section of this part serves as a link towards the third and final section of the paper.

<ins>*1 Sequence errors*</ins>

First, the error types made by the algorithms will be explored. These errors are subdivided in recent papers in the following ways.
1. Dennis Beslic:
    - 1 -> 1/2 amino acid replacement
    - 2 -> 2 amino acid replacement
    - x -> x amino acid replacement where x in [3,4,5,6]
    - inversion of first or last 3 amino acids

2. Instanovo:
    - Subsequence error
    - Completely wrong
    - Subsequence reordering
    - Added (or dropped) tokens

3. Spectralis
    - Levenshtein distance

It is clear that errors can be represented in multiple ways. 

Of interest to me is where these errors occur and what the nature of the errors are. Thus, in this section, the errors will be counted based on location and type.

*1.1. Error location*

(Should we first make distinction between large or small mistakes ?)

Interesting would be that the error locate at the sites where there is no sequencing evidence. Alternatively, the n- or c-term sites accumulating errors could be caused by the low intensity of terminal ion fragments, especially for longer sequences.

- n- or c-terminal
- missing fragmentation sites

*1.2. Error type*

Irrespective of the site of the errors, it is interesting how these errors manifest. Are these isobaric errors at sites of missing fragmentation evidence? Are these well-known errors such as the NQ-deamidation --> ED, or other 1/2 isobaric amino acid changes?

Lastly, to gain an idea of the gravity of the sequencing errors, a simple metric can be computed to be compared across all tools, namely **the Levenshtein distance** as used in the *Spectralis* paper.


<ins>*2 Spectrum characteristics*</ins>

Irrespective of the actual error that was made, the reason the model makes a mistake might be due to a specific spectrum characteristic or the nature of the peptide that needs to be predicted. Here, the following features will be evaluated in their propensity to induce errors in de novo models:

- Missing fragmentation sites
- Explained intensity in percent (reverse of the noise in the spectrum)
- Lenght of the peptide (strongly related to missing fragmentation sites)
- Charge state
- ...

More spectrum characteristics can be included in the future, but these are it for now.

<ins>*3 Are errors truly errors ?*</ins>

The final section explores the possibility that some errors made by the models are in fact not errors, but a mistake in search space reduction during database construction. Here, this possibility is explored by rescoring de novo results with the ms2rescore approach after training mokapot models on the database results. The logic behind this scheme is that the model will pick up features contributing to good hits. If a better hit presents itself outside the search space, e.g. a de novo sequence, this will acquire a higher score than the database PSM. Both the binary metric (higher, lower) as the size of the gap will be interesting to explore.

In the paper, we will focus on two aspects.

Firstly, if a de novo PSM is scored higher or lower, this is interpretable in terms of features. Therefore, several features used to calculate the ms2rescore score will be investigated. We expect the retention times and ion mobilities to fit better for database hits when de novo is bad, etc... [see notebook: 0_merge_results in PXD028735]

Secondly, the de novo hits might differ only slightly from the true hits according to the score-difference boxplots, begging the question whether some of these mistakes are truly mistakes or just unsolvable with de novo sequencing. (stacked barplot here with missing fragmentations and error types [notebook: get_features_from_psm in scrap_notes])


---


By concluding the section with questioning the mistakes made by de novo altogether, we allow to go a bit deeper on this aspect and steer the minds of the readers towards the advantages and disadvantages of database and de novo searching. Furthermore, thanks to this section we can make some claims about how FDR of de novo sequencing results should look like. Indeed, instead of looking run-wide towards characteristics of PSMs, spectrum-by-spectrum, we should look within a single spectrum and ask ourselves if this is solvable in a larger search space. Search space is here the biggest determinant of solvability as with databases we make solvability artificially high, which is not necessarily a bad thing. The final section will deal with the issue of solvability and make some preliminary attempts towards this.



# Plots to be made

## 1_error_seq

- Plot with Levenshtein distances per tool (boxplot / **barplot**). x-axis: levenshtein distance, y-axis: PSM count, hue: tool
- Filter PSMs on levenshtein distance [0, 2]: Countplot errors per tool
- Barplot showing N-term, C-term, MF errors

## 2_error_spectrum

- Stacked barplots per feature when taking the best de novo match. Put countplot besides it how many times a PSM of a given tool was selected.
    - pep length
    - MF sites
    - charge
    - Explained intensity (bins)

## 3_error_question

- Plot score differences in boxplots.
- Take non-matching. Split in higher and lower. Show distribution of RT, MS2PIP, (IM), hyperscore (...) if de novo is better (plot above) or worse (plot below). The hue shows also the database feature for those spectra
