The final section of the paper will arguably be the most interesting. The central question of this part is the following:

# Can de novo handle the large search space better than database searches would ?

Indeed, this is actually the central question to be answered before deciding to shout 'all HAIL DE NOVO, our savior, begone with databases'.

The task of de novo sequencing is much harder than database searching as the search space becomes unrestricted given a pre-defined set of tokens, e.g. amino acid and modifications. The question is more layered than one would first expect. We will subdivide the question in three parts and argument all of them before coming to the conclusion of the paper.

### Question 1: What would database search engines choose ?

The thing developers of de novo algorithms focus most on, is something that is often obscured when looking at database searching and more specifically, the post-processing part of it with rescoring. Database searches allow x candidates (either allowing a mass shift or nah) per spectrum and choosing only the top-ranking one according to a very simple rule. In Sage, this is the summed intensity of the annotated fragment ions. After this step, rescoring comes into play to evaluate the top-scoring PSMs per spectrum across the whole run, essentially making the FDR dataset-dependent (something we want as each dataset is unique). In de novo, we can also choose to output a single peptide, however, during the decision process of this particular peptide the whole complex rescoring part including fragment intensity, retention time (?) prediction and other more complex features is used indirectly. The approach of de database searching would be more akin to de novo sequencing if the rescoring module is applied on a larger set of peptides within a single spectrum, and then choosing the best PSM among them.

Therefore, to introduce this difficult to grasp distinction, I want to show this with a nice example. If I would expand the search space with the search space delimited by the de novo algorithms, what would happen to the de database results ? The small percentage (x) increase of search space drastically changes the results of the database search. The only reason this is the case is because internally (in Sage), this particular closely-resembling peptide sequence explains more intensity within the spectrum, becoming the top-ranked hit and thus selected for further analysis with rescoring, ignoring all ambiguity to lower-ranked hits for that spectrum in the process.
 
Indeed, the results of this experiment is directly related to the concept of ambiguity, something we aim to solve with restricting the database yet after selecting a top-hit for a spectrum, completely ignoring it afterwards. When doing benchmarks with de novo, we expect the algorithm to choose a possibly suboptimal sequence within a pre-defined, restricted, yet unknown search space.


### Question 2: Are the de novo sequences inside the cloud of ambiguity for those spectra ?

To further support the argument of ambiguity, a second analysis would demonstrate that some false de novo predictions lie in the ambiguity area of a spectrum. This ambiguity area is still for me difficult to clearly define but would work as follows. Some spectra do not have conclusive evidence for a particular sequence. That's that. No matter the approach, either with databases or de novo, the answer will remain uncertain. A side-effect demonstrated here, is that those spectra will have a bunch of peptide sequences with semi-equal scores, where neither the de novo, nor the database score is the highest, yet lies within a normal distribution of other ambiguous cases. The difficulty here is that ambiguity already starts with 2 sequences. Designing a metric for this is not straightforward.


### Question 3: Databases removes the ambiguity problem, and no amount of de novo will be able to solve it effectively

This section is for those de novo prophets, promising de novo to be the end of database searching. This will only be the case if the data is perfect, which is never the case. This section will only be short and builds further upon question 2, with the addition that the database makes the ambiguity metrics explode in the advantage of clarity.

This will conclude the evaluation of de novo algorithms and thus the paper.

# Plots to make

### 1_search_engine_results

Perform experiment as follows:
1. Search 1: Classic search
2. Search 2: Classic search with added de novo sequences (ignore the scores)
3. Rescore both by using the models trained on results from search 1.
4. Analyze the difference. Are de novo sequences higher scored? If not, then Sage internals went wrong. If yes, showcases false positive ? Is the database wrong ?

The plot should raise questions answered in section 2.

### 2_ambiguity_metrics

Still a WIP on how to define this metric

- Boxplot: metrics when de novo engines are correct. Next to it, boxplot with metrics if de novo is wrong in all cases, or in some cases
- Pick out example showing where the de novo sequences, database sequence lie within the large cloud of candidates in a highly ambiguous spectrum

### 3_ambiguity_reduction_db

Before and after boxplot showing how the ambiguity metrics change.