In [None]:
#---#| default_exp scoring.ml_scoring_base

# Base Class of ML Scoring Methods

In [None]:
from alphabase.scoring.ml_scoring_base import *

There are two key modules in ML-based rescoring: feature extraction and rescoring algorithm. Here we designed these two modules as flexible as possible for future extensions.

## Feature extraction

The feature extractor is more important than the ML methods, so we designed a flexible architecture for feature extraction. As shown in `BaseFeatureExtractor`, a feature extractor inherited from `BaseFeatureExtractor` must re-implement `BaseFeatureExtractor.extract_features`, and tells the ML methods what are the extracted features by providing `BaseFeatureExtractor.feature_list`. 

For example, if we have two feature extractors, `AlphaPeptFE` and `AlphaPeptDeepFE`:

```python
class AlphaPeptFE(BaseFeatureExtractor):
    def extract_features(self, psm_df):
        psm_df['ap_f1'] = ...
        self._feature_list.append('ap_f1')
        psm_df['ap_f2'] = ...
        self._feature_list.append('ap_f2')

class AlphaPeptDeepFE(BaseFeatureExtractor):
    def extract_features(self, psm_df):
        psm_df['ad_f1'] = ...
        self._feature_list.append('ad_f1')
        psm_df['ad_f2'] = ...
        self._feature_list.append('ad_f2')
```

We can easily design a new feature extractor which combines these two and more feature extractors:

```python
class CombFE(BaseFeatureExtractor):
    def __init__(self):
        self.fe_list = [AlphaPeptFE(),AlphaPeptDeepFE()]

    def extract_features(self, psm_df):
        for fe in self.fe_list:
            fe.extract_features(psm_df)

    @property
    def feature_list(self):
        f_set = set()
        for fe in self.fe_list:
            f_set.update(fe.feature_list)
        return list(f_set)
```

This will be useful for rescoring with DL features, for instance, when AlphaPeptDeep is or is not installed.

## Rescoring Algorithm

The rescoring algorithm called `Percolator` (Kall et al. 2007) based on the semi-supervised learning algorithm is still the most widely used in MS-based proteomics. Therefore, we used `Percolator` as the base rescoring class and others can re-implement its methods for different algorithms.  as well as different 

1. Rescoring algorithm. We have provided the base rescoring code structure in `Percolator`. If we are going to support DiaNN's brute-force supervised learning methods, we can define the class like this:

```python
class DiaNNRescoring(Percolator):
    def __init__(self):
        super().__init__()
        self.training_fdr = 100000 # disable target filtration on FDR, which is the same as DiaNN but different from Percolator

        self._ml_model.fit(
            train_df[self.feature_list].values, 
            train_label
        )
    def rescore(self, psm_df):
        # We don't need iteration anymore, but cross validation is still necessary
        df = self._cv_score(df)
        return self._estimate_fdr(df)
```

2. ML models. Personally, `Percolator` with a linear classifier (SVM or LogisticRegression) is prefered. But as a framework, we should support different ML models. We can easily switch to the random forest by `self.ml_model = RandomForestClassifier()`. We can also use a DL model which provides sklearn-like `fit()` and `decision_function()` APIs for rescoring.

In [None]:
#| hide
from nbdev.showdoc import show_doc

### Properties of `Percolator`

In [None]:
show_doc(Percolator.ml_model)

---

[source](https://github.com/MannLabs/alphabase/blob/main/alphabase/scoring/ml_scoring_base.py#L46){target="_blank" style="float:right; font-size:smaller"}

### Percolator.ml_model

>      Percolator.ml_model ()

ML model in Percolator.
It can be sklearn models or other models but implement 
the methods `fit()` and `decision_function()` (or `predict_proba()`) 
which are the same as sklearn models.

In [None]:
show_doc(Percolator.feature_extractor)

---

[source](https://github.com/MannLabs/alphabase/blob/main/alphabase/scoring/ml_scoring_base.py#L66){target="_blank" style="float:right; font-size:smaller"}

### Percolator.feature_extractor

>      Percolator.feature_extractor ()

The feature extractor inherited from `BaseFeatureExtractor`

In [None]:
show_doc(Percolator.feature_list)

---

[source](https://github.com/MannLabs/alphabase/blob/main/alphabase/scoring/ml_scoring_base.py#L37){target="_blank" style="float:right; font-size:smaller"}

### Percolator.feature_list

>      Percolator.feature_list ()

Get extracted feature_list. Property, read-only

### Methods of `Percolator`

In [None]:
show_doc(Percolator.extract_features)

---

[source](https://github.com/MannLabs/alphabase/blob/main/alphabase/scoring/ml_scoring_base.py#L69){target="_blank" style="float:right; font-size:smaller"}

### Percolator.extract_features

>      Percolator.extract_features (psm_df:pandas.core.frame.DataFrame, *args,
>                                   **kwargs)

Extract features for rescoring.

*args and **kwargs are used for 
`self.feature_extractor.extract_features`.

|    | **Type** | **Details** |
| -- | -------- | ----------- |
| psm_df | DataFrame | PSM DataFrame |
| args |  |  |
| kwargs |  |  |
| **Returns** | **DataFrame** | **psm_df with feature columns appended inplace.** |

In [None]:
show_doc(Percolator.rescore)

---

[source](https://github.com/MannLabs/alphabase/blob/main/alphabase/scoring/ml_scoring_base.py#L95){target="_blank" style="float:right; font-size:smaller"}

### Percolator.rescore

>      Percolator.rescore (df:pandas.core.frame.DataFrame)

Estimate ML scores and then FDRs (q-values)

|    | **Type** | **Details** |
| -- | -------- | ----------- |
| df | DataFrame | psm_df |
| **Returns** | **DataFrame** | **psm_df with `ml_score` and `fdr` columns updated inplace** |

In [None]:
show_doc(Percolator.run_rescore_workflow)

---

[source](https://github.com/MannLabs/alphabase/blob/main/alphabase/scoring/ml_scoring_base.py#L171){target="_blank" style="float:right; font-size:smaller"}

### Percolator.run_rescore_workflow

>      Percolator.run_rescore_workflow (psm_df:pandas.core.frame.DataFrame,
>                                       *args, **kwargs)

Run percolator workflow:

- self.extract_features()
- self.rescore()

*args and **kwargs are used for 
`self.feature_extractor.extract_features`.

|    | **Type** | **Details** |
| -- | -------- | ----------- |
| psm_df | DataFrame | PSM DataFrame |
| args |  |  |
| kwargs |  |  |
| **Returns** | **DataFrame** | **psm_df with feature columns appended inplace.** |

In [None]:
show_doc(Percolator.run_rerank_workflow)

---

[source](https://github.com/MannLabs/alphabase/blob/main/alphabase/scoring/ml_scoring_base.py#L117){target="_blank" style="float:right; font-size:smaller"}

### Percolator.run_rerank_workflow

>      Percolator.run_rerank_workflow (top_k_psm_df:pandas.core.frame.DataFrame,
>                                      rerank_column:str='spec_idx', *args,
>                                      **kwargs)

Run percolator workflow with reranking 
the peptides for each spectrum.

- self.extract_features()
- self.rescore()

*args and **kwargs are used for 
`self.feature_extractor.extract_features`.

|    | **Type** | **Default** | **Details** |
| -- | -------- | ----------- | ----------- |
| top_k_psm_df | DataFrame |  | PSM DataFrame |
| rerank_column | str | spec_idx | The column use to rerank PSMs. <br><br>For example, use the following code to select <br>the top-ranked peptide for each spectrum.<br>```<br>rerank_column = 'spec_idx' # scan_num<br>idx = top_k_psm_df.groupby(<br>    ['raw_name',rerank_column]<br>)['ml_score'].idxmax()<br>psm_df = top_k_psm_df.loc[idx].copy()<br>``` |
| args |  |  |  |
| kwargs |  |  |  |
| **Returns** | **DataFrame** |  |  |

## Simple Examples

In [None]:
df = pd.DataFrame({
    'score': list(np.random.uniform(0,100,100))+list(np.random.uniform(0,10,100)),
    'nAA': list(np.random.randint(7,30,200)),
    'charge': list(np.random.randint(2,4,200)),
    'decoy': [0]*100+[1]*100,
    'spec_idx': np.repeat(np.arange(100),2),
    'raw_name': 'raw',
})
perc = Percolator()
perc.min_training_sample = 10
perc.run_rescore_workflow(df)

Unnamed: 0,score,nAA,charge,decoy,spec_idx,raw_name,ml_score,fdr
0,99.851979,26,3,0,18,raw,138.142766,0.000000
1,98.746052,7,3,0,12,raw,133.867779,0.000000
2,97.415167,16,2,0,16,raw,133.447761,0.000000
3,96.857314,14,3,0,15,raw,131.877318,0.000000
4,94.606208,17,3,0,48,raw,128.785713,0.000000
...,...,...,...,...,...,...,...,...
195,0.346523,18,2,1,89,raw,-17.008649,0.979798
196,0.703782,15,3,1,82,raw,-17.292748,0.989899
197,0.058571,22,3,1,77,raw,-17.352293,1.000000
198,0.901983,9,2,1,64,raw,-17.357704,1.000000


In [None]:
perc.run_rerank_workflow(df, rerank_column='spec_idx')

Unnamed: 0,score,nAA,charge,decoy,spec_idx,raw_name,ml_score,fdr
54,44.986000,25,2,0,0,raw,23.239871,0.000000
6,94.020658,7,3,0,1,raw,61.762973,0.000000
73,23.028068,14,2,0,2,raw,5.346026,0.000000
17,79.163537,28,3,0,3,raw,50.584693,0.000000
36,61.673923,23,2,0,4,raw,36.500728,0.000000
...,...,...,...,...,...,...,...,...
170,2.306086,7,3,1,95,raw,-11.475920,0.744898
105,8.107192,8,2,1,96,raw,-6.765049,0.191011
92,9.717331,10,3,1,97,raw,-5.459666,0.044944
143,4.381494,29,3,1,98,raw,-9.100027,0.565217
