In [None]:
#---#| default_exp scoring.ml_scoring

# Base Class of ML Scoring Methods

In [None]:
from alphabase.scoring.ml_scoring import *

There are two key modules in ML-based rescoring: feature extraction and rescoring algorithm. Here we designed these two modules as flexible as possible for future extensions.

## Feature extraction

The feature extractor is more important than the ML methods, so we designed a flexible architecture for feature extraction. As shown in `BaseFeatureExtractor`, a feature extractor inherited from `BaseFeatureExtractor` must re-implement `BaseFeatureExtractor.extract_features`, and tells the ML methods what are the extracted features by providing `BaseFeatureExtractor.feature_list`. 

For example, if we have two feature extractors, `AlphaPeptFE` and `AlphaPeptDeepFE`:

```python
class AlphaPeptFE(BaseFeatureExtractor):
    def extract_features(self, psm_df):
        psm_df['ap_f1'] = ...
        self._feature_list.append('ap_f1')
        psm_df['ap_f2'] = ...
        self._feature_list.append('ap_f2')

class AlphaPeptDeepFE(BaseFeatureExtractor):
    def extract_features(self, psm_df):
        psm_df['ad_f1'] = ...
        self._feature_list.append('ad_f1')
        psm_df['ad_f2'] = ...
        self._feature_list.append('ad_f2')
```

We can easily design a new feature extractor which combines these two and more feature extractors:

```python
class CombFE(BaseFeatureExtractor):
    def __init__(self):
        self.fe_list = [AlphaPeptFE(),AlphaPeptDeepFE()]

    def extract_features(self, psm_df):
        for fe in self.fe_list:
            fe.extract_features(psm_df)

    @property
    def feature_list(self):
        f_set = set()
        for fe in self.fe_list:
            f_set.update(fe.feature_list)
        return list(f_set)
```

This will be useful for rescoring with DL features, for instance, when AlphaPeptDeep is or is not installed.

## Rescoring Algorithm

The rescoring algorithm called `Percolator` (Kall et al. 2007) based on the semi-supervised learning algorithm is still the most widely used in MS-based proteomics. Therefore, we used `Percolator` as the base rescoring class and others can re-implement its methods for different algorithms.  as well as different 

1. Rescoring algorithm. We have provided the base rescoring code structure in `Percolator`. If we are going to support DiaNN's brute-force supervised learning methods, we can define the class like this:

```python
class DiaNNRescoring(Percolator):
    def __init__(self):
        super().__init__()
        self.training_fdr = 100000 # disable target filtration on FDR, which is the same as DiaNN but different from Percolator

        self._ml_model.fit(
            train_df[self.feature_list].values, 
            train_label
        )
    def rescore(self, psm_df):
        # We don't need iteration anymore, but cross validation is still necessary
        df = self._cv_score(df)
        return self._estimate_fdr(df)
```

2. ML models. Personally, `Percolator` with a linear classifier (SVM or LogisticRegression) is prefered. But as a framework, we should support different ML models. We can easily switch to the random forest by `self.ml_model = RandomForestClassifier()`. We can also use a DL model which provides sklearn-like `fit()` and `decision_function()` APIs for rescoring.

## Simple Examples

In [None]:
df = pd.DataFrame({
    'score': list(np.random.uniform(0,100,100))+list(np.random.uniform(0,50,100)),
    'nAA': list(np.random.randint(7,30,200)),
    'charge': list(np.random.randint(2,4,200)),
    'decoy': [0]*100+[1]*100,
    'spec_idx': np.repeat(np.arange(100),2),
    'raw_name': 'raw',
})
perc = Percolator()
perc.min_training_sample = 10
df = perc.run_rescore_workflow(df)
print(np.sum(df.fdr<=0.01))
df

54


Unnamed: 0,score,nAA,charge,decoy,spec_idx,raw_name,ml_score,fdr
0,99.701130,25,3,0,49,raw,73.806177,0.0
1,97.371486,16,2,0,7,raw,70.674098,0.0
2,96.520631,20,2,0,0,raw,69.463321,0.0
3,94.903888,10,3,0,27,raw,66.648127,0.0
4,94.770806,29,3,0,45,raw,66.620960,0.0
...,...,...,...,...,...,...,...,...
195,0.954094,7,2,1,50,raw,-70.609985,1.0
196,0.778568,13,2,1,92,raw,-70.814078,1.0
197,0.823450,26,3,0,12,raw,-70.992752,1.0
198,0.814022,10,3,1,84,raw,-71.147809,1.0


In [None]:
df = perc.run_rerank_workflow(df, rerank_column='spec_idx')
print(np.sum(df.fdr<=0.01))
df

37


Unnamed: 0,score,nAA,charge,decoy,spec_idx,raw_name,ml_score,fdr
2,96.520631,20,2,0,0,raw,57.943887,0.000000
32,73.452338,7,2,0,1,raw,30.871846,0.000000
71,45.333046,27,3,0,2,raw,-8.457047,0.180328
51,49.989088,11,3,0,3,raw,0.375916,0.000000
122,25.828193,9,2,0,4,raw,-30.234443,0.576923
...,...,...,...,...,...,...,...,...
109,30.181791,22,3,1,95,raw,-26.878013,0.480000
92,38.893862,17,3,1,96,raw,-14.861347,0.333333
90,39.019470,18,2,1,97,raw,-15.036974,0.333333
147,17.662388,22,3,1,98,raw,-42.846490,0.729412


In [None]:
df = pd.DataFrame({
    'score': list(np.random.uniform(0,100,100))+list(np.random.uniform(0,50,100)),
    'nAA': list(np.random.randint(7,30,200)),
    'charge': list(np.random.randint(2,4,200)),
    'decoy': [0]*100+[1]*100,
    'spec_idx': np.repeat(np.arange(100),2),
    'raw_name': 'raw',
})
perc = SupervisedPercolator()
perc.min_training_sample = 10
df = perc.run_rescore_workflow(df)
print(np.sum(df.fdr<=0.01))
df

50


Unnamed: 0,score,nAA,charge,decoy,spec_idx,raw_name,ml_score,fdr
1,94.479489,22,3,0,3,raw,3.660554,0.000000
14,84.871839,27,3,0,40,raw,3.369519,0.000000
2,94.021258,15,3,0,27,raw,3.312799,0.000000
6,92.068047,17,3,0,26,raw,3.298997,0.000000
5,92.155700,28,2,0,16,raw,3.289403,0.000000
...,...,...,...,...,...,...,...,...
190,4.059851,13,2,1,59,raw,-2.185752,0.979798
186,6.425728,8,2,1,88,raw,-2.287918,0.989899
195,2.260293,9,2,1,56,raw,-2.467961,1.000000
197,1.738457,7,2,1,60,raw,-2.588545,1.000000
