---
title: "NIST (part 2): Traditional ML: Gradient boosting"

date: last-modified

author:

- name: Ralf Gabriels

  orcid: 0000-0002-1679-1711

  affiliations:
    - VIB-UGent Center for Medical Biotechnology, VIB, Belgium
    - Department of Biomolecular Medicine, Ghent University, Belgium

---

[![](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ProteomicsML/ProteomicsML/blob/main/tutorials/fragmentation/_nist-2-traditional-ml-gradient-boosting.ipynb)

## 2.1 Feature engineering

In [None]:
import numpy as np
import pandas as pd

In [None]:
amino_acids = list("ACDEFGHIKLMNPQRSTVWY")
aa_properties = {
    "basicity": np.array([37,35,59,129,94,0,210,81,191,81,106,101,117,115,343,49,90,60,134,104]),
    "helicity": np.array([68,23,33,29,70,58,41,73,32,73,66,38,0,40,39,44,53,71,51,55]),
    "hydrophobicity": np.array([51,75,25,35,100,16,3,94,0,94,82,12,0,22,22,21,39,80,98,70]),
    "pI": np.array([32,23,0,4,27,32,48,32,69,32,29,26,35,28,79,29,28,31,31,28]),
}

properties_df = pd.DataFrame(aa_properties, index=amino_acids)
properties_df

In [None]:
# Peptide input
# Feature engineering settings

properties = np.array([
    [37,35,59,129,94,0,210,81,191,81,106,101,117,115,343,49,90,60,134,104],  # basicity
    [68,23,33,29,70,58,41,73,32,73,66,38,0,40,39,44,53,71,51,55],  # helicity
    [51,75,25,35,100,16,3,94,0,94,82,12,0,22,22,21,39,80,98,70],  # hydrophobicity
    [32,23,0,4,27,32,48,32,69,32,29,26,35,28,79,29,28,31,31,28],  # pI
])

quantiles = [0, 0.25, 0.5, 0.75, 1]
aa_indices = {aa: i for i, aa in  enumerate("ACDEFGHIKLMNPQRSTVWY")}
aa_to_index = np.vectorize(lambda aa: aa_indices[aa])

def encode_peptide(sequence, charge):
    # 4 properties * 5 quantiles * 3 ion types + 4 properties * 4 site + 2 global
    n_features = 78
    n_ions = len(sequence) - 1

    # Encode amino acids as integers to index amino acid properties for peptide sequence
    peptide_indexed = aa_to_index(np.array(list(sequence)))
    peptide_properties = properties[:, peptide_indexed]

    # Empty peptide_features array
    peptide_features = np.full((n_ions, n_features), np.nan)

    for b_ion_number in range(1, n_ions + 1):
        # Calculate quantiles of features across peptide, b-ion, and y-ion
        peptide_quantiles = np.hstack(
            np.quantile(peptide_properties, quantiles, axis=1).transpose()
        )
        b_ion_quantiles = np.hstack(
            np.quantile(peptide_properties[:,:b_ion_number], quantiles, axis=1).transpose()
        )
        y_ion_quantiles = np.hstack(
            np.quantile(peptide_properties[:,b_ion_number:], quantiles, axis=1).transpose()
        )

        # Properties on specific sites: nterm, frag-1, frag+1, cterm
        specific_site_indexes = np.array([0, b_ion_number - 1, b_ion_number, -1])
        specific_site_properties = np.hstack(peptide_properties[:, specific_site_indexes].transpose())

        # Global features: Length and charge
        global_features = np.array([len(sequence), int(charge)])

        # Assign to peptide_features array
        peptide_features[b_ion_number - 1, 0:20] = peptide_quantiles
        peptide_features[b_ion_number - 1, 20:40] = b_ion_quantiles
        peptide_features[b_ion_number - 1, 40:60] = y_ion_quantiles
        peptide_features[b_ion_number - 1, 60:76] = specific_site_properties
        peptide_features[b_ion_number - 1, 76:78] = global_features

    return peptide_features


def generate_feature_names():
    feature_names = []
    for level in ["peptide", "b", "y"]:
        for aa_property in ["basicity", "helicity", "hydrophobicity", "pi"]:
            for quantile in ["min", "q1", "q2", "q3", "max"]:
                feature_names.append("_".join([level, aa_property, quantile]))
    for site in ["nterm", "fragmin1", "fragplus1", "cterm"]:
        for aa_property in ["basicity", "helicity", "hydrophobicity", "pi"]:
            feature_names.append("_".join([site, aa_property]))
        
    feature_names.extend(["length", "charge"])
    return feature_names

Let's test it with a single peptide:

In [None]:
peptide_features = pd.DataFrame(encode_peptide("RALFGARIELS", 2), columns=generate_feature_names())
peptide_features

## 2.2 Getting the target intensities

In [None]:
peptide_targets =  pd.DataFrame({
    "b_target": spectrum["parsed_intensity"]["b"],
    "y_target": spectrum["parsed_intensity"]["y"],
})
peptide_targets

In [None]:
peptide_targets =  pd.DataFrame({
    "b_target": spectrum["parsed_intensity"]["b"],
    "y_target": spectrum["parsed_intensity"]["y"][::-1],
})
peptide_targets

In [None]:
features = encode_peptide(spectrum["sequence"], spectrum["charge"])
targets = np.stack([spectrum["parsed_intensity"]["b"], spectrum["parsed_intensity"]["y"][::-1]], axis=1)
spectrum_id = np.full(shape=(targets.shape[0], 1), fill_value=1, dtype=np.uint32)  # Repeat id for all ions

In [None]:
pd.DataFrame(np.hstack([spectrum_id, features, targets]), columns=["spectrum_id"] + generate_feature_names() + ["b_target",  "y_target"])

Note the `[::-1]` after `spectrum["parsed_intensity"]["y"]`. Remember why we do this?

Let's get a full feature/target table for all spectra in our dataset. Note that
this might take some time, sometimes up to 30 minutes. To skip this step, simple
download the file with pre-encoded features and targets, and load in two cells
below.

In [None]:
tables = []
for i, spectrum in progress.track(enumerate(spectrum_list)):
    features = encode_peptide(spectrum["sequence"], spectrum["charge"])
    targets = np.stack([spectrum["parsed_intensity"]["b"], spectrum["parsed_intensity"]["y"][::-1]], axis=1)
    spectrum_id = np.full(shape=(targets.shape[0], i), fill_value=1, dtype=np.uint32)  # Repeat id for all ions
    table = np.hstack([spectrum_id, features, targets])
    tables.append(table)

full_table = np.vstack(tables)

spectra_encoded = pd.DataFrame(full_table, columns=["spectrum_id"] + generate_feature_names() + ["b_target",  "y_target"])
spectra_encoded.to_feather("human_hcd_tryp_best_spectra_encoded.feather")

In [None]:
# Uncomment this step to load in pre-encoded features from a file:
# spectra_encoded = pd.read_feather("human_hcd_tryp_best_spectra_encoded.feather")

In [None]:
spectra_encoded

This is the data we will use for training. Note that each spectrum comprises of
multiple lines: One line per b/y-ion couple. The only thing left to do is to
split the data into train, validation, and test sets, according to the
peptide-level split we made earlier.

In [None]:
spectra_encoded_trainval = spectra_encoded[spectra_encoded.index.isin(train_val_spectra.index)]
spectra_encoded_test = spectra_encoded[spectra_encoded.index.isin(test_spectra.index)]

## 2.3 Hyperparameter optimization and model selection

In [None]:
from sklearn.ensemble import GradientBoostingRegressor

In [None]:
reg =  GradientBoostingRegressor()

X_train = spectra_encoded_trainval.drop(columns=["spectrum_id", "b_target",  "y_target"])
y_train = spectra_encoded_trainval["b_target"]
X_test = spectra_encoded_test.drop(columns=["spectrum_id", "b_target",  "y_target"])
y_test = spectra_encoded_test["b_target"]

reg.fit(X_test, y_test)

In [None]:
y_test_pred = reg.predict(X_test)

In [None]:
np.corrcoef(y_test, y_test_pred)[0][1]

Let's see if we can do better by optimizing some hyperparameters!

In [None]:
from hyperopt import fmin, hp, tpe, Trials, space_eval, STATUS_OK

In [None]:
def objective(n_estimators):
    # Define algorithm
    reg =  GradientBoostingRegressor(n_estimators=n_estimators)

    # Fit model
    reg.fit(X_test, y_test)

    # Test model
    y_test_pred = reg.predict(X_test)
    correlation = np.corrcoef(y_test, y_test_pred)[0][1]
    
    return {'loss': -correlation, 'status': STATUS_OK}
    

In [None]:
best_params = fmin(
  fn=objective,
  space=hp.randint('n_estimators', 10, 1000),
  algo=tpe.suggest,
  max_evals=10,
)

In [None]:
best_params

Success! Initially, the default value of 100 estimators was used. According to
this hyperopt run, using 874 estimators results in a more performant model.