**Project:** Machine Learning Practice with Scikit-Learn

**Goal:**
This is a practice on machine-learning-guided protein engineering using Python scikit-learn and GFP dataset from [Saito et al. (2018)](https://doi.org/10.1021/acssynbio.8b00155). This page describes a step-by-step instruction where it should work on any local computer or supercomputer (with slight modifications).

**Steps:**
1. Data Collection
2. Feature Extraction
3. Preprocessing and Normalization
4. Model Selection and Training
5. Model Evaluation and Hyperparameter Tuning
6. Model Prediction

**Step 1: Data Collection**
About the dataset (`../data/umetsu/Umetsu_GFP.csv`) : The dataset contains GFP variants, which include 153 variants from single-point or random multiple mutagenesis, reference GFP and reference YFP. The columns are: `Sequence`, `Intensity`, and `Change`. Sigmoidal function is used to generate the `Score` column.  

`Sequence`: GFP variants<br>
`Intensity`: Intensity<br>
`Change`: Change to the intensity<br>

**Prerequisites:**
JupyterLab is an open-source web-based interactive development environment (IDE) primarily used for working with Jupyter notebooks, code, and data (advanced version of Jupyter Notebook). It is generally a good idea to install JupyterLab in a dedicated environment to avoid conflicts with other packages.

To create a new conda environment, run:<br>
	&emsp;&emsp;&emsp;`conda create -n mlearn python=3.12`<br>
	&emsp;&emsp;&emsp;`conda activate mlearn`

*(alternatively, create conda environment from the provided yml file: `conda env create -f mlearn.yml`)*<br>

To install jupyer-lab, run:<br>
    &emsp;&emsp;&emsp;`conda install -c conda-forge jupyterlab`

Once JupyterLab is installed, you can start it by running:<br>
    &emsp;&emsp;&emsp;`jupyter-lab`

This will launch JupyterLab in your default web browser. It will typically open at `http://localhost:8888` (or another port if 8888 is already in use).

In [1]:
# When running in Google Colab
# (i) Create a copy of data and notebook in your own drive
# Mount Drive
#from google.colab import drive
#drive.mount('/content/drive')

In [2]:
# install required packages
!pip install pandas peptides scikit-learn numpy scipy

Collecting pandas
  Using cached pandas-2.3.3-cp312-cp312-macosx_11_0_arm64.whl.metadata (91 kB)
Collecting peptides
  Using cached peptides-0.5.0-py3-none-any.whl.metadata (49 kB)
Collecting scikit-learn
  Using cached scikit_learn-1.8.0-cp312-cp312-macosx_12_0_arm64.whl.metadata (11 kB)
Collecting numpy
  Using cached numpy-2.4.0-cp312-cp312-macosx_14_0_arm64.whl.metadata (6.6 kB)
Collecting scipy
  Using cached scipy-1.16.3-cp312-cp312-macosx_14_0_arm64.whl.metadata (62 kB)
Collecting joblib>=1.3.0 (from scikit-learn)
  Using cached joblib-1.5.3-py3-none-any.whl.metadata (5.5 kB)
Collecting threadpoolctl>=3.2.0 (from scikit-learn)
  Using cached threadpoolctl-3.6.0-py3-none-any.whl.metadata (13 kB)
Using cached pandas-2.3.3-cp312-cp312-macosx_11_0_arm64.whl (10.7 MB)
Using cached peptides-0.5.0-py3-none-any.whl (71 kB)
Using cached scikit_learn-1.8.0-cp312-cp312-macosx_12_0_arm64.whl (8.1 MB)
Using cached numpy-2.4.0-cp312-cp312-macosx_14_0_arm64.whl (5.2 MB)
Using cached scipy-1.16

In [2]:
# Check slearn version
import sklearn
print(sklearn.__version__)

1.8.0


**Step 2: Data processing**
<br>Load the dataset and generate the `Score` column by running the sigmoidal function on `Intensity` and `Change` values.

**Load the dataset**

In [3]:
import pandas as pd

csv_file = "../data/umetsu/Umetsu_GFP.csv"
df = pd.read_csv(csv_file)

print(df.head(2))
print(len(df))

  Sequence  Intensity    Change
0     SSHT    1.00000  1.000000
1     GAYF   10.36132  7.463609
155


**Add the `Score` column based on sigmoidal function**

In [4]:
# -0.75, -0.00164257656472999
import numpy as np

def sigmoid(x):
    return 1.0 / (1.0 + np.exp(-x))

thsh = 1.0

df['Score'] = (
    sigmoid(df['Intensity'] - thsh)
    * sigmoid(df['Change'] - thsh)
    - 1.0
)

print(df.head(2))

  Sequence  Intensity    Change     Score
0     SSHT    1.00000  1.000000 -0.750000
1     GAYF   10.36132  7.463609 -0.001643


**Step 3: Generate amino acid features**
<br>Each amino acid can be represented by a vector of physicochemical properties like hydrophobicity, charge, size, etc. A total of 8 different protein descriptor sets will be used in this study: Z-scales, VHSE, T-scales, ST-scales, MS-WHIM, FASGAI, BLOSUM and ProtFP ([van Westen et al., 2013](https://pmc.ncbi.nlm.nih.gov/articles/PMC3848949/)). Here, `peptides` Python package is used to compute descriptors for each sequence.

In [5]:
import peptides

# Define a function to compute features, adding new columns
# Available: blosum_indices fasgai_vectors ms_whim_scores protfp_descriptors st_scales t_scales vhse_scale z_scales
def compute_features(df, seqcol, feature):
    # Available features: available: blosum_indices fasgai_vectors ms_whim_scores protfp_descriptors st_scales t_scales vhse_scale z_scales
    # and many more (refer to peptides package in github: https://github.com/althonos/peptides.py)

    # Get features from a descriptor
    data1=[[list(getattr(peptides.Peptide(a_a), feature)()) for a_a in list(seq)] for seq in df[seqcol]]
    data2=[[s for j in k for s in j] for k in data1]

    # Retrieve features (X) and target values (y)
    X = pd.DataFrame(data2)
    
    return X

# Define arguments
RANDOM_STATE=0
seqcol = 'Sequence'
feature = 't_scales'

# Get features for all sequences in the dataset (X)
X = compute_features(df, seqcol, feature)

# Combine the columns
cols = [n for n in df.columns if n != seqcol]
features = pd.concat([df[seqcol], X, df['Score']],axis=1)
print(features.head(2))


  Sequence      0     1     2     3     4     5     6     7     8  ...    11  \
0     SSHT  -7.44 -0.65  0.68 -0.17  1.58 -7.44 -0.65  0.68 -0.17  ... -1.31   
1     GAYF -10.61 -1.21 -0.12  0.75  3.25 -9.11 -1.63  0.63  1.04  ... -0.47   

     12    13    14    15    16    17    18    19     Score  
0  0.01 -1.81 -0.21 -5.97 -0.62  1.11  0.31  0.95 -0.750000  
1  0.07 -1.67 -0.35  0.49 -0.94 -0.63 -1.27 -0.44 -0.001643  

[2 rows x 22 columns]


**Step 4: Model selection**
<br>To identify which model is best used with the GFP dataset, several models are tested: `GaussianProcessRegressor()`, `Lasso()`, `RandomForestRegressor()` and `SVR()`.
<br> The following steps are performed for individual model in `model_selection_*.py`:
* Data loading: Load the data, `X = amino acid features (0:ncol-2)`, and `Y = Score (ncol-2)`
* Data processing: Standardizes (`StandardScaler`) and transforms features (`fit_transform()`). The dataset is scaled based on mean and variance. Here, we use StandardScaler() that standardize the features so they have a mean of 0 and a standard deviation of 1.
* Scoring metric: Define the metric used for calculating the performance score: `r2`, `rmse`, `pearson`, `spearman`
* Initialize the selected model and search for best hyperparameters using grid search (`GridSearchCV()`)
* Uses k-fold cross-validation (`KFold()` and `cross_validate()`) to evaluate model’s performance
* Obtain best parameters and the performance score


In [6]:
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_validate, cross_val_score, KFold, GridSearchCV
from sklearn.metrics import make_scorer
from scipy.stats import spearmanr, pearsonr

# Define metric loading function
def pearsonr_metric(y_true, y_pred):
    r = pearsonr(x=y_true, y=y_pred)
    return r[0] 

def spearmanr_metric(y_true, y_pred):
    r = spearmanr(a=y_true, b=y_pred)
    return r[0] 

def set_scoring(metric):
    if metric == 'r2':
        return 'r2'
    elif metric == 'rmse':
        return 'neg_root_mean_squared_error'
    elif metric == 'pearson':
        return make_scorer(pearsonr_metric)
    elif metric == 'spearman':
        return make_scorer(spearmanr_metric)
    else:
        print('wrong metric', metric)
        exit()

In [7]:
# Define features and labels
X = features.iloc[:, 1:-1]
y = features['Score']

# Define metric
metric = 'spearman'

**Model selection on SVR, Lasso and GPR**

In [8]:
# Define function to run model selection with nested CV
import warnings
    
def model_selection(default_model, X, y, metric, RANDOM_STATE, param_grid=None):
    # Define params
    N_SPLITS=5

    # Load metric
    scoring = set_scoring(metric)

    # Scale the features
    scaler = StandardScaler()
    X = scaler.fit_transform(X)

    # Define outer CV
    outer_cv = KFold(n_splits=N_SPLITS, shuffle=True, random_state=RANDOM_STATE)
    
    with warnings.catch_warnings():
        warnings.simplefilter("ignore")  # ignores all warnings
        
        if param_grid: # run optimization if param_grid provided
            # Only use nested (inner and outer) CV when optimization is performed
            inner_cv = KFold(n_splits=N_SPLITS, shuffle=True, random_state=RANDOM_STATE)
            model = GridSearchCV(estimator=default_model, param_grid=param_grid, cv=inner_cv, scoring=scoring, verbose=1) # inner CV
            # Perform cross-validation with Kfold
            res = cross_validate(estimator=model, X=X, y=y, cv=outer_cv, scoring=scoring, return_estimator=True, verbose=1) # outer CV
            
            if param_grid:
                print('cv scores and parameters:')
                for i in range(N_SPLITS):
                    print(res['test_score'][i], res['estimator'][i].best_params_)
        
        
        else:
            model = default_model
            # Perform cross-validation with Kfold
            res = cross_validate(estimator=model, X=X, y=y, cv=outer_cv, scoring=scoring, return_estimator=True, verbose=1) # outer CV
            
            print('cv scores:')
            for i in range(N_SPLITS):
                print(res['test_score'][i])
        
        print('mean score:')
        print(res['test_score'].mean())


In [9]:
# try for SVR
from sklearn.svm import SVR
model = SVR()
grid = {'gamma': [1e-5, 1e-4, 1e-3, 1e-2, 1e-1], 'C': [1e-2, 1e-1, 1e0, 1e1, 1e2], 'epsilon': [1e-4, 1e-3, 1e-2, 1e-1, 1e0]}

# Run default parameters
print("-- Default hyperparameters (SVR) -- ")
model_selection(model, X, y, metric, RANDOM_STATE)

# Optimize hyperparameters
print("-- Optimized hyperparameters (SVR) -- ")
model_selection(model, X, y, metric, RANDOM_STATE, grid)

-- Default hyperparameters (SVR) -- 
cv scores:
0.44083942799449627
0.6310900484738401
0.719378995620727
0.5255243547711439
0.42597137568529864
mean score:
0.5485608405091011
-- Optimized hyperparameters (SVR) -- 
Fitting 5 folds for each of 125 candidates, totalling 625 fits


[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    0.0s finished


Fitting 5 folds for each of 125 candidates, totalling 625 fits
Fitting 5 folds for each of 125 candidates, totalling 625 fits
Fitting 5 folds for each of 125 candidates, totalling 625 fits
Fitting 5 folds for each of 125 candidates, totalling 625 fits
cv scores and parameters:
0.5123481409133896 {'C': 0.1, 'epsilon': 0.01, 'gamma': 0.1}
0.6298310159581817 {'C': 10.0, 'epsilon': 0.0001, 'gamma': 0.1}
0.7182777648582003 {'C': 0.01, 'epsilon': 0.1, 'gamma': 0.1}
0.5459841461137357 {'C': 0.01, 'epsilon': 0.01, 'gamma': 0.01}
0.42597137568529864 {'C': 0.1, 'epsilon': 0.01, 'gamma': 0.1}
mean score:
0.5664824887057611


[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    2.2s finished


In [10]:
# try for Lasso

from sklearn.linear_model import Lasso
model = Lasso(max_iter=100000)
grid = {'alpha': [1e-2, 2e-2, 5e-2, 1e-1, 2e-1, 5e-1, 1, 2, 5, 1e+1, 2e+1, 5e+1, 1e+2]}

# Run default parameters
print("-- Default hyperparameters (Lasso) -- ")
model_selection(Lasso(), X, y, metric, RANDOM_STATE)

# Optimize hyperparameters
print("-- Optimized hyperparameters (Lasso) -- ")
model_selection(model, X, y, metric, RANDOM_STATE, grid)


-- Default hyperparameters (Lasso) -- 
cv scores:
nan
nan
nan
nan
nan
mean score:
nan
-- Optimized hyperparameters (Lasso) -- 
Fitting 5 folds for each of 13 candidates, totalling 65 fits
Fitting 5 folds for each of 13 candidates, totalling 65 fits
Fitting 5 folds for each of 13 candidates, totalling 65 fits
Fitting 5 folds for each of 13 candidates, totalling 65 fits
Fitting 5 folds for each of 13 candidates, totalling 65 fits
cv scores and parameters:
0.45380529352374616 {'alpha': 0.01}
0.5904862498438523 {'alpha': 0.01}
0.727913534030311 {'alpha': 0.01}
0.5109102180978641 {'alpha': 0.01}
0.4250247726282202 {'alpha': 0.01}
mean score:
0.5416280136247987


[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    0.0s finished
[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    0.2s finished


In [11]:
# try for GPR
from sklearn.gaussian_process import GaussianProcessRegressor
# n_restarts_optimizer = different random starting points
# it controls how many times optimizer restarts when fitting the kernel hyperparameters
# thus no params will be used here for optimization

# Run default parameters
print("-- Default hyperparameters (GPR) -- ")
model_selection(GaussianProcessRegressor(), X, y, metric, RANDOM_STATE)

# Optimize hyperparameters
print("-- Optimized hyperparameters (GPR) -- ")
model = GaussianProcessRegressor(n_restarts_optimizer=10, normalize_y=True, random_state=RANDOM_STATE)
model_selection(model, X, y, metric, RANDOM_STATE)

-- Default hyperparameters (GPR) -- 
cv scores:
-0.41019283674354196
-0.2987054643399872
-0.24997938309361661
-0.22505770476850992
-0.411772329829122
mean score:
-0.3191415437549555
-- Optimized hyperparameters (GPR) -- 
cv scores:
0.49781065531998825
0.6304605322160108
0.7075407649235625
0.6547133229629379
0.42597137568529864
mean score:
0.5832993302215596


[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    0.0s finished
[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    0.0s finished


**Step 5: Model Construction: Construct the final model that will be used for prediction**
<br>The following steps are performed for individual model in `model_construction_*.py`:
* Data loading: Load the data, `X = amino acid features (0:ncol-2)`, and `Y = Score (ncol-2)`
* Data processing: Standardizes (`StandardScaler`) and transforms features (`fit_transform()`)
* Scoring metric: Define the metric used for calculating the performance score: `r2`, `rmse`, `pearson`, `spearman`
* Initialize the selected model and search for best hyperparameters using grid search (`GridSearchCV()`)
* Uses k-fold cross-validation (`KFold()` and `cross_validate()`) to evaluate model’s performance
* Train the model with training dataset (`fit()`)
* Obtain best parameters (`.best_params_`) and the performance score (`.best_score_`)
* Save the best estimator (`.best_estimator_`)  using `pickle` (not demonstrated here)

Here, three models will be constructed: linear model `Lasso()`,  support vector machine `SVR()`, and gaussian model `GaussianProcessRegressor()`

In [12]:
# Define function to run model construction nested CV
import warnings
from sklearn.metrics import mean_squared_error, r2_score
    
def model_construction(default_model, X, y, metric, RANDOM_STATE, param_grid=None):
    # Define params
    N_SPLITS=5

    # Load metric
    scoring = set_scoring(metric)

    # Scale the features
    scaler = StandardScaler()
    X = scaler.fit_transform(X)

    # Define outer CV
    outer_cv = KFold(n_splits=N_SPLITS, shuffle=True, random_state=RANDOM_STATE)
    
    with warnings.catch_warnings():
        warnings.simplefilter("ignore")  # ignores all warnings
        
        if param_grid: # run optimization if param_grid provided
            # Only use nested (inner and outer) CV when optimization is performed
            inner_cv = KFold(n_splits=N_SPLITS, shuffle=True, random_state=RANDOM_STATE)
            model = GridSearchCV(estimator=default_model, param_grid=param_grid, cv=inner_cv, scoring=scoring, verbose=1) # inner CV
            
            # Cross validation will not be performed here
            # Fit the best hyperparamater to construct updated model
            model.fit(X=X, y=y)
            
            print('best parameter:', model.best_params_)
            print('best score:', model.best_score_)

            # get best estimator
            bsmodel = model.best_estimator_

        else:
            bsmodel = default_model
            
            # Fit the default hyperparamater to construct updated model
            bsmodel.fit(X=X, y=y)

            # Evaluate the prediction
            y_pred = bsmodel.predict(X)
            if metric == 'r2':
                val = r2_score(y, y_pred)
            elif metric == 'rmse':
                val = - np.sqrt(mean_squared_error(y, y_pred, squared=False))
            elif metric == 'pearson':
                val = pearsonr_metric(y, y_pred)
            elif metric == 'spearman':
                val = spearmanr_metric(y, y_pred)
            else:
                print('wrong metric', metric)
                exit()    
            
            print('default score:', val)

    return bsmodel, scaler
    

In [13]:
# try for SVR
model = SVR()
svr_grid = {'gamma': [1e-5, 1e-4, 1e-3, 1e-2, 1e-1], 'C': [1e-2, 1e-1, 1e0, 1e1, 1e2], 'epsilon': [1e-4, 1e-3, 1e-2, 1e-1, 1e0]}

# Run default parameters
print("-- Default hyperparameters (SVR) -- ")
sdef_model, sdef_scaler = model_construction(model, X, y, metric, RANDOM_STATE)

# Optimize hyperparameters
print("-- Optimized hyperparameters (SVR) -- ")
sbs_model, sbs_scaler = model_construction(model, X, y, metric, RANDOM_STATE, svr_grid)

-- Default hyperparameters (SVR) -- 
default score: 0.5862369793351073
-- Optimized hyperparameters (SVR) -- 
Fitting 5 folds for each of 125 candidates, totalling 625 fits
best parameter: {'C': 0.1, 'epsilon': 0.01, 'gamma': 0.1}
best score: 0.5854679124322926


In [14]:
# try for Lasso
model = Lasso(max_iter=100000)
lasso_grid = {'alpha': [1e-2, 2e-2, 5e-2, 1e-1, 2e-1, 5e-1, 1, 2, 5, 1e+1, 2e+1, 5e+1, 1e+2]}

# Run default parameters
print("-- Default hyperparameters (Lasso) -- ")
ldef_model, ldef_scaler = model_construction(model, X, y, metric, RANDOM_STATE)

# Optimize hyperparameters
print("-- Optimized hyperparameters (Lasso) -- ")
lbs_model, lbs_scaler = model_construction(model, X, y, metric, RANDOM_STATE, lasso_grid)

-- Default hyperparameters (Lasso) -- 
default score: nan
-- Optimized hyperparameters (Lasso) -- 
Fitting 5 folds for each of 13 candidates, totalling 65 fits
best parameter: {'alpha': 0.01}
best score: 0.5416280136247987


In [15]:
# try for GPR
# Run default parameters
print("-- Default hyperparameters (GPR) -- ")
gdef_model, gdef_scaler = model_construction(GaussianProcessRegressor(), X, y, metric, RANDOM_STATE)

# Optimize hyperparameters
print("-- Optimized hyperparameters (GPR) -- ")
model = GaussianProcessRegressor(n_restarts_optimizer=10, normalize_y=True, random_state=RANDOM_STATE)
gbs_model, gbs_scaler = model_construction(model, X, y, metric, RANDOM_STATE)

-- Default hyperparameters (GPR) -- 
default score: 0.6183705474834948
-- Optimized hyperparameters (GPR) -- 
default score: 0.6183810096914947


In [17]:
# run this to save the model for future use
# pickle.dump(gbs_model, open(outsuf + ".model.pickle", 'wb'))
# pickle.dump(gbs_scaler, open(outsuf + ".scaler.pickle", 'wb'))

Here, GPR appears to outperform the other models. Therefore, we will use this model for prediction.

**Step 6: Model Prediction**
<br>After evaluating several models based on their performance metrics, the best model can be saved for prediction on unseen or real data. For example, the model can be used to predict the fluorescent intensity of variants, given the amino acid features as the input.
<br>The following steps are performed for the saved model in `model_prediction_*.py`:
* Data loading: Load the data, `X = amino acid features (0:ncol-2)`, and `Y = Score (ncol-2)`
* Data processing: Standardizes (`StandardScaler`) and transforms features (`fit_transform()`)
* Scoring metric: Define the metric used for calculating the performance score: `r2`, `rmse`, `pearson`, `spearman`
* Load the selected model that had been previously saved  (`pickle.load()`)
* Make prediction(`predict(X)`)
* Print the performance score (`.best_score_`)
* Print the predicted value (performance score)

**Load pre-computed dataset for prediction**
<br>The dataset contains pre-computed T-scale features and score for a set of GFP mutants (n=160,000)

In [18]:
pred_file = "../data/umetsu/Umetsu_GFP_T-scale_pred.csv"
pred_df = pd.read_csv(pred_file)
print(pred_df.head(2))

  Sequence      f   f.1   f.2   f.3   f.4   f.5   f.6   f.7   f.8  ...  f.11  \
0     SSHT  -7.44 -0.65  0.68 -0.17  1.58 -7.44 -0.65  0.68 -0.17  ... -1.31   
1     GAYF -10.61 -1.21 -0.12  0.75  3.25 -9.11 -1.63  0.63  1.04  ... -0.47   

   f.12  f.13  f.14  f.15  f.16  f.17  f.18  f.19     Score  
0  0.01 -1.81 -0.21 -5.97 -0.62  1.11  0.31  0.95 -0.750000  
1  0.07 -1.67 -0.35  0.49 -0.94 -0.63 -1.27 -0.44 -0.001643  

[2 rows x 22 columns]


In [19]:
X = pred_df.iloc[:, 1:-1]
y = pred_df['Score']
print (X.head(2))
print(y.head(2))

       f   f.1   f.2   f.3   f.4   f.5   f.6   f.7   f.8   f.9  f.10  f.11  \
0  -7.44 -0.65  0.68 -0.17  1.58 -7.44 -0.65  0.68 -0.17  1.58 -1.01 -1.31   
1 -10.61 -1.21 -0.12  0.75  3.25 -9.11 -1.63  0.63  1.04  2.26  2.08 -0.47   

   f.12  f.13  f.14  f.15  f.16  f.17  f.18  f.19  
0  0.01 -1.81 -0.21 -5.97 -0.62  1.11  0.31  0.95  
1  0.07 -1.67 -0.35  0.49 -0.94 -0.63 -1.27 -0.44  
0   -0.750000
1   -0.001643
Name: Score, dtype: float64


**Perform prediction on the prediction list**

In [29]:
# Define a function for model prediction

def model_prediction(df, model, scaler, metric):
    
    # Define X and y here
    X = df.iloc[:, 1:-1]
    y = df['Score']
    
    # Transform the features (do not perform fit_transform so that it remain unseen)
    # Scaler must be the same scaler used for training and hyperparameter optimization
    X_scaled = scaler.transform(X)
    
    # Make prediction
    y_pred = model.predict(X_scaled)

    # Evaluate the prediction
    if metric == 'r2':
        val = r2_score(y, y_pred)
    elif metric == 'rmse':
        val = - np.sqrt(mean_squared_error(y, y_pred, squared=False))
    elif metric == 'pearson':
        val = pearsonr_metric(y, y_pred)
    elif metric == 'spearman':
        val = spearmanr_metric(y, y_pred)
    else:
        print('wrong metric', metric)
        exit()
    
    print('score:', val)

    # Create a dataframe and sort
    pred = pd.concat([df['Sequence'], df['Score'], pd.DataFrame(y_pred, columns=['Pred'])], axis=1)
    
    return (pred)    

**Make prediction**

In [30]:
g_pred = model_prediction(pred_df, gbs_model, gbs_scaler, metric)
l_pred = model_prediction(pred_df, lbs_model, lbs_scaler, metric)
s_pred = model_prediction(pred_df, sbs_model, sbs_scaler, metric)



score: 0.037923251242282005
score: -0.001856800304612085




score: 0.014895402756190319


The evaluation score seems to be very poor, but this is expected considering large data points in the prediction list.

In [31]:
# Get top 10 mutants with best scores

results = g_pred.sort_values(by='Pred', ascending=False)
results.head(10)

Unnamed: 0,Sequence,Score,Pred
1,GAYF,-0.001643,-0.001643
2,CCFV,-0.22449,-0.22449
7,ASSV,-0.319873,-0.319873
20,GSHT,-0.330087,-0.330087
56276,GAHF,1.0,-0.330359
13,SSHY,-0.372963,-0.372963
56376,GAFF,1.0,-0.392235
56480,GAYY,1.0,-0.407431
56471,GAYH,1.0,-0.417859
22,TSHT,-0.418958,-0.418958


**References:** <br>
* [Fluorescence TAPE benchmark dataset](https://github.com/songlab-cal/tape)
* [Peptides - amino acid descriptors](https://peptides.readthedocs.io/en/stable/)
* [scikit-learn.org](https://scikit-learn.org/1.5/auto_examples/model_selection/plot_nested_cross_validation_iris.html)
* [Model selection: Nested Cross-Validation for Machine Learning with Python](https://machinelearningmastery.com/nested-cross-validation-for-machine-learning-with-python/)
* [Model selection: Training-validation-test split and cross-validation done right](https://machinelearningmastery.com/training-validation-test-split-and-cross-validation-done-right/)
* [Model construction: How to Train a Final Machine Learning Model](https://machinelearningmastery.com/train-final-machine-learning-model/)
* [Model training: Embrace Randomness in Machine Learning](https://machinelearningmastery.com/randomness-in-machine-learning/)
