<img src="https://github.com/Multiomics-Analytics-Group/course_protein_language_modeling/blob/main/img/nb_logo.png?raw=1" width="600">

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Multiomics-Analytics-Group/course_protein_language_modeling/blob/main/notebooks/prediction.ipynb)


This is a version of the notebook from [Meta Research](https://research.facebook.com/) --- [here](https://github.com/facebookresearch/esm/blob/main/examples/sup_variant_prediction.ipynb) using the output from the [Embeddings notebook](https://colab.research.google.com/github/Multiomics-Analytics-Group/course_protein_language_modeling/blob/main/notebooks/embeddings.ipynb)

In [None]:
# Install requirements
!pip install h5py > /dev/null

# Predicting Varient Effect from Sequence Embeddings

In this notebook we will use the embeddings we generated in embeddings.ipynb to train and optimize various machine learning models in sklearn.

Each observation in our dataset $--$ which we created in embeddings.ipynb (or can be found in the git repo at _data/per_protein_embeddings.h5.zip_)$--$contains: 
- value: an embedded representation of the mutated ß-lactamase sequence
- key: `{index}|beta-lactamase_{mutation}|{scaled_varient_effect}` where the target value is the scaled_varient_effect, which describes the scaled effect of the mutation

**Goal:**
Train a regression model in to predict the "effect" score of a $\beta$-$lactamase$ variant given the embedding.

In [None]:
# imports
import h5py
import zipfile
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import scipy

# for fine-tuning
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.pipeline import Pipeline

# dimensionality reduction
from sklearn.decomposition import PCA

# models
from sklearn.neighbors import KNeighborsRegressor
from sklearn.svm import SVR

# path to files
zip_path = '../data/per_protein_embeddings.h5.zip' # local path
#zip_path = 'per_protein_embeddings.h5.zip' # on collab
filename = 'per_protein_embeddings.h5'

In [None]:
# uncomment if you wish to unzip the file here
with zipfile.ZipFile(zip_path, 'r') as zip_ref:
    # extract the filename of interest
    zip_ref.extract(filename)

In [None]:
# functions to help us with reading in our dataset

def read_hdf5(path:str) -> dict:
    '''
    read in the h5 file to a dictionary
    '''
    # empty dict and list
    weights = {}
    
    # open file
    with h5py.File(path, 'r') as f:
        # append to dict
        for key in f.keys():
            weights[key] = list(f[key])
            
    return weights

def emb_to_df(emb:dict) -> pd.DataFrame:
    '''
    Takes the dictionary from read_hdf5() to create a dataframe. 
    This function is super specific to per_protein_embeddings.h5
    '''
    # to dataframe
    df_seq = pd.DataFrame.from_dict(emb, orient='index').reset_index()
    
    # additional formatting for our purposes
    # making each part of key its own column
    header = df_seq['index'].str.split('|', expand=True).rename(
        columns={
            0:'index_value',
            1:'mutation',
            2:'scaled_varient_effect'
        }
    )
    
    # combining with sequence embeddings
    df = pd.concat([header, df_seq.drop('index', axis=1)], axis=1)

    # target column to float dtype
    df['scaled_varient_effect'] = df['scaled_varient_effect'].astype(float)
            
    return df

### The Dataset

Here, we read in the embeddings as a dataframe and take a look. 

In [None]:
# load in embeddings
embeddings = read_hdf5(filename)

# to dataframe
df = emb_to_df(embeddings)

In [None]:
# What does our data look like? 
print(df.shape)
print(f'Missing data? {df.isna().any().any()}')
display(df.head())

Further getting the dataset ready by separating the features (embedding only) and target (scaled_varient_effect). 

In [None]:
# target
y = df['scaled_varient_effect']

# isolating features
X = df.drop(['index_value', 'mutation', 'scaled_varient_effect'], axis=1)

# check
print(X.shape, y.shape)

### Train / Test Split

Here we choose to follow the Envision paper, using 80% of the data for training, but we actually found that pre-trained ESM embeddings require fewer downstream training examples to reach the same level of performance.

The test set will be used at the end of our notebook to assess how the model performs on never before seen data. 

The training set will be used to train and fine tune hyperparameters via a validation set (i.e., an intermediary 'test' set), which we will incorporate later in the notebook via cross-validation.

In [None]:
# for fine-tuning
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.pipeline import Pipeline

# dimensionality reduction
from sklearn.decomposition import PCA

# models
from sklearn.neighbors import KNeighborsRegressor
from sklearn.svm import SVR
from sklearn.linear_model import LogisticRegression

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    X, 
    y, 
    test_size=0.2, 
    random_state=42)

### PCA

Principal Component Analysis (PCA) is a popular technique for dimensionality reduction. Given `n_features` (1024 in our case), PCA computes a new set of `components` that "best explain the data" by capturing the variance in the data. 


Using a subset of `components` reduces the number of dimensons (e.g., columns), which enables downstream models to be trained faster with minimal loss in performance.  

For the below example, we arbitrarily set `components` to 50, but feel free to change it!

In [None]:
num_pca_components = 50

# instantiate
pca = PCA(num_pca_components)

# fit to the data and keep only num_pca_componenrs
X_train_pca = pca.fit_transform(X_train)

In [None]:
# how much variance in the data is captured in the selected components?

# calculating variance captured
exp_var = sum(pca.explained_variance_ratio_)

print(f'{exp_var} of variance is captured in {num_pca_components} PCs.')

## Visualize Embeddings

Here, we plot the first two principal components on the x- and y- axes. Each point is then colored by its scaled effect (what we want to predict).

Visually, we can see a separation based on color/effect, suggesting that our representations are useful for this task, without any task-specific training!

In [None]:
fig_dims = (7, 6)
fig, ax = plt.subplots(figsize=fig_dims)
sc = ax.scatter(Xs_train_pca[:,0], Xs_train_pca[:,1], c=y_train, marker='.')
ax.set_xlabel('PCA first principal component')
ax.set_ylabel('PCA second principal component')
plt.colorbar(sc, label='Variant Effect')
plt.show()

### Setting up our pipeline process

Sklearn has a [`Pipeline` class](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) that allows us to chain together pre-processing and model training steps. The purpose of the pipeline is to assemble several steps that can be cross-validated together while setting different parameters. We will later create a list of `params` consisting of the different hyperparameters we wish to fine-tune using cross-validation.

The sequence in the `Pipeline()` will be:

- A Dimensionality Reduction technique to reduce the number of dimensions ( `PCA()` ), and
- Training a regressor model on the training dataset

where the data will be passed from one transformer to the next in that order. 

In [None]:
pipe = Pipeline(
    [
        ('reduce_dim', PCA(n_components=50)),
        ('model', KNeighborsRegressor()) 
    ]
)

With no additional fine-tuning of the we can see how the model performs by fitting our pipeline to the train and test data. Here, a [score](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsRegressor.html#sklearn.neighbors.KNeighborsRegressor.score) closer to 1.0 is better. 

In [None]:
pipe.fit(X_train, y_train)
print(pipe.score(X_train, y_train))
print(pipe.score(X_test, y_test))

How do we know if this is the best model? 
How do we know if we included the right amount of components?
What about the hyperparameters n_neighbors, leaf_size, p, weights? Can we fine-tune these to get a model that has a better accuracy on the test set?
What about another algorithm? Like SVR?

### Initialize parameter-grid for fine-tuning

We will let [`GridSearchCV`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) investigate which algorithm and algorithm-specific hyperparameters result in the best model (i.e., highest score) by completing a cross-validated grid-search over a parameter grid we provide.

We will need to provide the options for `GridSearchCV` to explore. When selecting the optimal set of parameters to include we should consider the algorithm, our problem and what we know about the data. 

From [the documentation](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) we know that GridSearchCV() expects the `param_grid` to be a list of dictionaries of lists. 

In [None]:
# Save the parameters to be tweaked as a list of dictionaries 

# To determine which number of principal components optimize scores that the model achieves on the validation data

kn_grid = {
    'model': [KNeighborsRegressor()],
    'model__n_neighbors': [5, 10],
    'model__weights': ['uniform', 'distance'],
    'model__leaf_size' : [15, 30],
    'model__p' : [1, 2],
    'reduce_dim__n_components': [50, 100]
}

svr_grid = {
    'model': [SVR(gamma='scale', degree=3)],
    'model__C': [0.01, 0.1, 1.0, 10.0],
    'model__kernel': ['linear', 'poly', 'rbf', 'sigmoid'],
    'reduce_dim__n_components': [50, 100]
}

lr_grid = {
    'model': [LogisticRegression(max_iter=100000)],
    'model__C': [0.01, 0.1, 1],
    'reduce_dim__n_components': [50, 100]
    
}

params = [kn_grid, svr_grid]

### Run Grid Search

(This will take a few minutes on a single core)


In [None]:
# instantiating the grid search
best_model = GridSearchCV(
    estimator=pipe, 
    param_grid=params, 
    scoring='r2',
    cv=5, # 5 folds (validation)
    verbose=3,
    n_jobs=-1 #if wanting verbosity during run than n_jobs must be 1 (default)
)

In [None]:
# running
best_model.fit(X_train, y_train)

So what was the best model and hyperparameters as determined by the grid search?

In [None]:
# look at parameters
best_model.best_estimator_.get_params()

SVR with a rbf kernel and a $C$ value of 1.0 trained on a dataset of 100 principal components (that's only 10% of the number of features we started with!) was selected as the best estimator. 

Using the GridSearchCV `.predict()` method will [call predict on the best found parameters.](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html#sklearn.model_selection.GridSearchCV.predict)

Let's take a look at how the model scores on the train and (most importantly) the test set, sequences it has never seen before.  

In [None]:
print(f'training set score {best_model.score(X_train, y_train)}')
print(f'test set score {best_model.score(X_test, y_test)}')

In [None]:
# 
preds = best_model.predict(X_test)
print(f'{scipy.stats.spearmanr(y_test, preds)}')

We achieved a spearman rho of 0.815 on the test set!

This is in line with our grid-search results, where it also had the best validation performance.

In conclusion, our downstream model was able to use fixed pre-trained ESM embedding representations and obtain a decent result.