<img src="https://github.com/Multiomics-Analytics-Group/course_protein_language_modeling/blob/main/img/nb_logo.png?raw=1" width="600">

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Multiomics-Analytics-Group/course_protein_language_modeling/blob/main/notebooks/prediction.ipynb)


This is a version of the notebook from [Meta Research](https://research.facebook.com/) --- [here](https://github.com/facebookresearch/esm/blob/main/examples/sup_variant_prediction.ipynb) using the output from the [Embeddings notebook](https://colab.research.google.com/github/Multiomics-Analytics-Group/course_protein_language_modeling/blob/main/notebooks/embeddings.ipynb)

In [None]:
# Install requirements
!pip install h5py > /dev/null

In [None]:
ys = []
Xs = []

for key in embeddings:
  scaled_effect = key.split('|')[-1]
  ys.append(float(scaled_effect))
  Xs.append(embeddings[key])
Xs = torch.stack(Xs, dim=0).numpy()
print(len(ys))
print(Xs.shape)

In [None]:
def read_hdf5(path):

    weights = {}

    keys = []
    with h5py.File(path, 'r') as f: # open file
        for key in f.keys():
          weights[key] = list(f[key])
    return weights

In [None]:
per_protein_path = "./protT5/output/per_protein_embeddings.h5"

embeddings = read_hdf5(path=per_protein_path)

In [None]:
embeddings.keys()

In [None]:
import numpy as np

In [None]:
ys = []
Xs = []

for key in embeddings:
  scaled_effect = key.split('|')[-1]
  ys.append(float(scaled_effect))
  embs = np.array(embeddings[key])
  num_na = np.count_nonzero(np.isnan(embs))
  Xs.append(torch.from_numpy(embs))

Xs = torch.stack(Xs, dim=0).numpy()
print(len(ys))
print(Xs.shape)

### Train / Test Split

Here we choose to follow the Envision paper, using 80% of the data for training, but we actually found that pre-trained ESM embeddings require fewer downstream training examples to reach the same level of performance.

In [None]:
import scipy
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.decomposition import PCA
from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor
from sklearn.svm import SVC, SVR
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor

import matplotlib.pyplot as plt
import pandas as pd

In [None]:
train_size = 0.8
Xs_train, Xs_test, ys_train, ys_test = train_test_split(Xs, ys, train_size=train_size, random_state=42)

### PCA

Principal Component Analysis is a popular technique for dimensionality reduction. Given `n_features` (1280 in our case), PCA computes a new set of `X` that "best explain the data." We've found that this enables downstream models to be trained faster with minimal loss in performance.  

Here, we set `X` to 60, but feel free to change it!


In [None]:
num_pca_components = 100
pca = PCA(num_pca_components)
Xs_train_pca = pca.fit_transform(Xs_train)

<a id='viz_embeddings'></a>
## Visualize Embeddings

Here, we plot the first two principal components on the x- and y- axes. Each point is then colored by its scaled effect (what we want to predict).

Visually, we can see a separation based on color/effect, suggesting that our representations are useful for this task, without any task-specific training!

In [None]:
fig_dims = (7, 6)
fig, ax = plt.subplots(figsize=fig_dims)
sc = ax.scatter(Xs_train_pca[:,0], Xs_train_pca[:,1], c=ys_train, marker='.')
ax.set_xlabel('PCA first principal component')
ax.set_ylabel('PCA second principal component')
plt.colorbar(sc, label='Variant Effect')

### Initialize grids for different regression techniques

In [None]:
knn_grid = [
    {
        'n_neighbors': [5, 10],
        'weights': ['uniform', 'distance'],
        'leaf_size' : [15, 30],
        'p' : [1, 2],
    }
    ]

svm_grid = [
    {
        'C' : [0.1, 1.0, 10.0],
        'kernel' : ['linear', 'poly', 'rbf', 'sigmoid'],
        'degree' : [3],
        'gamma': ['scale'],
    }
]

In [None]:
cls_list = [KNeighborsRegressor(), SVR()]
param_grid_list = [knn_grid, svm_grid]

### Run Grid Search

(will take a few minutes on a single core)

In [None]:
result_list = []
grid_list = []
for cls_name, param_grid in zip(cls_list, param_grid_list):
    grid = GridSearchCV(
        estimator = cls_name,
        param_grid = param_grid,
        scoring = 'r2',
        verbose = 1,
        n_jobs = -1 # use all available cores
    )
    grid.fit(Xs_train, ys_train)
    result_list.append(pd.DataFrame.from_dict(grid.cv_results_))
    grid_list.append(grid)

<a id='browse'></a>
## Browse the Sweep Results

The following tables show the top 5 parameter settings, based on `mean_test_score`. Given our setup, this should really be thought of as `validation_score`.

# K Nearest Neighbors

In [None]:
result_list[0].sort_values('mean_test_score')[:5]

SVM

In [None]:
result_list[1].sort_values('mean_test_score')[:5]

### Random Forest

In [None]:
result_list[2].sort_values('mean_test_score')[:5]

<a id='eval'></a>
## Evaluation

Now that we have run grid search, each `grid` object contains a `best_estimator_`.

We can use this to evaluate the correlation between our predictions and the true effect scores on the held-out validation set.

In [None]:
for grid in grid_list:
    print(grid.best_estimator_.get_params()) # get the model details from the estimator
    print()
    preds = grid.predict(Xs_test)
    print(f'{scipy.stats.spearmanr(ys_test, preds)}')
    print('\n', '-' * 80, '\n')

The SVM performs the best on the `test` set, with a spearman rho of 0.78.