## Dataset Walkthrough

The dataset is a standard dataframe importabale through pandas. 

In [1]:
import pandas as pd

df_train = pd.read_csv("af2_dataset_training_labeled.csv.gz", index_col=0)
df_train

Unnamed: 0,annotation_sequence,feat_A,feat_C,feat_D,feat_E,feat_F,feat_G,feat_H,feat_I,feat_K,...,feat_DSSP_10,feat_DSSP_11,feat_DSSP_12,feat_DSSP_13,coord_X,coord_Y,coord_Z,entry,entry_index,y_Ligand
0,M,False,False,False,False,False,False,False,False,False,...,0,0.0,47,-0.0,-26.499001,-4.742000,-35.189999,GEMI5_HUMAN,0,False
1,G,False,False,False,False,False,True,False,False,False,...,0,0.0,0,0.0,-25.158001,-1.342000,-34.104000,GEMI5_HUMAN,1,False
2,Q,False,False,False,False,False,False,False,False,False,...,1,-0.0,-1,-0.0,-21.926001,-1.641000,-32.175999,GEMI5_HUMAN,2,False
3,E,False,False,False,True,False,False,False,False,False,...,706,-0.1,705,-0.0,-22.073999,0.654000,-29.171000,GEMI5_HUMAN,3,False
4,P,False,False,False,False,False,False,False,False,False,...,0,0.0,705,-0.2,-19.783001,2.670000,-26.858999,GEMI5_HUMAN,4,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
755,S,False,False,False,False,False,False,False,False,False,...,-3,-0.1,2,-0.4,-19.742001,20.796000,-12.319000,AOC3_HUMAN,755,False
756,H,False,False,False,False,False,False,True,False,False,...,-358,-0.1,-330,-0.1,-16.299000,19.153999,-12.640000,AOC3_HUMAN,756,False
757,G,False,False,False,False,False,True,False,False,False,...,-360,-0.2,-1,-0.1,-13.404000,19.502001,-10.121000,AOC3_HUMAN,757,False
758,G,False,False,False,False,False,True,False,False,False,...,0,0.0,0,0.0,-10.986000,20.320000,-13.016000,AOC3_HUMAN,758,False


In [2]:
df_train.columns

Index(['annotation_sequence', 'feat_A', 'feat_C', 'feat_D', 'feat_E', 'feat_F',
       'feat_G', 'feat_H', 'feat_I', 'feat_K', 'feat_L', 'feat_M', 'feat_N',
       'feat_P', 'feat_Q', 'feat_R', 'feat_S', 'feat_T', 'feat_V', 'feat_W',
       'feat_Y', 'annotation_atomrec', 'feat_PHI', 'feat_PSI', 'feat_TAU',
       'feat_THETA', 'feat_BBSASA', 'feat_SCSASA', 'feat_pLDDT', 'feat_DSSP_H',
       'feat_DSSP_B', 'feat_DSSP_E', 'feat_DSSP_G', 'feat_DSSP_I',
       'feat_DSSP_T', 'feat_DSSP_S', 'feat_DSSP_6', 'feat_DSSP_7',
       'feat_DSSP_8', 'feat_DSSP_9', 'feat_DSSP_10', 'feat_DSSP_11',
       'feat_DSSP_12', 'feat_DSSP_13', 'coord_X', 'coord_Y', 'coord_Z',
       'entry', 'entry_index', 'y_Ligand'],
      dtype='object')

All columns with the `feat_*` prefix are boolean, integer, or float features that describe the residue itself.  These can be used for training a model.  Domain knowledge of these values should not be necessary to participate in the challenge, but we've provided brief descriptions below for anyone who may be interested:

* `feat_[letter]` are one-hot encoded boolean values for each of the 20 possible amino acids.
* `feat_PHI`, `feat_PSI`, `feat_TAU`, `feat_THETA` describe various protein chain bonding angles, computed with [Biopython](https://biopython.org/docs/1.75/api/Bio.PDB.Polypeptide.html).
* `feat_BBSASA`, `feat_SCSASA` describe the solvent accessible surface area, calculated using [FreeSASA](https://freesasa.github.io/).
* `feat_pLDDT` is an AlphaFold2 residue-level prediction confidence value.
* `feat_DSSP_[letter]` are secondary structure assignments by [DSSP].(https://en.wikipedia.org/wiki/DSSP_(algorithm))
* `feat_DSSP_[number]` are other backbone structural features describing backbone hydrogen. bonding networks, also assigned by [DSSP](https://en.wikipedia.org/wiki/DSSP_(algorithm)).

Column `y_Ligand` indicates if the residue (row) belongs to a known binding site or not.  This column is the classification objective for our challenge. 

The remaining columns describe other elements of the protein structure for reference or troubleshooting purposes.  Participants may use this information to to engineer new features/representations in their models if they so choose. These include:
* `annotation_sequence` and `annotation_atomrec`: Residue amino acid in character format.
* `entry`: Protein name, can be looked up on Uniprot for more information about the protein.  Each unique entry is one unique protein structure in this dataset.
* `coord_X`, `coord_Y`, `coord_Z`: XYZ coordinates of the residue in the respective protein structure.  For example, all residues for protein 'QCR1_HUMAN' belong to the same coordinate space, but the coordinate space would shared between two residues (rows) with `entry` values of 'QCR1_HUMAN' and 'PPM1A_HUMAN'.
* `entry_index`: The order of the amino acid within the protein sequence.  As with coordinates, these relationships are only meaningful for rows (residues) that share the same `entry` value.  For example, within QCR1_HUMAN two residues (rows) with `entry_index` 5 and 6 are adjacent (connected) neighbors.

The test dataset has the same format, but is otherwise missing the `y_Ligand` column. 

### Fitting a model to the dataset

In [3]:
X_train = df_train.drop(["y_Ligand"], axis=1)
without_categorical_columns = [col for col in X_train.columns if X_train[col].dtype != "O"]

In [4]:
from sklearn.preprocessing import StandardScaler
dataNormalizer = StandardScaler()
dataNormalizer.fit(X_train[without_categorical_columns])

Norm = dataNormalizer.transform(X_train[without_categorical_columns])

In [5]:
new_df = pd.DataFrame(Norm, columns=df_train[without_categorical_columns].columns)
y_train = df_train["y_Ligand"].to_numpy()
new_df["y_Ligand"] = y_train
new_df

Unnamed: 0,feat_A,feat_C,feat_D,feat_E,feat_F,feat_G,feat_H,feat_I,feat_K,feat_L,...,feat_DSSP_9,feat_DSSP_10,feat_DSSP_11,feat_DSSP_12,feat_DSSP_13,coord_X,coord_Y,coord_Z,entry_index,y_Ligand
0,-0.270969,-0.145918,-0.232492,-0.281446,-0.195631,-0.263770,-0.159571,-0.225497,-0.257845,-0.326453,...,0.791377,0.000918,1.134726,0.765633,0.865698,-0.875853,-0.324502,-1.279786,-0.804361,False
1,-0.270969,-0.145918,-0.232492,-0.281446,-0.195631,3.791187,-0.159571,-0.225497,-0.257845,-0.326453,...,0.885535,0.000918,1.134726,-0.002097,0.865698,-0.823817,-0.159783,-1.239016,-0.802859,False
2,-0.270969,-0.145918,-0.232492,-0.281446,-0.195631,-0.263770,-0.159571,-0.225497,-0.257845,-0.326453,...,0.979693,0.018187,1.134726,-0.018432,0.865698,-0.698402,-0.174269,-1.166637,-0.801357,False
3,-0.270969,-0.145918,-0.232492,3.553076,-0.195631,-0.263770,-0.159571,-0.225497,-0.257845,-0.326453,...,0.979693,12.193081,0.496366,11.513857,0.865698,-0.704145,-0.063084,-1.053827,-0.799855,False
4,-0.270969,-0.145918,-0.232492,-0.281446,-0.195631,-0.263770,-0.159571,-0.225497,-0.257845,-0.326453,...,0.791377,0.000918,1.134726,11.513857,0.006975,-0.615245,0.034584,-0.967032,-0.798353,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
497161,-0.270969,-0.145918,-0.232492,-0.281446,-0.195631,-0.263770,-0.159571,-0.225497,-0.257845,-0.326453,...,0.508904,-0.050890,0.496366,0.030572,-0.851747,-0.613654,0.912728,-0.421186,0.329716,False
497162,-0.270969,-0.145918,-0.232492,-0.281446,-0.195631,-0.263770,6.266790,-0.225497,-0.257845,-0.326453,...,0.979693,-6.181511,0.496366,-5.392544,0.436336,-0.480052,0.833179,-0.433237,0.331218,False
497163,-0.270969,-0.145918,-0.232492,-0.281446,-0.195631,3.791187,-0.159571,-0.225497,-0.257845,-0.326453,...,0.697220,-6.216050,-0.141993,-0.018432,0.436336,-0.367714,0.850039,-0.338671,0.332720,False
497164,-0.270969,-0.145918,-0.232492,-0.281446,-0.195631,3.791187,-0.159571,-0.225497,-0.257845,-0.326453,...,1.073851,0.000918,1.134726,-0.002097,0.865698,-0.273886,0.889668,-0.447352,0.334222,False


In [6]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(new_df.drop(["y_Ligand"], axis=1), new_df["y_Ligand"], test_size=0.2, random_state=42)

In [7]:
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from imblearn.under_sampling import RandomUnderSampler

np.random.seed(0)
undersample = RandomUnderSampler(sampling_strategy='majority')
X_train, y_train = undersample.fit_resample(X_train, y_train)

In [8]:
# # 10-fold Cross validation

# from sklearn.model_selection import cross_val_score
# from sklearn.pipeline import make_pipeline
# from sklearn.preprocessing import StandardScaler
# from sklearn.svm import SVC
# from sklearn.ensemble import RandomForestClassifier
# from sklearn.model_selection import StratifiedKFold
# #
# # Create an instance of Pipeline
# #
# pipeline = make_pipeline(StandardScaler(), RandomForestClassifier(random_state=0, n_estimators=400, min_samples_split=2, 
#                                                                   min_samples_leaf=1, max_features='sqrt', max_depth=None, bootstrap=False))
# #
# # Create an instance of StratifiedKFold which can be used to get indices of different training and test folds
# #
# strtfdKFold = StratifiedKFold(n_splits=10)
# kfold = strtfdKFold.split(X_train, y_train)
# scores = []
# #
# #
# #
# for k, (train, test) in enumerate(kfold):
#     pipeline.fit(X_train.iloc[train, :], y_train.iloc[train])
#     score = pipeline.score(X_train.iloc[test, :], y_train.iloc[test])
#     scores.append(score)
#     print('Fold: %2d, Training/Test Split Distribution: %s, Accuracy: %.3f' % (k+1, np.bincount(y_train.iloc[train]), score))
 
# print('\n\nCross-Validation accuracy: %.3f +/- %.3f' %(np.mean(scores), np.std(scores)))

# # -----------------------------------------------------------------------------------------------------------------------------

# # OUT:
# # Fold:  1, Training/Test Split Distribution: [12472 12472], Accuracy: 0.779
# # Fold:  2, Training/Test Split Distribution: [12472 12472], Accuracy: 0.804
# # Fold:  3, Training/Test Split Distribution: [12472 12472], Accuracy: 0.801
# # Fold:  4, Training/Test Split Distribution: [12472 12472], Accuracy: 0.791
# # Fold:  5, Training/Test Split Distribution: [12472 12472], Accuracy: 0.777
# # Fold:  6, Training/Test Split Distribution: [12472 12472], Accuracy: 0.780
# # Fold:  7, Training/Test Split Distribution: [12472 12473], Accuracy: 0.785
# # Fold:  8, Training/Test Split Distribution: [12472 12473], Accuracy: 0.794
# # Fold:  9, Training/Test Split Distribution: [12473 12472], Accuracy: 0.794
# # Fold: 10, Training/Test Split Distribution: [12473 12472], Accuracy: 0.789
# # Cross-Validation accuracy: 0.789 +/- 0.009

In [9]:
model = RandomForestClassifier(random_state=0, n_estimators=400, min_samples_split=2, min_samples_leaf=1, max_features='sqrt', 
                              max_depth=None, bootstrap=False)
model.fit(X_train, y_train)

In [10]:
# import shap
# from matplotlib import pyplot as plt

# shap_values = shap.TreeExplainer(model).shap_values(X_train)
# shap.summary_plot(shap_values, X_train, plot_type="bar",max_display=10,show=False)
# plt.title("Variable Importance Plot")
# plt.show()

# due to time contraints, the following plot is based on
# model = RandomForestClassifier(random_state=0, n_estimators=10, min_samples_split=2, min_samples_leaf=1, max_features='sqrt', 
#                                max_depth=6, bootstrap=False)

In [11]:
y_test_pred = model.predict(X_test)
y_test_pred

array([0., 0., 0., ..., 0., 0., 1.])

In [12]:
from sklearn import metrics

fpr, tpr, thresholds = metrics.roc_curve(y_test, y_test_pred, pos_label=1)
auc_roc = metrics.auc(fpr, tpr)

precision, recall, _ = metrics.precision_recall_curve(y_test, y_test_pred)
auc_pr = metrics.auc(recall, precision)

print(f"ROC-AUC: {auc_roc} \n PR-AUC {auc_pr}")

# ROC-AUC: 0.7844943915873897 
# PR-AUC 0.45470804279065874

ROC-AUC: 0.7844943915873897 
 PR-AUC 0.45470804279065874


## Submission Instructions

- Run inference on the test set and save the inference results as a csv file, the file should look like this
```
id,Predicted
0,True
1,False
2,True
3,False
....
```
- Submit the csv on Kaggle
- Automatic evaluation will be done with ROC-AUC
- Top submissions will be further evaluated by the mean of ROC-AUC and PR-AUC

In [13]:
df_test = pd.read_csv("af2_dataset_testset_unlabeled.csv.gz", index_col=0)
df_test

Unnamed: 0,annotation_sequence,feat_A,feat_C,feat_D,feat_E,feat_F,feat_G,feat_H,feat_I,feat_K,...,feat_DSSP_9,feat_DSSP_10,feat_DSSP_11,feat_DSSP_12,feat_DSSP_13,coord_X,coord_Y,coord_Z,entry,entry_index
0,M,False,False,False,False,False,False,False,False,False,...,0.0,0,0.0,0,0.0,33.116001,37.023998,38.417000,QCR1_HUMAN,0
1,A,True,False,False,False,False,False,False,False,False,...,-0.0,2,-0.0,0,0.0,35.849998,34.841000,40.185001,QCR1_HUMAN,1
2,A,True,False,False,False,False,False,False,False,False,...,-0.1,0,0.0,2,-0.0,37.087002,31.719999,40.547001,QCR1_HUMAN,2
3,S,False,False,False,False,False,False,False,False,False,...,-0.1,0,0.0,-2,-0.0,38.095001,28.951000,42.321999,QCR1_HUMAN,3
4,V,False,False,False,False,False,False,False,False,False,...,0.0,0,0.0,0,0.0,41.435001,27.417000,43.703999,QCR1_HUMAN,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
474,L,False,False,False,False,False,False,False,False,False,...,-0.5,-3,-0.3,-3,-0.0,47.813999,7.569000,-27.368999,PDE7A_HUMAN,474
475,P,False,False,False,False,False,False,False,False,False,...,-0.1,0,0.0,-3,-0.0,50.228001,8.068000,-30.333000,PDE7A_HUMAN,475
476,Q,False,False,False,False,False,False,False,False,False,...,-0.0,0,0.0,0,0.0,51.507999,4.896000,-31.959999,PDE7A_HUMAN,476
477,E,False,False,False,True,False,False,False,False,False,...,0.0,0,0.0,0,0.0,54.845001,6.372000,-33.125000,PDE7A_HUMAN,477


In [14]:
without_categorical_columns = [col for col in X_test.columns if X_test[col].dtype != "O"]

from sklearn.preprocessing import StandardScaler
dataNormalizer = StandardScaler()
dataNormalizer.fit(X_test[without_categorical_columns])

Norm = dataNormalizer.transform(X_test[without_categorical_columns])

new_df_test = pd.DataFrame(Norm, columns=df_test[without_categorical_columns].columns)
new_df_test

Unnamed: 0,feat_A,feat_C,feat_D,feat_E,feat_F,feat_G,feat_H,feat_I,feat_K,feat_L,...,feat_DSSP_8,feat_DSSP_9,feat_DSSP_10,feat_DSSP_11,feat_DSSP_12,feat_DSSP_13,coord_X,coord_Y,coord_Z,entry_index
0,-0.270980,-0.145197,-0.231649,-0.281274,-0.194068,-0.263201,-0.159106,-0.226494,-0.257697,3.071269,...,-6.496778,0.978908,-5.022392,-0.776392,-4.765542,0.435672,-0.772330,0.863753,0.985557,-0.104000
1,-0.270980,-0.145197,-0.231649,-0.281274,-0.194068,-0.263201,-0.159106,-0.226494,-0.257697,3.071269,...,0.038177,-1.463543,0.038635,-0.141144,-0.032446,0.004540,-0.449937,0.162350,0.441870,-0.569993
2,-0.270980,-0.145197,-0.231649,-0.281274,-0.194068,-0.263201,-0.159106,-0.226494,-0.257697,3.071269,...,0.038177,-1.745364,0.038635,-0.141144,-0.032446,0.004540,-0.017541,-0.349352,-0.867082,-0.284144
3,-0.270980,6.887199,-0.231649,-0.281274,-0.194068,-0.263201,-0.159106,-0.226494,-0.257697,-0.325598,...,0.021203,-1.087781,0.056027,-0.141144,-0.032446,0.004540,-1.100007,0.892065,0.029296,-0.022116
4,-0.270980,-0.145197,-0.231649,-0.281274,-0.194068,-0.263201,-0.159106,-0.226494,-0.257697,-0.325598,...,-0.029719,1.072848,-0.030932,0.494104,0.000537,0.866805,1.149050,-0.539392,-3.096304,-0.302010
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
99429,-0.270980,-0.145197,-0.231649,-0.281274,-0.194068,-0.263201,-0.159106,4.415134,-0.257697,-0.325598,...,-0.097614,0.884968,0.021243,-0.776392,-0.032446,0.435672,-0.851636,-0.170123,0.619943,0.428990
99430,-0.270980,-0.145197,-0.231649,-0.281274,-0.194068,-0.263201,-0.159106,-0.226494,-0.257697,-0.325598,...,0.004229,0.791027,-0.048324,0.494104,-0.048938,0.435672,-0.363015,0.561483,-0.622738,-0.544684
99431,-0.270980,-0.145197,-0.231649,-0.281274,-0.194068,3.799377,-0.159106,-0.226494,-0.257697,-0.325598,...,0.004229,0.791027,0.021243,-0.776392,-0.048938,0.004540,0.395427,0.544272,0.147109,-0.494065
99432,-0.270980,-0.145197,-0.231649,-0.281274,-0.194068,-0.263201,-0.159106,-0.226494,-0.257697,-0.325598,...,-0.029719,1.072848,0.003851,1.129352,0.000537,0.866805,0.156770,-0.095125,0.041242,2.483530


In [15]:
y_test_submission = model.predict(new_df_test)

In [16]:
s = pd.Series(y_test_submission).astype(bool)
s.name = "Predicted"
s.to_csv("submission.csv")