### Import libraries

In [1]:
import sys
sys.path.append("../models")
sys.path.append("../")

In [2]:
import models.fm4m as fm4m
import pandas as pd
import numpy as np
from sklearn.svm import SVR
from sklearn.compose import TransformedTargetRegressor
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import mean_squared_error

### Load data 

In [3]:
train_df  = pd.read_csv(f"../data/bace/train.csv")
test_df  = pd.read_csv(f"../data/bace/test.csv")

In [4]:
train_df.head()

Unnamed: 0,smiles,CID,Class,Unnamed: 3,pIC50,MW,AlogP,HBA,HBD,RB,...,PEOE6 (PEOE6),PEOE7 (PEOE7),PEOE8 (PEOE8),PEOE9 (PEOE9),PEOE10 (PEOE10),PEOE11 (PEOE11),PEOE12 (PEOE12),PEOE13 (PEOE13),PEOE14 (PEOE14),canvasUID
0,O1CC[C@@H](NC(=O)[C@@H](Cc2cc3cc(ccc3nc2N)-c2c...,BACE_1,1,,9.154901,431.56979,4.4014,3,2,5,...,53.205711,78.640335,226.85541,107.43491,37.133846,0.0,7.98017,0.0,0.0,1
1,S1(=O)(=O)N(c2cc(cc3c2n(cc3CC)CC1)C(=O)N[C@H](...,BACE_3,1,,8.69897,591.74091,2.5499,4,3,11,...,70.365707,47.941147,192.40652,255.75255,23.654478,0.230159,15.87979,0.0,24.663788,3
2,S1(=O)(=O)N(c2cc(cc3c2n(cc3CC)CC1)C(=O)N[C@H](...,BACE_5,1,,8.69897,629.71283,3.5086,3,3,11,...,78.945702,39.361153,179.71288,220.4613,23.654478,0.230159,15.87979,0.0,26.100143,5
3,S(=O)(=O)(CCCCC)C[C@@H](NC(=O)c1cccnc1)C(=O)N[...,BACE_7,1,,8.69897,645.78009,3.1973,5,4,18,...,63.830162,52.390511,263.78134,190.54213,45.370659,0.0,23.859961,0.0,24.663788,7
4,O1c2c(cc(cc2)CC)[C@@H]([NH2+]C[C@@H](O)[C@H]2N...,BACE_9,1,,8.60206,556.71503,4.701,4,3,5,...,53.205711,68.418541,299.00003,140.68362,28.755558,0.0,15.87979,6.904104,24.663788,9


In [5]:
xtrain = list(train_df["smiles"].values)
ytrain = list(train_df["Class"].values)

xtest = list(test_df["smiles"].values) 
ytest = list(test_df["Class"].values)

### List of available models

#### Base Models

In [6]:
fm4m.avail_models()

Unnamed: 0,Model Name,Description
0,SMI-TED,SMILES based encoder decoder model
1,SELFIES-TED,BART model for string based SELFIES modality
2,MolFormer,MolFormer model for string based SMILES modality
3,MHG-GED,Molecular hypergraph model


#### Downstream Models

In [7]:
fm4m.avail_downstream_models()

Unnamed: 0,Name,Task Type
0,XGBClassifier,Classfication
1,DefaultClassifier,Classfication
2,SVR,Regression
3,Kernel Ridge,Regression
4,Linear Regression,Regression
5,DefaultRegressor,Regression


### Example of single-modal model usage

#### Model Type : SELFIES-TED

In [8]:
result = fm4m.single_modal(model="SELFIES-TED", x_train=xtrain, y_train=ytrain, x_test=xtest, y_test=ytest, downstream_model="DefaultClassifier")

SELFIES-TED


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Map:   0%|          | 0/1209 [00:00<?, ? examples/s]

Map:   0%|          | 0/152 [00:00<?, ? examples/s]

 Calculating ROC AUC Score ...
ROC-AUC Score: 0.8520
Generating latent plots
Generating latent plots : Done


In [9]:
result[0]

'ROC-AUC Score: 0.8520'

#### Model Type : MHG-GNN

In [10]:
result = fm4m.single_modal(model="MHG-GED", x_train=xtrain, y_train=ytrain, x_test=xtest, y_test=ytest, downstream_model="DefaultClassifier")

MHG-GED
 Calculating ROC AUC Score ...
ROC-AUC Score: 0.8690
Generating latent plots
Generating latent plots : Done


In [11]:
result[0]

'ROC-AUC Score: 0.8690'

#### Model Type : SMI-TED

In [12]:
result = fm4m.single_modal(model="SMI-TED", x_train=xtrain, y_train=ytrain, x_test=xtest, y_test=ytest, downstream_model="DefaultClassifier")

SMI-TED
Random Seed: 12345
Using Rotation Embedding
Using Rotation Embedding
Using Rotation Embedding
Using Rotation Embedding
Using Rotation Embedding
Using Rotation Embedding
Using Rotation Embedding
Using Rotation Embedding
Using Rotation Embedding
Using Rotation Embedding
Using Rotation Embedding
Using Rotation Embedding
Vocab size: 2393
[INFERENCE MODE - smi-ted-Light]


100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 12/12 [00:16<00:00,  1.37s/it]
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:02<00:00,  2.22s/it]


 Calculating ROC AUC Score ...
ROC-AUC Score: 0.8267
Generating latent plots
Generating latent plots : Done


In [13]:
result[0]

'ROC-AUC Score: 0.8267'

### Example of multi-modal model usage

In [14]:
result = fm4m.multi_modal(model_list=["SELFIES-TED","MHG-GED","SMI-TED"], x_train=xtrain, y_train=ytrain, x_test=xtest, y_test=ytest, downstream_model="DefaultClassifier")

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Map:   0%|          | 0/1209 [00:00<?, ? examples/s]

Map:   0%|          | 0/152 [00:00<?, ? examples/s]

Random Seed: 12345
Using Rotation Embedding
Using Rotation Embedding
Using Rotation Embedding
Using Rotation Embedding
Using Rotation Embedding
Using Rotation Embedding
Using Rotation Embedding
Using Rotation Embedding
Using Rotation Embedding
Using Rotation Embedding
Using Rotation Embedding
Using Rotation Embedding
Vocab size: 2393
[INFERENCE MODE - smi-ted-Light]


100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 12/12 [00:16<00:00,  1.38s/it]
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:02<00:00,  2.25s/it]


Representations loaded successfully
Generating latent plots
Generating latent plots : Done
 Calculating ROC AUC Score ...
ROC-AUC Score: 0.8760
ROC-AUC Score: 0.8760


In [15]:
result[0]

'ROC-AUC Score: 0.8760'