# SMILES Embeddings with BERT

## Setup

Make sure that the parent directory is on our python path.

In [1]:
import os
import sys
nb_dir = os.path.split(os.getcwd())[0]
if nb_dir not in sys.path:
    sys.path.append(nb_dir)

Load the dataset to extract SMILES Embeddings from.

In [2]:
from loaders.Loaders import CSVLoader

dataset = CSVLoader(dataset_path='../data/HIV_processed.csv', 
                    mols_field='processed_smiles', 
                    labels_fields='HIV_active')

dataset = dataset.create_dataset()
dataset.get_shape()

Mols_shape:  40358
Features_shape:  X not defined!
Labels_shape:  (40358,)


## SMILES Embeddings Extraction

Install necessary packages.

In [4]:
! pip install transformers

Initialize the transformer BERT.

In [3]:
from compoundFeaturization.Transformers.bert import Transformer

sequence_length = 125 # maximum length of a sequence of tokens
bert = Transformer(sequence_length)

Some layers from the model checkpoint at bert-base-cased were not used when initializing TFBertModel: ['mlm___cls', 'nsp___cls']
- This IS expected if you are initializing TFBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFBertModel were initialized from the model checkpoint at bert-base-cased.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions without further training.


Extract the embeddings for all the SMILES in the dataset.

In [4]:
dataset = bert.featurize(dataset)

100%|██████████| 40358/40358 [4:36:51<00:00,  2.43it/s]


Save the dataset with embeddings to a csv file.

In [5]:
dataset.save_to_csv('../data/HIV_featurized.csv')