# General-Domain MSPM Training

In this notebook, we are going to train an universal molecular structure prediction model (MSPM) on **one million** compounds curated from [ChEMBL](https://www.ebi.ac.uk/chembl/). 

We use [SMILES](https://en.wikipedia.org/wiki/Simplified_molecular-input_line-entry_system) as molecuar representation. SMILES is a type of textual represetnation for molecules. 

In [1]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline

from rdkit import Chem
from rdkit.Chem import Draw
from rdkit.Chem.Draw import IPythonConsole

from fastai import *
from fastai.text import *
from utils import *

torch.cuda.set_device(1) #change to 0 if you only has one GPU 



## Data

Read the datasets. The train and valid datasets were splited randomly with a ratio of 0.98:0.02. The training set is large and was stored as a zip file.

In [2]:
train = pd.read_csv('../data/MSPM/ChemBL-LM_train.zip', compression='zip')
valid = pd.read_csv('../data/MSPM/ChemBL-LM_val.csv')
train.shape, valid.shape

((980000, 2), (20000, 2))

Since each molecular structure can be represented by multiple SMIMES. [Josep Arús-Pous et.al.](https://jcheminf.biomedcentral.com/articles/10.1186/s13321-019-0393-0) found that randomized SMILES strings imporve the quality of molecular generative models. We randomized the SMILES to augment the dataset. The `Randomized SMILES` are generated by randomizing the the atom IDs of a molecular graph (see `randomize_smiles()` in utils.py).

We generated **4** Randomized SMILES for each molecule. It will take a while to run the code.

In [None]:
%%time
train_aug = smiles_augmentation(train,4)
valid_aug = smiles_augmentation(valid,4)

## Prepare data for modeling

Create a path to save the resluts.

In [6]:
result_path = Path('../results')
name = 'MSPM'
path = result_path/name
path.mkdir(exist_ok=True, parents=True)

mdl_path = path/'models'
mdl_path.mkdir(exist_ok=True)

Next, we are going to tokenize SMILES and map each token into a unique token ID. (Detials see `MolTokenizer()` in the utils.py and [`Tokenizer()`](https://docs.fast.ai/text.transform.html#Tokenizer) in fastai library.

In [6]:
tok = Tokenizer(partial(MolTokenizer, special_tokens = special_tokens), n_cpus=6, pre_rules=[], post_rules=[])

Create a [DataBunch](https://docs.fast.ai/text.data.html#TextLMDataBunch) for training the model. This will take a relative long time. We only need to run it once and load the saved files later.

In [None]:
%%time
bs = 128 # batch size

data = TextLMDataBunch.from_df(path, train_aug, valid_aug, bs=bs, tokenizer=tok, 
                              chunksize=50000, text_cols=0, max_vocab=60000, include_bos=False)

One batch of the data. [BOS] is prepended to the SMILES, means the start of the SMILES.

In [10]:
data.show_batch()

idx,text
0,) F ) n n c 1 N 1 C [C@H] ( c 2 s c ( Cl ) c c 2 ) N ( C ( C 2 C C S ( = O ) ( = O ) C C 2 ) = O ) C C 1 [BOS] c 1 2 [nH] c n c 1 c c ( C ( N 1 [C@H] 3 C
1,C ( N c 1 n [nH] c 2 n c ( - c 3 c c c o 3 ) c ( Br ) c c 1 2 ) C 1 C C N ( C c 2 c c c c c 2 ) C 1 [BOS] C 1 C 2 ( O O C 3 ( O O 2 ) C C C C C C C
2,O ) n ( C ) c 3 N C 3 = C 2 C ( = O ) c 2 c 3 c c c c 2 ) c c ( C ( F ) ( F ) F ) c 1 ) ( F ) ( F ) F [BOS] N ( C ( = O ) c 1 n ( - c 2 c ( Cl )
3,C ) c c 3 C ) C C 2 ) = O ) c c 1 [BOS] C c 1 c ( Cl ) c c c c 1 N C ( = O ) C C C ( = O ) N / N = C / c 1 c c c [nH] 1 [BOS] C C C 1 C c 2 c ( C ) n
4,( n 2 c c ( C ( N ) = O ) c ( N c 3 c c n c ( F ) c 3 ) n 2 ) C C N ( C ( = O ) O C C ( F ) F ) C [C@@H] 1 F [BOS] s 1 c ( N C ( = O ) C S C c 2


Save the databunch.

In [None]:
data.save(f'{name}_databunch')
len(data.vocab.itos),len(data.train_ds)

## Traain the Model

Load the databunch generated in last section.

In [7]:
bs = 128 # batch size
data_lm = load_data(path, f'{name}_databunch', bs=bs)

Define the [model](https://docs.fast.ai/text.learner.html).

In [8]:
learner = language_model_learner(data_lm, AWD_LSTM, drop_mult=1, pretrained=False)

Model Architecture.

In [10]:
learner.model

SequentialRNN(
  (0): AWD_LSTM(
    (encoder): Embedding(64, 400, padding_idx=1)
    (encoder_dp): EmbeddingDropout(
      (emb): Embedding(64, 400, padding_idx=1)
    )
    (rnns): ModuleList(
      (0): WeightDropout(
        (module): LSTM(400, 1152, batch_first=True)
      )
      (1): WeightDropout(
        (module): LSTM(1152, 1152, batch_first=True)
      )
      (2): WeightDropout(
        (module): LSTM(1152, 400, batch_first=True)
      )
    )
    (input_dp): RNNDropout()
    (hidden_dps): ModuleList(
      (0): RNNDropout()
      (1): RNNDropout()
      (2): RNNDropout()
    )
  )
  (1): LinearDecoder(
    (decoder): Linear(in_features=400, out_features=64, bias=True)
    (output_dp): RNNDropout()
  )
)

Finally, we are ready to train the model. I trained the model on a single **Quadro P4000** GPU.

In [10]:
lr = 3e-3
lr *= bs/48  # Scale learning rate by batch size

learner.unfreeze()
learner.fit_one_cycle(10, lr, moms=(0.8,0.7))

epoch,train_loss,valid_loss,accuracy,time
0,0.785495,0.745334,0.736138,2:23:47
1,0.798158,0.755145,0.732615,2:23:42
2,0.78008,0.738548,0.737841,2:22:40
3,0.76019,0.722149,0.743182,2:22:45
4,0.747407,0.70636,0.748099,2:27:09
5,0.727602,0.689494,0.753515,2:28:04
6,0.712938,0.673928,0.758721,2:28:44
7,0.689546,0.658265,0.763684,2:29:31
8,0.679796,0.647731,0.767208,2:25:53
9,0.675316,0.644556,0.768328,2:25:04


Save both the weights and vocabulary.

In [None]:
lm_fns = [f'{name}_wt', f'{name}_vocab']

learner.save(lm_fns[0], with_opt=False)
learner.data.vocab.save(mdl_path/(lm_fns[1] + '.pkl'))