This is the main script that reads the tokenized SMILES data, trains a self-supervised masked language model, and uses a fully-connected linear layer to train and predict the 5 target polymer properties.

###1. Use the Tokenization notebook to create tokens for train and test data using TransPolymer

###2. Train the self-supervised model with masking

In [0]:
%sh
TP_DIR="" #main project directory
VOCAB="$TP_DIR/tokenizer/vocab.json"
MERGES="$TP_DIR/tokenizer/merges.txt"
TOKENS_TRAIN="/data/train.tokenized.pt" # not added to the repo
OUT_SSL="/Transformer/saved_models/ssl_mlm_ckpt"

export PYTHONPATH="$TP_DIR:$PYTHONPATH"
export TOKENS_TRAIN OUT_SSL

python self_supervised_mlm.py \
 --tp_dir "$TP_DIR" \
 --vocab "$VOCAB" \
 --merges "$MERGES" \
 --tokens_pt "$TOKENS_TRAIN" \
 --out_dir "$OUT_SSL" \
 --max_len 128 --hidden 256 --layers 6 --heads 8 --intermediate 1024 --batch_size 32 --epochs 5 --lr 5e-4

###3. Train the regressor to predict polymer properties

In [0]:
%sh
TRAIN_DATA="/data/train.csv"  # not added to the repo
TOKENS_TRAIN="/data/train.tokenized.pt" # not added to the repo
SSL_DIR="Transformer/saved_models/ssl_mlm_ckpt"
OUT_REGRESSOR="/Transformer/saved_models/regressor_ckpt"

export TRAIN_DATA TOKENS_TRAIN SSL_DIR OUT_REGRESSOR

python regressor.py \
  --input_data "$TRAIN_DATA" \
  --tokens_pt "$TOKENS_TRAIN" \
  --ssl_dir "$SSL_DIR" \
  --out_dir "$OUT_REGRESSOR" \
  --epochs 20 --batch_size 32 --lr 2e-4 --freeze_encoder_epochs 3 --pool cls

###4. Prediction on test data

In [0]:
%sh
TEST_DATA="/data/test.csv" # not added to the repo
TOKENS_TEST="/data/test.tokenized.pt" # not added to the repo
REG_DIR="/Transformer/saved_models/regressor_ckpt" # not added to the repo
OUT="/Transformer/results/predictions_test.csv"

export TOKENS_TEST REG_DIR TEST_DATA OUT

python predict_test.py \
  --tokens_pt "$TOKENS_TEST" \
  --regressor_dir "$REG_DIR" \
  --test_data "$TEST_DATA" \
  --out_csv "$OUT" \
  --batch_size 64