This repository contains a script to train and evaluate a convolutional neural network (CNN) to predict a gas/solvent weight fraction in a (possibly swollen) polymer using a pre-calculated distribution of the electron density.
(image created using DALL·E 2)
- Overview
- Reference
- Quick Start Guide
- Advanced
- Dataset structure
- Electron density generation
- TODO
- Notes
The RhNet model is a convolutional neural network aimed to predict the pressure and temperature dependent gas/solvent weight fraction in a (swollen) polymer. Prediction is based on 3D images of electron density distributions in a solvent molecule and in a polymer monomer/oligomer. Basically it combines two key ideas:
- to use a CNN to process electron density distribution and generate features to be fed to a dense network (see for example Prediction of Energetic Material Properties from Electronic Structure Using 3D Convolutional Neural Networks)
- to concatenate arrays of a polymer and a solvent features and an array of experimental conditions and feed them to a dense network (see for example Machine Learning for Polymer Swelling in Liquids)
Generally, the resulting architecture resembles some face recognition and face verification models (see for example this lecture by Andrew Ng)
The research paper with detailed description of the project is published in the Polymer Chemistry journal. A preprint version is available at ChemRxiv. In case you find the model or the dataset useful please cite:
Gromov, O.I. Predicting the Solubility of Gases, Vapors, and Supercritical Fluids in Amorphous Polymers from Electron Density using Convolutional Neural Networks, Polym. Chem., 2024, 15, 1273-1296, https://doi.org/10.1039/D3PY01028G
The repository contains a Dockerfile which adds RhNet dependencies over a TensorFlow NGC container. To run an NGC container one will need:
- NVIDIA Docker
- Volta/Turing/Ampere and higher-based GPU
For more information on NGC containers and their compatibility refer to NVIDIA Container Support Matrix.
-
Clone the repository
git clone https://github.com/Shorku/rhnet cd rhnet
-
Build the NGC container with additional layers
sudo docker build -t rhnet .
The NGC container encapsulates Python, TensorFlow and other necessary components optimized for usage on NVIDIA hardware.
docker build
will automatically download specified inDockerfile
image of an NGC container, install some additional python modules and copy the whole current directory with the rhnet script.Note, the quantum-chemical data may be really huge. Make sure there is no quantum-chemical data within the current folder while building rhnet image, otherwise it will be also copied.
Note, Current models and related routines were prepared with TensorFlow 2.5.0 which runs only with numpy<=1.21. On the other hand TensorFlow Model Optimization requires numpy=>1.23. An ugly solution is to forcefully install numpy==1.21 after the requirements are installed, which is done while building docker container. Somehow it works. I will try to move to a newer TensorFlow version some time later
-
Download or generate the data
-
Pre-trained model (multikernel option) can be downloaded from a separate repository (18,5 MB)
-
Experimental data and other necessary tables for the RhNet to perform training and inference are in the
rhnet/data
folder. The only exception are the electron densities. -
Electron density images should be either calculated and formatted as described in Electron density generation section or downloaded elsewhere: Density images used to train the current model (28,6 GB), Density images used to train the current model (unshifted) (1,1 GB), Full example of the data folder for inference
-
To perform both training and inference the script expects the dataset folder to contain the following files:
polymer_mass.csv
(required for training and prediction)solvent_mass.csv
(required for training and prediction)cubes
folder with the electron density images
-
optionally if the model is intended to use solvent macroparameters:
solvent_macro_features.csv
(optional for training and prediction)
-
To perform inference the script also expects the dataset folder to contain the following file:
to_predict_ranges.csv
(required for inference)
-
To perform training the script also expects the dataset folder to contain the following files:
experimental_dataset.csv
(required for training)polymers.txt
(required for training)solvents.txt
(required for training)
-
For an in-depth description of the dataset structure refer to Dataset structure section.
-
-
Start the NGC container in the interactive mode
mkdir local_dataset_folder_path mkdir local_model_folder_path mkdir local_logs_folder_path sudo docker run --runtime=nvidia -it --shm-size=<available_RAM>g --ulimit memlock=-1 --gpus all --rm --ipc=host -v local_dataset_folder_path:dataset_folder_path -v local_model_folder_path:model_folder_path -v local_logs_folder_path:logs_folder_path rhnet /bin/bash
This command will launch the container and mount necessary folders:
local_dataset_folder_path
directory containing table and quantum-chemical data as a volume to thedataset_folder_path
directory inside the containerlocal_model_folder_path
directory, containing a pre-trained model or checkpoints to themodel_folder_path
directory in the container.local_logs_folder_path
directory containing log-files to thelogs_folder_path
directory inside the container.
These will be accessible both from the host and docker container.
Note,
--shm-size=100g
for example will share 100 GiB between the host and your docker container: pnet preloads the quantum chemical data to the RAM and in the case of huge datasets will need more RAM, than the default amount. Make sure to adjust this parameter.
-
Inference
python3 main.py --load_model --exec_mode predict --data_dir dataset_folder_path --model_dir model_folder_path --log_dir logs_folder_path --log_name logfile_name --use_only_mw --use_only_amorph --augment 1 --batch_size 32
--load_model
tells the script to load an already trained and saved model. It will override--model
option.--data_dir
specifies the path to a folder with the data for prediction. Note, the data for fit and prediction has slightly different format as described in Dataset for inference section.--model_dir
specifies the path to a folder with a pre-trained Keras model. As usual, it is expected to containsaved_model
folder (model_folder_path/saved_model
).--log_dir
specifies the path to a folder where the calculated data will be stored.--use_only_mw
tells the script to use only Mw (weight-average molecular mass) feature and to omit Mn (number-average molecular mass) feature, which is the way the current model is trained. Should be used with our pre-trained models.--use_only_amorph
tells the script to use only amorphous polymers and omit crystallinity feature. Should be used with our pre-trained models.--augment 1
tells the script to skip rotations and shifts of the electron density images, which is unnecessary for inference.Notes:
- If a model was trained with the
--augment_onthefly
option, it should be invoked with--augment_onthefly
option also for inference as interpolation smooths out density images, so without this option the predictions would be inconsistent.
- If a model was trained with the
-
Training
Current model
python3 main.py --data_dir /data --model_dir /results --batch_size 8 --epochs 100 --exec_mode train --log_name multikernel --augment 25 --model multikernel --api builtin --use_only_mw --use_only_amorph --store_density cache --xla --amp --activation elu --learning_rate 0.000005 --save_model --log_dir tb_logs/fit --make_even --eval_define val_pairs.csv --checkpoint_every 20 --dnn_l2 0.00001 --cnn_l2 0.00001
Basic:
python3 main.py --exec_mode train --data_dir /data --model_dir /results --batch_size 24 --epochs 20 --log_dir /logs --log_name mylog --augment 25
--batch_size
tells the script the size of mini batch--epochs
tells the script the number of epochs (how many times the whole set will be consumed)*--augment
tells the script how many shifted/rotated electron density 3D images are available.**Mixed precision:
(lower memory consumption, faster evaluation)
python3 main.py --exec_mode train --xla --amp --data_dir /data --model_dir /results --batch_size 24 --epochs 20 --log_dir /logs --log_name mylog --augment 25
--xla
tells the script to enable accelerated linear algebra optimization--amp
tells the script to enable automatic mixed precisionWith L2 regularization:
python3 main.py --dnn_l2 0.005 --cnn_l2 0.005 --exec_mode train --data_dir /data --model_dir /results --batch_size 24 --epochs 20 --log_dir /logs --log_name mylog --augment 25
--dnn_l2
tells the script to include regularizing term calculated using the weights of the dense part of the net in loss function--cnn_l2
tells the script to include regularizing term calculated using the weights of the convolutional part of the net in loss functionThe result will be stored in a checkpoint file in
./results
folder, if--save_model
of--checkpoint_every
options are invoked.Notes:
- By default, the script will try to balance the set. First it will generate a lookup table, which for every experimental point will contain all possible combinations of conformations and augmentations of the corresponding polymer and solvent. Then every epoch it will sample a random subset so that all polymer-solvent pairs will have equal shares within the sampled set, while all experimental points will also have equal shares within their polymer-solvent pairs share.
- Augmentation (shift and rotation of the images) is computationally heavy
and by default is supposed to be done by a separate script outside the
training procedure. The number of available pre-shifted images should be
defined by
--augment
option. On-the-fly augmentation is turned on using--augment_onthefly
. With on-the-fly augmentation the RhNet script will randomly shift and rotate each image each time while prefetching the next batch of the data. This will create a bottleneck which could be to some extent mitigated using parallel execution of data generator (defined by--parallel_preproc
keyword)
-
Training with validation.
The model can be evaluated while training by passing the
--eval_split
flag:python3 main.py --exec_mode train --eval_split <natural number> --evaluate_every 1 --data_dir /data --model_dir /results --batch_size 24 --epochs 20 --log_dir /logs --log_name mylog --augment 25
--eval_split n
will tell the script to split the set into training and validation subsets.1/n
fraction of the total set will be assigned to the validation set, while the rest will be used for training.--evaluate_every n
will tell the script to evaluate the models performance everyn
epochs.Notes:
- validation set can also be defined using
--eval_define my_val_pairs.csv
option, see General parameters section for details.
- validation set can also be defined using
-
Training with cross-validation.
To perform n-fold cross-validation a
--fold
flag should be passed:python3 main.py --exec_mode train --fold <natural/number> --evaluate_every 1 --data_dir /data --model_dir /results --batch_size 24 --epochs 20 --log_dir /logs --log_name mylog --augment 25
--fold n
will tell the script to randomly split the set inton
subsets. Next the script willn
times train the model from scratch using each subset as a validation set.Notes:
- cross-validation feature wasn't tested for a long time, sorry
The following sections provide details of the script structure and available options.
The root directory contains:
main.py
: The entry point of the application.Dockerfile
: Specifies the container with the basic set of dependencies to run RhNet.requirements.txt
: Extra requirements for RhNet.
The runtime/
folder encapsulates the runtime logic:
run.py
: Implements the logic for training, evaluation, testing and prediction.setup.py
: Implements the functions to set up environment, create necessary directories, and initiate logging.arguments.py
: Command-line arguments parsing.
The data_loading/
folder encapsulates the dataset loading and construction
logic:
data_loader.py
: Implements loading of the necessary tables, construction and splitting of the dataset into train and evaluation subsets, and provides a generator to prepare examples for training and evaluation.
The model/
folder contains RhNet and its building blocks:
blocks.py
: Implements different blocks that are used to assemble RhNet.rhnet.py
: Defines the model architecture.
The utils/
folder encapsulates some minor utility functions in utils.py
The data/
folder contains the dataset (except for the calculated electron
densities), for details refer to the Dataset structure
section.
The qcdata_gen/
folder contains a script, which automates the quantum
chemical data (electron and spin densities) generation, for details refer to
the Electron density generation section.
The images/
folder contains README figures.
Following is a list of the available parameters for the main.py
script:
--exec_mode
: The execution mode of the script:train
: (DEFAULT) trains model.evaluate
: performs evaluation on the validation subset.predict
: loads model or checkpoint and runs inference. Predictions will be saved to {--log_dir}/prediction_{--log_name}.csv.error_analysis
: performs evaluation on the whole dataset and save the whole augmented dataset (without images but their indexes) to full_table_{--log_name}.csv and predicted values to detailed_pred_{--log_name}.csv for subsequent analysis
--model_dir
: The directory to save checkpoints and the trained model. DEFAULT:./results
.--data_dir
: The directory containing the dataset. Required parameter.--data_csv
: The .csv file containing experimental sorption data. The.csv
file should look like (DEFAULT:experimental_dataset.csv
):
expno,polymer,solvent,mn,mw,cryst,tg,dens,pressure,temperature,wa,doi,notes
1,1,11,18000,80000,0,593,1.2,0.002,283.5,0.08,"10.1021/ie3027873","Huntsman Advanced Materials"
2,1,11,18000,80000,0,593,1.2,0.004,283.5,0.10,"10.1021/ie3027873","Huntsman Advanced Materials"
...
--to_pred_csv
: The .csv file containing intended for prediction parameters ranges. The.csv
file should look like (DEFAULT:to_predict_ranges.csv
):
polymer,solvent,mn,mw,cryst,tg,dens,pmin,pmax,npstep,tmin,tmax,ntstep
1,1,18000,80000,0,593,1.2,0.1,1.0,10,298.15,308.15,3
1,10,18000,80000,0,593,1.2,0.1,1.0,10,298.15,308.15,3
...
--api
: Whether to use Keras builtin or custom training/evaluating loop:builtin
: (DEFAULT) use builtin Keras loopscustom
: use custom loops
--store_density
: Where to store electron density images:ram
: preloads images to RAM before training/evaluation. Will save some time on I/O but can consume much memoryfile
: loads images during training/evaluation from disk every time. Reduces RAM demands but creates I/O bottleneckcache
: loads images during training/evaluation from disk, but store already loaded images in RAM for all subsequent usages
--store_sparse
: Compress images stored in RAM using their sparsity. Saves memory but adds compression-decompression overhead. DEFAULT:False
.--parallel_preproc
: Perform assembly and augmentation of training examples in multiple CPU-threads. If set to 0 will settle to a number of threads equal tobatch_size
. DEFAULT:0
.--timeout
: Introduce a delay (in seconds) between data generators initialization and the first generators' calls to syncronize their state at the epoch start. DEFAULT:10
.--log_dir
: The directory to store logs. DEFAULT:.
.--log_name
: Suffix for different log files. DEFAULT:None
.--log_every
: Log performance every ... steps (for dllogger and custom training loop)--xla
: Enable accelerated linear algebra optimization. DEFAULT:False
.--amp
: Enable automatic mixed precision. DEFAULT:False
.
--model
: index of the RhNet model to be used. Different models are defined inrhnet.py
. DEFAULT:linear
.--activation
: activation function:relu
: use ReLU (rectified linear unit) activationsigmoid
: use Sigmoid activationtanh
: use Tanh (hyperbolic tangent) activationleaky_relu
: use Leaky ReLU (leaky rectified linear unit) activationelu
: use ELU (exponential linear unit) activation
--pooling
: choose pooling function for convolutional part of the network:av
: downsize images with averagingmax
: conventional maxpooling
--save_model
: Save the model after the training run. DEFAULT:False
.--resume_training
: Load the model from the checkpoint in themodel_dir
. If not enabled, themodel_dir
will be rewritten. DEFAULT:False
.--load_model
: Load the saved model. DEFAULT:False
.--mw
: Use only Mw (weight average molecular weight) and drop Mn (number average molecular weight) features. If not enabled, both Mw and Mn features are used. DEFAULT:False
.--amorph
: Use only amorphous/melted polymers from the dataset and drop thecryst
feature. DEFAULT:False
.--tg
: Use polymer glass transition temperature as a feature. DEFAULT:False
.--dens
: Use polymer density at normal conditions as a feature. DEFAULT:False
.--bt
: Use solvents boiling temperature at normal pressure as a feature. DEFAULT:False
.--ctp
: Use solvents critical temperature and pressure as features. DEFAULT:False
.
--augment
: Tells the script, how many augmented (shifted and rotated) electron densities images are available for each molecular geometry. Augmentation itself is done elsewhere. DEFAULT:25
.--augment_onthefly
: Perform image augmentation (random shift and rotation) on-the-fly in parallel to training. Computationally heavy. It will automatically set--augment
to1
and read (and store) only one image per chemical structure/conformation. For example, onlyp_{index}_{index}_{index}_1.npy
will be loaded and shifted/rotated, while any otherp_{index}_{index}_{index}_{index}.npy
will be ignored. DEFAULT:False
.--nonint_shift
: In conjunction withaugment_onthefly
will result in images shift by non-integer (in voxels) steps (if set toTrue
). Otherwise, only integer steps will be performed. Non-integer steps involve interpolation and are computationally heavier. DEFAULT:False
.--make_even
: Try to tune sampling weights to equalize impacts of the examples with medium and high solvent content assuming normal distribution--analysis_n
: Loop only through the firstn
examples of the dataset in error analysis mode
--eval_split
: Turns on evaluation during model training. If --eval_split n is specified, the script will randomly sample 1/n of the polymer-solvent pairs into the validation set. DEFAULT:None
.--eval_define
: Path to the.csv
file which defines the polymer-solvent pairs to be excluded from the training set and to be included in the validation set for the training-time evaluation. The.csv
file should look like (DEFAULT:None
):
polymer,solvent
25,10
25,37
16,2
19,45
...
--fold
: Turns on n-fold cross-validation. Randomly splits the dataset into n parts and performs n training runs using each part as a validation subset. Overrides--eval_split
. DEFAULT:None
.--holdout_define
: Path to the.csv
file which defines the polymer-solvent pairs to be excluded from the dataset during the current run. The.csv
file has the same structure as described for--eval_define
. (DEFAULT:None
)--restrict_to_define
: Path to the.csv
file which defines the polymer-solvent pairs to be used during the current run (everything else will be excluded). The.csv
file has the same structure as described for--eval_define
. (DEFAULT:None
)
--batch_size
: Size of the minibatch. DEFAULT:1
.--learning_rate
: Adam’s learning rate. DEFAULT:0.0001
.--learning_rate_decay
: Learning rate exponential decay coefficient. DEFAULT:0.0001
.--learning_rate_decay_steps
: Learning rate exponential decay rate. DEFAULT:600000
.--max_steps
: Maximum number of steps (batches) for each epoch during a training run. If set to0
will consume the whole dataset. DEFAULT:0
.--epochs
: Sets the number of epochs for the training run. DEFAULT:1
.--evaluate_every
: Perform evaluation during model training every n epochs. DEFAULT:1
.--checkpoint_every
: Save a model checkpoint every n epochs. Will not save checkpoints if set to0
. DEFAULT:0
.--initial_epoch
: Set initial epoch for model fitting. Applied only to functional models. This option is only needed for correct callbacks functioning while fitting loaded models. Otherwise, it has no effect. DEFAULT:0
.--prune_model
: Define final model sparsity and perform model pruning with polynomial decay schedule. DEFAULT:None
.--prune_start
: Define model pruning starting step. DEFAULT:0
.--prune_end
: Define model pruning final step. DEFAULT:None
.
--dnn_dropout
: Use dropout regularization for dense part. DEFAULT:None
.--cnn_dropout
: Use dropout regularization for convolutional part. DEFAULT:None
.--dnn_l2
: Use l2 regularization for dense part. DEFAULT:0
.--cnn_l2
: Use l2 regularization for convolutional part. DEFAULT:0
.
--zdt
: Perform a second run of error analysis using zero densities--logcm
: Save predicted values in cm3(STP)/cm3 instead of weight fraction
To see the full list of available options and their descriptions, use
the -h
or --help
command-line option, for example:
python main.py --help
The following example output is printed when running the model:
usage: main.py [-h] [--exec_mode {train,evaluate,predict,error_analysis,train_and_error_analysis}]
[--model_dir MODEL_DIR] [--data_dir DATA_DIR] [--data_csv DATA_CSV]
[--to_pred_csv TO_PRED_CSV] [--api {builtin,custom}]
[--store_density {ram,file,cache}] [--store_sparse]
[--parallel_preproc PARALLEL_PREPROC] [--timeout TIMEOUT]
[--log_dir LOG_DIR] [--log_name LOG_NAME] [--log_every LOG_EVERY]
[--use_amp] [--use_xla] [--model MODEL]
[--activation {relu,sigmoid,tanh,leaky_relu,elu}]
[--pooling {av,max}] [--save_model] [--resume_training]
[--load_model] [--use_only_mw] [--use_only_amorph] [--use_tg]
[--use_dens] [--use_solvent_boil_temp] [--use_solvent_cryt_point]
[--augment AUGMENT] [--augment_onthefly] [--nonint_shift]
[--even_ratios_distrib] [--analysis_n ANALYSIS_N]
[--eval_split EVAL_SPLIT] [--eval_define EVAL_DEFINE] [--fold FOLD]
[--holdout_define HOLDOUT_DEFINE] [--restrict_to_define RESTRICT_TO_DEFINE]
[--batch_size BATCH_SIZE] [--learning_rate LEARNING_RATE]
[--learning_rate_decay LEARNING_RATE_DECAY]
[--learning_rate_decay_steps LEARNING_RATE_DECAY_STEPS]
[--max_steps MAX_STEPS] [--epochs EPOCHS] [--evaluate_every EVALUATE_EVERY]
[--checkpoint_every CHECKPOINT_EVERY]
[--dnn_dropout DNN_DROPOUT] [--cnn_dropout CNN_DROPOUT]
[--dnn_l2 DNN_L2] [--cnn_l2 CNN_L2] [--zero_density_test]
[--log_cm3cm3]
RhNet
optional arguments:
-h, --help show this help message and exit
--exec_mode {train,evaluate,predict,error_analysis,train_and_error_analysis}
Execution mode of running the model
--model_dir MODEL_DIR
Output directory for saved model and checkpoints
--data_dir DATA_DIR Directory with the dataset
--data_csv DATA_CSV CSV file with experimental part of the dataset relative
to data_dir
--to_pred_csv TO_PRED_CSV
CSV file with intended for prediction parameters
ranges
--api {builtin,custom}
Whether to use Keras builtin or custom training loop
--store_density {ram,file,cache}
Where to store density images
--store_sparse Compress images
--parallel_preproc PARALLEL_PREPROC
Perform preprocessing in parallel
--timeout TIMEOUT Delay (sec) to sync data generators init
--log_dir LOG_DIR Output directory for training logs
--log_name LOG_NAME Suffix for different log files
--log_every LOG_EVERY
Log performance every n steps
--use_amp, --amp Train using TF-AMP
--use_xla, --xla Train using XLA
--model MODEL Model index
--activation {relu,sigmoid,tanh,leaky_relu,elu}
Activation function
--pooling {av,max} Pooling function
--save_model Save model after training
--resume_training Resume training from a checkpoint
--load_model Resume training from a saved_model
--use_only_mw, --mw Use only Mw and drop Mn
--use_only_amorph, --amorph
Drop semi-crystalline poly
--use_tg, --tg Use polymer Tg feature
--use_dens, --dens Use polymer density feature
--use_solvent_boil_temp, --bt
Use solvent Tb feature
--use_solvent_cryt_point, --ctp
Use solvent critical point feature
--augment AUGMENT Number of shifted and rotated densities per geometry.
Note: the option affects only indexing, augmentation
itself is done elsewhere
--augment_onthefly Shift and rotate images on the fly
--nonint_shift Use non-integer steps shifting images
--even_ratios_distrib, --make_even
Tune sampling weights to equalize impacts of the
examples with medium and high solvent content
--analysis_n ANALYSIS_N
Number of samples to be used in error analysis. Use
full table if not specified
--eval_split EVAL_SPLIT
Turn on validation during training. Randomly sample
1/val_split examples into validation set
--eval_define EVAL_DEFINE
Turn on validation during training. Filepath to a .csv
file defining polymer-solvent pairs to be used in
evaluation (validation). Will override --eval_split
option.
--fold FOLD Turn on cross-validation during training. Randomly
split the dataset into a defined number of parts for
cross-validation. None - no cross-val. Will override
--eval_split and --eval_define
--holdout_define HOLDOUT_DEFINE
Exclude the data associated with the defined
polymer-solvent pairs from the currently used dataset.
Filepath to a .csv file defining polymer-solvent pairs
to be excluded.
--restrict_to_define RESTRICT_TO_DEFINE
Include in the currently used dataset only the data
associated with the defined polymer-solvent pairs.
Filepath to a .csv file defining polymer-solvent pairs
to be included.
--batch_size BATCH_SIZE
Size of each minibatch per GPU
--learning_rate LEARNING_RATE
Learning rate coefficient for AdamOptimizer
--learning_rate_decay LEARNING_RATE_DECAY
Learning rate exponential decay coefficient
--learning_rate_decay_steps LEARNING_RATE_DECAY_STEPS
Learning rate exponential decay rate
--max_steps MAX_STEPS
Maximum number of steps (batches) in training
If --max_steps 0 consume the whole set
--epochs EPOCHS Number of epochs used in training
--evaluate_every EVALUATE_EVERY
Evaluate performance (validate) every n epochs
--checkpoint_every CHECKPOINT_EVERY
Save checkpoint every ... epochs
--dnn_dropout DNN_DROPOUT
Use dropout regularization for dense part
--cnn_dropout CNN_DROPOUT
Use dropout regularization for conv part
--dnn_l2 DNN_L2 Use l2 regularization for dense part
--cnn_l2 CNN_L2 Use l2 regularization for conv part
--zero_density_test, --zdt
Perform a second run of error analysis using zero
densities
--log_cm3cm3, --logcm Log predicted values in cm3/cm3 instead of weight
fraction
The data/
folder contains the current version of the dataset. It includes
following files:
data/experimental_dataset.csv
: the .csv file containing experimental sorption data:
expno,polymer,solvent,mn,mw,cryst,tg,dens,pressure,temperature,wa,doi,notes
1,1,11,18000,80000,0,593,1.2,0.002,283.5,0.08,"10.1021/ie3027873","Huntsman Advanced Materials"
2,1,11,18000,80000,0,593,1.2,0.004,283.5,0.10,"10.1021/ie3027873","Huntsman Advanced Materials"
...
data/to_predict_ranges.csv
: the (example) .csv file containing intended for prediction parameters ranges:
polymer,solvent,mn,mw,cryst,tg,dens,pmin,pmax,npstep,tmin,tmax,ntstep
1,1,18000,80000,0,593,1.2,0.1,1.0,10,298.15,308.15,3
1,10,18000,80000,0,593,1.2,0.1,1.0,10,298.15,308.15,3
...
data/list_of_polymers.csv
matches the names of polymers and their ID numbers in the datasetdata/list_of_solvents.csv
matches the names of solvents and their ID numbers in the datasetdata/geometries.tar.xz
archive provides geometries of solvents and polymers' repeating units (inXYZ
format) which were used to calculate electron densities. Filenames in the archive use the following convention:- Filenames contain four indexes separated by
_
- The first index is a letter: either
p
(olymer) ors
(olvent) - Second index is an ID number of polymer specified in
data/list_of_polymers.csv
or solvent specified indata/list_of_solvents.csv
- Third index if for the choice of a particular repeating unit of a polymer and is always 1 for solvents.
- Fourth index enumerates conformational isomers
- For example
p_3_2_1.xyz
will contain geometry of:- polymer, as it starts with
p
- polyethylene (-CH2CH2-)n which is
defined by index
3
- -CH2CH2CH2CH2- repeating unit
defined by index
2
- conformational isomer of the
-CH2CH2CH2CH2- repeating unit
number
1
- polymer, as it starts with
- Filenames contain four indexes separated by
data/smiles.tar.xz
archive contains polymers' repeating units and solvents in SMILES format. It is provided within the dataset to simplify its use in fitting of models with other than electron density molecular representations. Dangling bonds in polymers' repeating units are caped with Fr-atoms. Filenames in the archive use the following convention:- Filenames contain two indexes separated by
_
- The first index is a letter: either
p
(olymer) ors
(olvent) - Second index is an ID number of polymer specified in
data/list_of_polymers.csv
or solvent specified indata/list_of_solvents.csv
- Filenames contain two indexes separated by
data/polymers.txt
contains python dictionary defining the number of available conformations for each polymer repeating unit (called "cut" in the code for shortness):
{polymer_ID: [cut0_confs, cut1_confs...],
polymer_ID: [cut0_confs],
polymer_ID: [cut0_confs, cut1_confs, cut2_confs...],
...}
For example:
{1: [92],
2: [18, 40],
3: [3, 9],
4: [3, 20, 6],
...}
data/polymer_mass.csv
defines the molar masses (g/mol) of the polymer repeating units, repeating units' (cuts) indexing is 1-based:
polymer, cut, poly_mass
polymer_ID, cut_ID, mass
polymer_ID, cut_ID, mass
...
For example:
polymer,cut,poly_mass
1,1,552.61
2,1,114.16
...
data/solvents.txt
contains python dictionary defining the number of available conformations for each solvent:
{solvent_ID: num_confs,
solvent_ID: num_confs,
...}
For example:
{1: 1,
2: 1,
3: 3,
...}
data/solvent_mass.csv
defines the molar masses (g/mol) of the solvents:
solvent, solv_mass
solvent_ID, mass
solvent_ID, mass
...
For example:
solvent,solv_mass
45,74.09
9,120.03
...
data/solvent_macro_features.csv
defines solvent macroscopic features (optional):
solvent,bt,ct,cp
solvent_ID, boiling temp(K), critical temp(K), critical pressure(MPa)
solvent_ID, boiling temp(K), critical temp(K), critical pressure(MPa)
...
For example:
solvent,bt,ct,cp
1,194.7,304.1,7.38
2,111.6,190.6,4.6
...
data/val_pairs.csv
defines the polymer-solvent pairs to be excluded from the training set and to be included in the validation set for the training-time evaluation:
polymer,solvent
polymer_ID, solvent_ID
polymer_ID, solvent_ID
...
For example:
polymer,solvent
25,10
19,45
...
data/test_pairs.csv
defines the polymer-solvent pairs to be excluded from the training and validation sets and used for the after-training test evaluation.
polymer,solvent
polymer_ID, solvent_ID
polymer_ID, solvent_ID
...
For example:
polymer,solvent
13,1
24,1
...
A single example consists of two images: polymer repeating unit's density and solvent's density, and a vector (table row) of macroscopic features such as pressure, temperature, polymers molar mass, etc.
The density images have two channels: electron density and spin density channels. Spin density channel is used mostly to represent position and orientation of the polymer repeating units' dangling bonds, which connect them to the other repeating units in a real polymer.
Available macroscopic features (listed in the order they are expected by the model) are:
features = ['mn', # polymer number average molar weight
'mw', # polymer weight average molar weight
'cryst', # polymer crystallinity degree
'pressure', # experimental pressure
'temperature', # experimental temperature
'tg', # polymer glass transition temperature
'dens', # polymer density
'bt', # solvent boiling temperature
'ct', # solvent critical temperature
'cp' # solvent critical pressure
]
Some of the features can be omitted using command line arguments
(see Advanced section). The current version of the pre-trained
model uses ['mw', 'pressure', 'temperature']
set of features.
The model's input signature is as follows:
import tensorflow as tf
fit_input_signature= \
((tf.TensorSpec(shape=(img_dim, # polymer electron/spin
img_dim, # density image
img_dim, 2),
dtype=precision),
tf.TensorSpec(shape=(img_dim, # solvent electron/spin
img_dim, # density image
img_dim, 2),
dype=precision),
tf.TensorSpec(shape=(macro_feature_number,), # polymer/solvent macro
dtype=precision)), # features
tf.TensorSpec(shape=(1,), # value to be predicted
dtype=precision),)
inference_input_signature= \
((tf.TensorSpec(shape=(img_dim, # polymer electron/spin
img_dim, # density image
img_dim, 2),
dtype=precision),
tf.TensorSpec(shape=(img_dim, # solvent electron/spin
img_dim, # density image
img_dim, 2),
dype=precision),
tf.TensorSpec(shape=(macro_feature_number,), # polymer/solvent macro
dtype=precision),)) # features
The current pre-trained model uses img_dim = 80
, macro_feature_number = 3
.
By default precision = tf.float32
, if --amp
key is invoked
precision = tf.float16
.
The data/
folder should contain the following files:
to_predict_ranges.csv
list_of_polymers.csv
list_of_solvents.csv
polymer_mass.csv
solvent_mass.csv
cubes
directory with density images
For example, suppose the to_predict_ranges.csv
file looks like:
polymer,solvent,mn,mw,cryst,tg,dens,pmin,pmax,npstep,tmin,tmax,ntstep
13,1,29000,140000,0,243,1.2304,2,20,19,393.15,453.15,3
It will request calculation of absorption of a gas with ID number 1
in a
polymer with ID number 13
:
cubes
directory should contain polymer repeating unit density inp_13.npy
file and solvent molecule density ins_1.npy
file- Number average (
mn
) and weight average (mw
) molar weights of the polymer are set to 29000 and 140000 respectively - Degree of crystallinity (
cryst
) is set to 0 (ignored in the current model) - Polymer glass transition temperature (
tg
) is set to 243 K (ignored in the current model) - Polymer density (
dens
) is set to 1.2304 g/cm3 (ignored in the current model) - The model will be applied to calculate
ntstep
isotherms with minimal temperaturetmin
and maximal temperaturetmax
. In this example at 393.15 K, 423.15 K, and 453.15 K. - Each isotherm will be calculated for pressures in range
pmin
-pmax
withnpstep
points: at 2 MPa, 3 MPa... 20 MPa
The data/
folder should contain the following files:
experimental_dataset.csv
list_of_polymers.csv
list_of_solvents.csv
polymer_mass.csv
solvent_mass.csv
polymers.txt
solvents.txt
cubes
directory with density images Refer to Dataset overview section for details
Repository contains a script to generate and optionally augment 3D images of electron density distribution in a format suitable for the RhNet model.
The qcdata_gen script is intended to perform two main tasks:
- generate quantum-chemical data for training
- generate quantum-chemical data for inference
The workflows in these cases differ. For a general description of training data generation refer to the paper from Reference section.
To run the script one will need:
- ORCA - An ab initio, DFT and semiempirical SCF-MO package
- RDKit* - Open-Source Cheminformatics Software
- XTB* - Semiempirical Extended Tight-Binding Program Package
- conformers* - a standalone script for identical structures filtering
Notes:
- RDKit, XTB, and conformers are only needed for training dataset generation
Download and install prerequisites:
- Generation of both training and inference data requires ORCA
package. ORCA is available as a ready-to-use binary package and doesn't
require (for the time being) compilation and installation. For more details
refer to the official documentation or
other resources.
There are two options for the script to locate ORCA executables.
Either ORCA path can be retrieved from $ORCA environmental variable, or it can
be provided as a command-line option
--orca_path <orca_path>
- RDKit is only needed to generate training examples. Here is RDKit installation example that worked for me (make sure to use your username, correct python path, and most recent RDKit version):
sudo apt install libboost-all-dev
sudo apt install zlib1g-dev
sudo apt install libfreetype6-dev libfreetype6
sudo apt install libeigen3-dev
sudo apt install cmake
wget https://github.com/rdkit/rdkit/archive/Release_2021_03_1.tar.gz
cd Release_2021_03_1/rdkit-Release_2021_03_1/
mkdir build
cd build
export RDBASE=/home/username/Release_2021_03_1/rdkit-Release_2021_03_1
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/home/username/Release_2021_03_1/rdkit-Release_2021_03_1/lib
cmake -D PYTHON_EXECUTABLE=/usr/bin/python3 -D RDK_INSTALL_INTREE=OFF -D CMAKE_INSTALL_PREFIX=/home/username/rdkit ..
make -j 12
make install
ln -s /home/username/rdkit/lib/python3.8/site-packages/rdkit /home/username/.local/lib/python3.8/site-packages
- XTB is only needed to generate training examples. For XTB setup instructions refer to the documentation. Once downloaded and compiled be sure to make it executable and move it to the ORCA directory:
chmod +x xtb
mv xtb orca_path/otool_xtb
- conformers script is
only needed to generate training examples. There are two options for the
qdata_gen script to locate conformers script. Either conformers path can be
retrieved from $CONFORMERS environmental variable, or it can be provided as
a command-line option
--conf_path <conf_path>
. Once downloaded make it executable:
chmod +x conformers
The script can consume multiple .xyz
files from the input directory, perform
single-point DFT calculations and convert resulting electron and spin-densities
to .npy
format suitable for the RhNet model.
python3 qcdata_gen/main.py --exec_mode cube_to_predict --pal <ncores> --mol_dir <input_path> --out_dir <output_path> --orca_path <orca_path> --use_name_convention
--exec_mode cube_to_predict
will instruct the script to perform only single-point calculations on the user-supplied geometries. No shifts or rotations will be applied to electron densities--use_name_convention
option suggests a structure with xyz-filename starting with:s_
to be a solvent molecule and withp_
to be a polymer repeating unit with two dangling bonds--pal <ncores>
number of CPU cores to use in DFT calculations--mol_dir <input_path>
a directory where the script will look for input.xyz
files--out_dir <output_path>
a directory where temporary files and the results will be stored--orca_path <orca_path>
The script can consume multiple .mol files from the input directory.
Basic:
python3 qcdata_gen/main.py --exec_mode conf_and_cube --pal <ncores> --mol_dir <input_path> --out_dir <output_path> --orca_path <orca_path> --conf_path <conf_path> --use_name_convention --rdkit_nconf <nconf> --rdkit_thresh <thresh>
--exec_mode conf_and_cube
will instruct the script to generate a number of conformations using RDKit, optimize geometries using RDKit and ORCA (consequently on the XTB and DFT levels) and finally perform single-point calculations to generate density images. By default, with translational and rotational augmentation--pal <ncores>
number of CPU cores to use in DFT calculations--mol_dir <input_path>
a directory where the script will look for input .mol files--out_dir <output_path>
a directory where temporary files and the results will be stored--orca_path <orca_path>
a directory where ORCA binaries are stored. Can also be defined inORCA
environment variable--conf_path <conf_path>
a directory whereconformers
script is stored. Can also be defined inCONFORMERS
environment variable--use_name_convention
option suggests a structure with xyz-filename starting with:s_
to be a solvent molecule and withp_
to be a polymer repeating unit with two dangling bonds--rdkit_nconf <nconf>
maximal number of conformers generated by RDKit--rdkit_thresh <thresh>
conformer difference threshold for RDKit
--exec_mode
: The execution mode of the script:conf
: search for conformations and optimize their geometriescube
: perform single-point calculations and generate density imagesconf_and_cube
: (DEFAULT) search for conformations, optimize their geometries, perform single-point calculations and generate density imagescube_to_predict
: perform only single-point calculations on the user-supplied geometries. No shifts or rotations will be applied to electron densities
--mol_dir
: input .mol or .xyz structures directory--out_dir
: a directory where temporary files and the results will be stored--pal
: number of CPU cores to use in DFT calculations. DEFAULT:12
--conf_level
: Highest level of geometry optimization. By default, the script will successively optimize geometries at molecular mechanics level in RDKit, then at XTB and DFT levels in ORCAff
: optimize geometries at molecular mechanics level in RDKit onlysemi
: optimize geometries at molecular mechanics level in RDKit and XTB level in ORCAdft
: (DEFAULT) optimize geometries at molecular mechanics level in RDKit and XTB and DFT levels in ORCAff_dft
: omit XTB optimization
--dft_scheme
: Choice of DFT methods for optimization and electron density calculation (DEFAULT:normal
):normal
:r2SCAN-3c
for optimization andB3LYP/def2-TZVP
for densityfast
:B97-3c
for optimization andPBE/def2-SVP
for density
--use_name_convention
: suggest structure with xyz-filename starting withs_
to be a solvent molecule and withp_
to be a polymer repeating unit. In polymer repeating unit geometries 8 last atoms are suggested to be screening Me- groups and will be removed for single-point calculation and density image generation--component
: suggest all structures in xyz-files to be (DEFAULT:None
):solvent
: solvent - calculate images for the whole moleculepolymer
: polymer - 8 last atoms are suggested to be screening Me- groups and will be removed for single-point calculation and density image generation
--rdkit_thresh
: RDKit conformers difference criteria. DEFAULT:0.1
--rdkit_nconf
: max number of conformers RDKit will generate. DEFAULT:1000
--rdkit_thresh_keep
: energy window to keep conformers within during optimization in RDKit, kJ/mol. DEFAULT:20
--orca_thresh_keep
: energy window to keep conformers within during optimization in ORCA, kJ/mol. DEFAULT:5
--cube_n
: dimension of output density images, points. DEFAULT:80
--cube_spacing
: spacing in output density images, Angstroem. DEFAULT:0.4
--cube_aug
: number of translationally and rotationally augmented density images to be generated, max is 25. If on-the-fly rotational and translational augmentation is chosen for training--cube_aug 1
could probably be set. DEFAULT:25
--extend_cube
: density resolution increment. The original .cube files generated by ORCA will have--extend_cube
-times better resolution than requested by--cube_n
. Next density images resolution will be decreased to the target value via averaging instead of interpolation. DEFAULT:6
--orca_path
: ORCA binaries location, can also be taken from environmental variableORCA
--conf_path
: CONFORMERS script location, can also be taken from environmental variableCONFORMERS
To see the full list of available options and their descriptions, use the -h
or --help
command-line option, for example:
python qcdata_gen/main.py --help
The following example output is printed:
usage: main.py [-h] [--exec_mode {conf,cube,conf_and_cube,cube_to_predict}]
[--mol_dir MOL_DIR] --out_dir OUT_DIR [--pal PAL]
[--conf_level {ff,semi,dft,ff_dft}] [--dft_scheme {normal,fast}]
[--use_name_convention] [--component {solvent,polymer}]
[--rdkit_thresh RDKIT_THRESH] [--rdkit_nconf RDKIT_NCONF]
[--rdkit_thresh_keep RDKIT_THRESH_KEEP]
[--orca_thresh_keep ORCA_THRESH_KEEP] [--cube_n CUBE_N]
[--cube_spacing CUBE_SPACING] [--cube_aug CUBE_AUG]
[--extend_cube EXTEND_CUBE] [--orca_path ORCA_PATH]
[--conf_path CONF_PATH]
ConfGen
optional arguments:
-h, --help show this help message and exit
--exec_mode {conf,cube,conf_and_cube,cube_to_predict}
Generate conformers and/or cubes
--mol_dir MOL_DIR Input .mol or .xyz structures directory
--out_dir OUT_DIR Directory with output data
--pal PAL Number of CPU cores to use in calculations
--conf_level {ff,semi,dft,ff_dft}
Highest level of conf geometry optimization,
successively optimize at ff, xtb and dft levels by
default. For example xtb option will do ff and
xtb optimizations. ff-dft will omit xtb step.
--dft_scheme {normal,fast}
Choice of DFT methods for optimization and electron
density calculation
--use_name_convention
Suggest structure with xyz-filename starting with: 's_'
to be a solvent molecule and with 'p_' to be a monomer
- 8 last atoms are suggested to be screening Me- groups
and will be removed for cube eval
--component {solvent,polymer}
Suggest all structures in xyz-files to be: solvent -
calculate/interpolate cube for the whole molecule
polymer - 8 last atoms are suggested to be screening
Me- groups and will be removed for cube eval
--rdkit_thresh RDKIT_THRESH
Conformers energy difference criteria
--rdkit_nconf RDKIT_NCONF
Max number of conformers rdkit will generate
--rdkit_thresh_keep RDKIT_THRESH_KEEP
Energy window to keep conformers within, kJ/mol
--orca_thresh_keep ORCA_THRESH_KEEP
Energy window to keep conformers within, kJ/mol
--cube_n CUBE_N Dimension of interpolated .cube, points
--cube_spacing CUBE_SPACING
Spacing in interpolated .cube, Angstroem
--cube_aug CUBE_AUG Number of augmented .cubes, max is 25
--extend_cube EXTEND_CUBE
Number of points to average cube density
--orca_path ORCA_PATH
ORCA location, can also be taken from env
--conf_path CONF_PATH
CONFORMERS location, can also be taken from env
- complete documentation
- inference utility
- XYZ (geometry) to ready-to-use NPY (image) automation (for inference)
- convert pressure and temperature back to MPa and K in inference log
- convert predicted values to weight fractions in inference log
- add option to choose printing predicted values in cm3(STP)/cm3
- prune and upload fitted models
- utility for nice inference reports (Matplotlib + LaTeX?)
- option for a lighter (RI-PBE/def2-SVP?) electron density calculation
- test automation
- compactize model definition code
- refactor stuff triggering SettingWithCopyWarning
- on-the-fly translational and rotational augmentation
- sparse convolutions
- unify feature selection
- add ORCA docker
- check smiles.tar.xz for non-minimal polymer repeating units
- check smiles.tar.xz for presence of S-O bonds instead of S=O bonds
- add to qcdata_gen script a check for a single-atom input geometry
- merge solvents 31 and 63 - both are SO2
- disable RDKit import in cube_to_predict mode
- 10.3390/polym14030596. Reason: Table 2 disagrees with Figure 11.
- 10.1002/pen.760211409. Reason: microsphere samples seem to be too micro =)
- 10.1016/j.supflu.2022.105534. Reason: In section 2.3.1 it is stated, that the (P and T independent) integral molar absorptivity was used, while in section 2.3.2.2 and Figure 5 the (P and T dependent?) intensity based extinction coefficient is clearly used.
- 10.1002/polb.22209. Reason: the measured data deviates from other sources by up to 100%.
- 10.1080/19443994.2014.981931. Reason: LiCl was not removed from mPBI samples
- 10.1021/ma302155j. Reason: temperature is specified as "room temperature"
- 10.1016/j.supflu.2013.12.021. Reason: PEG melting temperatures are quite likely wrongly determined to fall in range 410-440 K, sorption isotherms have strange discontinuities
- 10.1002/polb.1181. Dual mode sorption parameters are apparently incorrect