# Group 9 - Omkar Chaudhari, Akshay Gurumoorthi, Rishabh Puri, Matthew Too

# Property Driven Molecule Generation using Conditional Normalizing Flows


## Abstract
This work presents a method to generate molecules with target properties using Conditional
Normalising Flows. The properties of the generated molecules are further validated using xtb,
and ORCA. Using the QM9 dataset, an autorgressive normalising flow model is trained on
the molecules using TensorFlow and DeepChem, with the one hot encodings of their SELF-
IES strings. The model is conditioned during training with DFT computed properties of the
molecules from the dataset. By utilising the learned bijections, new molecules are sampled by
passing a condition vector with properties of interest. Using the models outputted SELFIES
strings, these are converted back to SMILES, created into ORCA input files using RDKit and
Python, and validated using ORCA. This conditioning approach is tested to see if the model can
be directed towards generating molecules with desired properties. Such models will be useful
to generate new molecules in various domains like drug discovery, the semiconductor industry,
renewable materials etc.


## Code Citation

The base code was used from Deepchems normalising flow tutorial - https://github.com/deepchem/deepchem/blob/master/examples/tutorials/Training_a_Normalizing_Flow_on_QM9.ipynb

The initial smiles processing is the same as the tutorial, but the model implementation has been changed and written by us. The conditioning part had not been implemented into the deepchem library's normalising flow function.

Additionally we have added more parameter controls, early stopping, and have added the DFT conditioning sampling from the model. The nfm.flow.sample() by deepchem cannot sample conditioned samples. So we used tensorflows tfd.TransformedDistribution.sample to sample the conditioned molecule from the learned chained bijectors.



In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All"
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [2]:
import os
os.environ["TF_USE_LEGACY_KERAS"] = "1"

In [None]:

!pip install --pre deepchem
!pip install selfies


In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import os

import deepchem as dc
from deepchem.data import NumpyDataset
from deepchem.splits import RandomSplitter
from tensorflow.keras.callbacks import EarlyStopping

import rdkit
from rdkit import Chem

from IPython.display import Image, display

import selfies as sf

import tensorflow as tf
import tensorflow_probability as tfp

from rdkit.Chem.Fingerprints.FingerprintMols import FingerprintMol
from rdkit.DataStructs import FingerprintSimilarity
from IPython.display import display

from rdkit import RDLogger
RDLogger.DisableLog('rdApp.*')

tfd = tfp.distributions
tfb = tfp.bijectors
tfk = tf.keras

tfk.backend.set_floatx('float64')

In [None]:
tasks, datasets, transformers = dc.molnet.load_qm9(featurizer='ECFP')
qm9_dataset = datasets[0]

In [None]:
print(qm9_dataset.y[0])

In [None]:
qm9_dataset.y

In [None]:
task_names = qm9_dataset.get_task_names()
print(task_names)

In [None]:
zero_mask = (qm9_dataset.w == 0)
zero_count = np.sum(zero_mask)
print(zero_count)

In [None]:
dataset_size = 20000
df = pd.DataFrame(data={'smiles': qm9_dataset.ids})
dft_features = pd.DataFrame(data=qm9_dataset.y, columns=task_names)
data = pd.concat([df[['smiles']], dft_features], axis=1).sample(dataset_size, random_state=42)
data.reset_index(inplace=True, drop=True)

In [None]:
data

In [None]:
sampled_data = data.sample(1, random_state=42).reset_index(drop=True)
sampled_data

In [None]:
def preprocess_smiles(smiles):
    if '.' in smiles:
        return None
    return sf.encoder(smiles)

def keys_int(symbol_to_int):
    return {i: key for i, key in enumerate(symbol_to_int.keys())}

In [None]:


data['selfies'] = data['smiles'].apply(preprocess_smiles)
data.dropna(subset=['selfies'], inplace=True)
data.reset_index(drop=True, inplace=True)

data['len'] = data['smiles'].str.len()
data.sort_values(by='len', inplace=True)


constraints = sf.get_semantic_constraints()
constraints['?'] = 3
sf.set_semantic_constraints(constraints)

print(sf.set_semantic_constraints())

In [None]:
selfies_list = data['selfies'].to_numpy()
selfies_alphabet = sf.get_alphabet_from_selfies(selfies_list)
selfies_alphabet.add('[nop]')
selfies_alphabet = sorted(selfies_alphabet)

largest_selfie_len = max(sf.len_selfies(s) for s in selfies_list)
symbol_to_int = {c: i for i, c in enumerate(selfies_alphabet)}
onehots = sf.batch_selfies_to_flat_hot(selfies_list, symbol_to_int, largest_selfie_len)

In [None]:
dft_features_values = data[task_names].to_numpy(dtype='float64')
combined_data = np.hstack([onehots, dft_features_values])

input_tensor = tf.convert_to_tensor(onehots, dtype='float64')
noise_tensor = tf.random.uniform(shape=input_tensor.shape, minval=0, maxval=1, dtype='float64')
dequantized_data = tf.add(input_tensor, noise_tensor)

# Convert to tensors and prepare data for training
input_tensor = tf.convert_to_tensor(dequantized_data, dtype='float64')
dft_tensor = tf.convert_to_tensor(dft_features_values, dtype='float64')

In [None]:
dataset = NumpyDataset(X=input_tensor, y=dft_tensor)
splitter = RandomSplitter()
train, val, test = splitter.train_valid_test_split(dataset=dataset, seed=42)
print("Training, validation, and test splits created.")

In [None]:
dim_smiles = input_tensor.shape[-1]
dim_dft = dft_tensor.shape[-1]

In [None]:
event_shape = (dim_smiles,)
conditional_event_shape = (dim_dft,)

In [None]:


num_layers = 12
hidden_units=[128,128]
flow_layers = []

Made = tfb.AutoregressiveNetwork(params=2, hidden_units=hidden_units, activation='relu', event_shape=event_shape, conditional=True, conditional_event_shape=conditional_event_shape, conditional_input_layers='all_layers')

for i in range(num_layers):
    flow_layers.append(tfb.MaskedAutoregressiveFlow(shift_and_log_scale_fn=Made, name=f'maf{i}'))
    flow_layers.append(tfb.Permute(tf.cast(np.random.permutation(np.arange(0, dim_smiles)), tf.int32)))

print("Flow layers defined.")

base_dist = tfd.MultivariateNormalDiag(loc=tf.zeros(dim_smiles, dtype=tf.float64), scale_diag=tf.ones(dim_smiles, dtype=tf.float64))
chain_bijector = tfb.Chain(list(reversed(flow_layers)))
distribution = tfd.TransformedDistribution(
    distribution=base_dist,
    bijector=chain_bijector
)

import re
def make_bijector_kwargs(bijector, name_to_kwargs):
  if hasattr(bijector, 'bijectors'):
    return {b.name: make_bijector_kwargs(b, name_to_kwargs) for b in bijector.bijectors}
  else:
    for name_regex, kwargs in name_to_kwargs.items():
      if re.match(name_regex, bijector.name):
        return kwargs
  return {}

# Construct and compile the conditional normalizing flow model
x_ = tfk.Input(shape=(dim_smiles,), dtype=tf.float64)
c_ = tfk.Input(shape=(dim_dft,), dtype=tf.float64)
log_prob_ = distribution.log_prob(x_, bijector_kwargs=make_bijector_kwargs(chain_bijector, {'maf.': {'conditional_input': c_}}))
model = tfk.Model([x_, c_], log_prob_)

model.compile(optimizer=tf.optimizers.Adam(learning_rate=0.0001),
              loss=lambda _, log_prob: -log_prob)

early_stopping = EarlyStopping(
    monitor='val_loss',       # Monitor validation loss
    patience=5,               # Number of epochs with no improvement before stopping
    restore_best_weights=True # Restore the best weights after stopping
)

batch_size = 128
max_epochs = 100
n = len(train.X)

In [None]:
# Training the model


history = model.fit(
    x=[
        train.X, 
        (train.y - np.mean(train.y, axis=0)) / np.std(train.y, axis=0)
    ],
    y=np.zeros((n, 0), dtype=np.float64),
    batch_size=batch_size,
    epochs=max_epochs,
    steps_per_epoch=n // batch_size,
    shuffle=True,
    verbose=True,
    validation_data=[
        [val.X, val.y], 
        np.zeros((len(val.X), 0))
    ],
    callbacks=[early_stopping]
)


In [None]:
train_loss = history.history['loss']
val_loss = history.history['val_loss']

# Plot the training and validation loss
plt.figure(figsize=(8, 6))
plt.plot(train_loss, label='Training Loss')
plt.plot(val_loss, label='Validation Loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.title('Training and Validation Loss')
plt.legend()
plt.grid()
plot_filename = f'training_validation_loss_layers{num_layers}_units{hidden_units}.png'
plt.savefig(plot_filename, dpi=300, bbox_inches='tight')
plt.show()

In [None]:
from rdkit.Chem import Draw

def sample_and_save(dataset, num_samples=5):
    # Convert dataset to a NumPy array with float64 for numerical operations
    dataset = np.array(dataset, dtype=object)  # Convert to a general NumPy array first
    smiles = dataset[:, 0]  # Extract the SMILES strings
    features = dataset[:, 1:].astype('float64')  # Convert the rest to float64 for numerical processing
    
    # Column names for DFT conditioning values
    dft_columns = ['mu', 'alpha', 'homo', 'lumo', 'gap', 'r2', 'zpve', 'cv', 'u0', 'u298', 'h298', 'g298']
    conditioning_mean = np.mean(features, axis=0)
    conditioning_std = np.std(features, axis=0)
    
    results = []

    for vector_num, (smile, conditioning_values) in enumerate(zip(smiles, features)):
        conditioning_values_scaled = (conditioning_values - conditioning_mean) / conditioning_std
        
        condition_vector = tf.convert_to_tensor(
            np.tile(conditioning_values_scaled, (num_samples, 1)),
            dtype=tf.float64
        )

        sampled_smiles = distribution.sample(
            (num_samples,),
            bijector_kwargs=make_bijector_kwargs(chain_bijector, {'maf.': {'conditional_input': condition_vector}})
        )
        sampled_smiles = tf.math.floor(sampled_smiles)
        sampled_smiles = tf.clip_by_value(sampled_smiles, 0, 1)
        mols_list = sampled_smiles.numpy().tolist()
        
        for mol in mols_list:
            for j in range(largest_selfie_len):
                row = mol[len(selfies_alphabet) * j: len(selfies_alphabet) * (j + 1)]
                if all(elem == 0 for elem in row):
                    mol[len(selfies_alphabet) * (j+1) - 1] = 1

        int_mol = keys_int(symbol_to_int)
        generated_selfies = sf.batch_flat_hot_to_selfies(mols_list, int_mol)
        valid_selfies, valid_smiles = [], []

        for selfies in generated_selfies:
            try:
                smiles = sf.decoder(selfies)
                if Chem.MolFromSmiles(smiles, sanitize=True) is not None:
                    valid_selfies.append(selfies)
                    valid_smiles.append(smiles)
            except Exception:
                continue

        gen_mols = [Chem.MolFromSmiles(sm) for sm in valid_smiles]
        original_mol = Chem.MolFromSmiles(smile)

        # Compute Tanimoto similarity between generated molecules and the original molecule
        def tanimoto_similarity(query_mol, generated_mols):
            query_fp = FingerprintMol(query_mol)
            similarities = []
            for gen_mol in generated_mols:
                if gen_mol:
                    gen_fp = FingerprintMol(gen_mol)
                    similarities.append(FingerprintSimilarity(query_fp, gen_fp))
                else:
                    similarities.append(None)  # Handle invalid molecules
            return similarities

        similarities = tanimoto_similarity(original_mol, gen_mols)

        # Save the image for the original SMILES
        if original_mol:
            original_filename = f"original_layers{num_layers}_units{hidden_units}_vector{vector_num}.png"
            Draw.MolToFile(original_mol, original_filename)

        # Save images for the generated molecules
        for sample_num, mol in enumerate(gen_mols):
            if mol:
                generated_filename = f"generated_layers{num_layers}_units{hidden_units}_sample{sample_num + 1}_vector{vector_num}.png"
                Draw.MolToFile(mol, generated_filename)

        # Append results for each molecule
        results.append({
            "Original_SMILES": smile,
            **{col: val for col, val in zip(dft_columns, conditioning_values)},
            **{f"Generated_SMILES_{i+1}": valid_smiles[i] if i < len(valid_smiles) else None for i in range(num_samples)},
            **{f"Similarity_Score_{i+1}": round(similarities[i], 3) if i < len(similarities) and similarities[i] is not None else None for i in range(num_samples)},
        })

    # Convert results into a DataFrame
    df = pd.DataFrame(results)
    output_file = f"generated_molecules_layers{num_layers}_units{hidden_units}.xlsx"
    df.to_excel(output_file, index=False)
    print(f"Results and images saved to {output_file}")


In [None]:
sample_and_save(sampled_data)