# RNN Based molucule generation

Laurent Cetinsoy

In this hands-on we want to generate molecule formulas for denovo-drug discovery.

For that we need to use Generative models. Generative models are models which goes beyond classification or simple regression : they are able to generate data that look like previously seens dataset.

There exists a lot of models :

- Bayesian models like graphical models
- Recurrent models (for sequence generation like texte)
- Variational auto encoders
- Generative adversarial models
- Flow and diffusion models


In the hands-on we will start by  trainning a character based RNN to generate smile molecules


We want to feed smile representations of molecules to an RNN.
The basic idea is we will train it to predict the next smile token of a molecule given the previous one.

For instance for the following molecule "CC(=O)NC1=CC=C(O)C=C1" will may give to the model

X = "CC(=O)N"
y = C

and ask the RNN to learn to predict y given X

Like a standard language model !


## RNN Language model


A language model is a model which predict the next token of a sequence given the previous ones :

$ P(X_t | X_{t-1}, X_{t-2}, ..., X_{t-p})  $


This model can be learned with a Recurrent neural network

$ y = P(X_t | X_{t-1}, X_{t-2}, ..., X_{t-p}) = RNN_{\theta} (X_{t-1}, X_{t-2}, ..., X_{t-p})  $


In order to train such model you need a corpus of data.



There are two main ways to do that : Word level model or character level model

For character level models, an interesting resource is : http://karpathy.github.io/2015/05/21/rnn-effectiveness/



Explain briefly what is the difference between word based language model and character based language model

## Loading the data

Dowload the following dataset : https://github.com/joeymach/Leveraging-VAE-to-generate-molecules

In [1]:
!unzip /content/Leveraging-VAE-to-generate-molecules-master.zip

Archive:  /content/Leveraging-VAE-to-generate-molecules-master.zip
bdb6ecc45027b920d97e85e1a2b3ab8945759792
   creating: Leveraging-VAE-to-generate-molecules-master/
  inflating: Leveraging-VAE-to-generate-molecules-master/250k_smiles.csv  
  inflating: Leveraging-VAE-to-generate-molecules-master/README.md  
  inflating: Leveraging-VAE-to-generate-molecules-master/VAE_model_250k.ipynb  


Import pandas and load the first 1000 lines

In [91]:
import numpy as np
import pandas as pd

In [3]:
path = "Leveraging-VAE-to-generate-molecules-master/250k_smiles.csv"
df = pd.read_csv(path, nrows=1000)

Display the first rows of the dataframe

In [4]:
df.head()

Unnamed: 0,smiles,logP,qed,SAS
0,CC(C)(C)c1ccc2occ(CC(=O)Nc3ccccc3F)c2c1\n,5.0506,0.702012,2.084095
1,C[C@@H]1CC(Nc2cncc(-c3nncn3C)c2)C[C@@H](C)C1\n,3.1137,0.928975,3.432004
2,N#Cc1ccc(-c2ccc(O[C@@H](C(=O)N3CCCC3)c3ccccc3)...,4.96778,0.599682,2.470633
3,CCOC(=O)[C@@H]1CCCN(C(=O)c2nc(-c3ccc(C)cc3)n3c...,4.00022,0.690944,2.822753
4,N#CC1=C(SCC(=O)Nc2cccc(Cl)c2)N=C([O-])[C@H](C#...,3.60956,0.789027,4.035182


## Processing the data

We need to do the following things :

- convert smile tokens to numbers
- build  smile token sequences and corresponding labels pairs

Compute the biggest smile molecule size

In [16]:
max_len = df['smiles'].str.len().max()


Code a function **unic_characters(string)** which return the unic characters in a string


In [30]:
def unic_characters(string):
    return sorted(list(set(string)))

Concatenate all smile string of the pandas dataframe and use **unic_characters** to get the unic_characters

In [33]:
all_smiles = ''.join(df['smiles'])
unique_chars = unic_characters(all_smiles)
unique_chars

['\n',
 '#',
 '(',
 ')',
 '+',
 '-',
 '/',
 '1',
 '2',
 '3',
 '4',
 '5',
 '6',
 '7',
 '=',
 '@',
 'B',
 'C',
 'F',
 'H',
 'I',
 'N',
 'O',
 'S',
 '[',
 '\\',
 ']',
 'c',
 'l',
 'n',
 'o',
 'r',
 's']

Code a function **map_char_to_int(unic_chars)** which returns a dictionnary where each char is assigned an int value.
Add a character to specify the end of the molecule (like "\n")


Code a function map_int_to_char(unic_chars) which returns the reverse mapping.

If you want you can merge both functions in a class

For each smile molecule add the ending token to it

In [34]:
class MapClass:
    def __init__(self, unic_chars):
        self.unic_characters = unic_chars

    def map_char_to_int(self):
        char_to_int = dict((c, i) for i, c in enumerate(self.unic_characters))
        return char_to_int

    def map_int_to_char(self):
        int_to_char = dict((i, c) for i, c in enumerate(self.unic_characters))
        return int_to_char

In [55]:
map_class = MapClass(unique_chars)

In [56]:
char_to_int = map_class.map_char_to_int()
print(''.join(repr(char_to_int)))

{'\n': 0, '#': 1, '(': 2, ')': 3, '+': 4, '-': 5, '/': 6, '1': 7, '2': 8, '3': 9, '4': 10, '5': 11, '6': 12, '7': 13, '=': 14, '@': 15, 'B': 16, 'C': 17, 'F': 18, 'H': 19, 'I': 20, 'N': 21, 'O': 22, 'S': 23, '[': 24, '\\': 25, ']': 26, 'c': 27, 'l': 28, 'n': 29, 'o': 30, 'r': 31, 's': 32}


In [57]:
int_to_char = map_class.map_int_to_char()
int_to_char

{0: '\n',
 1: '#',
 2: '(',
 3: ')',
 4: '+',
 5: '-',
 6: '/',
 7: '1',
 8: '2',
 9: '3',
 10: '4',
 11: '5',
 12: '6',
 13: '7',
 14: '=',
 15: '@',
 16: 'B',
 17: 'C',
 18: 'F',
 19: 'H',
 20: 'I',
 21: 'N',
 22: 'O',
 23: 'S',
 24: '[',
 25: '\\',
 26: ']',
 27: 'c',
 28: 'l',
 29: 'n',
 30: 'o',
 31: 'r',
 32: 's'}

In [58]:
def smile_to_int(smile):
    encoded_smile = [char_to_int[char] for char in smile]
    return encoded_smile


def int_to_smile(encoded_smile):
    smile = ''.join([int_to_char[i] for i in encoded_smile])
    return smile

In [59]:
df['encoded_smiles'] = df['smiles'].apply(smile_to_int)
df['decoded_smiles'] = df['encoded_smiles'].apply(int_to_smile)
assert (df['smiles'] == df['decoded_smiles']).all()

In [60]:
input_sequences = []
output_labels = []

for encoded_smile in df['encoded_smiles']:
    for i in range(1, len(encoded_smile)):
        input_sequence = encoded_smile[:i]
        output_label = encoded_smile[i]
        input_sequences.append(input_sequence)
        output_labels.append(output_label)

In [61]:
df

Unnamed: 0,smiles,logP,qed,SAS,encoded_smiles,decoded_smiles
0,CC(C)(C)c1ccc2occ(CC(=O)Nc3ccccc3F)c2c1\n,5.05060,0.702012,2.084095,"[17, 17, 2, 17, 3, 2, 17, 3, 27, 7, 27, 27, 27...",CC(C)(C)c1ccc2occ(CC(=O)Nc3ccccc3F)c2c1\n
1,C[C@@H]1CC(Nc2cncc(-c3nncn3C)c2)C[C@@H](C)C1\n,3.11370,0.928975,3.432004,"[17, 24, 17, 15, 15, 19, 26, 7, 17, 17, 2, 21,...",C[C@@H]1CC(Nc2cncc(-c3nncn3C)c2)C[C@@H](C)C1\n
2,N#Cc1ccc(-c2ccc(O[C@@H](C(=O)N3CCCC3)c3ccccc3)...,4.96778,0.599682,2.470633,"[21, 1, 17, 27, 7, 27, 27, 27, 2, 5, 27, 8, 27...",N#Cc1ccc(-c2ccc(O[C@@H](C(=O)N3CCCC3)c3ccccc3)...
3,CCOC(=O)[C@@H]1CCCN(C(=O)c2nc(-c3ccc(C)cc3)n3c...,4.00022,0.690944,2.822753,"[17, 17, 22, 17, 2, 14, 22, 3, 24, 17, 15, 15,...",CCOC(=O)[C@@H]1CCCN(C(=O)c2nc(-c3ccc(C)cc3)n3c...
4,N#CC1=C(SCC(=O)Nc2cccc(Cl)c2)N=C([O-])[C@H](C#...,3.60956,0.789027,4.035182,"[21, 1, 17, 17, 7, 14, 17, 2, 23, 17, 17, 2, 1...",N#CC1=C(SCC(=O)Nc2cccc(Cl)c2)N=C([O-])[C@H](C#...
...,...,...,...,...,...,...
995,ClCCc1nc2cccnc2n1CCn1cccn1\n,2.10930,0.672897,2.506188,"[17, 28, 17, 17, 27, 7, 29, 27, 8, 27, 27, 27,...",ClCCc1nc2cccnc2n1CCn1cccn1\n
996,CC[C@@](C)([C@@H]([NH3+])c1cc(Br)ccc1F)N1CCOCC1\n,2.37210,0.910031,3.884291,"[17, 17, 24, 17, 15, 15, 26, 2, 17, 3, 2, 24, ...",CC[C@@](C)([C@@H]([NH3+])c1cc(Br)ccc1F)N1CCOCC1\n
997,Cc1ccc(NC(=O)c2cc3ccccc3oc2=O)c([N+](=O)[O-])c1\n,3.26192,0.444910,1.998848,"[17, 27, 7, 27, 27, 27, 2, 21, 17, 2, 14, 22, ...",Cc1ccc(NC(=O)c2cc3ccccc3oc2=O)c([N+](=O)[O-])c1\n
998,CC1(C)OC[C@H]([C@H]2O[C@@H]3OC(C)(C)O[C@@H]3[C...,0.35910,0.696755,4.270988,"[17, 17, 7, 2, 17, 3, 22, 17, 24, 17, 15, 19, ...",CC1(C)OC[C@H]([C@H]2O[C@@H]3OC(C)(C)O[C@@H]3[C...


## Building the dataset

Now we will create the dataset so that it has the good share for our Keras LSTM model

Remember Keras recurrent models expect a 3D array with shapes (n_examples, seq_len, n_features)



What will be n_features in our case ?

n_features will be 1.

Each element in the input sequence is a single integer representing a character.  The RNN processes these integer representations one at a time.


Code a function **build_X_and_y(string, i_char, seq_lenght)** which takes a string, a **seq_length** number and a position.


It should create X by by getting all character between i and i + seq_length
and create y by getting the character following the X sequence
it returns X and y

Test your function on the following string "" with seq_length = 4 and i = [1, 2, 3]

In [62]:
def build_X_and_y(string, i_char, seq_length):
    X = string[i_char:i_char + seq_length]
    y = string[i_char + seq_length]
    return X, y


build_X_and_y("testicules", i_char=1, seq_length=4)

('esti', 'c')

By using build_X_and_y and map_char_to_int build a list nameed X_train and a list named y_train

In [88]:
X_train, y_train = [], []
for smiles in df['smiles']:
  max_len = len(smiles)
  for i in range(max_len - 1):
      X, y = build_X_and_y(smiles, 0, i)
      X = [char_to_int[char] for char in X]
      y = char_to_int[y]
      X_train.append(X)
      y_train.append(y)

Create numpy arrays from the lists

In [93]:
from tensorflow.keras.preprocessing.sequence import pad_sequences

In [94]:
X_train = pad_sequences(X_train, padding='pre')

In [95]:
X_train = np.array(X_train)
y_train = np.array(y_train)

Reshape the X numpy array (n_examples, seq_lenght, 1)

In [97]:
X_train = X_train.reshape(X_train.shape[0], X_train.shape[1], 1)
X_train.shape

(44242, 104, 1)

Normalize X by dividing each values by the total number of unic characters

In [99]:
X_train = X_train / len(unique_chars)
X_train

array([[[0.        ],
        [0.        ],
        [0.        ],
        ...,
        [0.        ],
        [0.        ],
        [0.        ]],

       [[0.        ],
        [0.        ],
        [0.        ],
        ...,
        [0.        ],
        [0.        ],
        [0.51515152]],

       [[0.        ],
        [0.        ],
        [0.        ],
        ...,
        [0.        ],
        [0.51515152],
        [0.51515152]],

       ...,

       [[0.        ],
        [0.        ],
        [0.        ],
        ...,
        [0.51515152],
        [0.66666667],
        [0.24242424]],

       [[0.        ],
        [0.        ],
        [0.        ],
        ...,
        [0.66666667],
        [0.24242424],
        [0.09090909]],

       [[0.        ],
        [0.        ],
        [0.        ],
        ...,
        [0.24242424],
        [0.09090909],
        [0.81818182]]])

Import Keras and build (at least) a two layered LSTM network with 128 neurone in each.

You can also add Dropoutlayers

Do you think you should use the return_sequences = True ? If yes, when ?


Add a Dense layer on top with with the appropriate activation function and number of neurones


In [101]:
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, LSTM, Dropout


def build_model(seq_length):
    model = Sequential()
    model.add(LSTM(128, input_shape=(seq_length, 1), return_sequences=True))
    model.add(Dropout(0.2))
    model.add(LSTM(128))
    model.add(Dropout(0.2))
    model.add(Dense(len(unique_chars), activation='softmax'))
    return model


model = build_model(X_train.shape[1])

  super().__init__(**kwargs)


Compile the model with the appropriate loss function and the adam optimizer

In [102]:
model.compile(loss='sparse_categorical_crossentropy', optimizer='adam')

Train the model on 20 epochs and 10 examples (yeah you read correctly) and check that the model overfits !

In [103]:
model.fit(X_train[:10], y_train[:10], epochs=20, batch_size=10)

Epoch 1/20
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 3s/step - loss: 3.4996
Epoch 2/20
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 188ms/step - loss: 3.4725
Epoch 3/20
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 176ms/step - loss: 3.4558
Epoch 4/20
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 287ms/step - loss: 3.4282
Epoch 5/20
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 160ms/step - loss: 3.3942
Epoch 6/20
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 164ms/step - loss: 3.3339
Epoch 7/20
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 307ms/step - loss: 3.2616
Epoch 8/20
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 288ms/step - loss: 3.0505
Epoch 9/20
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 158ms/step - loss: 2.6794
Epoch 10/20
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 153ms/step - loss: 2.1625
Epoch 11/20


<keras.src.callbacks.history.History at 0x7f6cc0104400>

If it does not overfit try to fix data prep and model architecture so it does

In [107]:
def build_model_overfit(seq_length):
    model = Sequential()
    model.add(LSTM(128, input_shape=(seq_length, 1), return_sequences=True))
    model.add(LSTM(128))
    model.add(Dense(len(unique_chars), activation='softmax'))
    return model


model = build_model_overfit(X_train.shape[1])
model.compile(loss='sparse_categorical_crossentropy', optimizer='adam')
model.fit(X_train[:10], y_train[:10], epochs=20, batch_size=1)

Epoch 1/20
[1m10/10[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 202ms/step - loss: 3.4824
Epoch 2/20
[1m10/10[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 100ms/step - loss: 2.9019
Epoch 3/20
[1m10/10[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 97ms/step - loss: 1.6286
Epoch 4/20
[1m10/10[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 102ms/step - loss: 1.4950
Epoch 5/20
[1m10/10[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 101ms/step - loss: 1.9145
Epoch 6/20
[1m10/10[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 101ms/step - loss: 1.8157
Epoch 7/20
[1m10/10[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 102ms/step - loss: 1.6987
Epoch 8/20
[1m10/10[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 101ms/step - loss: 1.6029
Epoch 9/20
[1m10/10[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 104ms/step - loss: 1.4778
Epoch 10/20
[1m10/10[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 114ms/step - los

<keras.src.callbacks.history.History at 0x7f6cbb50bd60>

Create a function **make_prediction(seed_start)** which takes a starting string sequence and uses it to generate a molecule


In [108]:
def make_prediction(seed_start):
    generated_molecule = seed_start
    for i in range(max_len):
        encoded_seed = [char_to_int[char] for char in generated_molecule]
        encoded_seed = pad_sequences([encoded_seed], maxlen=X_train.shape[1], padding='pre')
        encoded_seed = np.array(encoded_seed).reshape(1, X_train.shape[1], 1) / len(unique_chars)

        prediction = model.predict(encoded_seed, verbose=0)[0]

        predicted_index = np.argmax(prediction)
        predicted_char = int_to_char[predicted_index]

        generated_molecule += predicted_char

        if predicted_char == '\n':
            break

    return generated_molecule

generate a molecule of your overfitted model

In [109]:
make_prediction("CC(=O")

'CC(=OCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC'

Make a model checkpoint so that the model is saved after each epoch
if you train on a plateform and it stops you do not lose your training

In [115]:
from tensorflow.keras.callbacks import ModelCheckpoint

checkpoint_filepath = '/content/model_checkpoints/model_epoch_{epoch:02d}.keras'
model_checkpoint_callback = ModelCheckpoint(
    filepath=checkpoint_filepath,
    save_weights_only=False,
    monitor='loss',
    mode='min',
    save_best_only=True,
    save_freq='epoch'
)

Now go to your favorite plateform (colab or something else) and train the dataset on the whole data for 10 epochs and batch size 256

it should take a long time so either follow the class or go take a nap

In [116]:
model = build_model(X_train.shape[1])
model.compile(loss='sparse_categorical_crossentropy', optimizer='adam')

In [117]:
model.fit(X_train, y_train, epochs=10, batch_size=256, callbacks=[model_checkpoint_callback])

Epoch 1/10
[1m173/173[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m192s[0m 1s/step - loss: 2.7925
Epoch 2/10
[1m173/173[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m196s[0m 1s/step - loss: 2.5498
Epoch 3/10
[1m173/173[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m197s[0m 1s/step - loss: 2.3362
Epoch 4/10
[1m173/173[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m190s[0m 1s/step - loss: 2.1190
Epoch 5/10
[1m173/173[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m201s[0m 1s/step - loss: 1.8634
Epoch 6/10
[1m173/173[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m198s[0m 1s/step - loss: 1.7094
Epoch 7/10
[1m173/173[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m201s[0m 1s/step - loss: 1.6209
Epoch 8/10
[1m173/173[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m200s[0m 1s/step - loss: 1.5628
Epoch 9/10
[1m173/173[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m180s[0m 1s/step - loss: 1.4991
Epoch 10/10
[1m173/173[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m204s[0m 1s

<keras.src.callbacks.history.History at 0x7f6cbad14e20>

Generate between 100 and 1000 molecules.

create a list where molecules have between 10 and 50 atoms

In [123]:
import random

def generate_molecules(model, char_to_int, int_to_char, X_train, num_molecules):
    molecules = []
    for _ in range(num_molecules):
        seed_start = random.choice(df['smiles'])[:random.randint(1, 5)]
        generated_molecule = make_prediction(seed_start)
        atom_count = sum(1 for char in generated_molecule if char.isalpha())
        if 10 <= atom_count <= 50 :
          molecules.append(generated_molecule)
    return molecules

num_molecules_to_generate = 10 # TODO, add more molecules.
generated_molecules = generate_molecules(model, char_to_int, int_to_char, X_train, num_molecules_to_generate)
print(f"Generated {len(generated_molecules)} molecules with 10-50 atoms.")

Generated 10 molecules with 10-50 atoms.


In [124]:
generated_molecules

['CCNC(=O)NCCC(=O)NCCCC(=O)NCCCC2)c1ccccc1',
 'Cc1ccc(C)cc1CCCCCC(=O)NCCCCC(=O)NCCC2)c1',
 'CCC(CCCCCCC(=O)NCCCC(=O)NCCCC2)c1ccccc',
 'Cc1ccc(C)cc1CCCCCC(=O)NCCCCC(=O)NCCC2)c',
 'CCC(CCCCCCC(=O)NCCCC(=O)NCCCC2)c1ccccc',
 'CCC(CCCCCCC(=O)NCCCC(=O)NCCCC2)c1ccccc',
 'O=C(CCCCCCC(=O)NCCCC(=O)NCCCC2)c1ccccc1)',
 'Cc1ccc(C)cc1CCCCCC(=O)NCCCCC(=O)NCCC2)c1cc',
 'CCC(CCCCCCC(=O)NCCCC(=O)NCCCC2)c1ccccc',
 'CCC(CCCCCCC(=O)NCCCC(=O)NCCCC2)c1ccccc']

With rdkit compute the Quantified Estimated Drug likelyness (QED) of each molecule in this subset

In [125]:
!pip install rdkit-pypi

Collecting rdkit-pypi
  Downloading rdkit_pypi-2022.9.5-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.9 kB)
Downloading rdkit_pypi-2022.9.5-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (29.4 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m29.4/29.4 MB[0m [31m57.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: rdkit-pypi
Successfully installed rdkit-pypi-2022.9.5


In [126]:
from rdkit import Chem
from rdkit.Chem import QED

for smiles in generated_molecules:
  try:
    mol = Chem.MolFromSmiles(smiles)
    if mol is not None:
      qed_value = QED.qed(mol)
      print(f"SMILES: {smiles}, QED: {qed_value}")
    else:
      print(f"Invalid SMILES string: {smiles}")
  except Exception as e:
      print(f"Error processing SMILES {smiles}: {e}")

Invalid SMILES string: CCNC(=O)NCCC(=O)NCCCC(=O)NCCCC2)c1ccccc1
Invalid SMILES string: Cc1ccc(C)cc1CCCCCC(=O)NCCCCC(=O)NCCC2)c1
Invalid SMILES string: CCC(CCCCCCC(=O)NCCCC(=O)NCCCC2)c1ccccc
Invalid SMILES string: Cc1ccc(C)cc1CCCCCC(=O)NCCCCC(=O)NCCC2)c
Invalid SMILES string: CCC(CCCCCCC(=O)NCCCC(=O)NCCCC2)c1ccccc
Invalid SMILES string: CCC(CCCCCCC(=O)NCCCC(=O)NCCCC2)c1ccccc
Invalid SMILES string: O=C(CCCCCCC(=O)NCCCC(=O)NCCCC2)c1ccccc1)
Invalid SMILES string: Cc1ccc(C)cc1CCCCCC(=O)NCCCCC(=O)NCCC2)c1cc
Invalid SMILES string: CCC(CCCCCCC(=O)NCCCC(=O)NCCCC2)c1ccccc
Invalid SMILES string: CCC(CCCCCCC(=O)NCCCC(=O)NCCCC2)c1ccccc


[16:02:41] SMILES Parse Error: extra close parentheses while parsing: CCNC(=O)NCCC(=O)NCCCC(=O)NCCCC2)c1ccccc1
[16:02:41] SMILES Parse Error: Failed parsing SMILES 'CCNC(=O)NCCC(=O)NCCCC(=O)NCCCC2)c1ccccc1' for input: 'CCNC(=O)NCCC(=O)NCCCC(=O)NCCCC2)c1ccccc1'
[16:02:41] SMILES Parse Error: extra close parentheses while parsing: Cc1ccc(C)cc1CCCCCC(=O)NCCCCC(=O)NCCC2)c1
[16:02:41] SMILES Parse Error: Failed parsing SMILES 'Cc1ccc(C)cc1CCCCCC(=O)NCCCCC(=O)NCCC2)c1' for input: 'Cc1ccc(C)cc1CCCCCC(=O)NCCCCC(=O)NCCC2)c1'
[16:02:41] SMILES Parse Error: unclosed ring for input: 'CCC(CCCCCCC(=O)NCCCC(=O)NCCCC2)c1ccccc'
[16:02:41] SMILES Parse Error: extra close parentheses while parsing: Cc1ccc(C)cc1CCCCCC(=O)NCCCCC(=O)NCCC2)c
[16:02:41] SMILES Parse Error: Failed parsing SMILES 'Cc1ccc(C)cc1CCCCCC(=O)NCCCCC(=O)NCCC2)c' for input: 'Cc1ccc(C)cc1CCCCCC(=O)NCCCCC(=O)NCCC2)c'
[16:02:41] SMILES Parse Error: unclosed ring for input: 'CCC(CCCCCCC(=O)NCCCC(=O)NCCCC2)c1ccccc'
[16:02:41] SMILES Parse Er

Bonus 1 : Using rdkit, compute the quantitative estimation of drug-likeness (QED) of your generated molecules.

Bonus 2 : try to adapt a transformer model training from hugging face to see if it is better