While typing FreeSolv molecules with smirnoff99Frosst, I found 4 molecules that are potentially duplicated in the FreeSolv set. Below is the code snippet I used that found the duplicates:
import glob
from openforcefield.utils import read_molecules
from openeye import oechem
# untarred mol2files_sybyl.tar.gz
DBpath = "/FreeSolv/mol2files_sybyl/*.mol2"
for file in glob.glob(DBpath):
mol = read_molecules(file, verbose = False)[0]
f = file.split('/')[-1]
c_mol = oechem.OEMol(mol)
oechem.OEAddExplicitHydrogens(c_mol)
smi = oechem.OECreateIsoSmiString(mol)
f = file.split('/')[-1]
if smi in isosmiles_to_mol:
print("File: %35s %35s" % (f, smi_to_file[smi]))
print("Title: %35s %35s" % (c_mol.GetTitle(), isosmiles_to_mol[smi].GetTitle()))
print("SMILES: %35s %35s" % (smi, oechem.OECreateIsoSmiString(isosmiles_to_mol[smi])))
print('\n')
isosmiles_to_mol[smi] = c_mol
smi_to_file[smi] = f
# OUTPUT:
#File: mobley_4689084.mol2 mobley_352111.mol2
#Title: 2-acetoxyethyl acetate 2-acetoxyethyl acetate
#SMILES: CC(=O)OCCOC(=O)C CC(=O)OCCOC(=O)C
#
#
#File: mobley_9897248.mol2 mobley_819018.mol2
#Title: (2Z)-3,7-dimethylocta-2,6-dien-1-ol (2E)-3,7-dimethylocta-2,6-dien-1-ol
#SMILES: CC(=CCCC(=CCO)C)C CC(=CCCC(=CCO)C)C
#
#
#File: mobley_9913368.mol2 mobley_4465023.mol2
#Title: (E)-1,2-dichloroethylene (Z)-1,2-dichloroethylene
#SMILES: C(=CCl)Cl C(=CCl)Cl
#
#
#File: mobley_9979854.mol2 mobley_628086.mol2
#Title: (2R)-1,1,1-trifluoropropan-2-ol (2S)-1,1,1-trifluoropropan-2-ol
#SMILES: CC(C(F)(F)F)O CC(C(F)(F)F)O
While typing FreeSolv molecules with smirnoff99Frosst, I found 4 molecules that are potentially duplicated in the FreeSolv set. Below is the code snippet I used that found the duplicates: