# Delaney Dataset

**ESOL:  Estimating Aqueous Solubility Directly from Molecular Structure**

John S. Delaney

Journal of Chemical Information and Computer Sciences **2004** 44 (3), 1000-1005

DOI: 10.1021/ci034243x

The \( R^2 \) (R-squared) value between the ESOL predicted log solubility and the measured log solubility in the Delaney dataset is approximately \( 0.811 \). This indicates a relatively strong positive correlation between the predicted and measured values. An \( R^2 \) value close to 1 suggests that the ESOL model is fairly accurate in predicting the solubility of the compounds in the dataset.

In [None]:
import pandas as pd
import numpy as np
from rdkit import Chem
from rdkit.Chem import AllChem
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from xgboost import XGBRegressor

In [None]:
# # Load the dataset into a Pandas DataFrame
# df = pd.read_csv('delaney-processed.csv')

### Fields in the Delaney Dataset

1. **`Compound ID`:** A unique identifier for each compound in the dataset.
  
2. **`ESOL predicted log solubility in mols per litre`:** The log solubility of the compound as predicted by the ESOL (Estimated SOLubility) model, measured in moles per liter (mols/L).

3. **`Minimum Degree`:** The minimum degree of any atom in the molecular graph of the compound. It represents the least number of edges connected to any vertex in the graph.

4. **`Molecular Weight`:** The molecular weight of the compound, usually measured in g/mol.

5. **`Number of H-Bond Donors`:** The number of hydrogen bond donors present in the compound.

6. **`Number of Rings`:** The number of ring structures present in the compound.

7. **`Number of Rotatable Bonds`:** The number of bonds in the molecule that can be rotated around.

8. **`Polar Surface Area`:** The polar surface area of the molecule, generally measured in square Angstroms (\( \text{Å}^2 \)).

9. **`measured log solubility in mols per litre`:** The experimentally measured log solubility of the compound, measured in moles per liter (mols/L). This is generally considered the 'ground truth' for training and evaluating predictive models.

10. **`smiles`:** The Simplified Molecular Input Line Entry System (SMILES) representation of the compound. It's a string that represents the structural formula of the compound.

This dataset is frequently used for training machine learning models to predict the aqueous solubility of new, untested compounds based on these features.

In [None]:
# df.columns

In [None]:
# df.head()

In [None]:
# df.shape

In [None]:
# df2 = df.drop(['Compound ID', 'smiles', 'ESOL predicted log solubility in mols per litre'], axis=1)

In [None]:
# df2.info()

In [None]:
# X = df2.drop(['measured log solubility in mols per litre'], axis=1)
# y = df2['measured log solubility in mols per litre']

In [None]:
# X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
# xgb_model = XGBRegressor(objective='reg:squarederror')

In [None]:
# xgb_model.fit(X_train, y_train)

In [None]:
# y_pred = xgb_model.predict(X_test)

In [None]:
# Generate the metrics
print('Mean Squared Error: ', mean_squared_error(y_test, y_pred))
print('Mean Absolute Error: ', mean_absolute_error(y_test, y_pred))
print('R2 Score: ', r2_score(y_test, y_pred))

# Adding Morgan Fingerprints

In [None]:
# Function to convert SMILES to Morgan fingerprint
def smiles_to_morgan(smiles, radius=2, nBits=1024):
    mol = Chem.MolFromSmiles(smiles)
    fp = AllChem.GetMorganFingerprintAsBitVect(mol, radius, nBits=nBits)
    return np.array(fp)

In [None]:
# # Generate Morgan fingerprints for each SMILES string
# df['Morgan_Fingerprint'] = df['smiles'].apply(smiles_to_morgan)

In [None]:
# # Drop rows where Morgan fingerprints could not be generated
# df.dropna(subset=['Morgan_Fingerprint'], inplace=True)

In [None]:
# # Convert Morgan fingerprints from list to separate columns
# morgan_df = pd.DataFrame(df['Morgan_Fingerprint'].to_list(), columns=[f'Bit_{i}' for i in range(1024)])

In [None]:
# df['Morgan_Fingerprint'][0].shape

In [None]:
# Prepare features and target variable
X = df.drop(columns=['Compound ID', 'ESOL predicted log solubility in mols per litre', 'smiles', 'measured log solubility in mols per litre', 'Morgan_Fingerprint'])
X = pd.concat([X.reset_index(drop=True), morgan_df.reset_index(drop=True)], axis=1)

In [None]:
# y = df['measured log solubility in mols per litre']

In [None]:
# X

In [None]:
# # Split the data into training and test sets
# X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
# X_test

In [None]:
# # Initialize and train the XGBoost model
# xgb_model = XGBRegressor(objective='reg:squarederror')
# xgb_model.fit(X_train, y_train)

In [None]:
# # Make predictions on the test set
# y_pred = xgb_model.predict(X_test)

In [None]:
# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

In [None]:
print(f"Mean Squared Error: {mse}")
print(f"Root Mean Squared Error: {rmse}")
print(f"Mean Absolute Error: {mae}")
print(f"R2 Score: {r2}")

In [None]:
# Create a DataFrame to compare true and predicted values
comparison_df = pd.DataFrame({'True_Values': y_test, 'Predictions': y_pred})

# Reset the index for better visualization
comparison_df.reset_index(drop=True, inplace=True)

# Display the comparison DataFrame
print(comparison_df.head(10))

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# Create the regression plot
sns.regplot(x=y_test, y=y_pred, scatter_kws={"color": "blue"}, line_kws={"color": "red"})
plt.xlabel('True Values')
plt.ylabel('Predictions')
plt.title('Regression Plot: True vs Predicted Values')
plt.show()


# Saving the model for later

In [None]:
# # Save the model

# xgb_model.save_model('xgb_model.json')

In [None]:
# # Load the model

# xgb_model = XGBRegressor()
# xgb_model.load_model('xgb_model.json')

In [61]:
y_pred = xgb_model.predict(X_test)

  if is_sparse(dtype):
  elif is_categorical_dtype(dtype) and enable_categorical:
  if is_categorical_dtype(dtype)
  return is_int or is_bool or is_float or is_categorical_dtype(dtype)


In [62]:
y_test.iloc[0]

-2.54

In [65]:
y_pred[0]

-2.3087633