# Solubility Prediction by Random Forest and RDKit

LogP, also known as the partition coefficient, measures a molecule's lipophilicity, or its affinity for a lipid (fat) environment versus a water environment. It's expressed as the logarithm of the ratio of the concentrations of the molecule in the two phases (typically octanol and water). A higher LogP value indicates greater lipophilicity, meaning the molecule is more soluble in fats.

Hydrogen bond donors are atoms or molecules that possess a hydrogen atom directly bonded to a more electronegative atom, such as oxygen (O), nitrogen (N), or fluorine (F). 

## 1. Create database

In [None]:
import pandas as pd
from rdkit import Chem
from rdkit.Chem import Descriptors
import numpy as np

# Generate a list of SMILES (valid and a few invalid)
smiles_list = [
    "CCO", "O=C(O)c1ccccc1OC", "CN1C=NC2=C1C(=O)N(C(=O)N2C)C", 
    "C([C@@H]([C@@H]([C@H](O)CO)O)O)O", "CC(=O)OC1=CC=CC=C1C(=O)O",
    "CC(C)CC1=CC=C(C=C1)O", "C1=CC=C(C=C1)O", "C1CCCCC1", 
    "C1=CC=NC(=C1)N", "C(CO)NC(C)C", "InvalidSMILES123", 
    "C1CCOC1", "C1=CN=C2C(=N1)C=NC=N2", "NaCl", 
    # ... Add more SMILES (repeat with variations) ...
]

# Generate synthetic solubility based on descriptors
data = []
for smiles in smiles_list * 10:  # Repeat to create 200+ entries
    mol = Chem.MolFromSmiles(smiles)
    if mol is not None:
        # Calculate descriptors
        logp = Descriptors.MolLogP(mol)
        mol_wt = Descriptors.MolWt(mol)
        h_bond_donors = Descriptors.NumHDonors(mol)
        
        # Simulate solubility: higher solubility for lower logP, lower molecular weight
        solubility = 0.5 - 0.02 * logp + 0.001 * mol_wt - 0.1 * h_bond_donors
        solubility += np.random.normal(0, 0.05)  # Add noise
        
        # Normalize: Clip solubility between 0 and 1
        solubility = np.clip(solubility, 0.05, 0.95)
        data.append({"SMILES": smiles, "Solubility": round(solubility, 2)})

# Create DataFrame and save
df = pd.DataFrame(data)
df.to_csv("sample_solubility_2.csv", index=False)

## 2. Prepare the data and calculate descriptions
The amount of features we import in our file has great impact. 
Too much features without PCA leads to overfitting. This is what I saw when I imported 3d data in current code.  
  
  Warning: I didn't normalize new features. I should check them again.  
  So I have 221 features with deifferent weights!

In [55]:
import pandas as pd
from rdkit import Chem
from rdkit.Chem import Descriptors
from rdkit.Chem import rdMolDescriptors
from rdkit.Chem import AllChem

# Load the dataset
data = pd.read_csv("sample_solubility_2.csv")

def get_descriptors(smiles):
    mol = Chem.MolFromSmiles(smiles)
    if mol:
        desc = Descriptors.CalcMolDescriptors(mol)
        desc["PolarSurfaceArea"] = rdMolDescriptors.CalcTPSA(mol)  # 2D Descriptor

        # Generate 3D Conformer (required for 3D descriptors)
        mol = Chem.AddHs(mol)  # Add hydrogens
        if AllChem.EmbedMolecule(mol, AllChem.ETKDG()) == 0:  # Ensure embedding succeeds
            desc["RadiusOfGyration"] = rdMolDescriptors.CalcRadiusOfGyration(mol)  # 3D Descriptor
            desc["Asphericity"] = rdMolDescriptors.CalcAsphericity(mol)  # 3D Descriptor
            desc["Eccentricity"] = rdMolDescriptors.CalcEccentricity(mol)  # 3D Descriptor
        else:
            desc["RadiusOfGyration"] = None
            desc["Asphericity"] = None
            desc["Eccentricity"] = None

        return desc
    return None

# Apply the function to create a descriptor DataFrame
descriptor_list = []
for idx, row in data.iterrows():
    desc = get_descriptors(row["SMILES"])
    if desc is not None:
        desc["SMILES"] = row["SMILES"]  # Track valid SMILES
        desc["Solubility"] = row["Solubility"]
        descriptor_list.append(desc)

# Create final DataFrame
df = pd.DataFrame(descriptor_list).dropna()
print(f"Valid molecules: {len(df)}")

# Save to CSV
df.to_csv("molecular_descriptors.csv", index=False)
print("Saved descriptors to molecular_descriptors.csv")


Valid molecules: 120
Saved descriptors to molecular_descriptors.csv


### My other failed trys:

## 3. Preprocess    
Main parts:  
1. Split
2. Normalize (if didn't in previous part)  
Had major problams with this part.  
The changes I can make to improve my results:    
  a. remove ouliers: Didn't work well with my sample.  
b. Selecting best features while omitting others to avoid overfitting. I have to recieve the name of selected features.  
c. Handle missing values.

In [56]:
# Load your generated descriptor data
df = pd.read_csv("molecular_descriptors.csv")

# Split into features (X) and target (y)
X = df.drop(["SMILES", "Solubility"], axis=1)
y = df["Solubility"]

# Handle missing values (if any)
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy="median")
X = imputer.fit_transform(X)

# Split into train/test sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [57]:
print(f"X_train = {X_train.shape} , X_test = {X_test.shape}, y_train = {y_train.shape}, y_test = {y_test.shape}")

X_train = (96, 221) , X_test = (24, 221), y_train = (96,), y_test = (24,)


In [58]:
#feature selection
from sklearn.decomposition import PCA

#Decreasing the number of features from 221 to 10
pca = PCA(n_components=10)
pca.fit(X_train)

X_new_train= pca.transform(X_train)
X_new_test = pca.transform(X_test)
#y_new_train = y_train_clean
X_train.shape , X_new_train.shape, y_new_train.shape
#Check the decrease in features

((96, 221), (96, 10), (91,))

### Some advanced code to remove ouliers whitch didn't work well:

## 4. Training the model
I used GridSearchCV to optimize the parameters of my randomforest model.

In [59]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV

# Define parameter grid
param_grid = {
    'n_estimators': [100, 200],
    'max_depth': [10, 20, None],
    'min_samples_split': [2, 5]
}

# Initialize and fit grid search
model = RandomForestRegressor(random_state=42)
grid_search = GridSearchCV(model, param_grid, cv=3, scoring='neg_root_mean_squared_error')
grid_search.fit(X_new_train, y_train)

# Get best model
best_model = grid_search.best_estimator_
print("Best Parameters:", grid_search.best_params_)

# Now use best_model for predictions
best_model.fit(X_new_train, y_train)

Best Parameters: {'max_depth': 10, 'min_samples_split': 5, 'n_estimators': 100}


### If you couldn't run calculate calculate_regression_metrics, Do these:

In [60]:
#Verify Data Types
print(y_train.dtype)  # Should output `float64` or `int64`
print(y_test.dtype)   # Should not output `object`

float64
float64


In [61]:
print(type(X_train))  # If it's a DataFrame, use .values. If it's an array, skip.
# .values:
#Ensure Input Shapes
# Check that X_train, X_test, y_train, and y_test are NumPy arrays or Pandas DataFrames (not lists or other objects):
# Convert to NumPy arrays if needed
#X_train = X_train.values
#X_test = X_test.values
#y_train = y_train.values.ravel()  # Flatten to 1D array
#y_test = y_test.values.ravel()

<class 'numpy.ndarray'>


## 5. Evaluate

RMSE (Root Mean Squared Error): Measures the average prediction error (lower = better).
Example: RMSE = 0.5 means predictions are off by ~0.5 units (e.g., logS) on average.<br>
<br> 

R² (R-squared): Measures how much variance your model explains (1 = perfect, 0 = no better than the mean). <br>Train R² ≈ Test R² (e.g., Train R² = 0.85, Test R² = 0.80). Test R² ≥ 0.7 is often good.

In [62]:
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error

def calculate_regression_metrics(y_train, y_test, y_pred_train, y_pred_test):
    # Calculate metrics for training set
    mse_train = mean_squared_error(y_train, y_pred_train)
    rmse_train = np.sqrt(mse_train)  # Compute RMSE manually
    #rmse_train = mean_squared_error(y_train, y_pred_train, squared=False)
    r2_train = r2_score(y_train, y_pred_train)
    
    # Calculate metrics for test set
    mse_test = mean_squared_error(y_train, y_pred_train)
    rmse_test = np.sqrt(mse_test)  # Compute RMSE manually
    #rmse_test = mean_squared_error(y_test, y_pred_test, squared=False)
    r2_test = r2_score(y_test, y_pred_test)
    
    print(f"Train RMSE: {rmse_train:.2f}, Train R²: {r2_train:.2f}")
    print(f"Test RMSE: {rmse_test:.2f}, Test R²: {r2_test:.2f}")

In [63]:
# Predict solubilities
y_pred_train_rf = best_model.predict(X_new_train)
y_pred_test_rf = best_model.predict(X_new_test)

# Evaluate
print("Best Parameters:", grid_search.best_params_)
#calculate_regression_metrics(y_train, y_test, y_pred_train_rf, y_pred_test_rf)
calculate_regression_metrics(y_train, y_test, y_pred_train_rf, y_pred_test_rf)

Best Parameters: {'max_depth': 10, 'min_samples_split': 5, 'n_estimators': 100}
Train RMSE: 0.03, Train R²: 0.95
Test RMSE: 0.03, Test R²: 0.58
