<a href="https://colab.research.google.com/github/ChemistZee/ml_for_molecules/blob/main/Classical_ML.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Training classical ML models

In the QM9 dataset, all entries have the HOMO-LUMO gap which are continuous values. So, we adopt supervised learning method with regression task.

The classical ML models include linear models, support vector machines, decision tress etc. A list of algorithms avialable in ``scikit-learn`` package can be found [here](https://scikit-learn.org/stable/supervised_learning.html#supervised-learning).

Here, we will train some of those ML models to predict the HOMO-LUMO gap.

In [1]:
# import that pandas library
import pandas as pd

# load the dataframe as CSV from URL.
df = pd.read_csv("https://deepchemdata.s3-us-west-1.amazonaws.com/datasets/qm9.csv")

# we will use 20% of the dataset for demo
dataset = df[["smiles","gap"]].sample(frac=0.2)

### Molecular Representation

We will use the molecular fingerprints as the representation for the molecules. We will use the featurizer from deepchem for this operation.

In [2]:
# install rdkit and deepchem
! pip install rdkit
! pip install deepchem

Collecting rdkit
  Downloading rdkit-2025.9.3-cp312-cp312-manylinux_2_28_x86_64.whl.metadata (4.2 kB)
Downloading rdkit-2025.9.3-cp312-cp312-manylinux_2_28_x86_64.whl (36.4 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m36.4/36.4 MB[0m [31m51.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: rdkit
Successfully installed rdkit-2025.9.3
Collecting deepchem
  Downloading deepchem-2.5.0-py3-none-any.whl.metadata (1.1 kB)
Downloading deepchem-2.5.0-py3-none-any.whl (552 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m552.4/552.4 kB[0m [31m8.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: deepchem
Successfully installed deepchem-2.5.0


In [6]:
# import depechem and rdkit
import deepchem as dc
from rdkit import Chem

# create the featurizer object
# we will set the radius=2, size=100 as before
featurizer = dc.feat.CircularFingerprint(size=100, radius=2)

# apply to the dataset
dataset["fp"] = dataset["smiles"].apply(featurizer.featurize)

# the fp is an multi-dimensional array but we want to list for training
dataset["fp"] = dataset["fp"].apply(lambda x: list(x[0]))

In [5]:
dataset['fp'].dtype

dtype('O')

We will use a random split of the dataset using Fast-ML

In [7]:
# install Fast-ML #we don't want to use deepchem as that would require conversion between datatframe and dataset
! pip install fast_ml

Collecting fast_ml
  Downloading fast_ml-3.68-py3-none-any.whl.metadata (12 kB)
Downloading fast_ml-3.68-py3-none-any.whl (42 kB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/42.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.1/42.1 kB[0m [31m1.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: fast_ml
Successfully installed fast_ml-3.68


In [8]:
# import the function to split into train-valid-test #note that we used the fp to split the datasets
from fast_ml.model_development import train_valid_test_split

# we will split the dataset as train-valid-test = 0.8:0.1:0.1
X_train, y_train, X_valid, y_valid, \
X_test, y_test = train_valid_test_split(dataset[["fp","gap"]], target = "gap", train_size=0.8,
                                        valid_size=0.1, test_size=0.1)

# look at the dataset
X_train

Unnamed: 0,fp
75724,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, ..."
126792,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, ..."
108098,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, ..."
9352,"[0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
101187,"[0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
...,...
68295,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, ..."
126160,"[0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, ..."
89489,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, ..."
76227,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, ..."


### Linear regression model
We see that the new X dataframes have additional column with fingerprint. We will use those as input for training the ML models.

Let us begin with ``Linear Regression`` model. This is the least squares method. You can find more details [here](https://scikit-learn.org/stable/modules/linear_model.html#ordinary-least-squares)

In [9]:
# import the model
from sklearn.linear_model import LinearRegression

#create the model object
lr = LinearRegression()

# fit the model with x=fp and y=gap
model = lr.fit(X_train["fp"].values.tolist(),y_train.values.tolist())

To check the accuracy of the linear fit, we can use the valid dataset. The ``score`` function computes the R<sup>2</sup> value. R<sup>2</sup> close to 1 is better.

In [10]:
model.score(X_valid["fp"].values.tolist(),y_valid.values.tolist())

0.5917523424481731

Let the model predit 10 values from the test dataset

In [11]:
model.predict(X_test["fp"].values.tolist())[:10]

array([0.28308352, 0.22297801, 0.21183692, 0.29364882, 0.24408212,
       0.23275133, 0.2850813 , 0.30508494, 0.30582675, 0.24077273])

The corresponding HOMO-LUMO gaps in the test dataset are -

In [13]:
y_test.values[:10].tolist() #tolist() is optional

[0.6221,
 0.1378,
 0.2733,
 0.2819,
 0.2459,
 0.2311,
 0.3165,
 0.2944,
 0.2956,
 0.2131]

### Support vector machine regression (SVR) model

Not much change in the code, using ``SVR`` instead of ``LinearRegression``.

In [14]:
# import the model class
from sklearn.svm import SVR

#create the model object
svr = SVR() #using default values for the parrameters

# fit the model with x=fp and y=gap
model = svr.fit(X_train["fp"].values.tolist(),y_train.values.tolist())

Again computing the R<sup>2</sup>

In [15]:
model.score(X_valid["fp"].values.tolist(),y_valid.values.tolist())

0.20050955099689127

The R<sup>2</sup> is low with SVR. We can change the model parameters to see if we get any improvement. The model parameters can be found [here](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVR.html#sklearn.svm.SVR).

We will change the *kernel* to *linear* and see if that helps. Default is *rbf*

In [16]:
svr = SVR(kernel="linear")
model = svr.fit(X_train["fp"].values.tolist(),y_train.values.tolist())
model.score(X_valid["fp"].values.tolist(),y_valid.values.tolist())

0.3450953411641855