### Training classical ML models

In the QM9 dataset, all entries have the HOMO-LUMO gap which are continuous values. So, we adopt supervised learning method with regression task.

The classical ML models include linear models, support vector machines, decision tress etc. A list of algorithms avialable in ``scikit-learn`` package can be found [here](https://scikit-learn.org/stable/supervised_learning.html#supervised-learning).

Here, we will train some of those ML models to predict the HOMO-LUMO gap.

In [1]:
# import that pandas library
import pandas as pd

# load the dataframe as CSV from URL.
df = pd.read_csv("https://deepchemdata.s3-us-west-1.amazonaws.com/datasets/qm9.csv")

# we will use 20% of the dataset for demo
dataset = df[["smiles","gap"]].sample(frac=0.2)

### Molecular Representation

We will use the molecular fingerprints as the representation for the molecules. We will use the featurizer from deepchem for this operation.

In [2]:
# install rdkit and deepchem
! pip install rdkit
! pip install deepchem

Collecting rdkit
  Downloading rdkit-2023.9.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (30.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m30.5/30.5 MB[0m [31m45.2 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: rdkit
Successfully installed rdkit-2023.9.1
Collecting deepchem
  Downloading deepchem-2.7.1-py3-none-any.whl (693 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m693.2/693.2 kB[0m [31m6.3 MB/s[0m eta [36m0:00:00[0m
Collecting scipy<1.9 (from deepchem)
  Downloading scipy-1.8.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (42.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.2/42.2 MB[0m [31m8.6 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: scipy, deepchem
  Attempting uninstall: scipy
    Found existing installation: scipy 1.11.3
    Uninstalling scipy-1.11.3:
      Successfully uninstalled scipy-1.11.3
[31mERROR: pip's dependency resolver does

In [3]:
# import depechem and rdkit
import deepchem as dc
from rdkit import Chem

# create the featurizer object
# we will set the radius=2, size=100 as before
featurizer = dc.feat.CircularFingerprint(size=100, radius=2)

# apply to the dataset
dataset["fp"] = dataset["smiles"].apply(featurizer.featurize)

# the fp is an multi-dimensional array but we want to list for training
dataset["fp"] = dataset["fp"].apply(lambda x: list(x[0]))



We will use a random split of the dataset using Fast-ML

In [4]:
# install Fast-ML
! pip install fast_ml

Collecting fast_ml
  Downloading fast_ml-3.68-py3-none-any.whl (42 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/42.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.1/42.1 kB[0m [31m1.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: fast_ml
Successfully installed fast_ml-3.68


In [5]:
# import the function to split into train-valid-test
from fast_ml.model_development import train_valid_test_split

# we will split the dataset as train-valid-test = 0.8:0.1:0.1
X_train, y_train, X_valid, y_valid, \
X_test, y_test = train_valid_test_split(dataset[["fp","gap"]], target = "gap", train_size=0.8,
                                        valid_size=0.1, test_size=0.1)

# look at the dataset
X_train

Unnamed: 0,fp
130861,"[0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 1.0, ..."
124193,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, ..."
122350,"[1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
8916,"[0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
118544,"[0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
...,...
51923,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
21004,"[0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 1.0, ..."
12710,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 1.0, ..."
47167,"[0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, ..."


### Linear regression model
We see that the new X dataframes have additional column with fingerprint. We will use those as input for training the ML models.

Let us begin with ``Linear Regression`` model. This is the least squares method. You can find more details [here](https://scikit-learn.org/stable/modules/linear_model.html#ordinary-least-squares)

In [6]:
# import the model
from sklearn.linear_model import LinearRegression

#create the model object
lr = LinearRegression()

# fit the model with x=fp and y=gap
model = lr.fit(X_train["fp"].values.tolist(),y_train.values.tolist())

To check the accuracy of the linear fit, we can use the valid dataset. The ``score`` function computes the R<sup>2</sup> value. R<sup>2</sup> close to 1 is better.

In [7]:
model.score(X_valid["fp"].values.tolist(),y_valid.values.tolist())

0.6174072347956134

Let the model predit 10 values from the test dataset

In [8]:
model.predict(X_test["fp"].values.tolist())[:10]

array([0.29790964, 0.25696517, 0.25668655, 0.25561846, 0.22533646,
       0.20936748, 0.3010466 , 0.23551685, 0.21772902, 0.24318356])

The corresponding HOMO-LUMO gaps in the test dataset are -

In [9]:
y_test.values[:10]

array([0.3199, 0.2619, 0.252 , 0.2335, 0.2831, 0.1368, 0.2976, 0.2329,
       0.2643, 0.2554])

### Support vector machine regression (SVR) model

Not much change in the code, using ``SVR`` instead of ``LinearRegression``.

In [10]:
# import the model class
from sklearn.svm import SVR

#create the model object
svr = SVR()

# fit the model with x=fp and y=gap
model = svr.fit(X_train["fp"].values.tolist(),y_train.values.tolist())

Again computing the R<sup>2</sup>

In [11]:
model.score(X_valid["fp"].values.tolist(),y_valid.values.tolist())

0.23532637621080643

The R<sup>2</sup> is low with SVR. We can change the model parameters to see if we get any improvement. The model parameters can be found [here](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVR.html#sklearn.svm.SVR).

We will change the *kernel* to *linear* and see if that helps. Default is *rbf*

In [12]:
svr = SVR(kernel="linear")
model = svr.fit(X_train["fp"].values.tolist(),y_train.values.tolist())
model.score(X_valid["fp"].values.tolist(),y_valid.values.tolist())

0.41949513794965376

In [13]:
from sklearn.ensemble import RandomForestRegressor

In [14]:
rf = RandomForestRegressor()
model = rf.fit(X_train["fp"].values.tolist(),y_train.values.tolist())
model.score(X_valid["fp"].values.tolist(),y_valid.values.tolist())

0.7942323667002544