<a href="https://colab.research.google.com/github/ChemistZee/ml_for_molecules/blob/main/Molecular_Representation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Featurizers

There are several featurizers available in python packages like ```deepchem```. Here, we will look at some of those. A detailed documentation can be found [here](https://deepchem.readthedocs.io/en/latest/api_reference/featurizers.html)

Unlike dataset splitters, we do not need to convert our dataset into a deepchem object. We apply the featurizers on the pandas dataframe.

In [1]:
# install deepchem and rdkit
! pip install deepchem
! pip install rdkit

Collecting deepchem
  Downloading deepchem-2.5.0-py3-none-any.whl.metadata (1.1 kB)
Downloading deepchem-2.5.0-py3-none-any.whl (552 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m552.4/552.4 kB[0m [31m6.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: deepchem
Successfully installed deepchem-2.5.0
Collecting rdkit
  Downloading rdkit-2025.9.3-cp312-cp312-manylinux_2_28_x86_64.whl.metadata (4.2 kB)
Downloading rdkit-2025.9.3-cp312-cp312-manylinux_2_28_x86_64.whl (36.4 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m36.4/36.4 MB[0m [31m17.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: rdkit
Successfully installed rdkit-2025.9.3


As before, we will use the QM9 dataset with HOMO-LUMO gap as the target. We will apply the featurizer to entire dataset and then split it randomly.

In [2]:
# import that pandas library
import pandas as pd

# load the dataframe as CSV from URL.
# If you upload the file to Colab, replace the URL with the file name
df = pd.read_csv("https://deepchemdata.s3-us-west-1.amazonaws.com/datasets/qm9.csv")

# create the dataset with only smiles and gap and 10% dataset
dataset = df[["smiles","gap"]].sample(frac=0.1)

We will use ``CircularFingerprint``(Morgan fingerprint) and ``RDKitDescriptors`` featurizer from deepchem. You can look for documentation on available featurizers [here](https://deepchem.readthedocs.io/en/latest/api_reference/featurizers.html)  

### CircularFingerprints

In [3]:
# import depechem and rdkit
import deepchem as dc
from rdkit import Chem

# create the featurizer object
# we will set the radius=2, size=100 as before
featurizer = dc.feat.CircularFingerprint(size=100, radius=2)

Instructions for updating:
experimental_relax_shapes is deprecated, use reduce_retracing instead


To test, we will apply the featurizer to ethane.

In [4]:
featurizer.featurize("CC")



array([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0.,
        0., 0., 0., 0.]])

We see that the output is an array and the code is less sophisticated than the pure RDKit implementation. We can now apply the featurizer to the dataset. This may take a while.

In [5]:
dataset["fp"] = dataset["smiles"].apply(featurizer.featurize)

[1;30;43mStreaming output truncated to the last 5000 lines.[0m


In [6]:
# looking at the top 5 entries
dataset.head()

Unnamed: 0,smiles,gap,fp
38646,C1C2CC3C=CCC1C23,0.2609,"[[0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 1.0, 0.0, 0.0,..."
109477,CCC1C(=O)CNC1=O,0.2097,"[[0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 1.0, 1.0,..."
41206,C1OC2C3CC12OCO3,0.2781,"[[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0,..."
84889,OC1C=CCCC11CO1,0.2576,"[[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,..."
52066,C1CC1(COC=O)C#N,0.2828,"[[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,..."


### RDKitDescriptors

This gives a list of chemical descriptors like molecular weight, number of valence electrons, maximum and minimum partial charge, etc using RDKit. By default, the length of the list is 208.

The code below shows featurizing of ethane.

In [7]:
# create the featurizer
featurizer = dc.feat.RDKitDescriptors()

# apply it on ethane
featurizer.featurize("CC")

array([[ 2.        ,  2.        ,  2.        ,  2.        ,  0.37278556,
         3.        , 30.07      , 24.022     , 30.04695019, 14.        ,
         0.        , -0.06826238, -0.06826238,  0.06826238,  0.06826238,
         1.        ,  1.        ,  1.        , 13.011     , 11.011     ,
         0.93173762, -1.06826238,  1.1441    , -0.8559    ,  3.503     ,
         1.503     ,  1.        ,  1.        ,  0.        ,  2.        ,
         2.        ,  2.        ,  1.        ,  1.        ,  1.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  1.        ,  2.        ,  0.        ,
         0.        , 15.10419314,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        , 13.8474744 ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        , 13.8474744 ,  0.        ,  0. 

In [8]:
dataset['rdkit_desc'] = dataset['smiles'].apply(featurizer.featurize)

In [9]:
dataset.head(5)

Unnamed: 0,smiles,gap,fp,rdkit_desc
38646,C1C2CC3C=CCC1C23,0.2609,"[[0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 1.0, 0.0, 0.0,...","[[2.455185185185185, 2.455185185185185, 1.0300..."
109477,CCC1C(=O)CNC1=O,0.2097,"[[0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 1.0, 1.0,...","[[10.732407407407408, 10.732407407407408, 0.03..."
41206,C1OC2C3CC12OCO3,0.2781,"[[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0,...","[[5.383101851851852, 5.383101851851852, 0.1140..."
84889,OC1C=CCCC11CO1,0.2576,"[[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,...","[[9.287638888888889, 9.287638888888889, 0.1440..."
52066,C1CC1(COC=O)C#N,0.2828,"[[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,...","[[9.66360260770975, 9.66360260770975, 0.270833..."
