# Lesson 2: Using RDKit to Extract Molecular Descriptors

**Objective:** Learn how to convert SMILES strings into RDKit molecule objects and extract simple descriptors.
We'll use a mock BitterDB dataset with SMILES and a bitterness label, and compute properties relevant to bitterness prediction.

In [1]:
from rdkit import Chem
from rdkit.Chem import Descriptors, Crippen, rdMolDescriptors, Lipinski
import pandas as pd

In [2]:
# Example dataset with SMILES and Bitter label
data = {
    'Name': ['Caffeine', 'Quinine', 'Saccharin', 'Sucrose', 'Denatonium'],
    'SMILES': [
        'Cn1cnc2c1c(=O)n(c(=O)n2C)C',
        'CC1=C(C(=O)NC2=CC=CC=C12)C3=CC=CC=C3O',
        'C1=CC=C(C=C1)C(=O)NS(=O)(=O)O',
        'C(C1C(C(C(C(O1)O)O)O)O)O',
        'CC[N+](C)(C)CCC1=CC=C(C=C1)C(=O)[O-]'
    ],
    'Bitter': [1, 1, 1, 0, 1]
}
df = pd.DataFrame(data)
df

Unnamed: 0,Name,SMILES,Bitter
0,Caffeine,Cn1cnc2c1c(=O)n(c(=O)n2C)C,1
1,Quinine,CC1=C(C(=O)NC2=CC=CC=C12)C3=CC=CC=C3O,1
2,Saccharin,C1=CC=C(C=C1)C(=O)NS(=O)(=O)O,1
3,Sucrose,C(C1C(C(C(C(O1)O)O)O)O)O,0
4,Denatonium,CC[N+](C)(C)CCC1=CC=C(C=C1)C(=O)[O-],1


In [3]:
# Convert SMILES to RDKit molecules
df['Mol'] = df['SMILES'].apply(Chem.MolFromSmiles)
df['Mol'].head()

0    <rdkit.Chem.rdchem.Mol object at 0x11c0ca7a0>
1    <rdkit.Chem.rdchem.Mol object at 0x11c0ca8f0>
2    <rdkit.Chem.rdchem.Mol object at 0x11c0ca810>
3    <rdkit.Chem.rdchem.Mol object at 0x11c0ca960>
4    <rdkit.Chem.rdchem.Mol object at 0x11c0ca9d0>
Name: Mol, dtype: object

In [4]:
# Define a function to calculate descriptors
def calc_descriptors(mol):
    return pd.Series({
        'FormalCharge': Chem.GetFormalCharge(mol),
        'MolWt': Descriptors.MolWt(mol),
        'LogP': Crippen.MolLogP(mol),
        'NumHDonors': Lipinski.NumHDonors(mol),
        'NumHAcceptors': Lipinski.NumHAcceptors(mol),
        'TPSA': rdMolDescriptors.CalcTPSA(mol),
        'NumRotatableBonds': rdMolDescriptors.CalcNumRotatableBonds(mol)
    })

In [5]:
# Apply descriptor calculation to all molecules
descriptor_df = df['Mol'].apply(calc_descriptors)
descriptor_df

Unnamed: 0,FormalCharge,MolWt,LogP,NumHDonors,NumHAcceptors,TPSA,NumRotatableBonds
0,0.0,194.194,-1.0293,0.0,6.0,61.82,0.0
1,0.0,251.285,3.20912,2.0,2.0,53.09,1.0
2,0.0,201.203,0.2192,2.0,3.0,83.47,2.0
3,0.0,180.156,-3.2214,5.0,6.0,110.38,1.0
4,0.0,221.3,0.6889,0.0,2.0,40.13,5.0


In [6]:
# Combine with original dataset
full_df = pd.concat([df, descriptor_df], axis=1)
full_df.drop(columns='Mol')

Unnamed: 0,Name,SMILES,Bitter,FormalCharge,MolWt,LogP,NumHDonors,NumHAcceptors,TPSA,NumRotatableBonds
0,Caffeine,Cn1cnc2c1c(=O)n(c(=O)n2C)C,1,0.0,194.194,-1.0293,0.0,6.0,61.82,0.0
1,Quinine,CC1=C(C(=O)NC2=CC=CC=C12)C3=CC=CC=C3O,1,0.0,251.285,3.20912,2.0,2.0,53.09,1.0
2,Saccharin,C1=CC=C(C=C1)C(=O)NS(=O)(=O)O,1,0.0,201.203,0.2192,2.0,3.0,83.47,2.0
3,Sucrose,C(C1C(C(C(C(O1)O)O)O)O)O,0,0.0,180.156,-3.2214,5.0,6.0,110.38,1.0
4,Denatonium,CC[N+](C)(C)CCC1=CC=C(C=C1)C(=O)[O-],1,0.0,221.3,0.6889,0.0,2.0,40.13,5.0


### YOU TRY 🧪
Can you filter the molecules to show only those with:
- A molecular weight below 350
- A LogP value greater than 1?

Use `full_df.query()` or boolean masking.

## Next Step
In Lesson 3, we'll use these descriptors to train a simple machine learning model to predict bitterness!