# Lesson 2: Using RDKit to Extract Molecular Descriptors

**Objective:** Learn how to convert SMILES strings into RDKit molecule objects and extract simple descriptors.
We'll use a mock BitterDB dataset with SMILES and a bitterness label, and compute properties relevant to bitterness prediction.

In [None]:
from rdkit import Chem
from rdkit.Chem import Descriptors, Crippen, rdMolDescriptors, Lipinski
import pandas as pd

In [None]:
# Example dataset with SMILES and Bitter label
data = {
    'Name': ['Caffeine', 'Quinine', 'Saccharin', 'Sucrose', 'Denatonium'],
    'SMILES': [
        'Cn1cnc2c1c(=O)n(c(=O)n2C)C',
        'CC1=C(C(=O)NC2=CC=CC=C12)C3=CC=CC=C3O',
        'C1=CC=C(C=C1)C(=O)NS(=O)(=O)O',
        'C(C1C(C(C(C(O1)O)O)O)O)O',
        'CC[N+](C)(C)CCC1=CC=C(C=C1)C(=O)[O-]'
    ],
    'Bitter': [1, 1, 1, 0, 1]
}
df = pd.DataFrame(data)
df

In [None]:
# Convert SMILES to RDKit molecules
df['Mol'] = df['SMILES'].apply(Chem.MolFromSmiles)
df['Mol'].head()

In [None]:
# Define a function to calculate descriptors
def calc_descriptors(mol):
    return pd.Series({
        'FormalCharge': Chem.GetFormalCharge(mol),
        'MolWt': Descriptors.MolWt(mol),
        'LogP': Crippen.MolLogP(mol),
        'NumHDonors': Lipinski.NumHDonors(mol),
        'NumHAcceptors': Lipinski.NumHAcceptors(mol),
        'TPSA': rdMolDescriptors.CalcTPSA(mol),
        'NumRotatableBonds': rdMolDescriptors.CalcNumRotatableBonds(mol)
    })

In [None]:
# Apply descriptor calculation to all molecules
descriptor_df = df['Mol'].apply(calc_descriptors)
descriptor_df

In [None]:
# Combine with original dataset
full_df = pd.concat([df, descriptor_df], axis=1)
full_df.drop(columns='Mol')

### YOU TRY 🧪
Can you filter the molecules to show only those with:
- A molecular weight below 350
- A LogP value greater than 1?

Use `full_df.query()` or boolean masking.

## Next Step
In Lesson 3, we'll use these descriptors to train a simple machine learning model to predict bitterness!