# Lesson 1: Exploring Molecular Datasets with Pandas

**Objective:** Learn how to explore chemical datasets using `pandas`.

We'll use a small dataset inspired by **BitterDB** — a resource for bitter and non-bitter compounds.

In [1]:
# Import pandas
import pandas as pd

In [2]:
# Create a small mock BitterDB dataset
data = {
    "Name": ["Caffeine", "Quinine", "Saccharin", "Sucrose", "Denatonium"],
    "SMILES": [
        "Cn1cnc2c1c(=O)n(c(=O)n2C)C",
        "CC1=C(C(=O)NC2=CC=CC=C12)C3=CC=CC=C3O",
        "C1=CC=C(C=C1)C(=O)NS(=O)(=O)O",
        "C(C1C(C(C(C(O1)O)O)O)O)O",
        "CC[N+](C)(C)CCC1=CC=C(C=C1)C(=O)[O-]"
    ],
    "Bitter": [1, 1, 1, 0, 1],
    "Class": ["Alkaloid", "Alkaloid", "Artificial Sweetener", "Sugar", "Quaternary Ammonium"]
}
df = pd.DataFrame(data)
df

Unnamed: 0,Name,SMILES,Bitter,Class
0,Caffeine,Cn1cnc2c1c(=O)n(c(=O)n2C)C,1,Alkaloid
1,Quinine,CC1=C(C(=O)NC2=CC=CC=C12)C3=CC=CC=C3O,1,Alkaloid
2,Saccharin,C1=CC=C(C=C1)C(=O)NS(=O)(=O)O,1,Artificial Sweetener
3,Sucrose,C(C1C(C(C(C(O1)O)O)O)O)O,0,Sugar
4,Denatonium,CC[N+](C)(C)CCC1=CC=C(C=C1)C(=O)[O-],1,Quaternary Ammonium


### Instructor Note
- Ask students: *Which of these molecules are bitter? Why might that be?*
- Highlight structure-property relationship exploration.

## Exploring the BitterDB Data

In [3]:
# What are the column names?
df.columns

Index(['Name', 'SMILES', 'Bitter', 'Class'], dtype='object')

In [4]:
# How many molecules are in the dataset?
df.shape

(5, 4)

In [5]:
# Get summary statistics for numeric columns
df.describe()

Unnamed: 0,Bitter
count,5.0
mean,0.8
std,0.447214
min,0.0
25%,1.0
50%,1.0
75%,1.0
max,1.0


In [6]:
# Count how many molecules are bitter vs not bitter
df['Bitter'].value_counts()

Bitter
1    4
0    1
Name: count, dtype: int64

In [7]:
# How many molecules per Class?
df['Class'].value_counts()

Class
Alkaloid                2
Artificial Sweetener    1
Sugar                   1
Quaternary Ammonium     1
Name: count, dtype: int64

In [8]:
# Optional: Filter only bitter molecules
df[df['Bitter'] == 1]

Unnamed: 0,Name,SMILES,Bitter,Class
0,Caffeine,Cn1cnc2c1c(=O)n(c(=O)n2C)C,1,Alkaloid
1,Quinine,CC1=C(C(=O)NC2=CC=CC=C12)C3=CC=CC=C3O,1,Alkaloid
2,Saccharin,C1=CC=C(C=C1)C(=O)NS(=O)(=O)O,1,Artificial Sweetener
4,Denatonium,CC[N+](C)(C)CCC1=CC=C(C=C1)C(=O)[O-],1,Quaternary Ammonium


### YOU TRY 🧪
Filter the dataset to show only **non-bitter** molecules. Can you sort them alphabetically by name?

In [9]:
df[df['Bitter'] != 1]

Unnamed: 0,Name,SMILES,Bitter,Class
3,Sucrose,C(C1C(C(C(C(O1)O)O)O)O)O,0,Sugar


## What's Next?
In the next lesson, we'll use a tool called **RDKit** to turn each molecule's SMILES string into **descriptors** like molecular weight and logP — the kinds of features used in machine learning.

## What's Next?
In the next lesson, we'll use a tool called **RDKit** to turn each molecule's SMILES string into **descriptors** like molecular weight and logP — the kinds of features used in machine learning.

In [10]:
# Lesson 2: Calculating Molecular Descriptors with RDKit

# Import necessary libraries
import pandas as pd
from rdkit import Chem
from rdkit.Chem import Descriptors, Crippen, Lipinski

# Recreate the BitterDB dataset from Lesson 1
data = {
    "Name": ["Caffeine", "Quinine", "Saccharin", "Sucrose", "Denatonium"],
    "SMILES": [
        "Cn1cnc2c1c(=O)n(c(=O)n2C)C",
        "CC1=C(C(=O)NC2=CC=CC=C12)C3=CC=CC=C3O", 
        "C1=CC=C(C=C1)C(=O)NS(=O)(=O)O",
        "C(C1C(C(C(C(O1)O)O)O)O)O",
        "CC[N+](C)(C)CCC1=CC=C(C=C1)C(=O)[O-]"
    ],
    "Bitter": [1, 1, 1, 0, 1],
    "Class": ["Alkaloid", "Alkaloid", "Artificial Sweetener", "Sugar", "Quaternary Ammonium"]
}
df = pd.DataFrame(data)

# Convert SMILES to RDKit molecule objects
df['Molecule'] = df['SMILES'].apply(Chem.MolFromSmiles)

# Check if any molecules failed to parse
print("Molecules that failed to parse:")
print(df[df['Molecule'].isna()]['Name'].tolist())

# Calculate basic molecular descriptors
df['MolWt'] = df['Molecule'].apply(lambda mol: Descriptors.MolWt(mol) if mol else None)
df['LogP'] = df['Molecule'].apply(lambda mol: Crippen.MolLogP(mol) if mol else None)
df['NumHDonors'] = df['Molecule'].apply(lambda mol: Descriptors.NumHDonors(mol) if mol else None)
df['NumHAcceptors'] = df['Molecule'].apply(lambda mol: Descriptors.NumHAcceptors(mol) if mol else None)

# Display the enhanced dataset
print("Enhanced dataset with molecular descriptors:")
df_display = df.drop('Molecule', axis=1)  # Remove molecule objects for cleaner display
print(df_display)

# Calculate Lipinski's Rule of Five descriptors
df['TPSA'] = df['Molecule'].apply(lambda mol: Descriptors.TPSA(mol) if mol else None)
df['NumRotatableBonds'] = df['Molecule'].apply(lambda mol: Descriptors.NumRotatableBonds(mol) if mol else None)

# Check Lipinski's Rule of Five compliance
def check_lipinski(row):
    """Check if molecule passes Lipinski's Rule of Five"""
    if pd.isna(row['MolWt']):
        return None
    
    violations = 0
    if row['MolWt'] > 500: violations += 1
    if row['LogP'] > 5: violations += 1
    if row['NumHDonors'] > 5: violations += 1
    if row['NumHAcceptors'] > 10: violations += 1
    
    return violations

df['Lipinski_Violations'] = df.apply(check_lipinski, axis=1)

print("\nLipinski's Rule of Five Analysis:")
print(df[['Name', 'MolWt', 'LogP', 'NumHDonors', 'NumHAcceptors', 'Lipinski_Violations']])

# Explore relationship between descriptors and bitterness
print("\nDescriptor statistics by bitterness:")
descriptor_cols = ['MolWt', 'LogP', 'NumHDonors', 'NumHAcceptors', 'TPSA']
bitter_stats = df.groupby('Bitter')[descriptor_cols].mean()
print(bitter_stats)

# Additional useful descriptors for cheminformatics
df['NumHeavyAtoms'] = df['Molecule'].apply(lambda mol: Descriptors.HeavyAtomCount(mol) if mol else None)
df['NumAromaticRings'] = df['Molecule'].apply(lambda mol: Descriptors.NumAromaticRings(mol) if mol else None)

print("\nAdditional structural descriptors:")
additional_cols = ['Name', 'NumHeavyAtoms', 'NumAromaticRings']
print(df[additional_cols])

# Create a function to calculate multiple descriptors at once
def calculate_descriptors(smiles):
    """Calculate a comprehensive set of molecular descriptors from SMILES"""
    mol = Chem.MolFromSmiles(smiles)
    if mol is None:
        return {desc: None for desc in ['MolWt', 'LogP', 'NumHDonors', 'NumHAcceptors', 
                                       'TPSA', 'NumRotatableBonds', 'NumHeavyAtoms', 
                                       'NumAromaticRings']}
    
    return {
        'MolWt': Descriptors.MolWt(mol),
        'LogP': Crippen.MolLogP(mol),
        'NumHDonors': Descriptors.NumHDonors(mol),
        'NumHAcceptors': Descriptors.NumHAcceptors(mol),
        'TPSA': Descriptors.TPSA(mol),
        'NumRotatableBonds': Descriptors.NumRotatableBonds(mol),
        'NumHeavyAtoms': Descriptors.HeavyAtomCount(mol),
        'NumAromaticRings': Descriptors.NumAromaticRings(mol)
    }

# Example of using the function on a new molecule
new_smiles = "CC(=O)Oc1ccccc1C(=O)O"  # Aspirin
aspirin_descriptors = calculate_descriptors(new_smiles)
print(f"\nAspirin descriptors: {aspirin_descriptors}")

# Save the enhanced dataset
df_final = df.drop('Molecule', axis=1)  # Remove molecule objects before saving
# df_final.to_csv('bitterdb_with_descriptors.csv', index=False)
print("\nDataset ready for machine learning!")
print(f"Shape: {df_final.shape}")
print(f"Columns: {df_final.columns.tolist()}")

Molecules that failed to parse:
[]
Enhanced dataset with molecular descriptors:
         Name                                 SMILES  Bitter  \
0    Caffeine             Cn1cnc2c1c(=O)n(c(=O)n2C)C       1   
1     Quinine  CC1=C(C(=O)NC2=CC=CC=C12)C3=CC=CC=C3O       1   
2   Saccharin          C1=CC=C(C=C1)C(=O)NS(=O)(=O)O       1   
3     Sucrose               C(C1C(C(C(C(O1)O)O)O)O)O       0   
4  Denatonium   CC[N+](C)(C)CCC1=CC=C(C=C1)C(=O)[O-]       1   

                  Class    MolWt     LogP  NumHDonors  NumHAcceptors  
0              Alkaloid  194.194 -1.02930           0              6  
1              Alkaloid  251.285  3.20912           2              2  
2  Artificial Sweetener  201.203  0.21920           2              3  
3                 Sugar  180.156 -3.22140           5              6  
4   Quaternary Ammonium  221.300  0.68890           0              2  

Lipinski's Rule of Five Analysis:
         Name    MolWt     LogP  NumHDonors  NumHAcceptors  \
0    Caffein

Proposed "BitterSOL" Model
Based on the literature and chemical intuition about bitterness, I'd suggest this simplified linear model:
P(Bitter) = logistic(β₀ + β₁ × LogP + β₂ × MW + β₃ × TPSA + β₄ × Charge)
Where the key descriptors would be:

LogP - Lipophilicity (bitter compounds often need some lipophilicity to interact with taste receptors)
Molecular Weight - Size matters for receptor binding
TPSA - Topological Polar Surface Area (captures "surface properties" mentioned in BitterPredict)
Formal Charge - Total charge (specifically mentioned as important in BitterPredict)

This parallels ESOL's simplicity but uses descriptors more relevant to taste perception. The key differences from ESOL:

Uses TPSA instead of aromatic proportion (more relevant to bitter receptor binding)
Uses formal charge instead of rotatable bonds (charge interactions are crucial for taste)
Uses logistic regression instead of linear (since we're predicting binary bitter/not-bitter)

Would you like me to show how to implement this in your lesson? We could:

Calculate these four descriptors for your BitterDB molecules
Fit a simple logistic regression model
Compare it to the complexity of modern ML approaches like BitterPredict
Let students experiment with different descriptor combinations

This would make a great educational tool showing how domain knowledge can guide feature selection for predictive modeling!