# Chemical Data Analysis Workflow: Filtering Compounds by Molecular Weight

This notebook demonstrates a complete workflow for processing chemical data:
1. Loading compound data from CSV
2. Converting SMILES to molecular objects
3. Calculating molecular descriptors
4. Filtering based on criteria
5. Exporting results

**Required Libraries**: pandas, rdkit

## Step 1: Import Required Libraries

In [1]:
import pandas as pd
from rdkit import Chem
from rdkit.Chem import Descriptors
import warnings
warnings.filterwarnings('ignore')

## Step 2: Load Compound Data

Load a CSV file containing compound names and SMILES strings. The file should have at least two columns:
- `Name`: Compound name
- `SMILES`: SMILES notation for the molecular structure

In [2]:
# Load compounds from CSV file
df = pd.read_csv("compounds.csv")

print(f"Loaded {len(df)} compounds from file\n")
print("First few entries:")
df.head(3)

Loaded 10 compounds from file

First few entries:


Unnamed: 0,Name,SMILES
0,Aspirin,CC(=O)OC1=CC=CC=C1C(=O)O
1,Caffeine,CN1C=NC2=C1C(=O)N(C(=O)N2C)C
2,Ibuprofen,CC(C)CC1=CC=C(C=C1)C(C)C(=O)O


## Step 3: Convert SMILES to Molecular Objects

RDKit's `MolFromSmiles()` function parses SMILES strings and creates molecule objects that can be used for descriptor calculations. Invalid SMILES will return `None`.

In [3]:
# Convert SMILES to molecular objects
df["Mol"] = df["SMILES"].apply(Chem.MolFromSmiles)

# Count valid and invalid molecules
valid_count = df["Mol"].notnull().sum()
invalid_count = df["Mol"].isnull().sum()

print(f"Successfully converted {valid_count} SMILES strings to molecule objects")
print(f"Invalid SMILES: {invalid_count}")

Successfully converted 10 SMILES strings to molecule objects
Invalid SMILES: 0


## Step 4: Remove Invalid Molecules

Before calculating descriptors, remove any rows with invalid SMILES (where `Mol` is `None`).

In [4]:
# Drop rows where molecule object is None (invalid SMILES)
df = df[df["Mol"].notnull()]

print(f"Dataset after removing invalid molecules: {len(df)} compounds")

Dataset after removing invalid molecules: 10 compounds


## Step 5: Calculate Molecular Weight

The `Descriptors.MolWt()` function calculates molecular weight by summing the atomic masses of all atoms in the molecule.

In [5]:
# Calculate molecular weights for valid molecules
df["MolWt"] = df["Mol"].apply(Descriptors.MolWt)

# Display statistics
print("Molecular weight statistics:")
print(f"  Mean: {df['MolWt'].mean():.2f} g/mol")
print(f"  Min: {df['MolWt'].min():.2f} g/mol")
print(f"  Max: {df['MolWt'].max():.2f} g/mol")
print(f"  Std: {df['MolWt'].std():.2f} g/mol")

Molecular weight statistics:
  Mean: 234.15 g/mol
  Min: 46.07 g/mol
  Max: 584.68 g/mol
  Std: 165.43 g/mol


## Step 6: View Complete Dataset

Display the dataset with calculated molecular weights:

In [6]:
# Display dataset (excluding Mol column for readability)
df[["Name", "SMILES", "MolWt"]].head()

Unnamed: 0,Name,SMILES,MolWt
0,Aspirin,CC(=O)OC1=CC=CC=C1C(=O)O,180.159
1,Caffeine,CN1C=NC2=C1C(=O)N(C(=O)N2C)C,194.194
2,Ibuprofen,CC(C)CC1=CC=C(C=C1)C(C)C(=O)O,206.285
3,Ethanol,CCO,46.069
4,Benzene,C1=CC=CC=C1,78.114


## Step 7: Filter Compounds by Molecular Weight

Apply a molecular weight cutoff of 500 g/mol, following Lipinski's Rule of Five for drug-likeness. This filters out compounds that may have poor oral bioavailability.

In [7]:
# Filter compounds with molecular weight less than 500 g/mol
mw_cutoff = 500
initial_count = len(df)
filtered = df[df["MolWt"] < mw_cutoff].copy()
final_count = len(filtered)

print(f"Compounds before filtering: {initial_count}")
print(f"Compounds after filtering (MW < {mw_cutoff}): {final_count}")
print(f"Removed: {initial_count - final_count} compounds ({(initial_count - final_count)/initial_count*100:.1f}%)")

Compounds before filtering: 10
Compounds after filtering (MW < 500): 8
Removed: 2 compounds (20.0%)


## Step 8: Display Filtered Results

In [8]:
print("Filtered compounds (first 5):")
filtered[["Name", "MolWt"]].head()

Filtered compounds (first 5):


Unnamed: 0,Name,MolWt
0,Aspirin,180.159
1,Caffeine,194.194
2,Ibuprofen,206.285
3,Ethanol,46.069
4,Benzene,78.114


## Step 9: Save Filtered Results

Export the filtered dataset to a new CSV file for further analysis or sharing with collaborators.

In [9]:
# Save to CSV (excluding the Mol object column)
output_columns = ["Name", "SMILES", "MolWt"]
filtered[output_columns].to_csv("filtered_compounds.csv", index=False)

print("✓ Filtered results saved to 'filtered_compounds.csv'")
print(f"  File contains {len(filtered)} compounds")

✓ Filtered results saved to 'filtered_compounds.csv'
  File contains 8 compounds


## Output Explanation

The final output shows a DataFrame with filtered compounds. Key points:

**Column Structure**:
- `Name`: Compound name (original from input)
- `MolWt`: Calculated molecular weight in g/mol (3 decimal places)

**Data Quality**:
- All displayed compounds have molecular weight < 500 g/mol
- Invalid SMILES strings were removed in earlier steps
- Molecular weights are exact calculations from atomic masses

**Chemical Interpretation**:
- **Ethanol (46.07 g/mol)**: Simple alcohol, smallest molecule
- **Benzene (78.11 g/mol)**: Aromatic compound, reference structure
- **Aspirin (180.16 g/mol)**: Common analgesic, drug-like properties
- **Caffeine (194.19 g/mol)**: Stimulant, heterocyclic compound
- **Ibuprofen (206.29 g/mol)**: NSAID, good oral bioavailability

All compounds pass the Lipinski MW criterion (≤ 500 g/mol), suggesting favorable drug-like properties for oral administration.

## Workflow Summary

This notebook demonstrated a complete cheminformatics workflow:

1. **Data Loading**: Read compound data from external CSV files
2. **Structure Parsing**: Convert SMILES strings to RDKit molecule objects
3. **Quality Control**: Identify and remove invalid molecular structures
4. **Descriptor Calculation**: Compute molecular weight for all valid compounds
5. **Data Filtering**: Apply molecular weight cutoff for drug-likeness
6. **Result Export**: Save filtered data for downstream analysis

### Key Concepts Illustrated:

- **Pandas integration**: Efficient handling of chemical datasets
- **RDKit capabilities**: SMILES parsing and descriptor calculation
- **Data quality**: Handling invalid structures gracefully
- **Filtering logic**: Applying chemical criteria (Lipinski's rules)
- **Reproducibility**: Complete workflow from input to output

### Extending This Workflow:

This template can be adapted for:
- Calculating additional descriptors (LogP, TPSA, H-bond donors/acceptors)
- Multi-criteria filtering (drug-likeness, ADMET properties)
- Batch processing of large compound libraries
- Integration with machine learning pipelines
- Automated reporting and visualization

### Representative of Modern Computational Chemistry:

This workflow exemplifies how Python combines:
- **Chemical knowledge** (molecular structure, drug-likeness)
- **Programming skills** (data manipulation, logical filtering)
- **Statistical thinking** (descriptor analysis, cutoff selection)
- **Practical utility** (reproducible, shareable, scalable)

The same principles apply to more complex tasks in drug discovery, materials design, and chemical data mining.

## Exercises

Try modifying this notebook to:

1. Calculate additional descriptors: LogP, TPSA, number of rotatable bonds
2. Filter by multiple criteria (e.g., MW < 500 AND LogP < 5)
3. Create a visualization of molecular weight distribution
4. Identify compounds that violate Lipinski's Rule of Five
5. Compare filtered vs. rejected compounds using boxplots