# FLPCO2DB Project: COâ‚‚ Activation by Frustrated Lewis Pairs
## Machine Learning for Molecular Property Prediction

**Project:** A Machine Learning Framework for COâ‚‚ Activation by Frustrated Lewis Pairs  
**Team:** Error 404  
**Course:** 06-731 Molecular Machine Learning  

---

## Project Overview

### The Challenge

Rising atmospheric COâ‚‚ is a critical global challenge. **Frustrated Lewis Pairs (FLPs)** offer a promising metal-free approach to COâ‚‚ capture and activation through cooperative acid-base chemistry. However, the vast combinatorial space of Lewis acid and base pairs makes experimental screening impractical.

### Your Mission

Build a machine learning framework to:
1. **Predict COâ‚‚ binding energies** for novel FLP combinations
2. **Rank FLP candidates** for experimental validation
3. **Discover design principles** for optimal COâ‚‚ activation

### What You Have

- **Curated FLP-COâ‚‚ database** with DFT-computed structures and energies (133 entries)
- **Reference workflows** from `mml_studio_07` for molecular ML
- **Computational tools**: RDKit, XTB, morfeus, scikit-learn

### What You Need to Build

1. **Feature engineering pipeline** (fingerprints + QM descriptors)
2. **Baseline ML models** (Ridge, Lasso, Random Forest)
3. **Model evaluation framework** with proper cross-validation
4. **Candidate ranking system** with uncertainty quantification

---

## Learning Objectives

By completing this project, you will:

1. **Apply molecular parametrization** to a real chemistry problem
2. **Engineer features** from both 2D structure and 3D geometry
3. **Train and evaluate** regression models for molecular property prediction
4. **Interpret models** to extract chemical insights
5. **Handle real-world data** with missing values and chemical complexity
6. **Make predictions** with uncertainty estimates for experimental validation

---

## Setup and Imports

In [1]:
# Standard imports
import sys
import os
from pathlib import Path

# Add parent directory to path to import from src/
sys.path.append(str(Path.cwd().parent))

# Import utilities (includes RDKit, XTB setup, plotting functions)
from utils import *

# ML libraries
from sklearn.linear_model import Ridge, Lasso, BayesianRidge
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_score, KFold
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

# FLPCO2DB specific imports
import yaml

---

# Part 1: Data Exploration

## Loading the FLPCO2DB Registry

The FLPCO2DB registry contains curated data for 133 Frustrated Lewis Pairs with:
- **Provenance**: XYZ file paths, source papers
- **Structure**: SMILES strings, InChI keys
- **Energies**: Recovered DFT binding energies (gas and solution phase)
- **QC Flags**: Data quality indicators

Let's start by loading and exploring the registry:

In [2]:
# Load the central registry
registry_path = Path.cwd().parent / "data" / "processed" / "co2_registry.yaml"

with open(registry_path, 'r') as f:
    registry = yaml.safe_load(f)

print("Registry Overview:")
print(f"  Dataset version: {registry['dataset_version']}")
print(f"  Generated: {registry['generated_at']}")
print(f"\nDataset Statistics:")
for key, value in registry['counts'].items():
    print(f"  {key}: {value}")

print(f"\nTotal entries: {len(registry['entries'])}")

Registry Overview:
  Dataset version: 0.1.0
  Generated: 2025-11-06T17:44:46.144544

Dataset Statistics:
  flps_total: 133
  with_xyz_co2: 132
  with_energy_co2: 132
  overlap: 131
  smiles_validated: 0
  smiles_failed: 3

Total entries: 133


## Loading Individual Entry Files

Each FLP has a detailed entry file with complete information. Let's examine one:

In [3]:
# Load a sample entry
entries_dir = Path.cwd().parent / "data" / "processed" / "entries"
sample_entry_path = entries_dir / "1.yaml"

with open(sample_entry_path, 'r') as f:
    sample_entry = yaml.safe_load(f)

print("Sample FLP Entry Structure:")
print(f"FLP ID: {sample_entry['flp_id']}")
print(f"FLP Code: {sample_entry['flp_code']}")
print(f"\nAvailable XYZ files:")
for key, path in sample_entry['provenance']['xyz_paths'].items():
    print(f"  {key}: {path}")
print(f"\nQC Flags:")
for key, value in sample_entry['qc_flags'].items():
    print(f"  {key}: {value}")

Sample FLP Entry Structure:
FLP ID: 1
FLP Code: None

Available XYZ files:
  flp: data/raw/xyz/1/1.xyz
  co2: data/raw/xyz/1/1CO2.xyz

QC Flags:
  has_xyz_flp: True
  has_xyz_co2: True
  has_recovered_energy: True
  join_ok: True
  smiles_validation_passed: False


## ðŸŽ¯ Exercise 1.1: Build a Dataset DataFrame

**Task:** Create a pandas DataFrame that compiles key information from all entries.

**Include these columns:**
- `flp_id`: FLP identifier
- `smiles_flp`: SMILES string for bare FLP (if available)
- `has_xyz_flp`: Boolean for bare FLP structure
- `has_xyz_co2`: Boolean for COâ‚‚ adduct structure
- `has_energy`: Boolean for recovered binding energy
- Target variable (you need to extract this from `energies_recovered`!)

**Hints:**
- Loop through all entry files in `entries_dir`
- Check the `energies_recovered` field - which energy should you use as the target?
- Some entries may have missing data - handle this appropriately
- Remember: More negative Î”G = stronger COâ‚‚ binding!

**Question:** What should be your ML target variable? Gas-phase or solution-phase energy? E, H, or G?

In [4]:
# TODO: Your code here
# Load all entries and compile into a DataFrame

data = []

# Hint: Use Path.glob() to iterate over all .yaml files
# for entry_file in entries_dir.glob("*.yaml"):
#     ...

df = pd.DataFrame(data)
print(f"Dataset shape: {df.shape}")
df.head()

Dataset shape: (0, 0)


## Data Quality Assessment

Before ML modeling, assess your data quality:

In [5]:
# TODO: Explore your data
# - How many entries have complete data?
# - What's the distribution of your target variable?
# - Are there any outliers?
# - How many entries have SMILES strings?

---

# Part 2: Molecular Parametrization

## Fingerprint-Based Features

Morgan fingerprints encode 2D structural information. For FLPs, we need fingerprints for:
1. **Bare FLP** (Lewis acid + Lewis base)
2. **COâ‚‚ adduct** (FLP bound to COâ‚‚)

### Review: Morgan Fingerprints from mml_studio_07

Key parameters:
- `radius`: Circular neighborhood size (typically 2-3)
- `nBits`: Fingerprint length (1024, 2048, or 4096)

Example from studio 7:

In [6]:
# Example: Generate fingerprint for ibuprofen (from studio 7)
example_smiles = "CC(C)Cc1ccc(cc1)[C@@H](C)C(=O)O"
example_mol = Chem.MolFromSmiles(example_smiles)
example_fp = AllChem.GetMorganFingerprintAsBitVect(example_mol, radius=2, nBits=2048)

# Convert to numpy array for ML
fp_array = np.array(example_fp)
print(f"Fingerprint shape: {fp_array.shape}")
print(f"Non-zero bits: {fp_array.sum()} / {len(fp_array)} ({100*fp_array.sum()/len(fp_array):.1f}%)")

Fingerprint shape: (2048,)
Non-zero bits: 25 / 2048 (1.2%)




## ðŸŽ¯ Exercise 2.1: Generate FLP Fingerprints

**Task:** Create a function to generate Morgan fingerprints for your FLP dataset.

**Design decisions you need to make:**
1. Which SMILES to use? (bare FLP, COâ‚‚ adduct, or both?)
2. What radius and nBits?
3. How to handle missing SMILES strings?

**Hint:** Consider creating features that capture the **change** upon COâ‚‚ binding!

In [7]:
def generate_fingerprints(df, radius=2, nBits=2048):
    """
    Generate Morgan fingerprints for FLPs.
    
    Args:
        df: DataFrame with SMILES columns
        radius: Morgan fingerprint radius
        nBits: Fingerprint length
        
    Returns:
        X: Feature matrix (n_samples, nBits)
    """
    # TODO: Implement fingerprint generation
    # Consider:
    # - Which molecules to fingerprint?
    # - How to combine multiple fingerprints?
    # - Error handling for invalid SMILES
    
    pass

# Test your function
# X_fp = generate_fingerprints(df)
# print(f"Feature matrix shape: {X_fp.shape}")

---

## Quantum Mechanical Features

### Review: QM Descriptors from mml_studio_07

QM descriptors provide physically motivated features:
- **Atomic charges** (XTB, NPA, Mulliken)
- **Orbital energies** (HOMO, LUMO, gaps)
- **Steric parameters** (buried volume, sterimol)

Key challenge: **Conformational dependence** - QM properties depend on 3D geometry!

### Workflow from Studio 7:

1. Generate conformer ensemble
2. Optimize each conformer with XTB
3. Compute QM descriptors
4. Apply Boltzmann averaging (optional)

## Loading 3D Structures from XYZ Files

Good news! You already have DFT-optimized XYZ files for each FLP. Let's load one:

In [8]:
# Example: Load and visualize an FLP structure
sample_xyz_path = Path.cwd().parent / sample_entry['provenance']['xyz_paths']['flp']

# Visualize with py3Dmol (from utils.py)
MolTo3DView(str(sample_xyz_path))

<py3Dmol.view at 0x31846f820>

## Computing QM Descriptors with XTB

You can use AutodE (already imported in utils.py) to compute QM descriptors:

In [9]:
# Example: Load XYZ and compute properties with AutodE
# (This is an example - you'll need to adapt for your needs)

# mol = ade.Molecule(str(sample_xyz_path))
# mol.single_point(method=ade.methods.XTB())
# 
# # Access properties
# energy = mol.energy  # in Hartrees
# charges = mol.partial_charges  # atomic charges

## ðŸŽ¯ Exercise 2.2: Extract QM Features

**Task:** Design and extract QM-based features for FLP-COâ‚‚ binding prediction.

**Potential features to consider:**
- Atomic charges on Lewis acid and Lewis base atoms
- HOMO/LUMO energies and gaps
- Dipole moments
- Molecular volume or size metrics
- Differences between bare FLP and COâ‚‚ adduct

**Questions to answer:**
1. Which atoms are the Lewis acid and Lewis base centers? (Hint: Check the reference papers!)
2. What chemical principles suggest which descriptors might be important?
3. How can you capture the "frustration" in the Lewis pair?

**Advanced:** Use morfeus package for steric descriptors (buried volume, sterimol parameters)

In [10]:
def extract_qm_features(entry):
    """
    Extract QM features from FLP entry.
    
    Args:
        entry: Dictionary with entry data (including xyz_paths)
        
    Returns:
        features: Dictionary of QM descriptors
    """
    # TODO: Implement QM feature extraction
    # You'll need to:
    # 1. Load XYZ coordinates
    # 2. Compute descriptors (or use pre-computed if available)
    # 3. Identify key atoms (LA and LB centers)
    # 4. Extract relevant properties
    
    pass

# Test on a sample entry
# qm_features = extract_qm_features(sample_entry)
# print("QM features:", qm_features.keys())

---

# Part 3: Machine Learning Models

## Baseline Models

Following best practices from mml_studio_07, we'll train multiple baseline models:

1. **Ridge Regression**: L2 regularization (good for correlated features)
2. **Lasso Regression**: L1 regularization (feature selection)
3. **Bayesian Ridge**: Uncertainty quantification
4. **Random Forest**: Non-linear relationships

### Review: Model Training from Studio 7

In [11]:
# Example: Ridge regression with cross-validation (from studio 7 patterns)

from sklearn.linear_model import Ridge
from sklearn.model_selection import cross_val_score

# Assuming you have X (features) and y (targets)
# model = Ridge(alpha=1.0)
# 
# # 5-fold cross-validation
# cv_scores = cross_val_score(model, X, y, cv=5, 
#                             scoring='neg_mean_absolute_error')
# mae = -cv_scores.mean()
# print(f"Cross-validated MAE: {mae:.3f} kcal/mol")

## ðŸŽ¯ Exercise 3.1: Build ML Pipeline

**Task:** Implement a complete ML pipeline for COâ‚‚ binding energy prediction.

**Requirements:**
1. Train/test split (or cross-validation)
2. Multiple model types
3. Hyperparameter tuning
4. Performance metrics (MAE, RMSE, RÂ²)
5. Model comparison

**Key decisions:**
- How to split data? (random, stratified, or leave-group-out?)
- Which hyperparameters to tune?
- How to combine fingerprint and QM features?
- How to handle small dataset size (n~130)?

In [12]:
# TODO: Implement your ML pipeline

# 1. Prepare features and targets
# X = ...
# y = ...

# 2. Split data
# from sklearn.model_selection import train_test_split

# 3. Train models
# models = {
#     'Ridge': Ridge(),
#     'Lasso': Lasso(),
#     'Random Forest': RandomForestRegressor()
# }

# 4. Evaluate and compare
# results = {}

## Model Evaluation and Visualization

Create diagnostic plots to assess model performance:

In [13]:
# TODO: Create evaluation visualizations
# 1. Parity plots (predicted vs actual)
# 2. Residual plots
# 3. Feature importance (for interpretable models)
# 4. Learning curves (performance vs training set size)

# Use plotter() function from utils.py for consistent styling

---

# Part 4: Model Interpretation

## Feature Importance Analysis

Understanding **why** your model makes predictions is crucial for:
- Building trust in predictions
- Extracting chemical insights
- Guiding experimental design

### Linear Model Coefficients

For Ridge/Lasso, coefficients tell you feature importance:

In [14]:
# Example: Visualize Ridge coefficients
# model = Ridge(alpha=1.0)
# model.fit(X_train, y_train)
#
# # Get feature importances
# importances = np.abs(model.coef_)
# top_features = np.argsort(importances)[-20:]  # Top 20
#
# # Plot
# plt.figure(figsize=(10, 8))
# plt.barh(range(len(top_features)), importances[top_features])
# plt.xlabel('|Coefficient|')
# plt.title('Top 20 Important Features')

## ðŸŽ¯ Exercise 4.1: Interpret Your Model

**Task:** Extract chemical insights from your best model.

**Questions to answer:**
1. Which structural features are most important for COâ‚‚ binding?
2. Do the important features make chemical sense?
3. Can you visualize important fingerprint bits? (Use `Chem.Draw.DrawMorganBits` from studio 7)
4. What do QM features tell you about the binding mechanism?
5. How do your findings compare to known FLP chemistry literature?

**Deliverable:** Write a short interpretation connecting ML results to chemical principles.

In [15]:
# TODO: Implement model interpretation
# - Feature importance analysis
# - Fingerprint bit visualization (for top important bits)
# - Chemical interpretation

---

# Part 5: Candidate Screening and Ranking

## Prediction with Uncertainty

For experimental validation, you need:
1. **Predicted binding energy** (more negative = better)
2. **Prediction uncertainty** (to prioritize confident predictions)

### Bayesian Ridge for Uncertainty Quantification

From studio 7, Bayesian Ridge provides prediction intervals:

In [16]:
# Example: Bayesian Ridge with uncertainty
# from sklearn.linear_model import BayesianRidge
#
# model = BayesianRidge()
# model.fit(X_train, y_train)
#
# # Predictions with uncertainty
# y_pred, y_std = model.predict(X_test, return_std=True)
#
# # 95% confidence intervals
# ci_lower = y_pred - 1.96 * y_std
# ci_upper = y_pred + 1.96 * y_std

## ðŸŽ¯ Exercise 5.1: Screen and Rank FLP Candidates

**Task:** Create a ranking system for FLP candidates.

**Approach:**
1. Apply your trained model to all FLPs in the database
2. Compute predicted binding energies and uncertainties
3. Rank by predicted binding strength (more negative Î”G)
4. Consider uncertainty in ranking (balance exploration vs exploitation)

**Deliverables:**
- Top 10 predicted strongest COâ‚‚ binders
- Top 10 most uncertain predictions (for active learning)
- Visualization of predicted vs known binding energies

**Advanced:** Implement an acquisition function for active learning (e.g., expected improvement)

In [17]:
# TODO: Implement candidate ranking
# 1. Generate predictions for all FLPs
# 2. Rank by binding energy
# 3. Consider uncertainty
# 4. Create ranking table

---

# Part 6: Project Milestones and Next Steps

## Week 1-2: Foundation âœ…

- [ ] Load and explore FLPCO2DB registry
- [ ] Generate molecular fingerprints
- [ ] Extract QM descriptors
- [ ] Build baseline ML models
- [ ] Establish performance metrics

## Week 2-3: Refinement ðŸ”„

- [ ] Optimize hyperparameters
- [ ] Engineer new features
- [ ] Try advanced models (GNNs?)
- [ ] Implement uncertainty quantification
- [ ] Interpret model predictions

## Week 3-4: Validation and Discovery ðŸŽ¯

- [ ] Rank FLP candidates
- [ ] Design new FLPs for validation
- [ ] Run DFT calculations on top candidates
- [ ] Compare ML predictions with DFT
- [ ] Iterate with active learning

## Week 4: Presentation ðŸ“Š

- [ ] Final report with chemical insights
- [ ] Presentation slides
- [ ] Code and data release

---

## Additional Resources

### Codebase Structure

```
FLPCO2DB/
â”œâ”€â”€ data/
â”‚   â”œâ”€â”€ raw/              # Original XYZ files and CSVs
â”‚   â””â”€â”€ processed/        # Curated registry
â”‚       â”œâ”€â”€ co2_registry.yaml
â”‚       â””â”€â”€ entries/      # Individual FLP entries
â”œâ”€â”€ src/flpco2/          # Database management tools
â”‚   â”œâ”€â”€ cli.py           # Command-line interface
â”‚   â”œâ”€â”€ registry_builder.py
â”‚   â””â”€â”€ smiles_utils.py
â”œâ”€â”€ notebooks/           # Your workspace!
â””â”€â”€ reference/           # Reference materials
    â””â”€â”€ mml_studio_07/   # Studio 7 notebooks
```

### Useful Commands

```bash
# Inspect entries
flpco2 inspect 108

# View statistics
flpco2 stats

# Export to CSV
flpco2 export --output flp_data.csv --format csv
```

### Key Papers

1. Ye et al. (2025) - The FLPDB paper (in `reference/`)
2. Khan et al. (2023) - Review of FLP-COâ‚‚ chemistry
3. Stephan (2015) - FLP concept and catalysis

### Getting Help

- Review `reference/mml_studio_07/` for ML workflows
- Check `reference/Project Proposal_MML.md` for project objectives
- RDKit documentation: https://www.rdkit.org/docs/
- Scikit-learn tutorials: https://scikit-learn.org/stable/tutorial/

---

**Good luck with your project! Remember: The goal isn't just to build a model, but to discover chemical insights that could guide experimental design.** ðŸ§ªðŸ¤–