# 01 — Data loading and featurization

This notebook builds the modelling dataset used in later notebooks:

1. Load two public melting point datasets (a large “full” dataset and a smaller curated subset).
2. Clean and merge them, keeping curated values when duplicate SMILES occur.
3. Generate RDKit descriptors and Morgan fingerprints from SMILES.
4. Create a final hold-out test set (5%) that is never used during training or hyperparameter tuning.
5. Save processed CSVs to `data/processed/`.

## Data sources and merge logic

Combine two datasets:

- **Full dataset**: larger, noisier, includes a `donotuse` flag for problematic entries.
- **Curated dataset**: smaller, higher quality.

When the same SMILES appears in both datasets, keep the curated entry and drop the full entry.
Also reduce the number of unique `source` categories by grouping rare sources into `other`
(to make downstream categorical handling feasible).

## Cleaning and filtering rules

Rows are removed if:

- `smiles` or `mpC` is missing
- SMILES cannot be parsed by RDKit
- The molecule is **charged** (formal charge ≠ 0)
- The molecule contains elements outside a fixed allowed set (typical organic chemistry elements)
- The molecule does not have at least one hydrogen atom attached to a carbon atom (keeping only organic compounds)

This filtering aims to remove salts/ions/metals and other cases that can make featurization unstable
or introduce large distribution shifts in descriptors.

## Feature generation

For each molecule the following is computed:

### RDKit descriptors
A set of 2D descriptors covering:
- basic physchem properties (e.g., MolWt, LogP, TPSA, HBD/HBA)
- ring counts and aromaticity metrics
- connectivity / shape indices (Kappa, Chi indices)
- surface-area-like surrogates (e.g., LabuteASA)

### Morgan fingerprints
- Radius: 2
- Size: 2048 bits
- Stored as binary columns `FP_0 ... FP_2047`

These features are computed without using the target (`mpC`), so feature generation is unsupervised.

## Dataset split and outputs

Create a single hold-out **final test set**:

- `final_test.csv`: 5% random sample (seeded), never used during training or tuning
- `train_val.csv`: remaining 95%, used for cross-validation and model selection
- `full_dataset.csv`: the complete featurized dataset (after cleaning/filtering)

Outputs are written to:
- `data/processed/`

In [None]:
from mp.data_loading import get_data
from mp.featurize import get_features
from mp.io import get_repo_root

In [None]:
ROOT = get_repo_root()

data = get_data(root=ROOT)            
data_features = get_features(data)

final_test_df = data_features.sample(frac=0.05, random_state=42)
train_val_data = data_features.drop(final_test_df.index)

out_dir = ROOT / "data" / "processed"
out_dir.mkdir(parents=True, exist_ok=True)

data_features.to_csv(out_dir / "full_dataset.csv", index=True)
train_val_data.to_csv(out_dir / "train_val.csv", index=True)
final_test_df.to_csv(out_dir / "final_test.csv", index=True)