# ADMET Safety Model - Multi-Property Drug Safety Prediction

## Overview

This notebook implements a comprehensive **ADMET (Absorption, Distribution, Metabolism, Excretion, and Toxicity)** safety filtering system using Random Forest models. This is **Stage 2** of the drug discovery pipeline.

### Features:
- Multi-task ADMET prediction (Toxicity, Clinical Toxicity, BBB Permeability, Solubility)
- RDKit molecular descriptor calculation
- Random Forest classification and regression
- Model persistence and loading
- Comprehensive evaluation metrics
- SMILES-based compound filtering

### Datasets Used:
- **Tox21**: Toxicity prediction (12 targets)
- **ClinTox**: Clinical trial toxicity
- **BBBP**: Blood-Brain Barrier Permeability
- **ESOL (Delaney)**: Aqueous Solubility

**Author:** Bio-ScreenNet Team  
**Date:** 2025

## 1. Import Required Libraries

In [1]:
import os
import sys
import warnings
import gzip
import joblib
from pathlib import Path
from typing import Dict, List, Tuple, Optional, Union

import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    roc_auc_score, confusion_matrix, classification_report,
    mean_squared_error, r2_score, mean_absolute_error
)
from sklearn.preprocessing import StandardScaler
from tqdm import tqdm

# RDKit imports
try:
    from rdkit import Chem
    from rdkit.Chem import Descriptors, AllChem
    from rdkit import RDLogger
    RDLogger.DisableLog('rdApp.*')  # Disable RDKit warnings
    print("✓ RDKit imported successfully")
except ImportError:
    print("Warning: RDKit not installed. Please install: conda install -c conda-forge rdkit")
    sys.exit(1)

warnings.filterwarnings('ignore')
print("✓ All libraries imported successfully")

✓ RDKit imported successfully
✓ All libraries imported successfully


## 2. Molecular Descriptor Calculator

This class calculates molecular descriptors from SMILES strings using RDKit.

### Features Calculated:
- **Lipinski descriptors**: Molecular Weight, LogP, H-bond Donors/Acceptors, TPSA
- **Structural features**: Rotatable bonds, Aromatic/Aliphatic rings
- **Morgan Fingerprints**: 512-bit circular fingerprints (radius=2)

In [2]:
class MolecularDescriptorCalculator:
    """Calculate molecular descriptors from SMILES strings using RDKit."""

    @staticmethod
    def calculate_descriptors(smiles: str) -> Optional[np.ndarray]:
        """
        Calculate molecular descriptors for a given SMILES string.

        Args:
            smiles: SMILES representation of molecule

        Returns:
            Array of molecular descriptors or None if calculation fails
        """
        try:
            mol = Chem.MolFromSmiles(smiles)
            if mol is None:
                return None

            # Calculate Lipinski descriptors
            mw = Descriptors.MolWt(mol)
            logp = Descriptors.MolLogP(mol)
            hbd = Descriptors.NumHDonors(mol)
            hba = Descriptors.NumHAcceptors(mol)
            tpsa = Descriptors.TPSA(mol)

            # Additional descriptors
            rot_bonds = Descriptors.NumRotatableBonds(mol)
            aromatic_rings = Descriptors.NumAromaticRings(mol)
            aliphatic_rings = Descriptors.NumAliphaticRings(mol)

            # Fingerprint-based descriptors (Morgan fingerprint)
            fp = AllChem.GetMorganFingerprintAsBitVect(mol, radius=2, nBits=512)
            fp_array = np.array(fp)

            # Combine all descriptors
            basic_descriptors = np.array([
                mw, logp, hbd, hba, tpsa, rot_bonds,
                aromatic_rings, aliphatic_rings
            ])

            descriptors = np.concatenate([basic_descriptors, fp_array])
            return descriptors

        except Exception as e:
            print(f"Error calculating descriptors for {smiles}: {e}")
            return None

    @staticmethod
    def batch_calculate_descriptors(smiles_list: List[str], show_progress: bool = True) -> Tuple[np.ndarray, List[int]]:
        """
        Calculate descriptors for a list of SMILES.

        Args:
            smiles_list: List of SMILES strings
            show_progress: Whether to show progress bar

        Returns:
            Tuple of (descriptor_matrix, valid_indices)
        """
        descriptors = []
        valid_indices = []

        iterator = tqdm(enumerate(smiles_list), total=len(smiles_list), desc="Calculating descriptors") if show_progress else enumerate(smiles_list)

        for idx, smiles in iterator:
            desc = MolecularDescriptorCalculator.calculate_descriptors(smiles)
            if desc is not None:
                descriptors.append(desc)
                valid_indices.append(idx)

        return np.array(descriptors), valid_indices

print("✓ MolecularDescriptorCalculator class defined")

✓ MolecularDescriptorCalculator class defined


## 3. ADMET Safety Model Class

This is the main class that handles all ADMET prediction tasks.

In [3]:
class ADMETSafetyModel:
    """
    Comprehensive ADMET Safety Prediction Model.

    This class handles multiple ADMET properties including:
    - Toxicity (Tox21)
    - Clinical Toxicity (ClinTox)
    - Blood-Brain Barrier Permeability (BBBP)
    - Aqueous Solubility (ESOL)
    """

    def __init__(self, data_dir: str = None, model_dir: str = None):
        """
        Initialize ADMET Safety Model.

        Args:
            data_dir: Directory containing ADMET datasets
            model_dir: Directory to save/load trained models
        """
        if data_dir is None:
            data_dir = os.path.join(os.path.expanduser("~"), ".deepchem", "datasets")
        if model_dir is None:
            # Use relative path from notebook location
            model_dir = os.path.join("..", "models", "admet_models")

        self.data_dir = Path(data_dir)
        self.model_dir = Path(model_dir)
        self.model_dir.mkdir(parents=True, exist_ok=True)

        self.models = {}
        self.scalers = {}
        self.feature_names = None
        self.results = {}

        print(f"ADMET Model initialized")
        print(f"Data directory: {self.data_dir}")
        print(f"Model directory: {self.model_dir}")

    def load_dataset(self, dataset_name: str) -> Optional[pd.DataFrame]:
        """
        Load ADMET dataset from file.

        Args:
            dataset_name: Name of dataset (tox21, clintox, bbbp, sider, delaney)

        Returns:
            DataFrame containing the dataset or None if loading fails
        """
        dataset_files = {
            'tox21': 'tox21.csv.gz',
            'clintox': 'clintox.csv.gz',
            'bbbp': 'BBBP.csv',
            'sider': 'sider.csv.gz',
            'delaney': 'delaney-processed.csv'
        }

        if dataset_name not in dataset_files:
            print(f"Unknown dataset: {dataset_name}")
            return None

        file_path = self.data_dir / dataset_files[dataset_name]

        if not file_path.exists():
            print(f"Dataset file not found: {file_path}")
            return None

        try:
            print(f"\nLoading {dataset_name} dataset from {file_path}...")

            if file_path.suffix == '.gz':
                df = pd.read_csv(file_path, compression='gzip')
            else:
                df = pd.read_csv(file_path)

            print(f"Loaded {len(df)} samples from {dataset_name}")
            print(f"Columns: {list(df.columns)}")
            return df

        except Exception as e:
            print(f"Error loading {dataset_name}: {e}")
            return None

    def prepare_data(self, df: pd.DataFrame, smiles_col: str, target_cols: List[str]) -> Tuple:
        """
        Prepare data for training by calculating molecular descriptors.

        Args:
            df: DataFrame containing SMILES and target columns
            smiles_col: Name of SMILES column
            target_cols: Names of target columns

        Returns:
            Tuple of (X, y, valid_df)
        """
        print(f"\nPreparing data...")
        print(f"SMILES column: {smiles_col}")
        print(f"Target columns: {target_cols}")

        # Calculate descriptors
        X, valid_indices = MolecularDescriptorCalculator.batch_calculate_descriptors(
            df[smiles_col].tolist(), show_progress=True
        )

        # Filter valid samples
        valid_df = df.iloc[valid_indices].reset_index(drop=True)
        y = valid_df[target_cols].values

        # Remove samples with missing targets
        valid_mask = ~np.isnan(y).any(axis=1)
        X = X[valid_mask]
        y = y[valid_mask]
        valid_df = valid_df[valid_mask].reset_index(drop=True)

        print(f"Final dataset: {len(X)} samples with {X.shape[1]} features")
        print(f"Target shape: {y.shape}")

        return X, y, valid_df

    def save_model(self, model_name: str, model, scaler):
        """Save trained model and scaler to disk."""
        model_path = self.model_dir / f"{model_name}_model.pkl"
        scaler_path = self.model_dir / f"{model_name}_scaler.pkl"

        joblib.dump(model, model_path)
        joblib.dump(scaler, scaler_path)

        print(f"Model saved to {model_path}")

    def load_model(self, model_name: str) -> bool:
        """Load trained model and scaler from disk."""
        model_path = self.model_dir / f"{model_name}_model.pkl"
        scaler_path = self.model_dir / f"{model_name}_scaler.pkl"

        if not model_path.exists() or not scaler_path.exists():
            print(f"Model files not found for {model_name}")
            return False

        try:
            self.models[model_name] = joblib.load(model_path)
            self.scalers[model_name] = joblib.load(scaler_path)
            print(f"Loaded {model_name} model from {model_path}")
            return True
        except Exception as e:
            print(f"Error loading {model_name} model: {e}")
            return False

print("✓ ADMETSafetyModel base class defined")

✓ ADMETSafetyModel base class defined


## 4. Training Methods - Toxicity Model

Train a model to predict general toxicity using the **Tox21** dataset with 12 toxicity targets.

In [4]:
def train_toxicity_model(self, test_size: float = 0.2, random_state: int = 42) -> Dict:
    """
    Train toxicity prediction model using Tox21 dataset.
    """
    print("\n" + "="*80)
    print("TRAINING TOXICITY MODEL (Tox21)")
    print("="*80)

    # Load Tox21 dataset
    df = self.load_dataset('tox21')
    if df is None:
        return None

    # Tox21 has 12 toxicity targets
    target_cols = [col for col in df.columns if col.startswith('NR-') or col.startswith('SR-')]
    smiles_col = 'smiles'

    # Prepare data
    X, y, valid_df = self.prepare_data(df, smiles_col, target_cols)

    # Create binary toxicity label: toxic if any target is 1
    y_binary = (y.sum(axis=1) > 0).astype(int)

    # Split data
    X_train, X_test, y_train, y_test = train_test_split(
        X, y_binary, test_size=test_size, random_state=random_state, stratify=y_binary
    )

    # Scale features
    scaler = StandardScaler()
    X_train_scaled = scaler.fit_transform(X_train)
    X_test_scaled = scaler.transform(X_test)

    # Train Random Forest model
    print("\nTraining Random Forest Classifier...")
    model = RandomForestClassifier(
        n_estimators=100,
        max_depth=20,
        min_samples_split=5,
        min_samples_leaf=2,
        random_state=random_state,
        n_jobs=-1,
        verbose=0
    )
    model.fit(X_train_scaled, y_train)

    # Evaluate
    y_pred = model.predict(X_test_scaled)
    y_pred_proba = model.predict_proba(X_test_scaled)[:, 1]

    results = {
        'accuracy': accuracy_score(y_test, y_pred),
        'precision': precision_score(y_test, y_pred, zero_division=0),
        'recall': recall_score(y_test, y_pred, zero_division=0),
        'f1': f1_score(y_test, y_pred, zero_division=0),
        'roc_auc': roc_auc_score(y_test, y_pred_proba),
        'confusion_matrix': confusion_matrix(y_test, y_pred),
        'n_train': len(X_train),
        'n_test': len(X_test)
    }

    # Save model
    self.models['toxicity'] = model
    self.scalers['toxicity'] = scaler
    self.save_model('toxicity', model, scaler)

    # Print results
    print("\n" + "-"*80)
    print("TOXICITY MODEL RESULTS")
    print("-"*80)
    print(f"Accuracy:  {results['accuracy']:.4f}")
    print(f"Precision: {results['precision']:.4f}")
    print(f"Recall:    {results['recall']:.4f}")
    print(f"F1-Score:  {results['f1']:.4f}")
    print(f"ROC-AUC:   {results['roc_auc']:.4f}")
    print(f"\nConfusion Matrix:")
    print(results['confusion_matrix'])
    print(f"\nTrain samples: {results['n_train']}")
    print(f"Test samples:  {results['n_test']}")

    self.results['toxicity'] = results
    return results

# Add method to class
ADMETSafetyModel.train_toxicity_model = train_toxicity_model
print("✓ Toxicity training method added")

✓ Toxicity training method added


## 5. Training Methods - Clinical Toxicity Model

Train a model to predict clinical trial toxicity using the **ClinTox** dataset.

In [5]:
def train_clintox_model(self, test_size: float = 0.2, random_state: int = 42) -> Dict:
    """
    Train clinical toxicity prediction model using ClinTox dataset.
    """
    print("\n" + "="*80)
    print("TRAINING CLINICAL TOXICITY MODEL (ClinTox)")
    print("="*80)

    df = self.load_dataset('clintox')
    if df is None:
        return None

    target_cols = ['CT_TOX']
    smiles_col = 'smiles'

    X, y, valid_df = self.prepare_data(df, smiles_col, target_cols)
    y = y.ravel()

    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=test_size, random_state=random_state, stratify=y
    )

    scaler = StandardScaler()
    X_train_scaled = scaler.fit_transform(X_train)
    X_test_scaled = scaler.transform(X_test)

    print("\nTraining Random Forest Classifier...")
    model = RandomForestClassifier(
        n_estimators=100, max_depth=20, min_samples_split=5,
        min_samples_leaf=2, random_state=random_state, n_jobs=-1, verbose=0
    )
    model.fit(X_train_scaled, y_train)

    y_pred = model.predict(X_test_scaled)
    y_pred_proba = model.predict_proba(X_test_scaled)[:, 1]

    results = {
        'accuracy': accuracy_score(y_test, y_pred),
        'precision': precision_score(y_test, y_pred, zero_division=0),
        'recall': recall_score(y_test, y_pred, zero_division=0),
        'f1': f1_score(y_test, y_pred, zero_division=0),
        'roc_auc': roc_auc_score(y_test, y_pred_proba),
        'confusion_matrix': confusion_matrix(y_test, y_pred),
        'n_train': len(X_train),
        'n_test': len(X_test)
    }

    self.models['clintox'] = model
    self.scalers['clintox'] = scaler
    self.save_model('clintox', model, scaler)

    print("\n" + "-"*80)
    print("CLINICAL TOXICITY MODEL RESULTS")
    print("-"*80)
    print(f"Accuracy:  {results['accuracy']:.4f}")
    print(f"Precision: {results['precision']:.4f}")
    print(f"Recall:    {results['recall']:.4f}")
    print(f"F1-Score:  {results['f1']:.4f}")
    print(f"ROC-AUC:   {results['roc_auc']:.4f}")
    print(f"\nConfusion Matrix:")
    print(results['confusion_matrix'])

    self.results['clintox'] = results
    return results

ADMETSafetyModel.train_clintox_model = train_clintox_model
print("✓ ClinTox training method added")

✓ ClinTox training method added


## 6. Training Methods - BBB Permeability Model

Train a model to predict Blood-Brain Barrier permeability using the **BBBP** dataset.

In [6]:
def train_bbbp_model(self, test_size: float = 0.2, random_state: int = 42) -> Dict:
    """
    Train Blood-Brain Barrier Permeability model using BBBP dataset.
    """
    print("\n" + "="*80)
    print("TRAINING BBB PERMEABILITY MODEL (BBBP)")
    print("="*80)

    df = self.load_dataset('bbbp')
    if df is None:
        return None

    target_cols = ['p_np']
    smiles_col = 'smiles'

    X, y, valid_df = self.prepare_data(df, smiles_col, target_cols)
    y = y.ravel()

    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=test_size, random_state=random_state, stratify=y
    )

    scaler = StandardScaler()
    X_train_scaled = scaler.fit_transform(X_train)
    X_test_scaled = scaler.transform(X_test)

    print("\nTraining Random Forest Classifier...")
    model = RandomForestClassifier(
        n_estimators=100, max_depth=20, min_samples_split=5,
        min_samples_leaf=2, random_state=random_state, n_jobs=-1, verbose=0
    )
    model.fit(X_train_scaled, y_train)

    y_pred = model.predict(X_test_scaled)
    y_pred_proba = model.predict_proba(X_test_scaled)[:, 1]

    results = {
        'accuracy': accuracy_score(y_test, y_pred),
        'precision': precision_score(y_test, y_pred, zero_division=0),
        'recall': recall_score(y_test, y_pred, zero_division=0),
        'f1': f1_score(y_test, y_pred, zero_division=0),
        'roc_auc': roc_auc_score(y_test, y_pred_proba),
        'confusion_matrix': confusion_matrix(y_test, y_pred),
        'n_train': len(X_train),
        'n_test': len(X_test)
    }

    self.models['bbbp'] = model
    self.scalers['bbbp'] = scaler
    self.save_model('bbbp', model, scaler)

    print("\n" + "-"*80)
    print("BBB PERMEABILITY MODEL RESULTS")
    print("-"*80)
    print(f"Accuracy:  {results['accuracy']:.4f}")
    print(f"Precision: {results['precision']:.4f}")
    print(f"Recall:    {results['recall']:.4f}")
    print(f"F1-Score:  {results['f1']:.4f}")
    print(f"ROC-AUC:   {results['roc_auc']:.4f}")
    print(f"\nConfusion Matrix:")
    print(results['confusion_matrix'])

    self.results['bbbp'] = results
    return results

ADMETSafetyModel.train_bbbp_model = train_bbbp_model
print("✓ BBBP training method added")

✓ BBBP training method added


## 7. Training Methods - Solubility Model

Train a regression model to predict aqueous solubility using the **ESOL (Delaney)** dataset.

In [7]:
def train_solubility_model(self, test_size: float = 0.2, random_state: int = 42) -> Dict:
    """
    Train aqueous solubility prediction model using ESOL (Delaney) dataset.
    """
    print("\n" + "="*80)
    print("TRAINING SOLUBILITY MODEL (ESOL/Delaney)")
    print("="*80)

    df = self.load_dataset('delaney')
    if df is None:
        return None

    target_cols = ['measured log solubility in mols per litre']
    smiles_col = 'smiles'

    X, y, valid_df = self.prepare_data(df, smiles_col, target_cols)
    y = y.ravel()

    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=test_size, random_state=random_state
    )

    scaler = StandardScaler()
    X_train_scaled = scaler.fit_transform(X_train)
    X_test_scaled = scaler.transform(X_test)

    print("\nTraining Random Forest Regressor...")
    model = RandomForestRegressor(
        n_estimators=100, max_depth=20, min_samples_split=5,
        min_samples_leaf=2, random_state=random_state, n_jobs=-1, verbose=0
    )
    model.fit(X_train_scaled, y_train)

    y_pred = model.predict(X_test_scaled)

    results = {
        'r2': r2_score(y_test, y_pred),
        'rmse': np.sqrt(mean_squared_error(y_test, y_pred)),
        'mae': mean_absolute_error(y_test, y_pred),
        'n_train': len(X_train),
        'n_test': len(X_test)
    }

    self.models['solubility'] = model
    self.scalers['solubility'] = scaler
    self.save_model('solubility', model, scaler)

    print("\n" + "-"*80)
    print("SOLUBILITY MODEL RESULTS")
    print("-"*80)
    print(f"R² Score:  {results['r2']:.4f}")
    print(f"RMSE:      {results['rmse']:.4f}")
    print(f"MAE:       {results['mae']:.4f}")
    print(f"\nTrain samples: {results['n_train']}")
    print(f"Test samples:  {results['n_test']}")

    self.results['solubility'] = results
    return results

ADMETSafetyModel.train_solubility_model = train_solubility_model
print("✓ Solubility training method added")

✓ Solubility training method added


## 8. Train All Models Method

In [8]:
def train_all_models(self) -> Dict:
    """
    Train all ADMET models.
    """
    print("\n" + "="*80)
    print("TRAINING ALL ADMET MODELS")
    print("="*80)

    all_results = {}

    models_to_train = [
        ('toxicity', self.train_toxicity_model),
        ('clintox', self.train_clintox_model),
        ('bbbp', self.train_bbbp_model),
        ('solubility', self.train_solubility_model)
    ]

    for model_name, train_func in models_to_train:
        try:
            result = train_func()
            if result:
                all_results[model_name] = result
        except Exception as e:
            print(f"\nError training {model_name} model: {e}")
            import traceback
            traceback.print_exc()

    # Print summary
    print("\n" + "="*80)
    print("TRAINING SUMMARY")
    print("="*80)
    for model_name, result in all_results.items():
        print(f"\n{model_name.upper()}:")
        if 'accuracy' in result:
            print(f"  Accuracy: {result['accuracy']:.4f}")
            print(f"  ROC-AUC:  {result['roc_auc']:.4f}")
        elif 'r2' in result:
            print(f"  R² Score: {result['r2']:.4f}")
            print(f"  RMSE:     {result['rmse']:.4f}")

    return all_results

ADMETSafetyModel.train_all_models = train_all_models
print("✓ Train all models method added")

✓ Train all models method added


## 9. Prediction Method

In [9]:
def predict_admet(self, smiles: Union[str, List[str]]) -> Dict:
    """
    Predict ADMET properties for given SMILES.
    """
    if isinstance(smiles, str):
        smiles = [smiles]

    results = {
        'smiles': smiles,
        'predictions': []
    }

    for smile in smiles:
        descriptors = MolecularDescriptorCalculator.calculate_descriptors(smile)

        if descriptors is None:
            results['predictions'].append({
                'valid': False,
                'error': 'Invalid SMILES or descriptor calculation failed'
            })
            continue

        descriptors = descriptors.reshape(1, -1)
        prediction = {'valid': True}

        for model_name in ['toxicity', 'clintox', 'bbbp', 'solubility']:
            if model_name in self.models:
                model = self.models[model_name]
                scaler = self.scalers[model_name]
                X_scaled = scaler.transform(descriptors)

                if model_name == 'solubility':
                    pred = model.predict(X_scaled)[0]
                    prediction[model_name] = float(pred)
                else:
                    pred_class = model.predict(X_scaled)[0]
                    pred_proba = model.predict_proba(X_scaled)[0]
                    prediction[model_name] = {
                        'class': int(pred_class),
                        'probability': float(pred_proba[1])
                    }

        results['predictions'].append(prediction)

    return results

ADMETSafetyModel.predict_admet = predict_admet
print("✓ Prediction method added")

✓ Prediction method added


## 10. Initialize and Train Models

Now let's initialize the ADMET model and train all models.

In [10]:
# Initialize model
print("="*80)
print("ADMET SAFETY MODEL - DRUG DISCOVERY PIPELINE")
print("="*80)

admet_model = ADMETSafetyModel()

# Train all models
print("\nStarting model training...")
results = admet_model.train_all_models()

ADMET SAFETY MODEL - DRUG DISCOVERY PIPELINE
ADMET Model initialized
Data directory: C:\Users\Hoang Nhan\.deepchem\datasets
Model directory: ..\models\admet_models

Starting model training...

TRAINING ALL ADMET MODELS

TRAINING TOXICITY MODEL (Tox21)



Loading tox21 dataset from C:\Users\Hoang Nhan\.deepchem\datasets\tox21.csv.gz...
Loaded 7831 samples from tox21
Columns: ['NR-AR', 'NR-AR-LBD', 'NR-AhR', 'NR-Aromatase', 'NR-ER', 'NR-ER-LBD', 'NR-PPAR-gamma', 'SR-ARE', 'SR-ATAD5', 'SR-HSE', 'SR-MMP', 'SR-p53', 'mol_id', 'smiles']

Preparing data...
SMILES column: smiles
Target columns: ['NR-AR', 'NR-AR-LBD', 'NR-AhR', 'NR-Aromatase', 'NR-ER', 'NR-ER-LBD', 'NR-PPAR-gamma', 'SR-ARE', 'SR-ATAD5', 'SR-HSE', 'SR-MMP', 'SR-p53']


Calculating descriptors:   0%|          | 0/7831 [00:00<?, ?it/s]

Calculating descriptors:   1%|▏         | 104/7831 [00:00<00:08, 945.48it/s]

Calculating descriptors:   3%|▎         | 227/7831 [00:00<00:07, 1043.75it/s]

Calculating descriptors:   4%|▍         | 350/7831 [00:00<00:07, 1056.45it/s]

Calculating descriptors:   6%|▌         | 475/7831 [00:00<00:06, 1087.81it/s]

Calculating descriptors:   8%|▊         | 594/7831 [00:00<00:06, 1088.19it/s]

Calculating descriptors:   9%|▉         | 706/7831 [00:00<00:06, 1067.70it/s]

Calculating descriptors:  11%|█         | 828/7831 [00:00<00:06, 1080.36it/s]

Calculating descriptors:  12%|█▏        | 946/7831 [00:00<00:06, 1100.14it/s]

Calculating descriptors:  13%|█▎        | 1057/7831 [00:00<00:06, 1094.40it/s]

Calculating descriptors:  15%|█▌        | 1176/7831 [00:01<00:05, 1119.91it/s]

Calculating descriptors:  17%|█▋        | 1296/7831 [00:01<00:05, 1141.56it/s]

Calculating descriptors:  18%|█▊        | 1411/7831 [00:01<00:06, 1013.04it/s]

Calculating descriptors:  19%|█▉        | 1525/7831 [00:01<00:06, 1020.22it/s]

Calculating descriptors:  21%|██        | 1649/7831 [00:01<00:05, 1052.22it/s]

Calculating descriptors:  23%|██▎       | 1762/7831 [00:01<00:05, 1045.18it/s]

Calculating descriptors:  24%|██▍       | 1868/7831 [00:01<00:05, 1021.43it/s]

Calculating descriptors:  25%|██▌       | 1971/7831 [00:01<00:05, 984.57it/s] 

Calculating descriptors:  26%|██▋       | 2070/7831 [00:01<00:06, 930.82it/s]

Calculating descriptors:  28%|██▊       | 2169/7831 [00:02<00:06, 922.39it/s]

Calculating descriptors:  29%|██▉       | 2274/7831 [00:02<00:05, 933.34it/s]

Calculating descriptors:  30%|███       | 2370/7831 [00:02<00:05, 915.47it/s]

Calculating descriptors:  32%|███▏      | 2470/7831 [00:02<00:05, 913.94it/s]

Calculating descriptors:  33%|███▎      | 2575/7831 [00:02<00:05, 926.31it/s]

Calculating descriptors:  34%|███▍      | 2678/7831 [00:02<00:05, 930.05it/s]

Calculating descriptors:  35%|███▌      | 2772/7831 [00:02<00:05, 907.82it/s]

Calculating descriptors:  37%|███▋      | 2870/7831 [00:02<00:05, 900.52it/s]

Calculating descriptors:  38%|███▊      | 2962/7831 [00:02<00:05, 903.83it/s]

Calculating descriptors:  39%|███▉      | 3061/7831 [00:03<00:05, 918.10it/s]

Calculating descriptors:  40%|████      | 3165/7831 [00:03<00:05, 927.00it/s]

Calculating descriptors:  42%|████▏     | 3267/7831 [00:03<00:04, 927.49it/s]

Calculating descriptors:  43%|████▎     | 3376/7831 [00:03<00:04, 946.88it/s]

Calculating descriptors:  44%|████▍     | 3476/7831 [00:03<00:04, 933.58it/s]

Calculating descriptors:  46%|████▌     | 3579/7831 [00:03<00:04, 934.95it/s]

Calculating descriptors:  47%|████▋     | 3673/7831 [00:03<00:04, 910.69it/s]

Calculating descriptors:  48%|████▊     | 3771/7831 [00:03<00:04, 904.94it/s]

Calculating descriptors:  49%|████▉     | 3868/7831 [00:03<00:04, 922.74it/s]

Calculating descriptors:  51%|█████     | 3966/7831 [00:04<00:04, 924.02it/s]

Calculating descriptors:  52%|█████▏    | 4073/7831 [00:04<00:03, 940.60it/s]

Calculating descriptors:  53%|█████▎    | 4181/7831 [00:04<00:03, 953.75it/s]

Calculating descriptors:  55%|█████▍    | 4285/7831 [00:04<00:03, 966.66it/s]

Calculating descriptors:  56%|█████▌    | 4383/7831 [00:04<00:03, 970.05it/s]

Calculating descriptors:  57%|█████▋    | 4481/7831 [00:04<00:03, 972.08it/s]

Calculating descriptors:  58%|█████▊    | 4579/7831 [00:04<00:03, 922.24it/s]

Calculating descriptors:  60%|█████▉    | 4681/7831 [00:04<00:03, 923.95it/s]

Calculating descriptors:  61%|██████    | 4774/7831 [00:04<00:03, 894.06it/s]

Calculating descriptors:  62%|██████▏   | 4866/7831 [00:05<00:03, 888.53it/s]

Calculating descriptors:  64%|██████▎   | 4973/7831 [00:05<00:03, 906.29it/s]

Calculating descriptors:  65%|██████▍   | 5073/7831 [00:05<00:03, 907.03it/s]

Calculating descriptors:  66%|██████▌   | 5178/7831 [00:05<00:02, 921.68it/s]

Calculating descriptors:  67%|██████▋   | 5271/7831 [00:05<00:02, 899.50it/s]

Calculating descriptors:  69%|██████▊   | 5377/7831 [00:05<00:02, 919.32it/s]

Calculating descriptors:  70%|███████   | 5486/7831 [00:05<00:02, 958.64it/s]

Calculating descriptors:  71%|███████▏  | 5587/7831 [00:05<00:02, 972.75it/s]

Calculating descriptors:  73%|███████▎  | 5697/7831 [00:05<00:02, 1008.68it/s]

Calculating descriptors:  74%|███████▍  | 5811/7831 [00:05<00:01, 1045.29it/s]

Calculating descriptors:  76%|███████▌  | 5922/7831 [00:06<00:01, 1064.30it/s]

Calculating descriptors:  77%|███████▋  | 6036/7831 [00:06<00:01, 1085.00it/s]

Calculating descriptors:  78%|███████▊  | 6145/7831 [00:06<00:01, 1070.46it/s]

Calculating descriptors:  80%|████████  | 6270/7831 [00:06<00:01, 1098.01it/s]

Calculating descriptors:  82%|████████▏ | 6391/7831 [00:06<00:01, 1099.10it/s]

Calculating descriptors:  83%|████████▎ | 6515/7831 [00:06<00:01, 1108.11it/s]

Calculating descriptors:  85%|████████▍ | 6626/7831 [00:06<00:01, 1078.44it/s]

Calculating descriptors:  86%|████████▌ | 6734/7831 [00:06<00:01, 1027.20it/s]

Calculating descriptors:  87%|████████▋ | 6838/7831 [00:06<00:00, 998.38it/s] 

Calculating descriptors:  89%|████████▊ | 6943/7831 [00:07<00:00, 990.44it/s]

Calculating descriptors:  90%|█████████ | 7056/7831 [00:07<00:00, 997.48it/s]

Calculating descriptors:  92%|█████████▏| 7172/7831 [00:07<00:00, 1017.38it/s]

Calculating descriptors:  93%|█████████▎| 7290/7831 [00:07<00:00, 1061.88it/s]

Calculating descriptors:  94%|█████████▍| 7397/7831 [00:07<00:00, 1062.90it/s]

Calculating descriptors:  96%|█████████▌| 7515/7831 [00:07<00:00, 1052.87it/s]

Calculating descriptors:  97%|█████████▋| 7627/7831 [00:07<00:00, 1038.25it/s]

Calculating descriptors:  99%|█████████▉| 7744/7831 [00:07<00:00, 1046.49it/s]

Calculating descriptors: 100%|██████████| 7831/7831 [00:07<00:00, 992.56it/s] 




Final dataset: 3074 samples with 520 features
Target shape: (3074, 12)

Training Random Forest Classifier...


Model saved to ..\models\admet_models\toxicity_model.pkl

--------------------------------------------------------------------------------
TOXICITY MODEL RESULTS
--------------------------------------------------------------------------------
Accuracy:  0.8000
Precision: 1.0000
Recall:    0.0889
F1-Score:  0.1633
ROC-AUC:   0.7205

Confusion Matrix:
[[480   0]
 [123  12]]

Train samples: 2459
Test samples:  615

TRAINING CLINICAL TOXICITY MODEL (ClinTox)

Loading clintox dataset from C:\Users\Hoang Nhan\.deepchem\datasets\clintox.csv.gz...
Loaded 1484 samples from clintox
Columns: ['smiles', 'FDA_APPROVED', 'CT_TOX']

Preparing data...
SMILES column: smiles
Target columns: ['CT_TOX']


Calculating descriptors:   0%|          | 0/1484 [00:00<?, ?it/s]

Calculating descriptors:   8%|▊         | 123/1484 [00:00<00:01, 1119.34it/s]

Calculating descriptors:  16%|█▌        | 235/1484 [00:00<00:01, 808.14it/s] 

Calculating descriptors:  22%|██▏       | 321/1484 [00:00<00:01, 643.84it/s]

Calculating descriptors:  26%|██▋       | 393/1484 [00:00<00:01, 656.00it/s]

Calculating descriptors:  31%|███       | 462/1484 [00:00<00:01, 628.30it/s]

Calculating descriptors:  36%|███▌      | 527/1484 [00:00<00:01, 605.87it/s]

Calculating descriptors:  40%|███▉      | 593/1484 [00:00<00:01, 617.37it/s]

Calculating descriptors:  45%|████▌     | 668/1484 [00:01<00:01, 644.82it/s]

Calculating descriptors:  50%|████▉     | 738/1484 [00:01<00:01, 655.74it/s]

Calculating descriptors:  54%|█████▍    | 805/1484 [00:01<00:01, 641.02it/s]

Calculating descriptors:  59%|█████▊    | 870/1484 [00:01<00:00, 630.35it/s]

Calculating descriptors:  63%|██████▎   | 934/1484 [00:01<00:00, 597.02it/s]

Calculating descriptors:  67%|██████▋   | 995/1484 [00:01<00:00, 551.03it/s]

Calculating descriptors:  72%|███████▏  | 1066/1484 [00:01<00:00, 571.37it/s]

Calculating descriptors:  77%|███████▋  | 1144/1484 [00:01<00:00, 624.54it/s]

Calculating descriptors:  83%|████████▎ | 1230/1484 [00:01<00:00, 671.13it/s]

Calculating descriptors:  88%|████████▊ | 1300/1484 [00:02<00:00, 658.33it/s]

Calculating descriptors:  93%|█████████▎| 1386/1484 [00:02<00:00, 686.19it/s]

Calculating descriptors:  98%|█████████▊| 1455/1484 [00:02<00:00, 634.95it/s]

Calculating descriptors: 100%|██████████| 1484/1484 [00:02<00:00, 645.81it/s]




Final dataset: 1480 samples with 520 features
Target shape: (1480, 1)

Training Random Forest Classifier...


Model saved to ..\models\admet_models\clintox_model.pkl

--------------------------------------------------------------------------------
CLINICAL TOXICITY MODEL RESULTS
--------------------------------------------------------------------------------
Accuracy:  0.9223
Precision: 0.3333
Recall:    0.0455
F1-Score:  0.0800
ROC-AUC:   0.8112

Confusion Matrix:
[[272   2]
 [ 21   1]]

TRAINING BBB PERMEABILITY MODEL (BBBP)

Loading bbbp dataset from C:\Users\Hoang Nhan\.deepchem\datasets\BBBP.csv...
Loaded 2050 samples from bbbp
Columns: ['num', 'name', 'p_np', 'smiles']

Preparing data...
SMILES column: smiles
Target columns: ['p_np']


Calculating descriptors:   0%|          | 0/2050 [00:00<?, ?it/s]

Calculating descriptors:   2%|▏         | 45/2050 [00:00<00:04, 443.14it/s]

Calculating descriptors:   5%|▍         | 96/2050 [00:00<00:04, 476.33it/s]

Calculating descriptors:   7%|▋         | 144/2050 [00:00<00:04, 426.00it/s]

Calculating descriptors:   9%|▉         | 188/2050 [00:00<00:04, 414.97it/s]

Calculating descriptors:  11%|█         | 230/2050 [00:00<00:04, 398.37it/s]

Calculating descriptors:  13%|█▎        | 271/2050 [00:00<00:04, 391.12it/s]

Calculating descriptors:  15%|█▌        | 315/2050 [00:00<00:04, 404.52it/s]

Calculating descriptors:  17%|█▋        | 356/2050 [00:00<00:04, 404.35it/s]

Calculating descriptors:  20%|█▉        | 403/2050 [00:00<00:03, 417.73it/s]

Calculating descriptors:  22%|██▏       | 455/2050 [00:01<00:03, 447.07it/s]

Calculating descriptors:  25%|██▍       | 504/2050 [00:01<00:03, 457.79it/s]

Calculating descriptors:  27%|██▋       | 550/2050 [00:01<00:03, 399.56it/s]

Calculating descriptors:  29%|██▉       | 600/2050 [00:01<00:03, 418.54it/s]

Calculating descriptors:  31%|███▏      | 643/2050 [00:01<00:03, 420.88it/s]

Calculating descriptors:  33%|███▎      | 686/2050 [00:01<00:03, 422.81it/s]

Calculating descriptors:  36%|███▌      | 737/2050 [00:01<00:03, 437.53it/s]

Calculating descriptors:  38%|███▊      | 783/2050 [00:01<00:02, 434.50it/s]

Calculating descriptors:  40%|████      | 827/2050 [00:01<00:02, 429.32it/s]

Calculating descriptors:  42%|████▏     | 871/2050 [00:02<00:02, 432.16it/s]

Calculating descriptors:  45%|████▍     | 915/2050 [00:02<00:02, 413.39it/s]

Calculating descriptors:  47%|████▋     | 965/2050 [00:02<00:02, 431.48it/s]

Calculating descriptors:  51%|█████     | 1040/2050 [00:02<00:01, 507.68it/s]

Calculating descriptors:  54%|█████▍    | 1117/2050 [00:02<00:01, 563.59it/s]

Calculating descriptors:  58%|█████▊    | 1187/2050 [00:02<00:01, 599.20it/s]

Calculating descriptors:  61%|██████▏   | 1258/2050 [00:02<00:01, 613.46it/s]

Calculating descriptors:  66%|██████▌   | 1343/2050 [00:02<00:01, 661.97it/s]

Calculating descriptors:  70%|██████▉   | 1427/2050 [00:02<00:00, 692.97it/s]

Calculating descriptors:  74%|███████▍  | 1515/2050 [00:03<00:00, 725.71it/s]

Calculating descriptors:  78%|███████▊  | 1604/2050 [00:03<00:00, 750.89it/s]

Calculating descriptors:  82%|████████▏ | 1689/2050 [00:03<00:00, 757.90it/s]

Calculating descriptors:  87%|████████▋ | 1777/2050 [00:03<00:00, 771.96it/s]

Calculating descriptors:  90%|█████████ | 1855/2050 [00:03<00:00, 770.83it/s]

Calculating descriptors:  94%|█████████▍| 1937/2050 [00:03<00:00, 777.51it/s]

Calculating descriptors:  98%|█████████▊| 2016/2050 [00:03<00:00, 759.57it/s]

Calculating descriptors: 100%|██████████| 2050/2050 [00:03<00:00, 551.94it/s]




Final dataset: 2039 samples with 520 features
Target shape: (2039, 1)

Training Random Forest Classifier...


Model saved to ..\models\admet_models\bbbp_model.pkl

--------------------------------------------------------------------------------
BBB PERMEABILITY MODEL RESULTS
--------------------------------------------------------------------------------
Accuracy:  0.9069
Precision: 0.8983
Recall:    0.9904
F1-Score:  0.9421
ROC-AUC:   0.9392

Confusion Matrix:
[[ 61  35]
 [  3 309]]

TRAINING SOLUBILITY MODEL (ESOL/Delaney)

Loading delaney dataset from C:\Users\Hoang Nhan\.deepchem\datasets\delaney-processed.csv...
Loaded 1128 samples from delaney
Columns: ['Compound ID', 'ESOL predicted log solubility in mols per litre', 'Minimum Degree', 'Molecular Weight', 'Number of H-Bond Donors', 'Number of Rings', 'Number of Rotatable Bonds', 'Polar Surface Area', 'measured log solubility in mols per litre', 'smiles']

Preparing data...
SMILES column: smiles
Target columns: ['measured log solubility in mols per litre']


Calculating descriptors:   0%|          | 0/1128 [00:00<?, ?it/s]

Calculating descriptors:  12%|█▏        | 132/1128 [00:00<00:00, 1201.56it/s]

Calculating descriptors:  23%|██▎       | 260/1128 [00:00<00:00, 1242.76it/s]

Calculating descriptors:  34%|███▍      | 385/1128 [00:00<00:00, 1244.10it/s]

Calculating descriptors:  45%|████▌     | 510/1128 [00:00<00:00, 1237.04it/s]

Calculating descriptors:  56%|█████▌    | 634/1128 [00:00<00:00, 1182.07it/s]

Calculating descriptors:  67%|██████▋   | 753/1128 [00:00<00:00, 1097.42it/s]

Calculating descriptors:  77%|███████▋  | 864/1128 [00:00<00:00, 1057.87it/s]

Calculating descriptors:  88%|████████▊ | 989/1128 [00:00<00:00, 1108.83it/s]

Calculating descriptors:  99%|█████████▉| 1121/1128 [00:00<00:00, 1169.10it/s]

Calculating descriptors: 100%|██████████| 1128/1128 [00:00<00:00, 1160.48it/s]




Final dataset: 1128 samples with 520 features
Target shape: (1128, 1)

Training Random Forest Regressor...


Model saved to ..\models\admet_models\solubility_model.pkl

--------------------------------------------------------------------------------
SOLUBILITY MODEL RESULTS
--------------------------------------------------------------------------------
R² Score:  0.8701
RMSE:      0.7836
MAE:       0.5416

Train samples: 902
Test samples:  226

TRAINING SUMMARY

TOXICITY:
  Accuracy: 0.8000
  ROC-AUC:  0.7205

CLINTOX:
  Accuracy: 0.9223
  ROC-AUC:  0.8112

BBBP:
  Accuracy: 0.9069
  ROC-AUC:  0.9392

SOLUBILITY:
  R² Score: 0.8701
  RMSE:     0.7836


## 11. Test Predictions

Test the trained models with known drug compounds.

In [11]:
print("\n" + "="*80)
print("TESTING PREDICTIONS")
print("="*80)

# Test with common drug compounds
test_smiles = [
    "CC(C)Cc1ccc(cc1)C(C)C(O)=O",  # Ibuprofen
    "CC(=O)Oc1ccccc1C(=O)O",  # Aspirin
    "CN1C=NC2=C1C(=O)N(C(=O)N2C)C"  # Caffeine
]

drug_names = ["Ibuprofen", "Aspirin", "Caffeine"]

print("\nTest compounds:")
for i, (name, smile) in enumerate(zip(drug_names, test_smiles), 1):
    print(f"{i}. {name}: {smile}")

predictions = admet_model.predict_admet(test_smiles)

print("\n" + "="*80)
print("PREDICTION RESULTS")
print("="*80)

for i, (name, smile, pred) in enumerate(zip(drug_names, predictions['smiles'], predictions['predictions']), 1):
    print(f"\n{i}. {name}")
    print(f"   SMILES: {smile}")
    if pred['valid']:
        print(f"   \n   ADMET Properties:")
        for prop, value in pred.items():
            if prop != 'valid':
                if isinstance(value, dict):
                    print(f"     - {prop}: Class={value['class']}, Probability={value['probability']:.4f}")
                else:
                    print(f"     - {prop}: {value:.4f}")
    else:
        print(f"   Error: {pred['error']}")


TESTING PREDICTIONS

Test compounds:
1. Ibuprofen: CC(C)Cc1ccc(cc1)C(C)C(O)=O
2. Aspirin: CC(=O)Oc1ccccc1C(=O)O
3. Caffeine: CN1C=NC2=C1C(=O)N(C(=O)N2C)C



PREDICTION RESULTS

1. Ibuprofen
   SMILES: CC(C)Cc1ccc(cc1)C(C)C(O)=O
   
   ADMET Properties:
     - toxicity: Class=0, Probability=0.1761
     - clintox: Class=0, Probability=0.0457
     - bbbp: Class=0, Probability=0.4981
     - solubility: -3.2710

2. Aspirin
   SMILES: CC(=O)Oc1ccccc1C(=O)O
   
   ADMET Properties:
     - toxicity: Class=0, Probability=0.1082
     - clintox: Class=0, Probability=0.2296
     - bbbp: Class=1, Probability=0.6984
     - solubility: -1.7281

3. Caffeine
   SMILES: CN1C=NC2=C1C(=O)N(C(=O)N2C)C
   
   ADMET Properties:
     - toxicity: Class=0, Probability=0.3750
     - clintox: Class=0, Probability=0.0436
     - bbbp: Class=1, Probability=0.9319
     - solubility: -1.1852


## 12. Summary

Display final summary of the trained models.

In [12]:
print("\n" + "="*80)
print("ADMET MODEL TRAINING COMPLETED SUCCESSFULLY!")
print("="*80)
print(f"\nModels saved in: {admet_model.model_dir.absolute()}")
print("\nAvailable models:")
for model_name in admet_model.models.keys():
    print(f"  ✓ {model_name}")

print("\n" + "="*80)
print("NEXT STEPS")
print("="*80)
print("1. Use these models to filter compounds in your drug discovery pipeline")
print("2. Integrate with Stage 1 (Target Prediction) and Stage 3 (Activity Prediction)")
print("3. Deploy models for production use via Streamlit or API")
print("4. Continue refining models with additional data")


ADMET MODEL TRAINING COMPLETED SUCCESSFULLY!

Models saved in: D:\Major\DA_for_LS\final_DA\Computational-Drug-Discovery\notebooks\..\models\admet_models

Available models:
  ✓ toxicity
  ✓ clintox
  ✓ bbbp
  ✓ solubility

NEXT STEPS
1. Use these models to filter compounds in your drug discovery pipeline
2. Integrate with Stage 1 (Target Prediction) and Stage 3 (Activity Prediction)
3. Deploy models for production use via Streamlit or API
4. Continue refining models with additional data
