# üß† Neural Network Regression: DAT Binding Prediction

**Goal**: Predict pKi values (binding strength) using Neural Networks (Deep Learning)

**Dataset**: 541 compounds with RDKit descriptors  
**Target**: pKi (continuous variable)  
**Method**: Deep Neural Network + 70/15/15 Train/Val/Test Split

**Key Differences from Tree Models:**
- Uses neural network architecture (multiple dense layers)
- Requires 70/15/15 split (train/validation/test)
- Validation set for early stopping (essential!)
- More sensitive to feature scaling
- Non-linear activation functions (ReLU)
- Dropout for regularization

---


In [None]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
from sklearn.metrics import confusion_matrix, classification_report

# TensorFlow/Keras for Neural Networks
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers, callbacks

print("‚úÖ Libraries imported successfully!")
print(f"TensorFlow version: {tf.__version__}")


## üìÇ Step 1: Load Processed Data

**Source:** `processed_DAT_rdkit_features.csv` (from dataanalyse.ipynb)


In [None]:
# Load processed RDKit features
df_rdkit = pd.read_csv('processed_DAT_rdkit_features.csv')

# Prepare features and target
X = df_rdkit.drop(['ChEMBL_ID', 'pKi'], axis=1)
y = df_rdkit['pKi']

print("="*60)
print("üìÇ DATA LOADED")
print("="*60)
print(f"Total compounds: {len(df_rdkit)}")
print(f"Features: {X.shape[1]} RDKit descriptors")
print(f"Target: pKi (range: {y.min():.2f} - {y.max():.2f})")
print("="*60)


## üîß Step 2: Train/Validation/Test Split (70/15/15)

**Critical for Neural Networks:**
- Training set (70%): For learning weights
- Validation set (15%): For early stopping & hyperparameter tuning
- Test set (15%): For final evaluation only


In [None]:
# First split: 70% train, 30% temp (for val+test)
X_train, X_temp, y_train, y_temp = train_test_split(
    X, y, test_size=0.30, random_state=42, shuffle=True
)

# Second split: split temp into 50/50 (15% each of total)
X_val, X_test, y_val, y_test = train_test_split(
    X_temp, y_temp, test_size=0.50, random_state=42, shuffle=True
)

print("="*60)
print("üìä TRAIN/VALIDATION/TEST SPLIT (70/15/15)")
print("="*60)
print(f"Training set: {len(X_train)} compounds ({len(X_train)/len(X)*100:.1f}%)")
print(f"Validation set: {len(X_val)} compounds ({len(X_val)/len(X)*100:.1f}%)")
print(f"Test set: {len(X_test)} compounds ({len(X_test)/len(X)*100:.1f}%)")
print(f"\npKi ranges:")
print(f"   Train: {y_train.min():.2f} - {y_train.max():.2f}")
print(f"   Val:   {y_val.min():.2f} - {y_val.max():.2f}")
print(f"   Test:  {y_test.min():.2f} - {y_test.max():.2f}")
print("="*60)


In [None]:
# Scale features (CRITICAL for Neural Networks!)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_val_scaled = scaler.transform(X_val)
X_test_scaled = scaler.transform(X_test)

print("‚úÖ Features scaled using StandardScaler")
print(f"   Mean ‚âà 0, Std ‚âà 1 (required for NN training)")


## üèóÔ∏è Step 3: Build Neural Network Architecture

**Architecture:**
- Input layer: 17 features (RDKit descriptors)
- Hidden layer 1: 128 neurons + ReLU + Dropout(30%)
- Hidden layer 2: 64 neurons + ReLU + Dropout(20%)
- Hidden layer 3: 32 neurons + ReLU
- Output layer: 1 neuron (regression output)


In [None]:
# Build Neural Network model
model = keras.Sequential([
    # Input + First Hidden Layer
    layers.Dense(128, activation='relu', input_shape=(X_train_scaled.shape[1],)),
    layers.Dropout(0.3),  # 30% dropout for regularization
    
    # Second Hidden Layer
    layers.Dense(64, activation='relu'),
    layers.Dropout(0.2),  # 20% dropout
    
    # Third Hidden Layer
    layers.Dense(32, activation='relu'),
    
    # Output Layer (regression)
    layers.Dense(1)  # Single output: pKi value
])

# Compile model
model.compile(
    optimizer=keras.optimizers.Adam(learning_rate=0.001),
    loss='mse',  # Mean Squared Error for regression
    metrics=['mae']  # Mean Absolute Error as metric
)

print("="*60)
print("üèóÔ∏è NEURAL NETWORK ARCHITECTURE")
print("="*60)
model.summary()
print("="*60)


## üéØ Step 4: Train Neural Network with Early Stopping

**Training Configuration:**
- Epochs: 500 (but will stop early if no improvement)
- Batch size: 32
- Early stopping: patience=20 (stop if val_loss doesn't improve for 20 epochs)
- Restore best weights: Yes


In [None]:
# Early stopping callback (ESSENTIAL for Neural Networks!)
early_stop = callbacks.EarlyStopping(
    monitor='val_loss',
    patience=20,
    restore_best_weights=True,
    verbose=1
)

print("üöÄ Training Neural Network...")
print("   Using early stopping to prevent overfitting\n")

# Train the model
history = model.fit(
    X_train_scaled, y_train,
    validation_data=(X_val_scaled, y_val),
    epochs=500,
    batch_size=32,
    callbacks=[early_stop],
    verbose=1
)

print("\n‚úÖ Training completed!")


## üìä Step 5: Training History Visualization


In [None]:
# Plot training history
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Loss plot
axes[0].plot(history.history['loss'], label='Training Loss', linewidth=2)
axes[0].plot(history.history['val_loss'], label='Validation Loss', linewidth=2)
axes[0].set_xlabel('Epoch', fontsize=12, fontweight='bold')
axes[0].set_ylabel('Loss (MSE)', fontsize=12, fontweight='bold')
axes[0].set_title('Model Loss During Training', fontsize=14, fontweight='bold')
axes[0].legend()
axes[0].grid(alpha=0.3)

# MAE plot
axes[1].plot(history.history['mae'], label='Training MAE', linewidth=2)
axes[1].plot(history.history['val_mae'], label='Validation MAE', linewidth=2)
axes[1].set_xlabel('Epoch', fontsize=12, fontweight='bold')
axes[1].set_ylabel('MAE', fontsize=12, fontweight='bold')
axes[1].set_title('Mean Absolute Error During Training', fontsize=14, fontweight='bold')
axes[1].legend()
axes[1].grid(alpha=0.3)

plt.tight_layout()
plt.show()

print(f"\nüìä Training stopped at epoch {len(history.history['loss'])}")


## üìà Step 6: Model Evaluation


In [None]:
# Make predictions
y_train_pred = model.predict(X_train_scaled, verbose=0).flatten()
y_val_pred = model.predict(X_val_scaled, verbose=0).flatten()
y_test_pred = model.predict(X_test_scaled, verbose=0).flatten()

# Calculate metrics
train_r2 = r2_score(y_train, y_train_pred)
train_rmse = np.sqrt(mean_squared_error(y_train, y_train_pred))
train_mae = mean_absolute_error(y_train, y_train_pred)

val_r2 = r2_score(y_val, y_val_pred)
val_rmse = np.sqrt(mean_squared_error(y_val, y_val_pred))
val_mae = mean_absolute_error(y_val, y_val_pred)

test_r2 = r2_score(y_test, y_test_pred)
test_rmse = np.sqrt(mean_squared_error(y_test, y_test_pred))
test_mae = mean_absolute_error(y_test, y_test_pred)

print("="*80)
print("üìä NEURAL NETWORK MODEL PERFORMANCE")
print("="*80)
print(f"\n{'Metric':<15} {'Training':<20} {'Validation':<20} {'Test':<20}")
print("-"*80)
print(f"{'R¬≤ Score':<15} {train_r2:<20.4f} {val_r2:<20.4f} {test_r2:<20.4f}")
print(f"{'RMSE':<15} {train_rmse:<20.4f} {val_rmse:<20.4f} {test_rmse:<20.4f}")
print(f"{'MAE':<15} {train_mae:<20.4f} {val_mae:<20.4f} {test_mae:<20.4f}")
print("-"*80)

# Overfitting analysis
overfit_r2 = train_r2 - test_r2
print(f"\nüîç Overfitting Analysis:")
print(f"   R¬≤ difference (train - test): {overfit_r2:.4f}")
if overfit_r2 > 0.1:
    print(f"   ‚ö†Ô∏è  Potential overfitting")
elif overfit_r2 > 0.05:
    print(f"   ‚ö° Mild overfitting")
else:
    print(f"   ‚úÖ Good generalization!")
print("="*80)


## üìà Step 7: Prediction Visualizations


In [None]:
# Actual vs Predicted plots
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

# Training
axes[0].scatter(y_train, y_train_pred, alpha=0.6, s=40, edgecolors='black')
axes[0].plot([y_train.min(), y_train.max()], [y_train.min(), y_train.max()], 'r--', lw=2)
axes[0].set_xlabel('Actual pKi', fontsize=11, fontweight='bold')
axes[0].set_ylabel('Predicted pKi', fontsize=11, fontweight='bold')
axes[0].set_title(f'Training Set\nR¬≤ = {train_r2:.4f}', fontsize=12, fontweight='bold')
axes[0].grid(alpha=0.3)

# Validation
axes[1].scatter(y_val, y_val_pred, alpha=0.6, s=40, edgecolors='black', color='orange')
axes[1].plot([y_val.min(), y_val.max()], [y_val.min(), y_val.max()], 'r--', lw=2)
axes[1].set_xlabel('Actual pKi', fontsize=11, fontweight='bold')
axes[1].set_ylabel('Predicted pKi', fontsize=11, fontweight='bold')
axes[1].set_title(f'Validation Set\nR¬≤ = {val_r2:.4f}', fontsize=12, fontweight='bold')
axes[1].grid(alpha=0.3)

# Test
axes[2].scatter(y_test, y_test_pred, alpha=0.6, s=40, edgecolors='black', color='green')
axes[2].plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--', lw=2)
axes[2].set_xlabel('Actual pKi', fontsize=11, fontweight='bold')
axes[2].set_ylabel('Predicted pKi', fontsize=11, fontweight='bold')
axes[2].set_title(f'Test Set\nR¬≤ = {test_r2:.4f}', fontsize=12, fontweight='bold')
axes[2].grid(alpha=0.3)

plt.tight_layout()
plt.show()


## üéØ Step 8: Classification Performance (Confusion Matrix)


In [None]:
# Classification function
def classify_pKi(pKi_values):
    return np.array(['Low' if pKi < 6.0 else 'Medium' if pKi < 8.0 else 'High' for pKi in pKi_values])

# Convert to categories (test set)
y_test_cat = classify_pKi(y_test)
y_test_pred_cat = classify_pKi(y_test_pred)

# Confusion matrix
cm = confusion_matrix(y_test_cat, y_test_pred_cat, labels=['Low', 'Medium', 'High'])

# Visualize
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Purples',
            xticklabels=['Low', 'Medium', 'High'],
            yticklabels=['Low', 'Medium', 'High'],
            cbar_kws={'label': 'Count'})
plt.xlabel('Predicted Category', fontsize=12, fontweight='bold')
plt.ylabel('Actual Category', fontsize=12, fontweight='bold')
test_acc = np.trace(cm) / cm.sum() * 100
plt.title(f'Neural Network - Test Set\nClassification Accuracy: {test_acc:.2f}%', 
          fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

print("\n" + "="*60)
print("üìä CLASSIFICATION REPORT (Test Set)")
print("="*60)
print(classification_report(y_test_cat, y_test_pred_cat, labels=['Low', 'Medium', 'High']))
print("="*60)


## üìä Step 9: Final Summary


In [None]:
print("="*80)
print("üéØ FINAL SUMMARY - NEURAL NETWORK REGRESSION (NO PCA)")
print("="*80)
print(f"\nüìä Dataset:")
print(f"   Total compounds: {len(df_rdkit)}")
print(f"   Training: {len(X_train)} (70%)")
print(f"   Validation: {len(X_val)} (15%)")
print(f"   Test: {len(X_test)} (15%)")
print(f"   Features: {X.shape[1]} RDKit descriptors (no PCA)")

print(f"\nüèóÔ∏è Model Architecture:")
print(f"   Layers: Dense(128) ‚Üí Dense(64) ‚Üí Dense(32) ‚Üí Dense(1)")
print(f"   Activation: ReLU")
print(f"   Dropout: 30%, 20%")
print(f"   Optimizer: Adam (lr=0.001)")
print(f"   Total parameters: {model.count_params():,}")

print(f"\nüèÜ Best Model Performance (Test Set):")
print(f"   R¬≤ Score: {test_r2:.4f}")
print(f"   RMSE: {test_rmse:.4f}")
print(f"   MAE: {test_mae:.4f}")
print(f"   Classification Accuracy: {test_acc:.2f}%")

print(f"\nüí° Key Insights:")
print(f"   ‚Ä¢ Neural networks require 70/15/15 split (train/val/test)")
print(f"   ‚Ä¢ Early stopping essential to prevent overfitting")
print(f"   ‚Ä¢ Model stopped at epoch {len(history.history['loss'])} (early stopping worked!)")
print(f"   ‚Ä¢ Validation set used for monitoring, test set untouched until final eval")
print(f"   ‚Ä¢ NN performance is {'competitive with' if test_r2 > 0.5 else 'comparable to'} tree-based models")

print("\n" + "="*80)
print("‚úÖ Neural Network Analysis Complete!")
print("="*80)
