Toxicity Dataset : https://archive.ics.uci.edu/dataset/728/toxicity-2

The dataset includes 171 molecules designed for functional domains of a core clock protein, CRY1, responsible for generating circadian rhythm. 56 of the molecules are toxic and the rest are non-toxic. 

The data consists a complete set of 1203 molecular descriptors and needs feature selection before classification since some of the features are redundant. 

Introductory Paper:
Structure-based design and classifications of small molecules regulating the circadian rhythm period
By Seref Gul, F. Rahim, Safak Isin, Fatma Yilmaz, Nuri Ozturk, M. Turkay, I. Kavakli. 2021
https://www.semanticscholar.org/paper/Structure-based-design-and-classifications-of-small-Gul-Rahim/5944836c47bc7d1a2b0464a9a1db94d4bc7f28ce
Published in Scientific reports

# Import necessary libraries

In [10]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import Ridge, Lasso, ElasticNet, LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, confusion_matrix
from sklearn.metrics import classification_report, roc_curve
from sklearn.feature_selection import VarianceThreshold
import warnings
warnings.filterwarnings('ignore')


# Load the toxicity dataset

In [11]:
import pandas as pd

# Read the CSV file
data = pd.read_csv("./data.csv")

# Display basic information about the dataset
print("Dataset shape:", data.shape)
print("\nFirst few rows:")
print(data.head())
print("\nColumn names:")
print(data.columns.tolist())

# Separate features and target
# Assuming the last column or a column named 'Class' contains the target
if 'Class' in data.columns:
    X = data.drop('Class', axis=1)
    y = data['Class']
else:
    # Assume last column is the target
    X = data.iloc[:, :-1]
    y = data.iloc[:, -1]

print(f"\nFeatures shape: {X.shape}")
print(f"Target shape: {y.shape}")

Dataset shape: (171, 1204)

First few rows:
   MATS3v  nHBint10  MATS3s  MATS3p  nHBDon_Lipinski  minHBint8  MATS3e  \
0  0.0908         0  0.0075  0.0173                0        0.0 -0.0436   
1  0.0213         0  0.1144 -0.0410                0        0.0  0.1231   
2  0.0018         0 -0.0156 -0.0765                2        0.0 -0.1138   
3 -0.0251         0 -0.0064 -0.0894                3        0.0 -0.0747   
4  0.0135         0  0.0424 -0.0353                0        0.0 -0.0638   

   MATS3c  minHBint2  MATS3m  ...   WTPT-4   WTPT-5  ETA_EtaP_L  ETA_EtaP_F  \
0  0.0409        0.0  0.1368  ...   0.0000   0.0000      0.1780      1.5488   
1 -0.0316        0.0  0.1318  ...   8.8660  19.3525      0.1739      1.3718   
2 -0.1791        0.0  0.0615  ...   5.2267  27.8796      0.1688      1.4395   
3 -0.1151        0.0  0.0361  ...   7.7896  24.7336      0.1702      1.4654   
4  0.0307        0.0  0.0306  ...  12.3240  19.7486      0.1789      1.4495   

   ETA_EtaP_B  nT5Ring  SHdNH 

# EDA

In [12]:
# Basic data exploration
print("\n=== DATA EXPLORATION ===")
print(f"Shape of features (X): {X.shape}")
print(f"Shape of target (y): {y.shape}")
print(f"\nTarget distribution:")
print(y.value_counts())
print(f"\nClass balance:")
print(y.value_counts(normalize=True))

# Check target data type and unique values
print(f"\nTarget data type: {y.dtype}")
print(f"Unique target values: {y.unique()}")


=== DATA EXPLORATION ===
Shape of features (X): (171, 1203)
Shape of target (y): (171,)

Target distribution:
Class
NonToxic    115
Toxic        56
Name: count, dtype: int64

Class balance:
Class
NonToxic    0.672515
Toxic       0.327485
Name: proportion, dtype: float64

Target data type: object
Unique target values: ['NonToxic' 'Toxic']


In [13]:
# Check for missing values
print(f"\nMissing values in features: {X.isnull().sum().sum()}")
print(f"Missing values in target: {y.isnull().sum()}")



Missing values in features: 0
Missing values in target: 0


# Preprocessing

In [14]:
# Handle missing values if any
print("=== PREPROCESSING ===")
print(f"Missing values in features: {X.isnull().sum().sum()}")
print(f"Missing values in target: {y.isnull().sum()}")

if X.isnull().sum().sum() > 0:
    # Option 1: Drop columns with too many missing values
    missing_threshold = 0.3  # Drop columns with >30% missing
    missing_prop = X.isnull().sum() / len(X)
    cols_to_drop = missing_prop[missing_prop > missing_threshold].index
    X = X.drop(columns=cols_to_drop)
    
    # Option 2: Impute remaining missing values
    from sklearn.impute import SimpleImputer
    imputer = SimpleImputer(strategy='median')
    X = pd.DataFrame(imputer.fit_transform(X), columns=X.columns)
    print(f"Missing values imputed")

# Convert target to binary (1 for NonToxic, 0 for Toxic) - FLIPPED LABELS
y_binary = (y == 'NonToxic').astype(int)

# Verify the binary conversion
print("\nBinary target distribution:")
print(y_binary.value_counts())
print(f"Class balance: {y_binary.value_counts(normalize=True)}")

# Double-check the conversion is correct
print(f"\nMapping verification:")
print(f"Original 'NonToxic' ‚Üí Binary 1: {y_binary[y == 'NonToxic'].unique()}")
print(f"Original 'Toxic' ‚Üí Binary 0: {y_binary[y == 'Toxic'].unique()}")

=== PREPROCESSING ===
Missing values in features: 0
Missing values in target: 0

Binary target distribution:
Class
1    115
0     56
Name: count, dtype: int64
Class balance: Class
1    0.672515
0    0.327485
Name: proportion, dtype: float64

Mapping verification:
Original 'NonToxic' ‚Üí Binary 1: [1]
Original 'Toxic' ‚Üí Binary 0: [0]


In [15]:
# Feature preprocessing
print("\n=== FEATURE PREPROCESSING ===")

# Remove constant features
constant_filter = VarianceThreshold(threshold=0)
X_filtered = constant_filter.fit_transform(X)
constant_columns = X.columns[~constant_filter.get_support()]
print(f"Removed {len(constant_columns)} constant features")

# Remove quasi-constant features (variance < 0.01)
quasi_constant_filter = VarianceThreshold(threshold=0.01)
X_filtered = quasi_constant_filter.fit_transform(X_filtered)
selected_features = X.columns[constant_filter.get_support()][quasi_constant_filter.get_support()]
X_filtered = pd.DataFrame(X_filtered, columns=selected_features)
print(f"Remaining features after variance filtering: {X_filtered.shape[1]}")

# Remove highly correlated features
correlation_matrix = X_filtered.corr().abs()
upper_triangle = correlation_matrix.where(
    np.triu(np.ones(correlation_matrix.shape), k=1).astype(bool)
)
high_corr_features = [column for column in upper_triangle.columns 
                      if any(upper_triangle[column] > 0.95)]
X_filtered = X_filtered.drop(columns=high_corr_features)
print(f"Removed {len(high_corr_features)} highly correlated features")
print(f"Final feature count: {X_filtered.shape[1]}")


=== FEATURE PREPROCESSING ===
Removed 0 constant features
Remaining features after variance filtering: 994
Removed 434 highly correlated features
Final feature count: 560


# Data Splitting

In [16]:
# Split the data with stratification to ensure balanced folds
from sklearn.model_selection import StratifiedKFold

# Add some randomness to address potential ordering issues
np.random.seed(42)
shuffle_idx = np.random.permutation(len(X_filtered))
X_shuffled = X_filtered.iloc[shuffle_idx].reset_index(drop=True)
y_shuffled = y_binary.iloc[shuffle_idx].reset_index(drop=True)

X_train, X_test, y_train, y_test = train_test_split(
    X_shuffled, y_shuffled, test_size=0.2, random_state=42, 
    stratify=y_shuffled, shuffle=True
)

# Standardize features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print(f"Training set size: {X_train_scaled.shape}")
print(f"Test set size: {X_test_scaled.shape}")

# Check class distribution in train and test sets
print(f"\nTrain set class distribution:")
print(pd.Series(y_train).value_counts(normalize=True))
print(f"\nTest set class distribution:")
print(pd.Series(y_test).value_counts(normalize=True))

Training set size: (136, 560)
Test set size: (35, 560)

Train set class distribution:
Class
1    0.669118
0    0.330882
Name: proportion, dtype: float64

Test set class distribution:
Class
1    0.685714
0    0.314286
Name: proportion, dtype: float64


# Evaluation function

In [17]:
# Define evaluation function
def evaluate_model(model, X_train, X_test, y_train, y_test, model_name):
    """Evaluate a classification model and return metrics"""
    # Predictions
    y_train_pred = model.predict(X_train)
    y_test_pred = model.predict(X_test)
    
    # Probabilities for AUC
    if hasattr(model, 'predict_proba'):
        y_train_proba = model.predict_proba(X_train)[:, 1]
        y_test_proba = model.predict_proba(X_test)[:, 1]
    else:
        y_train_proba = model.decision_function(X_train)
        y_test_proba = model.decision_function(X_test)
    
    # Calculate metrics
    metrics = {
        'model': model_name,
        'train_accuracy': accuracy_score(y_train, y_train_pred),
        'test_accuracy': accuracy_score(y_test, y_test_pred),
        'train_auc': roc_auc_score(y_train, y_train_proba),
        'test_auc': roc_auc_score(y_test, y_test_proba),
        'precision': precision_score(y_test, y_test_pred),
        'recall': recall_score(y_test, y_test_pred),
        'f1': f1_score(y_test, y_test_pred)
    }
    
    return metrics, y_test_pred, y_test_proba

In [18]:
# Initialize results storage
results = []
all_predictions = {}
all_probabilities = {}

In [19]:
print("\n" + "="*80)
print("COMPREHENSIVE MODEL COMPARISON: ORDINARY VS PENALIZED REGRESSION")
print("="*80)

print("""
This analysis compares the following models:
1. Ordinary Logistic Regression (no regularization) - Baseline
2. Ridge Regression (L2 penalty) - Shrinks coefficients
3. Lasso Regression (L1 penalty) - Feature selection + shrinkage  
4. Elastic Net (L1 + L2 penalty) - Combines both approaches
""")


COMPREHENSIVE MODEL COMPARISON: ORDINARY VS PENALIZED REGRESSION

This analysis compares the following models:
1. Ordinary Logistic Regression (no regularization) - Baseline
2. Ridge Regression (L2 penalty) - Shrinks coefficients
3. Lasso Regression (L1 penalty) - Feature selection + shrinkage  
4. Elastic Net (L1 + L2 penalty) - Combines both approaches



# Model training and evaluation

## Ordinary Logistic Regression

In [20]:
# 0. Ordinary Logistic Regression (Baseline)
print("\n0. ORDINARY LOGISTIC REGRESSION (BASELINE)")
print("-" * 50)

# No regularization - this is our baseline to compare against penalized methods
ordinary_lr = LogisticRegression(penalty=None, max_iter=2000, solver='lbfgs')
ordinary_lr.fit(X_train_scaled, y_train)

# Evaluate ordinary logistic regression
ordinary_metrics, ordinary_pred, ordinary_proba = evaluate_model(
    ordinary_lr, X_train_scaled, X_test_scaled, y_train, y_test, 'Ordinary LR'
)
results.append(ordinary_metrics)
all_predictions['Ordinary LR'] = ordinary_pred
all_probabilities['Ordinary LR'] = ordinary_proba

print(f"Training Accuracy: {ordinary_metrics['train_accuracy']:.4f}")
print(f"Test Accuracy: {ordinary_metrics['test_accuracy']:.4f}")
print(f"Test AUC: {ordinary_metrics['test_auc']:.4f}")
print(f"Precision: {ordinary_metrics['precision']:.4f}")
print(f"Recall: {ordinary_metrics['recall']:.4f}")
print(f"F1-Score: {ordinary_metrics['f1']:.4f}")

# Check for overfitting
overfitting = ordinary_metrics['train_accuracy'] - ordinary_metrics['test_accuracy']
print(f"Overfitting Gap (Train - Test Accuracy): {overfitting:.4f}")
if overfitting > 0.05:
    print("‚ö†Ô∏è  Significant overfitting detected - penalized methods should help!")
else:
    print("‚úì Low overfitting - but regularization may still improve generalization")


0. ORDINARY LOGISTIC REGRESSION (BASELINE)
--------------------------------------------------
Training Accuracy: 1.0000
Test Accuracy: 0.6286
Test AUC: 0.5909
Precision: 0.7619
Recall: 0.6667
F1-Score: 0.7111
Overfitting Gap (Train - Test Accuracy): 0.3714
‚ö†Ô∏è  Significant overfitting detected - penalized methods should help!


## Ridge Regression

In [21]:
# Define stratified cross-validation
from sklearn.model_selection import StratifiedKFold
stratified_cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)


# 1. Ridge Regression (L2 Regularization)
print("\n1. RIDGE REGRESSION (L2 REGULARIZATION)")
print("-" * 50)

# Use broad parameter range for comprehensive comparison
ridge_params = {'C': np.logspace(-6, 6, 25)}  # Broader range
ridge = LogisticRegression(penalty='l2', solver='lbfgs', max_iter=2000)

# Use StratifiedKFold to ensure balanced folds
ridge_cv = GridSearchCV(
    ridge, ridge_params, 
    cv=stratified_cv, 
    scoring='roc_auc', 
    n_jobs=-1
)

ridge_cv.fit(X_train_scaled, y_train)

print(f"Best Ridge parameter (C): {ridge_cv.best_params_['C']:.6f}")
print(f"Best CV AUC score: {ridge_cv.best_score_:.4f}")

# Evaluate best model
ridge_metrics, ridge_pred, ridge_proba = evaluate_model(
    ridge_cv.best_estimator_, X_train_scaled, X_test_scaled, y_train, y_test, 'Ridge'
)
results.append(ridge_metrics)
all_predictions['Ridge'] = ridge_pred
all_probabilities['Ridge'] = ridge_proba

# Display results
print(f"Training Accuracy: {ridge_metrics['train_accuracy']:.4f}")
print(f"Test Accuracy: {ridge_metrics['test_accuracy']:.4f}")
print(f"Test AUC: {ridge_metrics['test_auc']:.4f}")
print(f"Precision: {ridge_metrics['precision']:.4f}")
print(f"Recall: {ridge_metrics['recall']:.4f}")
print(f"F1-Score: {ridge_metrics['f1']:.4f}")

# Compare with ordinary LR
improvement_auc = ridge_metrics['test_auc'] - ordinary_metrics['test_auc']
improvement_acc = ridge_metrics['test_accuracy'] - ordinary_metrics['test_accuracy']
print(f"\nImprovement over Ordinary LR:")
print(f"  AUC: {improvement_auc:+.4f}")
print(f"  Accuracy: {improvement_acc:+.4f}")


1. RIDGE REGRESSION (L2 REGULARIZATION)
--------------------------------------------------
Best Ridge parameter (C): 0.000001
Best CV AUC score: 0.4270
Training Accuracy: 0.6691
Test Accuracy: 0.6857
Test AUC: 0.5530
Precision: 0.6857
Recall: 1.0000
F1-Score: 0.8136

Improvement over Ordinary LR:
  AUC: -0.0379
  Accuracy: +0.0571


## Lasso Regression

In [22]:
# 2. Lasso Regression (L1 Regularization) - IMPROVED ANALYSIS
print("\n2. LASSO REGRESSION (L1 REGULARIZATION) - IMPROVED ANALYSIS")
print("-" * 60)

# Test multiple C ranges to understand the behavior
print("üîç DIAGNOSTIC: Testing different regularization strengths...")

# Start with very weak regularization to see if any features can be selected
test_c_values = [100, 50, 20, 10, 5, 2, 1, 0.5, 0.1, 0.05, 0.01]
diagnostic_results = []

for c_val in test_c_values:
    test_lasso = LogisticRegression(penalty='l1', solver='liblinear', C=c_val, max_iter=2000)
    test_lasso.fit(X_train_scaled, y_train)
    n_selected = np.sum(test_lasso.coef_[0] != 0)
    test_proba = test_lasso.predict_proba(X_test_scaled)[:, 1]
    test_auc = roc_auc_score(y_test, test_proba)
    
    diagnostic_results.append({
        'C': c_val,
        'n_features': n_selected,
        'test_auc': test_auc,
        'max_coef': np.max(np.abs(test_lasso.coef_[0])) if n_selected > 0 else 0
    })
    
    print(f"C={c_val:6.2f}: {n_selected:4d} features, AUC={test_auc:.4f}, max|coef|={np.max(np.abs(test_lasso.coef_[0])):.6f}")

# Find the range where features start being selected
non_zero_results = [r for r in diagnostic_results if r['n_features'] > 0]
if non_zero_results:
    min_c_with_features = min(r['C'] for r in non_zero_results)
    max_c_with_features = max(r['C'] for r in non_zero_results)
    print(f"\nüìä Features selected in C range: [{min_c_with_features:.2f}, {max_c_with_features:.2f}]")
    
    # Use a focused range around where features are actually selected
    if min_c_with_features <= 10:
        lasso_params = {'C': np.logspace(np.log10(max(min_c_with_features/2, 0.1)), 
                                        np.log10(min(max_c_with_features*2, 100)), 20)}
    else:
        lasso_params = {'C': np.logspace(1, 2, 20)}  # Focus on C=[10, 100]
    
    print(f"üéØ Using focused C range: [{min(lasso_params['C']):.3f}, {max(lasso_params['C']):.3f}]")
else:
    print("\n‚ö†Ô∏è  WARNING: No features selected even with very weak regularization (C=100)")
    print("This strongly suggests that predictor effects are extremely weak relative to noise.")
    print("Using minimal regularization range for analysis...")
    lasso_params = {'C': np.logspace(1, 3, 20)}  # C from 10 to 1000

# Create stratified CV if not already defined
if 'stratified_cv' not in locals():
    stratified_cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# Perform grid search with the focused parameter range
lasso = LogisticRegression(penalty='l1', solver='liblinear', max_iter=2000)
lasso_cv = GridSearchCV(
    lasso, lasso_params, 
    cv=stratified_cv, 
    scoring='roc_auc',
    n_jobs=-1
)

print(f"\nüîÑ Running Grid Search with {len(lasso_params['C'])} parameter values...")
lasso_cv.fit(X_train_scaled, y_train)

print(f"‚úÖ Best Lasso parameter (C): {lasso_cv.best_params_['C']:.6f}")
print(f"‚úÖ Best CV AUC score: {lasso_cv.best_score_:.4f}")

# Evaluate best model
lasso_metrics, lasso_pred, lasso_proba = evaluate_model(
    lasso_cv.best_estimator_, X_train_scaled, X_test_scaled, y_train, y_test, 'Lasso'
)
results.append(lasso_metrics)
all_predictions['Lasso'] = lasso_pred
all_probabilities['Lasso'] = lasso_proba

# Detailed coefficient analysis
lasso_coef = lasso_cv.best_estimator_.coef_[0]
n_selected_features = np.sum(lasso_coef != 0)
max_coef_magnitude = np.max(np.abs(lasso_coef)) if n_selected_features > 0 else 0

# Display comprehensive results
print(f"\nüìà LASSO RESULTS:")
print(f"Training Accuracy: {lasso_metrics['train_accuracy']:.4f}")
print(f"Test Accuracy: {lasso_metrics['test_accuracy']:.4f}")
print(f"Test AUC: {lasso_metrics['test_auc']:.4f}")
print(f"Precision: {lasso_metrics['precision']:.4f}")
print(f"Recall: {lasso_metrics['recall']:.4f}")
print(f"F1-Score: {lasso_metrics['f1']:.4f}")

print(f"\nüéØ FEATURE SELECTION ANALYSIS:")
print(f"Features selected: {n_selected_features}/{X_train_scaled.shape[1]} ({n_selected_features/X_train_scaled.shape[1]*100:.1f}%)")
print(f"Maximum coefficient magnitude: {max_coef_magnitude:.6f}")

if n_selected_features > 0:
    # Show the most important selected features
    feature_importance = pd.DataFrame({
        'feature_idx': range(len(lasso_coef)),
        'coefficient': lasso_coef,
        'abs_coefficient': np.abs(lasso_coef)
    })
    
    selected_features = feature_importance[feature_importance['coefficient'] != 0]
    selected_features = selected_features.sort_values('abs_coefficient', ascending=False)
    
    print(f"\nüèÜ TOP 10 SELECTED FEATURES:")
    for i, (_, row) in enumerate(selected_features.head(10).iterrows(), 1):
        direction = "‚Üë" if row['coefficient'] > 0 else "‚Üì"
        print(f"{i:2d}. Feature_{int(row['feature_idx']):3d} {direction} {row['coefficient']:8.4f}")
    
    # Compare with ordinary LR if available
    if 'ordinary_metrics' in locals():
        improvement_auc = lasso_metrics['test_auc'] - ordinary_metrics['test_auc']
        improvement_acc = lasso_metrics['test_accuracy'] - ordinary_metrics['test_accuracy']
        print(f"\nüìä IMPROVEMENT OVER ORDINARY LR:")
        print(f"  AUC improvement: {improvement_auc:+.4f}")
        print(f"  Accuracy improvement: {improvement_acc:+.4f}")
        print(f"  Feature reduction: {X_train_scaled.shape[1] - n_selected_features} features removed")
        
        if improvement_auc > 0.01:
            print("‚úÖ Meaningful improvement achieved through regularization")
        elif improvement_auc > -0.01:
            print("‚ûñ Modest performance with significant dimensionality reduction")
        else:
            print("‚ö†Ô∏è Performance decreased - regularization may be too strong")
else:
    print("\n‚ùå NO FEATURES SELECTED - COMPLETE FEATURE ELIMINATION")
    print("\nüî¨ DETAILED ANALYSIS:")
    print(f"   ‚Ä¢ Best C parameter: {lasso_cv.best_params_['C']:.6f}")
    print(f"   ‚Ä¢ All {X_train_scaled.shape[1]} coefficients set to exactly zero")
    print(f"   ‚Ä¢ Model defaults to predicting class proportions (AUC ‚âà 0.5)")
    print(f"   ‚Ä¢ This indicates extremely weak signal-to-noise ratio")
    
    print(f"\nüí° IMPLICATIONS:")
    print(f"   ‚Ä¢ Individual features have negligible predictive power")
    print(f"   ‚Ä¢ High dimensionality (p={X_train_scaled.shape[1]}) vs sample size (n={X_train_scaled.shape[0]})")
    print(f"   ‚Ä¢ Possible multicollinearity masking true relationships")
    print(f"   ‚Ä¢ Data may require different modeling approaches (ensemble methods, dimensionality reduction)")
    
    print(f"\nüéØ RECOMMENDATIONS:")
    print(f"   ‚Ä¢ Consider PCA or other dimensionality reduction before modeling")
    print(f"   ‚Ä¢ Try ensemble methods (Random Forest, Gradient Boosting)")
    print(f"   ‚Ä¢ Investigate feature engineering opportunities")
    print(f"   ‚Ä¢ Consider non-linear modeling approaches")

# Show CV scores distribution for transparency
cv_scores = lasso_cv.cv_results_['mean_test_score']
print(f"\nüìä CROSS-VALIDATION SCORE DISTRIBUTION:")
print(f"   Mean CV AUC: {np.mean(cv_scores):.4f} ¬± {np.std(cv_scores):.4f}")
print(f"   Min CV AUC:  {np.min(cv_scores):.4f}")
print(f"   Max CV AUC:  {np.max(cv_scores):.4f}")

if np.std(cv_scores) < 0.01:
    print("   ‚úÖ Very stable across different C values")
elif np.std(cv_scores) < 0.05:
    print("   ‚úÖ Reasonably stable performance")
else:
    print("   ‚ö†Ô∏è High variance across C values - consider model stability")


2. LASSO REGRESSION (L1 REGULARIZATION) - IMPROVED ANALYSIS
------------------------------------------------------------
üîç DIAGNOSTIC: Testing different regularization strengths...
C=100.00:  105 features, AUC=0.5114, max|coef|=5.352088
C= 50.00:  103 features, AUC=0.5076, max|coef|=5.096339
C= 20.00:   90 features, AUC=0.5303, max|coef|=4.379703
C= 10.00:   92 features, AUC=0.5265, max|coef|=3.826230
C=  5.00:   87 features, AUC=0.5265, max|coef|=2.938843
C=  2.00:   88 features, AUC=0.5455, max|coef|=1.553257
C=  1.00:   80 features, AUC=0.5492, max|coef|=1.012076
C=  0.50:   70 features, AUC=0.5720, max|coef|=0.781104
C=  0.10:    4 features, AUC=0.5114, max|coef|=0.099917
C=  0.05:    0 features, AUC=0.5000, max|coef|=0.000000
C=  0.01:    0 features, AUC=0.5000, max|coef|=0.000000

üìä Features selected in C range: [0.10, 100.00]
üéØ Using focused C range: [0.100, 100.000]

üîÑ Running Grid Search with 20 parameter values...
‚úÖ Best Lasso parameter (C): 0.100000
‚úÖ Best C

## Elastic Net

In [23]:
# 3. Elastic Net (L1 + L2 Regularization)
print("\n3. ELASTIC NET (L1 + L2 REGULARIZATION)")
print("-" * 50)

# Use broader parameter search for better results
from sklearn.linear_model import SGDClassifier
elastic_params = {
    'alpha': np.logspace(-6, 2, 15),
    'l1_ratio': [0.1, 0.3, 0.5, 0.7, 0.9]
}
elastic = SGDClassifier(
    loss='log_loss', 
    penalty='elasticnet', 
    max_iter=2000, 
    random_state=42,
    tol=1e-4
)

# Use StratifiedKFold to ensure balanced folds
elastic_cv = GridSearchCV(
    elastic, elastic_params, 
    cv=stratified_cv, 
    scoring='roc_auc',
    n_jobs=-1
)

elastic_cv.fit(X_train_scaled, y_train)

print(f"Best Elastic Net parameters:")
print(f"  Alpha: {elastic_cv.best_params_['alpha']:.6f}")
print(f"  L1 ratio: {elastic_cv.best_params_['l1_ratio']:.2f}")
print(f"Best CV AUC score: {elastic_cv.best_score_:.4f}")

# Evaluate best model
elastic_metrics, elastic_pred, elastic_proba = evaluate_model(
    elastic_cv.best_estimator_, X_train_scaled, X_test_scaled, y_train, y_test, 'Elastic Net'
)
results.append(elastic_metrics)
all_predictions['Elastic Net'] = elastic_pred
all_probabilities['Elastic Net'] = elastic_proba

# Display results
print(f"Training Accuracy: {elastic_metrics['train_accuracy']:.4f}")
print(f"Test Accuracy: {elastic_metrics['test_accuracy']:.4f}")
print(f"Test AUC: {elastic_metrics['test_auc']:.4f}")
print(f"Precision: {elastic_metrics['precision']:.4f}")
print(f"Recall: {elastic_metrics['recall']:.4f}")
print(f"F1-Score: {elastic_metrics['f1']:.4f}")

# Compare with ordinary LR
improvement_auc = elastic_metrics['test_auc'] - ordinary_metrics['test_auc']
improvement_acc = elastic_metrics['test_accuracy'] - ordinary_metrics['test_accuracy']
print(f"\nImprovement over Ordinary LR:")
print(f"  AUC: {improvement_auc:+.4f}")
print(f"  Accuracy: {improvement_acc:+.4f}")

# Interpret the L1 ratio
l1_ratio = elastic_cv.best_params_['l1_ratio']
if l1_ratio < 0.3:
    print(f"  Regularization: Mostly Ridge-like (L2 dominant)")
elif l1_ratio > 0.7:
    print(f"  Regularization: Mostly Lasso-like (L1 dominant)")
else:
    print(f"  Regularization: Balanced L1/L2 combination")


3. ELASTIC NET (L1 + L2 REGULARIZATION)
--------------------------------------------------
Best Elastic Net parameters:
  Alpha: 0.138950
  L1 ratio: 0.90
Best CV AUC score: 0.5080
Training Accuracy: 0.6691
Test Accuracy: 0.6857
Test AUC: 0.5000
Precision: 0.6857
Recall: 1.0000
F1-Score: 0.8136

Improvement over Ordinary LR:
  AUC: -0.0909
  Accuracy: +0.0571
  Regularization: Mostly Lasso-like (L1 dominant)


# Comparison of models

In [24]:
# Create comprehensive results comparison
results_df = pd.DataFrame(results)

print("\n" + "="*80)
print("COMPREHENSIVE MODEL COMPARISON RESULTS")
print("="*80)

# Display results table
print("\nPerformance Metrics Table:")
print(results_df.round(4))

# Calculate relative improvements over ordinary logistic regression
print("\nRelative Improvements over Ordinary Logistic Regression:")
baseline_metrics = results_df[results_df['model'] == 'Ordinary LR'].iloc[0]
for idx, row in results_df.iterrows():
    if row['model'] != 'Ordinary LR':
        auc_improvement = row['test_auc'] - baseline_metrics['test_auc']
        acc_improvement = row['test_accuracy'] - baseline_metrics['test_accuracy']
        overfitting_reduction = (baseline_metrics['train_accuracy'] - baseline_metrics['test_accuracy']) - \
                               (row['train_accuracy'] - row['test_accuracy'])
        print(f"\n{row['model']}:")
        print(f"  AUC improvement: {auc_improvement:+.4f}")
        print(f"  Accuracy improvement: {acc_improvement:+.4f}")
        print(f"  Overfitting reduction: {overfitting_reduction:+.4f}")

# Find best performing model
best_auc_idx = results_df['test_auc'].idxmax()
best_model = results_df.loc[best_auc_idx]
print(f"\nüèÜ Best Model (by AUC): {best_model['model']} with AUC = {best_model['test_auc']:.4f}")


COMPREHENSIVE MODEL COMPARISON RESULTS

Performance Metrics Table:
         model  train_accuracy  test_accuracy  train_auc  test_auc  precision  \
0  Ordinary LR          1.0000         0.6286     1.0000    0.5909     0.7619   
1        Ridge          0.6691         0.6857     0.7221    0.5530     0.6857   
2        Lasso          0.6691         0.6857     0.6947    0.5114     0.6857   
3  Elastic Net          0.6691         0.6857     0.5000    0.5000     0.6857   

   recall      f1  
0  0.6667  0.7111  
1  1.0000  0.8136  
2  1.0000  0.8136  
3  1.0000  0.8136  

Relative Improvements over Ordinary Logistic Regression:

Ridge:
  AUC improvement: -0.0379
  Accuracy improvement: +0.0571
  Overfitting reduction: +0.3880

Lasso:
  AUC improvement: -0.0795
  Accuracy improvement: +0.0571
  Overfitting reduction: +0.3880

Elastic Net:
  AUC improvement: -0.0909
  Accuracy improvement: +0.0571
  Overfitting reduction: +0.3880

üèÜ Best Model (by AUC): Ordinary LR with AUC = 0.5909


In [25]:
# CROSS-VALIDATION STABILITY ANALYSIS
print("\n" + "="*80)
print("CROSS-VALIDATION STABILITY ANALYSIS")
print("="*80)

# Perform detailed cross-validation for each model
cv_results = {}
models_for_cv = {}

# Prepare models for CV analysis
if 'ordinary_lr' in locals():
    models_for_cv['Ordinary LR'] = ordinary_lr
if 'ridge_cv' in locals():
    models_for_cv['Ridge'] = ridge_cv.best_estimator_
if 'lasso_cv' in locals():
    models_for_cv['Lasso'] = lasso_cv.best_estimator_
if 'elastic_cv' in locals():
    models_for_cv['Elastic Net'] = elastic_cv.best_estimator_

print("Performing 10-fold cross-validation for stability assessment...")

for name, model in models_for_cv.items():
    # Perform cross-validation with multiple metrics
    cv_scores_auc = cross_val_score(model, X_train_scaled, y_train, cv=10, scoring='roc_auc')
    cv_scores_acc = cross_val_score(model, X_train_scaled, y_train, cv=10, scoring='accuracy')
    cv_scores_f1 = cross_val_score(model, X_train_scaled, y_train, cv=10, scoring='f1')
    
    cv_results[name] = {
        'auc_mean': cv_scores_auc.mean(),
        'auc_std': cv_scores_auc.std(),
        'auc_scores': cv_scores_auc,
        'acc_mean': cv_scores_acc.mean(),
        'acc_std': cv_scores_acc.std(),
        'acc_scores': cv_scores_acc,
        'f1_mean': cv_scores_f1.mean(),
        'f1_std': cv_scores_f1.std(),
        'f1_scores': cv_scores_f1
    }
    
    print(f"\n{name}:")
    print(f"  AUC: {cv_scores_auc.mean():.4f} ¬± {cv_scores_auc.std()*2:.4f} (95% CI)")
    print(f"  Accuracy: {cv_scores_acc.mean():.4f} ¬± {cv_scores_acc.std()*2:.4f}")
    print(f"  F1-Score: {cv_scores_f1.mean():.4f} ¬± {cv_scores_f1.std()*2:.4f}")
    
    # Stability assessment
    if cv_scores_auc.std() < 0.02:
        stability = "üü¢ Very Stable"
    elif cv_scores_auc.std() < 0.05:
        stability = "üü° Moderately Stable"
    else:
        stability = "üî¥ Unstable"
    print(f"  Stability: {stability} (std = {cv_scores_auc.std():.4f})")

print(f"\nüìä STABILITY RANKING (by AUC standard deviation):")
stability_ranking = sorted(cv_results.items(), key=lambda x: x[1]['auc_std'])
for i, (name, results) in enumerate(stability_ranking, 1):
    print(f"{i}. {name:<15} (std = {results['auc_std']:.4f})")


CROSS-VALIDATION STABILITY ANALYSIS
Performing 10-fold cross-validation for stability assessment...

Ordinary LR:
  AUC: 0.4131 ¬± 0.1296 (95% CI)
  Accuracy: 0.4637 ¬± 0.1643
  F1-Score: 0.5439 ¬± 0.2437
  Stability: üî¥ Unstable (std = 0.0648)

Ridge:
  AUC: 0.4919 ¬± 0.2339 (95% CI)
  Accuracy: 0.6698 ¬± 0.0553
  F1-Score: 0.8019 ¬± 0.0395
  Stability: üî¥ Unstable (std = 0.1169)

Lasso:
  AUC: 0.4975 ¬± 0.2262 (95% CI)
  Accuracy: 0.6626 ¬± 0.0485
  F1-Score: 0.7968 ¬± 0.0349
  Stability: üî¥ Unstable (std = 0.1131)

Elastic Net:
  AUC: 0.5000 ¬± 0.0000 (95% CI)
  Accuracy: 0.6698 ¬± 0.0553
  F1-Score: 0.8019 ¬± 0.0395
  Stability: üü¢ Very Stable (std = 0.0000)

üìä STABILITY RANKING (by AUC standard deviation):
1. Elastic Net     (std = 0.0000)
2. Ordinary LR     (std = 0.0648)
3. Lasso           (std = 0.1131)
4. Ridge           (std = 0.1169)
