Toxicity Dataset : https://archive.ics.uci.edu/dataset/728/toxicity-2

The dataset includes 171 molecules designed for functional domains of a core clock protein, CRY1, responsible for generating circadian rhythm. 56 of the molecules are toxic and the rest are non-toxic. 

The data consists a complete set of 1203 molecular descriptors and needs feature selection before classification since some of the features are redundant. 

Introductory Paper:
Structure-based design and classifications of small molecules regulating the circadian rhythm period
By Seref Gul, F. Rahim, Safak Isin, Fatma Yilmaz, Nuri Ozturk, M. Turkay, I. Kavakli. 2021
https://www.semanticscholar.org/paper/Structure-based-design-and-classifications-of-small-Gul-Rahim/5944836c47bc7d1a2b0464a9a1db94d4bc7f28ce

In [1]:
# Import necessary libraries
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, confusion_matrix
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.inspection import permutation_importance
from sklearn.utils.class_weight import compute_class_weight
from xgboost import XGBClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier, ExtraTreesClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis, QuadraticDiscriminantAnalysis
from sklearn.linear_model import RidgeClassifier, SGDClassifier
import warnings
warnings.filterwarnings('ignore')

In [2]:
# Load and preprocess data
data = pd.read_csv("/kaggle/input/toxicity/data.csv")
X = data.drop('Class', axis=1) if 'Class' in data.columns else data.iloc[:, :-1]
y = data['Class'] if 'Class' in data.columns else data.iloc[:, -1]
y_binary = (y == 'NonToxic').astype(int)

print(f"Dataset shape: {data.shape}")
print(f"Class distribution:\n{y.value_counts()}")
print(f"Class balance:\n{y.value_counts(normalize=True)}")

Dataset shape: (171, 1204)
Class distribution:
Class
NonToxic    115
Toxic        56
Name: count, dtype: int64
Class balance:
Class
NonToxic    0.672515
Toxic       0.327485
Name: proportion, dtype: float64


In [3]:
# Shuffle and split
np.random.seed(42)
shuffle_idx = np.random.permutation(len(X))
X_shuffled, y_shuffled = X.iloc[shuffle_idx].reset_index(drop=True), y_binary.iloc[shuffle_idx].reset_index(drop=True)
X_train, X_test, y_train, y_test = train_test_split(X_shuffled, y_shuffled, test_size=0.2, random_state=42, stratify=y_shuffled)

In [4]:
# Standardize
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [5]:
print(f"Training set: {X_train_scaled.shape}, Test set: {X_test_scaled.shape}")

Training set: (136, 1203), Test set: (35, 1203)


In [6]:
# Define models to compare
models = {
    # === LINEAR MODELS ===
    'LR_No_Penalty': LogisticRegression(penalty=None, max_iter=2000, solver='lbfgs'),
    'LR_Ridge_C1': LogisticRegression(penalty='l2', C=1.0, max_iter=2000, solver='lbfgs'),
    'LR_Ridge_C0.1': LogisticRegression(penalty='l2', C=0.1, max_iter=2000, solver='lbfgs'),
    'LR_Ridge_C10': LogisticRegression(penalty='l2', C=10.0, max_iter=2000, solver='lbfgs'),
    'LR_Lasso_C1': LogisticRegression(penalty='l1', C=1.0, max_iter=2000, solver='saga'),
    'LR_Lasso_C0.1': LogisticRegression(penalty='l1', C=0.1, max_iter=2000, solver='saga'),
    'LR_ElasticNet_L1_0.5': LogisticRegression(penalty='elasticnet', solver='saga', l1_ratio=0.5, C=1.0, max_iter=2000),
    'LR_ElasticNet_L1_0.7': LogisticRegression(penalty='elasticnet', solver='saga', l1_ratio=0.7, C=1.0, max_iter=2000),
    'Ridge_Classifier': RidgeClassifier(alpha=1.0),
    'SGD_Classifier': SGDClassifier(loss='log_loss', max_iter=2000, random_state=42),
    
    # === DISCRIMINANT ANALYSIS ===
    'LDA': LinearDiscriminantAnalysis(),
    'QDA': QuadraticDiscriminantAnalysis(),
    
    # === NAIVE BAYES ===
    'Naive_Bayes': GaussianNB(),
    
    # === TREE-BASED MODELS ===
    'Decision_Tree_D5': DecisionTreeClassifier(max_depth=5, random_state=42),
    'Decision_Tree_D10': DecisionTreeClassifier(max_depth=10, random_state=42),
    'Decision_Tree_D20': DecisionTreeClassifier(max_depth=20, random_state=42),
    'Decision_Tree_Unpruned': DecisionTreeClassifier(random_state=42),
    
    # === ENSEMBLE MODELS - BAGGING ===
    'Random_Forest_N50': RandomForestClassifier(n_estimators=50, max_depth=10, random_state=42),
    'Random_Forest_N100': RandomForestClassifier(n_estimators=100, max_depth=10, random_state=42),
    'Random_Forest_N200': RandomForestClassifier(n_estimators=200, max_depth=10, random_state=42),
    'Random_Forest_Deep': RandomForestClassifier(n_estimators=100, max_depth=20, random_state=42),
    'Extra_Trees_N100': ExtraTreesClassifier(n_estimators=100, max_depth=10, random_state=42),
    
    # === ENSEMBLE MODELS - BOOSTING ===
    'AdaBoost_N50': AdaBoostClassifier(n_estimators=50, random_state=42, algorithm='SAMME'),
    'AdaBoost_N100': AdaBoostClassifier(n_estimators=100, random_state=42, algorithm='SAMME'),
    'GradientBoosting_N50': GradientBoostingClassifier(n_estimators=50, max_depth=3, random_state=42),
    'GradientBoosting_N100': GradientBoostingClassifier(n_estimators=100, max_depth=3, random_state=42),
    'XGBoost_D3_N50': XGBClassifier(max_depth=3, n_estimators=50, random_state=42, eval_metric='logloss', use_label_encoder=False),
    'XGBoost_D3_N100': XGBClassifier(max_depth=3, n_estimators=100, random_state=42, eval_metric='logloss', use_label_encoder=False),
    'XGBoost_D5_N100': XGBClassifier(max_depth=5, n_estimators=100, random_state=42, eval_metric='logloss', use_label_encoder=False),
    
    # === SVM VARIATIONS ===
    'SVM_Linear': SVC(kernel='linear', probability=True, random_state=42),
    'SVM_RBF_C1': SVC(kernel='rbf', C=1.0, probability=True, random_state=42),
    'SVM_RBF_C10': SVC(kernel='rbf', C=10.0, probability=True, random_state=42),
    'SVM_Poly_D2': SVC(kernel='poly', degree=2, probability=True, random_state=42),
    'SVM_Poly_D3': SVC(kernel='poly', degree=3, probability=True, random_state=42),
    
    # === K-NEAREST NEIGHBORS ===
    'KNN_K3': KNeighborsClassifier(n_neighbors=3),
    'KNN_K5': KNeighborsClassifier(n_neighbors=5),
    'KNN_K7': KNeighborsClassifier(n_neighbors=7),
    'KNN_K10': KNeighborsClassifier(n_neighbors=10),
    
    # === NEURAL NETWORKS ===
    'NN_Small': MLPClassifier(hidden_layer_sizes=(25,), max_iter=1000, random_state=42, early_stopping=True, solver='lbfgs'),
    'NN_Medium': MLPClassifier(hidden_layer_sizes=(50, 25), max_iter=1000, random_state=42, early_stopping=True, solver='lbfgs'),
    'NN_Large': MLPClassifier(hidden_layer_sizes=(100, 50), max_iter=1000, random_state=42, early_stopping=True, solver='lbfgs'),
    'NN_Deep': MLPClassifier(hidden_layer_sizes=(100, 50, 25), max_iter=1000, random_state=42, early_stopping=True, solver='lbfgs'),
    'NN_Adam': MLPClassifier(hidden_layer_sizes=(50, 25), max_iter=1000, random_state=42, early_stopping=True, solver='adam'),
}

In [7]:
# Evaluation function
def evaluate_model(model, X_train, X_test, y_train, y_test):
    """Evaluate a classification model and return metrics"""
    X_train_selected, X_test_selected = X_train, X_test
    model.fit(X_train_selected, y_train)
    
    # Predictions
    y_train_pred = model.predict(X_train_selected)
    y_test_pred = model.predict(X_test_selected)
    
    # Probabilities
    if hasattr(model, 'predict_proba'):
        y_train_proba = model.predict_proba(X_train_selected)[:, 1]
        y_test_proba = model.predict_proba(X_test_selected)[:, 1]
    else:
        y_train_proba = model.decision_function(X_train_selected)
        y_test_proba = model.decision_function(X_test_selected)
    
    return {
        'train_acc': accuracy_score(y_train, y_train_pred),
        'test_acc': accuracy_score(y_test, y_test_pred),
        'train_auc': roc_auc_score(y_train, y_train_proba),
        'test_auc': roc_auc_score(y_test, y_test_proba),
        'precision': precision_score(y_test, y_test_pred),
        'recall': recall_score(y_test, y_test_pred),
        'f1': f1_score(y_test, y_test_pred)
    }

In [8]:
# Train and evaluate all models (WITHOUT class weights)
print("\n" + "="*80)
print("TRAINING MODELS WITHOUT CLASS WEIGHTS")
print("="*80)
results = []
for name, model in models.items():
    print(f"Training {name}...")
    metrics = evaluate_model(model, X_train_scaled, X_test_scaled, y_train, y_test)
    metrics['model'] = name
    results.append(metrics)
    print(f"  Test Accuracy: {metrics['test_acc']:.4f}, Test AUC: {metrics['test_auc']:.4f}")



TRAINING MODELS WITHOUT CLASS WEIGHTS
Training LR_No_Penalty...
  Test Accuracy: 0.5714, Test AUC: 0.5871
Training LR_Ridge_C1...
  Test Accuracy: 0.5714, Test AUC: 0.5644
Training LR_Ridge_C0.1...
  Test Accuracy: 0.5714, Test AUC: 0.5758
Training LR_Ridge_C10...
  Test Accuracy: 0.5714, Test AUC: 0.5606
Training LR_Lasso_C1...
  Test Accuracy: 0.6000, Test AUC: 0.5455
Training LR_Lasso_C0.1...
  Test Accuracy: 0.6857, Test AUC: 0.5644
Training LR_ElasticNet_L1_0.5...
  Test Accuracy: 0.6000, Test AUC: 0.5417
Training LR_ElasticNet_L1_0.7...
  Test Accuracy: 0.6000, Test AUC: 0.5417
Training Ridge_Classifier...
  Test Accuracy: 0.5143, Test AUC: 0.5492
Training SGD_Classifier...
  Test Accuracy: 0.5714, Test AUC: 0.4924
Training LDA...
  Test Accuracy: 0.6000, Test AUC: 0.5473
Training QDA...
  Test Accuracy: 0.5143, Test AUC: 0.4981
Training Naive_Bayes...
  Test Accuracy: 0.4286, Test AUC: 0.5076
Training Decision_Tree_D5...
  Test Accuracy: 0.6286, Test AUC: 0.5795
Training Decisi

In [9]:
# Create results DataFrame
results_df = pd.DataFrame(results)
results_df = results_df[['model', 'train_acc', 'test_acc', 'train_auc', 'test_auc', 'precision', 'recall', 'f1']]
results_df = results_df.sort_values('test_acc', ascending=False).reset_index(drop=True)

print("\n=== MODEL COMPARISON (NO CLASS WEIGHTS) ===")
print(results_df.to_string(index=False))


=== MODEL COMPARISON (NO CLASS WEIGHTS) ===
                 model  train_acc  test_acc  train_auc  test_auc  precision   recall       f1
           SVM_Poly_D2   0.720588  0.685714   0.011966  0.382576   0.685714 1.000000 0.813559
            SVM_RBF_C1   0.727941  0.685714   0.010501  0.382576   0.685714 1.000000 0.813559
         LR_Lasso_C0.1   0.669118  0.685714   0.723077  0.564394   0.685714 1.000000 0.813559
           SVM_Poly_D3   0.720588  0.685714   0.006105  0.333333   0.685714 1.000000 0.813559
     Random_Forest_N50   1.000000  0.657143   1.000000  0.638258   0.714286 0.833333 0.769231
                KNN_K3   0.757353  0.657143   0.779487  0.695076   0.750000 0.750000 0.750000
                KNN_K5   0.713235  0.657143   0.757021  0.702652   0.714286 0.833333 0.769231
                KNN_K7   0.727941  0.657143   0.728816  0.666667   0.714286 0.833333 0.769231
               KNN_K10   0.654412  0.657143   0.682051  0.655303   0.730769 0.791667 0.760000
  GradientBoost

In [10]:
# ===== FEATURE IMPORTANCE ANALYSIS =====
print("\n" + "="*80)
print("FEATURE IMPORTANCE COMPARISON WITH ORIGINAL STUDY")
print("="*80)

# Original study's important features
original_features = ['MDEC-23', 'MATS2v', 'ATSC8s', 'VE3_Dt', 'CrippenMR', 'SpMax7_Bhe',
                     'SpMin1_Bhs', 'C1SP2', 'GATS8e', 'GATS8s', 'SpMax5_Bhv', 'VE3_Dzi', 'VPC-4']

feature_names = X.columns.tolist()


FEATURE IMPORTANCE COMPARISON WITH ORIGINAL STUDY


In [11]:
def get_feature_importance(model, model_name, X_train, X_test, y_train, y_test):
    """Extract feature importance for different model types"""
    # Tree-based models: use built-in feature_importances_
    if hasattr(model, 'feature_importances_'):
        importances = model.feature_importances_
        method = "Built-in (Impurity-based)"
    # Linear models: use absolute coefficient values
    elif hasattr(model, 'coef_'):
        importances = np.abs(model.coef_[0])
        method = "Coefficients"
    # Other models: use permutation importance
    else:
        perm_importance = permutation_importance(
            model, X_test, y_test, n_repeats=10, random_state=42, n_jobs=-1
        )
        importances = perm_importance.importances_mean
        method = "Permutation"
    
    # Create DataFrame with feature importance
    importance_df = pd.DataFrame({
        'feature': feature_names,
        'importance': importances
    }).sort_values('importance', ascending=False)
    
    return importance_df, method

In [12]:
# Extract feature importance for each trained model
print(f"\nOriginal study identified {len(original_features)} important features using DTC:")
print(original_features)
print("\n" + "-"*80)


Original study identified 13 important features using DTC:
['MDEC-23', 'MATS2v', 'ATSC8s', 'VE3_Dt', 'CrippenMR', 'SpMax7_Bhe', 'SpMin1_Bhs', 'C1SP2', 'GATS8e', 'GATS8s', 'SpMax5_Bhv', 'VE3_Dzi', 'VPC-4']

--------------------------------------------------------------------------------


In [13]:
feature_comparison = {}

for name, model in models.items():
    print(f"\n### {name} ###")
    
    # Get feature importance
    importance_df, method = get_feature_importance(model, name, X_train_scaled, X_test_scaled, y_train, y_test)
    
    # Get top 13 features (same number as original study)
    top_13 = importance_df.head(13)
    top_13_features = top_13['feature'].tolist()
    
    # Calculate overlap with original study
    overlap = set(top_13_features) & set(original_features)
    overlap_count = len(overlap)
    overlap_pct = (overlap_count / len(original_features)) * 100
    
    print(f"Method: {method}")
    print(f"\nTop 13 Features:")
    print(top_13.to_string(index=False))
    print(f"\nOverlap with original study: {overlap_count}/{len(original_features)} ({overlap_pct:.1f}%)")
    if overlap:
        print(f"Matching features: {sorted(overlap)}")
    
    feature_comparison[name] = {
        'top_13': top_13_features,
        'overlap_count': overlap_count,
        'overlap_features': sorted(overlap),
        'method': method
    }



### LR_No_Penalty ###
Method: Coefficients

Top 13 Features:
        feature  importance
           JGI7    4.427035
       maxsssCH    4.049126
        nHBint3    3.663130
      topoShape    3.568342
         ALogp2    3.528732
         WTPT-2    3.461760
PetitjeanNumber    3.400879
          C3SP3    3.381391
      minHBint4    3.334651
      maxHBint4    3.254575
        minssNH    3.167231
        maxssNH    3.122925
        nF6Ring    3.050619

Overlap with original study: 0/13 (0.0%)

### LR_Ridge_C1 ###
Method: Coefficients

Top 13 Features:
  feature  importance
   ALogp2    0.485300
 BCUTw-1l    0.450650
 maxsssCH    0.418850
    minsF    0.418486
    maxsF    0.407373
    nBase    0.396368
  minssNH    0.381735
   MATS1s    0.373555
  maxssNH    0.368521
maxHBint5    0.357965
  VE3_Dzs    0.346970
    C3SP3    0.345447
minHCsatu    0.336533

Overlap with original study: 0/13 (0.0%)

### LR_Ridge_C0.1 ###
Method: Coefficients

Top 13 Features:
  feature  importance
   ALogp2 

In [14]:
# Summary comparison table
print("\n" + "="*80)
print("SUMMARY: OVERLAP WITH ORIGINAL STUDY")
print("="*80)
summary_df = pd.DataFrame({
    'Model': list(feature_comparison.keys()),
    'Overlap Count': [v['overlap_count'] for v in feature_comparison.values()],
    'Overlap %': [(v['overlap_count']/13)*100 for v in feature_comparison.values()],
    'Method': [v['method'] for v in feature_comparison.values()]
}).sort_values('Overlap Count', ascending=False)

print(summary_df.to_string(index=False))


SUMMARY: OVERLAP WITH ORIGINAL STUDY
                 Model  Overlap Count  Overlap %                    Method
          AdaBoost_N50              2  15.384615 Built-in (Impurity-based)
         AdaBoost_N100              1   7.692308 Built-in (Impurity-based)
              NN_Large              1   7.692308               Permutation
             NN_Medium              1   7.692308               Permutation
           SVM_RBF_C10              0   0.000000               Permutation
  GradientBoosting_N50              0   0.000000 Built-in (Impurity-based)
 GradientBoosting_N100              0   0.000000 Built-in (Impurity-based)
        XGBoost_D3_N50              0   0.000000 Built-in (Impurity-based)
       XGBoost_D3_N100              0   0.000000 Built-in (Impurity-based)
       XGBoost_D5_N100              0   0.000000 Built-in (Impurity-based)
            SVM_Linear              0   0.000000              Coefficients
            SVM_RBF_C1              0   0.000000              

In [15]:
# Find features commonly selected across multiple models
print("\n" + "="*80)
print("FEATURES SELECTED BY MULTIPLE MODELS (in top 13)")
print("="*80)

all_top_features = []
for comp in feature_comparison.values():
    all_top_features.extend(comp['top_13'])

feature_counts = pd.Series(all_top_features).value_counts()
frequent_features = feature_counts[feature_counts >= 3]

if len(frequent_features) > 0:
    print(f"\nFeatures selected by 3+ models:")
    for feat, count in frequent_features.items():
        in_original = "✓" if feat in original_features else " "
        print(f"  [{in_original}] {feat}: {count}/{len(models)} models")
else:
    print("No features were consistently selected across 3+ models")


FEATURES SELECTED BY MULTIPLE MODELS (in top 13)

Features selected by 3+ models:
  [ ] AATSC4m: 14/43 models
  [ ] SpDiam_Dt: 12/43 models
  [ ] minssNH: 12/43 models
  [ ] ALogp2: 10/43 models
  [ ] maxssNH: 10/43 models
  [ ] nBase: 9/43 models
  [ ] minsF: 9/43 models
  [ ] maxsF: 8/43 models
  [ ] ATSC3v: 8/43 models
  [ ] ATS2p: 8/43 models
  [ ] maxsssCH: 8/43 models
  [ ] MATS1s: 8/43 models
  [ ] MATS3v: 7/43 models
  [ ] ATS2s: 7/43 models
  [ ] ATS2v: 7/43 models
  [ ] JGI7: 7/43 models
  [ ] C3SP3: 7/43 models
  [ ] AATSC4i: 7/43 models
  [ ] AATSC4c: 6/43 models
  [ ] maxHBint4: 6/43 models
  [ ] SHBint10: 6/43 models
  [ ] minHCsatu: 6/43 models
  [ ] minHBint4: 6/43 models
  [ ] CrippenLogP: 6/43 models
  [ ] ATS5v: 6/43 models
  [ ] ATS3m: 5/43 models
  [ ] SpMax4_Bhm: 5/43 models
  [ ] MATS5i: 5/43 models
  [ ] BCUTw-1l: 5/43 models
  [ ] maxHBint5: 5/43 models
  [ ] topoShape: 5/43 models
  [ ] AATSC4e: 5/43 models
  [ ] ATSC2e: 4/43 models
  [ ] GATS7m: 4/43 models


In [16]:
# Save detailed comparison
comparison_results = []
for model_name, comp in feature_comparison.items():
    for i, feat in enumerate(comp['top_13'], 1):
        comparison_results.append({
            'model': model_name,
            'rank': i,
            'feature': feat,
            'in_original_study': feat in original_features
        })

comparison_df = pd.DataFrame(comparison_results)
comparison_df
# comparison_df.to_csv('feature_importance_comparison.csv', index=False)
# print("\n✓ Feature importance saved to 'feature_importance_comparison.csv'")

Unnamed: 0,model,rank,feature,in_original_study
0,LR_No_Penalty,1,JGI7,False
1,LR_No_Penalty,2,maxsssCH,False
2,LR_No_Penalty,3,nHBint3,False
3,LR_No_Penalty,4,topoShape,False
4,LR_No_Penalty,5,ALogp2,False
...,...,...,...,...
554,NN_Adam,9,minssNH,False
555,NN_Adam,10,ATSC8p,False
556,NN_Adam,11,AATSC8v,False
557,NN_Adam,12,minaaN,False


In [17]:
# ===== CLASS WEIGHT COMPARISON =====
print("\n" + "="*80)
print("TRAINING MODELS WITH CLASS WEIGHTS")
print("="*80)

# Calculate class weights
class_weights = compute_class_weight('balanced', classes=np.unique(y_train), y=y_train)
weight_dict = {0: class_weights[0], 1: class_weights[1]}
print(f"\nComputed class weights: {weight_dict}")
print(f"Toxic (0): {class_weights[0]:.3f}, NonToxic (1): {class_weights[1]:.3f}")

# Calculate scale_pos_weight for XGBoost
n_toxic = np.sum(y_train == 0)
n_nontoxic = np.sum(y_train == 1)
scale_pos_weight = n_toxic / n_nontoxic
print(f"XGBoost scale_pos_weight: {scale_pos_weight:.3f}")


TRAINING MODELS WITH CLASS WEIGHTS

Computed class weights: {0: 1.511111111111111, 1: 0.7472527472527473}
Toxic (0): 1.511, NonToxic (1): 0.747
XGBoost scale_pos_weight: 0.495


In [18]:
# Define models WITH class weights
models_weighted = {
    # === LINEAR MODELS ===
    'LR_No_Penalty': LogisticRegression(penalty=None, max_iter=2000, solver='lbfgs', class_weight='balanced'),
    'LR_Ridge_C1': LogisticRegression(penalty='l2', C=1.0, max_iter=2000, solver='lbfgs', class_weight='balanced'),
    'LR_Ridge_C0.1': LogisticRegression(penalty='l2', C=0.1, max_iter=2000, solver='lbfgs', class_weight='balanced'),
    'LR_Ridge_C10': LogisticRegression(penalty='l2', C=10.0, max_iter=2000, solver='lbfgs', class_weight='balanced'),
    'LR_Lasso_C1': LogisticRegression(penalty='l1', C=1.0, max_iter=2000, solver='saga', class_weight='balanced'),
    'LR_Lasso_C0.1': LogisticRegression(penalty='l1', C=0.1, max_iter=2000, solver='saga', class_weight='balanced'),
    'LR_ElasticNet_L1_0.5': LogisticRegression(penalty='elasticnet', solver='saga', l1_ratio=0.5, C=1.0, max_iter=2000, class_weight='balanced'),
    'LR_ElasticNet_L1_0.7': LogisticRegression(penalty='elasticnet', solver='saga', l1_ratio=0.7, C=1.0, max_iter=2000, class_weight='balanced'),
    'Ridge_Classifier': RidgeClassifier(alpha=1.0, class_weight='balanced'),
    'SGD_Classifier': SGDClassifier(loss='log_loss', max_iter=2000, random_state=42, class_weight='balanced'),
    
    # === DISCRIMINANT ANALYSIS ===
    # LDA and QDA do not support class_weight parameter
    'LDA': LinearDiscriminantAnalysis(),
    'QDA': QuadraticDiscriminantAnalysis(),
    
    # === NAIVE BAYES ===
    # GaussianNB does not support class_weight parameter
    'Naive_Bayes': GaussianNB(),
    
    # === TREE-BASED MODELS ===
    'Decision_Tree_D5': DecisionTreeClassifier(max_depth=5, random_state=42, class_weight='balanced'),
    'Decision_Tree_D10': DecisionTreeClassifier(max_depth=10, random_state=42, class_weight='balanced'),
    'Decision_Tree_D20': DecisionTreeClassifier(max_depth=20, random_state=42, class_weight='balanced'),
    'Decision_Tree_Unpruned': DecisionTreeClassifier(random_state=42, class_weight='balanced'),
    
    # === ENSEMBLE MODELS - BAGGING ===
    'Random_Forest_N50': RandomForestClassifier(n_estimators=50, max_depth=10, random_state=42, class_weight='balanced'),
    'Random_Forest_N100': RandomForestClassifier(n_estimators=100, max_depth=10, random_state=42, class_weight='balanced'),
    'Random_Forest_N200': RandomForestClassifier(n_estimators=200, max_depth=10, random_state=42, class_weight='balanced'),
    'Random_Forest_Deep': RandomForestClassifier(n_estimators=100, max_depth=20, random_state=42, class_weight='balanced'),
    'Extra_Trees_N100': ExtraTreesClassifier(n_estimators=100, max_depth=10, random_state=42, class_weight='balanced'),
    
    # === ENSEMBLE MODELS - BOOSTING ===
    # AdaBoost does not support class_weight parameter directly
    'AdaBoost_N50': AdaBoostClassifier(n_estimators=50, random_state=42, algorithm='SAMME'),
    'AdaBoost_N100': AdaBoostClassifier(n_estimators=100, random_state=42, algorithm='SAMME'),
    # GradientBoosting does not support class_weight parameter directly
    'GradientBoosting_N50': GradientBoostingClassifier(n_estimators=50, max_depth=3, random_state=42),
    'GradientBoosting_N100': GradientBoostingClassifier(n_estimators=100, max_depth=3, random_state=42),
    # XGBoost uses scale_pos_weight instead of class_weight
    'XGBoost_D3_N50': XGBClassifier(max_depth=3, n_estimators=50, scale_pos_weight=scale_pos_weight, random_state=42, eval_metric='logloss', use_label_encoder=False),
    'XGBoost_D3_N100': XGBClassifier(max_depth=3, n_estimators=100, scale_pos_weight=scale_pos_weight, random_state=42, eval_metric='logloss', use_label_encoder=False),
    'XGBoost_D5_N100': XGBClassifier(max_depth=5, n_estimators=100, scale_pos_weight=scale_pos_weight, random_state=42, eval_metric='logloss', use_label_encoder=False),
    
    # === SVM VARIATIONS ===
    'SVM_Linear': SVC(kernel='linear', probability=True, random_state=42, class_weight='balanced'),
    'SVM_RBF_C1': SVC(kernel='rbf', C=1.0, probability=True, random_state=42, class_weight='balanced'),
    'SVM_RBF_C10': SVC(kernel='rbf', C=10.0, probability=True, random_state=42, class_weight='balanced'),
    'SVM_Poly_D2': SVC(kernel='poly', degree=2, probability=True, random_state=42, class_weight='balanced'),
    'SVM_Poly_D3': SVC(kernel='poly', degree=3, probability=True, random_state=42, class_weight='balanced'),
    
    # === K-NEAREST NEIGHBORS ===
    # KNN does not support class_weight parameter
    'KNN_K3': KNeighborsClassifier(n_neighbors=3),
    'KNN_K5': KNeighborsClassifier(n_neighbors=5),
    'KNN_K7': KNeighborsClassifier(n_neighbors=7),
    'KNN_K10': KNeighborsClassifier(n_neighbors=10),
    
    # === NEURAL NETWORKS ===
    # MLPClassifier does not support class_weight parameter
    'NN_Small': MLPClassifier(hidden_layer_sizes=(25,), max_iter=1000, random_state=42, early_stopping=True, solver='lbfgs'),
    'NN_Medium': MLPClassifier(hidden_layer_sizes=(50, 25), max_iter=1000, random_state=42, early_stopping=True, solver='lbfgs'),
    'NN_Large': MLPClassifier(hidden_layer_sizes=(100, 50), max_iter=1000, random_state=42, early_stopping=True, solver='lbfgs'),
    'NN_Deep': MLPClassifier(hidden_layer_sizes=(100, 50, 25), max_iter=1000, random_state=42, early_stopping=True, solver='lbfgs'),
    'NN_Adam': MLPClassifier(hidden_layer_sizes=(50, 25), max_iter=1000, random_state=42, early_stopping=True, solver='adam'),
}

In [19]:
# Train weighted models
results_weighted = []
for name, model in models_weighted.items():
    print(f"Training {name} (weighted)...")
    metrics = evaluate_model(model, X_train_scaled, X_test_scaled, y_train, y_test)
    metrics['model'] = name
    results_weighted.append(metrics)
    print(f"  Test Accuracy: {metrics['test_acc']:.4f}, Test AUC: {metrics['test_auc']:.4f}")


Training LR_No_Penalty (weighted)...
  Test Accuracy: 0.6000, Test AUC: 0.5682
Training LR_Ridge_C1 (weighted)...
  Test Accuracy: 0.5714, Test AUC: 0.5644
Training LR_Ridge_C0.1 (weighted)...
  Test Accuracy: 0.5714, Test AUC: 0.5758
Training LR_Ridge_C10 (weighted)...
  Test Accuracy: 0.5429, Test AUC: 0.5606
Training LR_Lasso_C1 (weighted)...
  Test Accuracy: 0.6000, Test AUC: 0.5417
Training LR_Lasso_C0.1 (weighted)...
  Test Accuracy: 0.5143, Test AUC: 0.5492
Training LR_ElasticNet_L1_0.5 (weighted)...
  Test Accuracy: 0.5714, Test AUC: 0.5455
Training LR_ElasticNet_L1_0.7 (weighted)...
  Test Accuracy: 0.6000, Test AUC: 0.5417
Training Ridge_Classifier (weighted)...
  Test Accuracy: 0.4857, Test AUC: 0.5530
Training SGD_Classifier (weighted)...
  Test Accuracy: 0.5714, Test AUC: 0.4962
Training LDA (weighted)...
  Test Accuracy: 0.6000, Test AUC: 0.5473
Training QDA (weighted)...
  Test Accuracy: 0.5143, Test AUC: 0.4981
Training Naive_Bayes (weighted)...
  Test Accuracy: 0.4286,

In [20]:
# Create comparison DataFrames
results_df_weighted = pd.DataFrame(results_weighted)

In [21]:
# Merge for side-by-side comparison
comparison = pd.merge(
    results_df[['model', 'test_acc', 'precision', 'recall', 'f1']],
    results_df_weighted[['model', 'test_acc', 'precision', 'recall', 'f1']],
    on='model',
    suffixes=('_original', '_weighted')
)

In [22]:
# Calculate improvements
comparison['acc_change'] = comparison['test_acc_weighted'] - comparison['test_acc_original']
comparison['recall_change'] = comparison['recall_weighted'] - comparison['recall_original']
comparison['f1_change'] = comparison['f1_weighted'] - comparison['f1_original']
comparison

Unnamed: 0,model,test_acc_original,precision_original,recall_original,f1_original,test_acc_weighted,precision_weighted,recall_weighted,f1_weighted,acc_change,recall_change,f1_change
0,SVM_Poly_D2,0.685714,0.685714,1.0,0.813559,0.485714,0.75,0.375,0.5,-0.2,-0.625,-0.313559
1,SVM_RBF_C1,0.685714,0.685714,1.0,0.813559,0.514286,0.733333,0.458333,0.564103,-0.171429,-0.541667,-0.249457
2,LR_Lasso_C0.1,0.685714,0.685714,1.0,0.813559,0.514286,0.684211,0.541667,0.604651,-0.171429,-0.458333,-0.208908
3,SVM_Poly_D3,0.685714,0.685714,1.0,0.813559,0.428571,1.0,0.166667,0.285714,-0.257143,-0.833333,-0.527845
4,Random_Forest_N50,0.657143,0.714286,0.833333,0.769231,0.628571,0.689655,0.833333,0.754717,-0.028571,0.0,-0.014514
5,KNN_K3,0.657143,0.75,0.75,0.75,0.657143,0.75,0.75,0.75,0.0,0.0,0.0
6,KNN_K5,0.657143,0.714286,0.833333,0.769231,0.657143,0.714286,0.833333,0.769231,0.0,0.0,0.0
7,KNN_K7,0.657143,0.714286,0.833333,0.769231,0.657143,0.714286,0.833333,0.769231,0.0,0.0,0.0
8,KNN_K10,0.657143,0.730769,0.791667,0.76,0.657143,0.730769,0.791667,0.76,0.0,0.0,0.0
9,GradientBoosting_N50,0.657143,0.730769,0.791667,0.76,0.657143,0.730769,0.791667,0.76,0.0,0.0,0.0


In [23]:
# Detailed confusion matrix comparison
print("\n" + "="*100)
print("CONFUSION MATRIX COMPARISON (Original vs Weighted)")
print("="*100)

for name, model_original in models.items():
    model_weighted = models_weighted[name]
    
    # Get predictions
    y_pred_original = model_original.predict(X_test_scaled)
    y_pred_weighted = model_weighted.predict(X_test_scaled)
    
    # Confusion matrices
    cm_original = confusion_matrix(y_test, y_pred_original)
    cm_weighted = confusion_matrix(y_test, y_pred_weighted)
    
    print(f"\n### {name} ###")
    print("\nOriginal (No Class Weights):")
    print(f"                Predicted Toxic    Predicted NonToxic")
    print(f"Actual Toxic          {cm_original[0,0]:3d}                 {cm_original[0,1]:3d}")
    print(f"Actual NonToxic       {cm_original[1,0]:3d}                 {cm_original[1,1]:3d}")
    
    print("\nWith Class Weights:")
    print(f"                Predicted Toxic    Predicted NonToxic")
    print(f"Actual Toxic          {cm_weighted[0,0]:3d}                 {cm_weighted[0,1]:3d}")
    print(f"Actual NonToxic       {cm_weighted[1,0]:3d}                 {cm_weighted[1,1]:3d}")
    
    # Calculate recall for minority class
    recall_toxic_orig = cm_original[0,0] / (cm_original[0,0] + cm_original[0,1]) if (cm_original[0,0] + cm_original[0,1]) > 0 else 0
    recall_toxic_weighted = cm_weighted[0,0] / (cm_weighted[0,0] + cm_weighted[0,1]) if (cm_weighted[0,0] + cm_weighted[0,1]) > 0 else 0
    
    print(f"\nRecall for Toxic class: {recall_toxic_orig:.3f} → {recall_toxic_weighted:.3f} (Δ={recall_toxic_weighted-recall_toxic_orig:+.3f})")
    
    # False negatives
    fn_orig = cm_original[0,1]
    fn_weighted = cm_weighted[0,1]
    print(f"False Negatives (Toxic → NonToxic): {fn_orig} → {fn_weighted} (Δ={fn_weighted-fn_orig:+d})")



CONFUSION MATRIX COMPARISON (Original vs Weighted)

### LR_No_Penalty ###

Original (No Class Weights):
                Predicted Toxic    Predicted NonToxic
Actual Toxic            5                   6
Actual NonToxic         9                  15

With Class Weights:
                Predicted Toxic    Predicted NonToxic
Actual Toxic            6                   5
Actual NonToxic         9                  15

Recall for Toxic class: 0.455 → 0.545 (Δ=+0.091)
False Negatives (Toxic → NonToxic): 6 → 5 (Δ=-1)

### LR_Ridge_C1 ###

Original (No Class Weights):
                Predicted Toxic    Predicted NonToxic
Actual Toxic            4                   7
Actual NonToxic         8                  16

With Class Weights:
                Predicted Toxic    Predicted NonToxic
Actual Toxic            4                   7
Actual NonToxic         8                  16

Recall for Toxic class: 0.364 → 0.364 (Δ=+0.000)
False Negatives (Toxic → NonToxic): 7 → 7 (Δ=+0)

### LR_Ridge_C0.1 #

In [24]:
# Save all comparisons
# comparison.to_csv('class_weight_comparison.csv', index=False)
# results_df.to_csv('model_results_original.csv', index=False)
# results_df_weighted.to_csv('model_results_weighted.csv', index=False)
# print("\n✓ All results saved to CSV files")
# print("  - model_results_original.csv")
# print("  - model_results_weighted.csv")
# print("  - class_weight_comparison.csv")
# print("  - feature_importance_comparison.csv")