# Feature Selection & Model Comparison

## Objective:
1. Apply multiple feature selection algorithms to identify the most influential features
2. Compare model performance with all features vs selected features
3. Evaluate training time reduction

## Feature Selection Algorithms:
- **Boruta** - Wrapper method using Random Forest
- **RFE (Recursive Feature Elimination)** - Recursive elimination
- **Correlation-based Feature Selection** - Filter method
- **ContrastFS** - Contrast-based selection

## Models:
- LightGBM
- XGBClassifier
- CatBoost

## Evaluation Metrics:
- Accuracy
- Precision
- Recall
- F1-Score
- MCC (Matthews Correlation Coefficient)
- Training Time

In [3]:
# Install required packages if not available
!pip install boruta lightgbm xgboost catboost scikit-learn pandas numpy




[notice] A new release of pip is available: 25.1.1 -> 25.3
[notice] To update, run: python.exe -m pip install --upgrade pip


In [4]:
# Import required libraries
import pandas as pd
import numpy as np
import time
import warnings
warnings.filterwarnings('ignore')

# Sklearn
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.feature_selection import RFE
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, 
    f1_score, matthews_corrcoef, classification_report
)

# Models
from lightgbm import LGBMClassifier
from xgboost import XGBClassifier
from catboost import CatBoostClassifier

# Boruta
from boruta import BorutaPy

print("All libraries imported successfully!")

All libraries imported successfully!


## Step 1: Load Dataset

In [5]:
# Load the dataset with 63 features
df = pd.read_csv('new_dataset/PhiUSIIL_Phishing_URL_63_Features.csv')

print("=" * 70)
print("DATASET INFORMATION")
print("=" * 70)
print(f"Number of Rows: {len(df)}")
print(f"Number of Columns: {len(df.columns)}")
print(f"\nColumn Names:")
for i, col in enumerate(df.columns, 1):
    print(f"  {i}. {col}")

df.head()

DATASET INFORMATION
Number of Rows: 235795
Number of Columns: 64

Column Names:
  1. FILENAME
  2. URL
  3. URLLength
  4. Domain
  5. DomainLength
  6. IsDomainIP
  7. TLD
  8. URLSimilarityIndex
  9. CharContinuationRate
  10. TLDLegitimateProb
  11. URLCharProb
  12. TLDLength
  13. NoOfSubDomain
  14. HasObfuscation
  15. NoOfObfuscatedChar
  16. ObfuscationRatio
  17. NoOfLettersInURL
  18. LetterRatioInURL
  19. NoOfDegitsInURL
  20. DegitRatioInURL
  21. NoOfEqualsInURL
  22. NoOfQMarkInURL
  23. NoOfAmpersandInURL
  24. NoOfOtherSpecialCharsInURL
  25. SpacialCharRatioInURL
  26. IsHTTPS
  27. LineOfCode
  28. LargestLineLength
  29. HasTitle
  30. Title
  31. DomainTitleMatchScore
  32. URLTitleMatchScore
  33. HasFavicon
  34. Robots
  35. IsResponsive
  36. NoOfURLRedirect
  37. NoOfSelfRedirect
  38. HasDescription
  39. NoOfPopup
  40. NoOfiFrame
  41. HasExternalFormSubmit
  42. HasSocialNet
  43. HasSubmitButton
  44. HasHiddenFields
  45. HasPasswordField
  46. Bank
  4

Unnamed: 0,FILENAME,URL,URLLength,Domain,DomainLength,IsDomainIP,TLD,URLSimilarityIndex,CharContinuationRate,TLDLegitimateProb,...,NoOfExternalRef,Unnamed: 55,has_no_www,num_slashes,num_hyphens,URL_Profanity_Prob,URL_NumberOf_Profanity,URLContent_Profanity_Prob,URLContent_NumberOf_Profanity,label
0,521848.txt,https://www.southbankmosaics.com,32,www.southbankmosaics.com,24,0,com,100.0,1.0,0.522907,...,124,,0,2,0,0.012189,1,0.01188,1,1
1,31372.txt,https://www.uni-mainz.de,24,www.uni-mainz.de,16,0,de,100.0,0.666667,0.03265,...,217,,0,2,1,0.027988,0,0.019723,0,1
2,597387.txt,https://www.voicefmradio.co.uk,30,www.voicefmradio.co.uk,22,0,uk,100.0,0.866667,0.028555,...,5,,0,2,0,0.015063,0,0.000294,1,1
3,554095.txt,https://www.sfnmjournal.com,27,www.sfnmjournal.com,19,0,com,100.0,1.0,0.522907,...,31,,0,2,0,0.012189,0,0.0,0,1
4,151578.txt,https://www.rewildingargentina.org,34,www.rewildingargentina.org,26,0,org,100.0,1.0,0.079963,...,85,,0,2,0,0.005476,0,0.002091,48,1


In [6]:
# Prepare features and target
# Exclude non-feature columns (URL, FILENAME, etc.)
exclude_cols = ['URL', 'FILENAME', 'Domain', 'TLD', 'Title']
target_col = 'label'  # Adjust if your target column has different name

# Find the target column
if 'label' in df.columns:
    target_col = 'label'
elif 'Label' in df.columns:
    target_col = 'Label'
elif 'CLASS_LABEL' in df.columns:
    target_col = 'CLASS_LABEL'
else:
    print("Target column not found! Please specify manually.")
    print("Available columns:", df.columns.tolist())

print(f"Target column: {target_col}")
print(f"Target distribution:\n{df[target_col].value_counts()}")

Target column: label
Target distribution:
label
1    134850
0    100945
Name: count, dtype: int64


In [7]:
# Prepare feature matrix X and target vector y
# Get only numeric columns for features
feature_cols = [col for col in df.columns 
                if col not in exclude_cols + [target_col] 
                and df[col].dtype in ['int64', 'float64', 'int32', 'float32']]

print(f"Number of feature columns: {len(feature_cols)}")
print(f"\nFeature columns:")
for i, col in enumerate(feature_cols, 1):
    print(f"  {i}. {col}")

X = df[feature_cols].copy()
y = df[target_col].copy()

# Encode target if needed
if y.dtype == 'object':
    le = LabelEncoder()
    y = le.fit_transform(y)
    print(f"\nTarget encoded: {le.classes_}")

# Handle missing values
X = X.fillna(0)

print(f"\nFeature matrix shape: {X.shape}")
print(f"Target vector shape: {y.shape}")

Number of feature columns: 58

Feature columns:
  1. URLLength
  2. DomainLength
  3. IsDomainIP
  4. URLSimilarityIndex
  5. CharContinuationRate
  6. TLDLegitimateProb
  7. URLCharProb
  8. TLDLength
  9. NoOfSubDomain
  10. HasObfuscation
  11. NoOfObfuscatedChar
  12. ObfuscationRatio
  13. NoOfLettersInURL
  14. LetterRatioInURL
  15. NoOfDegitsInURL
  16. DegitRatioInURL
  17. NoOfEqualsInURL
  18. NoOfQMarkInURL
  19. NoOfAmpersandInURL
  20. NoOfOtherSpecialCharsInURL
  21. SpacialCharRatioInURL
  22. IsHTTPS
  23. LineOfCode
  24. LargestLineLength
  25. HasTitle
  26. DomainTitleMatchScore
  27. URLTitleMatchScore
  28. HasFavicon
  29. Robots
  30. IsResponsive
  31. NoOfURLRedirect
  32. NoOfSelfRedirect
  33. HasDescription
  34. NoOfPopup
  35. NoOfiFrame
  36. HasExternalFormSubmit
  37. HasSocialNet
  38. HasSubmitButton
  39. HasHiddenFields
  40. HasPasswordField
  41. Bank
  42. Pay
  43. Crypto
  44. HasCopyrightInfo
  45. NoOfImage
  46. NoOfCSS
  47. NoOfJS
  48. 

In [8]:
# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print("=" * 70)
print("DATA SPLIT")
print("=" * 70)
print(f"Training set: {X_train.shape[0]} samples")
print(f"Test set: {X_test.shape[0]} samples")
print(f"Number of features: {X_train.shape[1]}")

DATA SPLIT
Training set: 188636 samples
Test set: 47159 samples
Number of features: 58


---
## Step 2: Feature Selection Algorithms

### 2.1 Boruta Feature Selection

In [9]:
# Boruta Feature Selection
print("=" * 70)
print("BORUTA FEATURE SELECTION")
print("=" * 70)

# Initialize Random Forest for Boruta
rf = RandomForestClassifier(n_jobs=-1, class_weight='balanced', max_depth=5, random_state=42)

# Initialize Boruta
boruta_selector = BorutaPy(rf, n_estimators='auto', verbose=2, random_state=42, max_iter=100)

# Fit Boruta
start_time = time.time()
boruta_selector.fit(X_train.values, y_train)
boruta_time = time.time() - start_time

# Get selected features
boruta_features = X_train.columns[boruta_selector.support_].tolist()

print(f"\nBoruta completed in {boruta_time:.2f} seconds")
print(f"Selected {len(boruta_features)} features out of {X_train.shape[1]}")
print(f"\nSelected features:")
for i, f in enumerate(boruta_features, 1):
    print(f"  {i}. {f}")

BORUTA FEATURE SELECTION
Iteration: 	1 / 100
Confirmed: 	0
Tentative: 	58
Rejected: 	0
Iteration: 	2 / 100
Confirmed: 	0
Tentative: 	58
Rejected: 	0
Iteration: 	3 / 100
Confirmed: 	0
Tentative: 	58
Rejected: 	0
Iteration: 	4 / 100
Confirmed: 	0
Tentative: 	58
Rejected: 	0
Iteration: 	5 / 100
Confirmed: 	0
Tentative: 	58
Rejected: 	0
Iteration: 	6 / 100
Confirmed: 	0
Tentative: 	58
Rejected: 	0
Iteration: 	7 / 100
Confirmed: 	0
Tentative: 	58
Rejected: 	0
Iteration: 	8 / 100
Confirmed: 	51
Tentative: 	7
Rejected: 	0
Iteration: 	9 / 100
Confirmed: 	51
Tentative: 	7
Rejected: 	0
Iteration: 	10 / 100
Confirmed: 	51
Tentative: 	7
Rejected: 	0
Iteration: 	11 / 100
Confirmed: 	51
Tentative: 	6
Rejected: 	1
Iteration: 	12 / 100
Confirmed: 	52
Tentative: 	5
Rejected: 	1
Iteration: 	13 / 100
Confirmed: 	52
Tentative: 	5
Rejected: 	1
Iteration: 	14 / 100
Confirmed: 	52
Tentative: 	5
Rejected: 	1
Iteration: 	15 / 100
Confirmed: 	52
Tentative: 	5
Rejected: 	1
Iteration: 	16 / 100
Confirmed: 	52
Ten

### 2.2 RFE (Recursive Feature Elimination)

In [10]:
# RFE Feature Selection
print("=" * 70)
print("RFE (RECURSIVE FEATURE ELIMINATION)")
print("=" * 70)

# Use LightGBM as base estimator for RFE
lgbm_estimator = LGBMClassifier(n_estimators=100, random_state=42, verbose=-1)

# Select top 20 features (adjust as needed)
n_features_to_select = 20
rfe_selector = RFE(estimator=lgbm_estimator, n_features_to_select=n_features_to_select, step=1)

start_time = time.time()
rfe_selector.fit(X_train, y_train)
rfe_time = time.time() - start_time

# Get selected features
rfe_features = X_train.columns[rfe_selector.support_].tolist()

print(f"\nRFE completed in {rfe_time:.2f} seconds")
print(f"Selected {len(rfe_features)} features out of {X_train.shape[1]}")
print(f"\nSelected features:")
for i, f in enumerate(rfe_features, 1):
    print(f"  {i}. {f}")

RFE (RECURSIVE FEATURE ELIMINATION)

RFE completed in 116.74 seconds
Selected 20 features out of 58

Selected features:
  1. URLLength
  2. URLSimilarityIndex
  3. CharContinuationRate
  4. TLDLegitimateProb
  5. URLCharProb
  6. NoOfSubDomain
  7. NoOfLettersInURL
  8. LetterRatioInURL
  9. SpacialCharRatioInURL
  10. IsHTTPS
  11. LineOfCode
  12. LargestLineLength
  13. HasDescription
  14. NoOfImage
  15. NoOfCSS
  16. NoOfJS
  17. NoOfSelfRef
  18. NoOfEmptyRef
  19. NoOfExternalRef
  20. URL_Profanity_Prob


### 2.3 Correlation-based Feature Selection

In [11]:
# Correlation-based Feature Selection
print("=" * 70)
print("CORRELATION-BASED FEATURE SELECTION")
print("=" * 70)

start_time = time.time()

# Calculate correlation with target
correlations = pd.DataFrame()
correlations['feature'] = feature_cols
correlations['correlation'] = [abs(X_train[col].corr(pd.Series(y_train))) for col in feature_cols]
correlations = correlations.sort_values('correlation', ascending=False)

# Select features with correlation > threshold or top N features
correlation_threshold = 0.1
corr_features = correlations[correlations['correlation'] >= correlation_threshold]['feature'].tolist()

# Also remove highly correlated features among themselves
corr_matrix = X_train[corr_features].corr().abs()
upper_triangle = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(bool))
to_drop = [column for column in upper_triangle.columns if any(upper_triangle[column] > 0.95)]
corr_features_final = [f for f in corr_features if f not in to_drop]

corr_time = time.time() - start_time

print(f"\nCorrelation-based selection completed in {corr_time:.2f} seconds")
print(f"Selected {len(corr_features_final)} features out of {X_train.shape[1]}")
print(f"\nTop 10 correlated features with target:")
print(correlations.head(10).to_string(index=False))
print(f"\nSelected features:")
for i, f in enumerate(corr_features_final, 1):
    print(f"  {i}. {f}")

CORRELATION-BASED FEATURE SELECTION

Correlation-based selection completed in 1.36 seconds
Selected 39 features out of 58

Top 10 correlated features with target:
              feature  correlation
   URLSimilarityIndex     0.860443
         HasSocialNet     0.783682
     HasCopyrightInfo     0.742820
       HasDescription     0.690587
           has_no_www     0.668396
              IsHTTPS     0.612900
DomainTitleMatchScore     0.583463
      HasSubmitButton     0.578994
         IsResponsive     0.548483
   URLTitleMatchScore     0.538363

Selected features:
  1. URLSimilarityIndex
  2. HasSocialNet
  3. HasCopyrightInfo
  4. HasDescription
  5. has_no_www
  6. IsHTTPS
  7. DomainTitleMatchScore
  8. HasSubmitButton
  9. IsResponsive
  10. SpacialCharRatioInURL
  11. HasHiddenFields
  12. HasFavicon
  13. num_slashes
  14. URLCharProb
  15. CharContinuationRate
  16. HasTitle
  17. DegitRatioInURL
  18. Robots
  19. LetterRatioInURL
  20. Pay
  21. NoOfSelfRef
  22. NoOfJS
  23. NoO

### 2.4 Feature Importance-based Selection (Alternative to ContrastFS)

In [12]:
# Feature Importance-based Selection using ensemble of models
print("=" * 70)
print("ENSEMBLE FEATURE IMPORTANCE SELECTION")
print("=" * 70)

start_time = time.time()

# Clean feature names to remove special characters for LightGBM compatibility
import re
def clean_feature_name(name):
    # Replace special characters with underscores
    return re.sub(r'[^a-zA-Z0-9_]', '_', str(name))

# Create cleaned column names mapping
clean_cols = {col: clean_feature_name(col) for col in X_train.columns}
X_train_clean = X_train.rename(columns=clean_cols)
# XGBoost importance
xgb = XGBClassifier(n_estimators=100, random_state=42, use_label_encoder=False, eval_metric='logloss', verbosity=0)
xgb.fit(X_train_clean, y_train)
importances['xgb'] = xgb.feature_importances_

# CatBoost importance
cat = CatBoostClassifier(n_estimators=100, random_state=42, verbose=0)
cat.fit(X_train_clean, y_train)
lgbm.fit(X_train_clean, y_train)
importances['lgbm'] = lgbm.feature_importances_

# XGBoost importance
xgb = XGBClassifier(n_estimators=100, random_state=42, use_label_encoder=False, eval_metric='logloss', verbosity=0)
xgb.fit(X_train, y_train)
importances['xgb'] = xgb.feature_importances_

# CatBoost importance
cat = CatBoostClassifier(n_estimators=100, random_state=42, verbose=0)
cat.fit(X_train, y_train)
importances['catboost'] = cat.feature_importances_

# Normalize and average
for col in ['lgbm', 'xgb', 'catboost']:
    importances[col] = importances[col] / importances[col].max()

importances['avg_importance'] = importances[['lgbm', 'xgb', 'catboost']].mean(axis=1)
importances = importances.sort_values('avg_importance', ascending=False)

# Select top features (using elbow method or threshold)
importance_threshold = 0.1  # Features with avg importance > 10% of max
ensemble_features = importances[importances['avg_importance'] >= importance_threshold]['feature'].tolist()

ensemble_time = time.time() - start_time

print(f"\nEnsemble selection completed in {ensemble_time:.2f} seconds")
print(f"Selected {len(ensemble_features)} features out of {X_train.shape[1]}")
print(f"\nTop 15 features by importance:")
print(importances.head(15)[['feature', 'avg_importance']].to_string(index=False))
print(f"\nSelected features:")
for i, f in enumerate(ensemble_features, 1):
    print(f"  {i}. {f}")

ENSEMBLE FEATURE IMPORTANCE SELECTION


NameError: name 'importances' is not defined

---
## Step 3: Compare Selected Features

In [None]:
# Summary of all feature selection methods
print("=" * 70)
print("FEATURE SELECTION SUMMARY")
print("=" * 70)

selection_summary = {
    'Method': ['All Features', 'Boruta', 'RFE', 'Correlation-based', 'Ensemble Importance'],
    'Num Features': [
        len(feature_cols),
        len(boruta_features),
        len(rfe_features),
        len(corr_features_final),
        len(ensemble_features)
    ],
    'Selection Time (s)': [
        0,
        round(boruta_time, 2),
        round(rfe_time, 2),
        round(corr_time, 2),
        round(ensemble_time, 2)
    ]
}

summary_df = pd.DataFrame(selection_summary)
print(summary_df.to_string(index=False))

# Find common features across all methods
common_features = set(boruta_features) & set(rfe_features) & set(corr_features_final) & set(ensemble_features)
print(f"\nCommon features across all methods: {len(common_features)}")
for f in common_features:
    print(f"  - {f}")

FEATURE SELECTION SUMMARY
             Method  Num Features  Selection Time (s)
       All Features            58                0.00
             Boruta            52             1674.75
                RFE            20               50.58
  Correlation-based            39                1.52
Ensemble Importance             8               11.43

Common features across all methods: 7
  - NoOfExternalRef
  - IsHTTPS
  - LetterRatioInURL
  - URLSimilarityIndex
  - URLCharProb
  - LineOfCode
  - SpacialCharRatioInURL


---
## Step 4: Model Training and Evaluation

In [None]:
# Function to train and evaluate a model
def train_and_evaluate(model, X_train, X_test, y_train, y_test, model_name):
    """
    Train a model and return evaluation metrics.
    
    Returns:
        dict: Dictionary containing all metrics
    """
    # Training
    start_time = time.time()
    model.fit(X_train, y_train)
    training_time = time.time() - start_time
    
    # Prediction
    y_pred = model.predict(X_test)
    
    # Calculate metrics
    accuracy = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred, average='weighted', zero_division=0)
    recall = recall_score(y_test, y_pred, average='weighted', zero_division=0)
    f1 = f1_score(y_test, y_pred, average='weighted', zero_division=0)
    mcc = matthews_corrcoef(y_test, y_pred)
    
    return {
        'Model': model_name,
        'Accuracy': round(accuracy, 7),
        'Precision': round(precision, 7),
        'Recall': round(recall, 7),
        'F1-Score': round(f1, 7),
        'MCC': round(mcc, 7),
        'Training Time (s)': round(training_time, 4)
    }

print("Evaluation function defined successfully!")

Evaluation function defined successfully!


In [None]:
# Prepare feature sets for evaluation
feature_sets = {
    'All Features': feature_cols,
    'Boruta': boruta_features,
    'RFE': rfe_features,
    'Correlation': corr_features_final,
    'Ensemble': ensemble_features
}

# Store all results
all_results = []

print("=" * 70)
print("TRAINING MODELS WITH DIFFERENT FEATURE SETS")
print("=" * 70)

NameError: name 'feature_cols' is not defined

In [None]:
# Function to clean feature names for LightGBM compatibility
import re
def clean_feature_names(df):
    clean_cols = {col: re.sub(r'[^a-zA-Z0-9_]', '_', str(col)) for col in df.columns}
    return df.rename(columns=clean_cols)

# Train and evaluate all models with all feature sets
for fs_name, features in feature_sets.items():
    print(f"\n{'='*70}")
    print(f"Feature Set: {fs_name} ({len(features)} features)")
    print("="*70)
    
    if len(features) == 0:
        print("No features selected, skipping...")
        continue
    
    # Prepare data with selected features
    X_train_fs = X_train[features]
    X_test_fs = X_test[features]
    
    # Clean feature names for LightGBM compatibility
    X_train_fs_clean = clean_feature_names(X_train_fs)
    X_test_fs_clean = clean_feature_names(X_test_fs)
    
    # LightGBM
    print("\nTraining LightGBM...")
    lgbm_model = LGBMClassifier(n_estimators=200, random_state=42, verbose=-1)
    result = train_and_evaluate(lgbm_model, X_train_fs_clean, X_test_fs_clean, y_train, y_test, 'LightGBM')
    result['Feature Set'] = fs_name
    result['Num Features'] = len(features)
    all_results.append(result)
    print(f"  Accuracy: {result['Accuracy']}, F1: {result['F1-Score']}, Time: {result['Training Time (s)']}s")
    
    # XGBoost
    print("Training XGBoost...")
    xgb_model = XGBClassifier(n_estimators=200, random_state=42, use_label_encoder=False, 
                               eval_metric='logloss', verbosity=0)
    result = train_and_evaluate(xgb_model, X_train_fs, X_test_fs, y_train, y_test, 'XGBoost')
    result['Feature Set'] = fs_name
    result['Num Features'] = len(features)
    all_results.append(result)
    print(f"  Accuracy: {result['Accuracy']}, F1: {result['F1-Score']}, Time: {result['Training Time (s)']}s")
    
    # CatBoost
    print("Training CatBoost...")
    cat_model = CatBoostClassifier(n_estimators=200, random_state=42, verbose=0)
    result = train_and_evaluate(cat_model, X_train_fs, X_test_fs, y_train, y_test, 'CatBoost')
    result['Feature Set'] = fs_name
    result['Num Features'] = len(features)
    all_results.append(result)
    print(f"  Accuracy: {result['Accuracy']}, F1: {result['F1-Score']}, Time: {result['Training Time (s)']}s")

print("\n" + "="*70)
print("ALL TRAINING COMPLETED!")
print("="*70)


Feature Set: All Features (58 features)

Training LightGBM...
  Accuracy: 1.0, F1: 1.0, Time: 2.7544s
Training XGBoost...
  Accuracy: 1.0, F1: 1.0, Time: 1.9234s
Training CatBoost...
  Accuracy: 1.0, F1: 1.0, Time: 7.0513s

Feature Set: Boruta (52 features)

Training LightGBM...
  Accuracy: 1.0, F1: 1.0, Time: 2.4067s
Training XGBoost...
  Accuracy: 1.0, F1: 1.0, Time: 1.9527s
Training CatBoost...
  Accuracy: 1.0, F1: 1.0, Time: 8.6595s

Feature Set: RFE (20 features)

Training LightGBM...
  Accuracy: 1.0, F1: 1.0, Time: 1.9545s
Training XGBoost...
  Accuracy: 1.0, F1: 1.0, Time: 0.8153s
Training CatBoost...
  Accuracy: 1.0, F1: 1.0, Time: 5.4736s

Feature Set: Correlation (39 features)

Training LightGBM...
  Accuracy: 1.0, F1: 1.0, Time: 2.203s
Training XGBoost...
  Accuracy: 1.0, F1: 1.0, Time: 1.4441s
Training CatBoost...
  Accuracy: 1.0, F1: 1.0, Time: 6.1428s

Feature Set: Ensemble (8 features)

Training LightGBM...
  Accuracy: 1.0, F1: 1.0, Time: 1.5252s
Training XGBoost...
  A

---
## Step 5: Results Comparison

In [None]:
# Create results dataframe
results_df = pd.DataFrame(all_results)

# Reorder columns
column_order = ['Feature Set', 'Num Features', 'Model', 'Accuracy', 'Precision', 
                'Recall', 'F1-Score', 'MCC', 'Training Time (s)']
results_df = results_df[column_order]

print("=" * 100)
print("COMPLETE RESULTS TABLE")
print("=" * 100)
print(results_df.to_string(index=False))

COMPLETE RESULTS TABLE
 Feature Set  Num Features    Model  Accuracy  Precision  Recall  F1-Score  MCC  Training Time (s)
All Features            58 LightGBM       1.0        1.0     1.0       1.0  1.0             2.7544
All Features            58  XGBoost       1.0        1.0     1.0       1.0  1.0             1.9234
All Features            58 CatBoost       1.0        1.0     1.0       1.0  1.0             7.0513
      Boruta            52 LightGBM       1.0        1.0     1.0       1.0  1.0             2.4067
      Boruta            52  XGBoost       1.0        1.0     1.0       1.0  1.0             1.9527
      Boruta            52 CatBoost       1.0        1.0     1.0       1.0  1.0             8.6595
         RFE            20 LightGBM       1.0        1.0     1.0       1.0  1.0             1.9545
         RFE            20  XGBoost       1.0        1.0     1.0       1.0  1.0             0.8153
         RFE            20 CatBoost       1.0        1.0     1.0       1.0  1.0       

In [None]:
# Pivot table for better visualization - by Model
print("=" * 100)
print("ACCURACY COMPARISON BY MODEL AND FEATURE SET")
print("=" * 100)

accuracy_pivot = results_df.pivot(index='Feature Set', columns='Model', values='Accuracy')
print(accuracy_pivot.to_string())

print("\n" + "=" * 100)
print("F1-SCORE COMPARISON BY MODEL AND FEATURE SET")
print("=" * 100)

f1_pivot = results_df.pivot(index='Feature Set', columns='Model', values='F1-Score')
print(f1_pivot.to_string())

print("\n" + "=" * 100)
print("TRAINING TIME COMPARISON BY MODEL AND FEATURE SET")
print("=" * 100)

time_pivot = results_df.pivot(index='Feature Set', columns='Model', values='Training Time (s)')
print(time_pivot.to_string())

ACCURACY COMPARISON BY MODEL AND FEATURE SET
Model         CatBoost  LightGBM  XGBoost
Feature Set                              
All Features       1.0       1.0      1.0
Boruta             1.0       1.0      1.0
Correlation        1.0       1.0      1.0
Ensemble           1.0       1.0      1.0
RFE                1.0       1.0      1.0

F1-SCORE COMPARISON BY MODEL AND FEATURE SET
Model         CatBoost  LightGBM  XGBoost
Feature Set                              
All Features       1.0       1.0      1.0
Boruta             1.0       1.0      1.0
Correlation        1.0       1.0      1.0
Ensemble           1.0       1.0      1.0
RFE                1.0       1.0      1.0

TRAINING TIME COMPARISON BY MODEL AND FEATURE SET
Model         CatBoost  LightGBM  XGBoost
Feature Set                              
All Features    7.0513    2.7544   1.9234
Boruta          8.6595    2.4067   1.9527
Correlation     6.1428    2.2030   1.4441
Ensemble        4.9935    1.5252   0.8117
RFE             5.

In [None]:
# Calculate performance change (selected features vs all features)
print("=" * 100)
print("PERFORMANCE COMPARISON: SELECTED FEATURES vs ALL FEATURES")
print("=" * 100)

comparison_results = []

for model in ['LightGBM', 'XGBoost', 'CatBoost']:
    all_features_row = results_df[(results_df['Feature Set'] == 'All Features') & 
                                   (results_df['Model'] == model)].iloc[0]
    
    for fs in ['Boruta', 'RFE', 'Correlation', 'Ensemble']:
        fs_row = results_df[(results_df['Feature Set'] == fs) & 
                            (results_df['Model'] == model)]
        if len(fs_row) == 0:
            continue
        fs_row = fs_row.iloc[0]
        
        # Calculate changes
        accuracy_change = fs_row['Accuracy'] - all_features_row['Accuracy']
        f1_change = fs_row['F1-Score'] - all_features_row['F1-Score']
        time_reduction = all_features_row['Training Time (s)'] - fs_row['Training Time (s)']
        time_reduction_pct = (time_reduction / all_features_row['Training Time (s)']) * 100 if all_features_row['Training Time (s)'] > 0 else 0
        feature_reduction = all_features_row['Num Features'] - fs_row['Num Features']
        feature_reduction_pct = (feature_reduction / all_features_row['Num Features']) * 100
        
        comparison_results.append({
            'Model': model,
            'Feature Set': fs,
            'Num Features': fs_row['Num Features'],
            'Feature Reduction': f"{feature_reduction_pct:.1f}%",
            'Accuracy Change': f"{accuracy_change:+.4f}",
            'F1 Change': f"{f1_change:+.4f}",
            'Time Reduction': f"{time_reduction_pct:.1f}%"
        })

comparison_df = pd.DataFrame(comparison_results)
print(comparison_df.to_string(index=False))

PERFORMANCE COMPARISON: SELECTED FEATURES vs ALL FEATURES
   Model Feature Set  Num Features Feature Reduction Accuracy Change F1 Change Time Reduction
LightGBM      Boruta            52             10.3%         +0.0000   +0.0000          12.6%
LightGBM         RFE            20             65.5%         +0.0000   +0.0000          29.0%
LightGBM Correlation            39             32.8%         +0.0000   +0.0000          20.0%
LightGBM    Ensemble             8             86.2%         +0.0000   +0.0000          44.6%
 XGBoost      Boruta            52             10.3%         +0.0000   +0.0000          -1.5%
 XGBoost         RFE            20             65.5%         +0.0000   +0.0000          57.6%
 XGBoost Correlation            39             32.8%         +0.0000   +0.0000          24.9%
 XGBoost    Ensemble             8             86.2%         +0.0000   +0.0000          57.8%
CatBoost      Boruta            52             10.3%         +0.0000   +0.0000         -22.8%
Ca

In [None]:
# Find the best configuration
print("=" * 100)
print("BEST CONFIGURATIONS")
print("=" * 100)

# Best by Accuracy
best_accuracy = results_df.loc[results_df['Accuracy'].idxmax()]
print(f"\nBest Accuracy: {best_accuracy['Accuracy']}")
print(f"  Model: {best_accuracy['Model']}")
print(f"  Feature Set: {best_accuracy['Feature Set']} ({best_accuracy['Num Features']} features)")

# Best by F1-Score
best_f1 = results_df.loc[results_df['F1-Score'].idxmax()]
print(f"\nBest F1-Score: {best_f1['F1-Score']}")
print(f"  Model: {best_f1['Model']}")
print(f"  Feature Set: {best_f1['Feature Set']} ({best_f1['Num Features']} features)")

# Best by MCC
best_mcc = results_df.loc[results_df['MCC'].idxmax()]
print(f"\nBest MCC: {best_mcc['MCC']}")
print(f"  Model: {best_mcc['Model']}")
print(f"  Feature Set: {best_mcc['Feature Set']} ({best_mcc['Num Features']} features)")

# Best efficiency (high accuracy with low training time)
# Create efficiency score: accuracy / training_time
results_df['Efficiency'] = results_df['Accuracy'] / (results_df['Training Time (s)'] + 0.001)
best_efficiency = results_df.loc[results_df['Efficiency'].idxmax()]
print(f"\nBest Efficiency (Accuracy/Time):")
print(f"  Model: {best_efficiency['Model']}")
print(f"  Feature Set: {best_efficiency['Feature Set']} ({best_efficiency['Num Features']} features)")
print(f"  Accuracy: {best_efficiency['Accuracy']}, Time: {best_efficiency['Training Time (s)']}s")

BEST CONFIGURATIONS

Best Accuracy: 1.0
  Model: LightGBM
  Feature Set: All Features (58 features)

Best F1-Score: 1.0
  Model: LightGBM
  Feature Set: All Features (58 features)

Best MCC: 1.0
  Model: LightGBM
  Feature Set: All Features (58 features)

Best Efficiency (Accuracy/Time):
  Model: XGBoost
  Feature Set: Ensemble (8 features)
  Accuracy: 1.0, Time: 0.8117s


---
## Step 6: Final Summary and Recommendations

In [None]:
# Final Summary
print("=" * 100)
print("FINAL SUMMARY AND RECOMMENDATIONS")
print("=" * 100)

print("""
FEATURE SELECTION ANALYSIS:
""")

# Print feature selection summary
print(f"1. Original Features: {len(feature_cols)}")
print(f"2. Boruta Selected: {len(boruta_features)} ({(len(boruta_features)/len(feature_cols)*100):.1f}% of original)")
print(f"3. RFE Selected: {len(rfe_features)} ({(len(rfe_features)/len(feature_cols)*100):.1f}% of original)")
print(f"4. Correlation Selected: {len(corr_features_final)} ({(len(corr_features_final)/len(feature_cols)*100):.1f}% of original)")
print(f"5. Ensemble Selected: {len(ensemble_features)} ({(len(ensemble_features)/len(feature_cols)*100):.1f}% of original)")

print("""
KEY FINDINGS:
""")

# Calculate average metrics for all features vs selected features
all_features_avg = results_df[results_df['Feature Set'] == 'All Features'][['Accuracy', 'F1-Score', 'Training Time (s)']].mean()
selected_avg = results_df[results_df['Feature Set'] != 'All Features'][['Accuracy', 'F1-Score', 'Training Time (s)']].mean()

print(f"Average with ALL Features:")
print(f"  - Accuracy: {all_features_avg['Accuracy']:.4f}")
print(f"  - F1-Score: {all_features_avg['F1-Score']:.4f}")
print(f"  - Training Time: {all_features_avg['Training Time (s)']:.4f}s")

print(f"\nAverage with SELECTED Features:")
print(f"  - Accuracy: {selected_avg['Accuracy']:.4f}")
print(f"  - F1-Score: {selected_avg['F1-Score']:.4f}")
print(f"  - Training Time: {selected_avg['Training Time (s)']:.4f}s")

accuracy_diff = selected_avg['Accuracy'] - all_features_avg['Accuracy']
time_diff = all_features_avg['Training Time (s)'] - selected_avg['Training Time (s)']
time_diff_pct = (time_diff / all_features_avg['Training Time (s)']) * 100 if all_features_avg['Training Time (s)'] > 0 else 0

print(f"\nDIFFERENCE:")
print(f"  - Accuracy: {accuracy_diff:+.4f}")
print(f"  - Training Time Reduction: {time_diff_pct:.1f}%")

FINAL SUMMARY AND RECOMMENDATIONS

FEATURE SELECTION ANALYSIS:

1. Original Features: 58
2. Boruta Selected: 52 (89.7% of original)
3. RFE Selected: 20 (34.5% of original)
4. Correlation Selected: 39 (67.2% of original)
5. Ensemble Selected: 8 (13.8% of original)

KEY FINDINGS:

Average with ALL Features:
  - Accuracy: 1.0000
  - F1-Score: 1.0000
  - Training Time: 3.9097s

Average with SELECTED Features:
  - Accuracy: 1.0000
  - F1-Score: 1.0000
  - Training Time: 3.1985s

DIFFERENCE:
  - Accuracy: +0.0000
  - Training Time Reduction: 18.2%


In [None]:
# Save results to CSV
results_df.to_csv('feature_selection_results.csv', index=False)
print("Results saved to: feature_selection_results.csv")

# Display final table
print("\n" + "=" * 100)
print("FINAL RESULTS TABLE")
print("=" * 100)
print(results_df.drop(columns=['Efficiency']).to_string(index=False))

Results saved to: feature_selection_results.csv

FINAL RESULTS TABLE
 Feature Set  Num Features    Model  Accuracy  Precision  Recall  F1-Score  MCC  Training Time (s)
All Features            58 LightGBM       1.0        1.0     1.0       1.0  1.0             2.7544
All Features            58  XGBoost       1.0        1.0     1.0       1.0  1.0             1.9234
All Features            58 CatBoost       1.0        1.0     1.0       1.0  1.0             7.0513
      Boruta            52 LightGBM       1.0        1.0     1.0       1.0  1.0             2.4067
      Boruta            52  XGBoost       1.0        1.0     1.0       1.0  1.0             1.9527
      Boruta            52 CatBoost       1.0        1.0     1.0       1.0  1.0             8.6595
         RFE            20 LightGBM       1.0        1.0     1.0       1.0  1.0             1.9545
         RFE            20  XGBoost       1.0        1.0     1.0       1.0  1.0             0.8153
         RFE            20 CatBoost     