# Feature Selection & Model Comparison

## Objective:
1. Apply multiple feature selection algorithms to identify the most influential features
2. Compare model performance with all features vs selected features
3. Evaluate training time reduction

## Feature Selection Algorithms:
- **Boruta** - Wrapper method using Random Forest
- **RFE (Recursive Feature Elimination)** - Recursive elimination
- **Correlation-based Feature Selection** - Filter method
- **Ensemble Feature Importance** - Ensemble-based selection

## Models:
- LightGBM
- XGBClassifier
- CatBoost

## Evaluation Metrics:
- Accuracy
- Precision
- Recall
- F1-Score
- MCC (Matthews Correlation Coefficient)
- Training Time

In [1]:
# Install required packages if not available
!pip install boruta lightgbm xgboost catboost scikit-learn pandas numpy




[notice] A new release of pip is available: 25.1.1 -> 25.3
[notice] To update, run: python.exe -m pip install --upgrade pip


In [2]:
# Import required libraries
import pandas as pd
import numpy as np
import time
import warnings
warnings.filterwarnings('ignore')

# Sklearn
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.feature_selection import RFE
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, 
    f1_score, matthews_corrcoef, classification_report
)

# Models
from lightgbm import LGBMClassifier
from xgboost import XGBClassifier
from catboost import CatBoostClassifier

# Boruta
from boruta import BorutaPy

print("All libraries imported successfully!")

All libraries imported successfully!


## Step 1: Load Dataset

In [3]:
# Load the dataset with 63 features
df = pd.read_csv('new_dataset/PhiUSIIL_Phishing_URL_63_Features.csv')

print("=" * 70)
print("DATASET INFORMATION")
print("=" * 70)
print(f"Number of Rows: {len(df)}")
print(f"Number of Columns: {len(df.columns)}")
print(f"\nColumn Names:")
for i, col in enumerate(df.columns, 1):
    print(f"  {i}. {col}")

df.head()

DATASET INFORMATION
Number of Rows: 235795
Number of Columns: 63

Column Names:
  1. FILENAME
  2. URL
  3. URLLength
  4. Domain
  5. DomainLength
  6. IsDomainIP
  7. TLD
  8. URLSimilarityIndex
  9. CharContinuationRate
  10. TLDLegitimateProb
  11. URLCharProb
  12. TLDLength
  13. NoOfSubDomain
  14. HasObfuscation
  15. NoOfObfuscatedChar
  16. ObfuscationRatio
  17. NoOfLettersInURL
  18. LetterRatioInURL
  19. NoOfDegitsInURL
  20. DegitRatioInURL
  21. NoOfEqualsInURL
  22. NoOfQMarkInURL
  23. NoOfAmpersandInURL
  24. NoOfOtherSpecialCharsInURL
  25. SpacialCharRatioInURL
  26. IsHTTPS
  27. LineOfCode
  28. LargestLineLength
  29. HasTitle
  30. Title
  31. DomainTitleMatchScore
  32. URLTitleMatchScore
  33. HasFavicon
  34. Robots
  35. IsResponsive
  36. NoOfURLRedirect
  37. NoOfSelfRedirect
  38. HasDescription
  39. NoOfPopup
  40. NoOfiFrame
  41. HasExternalFormSubmit
  42. HasSocialNet
  43. HasSubmitButton
  44. HasHiddenFields
  45. HasPasswordField
  46. Bank
  4

Unnamed: 0,FILENAME,URL,URLLength,Domain,DomainLength,IsDomainIP,TLD,URLSimilarityIndex,CharContinuationRate,TLDLegitimateProb,...,NoOfEmptyRef,NoOfExternalRef,has_no_www,num_slashes,num_hyphens,URL_Profanity_Prob,URL_NumberOf_Profanity,URLContent_Profanity_Prob,URLContent_NumberOf_Profanity,label
0,521848.txt,https://www.southbankmosaics.com,32,www.southbankmosaics.com,24,0,com,100.0,1.0,0.522907,...,0,124,0,2,0,0.012189,1,0.01188,1,1
1,31372.txt,https://www.uni-mainz.de,24,www.uni-mainz.de,16,0,de,100.0,0.666667,0.03265,...,0,217,0,2,1,0.027988,0,0.019723,0,1
2,597387.txt,https://www.voicefmradio.co.uk,30,www.voicefmradio.co.uk,22,0,uk,100.0,0.866667,0.028555,...,2,5,0,2,0,0.015063,0,0.000294,1,1
3,554095.txt,https://www.sfnmjournal.com,27,www.sfnmjournal.com,19,0,com,100.0,1.0,0.522907,...,1,31,0,2,0,0.012189,0,0.0,0,1
4,151578.txt,https://www.rewildingargentina.org,34,www.rewildingargentina.org,26,0,org,100.0,1.0,0.079963,...,1,85,0,2,0,0.005476,0,0.002091,48,1


In [4]:
# Prepare features and target
# Exclude non-feature columns (URL, FILENAME, etc.)
exclude_cols = ['URL', 'FILENAME', 'Domain', 'TLD', 'Title']
target_col = 'label'  # Adjust if your target column has different name

# Find the target column
if 'label' in df.columns:
    target_col = 'label'
elif 'Label' in df.columns:
    target_col = 'Label'
elif 'CLASS_LABEL' in df.columns:
    target_col = 'CLASS_LABEL'
else:
    print("Target column not found! Please specify manually.")
    print("Available columns:", df.columns.tolist())

print(f"Target column: {target_col}")
print(f"Target distribution:\n{df[target_col].value_counts()}")

Target column: label
Target distribution:
label
1    134850
0    100945
Name: count, dtype: int64


In [5]:
# Prepare feature matrix X and target vector y
# Get only numeric columns for features
feature_cols = [col for col in df.columns 
                if col not in exclude_cols + [target_col] 
                and df[col].dtype in ['int64', 'float64', 'int32', 'float32']]

print(f"Number of feature columns: {len(feature_cols)}")
print(f"\nFeature columns:")
for i, col in enumerate(feature_cols, 1):
    print(f"  {i}. {col}")

X = df[feature_cols].copy()
y = df[target_col].copy()

# Encode target if needed
if y.dtype == 'object':
    le = LabelEncoder()
    y = le.fit_transform(y)
    print(f"\nTarget encoded: {le.classes_}")

# Handle missing values
X = X.fillna(0)

print(f"\nFeature matrix shape: {X.shape}")
print(f"Target vector shape: {y.shape}")

Number of feature columns: 57

Feature columns:
  1. URLLength
  2. DomainLength
  3. IsDomainIP
  4. URLSimilarityIndex
  5. CharContinuationRate
  6. TLDLegitimateProb
  7. URLCharProb
  8. TLDLength
  9. NoOfSubDomain
  10. HasObfuscation
  11. NoOfObfuscatedChar
  12. ObfuscationRatio
  13. NoOfLettersInURL
  14. LetterRatioInURL
  15. NoOfDegitsInURL
  16. DegitRatioInURL
  17. NoOfEqualsInURL
  18. NoOfQMarkInURL
  19. NoOfAmpersandInURL
  20. NoOfOtherSpecialCharsInURL
  21. SpacialCharRatioInURL
  22. IsHTTPS
  23. LineOfCode
  24. LargestLineLength
  25. HasTitle
  26. DomainTitleMatchScore
  27. URLTitleMatchScore
  28. HasFavicon
  29. Robots
  30. IsResponsive
  31. NoOfURLRedirect
  32. NoOfSelfRedirect
  33. HasDescription
  34. NoOfPopup
  35. NoOfiFrame
  36. HasExternalFormSubmit
  37. HasSocialNet
  38. HasSubmitButton
  39. HasHiddenFields
  40. HasPasswordField
  41. Bank
  42. Pay
  43. Crypto
  44. HasCopyrightInfo
  45. NoOfImage
  46. NoOfCSS
  47. NoOfJS
  48. 

In [6]:
# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print("=" * 70)
print("DATA SPLIT")
print("=" * 70)
print(f"Training set: {X_train.shape[0]} samples")
print(f"Test set: {X_test.shape[0]} samples")
print(f"Number of features: {X_train.shape[1]}")

DATA SPLIT
Training set: 188636 samples
Test set: 47159 samples
Number of features: 57


---
## Step 2: Feature Selection Algorithms

### 2.1 Boruta Feature Selection

**How Boruta Works:**
- Boruta is a wrapper algorithm that uses Random Forest to determine feature importance
- It creates "shadow features" by shuffling original features randomly
- It compares each original feature's importance against the maximum importance of shadow features
- Features that consistently outperform shadow features are confirmed as important
- Features that don't outperform shadow features are rejected

**Why Features Are Selected:**
- Features are selected based on their ability to distinguish between classes better than random chance
- The algorithm uses statistical tests to ensure the selection is robust

In [7]:
# Boruta Feature Selection
print("=" * 70)
print("BORUTA FEATURE SELECTION")
print("=" * 70)

# Initialize Random Forest for Boruta
rf = RandomForestClassifier(n_jobs=-1, class_weight='balanced', max_depth=5, random_state=42)

# Initialize Boruta
boruta_selector = BorutaPy(rf, n_estimators='auto', verbose=2, random_state=42, max_iter=100)

# Fit Boruta
start_time = time.time()
boruta_selector.fit(X_train.values, y_train)
boruta_time = time.time() - start_time

# Get selected features
boruta_features = X_train.columns[boruta_selector.support_].tolist()

# Get feature rankings from Boruta (lower rank = more important)
# ranking_ gives the rank where 1 is the most important
boruta_ranking = pd.DataFrame({
    'feature': X_train.columns,
    'rank': boruta_selector.ranking_,
    'selected': boruta_selector.support_
})

# To get importance scores, we'll use the Random Forest that Boruta trained
# But boruta doesn't expose this directly, so we'll train RF on all features
rf_for_importance = RandomForestClassifier(n_jobs=-1, class_weight='balanced', 
                                            max_depth=5, random_state=42, n_estimators=100)
rf_for_importance.fit(X_train, y_train)

# Create importance scores
boruta_importance_df = pd.DataFrame({
    'feature': X_train.columns,
    'importance': rf_for_importance.feature_importances_,
    'boruta_rank': boruta_selector.ranking_,
    'selected': boruta_selector.support_
})

# Sort by importance
boruta_importance_df = boruta_importance_df.sort_values('importance', ascending=False)

print(f"\nBoruta completed in {boruta_time:.2f} seconds")
print(f"Selected {len(boruta_features)} features out of {X_train.shape[1]}")

# Display Top 10 most important features
print(f"\nTop 10 most important features (Boruta):")
print(boruta_importance_df.head(10)[['feature', 'importance']].to_string(index=False))

print(f"\n" + "="*70)
print("BORUTA ALGORITHM EXPLANATION")
print("="*70)
print("""
Why these features are selected:

1. URLSimilarityIndex: Measures how similar a URL is to legitimate URLs.
   - High importance because phishing URLs often have low similarity to real sites.

2. HasSocialNet/HasCopyrightInfo: Presence of social media links/copyright info.
   - Legitimate sites usually have these; phishing sites often don't.

3. IsHTTPS: Whether the site uses secure connection.
   - Phishing sites often lack SSL certificates.

4. LineOfCode/LargestLineLength: Code structure metrics.
   - Phishing pages often have simpler or copied code patterns.

5. URL structure features (NoOfSubDomain, URLLength, etc.):
   - Phishing URLs tend to be longer, have more subdomains.

Boruta confirms features by comparing them against "shadow" (shuffled) features.
Only features that consistently beat random noise are selected.
""")

print(f"\nAll selected features:")
for i, f in enumerate(boruta_features, 1):
    print(f"  {i}. {f}")

BORUTA FEATURE SELECTION
Iteration: 	1 / 100
Confirmed: 	0
Tentative: 	57
Rejected: 	0
Iteration: 	2 / 100
Confirmed: 	0
Tentative: 	57
Rejected: 	0
Iteration: 	3 / 100
Confirmed: 	0
Tentative: 	57
Rejected: 	0
Iteration: 	4 / 100
Confirmed: 	0
Tentative: 	57
Rejected: 	0
Iteration: 	5 / 100
Confirmed: 	0
Tentative: 	57
Rejected: 	0
Iteration: 	6 / 100
Confirmed: 	0
Tentative: 	57
Rejected: 	0
Iteration: 	7 / 100
Confirmed: 	0
Tentative: 	57
Rejected: 	0
Iteration: 	8 / 100
Confirmed: 	49
Tentative: 	8
Rejected: 	0
Iteration: 	9 / 100
Confirmed: 	49
Tentative: 	8
Rejected: 	0
Iteration: 	10 / 100
Confirmed: 	49
Tentative: 	8
Rejected: 	0
Iteration: 	11 / 100
Confirmed: 	49
Tentative: 	8
Rejected: 	0
Iteration: 	12 / 100
Confirmed: 	49
Tentative: 	8
Rejected: 	0
Iteration: 	13 / 100
Confirmed: 	49
Tentative: 	8
Rejected: 	0
Iteration: 	14 / 100
Confirmed: 	49
Tentative: 	8
Rejected: 	0
Iteration: 	15 / 100
Confirmed: 	49
Tentative: 	8
Rejected: 	0
Iteration: 	16 / 100
Confirmed: 	49
Ten

### 2.2 RFE (Recursive Feature Elimination)

**How RFE Works:**
- RFE starts with all features and recursively removes the least important ones
- At each iteration, it trains a model and ranks features by importance
- The least important feature(s) are removed
- This process continues until the desired number of features is reached

**Why Features Are Selected:**
- Features that survive to the end are consistently important across multiple model iterations
- The ranking reflects the order in which features were eliminated (later = more important)

In [8]:
# RFE Feature Selection
print("=" * 70)
print("RFE (RECURSIVE FEATURE ELIMINATION)")
print("=" * 70)

# Use LightGBM as base estimator for RFE
lgbm_estimator = LGBMClassifier(n_estimators=100, random_state=42, verbose=-1)

# Select top 20 features (adjust as needed)
n_features_to_select = 20
rfe_selector = RFE(estimator=lgbm_estimator, n_features_to_select=n_features_to_select, step=1)

start_time = time.time()
rfe_selector.fit(X_train, y_train)
rfe_time = time.time() - start_time

# Get selected features
rfe_features = X_train.columns[rfe_selector.support_].tolist()

# Get feature rankings (lower = better, 1 = selected)
rfe_ranking_df = pd.DataFrame({
    'feature': X_train.columns,
    'rfe_rank': rfe_selector.ranking_,
    'selected': rfe_selector.support_
})

# Get feature importances from the fitted estimator
# Also train LightGBM to get importance scores for ranking
lgbm_for_importance = LGBMClassifier(n_estimators=100, random_state=42, verbose=-1)
lgbm_for_importance.fit(X_train, y_train)

rfe_importance_df = pd.DataFrame({
    'feature': X_train.columns,
    'importance': lgbm_for_importance.feature_importances_,
    'rfe_rank': rfe_selector.ranking_,
    'selected': rfe_selector.support_
})

# Sort by importance
rfe_importance_df = rfe_importance_df.sort_values('importance', ascending=False)

print(f"\nRFE completed in {rfe_time:.2f} seconds")
print(f"Selected {len(rfe_features)} features out of {X_train.shape[1]}")

# Display Top 10 most important features
print(f"\nTop 10 most important features (RFE with LightGBM):")
print(rfe_importance_df.head(10)[['feature', 'importance']].to_string(index=False))

print(f"\n" + "="*70)
print("RFE ALGORITHM EXPLANATION")
print("="*70)
print("""
Why these features are selected:

RFE uses LightGBM's built-in feature importance (gain-based) to rank features.
Features are eliminated one by one based on their contribution to model accuracy.

1. URLSimilarityIndex: Highest gain contribution - strongly separates phishing/legitimate.
   
2. LineOfCode/LargestLineLength: Page complexity metrics have high predictive power.
   - Phishing pages often have minimal/copied code.

3. NoOfExternalRef: Number of external references indicates site legitimacy.
   - Legitimate sites reference many external resources.

4. URLCharProb: Probability distribution of characters in URL.
   - Phishing URLs often contain unusual character patterns.

5. IsHTTPS: Security indicator that can distinguish many phishing attempts.

RFE is robust because features that remain important throughout recursive
elimination are truly essential for classification, not just noise.
""")

print(f"\nSelected features:")
for i, f in enumerate(rfe_features, 1):
    print(f"  {i}. {f}")

RFE (RECURSIVE FEATURE ELIMINATION)

RFE completed in 131.33 seconds
Selected 20 features out of 57

Top 10 most important features (RFE with LightGBM):
              feature  importance
           LineOfCode         499
    LargestLineLength         471
      NoOfExternalRef         277
          URLCharProb         269
     LetterRatioInURL         242
SpacialCharRatioInURL         180
              IsHTTPS         161
            URLLength         128
               NoOfJS         104
   URLSimilarityIndex         102

RFE ALGORITHM EXPLANATION

Why these features are selected:

RFE uses LightGBM's built-in feature importance (gain-based) to rank features.
Features are eliminated one by one based on their contribution to model accuracy.

1. URLSimilarityIndex: Highest gain contribution - strongly separates phishing/legitimate.

2. LineOfCode/LargestLineLength: Page complexity metrics have high predictive power.
   - Phishing pages often have minimal/copied code.

3. NoOfExternalRef:

### 2.3 Correlation-based Feature Selection

**How Correlation-based Selection Works:**
- Calculates Pearson correlation coefficient between each feature and the target
- Features with high absolute correlation with target are selected
- Also removes features that are highly correlated with each other (multicollinearity)

**Why Features Are Selected:**
- High correlation with target = strong linear relationship with outcome
- Removing redundant features reduces noise and computational cost

In [9]:
# Correlation-based Feature Selection
print("=" * 70)
print("CORRELATION-BASED FEATURE SELECTION")
print("=" * 70)

start_time = time.time()

# Calculate correlation with target
correlations = pd.DataFrame()
correlations['feature'] = feature_cols
correlations['correlation'] = [abs(X_train[col].corr(pd.Series(y_train))) for col in feature_cols]
correlations = correlations.sort_values('correlation', ascending=False)

# Select features with correlation > threshold or top N features
correlation_threshold = 0.1
corr_features = correlations[correlations['correlation'] >= correlation_threshold]['feature'].tolist()

# Also remove highly correlated features among themselves
corr_matrix = X_train[corr_features].corr().abs()
upper_triangle = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(bool))
to_drop = [column for column in upper_triangle.columns if any(upper_triangle[column] > 0.95)]
corr_features_final = [f for f in corr_features if f not in to_drop]

corr_time = time.time() - start_time

print(f"\nCorrelation-based selection completed in {corr_time:.2f} seconds")
print(f"Selected {len(corr_features_final)} features out of {X_train.shape[1]}")

# Display Top 10 correlated features
print(f"\nTop 10 correlated features with target:")
print(correlations.head(10).to_string(index=False))

print(f"\n" + "="*70)
print("CORRELATION-BASED ALGORITHM EXPLANATION")
print("="*70)
print("""
Why these features are selected:

1. URLSimilarityIndex (0.86): Highest correlation - legitimate URLs share patterns.
   - Phishing URLs typically deviate from these patterns significantly.

2. HasSocialNet (0.78): Social media presence is a legitimacy indicator.
   - Most legitimate sites have social media integration.

3. HasCopyrightInfo (0.74): Copyright notices indicate professional websites.
   - Phishing sites rarely include proper legal information.

4. HasDescription (0.69): Meta descriptions indicate proper SEO practices.
   - Legitimate sites usually have proper metadata.

5. has_no_www (0.67): URL structure indicator.
   - Unusual URL patterns often indicate phishing.

6. IsHTTPS (0.61): Security certificate presence.
   - Legitimate sites increasingly use HTTPS; phishing sites less so.

Note: Correlation measures LINEAR relationship only. Some important features
might have non-linear relationships not captured by this method.
""")

print(f"\nSelected features:")
for i, f in enumerate(corr_features_final, 1):
    print(f"  {i}. {f}")

CORRELATION-BASED FEATURE SELECTION

Correlation-based selection completed in 1.65 seconds
Selected 39 features out of 57

Top 10 correlated features with target:
              feature  correlation
   URLSimilarityIndex     0.860443
         HasSocialNet     0.783682
     HasCopyrightInfo     0.742820
       HasDescription     0.690587
           has_no_www     0.668396
              IsHTTPS     0.612900
DomainTitleMatchScore     0.583463
      HasSubmitButton     0.578994
         IsResponsive     0.548483
   URLTitleMatchScore     0.538363

CORRELATION-BASED ALGORITHM EXPLANATION

Why these features are selected:

1. URLSimilarityIndex (0.86): Highest correlation - legitimate URLs share patterns.
   - Phishing URLs typically deviate from these patterns significantly.

2. HasSocialNet (0.78): Social media presence is a legitimacy indicator.
   - Most legitimate sites have social media integration.

3. HasCopyrightInfo (0.74): Copyright notices indicate professional websites.
   - Phis

### 2.4 ContrastFS (Contrastive Feature Selection)

**How ContrastFS Works:**
- Measures each feature's ability to separate different classes
- Calculates **intra-class distance** (samples within same class should be similar)
- Calculates **inter-class distance** (samples from different classes should be different)
- Computes **Contrast Score** = inter-class distance / intra-class distance
- Higher contrast score = better discriminative power

**Why Features Are Selected:**
- Features that maximize separation between classes while keeping similar classes close
- Based on contrastive learning principles: 'push apart' different classes, 'pull together' same class
- More robust than single-model importance as it directly measures class separability

In [26]:
# ContrastFS (Contrastive Feature Selection)
print("=" * 70)
print("CONTRASTFS (CONTRASTIVE FEATURE SELECTION)")
print("=" * 70)

start_time = time.time()

def contrastfs_score(X, y, n_samples=5000):
    """
    Calculate ContrastFS scores for each feature.
    
    ContrastFS measures feature importance based on contrastive learning principles:
    - Good features should maximize inter-class distance (different classes far apart)
    - Good features should minimize intra-class distance (same class samples close)
    - Contrast Score = inter-class distance / intra-class distance
    
    Parameters:
    -----------
    X : DataFrame - Feature matrix
    y : array-like - Target labels
    n_samples : int - Number of samples to use (for efficiency)
    
    Returns:
    --------
    DataFrame with feature names and their contrast scores
    """
    from sklearn.preprocessing import StandardScaler
    
    # Sample data for efficiency (large datasets)
    if len(X) > n_samples:
        np.random.seed(42)
        indices = np.random.choice(len(X), n_samples, replace=False)
        X_sample = X.iloc[indices].values
        y_sample = np.array(y)[indices] if hasattr(y, '__iter__') else y.iloc[indices].values
    else:
        X_sample = X.values
        y_sample = np.array(y)
    
    # Standardize features
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X_sample)
    
    # Get unique classes
    classes = np.unique(y_sample)
    n_features = X_scaled.shape[1]
    
    contrast_scores = []
    
    for feat_idx in range(n_features):
        feature_values = X_scaled[:, feat_idx]
        
        # Calculate intra-class variance (samples within same class)
        intra_class_var = 0
        for c in classes:
            class_mask = (y_sample == c)
            class_values = feature_values[class_mask]
            if len(class_values) > 1:
                intra_class_var += np.var(class_values) * len(class_values)
        intra_class_var /= len(y_sample)
        
        # Calculate inter-class distance (distance between class means)
        class_means = []
        class_sizes = []
        for c in classes:
            class_mask = (y_sample == c)
            class_values = feature_values[class_mask]
            class_means.append(np.mean(class_values))
            class_sizes.append(len(class_values))
        
        # Weighted inter-class distance
        inter_class_dist = 0
        total_pairs = 0
        for i in range(len(classes)):
            for j in range(i + 1, len(classes)):
                weight = class_sizes[i] * class_sizes[j]
                inter_class_dist += weight * (class_means[i] - class_means[j]) ** 2
                total_pairs += weight
        
        if total_pairs > 0:
            inter_class_dist /= total_pairs
        
        # Contrast Score = inter-class distance / (intra-class variance + epsilon)
        epsilon = 1e-10  # Avoid division by zero
        contrast_score = inter_class_dist / (intra_class_var + epsilon)
        contrast_scores.append(contrast_score)
    
    # Create results DataFrame
    results_df = pd.DataFrame({
        'feature': X.columns,
        'contrast_score': contrast_scores
    })
    results_df = results_df.sort_values('contrast_score', ascending=False)
    
    return results_df

# Apply ContrastFS
contrastfs_df = contrastfs_score(X_train, y_train, n_samples=10000)

# Normalize scores to [0, 1]
contrastfs_df['normalized_score'] = contrastfs_df['contrast_score'] / contrastfs_df['contrast_score'].max()

# Select features with contrast score above threshold (top percentile or fixed threshold)
# Method 1: Top N features
n_top_features = 20
contrastfs_features = contrastfs_df.head(n_top_features)['feature'].tolist()

# Method 2: Threshold-based (features with score > 10% of max)
# threshold = 0.1
# contrastfs_features = contrastfs_df[contrastfs_df['normalized_score'] >= threshold]['feature'].tolist()

contrastfs_time = time.time() - start_time

print(f"\nContrastFS completed in {contrastfs_time:.2f} seconds")
print(f"Selected {len(contrastfs_features)} features out of {X_train.shape[1]}")

# Display Top 10 features by contrast score
print(f"\nTop 10 features by ContrastFS score:")
print(contrastfs_df.head(10)[['feature', 'contrast_score', 'normalized_score']].to_string(index=False))

print(f"\n" + "="*70)
print("CONTRASTFS ALGORITHM EXPLANATION")
print("="*70)
print("""
Why these features are selected:

ContrastFS uses contrastive learning principles to evaluate features:

1. HIGH CONTRAST SCORE features can effectively separate classes:
   - Large distance between class means (inter-class)
   - Small variance within each class (intra-class)

2. The algorithm calculates for each feature:
   - Intra-class variance: How spread out samples are within the same class
   - Inter-class distance: How far apart the class centers are
   - Contrast Score = Inter-class Distance / Intra-class Variance

3. Interpretation:
   - High score = Feature creates clear separation between phishing & legitimate
   - Low score = Feature values overlap significantly between classes

4. Advantages over tree-based importance:
   - Model-agnostic: Does not depend on any specific classifier
   - Direct measurement: Measures actual class separability
   - No training bias: Pure statistical measure of discriminative power

5. Features with highest contrast scores are most effective at:
   - Distinguishing phishing URLs from legitimate ones
   - Creating decision boundaries in the feature space
""")

print(f"\nSelected features (Top {n_top_features} by ContrastFS):")
for i, f in enumerate(contrastfs_features, 1):
    print(f"  {i}. {f}")

# Store for later comparison (rename to match expected variable name)
contrast_features = contrastfs_features  # Keep compatibility with rest of notebook
contrast_time = contrastfs_time

CONTRASTFS (CONTRASTIVE FEATURE SELECTION)

ContrastFS completed in 0.06 seconds
Selected 20 features out of 57

Top 10 features by ContrastFS score:
              feature  contrast_score  normalized_score
   URLSimilarityIndex       11.191154          1.000000
         HasSocialNet        6.384796          0.570522
     HasCopyrightInfo        5.110844          0.456686
       HasDescription        3.708168          0.331348
           has_no_www        3.396335          0.303484
              IsHTTPS        2.524480          0.225578
      HasSubmitButton        2.098836          0.187544
DomainTitleMatchScore        2.095428          0.187240
         IsResponsive        1.843700          0.164746
   URLTitleMatchScore        1.630732          0.145716

CONTRASTFS ALGORITHM EXPLANATION

Why these features are selected:

ContrastFS uses contrastive learning principles to evaluate features:

1. HIGH CONTRAST SCORE features can effectively separate classes:
   - Large distance between 

---
## Step 3: Compare Selected Features

In [27]:
# Summary of all feature selection methods
print("=" * 70)
print("FEATURE SELECTION SUMMARY")
print("=" * 70)

selection_summary = {
    'Method': ['All Features', 'Boruta', 'RFE', 'Correlation-based', 'ContrastFS'],
    'Num Features': [
        len(feature_cols),
        len(boruta_features),
        len(rfe_features),
        len(corr_features_final),
        len(contrast_features)
    ],
    'Selection Time (s)': [
        0,
        round(boruta_time, 2),
        round(rfe_time, 2),
        round(corr_time, 2),
        round(contrast_time, 2)
    ]
}

summary_df = pd.DataFrame(selection_summary)
print(summary_df.to_string(index=False))

# Find common features across all methods
common_features = set(boruta_features) & set(rfe_features) & set(corr_features_final) & set(contrast_features)
print(f"\nCommon features across all methods: {len(common_features)}")
for f in common_features:
    print(f"  - {f}")

FEATURE SELECTION SUMMARY
           Method  Num Features  Selection Time (s)
     All Features            57                0.00
           Boruta            52             1983.55
              RFE            20              131.33
Correlation-based            39                1.65
       ContrastFS            20                0.06

Common features across all methods: 6
  - URLSimilarityIndex
  - HasDescription
  - URLCharProb
  - CharContinuationRate
  - SpacialCharRatioInURL
  - IsHTTPS


### Top 10 Features Comparison Across All Methods

In [29]:
# Create a comprehensive Top 10 comparison table
print("=" * 100)
print("TOP 10 FEATURES COMPARISON ACROSS ALL METHODS")
print("=" * 100)

# Get Top 10 from each method
top10_comparison = pd.DataFrame({
    'Rank': range(1, 11),
    'Boruta (RF Importance)': boruta_importance_df.head(10)['feature'].values,
    'RFE (LightGBM)': rfe_importance_df.head(10)['feature'].values,
    'Correlation': correlations.head(10)['feature'].values,
    'ContrastFS': contrastfs_df.head(10)['feature'].values
})

print(top10_comparison.to_string(index=False))

# Create importance scores table
print(f"\n" + "="*100)
print("TOP 10 FEATURES WITH SCORES")
print("="*100)

top10_scores = pd.DataFrame({
    'Rank': range(1, 11),
    'Boruta Feature': boruta_importance_df.head(10)['feature'].values,
    'Score': boruta_importance_df.head(10)['importance'].round(4).values,
    'RFE Feature': rfe_importance_df.head(10)['feature'].values,
    'Score ': rfe_importance_df.head(10)['importance'].round(4).values,
    'Corr Feature': correlations.head(10)['feature'].values,
    'Score  ': correlations.head(10)['correlation'].round(4).values,
    'ContrastFS Feature': contrastfs_df.head(10)['feature'].values,
    'Score   ': contrastfs_df.head(10)['normalized_score'].round(4).values
})

print(top10_scores.to_string(index=False))

TOP 10 FEATURES COMPARISON ACROSS ALL METHODS
 Rank Boruta (RF Importance)        RFE (LightGBM)           Correlation            ContrastFS
    1     URLSimilarityIndex            LineOfCode    URLSimilarityIndex    URLSimilarityIndex
    2             LineOfCode     LargestLineLength          HasSocialNet          HasSocialNet
    3        NoOfExternalRef       NoOfExternalRef      HasCopyrightInfo      HasCopyrightInfo
    4            NoOfSelfRef           URLCharProb        HasDescription        HasDescription
    5                 NoOfJS      LetterRatioInURL            has_no_www            has_no_www
    6         HasDescription SpacialCharRatioInURL               IsHTTPS               IsHTTPS
    7              NoOfImage               IsHTTPS DomainTitleMatchScore       HasSubmitButton
    8           HasSocialNet             URLLength       HasSubmitButton DomainTitleMatchScore
    9                NoOfCSS                NoOfJS          IsResponsive          IsResponsive
   1

---
## Step 4: Model Training and Evaluation

In [30]:
# Function to train and evaluate a model
def train_and_evaluate(model, X_train, X_test, y_train, y_test, model_name):
    """
    Train a model and return evaluation metrics.
    
    Returns:
        dict: Dictionary containing all metrics
    """
    # Training
    start_time = time.time()
    model.fit(X_train, y_train)
    training_time = time.time() - start_time
    
    # Prediction
    y_pred = model.predict(X_test)
    
    # Calculate metrics
    accuracy = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred, average='weighted', zero_division=0)
    recall = recall_score(y_test, y_pred, average='weighted', zero_division=0)
    f1 = f1_score(y_test, y_pred, average='weighted', zero_division=0)
    mcc = matthews_corrcoef(y_test, y_pred)
    
    return {
        'Model': model_name,
        'Accuracy': round(float(accuracy), 4),
        'Precision': round(float(precision), 4),
        'Recall': round(float(recall), 4),
        'F1-Score': round(float(f1), 4),
        'MCC': round(float(mcc), 4),
        'Training Time (s)': round(float(training_time), 4)
    }

print("Evaluation function defined successfully!")

Evaluation function defined successfully!


In [31]:
# Prepare feature sets for evaluation
feature_sets = {
    'All Features': feature_cols,
    'Boruta': boruta_features,
    'RFE': rfe_features,
    'Correlation': corr_features_final,
    'ContrastFS': contrastfs_features,
}

# Store all results
all_results = []

print("=" * 70)
print("TRAINING MODELS WITH DIFFERENT FEATURE SETS")
print("=" * 70)

TRAINING MODELS WITH DIFFERENT FEATURE SETS


In [32]:
# Function to clean feature names for LightGBM compatibility
import re
def clean_feature_names(df):
    clean_cols = {col: re.sub(r'[^a-zA-Z0-9_]', '_', str(col)) for col in df.columns}
    return df.rename(columns=clean_cols)

# Train and evaluate all models with all feature sets
for fs_name, features in feature_sets.items():
    print(f"\n{'='*70}")
    print(f"Feature Set: {fs_name} ({len(features)} features)")
    print("="*70)
    
    if len(features) == 0:
        print("No features selected, skipping...")
        continue
    
    # Prepare data with selected features
    X_train_fs = X_train[features]
    X_test_fs = X_test[features]
    
    # Clean feature names for LightGBM compatibility
    X_train_fs_clean = clean_feature_names(X_train_fs)
    X_test_fs_clean = clean_feature_names(X_test_fs)
    
    # LightGBM
    print("\nTraining LightGBM...")
    lgbm_model = LGBMClassifier(n_estimators=200, random_state=42, verbose=-1)
    result = train_and_evaluate(lgbm_model, X_train_fs_clean, X_test_fs_clean, y_train, y_test, 'LightGBM')
    result['Feature Set'] = fs_name
    result['Num Features'] = len(features)
    all_results.append(result)
    print(f"  Accuracy: {result['Accuracy']}, F1: {result['F1-Score']}, Time: {result['Training Time (s)']}s")
    
    # XGBoost
    print("Training XGBoost...")
    xgb_model = XGBClassifier(n_estimators=200, random_state=42, use_label_encoder=False, 
                               eval_metric='logloss', verbosity=0)
    result = train_and_evaluate(xgb_model, X_train_fs, X_test_fs, y_train, y_test, 'XGBoost')
    result['Feature Set'] = fs_name
    result['Num Features'] = len(features)
    all_results.append(result)
    print(f"  Accuracy: {result['Accuracy']}, F1: {result['F1-Score']}, Time: {result['Training Time (s)']}s")
    
    # CatBoost
    print("Training CatBoost...")
    cat_model = CatBoostClassifier(n_estimators=200, random_state=42, verbose=0)
    result = train_and_evaluate(cat_model, X_train_fs, X_test_fs, y_train, y_test, 'CatBoost')
    result['Feature Set'] = fs_name
    result['Num Features'] = len(features)
    all_results.append(result)
    print(f"  Accuracy: {result['Accuracy']}, F1: {result['F1-Score']}, Time: {result['Training Time (s)']}s")

print("\n" + "="*70)
print("ALL TRAINING COMPLETED!")
print("="*70)


Feature Set: All Features (57 features)

Training LightGBM...
  Accuracy: 1.0, F1: 1.0, Time: 27.0539s
Training XGBoost...
  Accuracy: 1.0, F1: 1.0, Time: 4.9349s
Training CatBoost...
  Accuracy: 1.0, F1: 1.0, Time: 10.1252s

Feature Set: Boruta (52 features)

Training LightGBM...
  Accuracy: 1.0, F1: 1.0, Time: 3.8738s
Training XGBoost...
  Accuracy: 1.0, F1: 1.0, Time: 3.053s
Training CatBoost...
  Accuracy: 1.0, F1: 1.0, Time: 15.3915s

Feature Set: RFE (20 features)

Training LightGBM...
  Accuracy: 1.0, F1: 1.0, Time: 32.988s
Training XGBoost...
  Accuracy: 1.0, F1: 1.0, Time: 2.9907s
Training CatBoost...
  Accuracy: 1.0, F1: 1.0, Time: 10.3145s

Feature Set: Correlation (39 features)

Training LightGBM...
  Accuracy: 1.0, F1: 1.0, Time: 3.1376s
Training XGBoost...
  Accuracy: 1.0, F1: 1.0, Time: 17.9066s
Training CatBoost...
  Accuracy: 1.0, F1: 1.0, Time: 19.4922s

Feature Set: ContrastFS (20 features)

Training LightGBM...
  Accuracy: 1.0, F1: 1.0, Time: 11.8563s
Training XGBo

---
## Step 5: Results Comparison

In [16]:
# Create results dataframe
results_df = pd.DataFrame(all_results)

# Reorder columns
column_order = ['Feature Set', 'Num Features', 'Model', 'Accuracy', 'Precision', 
                'Recall', 'F1-Score', 'MCC', 'Training Time (s)']
results_df = results_df[column_order]

print("=" * 100)
print("COMPLETE RESULTS TABLE")
print("=" * 100)
print(results_df.to_string(index=False))

COMPLETE RESULTS TABLE
 Feature Set  Num Features    Model  Accuracy  Precision  Recall  F1-Score  MCC  Training Time (s)
All Features            57 LightGBM       1.0        1.0     1.0       1.0  1.0             3.7092
All Features            57  XGBoost       1.0        1.0     1.0       1.0  1.0             2.6607
All Features            57 CatBoost       1.0        1.0     1.0       1.0  1.0            17.9088
      Boruta            52 LightGBM       1.0        1.0     1.0       1.0  1.0            20.9070
      Boruta            52  XGBoost       1.0        1.0     1.0       1.0  1.0             3.2267
      Boruta            52 CatBoost       1.0        1.0     1.0       1.0  1.0            12.2842
         RFE            20 LightGBM       1.0        1.0     1.0       1.0  1.0             3.8918
         RFE            20  XGBoost       1.0        1.0     1.0       1.0  1.0             1.9027
         RFE            20 CatBoost       1.0        1.0     1.0       1.0  1.0       

In [17]:
# Pivot table for better visualization - by Model
print("=" * 100)
print("ACCURACY COMPARISON BY MODEL AND FEATURE SET")
print("=" * 100)

accuracy_pivot = results_df.pivot(index='Feature Set', columns='Model', values='Accuracy')
print(accuracy_pivot.to_string())

print("\n" + "=" * 100)
print("F1-SCORE COMPARISON BY MODEL AND FEATURE SET")
print("=" * 100)

f1_pivot = results_df.pivot(index='Feature Set', columns='Model', values='F1-Score')
print(f1_pivot.to_string())

print("\n" + "=" * 100)
print("TRAINING TIME COMPARISON BY MODEL AND FEATURE SET")
print("=" * 100)

time_pivot = results_df.pivot(index='Feature Set', columns='Model', values='Training Time (s)')
print(time_pivot.to_string())

ACCURACY COMPARISON BY MODEL AND FEATURE SET
Model         CatBoost  LightGBM  XGBoost
Feature Set                              
All Features       1.0       1.0      1.0
Boruta             1.0       1.0      1.0
Correlation        1.0       1.0      1.0
Ensemble           1.0       1.0      1.0
RFE                1.0       1.0      1.0

F1-SCORE COMPARISON BY MODEL AND FEATURE SET
Model         CatBoost  LightGBM  XGBoost
Feature Set                              
All Features       1.0       1.0      1.0
Boruta             1.0       1.0      1.0
Correlation        1.0       1.0      1.0
Ensemble           1.0       1.0      1.0
RFE                1.0       1.0      1.0

TRAINING TIME COMPARISON BY MODEL AND FEATURE SET
Model         CatBoost  LightGBM  XGBoost
Feature Set                              
All Features   17.9088    3.7092   2.6607
Boruta         12.2842   20.9070   3.2267
Correlation     9.7604    6.1745  19.0364
Ensemble        8.9671    2.5944   1.3097
RFE            16.

In [18]:
# Calculate performance change (selected features vs all features)
print("=" * 100)
print("PERFORMANCE COMPARISON: SELECTED FEATURES vs ALL FEATURES")
print("=" * 100)

comparison_results = []

for model in ['LightGBM', 'XGBoost', 'CatBoost']:
    all_features_row = results_df[(results_df['Feature Set'] == 'All Features') & 
                                   (results_df['Model'] == model)].iloc[0]
    
    for fs in ['Boruta', 'RFE', 'Correlation', 'Ensemble']:
        fs_row = results_df[(results_df['Feature Set'] == fs) & 
                            (results_df['Model'] == model)]
        if len(fs_row) == 0:
            continue
        fs_row = fs_row.iloc[0]
        
        # Calculate changes
        accuracy_change = fs_row['Accuracy'] - all_features_row['Accuracy']
        f1_change = fs_row['F1-Score'] - all_features_row['F1-Score']
        time_reduction = all_features_row['Training Time (s)'] - fs_row['Training Time (s)']
        time_reduction_pct = (time_reduction / all_features_row['Training Time (s)']) * 100 if all_features_row['Training Time (s)'] > 0 else 0
        feature_reduction = all_features_row['Num Features'] - fs_row['Num Features']
        feature_reduction_pct = (feature_reduction / all_features_row['Num Features']) * 100
        
        comparison_results.append({
            'Model': model,
            'Feature Set': fs,
            'Num Features': fs_row['Num Features'],
            'Feature Reduction': f"{feature_reduction_pct:.1f}%",
            'Accuracy Change': f"{accuracy_change:+.4f}",
            'F1 Change': f"{f1_change:+.4f}",
            'Time Reduction': f"{time_reduction_pct:.1f}%"
        })

comparison_df = pd.DataFrame(comparison_results)
print(comparison_df.to_string(index=False))

PERFORMANCE COMPARISON: SELECTED FEATURES vs ALL FEATURES
   Model Feature Set  Num Features Feature Reduction Accuracy Change F1 Change Time Reduction
LightGBM      Boruta            52              8.8%         +0.0000   +0.0000        -463.7%
LightGBM         RFE            20             64.9%         +0.0000   +0.0000          -4.9%
LightGBM Correlation            39             31.6%         +0.0000   +0.0000         -66.5%
LightGBM    Ensemble             8             86.0%         +0.0000   +0.0000          30.1%
 XGBoost      Boruta            52              8.8%         +0.0000   +0.0000         -21.3%
 XGBoost         RFE            20             64.9%         +0.0000   +0.0000          28.5%
 XGBoost Correlation            39             31.6%         +0.0000   +0.0000        -615.5%
 XGBoost    Ensemble             8             86.0%         +0.0000   +0.0000          50.8%
CatBoost      Boruta            52              8.8%         +0.0000   +0.0000          31.4%
Ca

In [19]:
# Find the best configuration
print("=" * 100)
print("BEST CONFIGURATIONS")
print("=" * 100)

# Best by Accuracy
best_accuracy = results_df.loc[results_df['Accuracy'].idxmax()]
print(f"\nBest Accuracy: {best_accuracy['Accuracy']}")
print(f"  Model: {best_accuracy['Model']}")
print(f"  Feature Set: {best_accuracy['Feature Set']} ({best_accuracy['Num Features']} features)")

# Best by F1-Score
best_f1 = results_df.loc[results_df['F1-Score'].idxmax()]
print(f"\nBest F1-Score: {best_f1['F1-Score']}")
print(f"  Model: {best_f1['Model']}")
print(f"  Feature Set: {best_f1['Feature Set']} ({best_f1['Num Features']} features)")

# Best by MCC
best_mcc = results_df.loc[results_df['MCC'].idxmax()]
print(f"\nBest MCC: {best_mcc['MCC']}")
print(f"  Model: {best_mcc['Model']}")
print(f"  Feature Set: {best_mcc['Feature Set']} ({best_mcc['Num Features']} features)")

# Best efficiency (high accuracy with low training time)
# Create efficiency score: accuracy / training_time
results_df['Efficiency'] = results_df['Accuracy'] / (results_df['Training Time (s)'] + 0.001)
best_efficiency = results_df.loc[results_df['Efficiency'].idxmax()]
print(f"\nBest Efficiency (Accuracy/Time):")
print(f"  Model: {best_efficiency['Model']}")
print(f"  Feature Set: {best_efficiency['Feature Set']} ({best_efficiency['Num Features']} features)")
print(f"  Accuracy: {best_efficiency['Accuracy']}, Time: {best_efficiency['Training Time (s)']}s")

BEST CONFIGURATIONS

Best Accuracy: 1.0
  Model: LightGBM
  Feature Set: All Features (57 features)

Best F1-Score: 1.0
  Model: LightGBM
  Feature Set: All Features (57 features)

Best MCC: 1.0
  Model: LightGBM
  Feature Set: All Features (57 features)

Best Efficiency (Accuracy/Time):
  Model: XGBoost
  Feature Set: Ensemble (8 features)
  Accuracy: 1.0, Time: 1.3097s


---
## Step 6: Final Summary and Recommendations

In [20]:
# Final Summary
print("=" * 100)
print("FINAL SUMMARY AND RECOMMENDATIONS")
print("=" * 100)

print("""
FEATURE SELECTION ANALYSIS:
""")

# Print feature selection summary
print(f"1. Original Features: {len(feature_cols)}")
print(f"2. Boruta Selected: {len(boruta_features)} ({(len(boruta_features)/len(feature_cols)*100):.1f}% of original)")
print(f"3. RFE Selected: {len(rfe_features)} ({(len(rfe_features)/len(feature_cols)*100):.1f}% of original)")
print(f"4. Correlation Selected: {len(corr_features_final)} ({(len(corr_features_final)/len(feature_cols)*100):.1f}% of original)")
print(f"5. Ensemble Selected: {len(ensemble_features)} ({(len(ensemble_features)/len(feature_cols)*100):.1f}% of original)")

print("""
KEY FINDINGS:
""")

# Calculate average metrics for all features vs selected features
all_features_avg = results_df[results_df['Feature Set'] == 'All Features'][['Accuracy', 'F1-Score', 'Training Time (s)']].mean()
selected_avg = results_df[results_df['Feature Set'] != 'All Features'][['Accuracy', 'F1-Score', 'Training Time (s)']].mean()

print(f"Average with ALL Features:")
print(f"  - Accuracy: {all_features_avg['Accuracy']:.4f}")
print(f"  - F1-Score: {all_features_avg['F1-Score']:.4f}")
print(f"  - Training Time: {all_features_avg['Training Time (s)']:.4f}s")

print(f"\nAverage with SELECTED Features:")
print(f"  - Accuracy: {selected_avg['Accuracy']:.4f}")
print(f"  - F1-Score: {selected_avg['F1-Score']:.4f}")
print(f"  - Training Time: {selected_avg['Training Time (s)']:.4f}s")

accuracy_diff = selected_avg['Accuracy'] - all_features_avg['Accuracy']
time_diff = all_features_avg['Training Time (s)'] - selected_avg['Training Time (s)']
time_diff_pct = (time_diff / all_features_avg['Training Time (s)']) * 100 if all_features_avg['Training Time (s)'] > 0 else 0

print(f"\nDIFFERENCE:")
print(f"  - Accuracy: {accuracy_diff:+.4f}")
print(f"  - Training Time Reduction: {time_diff_pct:.1f}%")

FINAL SUMMARY AND RECOMMENDATIONS

FEATURE SELECTION ANALYSIS:

1. Original Features: 57
2. Boruta Selected: 52 (91.2% of original)
3. RFE Selected: 20 (35.1% of original)
4. Correlation Selected: 39 (68.4% of original)
5. Ensemble Selected: 8 (14.0% of original)

KEY FINDINGS:

Average with ALL Features:
  - Accuracy: 1.0000
  - F1-Score: 1.0000
  - Training Time: 8.0929s

Average with SELECTED Features:
  - Accuracy: 1.0000
  - F1-Score: 1.0000
  - Training Time: 8.8685s

DIFFERENCE:
  - Accuracy: +0.0000
  - Training Time Reduction: -9.6%


In [21]:
# Save results to CSV
results_df.to_csv('feature_selection_results.csv', index=False)
print("Results saved to: feature_selection_results.csv")

# Display final table
print("\n" + "=" * 100)
print("FINAL RESULTS TABLE")
print("=" * 100)
print(results_df.drop(columns=['Efficiency']).to_string(index=False))

Results saved to: feature_selection_results.csv

FINAL RESULTS TABLE
 Feature Set  Num Features    Model  Accuracy  Precision  Recall  F1-Score  MCC  Training Time (s)
All Features            57 LightGBM       1.0        1.0     1.0       1.0  1.0             3.7092
All Features            57  XGBoost       1.0        1.0     1.0       1.0  1.0             2.6607
All Features            57 CatBoost       1.0        1.0     1.0       1.0  1.0            17.9088
      Boruta            52 LightGBM       1.0        1.0     1.0       1.0  1.0            20.9070
      Boruta            52  XGBoost       1.0        1.0     1.0       1.0  1.0             3.2267
      Boruta            52 CatBoost       1.0        1.0     1.0       1.0  1.0            12.2842
         RFE            20 LightGBM       1.0        1.0     1.0       1.0  1.0             3.8918
         RFE            20  XGBoost       1.0        1.0     1.0       1.0  1.0             1.9027
         RFE            20 CatBoost     

---
## Algorithm Summary

### Feature Selection Methods Comparison

| Method | Type | Pros | Cons | Best For |
|--------|------|------|------|----------|
| **Boruta** | Wrapper | Comprehensive, statistical validation | Slow, computationally expensive | When you need robust feature selection |
| **RFE** | Wrapper | Fast, model-specific selection | May miss non-linear relationships | Quick feature reduction |
| **Correlation** | Filter | Very fast, interpretable | Only captures linear relationships | Initial exploration |
| **Ensemble** | Hybrid | Robust across models, reduces bias | Requires multiple models | Production systems |

### Key Features Identified

1. **URLSimilarityIndex** - Most important feature across all methods
2. **IsHTTPS** - Security indicator, important for phishing detection
3. **HasSocialNet/HasCopyrightInfo** - Legitimacy indicators
4. **LineOfCode/LargestLineLength** - Website complexity metrics
5. **NoOfExternalRef** - External reference patterns