# Feature Selection & Model Training 
## Dengan Hyperparameters Default dan Hasil 7 Decimal Places

**Berdasarkan Paper:** Prasad & Chandra (Computers & Security 136, 2024)

**Perbaikan:**
- Metrics ditampilkan dengan 7 decimal places
- Hyperparameters menggunakan default values
- Investigasi data leakage

In [1]:
# Install required packages
!pip install boruta lightgbm xgboost catboost scikit-learn pandas numpy -q


[notice] A new release of pip is available: 25.1.1 -> 25.3
[notice] To update, run: python.exe -m pip install --upgrade pip


In [2]:
import pandas as pd
import numpy as np
import time
import warnings
import re
warnings.filterwarnings('ignore')

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.feature_selection import RFE
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, 
    f1_score, matthews_corrcoef, classification_report, confusion_matrix
)

from lightgbm import LGBMClassifier
from xgboost import XGBClassifier
from catboost import CatBoostClassifier
from boruta import BorutaPy

print("All libraries imported successfully!")

All libraries imported successfully!


## Step 1: Load Dataset

In [3]:
# Load the dataset
df = pd.read_csv('new_dataset/PhiUSIIL_Phishing_URL_63_Features.csv')

print("=" * 70)
print("DATASET INFORMATION")
print("=" * 70)
print(f"Number of Rows: {len(df)}")
print(f"Number of Columns: {len(df.columns)}")
print(f"\nColumn Names:")
for i, col in enumerate(df.columns, 1):
    print(f"  {i}. {col}")

df.head()

DATASET INFORMATION
Number of Rows: 235795
Number of Columns: 64

Column Names:
  1. FILENAME
  2. URL
  3. URLLength
  4. Domain
  5. DomainLength
  6. IsDomainIP
  7. TLD
  8. URLSimilarityIndex
  9. CharContinuationRate
  10. TLDLegitimateProb
  11. URLCharProb
  12. TLDLength
  13. NoOfSubDomain
  14. HasObfuscation
  15. NoOfObfuscatedChar
  16. ObfuscationRatio
  17. NoOfLettersInURL
  18. LetterRatioInURL
  19. NoOfDegitsInURL
  20. DegitRatioInURL
  21. NoOfEqualsInURL
  22. NoOfQMarkInURL
  23. NoOfAmpersandInURL
  24. NoOfOtherSpecialCharsInURL
  25. SpacialCharRatioInURL
  26. IsHTTPS
  27. LineOfCode
  28. LargestLineLength
  29. HasTitle
  30. Title
  31. DomainTitleMatchScore
  32. URLTitleMatchScore
  33. HasFavicon
  34. Robots
  35. IsResponsive
  36. NoOfURLRedirect
  37. NoOfSelfRedirect
  38. HasDescription
  39. NoOfPopup
  40. NoOfiFrame
  41. HasExternalFormSubmit
  42. HasSocialNet
  43. HasSubmitButton
  44. HasHiddenFields
  45. HasPasswordField
  46. Bank
  4

Unnamed: 0,FILENAME,URL,URLLength,Domain,DomainLength,IsDomainIP,TLD,URLSimilarityIndex,CharContinuationRate,TLDLegitimateProb,...,NoOfExternalRef,Unnamed: 55,has_no_www,num_slashes,num_hyphens,URL_Profanity_Prob,URL_NumberOf_Profanity,URLContent_Profanity_Prob,URLContent_NumberOf_Profanity,label
0,521848.txt,https://www.southbankmosaics.com,32,www.southbankmosaics.com,24,0,com,100.0,1.0,0.522907,...,124,,0,2,0,0.012189,1,0.01188,1,1
1,31372.txt,https://www.uni-mainz.de,24,www.uni-mainz.de,16,0,de,100.0,0.666667,0.03265,...,217,,0,2,1,0.027988,0,0.019723,0,1
2,597387.txt,https://www.voicefmradio.co.uk,30,www.voicefmradio.co.uk,22,0,uk,100.0,0.866667,0.028555,...,5,,0,2,0,0.015063,0,0.000294,1,1
3,554095.txt,https://www.sfnmjournal.com,27,www.sfnmjournal.com,19,0,com,100.0,1.0,0.522907,...,31,,0,2,0,0.012189,0,0.0,0,1
4,151578.txt,https://www.rewildingargentina.org,34,www.rewildingargentina.org,26,0,org,100.0,1.0,0.079963,...,85,,0,2,0,0.005476,0,0.002091,48,1


In [4]:
# Prepare features and target
exclude_cols = ['URL', 'FILENAME', 'Domain', 'TLD', 'Title', 'Unnamed: 55']
target_col = 'label'

# Get only numeric columns for features
feature_cols = [col for col in df.columns 
                if col not in exclude_cols + [target_col] 
                and df[col].dtype in ['int64', 'float64', 'int32', 'float32']]

print(f"Number of feature columns: {len(feature_cols)}")
print(f"Target column: {target_col}")
print(f"Target distribution:\n{df[target_col].value_counts()}")

X = df[feature_cols].copy()
y = df[target_col].copy()

# Handle missing values
X = X.fillna(0)

print(f"\nFeature matrix shape: {X.shape}")
print(f"Target vector shape: {y.shape}")

Number of feature columns: 57
Target column: label
Target distribution:
label
1    134850
0    100945
Name: count, dtype: int64

Feature matrix shape: (235795, 57)
Target vector shape: (235795,)


## Step 2: Check Feature Correlation with Label

In [5]:
# Calculate correlation with label
correlations = {}
for col in feature_cols:
    correlations[col] = abs(df[col].corr(df[target_col]))

# Sort by correlation
sorted_corr = sorted(correlations.items(), key=lambda x: x[1], reverse=True)

print("TOP 10 FEATURES MOST CORRELATED WITH LABEL:")
print("-" * 60)
for i, (feat, corr) in enumerate(sorted_corr[:10], 1):
    warning = " ⚠ SUSPICIOUS!" if corr > 0.95 else ""
    print(f"{i:2d}. {feat:35s} : {corr:.7f}{warning}")

TOP 10 FEATURES MOST CORRELATED WITH LABEL:
------------------------------------------------------------
 1. URLSimilarityIndex                  : 0.8603580
 2. HasSocialNet                        : 0.7842545
 3. HasCopyrightInfo                    : 0.7433575
 4. HasDescription                      : 0.6902318
 5. has_no_www                          : 0.6684359
 6. IsHTTPS                             : 0.6128735
 7. DomainTitleMatchScore               : 0.5849046
 8. HasSubmitButton                     : 0.5785609
 9. IsResponsive                        : 0.5486075
10. URLTitleMatchScore                  : 0.5394187


## Step 3: Split Data

In [6]:
# Split data into train and test sets (80/20)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print("=" * 70)
print("DATA SPLIT")
print("=" * 70)
print(f"Training set: {X_train.shape[0]} samples")
print(f"Test set: {X_test.shape[0]} samples")
print(f"Number of features: {X_train.shape[1]}")

DATA SPLIT
Training set: 188636 samples
Test set: 47159 samples
Number of features: 57


## Step 4: Define Evaluation Function (7 Decimal Places)

In [7]:
def train_and_evaluate(model, X_train, X_test, y_train, y_test, model_name):
    """
    Train a model and return evaluation metrics with 7 decimal precision.
    """
    # Training
    start_time = time.time()
    model.fit(X_train, y_train)
    training_time = time.time() - start_time
    
    # Prediction
    y_pred = model.predict(X_test)
    
    # Calculate metrics with HIGH PRECISION (7 decimals)
    accuracy = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred, average='weighted', zero_division=0)
    recall = recall_score(y_test, y_pred, average='weighted', zero_division=0)
    f1 = f1_score(y_test, y_pred, average='weighted', zero_division=0)
    mcc = matthews_corrcoef(y_test, y_pred)
    
    return {
        'Model': model_name,
        'Accuracy': round(accuracy, 7),
        'Precision': round(precision, 7),
        'Recall': round(recall, 7),
        'F1-Score': round(f1, 7),
        'MCC': round(mcc, 7),
        'Training Time (s)': round(training_time, 4)
    }

print("Evaluation function defined with 7 decimal precision!")

Evaluation function defined with 7 decimal precision!


## Step 5: Define Models with Default Hyperparameters

In [8]:
# Clean feature names for LightGBM compatibility
def clean_feature_names(df):
    clean_cols = {col: re.sub(r'[^a-zA-Z0-9_]', '_', str(col)) for col in df.columns}
    return df.rename(columns=clean_cols)

# Define models with DEFAULT hyperparameters (as likely used in the paper)
models = {
    'LightGBM': LGBMClassifier(
        n_estimators=100,      # default
        learning_rate=0.1,     # default
        max_depth=-1,          # unlimited (default)
        num_leaves=31,         # default
        random_state=42,
        verbose=-1
    ),
    'XGBoost': XGBClassifier(
        n_estimators=100,      # default
        learning_rate=0.3,     # default
        max_depth=6,           # default
        random_state=42,
        use_label_encoder=False,
        eval_metric='logloss',
        verbosity=0
    ),
    'CatBoost': CatBoostClassifier(
        iterations=100,        # similar to n_estimators
        learning_rate=0.03,    # CatBoost auto
        depth=6,               # default
        random_state=42,
        verbose=0
    )
}

print("Models defined with hyperparameters:")
print("\nLightGBM:")
print("  - n_estimators: 100")
print("  - learning_rate: 0.1")
print("  - max_depth: -1 (unlimited)")
print("  - num_leaves: 31")
print("\nXGBoost:")
print("  - n_estimators: 100")
print("  - learning_rate: 0.3")
print("  - max_depth: 6")
print("\nCatBoost:")
print("  - iterations: 100")
print("  - learning_rate: 0.03")
print("  - depth: 6")

Models defined with hyperparameters:

LightGBM:
  - n_estimators: 100
  - learning_rate: 0.1
  - max_depth: -1 (unlimited)
  - num_leaves: 31

XGBoost:
  - n_estimators: 100
  - learning_rate: 0.3
  - max_depth: 6

CatBoost:
  - iterations: 100
  - learning_rate: 0.03
  - depth: 6


## Step 6: Train and Evaluate Models

In [9]:
# Clean feature names for LightGBM
X_train_clean = clean_feature_names(X_train)
X_test_clean = clean_feature_names(X_test)

# Train and evaluate all models
results = []

print("=" * 80)
print("TRAINING MODELS")
print("=" * 80)

for model_name, model in models.items():
    print(f"\nTraining {model_name}...")
    
    # Use clean feature names for LightGBM
    if model_name == 'LightGBM':
        result = train_and_evaluate(model, X_train_clean, X_test_clean, y_train, y_test, model_name)
    else:
        result = train_and_evaluate(model, X_train, X_test, y_train, y_test, model_name)
    
    results.append(result)
    print(f"  Accuracy:  {result['Accuracy']:.7f}")
    print(f"  Precision: {result['Precision']:.7f}")
    print(f"  Recall:    {result['Recall']:.7f}")
    print(f"  F1-Score:  {result['F1-Score']:.7f}")
    print(f"  MCC:       {result['MCC']:.7f}")
    print(f"  Time:      {result['Training Time (s)']}s")

print("\n" + "=" * 80)
print("ALL TRAINING COMPLETED!")
print("=" * 80)

TRAINING MODELS

Training LightGBM...
  Accuracy:  1.0000000
  Precision: 1.0000000
  Recall:    1.0000000
  F1-Score:  1.0000000
  MCC:       1.0000000
  Time:      3.8439s

Training XGBoost...
  Accuracy:  1.0000000
  Precision: 1.0000000
  Recall:    1.0000000
  F1-Score:  1.0000000
  MCC:       1.0000000
  Time:      1.8487s

Training CatBoost...
  Accuracy:  1.0000000
  Precision: 1.0000000
  Recall:    1.0000000
  F1-Score:  1.0000000
  MCC:       1.0000000
  Time:      4.7094s

ALL TRAINING COMPLETED!


## Step 7: Results Table (7 Decimal Precision)

In [10]:
# Create results dataframe
results_df = pd.DataFrame(results)

# Display with formatting
print("=" * 100)
print("COMPLETE RESULTS TABLE (7 DECIMAL PRECISION)")
print("=" * 100)
print(results_df.to_string(index=False))

# Save to CSV
results_df.to_csv('model_results_7decimals.csv', index=False)
print("\nResults saved to: model_results_7decimals.csv")

COMPLETE RESULTS TABLE (7 DECIMAL PRECISION)
   Model  Accuracy  Precision  Recall  F1-Score  MCC  Training Time (s)
LightGBM       1.0        1.0     1.0       1.0  1.0             3.8439
 XGBoost       1.0        1.0     1.0       1.0  1.0             1.8487
CatBoost       1.0        1.0     1.0       1.0  1.0             4.7094

Results saved to: model_results_7decimals.csv


## Step 8: Confusion Matrices

In [11]:
print("=" * 80)
print("CONFUSION MATRICES")
print("=" * 80)

for model_name, model in models.items():
    if model_name == 'LightGBM':
        y_pred = model.predict(X_test_clean)
    else:
        y_pred = model.predict(X_test)
    
    cm = confusion_matrix(y_test, y_pred)
    
    print(f"\n{model_name}:")
    print(f"  TN (True Negative):  {cm[0][0]:>6}")
    print(f"  FP (False Positive): {cm[0][1]:>6}")
    print(f"  FN (False Negative): {cm[1][0]:>6}")
    print(f"  TP (True Positive):  {cm[1][1]:>6}")

CONFUSION MATRICES

LightGBM:
  TN (True Negative):   20189
  FP (False Positive):      0
  FN (False Negative):      0
  TP (True Positive):   26970

XGBoost:
  TN (True Negative):   20189
  FP (False Positive):      0
  FN (False Negative):      0
  TP (True Positive):   26970

CatBoost:
  TN (True Negative):   20189
  FP (False Positive):      0
  FN (False Negative):      0
  TP (True Positive):   26970


## Step 9: Comparison with Paper Results

In [12]:
# Paper results (Prasad, Computers & Security 136, 2024)
paper_results = {
    'LightGBM': {'Accuracy': 0.9999, 'Precision': 0.99991, 'Recall': 0.99993, 'F1-Score': 0.99992, 'MCC': 0.99981},
    'XGBoost': {'Accuracy': 0.99993, 'Precision': 0.99993, 'Recall': 0.99994, 'F1-Score': 0.99994, 'MCC': 0.99985},
    'CatBoost': {'Accuracy': 0.99987, 'Precision': 0.99981, 'Recall': 0.99996, 'F1-Score': 0.99989, 'MCC': 0.99974}
}

print("=" * 100)
print("COMPARISON: YOUR RESULTS vs PAPER RESULTS")
print("=" * 100)
print()
print(f"{'Model':<12} {'Your Accuracy':>15} {'Paper Accuracy':>15} {'Difference':>15}")
print("-" * 60)

for result in results:
    model = result['Model']
    your_acc = result['Accuracy']
    paper_acc = paper_results.get(model, {}).get('Accuracy', 'N/A')
    diff = your_acc - paper_acc
    print(f"{model:<12} {your_acc:>15.7f} {paper_acc:>15.5f} {diff:>+15.7f}")

COMPARISON: YOUR RESULTS vs PAPER RESULTS

Model          Your Accuracy  Paper Accuracy      Difference
------------------------------------------------------------
LightGBM           1.0000000         0.99990      +0.0001000
XGBoost            1.0000000         0.99993      +0.0000700
CatBoost           1.0000000         0.99987      +0.0001300


## Summary

### Hyperparameters yang Digunakan:

| Model | n_estimators | learning_rate | max_depth | Lainnya |
|-------|-------------|---------------|-----------|--------|
| LightGBM | 100 | 0.1 | -1 (unlimited) | num_leaves=31 |
| XGBoost | 100 | 0.3 | 6 | - |
| CatBoost | 100 | 0.03 | 6 | - |

In [13]:
print("=" * 100)
print("NOTEBOOK COMPLETED")
print("=" * 100)
print()
print("Files created:")
print("  - model_results_7decimals.csv")
print()
print("All metrics displayed with 7 decimal precision.")

NOTEBOOK COMPLETED

Files created:
  - model_results_7decimals.csv

All metrics displayed with 7 decimal precision.
