# Phishing URL Tree-Based Model Experiments

This notebook explores various tree-based models using the Kaggle phishing URL dataset.

For the tree-based models, we will be experimenting with:

1. Random Forest
2. XGBoost
3. LightGBM
4. CatBoost

## Setup and Imports

In [8]:
# Standard libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

# Scikit-learn
from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.metrics import (accuracy_score, precision_score, recall_score,
                             f1_score, roc_auc_score, confusion_matrix,
                             classification_report)
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from scipy.sparse import hstack
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

# Tree-based models
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from catboost import CatBoostClassifier

# Import ModelSaver
import sys
import os
sys.path.append(os.path.abspath('.'))
from save_model import ModelSaver

# Display settings
pd.set_option('display.max_columns', None)
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")

In [30]:
# Safe TruncatedSVD wrapper to clamp n_components at fit time
from sklearn.base import TransformerMixin, BaseEstimator
from sklearn.decomposition import TruncatedSVD

class SafeTruncatedSVD(TransformerMixin, BaseEstimator):
    """
    Adapter for TruncatedSVD that ensures n_components <= n_features at fit time.

    This prevents ValueError when Optuna suggests more components than the TF-IDF
    matrix actually has. The wrapper will create an internal TruncatedSVD with the
    safe number of components and delegate fit/transform calls to it.
    """
    def __init__(self, n_components=100, random_state=None):
        self.n_components = n_components
        self.random_state = random_state
        self._svd = None

    def fit(self, X, y=None):
        # X can be dense or sparse; get n_features = number of columns
        n_features = X.shape[1]
        safe_n = max(1, min(self.n_components, n_features))
        self._svd = TruncatedSVD(n_components=safe_n, random_state=self.random_state)
        self._svd.fit(X)
        self.n_components_ = safe_n
        return self

    def transform(self, X):
        return self._svd.transform(X)

    def fit_transform(self, X, y=None):
        self.fit(X, y)
        return self.transform(X)


In [9]:
# Configuration
SAVE_MODELS = True
SEED = 42
np.random.seed(SEED)

# Check for Google Drive (if running in Colab)
use_drive = False
try:
    from google.colab import drive
    drive.mount('/content/drive')
    use_drive = True
    drive_root = '/content/drive/MyDrive/fraud-grp-proj/'
except ImportError:
    pass

In [10]:
# Load train and test datasets
train_df = pd.read_csv('dataset/train.csv')
test_df = pd.read_csv('dataset/test.csv')

train_w_features_df = pd.read_csv('dataset/df_train_feature_engineered.csv')
test_w_features_df = pd.read_csv('dataset/df_test_feature_engineered.csv')

print(f"Train shape: {train_df.shape}")
print(f"Test shape: {test_df.shape}")

print(f"Train with features shape: {train_w_features_df.shape}")
print(f"Test with features shape: {test_w_features_df.shape}")

# Prepare text data for TF-IDF
X_text = train_df['url'].values
X_text_test = test_df['url'].values

Train shape: (9143, 2)
Test shape: (2286, 2)
Train with features shape: (9143, 78)
Test with features shape: (2286, 78)


Due to the robust nature of tree-based models, we will just be using the full feature set including originals and transformed features, unlike our approach for linear and neural network models.

In [11]:
# Prepare X and y
non_text_cols = train_w_features_df.select_dtypes(exclude=[object]).columns.tolist()
if 'target' in non_text_cols:
    non_text_cols.remove('target')

# We will use the DataFrames directly
X_train_df = train_w_features_df.copy()
y_train = train_w_features_df['target'].values

X_test_df = test_w_features_df.copy()
if 'target' in test_w_features_df.columns:
    y_test = test_w_features_df['target'].values
else:
    y_test = np.zeros(len(test_w_features_df))

print(f"Numeric features: {len(non_text_cols)}")
print(f"Total features (including url): {len(non_text_cols) + 1}")


Numeric features: 72
Total features (including url): 73


## Training Models

Now lets move on to training the models. We use the `ModelSaver` utility to help us standardize the storing of metrics and models for evaluation later on.

Since we found that combined features worked best for linear models, we will focus on combined features (TF-IDF + Numeric) for tree-based models as well. That said, to help with performance, we will perform SVD on the TF-IDF features to reduce dimensionality before combining with numeric features.

Ultimately, we will be experimenting with:
1. Numeric features only
2. Combined features (TF-IDF + SVD + Numeric)

In [31]:
def run_tree_experiment(model_class, model_name, model_params, experiment_name, X_train, y_train, X_test, numeric_features, text_feature=None, save_model=True, n_svd_components=100, tfidf_max_features=5000, tfidf_ngram_range=(3,5), **kwargs):
    print(f"\n=== Running Experiment: {experiment_name} ({model_name}) ===")
    print(f"Saving Model: {save_model}")

    saver = None
    if save_model:
        if use_drive:
            base_path = drive_root + "experiments"
        else:
            base_path = "experiments"
        saver = ModelSaver(base_path=base_path)
        saver.start_experiment(
            experiment_name=experiment_name,
            model_type=model_name,
            vectorizer="Tfidf+SVD" if text_feature else "Numeric",
            vectorizer_params={'max_features': tfidf_max_features, 'ngram_range': tfidf_ngram_range, 'n_components': n_svd_components} if text_feature else {},
            model_params=model_params,
            n_folds=5,
            save_format="pickle"
        )

    skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=SEED)
    fold_test_preds = []
    
    for fold, (train_idx, val_idx) in enumerate(skf.split(X_train, y_train), start=1):
        print(f"\n--- Fold {fold}/5 ---")
        
        # Split data
        X_train_fold = X_train.iloc[train_idx]
        y_train_fold = y_train[train_idx]
        X_val_fold = X_train.iloc[val_idx]
        y_val_fold = y_train[val_idx]

        transformers = []
        # We will compute text SVD feature names after pipeline.fit in case
        # n_components is clamped by the SafeTruncatedSVD wrapper.
        feature_names_out = []

        # 1. Text Pipeline (TF-IDF + SVD)
        if text_feature:
            text_pipeline = Pipeline([
                ('tfidf', TfidfVectorizer(max_features=tfidf_max_features, analyzer='char', ngram_range=tfidf_ngram_range)),
                ('svd', SafeTruncatedSVD(n_components=n_svd_components, random_state=SEED))
            ])
            transformers.append(('text', text_pipeline, text_feature))

        # 2. Numeric Pipeline
        if numeric_features:
            transformers.append(('numeric', 'passthrough', numeric_features))
            feature_names_out.extend(numeric_features)

        # Column Transformer
        preprocessor = ColumnTransformer(transformers)

        # 3. Full Pipeline
        pipeline = Pipeline([
            ('preprocessor', preprocessor),
            ('classifier', model_class(**model_params))
        ])

        # Train
        pipeline.fit(X_train_fold, y_train_fold)

        # If we used SVD, now compute the actual number of components the
        # SafeTruncatedSVD used and append the svd feature names.
        if text_feature:
            # Access the fitted SVD transformer:
            fitted_svd = pipeline.named_steps['preprocessor'].named_transformers_['text'].named_steps['svd']
            n_svd_fitted = getattr(fitted_svd, 'n_components_', n_svd_components)
            feature_names_out = [*feature_names_out, *[f'svd_{i}' for i in range(n_svd_fitted)]]

        # Validation predictions
        val_probs = pipeline.predict_proba(X_val_fold)[:, 1]
        val_preds = (val_probs > 0.5).astype(int)

        # Calculate metrics
        tn, fp, fn, tp = confusion_matrix(y_val_fold, val_preds).ravel()
        
        metrics = {
            'fold': fold,
            'accuracy': accuracy_score(y_val_fold, val_preds),
            'precision': precision_score(y_val_fold, val_preds, zero_division=0),
            'recall': recall_score(y_val_fold, val_preds, zero_division=0),
            'f1': f1_score(y_val_fold, val_preds, zero_division=0),
            'roc_auc': roc_auc_score(y_val_fold, val_probs),
            'TP': int(tp),
            'FP': int(fp),
            'TN': int(tn),
            'FN': int(fn),
            'train_size': len(train_idx),
            'val_size': len(val_idx)
        }
        
        print(f"Fold {fold} Val AUC: {metrics['roc_auc']:.4f}")

        # Test predictions
        test_probs = pipeline.predict_proba(X_test)[:, 1]
        fold_test_preds.append(test_probs)

        if save_model and saver:
            saver.add_fold(
                fold_model=pipeline,
                fold_metric=metrics,
                test_predictions=test_probs,
                feature_names=feature_names_out
            )

    if save_model and saver:
        saver.finalize_experiment(**kwargs)
        print(f"Experiment saved to {saver._exp_dir}")

    return pipeline


### 1. Random Forest

#### 1.1. Numeric Features

In [13]:
# Default params
rf_params = {
    'random_state': SEED,
    'verbose': 0
}

run_tree_experiment(
    RandomForestClassifier, 
    "RandomForest", 
    rf_params, 
    "exp_2_random_forest_numeric", 
    X_train=X_train_df, 
    y_train=y_train, 
    X_test=X_test_df, 
    numeric_features=non_text_cols,
    text_feature=None, 
    save_model=SAVE_MODELS
)


=== Running Experiment: exp_2_random_forest_numeric (RandomForest) ===
Saving Model: True
Experiment 'exp_2_random_forest_numeric' initialized at: experiments/exp_2_random_forest_numeric
Mode: Incremental saving (5 folds)

--- Fold 1/5 ---
Fold 1 Val AUC: 0.9739
  Fold 1/5 saved | ROC AUC: 0.9739

--- Fold 2/5 ---
Fold 1 Val AUC: 0.9739
  Fold 1/5 saved | ROC AUC: 0.9739

--- Fold 2/5 ---
Fold 2 Val AUC: 0.9716
  Fold 2/5 saved | ROC AUC: 0.9716

--- Fold 3/5 ---
Fold 2 Val AUC: 0.9716
  Fold 2/5 saved | ROC AUC: 0.9716

--- Fold 3/5 ---
Fold 3 Val AUC: 0.9705
  Fold 3/5 saved | ROC AUC: 0.9705

--- Fold 4/5 ---
Fold 3 Val AUC: 0.9705
  Fold 3/5 saved | ROC AUC: 0.9705

--- Fold 4/5 ---
Fold 4 Val AUC: 0.9700
  Fold 4/5 saved | ROC AUC: 0.9700

--- Fold 5/5 ---
Fold 4 Val AUC: 0.9700
  Fold 4/5 saved | ROC AUC: 0.9700

--- Fold 5/5 ---
Fold 5 Val AUC: 0.9707
  Fold 5/5 saved | ROC AUC: 0.9707

Finalizing experiment...
  Predictions saved to experiments/exp_2_random_forest_numeric/exp_

0,1,2
,steps,"[('preprocessor', ...), ('classifier', ...)]"
,transform_input,
,memory,
,verbose,False

0,1,2
,transformers,"[('numeric', ...)]"
,remainder,'drop'
,sparse_threshold,0.3
,n_jobs,
,transformer_weights,
,verbose,False
,verbose_feature_names_out,True
,force_int_remainder_cols,'deprecated'

0,1,2
,n_estimators,100
,criterion,'gini'
,max_depth,
,min_samples_split,2
,min_samples_leaf,1
,min_weight_fraction_leaf,0.0
,max_features,'sqrt'
,max_leaf_nodes,
,min_impurity_decrease,0.0
,bootstrap,True


#### 1.2. TF-IDF + SVD + Engineered Features

In [14]:
run_tree_experiment(
    RandomForestClassifier, 
    "RandomForest", 
    rf_params, 
    "exp_2_random_forest_all", 
    X_train=X_train_df, 
    y_train=y_train, 
    X_test=X_test_df, 
    numeric_features=non_text_cols, 
    text_feature='url', 
    save_model=SAVE_MODELS
)


=== Running Experiment: exp_2_random_forest_all (RandomForest) ===
Saving Model: True
Experiment 'exp_2_random_forest_all' initialized at: experiments/exp_2_random_forest_all
Mode: Incremental saving (5 folds)

--- Fold 1/5 ---
Fold 1 Val AUC: 0.9822
Fold 1 Val AUC: 0.9822
  Fold 1/5 saved | ROC AUC: 0.9822

--- Fold 2/5 ---
  Fold 1/5 saved | ROC AUC: 0.9822

--- Fold 2/5 ---
Fold 2 Val AUC: 0.9794
  Fold 2/5 saved | ROC AUC: 0.9794

--- Fold 3/5 ---
Fold 2 Val AUC: 0.9794
  Fold 2/5 saved | ROC AUC: 0.9794

--- Fold 3/5 ---
Fold 3 Val AUC: 0.9809
  Fold 3/5 saved | ROC AUC: 0.9809

--- Fold 4/5 ---
Fold 3 Val AUC: 0.9809
  Fold 3/5 saved | ROC AUC: 0.9809

--- Fold 4/5 ---
Fold 4 Val AUC: 0.9838
  Fold 4/5 saved | ROC AUC: 0.9838

--- Fold 5/5 ---
Fold 4 Val AUC: 0.9838
  Fold 4/5 saved | ROC AUC: 0.9838

--- Fold 5/5 ---
Fold 5 Val AUC: 0.9784
  Fold 5/5 saved | ROC AUC: 0.9784

Finalizing experiment...
  Predictions saved to experiments/exp_2_random_forest_all/exp_2_random_forest_

0,1,2
,steps,"[('preprocessor', ...), ('classifier', ...)]"
,transform_input,
,memory,
,verbose,False

0,1,2
,transformers,"[('text', ...), ('numeric', ...)]"
,remainder,'drop'
,sparse_threshold,0.3
,n_jobs,
,transformer_weights,
,verbose,False
,verbose_feature_names_out,True
,force_int_remainder_cols,'deprecated'

0,1,2
,input,'content'
,encoding,'utf-8'
,decode_error,'strict'
,strip_accents,
,lowercase,True
,preprocessor,
,tokenizer,
,analyzer,'char'
,stop_words,
,token_pattern,'(?u)\\b\\w\\w+\\b'

0,1,2
,n_components,100
,algorithm,'randomized'
,n_iter,5
,n_oversamples,10
,power_iteration_normalizer,'auto'
,random_state,42
,tol,0.0

0,1,2
,n_estimators,100
,criterion,'gini'
,max_depth,
,min_samples_split,2
,min_samples_leaf,1
,min_weight_fraction_leaf,0.0
,max_features,'sqrt'
,max_leaf_nodes,
,min_impurity_decrease,0.0
,bootstrap,True


Since using combined features worked best for our baseline random forest, we will try using it for the rest of the tree-based models as well.

### 2. XGBoost

In [15]:
# Default XGBoost params
xgb_params = {
    'random_state': SEED,
    'verbosity': 0
}

run_tree_experiment(
    XGBClassifier, 
    "XGBoost", 
    xgb_params, 
    "exp_2_xgboost_all", 
    X_train=X_train_df, 
    y_train=y_train, 
    X_test=X_test_df, 
    numeric_features=non_text_cols, 
    text_feature='url', 
    save_model=SAVE_MODELS
)


=== Running Experiment: exp_2_xgboost_all (XGBoost) ===
Saving Model: True
Experiment 'exp_2_xgboost_all' initialized at: experiments/exp_2_xgboost_all
Mode: Incremental saving (5 folds)

--- Fold 1/5 ---
Fold 1 Val AUC: 0.9870
  Fold 1/5 saved | ROC AUC: 0.9870

--- Fold 2/5 ---
Fold 2 Val AUC: 0.9837
  Fold 2/5 saved | ROC AUC: 0.9837

--- Fold 3/5 ---
Fold 3 Val AUC: 0.9833
  Fold 3/5 saved | ROC AUC: 0.9833

--- Fold 4/5 ---
Fold 4 Val AUC: 0.9858
  Fold 4/5 saved | ROC AUC: 0.9858

--- Fold 5/5 ---
Fold 5 Val AUC: 0.9818
  Fold 5/5 saved | ROC AUC: 0.9818

Finalizing experiment...
  Predictions saved to experiments/exp_2_xgboost_all/exp_2_xgboost_all_prediction.csv

✓ Experiment 'exp_2_xgboost_all' finalized!
  Location: experiments/exp_2_xgboost_all
  Folds completed: 5
  Best fold: 1 (ROC AUC: 0.9870)
  Average ROC AUC: 0.9843 ± 0.0019
Experiment saved to experiments/exp_2_xgboost_all


0,1,2
,steps,"[('preprocessor', ...), ('classifier', ...)]"
,transform_input,
,memory,
,verbose,False

0,1,2
,transformers,"[('text', ...), ('numeric', ...)]"
,remainder,'drop'
,sparse_threshold,0.3
,n_jobs,
,transformer_weights,
,verbose,False
,verbose_feature_names_out,True
,force_int_remainder_cols,'deprecated'

0,1,2
,input,'content'
,encoding,'utf-8'
,decode_error,'strict'
,strip_accents,
,lowercase,True
,preprocessor,
,tokenizer,
,analyzer,'char'
,stop_words,
,token_pattern,'(?u)\\b\\w\\w+\\b'

0,1,2
,n_components,100
,algorithm,'randomized'
,n_iter,5
,n_oversamples,10
,power_iteration_normalizer,'auto'
,random_state,42
,tol,0.0

0,1,2
,objective,'binary:logistic'
,base_score,
,booster,
,callbacks,
,colsample_bylevel,
,colsample_bynode,
,colsample_bytree,
,device,
,early_stopping_rounds,
,enable_categorical,False


### 3. LightGBM

In [16]:
# Default LightGBM params
lgbm_params = {
    'random_state': SEED,
    'verbose': -1
}

run_tree_experiment(
    LGBMClassifier, 
    "LightGBM", 
    lgbm_params, 
    "exp_2_lgbm_all", 
    X_train=X_train_df, 
    y_train=y_train, 
    X_test=X_test_df, 
    numeric_features=non_text_cols, 
    text_feature='url', 
    save_model=SAVE_MODELS
)


=== Running Experiment: exp_2_lgbm_all (LightGBM) ===
Saving Model: True
Experiment 'exp_2_lgbm_all' initialized at: experiments/exp_2_lgbm_all
Mode: Incremental saving (5 folds)

--- Fold 1/5 ---
Fold 1 Val AUC: 0.9855
  Fold 1/5 saved | ROC AUC: 0.9855

--- Fold 2/5 ---
Fold 2 Val AUC: 0.9840
  Fold 2/5 saved | ROC AUC: 0.9840

--- Fold 3/5 ---
Fold 3 Val AUC: 0.9820
  Fold 3/5 saved | ROC AUC: 0.9820

--- Fold 4/5 ---
Fold 4 Val AUC: 0.9841
  Fold 4/5 saved | ROC AUC: 0.9841

--- Fold 5/5 ---
Fold 5 Val AUC: 0.9817
  Fold 5/5 saved | ROC AUC: 0.9817

Finalizing experiment...
  Predictions saved to experiments/exp_2_lgbm_all/exp_2_lgbm_all_prediction.csv

✓ Experiment 'exp_2_lgbm_all' finalized!
  Location: experiments/exp_2_lgbm_all
  Folds completed: 5
  Best fold: 1 (ROC AUC: 0.9855)
  Average ROC AUC: 0.9834 ± 0.0014
Experiment saved to experiments/exp_2_lgbm_all


0,1,2
,steps,"[('preprocessor', ...), ('classifier', ...)]"
,transform_input,
,memory,
,verbose,False

0,1,2
,transformers,"[('text', ...), ('numeric', ...)]"
,remainder,'drop'
,sparse_threshold,0.3
,n_jobs,
,transformer_weights,
,verbose,False
,verbose_feature_names_out,True
,force_int_remainder_cols,'deprecated'

0,1,2
,input,'content'
,encoding,'utf-8'
,decode_error,'strict'
,strip_accents,
,lowercase,True
,preprocessor,
,tokenizer,
,analyzer,'char'
,stop_words,
,token_pattern,'(?u)\\b\\w\\w+\\b'

0,1,2
,n_components,100
,algorithm,'randomized'
,n_iter,5
,n_oversamples,10
,power_iteration_normalizer,'auto'
,random_state,42
,tol,0.0

0,1,2
,boosting_type,'gbdt'
,num_leaves,31
,max_depth,-1
,learning_rate,0.1
,n_estimators,100
,subsample_for_bin,200000
,objective,
,class_weight,
,min_split_gain,0.0
,min_child_weight,0.001


### 4. CatBoost

In [17]:
# Default CatBoost params
catboost_params = {
    'random_state': SEED,
    'verbose': 0
}

run_tree_experiment(
    CatBoostClassifier, 
    "CatBoost", 
    catboost_params, 
    "exp_2_catboost_all", 
    X_train=X_train_df, 
    y_train=y_train, 
    X_test=X_test_df, 
    numeric_features=non_text_cols, 
    text_feature='url', 
    save_model=SAVE_MODELS
)


=== Running Experiment: exp_2_catboost_all (CatBoost) ===
Saving Model: True
Experiment 'exp_2_catboost_all' initialized at: experiments/exp_2_catboost_all
Mode: Incremental saving (5 folds)

--- Fold 1/5 ---
Fold 1 Val AUC: 0.9862
  Fold 1/5 saved | ROC AUC: 0.9862

--- Fold 2/5 ---
Fold 2 Val AUC: 0.9857
  Fold 2/5 saved | ROC AUC: 0.9857

--- Fold 3/5 ---
Fold 3 Val AUC: 0.9843
  Fold 3/5 saved | ROC AUC: 0.9843

--- Fold 4/5 ---
Fold 4 Val AUC: 0.9878
  Fold 4/5 saved | ROC AUC: 0.9878

--- Fold 5/5 ---
Fold 5 Val AUC: 0.9829
  Fold 5/5 saved | ROC AUC: 0.9829

Finalizing experiment...
  Predictions saved to experiments/exp_2_catboost_all/exp_2_catboost_all_prediction.csv

✓ Experiment 'exp_2_catboost_all' finalized!
  Location: experiments/exp_2_catboost_all
  Folds completed: 5
  Best fold: 4 (ROC AUC: 0.9878)
  Average ROC AUC: 0.9854 ± 0.0017
Experiment saved to experiments/exp_2_catboost_all


0,1,2
,steps,"[('preprocessor', ...), ('classifier', ...)]"
,transform_input,
,memory,
,verbose,False

0,1,2
,transformers,"[('text', ...), ('numeric', ...)]"
,remainder,'drop'
,sparse_threshold,0.3
,n_jobs,
,transformer_weights,
,verbose,False
,verbose_feature_names_out,True
,force_int_remainder_cols,'deprecated'

0,1,2
,input,'content'
,encoding,'utf-8'
,decode_error,'strict'
,strip_accents,
,lowercase,True
,preprocessor,
,tokenizer,
,analyzer,'char'
,stop_words,
,token_pattern,'(?u)\\b\\w\\w+\\b'

0,1,2
,n_components,100
,algorithm,'randomized'
,n_iter,5
,n_oversamples,10
,power_iteration_normalizer,'auto'
,random_state,42
,tol,0.0


In [None]:
# Pre-calculate features for Optuna to speed up tuning
print("Pre-calculating features for Optuna...")

# 1. Text Features (TF-IDF + SVD)
tfidf = TfidfVectorizer(max_features=5000, analyzer='char', ngram_range=(3, 5))
svd = SafeTruncatedSVD(n_components=100, random_state=SEED)

X_text_tfidf = tfidf.fit_transform(X_train_df['url'])
X_text_svd = svd.fit_transform(X_text_tfidf)

# 2. Numeric Features
X_numeric = X_train_df[non_text_cols].values

# 3. Combine
X_combined = np.hstack([X_text_svd, X_numeric])
y = y_train

print(f"Combined features shape: {X_combined.shape}")

Pre-calculating features for Optuna...
Combined features shape: (9143, 172)


## Optuna Hyperparameter Tuning

Now we can perform hyperparameter tuning using Optuna for the best tree-based model, CatBoost (on numeric features).

In [32]:
import optuna
from optuna.pruners import MedianPruner
from optuna.samplers import TPESampler
from optuna.exceptions import TrialPruned

# Caches for TF-IDF and SVD transforms so repeated combinations don't recompute
# across Optuna trials. This is an in-memory cache and will persist while the
# kernel/notebook session is active.
tfidf_cache = {}
svd_cache = {}

print(f"Optuna version: {optuna.__version__}")


Optuna version: 4.6.0


In [33]:
def objective(trial):
    # -------------------------
    # Hyperparameter Search Space
    # -------------------------
    params = {
        'iterations': trial.suggest_int('iterations', 300, 2000),
        'learning_rate': trial.suggest_float('learning_rate', 1e-4, 0.3, log=True),

        # Depth (CatBoost supports depth 1–16 but 4–12 is typically optimal)
        'depth': trial.suggest_int('depth', 3, 12),

        # L2 regularization
        'l2_leaf_reg': trial.suggest_float('l2_leaf_reg', 1e-6, 200.0, log=True),

        # Bootstrap
        'bootstrap_type': trial.suggest_categorical(
            'bootstrap_type', 
            ['Bayesian', 'Bernoulli', 'MVS']
        ),

        # Feature bagging / randomness
        'random_strength': trial.suggest_float('random_strength', 1e-8, 50.0, log=True),

        # Leaf estimation
        'leaf_estimation_iterations': trial.suggest_int('leaf_estimation_iterations', 1, 10),

        # Growing policy
        'grow_policy': trial.suggest_categorical(
            'grow_policy', ['SymmetricTree', 'Depthwise', 'Lossguide']
        ),

        # Other CatBoost settings
        'task_type': 'CPU',
        'eval_metric': 'AUC',
        'use_best_model': True,
        'random_seed': SEED,
        'verbose': False
    }

    # If Bayesian bootstrap → bagging_temperature is meaningful
    if params['bootstrap_type'] == 'Bayesian':
        params['bagging_temperature'] = trial.suggest_float('bagging_temperature', 0.0, 10.0)
    elif params['bootstrap_type'] == 'Bernoulli':
        params['subsample'] = trial.suggest_float('subsample', 0.5, 1.0)

    # -------------------------
    # TF-IDF & SVD Search Space
    # -------------------------
    # Number of TF-IDF features (chars) and n-gram range
    max_features = trial.suggest_categorical("max_features", [1000, 5000, 10000])

    # Choose min ngram size then choose max ngram size >= min
    ngram_min = trial.suggest_int("ngram_min", 1, 3)
    ngram_max = trial.suggest_int("ngram_max", ngram_min, 5)
    ngram_range = (ngram_min, ngram_max)

    # TruncatedSVD components for dimensionality reduction
    n_svd_components = trial.suggest_int("n_svd_components", 25, 200)

    # -------------------------
    # Cross-Validation
    # -------------------------
    # Use caching for TF-IDF and SVD transforms to avoid recomputation across
    # trials of the same hyperparameter combinations.
    tfidf_key = (max_features, ngram_range)
    if tfidf_key in tfidf_cache:
        X_text_tfidf_trial = tfidf_cache[tfidf_key]
    else:
        tfidf_trial = TfidfVectorizer(max_features=max_features, analyzer='char', ngram_range=ngram_range)
        X_text_tfidf_trial = tfidf_trial.fit_transform(X_train_df['url'])
        tfidf_cache[tfidf_key] = X_text_tfidf_trial

    # Clamp SVD components based on actual TF-IDF dimensionality to avoid
    # TruncatedSVD errors when trial suggests more components than available
    # features.
    max_possible_svd = X_text_tfidf_trial.shape[1]
    safe_n_svd = max(1, min(n_svd_components, max_possible_svd))

    svd_key = (max_features, ngram_range, safe_n_svd)
    if svd_key in svd_cache:
        X_text_svd_trial = svd_cache[svd_key]
    else:
        svd_trial = SafeTruncatedSVD(n_components=n_svd_components, random_state=SEED)
        X_text_svd_trial = svd_trial.fit_transform(X_text_tfidf_trial)
        # Record the number of components actually used by the SafeTruncatedSVD
        actual_svd_used = getattr(svd_trial, 'n_components_', safe_n_svd)
        # Use the actual number in the cache key so future trials reuse the correct matrix
        svd_cache[(max_features, ngram_range, actual_svd_used)] = X_text_svd_trial

    X_numeric_trial = X_train_df[non_text_cols].values

    X_combined_trial = np.hstack([X_text_svd_trial, X_numeric_trial])

    skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=SEED)
    cv_scores = []

    for train_idx, val_idx in skf.split(X_combined_trial, y):
        X_train, X_val = X_combined_trial[train_idx], X_combined_trial[val_idx]
        y_train, y_val = y[train_idx], y[val_idx]

        model = CatBoostClassifier(**params)

        model.fit(
            X_train, y_train,
            eval_set=(X_val, y_val),
            early_stopping_rounds=100,
            verbose=False
        )

        val_probs = model.predict_proba(X_val)[:, 1]
        roc_auc = roc_auc_score(y_val, val_probs)
        cv_scores.append(roc_auc)

        # Tell Optuna the fold's intermediate value for pruning
        trial.report(roc_auc, step=len(cv_scores))

        if trial.should_prune():
            raise optuna.TrialPruned()

    return np.mean(cv_scores)

In [34]:
study = optuna.create_study(direction='maximize',
                            sampler=TPESampler(seed=SEED),
                            pruner=MedianPruner(n_startup_trials=10, n_warmup_steps=5))
study.optimize(objective, n_trials=60, show_progress_bar=True)

print('Number of finished trials:', len(study.trials))
print('Best trial:', study.best_trial.params)

[I 2025-11-21 14:20:35,514] A new study created in memory with name: no-name-6dc2d7a3-825d-4b6d-b37f-8637f0cecf03
Best trial: 0. Best value: 0.976685:   2%|▏         | 1/60 [01:11<1:10:13, 71.41s/it]

[I 2025-11-21 14:21:46,923] Trial 0 finished with value: 0.9766854623730424 and parameters: {'iterations': 937, 'learning_rate': 0.20218499516556737, 'depth': 10, 'l2_leaf_reg': 0.09321419094969498, 'bootstrap_type': 'Bayesian', 'random_strength': 2.517772329704955, 'leaf_estimation_iterations': 7, 'grow_policy': 'Lossguide', 'bagging_temperature': 8.324426408004218, 'max_features': 1000, 'ngram_min': 1, 'ngram_max': 3, 'n_svd_components': 101}. Best is trial 0 with value: 0.9766854623730424.


Best trial: 1. Best value: 0.982931:   3%|▎         | 2/60 [01:30<39:23, 40.76s/it]  

[I 2025-11-21 14:22:06,223] Trial 1 finished with value: 0.9829309153655004 and parameters: {'iterations': 795, 'learning_rate': 0.013411788774467914, 'depth': 4, 'l2_leaf_reg': 0.00026613469190492437, 'bootstrap_type': 'MVS', 'random_strength': 8.642313649427712e-07, 'leaf_estimation_iterations': 6, 'grow_policy': 'Lossguide', 'max_features': 10000, 'ngram_min': 3, 'ngram_max': 5, 'n_svd_components': 78}. Best is trial 1 with value: 0.9829309153655004.


Best trial: 1. Best value: 0.982931:   5%|▌         | 3/60 [02:04<35:30, 37.38s/it]

[I 2025-11-21 14:22:39,592] Trial 2 finished with value: 0.9824669573936167 and parameters: {'iterations': 466, 'learning_rate': 0.023942042625390268, 'depth': 7, 'l2_leaf_reg': 1.0304882592628883e-05, 'bootstrap_type': 'MVS', 'random_strength': 3.235186184144167e-06, 'leaf_estimation_iterations': 7, 'grow_policy': 'Lossguide', 'max_features': 5000, 'ngram_min': 3, 'ngram_max': 5, 'n_svd_components': 130}. Best is trial 1 with value: 0.9829309153655004.


Best trial: 1. Best value: 0.982931:   7%|▋         | 4/60 [02:10<23:29, 25.17s/it]

[I 2025-11-21 14:22:46,051] Trial 3 finished with value: 0.9371991075885688 and parameters: {'iterations': 1868, 'learning_rate': 0.00020309496639348267, 'depth': 4, 'l2_leaf_reg': 2.3737396376160788e-06, 'bootstrap_type': 'Bernoulli', 'random_strength': 1.0911896618348131, 'leaf_estimation_iterations': 4, 'grow_policy': 'Depthwise', 'subsample': 0.9010984903770198, 'max_features': 5000, 'ngram_min': 1, 'ngram_max': 1, 'n_svd_components': 168}. Best is trial 1 with value: 0.9829309153655004.


Best trial: 1. Best value: 0.982931:   8%|▊         | 5/60 [03:45<46:04, 50.26s/it]

[I 2025-11-21 14:24:20,777] Trial 4 finished with value: 0.9817991302681701 and parameters: {'iterations': 1502, 'learning_rate': 0.03426465148409957, 'depth': 10, 'l2_leaf_reg': 4.117625787042547e-06, 'bootstrap_type': 'MVS', 'random_strength': 0.011100686452941145, 'leaf_estimation_iterations': 4, 'grow_policy': 'Lossguide', 'max_features': 10000, 'ngram_min': 2, 'ngram_max': 2, 'n_svd_components': 150}. Best is trial 1 with value: 0.9829309153655004.


Best trial: 1. Best value: 0.982931:  10%|█         | 6/60 [07:35<1:40:13, 111.37s/it]

[I 2025-11-21 14:28:10,779] Trial 5 finished with value: 0.9635169980706377 and parameters: {'iterations': 1594, 'learning_rate': 0.00894599958022983, 'depth': 10, 'l2_leaf_reg': 0.012560648348468359, 'bootstrap_type': 'Bayesian', 'random_strength': 1.1128476528057723e-07, 'leaf_estimation_iterations': 1, 'grow_policy': 'SymmetricTree', 'bagging_temperature': 9.07566473926093, 'max_features': 10000, 'ngram_min': 1, 'ngram_max': 1, 'n_svd_components': 75}. Best is trial 1 with value: 0.9829309153655004.


Best trial: 6. Best value: 0.983396:  12%|█▏        | 7/60 [11:12<2:09:00, 146.06s/it]

[I 2025-11-21 14:31:48,248] Trial 6 finished with value: 0.9833962294561289 and parameters: {'iterations': 574, 'learning_rate': 0.170872222175711, 'depth': 11, 'l2_leaf_reg': 0.18109380011293338, 'bootstrap_type': 'Bayesian', 'random_strength': 4.538401911292445, 'leaf_estimation_iterations': 6, 'grow_policy': 'Depthwise', 'bagging_temperature': 1.1005192452767676, 'max_features': 10000, 'ngram_min': 3, 'ngram_max': 3, 'n_svd_components': 114}. Best is trial 6 with value: 0.9833962294561289.


Best trial: 6. Best value: 0.983396:  13%|█▎        | 8/60 [11:18<1:27:47, 101.30s/it]

[I 2025-11-21 14:31:53,719] Trial 7 finished with value: 0.9334480894202146 and parameters: {'iterations': 1010, 'learning_rate': 0.0005919646713104138, 'depth': 4, 'l2_leaf_reg': 0.0006346783219386272, 'bootstrap_type': 'Bayesian', 'random_strength': 0.0658506088548387, 'leaf_estimation_iterations': 4, 'grow_policy': 'SymmetricTree', 'bagging_temperature': 4.972485058923855, 'max_features': 1000, 'ngram_min': 2, 'ngram_max': 4, 'n_svd_components': 34}. Best is trial 6 with value: 0.9833962294561289.


Best trial: 6. Best value: 0.983396:  15%|█▌        | 9/60 [11:32<1:02:58, 74.09s/it] 

[I 2025-11-21 14:32:07,971] Trial 8 finished with value: 0.9821926807422006 and parameters: {'iterations': 773, 'learning_rate': 0.14392976524142465, 'depth': 5, 'l2_leaf_reg': 1.5950587338700585e-05, 'bootstrap_type': 'Bernoulli', 'random_strength': 0.03303866673140829, 'leaf_estimation_iterations': 8, 'grow_policy': 'Depthwise', 'subsample': 0.8161529152967897, 'max_features': 1000, 'ngram_min': 3, 'ngram_max': 3, 'n_svd_components': 57}. Best is trial 6 with value: 0.9833962294561289.


Best trial: 6. Best value: 0.983396:  17%|█▋        | 10/60 [13:47<1:17:29, 92.98s/it]

[I 2025-11-21 14:34:23,259] Trial 9 finished with value: 0.977138505721458 and parameters: {'iterations': 369, 'learning_rate': 0.01133982655605009, 'depth': 9, 'l2_leaf_reg': 1.3730807086609565e-06, 'bootstrap_type': 'MVS', 'random_strength': 4.911054624152157e-07, 'leaf_estimation_iterations': 7, 'grow_policy': 'Depthwise', 'max_features': 10000, 'ngram_min': 3, 'ngram_max': 3, 'n_svd_components': 141}. Best is trial 6 with value: 0.9833962294561289.


Best trial: 6. Best value: 0.983396:  18%|█▊        | 11/60 [19:51<2:23:31, 175.74s/it]

[I 2025-11-21 14:40:26,636] Trial 10 pruned. 


Best trial: 11. Best value: 0.986129:  20%|██        | 12/60 [20:32<1:47:52, 134.84s/it]

[I 2025-11-21 14:41:07,958] Trial 11 finished with value: 0.986129205122628 and parameters: {'iterations': 1211, 'learning_rate': 0.06144852846196334, 'depth': 7, 'l2_leaf_reg': 1.8205682678119093, 'bootstrap_type': 'MVS', 'random_strength': 1.22217038178843e-08, 'leaf_estimation_iterations': 5, 'grow_policy': 'Lossguide', 'max_features': 10000, 'ngram_min': 3, 'ngram_max': 5, 'n_svd_components': 96}. Best is trial 11 with value: 0.986129205122628.


Best trial: 11. Best value: 0.986129:  22%|██▏       | 13/60 [21:27<1:26:42, 110.70s/it]

[I 2025-11-21 14:42:03,102] Trial 12 finished with value: 0.985803377643591 and parameters: {'iterations': 1275, 'learning_rate': 0.07559197067362484, 'depth': 7, 'l2_leaf_reg': 10.738416054181831, 'bootstrap_type': 'MVS', 'random_strength': 49.293120933132414, 'leaf_estimation_iterations': 2, 'grow_policy': 'Lossguide', 'max_features': 10000, 'ngram_min': 3, 'ngram_max': 5, 'n_svd_components': 110}. Best is trial 11 with value: 0.986129205122628.


Best trial: 11. Best value: 0.986129:  23%|██▎       | 14/60 [22:31<1:14:04, 96.61s/it] 

[I 2025-11-21 14:43:07,154] Trial 13 finished with value: 0.9844198374609396 and parameters: {'iterations': 1321, 'learning_rate': 0.06452731339943558, 'depth': 7, 'l2_leaf_reg': 42.291234850837256, 'bootstrap_type': 'MVS', 'random_strength': 0.0002891868816110248, 'leaf_estimation_iterations': 2, 'grow_policy': 'Lossguide', 'max_features': 10000, 'ngram_min': 3, 'ngram_max': 5, 'n_svd_components': 95}. Best is trial 11 with value: 0.986129205122628.


Best trial: 11. Best value: 0.986129:  25%|██▌       | 15/60 [23:22<1:02:11, 82.92s/it]

[I 2025-11-21 14:43:58,333] Trial 14 finished with value: 0.9854244634663887 and parameters: {'iterations': 1187, 'learning_rate': 0.06031452325555029, 'depth': 6, 'l2_leaf_reg': 3.3228236338402066, 'bootstrap_type': 'MVS', 'random_strength': 1.1116658854427293e-08, 'leaf_estimation_iterations': 2, 'grow_policy': 'Lossguide', 'max_features': 10000, 'ngram_min': 2, 'ngram_max': 4, 'n_svd_components': 122}. Best is trial 11 with value: 0.986129205122628.


Best trial: 11. Best value: 0.986129:  27%|██▋       | 16/60 [24:25<56:26, 76.96s/it]  

[I 2025-11-21 14:45:01,477] Trial 15 finished with value: 0.9760316619058604 and parameters: {'iterations': 1444, 'learning_rate': 0.0017626515885854367, 'depth': 8, 'l2_leaf_reg': 6.495365120741536, 'bootstrap_type': 'MVS', 'random_strength': 2.9289404808546025e-05, 'leaf_estimation_iterations': 3, 'grow_policy': 'Lossguide', 'max_features': 5000, 'ngram_min': 3, 'ngram_max': 5, 'n_svd_components': 55}. Best is trial 11 with value: 0.986129205122628.


Best trial: 11. Best value: 0.986129:  28%|██▊       | 17/60 [24:51<44:09, 61.63s/it]

[I 2025-11-21 14:45:27,437] Trial 16 finished with value: 0.9859962521095653 and parameters: {'iterations': 1941, 'learning_rate': 0.07068128692057088, 'depth': 6, 'l2_leaf_reg': 0.852954127145171, 'bootstrap_type': 'Bernoulli', 'random_strength': 0.0024416030895080655, 'leaf_estimation_iterations': 1, 'grow_policy': 'SymmetricTree', 'subsample': 0.5034631172155324, 'max_features': 10000, 'ngram_min': 2, 'ngram_max': 4, 'n_svd_components': 93}. Best is trial 11 with value: 0.986129205122628.


Best trial: 11. Best value: 0.986129:  30%|███       | 18/60 [25:35<39:20, 56.20s/it]

[I 2025-11-21 14:46:11,000] Trial 17 finished with value: 0.9809906354545328 and parameters: {'iterations': 1990, 'learning_rate': 0.004424257646350249, 'depth': 6, 'l2_leaf_reg': 0.5420498648448743, 'bootstrap_type': 'Bernoulli', 'random_strength': 0.0017915993554375592, 'leaf_estimation_iterations': 10, 'grow_policy': 'SymmetricTree', 'subsample': 0.5075106058128437, 'max_features': 10000, 'ngram_min': 2, 'ngram_max': 4, 'n_svd_components': 83}. Best is trial 11 with value: 0.986129205122628.


Best trial: 11. Best value: 0.986129:  32%|███▏      | 19/60 [25:45<28:59, 42.42s/it]

[I 2025-11-21 14:46:21,327] Trial 18 finished with value: 0.9785766705659988 and parameters: {'iterations': 1778, 'learning_rate': 0.02719543115795785, 'depth': 3, 'l2_leaf_reg': 0.011567881024102061, 'bootstrap_type': 'Bernoulli', 'random_strength': 0.0014184820491679516, 'leaf_estimation_iterations': 1, 'grow_policy': 'SymmetricTree', 'subsample': 0.5102909300514106, 'max_features': 1000, 'ngram_min': 2, 'ngram_max': 4, 'n_svd_components': 59}. Best is trial 11 with value: 0.986129205122628.


Best trial: 11. Best value: 0.986129:  33%|███▎      | 20/60 [26:22<27:12, 40.82s/it]

[I 2025-11-21 14:46:58,407] Trial 19 finished with value: 0.9840635884942837 and parameters: {'iterations': 1699, 'learning_rate': 0.26744980254532535, 'depth': 8, 'l2_leaf_reg': 190.55790345266533, 'bootstrap_type': 'Bernoulli', 'random_strength': 1.7379366661673426e-05, 'leaf_estimation_iterations': 9, 'grow_policy': 'SymmetricTree', 'subsample': 0.6739882765223845, 'max_features': 5000, 'ngram_min': 2, 'ngram_max': 5, 'n_svd_components': 27}. Best is trial 11 with value: 0.986129205122628.


Best trial: 11. Best value: 0.986129:  35%|███▌      | 21/60 [26:44<22:52, 35.20s/it]

[I 2025-11-21 14:47:20,515] Trial 20 pruned. 


Best trial: 11. Best value: 0.986129:  37%|███▋      | 22/60 [27:33<24:53, 39.31s/it]

[I 2025-11-21 14:48:09,402] Trial 21 finished with value: 0.985617025525305 and parameters: {'iterations': 1268, 'learning_rate': 0.08308476677787864, 'depth': 7, 'l2_leaf_reg': 9.625681126006175, 'bootstrap_type': 'MVS', 'random_strength': 38.21531828774961, 'leaf_estimation_iterations': 2, 'grow_policy': 'Lossguide', 'max_features': 10000, 'ngram_min': 3, 'ngram_max': 5, 'n_svd_components': 106}. Best is trial 11 with value: 0.986129205122628.


Best trial: 22. Best value: 0.98624:  38%|███▊      | 23/60 [28:09<23:31, 38.15s/it] 

[I 2025-11-21 14:48:44,851] Trial 22 finished with value: 0.9862395510108788 and parameters: {'iterations': 1369, 'learning_rate': 0.0776581239764595, 'depth': 5, 'l2_leaf_reg': 0.07018664741539302, 'bootstrap_type': 'MVS', 'random_strength': 0.38380522262621763, 'leaf_estimation_iterations': 3, 'grow_policy': 'Lossguide', 'max_features': 10000, 'ngram_min': 2, 'ngram_max': 5, 'n_svd_components': 132}. Best is trial 22 with value: 0.9862395510108788.


Best trial: 22. Best value: 0.98624:  40%|████      | 24/60 [28:53<23:55, 39.88s/it]

[I 2025-11-21 14:49:28,750] Trial 23 finished with value: 0.9850182026824253 and parameters: {'iterations': 1107, 'learning_rate': 0.03714603343297576, 'depth': 5, 'l2_leaf_reg': 0.07088787690547108, 'bootstrap_type': 'MVS', 'random_strength': 0.12740641082112356, 'leaf_estimation_iterations': 3, 'grow_policy': 'Lossguide', 'max_features': 10000, 'ngram_min': 2, 'ngram_max': 4, 'n_svd_components': 154}. Best is trial 22 with value: 0.9862395510108788.


Best trial: 22. Best value: 0.98624:  42%|████▏     | 25/60 [29:10<19:23, 33.23s/it]

[I 2025-11-21 14:49:46,476] Trial 24 finished with value: 0.9856758056720301 and parameters: {'iterations': 1421, 'learning_rate': 0.09103528952221822, 'depth': 5, 'l2_leaf_reg': 0.0010389519315000724, 'bootstrap_type': 'Bernoulli', 'random_strength': 0.39949249978566537, 'leaf_estimation_iterations': 5, 'grow_policy': 'SymmetricTree', 'subsample': 0.9719575953274973, 'max_features': 10000, 'ngram_min': 2, 'ngram_max': 5, 'n_svd_components': 129}. Best is trial 22 with value: 0.9862395510108788.


Best trial: 22. Best value: 0.98624:  43%|████▎     | 26/60 [29:40<18:17, 32.27s/it]

[I 2025-11-21 14:50:16,502] Trial 25 finished with value: 0.982402209068507 and parameters: {'iterations': 1567, 'learning_rate': 0.016656373557577323, 'depth': 3, 'l2_leaf_reg': 0.8683283745631538, 'bootstrap_type': 'MVS', 'random_strength': 0.012475659004089387, 'leaf_estimation_iterations': 3, 'grow_policy': 'Lossguide', 'max_features': 10000, 'ngram_min': 2, 'ngram_max': 4, 'n_svd_components': 173}. Best is trial 22 with value: 0.9862395510108788.


Best trial: 26. Best value: 0.986253:  45%|████▌     | 27/60 [29:55<14:45, 26.84s/it]

[I 2025-11-21 14:50:30,662] Trial 26 finished with value: 0.9862530310876159 and parameters: {'iterations': 1898, 'learning_rate': 0.12493198024132458, 'depth': 6, 'l2_leaf_reg': 0.04539261143050395, 'bootstrap_type': 'Bernoulli', 'random_strength': 0.0006215956700117067, 'leaf_estimation_iterations': 1, 'grow_policy': 'SymmetricTree', 'subsample': 0.6538808638903314, 'max_features': 10000, 'ngram_min': 2, 'ngram_max': 5, 'n_svd_components': 69}. Best is trial 26 with value: 0.9862530310876159.


Best trial: 26. Best value: 0.986253:  47%|████▋     | 28/60 [30:13<12:55, 24.23s/it]

[I 2025-11-21 14:50:48,800] Trial 27 finished with value: 0.9849052465991915 and parameters: {'iterations': 1753, 'learning_rate': 0.27499124086021665, 'depth': 9, 'l2_leaf_reg': 0.034890958578164814, 'bootstrap_type': 'MVS', 'random_strength': 0.00042253752334996467, 'leaf_estimation_iterations': 3, 'grow_policy': 'Lossguide', 'max_features': 10000, 'ngram_min': 1, 'ngram_max': 5, 'n_svd_components': 68}. Best is trial 26 with value: 0.9862530310876159.


Best trial: 26. Best value: 0.986253:  48%|████▊     | 29/60 [30:19<09:47, 18.94s/it]

[I 2025-11-21 14:50:55,412] Trial 28 finished with value: 0.9845684369916435 and parameters: {'iterations': 1112, 'learning_rate': 0.11801360414468982, 'depth': 5, 'l2_leaf_reg': 0.002184685147652068, 'bootstrap_type': 'MVS', 'random_strength': 0.00013521591662114539, 'leaf_estimation_iterations': 5, 'grow_policy': 'SymmetricTree', 'max_features': 5000, 'ngram_min': 2, 'ngram_max': 5, 'n_svd_components': 43}. Best is trial 26 with value: 0.9862530310876159.


Best trial: 26. Best value: 0.986253:  50%|█████     | 30/60 [31:16<15:10, 30.35s/it]

[I 2025-11-21 14:51:52,396] Trial 29 finished with value: 0.9805619224410493 and parameters: {'iterations': 836, 'learning_rate': 0.04418649259470406, 'depth': 8, 'l2_leaf_reg': 0.21808464477022932, 'bootstrap_type': 'Bernoulli', 'random_strength': 7.483163307291707, 'leaf_estimation_iterations': 4, 'grow_policy': 'Lossguide', 'subsample': 0.6900270598997096, 'max_features': 1000, 'ngram_min': 1, 'ngram_max': 2, 'n_svd_components': 137}. Best is trial 26 with value: 0.9862530310876159.


Best trial: 26. Best value: 0.986253:  52%|█████▏    | 31/60 [32:36<21:45, 45.03s/it]

[I 2025-11-21 14:53:11,657] Trial 30 pruned. 


Best trial: 31. Best value: 0.986583:  53%|█████▎    | 32/60 [32:52<17:00, 36.46s/it]

[I 2025-11-21 14:53:28,122] Trial 31 finished with value: 0.9865829397436123 and parameters: {'iterations': 1955, 'learning_rate': 0.18921420337878742, 'depth': 6, 'l2_leaf_reg': 0.061193623928132775, 'bootstrap_type': 'Bernoulli', 'random_strength': 0.003288105994975753, 'leaf_estimation_iterations': 1, 'grow_policy': 'SymmetricTree', 'subsample': 0.5942450426078664, 'max_features': 10000, 'ngram_min': 2, 'ngram_max': 4, 'n_svd_components': 89}. Best is trial 31 with value: 0.9865829397436123.


Best trial: 31. Best value: 0.986583:  55%|█████▌    | 33/60 [33:02<12:53, 28.63s/it]

[I 2025-11-21 14:53:38,496] Trial 32 finished with value: 0.9853938800189967 and parameters: {'iterations': 1869, 'learning_rate': 0.23011251522401283, 'depth': 6, 'l2_leaf_reg': 0.04396865687033217, 'bootstrap_type': 'Bernoulli', 'random_strength': 3.195333636229999e-06, 'leaf_estimation_iterations': 2, 'grow_policy': 'SymmetricTree', 'subsample': 0.6137087693233002, 'max_features': 10000, 'ngram_min': 2, 'ngram_max': 5, 'n_svd_components': 71}. Best is trial 31 with value: 0.9865829397436123.


Best trial: 31. Best value: 0.986583:  57%|█████▋    | 34/60 [33:16<10:29, 24.19s/it]

[I 2025-11-21 14:53:52,332] Trial 33 finished with value: 0.9860226709144607 and parameters: {'iterations': 1808, 'learning_rate': 0.17100918825960082, 'depth': 5, 'l2_leaf_reg': 0.19378959942484034, 'bootstrap_type': 'Bernoulli', 'random_strength': 0.007382588425264806, 'leaf_estimation_iterations': 1, 'grow_policy': 'SymmetricTree', 'subsample': 0.7755397454957573, 'max_features': 10000, 'ngram_min': 2, 'ngram_max': 5, 'n_svd_components': 84}. Best is trial 31 with value: 0.9865829397436123.


Best trial: 31. Best value: 0.986583:  58%|█████▊    | 35/60 [33:38<09:45, 23.42s/it]

[I 2025-11-21 14:54:13,938] Trial 34 finished with value: 0.9841929108872997 and parameters: {'iterations': 1935, 'learning_rate': 0.017280961522531207, 'depth': 4, 'l2_leaf_reg': 0.00014308483051049623, 'bootstrap_type': 'Bernoulli', 'random_strength': 0.7971838622217775, 'leaf_estimation_iterations': 3, 'grow_policy': 'SymmetricTree', 'subsample': 0.6014262861408481, 'max_features': 10000, 'ngram_min': 2, 'ngram_max': 5, 'n_svd_components': 117}. Best is trial 31 with value: 0.9865829397436123.


Best trial: 31. Best value: 0.986583:  60%|██████    | 36/60 [34:12<10:35, 26.50s/it]

[I 2025-11-21 14:54:47,628] Trial 35 finished with value: 0.9846603042902305 and parameters: {'iterations': 1375, 'learning_rate': 0.044896311967959464, 'depth': 7, 'l2_leaf_reg': 0.004127534810487766, 'bootstrap_type': 'MVS', 'random_strength': 7.042257347916371e-06, 'leaf_estimation_iterations': 6, 'grow_policy': 'Lossguide', 'max_features': 10000, 'ngram_min': 2, 'ngram_max': 4, 'n_svd_components': 65}. Best is trial 31 with value: 0.9865829397436123.


Best trial: 31. Best value: 0.986583:  62%|██████▏   | 37/60 [34:38<10:08, 26.46s/it]

[I 2025-11-21 14:55:13,979] Trial 36 finished with value: 0.9823386198416106 and parameters: {'iterations': 1674, 'learning_rate': 0.007637871743675233, 'depth': 6, 'l2_leaf_reg': 0.05780348923573327, 'bootstrap_type': 'Bernoulli', 'random_strength': 8.371738306897286e-07, 'leaf_estimation_iterations': 4, 'grow_policy': 'SymmetricTree', 'subsample': 0.5961546020323498, 'max_features': 5000, 'ngram_min': 3, 'ngram_max': 5, 'n_svd_components': 48}. Best is trial 31 with value: 0.9865829397436123.


Best trial: 31. Best value: 0.986583:  63%|██████▎   | 38/60 [35:08<10:08, 27.67s/it]

[I 2025-11-21 14:55:44,474] Trial 37 finished with value: 0.9849790557755705 and parameters: {'iterations': 1527, 'learning_rate': 0.027708281613903703, 'depth': 4, 'l2_leaf_reg': 0.02880257999492868, 'bootstrap_type': 'MVS', 'random_strength': 3.2834460045509575, 'leaf_estimation_iterations': 2, 'grow_policy': 'Lossguide', 'max_features': 10000, 'ngram_min': 2, 'ngram_max': 5, 'n_svd_components': 78}. Best is trial 31 with value: 0.9865829397436123.


Best trial: 31. Best value: 0.986583:  65%|██████▌   | 39/60 [35:59<12:06, 34.62s/it]

[I 2025-11-21 14:56:35,302] Trial 38 finished with value: 0.9862121612322884 and parameters: {'iterations': 1184, 'learning_rate': 0.14417221116534543, 'depth': 7, 'l2_leaf_reg': 0.2765116798372541, 'bootstrap_type': 'Bernoulli', 'random_strength': 0.022292416834924816, 'leaf_estimation_iterations': 1, 'grow_policy': 'Depthwise', 'subsample': 0.7522749056280356, 'max_features': 10000, 'ngram_min': 2, 'ngram_max': 4, 'n_svd_components': 99}. Best is trial 31 with value: 0.9865829397436123.


Best trial: 31. Best value: 0.986583:  67%|██████▋   | 40/60 [36:31<11:13, 33.69s/it]

[I 2025-11-21 14:57:06,820] Trial 39 finished with value: 0.9849964150374338 and parameters: {'iterations': 1868, 'learning_rate': 0.11262101835997881, 'depth': 5, 'l2_leaf_reg': 0.34229411604613036, 'bootstrap_type': 'Bernoulli', 'random_strength': 0.033760435719986666, 'leaf_estimation_iterations': 1, 'grow_policy': 'Depthwise', 'subsample': 0.7347035720917937, 'max_features': 10000, 'ngram_min': 2, 'ngram_max': 3, 'n_svd_components': 126}. Best is trial 31 with value: 0.9865829397436123.


Best trial: 31. Best value: 0.986583:  68%|██████▊   | 41/60 [37:54<15:22, 48.58s/it]

[I 2025-11-21 14:58:30,139] Trial 40 pruned. 


Best trial: 31. Best value: 0.986583:  70%|███████   | 42/60 [38:31<13:29, 44.96s/it]

[I 2025-11-21 14:59:06,669] Trial 41 finished with value: 0.9852098034486024 and parameters: {'iterations': 1054, 'learning_rate': 0.18104903143339138, 'depth': 7, 'l2_leaf_reg': 2.622922162888081, 'bootstrap_type': 'Bernoulli', 'random_strength': 0.02258041394270416, 'leaf_estimation_iterations': 2, 'grow_policy': 'Depthwise', 'subsample': 0.7243542842346589, 'max_features': 10000, 'ngram_min': 2, 'ngram_max': 4, 'n_svd_components': 98}. Best is trial 31 with value: 0.9865829397436123.


Best trial: 31. Best value: 0.986583:  72%|███████▏  | 43/60 [39:18<12:55, 45.63s/it]

[I 2025-11-21 14:59:53,874] Trial 42 finished with value: 0.983197912199854 and parameters: {'iterations': 1189, 'learning_rate': 0.10731308730637283, 'depth': 6, 'l2_leaf_reg': 0.12765650839954312, 'bootstrap_type': 'Bayesian', 'random_strength': 0.08988602088790965, 'leaf_estimation_iterations': 6, 'grow_policy': 'Depthwise', 'bagging_temperature': 2.985998406092822, 'max_features': 10000, 'ngram_min': 2, 'ngram_max': 4, 'n_svd_components': 87}. Best is trial 31 with value: 0.9865829397436123.


Best trial: 31. Best value: 0.986583:  73%|███████▎  | 44/60 [40:36<14:45, 55.34s/it]

[I 2025-11-21 15:01:11,859] Trial 43 finished with value: 0.9844928221861968 and parameters: {'iterations': 886, 'learning_rate': 0.0482797054642083, 'depth': 7, 'l2_leaf_reg': 1.7949056605261524, 'bootstrap_type': 'Bernoulli', 'random_strength': 1.6086636437797766, 'leaf_estimation_iterations': 1, 'grow_policy': 'Depthwise', 'subsample': 0.5733000617069124, 'max_features': 10000, 'ngram_min': 2, 'ngram_max': 3, 'n_svd_components': 114}. Best is trial 31 with value: 0.9865829397436123.


Best trial: 31. Best value: 0.986583:  75%|███████▌  | 45/60 [41:07<12:00, 48.02s/it]

[I 2025-11-21 15:01:42,813] Trial 44 finished with value: 0.9852116506796149 and parameters: {'iterations': 1274, 'learning_rate': 0.20048389925765386, 'depth': 6, 'l2_leaf_reg': 0.10133259031410209, 'bootstrap_type': 'Bernoulli', 'random_strength': 11.855789256104105, 'leaf_estimation_iterations': 7, 'grow_policy': 'Depthwise', 'subsample': 0.6657683381193292, 'max_features': 10000, 'ngram_min': 2, 'ngram_max': 5, 'n_svd_components': 139}. Best is trial 31 with value: 0.9865829397436123.


Best trial: 31. Best value: 0.986583:  77%|███████▋  | 46/60 [41:40<10:12, 43.72s/it]

[I 2025-11-21 15:02:16,484] Trial 45 pruned. 


Best trial: 31. Best value: 0.986583:  78%|███████▊  | 47/60 [41:48<07:05, 32.75s/it]

[I 2025-11-21 15:02:23,643] Trial 46 pruned. 


Best trial: 31. Best value: 0.986583:  80%|████████  | 48/60 [42:50<08:20, 41.70s/it]

[I 2025-11-21 15:03:26,211] Trial 47 finished with value: 0.9828069020609368 and parameters: {'iterations': 1494, 'learning_rate': 0.1437394756487372, 'depth': 8, 'l2_leaf_reg': 0.009074376818070577, 'bootstrap_type': 'MVS', 'random_strength': 8.896454583473539e-08, 'leaf_estimation_iterations': 8, 'grow_policy': 'Depthwise', 'max_features': 5000, 'ngram_min': 3, 'ngram_max': 4, 'n_svd_components': 146}. Best is trial 31 with value: 0.9865829397436123.


Best trial: 31. Best value: 0.986583:  82%|████████▏ | 49/60 [43:45<08:22, 45.69s/it]

[I 2025-11-21 15:04:21,207] Trial 48 finished with value: 0.9848668459108703 and parameters: {'iterations': 1040, 'learning_rate': 0.06396937091815429, 'depth': 9, 'l2_leaf_reg': 25.269184817671803, 'bootstrap_type': 'Bernoulli', 'random_strength': 0.000657955432004961, 'leaf_estimation_iterations': 3, 'grow_policy': 'Lossguide', 'subsample': 0.7861817175521112, 'max_features': 10000, 'ngram_min': 2, 'ngram_max': 5, 'n_svd_components': 78}. Best is trial 31 with value: 0.9865829397436123.


Best trial: 31. Best value: 0.986583:  83%|████████▎ | 50/60 [44:06<06:21, 38.19s/it]

[I 2025-11-21 15:04:41,920] Trial 49 finished with value: 0.9819655228722131 and parameters: {'iterations': 924, 'learning_rate': 0.02159608241806573, 'depth': 5, 'l2_leaf_reg': 0.020550011427716126, 'bootstrap_type': 'MVS', 'random_strength': 6.38641023171963e-05, 'leaf_estimation_iterations': 2, 'grow_policy': 'SymmetricTree', 'max_features': 10000, 'ngram_min': 2, 'ngram_max': 3, 'n_svd_components': 199}. Best is trial 31 with value: 0.9865829397436123.


Best trial: 31. Best value: 0.986583:  85%|████████▌ | 51/60 [44:50<05:59, 39.91s/it]

[I 2025-11-21 15:05:25,824] Trial 50 finished with value: 0.9853543516301564 and parameters: {'iterations': 1359, 'learning_rate': 0.09197557867965088, 'depth': 6, 'l2_leaf_reg': 3.7939010195910585, 'bootstrap_type': 'Bernoulli', 'random_strength': 0.0034082466713861743, 'leaf_estimation_iterations': 1, 'grow_policy': 'Lossguide', 'subsample': 0.631240767282233, 'max_features': 10000, 'ngram_min': 3, 'ngram_max': 4, 'n_svd_components': 108}. Best is trial 31 with value: 0.9865829397436123.


Best trial: 31. Best value: 0.986583:  87%|████████▋ | 52/60 [45:01<04:10, 31.34s/it]

[I 2025-11-21 15:05:37,190] Trial 51 finished with value: 0.9864925238034423 and parameters: {'iterations': 1858, 'learning_rate': 0.18345701319851812, 'depth': 5, 'l2_leaf_reg': 0.1958858884388215, 'bootstrap_type': 'Bernoulli', 'random_strength': 0.007702813679384675, 'leaf_estimation_iterations': 1, 'grow_policy': 'SymmetricTree', 'subsample': 0.7705594424051627, 'max_features': 10000, 'ngram_min': 2, 'ngram_max': 5, 'n_svd_components': 84}. Best is trial 31 with value: 0.9865829397436123.


Best trial: 31. Best value: 0.986583:  88%|████████▊ | 53/60 [45:12<02:57, 25.32s/it]

[I 2025-11-21 15:05:48,460] Trial 52 finished with value: 0.9859241478279847 and parameters: {'iterations': 1956, 'learning_rate': 0.18987124103298753, 'depth': 5, 'l2_leaf_reg': 0.6989035083519433, 'bootstrap_type': 'Bernoulli', 'random_strength': 0.012630133612502653, 'leaf_estimation_iterations': 1, 'grow_policy': 'SymmetricTree', 'subsample': 0.8765693486170589, 'max_features': 10000, 'ngram_min': 2, 'ngram_max': 5, 'n_svd_components': 97}. Best is trial 31 with value: 0.9865829397436123.


Best trial: 53. Best value: 0.986626:  90%|█████████ | 54/60 [45:40<02:36, 26.12s/it]

[I 2025-11-21 15:06:16,453] Trial 53 finished with value: 0.9866262837372476 and parameters: {'iterations': 1829, 'learning_rate': 0.05526068316585797, 'depth': 6, 'l2_leaf_reg': 0.08843744861674385, 'bootstrap_type': 'Bernoulli', 'random_strength': 0.05514096618048006, 'leaf_estimation_iterations': 2, 'grow_policy': 'SymmetricTree', 'subsample': 0.554882152952546, 'max_features': 10000, 'ngram_min': 2, 'ngram_max': 5, 'n_svd_components': 88}. Best is trial 53 with value: 0.9866262837372476.


Best trial: 53. Best value: 0.986626:  92%|█████████▏| 55/60 [45:47<01:41, 20.30s/it]

[I 2025-11-21 15:06:23,180] Trial 54 finished with value: 0.9847951165700886 and parameters: {'iterations': 1826, 'learning_rate': 0.13470158203716107, 'depth': 4, 'l2_leaf_reg': 0.10373688011934697, 'bootstrap_type': 'Bernoulli', 'random_strength': 0.0010487023047703013, 'leaf_estimation_iterations': 2, 'grow_policy': 'SymmetricTree', 'subsample': 0.5613676101700192, 'max_features': 10000, 'ngram_min': 2, 'ngram_max': 5, 'n_svd_components': 62}. Best is trial 53 with value: 0.9866262837372476.


Best trial: 53. Best value: 0.986626:  93%|█████████▎| 56/60 [46:07<01:20, 20.08s/it]

[I 2025-11-21 15:06:42,721] Trial 55 finished with value: 0.986013171280414 and parameters: {'iterations': 1890, 'learning_rate': 0.055057106625156206, 'depth': 6, 'l2_leaf_reg': 0.0035199906764759033, 'bootstrap_type': 'Bernoulli', 'random_strength': 0.06735234984783779, 'leaf_estimation_iterations': 1, 'grow_policy': 'SymmetricTree', 'subsample': 0.5504981299427472, 'max_features': 10000, 'ngram_min': 2, 'ngram_max': 5, 'n_svd_components': 77}. Best is trial 53 with value: 0.9866262837372476.


Best trial: 53. Best value: 0.986626:  95%|█████████▌| 57/60 [46:19<00:53, 17.70s/it]

[I 2025-11-21 15:06:54,862] Trial 56 finished with value: 0.9862395130719983 and parameters: {'iterations': 1726, 'learning_rate': 0.08573677770696077, 'depth': 5, 'l2_leaf_reg': 0.27157506434389866, 'bootstrap_type': 'Bernoulli', 'random_strength': 0.024715437479974768, 'leaf_estimation_iterations': 1, 'grow_policy': 'SymmetricTree', 'subsample': 0.7181470120153333, 'max_features': 10000, 'ngram_min': 2, 'ngram_max': 5, 'n_svd_components': 88}. Best is trial 53 with value: 0.9866262837372476.


Best trial: 53. Best value: 0.986626:  97%|█████████▋| 58/60 [46:36<00:34, 17.48s/it]

[I 2025-11-21 15:07:11,841] Trial 57 finished with value: 0.9852815602623665 and parameters: {'iterations': 1699, 'learning_rate': 0.03547520915181842, 'depth': 4, 'l2_leaf_reg': 0.04036936644694786, 'bootstrap_type': 'Bernoulli', 'random_strength': 0.143908919691032, 'leaf_estimation_iterations': 2, 'grow_policy': 'SymmetricTree', 'subsample': 0.7078098894103345, 'max_features': 10000, 'ngram_min': 2, 'ngram_max': 5, 'n_svd_components': 87}. Best is trial 53 with value: 0.9866262837372476.


Best trial: 53. Best value: 0.986626:  98%|█████████▊| 59/60 [46:50<00:16, 16.54s/it]

[I 2025-11-21 15:07:26,200] Trial 58 finished with value: 0.9822545096050387 and parameters: {'iterations': 1998, 'learning_rate': 0.06903171910910771, 'depth': 3, 'l2_leaf_reg': 0.07345954395867821, 'bootstrap_type': 'Bernoulli', 'random_strength': 0.04309236164173069, 'leaf_estimation_iterations': 2, 'grow_policy': 'SymmetricTree', 'subsample': 0.6339876924645074, 'max_features': 1000, 'ngram_min': 2, 'ngram_max': 5, 'n_svd_components': 74}. Best is trial 53 with value: 0.9866262837372476.


Best trial: 53. Best value: 0.986626: 100%|██████████| 60/60 [47:08<00:00, 47.14s/it]

[I 2025-11-21 15:07:44,030] Trial 59 pruned. 
Number of finished trials: 60
Best trial: {'iterations': 1829, 'learning_rate': 0.05526068316585797, 'depth': 6, 'l2_leaf_reg': 0.08843744861674385, 'bootstrap_type': 'Bernoulli', 'random_strength': 0.05514096618048006, 'leaf_estimation_iterations': 2, 'grow_policy': 'SymmetricTree', 'subsample': 0.554882152952546, 'max_features': 10000, 'ngram_min': 2, 'ngram_max': 5, 'n_svd_components': 88}





In [35]:
best_params = study.best_params.copy()
best_params['random_seed'] = SEED
best_params['verbose'] = 0
best_params['task_type'] = 'CPU'

# Derive TF-IDF and SVD best params from Optuna study
best_tfidf_max = best_params.get('max_features', 5000)
best_ngram_min = best_params.get('ngram_min', 3)
best_ngram_max = best_params.get('ngram_max', 5)
best_ngram_range = (best_ngram_min, best_ngram_max)

best_n_svd_components = best_params.get('n_svd_components', 100)

# Filter out text processing params so we only pass model params to CatBoost
model_param_keys = [k for k in best_params.keys() if k not in ['max_features', 'ngram_min', 'ngram_max', 'n_svd_components']]
catboost_model_params = {k: best_params[k] for k in model_param_keys}

optuna_info = {
    "n_trials": 60,
    "best_params": study.best_params,
    "best_value": study.best_value,
    "study_path": "optuna_study.pkl",
    "tfidf_max_features": best_tfidf_max,
    "tfidf_ngram_range": best_ngram_range,
    "n_svd_components": best_n_svd_components
}

print("Running final experiment with best parameters...")
run_tree_experiment(
    CatBoostClassifier, 
    "CatBoost_Optuna", 
    catboost_model_params, 
    "exp_2_catboost_optuna", 
    X_train=X_train_df, 
    y_train=y_train, 
    X_test=X_test_df, 
    numeric_features=non_text_cols, 
    text_feature='url', 
    save_model=SAVE_MODELS,
    optuna_study=study,
    optuna_params=optuna_info,
    n_svd_components=best_n_svd_components,
    tfidf_max_features=best_tfidf_max,
    tfidf_ngram_range=best_ngram_range
)


Running final experiment with best parameters...

=== Running Experiment: exp_2_catboost_optuna (CatBoost_Optuna) ===
Saving Model: True
Experiment 'exp_2_catboost_optuna' initialized at: experiments/exp_2_catboost_optuna
Mode: Incremental saving (5 folds)

--- Fold 1/5 ---
Fold 1 Val AUC: 0.9874
  Fold 1/5 saved | ROC AUC: 0.9874

--- Fold 2/5 ---
Fold 2 Val AUC: 0.9868
  Fold 2/5 saved | ROC AUC: 0.9868

--- Fold 3/5 ---
Fold 3 Val AUC: 0.9860
  Fold 3/5 saved | ROC AUC: 0.9860

--- Fold 4/5 ---
Fold 4 Val AUC: 0.9893
  Fold 4/5 saved | ROC AUC: 0.9893

--- Fold 5/5 ---
Fold 5 Val AUC: 0.9832
  Fold 5/5 saved | ROC AUC: 0.9832

Finalizing experiment...


[W 2025-11-21 15:08:35,423] Your study has only completed trials with missing parameters.


  Optuna plots saved to experiments/exp_2_catboost_optuna/optuna_plots
  Predictions saved to experiments/exp_2_catboost_optuna/exp_2_catboost_optuna_prediction.csv

✓ Experiment 'exp_2_catboost_optuna' finalized!
  Location: experiments/exp_2_catboost_optuna
  Folds completed: 5
  Best fold: 4 (ROC AUC: 0.9893)
  Average ROC AUC: 0.9865 ± 0.0020
Experiment saved to experiments/exp_2_catboost_optuna


0,1,2
,steps,"[('preprocessor', ...), ('classifier', ...)]"
,transform_input,
,memory,
,verbose,False

0,1,2
,transformers,"[('text', ...), ('numeric', ...)]"
,remainder,'drop'
,sparse_threshold,0.3
,n_jobs,
,transformer_weights,
,verbose,False
,verbose_feature_names_out,True
,force_int_remainder_cols,'deprecated'

0,1,2
,input,'content'
,encoding,'utf-8'
,decode_error,'strict'
,strip_accents,
,lowercase,True
,preprocessor,
,tokenizer,
,analyzer,'char'
,stop_words,
,token_pattern,'(?u)\\b\\w\\w+\\b'

0,1,2
,n_components,88
,random_state,42
