# 1. <a id='toc1_'></a>[Features Selection](#toc0_)

This notebook contains the code to select the most important features for the model. 

Here filter mothod is used to select the features. The features are selected using the `feature_importance_` attribute of the model. The features are then ranked and the top 10 features are selected. The selected features are then used to train the model.

Filter-based feature selection methods for unsupervised data typically rely on statistical measures or heuristic approaches to rank features based on their intrinsic characteristics rather than on a specific learning algorithm. Here are a few filter-based methods along with how you might associate each selected feature with its importance:

+ Variance Threshold</br>
Compute the variance of each feature. Features with low variance are less informative and can be removed.
Associate the variance value directly as the feature importance.
Correlation Coefficient:

+ Calculate the correlation coefficient between each pair of features.</br>
  Features highly correlated with other features might contain redundant information. You can select one of each highly correlated pair or remove one randomly.
Associate the absolute value of the correlation coefficient as the feature importance.
Mutual Information:

+ Measure the mutual information between each feature and the cluster labels.</br>
Features with high mutual information are more informative for clustering.
Associate the mutual information value as the feature importance.
Distance-based Methods:

+ Compute the distance between instances in the feature space and analyze the distribution of distances.</br>
Features that contribute to larger distances between instances might be more important for clustering.
Associate the distance measure (e.g., mean distance or median distance) as the feature importance.

+ ``sklearn.feature_selection`` module is used for feature selection/dimensionality reduction.
+ Goal:
  + Improve estimators accuracy scores
  + Avoiding overfitting
  + Reduce the computational cost
  + Improve the comprehensibility of the model
+ There are three main strategies:
  + Univariate statistics: Select the best features based on univariate statistical tests
  + Model-based selection: Use a supervised model to judge the importance of each feature
  + Iterative selection: Build a model on initial features and then iteratively remove the least important feature
+ Feature selection methods can also be categorised into:
  + Filter methods: Select features based on their scores in various statistical tests
  + Wrapper methods: Select features based on the performance of a model trained with the selected features
  + Embedded methods: Select features based on the importance of their contribution to the model
+ Feature selection can be done in four ways:
  + **SelectKBest**: Select features according to the k highest scores
  + **SelectPercentile**: Select features according to a percentile of the highest scores
  + **SelectFpr**: Select features based on a false positive rate test
  + **SelectFdr**: Select features based on an estimated false discovery rate
  + **SelectFwe**: Select features based on family-wise error 

**Table of contents**<a id='toc0_'></a>    
1. [Features Selection](#toc1_)    
1.1. [Dependencies and paths](#toc1_1_)    
1.2. [Load the data](#toc1_2_)    

<!-- vscode-jupyter-toc-config
	numbering=true
	anchor=true
	flat=true
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

## 1.1. <a id='toc1_1_'></a>[Dependencies and paths](#toc0_)

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
## DEPENDENCIES >>>
import os
import sys
from typing import List, Tuple, Dict, Any, Optional, Callable, Union
from pathlib import Path

import joblib
from functools import partial

# Add root directory to path for imports >
root_dir = Path.cwd().resolve().parent
if root_dir.exists():
    sys.path.append(str(root_dir))
else:
    raise FileNotFoundError('Root directory not found')

# import custom libraries >
from src.load import load_multiple_trajectoryCollection_parallel_pickle as lmtp
from src.load import load_datasets, load_df_to_dataset
from src.traj_dataloader import (TrajectoryDataset, 
                                 create_dataloader, 
                                 separate_files_by_season, 
                                 split_data, 
                                 get_files,
                                 AISDataset,
                                 )
from src.scaler import CustomMinMaxScaler, reduce_resolution

from datetime import datetime, timedelta

import dotsi
import itertools
import pickle

import numpy as np
import pandas as pd

# torch libraries >
import torch
from torch.utils.data import Dataset
from torch.utils.data import DataLoader
import torch.optim as optim
import torch.nn as nn
from torchvision import datasets
from torchvision.transforms import ToTensor
from torchvision.io import read_image

# sklearn libraries >
import sklearn as sk
from sklearn.model_selection import (train_test_split, 
                                     GridSearchCV, 
                                     RandomizedSearchCV)#, HalvingGridSearchCV, HalvingRandomSearchCV)
from sklearn.preprocessing import Normalizer, StandardScaler, MinMaxScaler, RobustScaler
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import cross_val_score 
# from sklearn.metrics import roc_auc_score, accuracy_score, precision_score, recall_score, f1_score, confusion_matrix
from sklearn.metrics import silhouette_score, silhouette_samples
from sklearn.cluster import KMeans
from sklearn.pipeline import Pipeline, make_pipeline

# Features selection >
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import ( mutual_info_classif,
                                       SelectKBest,
                                       chi2,
                                       VarianceThreshold,
                                       RFE,
                                       )
from mlxtend.feature_selection import ExhaustiveFeatureSelector
from skfeature.function.similarity_based import fisher_score

# Hyperopt >
import optuna
import warnings
warnings.filterwarnings("ignore")

# Plot >
import matplotlib.pyplot as plt
import seaborn as sns
import scienceplots  # https://github.com/garrettj403/SciencePlots?tab=readme-ov-file
plt.style.use(['science', 'grid', 'notebook'])  # , 'ieee'

# Multiprocessing >
from concurrent.futures import ProcessPoolExecutor
from functools import partial

# Toy datasets >
from sklearn.datasets import load_iris  # Sample dataset

# %matplotlib inline
%matplotlib widget

  """


In [3]:
## FLAGS & GLOBAL VALUES >>>

# Down sample the resolution
DOWN_SAMPLE = False  # used with SCALE and SAVE_SCALE to save the scaled data: (if True) with down sampled resolution, or with (not False) not.

# Explore
EXPLORE = True

# Debug
DEBUG = True

# Develop
DEVELOP = True

# HYPERPARAMETER OPTIMISATION
HYPEROPT = True

if HYPEROPT:
    OPTUNA = False # Optimise using Optuna
    GRIDSEARCH = True  # Optimise using GridSearchCV
    RANDOMSEARCH = False  # Optimise using RandomizedSearchCV

# SAVE SELECTED FEATURES in root / models / selected_features
SAVE_SELECT_FEATURES = True

# WORKING SERVER
AVAILABLE_SERVERS = ['ZS', 'PLOEN', 'KIEL', 'WYK']
CURRENT_SERVER = AVAILABLE_SERVERS[0]

# seed
split_seed = 42

# If DOWN_SAMPLE, define the target time resolution
targeted_resolution_min = 1  # minute

# TODO: The following featues are corrupted by containing NaNs. Fix this. For now, these columns are dropped
corrupted_features = ["stopped", "abs_ccs", "curv"]


# Use up to 70% of the available cpu cores
n_jobs = joblib.cpu_count()
print("Number of CPUs available:", n_jobs)
if CURRENT_SERVER == 'ZS':
    n_jobs = int(0.9 * n_jobs)
else:
    n_jobs = int(0.7 * n_jobs)
print("Number of CPUs to use:", n_jobs)

Number of CPUs available: 64
Number of CPUs to use: 57


In [4]:
## PATHS >>>
# data dir
data_dir = root_dir / 'data'
data_dir = data_dir.resolve()
if not data_dir.exists():
    raise FileNotFoundError('Data directory not found')

if CURRENT_SERVER == 'ZS':
    # assets dir  # TODO: Used temporarly during the features seletion process. Remove this!
    assets_dir = data_dir / 'assets'
    assets_dir = assets_dir.resolve()
    if not assets_dir.exists():
        raise FileNotFoundError(f'Assets directory in {CURRENT_SERVER} not found')
else:
    # aistraj dir
    assets_dir = data_dir / 'local' / 'aistraj'
    assets_dir = assets_dir.resolve()
    if not assets_dir.exists():
        raise FileNotFoundError('Assets directory not found')

    # train-validate-test (tvt) dir
    tvt_assets_dir = assets_dir / 'tvt_assets'
    tvt_assets_dir = tvt_assets_dir.resolve()
    if not tvt_assets_dir.exists():
        raise FileNotFoundError('Train-Validate-Test Assets directory not found')

    # tvt: extended pickle dir
    tvt_extended_dir = tvt_assets_dir / 'extended'
    tvt_extended_dir = tvt_extended_dir.resolve()
    if not tvt_extended_dir.exists():
        raise FileNotFoundError('TVT Extended Pickled Data directory not found')

    # tvt: scaled pickle dir
    tvt_scaled_dir = tvt_assets_dir / 'scaled'
    tvt_scaled_dir = tvt_scaled_dir.resolve()
    if not tvt_scaled_dir.exists():
        raise FileNotFoundError('TVT Scaled Pickled Data directory not found')

    # tvt: logs dir
    tvt_logs_dir = tvt_assets_dir / 'logs'
    tvt_logs_dir = tvt_logs_dir.resolve()
    if not tvt_logs_dir.exists():
        raise FileNotFoundError('TVT logs directory not found')
  
  
# models dir
models_dir = root_dir / 'models'
models_dir = models_dir.resolve()
if not models_dir.exists():
    raise FileNotFoundError('Models directory not found')    

# Selected Features dir
selected_features_dir = models_dir / 'selected_features'
selected_features_dir = selected_features_dir.resolve()
if not selected_features_dir.exists():
    raise FileNotFoundError('selected features directory not found')

## 1.2. <a id='toc1_2_'></a>[Load the data](#toc0_)

+ Select the paths of the scaled datasets

In [5]:
import_paths = {'train': None, 'validate': None, 'test': None}

if DOWN_SAMPLE:
    import_paths = {
                    'train': tvt_scaled_dir / 'scaled_cleaned_downsampled_extended_train_df.parquet',
                    'validate': tvt_scaled_dir / 'scaled_cleaned_downsampled_extended_validate_df.parquet',
                    'test': tvt_scaled_dir / 'scaled_cleaned_downsampled_extended_test_df.parquet'
                    }
else:  
    if CURRENT_SERVER != 'ZS':
        import_paths = {
                        'train': tvt_scaled_dir / 'scaled_cleaned_extended_train_df.parquet',
                        'validate': tvt_scaled_dir / 'scaled_cleaned_extended_validate_df.parquet',
                        'test': tvt_scaled_dir / 'scaled_cleaned_extended_test_df.parquet'
                        }
    else:
        import_paths = {
                        'train': assets_dir / 'scaled_cleaned_extended_train_df.parquet',
                        'validate': assets_dir / 'scaled_cleaned_extended_validate_df.parquet',
                        'test': assets_dir / 'scaled_cleaned_extended_test_df.parquet'
                        }
        
# Assets container >
train_df, validate_df, test_df = None, None, None
assets = {'train': train_df, 'validate': validate_df, 'test': test_df}

+ Load the train set

In [6]:
# %%time
# if not DEVELOP:  # Data is huge! don't use for exploring and developping
#     train_df = load_df_to_dataset(data_path=import_paths['train'], use_dask=False).data  # Load the train dataset

+ Load the validate set

In [7]:
%%time
validate_df = load_df_to_dataset(import_paths['validate'], use_dask=False).data  # Load the validate dataset

CPU times: user 3.75 s, sys: 7.07 s, total: 10.8 s
Wall time: 1.81 s


In [8]:
if EXPLORE:
    columns = validate_df.columns
    print(f"Num. Cols: {len(columns)}: {columns}")
    print()
    print(f"Num. Samples: {validate_df.shape[0]}")

Num. Cols: 24: Index(['epoch', 'datetime', 'obj_id', 'traj_id', 'month_sin', 'month_cos',
       'hour_sin', 'hour_cos', 'season', 'part_of_day', 'aad', 'cdd',
       'dir_ccs', 'cog_c', 'rot_c', 'distance_c', 'dist_ww', 'dist_ra',
       'dist_cl', 'dist_ma', 'speed_c', 'acc_c', 'lon', 'lat'],
      dtype='object')

Num. Samples: 14705500


In [9]:
if EXPLORE:
    display(validate_df.describe())

Unnamed: 0,epoch,datetime,obj_id,traj_id,month_sin,month_cos,hour_sin,hour_cos,season,part_of_day,...,rot_c,distance_c,dist_ww,dist_ra,dist_cl,dist_ma,speed_c,acc_c,lon,lat
count,14705500.0,14705500,14705500.0,14705500.0,14705500.0,14705500.0,14705500.0,14705500.0,14705500.0,14705500.0,...,14705500.0,14705500.0,14705500.0,14705500.0,14705500.0,14705500.0,14705500.0,14705500.0,14705500.0,14705500.0
mean,1666974000.0,2022-10-28 16:13:28.685266944,237556300.0,0.8652038,0.06713849,-0.5350965,0.06619752,-0.3610991,0.7613928,0.7803357,...,-0.0001199664,8.994076e-06,0.004779191,0.00282363,0.005197064,0.00282363,0.1092504,-0.5415504,0.1716905,0.06982459
min,1648080000.0,2022-03-24 00:00:00,0.0,0.0,-1.0,-1.0,-1.0,-1.0,0.0,0.0,...,-0.05,0.0,0.003090376,0.001545662,0.003682328,0.001545662,-0.6552631,-2205075.0,0.0,0.0
25%,1654656000.0,2022-06-08 02:32:17.500000,211341600.0,0.0,-0.5,-1.0,-0.5,-0.8660254,0.0,0.0,...,-0.0003949996,1.798412e-06,0.003843537,0.002221549,0.004462599,0.002221549,-0.5003267,-0.5611262,0.1707412,0.0692803
50%,1663483000.0,2022-09-18 06:32:10,211844700.0,0.0,1.224647e-16,-0.8660254,1.224647e-16,-0.5,0.0,1.0,...,0.0,7.910451e-06,0.004786457,0.002484184,0.005204943,0.002484184,0.003653112,-0.0003837946,0.1715423,0.06953888
75%,1683541000.0,2023-05-08 10:24:40,245399000.0,1.0,0.5,-1.83697e-16,0.7071068,6.123234000000001e-17,1.0,1.0,...,0.0003929777,1.374012e-05,0.005713276,0.00338841,0.005995282,0.00338841,0.4871682,0.4392827,0.1725441,0.07059271
max,1688170000.0,2023-07-01 00:00:00,1000000000.0,32.0,1.0,0.8660254,1.0,1.0,3.0,2.0,...,0.05,1.0,1.0,1.0,1.0,1.0,83007.95,9325.662,0.9216453,1.0
std,13601220.0,,64350240.0,2.382782,0.6518146,0.5331998,0.6910082,0.6226822,0.9608422,0.7965457,...,0.009468249,0.0003591384,0.001086893,0.0009401402,0.0009237291,0.0009401402,42.12552,791.3628,0.001133481,0.00114291


In [10]:
if EXPLORE:
    validate_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14705500 entries, 0 to 14705499
Data columns (total 24 columns):
 #   Column       Dtype         
---  ------       -----         
 0   epoch        int64         
 1   datetime     datetime64[ns]
 2   obj_id       int64         
 3   traj_id      int64         
 4   month_sin    float64       
 5   month_cos    float64       
 6   hour_sin     float64       
 7   hour_cos     float64       
 8   season       int64         
 9   part_of_day  int64         
 10  aad          float64       
 11  cdd          float64       
 12  dir_ccs      float64       
 13  cog_c        float64       
 14  rot_c        float64       
 15  distance_c   float64       
 16  dist_ww      float64       
 17  dist_ra      float64       
 18  dist_cl      float64       
 19  dist_ma      float64       
 20  speed_c      float64       
 21  acc_c        float64       
 22  lon          float64       
 23  lat          float64       
dtypes: datetime64[ns](1), 

+ Load the test set

In [11]:
# %%time
# if not DEVELOP:  # Data is huge! don't use for exploring and developping
#     test_df = load_df_to_dataset(import_paths['test'], use_dask=False).data  # Load the test dataset

+ Concat the datasets

In [12]:
# Concatenate the datasets >
asset_df = validate_df  # pd.concat([train_df, validate_df, test_df], axis=0)

# # Sort the dataset by epoch >
# asset_df = asset_df.sort_values(by='epoch', ascending=True)

# # Reset the index >
# asset_df = asset_df.reset_index(drop=True)

# # Display the dataset's head >
# if EXPLORE:
#     asset_df.head()

## Filter-based features selection

In [13]:
cols_not_to_study = ['epoch', 'datetime', 'obj_id', 'traj_id', 'stopped', 'curv']

# Check that the column in cols_not_to_study are in the dataset, otherwise remove them from the list >
cols_not_to_study = [col for col in cols_not_to_study if col in asset_df.columns]

print(f"Cols not to study: {cols_not_to_study}")


Cols not to study: ['epoch', 'datetime', 'obj_id', 'traj_id']


### Variance Threshold Method

+ The variance threshold method is a simple unsupervised feature selection method. It removes all features whose variance doesn’t meet some threshold. By default, it removes all zero-variance features, i.e., features that have the same value in all samples.
+ One of the main assumptions of this method is that features with a higher variance may contain more useful information. In practice, variance thresholding may not be very useful for regression tasks, but it can be useful for classification tasks, especially for binary classification and clustering tasks.
+ The variance threshold method is a simple and effective method for feature selection. It is a good starting point for feature selection and is especially useful for removing noisy and irrelevant features.
+ Feature variance can be used a measure of feature importance. Features with low variance are less informative and can be removed.

#### Define the Optuna objective function for the optimisation of ``threshold`` hyperparameter

In [14]:
%%time
# Create a copy of the dataset and drop the columns not to study >
df = asset_df.drop(columns=cols_not_to_study)

CPU times: user 384 µs, sys: 259 µs, total: 643 µs
Wall time: 657 µs


+ Find the best threshold value for the variance threshold method using Optuna. Using the silhouette score with k-means clustering.
  > **NOTE**:</br> Assuming that the number of clusters is $10$.
    

In [15]:
def variance_threshold_feature_selection(data: pd.DataFrame, threshold: float) -> Tuple[VarianceThreshold, pd.DataFrame]:
    """
    Perform feature selection using variance threshold.
    Assign the feature_importance based on the normalised variance of the features. 
    The lower the variance, the less important the feature.
    

    Args:
        data (pd.DataFrame): The input DataFrame containing the features.
        threshold (float): The threshold value for variance.

    Returns:
        Union[callable, pd.DataFrame]:
            [callable]: is the fitted VarianceThreshold object.
            [pd.DataFrame]: is the selected features in descending order.
                            The DataFrame contains two columns:
                                - `selected_features`: The selected features.
                                - `feature_importance`: The corresponding feature importance values.
    """
    # Instantiate a place holder for the variance threshold method (vtm) selected features >
    fs_df = pd.DataFrame(columns=['selected_features', 'feature_importance'])

    selector = VarianceThreshold(threshold=threshold)
    selector.fit(data)
    selected_features = data.columns[selector.get_support()]
    feature_importance = selector.variances_
    # L2 normalisation
    feature_importance = Normalizer().fit_transform(feature_importance)
    
    # put the data in fs_df >
    fs_df['selected_features'] = selected_features
    fs_df['feature_importance'] = feature_importance
    
    return selector, fs_df

In [16]:

## Define an objective function for Optuna >>
if HYPEROPT:
    if OPTUNA:  # Use Optuna for hyperparameter optimisation
        def objective(trial, 
                      df: pd.DataFrame, 
                      cluster: Callable,
                      n_clusters: int, 
                      random_state: Optional[int]=42, 
                      score_metric: Optional[Callable]=silhouette_score,
                      steps: Optional[float]=0.1):
            """Optimization objective function for feature selection.

            This function takes a trial object, a DataFrame, and optional parameters for the number of clusters and random state.
            It performs feature selection using the VarianceThreshold method and trains a clustering model (e.g., KMeans) on the selected features.
            The silhouette score is then calculated and returned as the optimization objective.

            Args:
                trial (optuna.Trial): The trial object used for optimization.
                df (pd.DataFrame): The input DataFrame containing the features.
                cluster (Callable): The clustering algorithm to be used.
                n_clusters (int): The number of clusters for the clustering algorithm.
                random_state (int, optional): The random state for reproducibility. Defaults to 42.
                score_metric (Callable, optional): The scoring metric used to evaluate the clustering model. Defaults to sklearn.metrics.silhouette_score.
                steps (float, optional): The step size for the threshold search space. Defaults to 0.2.

            Returns:
                float: The silhouette score of the clustering model trained on the selected features.
            """
            # Print the current trial number
            print("Running Trial Number:", trial.number)
            
            # Define the search space for the threshold
            threshold = trial.suggest_discrete_uniform(name='threshold', low=0, high=1, q=steps)  # Limit to 5 values between 0 and 1
            
            # Instantiate the VarianceThreshold object with the suggested threshold
            selector, _ = variance_threshold_feature_selection(df, threshold)
            
            # Apply the selector to the data
            x_selected = selector.transform(df)
            
            # Train a clustering model (e.g., KMeans) on the selected features
            clusterer = cluster(n_clusters=n_clusters, random_state=random_state)
            clusters = clusterer.fit_predict(x_selected)
            
            # Calculate silhouette score
            silhouette = score_metric(x_selected, clusters)
            return silhouette

In [17]:
# ## TOY >>
# X = df

# # Define the threshold range
# threshold_range = np.linspace(0, 0.5, 5)

# # Define the parameter grid for RandomizedSearchCV
# param_grid = {'vt__threshold': threshold_range}

# # Initialize the pipeline with VarianceThreshold and KMeans clustering
# pipeline = Pipeline([
#     ('vt', VarianceThreshold()),
#     ('kmeans', KMeans(n_clusters=30))
# ])

# # Define a function to compute silhouette score
# def silhouette_scorer(estimator, X):
#     labels = estimator.predict(X)
#     return silhouette_score(X, labels)

# # Initialize RandomizedSearchCV
# random_search = RandomizedSearchCV(estimator=pipeline,
#                                    param_distributions=param_grid,
#                                    scoring=silhouette_scorer,
#                                    n_iter=20,  # Adjust the number of iterations as needed
#                                    cv=5,       # Adjust cross-validation folds as needed
#                                    random_state=42)

# # Fit RandomizedSearchCV
# random_search.fit(X)

# # Print the best parameters and best score
# print("Best threshold:", random_search.best_params_)
# print("Best silhouette score:", random_search.best_score_)


In [18]:
%%time
# Hyperparameter Opt >
best_threshold = None
if HYPEROPT:
        # Common parameters for the optimisation >
        params = {'cluster': KMeans,
                  'n_clusters': 30,
                  'random_state': 42,
                  'metric': silhouette_score,
                  'n_iter': 100,
                  'step': 0.2,
                  'n_jobs': n_jobs}
        if OPTUNA:
                study_params = {'direction': 'maximize'}

                # Create a study object and optimize the objective function >
                study = optuna.create_study(direction=study_params['direction'])

                # Use the validation set only for optimisation >
                study.optimize(partial(objective,
                                       df=df,
                                       cluster=params['cluster'],
                                       n_clusters=params['n_clusters'],
                                       random_state=params['random_state'],
                                       score_metric=params['metric'],
                                       steps=params['step']), 
                        n_trials=params['n_iter'],
                        n_jobs=params['n_jobs'])
                # study.optimize(lambda trial: objective(trial, 
                #                                        df=df, 
                #                                        n_clusters=study_params.n_clusters, 
                #                                        random_state=split_seed), 
                #                n_trials=study_params.n_trials,
                #                n_jobs=study_params.n_jobs)
                # study.optimize(objective, n_trials=study_params.n_trials)

                # Get the best threshold
                best_threshold = study.best_params['threshold']
                print("Best Threshold:", best_threshold)

                # Free up memory >
                del study

        if GRIDSEARCH:
                # Define the parameter grid for RandomizedSearchCV
                param_grid = {'vt__threshold': np.arange(0, 1, params['step'])}

                # Initialize the pipeline with VarianceThreshold and KMeans clustering
                clusterer = params['cluster']
                pipeline = Pipeline([('vt', VarianceThreshold()),
                                     ('kmeans', clusterer(n_clusters=params['n_clusters'], 
                                                          random_state=params['random_state']))
                                     ])

                # Define a function to compute silhouette score
                def silhouette_scorer(estimator, X):
                        labels = estimator.predict(X)
                        return silhouette_score(X, labels)

                # Initialize RandomizedSearchCV
                grid_search = GridSearchCV(estimator=pipeline,
                                           param_grid=param_grid,
                                           scoring=silhouette_scorer,
                                           cv=None,
                                           n_jobs=params['n_jobs'],
                                           verbose=1)

                # Fit RandomizedSearchCV
                grid_search.fit(df)

                # Print the best parameters and best score
                best_threshold = grid_search.best_params_['threshold']
                
                print("Best threshold:", grid_search.best_params_['threshold'])
                print("Best silhouette score:", grid_search.best_score_)

                # Free up memory >
                del randomized_search
                
        if RANDOMSEARCH:
                # Define the parameter grid for RandomizedSearchCV
                param_grid = {'vt__threshold': np.arange(0, 1, params['step'])}

                # Initialize the pipeline with VarianceThreshold and KMeans clustering
                clusterer = params['cluster']
                pipeline = Pipeline([('vt', VarianceThreshold()),
                                     ('kmeans', clusterer(n_clusters=params['n_clusters'], 
                                                          random_state=params['random_state']))
                                     ])

                # Define a function to compute silhouette score
                def silhouette_scorer(estimator, X):
                        labels = estimator.predict(X)
                        return silhouette_score(X, labels)

                # Initialize RandomizedSearchCV
                random_search = RandomizedSearchCV(estimator=pipeline,
                                                   param_distributions=param_grid,
                                                   scoring=silhouette_scorer,
                                                   n_iter=params['n_iter'],
                                                   cv=None,
                                                   random_state=params['random_state'],
                                                   n_jobs=params['n_jobs'],
                                                   verbose=1)

                # Fit RandomizedSearchCV
                random_search.fit(df)

                # Print the best parameters and best score
                best_threshold = random_search.best_params_['threshold']
                
                print("Best threshold:", random_search.best_params_['threshold'])
                print("Best silhouette score:", random_search.best_score_)

                # Free up memory >
                del randomized_search


Fitting 5 folds for each of 5 candidates, totalling 25 fits


In [None]:
%%time
# If HYPEROPT, then use the optimised threshold, otherwise use the default threshold >
threshold = None

if HYPEROPT:
    threshold = best_threshold
else:
    threshold = 0.1
    
# Selecte features and return scores >
selector, fs_df = variance_threshold_feature_selection(df, threshold)
# features_selected = selector.transform(df)

display(fs_df)
    
    
    

# best_selector = VarianceThreshold(threshold=threshold)
# X_selected = best_selector.fit_transform(df)

# # Print out the selected features
# selected_features = df.columns[best_selector.get_support(indices=True)]


# # # Sort the selected features in alphabetical order
# # selected_features = sorted(selected_features)

# # Since the VTM does not provide a weight for each selected feature, we will create a uniform distribution of weights >
# weight = 1 / len(selected_features)  # Calculate the weight for each selected feature
# weights = [weight] * len(selected_features)  # Create a uniform distribution of weights

# selected_features_vtm['selected_features'] = selected_features
# selected_features_vtm['threshold'] = weights

# print("Selected Features:"), display(selected_features_vtm)

# Free up memory
del df

In [None]:
# Save the selected features to the models directory >
if SAVE_SELECT_FEATURES:
    fs_df.to_csv(selected_features_dir / 'new_selected_features_vtm.csv', index=False)
    print("Selected Features saved to:", selected_features_dir / 'new_selected_features_vtm.csv')