# Forward Feature Selection

The purpose of this notebook is to perform forward feature selection, starting from the importance scores given by the RF feature selection performed in another notebook. The clustering algorithm tested is DBSCAN and the goal is to maximize V-Measure, minimizing RMSE

## Libraries and Configurations

Import configuration files

In [32]:
from configparser import ConfigParser

config = ConfigParser()
config.read("../config.ini")

['../config.ini']

Import **data libraries**

In [33]:
import pandas as pd

Import **other libraries**

In [34]:
from rich.progress import Progress
from rich import traceback

traceback.install()

<bound method InteractiveShell.excepthook of <ipykernel.zmqshell.ZMQInteractiveShell object at 0x76a27b90fe50>>

Custom helper scripts

In [35]:
%cd ..
from scripts import plotHelper, encodingHelper
%cd data_exploration_cleaning

/home/bacci/COMPACT/notebooks
/home/bacci/COMPACT/notebooks/data_exploration_cleaning


## Feature Importance

In [36]:
feature_importances_csv = (
    config["DEFAULT"]["reports_path"]
    + "CSV/feature_selection/DISSECTED_importances_RF.csv"
)

feature_importances = pd.read_csv(feature_importances_csv).sort_values(
    by=["Importance"], ascending=False
)

These are the features selected by the Random Forest classifier, with the goal of correctly classifying the `Label` column. They are sorted according to the importance given, descending.

In [37]:
feature_importances

Unnamed: 0,Feature,Importance
3,Length,0.151389
2,Vendor Specific Tags,0.110161
21,Min_MPDCU_Start_Spacing,0.070928
23,RX_Highest_Supported_Data_Rate,0.04719
40,Interworking,0.044109
42,WNM_Notification,0.031174
13,DSSS_CCK,0.030882
41,QoS_Map,0.029089
30,Extended_Channel_Switching,0.028832
18,SM_Power_Save,0.027387


## Import Data

In [38]:
# Combined dataframe
burst_csv = (
    config["DEFAULT"]["interim_path"] + "dissected/selected_burst_dissected_df.csv"
)

In [39]:
df = pd.read_csv(burst_csv, index_col=0)

We remove `MAC Address` column, since it is not used for the feature selection

In [40]:
df = df.drop(columns=["MAC Address"])

## Split Columns

In [41]:
X = df.drop(columns=["Label"])
y = df["Label"]

In [42]:
X

Unnamed: 0,Vendor Specific Tags,Length,DSSS_CCK,SM_Power_Save,Min_MPDCU_Start_Spacing,RX_Highest_Supported_Data_Rate,Extended_Channel_Switching,WNM_Sleep_Mode,DMS,Interworking,QoS_Map,WNM_Notification,Operating_Mode_Notification
0,2,279,0.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,11,123,0.0,3.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
3,62,132,0.0,3.0,6.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0
4,62,132,0.0,3.0,6.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0
5,62,143,0.0,3.0,6.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
4410,62,131,0.0,3.0,6.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0
4411,1,156,0.0,0.0,5.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,1.0
4412,62,132,0.0,3.0,6.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0
4413,62,143,0.0,3.0,6.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0


## Normalize Data

In [43]:
X.fillna("-1", inplace=True)

In [45]:
from sklearn.preprocessing import MinMaxScaler

In [46]:
scaler = MinMaxScaler()

# Remove non-numeric values from X DataFrame
X_numeric = X.apply(pd.to_numeric, errors="coerce").dropna()

# Scale the numeric values using MinMaxScaler
X_normalized = pd.DataFrame(scaler.fit_transform(X_numeric), columns=X_numeric.columns)

In [47]:
X_normalized

Unnamed: 0,Vendor Specific Tags,Length,DSSS_CCK,SM_Power_Save,Min_MPDCU_Start_Spacing,RX_Highest_Supported_Data_Rate,Extended_Channel_Switching,WNM_Sleep_Mode,DMS,Interworking,QoS_Map,WNM_Notification,Operating_Mode_Notification
0,0.016393,1.000000,0.0,1.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.163934,0.118644,0.0,1.0,0.833333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
2,1.000000,0.169492,0.0,1.0,1.000000,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0
3,1.000000,0.169492,0.0,1.0,1.000000,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0
4,1.000000,0.231638,0.0,1.0,1.000000,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
3436,1.000000,0.163842,0.0,1.0,1.000000,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0
3437,0.000000,0.305085,0.0,0.0,0.833333,0.0,1.0,0.0,1.0,1.0,0.0,0.0,1.0
3438,1.000000,0.169492,0.0,1.0,1.000000,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0
3439,1.000000,0.231638,0.0,1.0,1.000000,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0


## Forward Feature Selection

In [None]:
from sklearn.cluster import DBSCAN
from sklearn.metrics import homogeneity_completeness_v_measure

In [48]:
remaining_features = feature_importances.columns

In [None]:
selected_features = []
best_homogeneity = 0
best_completeness = 0

while stopping_criteria_not_met:
    for feature in remaining_features:
        candidate_features = selected_features + [feature]
        X_selected = X_train[:, candidate_features]

        # Train DBSCAN model
        dbscan_model = DBSCAN(eps=eps, min_samples=min_samples)
        clusters = dbscan_model.fit_predict(X_selected)

        # Calculate homogeneity and completeness
        homogeneity = metrics.homogeneity_score(y_true, clusters)
        completeness = metrics.completeness_score(y_true, clusters)

        # Check for improvement
        if homogeneity > best_homogeneity and completeness > best_completeness:
            best_homogeneity = homogeneity
            best_completeness = completeness
            selected_features = candidate_features

    # Update remaining features
    remaining_features = remaining_features - selected_features

# Train final model
final_dbscan_model = DBSCAN(eps=eps, min_samples=min_samples)
final_clusters = final_dbscan_model.fit_predict(X_train[:, selected_features])