# Lab 6

Scikit learn provides a large variety of algorithms for some common Machine Learning tasks, such as:

* Classification
* Regression
* Clustering
* Feature Selection
* Anomaly Detection

It also provides some datasets that you can use to test these algorithms:

* Classification Datasets:
    * Breast cancer wisconsin
    * Iris plants (3-classes)
    * Optical recognition of handwritten digits (10-classes)
    * Wine (n-classes)

* Regression Datasets: 
    * Boston house prices 
    * Diabetes
    * Linnerrud (multiple regression)
    * California Housing

* Image: 
    * The Olivetti faces
    * The Labeled Faces in the Wild face recognition
    * Forest covertypes

* NLP:
    * News group
    * Reuters Corpus Volume I 

* Other:
    * Kddcup 99- Intrusion Detection

## Exercises

1. Use the full [Kddcup](http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html) dataset to compare classification performance of 3 different classifiers. 
    * Separate the data into train, validation, and test. 
    * Use accuracy as the metric for assessing performance. 
    * For each classifier, identify the hyperparameters. Perform optimization over at least 2 hyperparameters.   
    * Compare the performance of the optimal configuration of the classifiers.

2. Pick the best algorithm in question 1. Create an ensemble of at least 25 models, and use them for the classification task. Identify the top and bottom 10% of the data in terms of uncertainty of the decision.

3. Use 2 different feature selection algorithm to identify the 10 most important features for the task in question 1. Retrain classifiers in question 1 with just this subset of features and compare performance.

4. Use the same data, removing the labels, and compare performance of 3 different clustering algorithms. Can you find clusters for each of the classes in question 1? 

5. Can you identify any clusters within the top/botton 10% identified in 2. What are their characteristics?

6. Use the "SA" dataset to compare the performance of 3 different anomaly detection algorithms.

7. Create a subsample of 250 datapoints, redo question 6, using Leave-one-out as the method of evaluation.

8. Use the feature selection algorithm to identify the 5 most important features for the task in question 6, for each algorithm. Does the anomaly detection improve using less features?

In [4]:
import pandas as pd
file_name="kddcup.data.corrected"
df=pd.read_csv(file_name, header=None)
print(df.head())

   0    1     2   3    4      5   6   7   8   9   ...  32   33   34    35  \
0   0  tcp  http  SF  215  45076   0   0   0   0  ...   0  0.0  0.0  0.00   
1   0  tcp  http  SF  162   4528   0   0   0   0  ...   1  1.0  0.0  1.00   
2   0  tcp  http  SF  236   1228   0   0   0   0  ...   2  1.0  0.0  0.50   
3   0  tcp  http  SF  233   2032   0   0   0   0  ...   3  1.0  0.0  0.33   
4   0  tcp  http  SF  239    486   0   0   0   0  ...   4  1.0  0.0  0.25   

    36   37   38   39   40       41  
0  0.0  0.0  0.0  0.0  0.0  normal.  
1  0.0  0.0  0.0  0.0  0.0  normal.  
2  0.0  0.0  0.0  0.0  0.0  normal.  
3  0.0  0.0  0.0  0.0  0.0  normal.  
4  0.0  0.0  0.0  0.0  0.0  normal.  

[5 rows x 42 columns]


## Quick look at the data

In [5]:
from sklearn.datasets import fetch_kddcup99
D=fetch_kddcup99()

In [6]:
dir(D)

['DESCR', 'data', 'feature_names', 'frame', 'target', 'target_names']

In [7]:
print(D["DESCR"])

.. _kddcup99_dataset:

Kddcup 99 dataset
-----------------

The KDD Cup '99 dataset was created by processing the tcpdump portions
of the 1998 DARPA Intrusion Detection System (IDS) Evaluation dataset,
created by MIT Lincoln Lab [2]_. The artificial data (described on the `dataset's
homepage <https://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html>`_) was
generated using a closed network and hand-injected attacks to produce a
large number of different types of attack with normal activity in the
background. As the initial goal was to produce a large training set for
supervised learning algorithms, there is a large proportion (80.1%) of
abnormal data which is unrealistic in real world, and inappropriate for
unsupervised anomaly detection which aims at detecting 'abnormal' data, i.e.:

* qualitatively different from normal data
* in large minority among the observations.

We thus transform the KDD Data set into two different data sets: SA and SF.

* SA is obtained by simply selecting all

In [8]:
dir(D)

['DESCR', 'data', 'feature_names', 'frame', 'target', 'target_names']

In [9]:
import numpy as np
np.unique(D["target"])

array([b'back.', b'buffer_overflow.', b'ftp_write.', b'guess_passwd.',
       b'imap.', b'ipsweep.', b'land.', b'loadmodule.', b'multihop.',
       b'neptune.', b'nmap.', b'normal.', b'perl.', b'phf.', b'pod.',
       b'portsweep.', b'rootkit.', b'satan.', b'smurf.', b'spy.',
       b'teardrop.', b'warezclient.', b'warezmaster.'], dtype=object)

In [10]:
len(np.unique(D["target"]))

23

In [11]:
D["feature_names"]

['duration',
 'protocol_type',
 'service',
 'flag',
 'src_bytes',
 'dst_bytes',
 'land',
 'wrong_fragment',
 'urgent',
 'hot',
 'num_failed_logins',
 'logged_in',
 'num_compromised',
 'root_shell',
 'su_attempted',
 'num_root',
 'num_file_creations',
 'num_shells',
 'num_access_files',
 'num_outbound_cmds',
 'is_host_login',
 'is_guest_login',
 'count',
 'srv_count',
 'serror_rate',
 'srv_serror_rate',
 'rerror_rate',
 'srv_rerror_rate',
 'same_srv_rate',
 'diff_srv_rate',
 'srv_diff_host_rate',
 'dst_host_count',
 'dst_host_srv_count',
 'dst_host_same_srv_rate',
 'dst_host_diff_srv_rate',
 'dst_host_same_src_port_rate',
 'dst_host_srv_diff_host_rate',
 'dst_host_serror_rate',
 'dst_host_srv_serror_rate',
 'dst_host_rerror_rate',
 'dst_host_srv_rerror_rate']

## Question 1

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# Step 1: Load Dataset
feature_names = D["feature_names"]  
feature_names.append("label")  # Add the target column name
file_path = "kddcup.data.corrected"
data = pd.read_csv(file_path, header=None, names=feature_names)

# Step 2: Preprocess Data (Encoding)
categorical_features = ['protocol_type', 'service', 'flag', 'land', 'logged_in', 'is_host_login', 'is_guest_login']
for feature in categorical_features:
    le = LabelEncoder()
    data[feature] = le.fit_transform(data[feature])

# Encode the target column
data['label'] = LabelEncoder().fit_transform(data['label'])

# Split features and target
X = data.drop(columns=['label'])
y = data['label']

# Step 3: Split Dataset into Train, Validation, and Test Sets
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.3, random_state=42)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)

# Check shapes before scaling
print(f"X_train shape: {X_train.shape}, y_train shape: {y_train.shape}")
print(f"X_val shape: {X_val.shape}, y_val shape: {y_val.shape}")
print(f"X_test shape: {X_test.shape}, y_test shape: {y_test.shape}")

# Step 4: Scale Features
scaler = StandardScaler()

# Fit and transform the training data
X_train_scaled = scaler.fit_transform(X_train)

# Transform validation and test data
X_val_scaled = scaler.transform(X_val)
X_test_scaled = scaler.transform(X_test)

# Ensure shapes match after scaling
print(f"X_train_scaled shape: {X_train_scaled.shape}, y_train shape: {y_train.shape}")

# Step 5: Define Classifiers and Hyperparameter Grids
classifiers = {
    'RandomForest': RandomForestClassifier(random_state=42),
    'SVM': SVC(random_state=42),
    'KNN': KNeighborsClassifier()
}

param_grids = {
    'RandomForest': {'n_estimators': [50, 100], 'max_depth': [10, 20, None]},
    'SVM': {'C': [0.1, 1, 10], 'kernel': ['linear', 'rbf']},
    'KNN': {'n_neighbors': [3, 5, 7], 'weights': ['uniform', 'distance']}
}

# Step 6: Train and Optimize Classifiers
best_models = {}
val_accuracies = {}

for name, clf in classifiers.items():
    print(f"Optimizing {name}...")
    grid_search = GridSearchCV(clf, param_grids[name], cv=3, scoring='accuracy', verbose=1)
    grid_search.fit(X_train_scaled, y_train)  # Use scaled training data
    best_models[name] = grid_search.best_estimator_
    
    # Evaluate on validation set
    y_pred_val = grid_search.best_estimator_.predict(X_val_scaled)
    accuracy = accuracy_score(y_val, y_pred_val)
    val_accuracies[name] = accuracy
    print(f"{name}: Best Params: {grid_search.best_params_}, Validation Accuracy: {accuracy:.4f}")

# Step 7: Test the Best Model
best_model_name = max(val_accuracies, key=val_accuracies.get)
best_model = best_models[best_model_name]

y_pred_test = best_model.predict(X_test_scaled)
test_accuracy = accuracy_score(y_test, y_pred_test)

print(f"\nBest Model: {best_model_name}")
print(f"Test Accuracy: {test_accuracy:.4f}")


X_train shape: (3428901, 41), y_train shape: (3428901,)
X_val shape: (734765, 41), y_val shape: (734765,)
X_test shape: (734765, 41), y_test shape: (734765,)


## Question 2

In [None]:
import numpy as np
from sklearn.ensemble import BaggingClassifier
from sklearn.metrics import accuracy_score

# Step 1: Use the Best Model from Exercise 1
best_base_model = best_models[best_model_name]

# Step 2: Create an Ensemble of 25 Models
ensemble_model = BaggingClassifier(
    base_estimator=best_base_model, 
    n_estimators=25, 
    random_state=42
)

# Train the ensemble model on the training data
ensemble_model.fit(X_train_scaled, y_train)

# Step 3: Evaluate the Ensemble Model on the Test Set
y_pred_test_ensemble = ensemble_model.predict(X_test_scaled)
ensemble_accuracy = accuracy_score(y_test, y_pred_test_ensemble)
print(f"Ensemble Test Accuracy: {ensemble_accuracy:.4f}")

# Step 4: Estimate Uncertainty for Each Data Point
# Get the probabilities (if the base model supports it) or use votes
if hasattr(ensemble_model, "predict_proba"):
    probabilities = ensemble_model.predict_proba(X_test_scaled)
    uncertainty = np.max(probabilities, axis=1)  # Confidence in the most likely class
    uncertainty = 1 - uncertainty  # Higher values indicate more uncertainty
else:
    predictions = np.array([estimator.predict(X_test_scaled) for estimator in ensemble_model.estimators_]).T
    uncertainty = np.mean(predictions.std(axis=1))  # Use variance in predictions as a proxy for uncertainty

# Step 5: Identify Top and Bottom 10% Based on Uncertainty
num_points = len(uncertainty)
top_10_percent_indices = np.argsort(uncertainty)[-int(0.1 * num_points):]  # Top 10% most uncertain
bottom_10_percent_indices = np.argsort(uncertainty)[:int(0.1 * num_points)]  # Bottom 10% least uncertain

# Extract the corresponding data points
top_10_percent = X_test.iloc[top_10_percent_indices]
bottom_10_percent = X_test.iloc[bottom_10_percent_indices]

# Output statistics
print(f"Top 10% Uncertainty: {uncertainty[top_10_percent_indices].mean():.4f}")
print(f"Bottom 10% Uncertainty: {uncertainty[bottom_10_percent_indices].mean():.4f}")


## Question 3

In [None]:
from sklearn.feature_selection import SelectKBest, f_classif, mutual_info_classif
from sklearn.metrics import accuracy_score

# Step 1: Use Feature Selection Algorithms
# Select top 10 features using two different algorithms

# Algorithm 1: ANOVA F-statistic
selector_f_classif = SelectKBest(score_func=f_classif, k=10)
X_train_f_classif = selector_f_classif.fit_transform(X_train_scaled, y_train)
X_val_f_classif = selector_f_classif.transform(X_val_scaled)
X_test_f_classif = selector_f_classif.transform(X_test_scaled)

selected_features_f_classif = selector_f_classif.get_support(indices=True)
print("Selected Features by ANOVA F-statistic:", selected_features_f_classif)

# Algorithm 2: Mutual Information
selector_mutual_info = SelectKBest(score_func=mutual_info_classif, k=10)
X_train_mutual_info = selector_mutual_info.fit_transform(X_train_scaled, y_train)
X_val_mutual_info = selector_mutual_info.transform(X_val_scaled)
X_test_mutual_info = selector_mutual_info.transform(X_test_scaled)

selected_features_mutual_info = selector_mutual_info.get_support(indices=True)
print("Selected Features by Mutual Information:", selected_features_mutual_info)

# Step 2: Retrain Classifiers with Selected Features
retrain_results = {}

# Retrain classifiers on ANOVA F-statistic selected features
for name, clf in classifiers.items():
    clf.fit(X_train_f_classif, y_train)
    y_pred_val = clf.predict(X_val_f_classif)
    accuracy = accuracy_score(y_val, y_pred_val)
    retrain_results[f"{name}_f_classif"] = accuracy
    print(f"{name} (ANOVA F-statistic): Validation Accuracy = {accuracy:.4f}")

# Retrain classifiers on Mutual Information selected features
for name, clf in classifiers.items():
    clf.fit(X_train_mutual_info, y_train)
    y_pred_val = clf.predict(X_val_mutual_info)
    accuracy = accuracy_score(y_val, y_pred_val)
    retrain_results[f"{name}_mutual_info"] = accuracy
    print(f"{name} (Mutual Information): Validation Accuracy = {accuracy:.4f}")

# Step 3: Test the Best Classifier on Selected Features
best_retrain_model_name = max(retrain_results, key=retrain_results.get)
best_retrain_model_type, feature_selection_method = best_retrain_model_name.split("_")
best_retrain_model = classifiers[best_retrain_model_type]

if feature_selection_method == "f_classif":
    X_test_selected = X_test_f_classif
elif feature_selection_method == "mutual_info":
    X_test_selected = X_test_mutual_info

y_pred_test = best_retrain_model.predict(X_test_selected)
test_accuracy = accuracy_score(y_test, y_pred_test)

print(f"\nBest Retrained Model: {best_retrain_model_name}")
print(f"Test Accuracy with Selected Features: {test_accuracy:.4f}")


## Question 4

In [None]:
from sklearn.cluster import KMeans, DBSCAN, AgglomerativeClustering
from sklearn.metrics import adjusted_rand_score, silhouette_score

# Step 1: Use the same dataset without labels
X_train_cluster = X_train_scaled
X_test_cluster = X_test_scaled

# Step 2: Define Clustering Algorithms
clustering_algorithms = {
    "KMeans": KMeans(n_clusters=len(np.unique(y_train)), random_state=42),
    "DBSCAN": DBSCAN(eps=0.5, min_samples=5),
    "Agglomerative": AgglomerativeClustering(n_clusters=len(np.unique(y_train)))
}

# Step 3: Train and Evaluate Clustering Algorithms
clustering_results = {}

for name, model in clustering_algorithms.items():
    print(f"Running {name}...")
    # Train the clustering model
    cluster_labels_train = model.fit_predict(X_train_cluster)
    cluster_labels_test = model.fit_predict(X_test_cluster)
    
    # Evaluate clustering using Adjusted Rand Index (ARI)
    ari_score = adjusted_rand_score(y_test, cluster_labels_test)
    
    # Evaluate clustering using Silhouette Score (only for non-negative cluster labels)
    if len(set(cluster_labels_test)) > 1:
        silhouette = silhouette_score(X_test_cluster, cluster_labels_test)
    else:
        silhouette = -1  # Silhouette score is undefined for single cluster
    
    clustering_results[name] = {
        "ARI": ari_score,
        "Silhouette Score": silhouette
    }
    
    print(f"{name} - Adjusted Rand Index: {ari_score:.4f}, Silhouette Score: {silhouette:.4f}")

# Step 4: Map Clusters to Classes (for KMeans only as an example)
kmeans = KMeans(n_clusters=len(np.unique(y_train)), random_state=42)
kmeans.fit(X_train_cluster)
cluster_labels_test = kmeans.predict(X_test_cluster)

# Map each cluster to the majority class in that cluster
from scipy.stats import mode
cluster_to_class_map = {}
for cluster in np.unique(cluster_labels_test):
    indices = np.where(cluster_labels_test == cluster)
    majority_class = mode(y_test[indices]).mode[0]
    cluster_to_class_map[cluster] = majority_class

print("\nCluster-to-Class Mapping (KMeans):")
for cluster, cls in cluster_to_class_map.items():
    print(f"Cluster {cluster}: Class {cls}")


## Question 5

In [None]:
from sklearn.cluster import KMeans, DBSCAN, AgglomerativeClustering
import numpy as np
from sklearn.metrics import silhouette_score

# Step 1: Use the Top and Bottom 10% Data Identified in Exercise 2
X_top_10 = X_test_scaled[top_10_percent_indices]
X_bottom_10 = X_test_scaled[bottom_10_percent_indices]

print(f"Top 10% Data Shape: {X_top_10.shape}, Bottom 10% Data Shape: {X_bottom_10.shape}")

# Step 2: Define Clustering Algorithms
clustering_algorithms = {
    "KMeans": KMeans(n_clusters=2, random_state=42),
    "DBSCAN": DBSCAN(eps=0.5, min_samples=5),
    "Agglomerative": AgglomerativeClustering(n_clusters=2)
}

# Step 3: Cluster Top 10% and Bottom 10%
cluster_results = {}

for subset_name, subset_data in [("Top 10%", X_top_10), ("Bottom 10%", X_bottom_10)]:
    cluster_results[subset_name] = {}
    for name, model in clustering_algorithms.items():
        print(f"Clustering {subset_name} Data with {name}...")
        cluster_labels = model.fit_predict(subset_data)
        
        # Evaluate clustering with Silhouette Score
        if len(set(cluster_labels)) > 1:  # Check for more than 1 cluster
            silhouette = silhouette_score(subset_data, cluster_labels)
        else:
            silhouette = -1  # Undefined silhouette score for single cluster
        
        cluster_results[subset_name][name] = {
            "Cluster Labels": cluster_labels,
            "Silhouette Score": silhouette
        }
        print(f"{name} - Silhouette Score: {silhouette:.4f}")

# Step 4: Analyze Clusters
# Example: Analyze characteristics of clusters for KMeans on Top 10% data
kmeans_top_labels = cluster_results["Top 10%"]["KMeans"]["Cluster Labels"]
kmeans_top_clusters = {label: X_top_10[kmeans_top_labels == label] for label in np.unique(kmeans_top_labels)}

print("\nCharacteristics of Clusters in Top 10% (KMeans):")
for label, cluster_data in kmeans_top_clusters.items():
    print(f"Cluster {label}: Mean Feature Values:\n{np.mean(cluster_data, axis=0)}")


## Question 6

In [None]:
from sklearn.ensemble import IsolationForest
from sklearn.svm import OneClassSVM
from sklearn.neighbors import LocalOutlierFactor
from sklearn.metrics import precision_score, recall_score, f1_score

# Step 1: Load the "SA" Dataset
# Assuming `SA_dataset.csv` is the file containing the data
file_path = "SA_dataset.csv"
data = pd.read_csv(file_path)

# Split into features and labels
X = data.drop(columns=["label"])  
y = data["label"]  # 0 = normal, 1 = anomaly

# Scale features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Step 2: Define Anomaly Detection Algorithms
anomaly_detectors = {
    "IsolationForest": IsolationForest(random_state=42),
    "OneClassSVM": OneClassSVM(kernel='rbf', gamma=0.1, nu=0.1),
    "LocalOutlierFactor": LocalOutlierFactor(n_neighbors=20, novelty=True)
}

# Step 3: Train and Evaluate Each Algorithm
results = {}

for name, model in anomaly_detectors.items():
    print(f"Training {name}...")
    model.fit(X_scaled)  # Fit the model
    
    # Predict anomalies
    if name == "LocalOutlierFactor":
        # For LOF, use predict_proba to get anomaly scores on training data
        y_pred = model.predict(X_scaled)
    else:
        y_pred = model.predict(X_scaled)
    
    # Convert predictions to binary: -1 = anomaly, 1 = normal
    y_pred = np.where(y_pred == -1, 1, 0)
    
    # Evaluate performance
    precision = precision_score(y, y_pred)
    recall = recall_score(y, y_pred)
    f1 = f1_score(y, y_pred)
    
    results[name] = {
        "Precision": precision,
        "Recall": recall,
        "F1 Score": f1
    }
    print(f"{name} - Precision: {precision:.4f}, Recall: {recall:.4f}, F1 Score: {f1:.4f}")

# Step 4: Display Results
print("\nAnomaly Detection Results:")
for name, metrics in results.items():
    print(f"{name}: Precision = {metrics['Precision']:.4f}, Recall = {metrics['Recall']:.4f}, F1 Score = {metrics['F1 Score']:.4f}")


## Question 7

In [None]:
from sklearn.model_selection import LeaveOneOut
from sklearn.metrics import precision_score, recall_score, f1_score
import numpy as np

# Step 1: Create a Subsample of 250 Data Points
data_sampled = data.sample(n=250, random_state=42)  # Sample 250 points from the dataset
X_subsample = data_sampled.drop(columns=["label"])  
y_subsample = data_sampled["label"]

# Scale the subsample
X_subsample_scaled = scaler.fit_transform(X_subsample)

# Step 2: Define Anomaly Detection Algorithms
anomaly_detectors = {
    "IsolationForest": IsolationForest(random_state=42),
    "OneClassSVM": OneClassSVM(kernel='rbf', gamma=0.1, nu=0.1),
    "LocalOutlierFactor": LocalOutlierFactor(n_neighbors=20, novelty=True)
}

# Step 3: Leave-One-Out Cross-Validation
loo = LeaveOneOut()
results_loo = {}

for name, model in anomaly_detectors.items():
    print(f"Running Leave-One-Out for {name}...")
    precision_list, recall_list, f1_list = [], [], []
    
    for train_index, test_index in loo.split(X_subsample_scaled):
        X_train_loo, X_test_loo = X_subsample_scaled[train_index], X_subsample_scaled[test_index]
        y_train_loo, y_test_loo = y_subsample.iloc[train_index], y_subsample.iloc[test_index]
        
        # Train the model
        model.fit(X_train_loo)
        
        # Predict
        if name == "LocalOutlierFactor":
            y_pred_loo = model.predict(X_test_loo)
        else:
            y_pred_loo = model.predict(X_test_loo)
        
        # Convert predictions to binary: -1 = anomaly, 1 = normal
        y_pred_loo = np.where(y_pred_loo == -1, 1, 0)
        
        # Evaluate
        precision = precision_score([y_test_loo], [y_pred_loo], zero_division=1)
        recall = recall_score([y_test_loo], [y_pred_loo], zero_division=1)
        f1 = f1_score([y_test_loo], [y_pred_loo], zero_division=1)
        
        precision_list.append(precision)
        recall_list.append(recall)
        f1_list.append(f1)
    
    # Compute average metrics
    results_loo[name] = {
        "Precision": np.mean(precision_list),
        "Recall": np.mean(recall_list),
        "F1 Score": np.mean(f1_list)
    }
    print(f"{name} - Precision: {np.mean(precision_list):.4f}, Recall: {np.mean(recall_list):.4f}, F1 Score: {np.mean(f1_list):.4f}")

# Step 4: Display Results
print("\nLeave-One-Out Results:")
for name, metrics in results_loo.items():
    print(f"{name}: Precision = {metrics['Precision']:.4f}, Recall = {metrics['Recall']:.4f}, F1 Score = {metrics['F1 Score']:.4f}")


## Question 8

In [None]:
from sklearn.feature_selection import SelectKBest, mutual_info_classif, f_classif
from sklearn.metrics import precision_score, recall_score, f1_score
import numpy as np

# Step 1: Use Feature Selection to Identify the Top 5 Features
# Use both ANOVA F-statistic and Mutual Information

# ANOVA F-statistic
selector_f_classif = SelectKBest(score_func=f_classif, k=5)
X_train_f_classif = selector_f_classif.fit_transform(X_subsample_scaled, y_subsample)
X_test_f_classif = selector_f_classif.transform(X_subsample_scaled)

# Mutual Information
selector_mutual_info = SelectKBest(score_func=mutual_info_classif, k=5)
X_train_mutual_info = selector_mutual_info.fit_transform(X_subsample_scaled, y_subsample)
X_test_mutual_info = selector_mutual_info.transform(X_subsample_scaled)

print("Top Features by ANOVA F-statistic:", selector_f_classif.get_support(indices=True))
print("Top Features by Mutual Information:", selector_mutual_info.get_support(indices=True))

# Step 2: Retrain Anomaly Detection Algorithms with Selected Features
results_feature_selection = {}

for name, model in anomaly_detectors.items():
    for method, X_train_selected, X_test_selected in [
        ("ANOVA", X_train_f_classif, X_test_f_classif),
        ("Mutual Info", X_train_mutual_info, X_test_mutual_info)
    ]:
        print(f"Training {name} with {method} selected features...")
        model.fit(X_train_selected)
        
        # Predict
        if name == "LocalOutlierFactor":
            y_pred = model.predict(X_test_selected)
        else:
            y_pred = model.predict(X_test_selected)
        
        # Convert predictions to binary: -1 = anomaly, 1 = normal
        y_pred = np.where(y_pred == -1, 1, 0)
        
        # Evaluate
        precision = precision_score(y_subsample, y_pred, zero_division=1)
        recall = recall_score(y_subsample, y_pred, zero_division=1)
        f1 = f1_score(y_subsample, y_pred, zero_division=1)
        
        results_feature_selection[f"{name}_{method}"] = {
            "Precision": precision,
            "Recall": recall,
            "F1 Score": f1
        }
        print(f"{name} ({method}): Precision = {precision:.4f}, Recall = {recall:.4f}, F1 Score = {f1:.4f}")

# Step 3: Compare Results
print("\nFeature Selection Results:")
for name, metrics in results_feature_selection.items():
    print(f"{name}: Precision = {metrics['Precision']:.4f}, Recall = {metrics['Recall']:.4f}, F1 Score = {metrics['F1 Score']:.4f}")
