# Feature selection

## Description
This notebook performs feature selection using Recursive Feature Elimination (RFE) with a Random Forest classifier and evaluates the selected features using stratified k-fold cross-validation.

## Objectives

Feature Selection with RFE:

- **RFE** recursively removes the least important features based on the model's importance scores until the specified number of features is reached. This helps in identifying the most significant features for the classification task.
Using Random Forest for RFE:

- **A Random Forest classifier** is used as the estimator for RFE due to its ability to handle high-dimensional data, capture complex interactions between features, and provide feature importance scores. Random Forests are robust to overfitting and can handle both numerical and categorical data.
Cross-Validation:

- **Stratified k-fold cross-validation** is employed to ensure a robust and unbiased evaluation of the model. This method splits the data into k folds, ensuring that each fold has a similar distribution of the target variable. It helps in avoiding data leakage and ensures that the model's performance is evaluated across different subsets of the data, providing a more reliable estimate of its generalization performance.

In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import StratifiedKFold
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import RFE
from sklearn.metrics import f1_score
import matplotlib.pyplot as plt
import seaborn as sns
from tqdm import tqdm
from utils import load_and_preprocess_table_data

import warnings
warnings.filterwarnings("ignore")

# Load data
config = "no_resample_cloud_disturbance_weights_3Y"
data = load_and_preprocess_table_data(config)

# Define features and target
features = ['amplitude_red', 'cos_phase_red', 'sin_phase_red', 'offset_red',
            'amplitude_green', 'cos_phase_green', 'sin_phase_green', 'offset_green',
            'amplitude_blue', 'cos_phase_blue', 'sin_phase_blue', 'offset_blue',
            'amplitude_crswir', 'cos_phase_crswir', 'sin_phase_crswir', 'offset_crswir', 
            'elevation', 'cos_aspect', 'sin_aspect']
target = 'phen'

X = data[features]
y = data[target]

# Initialize the model
model = RandomForestClassifier(n_estimators=30, random_state=42, n_jobs=-1)

# Perform stratified k-fold cross-validation
n_splits = 5
skf = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=42)

# DataFrame to store feature rankings and performance
metrics_df = pd.DataFrame(columns=['Fold', 'Feature', 'Ranking'])
performance_df = []

for fold, (train_idx, val_idx) in tqdm(enumerate(skf.split(X, y)), total=n_splits):
    X_train, X_val = X.iloc[train_idx], X.iloc[val_idx]
    y_train, y_val = y.iloc[train_idx], y.iloc[val_idx]

    # Perform RFE
    rfe = RFE(estimator=model, n_features_to_select=1, step=3)
    rfe.fit(X_train, y_train)

    # Get ranking of features
    ranking = rfe.ranking_
    
    # Append feature ranking to DataFrame
    fold_ranking_df = pd.DataFrame({
        'Fold': fold + 1,
        'Feature': features,
        'Ranking': ranking
    })
    metrics_df = pd.concat([metrics_df, fold_ranking_df], ignore_index=True)

    # Evaluate the selected features on the validation dataset
    selected_features = [feature for feature, rank in zip(features, ranking) if rank == 1]
    X_train_selected = X_train[selected_features]
    X_val_selected = X_val[selected_features]

    model.fit(X_train_selected, y_train)
    y_val_pred = model.predict(X_val_selected)
    f1 = f1_score(y_val, y_val_pred, average='weighted')

    # Append performance to the list
    performance_df.append({'Fold': fold + 1, 'F1 Score': f1})

# Convert performance list to DataFrame
performance_df = pd.DataFrame(performance_df)

# Calculate mean and std of F1 scores
mean_f1 = performance_df['F1 Score'].mean()
std_f1 = performance_df['F1 Score'].std()

print(f'Mean F1 Score: {mean_f1}')
print(f'Std F1 Score: {std_f1}')

# Plot feature ranking
plt.figure(figsize=(12, 8))
sns.barplot(x='Ranking', y='Feature', data=metrics_df.sort_values('Ranking'), palette='viridis')
plt.title('Feature Importance Ranking (RFE)')
plt.xlabel('Ranking')
plt.ylabel('Feature')
plt.show()

# Save feature rankings and performance to CSV
metrics_df.to_csv('feature_ranking_rfe_cv.csv', index=False)
performance_df.to_csv('performance_rfe_cv.csv', index=False)

print("Cross-validation complete.")

In [None]:
# Plot feature ranking and save the figure
plt.figure(figsize=(6, 8))
sns.barplot(x='Ranking', y='Feature', data=metrics_df.sort_values('Ranking'), palette='viridis')

# Customize the plot to be clean and lean
plt.title('Feature Importance Ranking (RFE)', fontsize=16)
plt.xlabel('Ranking', fontsize=14)
plt.ylabel('Feature', fontsize=14)
plt.grid(True, which='both', axis='x', linestyle='--', linewidth=0.5)
plt.tick_params(axis='both', which='major', labelsize=12)
plt.gca().spines['right'].set_visible(False)
plt.gca().spines['top'].set_visible(False)

# Save the figure
plt.savefig('images/feature_importance_ranking.png', bbox_inches='tight')
plt.show()
