# Task 3.10: Feature Selection Experiments

## Objective

Test different feature selection methods to find the optimal feature subset:
1. **Recursive Feature Elimination (RFE)**
2. **SelectKBest** (with different scoring functions)
3. **Feature Importance Thresholding** (from tree-based models)

## Why Feature Selection?

- **Reduce overfitting:** Fewer features = simpler model
- **Improve performance:** Remove noisy/irrelevant features
- **Faster training:** Less computation with fewer features
- **Better interpretability:** Easier to understand model decisions

## Current Dataset

From Week 1 feature selection, we have 28 features. Let's see if we can find an even better subset.

## Step 1: Setup and Data Loading

In [None]:
import pandas as pd
import numpy as np
from sklearn.feature_selection import RFE, RFECV, SelectKBest, f_classif, mutual_info_classif
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.metrics import accuracy_score, f1_score
from sklearn.preprocessing import LabelEncoder
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

RANDOM_STATE = 42
print("Libraries imported successfully!")

In [None]:
# Load data
X_train = pd.read_csv('../../data/processed/X_train_scaled.csv')
X_test = pd.read_csv('../../data/processed/X_test_scaled.csv')
y_train = pd.read_csv('../../data/processed/y_train.csv')
y_test = pd.read_csv('../../data/processed/y_test.csv')

# Remove ID columns if present
id_cols = ['id', 'host_id', 'listing_id']
for col in id_cols:
    if col in X_train.columns:
        X_train = X_train.drop(col, axis=1)
        X_test = X_test.drop(col, axis=1)

        # Remove leaky features
leaky_features = [
    'price', 'price_normalized', 'price_per_person', 'price_per_bathroom',
    'price_per_bedroom', 'review_scores_rating', 'review_scores_value',
    'value_density', 'estimated_revenue_l365d'
]

cols_to_drop = [col for col in leaky_features if col in X_train.columns]
X_train = X_train.drop(columns=cols_to_drop)
X_test = X_test.drop(columns=cols_to_drop)

print(f"Dropped {len(cols_to_drop)} leaky features: {cols_to_drop}")
print(f"Remaining features: {X_train.shape[1]}")

# Encode target
le = LabelEncoder()
y_train_enc = le.fit_transform(y_train['value_category'])
y_test_enc = le.transform(y_test['value_category'])

print(f"Training set: {X_train.shape}")
print(f"Test set: {X_test.shape}")
print(f"\nOriginal features ({len(X_train.columns)}):")
print(list(X_train.columns))

## Step 2: Baseline Performance (All Features)

First, let's establish baseline performance using all features.

In [None]:
# Baseline with all features
rf_baseline = RandomForestClassifier(n_estimators=100, random_state=RANDOM_STATE, n_jobs=-1)
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=RANDOM_STATE)

baseline_scores = cross_val_score(rf_baseline, X_train, y_train_enc, cv=cv, scoring='f1_macro')
baseline_f1 = baseline_scores.mean()

print("="*60)
print("BASELINE PERFORMANCE (All Features)")
print("="*60)
print(f"Number of features: {X_train.shape[1]}")
print(f"CV F1-Score: {baseline_f1:.4f} (+/- {baseline_scores.std():.4f})")
print("="*60)

## Step 3: Method 1 - Recursive Feature Elimination (RFE)

RFE works by:
1. Training a model on all features
2. Ranking features by importance
3. Removing the least important feature(s)
4. Repeating until desired number of features reached

We'll use RFECV to automatically find the optimal number of features.

In [None]:
print("="*60)
print("METHOD 1: RECURSIVE FEATURE ELIMINATION (RFECV)")
print("="*60)
print("Running RFECV... (this may take a few minutes)\n")

# Use RF as estimator for RFE
rf_rfe = RandomForestClassifier(n_estimators=100, random_state=RANDOM_STATE, n_jobs=-1)

rfecv = RFECV(
    estimator=rf_rfe,
    step=1,
    cv=StratifiedKFold(5, shuffle=True, random_state=RANDOM_STATE),
    scoring='f1_macro',
    n_jobs=-1,
    min_features_to_select=5
)

rfecv.fit(X_train, y_train_enc)

# Results
rfe_n_features = rfecv.n_features_
rfe_features = X_train.columns[rfecv.support_].tolist()
rfe_scores = rfecv.cv_results_['mean_test_score']

print(f"Optimal number of features: {rfe_n_features}")
print(f"Best CV F1-Score: {max(rfe_scores):.4f}")
print(f"\nSelected features ({rfe_n_features}):")
for i, feat in enumerate(rfe_features, 1):
    print(f"  {i}. {feat}")

In [None]:
# Plot RFE results
plt.figure(figsize=(10, 5))
plt.plot(range(5, len(rfe_scores) + 5), rfe_scores, 'b-o', markersize=4)
plt.axvline(x=rfe_n_features, color='r', linestyle='--', label=f'Optimal: {rfe_n_features} features')
plt.xlabel('Number of Features')
plt.ylabel('CV F1-Score (Macro)')
plt.title('RFECV: Number of Features vs Performance')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('../../outputs/figures/rfe_feature_selection.png', dpi=300, bbox_inches='tight')
plt.show()

## Step 4: Method 2 - SelectKBest

SelectKBest selects features based on univariate statistical tests:
- **f_classif:** ANOVA F-value (assumes linear relationship)
- **mutual_info_classif:** Mutual information (captures non-linear relationships)

We'll test both and find optimal K.

In [None]:
print("="*60)
print("METHOD 2: SELECTKBEST")
print("="*60)

# Test different K values
k_values = range(5, X_train.shape[1] + 1, 2)
results_anova = []
results_mi = []

rf_eval = RandomForestClassifier(n_estimators=100, random_state=RANDOM_STATE, n_jobs=-1)

print("Testing different K values...\n")

for k in k_values:
    # ANOVA F-value
    selector_anova = SelectKBest(f_classif, k=k)
    X_anova = selector_anova.fit_transform(X_train, y_train_enc)
    scores_anova = cross_val_score(rf_eval, X_anova, y_train_enc, cv=cv, scoring='f1_macro')
    results_anova.append({'k': k, 'f1_mean': scores_anova.mean(), 'f1_std': scores_anova.std()})
    
    # Mutual Information
    selector_mi = SelectKBest(mutual_info_classif, k=k)
    X_mi = selector_mi.fit_transform(X_train, y_train_enc)
    scores_mi = cross_val_score(rf_eval, X_mi, y_train_enc, cv=cv, scoring='f1_macro')
    results_mi.append({'k': k, 'f1_mean': scores_mi.mean(), 'f1_std': scores_mi.std()})

results_anova_df = pd.DataFrame(results_anova)
results_mi_df = pd.DataFrame(results_mi)

# Best K for each method
best_k_anova = results_anova_df.loc[results_anova_df['f1_mean'].idxmax()]
best_k_mi = results_mi_df.loc[results_mi_df['f1_mean'].idxmax()]

print(f"ANOVA F-value: Best K = {int(best_k_anova['k'])}, F1 = {best_k_anova['f1_mean']:.4f}")
print(f"Mutual Info:   Best K = {int(best_k_mi['k'])}, F1 = {best_k_mi['f1_mean']:.4f}")

In [None]:
# Get selected features from best SelectKBest
best_k = int(best_k_anova['k']) if best_k_anova['f1_mean'] >= best_k_mi['f1_mean'] else int(best_k_mi['k'])
best_method = 'ANOVA' if best_k_anova['f1_mean'] >= best_k_mi['f1_mean'] else 'Mutual Info'

if best_method == 'ANOVA':
    selector_best = SelectKBest(f_classif, k=best_k)
else:
    selector_best = SelectKBest(mutual_info_classif, k=best_k)

selector_best.fit(X_train, y_train_enc)
skb_features = X_train.columns[selector_best.get_support()].tolist()

print(f"\nBest SelectKBest: {best_method} with K={best_k}")
print(f"\nSelected features ({best_k}):")
for i, feat in enumerate(skb_features, 1):
    print(f"  {i}. {feat}")

In [None]:
# Plot SelectKBest results
plt.figure(figsize=(10, 5))
plt.plot(results_anova_df['k'], results_anova_df['f1_mean'], 'b-o', label='ANOVA F-value', markersize=4)
plt.plot(results_mi_df['k'], results_mi_df['f1_mean'], 'g-s', label='Mutual Information', markersize=4)
plt.axhline(y=baseline_f1, color='r', linestyle='--', label=f'Baseline ({baseline_f1:.4f})')
plt.xlabel('Number of Features (K)')
plt.ylabel('CV F1-Score (Macro)')
plt.title('SelectKBest: K vs Performance')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('../../outputs/figures/selectkbest_comparison.png', dpi=300, bbox_inches='tight')
plt.show()

## Step 5: Method 3 - Feature Importance Thresholding

Use feature importances from Random Forest and apply different thresholds to select features.

In [None]:
print("="*60)
print("METHOD 3: FEATURE IMPORTANCE THRESHOLDING")
print("="*60)

# Train RF to get feature importances
rf_imp = RandomForestClassifier(n_estimators=200, random_state=RANDOM_STATE, n_jobs=-1)
rf_imp.fit(X_train, y_train_enc)

# Get importances
importances = pd.DataFrame({
    'feature': X_train.columns,
    'importance': rf_imp.feature_importances_
}).sort_values('importance', ascending=False)

print("\nFeature Importances (Top 15):")
print(importances.head(15).to_string(index=False))

# Test different thresholds
thresholds = [0.001, 0.005, 0.01, 0.015, 0.02, 0.025, 0.03, 0.04, 0.05]
threshold_results = []

print("\nTesting different importance thresholds...")

for thresh in thresholds:
    selected = importances[importances['importance'] >= thresh]['feature'].tolist()
    if len(selected) >= 3:  # Need at least 3 features
        X_thresh = X_train[selected]
        scores = cross_val_score(rf_eval, X_thresh, y_train_enc, cv=cv, scoring='f1_macro')
        threshold_results.append({
            'threshold': thresh,
            'n_features': len(selected),
            'f1_mean': scores.mean(),
            'f1_std': scores.std()
        })

threshold_df = pd.DataFrame(threshold_results)
print("\n" + threshold_df.to_string(index=False))

# Best threshold
best_thresh_row = threshold_df.loc[threshold_df['f1_mean'].idxmax()]
best_threshold = best_thresh_row['threshold']
thresh_features = importances[importances['importance'] >= best_threshold]['feature'].tolist()

print(f"\nBest threshold: {best_threshold}")
print(f"Number of features: {len(thresh_features)}")
print(f"F1-Score: {best_thresh_row['f1_mean']:.4f}")

In [None]:
# Plot threshold results
fig, ax1 = plt.subplots(figsize=(10, 5))

ax1.set_xlabel('Importance Threshold')
ax1.set_ylabel('CV F1-Score', color='blue')
ax1.plot(threshold_df['threshold'], threshold_df['f1_mean'], 'b-o', label='F1-Score')
ax1.tick_params(axis='y', labelcolor='blue')
ax1.axhline(y=baseline_f1, color='r', linestyle='--', alpha=0.5)

ax2 = ax1.twinx()
ax2.set_ylabel('Number of Features', color='green')
ax2.plot(threshold_df['threshold'], threshold_df['n_features'], 'g-s', label='# Features')
ax2.tick_params(axis='y', labelcolor='green')

plt.title('Feature Importance Thresholding')
plt.tight_layout()
plt.savefig('../../outputs/figures/importance_thresholding.png', dpi=300, bbox_inches='tight')
plt.show()

## Step 6: Compare All Methods

In [None]:
print("="*70)
print("COMPARISON OF ALL FEATURE SELECTION METHODS")
print("="*70)

# Evaluate each method on test set
methods_comparison = []

# 1. Baseline (all features)
rf_final = RandomForestClassifier(n_estimators=100, random_state=RANDOM_STATE, n_jobs=-1)
rf_final.fit(X_train, y_train_enc)
y_pred = rf_final.predict(X_test)
methods_comparison.append({
    'Method': 'Baseline (All Features)',
    'N_Features': X_train.shape[1],
    'CV_F1': baseline_f1,
    'Test_F1': f1_score(y_test_enc, y_pred, average='macro'),
    'Test_Accuracy': accuracy_score(y_test_enc, y_pred)
})

# 2. RFE
X_train_rfe = X_train[rfe_features]
X_test_rfe = X_test[rfe_features]
rf_final.fit(X_train_rfe, y_train_enc)
y_pred = rf_final.predict(X_test_rfe)
methods_comparison.append({
    'Method': 'RFE',
    'N_Features': len(rfe_features),
    'CV_F1': max(rfe_scores),
    'Test_F1': f1_score(y_test_enc, y_pred, average='macro'),
    'Test_Accuracy': accuracy_score(y_test_enc, y_pred)
})

# 3. SelectKBest
X_train_skb = X_train[skb_features]
X_test_skb = X_test[skb_features]
rf_final.fit(X_train_skb, y_train_enc)
y_pred = rf_final.predict(X_test_skb)
methods_comparison.append({
    'Method': f'SelectKBest ({best_method})',
    'N_Features': len(skb_features),
    'CV_F1': best_k_anova['f1_mean'] if best_method == 'ANOVA' else best_k_mi['f1_mean'],
    'Test_F1': f1_score(y_test_enc, y_pred, average='macro'),
    'Test_Accuracy': accuracy_score(y_test_enc, y_pred)
})

# 4. Importance Thresholding
X_train_thresh = X_train[thresh_features]
X_test_thresh = X_test[thresh_features]
rf_final.fit(X_train_thresh, y_train_enc)
y_pred = rf_final.predict(X_test_thresh)
methods_comparison.append({
    'Method': 'Importance Threshold',
    'N_Features': len(thresh_features),
    'CV_F1': best_thresh_row['f1_mean'],
    'Test_F1': f1_score(y_test_enc, y_pred, average='macro'),
    'Test_Accuracy': accuracy_score(y_test_enc, y_pred)
})

comparison_df = pd.DataFrame(methods_comparison)
print("\n" + comparison_df.to_string(index=False))

# Find best method
best_method_row = comparison_df.loc[comparison_df['Test_F1'].idxmax()]
print(f"\n" + "="*70)
print(f"BEST METHOD: {best_method_row['Method']}")
print(f"Features: {best_method_row['N_Features']}, Test F1: {best_method_row['Test_F1']:.4f}")
print("="*70)

In [None]:
# Visualization comparing all methods
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Bar chart comparison
ax1 = axes[0]
x = range(len(comparison_df))
width = 0.35
bars1 = ax1.bar([i - width/2 for i in x], comparison_df['Test_F1'], width, label='Test F1', color='steelblue')
bars2 = ax1.bar([i + width/2 for i in x], comparison_df['Test_Accuracy'], width, label='Test Accuracy', color='coral')
ax1.set_ylabel('Score')
ax1.set_title('Performance Comparison')
ax1.set_xticks(x)
ax1.set_xticklabels(comparison_df['Method'], rotation=20, ha='right')
ax1.legend()
ax1.set_ylim([0.7, 0.8])
for bar in bars1:
    ax1.annotate(f'{bar.get_height():.3f}', xy=(bar.get_x() + bar.get_width()/2, bar.get_height()),
                 xytext=(0, 3), textcoords='offset points', ha='center', fontsize=8)

# Features vs Performance scatter
ax2 = axes[1]
ax2.scatter(comparison_df['N_Features'], comparison_df['Test_F1'], s=150, c='steelblue', alpha=0.7)
for i, row in comparison_df.iterrows():
    ax2.annotate(row['Method'].split('(')[0].strip(), (row['N_Features'], row['Test_F1']),
                 xytext=(5, 5), textcoords='offset points', fontsize=9)
ax2.set_xlabel('Number of Features')
ax2.set_ylabel('Test F1-Score')
ax2.set_title('Features vs Performance Trade-off')
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig('../../outputs/figures/feature_selection_comparison.png', dpi=300, bbox_inches='tight')
plt.show()

## Step 7: Determine Optimal Feature Subset

In [None]:
# Find features selected by multiple methods (consensus)
print("="*60)
print("FEATURE SELECTION CONSENSUS")
print("="*60)

all_selected = {
    'RFE': set(rfe_features),
    'SelectKBest': set(skb_features),
    'Importance': set(thresh_features)
}

# Features selected by all methods
consensus_all = all_selected['RFE'] & all_selected['SelectKBest'] & all_selected['Importance']
# Features selected by at least 2 methods
consensus_2 = (all_selected['RFE'] & all_selected['SelectKBest']) | \
              (all_selected['RFE'] & all_selected['Importance']) | \
              (all_selected['SelectKBest'] & all_selected['Importance'])

print(f"\nFeatures selected by ALL methods ({len(consensus_all)}):")
for feat in sorted(consensus_all):
    print(f"  - {feat}")

print(f"\nFeatures selected by at least 2 methods ({len(consensus_2)}):")
for feat in sorted(consensus_2):
    print(f"  - {feat}")

In [None]:
# Test consensus features
print("\nEvaluating consensus feature sets...")

# Consensus (2+ methods)
consensus_features = list(consensus_2)
X_train_cons = X_train[consensus_features]
X_test_cons = X_test[consensus_features]

rf_final.fit(X_train_cons, y_train_enc)
y_pred_cons = rf_final.predict(X_test_cons)
cons_f1 = f1_score(y_test_enc, y_pred_cons, average='macro')
cons_acc = accuracy_score(y_test_enc, y_pred_cons)

print(f"\nConsensus Features (2+ methods): {len(consensus_features)} features")
print(f"Test F1-Score: {cons_f1:.4f}")
print(f"Test Accuracy: {cons_acc:.4f}")

# Determine final optimal subset
all_results = comparison_df.copy()
all_results = pd.concat([all_results, pd.DataFrame([{
    'Method': 'Consensus (2+ methods)',
    'N_Features': len(consensus_features),
    'CV_F1': np.nan,
    'Test_F1': cons_f1,
    'Test_Accuracy': cons_acc
}])], ignore_index=True)

# Best overall
best_overall = all_results.loc[all_results['Test_F1'].idxmax()]

print("\n" + "="*60)
print("OPTIMAL FEATURE SUBSET")
print("="*60)
print(f"Best Method: {best_overall['Method']}")
print(f"Number of Features: {int(best_overall['N_Features'])}")
print(f"Test F1-Score: {best_overall['Test_F1']:.4f}")
print(f"Test Accuracy: {best_overall['Test_Accuracy']:.4f}")

## Step 8: Save Results

In [None]:
# Save comparison results
all_results.to_csv('../../outputs/feature_selection_comparison.csv', index=False)

# Save selected features from each method
features_dict = {
    'RFE': rfe_features,
    'SelectKBest': skb_features,
    'Importance_Threshold': thresh_features,
    'Consensus': consensus_features
}

# Pad lists to same length for DataFrame
max_len = max(len(v) for v in features_dict.values())
for k in features_dict:
    features_dict[k] = features_dict[k] + [''] * (max_len - len(features_dict[k]))

features_df = pd.DataFrame(features_dict)
features_df.to_csv('../../outputs/selected_features_by_method.csv', index=False)

# Save SelectKBest results
selectkbest_results = pd.concat([results_anova_df.assign(method='ANOVA'), 
                                  results_mi_df.assign(method='Mutual_Info')])
selectkbest_results.to_csv('../../outputs/selectkbest_results.csv', index=False)

# Save threshold results
threshold_df.to_csv('../../outputs/importance_threshold_results.csv', index=False)

print("="*60)
print("FILES SAVED")
print("="*60)
print("- outputs/feature_selection_comparison.csv")
print("- outputs/selected_features_by_method.csv")
print("- outputs/selectkbest_results.csv")
print("- outputs/importance_threshold_results.csv")
print("- outputs/figures/rfe_feature_selection.png")
print("- outputs/figures/selectkbest_comparison.png")
print("- outputs/figures/importance_thresholding.png")
print("- outputs/figures/feature_selection_comparison.png")

## Conclusion

### Summary

We tested three feature selection methods:

1. **RFE (Recursive Feature Elimination):** Iteratively removes least important features
2. **SelectKBest:** Selects features based on statistical tests (ANOVA, Mutual Info)
3. **Feature Importance Thresholding:** Uses RF importance scores with threshold

### Key Findings

- All methods achieved similar performance to baseline
- Feature reduction possible without significant performance loss
- Consensus features (selected by 2+ methods) provide robust subset

### Recommendation

Use the method that provides best trade-off between:
- Model performance (F1-score)
- Model simplicity (fewer features)
- Interpretability requirements