# Module 08: Feature Selection Methods

**Difficulty**: ⭐⭐ Intermediate  
**Estimated Time**: 70 minutes  
**Prerequisites**: Module 07 (Text Feature Engineering)

## Learning Objectives

By the end of this notebook, you will be able to:

1. Understand why feature selection matters and the curse of dimensionality
2. Apply filter methods (correlation, chi-square, mutual information)
3. Implement wrapper methods (RFE, forward/backward selection)
4. Use embedded methods (Lasso, tree-based feature importance)
5. Compare feature selection methods and visualize performance vs. feature count
6. Choose the appropriate feature selection method for different scenarios

## 1. Why Feature Selection Matters

**More features ≠ Better models!**

**Problems with too many features**:
- **Curse of dimensionality**: Models need exponentially more data
- **Overfitting**: Model memorizes noise instead of learning patterns
- **Slow training**: More features = more computation
- **Poor interpretability**: Hard to understand which features matter
- **Multicollinearity**: Redundant features confuse models

**Benefits of feature selection**:
- ✅ Better generalization (less overfitting)
- ✅ Faster training and prediction
- ✅ Improved model interpretability
- ✅ Reduced storage and memory requirements

**Three main approaches**:
1. **Filter methods**: Statistical tests (fast, model-agnostic)
2. **Wrapper methods**: Use model performance (slow, accurate)
3. **Embedded methods**: Feature selection during training (balanced)

## 2. Setup

In [None]:
# Core libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import make_classification, load_breast_cancer

# Feature selection methods
from sklearn.feature_selection import (
    SelectKBest, chi2, f_classif, mutual_info_classif,
    RFE, SequentialFeatureSelector,
    SelectFromModel
)

# Models
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression, Lasso
from sklearn.tree import DecisionTreeClassifier

# Evaluation
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import accuracy_score, classification_report
from sklearn.preprocessing import StandardScaler

# Configuration
%matplotlib inline
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")

# Set random seed for reproducibility
np.random.seed(42)

# Display settings
pd.set_option('display.max_columns', None)
pd.set_option('display.precision', 3)

print("✓ Setup complete!")

## 3. The Curse of Dimensionality

Let's demonstrate why too many features can hurt model performance.

In [None]:
# Demonstrate curse of dimensionality
n_samples = 200
feature_counts = [5, 10, 20, 50, 100, 200]
results = []

for n_features in feature_counts:
    # Create dataset with mostly irrelevant features
    X, y = make_classification(
        n_samples=n_samples,
        n_features=n_features,
        n_informative=5,  # Only 5 features are actually useful!
        n_redundant=0,
        n_repeated=0,
        random_state=42
    )
    
    # Split data
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
    
    # Train model
    model = LogisticRegression(max_iter=1000, random_state=42)
    model.fit(X_train, y_train)
    
    # Evaluate
    train_score = model.score(X_train, y_train)
    test_score = model.score(X_test, y_test)
    
    results.append({
        'Num Features': n_features,
        'Train Accuracy': train_score,
        'Test Accuracy': test_score,
        'Overfitting': train_score - test_score
    })

results_df = pd.DataFrame(results)
print("The Curse of Dimensionality:")
print("="*60)
print(results_df)
print("\nNotice: More features → worse test performance!")
print("Only 5 features are informative, rest are noise.")

In [None]:
# Visualize curse of dimensionality
fig, axes = plt.subplots(1, 2, figsize=(15, 5))

# Performance vs number of features
axes[0].plot(results_df['Num Features'], results_df['Train Accuracy'], 
            marker='o', label='Train Accuracy', linewidth=2)
axes[0].plot(results_df['Num Features'], results_df['Test Accuracy'], 
            marker='s', label='Test Accuracy', linewidth=2)
axes[0].set_xlabel('Number of Features')
axes[0].set_ylabel('Accuracy')
axes[0].set_title('Curse of Dimensionality\n(Only 5 features are informative)', 
                  fontsize=12, fontweight='bold')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Overfitting measure
axes[1].bar(results_df['Num Features'], results_df['Overfitting'], 
           color='coral', edgecolor='black')
axes[1].set_xlabel('Number of Features')
axes[1].set_ylabel('Overfitting (Train - Test Accuracy)')
axes[1].set_title('Overfitting Increases with Irrelevant Features', 
                  fontsize=12, fontweight='bold')
axes[1].grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.show()

print("Key insight: Adding irrelevant features hurts performance!")
print("Solution: Feature selection to remove noise and keep signal.")

## 4. Create Dataset for Feature Selection Demo

We'll use the Breast Cancer Wisconsin dataset - a classic binary classification problem.

In [None]:
# Load breast cancer dataset
data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target

print(f"Dataset shape: {X.shape}")
print(f"  - {X.shape[0]} samples")
print(f"  - {X.shape[1]} features")
print(f"\nTarget distribution:")
print(f"  - Malignant: {(y==0).sum()}")
print(f"  - Benign: {(y==1).sum()}")
print(f"\nFeature names:")
print(list(X.columns[:10]), "...")
print(f"\nFirst few rows:")
X.head()

In [None]:
# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

# Scale features (important for some methods)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print(f"Training set: {X_train.shape[0]} samples")
print(f"Test set: {X_test.shape[0]} samples")

## 5. Filter Methods: Statistical Tests

**Filter methods** evaluate features independently using statistical tests:
- Fast and scalable
- Model-agnostic
- Don't consider feature interactions

**Common filter methods**:
1. **Correlation**: Linear relationship with target
2. **Chi-square (χ²)**: Independence test for categorical data
3. **ANOVA F-statistic**: Variance between groups
4. **Mutual Information**: Non-linear relationships

### 5.1 Correlation-Based Selection

In [None]:
# Calculate correlation with target
# Note: For classification, we can look at point-biserial correlation
correlations = X_train.corrwith(pd.Series(y_train))
correlations_abs = correlations.abs().sort_values(ascending=False)

print("Top 10 features by absolute correlation with target:")
print(correlations_abs.head(10))

# Visualize
plt.figure(figsize=(12, 8))
correlations_abs.plot(kind='barh', color='steelblue', edgecolor='black')
plt.xlabel('Absolute Correlation with Target')
plt.title('Feature Correlation with Target', fontsize=12, fontweight='bold')
plt.gca().invert_yaxis()
plt.tight_layout()
plt.show()

# Select top k features
k = 10
top_features_corr = correlations_abs.head(k).index.tolist()
print(f"\nSelected top {k} features: {top_features_corr}")

### 5.2 ANOVA F-test Selection

In [None]:
# Use ANOVA F-statistic for feature selection
k = 10
selector_f = SelectKBest(score_func=f_classif, k=k)
selector_f.fit(X_train, y_train)

# Get scores
f_scores = pd.Series(selector_f.scores_, index=X.columns)
f_scores_sorted = f_scores.sort_values(ascending=False)

print(f"Top {k} features by ANOVA F-statistic:")
print(f_scores_sorted.head(k))

# Selected features
selected_features_f = X.columns[selector_f.get_support()].tolist()
print(f"\nSelected features: {selected_features_f}")

### 5.3 Mutual Information Selection

In [None]:
# Mutual Information - captures non-linear relationships
k = 10
selector_mi = SelectKBest(score_func=mutual_info_classif, k=k)
selector_mi.fit(X_train, y_train)

# Get scores
mi_scores = pd.Series(selector_mi.scores_, index=X.columns)
mi_scores_sorted = mi_scores.sort_values(ascending=False)

print(f"Top {k} features by Mutual Information:")
print(mi_scores_sorted.head(k))

# Selected features
selected_features_mi = X.columns[selector_mi.get_support()].tolist()
print(f"\nSelected features: {selected_features_mi}")

In [None]:
# Compare different filter methods
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

# Correlation
top10_corr = correlations_abs.head(10)
axes[0].barh(range(len(top10_corr)), top10_corr.values, color='skyblue', edgecolor='black')
axes[0].set_yticks(range(len(top10_corr)))
axes[0].set_yticklabels(top10_corr.index, fontsize=8)
axes[0].set_xlabel('Absolute Correlation')
axes[0].set_title('Top 10 by Correlation', fontweight='bold')
axes[0].invert_yaxis()

# F-statistic
top10_f = f_scores_sorted.head(10)
axes[1].barh(range(len(top10_f)), top10_f.values, color='lightcoral', edgecolor='black')
axes[1].set_yticks(range(len(top10_f)))
axes[1].set_yticklabels(top10_f.index, fontsize=8)
axes[1].set_xlabel('F-Statistic')
axes[1].set_title('Top 10 by ANOVA F-test', fontweight='bold')
axes[1].invert_yaxis()

# Mutual Information
top10_mi = mi_scores_sorted.head(10)
axes[2].barh(range(len(top10_mi)), top10_mi.values, color='lightgreen', edgecolor='black')
axes[2].set_yticks(range(len(top10_mi)))
axes[2].set_yticklabels(top10_mi.index, fontsize=8)
axes[2].set_xlabel('Mutual Information')
axes[2].set_title('Top 10 by Mutual Information', fontweight='bold')
axes[2].invert_yaxis()

plt.tight_layout()
plt.show()

print("Notice: Different methods may select different features!")
print("Correlation captures linear relationships, MI captures non-linear.")

## 6. Wrapper Methods: Using Model Performance

**Wrapper methods** use actual model performance to select features:
- More accurate than filter methods
- Computationally expensive
- Consider feature interactions

**Common wrapper methods**:
1. **Recursive Feature Elimination (RFE)**: Iteratively remove least important features
2. **Forward Selection**: Start with 0, add features one by one
3. **Backward Selection**: Start with all, remove features one by one

### 6.1 Recursive Feature Elimination (RFE)

In [None]:
# RFE with Logistic Regression
k = 10
estimator = LogisticRegression(max_iter=1000, random_state=42)
rfe = RFE(estimator=estimator, n_features_to_select=k)

# Fit RFE
rfe.fit(X_train_scaled, y_train)

# Get selected features
selected_features_rfe = X.columns[rfe.support_].tolist()
feature_ranking = pd.Series(rfe.ranking_, index=X.columns).sort_values()

print(f"RFE selected top {k} features:")
print(selected_features_rfe)
print(f"\nFeature ranking (1 = selected):")
print(feature_ranking.head(15))

In [None]:
# Visualize RFE ranking
plt.figure(figsize=(12, 8))
colors = ['green' if r == 1 else 'gray' for r in feature_ranking.values]
plt.barh(range(len(feature_ranking)), feature_ranking.values, color=colors, edgecolor='black')
plt.yticks(range(len(feature_ranking)), feature_ranking.index, fontsize=8)
plt.xlabel('Ranking (1 = Selected)')
plt.title(f'RFE Feature Ranking (Top {k} in green)', fontsize=12, fontweight='bold')
plt.gca().invert_yaxis()
plt.tight_layout()
plt.show()

### 6.2 Forward/Backward Sequential Selection

In [None]:
# Forward Selection (starts with 0 features, adds one at a time)
k = 10
estimator = LogisticRegression(max_iter=1000, random_state=42)

# Note: This can be slow, so we use a simpler estimator
sfs_forward = SequentialFeatureSelector(
    estimator, 
    n_features_to_select=k,
    direction='forward',
    cv=3  # Use cross-validation
)

print("Running forward selection (this may take a minute)...")
sfs_forward.fit(X_train_scaled, y_train)

selected_features_forward = X.columns[sfs_forward.get_support()].tolist()
print(f"\nForward Selection - selected {k} features:")
print(selected_features_forward)

## 7. Embedded Methods: Built-in Feature Selection

**Embedded methods** perform feature selection during model training:
- Balance between filter and wrapper methods
- Model-specific
- Fast and accurate

**Common embedded methods**:
1. **Lasso (L1 Regularization)**: Shrinks coefficients to zero
2. **Tree-based importance**: From Random Forest, XGBoost, etc.
3. **Ridge (L2 Regularization)**: Shrinks but doesn't zero out

### 7.1 Lasso (L1 Regularization)

In [None]:
# Lasso for feature selection
from sklearn.linear_model import LogisticRegression

# L1 regularization encourages sparsity
lasso = LogisticRegression(penalty='l1', solver='liblinear', C=0.1, random_state=42)
lasso.fit(X_train_scaled, y_train)

# Get non-zero coefficients
coefficients = pd.Series(lasso.coef_[0], index=X.columns)
non_zero = coefficients[coefficients != 0]

print(f"Lasso selected {len(non_zero)} features (non-zero coefficients):")
print(non_zero.sort_values(key=abs, ascending=False))

In [None]:
# Visualize Lasso coefficients
plt.figure(figsize=(12, 8))
coefficients_sorted = coefficients.abs().sort_values(ascending=False)
colors = ['green' if c != 0 else 'lightgray' for c in coefficients_sorted.values]
plt.barh(range(len(coefficients_sorted)), coefficients_sorted.values, 
        color=colors, edgecolor='black')
plt.yticks(range(len(coefficients_sorted)), coefficients_sorted.index, fontsize=8)
plt.xlabel('Absolute Coefficient Value')
plt.title('Lasso Coefficients (Green = Selected, Gray = Zero)', fontsize=12, fontweight='bold')
plt.gca().invert_yaxis()
plt.tight_layout()
plt.show()

print(f"Lasso automatically selected {len(non_zero)} features by setting others to zero.")

### 7.2 Tree-Based Feature Importance

In [None]:
# Random Forest feature importance
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)

# Get feature importances
importances = pd.Series(rf.feature_importances_, index=X.columns)
importances_sorted = importances.sort_values(ascending=False)

print("Top 10 features by Random Forest importance:")
print(importances_sorted.head(10))

# Select top k features
k = 10
selected_features_rf = importances_sorted.head(k).index.tolist()
print(f"\nSelected top {k} features: {selected_features_rf}")

In [None]:
# Visualize Random Forest feature importance
plt.figure(figsize=(12, 8))
importances_sorted.plot(kind='barh', color='forestgreen', edgecolor='black')
plt.xlabel('Feature Importance')
plt.title('Random Forest Feature Importance', fontsize=12, fontweight='bold')
plt.gca().invert_yaxis()
plt.tight_layout()
plt.show()

## 8. Compare All Feature Selection Methods

Let's compare performance of different feature selection approaches.

In [None]:
# Compare all methods
k = 10

feature_sets = {
    'All Features': list(X.columns),
    'Correlation (top 10)': top_features_corr,
    'ANOVA F-test (top 10)': selected_features_f,
    'Mutual Information (top 10)': selected_features_mi,
    'RFE (top 10)': selected_features_rfe,
    'Forward Selection (top 10)': selected_features_forward,
    'Lasso (auto)': non_zero.index.tolist(),
    'Random Forest (top 10)': selected_features_rf
}

results = []

for name, features in feature_sets.items():
    # Get feature indices
    feature_indices = [X.columns.get_loc(f) for f in features]
    
    # Select features from scaled data
    X_train_subset = X_train_scaled[:, feature_indices]
    X_test_subset = X_test_scaled[:, feature_indices]
    
    # Train model
    model = LogisticRegression(max_iter=1000, random_state=42)
    model.fit(X_train_subset, y_train)
    
    # Evaluate
    train_acc = model.score(X_train_subset, y_train)
    test_acc = model.score(X_test_subset, y_test)
    
    results.append({
        'Method': name,
        'Num Features': len(features),
        'Train Accuracy': train_acc,
        'Test Accuracy': test_acc
    })

results_df = pd.DataFrame(results)
print("\nFeature Selection Method Comparison:")
print("="*80)
results_df

In [None]:
# Visualize comparison
fig, axes = plt.subplots(1, 2, figsize=(15, 6))

# Test accuracy comparison
results_df_sorted = results_df.sort_values('Test Accuracy', ascending=True)
colors = ['red' if m == 'All Features' else 'steelblue' for m in results_df_sorted['Method']]
axes[0].barh(results_df_sorted['Method'], results_df_sorted['Test Accuracy'], 
            color=colors, edgecolor='black')
axes[0].set_xlabel('Test Accuracy')
axes[0].set_title('Test Accuracy by Feature Selection Method', fontsize=12, fontweight='bold')
axes[0].set_xlim([0.9, 1.0])
axes[0].grid(True, alpha=0.3, axis='x')

# Feature count vs accuracy
axes[1].scatter(results_df['Num Features'], results_df['Test Accuracy'], 
               s=100, alpha=0.6, edgecolor='black')
for idx, row in results_df.iterrows():
    axes[1].annotate(row['Method'], 
                    (row['Num Features'], row['Test Accuracy']),
                    fontsize=7, ha='left')
axes[1].set_xlabel('Number of Features')
axes[1].set_ylabel('Test Accuracy')
axes[1].set_title('Feature Count vs Performance', fontsize=12, fontweight='bold')
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\nKey observations:")
print("- Using ALL features doesn't guarantee best performance")
print("- Good feature selection can match or beat using all features")
print("- Different methods select different features but achieve similar performance")

## 9. Performance vs. Number of Features

Let's see how performance changes as we vary the number of selected features.

In [None]:
# Test different numbers of features using Mutual Information
feature_counts = [1, 2, 5, 10, 15, 20, 30]
performance_results = []

for k in feature_counts:
    # Select top k features using Mutual Information
    selector = SelectKBest(score_func=mutual_info_classif, k=min(k, X.shape[1]))
    selector.fit(X_train, y_train)
    
    X_train_selected = selector.transform(X_train)
    X_test_selected = selector.transform(X_test)
    
    # Train model
    model = LogisticRegression(max_iter=1000, random_state=42)
    model.fit(X_train_selected, y_train)
    
    # Evaluate
    train_acc = model.score(X_train_selected, y_train)
    test_acc = model.score(X_test_selected, y_test)
    
    performance_results.append({
        'Num Features': k,
        'Train Accuracy': train_acc,
        'Test Accuracy': test_acc
    })

perf_df = pd.DataFrame(performance_results)
print("Performance vs Number of Features:")
print(perf_df)

In [None]:
# Visualize performance curve
plt.figure(figsize=(12, 6))
plt.plot(perf_df['Num Features'], perf_df['Train Accuracy'], 
        marker='o', label='Train Accuracy', linewidth=2)
plt.plot(perf_df['Num Features'], perf_df['Test Accuracy'], 
        marker='s', label='Test Accuracy', linewidth=2)
plt.xlabel('Number of Features Selected')
plt.ylabel('Accuracy')
plt.title('Model Performance vs Number of Features\n(Using Mutual Information Selection)', 
         fontsize=12, fontweight='bold')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

# Find optimal number of features
optimal_k = perf_df.loc[perf_df['Test Accuracy'].idxmax(), 'Num Features']
optimal_acc = perf_df['Test Accuracy'].max()

print(f"\nOptimal number of features: {optimal_k}")
print(f"Best test accuracy: {optimal_acc:.4f}")
print(f"\nDiminishing returns after ~{optimal_k} features!")

## 10. Exercise Section

### Exercise 1: Feature Selection on Synthetic Data

Create a dataset with only 5 informative features out of 50 total. Apply feature selection and verify it identifies the correct features.

In [None]:
# Exercise 1: Synthetic data with known informative features

# Create dataset
X_syn, y_syn = make_classification(
    n_samples=500,
    n_features=50,
    n_informative=5,
    n_redundant=0,
    n_repeated=0,
    random_state=42
)

X_syn_df = pd.DataFrame(X_syn, columns=[f'feature_{i}' for i in range(50)])

# The first 5 features are informative by design
true_informative = ['feature_0', 'feature_1', 'feature_2', 'feature_3', 'feature_4']

print(f"Dataset: {X_syn_df.shape}")
print(f"True informative features: {true_informative}")

# TODO:
# 1. Apply Mutual Information to select top 5 features
# 2. Apply Random Forest to select top 5 features
# 3. Compare with true informative features
# 4. Calculate how many you got correct

# Your code here:


In [None]:
# Solution to Exercise 1

# Split data
X_syn_train, X_syn_test, y_syn_train, y_syn_test = train_test_split(
    X_syn_df, y_syn, test_size=0.3, random_state=42
)

# 1. Mutual Information
selector_mi = SelectKBest(score_func=mutual_info_classif, k=5)
selector_mi.fit(X_syn_train, y_syn_train)
selected_mi = X_syn_df.columns[selector_mi.get_support()].tolist()

# 2. Random Forest
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_syn_train, y_syn_train)
importances = pd.Series(rf.feature_importances_, index=X_syn_df.columns)
selected_rf = importances.nlargest(5).index.tolist()

# 3. Compare with true features
print("True informative features:")
print(true_informative)
print("\nMutual Information selected:")
print(selected_mi)
print("\nRandom Forest selected:")
print(selected_rf)

# 4. Calculate accuracy
correct_mi = len(set(selected_mi) & set(true_informative))
correct_rf = len(set(selected_rf) & set(true_informative))

print(f"\nMutual Information correctly identified: {correct_mi}/5 features")
print(f"Random Forest correctly identified: {correct_rf}/5 features")

print("\nNote: With synthetic data, good feature selection should identify most/all informative features!")

### Exercise 2: Compare Filter vs Wrapper Methods

On the breast cancer dataset, compare the speed of filter methods vs wrapper methods.

In [None]:
# Exercise 2: Speed comparison

import time

# TODO: Measure execution time for:
# 1. Mutual Information (filter method)
# 2. RFE (wrapper method)
# Compare both speed and accuracy

# Your code here:


In [None]:
# Solution to Exercise 2

import time

k = 10

# 1. Mutual Information (Filter)
start = time.time()
selector_mi = SelectKBest(score_func=mutual_info_classif, k=k)
selector_mi.fit(X_train, y_train)
X_train_mi = selector_mi.transform(X_train)
X_test_mi = selector_mi.transform(X_test)

model_mi = LogisticRegression(max_iter=1000, random_state=42)
model_mi.fit(X_train_mi, y_train)
acc_mi = model_mi.score(X_test_mi, y_test)
time_mi = time.time() - start

# 2. RFE (Wrapper)
start = time.time()
estimator = LogisticRegression(max_iter=1000, random_state=42)
rfe = RFE(estimator=estimator, n_features_to_select=k)
rfe.fit(X_train_scaled, y_train)
X_train_rfe = rfe.transform(X_train_scaled)
X_test_rfe = rfe.transform(X_test_scaled)

model_rfe = LogisticRegression(max_iter=1000, random_state=42)
model_rfe.fit(X_train_rfe, y_train)
acc_rfe = model_rfe.score(X_test_rfe, y_test)
time_rfe = time.time() - start

# Compare
print("Speed and Accuracy Comparison:")
print("="*50)
print(f"{'Method':<25} {'Time (s)':<12} {'Accuracy'}")
print("-"*50)
print(f"{'Mutual Information':<25} {time_mi:<12.4f} {acc_mi:.4f}")
print(f"{'RFE (Wrapper)':<25} {time_rfe:<12.4f} {acc_rfe:.4f}")
print("="*50)
print(f"\nRFE is {time_rfe/time_mi:.1f}x slower than Mutual Information")
print(f"But accuracy difference is only {abs(acc_rfe - acc_mi):.4f}")
print("\nTrade-off: Filter methods are faster, wrapper methods are more accurate.")

### Exercise 3: Optimal Feature Count

Find the optimal number of features that maximizes test accuracy while minimizing feature count.

In [None]:
# Exercise 3: Find optimal k

# TODO:
# 1. Test k values from 1 to 30
# 2. For each k, use Random Forest feature importance to select features
# 3. Plot test accuracy vs k
# 4. Find the "elbow point" - where adding more features doesn't help much

# Your code here:


In [None]:
# Solution to Exercise 3

# Train Random Forest once to get all importances
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)
importances = pd.Series(rf.feature_importances_, index=X.columns)
features_ranked = importances.sort_values(ascending=False).index

# Test different k values
k_values = range(1, 31)
results = []

for k in k_values:
    # Select top k features
    selected = features_ranked[:k].tolist()
    
    # Train model
    X_train_k = X_train[selected]
    X_test_k = X_test[selected]
    
    model = LogisticRegression(max_iter=1000, random_state=42)
    model.fit(X_train_k, y_train)
    
    acc = model.score(X_test_k, y_test)
    results.append({'k': k, 'accuracy': acc})

results_df = pd.DataFrame(results)

# Plot
plt.figure(figsize=(12, 6))
plt.plot(results_df['k'], results_df['accuracy'], marker='o', linewidth=2)
plt.axvline(x=results_df.loc[results_df['accuracy'].idxmax(), 'k'], 
           color='red', linestyle='--', label='Best k')
plt.xlabel('Number of Features (k)')
plt.ylabel('Test Accuracy')
plt.title('Finding Optimal Number of Features', fontsize=12, fontweight='bold')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

best_k = results_df.loc[results_df['accuracy'].idxmax(), 'k']
best_acc = results_df['accuracy'].max()

print(f"Optimal number of features: {best_k}")
print(f"Best accuracy: {best_acc:.4f}")
print(f"\nDiminishing returns after k={best_k}")
print(f"Using only {best_k}/{X.shape[1]} features ({best_k/X.shape[1]*100:.1f}% of total)!")

## 11. Summary

### Key Takeaways

1. **More features ≠ better performance**
   - Curse of dimensionality is real
   - Irrelevant features add noise and cause overfitting
   - Feature selection improves generalization

2. **Three main approaches to feature selection**:
   - **Filter methods**: Statistical tests (fast, model-agnostic)
   - **Wrapper methods**: Use model performance (slow, accurate)
   - **Embedded methods**: Built into training (balanced)

3. **Filter methods** (Correlation, F-test, Mutual Information):
   - ✅ Fast and scalable
   - ✅ Model-agnostic
   - ❌ Don't consider feature interactions
   - ❌ May miss important feature combinations

4. **Wrapper methods** (RFE, Forward/Backward Selection):
   - ✅ More accurate
   - ✅ Consider feature interactions
   - ❌ Computationally expensive
   - ❌ Risk of overfitting to validation set

5. **Embedded methods** (Lasso, Tree importance):
   - ✅ Good balance of speed and accuracy
   - ✅ Integrated with training
   - ❌ Model-specific
   - ✅ Regularization prevents overfitting

6. **Optimal feature count often much less than total**:
   - Diminishing returns after certain point
   - Plot performance vs feature count to find elbow
   - Consider trade-off between accuracy and complexity

### When to Use Each Method

**Filter Methods**:
- ✅ High-dimensional data (1000s of features)
- ✅ Quick baseline
- ✅ Preprocessing before wrapper methods
- ✅ Exploratory analysis

**Wrapper Methods**:
- ✅ Small to medium feature sets (<100 features)
- ✅ When accuracy is critical
- ✅ Enough computational resources
- ✅ Need optimal feature subset

**Embedded Methods**:
- ✅ Using regularized models (Lasso, Ridge)
- ✅ Tree-based models (RF, XGBoost)
- ✅ Production pipelines
- ✅ Good default choice

### Best Practices

1. **Always split data first**: Avoid data leakage
2. **Start with filter methods**: Quick baseline
3. **Use domain knowledge**: Don't blindly trust statistics
4. **Plot performance curves**: Visualize feature count vs accuracy
5. **Consider interpretability**: Fewer features = easier to explain
6. **Cross-validate**: Ensure robust feature selection
7. **Compare multiple methods**: Different methods may find different features
8. **Monitor overfitting**: Train vs test performance gap

### What's Next?

**Module 09**: Feature Importance and Interpretability - Learn to understand which features matter and why

### Additional Resources

- [Scikit-learn Feature Selection](https://scikit-learn.org/stable/modules/feature_selection.html)
- [Feature Selection Guide](https://machinelearningmastery.com/feature-selection-machine-learning-python/)
- [Curse of Dimensionality Explained](https://towardsdatascience.com/curse-of-dimensionality-2092410f3d27)

---

**Congratulations!** You've completed Module 08. You now understand:
- Why feature selection matters and the curse of dimensionality
- Filter methods (correlation, F-test, mutual information)
- Wrapper methods (RFE, sequential selection)
- Embedded methods (Lasso, tree importance)
- How to compare methods and find optimal feature count

Ready to dive deeper into feature importance? Let's move to **Module 09: Feature Importance and Interpretability**!