# Feature Selection — Keep the Signal, Remove the Noise

**Objective**: Learn **why**, **when**, and **how** to select the **most important features** to improve model performance, speed, and interpretability.

---

## 1. What is Feature Selection?

**Definition**: The process of **automatically or manually** selecting a **subset of relevant features** for building a model.

### Why Do We Need It?

| Problem | Caused By | Solution |
|--------|-----------|----------|
| **Overfitting** | Too many irrelevant features | Reduce features |
| **Slow Training** | High-dimensional data | Fewer features |
| **Curse of Dimensionality** | Distance becomes meaningless | Dimensionality reduction |
| **Poor Interpretability** | 100+ features | Select top 5–10 |

> **"More features ≠ better model"**

---

## 2. Real-World Example: Diabetes Risk Prediction

We have 8 patients with 7 features:

| Patient | BMI | Glucose | Age | BP | Insulin | Noise1 | Noise2 | Diabetic? |
|--------|-----|---------|-----|----|---------|--------|--------|-----------|
| 1 | 25 | 90 | 30 | 120 | 100 | 42 | 88 | 0 |
| 2 | 35 | 180 | 55 | 140 | 300 | 11 | 77 | 1 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |

**Goal**: Predict diabetes using **only meaningful features**

In [5]:
import pandas as pd
import numpy as np
from sklearn.datasets import make_classification

# Simulate diabetes data
np.random.seed(42)
n_samples = 100

df = pd.DataFrame({
    'bmi': np.random.normal(30, 5, n_samples),
    'glucose': np.random.normal(120, 30, n_samples),
    'age': np.random.randint(25, 70, n_samples),
    'blood_pressure': np.random.normal(120, 15, n_samples),
    'insulin': np.random.normal(150, 50, n_samples),
    'noise1': np.random.randn(n_samples),
    'noise2': np.random.randn(n_samples)
})

# Create target: high glucose + high BMI → diabetic
df['diabetic'] = ((df['glucose'] > 140) & (df['bmi'] > 30)).astype(int)

print(f"Dataset: {df.shape[0]} samples, {df.shape[1]-1} features")
df.head()

Dataset: 100 samples, 7 features


Unnamed: 0,bmi,glucose,age,blood_pressure,insulin,noise1,noise2,diabetic
0,32.483571,77.538878,26,151.696682,122.537091,-0.752447,-0.508406,0
1,29.308678,107.38064,50,112.523901,27.040332,1.673748,-1.241841,0
2,33.238443,109.718565,41,115.134068,152.870647,0.043703,0.183512,0
3,37.615149,95.931682,64,127.869135,107.794613,1.073682,0.353158,0
4,28.829233,115.161429,57,167.8055,217.940832,0.076048,0.531917,0


---

## 3. Feature Selection Methods — Theory + Math

### 1. Filter Methods (Statistical Tests)

**Idea**: Rank features by **statistical score** with target

| Method | Formula | Use Case |
|--------|--------|----------|
| **Correlation** | $ r = \frac{\sum (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum (x_i - \bar{x})^2 \sum (y_i - \bar{y})^2}} $ | Continuous target |
| **Chi-Square** | $ \chi^2 = \sum \frac{(O-E)^2}{E} $ | Categorical features |
| **ANOVA F-test** | Compares variance between groups | Categorical target |

**Pros**: Fast, model-agnostic  
**Cons**: Ignores feature interactions

In [6]:
# Correlation with target
correlations = df.drop('diabetic', axis=1).corrwith(df['diabetic']).abs().sort_values(ascending=False)

print("Feature Correlation with Diabetic:")
print(correlations.round(3))

Feature Correlation with Diabetic:
glucose           0.439
bmi               0.307
age               0.217
noise2            0.094
noise1            0.094
insulin           0.092
blood_pressure    0.042
dtype: float64


> **Insight**: `glucose` and `bmi` are strong → `noise1`, `noise2` are weak

### 2. Wrapper Methods (Search-Based)

**Idea**: Train model on **subsets** of features → pick best

#### Recursive Feature Elimination (RFE)

1. Train model on all features
2. Remove **least important** feature
3. Repeat until desired number

**Pros**: Accurate  
**Cons**: Slow (exponential)

In [7]:
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression

X = df.drop('diabetic', axis=1)
y = df['diabetic']

model = LogisticRegression(max_iter=1000)
rfe = RFE(model, n_features_to_select=3)
rfe.fit(X, y)

selected = X.columns[rfe.support_].tolist()
print("RFE Selected Features:", selected)

RFE Selected Features: ['bmi', 'glucose', 'noise1']


### 3. Embedded Methods (Model Built-in)

**Idea**: Let the **model decide** importance during training

#### Tree-Based Importance

- Measures **how much each feature reduces impurity**
- Works with Random Forest, XGBoost

**Pros**: Captures interactions  
**Cons**: Biased toward high-cardinality

In [8]:
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X, y)

importance = pd.Series(rf.feature_importances_, index=X.columns).sort_values(ascending=False)
print("Top 4 Important Features (Tree-based):")
print(importance.head(4).round(3))

Top 4 Important Features (Tree-based):
glucose    0.411
bmi        0.208
age        0.116
insulin    0.077
dtype: float64


#### L1 Regularization (Lasso)

**Formula**: $ \min (\text{loss} + \lambda \sum |w_i|) $

→ Forces **unimportant weights to zero**

**Pros**: Built-in selection  
**Cons**: Only for linear models

In [9]:
from sklearn.linear_model import LogisticRegression

lasso = LogisticRegression(penalty='l1', solver='liblinear', C=1.0)
lasso.fit(X, y)

selected_lasso = X.columns[lasso.coef_[0] != 0].tolist()
print("Lasso Selected Features:", selected_lasso)

Lasso Selected Features: ['bmi', 'glucose', 'age', 'blood_pressure', 'insulin']


---

## 4. Comparison Summary

| Method | Speed | Accuracy | Captures Interaction? |
|--------|-------|----------|------------------------|
| **Filter (Correlation)** | Fast | Low | No |
| **Wrapper (RFE)** | Slow | High | Yes |
| **Embedded (Trees)** | Medium | High | Yes |
| **Embedded (Lasso)** | Fast | Medium | No |

---

## 5. Best Practices

1. **Use multiple methods** → take intersection
2. **Validate with cross-validation**
3. **Don’t remove correlated features blindly** → they may help
4. **For production**: Select once, reuse

---

## 6. Final Selected Features (Consensus)

```python
Top Features: ['glucose', 'bmi', 'insulin']
```

**Drop**: `noise1`, `noise2`, `age`, `blood_pressure`

**Key Takeaway**:
> **Feature selection = better model + faster + interpretable**  
> **Use Filter for speed, Embedded for accuracy**  
> **Always validate impact on performance**

---
**End of Notebook**