
# Feature Selection Techniques — Full Guide (Theory + Code)

**Feature selection** is the process of selecting the most relevant features (columns) for use in model training. It helps:

* Improve model performance
* Reduce overfitting
* Reduce training time
* Improve interpretability

---

##### 1. **Correlation-Based Feature Selection**

##### What is it?

* Correlation measures the linear relationship between two features.
* Highly correlated features (multicollinearity) can confuse models, especially linear models.
* You can remove one of the two highly correlated features.



##### When to use:

* When you suspect **multicollinearity** (especially in linear regression).
* When you want to reduce feature redundancy.













In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# 1. Calculate correlation matrix
corr_matrix = df.corr(numeric_only=True)

# 2. Visualize correlation matrix
plt.figure(figsize=(10,8))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm')

# 3. Drop features with high correlation (> 0.9)
def remove_highly_correlated(df, threshold=0.9):
    corr_matrix = df.corr().abs()
    upper_tri = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(bool))
    to_drop = [column for column in upper_tri.columns if any(upper_tri[column] > threshold)]
    return df.drop(columns=to_drop)

df_filtered = remove_highly_correlated(df, threshold=0.9)

##### 2. **Variance Threshold (Low Variance Filter)**

##### What is it?

* If a feature has **low variance**, it doesn't change much → provides little information → can be removed.
* Especially useful in high-dimensional data (e.g., text or image features).

##### When to use:

* When many features are **binary** or **categorical** encoded.
* When preprocessing **text or image data**.
* To remove uninformative columns before modeling.



In [None]:

from sklearn.feature_selection import VarianceThreshold

# Remove features with variance below 0.01
selector = VarianceThreshold(threshold=0.01)
df_reduced = selector.fit_transform(df)

# Get selected column names
selected_columns = df.columns[selector.get_support()]
df_reduced = pd.DataFrame(df_reduced, columns=selected_columns)


##### 3. **SelectKBest (Univariate Feature Selection)**

##### What is it?

* Selects the top **k** features that have the strongest relationship with the target.
* Based on statistical tests:

  * `f_classif` for classification
  * `f_regression` for regression
  * `chi2` for categorical data


#####  When to use:

* You want to select **top K features** before modeling.
* Especially useful for quick filtering before model selection.



In [None]:

from sklearn.feature_selection import SelectKBest, f_classif

X = df.drop('target', axis=1)
y = df['target']

# Select top 5 features using ANOVA F-test
selector = SelectKBest(score_func=f_classif, k=5)
X_new = selector.fit_transform(X, y)

# Get selected features
selected_features = X.columns[selector.get_support()]
print("Selected features:", selected_features.tolist())

# Rebuild new DataFrame
X_selected = pd.DataFrame(X_new, columns=selected_features)



#####Other Feature Selection Techniques

| Technique                               | Description                                                   |
| --------------------------------------- | ------------------------------------------------------------- |
| **Recursive Feature Elimination (RFE)** | Iteratively builds model and removes weakest features         |
| **L1 Regularization (Lasso)**           | Penalizes irrelevant features → zero weights                  |
| **Tree-based Feature Importance**       | Uses models like Random Forest or XGBoost to rank features    |
| **Mutual Information**                  | Measures non-linear relationships between features and target |

---

## 🧠 Summary Table

| Technique          | Type     | When to Use                                   |
| ------------------ | -------- | --------------------------------------------- |
| Correlation        | Filter   | To remove multicollinearity                   |
| Variance Threshold | Filter   | To remove features with no variance           |
| SelectKBest        | Filter   | To select top-k features based on score       |
| RFE                | Wrapper  | When using a model to recursively remove vars |
| Lasso (L1)         | Embedded | When using linear models with regularization  |
| Tree Importance    | Embedded | When using tree-based models                  |

