# C4: FEATURE ENGINEERING

## Feature Extraction
- **Definition:** Creating new features from raw data, especially from unstructured sources like text, images, or audio.
- **Examples:**
  - **Text:** Word counts, TF-IDF scores, word embeddings (Word2Vec, GloVe, BERT)
  - **Images:** Edges, color histograms, CNN features
  - **Time Series:** Rolling averages, seasonality/trend indicators, Fourier transforms
- **Need:** Converts raw/unstructured data into numeric features that can be effectively used by ML models.

## Feature Transformation
- **Definition:** Changing existing features into more useful forms to improve model performance.
- **Common Techniques:**
  1. Log transformation – Reduces skewness in data
  2. Square root / Box-Cox transformations – Stabilize variance
  3. Binning – Convert continuous values into categorical intervals
  4. Polynomial features – Add interaction terms or higher-order terms
  5. Encoding – One-hot encoding, label encoding for categorical variables
  6. Discretization – Splitting continuous features into discrete buckets
- **Need:** Helps improve linearity, model interpretability, and handling of non-linear relationships.

## Feature Scaling
- **Why Needed:** Some algorithms (e.g., Linear/Logistic Regression, SVM, K-Means, PCA, Neural Networks) are sensitive to feature scale.
- **Techniques:**
  - **Min-Max Normalization:**
    - Formula: $\mathrm{x'} = \frac{x - \min(x)}{\max(x) - \min(x)}$
    - Range: [0, 1]
    - **Use Case:** When bounded values are required, e.g., image pixel scaling
  - **Standardization (Z-score scaling):**
    - Formula: $\mathrm{z} = \frac{x - \mu}{\sigma}$
    - Mean = 0, Standard Deviation = 1
    - **Use Case:** Preferred when data contains outliers
  - **Robust Scaling (Median & IQR):**
    - Formula: $\mathrm{x'} = \frac{x - \text{median}(x)}{\text{IQR}}$
    - **Use Case:** Works well with heavy-tailed distributions and extreme outliers.

## Feature Selection
- **Goal:** Keep only the most informative features while removing irrelevant or noisy variables to reduce overfitting and improve efficiency.
- **Methods:**
  1. **Filter Methods:** Statistical tests (Chi-square, ANOVA, Mutual Information, Correlation thresholding)
  2. **Wrapper Methods:**
     - Forward Selection – Start with no features, add one at a time if model improves
     - Backward Elimination – Start with all features, remove least significant step by step
     - Stepwise Selection – Combination of forward and backward
     - Recursive Feature Elimination (RFE) – Train model, remove least important features iteratively
  3. **Embedded Methods:**
     - L1 Regularization (Lasso)
     - Tree-based models (e.g., Random Forest, XGBoost feature importance)
- **Benefits:** Improves model accuracy, reduces computation, prevents overfitting, and enhances interpretability.


In [3]:
import pandas as pd
from sklearn.datasets import load_diabetes
from sklearn.linear_model import LinearRegression
from sklearn.feature_selection import RFE

# ---------------------------
# 1. Load Dataset
# ---------------------------
diabetes = load_diabetes()
X = pd.DataFrame(diabetes.data, columns=diabetes.feature_names)
y = pd.Series(diabetes.target)

# ---------------------------
# 2. Fit Linear Regression Model
# ---------------------------
model = LinearRegression()

# ---------------------------
# 3. Apply RFE
# ---------------------------
rfe = RFE(model, n_features_to_select=5)  # keep top 5 features
rfe = rfe.fit(X, y)

# ---------------------------
# 4. Results
# ---------------------------
print("Selected Features:", X.columns[rfe.support_].tolist())
print("Ranking of Features:", rfe.ranking_)



Selected Features: ['bmi', 'bp', 's1', 's2', 's5']
Ranking of Features: [6 2 1 1 1 1 4 3 1 5]
