# Feature Engineering and Selection

In this notebook, we will explore:
- What is Feature Engineering?
- Common Feature Engineering techniques
- Feature Selection methods
- Example workflow with a dataset

## 1. What is Feature Engineering?
- Process of creating new input features from existing data.
- Helps improve model performance by providing more informative features.
- Examples:
  - Combining columns (e.g., BMI = weight/height^2)
  - Extracting date/time features (day, month, year)
  - Encoding categorical variables
  - Handling missing values

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_breast_cancer

# Load dataset
data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = pd.Series(data.target)

X.head()

## 2. Example of Feature Engineering

- Creating a new feature: Ratio of mean area to mean smoothness.
- Sometimes, ratios or combinations can reveal new patterns.

In [None]:
X['area_smoothness_ratio'] = X['mean area'] / X['mean smoothness']
X[['mean area', 'mean smoothness', 'area_smoothness_ratio']].head()

## 3. Feature Selection
- Reduces dimensionality, speeds up training, and improves generalization.
- Types:
  1. **Filter methods** – statistical tests (e.g., correlation, ANOVA)
  2. **Wrapper methods** – use model performance (e.g., RFE)
  3. **Embedded methods** – feature importance from models (e.g., Random Forest, Lasso)

### 3.1 Filter Method: SelectKBest (ANOVA F-test)

In [None]:
selector = SelectKBest(score_func=f_classif, k=5)
X_new = selector.fit_transform(X, y)
selected_features = X.columns[selector.get_support()]
selected_features

### 3.2 Embedded Method: Feature Importance from Random Forest

In [None]:
model = RandomForestClassifier(random_state=42)
model.fit(X, y)

importances = pd.Series(model.feature_importances_, index=X.columns)
importances.sort_values(ascending=False).head(10)

## 4. Splitting Data After Feature Selection

In [None]:
X_selected = X[selected_features]
X_train, X_test, y_train, y_test = train_test_split(X_selected, y, test_size=0.2, random_state=42)

X_train.shape, X_test.shape

## ✅ Summary
- **Feature Engineering** creates new useful features from raw data.
- **Feature Selection** reduces unnecessary features:
  - Filter methods → Statistical tests
  - Wrapper methods → Model-based evaluation
  - Embedded methods → Importance from algorithms
- Both steps are crucial for building efficient and high-performing ML models.