# Feature Engineering in Machine Learning
Feature engineering is one of the most critical steps in a machine learning pipeline. It significantly impacts model performance.

### What is Feature Engineering?
Feature Engineering is the process of creating, transforming, or selecting input features to improve model performance. It involves using domain knowledge and data transformation techniques to help the algorithm better understand the data.

##### Why Feature Engineering is Important?
- Boosts model accuracy

- Reduces overfitting/underfitting

- Simplifies complex data

- Helps models interpret hidden patterns

### Core Steps in Feature Engineering

##### 1. Feature Creation (Deriving New Features)
Creating new features from existing ones to provide more useful signals.

📌 Examples:
- BMI = Weight / Height^2
- Age from DOB
- Total_Amount = Quantity * Unit_Price

##### 2. Feature Transformation
Transforming the data to meet the assumptions of algorithms or to normalize the range.

Common Transformation:

| Transformation      | Description                  | Example                 |
| ------------------- | ---------------------------- | ----------------------- |
| **Log Transform**   | Handles skewed data          | `log(x + 1)`            |
| **Square Root**     | Normalizes wide-ranging data | `sqrt(x)`               |
| **Standardization** | Mean = 0, Std = 1            | `(x - mean)/std`        |
| **Normalization**   | Scale \[0,1]                 | `(x - min)/(max - min)` |


##### 3. Encoding Categorical Variables
Categorical features need to be converted to numeric format.

Encoding Techniques:
  
| Technique            | Use Case                     | Example                           |
| -------------------- | ---------------------------- | --------------------------------- |
| **Label Encoding**   | Ordinal Data                 | `{'Low':0, 'Medium':1, 'High':2}` |
| **One-Hot Encoding** | Nominal Data                 | `pd.get_dummies()`                |
| **Binary Encoding**  | High-cardinality categorical | Reduces dimensionality            |


##### 4. Handling Missing Data
Missing values can mislead the model.

📌 Strategies:
- Mean/Median/Mode Imputation
- Forward Fill / Backward Fill
- Using a placeholder (e.g., -999)
- ML-based Imputation (KNN, MICE)

##### 5. Handling Outliers
Outliers can heavily affect model performance.
📌 Techniques:
- Z-score: If |z| > 3, it's likely an outlier
- IQR Method: Outliers = x < Q1 - 1.5*IQR or x > Q3 + 1.5*IQR
- Capping: Replace with max/min threshold values

##### 6. Feature Selection
Removing irrelevant or redundant features improves speed and avoids overfitting.

📌 Methods:
- Filter Methods: Correlation, Chi-Square
- Wrapper Methods: Recursive Feature Elimination (RFE)
- Embedded Methods: Lasso (L1), Tree-based models (e.g., Feature Importance in RandomForest)



##### 7. Discretization (Binning)
Converts continuous data into categorical bins.

📌 Example:
Convert age into:
- 0–18: Child
- 19–35: Young Adult
- 36–60: Adult
- 60+: Senior



##### 8. Polynomial Features / Interaction Features
Creating interaction terms or polynomial powers to capture complex relationships.

##### 9. Datetime Feature Extraction

Extract meaningful components from date/time columns:
- Year, Month, Day
- Day of week
- Is weekend
- Time delta between dates

##### 10. Text Feature Engineering
For NLP tasks:
- Bag of Words
- TF-IDF
- Word Embeddings
- Keyword extraction
- Sentiment scores