<a href="https://colab.research.google.com/github/KhushnurAnjum26/Data-Analysis/blob/main/Feature_Engineering_Theory.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **🧠 0. Feature Engineering (The Foundation)**


Feature Engineering is the process of preparing raw data into features that improve model performance. It's one of the most important tasks in the machine learning pipeline.

### **🔍 Why It Matters:**
A good model with bad features will perform poorly.

A simple model with great features can perform extremely well.

**🧰 What it includes:**

Creating new features

Cleaning data

Transforming, scaling, and encoding

Selecting relevant features

## **🔄 1. Feature Transformation**

Transforming existing features into a more useful format. This helps models understand the data better or handle it more efficiently.

# 📌 Purpose:

Reduce skewness

Handle nonlinear relationships

Normalize distributions

🔧 Common Techniques:
Log transformation:

Example: Convert Price = [10, 100, 1000] → log(Price) = [1, 2, 3]

# **Power transformations (e.g., square root, cube)**

Box-Cox / Yeo-Johnson transformations for normalization

# **🚫 1.1. Missing Values Imputation**

Missing values (nulls/NaNs) can cause many ML algorithms to fail. You must handle them properly.

🔧 Methods:

Mean/Median/Mode:

Use mean for continuous, mode for categorical

Forward/Backward fill:

Use previous or next known value

Model-based imputation:

Use regression or KNN to estimate missing values

## **# 🔤 1.2. Handling Categorical Values**

Many models can only work with numbers, not strings. Categorical variables must be encoded.

🔧 Methods:

Label Encoding: Assigns a number to each category
(e.g., Male = 0, Female = 1)

One-Hot Encoding: Creates binary columns for each category
(e.g., Color = Red, Blue → [1, 0], [0, 1])

Target Encoding: Uses the mean of the target variable grouped by category

⚠️ Tip:
Avoid one-hot encoding with too many categories (e.g., Zip Codes).


## **🚨 1.3. Outlier Detection**

Outliers are unusual values that can skew model performance and affect results.

🔧 Methods:
IQR Method:

Outlier if
𝑥
<
𝑄
1
−
1.5
×
𝐼
𝑄
𝑅
 or
𝑥
>
𝑄
3
+
1.5
×
𝐼
𝑄
𝑅
Outlier if x<Q1−1.5×IQR or x>Q3+1.5×IQR
Z-score Method:

Outliers: Z > 3 or Z < -3

Visualization:

Boxplots, scatter plots


## **⚖️ 1.4. Feature Scaling**

Different features have different ranges. Scaling ensures no feature dominates another due to its unit size.

🔧 Methods:
Min-Max Scaling (0 to 1):

𝑥
scaled
=
𝑥
−
min
⁡
(
𝑥
)
max
⁡
(
𝑥
)
−
min
⁡
(
𝑥
)
x
scaled
​
 =
max(x)−min(x)
x−min(x)
​

Standardization (Z-score):

𝑥
scaled
=
𝑥
−
𝜇
𝜎
x
scaled
​
 =
σ
x−μ
​

📌 Why:
Essential for algorithms like KNN, SVM, Gradient Descent-based models


# **🧱 3. Feature Construction**

Creating new features that better describe the problem to the model, using existing raw data.

🧠 Example Ideas:
From Date → extract Year, Month, Day

Combine Price × Quantity = Total Spend

From Text → Count words or calculate sentiment



# **✅ 3. Feature Selection**

Reducing the number of features by keeping only the most relevant ones improves speed, reduces overfitting, and improves accuracy.

🔧 Methods:
Filter methods: Chi-square, ANOVA, correlation

Wrapper methods: Recursive Feature Elimination (RFE)

Embedded methods: Lasso Regression (L1), Tree-based models


# **🧬 4. Feature Extraction**

Transforming original features into a new feature space (usually fewer dimensions) while keeping important information.

🔧 Techniques:

PCA (Principal Component Analysis):

Reduces dimensionality by creating components that explain max variance

LDA (Linear Discriminant Analysis)

TF-IDF / Word2Vec: For extracting info from text

Autoencoders: Neural network-based compression
