## Feature Engineering – Definition

Feature Engineering is the process of transforming raw data into meaningful and structured features that can be effectively used by machine learning models to learn patterns and make accurate predictions.


### Feature Engineering Architecture

![Feature Engineering Architecture](architecture.png)

This diagram shows the basic feature engineering pipeline from raw data to
model-ready features.


## Feature Transformation
Feature Transformation refers to modifying existing features to make them more suitable for machine learning models.  
It changes the **representation or scale** of data without creating new information.

**Examples:**
- Scaling (StandardScaler, MinMaxScaler)
- Normalization
- Log / Power transformations
- Encoding categorical variables

**Goal:** Improve model performance and convergence.

---

## Feature Construction
Feature Construction is the process of **creating new features** from existing ones using domain knowledge or mathematical operations.

**Examples:**
- Creating `age` from `date_of_birth`
- Combining features (`total_amount = price × quantity`)
- Polynomial features
- Interaction terms

**Goal:** Add new information that helps the model learn better patterns.

---

## Feature Selection
Feature Selection is the process of **choosing the most relevant features** and removing redundant or irrelevant ones.

**Examples:**
- Correlation-based selection
- Variance threshold
- Recursive Feature Elimination (RFE)
- Feature importance from models

**Goal:** Reduce overfitting, improve interpretability, and speed up training.

---

## Feature Extraction
Feature Extraction transforms raw data into a **new feature space**, often reducing dimensionality.

**Examples:**
- PCA
- LDA
- Autoencoders
- TF-IDF (for text)

**Goal:** Capture important patterns while reducing data complexity.

---

## Key Differences (Summary)

| Technique | Creates New Features | Removes Features | Changes Representation |
|---------|---------------------|------------------|------------------------|
| Feature Transformation | No | No | Yes |
| Feature Construction | Yes | No | No |
| Feature Selection | No | Yes | No |
| Feature Extraction | Yes | Yes | Yes |


# Feature Transformation

### 1. missing values treatment
1) remove missing values 
2) filling missing values

### Missing Value Filling Techniques

Missing values can negatively impact machine learning models. Below are commonly used techniques to fill missing data, chosen based on data type and problem context.

---

### a. Mean Imputation
Replaces missing values with the mean of the feature.

**Best suited for:**
- Numerical data
- Data without extreme outliers

**Limitation:**
- Sensitive to outliers

```python
df["column"].fillna(df["column"].mean(), inplace=True)


### b. Median Imputation

Replaces missing values with the median of the feature.

Best suited for:

1) Numerical data with outliers

2) Skewed distributions
Advantage:
Robust to outliers

### c. Mode Imputation
Replaces missing values with the most frequent value.

Best suited for:
1) Categorical features

Limitation:
May introduce bias if one category dominates

### d. Forward Fill (ffill)

Fills missing values using the previous valid observation.

Best suited for:

Time-series data

Sequential data where order matters

### e. Backward Fill (bfill)

Fills missing values using the next valid observation.

Best suited for:
1) Time-series data
2) Sequential data

## 2. Handling Categorical Values

Categorical variables represent qualitative data and must be converted into numerical form before being used in most machine learning models.

---

### Types of Categorical Variables
- **Nominal**: Categories with no inherent order  
  (e.g., color, city, gender)
- **Ordinal**: Categories with a meaningful order  
  (e.g., low, medium, high)a
---

### 1. Label Encoding
Assigns a unique numerical value to each category.

**Best suited for:**
- Ordinal categorical variables

**Limitation:**
- Imposes an artificial order on nominal data

```python
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
df["column"] = le.fit_transform(df["column"])


### b. One-Hot Encoding

One-Hot Encoding creates separate binary (0/1) columns for each category in a categorical feature.

**Best suited for:**
- Nominal categorical variables
- Algorithms that assume no order between categories

**Limitation:**
- Can significantly increase dimensionality when the number of categories is large

```python
pd.get_dummies(df, columns=["column"], drop_first=True)

### c. Ordinal Encoding

Ordinal Encoding converts categorical values into numerical values based on a predefined order.

**Best suited for:**
- Ordered categorical variables

```python
from sklearn.preprocessing import OrdinalEncoder

encoder = OrdinalEncoder(categories=[["low", "medium", "high"]])
df["column"] = encoder.fit_transform(df[["column"]])

### d. Frequency Encoding
Frequency Encoding replaces each category with its frequency count in the dataset.

Best suited for:
1) High-cardinality categorical features
```python
freq = df["column"].value_counts()
df["column"] = df["column"].map(freq)

## 3. Outlier Detection

## 3. Outlier Detection

Outliers are data points that significantly deviate from the majority of observations. They can distort statistical analysis and negatively impact the performance of many machine learning models.

---

### Why Outlier Detection is Important
- Prevents skewed model learning
- Improves model stability and accuracy
- Helps identify data entry errors or rare but important events

---

### a. Z-Score Method

Measures how many standard deviations a data point is from the mean.

**Best suited for:**
- Normally distributed data

**Limitation:**
- Sensitive to outliers themselves

```python
from scipy import stats

z_scores = stats.zscore(df["column"])
df_outliers = df[abs(z_scores) > 3]

```
### b. IQR (Interquartile Range) Method
Uses the spread between the 25th and 75th percentiles.

### c. Box Plot Method

### d.Percentile-Based Method

## Feature Scaling

Feature Scaling is the process of standardizing or normalizing numerical features so that they are on a similar scale. This prevents features with larger magnitudes from dominating model learning.

---

### Why Feature Scaling is Important
- Ensures fair contribution of all features
- Improves model convergence
- Essential for distance-based and gradient-based algorithms

---

### Models That REQUIRE Scaling
- Linear Regression
- Logistic Regression
- K-Nearest Neighbors (KNN)
- Support Vector Machines (SVM)
- Neural Networks
- K-Means Clustering

### Models That DO NOT Require Scaling
- Decision


| Scaling Method | Handles Outliers | Output Range |
| -------------- | ---------------- | ------------ |
| StandardScaler | No               | Unbounded    |
| MinMaxScaler   | No               | 0 to 1       |
| RobustScaler   | Yes              | Unbounded    |
| MaxAbsScaler   | Partial          | -1 to 1      |


# Feature Construction

## Feature Construction

Feature Construction is the process of **creating new features** from existing data using domain knowledge, mathematical operations, or logical rules to improve a model’s ability to learn patterns.

---

### Why Feature Construction is Important

* Adds new information not explicitly present in raw data
* Improves model performance without changing algorithms
* Helps capture real-world relationships between variables

---

### Common Feature Construction Techniques

#### 1. Mathematical Operations

Create new features using arithmetic combinations.

**Examples:**

* Sum, difference, ratio, product

```python
df["total_price"] = df["price"] * df["quantity"]
df["price_per_unit"] = df["total_price"] / df["units"]
```

---

#### 2. Polynomial Features

Generate interaction and higher-degree terms.

**Best suited for:**

* Linear models capturing non-linear relationships

```python
from sklearn.preprocessing import PolynomialFeatures

poly = PolynomialFeatures(degree=2, include_bias=False)
X_poly = poly.fit_transform(X)
```

---

#### 3. Date and Time Features

Extract meaningful components from datetime columns.

**Examples:**

* Year, month, day, weekday, hour

```python
df["year"] = df["date"].dt.year
df["month"] = df["date"].dt.month
df["dayofweek"] = df["date"].dt.dayofweek
```

---

#### 4. Aggregation-Based Features

Create features using group-level statistics.

**Examples:**

* Mean, median, count per group

```python
df["avg_salary_by_dept"] = df.groupby("department")["salary"].transform("mean")
```

---

#### 5. Flag / Indicator Features

Create binary features to represent conditions.

```python
df["is_weekend"] = df["dayofweek"].isin([5, 6]).astype(int)
```

---

#### 6. Text-Derived Features (Basic)

Create features from text length or counts.

```python
df["text_length"] = df["review"].str.len()
df["word_count"] = df["review"].str.split().apply(len)
```

---

### Best Practices

* Use domain knowledge whenever possible
* Avoid creating too many features blindly
* Validate feature usefulness with EDA or model performance
* Watch out for data leakage

---

### Key Takeaways

* Feature Construction creates **new information**
* Often more impactful than model tuning
* Quality matters more than quantity


# Feature Selection 

In [12]:
## MNIST dataset

Feature Selection is the process of selecting a subset of the most relevant features from the dataset and removing redundant or irrelevant features before training a machine learning model.

---

### Why Feature Selection is Important

* Reduces overfitting by removing noise
* Improves model performance and generalization
* Decreases training time and computational cost
* Improves model interpretability

---

### Types of Feature Selection Methods

#### 1. Filter Methods

Filter methods select features based on statistical measures, independent of any machine learning model.

**Common techniques:**

* Correlation analysis
* Variance threshold
* Chi-square test

```python
# Correlation-based removal
corr_matrix = df.corr()
```

---

#### 2. Wrapper Methods

Wrapper methods evaluate feature subsets by training a model and measuring performance.

**Common techniques:**

* Forward Selection
* Backward Elimination
* Recursive Feature Elimination (RFE)

```python
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression

model = LogisticRegression()
rfe = RFE(model, n_features_to_select=5)
X_selected = rfe.fit_transform(X, y)
```

---

#### 3. Embedded Methods

Embedded methods perform feature selection as part of the model training process.

**Common techniques:**

* Lasso (L1 regularization)
* Tree-based feature importance

```python
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier()
model.fit(X, y)
importances = model.feature_importances_
```

---

### When to Use Feature Selection

* High-dimensional datasets
* Presence of multicollinearity
* Limited computational resources
* Need for model interpretability

---

### Best Practices

* Perform feature selection after train-test split
* Combine domain knowledge with statistical methods
* Avoid removing features blindly
* Validate selected features using cross-validation

---

### Key Takeaways

* Feature Selection removes irrelevant features
* Helps simplify models and improve generalization
* Different methods suit different problems


# Feature Extraction 

Feature Extraction is the process of transforming raw or high-dimensional data into a new set of features that captures the most important information while reducing complexity. It creates a new feature space that is more suitable for machine learning models.

---

### Why Feature Extraction is Important
- Reduces dimensionality of data
- Removes redundant and noisy information
- Improves model performance and training speed
- Helps handle high-dimensional datasets

---

### Common Feature Extraction Techniques

#### a. Principal Component Analysis (PCA)
PCA converts correlated numerical features into a smaller set of uncorrelated components while preserving maximum variance.

**Best suited for:**
- Numerical data
- Dimensionality reduction
- Multicollinearity issues

```python
from sklearn.decomposition import PCA

pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)
```

### 2. Linear Discriminant Analysis (LDA)

LDA projects data into a lower-dimensional space while maximizing class separability.

Best suited for:
**Supervised classification problems**

```python
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

lda = LinearDiscriminantAnalysis(n_components=1)
X_lda = lda.fit_transform(X, y)
```

### 3. Text Feature Extraction

Transforms unstructured text into numerical features.
Common techniques:
1) Bag of Words
2) TF-IDF
```python
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()
X_text = vectorizer.fit_transform(text_data)
```
### d. Autoencoders

Autoencoders are neural networks that learn compressed representations of data.

Best suited for:

1) Large-scale datasets
2) Deep learning applications


| Aspect                   | Feature Selection | Feature Extraction |
| ------------------------ | ----------------- | ------------------ |
| Creates new features     | No                | Yes                |
| Removes features         | Yes               | Yes                |
| Dimensionality reduction | Partial           | Strong             |
| Interpretability         | High              | Lower           

**Best Practices**

1) Apply feature extraction after scaling (for PCA and LDA)
2) Choose the number of components carefully
3) Perform extraction after train-test split
4) Validate impact on model performance

**Key Takeaways**

1) Feature Extraction creates a new feature space
2) It is useful for high-dimensional and complex data
3) Improves efficiency but reduces interpretability   |
