# 📘 Chapter 3: Feature Engineering and Feature Selection

This chapter covers the critical processes of **feature engineering** and **feature selection**, which lie at the heart of building effective, scalable machine learning systems. These processes are essential in both training and inference pipelines, especially in production ML environments.


📌 Why Feature Engineering Matters

> *"Coming up with features is difficult, time-consuming, and requires expert knowledge."* — Andrew Ng

Feature engineering is the process of transforming raw data into features that models can learn from effectively. It directly influences:
- Model convergence speed
- Predictive performance
- Compute and storage efficiency

It involves an iterative process of projecting, transforming, reducing, or combining data to extract meaningful signals.

---

## 🔁 Matching Training and Serving

During training, we often use the full dataset to calculate statistics (like standard deviation or means). At serving time, we process each input independently. Any mismatch here introduces **training-serving skew** — a critical production bug.


```python
import numpy as np

def compute_feature_std(feature_column):
    return np.std(feature_column)
```

---

## 🔧 Common Preprocessing Operations

Feature engineering involves several recurring transformations:
- Data cleansing
- Normalization / Standardization
- Bucketizing
- One-hot encoding
- Dimensionality reduction
- Image & text feature transformation

```python
import pandas as pd
from sklearn.preprocessing import MinMaxScaler, StandardScaler

X = [[1], [2], [3]]
X_norm = MinMaxScaler().fit_transform(X)
X_std = StandardScaler().fit_transform(X)
```

---

## 🧹 Data Cleaning Example

```python
df = pd.DataFrame({"timestamp": ["00:00", "12:00", "00:00"]})
df_cleaned = df[df["timestamp"] != "00:00"]
```

---

## 🪣 Bucketizing Numerical Features

Bucketizing converts numerical values into ranges (bins), often making patterns easier to model.

```python
import pandas as pd

values = pd.Series([1, 2, 2, 3, 4, 5, 6, 7, 8])
buckets = pd.qcut(values, q=3, labels=["low", "medium", "high"])
```

---


## ➕ Feature Crosses

Combining features to capture interactions (e.g. day × hour → hour of week):

```python
import pandas as pd

df = pd.DataFrame({"day": [1, 2], "hour": [10, 12]})
df["hour_of_week"] = df["day"] * 24 + df["hour"]
```

---

## 📉 Dimensionality Reduction

Used when input space is large but can be compressed (e.g. PCA, UMAP):

```python
from sklearn.decomposition import PCA

X = [[1, 2], [3, 4], [5, 6]]
pca = PCA(n_components=1)
X_reduced = pca.fit_transform(X)
```

---


## 🧠 Feature Selection: Why & How

Too many features:
- Increases training time & cost
- Introduces overfitting risk
- Slows inference

Goal: **Keep only predictive features**.

### 📊 Filter Methods

```python
cor = df.corr()
cor_target = abs(cor["target"])
selected_features = cor_target[cor_target > 0.8].index
```

---

### 🧪 Univariate Selection

```python
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler

def univariate_selection(X, y):
    X_train, _, y_train, _ = train_test_split(X, y, stratify=y, test_size=0.2)
    X_scaled = MinMaxScaler().fit_transform(X_train)
    selector = SelectKBest(score_func=chi2, k=10)
    selector.fit(X_scaled, y_train)
    return X.columns[selector.get_support()]
```

---

### 🔁 Recursive Feature Elimination (RFE)

```python
from sklearn.feature_selection import RFE
from sklearn.ensemble import RandomForestClassifier

def run_rfe(X, y, k):
    model = RandomForestClassifier()
    rfe = RFE(model, n_features_to_select=k)
    rfe.fit(X, y)
    return X.columns[rfe.get_support()]
```

---


### 🌲 Embedded Methods

```python
from sklearn.feature_selection import SelectFromModel
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier()
model.fit(X, y)
selector = SelectFromModel(model, prefit=True)
X_selected = selector.transform(X)
```

---


## 🤖 Tokenization with TF Transform

For LLMs, text data is tokenized into input IDs:


```python
import tensorflow as tf
import tensorflow_hub as hub
import tensorflow_text as tf_text

START_TOKEN_ID = 101
END_TOKEN_ID = 102

def preprocessing_fn(inputs):
    tokenizer = tf_text.BertTokenizer(...)
    text = tf_text.normalize_utf8(inputs["message"])
    tokens = tokenizer.tokenize(text).merge_dims(-2, -1)
    tokens, type_ids = tf_text.combine_segments(tokens, START_TOKEN_ID, END_TOKEN_ID)
    tokens, mask_ids = tf_text.pad_model_inputs(tokens, max_seq_length=128)
    return {"input_ids": tokens, "input_mask": mask_ids, "input_type_ids": type_ids}
```

---

## 🧠 Summary Tables

### 🔧 Feature Engineering Keywords & Techniques

The process of transforming raw data into input features that help machine learning models learn better and more efficiently. It includes scaling, encoding, aggregating, and creating new features.

| **Keyword**               | **Definition**                                                                 |
|---------------------------|-------------------------------------------------------------------------------|
| Normalization             | Scaling features to a range (e.g., [0,1]) using min-max scaling               |
| Standardization           | Rescaling to zero mean and unit variance (z-score)                           |
| One-Hot Encoding          | Converting categories to binary indicator vectors                            |
| Bucketizing               | Grouping numeric features into discrete bins                                 |
| Feature Crosses           | Combining multiple features into one                                          |
| Dimensionality Reduction  | Reducing feature space size while preserving information (e.g. PCA, UMAP)     |
| Embeddings                | Dense representations for high-cardinality features (e.g. text, categories)  |
| Data Cleansing            | Removing or correcting invalid/missing data                                  |
| Text Tokenization         | Splitting text into tokens for NLP models                                     |
| TF Transform (TFX)        | Scalable, consistent feature transformation library for TensorFlow           |
| Training–Serving Skew     | Discrepancy between training and inference transformations                   |


### 🧠 Feature Selection Keywords & Techniques

The process of selecting a subset of relevant features from the original set that contribute the most to model performance, while reducing noise, overfitting, and computational cost.

| **Technique**             | **Type**         | **Definition**                                                                 |
|---------------------------|------------------|-------------------------------------------------------------------------------|
| Filter Methods            | Supervised/Unsupervised | Based on stats like correlation, chi-square                             |
| Wrapper Methods           | Supervised       | Evaluate subsets using model performance (e.g., RFE)                         |
| Embedded Methods          | Supervised       | Feature selection built into model (e.g., tree importance, L1 penalty)       |
| Univariate Selection      | Filter           | Evaluate features independently via stat tests (e.g., chi2, F-test)          |
| Recursive Feature Elimination | Wrapper    | Iteratively removes least important features                                |
| SelectFromModel           | Embedded         | Keep features with high importance scores from trained model                |
| Mutual Information        | Filter           | Measures feature-target dependency                                           |
| Forward Selection         | Wrapper          | Add features iteratively based on best improvement                          |
| Backward Elimination      | Wrapper          | Remove features iteratively starting from all                               |

---

## ✅ Benefits of TensorFlow Transform (TFX)

- Prevents training-serving skew
- Runs at scale via Apache Beam
- Produces graphs + transformed data
- Allows preprocessing to be deployed with the model

---

## 🧠 Conclusion

- **Feature engineering** enhances learnability and efficiency.
- **Feature selection** improves performance and scalability.
- Use scalable tools (TFX, Beam) in production.
- In GenAI/LLMs, **example selection** is as critical as feature work.

> Focus on **data quality over data quantity**. What your model learns depends entirely on what you give it.

---

✅ *This notebook helps you implement Chapter 3 of* Machine Learning Production Systems *in real-world pipelines.*
