## Scikit-learn Preprocessing — Theory-Only 



---

## 1. What you MUST learn from this section
Learn **concepts + rules**, not formulas or toy examples.

### Core ideas to keep
- Scaling exists to **stabilize optimization**, not to “improve accuracy magically”
- Different data → different scaler
- Fit scaler on **train only**
- Pipelines exist to prevent **data leakage**

Everything else is secondary.

---

## 2. Standardization (StandardScaler) — ACTUALLY IMPORTANT

### What it really does
- Subtracts mean (centering)
- Divides by standard deviation (scaling)

### When you MUST use it
- Linear Regression
- Logistic Regression
- SVM
- PCA
- Any model with:
  - Regularization
  - Distance / dot products

### When you MUST NOT care
- Tree-based models (Decision Tree, RF, GBM)
- Rule-based models

### What to ignore
- “Gaussian” wording  
  → Model **does not require** normal distribution  
- Exact numeric example  
  → Just illustration

---

## 3. Term: **Sparsity** (important, not optional)

### What sparsity means
- Data with **many zeros**
- Example:


### Where sparse data comes from
- One-hot encoding
- Bag-of-words / TF-IDF
- High-dimensional categorical data

### Why it matters
- Sparse matrices store **only non-zero values**
- Efficient memory + speed

### Why centering breaks sparsity
- Centering adds mean to zeros
- Zeros → non-zero
- Memory explodes

**Rule**
> Never center sparse data

---

## 4. MinMaxScaler — When it makes sense

### What it does
- Scales features to fixed range (usually 0–1)

### Use it when
- You need bounded values
- Feature std is tiny or unstable
- Some NN setups

### DO NOT use when
- Strong outliers exist
- Data distribution can shift a lot

### Ignore
- Formula derivation
- `min_`, `scale_` attributes unless debugging

---

## 5. MaxAbsScaler — Why it exists

### What it does
- Divides by max absolute value
- Output in [-1, 1]

### Why it’s special
- Does **not center**
- Preserves zeros

### When to use
- Sparse data
- Features already centered near zero

**Mental rule**
> Sparse data → MaxAbsScaler

---

## 6. Scaling sparse data — non-negotiable rules

### Allowed
- MaxAbsScaler
- StandardScaler(with_mean=False)

### Forbidden
- StandardScaler() default
- RobustScaler.fit() on sparse data

### Why sklearn yells at you
- Silent centering = RAM explosion
- Better to crash than corrupt memory

---

## 7. Outliers — what matters

### What is an outlier
- Extreme value far from majority
- Skews mean and std

### Problem
- StandardScaler gets distorted

### Solution
- RobustScaler
- Uses median + IQR
- Resistant to outliers

### Tradeoff
- Less sensitive
- Slightly slower

---

## 8. Kernel centering — SKIP THIS

### Only relevant if
- You manually compute kernel matrices
- You know what a Gram matrix is
- You are deep into kernel theory

### For you
- **Ignore completely**

---

## 9. What to IGNORE safely
- Toy numeric arrays
- Exact formulas
- Attribute introspection examples
- KernelCenterer section
- CSR vs CSC details (for now)

---

## 10. Final mental map (remember this)
- Linear models → scale
- Tree models → don’t care
- Sparse data → never center
- Outliers → RobustScaler
- Pipelines → no leakage
- Scaling fixes **optimization**, not **data quality**



## 7.3.1 Standardization, or mean removal and variance scaling

In [1]:
from sklearn import preprocessing
import numpy as np 

X_train = np.array([[ 1., -1.,  2.],
                    [ 2.,  0.,  0.],
                    [ 0.,  1., -1.]])
scaler = preprocessing.StandardScaler().fit(X_train)
scaler
                    

In [2]:
scaler.mean_

array([1.        , 0.        , 0.33333333])

In [3]:
scaler.scale_

array([0.81649658, 0.81649658, 1.24721913])

In [6]:
X_scaled = scaler.transform(X_train)
# X_scaled = (X - mean_) / scale_
X_scaled

array([[ 0.        , -1.22474487,  1.33630621],
       [ 1.22474487,  0.        , -0.26726124],
       [-1.22474487,  1.22474487, -1.06904497]])

In [7]:
X_scaled.mean(axis=0)

array([0., 0., 0.])

In [8]:
X_scaled.std(axis=0)

array([1., 1., 1.])

In [None]:
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler