## Pre Processing

### 1 - Standardization, or mean removal and variance scaling

**Why We Standardize Data:**

1. Algorithms that use distance
Methods like **KNN, SVM, PCA, K-means** are sensitive to the scale of features.  
If one feature ranges from 1–1000 and another from 0–1, the large-scale feature dominates the calculations.  
**Standardization balances them**, so all features contribute equally.

2. Gradient-based optimization
Algorithms like **logistic regression** and **neural networks** converge faster if features are on a similar scale.  
Standardization helps speed up learning and improves stability.

3. Interpreting coefficients
In **linear models**, standardization allows you to **compare the effect of different features directly**, because they’re on the same scale.

---

**When to Use Standardization:**

- Almost always when features are on **different scales**.  
- Particularly important for **distance-based** or **gradient-based models**.  
- Usually **not needed for tree-based models** like Random Forests or XGBoost, since they do not rely on feature scale.

In [None]:
import pandas as pd
import numpy as np

In [None]:
# import the standard scaler:
from sklearn.preprocessing import StandardScaler

In [None]:
X_train = np.array([[ 1., -1.,  2.],[ 2.,  0.,  0.],[ 0.,  1., -1.]])
# create the scaler:
scaler = StandardScaler()
# fit the scaler:
scaler.fit(X_train)

In [None]:
# return the mean calculated for each col:
scaler.mean_

array([1.        , 0.        , 0.33333333])

In [None]:
# return the std of each col:
scaler.scale_

array([0.81649658, 0.81649658, 1.24721913])

In [None]:
# trabsform the data:
X_scaled = scaler.transform(X_train)
X_scaled

array([[ 0.        , -1.22474487,  1.33630621],
       [ 1.22474487,  0.        , -0.26726124],
       [-1.22474487,  1.22474487, -1.06904497]])

### 2 - Scaling features to a range

Scaling features to a range means transforming numerical data so that all values fall within a specific interval, commonly 0 to 1 or −1 to 1. This is done to ensure that features with larger numeric values do not dominate those with smaller ones when training a model. We use feature scaling when working with algorithms that are sensitive to the magnitude of data, such as k-nearest neighbors, support vector machines, neural networks, and gradient descent–based models. Scaling helps models train faster, behave more stably, and often achieve better performance.

In [None]:
X_train = np.array([[ 1., -1.,  2.],
                    [ 2.,  0.,  0.],
                    [ 0.,  1., -1.]])

In [None]:
# import the minmaxsaler:
from sklearn.preprocessing import MinMaxScaler

In [None]:
# create a scaler:
min_max_scaler = MinMaxScaler()
# fit and transform the data:
X_train_minmax = min_max_scaler.fit_transform(X_train)
X_train_minmax

array([[0.5       , 0.        , 1.        ],
       [1.        , 0.5       , 0.33333333],
       [0.        , 1.        , 0.        ]])

In [None]:
# transform the test data:
X_test = np.array([[-3., -1.,  4.]])
X_test_minmax = min_max_scaler.transform(X_test)
X_test_minmax

array([[-1.5       ,  0.        ,  1.66666667]])

In [None]:
# attributes of the scaler:
min_max_scaler.scale_

array([0.5       , 0.5       , 0.33333333])

In [None]:
min_max_scaler.min_

array([0.        , 0.5       , 0.33333333])

### 3 - Scaling Sparse Data

Scaling sparse data is about adjusting feature magnitudes **without destroying sparsity** (the many zeros that make sparse data memory-efficient).

**Key Idea:**
Sparse data (common in text data like bag-of-words or TF-IDF) contains mostly zeros. **Centering** (subtracting the mean) would turn many zeros into non-zero values, making the data dense. This wastes memory and can crash programs. Therefore, we usually **scale without centering**.

**Why Some Scalers Work and Others Don’t:**

**MaxAbsScaler ✔️**  
Scales each feature by its maximum absolute value. Zeros remain zeros, so sparsity is preserved. This is the **recommended scaler for sparse data**.

**StandardScaler (`with_mean=False`) ✔️**  
Scales by standard deviation but does **not subtract the mean**, preserving sparsity. If `with_mean=True`, an error is raised to prevent memory issues.

**RobustScaler ❌ (for fitting)**  
Requires centering using medians, which breaks sparsity. You can only use `transform`, not `fit`, on sparse data.


### 4 - Scaling data with outliers
If your data contains many outliers, scaling using the mean and variance of the data is likely to not work very well. In these cases, you can use RobustScaler as a drop-in replacement instead. It uses more robust estimates for the center and range of your data.

Robust Scaling – Key Definitions and Formula:

**Median**  
The middle value of the sorted data.

**Q1 (25%)**  
The value below which 25% of the data lies (first quartile).

**Q3 (75%)**  
The value below which 75% of the data lies (third quartile).

**Interquartile Range (IQR)**  
$$
\text{IQR} = Q3 - Q1
$$

**RobustScaler Formula:**
$$
x_{\text{scaled}} = \frac{x - \text{median}}{\text{IQR}}
$$

**Why this is robust to outliers:**
- The **median** is not strongly affected by extreme values.
- The **IQR** focuses on the middle 50% of the data.
- As a result, outliers have minimal influence on the scaling process.


In [None]:
X = [[ 1., -2.,  2.],
     [ -2.,  1.,  3.],
     [ 4.,  1., -2.]]
# import the robust scaler:
from sklearn.preprocessing import RobustScaler


In [None]:
# create a transformer:
transformer = RobustScaler()
# fit the transfomer to the data:
transformer.fit(X)

In [None]:
# transform the data:
transformer.transform(X)

array([[ 0. , -2. ,  0. ],
       [-1. ,  0. ,  0.4],
       [ 1. ,  0. , -1.6]])

### 5 - Mapping to a Uniform distribution

#### What does “mapping to a uniform distribution” mean?

A **uniform distribution between 0 and 1** means:

- Values are spread **evenly**
- Any number between **0 and 1** is equally likely

When we say **QuantileTransformer maps data to a uniform distribution**, it means:

- It **re-ranks your data based on its order (percentiles)**, not its original numeric values.

So:

- Smallest value → near **0**
- Median value → near **0.5**
- Largest value → near **1**

⚠️ **Important:**  
This transformation **does not preserve distances** between values — it preserves **only their relative order**.


#### What does QuantileTransformer actually do?

For each value in the dataset:

1. Find its **rank / percentile**
2. Replace the original value with that **percentile**
3. The output lies between **0 and 1**

This method is **non-parametric**, meaning:

- It makes **no assumption about the data’s distribution**
- It works well with **skewed, irregular, or unknown distributions**

In [None]:
# load a data:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

In [None]:
# import the transformer:
from sklearn.preprocessing import QuantileTransformer

In [None]:
# create a transformer:
quantile_transformer = QuantileTransformer(random_state=0)

In [None]:
# fit and transform the data:
X_train_trans = quantile_transformer.fit_transform(X_train)



In [None]:
np.percentile(X_train[:, 0], [0, 25, 50, 75, 100])

array([4.3, 5.1, 5.8, 6.5, 7.9])

In [None]:
np.percentile(X_train_trans[:, 0], [0, 25, 50, 75, 100])

array([0.        , 0.23873874, 0.50900901, 0.74324324, 1.        ])

### 6 - Mapping to a Gaussian distribution

In many modeling scenarios, normality of the features in a dataset is desirable. Power transforms are a family of parametric, monotonic transformations that aim to map data from any distribution to as close to a Gaussian distribution as possible in order to stabilize variance and minimize skewness.

PowerTransformer currently provides two such power transformations, the Yeo-Johnson transform and the Box-Cox transform.

In [None]:
# import the power transformer:
from sklearn.preprocessing import PowerTransformer

In [None]:
# data:
X_lognormal = np.random.RandomState(616).lognormal(size=(3, 3))

In [None]:
# create the transformer:
pt = PowerTransformer(method="box-cox", standardize=False)

In [None]:
pt.fit_transform(X_lognormal)

array([[ 0.49024349,  0.17881995, -0.1563781 ],
       [-0.05102892,  0.58863195, -0.57612414],
       [ 0.69420008, -0.84857823,  0.10051454]])

### 7 - Normalization (Vector Normalization)

**Normalization** (in this context) means:

> **Scale each individual sample (row vector) so that its length (norm) becomes 1, while keeping its direction the same.**

- You **do not change relationships inside a sample**
- You **only change its magnitude**
- This is **different from standardization**, which enforces:
  - mean = 0
  - standard deviation = 1

#### What is a Norm?

A **norm** measures the *length* (or size) of a vector.

For a vector:

$$
x = [x_1, x_2, \dots, x_n]
$$

#### Common Norms

**L1 Norm:**
$$
\|x\|_1 = |x_1| + |x_2| + \dots + |x_n|
$$

- Measures total absolute magnitude
- Often used when sparsity matters

**L2 Norm (Most Common):**
$$
\|x\|_2 = \sqrt{x_1^2 + x_2^2 + \dots + x_n^2}
$$

- Euclidean length of the vector
- Preserves direction
- Most commonly used in text and similarity tasks

#### Unit Norm

A vector has a **unit norm** if:

$$
\|x\| = 1
$$

#### Normalization Formula

$$
x_{\text{normalized}} = \frac{x}{\|x\|}
$$

This rescales the vector so its length is exactly 1.



#### Normalization in Text Classification

In text classification, each document is represented as a vector:

$$
\text{Doc}_i = [\text{word}_1\_\text{count}, \text{word}_2\_\text{count}, \dots, \text{word}_N\_\text{count}]
$$

#### Problem Without Normalization

- Long documents have **larger vector values**
- Short documents have **smaller vector values**
- Dot product and similarity measures:
  - Favor **longer documents**
  - Even if content is similar



#### Solution: Normalize Each Document Vector

After normalization:

- Document length **does not matter**
- Only **word distribution** matters
- Similarity depends on **direction**, not magnitude

This is why **TF-IDF + L2 normalization** is standard in:
- Text classification
- Information retrieval
- Document clustering

## Key Intuition

> **Normalization removes the effect of document length and keeps only the content pattern.**

That makes it ideal for similarity-based models such as:
- Cosine similarity
- Linear SVM
- k-NN (with cosine distance)


In [None]:
# data:
X = [[ 1., -1.,  2.],
     [ 2.,  0.,  0.],
     [ 0.,  1., -1.]]

In [None]:
from sklearn import preprocessing
# normalize the data
X_normalized = preprocessing.normalize(X, norm='l2')
X_normalized

array([[ 0.40824829, -0.40824829,  0.81649658],
       [ 1.        ,  0.        ,  0.        ],
       [ 0.        ,  0.70710678, -0.70710678]])

In [None]:
# import a normalizer:
from sklearn.preprocessing import Normalizer

In [None]:
normalizer = preprocessing.Normalizer().fit(X)  # fit does nothing
normalizer

In [None]:
normalizer.transform(X)

array([[ 0.40824829, -0.40824829,  0.81649658],
       [ 1.        ,  0.        ,  0.        ],
       [ 0.        ,  0.70710678, -0.70710678]])

### 8 - Label Encoder

In prediction time it can't handle unkown categories.

In [None]:
from sklearn.preprocessing import OrdinalEncoder

In [None]:
# data:
X = [['male', 'from US', 'uses Safari'], ['female', 'from Europe', 'uses Firefox']]
X

[['male', 'from US', 'uses Safari'], ['female', 'from Europe', 'uses Firefox']]

In [None]:
# create an encoder:
enc = OrdinalEncoder(encoded_missing_value=-1)

In [None]:
# fit the encoder:
X_encoded = enc.fit_transform(X)
X_encoded

array([[1., 1., 1.],
       [0., 0., 0.]])

In [None]:
X_test = [['male', 'from Europe', 'uses Firefox'], ['female', 'from US', 'uses Safari']]
X_test_encoded = enc.transform(X_test)
X_test_encoded

array([[1., 0., 0.],
       [0., 1., 1.]])

**Handling unkown categories**

In [None]:
# data:
X = [['male', 'from US', 'uses Safari'], ['female', 'from Europe', 'uses Firefox']]
X

[['male', 'from US', 'uses Safari'], ['female', 'from Europe', 'uses Firefox']]

In [None]:
# create an encoder and fit it:
enc = OrdinalEncoder(encoded_missing_value=-1, handle_unknown='use_encoded_value', unknown_value=-1)
enc.fit(X)

In [None]:
X_test = [['male', 'from Asia', 'uses Firefox'], ['other', 'from US', 'uses Safari']]
X_test_encoded = enc.transform(X_test)
X_test_encoded

array([[ 1., -1.,  0.],
       [-1.,  1.,  1.]])

### 9 - One Hot encoder

In [None]:
# import the one hot encoder:
from sklearn.preprocessing import OneHotEncoder

In [None]:
# create an encoder:
enc = OneHotEncoder()

In [None]:
X = [['male', 'from US', 'uses Safari'], ['female', 'from Europe', 'uses Firefox']]
enc.fit(X)
enc.transform([['female', 'from US', 'uses Safari'],
               ['male', 'from Europe', 'uses Safari']]).toarray()


array([[1., 0., 0., 1., 0., 1.],
       [0., 1., 1., 0., 0., 1.]])

In [None]:
# get the categoeries:
enc.categories_

[array(['female', 'male'], dtype=object),
 array(['from Europe', 'from US'], dtype=object),
 array(['uses Firefox', 'uses Safari'], dtype=object)]

In [None]:
# data:
X = [['male', 'from US', 'uses Safari'], ['female', 'from Europe', 'uses Firefox']]


In [None]:
# categories:
genders = ['female', 'male']
locations = ['from Africa', 'from Asia', 'from Europe', 'from US']
browsers = ['uses Chrome', 'uses Firefox', 'uses IE', 'uses Safari']

In [None]:
# create an encoder:
enc = OneHotEncoder(categories=[genders, locations, browsers])

In [None]:
# fit the data:
enc.fit(X)

In [None]:
# transform the data:
enc.transform([['female', 'from Asia', 'uses Chrome']]).toarray()

array([[1., 0., 0., 1., 0., 0., 1., 0., 0., 0.]])

**Handle unkown categories for unseen data**

In [None]:
# create an encoder:
enc = OneHotEncoder(handle_unknown='infrequent_if_exist')
X = [['male', 'from US', 'uses Safari'], ['female', 'from Europe', 'uses Firefox']]
enc.fit(X)
enc.transform([['female', 'from Asia', 'uses Safari']]).toarray()

array([[1., 0., 0., 0., 0., 1.]])

**drop first if binary**

In [None]:
X = [['male', 'US', 'Safari'],
     ['female', 'Europe', 'Firefox'],
     ['female', 'Asia', 'Chrome']]
drop_enc = OneHotEncoder(drop='if_binary').fit(X)
drop_enc.transform(X).toarray()

array([[1., 0., 0., 1., 0., 0., 1.],
       [0., 0., 1., 0., 0., 1., 0.],
       [0., 1., 0., 0., 1., 0., 0.]])

In [None]:
drop_enc.categories_

[array(['female', 'male'], dtype=object),
 array(['Asia', 'Europe', 'US'], dtype=object),
 array(['Chrome', 'Firefox', 'Safari'], dtype=object)]

**drop first and handle unkown**

In [None]:
drop_enc = preprocessing.OneHotEncoder(drop='if_binary', sparse_output=False,
                                       handle_unknown='ignore').fit(X)
X_test = [['unknown', 'America', 'IE']]
X_trans = drop_enc.transform(X_test)
X_trans




array([[0., 0., 0., 0., 0., 0., 0.]])

### 10 - Feature binarization

In [None]:
from sklearn.preprocessing import Binarizer

In [None]:
X = [[ 1., -1.,  2.],
     [ 2.,  0.,  0.],
     [ 0.,  1., -1.]]

binarizer = Binarizer().fit(X)  # fit does nothing
binarizer

binarizer.transform(X)

array([[1., 0., 1.],
       [1., 0., 0.],
       [0., 1., 0.]])

In [None]:
binarizer = preprocessing.Binarizer(threshold=1.1)
binarizer.transform(X)

array([[0., 0., 1.],
       [1., 0., 0.],
       [0., 0., 0.]])

## Missing Values

One type of imputation algorithm is univariate, which imputes values in the i-th feature dimension using only non-missing values in that feature dimension (e.g. SimpleImputer).

By contrast, multivariate imputation algorithms use the entire set of available feature dimensions to estimate the missing values (e.g. IterativeImputer).

> **By default, the scikit-learn imputers will drop fully empty features, i.e. columns containing only missing values.**

### 1 - Univariate feature imputation

The SimpleImputer class provides basic strategies for imputing missing values. Missing values can be imputed with a provided constant value, or using the statistics (mean, median or most frequent) of each column in which the missing values are located. This class also allows for different missing values encodings.

**Dffirent strategies for SimpleImputer:**

- mean
- median
- most_frequent
- constant --- > use fill_value to indicate the value

In [None]:
# import the imputer:
from sklearn.impute import SimpleImputer

In [None]:
import pandas as pd
df = pd.DataFrame([["a", "x"],
                   [np.nan, "y"],
                   ["a", np.nan],
                   ["b", "y"]], dtype="category")
imp = SimpleImputer(strategy="most_frequent")
print(imp.fit_transform(df))

[['a' 'x']
 ['a' 'y']
 ['a' 'y']
 ['b' 'y']]


### 2 - Multivariate feature imputation
A more sophisticated approach is to use the IterativeImputer class, which models each feature with missing values as a function of other features, and uses that estimate for imputation. It does so in an iterated round-robin fashion: at each step, a feature column is designated as output y and the other feature columns are treated as inputs X. A regressor is fit on (X, y) for known y. Then, the regressor is used to predict the missing values of y. This is done for each feature in an iterative fashion, and then is repeated for max_iter imputation rounds. The results of the final imputation round are returned.

In [None]:
# import the imputer:
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

In [None]:
# create an imputer:
imp = IterativeImputer(max_iter=10, random_state=0)

In [None]:
# data:
X = np.array([[1, 2], [3, 6], [4, 8], [np.nan, 3], [7, np.nan]])

In [None]:
# fit and transform the data:
imp.fit_transform(X)

array([[ 1.        ,  2.        ],
       [ 3.        ,  6.        ],
       [ 4.        ,  8.        ],
       [ 1.50004509,  3.        ],
       [ 7.        , 14.00004135]])

In [None]:
X_test = [[np.nan, 2], [6, np.nan], [np.nan, 6]]
print(np.round(imp.transform(X_test)))

[[ 1.  2.]
 [ 6. 12.]
 [ 3.  6.]]


### 3 - Nearest neighbors imputation

If a value is missing in a row, look at similar rows (its nearest neighbors) and use their values to fill in the missing entry.

Similarity is measured using Euclidean distance, but adjusted to ignore missing values

In [None]:
# import the imputer:
from sklearn.impute import KNNImputer

In [None]:
# create an imputer:
imputer = KNNImputer(n_neighbors=2, weights="uniform")

In [None]:
# data:
nan = np.nan
X = [[1, 2, nan], [3, 4, 3], [nan, 6, 5], [8, 8, 7]]

In [None]:
imputer.fit_transform(X)

array([[1. , 2. , 4. ],
       [3. , 4. , 3. ],
       [5.5, 6. , 5. ],
       [8. , 8. , 7. ]])