# Data Preprocessing

Data preprocessing ensures raw data is transformed into a clean representation suitable for modeling. In this lab we demonstrate practical techniques to standardize features, handle missing entries, validate models using cross-validation, incorporate regularization, and compress features with PCA.


### Data Preprocessing Workflow
```text
Raw Data -> [Cleaning] -> [Encoding] -> [Scaling] -> Processed Data
```

In [None]:
# Load Iris dataset
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
X,y=load_iris(return_X_y=True)
X_train,X_test,y_train,y_test=train_test_split(X,y,random_state=0)
X_train[:3]

### Normalization

Many algorithms assume features have comparable scales. Standardization rescales each feature to zero mean and unit variance using

$$z = \frac{x - \mu}{\sigma}.$$

This prevents attributes with large magnitudes from dominating the learning process.


- $z$: standardized value
- $x$: original feature value
- $\mu$: mean of the feature
- $\sigma$: standard deviation of the feature

### Missing Value Handling

Real-world datasets often contain missing values. Using `SimpleImputer` we can replace them with statistics such as the mean, median, or a constant placeholder. Proper imputation allows models that do not accept NaNs to be trained on imperfect data.

$$\hat{x}_{ij}=\begin{cases}x_{ij}, & x_{ij}\; 	ext{observed} \\ m_j, & x_{ij}\; 	ext{missing}\end{cases}$$


- $\hat{x}_{ij}$: imputed value for row $i$, feature $j$
- $x_{ij}$: observed data value
- $m_j$: statistic (e.g., mean) used when $x_{ij}$ is missing
- $i$: row index
- $j$: feature index

In [None]:
import numpy as np
X_missing = X_train.copy()
X_missing.ravel()[::40] = np.nan
from sklearn.impute import SimpleImputer
imp = SimpleImputer(strategy='mean')
X_imputed = imp.fit_transform(X_missing)


### Alternative Scaling

Minâ€“max scaling linearly maps each feature to the [0, 1] interval. It preserves the shape of the distribution and is useful when features have known bounds or when we want to maintain sparsity. The `MinMaxScaler` from scikit-learn performs this transformation.

$$x' = ({x - x_{\min}})/({x_{\max} - x_{\min}})$$


- $x'$: scaled value
- $x$: original value
- $x_{\min}$: minimum value of the feature
- $x_{\max}$: maximum value of the feature

In [None]:
from sklearn.preprocessing import MinMaxScaler
mm = MinMaxScaler()
X_mm = mm.fit_transform(X_train)


In [None]:
# Standardize features
from sklearn.preprocessing import StandardScaler
sc=StandardScaler()
X_scaled=sc.fit_transform(X_train)
X_scaled[:3]

### Cross Validation

Instead of relying on a single train/test split, k-fold cross validation repeatedly partitions the data and averages results to provide a more reliable estimate of generalization performance. The `cross_val_score` function automates this procedure.

$$\mathrm{CV}(f)=\frac{1}{k}\sum_{i=1}^k L\big(f^{(-i)}, D_i\big)$$


- $\mathrm{CV}(f)$: cross-validation estimate of model $f$
- $k$: number of folds
- $L$: loss function
- $f^{(-i)}$: model trained on all folds except $i$
- $D_i$: validation data from fold $i

In [None]:
# Cross validation example
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
clf=LogisticRegression(max_iter=200)
cv_scores=cross_val_score(clf,X_scaled,y_train,cv=5)
cv_scores

### Regularization

Regularization discourages overly complex models by adding a penalty term to the loss function. **Ridge** uses an L2 penalty while **Lasso** uses an L1 penalty that can drive coefficients to zero. Tuning the strength of the penalty helps balance bias and variance.

Ridge solves\n
$$\min_w \sum_i (y_i - w^T x_i)^2 + \alpha \|w\|_2^2,$$
while Lasso replaces the squared norm with \|w\|_1.


- $w$: weight vector
- $y_i$: target value for sample $i$
- $x_i$: feature vector for sample $i$
- $\alpha$: regularization strength
- $\|w\|_2^2$: squared L2-norm of $w

In [None]:
# L2 regularization
from sklearn.linear_model import Ridge
ridge=Ridge(alpha=1.0)
ridge.fit(X_scaled,y_train)
ridge.score(sc.transform(X_test),y_test)

### Dimension Reduction

Principal Component Analysis (PCA) rotates the feature space to new orthogonal axes ordered by variance. By keeping only the first few components we obtain a compressed representation that often reveals important structure and speeds up downstream algorithms.

The covariance matrix is\n
$$C=\frac{1}{n}\sum_{i=1}^n (x_i-\bar{x})(x_i-\bar{x})^T,$$
and projecting onto the first $k$ eigenvectors $W_k$ gives\n
$$Z = X W_k.$$


- $C$: covariance matrix
- $n$: number of samples
- $x_i$: sample vector $i$
- $\bar{x}$: mean vector of all samples
- $T$: transpose operation

- $Z$: data projected onto principal components
- $X$: centered data matrix
- $W_k$: first $k$ eigenvectors (principal components)

In [None]:
# PCA example
from sklearn.decomposition import PCA
pca=PCA(n_components=2)
X_pca=pca.fit_transform(X_scaled)
X_pca[:3]

In [None]:
plt.figure(figsize=(6,4))
plt.scatter(X_pca[:,0], X_pca[:,1], c=y_train, cmap='viridis')
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.title('PCA Projection')
plt.show()

### Results and Interpretation
The PCA visualization above illustrates how preprocessing condenses information while preserving class structure. Standardization, imputation, and scaling make feature ranges comparable, enabling clearer clustering in the reduced space.

### Exercises

1. Evaluate `LogisticRegression` on the Iris data with and without standardization. Report accuracy using 5-fold cross validation.
2. Compare `StandardScaler`, `MinMaxScaler`, and `RobustScaler` on a dataset containing outliers. Visualize the scaled feature distributions.
3. Fit a `Ridge` model for several `alpha` values and plot training vs validation scores.
4. Apply PCA to the scaled Iris features and plot the explained variance ratio. Experiment with different numbers of components.


### Hints

- Wrap preprocessing steps and the estimator in a `Pipeline` so that cross validation includes scaling.
- Use `cross_val_score` or `GridSearchCV` for fair comparisons.
- Access `pca.explained_variance_ratio_` to see how much variance each component captures.


### Why Preprocessing Matters

Real-world datasets often contain missing values, inconsistent units, and noisy measurements. Proper preprocessing improves model performance and training stability.

- **Min-max scaling** rescales features to a fixed range $[0, 1]$:

  $$x' = \frac{x - \min(x)}{\max(x) - \min(x)}$$

- **Standardization** centers features and scales them to unit variance:

  $$z = \frac{x - \mu}{\sigma}$$

### Worked Example: Standardizing Features

```python
import numpy as np
from sklearn.preprocessing import StandardScaler

X = np.array([[1.0, 10.0], [2.0, 20.0], [3.0, 30.0]])
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
print(X_scaled)
```

The resulting array has zero mean and unit variance for each feature, helping gradient-based models converge faster.

### Exercises
1. Use `SimpleImputer` to replace missing values in a dataset and compare model performance with and without imputation.
2. Apply `MinMaxScaler` to a dataset with skewed features and plot histograms before and after scaling.
3. Combine preprocessing steps in a `ColumnTransformer` that handles numeric and categorical data.
4. Explore how `RobustScaler` affects models when outliers are present.
5. Implement PCA to reduce the dataset to two components and visualize the result.
6. Create a custom transformer that applies a log transform to skewed features.