# Data Preprocessing

Data preprocessing ensures raw data is transformed into a clean representation suitable for modeling. In this lab we demonstrate practical techniques to standardize features, handle missing entries, validate models using cross-validation, incorporate regularization, and compress features with PCA.


In [None]:
# Load Iris dataset
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
X,y=load_iris(return_X_y=True)
X_train,X_test,y_train,y_test=train_test_split(X,y,random_state=0)
X_train[:3]

### Normalization

Many algorithms assume features have comparable scales. Standardization rescales each feature to zero mean and unit variance using

$$z = \frac{x - \mu}{\sigma}.$$

This prevents attributes with large magnitudes from dominating the learning process.


### Missing Value Handling

Real-world datasets often contain missing values. Using `SimpleImputer` we can replace them with statistics such as the mean, median, or a constant placeholder. Proper imputation allows models that do not accept NaNs to be trained on imperfect data.

$$\hat{x}_{ij}=\begin{cases}x_{ij}, & x_{ij}\; 	ext{observed} \\ m_j, & x_{ij}\; 	ext{missing}\end{cases}$$


In [None]:
import numpy as np
X_missing = X_train.copy()
X_missing.ravel()[::40] = np.nan
from sklearn.impute import SimpleImputer
imp = SimpleImputer(strategy='mean')
X_imputed = imp.fit_transform(X_missing)


### Alternative Scaling

Min–max scaling linearly maps each feature to the [0, 1] interval. It preserves the shape of the distribution and is useful when features have known bounds or when we want to maintain sparsity. The `MinMaxScaler` from scikit-learn performs this transformation.

$$x' = ({x - x_{\min}})/({x_{\max} - x_{\min}})$$


In [None]:
from sklearn.preprocessing import MinMaxScaler
mm = MinMaxScaler()
X_mm = mm.fit_transform(X_train)


In [None]:
# Standardize features
from sklearn.preprocessing import StandardScaler
sc=StandardScaler()
X_scaled=sc.fit_transform(X_train)
X_scaled[:3]

### Cross Validation

Instead of relying on a single train/test split, k-fold cross validation repeatedly partitions the data and averages results to provide a more reliable estimate of generalization performance. The `cross_val_score` function automates this procedure.

$$\mathrm{CV}(f)=\frac{1}{k}\sum_{i=1}^k L\big(f^{(-i)}, D_i\big)$$


In [None]:
# Cross validation example
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
clf=LogisticRegression(max_iter=200)
cv_scores=cross_val_score(clf,X_scaled,y_train,cv=5)
cv_scores

### Regularization

Regularization discourages overly complex models by adding a penalty term to the loss function. **Ridge** uses an L2 penalty while **Lasso** uses an L1 penalty that can drive coefficients to zero. Tuning the strength of the penalty helps balance bias and variance.

Ridge solves\n
$$\min_w \sum_i (y_i - w^T x_i)^2 + \alpha \|w\|_2^2,$$
while Lasso replaces the squared norm with \|w\|_1.


In [None]:
# L2 regularization
from sklearn.linear_model import Ridge
ridge=Ridge(alpha=1.0)
ridge.fit(X_scaled,y_train)
ridge.score(sc.transform(X_test),y_test)

### Dimension Reduction

Principal Component Analysis (PCA) rotates the feature space to new orthogonal axes ordered by variance. By keeping only the first few components we obtain a compressed representation that often reveals important structure and speeds up downstream algorithms.

The covariance matrix is\n
$$C=\frac{1}{n}\sum_{i=1}^n (x_i-\bar{x})(x_i-\bar{x})^T,$$
and projecting onto the first $k$ eigenvectors $W_k$ gives\n
$$Z = X W_k.$$


In [None]:
# PCA example
from sklearn.decomposition import PCA
pca=PCA(n_components=2)
X_pca=pca.fit_transform(X_scaled)
X_pca[:3]

In [None]:
plt.figure(figsize=(6,4))
plt.scatter(X_pca[:,0], X_pca[:,1], c=y_train, cmap='viridis')
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.title('PCA Projection')
plt.show()

### Exercises

1. Evaluate `LogisticRegression` on the Iris data with and without standardization. Report accuracy using 5-fold cross validation.
2. Compare `StandardScaler`, `MinMaxScaler`, and `RobustScaler` on a dataset containing outliers. Visualize the scaled feature distributions.
3. Fit a `Ridge` model for several `alpha` values and plot training vs validation scores.
4. Apply PCA to the scaled Iris features and plot the explained variance ratio. Experiment with different numbers of components.


### Hints

- Wrap preprocessing steps and the estimator in a `Pipeline` so that cross validation includes scaling.
- Use `cross_val_score` or `GridSearchCV` for fair comparisons.
- Access `pca.explained_variance_ratio_` to see how much variance each component captures.
