# 1. Feature scaling

Feature scaling is an important preprocessing step in linear regression and other machine learning algorithms. It ensures that all features contribute equally to the model's performance and that the optimization process works efficiently. Feature scaling is particularly important when using regularization techniques like Ridge and Lasso regression, which are sensitive to the scale of input features.

## 1.1. Why Feature Scaling Matters

1. **Effect on Coefficient Estimates**

    - In linear regression, the scale of the input features affects the magnitude of the estimated coefficients. Features with larger scales can dominate those with smaller scales, leading to biased interpretations of the model.

2. **Optimization Efficiency**

    - Linear regression uses gradient-based optimization techniques. If the features have different scales, the optimization process may converge slowly or become unstable. Scaling helps achieve a smoother gradient descent path.

3. **Comparability of Features**

    - Scaling allows for fair comparison between coefficients in the model, making it easier to interpret their relative importance.

## 1.2. Importance of Scaling in Regularization

Regularization techniques like Ridge and Lasso introduce penalties based on the magnitude of coefficients. The scale of the features can significantly impact the effectiveness of these penalties:

1. **Ridge Regression (L2 Regularization)**

    - Ridge adds a penalty equal to the square of the coefficients. If features are on different scales, the penalty term can disproportionately affect features with larger scales, leading to suboptimal coefficient estimates.

2. **Lasso Regression (L1 Regularization)**

    - Lasso adds a penalty proportional to the absolute value of the coefficients. Like Ridge, Lasso can unfairly penalize features based on their scale, potentially setting important features with small scales to zero.

## 1.3. Types of Feature Scaling

1. **Standardization (Z-score Normalization)**

  - Transforms the data to have a mean of 0 and a standard deviation of 1.
  - **Formula:** 

  $$
  z = \frac{x - \mu}{\sigma}
  $$

  Where:
  - $x$ is the feature value.
  - $\mu$ is the mean of the feature.
  - $\sigma$ is the standard deviation.

2. **Min-Max Scaling**

  - Transforms the data to a fixed range, usually [0, 1].
  - **Formula:** 

  $$
  x' = \frac{x - \min(x)}{\max(x) - \min(x)}
  $$

  Where:
  - $x'$ is the scaled feature value.

3. **Robust Scaling**

  - Uses the median and the interquartile range for scaling, making it robust to outliers.
  - **Formula:** 

  $$
  x' = \frac{x - \text{median}(x)}{\text{IQR}(x)}
  $$

  Where:
  - $\text{IQR}$ is the interquartile range.
