<a href="https://colab.research.google.com/github/ReyhaneTaj/ML_Algorithms/blob/main/Feature_Scaling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Feature Scaling with Standardization and Normalization

When working with data that contains features with different units or scales (e.g., age in years, income in dollars), some machine learning algorithms can perform poorly if these features are not scaled. This is because many algorithms (like KNN, SVM, and gradient descent-based models) are sensitive to the magnitude of the features.

### 1. Standardization (Z-score normalization)

Standardization rescales data so that it has a mean of 0 and a standard deviation of 1. This is useful for algorithms that assume Gaussian distribution of the data or when dealing with features that have different units.

**Formula:**

$$
\text{Standardized value} = \frac{X - \mu}{\sigma}
$$

Where:
- \(X\) is the original value.
- \(\mu\) is the mean of the feature.
- \(\sigma\) is the standard deviation of the feature.

### 2. Normalization (Min-Max Scaling)

Normalization rescales the feature values to a fixed range, usually [0, 1]. This is useful when you need the data to be bounded within a certain range.

**Formula:**

$$
\text{Normalized value} = \frac{X - X_{\text{min}}}{X_{\text{max}} - X_{\text{min}}}
$$

Where:
- \(X\) is the original value.
- \(X_{\text{min}}\) is the minimum value of the feature.
- \(X_{\text{max}}\) is the maximum value of the feature.

### Practical Example in Python


from sklearn.preprocessing import StandardScaler, MinMaxScaler
import numpy as np

# Example data
data = np.array([[1, 200], [2, 300], [3, 400]])

# Standardization
scaler = StandardScaler()
standardized_data = scaler.fit_transform(data)
print("Standardized Data:\n", standardized_data)

# Normalization
min_max_scaler = MinMaxScaler()
normalized_data = min_max_scaler.fit_transform(data)
print("Normalized Data:\n", normalized_data)


## Data Scaling and Data Leakage

Data scaling can lead to data leakage if not handled properly, particularly when the scaling process includes information from the test set during the training process. Data leakage occurs when information from outside the training dataset is used to create the model, causing the model to perform well on the test data but fail in real-world scenarios.

### How Data Scaling Can Cause Data Leakage

#### Improper Scaling Across Entire Dataset

If you scale your entire dataset (including both training and test sets) before splitting into training and test sets, the scaler will incorporate information from the test data, which can lead to overfitting. This is because the scaling parameters (mean, standard deviation, min, max, etc.) will be influenced by the test data, allowing the model to "see" the test data indirectly.

#### Example

Suppose you have a dataset that you want to standardize. If you calculate the mean and standard deviation of the entire dataset (including both the training and test sets) and then scale the data, your test set scaling will be biased. This means that the model has already been exposed to some information about the test set during training, leading to overly optimistic performance estimates.

### Correct Way to Scale Data

#### Fit on Training Data Only

Fit the scaler (whether it’s a StandardScaler, MinMaxScaler, or another method) only on the training data. This ensures that the scaling parameters are derived solely from the training data, preventing any information from the test data from leaking into the training process.

#### Apply Scaling Separately

After fitting the scaler on the training data, apply the same transformation to both the training and test data. This means that the test data is scaled using the parameters learned from the training data only.


In [2]:
### Practical Example in Python

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
import numpy as np

# Example data
data = np.array([[1, 200], [2, 300], [3, 400], [4, 500], [5, 600]])
labels = np.array([0, 1, 0, 1, 0])

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(data, labels, test_size=0.2, random_state=42)

# Initialize the scaler and fit on the training data only
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)

# Apply the same transformation to the test data
X_test_scaled = scaler.transform(X_test)

# Now X_train_scaled and X_test_scaled are properly scaled without any data leakage.


## Key Takeaways

- **Always** fit the scaler on the training data only.
- **Never** include test data in the fitting process of any preprocessing steps to avoid data leakage.
- Apply the fitted transformation consistently to both training and test sets.

By following these steps, you can prevent data leakage and ensure that your model's performance estimates are realistic.
