In [None]:
import numpy as np
import matplotlib.pyplot as plt

# **Preprocessing and Scaling:**

### Introduction:
- **Purpose:** The section introduces the importance of preprocessing as a crucial step in preparing data for machine learning tasks.

### Scaling Methods:
- **Objective:** Scaling methods aim to standardize or normalize the features of the dataset.
- **Unsupervised Nature:** Scaling methods are considered unsupervised because they do not utilize information about the target variable.

### Importance of Scaling:
- **Motivation:** Scaling becomes essential when features in the dataset have different scales or units.
- **Effects on Algorithms:** Ensures that the influence of certain features is not disproportionately high due to differences in scale.

### Types of Scaling Methods:
1. **Standardization (Z-score normalization):**
   - **Description:** Scales features to have a mean of 0 and a standard deviation of 1.
   - **Formula:** `z = (x - mean) / standard deviation`

2. **MinMax Scaling:**
   - **Description:** Scales features to a specific range, often between 0 and 1.
   - **Formula:** `x_scaled = (x - min) / (max - min)`

3. **Robust Scaling:**
   - **Description:** Scales features based on median and interquartile range, making it robust to outliers.
   - **Formula:** `x_scaled = (x - Q1) / (Q3 - Q1)`

### Preprocessing as an Unsupervised Task:
- **Definition:** Preprocessing, including scaling, is considered an unsupervised task because it does not rely on information about the target variable.
- **Motivation:** Ensures that the transformation is consistent across the entire dataset.

### Summary:
The section emphasizes the importance of preprocessing and scaling as essential steps in preparing data for machine learning. Scaling methods, including standardization, MinMax scaling, and robust scaling, are introduced with their respective formulas and purposes. The section concludes by highlighting that preprocessing, including scaling, is fundamentally an unsupervised task, ensuring consistency across the entire dataset.


Example:

In [None]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
# calling fit and transform in sequence (using method chaining)
X_scaled = scaler.fit(X).transform(X)
# same result, but more efficient computation
X_scaled_d = scaler.fit_transform(X)