
# Standardization and Normalization

This note covers two important data scaling techniques: **Standardization** and **Normalization**. These methods are used to adjust the range and distribution of data features, which is crucial for many machine learning algorithms.

---

## 1. Introduction

- **Standardization** rescales data to have a mean of 0 and a standard deviation of 1. It is especially useful when data is approximately normally distributed.
- **Normalization** scales data to a fixed range, usually [0, 1]. It is often used when the distribution is unknown or not Gaussian.

---

## 2. Standardization | [Link](https://github.com/AdilShamim8/50-Days-of-Machine-Learning/tree/main/Day%2008%20Standardization)

### 2.1. What is Standardization?

Standardization (or Z-score normalization) transforms your data so that it has:
- A mean (μ) of 0
- A standard deviation (σ) of 1

This process ensures that each feature contributes equally to the model.

### 2.2. Step-by-Step Process for Standardization

**Step 1:** **Calculate the Mean**  
Compute the mean of the dataset:
$$
\mu = \frac{1}{n}\sum_{i=1}^{n} x_i
$$

**Step 2:** **Calculate the Standard Deviation**  
Determine the standard deviation:
$$
\sigma = \sqrt{\frac{1}{n}\sum_{i=1}^{n}(x_i - \mu)^2}
$$

**Step 3:** **Transform the Data**  
Standardize each data point using:
$$
z = \frac{x - \mu}{\sigma}
$$

### 2.3. Python Example with NumPy

```python
import numpy as np

# Sample data
data = np.array([10, 20, 30, 40, 50])

# Step 1: Calculate mean
mean = np.mean(data)
print("Mean:", mean)

# Step 2: Calculate standard deviation
std = np.std(data)
print("Standard Deviation:", std)

# Step 3: Standardize the data
standardized_data = (data - mean) / std
print("Standardized Data:", standardized_data)
```

### 2.4. Using Scikit-Learn's StandardScaler

```python
from sklearn.preprocessing import StandardScaler
import numpy as np

# Data needs to be a 2D array (e.g., one feature per column)
data = np.array([[10], [20], [30], [40], [50]])

scaler = StandardScaler()
standardized_data = scaler.fit_transform(data)
print("Standardized Data using StandardScaler:")
print(standardized_data)
```

---

## 3. Normalization | [Link](https://github.com/AdilShamim8/50-Days-of-Machine-Learning/tree/main/Day%2009%20Normalization)

### 3.1. What is Normalization?

Normalization rescales the data to a fixed range, most commonly [0, 1]. This is useful when your data does not follow a Gaussian distribution or when you want to bound all features within the same range.

### 3.2. Step-by-Step Process for Normalization

**Step 1:** **Identify the Minimum and Maximum Values**  
Find the minimum and maximum values in the dataset:
$$
x_{\min} = \min(x_i), \quad x_{\max} = \max(x_i)
$$

**Step 2:** **Rescale the Data**  
Normalize each data point with:
$$
x' = \frac{x - x_{\min}}{x_{\max} - x_{\min}}
$$

### 3.3. Python Example with NumPy

```python
import numpy as np

# Sample data
data = np.array([10, 20, 30, 40, 50])

# Step 1: Calculate minimum and maximum
data_min = np.min(data)
data_max = np.max(data)
print("Minimum:", data_min)
print("Maximum:", data_max)

# Step 2: Normalize the data
normalized_data = (data - data_min) / (data_max - data_min)
print("Normalized Data:", normalized_data)
```

### 3.4. Using Scikit-Learn's MinMaxScaler

```python
from sklearn.preprocessing import MinMaxScaler
import numpy as np

# Data needs to be a 2D array
data = np.array([[10], [20], [30], [40], [50]])

scaler = MinMaxScaler()
normalized_data = scaler.fit_transform(data)
print("Normalized Data using MinMaxScaler:")
print(normalized_data)
```

---

## 4. When to Use Each Method

- **Standardization** is ideal when:
  - Data follows a Gaussian (normal) distribution.
  - Algorithms assume features are centered around zero (e.g., PCA, logistic regression).

- **Normalization** is preferred when:
  - You need to bound the data within a specific range.
  - The data does not follow a normal distribution (e.g., neural networks often perform better with normalized data).

---

## 5. Practical Considerations

- **Training vs. Testing:** Always compute scaling parameters (mean, standard deviation, min, max) on the training set and apply the same transformation to the test set.
- **Feature-wise Scaling:** When working with multi-dimensional data, apply scaling to each feature independently.
- **Impact on Models:** Proper scaling can improve the performance of many machine learning algorithms by ensuring that each feature contributes equally.

---

```

This detailed note should help you understand the processes behind standardization and normalization while providing clear, step-by-step instructions and code examples.