# Scaling and Normalization for Datasets

# 1. Introduction

Scaling and normalization are two common techniques used to preprocess datasets in machine learning. They aim to transform the data into a more suitable range or distribution to improve the performance of certain algorithms. Let's explore each technique:

### 1. Scaling:

Scaling refers to the process of transforming the values of numerical features to a specific range, typically between 0 and 1 or -1 and 1. It helps to ensure that all features have a similar scale and prevents certain features from dominating others in the learning algorithm. Common scaling techniques include:

- Min-max scaling (also known as normalization): It scales the data to a fixed range, usually between 0 and 1. The formula for min-max scaling is:

**X_scaled = (X - X_min) / (X_max - X_min)**

Where X is the original value, X_scaled is the scaled value, X_min is the minimum value in the dataset, and X_max is the maximum value in the dataset.

- Standardization: It transforms the data to have zero mean and unit variance. The formula for standardization is:

**X_scaled = (X - mean) / standard_deviation**

Where X is the original value, X_scaled is the scaled value, mean is the mean of the dataset, and standard_deviation is the standard deviation of the dataset.

- Robust scaling: It scales the data based on percentiles, making it robust to outliers. The formula for robust scaling is similar to standardization, but it uses the median and the interquartile range (IQR) instead of the mean and standard deviation.

### 2. Normalization:

Normalization aims to transform the distribution of the data to a standard distribution, such as a Gaussian distribution (bell curve). It can be useful for algorithms that assume normally distributed data. Common normalization techniques include:

- Z-score normalization (standardization): It transforms the data to have zero mean and unit variance, similar to the standardization technique mentioned in scaling.

- Log transformation: It applies the logarithm function to the data, which can help to normalize skewed distributions.

- Box-Cox transformation: It is a more general transformation that can handle different types of distributions. It applies a power transformation to the data, which can normalize skewed distributions and make them more Gaussian-like.

Both scaling and normalization techniques can be applied to the entire dataset or specific features depending on the requirements of the problem. It's important to note that these preprocessing techniques should be applied to the training set and then applied consistently to the test or validation sets using the parameters (e.g., mean and standard deviation) calculated from the training set.

# 2. Scaling Examples

In [2]:
import pandas as pd

In [4]:
housing = pd.read_csv('Housing.csv')
housing.head()

Unnamed: 0,price,area,bedrooms,bathrooms,stories,mainroad,guestroom,basement,hotwaterheating,airconditioning,parking,prefarea,furnishingstatus
0,13300000,7420,4,2,3,yes,no,no,no,yes,2,yes,furnished
1,12250000,8960,4,4,4,yes,no,no,no,yes,3,no,furnished
2,12250000,9960,3,2,2,yes,no,yes,no,no,2,yes,semi-furnished
3,12215000,7500,4,2,2,yes,no,yes,no,yes,3,yes,furnished
4,11410000,7420,4,1,2,yes,yes,yes,no,yes,2,no,furnished


This dataset of house prices has two features: "area" (in square feet) and "number of bedrooms. In order to do Min-max scaling, we need to find the minimum and maximum values of the "area".

In [5]:
max_value = housing['area'].max()
min_value = housing['area'].min()

print("Maximum value:", max_value)
print("Minimum value:", min_value)

Maximum value: 16200
Minimum value: 1650


In Pandas, you can apply a function to an entire column using the **apply()** method. The **apply()** method allows you to apply a function along either the rows or columns of a DataFrame or Series. Here's an example of how to map a function to an entire column in Pandas:

In [7]:
def min_max_scaling(x):
    X_scaled = (x - 500) / (3000 - 500) = 0.5


# 3. Normalization Examples