# Scaling 

scaling refers to the process of normalizing or standardizing the features of a dataset. The purpose of scaling is to bring all features to the same scale or range so that no single feature dominates the others. This is particularly important for algorithms that use distance metrics or gradients, such as k-nearest neighbors, support vector machines, and neural networks.

## Min-Max Scaling (Normalization):

This method scales the data to a fixed range, usually between 0 and 1. It is achieved by subtracting the minimum value of the feature and then dividing by the range (maximum value minus minimum value)

![image.png](attachment:image.png)

In [1]:
from sklearn.preprocessing import MinMaxScaler
import numpy as np

# Sample data
data = np.array([[1, 2], [3, 4], [5, 6]])

# Create MinMaxScaler object
scaler = MinMaxScaler()

# Fit the scaler to the data and transform the data
scaled_data = scaler.fit_transform(data)

print("Original data:")
print(data)
print("\nScaled data:")
print(scaled_data)


Original data:
[[1 2]
 [3 4]
 [5 6]]

Scaled data:
[[0.  0. ]
 [0.5 0.5]
 [1.  1. ]]


### limitations:

**Sensitivity to outliers**: Min-Max scaling can be greatly influenced by outliers in the data, especially when the range of the data is very large. Outliers can result in the majority of the data being squeezed into a very small range, leading to loss of information.

**May not handle varying distributions well**: Min-Max scaling assumes that the distribution of the data is relatively uniform across features. If the distributions are highly skewed or have different ranges, Min-Max scaling may not be effective.

## Standardization: 

This method rescales the data to have a mean of 0 and a standard deviation of 1. It is achieved by subtracting the mean of the feature and then dividing by the standard deviation. 

![image.png](attachment:image.png)

In [2]:
from sklearn.preprocessing import StandardScaler
import numpy as np

# Sample data
data = np.array([[1, 2], [3, 4], [5, 6]])

# Create StandardScaler object
scaler = StandardScaler()

# Fit the scaler to the data and transform the data
standardized_data = scaler.fit_transform(data)

print("Original data:")
print(data)
print("\nStandardized data:")
print(standardized_data)


Original data:
[[1 2]
 [3 4]
 [5 6]]

Standardized data:
[[-1.22474487 -1.22474487]
 [ 0.          0.        ]
 [ 1.22474487  1.22474487]]


### limitations

**Does not bound the data**: Unlike Min-Max scaling, standardization does not bound the data to a specific range. This means that the transformed data may have negative values, which might not be appropriate for certain algorithms or interpretations.

**Might not handle extreme outliers well**: While standardization is less sensitive to outliers compared to Min-Max scaling, extreme outliers can still impact the mean and standard deviation, potentially skewing the standardized data.

**May not be suitable for algorithms sensitive to feature scales**: Some algorithms, such as decision trees and random forests, are not sensitive to feature scales and may not benefit from standardization. In some cases, standardization might even degrade the performance of these algorithms.    

## when to use which one?


Choosing between Min-Max scaling (Normalization) and Standardization depends on the specific characteristics of your data and the requirements of your machine learning algorithm. Here are some guidelines to help you decide:

Min-Max Scaling (Normalization):

1. Use when your data has a known, limited range.
2. Useful when your algorithm (like neural networks, KNN) requires input data to be on a scale of 0 to 1.
3. It preserves the shape of the original distribution and does not change the relative distance between data points.

Standardization:

1. Use when your data has outliers or does not follow a Gaussian distribution.
2. Suitable for algorithms that assume your data is normally distributed, such as linear regression, logistic regression, and SVM.
3. Standardization centers the data around 0 and scales it to have a standard deviation of 1, making it less sensitive to the scale of features.

In summary, if your data has a clear minimum and maximum value, and you need your data to be within a specific range, use Min-Max scaling. If your data is not bounded and might have outliers, or if you're unsure about the distribution of your data, standardization might be a better choice as it's more robust in such cases. In practice, it's often a good idea to try both scaling methods and see which one yields better performance for your specific problem and algorithm.