# Data Normalization and Scaling
Data normalization and scaling are essential preprocessing techniques in machine learning to ensure that features are on a comparable scale. This is crucial because many machine learning algorithms, especially those based on gradient descent, converge faster and perform better when features are normalized.

# Normalization
Normalization refers to rescaling numerical features to a specific range, typically between 0 and 1. This is particularly useful when features have different units or scales.

# Scaling
Scaling is a broader term that encompasses both normalization and other techniques to modify the distribution of data.

In [2]:
import pandas as pd
from sklearn.preprocessing import MinMaxScaler, StandardScaler
import matplotlib.pyplot as plt
df = pd.read_csv('/kaggle/input/cats-dataset/cats_dataset.csv')

# # Sklearn Functions

In [3]:
numerical_cols = ['Age (Years)', 'Weight (kg)']

# Min-Max Scaling
min_max_scaler = MinMaxScaler()
df_min_max = df.copy()
df_min_max[numerical_cols] = min_max_scaler.fit_transform(df[numerical_cols])

# Z-Score Scaling
standard_scaler = StandardScaler()
df_z_score = df.copy()
df_z_score[numerical_cols] = standard_scaler.fit_transform(df[numerical_cols])

# Decimal Scaling
def decimal_scaling(data, num_digits):
    return data / (10 ** num_digits)

num_digits = 1  # Adjust number of digits as needed
df_decimal_scaled = df.copy()
df_decimal_scaled[numerical_cols] = decimal_scaling(df[numerical_cols], num_digits)

print("Min-Max Scaled Data:\n", df_min_max)
print("\nZ-Score Scaled Data:\n", df_z_score)
print("\nDecimal Scaled Data:\n", df_decimal_scaled)

Min-Max Scaled Data:
                   Breed  Age (Years)  Weight (kg)          Color  Gender
0          Russian Blue     1.000000     0.714286  Tortoiseshell  Female
1      Norwegian Forest     1.000000     1.000000  Tortoiseshell  Female
2             Chartreux     0.111111     0.142857          Brown  Female
3               Persian     0.666667     0.571429          Sable  Female
4               Ragdoll     0.500000     0.857143          Tabby    Male
..                  ...          ...          ...            ...     ...
995   British Shorthair     1.000000     0.428571           Gray  Female
996   British Shorthair     0.555556     0.000000        Bicolor  Female
997            Savannah     0.611111     0.428571        Bicolor  Female
998  American Shorthair     0.388889     0.142857  Tortoiseshell  Female
999           Chartreux     0.555556     0.285714          Sable  Female

[1000 rows x 5 columns]

Z-Score Scaled Data:
                   Breed  Age (Years)  Weight (kg)     

# # User Defined

# Min-Max Scaling
Min-Max scaling linearly transforms features to a specific range, usually 0 to 1. It's calculated as:

*X_scaled = (X - X_min) / (X_max - X_min)*
* X is the original value
* X_min is the minimum value in the feature
* X_max is the maximum value in the feature
* X_scaled is the scaled value*

# Z-Score Scaling (Standardization)
Z-score scaling transforms features to have a mean of 0 and a standard deviation of It's calculated as:

*X_scaled = (X - mean) / std*
* where:
* X is the original value
* mean is the mean of the feature
* std is the standard deviation of the feature
* X_scaled is the scaled value

# Decimal Scaling
Decimal scaling divides the values of a feature by a power of 10. It's a simple method to reduce the magnitude of values.

In [14]:
def min_max_scaling(data, numerical_cols):
    scaled_data = data.copy()
    for col in numerical_cols:
        min_val = data[col].min()
        max_val = data[col].max()
        scaled_data[col] = (data[col] - min_val) / (max_val - min_val)
    return scaled_data

def z_score_scaling(data, numerical_cols):
    scaled_data = data.copy()
    for col in numerical_cols:
        mean = data[col].mean()
        std = data[col].std()
        scaled_data[col] = (data[col] - mean) / std
    return scaled_data

def decimal_scaling(data, numerical_cols, num_digits):

    scaled_data = data.copy()
    for col in numerical_cols:
        scaled_data[col] = data[col] / (10 ** num_digits)
    return scaled_data

In [13]:
data = {
    'Breed': ['Russian Blue', 'Norwegian Forest', 'Chartreux'],
    'Age (Years)': [19, 19, 3],
    'Weight (kg)': [7, 9, 3],
    'Color': ['Tortoiseshell', 'Tortoiseshell', 'Brown'],
    'Gender': ['Female', 'Female', 'Female']
}
df = pd.DataFrame(data)

# Specify numerical columns
numerical_cols = ['Age (Years)', 'Weight (kg)']

# Apply scaling functions
df_min_max = min_max_scaling(df, numerical_cols)
df_z_score = z_score_scaling(df, numerical_cols)
df_decimal_scaled = decimal_scaling(df, numerical_cols, 1)

print("Min-Max Scaled Data:\n", df_min_max)
print("\nZ-Score Scaled Data:\n", df_z_score)
print("\nDecimal Scaled Data:\n", df_decimal_scaled)

Min-Max Scaled Data:
               Breed  Age (Years)  Weight (kg)          Color  Gender
0      Russian Blue          1.0     0.666667  Tortoiseshell  Female
1  Norwegian Forest          1.0     1.000000  Tortoiseshell  Female
2         Chartreux          0.0     0.000000          Brown  Female

Z-Score Scaled Data:
               Breed  Age (Years)  Weight (kg)          Color  Gender
0      Russian Blue     0.577350     0.218218  Tortoiseshell  Female
1  Norwegian Forest     0.577350     0.872872  Tortoiseshell  Female
2         Chartreux    -1.154701    -1.091089          Brown  Female

Decimal Scaled Data:
               Breed  Age (Years)  Weight (kg)          Color  Gender
0      Russian Blue          1.9          0.7  Tortoiseshell  Female
1  Norwegian Forest          1.9          0.9  Tortoiseshell  Female
2         Chartreux          0.3          0.3          Brown  Female


# Choosing the right scaling method:

* Min-Max scaling is suitable when you want to preserve the original range and there are no significant outliers.
* Z-score scaling is suitable when you assume a normal distribution and want to handle outliers better.
* Decimal scaling is a quick and simple method for reducing large values.