Data Preprocessing: Scaling, Encoding, Normalization (with Scikit-learn)

What is data preprocessing?
- Data preprocessing transforms raw data into a clean and usable format.
- It improves model performance and training speed.

Common Tasks:
- Handling missing values.
- Scaling numerical features
- Encoding categorical variables.
- Normalizing feature ranges.

1. Feature Scaling:
    - Scaling adjusts values so features contribute equally.
    - Common Techniques: StandardScaler (Z-score), MinMaxScaler (0-1 scale)


In [1]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, MinMaxScaler

# Example dataset

data = pd.DataFrame({
    'Salary': [50000, 60000, 55000, 70000, 65000],
    'Age': [25, 30, 28, 35, 32]
})
data

Unnamed: 0,Salary,Age
0,50000,25
1,60000,30
2,55000,28
3,70000,35
4,65000,32


- here as Salary values are bigger, Machine Learning model may think that Salary values have more importance as compared to Age values and give it more importance.

- Machine learning will give higher importance to higher value columns.

- As a data analyst, both are supposed to have equal importance.


In [4]:
# Standardization helps to center the data around zero with a standard deviation of one.
# This is useful for algorithms that assume a Gaussian distribution of the data.
# which makes model to give equal importance to all features.

scaler_standard = StandardScaler()
data_standardized = scaler_standard.fit_transform(data)
data_standardized = pd.DataFrame(data_standardized, columns=data.columns)
print("Standardized Data:")
print(data_standardized)
# Now, data_standardized has a mean of 0 and a standard deviation of 1 for each feature.
# and we cant say which feature has more importance.

Standardized Data:
     Salary       Age
0 -1.414214 -1.468051
1  0.000000  0.000000
2 -0.707107 -0.587220
3  1.414214  1.468051
4  0.707107  0.587220


In [7]:
# fit_transform() does two things:
# 1 fit(): It calculates the mean and standard deviation for each feature in the dataset.
# 2 transform(): It uses these calculated values to standardize the dataset
 # applied scaling (converts data to Z-scores) in one step.

    # Z-score formula: Z = (X - mean) / std_dev

# For Standard Scaler: Each feature will have mean = 0 and standard deviation = 1.

#### Min-Max Scaling (0 to 1)

In [8]:
# Min-Max Scaling transforms features by scaling each feature to a given range, usually between 0 and 1.
scaler_minmax = MinMaxScaler()
data_minmax_scaled = scaler_minmax.fit_transform(data)
data_minmax_scaled = pd.DataFrame(data_minmax_scaled, columns=data.columns)
print("\nMin-Max Scaled Data:")
print(data_minmax_scaled)
# Now, data_minmax_scaled has values between 0 and 1 for each feature.


Min-Max Scaled Data:
   Salary  Age
0    0.00  0.0
1    0.50  0.5
2    0.25  0.3
3    1.00  1.0
4    0.75  0.7


In [9]:
# Here values are between 0 and 1.
# how its calculated:
# scaled_value = (X - min) / (max - min)
# where X is the original value, min is the minimum value of the feature, and max is the maximum value of the feature.

In [10]:
# Difference between Standardization and Min-Max Scaling:
# Standardization centers the data around zero with a standard deviation of one, while Min-Max Scaling scales the data to a fixed range, typically between 0 and 1.
# Standardization is less affected by outliers compared to Min-Max Scaling.
# Standardization don't have fixed range, while Min-Max Scaling always between 0 and 1.
# --- IGNORE ---



2. Encoding Categorical Variables

Convert text labels to numeric format for ML Models


In [11]:
# We have 2 techniques for converting text labels into numerical form:
# 1. Label Encoding
# 2. One-Hot Encoding
# --- IGNORE ---
# label encoding assigns each unique category in a feature a unique integer value.
# For example, in a "Color" feature with categories "Red", "Blue", and
# "Green", label encoding might assign "Red" = 0, "Blue" = 1, and "Green" = 2.

# One-Hot Encoding creates binary columns for each category in a feature.
# Using the same "Color" feature example, one-hot encoding would create three new columns:
# "Color_Red", "Color_Blue", and "Color_Green". Each row
# would have a 1 in the column corresponding to its category and 0s in the others.


In [12]:
from sklearn.preprocessing import LabelEncoder, OneHotEncoder

# Example dataset with categorical feature
data_cat = pd.DataFrame({  
    'Color': ['Red', 'Blue', 'Green', 'Blue', 'Red']
})

data_cat 

Unnamed: 0,Color
0,Red
1,Blue
2,Green
3,Blue
4,Red


In [16]:
# Label Encoding
label_encoder = LabelEncoder()
data_cat['Color_LabelEncoded'] = label_encoder.fit_transform(data_cat['Color'])
print("Label Encoded Data:")
print(data_cat)

# fit_transform() in Label Encoding:
# 1 fit(): It identifies the unique categories in the 'Color' feature and assigns each
# a unique integer value.
# 2 transform(): It replaces each category in the 'Color' feature with its corresponding integer value.
# This is done in one step using fit_transform().   =

Label Encoded Data:
   Color  Color_LabelEncoded
0    Red                   2
1   Blue                   0
2  Green                   1
3   Blue                   0
4    Red                   2


In [21]:
# One-Hot Encoding
df = pd.DataFrame({  
    'Color': ['Red', 'Blue', 'Green', 'Blue', 'Red']
})

data_onehot = pd.get_dummies(df, columns=['Color'], prefix='Color')
print("\nOne-Hot Encoded Data:")
print(data_onehot)




One-Hot Encoded Data:
   Color_Blue  Color_Green  Color_Red
0       False        False       True
1        True        False      False
2       False         True      False
3        True        False      False
4       False        False       True


### Label Encoder vs One-Hot Encoding

LabelEncoder:
- Use for: Target (y) only
- Never use for Features (Creates false order)

One-Hot (get_dummies)
- Use for: Features (X) 
- Avoid when >15 categories

3. Normalization 

Normalize data such that row vector magnitudes = 1
Why Use Normalizer?

Purpose: Scales each row to have a unit magnitude (length = 1) while preserving direction.

Key Idea: Converts row vectors into "pure direction" form by diving by Euclidean norm

When to use?
- Text data (TD-IDF, word counts)
- Clustering (k-means, cosine similarity)
- Any algorithm sensitive to vector magnitudes but not their lengths (e.g. NLP, recommender systems).



In [23]:
from sklearn.preprocessing import Normalizer
# Normalization
X = np.array([[4, 1, 2, 2],
              [1, 3, 9, 3],
                [5, 7, 5, 1]])
normalizer = Normalizer()
X_normalized = normalizer.fit_transform(X)
print("\nNormalized Data:")
print(X_normalized)

# How it works?
# For each row, the normalizer computes the L2 norm (Euclidean norm) and divides each element in the row by this norm.
# This scales each row to have a unit norm (length of 1).
# Norm is calculated as: ||X|| = sqrt(x1^2 + x2^2 + ... + xn^2)
# For example, for the first row [4, 1, 2, 2]:
# L2 norm = sqrt(4^2 + 1^2 + 2^2 + 2^2) = sqrt(16 + 1 + 4 + 4) = sqrt(25) = 5
# Normalized row = [4/5, 1/5, 2/5, 2/5] = [0.8, 0.2, 0.4, 0.4]
# Normalization is particularly useful when you want to ensure that each data point (row) has equal weight in distance-based algorithms like KNN or clustering.



Normalized Data:
[[0.8 0.2 0.4 0.4]
 [0.1 0.3 0.9 0.3]
 [0.5 0.7 0.5 0.1]]
