# Data Handling and Preprocessing 

## Introduction

Dataset handling and preprocessing are critical steps in Machine Learning and Deep Learning.
The quality of data directly affects the performance, accuracy, and reliability of a model.

Raw data often contains:
- Missing values
- Different scales
- Noise and outliers
- Bias

Preprocessing ensures that the dataset is clean, well-structured, and suitable for training
machine learning models.


##  Purpose of Data Splitting

Data splitting is the process of dividing a dataset into separate subsets for training,
validation, and testing.

### Why is data splitting important?
- To evaluate model performance on unseen data
- To prevent overfitting
- To tune hyperparameters correctly
- To simulate real-world deployment scenarios

Without proper splitting, models may memorize data instead of learning patterns.

##  Training, Validation, and Testing Sets


### Training Set
- Used to train the model
- Model learns parameters such as weights and biases
- Typically contains the largest portion of the dataset


### Validation Set
- Used for hyperparameter tuning
- Helps in model selection
- Prevents overfitting
- Not used to update model weights


### Testing Set
- Used only after model training is complete
- Evaluates final model performance
- Should never influence training or validation


## Common Split Ratios (70/20/10)

A commonly used split ratio is:

- 70% Training
- 20% Validation
- 10% Testing

Other popular ratios include:
- 80/20
- 60/20/20

The choice depends on:
- Dataset size
- Model complexity
- Problem domain


The below code splits a dataset into training, validation, and testing sets.First, it separates 70% of the data for training and 30% for temporary use.Then, the temporary data is further split into validation and testing sets.Finally, it prints the sizes of each set to confirm the splits

In [1]:
from sklearn.model_selection import train_test_split
import numpy as np

# Sample dataset
X = np.arange(100).reshape(50, 2)
y = np.arange(50)

# Train-Test split
X_train, X_temp, y_train, y_temp = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Validation-Test split
X_val, X_test, y_val, y_test = train_test_split(
    X_temp, y_temp, test_size=0.33, random_state=42
)

print("Training set size:", X_train.shape)
print("Validation set size:", X_val.shape)
print("Testing set size:", X_test.shape)


Training set size: (35, 2)
Validation set size: (10, 2)
Testing set size: (5, 2)


##  Random Sampling vs Stratified Sampling


### Random Sampling

Random sampling is a data splitting technique in which samples are selected
purely at random from the dataset, giving each data point an equal probability
of being assigned to the training or testing set.

In random sampling, the selection process does not consider the class labels
or target variable distribution. As a result, the proportion of classes in the
training and testing sets may differ from that of the original dataset.

Random sampling is simple to implement and works well when the dataset is large
and balanced. However, for imbalanced datasets, it may lead to biased model
evaluation due to unequal class representation.




The below code performs a random train–test split on the dataset. It randomly selects 80% of the data for training and 20% for testing, without preserving any class distribution (no stratification).The random_state ensures reproducibility, and the printed shapes confirm the sizes of the training and testing sets.


In [3]:
import numpy as np
from sklearn.model_selection import train_test_split

# Example dataset
X = np.arange(100).reshape(50, 2)   # Features
y = np.arange(50)                  # Target labels

# Random sampling (no stratification)
X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.2,
    random_state=42,
    shuffle=True
)

print("Training set shape:", X_train.shape)
print("Testing set shape:", X_test.shape)


Training set shape: (40, 2)
Testing set shape: (10, 2)


### Stratification 

Stratification is a data sampling technique in which the dataset is divided
into homogeneous subgroups called *strata* based on the target variable.
Samples are then drawn from each stratum in such a way that the proportion
of each class is preserved in the training and testing sets.

Stratified sampling is mainly used in **classification problems**, especially
when the dataset is **imbalanced**. It ensures fair representation of all
classes during model training and evaluation.

Unlike random sampling, stratification prevents situations where minority
classes are underrepresented or completely missing in the training or
testing data.


In [4]:
import numpy as np
from sklearn.model_selection import train_test_split

# Example dataset with classes
X = np.arange(100).reshape(50, 2)
y = np.array([0]*25 + [1]*25)  # Balanced classes

# Stratified split
X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.2,
    stratify=y,
    random_state=42
)

print("Training set shape:", X_train.shape)
print("Testing set shape:", X_test.shape)


Training set shape: (40, 2)
Testing set shape: (10, 2)


### What is Data Leakage?

Data leakage is a serious problem in machine learning that occurs when
information from outside the training dataset is unintentionally used
during model training.

This leaked information gives the model access to data it would not have
in real-world scenarios, leading to unrealistically high performance
during training or evaluation.

As a result, the model performs poorly when deployed on new, unseen data.

#### How Data Leakage Happens

Data leakage can occur due to:
- Applying preprocessing (scaling, normalization) before data splitting
- Using test data during model training
- Creating features using future or target-related information
- Performing feature selection on the entire dataset

Even small leaks can severely distort model evaluation results.


#### Effects of Data Leakage
- Overestimated model accuracy
- Poor generalization to unseen data
- Incorrect model selection
- Failure in real-world deployment

#### How to Avoid Data Leakage
- Always split the data before preprocessing
- Fit preprocessing steps only on training data
- Never use test data during training or validation
- Apply the same preprocessing parameters to validation and test sets


## Importance of Feature Scaling

Feature scaling is a preprocessing technique used to bring all input features
to a similar scale so that no single feature dominates the learning process
due to its magnitude.

In real-world datasets, different features often have different units and
ranges. For example:
- Age may range from 0 to 100
- Salary may range from 1,000 to 1,000,000
- Distance may be measured in kilometers

Without scaling, machine learning models may give unfair importance to
features with larger numerical values.


### Example of Feature Scale

Consider the following features in a dataset:

- Age: ranges from 18 to 60
- Height (cm): ranges from 150 to 190
- Salary (INR): ranges from 15,000 to 1,000,000

Although all three are important, Salary has a much larger numerical scale
than Age or Height.


### Why Scale Matters

Many machine learning algorithms use:
- Distance calculations
- Gradient optimization

If features are on different scales, the algorithm will give more importance
to features with larger values, even if they are not more important.


### Meaning of Scaling a Feature

Scaling a feature means transforming its values so that:
- All features have comparable ranges
- No feature dominates due to its magnitude
- The learning algorithm treats features fairly

Scaling does NOT change the meaning of the data; it only changes how values
are represented numerically.




Imagine measuring:
- Distance in kilometers
- Weight in grams

If used together without adjustment, grams will dominate numerically.
Scaling converts both measurements into a comparable form.


## Min–Max Normalization

Min–Max Normalization is a feature scaling technique that transforms data
to a fixed range, usually between 0 and 1.

It rescales the feature values so that the minimum value becomes 0
and the maximum value becomes 1.

This technique preserves the original distribution shape of the data
but changes the scale.


### When to Use Min–Max Normalization

- When the data has a known fixed range
- When the algorithm requires bounded input values
- Commonly used in Neural Networks and Image Processing
 Sensitive to outliers, because extreme values affect min and max.


In [2]:
import numpy as np
from sklearn.preprocessing import MinMaxScaler

# Sample data
X = np.array([[10], [20], [30], [40], [50]])

# Apply Min-Max Normalization
scaler = MinMaxScaler()
X_minmax = scaler.fit_transform(X)

X_minmax


array([[0.  ],
       [0.25],
       [0.5 ],
       [0.75],
       [1.  ]])

## Z-score Standardization

Z-score Standardization (also called Standard Scaling) transforms data
so that it has:
- Mean = 0
- Standard Deviation = 1

This technique measures how many standard deviations a value is
away from the mean.


### Formula for Z-score Standardization

z = (x − μ) / σ

Where:
- μ (mu) is the mean of the feature
- σ (sigma) is the standard deviation


### When to Use Z-score Standardization

- When data follows a normal (Gaussian) distribution
- When outliers exist
- Required for algorithms like:
  - Linear Regression
  - Logistic Regression
  - SVM
  - PCA
  - Neural Networks


In [6]:
from sklearn.preprocessing import StandardScaler

# Sample data
X = np.array([[10], [20], [30], [40], [50]])

# Apply Z-score Standardization
scaler = StandardScaler()
X_zscore = scaler.fit_transform(X)

X_zscore

array([[-1.41421356],
       [-0.70710678],
       [ 0.        ],
       [ 0.70710678],
       [ 1.41421356]])

The code applies Z-score standardization to the dataset.
It rescales the values so that the transformed data has a mean of 0 and a standard deviation of 1, making the feature suitable for machine learning algorithms that are sensitive to feature scale.

## Mean and Standard Deviation Normalization (Image Data)

In image preprocessing, pixel values are normalized using the dataset's
mean and standard deviation.

Images usually have pixel values in the range [0, 255].
Normalization helps neural networks train faster and more stably.


### Why Image Normalization is Important

- Improves convergence speed
- Prevents exploding or vanishing gradients
- Ensures stable training in deep neural networks
- Standard practice in CNNs


## Formula for Image Normalization

x_normalized = (x − mean) / standard_deviation

In [8]:
import numpy as np

# Simulated RGB image (height=224, width=224, channels=3)
image = np.random.randint(0, 256, (224, 224, 3))

# Compute mean and std
mean = image.mean()
std = image.std()

# Normalize image
normalized_image = (image - mean) / std

normalized_image.shape

(224, 224, 3)

This code performs mean and standard deviation normalization on an RGB image.
It computes the image’s mean and standard deviation, then normalizes all pixel values so the image has approximately zero mean and unit variance, which helps improve stability and training efficiency in deep learning models.



In deep learning frameworks (PyTorch / TensorFlow), predefined mean and
standard deviation values are often used, especially for pretrained models.


## Consistent Preprocessing for Training and Testing Data

Consistent preprocessing means applying the same transformations
to training, validation, and testing datasets using identical parameters.

This is critical to ensure fair and unbiased model evaluation.


### Correct Rule to Follow

- Fit preprocessing steps ONLY on training data
- Apply (transform) the same preprocessing to validation and test data

Never fit preprocessing on test data


### Why Consistency is Important

If preprocessing parameters differ:
- Data leakage occurs
- Model evaluation becomes invalid
- Test accuracy becomes misleading


In [5]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import numpy as np

# Sample dataset
X = np.random.randint(10, 100, (100, 2))
y = np.random.randint(0, 2, 100)

# Split dataset
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Fit scaler ONLY on training data
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)

# Apply same scaler to test data
X_test_scaled = scaler.transform(X_test)

X_train_scaled[:3], X_test_scaled[:3]


(array([[-1.17824708,  0.23250366],
        [-0.91949478, -1.04884983],
        [-1.17824708, -1.50352364]]),
 array([[ 1.26141747,  1.51385714],
        [-1.03038863,  1.72052706],
        [ 1.11355901, -0.55284203]]))

This code demonstrates consistent preprocessing to avoid data leakage.
The dataset is first split into training and testing sets.
The StandardScaler is fitted only on the training data and then applied to both training and test sets using the same parameters, ensuring fair and unbiased model evaluation.


### Incorrect Approach (Leads to Data Leakage)

Fitting the scaler separately on training and test data
introduces test data information into preprocessing.
