# Manual Robust Scaling Normalization

Robust scaling is useful for data with outliers, as it is less sensitive to extreme values than standard scaling, which centers data by the mean and scales by the standard deviation. The robust scaling approach centers the data by the median and scales it by the interquartile range (IQR).

Given a dataset $ (X) $ with $ (n) $ samples and $ (p) $ features, the robust scaling transformation for each feature $ (X_{j}) $ is defined as:

$$
X_{j}^{\text{scaled}} = \frac{X_{j} - \text{Median}(X_{j})}{\text{IQR}(X_{j})}
$$

where:

- $ \text{Median}(X_{j}) $ is the median of feature $ X_{j} $,
- $ \text{IQR}(X_{j}) = Q_{3} - Q_{1} $ is the interquartile range of feature $ X_{j} $, where $ Q_{3} $ and $ Q_{1} $ represent the 75th and 25th percentiles of $ X_{j} $, respectively.

This scaling ensures that each feature has a median of 0 and a typical spread (IQR) of 1, making the data more robust to outliers.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import os

In [2]:
# Function to calculate the median and IQR of a data series
def robust_scale(series):
    median = np.median(series)
    q75, q25 = np.percentile(series, [75 ,25])
    iqr = q75 - q25
    
    # Avoid division by zero by adding a small constant
    if iqr == 0:
        iqr = 1e-9
    
    # Apply robust scaling
    return (series - median) / iqr

In [None]:
# Define directory and columns to be dropped
DIRECTORY = 'hopkins_export/'
drop_ls = [
    "expected_time",
    "flip_time",
    "stim_pos",
    "user_pos",
    "lambda_val",
    "change_rate_x",
]

# List of subject files in the directory
subject_files = [f for f in os.listdir(DIRECTORY) if f.endswith('.csv')]

# Process each subject file
for subject_file in subject_files:
    # Load each subject dataset
    subject_data = pd.read_csv(os.path.join(DIRECTORY, subject_file))
    
    # Separate columns to be scaled and columns to keep as is
    subject_data_drop = subject_data.drop(columns=drop_ls)
    subject_data_keep = subject_data[drop_ls]
    
    # Manually apply robust scaling
    subject_scaled = subject_data_drop.apply(robust_scale)
    
    # Combine scaled and unscaled columns, preserving original column order
    subject_final = pd.concat([subject_data_keep, subject_scaled], axis=1)
    subject_final = subject_final[subject_data.columns]
    
    # Save the scaled data to a new CSV file
    subject_final.to_csv(f'Manual_Robust_Scaling_{subject_file}', index=False)