##### **Preprocessing of Data**

In [1]:
### 
"""
What is Data Scaling?
Data scaling is the process of changing the range of values in a dataset to a common scale, 
without distorting the differences in the ranges or losing information. 
It's typically applied to numerical features in a dataset.

Why Scale Data?
Uniform Feature Importance: Scaling ensures that all features contribute equally to the analysis or model training process.
Improved Algorithm Performance: Many machine learning algorithms perform better or converge faster when features are on a similar scale.
Prevent Dominance: It prevents features with larger numeric ranges from dominating those with smaller ranges.
Numerical Stability: Some algorithms are sensitive to the scale of input features and may become numerically unstable without scaling.
"""
###

"\nWhat is Data Scaling?\nData scaling is the process of changing the range of values in a dataset to a common scale, \nwithout distorting the differences in the ranges or losing information. \nIt's typically applied to numerical features in a dataset.\n\nWhy Scale Data?\nUniform Feature Importance: Scaling ensures that all features contribute equally to the analysis or model training process.\nImproved Algorithm Performance: Many machine learning algorithms perform better or converge faster when features are on a similar scale.\nPrevent Dominance: It prevents features with larger numeric ranges from dominating those with smaller ranges.\nNumerical Stability: Some algorithms are sensitive to the scale of input features and may become numerically unstable without scaling.\n"

In [2]:
# Preparation of data for machine learning (preprocessing)

In [3]:
import numpy as np
from sklearn import preprocessing

In [4]:
input_data = np.array([[3, -1.5, 3, -6.4], [0, 3, -1.3, 4.1], [1, 2.3, -2.9, 4.3], [2, 2.8, -2.3, 4.5], [-1, 3.3, -1.9, 4.2], [1, 0, 0, 5]]) 
print(input_data)

[[ 3.  -1.5  3.  -6.4]
 [ 0.   3.  -1.3  4.1]
 [ 1.   2.3 -2.9  4.3]
 [ 2.   2.8 -2.3  4.5]
 [-1.   3.3 -1.9  4.2]
 [ 1.   0.   0.   5. ]]


In [5]:
print ("Mean =", input_data.mean(axis=0)) 
print ("Std deviation =", input_data.std(axis=0))

Mean = [ 1.          1.65       -0.9         2.61666667]
Std deviation = [1.29099445 1.77646653 1.96383978 4.04286065]


In [6]:
###
"""
1. Standardization (Z-score Normalization)
Transforms data to have a mean of 0 and a standard deviation of 1.
Formula: X_scaled = (X - μ) / σ
Where μ is the mean and σ is the standard deviation.
"""
###

'\n1. Standardization (Z-score Normalization)\nTransforms data to have a mean of 0 and a standard deviation of 1.\nFormula: X_scaled = (X - μ) / σ\nWhere μ is the mean and σ is the standard deviation.\n'

In [7]:
data_standardized = preprocessing.scale(input_data) 
print(data_standardized)

[[ 1.54919334 -1.77318286  1.98590539 -2.23026897]
 [-0.77459667  0.75993551 -0.2036826   0.36690192]
 [ 0.          0.36589488 -1.01841302  0.41637184]
 [ 0.77459667  0.64735247 -0.71288911  0.46584176]
 [-1.54919334  0.92881007 -0.50920651  0.39163688]
 [ 0.         -0.92881007  0.45828586  0.58951657]]


In [8]:
print('Mean = ', f'{data_standardized.mean(axis=0).round(2)}')
print("Std deviation = ", data_standardized.std(axis=0).round(2))

Mean =  [0. 0. 0. 0.]
Std deviation =  [1. 1. 1. 1.]


In [9]:
###
"""
2. Min-max Normalization
Transforms data to have a range between 0 and 1.
Formula: X_normalized = (X - min(X)) / (max(X) - min(X))
"""
###

'\n2. Min-max Normalization\nTransforms data to have a range between 0 and 1.\nFormula: X_normalized = (X - min(X)) / (max(X) - min(X))\n'

In [10]:
data_normalized = preprocessing.minmax_scale(input_data) 
print(data_normalized)

[[1.         0.         1.         0.        ]
 [0.25       0.9375     0.27118644 0.92105263]
 [0.5        0.79166667 0.         0.93859649]
 [0.75       0.89583333 0.10169492 0.95614035]
 [0.         1.         0.16949153 0.92982456]
 [0.5        0.3125     0.49152542 1.        ]]


In [11]:
print('Min = ', f'{data_normalized.min(axis=0).round(2)}')
print('Max = ', f'{data_normalized.max(axis=0).round(2)}')

Min =  [0. 0. 0. 0.]
Max =  [1. 1. 1. 1.]


In [12]:
###
"""
3. Robust Scaling
Transforms data to have a median of 0 and a standard deviation of 1.
Formula: X_scaled = (X - median(X)) / (quantile(X, 0.75) - quantile(X, 0.25))
"""
###

'\n3. Robust Scaling\nTransforms data to have a median of 0 and a standard deviation of 1.\nFormula: X_scaled = (X - median(X)) / (quantile(X, 0.75) - quantile(X, 0.25))\n'

In [13]:
print('Median= ', np.median(input_data, axis=0))
print('IQR = ', np.percentile(input_data, 75, axis=0) - np.percentile(input_data, 25, axis=0))

Median=  [ 1.    2.55 -1.6   4.25]
IQR =  [1.5   2.375 1.875 0.325]


In [14]:
data_robust_scaled = preprocessing.robust_scale(input_data)
print(data_robust_scaled)

[[  1.33333333  -1.70526316   2.45333333 -32.76923077]
 [ -0.66666667   0.18947368   0.16        -0.46153846]
 [  0.          -0.10526316  -0.69333333   0.15384615]
 [  0.66666667   0.10526316  -0.37333333   0.76923077]
 [ -1.33333333   0.31578947  -0.16        -0.15384615]
 [  0.          -1.07368421   0.85333333   2.30769231]]


In [15]:
print('Median = ', np.median(data_robust_scaled, axis=0).round(2))
print('IQR = ', np.percentile(data_robust_scaled, 75, axis=0) - np.percentile(data_robust_scaled, 25, axis=0))

Median =  [0. 0. 0. 0.]
IQR =  [1. 1. 1. 1.]


In [16]:
###
"""
4. Normalization
Transforms data to have a unit L1 norm (sum of absolute values equals 1).

norm='l1': given a matrix with samples as rows, L1 normalization ensures that each row (sample) has a unit L1 norm (sum of absolute values equals 1)
norm='l2': given a matrix with samples as rows, L2 normalization ensures that each row (sample) has a unit L2 norm (sum of squared values equals 1)
norm='max': given a matrix with samples as rows, max normalization ensures that each row (sample) has a unit max norm (maximum absolute value equals 1)
"""
###

"\n4. Normalization\nTransforms data to have a unit L1 norm (sum of absolute values equals 1).\n\nnorm='l1': given a matrix with samples as rows, L1 normalization ensures that each row (sample) has a unit L1 norm (sum of absolute values equals 1)\nnorm='l2': given a matrix with samples as rows, L2 normalization ensures that each row (sample) has a unit L2 norm (sum of squared values equals 1)\nnorm='max': given a matrix with samples as rows, max normalization ensures that each row (sample) has a unit max norm (maximum absolute value equals 1)\n"

In [17]:
# The choice of axis depends on whether you want to normalize each feature across all samples (axis=0 "in 2D array") or each sample across all features (axis=1).
data_mean_normalization = preprocessing.normalize(input_data, norm='l1', axis=1) 
# Since data_mean_normalization suggests normalizing data on a per-sample basis, 
# using axis=1 is generally more common for machine learning tasks where each sample vector is normalized.
print(data_mean_normalization)

[[ 0.21582734 -0.10791367  0.21582734 -0.46043165]
 [ 0.          0.35714286 -0.1547619   0.48809524]
 [ 0.0952381   0.21904762 -0.27619048  0.40952381]
 [ 0.17241379  0.24137931 -0.19827586  0.38793103]
 [-0.09615385  0.31730769 -0.18269231  0.40384615]
 [ 0.16666667  0.          0.          0.83333333]]


In [None]:
###
"""
5. Binarization
Transforms data into a binary format (0 or 1) based on a threshold value.
"""
###

In [18]:
data_binarization = preprocessing.binarize(input_data, threshold=0.5)
print(data_binarization)

[[1. 0. 1. 0.]
 [0. 1. 0. 1.]
 [1. 1. 0. 1.]
 [1. 1. 0. 1.]
 [0. 1. 0. 1.]
 [1. 0. 0. 1.]]
