# Data pre-processing


## Normalization

Normalize the samples' values to scale them to a smaller, consistent range.

It ensures that all features have the same scale, which can help in improving the performance for certain ML algorithms.

### Min-Max scaling (x - min) / (max - min)

https://www.educative.io/answers/what-is-data-scaling-and-normalization-in-machine-learning

- scales the values of a numeric feature into a 0 to 1 range (typically)
- preserves the original data distribution


In [14]:
import pandas as pd

# Define file paths
balanced_dataset_path = "./data/4-data-balancing/reduced_dataset.csv"
normalized_dataset_path = "./data/5-data-scaling/min_max_normalization.csv"

# Load the dataset in a df
df = pd.read_csv(balanced_dataset_path)

# Define the feature columns selected for Min-max Scaling
features_df = ["empatica_bvp", "empatica_eda", "empatica_temp", "samsung_bvp"]


# Min-Max Scaling implementation
def min_max_scaling(x):
    min = x.min()
    max = x.max()
    return ((x - min) / (max - min)).apply(lambda norm: f"{norm:.10f}")


# Apply min-max scaling to the feature
for feature in features_df:
    df[feature] = min_max_scaling(df[feature])

# Write the Min-Max scaled df into a new .csv file
df.to_csv(normalized_dataset_path, index=False)

### Z-score normalization (Standardization)

https://www.educative.io/answers/what-is-data-scaling-and-normalization-in-machine-learning

- calculates the mean (μ) and standard deviation (σ) of each feature column and scale the values using the Z-score formula
- aiming to have a mean 0 and a standard deviation 1
- useful when the features have different units or different ranges
- used in centering the data around zero and scaling it to have unit variance
- a z-score close or equal to 0 - the data point is very close or exactly the mean
- positive or negative z-scores represents the number of std-s the data point deviates from the mean (higher +; smaller -)

Z = (x − μ) / σ

Where:

- Z : the z-score
- x : the data point
- μ : the mean
- σ : the standard deviation


In [15]:
import pandas as pd

# Define file paths
balanced_dataset_path = "./data/4-data-balancing/reduced_dataset.csv"
z_score_normalized_dataset_path = "./data/5-data-scaling/z_score_standardization.csv"

# Load the dataset into a df
df = pd.read_csv(balanced_dataset_path)

# Define the feature columns selected for Z-score Scaling
features_df = ["empatica_bvp", "empatica_eda", "empatica_temp", "samsung_bvp"]


# Z-score Scaling implementation
def z_score_scaling(x):
    mean = x.mean()
    std = x.std()
    return ((x - mean) / std).apply(lambda stand: f"{stand:.10f}")


# Apply Z-score scaling to the features
for feature in features_df:
    df[feature] = z_score_scaling(df[feature])

# Write the Z-score scaled df to a new .csv file
df.to_csv(z_score_normalized_dataset_path, index=False)

## Splitting the data into training, testing, validation datasets

- **training**

  - the data used for training
  - split ID wise
  - containing the samples from IDs 0, 2, 4, 6, 8

- **testing**
  - the data used for testing
  - split ID wise
  - containing the samples from ID 9
- **validation**
  - the data used for validation
  - split ID wise
  - containing the samples from ID 10


In [25]:
import pandas as pd

# Define file paths
min_max_dataset_path = "./data/5-data-scaling/min_max_normalization.csv"
z_score_dataset_path = "./data/5-data-scaling/z_score_standardization.csv"

min_max_training_dataset_path = "./data/6-data-split/min-max/a_training.csv"
min_max_testing_dataset_path = "./data/6-data-split/min-max/b_testing.csv"
min_max_validation_dataset_path = "./data/6-data-split/min-max/c_validation.csv"

z_score_training_dataset_path = "./data/6-data-split/z-score/a_training.csv"
z_score_testing_dataset_path = "./data/6-data-split/z-score/b_testing.csv"
z_score_validation_dataset_path = "./data/6-data-split/z-score/c_validation.csv"

# Define a dictionary for the two scaled versions datasets
scaled_datasets = {
    "min_max": {
        "path": min_max_dataset_path,
        "training_path": min_max_training_dataset_path,
        "testing_path": min_max_testing_dataset_path,
        "validation_path": min_max_validation_dataset_path,
    },
    "z_score": {
        "path": z_score_dataset_path,
        "training_path": z_score_training_dataset_path,
        "testing_path": z_score_testing_dataset_path,
        "validation_path": z_score_validation_dataset_path,
    },
}

# Define the IDs for each subset
training_ids = [0, 2, 4, 6, 8]
testing_id = 9
validation_id = 10

# Iterate over each of the scaled datasets using `data` for the dictionary data
for _, data in scaled_datasets.items():
    # Load the input dataset having the path value at key `path`
    df = pd.read_csv(data["path"])

    # Create corresponding split subsets, ID-wise
    training_df = df[df["ID"].isin(training_ids)]
    testing_df = df[df["ID"] == testing_id]
    validation_df = df[df["ID"] == validation_id]

    # Define the list of the dictionaries' path keys
    path_keys = ["training_path", "testing_path", "validation_path"]

    # Iterate over the subsets to reset the indexes corresponding to the new df structures / content
    for i, dataset in enumerate([training_df, testing_df, validation_df]):
        dataset.reset_index(drop=True, inplace=True)
        dataset.to_csv(data[path_keys[i]], index=False)