# Program Description of Dataset Standardization (Module 9)

This module focuses on preprocessing the divided and analyzed datasets. It performs standardization and, optionally, Principal Component Analysis (PCA) on the features of the training set, validation set, and test set. The main functions include:

- **Standardization**: Scaling the features of the datasets to have zero mean and unit variance.
- **PCA (Principal Component Analysis)**: An optional transformation that reduces the dimensionality of the data by transforming it into a set of orthogonal components. Whether PCA is applied is controlled by the `apply_pca` parameter.

The module helps to ensure that the data used for modeling or analysis is well-conditioned and suitable for machine learning or statistical techniques.

### Key Parameters:
- **`apply_pca`**: This parameter manually sets whether the dataset should undergo PCA transformation. If set to `True`, PCA is applied; otherwise, standardization is performed without dimensionality reduction.

contacts：zhaohf@ihep.ac.cn

#  Import libraries

In [1]:
import os
import numpy as np
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import joblib

# Parameter Settings

## Input file path:
- `'dir_data'` is the directory where the original datasets are located.
- The files for features and labels for training, validation, and test datasets are loaded from this directory.
  Example: `'0926-datasets-2/datasets(JmolNN)'`

## Output file path:
- `'dir_output'` is the directory where the processed datasets will be saved after applying the necessary transformations.
  Example: `'0926-datasets/datasets(JmolNN)-pre-xmu-cn'`

## PCA Conversion flag (`apply_pca`):
- `'apply_pca'` is a flag indicating whether PCA (Principal Component Analysis) should be applied to the datasets.
- Set to `True` to apply PCA for dimensionality reduction; `False` if no PCA transformation is needed.
- Default is `False` (PCA will not be applied).
 be applied).
 be applied).


In [2]:
# Set the input file path and output directory
dir_data = '0926-datasets/datasets(JmolNN)'
dir_output = os.path.join('0926-datasets', 'datasets(JmolNN)-pre-xmu-cn')
# Check if the input directory exists
if os.path.exists(dir_data):
    print(f"Directory '{dir_data}' exists.")
else:
    raise FileNotFoundError(f"Directory '{dir_data}' does not exist.")

# Create the output directory if it does not exist
os.makedirs(dir_output, exist_ok=True)

# Define a flag to determine whether to apply PCA 
apply_pca = False  # Set this to True if you want to apply PCA

# Load the training, validation, and test datasets (features and labels)
file_train_feature = os.path.join(dir_data, 'xmu_train_JmolNN.txt')
file_train_label = os.path.join(dir_data, 'label_cn_train_JmolNN.txt')
file_valid_feature = os.path.join(dir_data, 'xmu_valid_JmolNN.txt')
file_valid_label = os.path.join(dir_data, 'label_cn_valid_JmolNN.txt')
file_test_feature = os.path.join(dir_data, 'xmu_test_JmolNN.txt')
file_test_label = os.path.join(dir_data, 'label_cn_test_JmolNN.txt')


Directory '0926-datasets/datasets(JmolNN)' exists.


 # Main program

In [3]:
# Function: Generate result file path
def generate_file_path(input_file, output_dir, suffix='.txt', pca_applied=False):
    filename = os.path.basename(input_file)  # Extract the input file name
    # Append '_pca' suffix only if PCA is applied
    result_filename = f'{filename.split(".")[0]}{"_pca" if pca_applied else ""}{suffix}'
    return os.path.join(output_dir, result_filename)

# Read input data
X_train = np.loadtxt(file_train_feature)
y_train = np.loadtxt(file_train_label, dtype=float)
X_valid = np.loadtxt(file_valid_feature)
y_valid = np.loadtxt(file_valid_label, dtype=float)
X_test = np.loadtxt(file_test_feature)
y_test = np.loadtxt(file_test_label, dtype=float)

# Standardize the data using StandardScaler
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_valid = scaler.transform(X_valid)
X_test = scaler.transform(X_test)

# Save the StandardScaler model for future use
scaler_path = os.path.join(dir_output, 'scaler.pkl')
joblib.dump(scaler, scaler_path)
print(f"Scaler saved to: {scaler_path}")

# Apply PCA if specified
if apply_pca:
    # Apply PCA and extract as many components as the number of features
    pca = PCA(n_components=X_train.shape[1])
    X_train_pca = pca.fit_transform(X_train)
    X_valid_pca = pca.transform(X_valid)
    X_test_pca = pca.transform(X_test)

    # Generate file paths for the PCA-transformed data
    train_file = generate_file_path(file_train_feature, dir_output, pca_applied=True)
    valid_file = generate_file_path(file_valid_feature, dir_output, pca_applied=True)
    test_file = generate_file_path(file_test_feature, dir_output, pca_applied=True)

    # Save PCA-transformed data to files
    np.savetxt(train_file, X_train_pca)
    np.savetxt(valid_file, X_valid_pca)
    np.savetxt(test_file, X_test_pca)

    print(f"PCA-transformed data has been saved to: {dir_output}")
else:
    # Save the standardized data without PCA
    train_file = generate_file_path(file_train_feature, dir_output)
    valid_file = generate_file_path(file_valid_feature, dir_output)
    test_file = generate_file_path(file_test_feature, dir_output)

    np.savetxt(train_file, X_train)
    np.savetxt(valid_file, X_valid)
    np.savetxt(test_file, X_test)

    print(f"Standardized data has been saved to: {dir_output}")

# Generate file paths for label files and save them
train_label_file = generate_file_path(file_train_label, dir_output)
valid_label_file = generate_file_path(file_valid_label, dir_output)
test_label_file = generate_file_path(file_test_label, dir_output)

np.savetxt(train_label_file, y_train)
np.savetxt(valid_label_file, y_valid)
np.savetxt(test_label_file, y_test)

print(f"Labels have been saved to: {dir_output}")


Scaler saved to: 0926-datasets/datasets(JmolNN)-pre-xmu-cn/scaler.pkl
Standardized data has been saved to: 0926-datasets/datasets(JmolNN)-pre-xmu-cn
Labels have been saved to: 0926-datasets/datasets(JmolNN)-pre-xmu-cn
