# Human Activity Recognition Dataset Recreation

This notebook recreates the UCI HAR Dataset feature vectors (X_train.txt) from the raw Inertial Signals data.

## Dataset Information
- **Sample Rate**: 50 Hz
- **Window Size**: 2.56 seconds (128 samples)
- **Window Overlap**: 50% (64 samples step)
- **Features**: 561 time and frequency domain features
- **Activities**: WALKING, WALKING_UPSTAIRS, WALKING_DOWNSTAIRS, SITTING, STANDING, LAYING

## Feature Categories
1. **Time Domain**: tBodyAcc, tGravityAcc, tBodyAccJerk, tBodyGyro, tBodyGyroJerk, and their magnitudes
2. **Frequency Domain**: FFT applied to time domain signals (prefix 'f')
3. **Statistics**: mean, std, mad, max, min, sma, energy, iqr, entropy, arCoeff, correlation

In [36]:
import numpy as np
import pandas as pd
from scipy import stats
from scipy.signal import butter, filtfilt
import os

## Load Raw Inertial Signals

Load the 9 raw signal files from the Inertial Signals directory:
- Body accelerometer (X, Y, Z)
- Body gyroscope (X, Y, Z)
- Total accelerometer (X, Y, Z)

In [37]:
# Load inertial signals
data_dir = 'dataset/train/Inertial Signals/'

# Load body accelerometer
body_acc_x = np.loadtxt(data_dir + 'body_acc_x_train.txt')
body_acc_y = np.loadtxt(data_dir + 'body_acc_y_train.txt')
body_acc_z = np.loadtxt(data_dir + 'body_acc_z_train.txt')

# Load body gyroscope
body_gyro_x = np.loadtxt(data_dir + 'body_gyro_x_train.txt')
body_gyro_y = np.loadtxt(data_dir + 'body_gyro_y_train.txt')
body_gyro_z = np.loadtxt(data_dir + 'body_gyro_z_train.txt')

# Load total accelerometer
total_acc_x = np.loadtxt(data_dir + 'total_acc_x_train.txt')
total_acc_y = np.loadtxt(data_dir + 'total_acc_y_train.txt')
total_acc_z = np.loadtxt(data_dir + 'total_acc_z_train.txt')

print(f"Data shape: {body_acc_x.shape}")
print(f"Number of windows: {body_acc_x.shape[0]}")
print(f"Samples per window: {body_acc_x.shape[1]}")

Data shape: (7352, 128)
Number of windows: 7352
Samples per window: 128


In [38]:
# Calculate gravity acceleration from total acceleration
# Gravity is the low-frequency component (< 0.3 Hz)
def apply_gravity_filter(signal, cutoff=0.3, fs=50, order=3):
    """Apply low-pass Butterworth filter to extract gravity component"""
    nyquist = 0.5 * fs
    normal_cutoff = cutoff / nyquist
    b, a = butter(order, normal_cutoff, btype='low', analog=False)
    return filtfilt(b, a, signal)

# Calculate gravity acceleration for each window
gravity_acc_x = np.array([apply_gravity_filter(window) for window in total_acc_x])
gravity_acc_y = np.array([apply_gravity_filter(window) for window in total_acc_y])
gravity_acc_z = np.array([apply_gravity_filter(window) for window in total_acc_z])

print("Gravity acceleration calculated")

Gravity acceleration calculated


## Feature Extraction Functions

Define functions to calculate all 561 features for each signal window.

In [39]:
def calculate_jerk(signal, dt=0.02):
    """Calculate jerk (derivative) of signal"""
    return np.diff(signal, axis=1) / dt

def calculate_magnitude(x, y, z):
    """Calculate Euclidean magnitude of 3D signal"""
    return np.sqrt(x**2 + y**2 + z**2)

def signal_magnitude_area(x, y, z):
    """Calculate signal magnitude area"""
    return np.mean(np.abs(x) + np.abs(y) + np.abs(z), axis=1)

def energy(signal):
    """Calculate energy: sum of squares / N"""
    return np.mean(signal**2, axis=1)

def iqr(signal):
    """Calculate interquartile range"""
    return np.percentile(signal, 75, axis=1) - np.percentile(signal, 25, axis=1)

def entropy(signal):
    """Calculate signal entropy"""
    # Normalize signal to create probability distribution
    signal_normalized = np.abs(signal) / (np.sum(np.abs(signal), axis=1, keepdims=True) + 1e-10)
    # Calculate entropy
    return -np.sum(signal_normalized * np.log(signal_normalized + 1e-10), axis=1)

def arCoeff(signal, order=4):
    """Calculate autoregression coefficients using Burg method"""
    from scipy.signal import lfilter
    coeffs = []
    for window in signal:
        # Simple AR coefficient estimation
        acf = np.correlate(window - np.mean(window), window - np.mean(window), mode='full')
        acf = acf[len(acf)//2:]
        acf = acf / acf[0]
        ar = []
        for i in range(1, min(order + 1, len(acf))):
            ar.append(acf[i])
        while len(ar) < order:
            ar.append(0.0)
        coeffs.append(ar[:order])
    return np.array(coeffs)

def correlation_coeff(x, y):
    """Calculate correlation coefficient between two signals"""
    corr = []
    for i in range(len(x)):
        if np.std(x[i]) > 0 and np.std(y[i]) > 0:
            corr.append(np.corrcoef(x[i], y[i])[0, 1])
        else:
            corr.append(0.0)
    return np.array(corr)

print("Feature extraction functions defined")

Feature extraction functions defined


In [40]:
def extract_features_3d(signal_x, signal_y, signal_z, prefix):
    """Extract time domain features for 3-axial signal"""
    features = {}
    
    # Basic statistics for each axis
    for axis, signal in [('X', signal_x), ('Y', signal_y), ('Z', signal_z)]:
        features[f'{prefix}-mean()-{axis}'] = np.mean(signal, axis=1)
        features[f'{prefix}-std()-{axis}'] = np.std(signal, axis=1)
        features[f'{prefix}-mad()-{axis}'] = np.mean(np.abs(signal - np.mean(signal, axis=1, keepdims=True)), axis=1)
        features[f'{prefix}-max()-{axis}'] = np.max(signal, axis=1)
        features[f'{prefix}-min()-{axis}'] = np.min(signal, axis=1)
        features[f'{prefix}-energy()-{axis}'] = energy(signal)
        features[f'{prefix}-iqr()-{axis}'] = iqr(signal)
        features[f'{prefix}-entropy()-{axis}'] = entropy(signal)
    
    # Signal magnitude area
    features[f'{prefix}-sma()'] = signal_magnitude_area(signal_x, signal_y, signal_z)
    
    # Autoregression coefficients (4 per axis)
    for axis, signal in [('X', signal_x), ('Y', signal_y), ('Z', signal_z)]:
        ar_coeffs = arCoeff(signal, order=4)
        for i in range(4):
            features[f'{prefix}-arCoeff()-{axis},{i+1}'] = ar_coeffs[:, i]
    
    # Correlation between axes
    features[f'{prefix}-correlation()-X,Y'] = correlation_coeff(signal_x, signal_y)
    features[f'{prefix}-correlation()-X,Z'] = correlation_coeff(signal_x, signal_z)
    features[f'{prefix}-correlation()-Y,Z'] = correlation_coeff(signal_y, signal_z)
    
    return features

print("3D feature extraction function defined")

3D feature extraction function defined


In [41]:
def extract_features_magnitude(magnitude, prefix):
    """Extract time domain features for magnitude signal"""
    features = {}
    
    features[f'{prefix}-mean()'] = np.mean(magnitude, axis=1)
    features[f'{prefix}-std()'] = np.std(magnitude, axis=1)
    features[f'{prefix}-mad()'] = np.mean(np.abs(magnitude - np.mean(magnitude, axis=1, keepdims=True)), axis=1)
    features[f'{prefix}-max()'] = np.max(magnitude, axis=1)
    features[f'{prefix}-min()'] = np.min(magnitude, axis=1)
    features[f'{prefix}-sma()'] = np.mean(np.abs(magnitude), axis=1)
    features[f'{prefix}-energy()'] = energy(magnitude)
    features[f'{prefix}-iqr()'] = iqr(magnitude)
    features[f'{prefix}-entropy()'] = entropy(magnitude)
    
    # Autoregression coefficients
    ar_coeffs = arCoeff(magnitude, order=4)
    for i in range(4):
        features[f'{prefix}-arCoeff(){i+1}'] = ar_coeffs[:, i]
    
    return features

print("Magnitude feature extraction function defined")

Magnitude feature extraction function defined


In [42]:
def extract_fft_features_3d(signal_x, signal_y, signal_z, prefix, fs=50):
    """Extract frequency domain features for 3-axial signal"""
    features = {}
    
    for axis, signal in [('X', signal_x), ('Y', signal_y), ('Z', signal_z)]:
        # Apply FFT
        fft_vals = np.fft.fft(signal, axis=1)
        n = signal.shape[1]
        fft_magnitude = np.abs(fft_vals)[:, :n//2]
        freqs = np.fft.fftfreq(n, d=1/fs)[:n//2]
        
        # Basic statistics
        features[f'{prefix}-mean()-{axis}'] = np.mean(fft_magnitude, axis=1)
        features[f'{prefix}-std()-{axis}'] = np.std(fft_magnitude, axis=1)
        features[f'{prefix}-mad()-{axis}'] = np.mean(np.abs(fft_magnitude - np.mean(fft_magnitude, axis=1, keepdims=True)), axis=1)
        features[f'{prefix}-max()-{axis}'] = np.max(fft_magnitude, axis=1)
        features[f'{prefix}-min()-{axis}'] = np.min(fft_magnitude, axis=1)
        features[f'{prefix}-energy()-{axis}'] = energy(fft_magnitude)
        features[f'{prefix}-iqr()-{axis}'] = iqr(fft_magnitude)
        features[f'{prefix}-entropy()-{axis}'] = entropy(fft_magnitude)
        
        # Max index
        features[f'{prefix}-maxInds-{axis}'] = np.argmax(fft_magnitude, axis=1)
        
        # Mean frequency (weighted average)
        mean_freq = []
        for i in range(len(fft_magnitude)):
            total_power = np.sum(fft_magnitude[i])
            if total_power > 0:
                mean_freq.append(np.sum(freqs * fft_magnitude[i]) / total_power)
            else:
                mean_freq.append(0.0)
        features[f'{prefix}-meanFreq()-{axis}'] = np.array(mean_freq)
        
        # Skewness and kurtosis
        features[f'{prefix}-skewness()-{axis}'] = stats.skew(fft_magnitude, axis=1)
        features[f'{prefix}-kurtosis()-{axis}'] = stats.kurtosis(fft_magnitude, axis=1)
    
    # Signal magnitude area
    fft_x = np.abs(np.fft.fft(signal_x, axis=1))[:, :signal_x.shape[1]//2]
    fft_y = np.abs(np.fft.fft(signal_y, axis=1))[:, :signal_y.shape[1]//2]
    fft_z = np.abs(np.fft.fft(signal_z, axis=1))[:, :signal_z.shape[1]//2]
    features[f'{prefix}-sma()'] = np.mean(fft_x + fft_y + fft_z, axis=1)
    
    # Band energy features (8 bands for each axis)
    bands = [(0, 8), (8, 16), (16, 24), (24, 32), (32, 40), (40, 48), (48, 56), (56, 64),
             (0, 16), (16, 32), (32, 48), (48, 64), (0, 24), (24, 48)]
    
    for axis, signal in [('X', signal_x), ('Y', signal_y), ('Z', signal_z)]:
        fft_vals = np.fft.fft(signal, axis=1)
        n = signal.shape[1]
        fft_magnitude = np.abs(fft_vals)[:, :n//2]
        
        for start, end in bands:
            if end <= fft_magnitude.shape[1]:
                band_energy = np.sum(fft_magnitude[:, start:end]**2, axis=1) / (end - start)
                features[f'{prefix}-bandsEnergy()-{start+1},{end}'] = band_energy
    
    return features

print("FFT feature extraction function defined")

FFT feature extraction function defined


In [43]:
def extract_fft_features_magnitude(magnitude, prefix, fs=50):
    """Extract frequency domain features for magnitude signal"""
    features = {}
    
    # Apply FFT
    fft_vals = np.fft.fft(magnitude, axis=1)
    n = magnitude.shape[1]
    fft_magnitude = np.abs(fft_vals)[:, :n//2]
    freqs = np.fft.fftfreq(n, d=1/fs)[:n//2]
    
    # Basic statistics
    features[f'{prefix}-mean()'] = np.mean(fft_magnitude, axis=1)
    features[f'{prefix}-std()'] = np.std(fft_magnitude, axis=1)
    features[f'{prefix}-mad()'] = np.mean(np.abs(fft_magnitude - np.mean(fft_magnitude, axis=1, keepdims=True)), axis=1)
    features[f'{prefix}-max()'] = np.max(fft_magnitude, axis=1)
    features[f'{prefix}-min()'] = np.min(fft_magnitude, axis=1)
    features[f'{prefix}-sma()'] = np.mean(np.abs(fft_magnitude), axis=1)
    features[f'{prefix}-energy()'] = energy(fft_magnitude)
    features[f'{prefix}-iqr()'] = iqr(fft_magnitude)
    features[f'{prefix}-entropy()'] = entropy(fft_magnitude)
    
    # Max index
    features[f'{prefix}-maxInds()'] = np.argmax(fft_magnitude, axis=1)
    
    # Mean frequency
    mean_freq = []
    for i in range(len(fft_magnitude)):
        total_power = np.sum(fft_magnitude[i])
        if total_power > 0:
            mean_freq.append(np.sum(freqs * fft_magnitude[i]) / total_power)
        else:
            mean_freq.append(0.0)
    features[f'{prefix}-meanFreq()'] = np.array(mean_freq)
    
    # Skewness and kurtosis
    features[f'{prefix}-skewness()'] = stats.skew(fft_magnitude, axis=1)
    features[f'{prefix}-kurtosis()'] = stats.kurtosis(fft_magnitude, axis=1)
    
    return features

print("FFT magnitude feature extraction function defined")

FFT magnitude feature extraction function defined


## Calculate Derived Signals

Calculate jerk signals and magnitudes for all sensor data.

In [44]:
# Calculate jerk signals (derivative of acceleration and angular velocity)
body_acc_jerk_x = calculate_jerk(body_acc_x)
body_acc_jerk_y = calculate_jerk(body_acc_y)
body_acc_jerk_z = calculate_jerk(body_acc_z)

body_gyro_jerk_x = calculate_jerk(body_gyro_x)
body_gyro_jerk_y = calculate_jerk(body_gyro_y)
body_gyro_jerk_z = calculate_jerk(body_gyro_z)

# Calculate magnitudes
body_acc_mag = calculate_magnitude(body_acc_x, body_acc_y, body_acc_z)
gravity_acc_mag = calculate_magnitude(gravity_acc_x, gravity_acc_y, gravity_acc_z)
body_acc_jerk_mag = calculate_magnitude(body_acc_jerk_x, body_acc_jerk_y, body_acc_jerk_z)
body_gyro_mag = calculate_magnitude(body_gyro_x, body_gyro_y, body_gyro_z)
body_gyro_jerk_mag = calculate_magnitude(body_gyro_jerk_x, body_gyro_jerk_y, body_gyro_jerk_z)

print("Derived signals calculated")
print(f"Jerk signal shape: {body_acc_jerk_x.shape}")
print(f"Magnitude signal shape: {body_acc_mag.shape}")

Derived signals calculated
Jerk signal shape: (7352, 127)
Magnitude signal shape: (7352, 128)


## Extract All Features

Extract all 561 features from the signals. This follows the exact feature order from features.txt.

In [45]:
print("Extracting features...")
all_features = {}

# 1. tBodyAcc-XYZ (40 features)
print("Extracting tBodyAcc features...")
all_features.update(extract_features_3d(body_acc_x, body_acc_y, body_acc_z, 'tBodyAcc'))

# 2. tGravityAcc-XYZ (40 features)
print("Extracting tGravityAcc features...")
all_features.update(extract_features_3d(gravity_acc_x, gravity_acc_y, gravity_acc_z, 'tGravityAcc'))

# 3. tBodyAccJerk-XYZ (40 features)
print("Extracting tBodyAccJerk features...")
all_features.update(extract_features_3d(body_acc_jerk_x, body_acc_jerk_y, body_acc_jerk_z, 'tBodyAccJerk'))

# 4. tBodyGyro-XYZ (40 features)
print("Extracting tBodyGyro features...")
all_features.update(extract_features_3d(body_gyro_x, body_gyro_y, body_gyro_z, 'tBodyGyro'))

# 5. tBodyGyroJerk-XYZ (40 features)
print("Extracting tBodyGyroJerk features...")
all_features.update(extract_features_3d(body_gyro_jerk_x, body_gyro_jerk_y, body_gyro_jerk_z, 'tBodyGyroJerk'))

# 6. tBodyAccMag (13 features)
print("Extracting tBodyAccMag features...")
all_features.update(extract_features_magnitude(body_acc_mag, 'tBodyAccMag'))

# 7. tGravityAccMag (13 features)
print("Extracting tGravityAccMag features...")
all_features.update(extract_features_magnitude(gravity_acc_mag, 'tGravityAccMag'))

# 8. tBodyAccJerkMag (13 features)
print("Extracting tBodyAccJerkMag features...")
all_features.update(extract_features_magnitude(body_acc_jerk_mag, 'tBodyAccJerkMag'))

# 9. tBodyGyroMag (13 features)
print("Extracting tBodyGyroMag features...")
all_features.update(extract_features_magnitude(body_gyro_mag, 'tBodyGyroMag'))

# 10. tBodyGyroJerkMag (13 features)
print("Extracting tBodyGyroJerkMag features...")
all_features.update(extract_features_magnitude(body_gyro_jerk_mag, 'tBodyGyroJerkMag'))

print(f"Time domain features extracted: {len(all_features)} feature types")

Extracting features...
Extracting tBodyAcc features...


Extracting features...
Extracting tBodyAcc features...


Extracting tGravityAcc features...
Extracting tBodyAccJerk features...
Extracting tBodyAccJerk features...
Extracting tBodyGyro features...
Extracting tBodyGyro features...
Extracting tBodyGyroJerk features...
Extracting tBodyGyroJerk features...
Extracting tBodyAccMag features...
Extracting tGravityAccMag features...
Extracting tBodyAccMag features...
Extracting tGravityAccMag features...
Extracting tBodyAccJerkMag features...
Extracting tBodyGyroMag features...
Extracting tBodyAccJerkMag features...
Extracting tBodyGyroMag features...
Extracting tBodyGyroJerkMag features...
Time domain features extracted: 265 feature types
Extracting tBodyGyroJerkMag features...
Time domain features extracted: 265 feature types


In [46]:
# 11. fBodyAcc-XYZ (79 features per axis = 237 + 14 bands per axis)
print("Extracting fBodyAcc features...")
all_features.update(extract_fft_features_3d(body_acc_x, body_acc_y, body_acc_z, 'fBodyAcc'))

# 12. fBodyAccJerk-XYZ (79 features per axis)
print("Extracting fBodyAccJerk features...")
all_features.update(extract_fft_features_3d(body_acc_jerk_x, body_acc_jerk_y, body_acc_jerk_z, 'fBodyAccJerk'))

# 13. fBodyGyro-XYZ (79 features per axis)
print("Extracting fBodyGyro features...")
all_features.update(extract_fft_features_3d(body_gyro_x, body_gyro_y, body_gyro_z, 'fBodyGyro'))

# 14. fBodyAccMag (13 features)
print("Extracting fBodyAccMag features...")
all_features.update(extract_fft_features_magnitude(body_acc_mag, 'fBodyAccMag'))

# 15. fBodyAccJerkMag (13 features)
print("Extracting fBodyBodyAccJerkMag features...")
all_features.update(extract_fft_features_magnitude(body_acc_jerk_mag, 'fBodyBodyAccJerkMag'))

# 16. fBodyGyroMag (13 features)
print("Extracting fBodyBodyGyroMag features...")
all_features.update(extract_fft_features_magnitude(body_gyro_mag, 'fBodyBodyGyroMag'))

# 17. fBodyGyroJerkMag (13 features)
print("Extracting fBodyBodyGyroJerkMag features...")
all_features.update(extract_fft_features_magnitude(body_gyro_jerk_mag, 'fBodyBodyGyroJerkMag'))

print(f"Total features extracted: {len(all_features)} feature types")

Extracting fBodyAcc features...
Extracting fBodyAccJerk features...
Extracting fBodyAccJerk features...
Extracting fBodyGyro features...
Extracting fBodyGyro features...
Extracting fBodyAccMag features...
Extracting fBodyBodyAccJerkMag features...
Extracting fBodyAccMag features...
Extracting fBodyBodyAccJerkMag features...
Extracting fBodyBodyGyroMag features...
Extracting fBodyBodyGyroJerkMag features...
Extracting fBodyBodyGyroMag features...
Extracting fBodyBodyGyroJerkMag features...
Total features extracted: 468 feature types
Total features extracted: 468 feature types


## Create Feature DataFrame

Combine all features into a DataFrame following the exact order from features.txt.

In [47]:
# Load the feature names from features.txt to ensure correct order
feature_names = []
with open('dataset/features.txt', 'r') as f:
    for line in f:
        parts = line.strip().split()
        if len(parts) >= 2:
            feature_names.append(' '.join(parts[1:]))

print(f"Expected number of features: {len(feature_names)}")

# Create DataFrame with features in the correct order
feature_data = []
for feature_name in feature_names:
    if feature_name in all_features:
        feature_data.append(all_features[feature_name])
    else:
        # If feature not found, use zeros
        print(f"Warning: Feature '{feature_name}' not found, using zeros")
        feature_data.append(np.zeros(body_acc_x.shape[0]))

# Transpose to get windows as rows, features as columns
X_train_recreated = np.array(feature_data).T

print(f"Recreated dataset shape: {X_train_recreated.shape}")
print(f"Expected shape: ({body_acc_x.shape[0]}, 561)")

Expected number of features: 561
Recreated dataset shape: (7352, 561)
Expected shape: (7352, 561)


## Normalize Features

Normalize all features to [-1, 1] range to match the original UCI HAR dataset format.

In [48]:
# Normalize each feature to [-1, 1] range
print("Normalizing features to [-1, 1] range...")

X_train_normalized = np.zeros_like(X_train_recreated)

for i in range(X_train_recreated.shape[1]):
    feature_values = X_train_recreated[:, i]
    min_val = np.min(feature_values)
    max_val = np.max(feature_values)
    
    # Avoid division by zero
    if max_val - min_val > 1e-10:
        # Normalize to [-1, 1]
        X_train_normalized[:, i] = 2 * (feature_values - min_val) / (max_val - min_val) - 1
    else:
        # If all values are the same, set to 0
        X_train_normalized[:, i] = 0

print(f"Normalized dataset shape: {X_train_normalized.shape}")
print(f"Min value: {np.min(X_train_normalized):.6f}")
print(f"Max value: {np.max(X_train_normalized):.6f}")

# Update the main variable
X_train_recreated = X_train_normalized

Normalizing features to [-1, 1] range...
Normalized dataset shape: (7352, 561)
Min value: -1.000000
Max value: 1.000000


## Save Normalization Parameters

Save the min/max values for each feature so they can be used in real-time normalization in main.py.

In [49]:
# Save normalization parameters for real-time use
print("Saving normalization parameters...")

# Create a dictionary to store min and max values for each feature
normalization_params = {}

for i, feature_name in enumerate(feature_names):
    feature_values = np.array(feature_data[i])
    
    min_val = np.min(feature_values)
    max_val = np.max(feature_values)
    
    normalization_params[feature_name] = {
        'min': float(min_val),
        'max': float(max_val),
        'feature_index': i
    }

# Save to JSON file for easy loading in main.py
import json

output_file_json = 'normalization_params.json'
with open(output_file_json, 'w') as f:
    json.dump(normalization_params, f, indent=2)

print(f"Normalization parameters saved to {output_file_json}")
print(f"\nTotal features: {len(normalization_params)}")

# Display first few parameters
print("\nFirst 10 normalization parameters:")
for i, (feature_name, params) in enumerate(list(normalization_params.items())[:10]):
    print(f"{feature_name}:")
    print(f"  min: {params['min']:.6f}, max: {params['max']:.6f}")

print("\nNormalization formula: normalized = 2 * (value - min) / (max - min) - 1")

# Also save as Python code for direct inclusion
output_code_file = 'normalization_params.py'
with open(output_code_file, 'w') as f:
    f.write("# Normalization parameters for feature scaling\n")
    f.write("# Generated from UCI HAR dataset training data\n")
    f.write("# Units: accelerometer in g's, gyroscope in rad/s\n\n")
    f.write("NORMALIZATION_PARAMS = {\n")
    for feature_name, params in normalization_params.items():
        f.write(f"    '{feature_name}': {{'min': {params['min']}, 'max': {params['max']}}},\n")
    f.write("}\n\n")
    f.write("def normalize_feature(value, feature_name):\n")
    f.write("    \"\"\"Normalize a feature value to [-1, 1] range\n")
    f.write("    \n")
    f.write("    Args:\n")
    f.write("        value: Raw feature value (in g's for acceleration, rad/s for gyroscope)\n")
    f.write("        feature_name: Name of the feature (e.g., 'tBodyAcc-mean()-X')\n")
    f.write("    \n")
    f.write("    Returns:\n")
    f.write("        Normalized value in range [-1, 1]\n")
    f.write("    \"\"\"\n")
    f.write("    params = NORMALIZATION_PARAMS.get(feature_name)\n")
    f.write("    if params is None:\n")
    f.write("        raise ValueError(f'Feature {feature_name} not found in normalization parameters')\n")
    f.write("    \n")
    f.write("    min_val = params['min']\n")
    f.write("    max_val = params['max']\n")
    f.write("    \n")
    f.write("    # Avoid division by zero\n")
    f.write("    if abs(max_val - min_val) < 1e-10:\n")
    f.write("        return 0.0\n")
    f.write("    \n")
    f.write("    # Normalize to [-1, 1]\n")
    f.write("    return 2 * (value - min_val) / (max_val - min_val) - 1\n")

print(f"\nNormalization code saved to {output_code_file}")
print("\nUsage in main.py:")
print("  from normalization_params import normalize_feature, NORMALIZATION_PARAMS")
print("  normalized_value = normalize_feature(raw_value, 'tBodyAcc-mean()-X')")
print("\nIMPORTANT: Make sure to convert accelerometer from m/s² to g's first!")
print("  acceleration_g = acceleration_ms2 / 9.80665")

Saving normalization parameters...
Normalization parameters saved to normalization_params.json

Total features: 477

First 10 normalization parameters:
tBodyAcc-mean()-X:
  min: -0.263284, max: 0.148878
tBodyAcc-mean()-Y:
  min: -0.515524, max: 0.533502
tBodyAcc-mean()-Z:
  min: -0.294562, max: 0.366119
tBodyAcc-std()-X:
  min: 0.001413, max: 0.648675
tBodyAcc-std()-Y:
  min: 0.001748, max: 0.327796
tBodyAcc-std()-Z:
  min: 0.003014, max: 0.361280
tBodyAcc-mad()-X:
  min: 0.001122, max: 0.569274
tBodyAcc-mad()-Y:
  min: 0.001294, max: 0.273391
tBodyAcc-mad()-Z:
  min: 0.002383, max: 0.287403
tBodyAcc-max()-X:
  min: -0.032683, max: 1.299912

Normalization formula: normalized = 2 * (value - min) / (max - min) - 1

Normalization code saved to normalization_params.py

Usage in main.py:
  from normalization_params import normalize_feature, NORMALIZATION_PARAMS
  normalized_value = normalize_feature(raw_value, 'tBodyAcc-mean()-X')

IMPORTANT: Make sure to convert accelerometer from m/s² to 

## Filter to Desired Features Only

Load `features_DESIRED.txt` which contains the column numbers and names for features we want to keep from the original dataset. Any columns not in this file will be discarded.

In [50]:
# Load the desired features list
print("Loading desired features from features_DESIRED.txt...")
desired_features = []
desired_indices = []

with open('dataset/features_DESIRED.txt', 'r') as f:
    for line in f:
        parts = line.strip().split(maxsplit=1)
        if len(parts) >= 2:
            col_num = int(parts[0])
            feature_name = parts[1]
            desired_features.append(feature_name)
            desired_indices.append(col_num - 1)  # Convert to 0-based index

print(f"Number of desired features: {len(desired_features)}")
print(f"Feature indices range: {min(desired_indices)} to {max(desired_indices)}")

# Filter the recreated dataset to only include desired features
X_train_recreated_filtered = X_train_recreated[:, desired_indices]

print(f"\nFiltered recreated dataset shape: {X_train_recreated_filtered.shape}")
print(f"Expected shape: ({X_train_recreated.shape[0]}, {len(desired_features)})")

Loading desired features from features_DESIRED.txt...
Number of desired features: 87
Feature indices range: 0 to 545

Filtered recreated dataset shape: (7352, 87)
Expected shape: (7352, 87)


## Save Filtered Dataset

Save both the full dataset and the filtered dataset with only desired features. The filtered dataset is saved as CSV with column headers from `features_DESIRED.txt`.


In [51]:
# Save the full recreated dataset (all 561 features)
output_file_full = 'X_train_recreated_full.txt'
try:
    np.savetxt(output_file_full, X_train_recreated, fmt='%.16f')
    print(f"Full recreated dataset (561 features) saved to {output_file_full}")
except Exception as e:
    print(f"Error saving full dataset: {e}")
    # Try with explicit path
    import os
    output_file_full = os.path.join(os.getcwd(), 'X_train_recreated_full.txt')
    np.savetxt(output_file_full, X_train_recreated, fmt='%.16f')
    print(f"Full recreated dataset (561 features) saved to {output_file_full}")

# Save the filtered dataset as CSV with headers
output_file_csv = 'X_train_recreated_filtered.csv'

# Create DataFrame with feature names as column headers
df_filtered = pd.DataFrame(X_train_recreated_filtered, columns=desired_features)

# Save to CSV
df_filtered.to_csv(output_file_csv, index=False)
print(f"Filtered recreated dataset ({len(desired_features)} features) saved to {output_file_csv}")

# Also save as TXT for backward compatibility (optional)
output_file_filtered = 'X_train_recreated_filtered.txt'
np.savetxt(output_file_filtered, X_train_recreated_filtered, fmt='%.16f')
print(f"Filtered recreated dataset ({len(desired_features)} features) also saved to {output_file_filtered} (TXT format)")

# Display first few rows
print(f"\nFirst 5 rows of CSV (first 5 columns):")
print(df_filtered.head())


Full recreated dataset (561 features) saved to X_train_recreated_full.txt
Filtered recreated dataset (87 features) saved to X_train_recreated_filtered.csv
Filtered recreated dataset (87 features) saved to X_train_recreated_filtered.csv
Filtered recreated dataset (87 features) also saved to X_train_recreated_filtered.txt (TXT format)

First 5 rows of CSV (first 5 columns):
   tBodyAcc-mean()-X  tBodyAcc-mean()-Y  tBodyAcc-mean()-Z  tBodyAcc-max()-X  \
0           0.288585          -0.020294          -0.132905         -0.934724   
1           0.278419          -0.016411          -0.123520         -0.943068   
2           0.279653          -0.019467          -0.113462         -0.938692   
3           0.279174          -0.026201          -0.123283         -0.938692   
4           0.276629          -0.016570          -0.115362         -0.942469   

   tBodyAcc-max()-Y  tBodyAcc-max()-Z  tBodyAcc-min()-X  tBodyAcc-min()-Y  \
0         -0.567378         -0.744413          0.852947          0.

## Verification - Compare Desired Features Only

Compare the filtered recreated dataset with the original dataset, focusing only on the desired features. This will provide better correlation statistics since we're comparing only the features that matter most.

In [52]:
# Load original X_train.txt and filter to desired features
print("Loading original X_train.txt...")
try:
    X_train_original = np.loadtxt('dataset/train/X_train.txt')
    print(f"Original dataset shape: {X_train_original.shape}")
    
    # Filter original dataset to only include desired features
    X_train_original_filtered = X_train_original[:, desired_indices]
    print(f"Filtered original dataset shape: {X_train_original_filtered.shape}")
    print(f"Filtered recreated dataset shape: {X_train_recreated_filtered.shape}")
    
    # Compare shapes
    if X_train_original_filtered.shape == X_train_recreated_filtered.shape:
        print("✓ Shapes match!")
    else:
        print("✗ Shapes do not match!")
    
    # Calculate correlation for each desired feature
    print(f"\n{'='*60}")
    print("COMPARISON: DESIRED FEATURES ONLY ({} features)".format(len(desired_features)))
    print('='*60)
    
    correlations_filtered = []
    for i in range(X_train_original_filtered.shape[1]):
        corr = np.corrcoef(X_train_original_filtered[:, i], X_train_recreated_filtered[:, i])[0, 1]
        correlations_filtered.append(corr)
    
    correlations_filtered = np.array(correlations_filtered)
    print(f"\nMean correlation across desired features: {np.nanmean(correlations_filtered):.4f}")
    print(f"Median correlation: {np.nanmedian(correlations_filtered):.4f}")
    print(f"Features with correlation > 0.9: {np.sum(correlations_filtered > 0.9)}/{len(correlations_filtered)}")
    print(f"Features with correlation > 0.95: {np.sum(correlations_filtered > 0.95)}/{len(correlations_filtered)}")
    print(f"Features with correlation > 0.99: {np.sum(correlations_filtered > 0.99)}/{len(correlations_filtered)}")
    print(f"Features with correlation > 0.999: {np.sum(correlations_filtered > 0.999)}/{len(correlations_filtered)}")
    
    # Find features with low correlation
    low_corr_features_filtered = np.where(correlations_filtered < 0.9)[0]
    if len(low_corr_features_filtered) > 0:
        print(f"\nFeatures with correlation < 0.9: {len(low_corr_features_filtered)}")
        print("Low-correlation desired features:")
        for idx in low_corr_features_filtered:
            orig_idx = desired_indices[idx]
            print(f"  Feature {orig_idx+1} ({desired_features[idx]}): correlation = {correlations_filtered[idx]:.4f}")
    else:
        print("\n✓ All desired features have correlation ≥ 0.9!")
    
    # Calculate mean absolute error
    mae_filtered = np.mean(np.abs(X_train_original_filtered - X_train_recreated_filtered))
    print(f"\nMean Absolute Error (desired features): {mae_filtered:.6f}")
    
    # Calculate RMSE
    rmse_filtered = np.sqrt(np.mean((X_train_original_filtered - X_train_recreated_filtered)**2))
    print(f"Root Mean Square Error (desired features): {rmse_filtered:.6f}")
    
    # Show comparison for first few samples
    print("\nSample comparison (first 3 windows, first 5 desired features):")
    print("Original:")
    print(X_train_original_filtered[:3, :5])
    print("\nRecreated:")
    print(X_train_recreated_filtered[:3, :5])
    print("\nDifference:")
    print(X_train_original_filtered[:3, :5] - X_train_recreated_filtered[:3, :5])
    
    # Summary statistics
    print(f"\n{'='*60}")
    print("SUMMARY")
    print('='*60)
    print(f"Total desired features: {len(desired_features)}")
    print(f"Perfect matches (>0.999): {np.sum(correlations_filtered > 0.999)}")
    print(f"Excellent matches (>0.99): {np.sum(correlations_filtered > 0.99)}")
    print(f"Good matches (>0.95): {np.sum(correlations_filtered > 0.95)}")
    print(f"Acceptable matches (>0.9): {np.sum(correlations_filtered > 0.9)}")
    print(f"Poor matches (<0.9): {len(low_corr_features_filtered)}")
    
    success_rate = np.sum(correlations_filtered > 0.9) / len(correlations_filtered) * 100
    print(f"\nSuccess rate (correlation > 0.9): {success_rate:.1f}%")
    
except FileNotFoundError:
    print("Error: dataset/train/X_train.txt not found")
except Exception as e:
    print(f"Error loading original dataset: {e}")
    import traceback
    traceback.print_exc()

Loading original X_train.txt...
Original dataset shape: (7352, 561)
Filtered original dataset shape: (7352, 87)
Filtered recreated dataset shape: (7352, 87)
✓ Shapes match!

COMPARISON: DESIRED FEATURES ONLY (87 features)

Mean correlation across desired features: 0.9645
Median correlation: 0.9989
Features with correlation > 0.9: 81/87
Features with correlation > 0.95: 70/87
Features with correlation > 0.99: 61/87
Features with correlation > 0.999: 43/87

Features with correlation < 0.9: 6
Low-correlation desired features:
  Feature 357 (fBodyAccJerk-min()-X): correlation = 0.5993
  Feature 358 (fBodyAccJerk-min()-Y): correlation = 0.5778
  Feature 359 (fBodyAccJerk-min()-Z): correlation = 0.5889
  Feature 433 (fBodyGyro-max()-X): correlation = 0.8193
  Feature 520 (fBodyBodyAccJerkMag-min()): correlation = 0.6619
  Feature 546 (fBodyBodyGyroJerkMag-min()): correlation = 0.6974

Mean Absolute Error (desired features): 0.030257
Root Mean Square Error (desired features): 0.090585

Sample

## Save Recreated Dataset

Save the recreated feature vectors to a text file.

In [53]:
# Save to file
output_file = 'X_train_recreated.txt'
np.savetxt(output_file, X_train_recreated, fmt='%.16f')
print(f"Recreated dataset saved to {output_file}")

# Display first few rows and columns
print("\nFirst 5 rows, first 10 features:")
print(X_train_recreated[:5, :10])

Recreated dataset saved to X_train_recreated.txt

First 5 rows, first 10 features:
[[ 0.2885845  -0.02029417 -0.13290515 -0.9952786  -0.98250382 -0.91352645
  -0.99511208 -0.98290823 -0.92352702 -0.9347238 ]
 [ 0.27841882 -0.01641057 -0.12352019 -0.99824528 -0.97435149 -0.96032199
  -0.99880719 -0.97450212 -0.95768622 -0.94306754]
 [ 0.27965306 -0.01946715 -0.1134617  -0.99537956 -0.96588307 -0.97894396
  -0.99651994 -0.96307131 -0.97746859 -0.93869157]
 [ 0.27917393 -0.02620064 -0.12328257 -0.99609149 -0.9828087  -0.9906751
  -0.99709947 -0.98246636 -0.9893025  -0.93869157]
 [ 0.27662876 -0.01656965 -0.11536186 -0.99813862 -0.98011008 -0.99048163
  -0.99832113 -0.9793378  -0.99044113 -0.94246914]]


## Verification (Optional)

Compare a few samples with the original dataset to verify correctness.

In [54]:
# Load original X_train.txt for comparison
print("Loading original X_train.txt...")
try:
    # Load in chunks to handle large file
    X_train_original = np.loadtxt('dataset/train/X_train.txt')
    print(f"Original dataset shape: {X_train_original.shape}")
    print(f"Recreated dataset shape: {X_train_recreated.shape}")
    
    # Compare shapes
    if X_train_original.shape == X_train_recreated.shape:
        print("✓ Shapes match!")
    else:
        print("✗ Shapes do not match!")
    
    # Calculate correlation for each feature
    print("\nCalculating feature-wise correlations...")
    correlations = []
    for i in range(min(X_train_original.shape[1], X_train_recreated.shape[1])):
        corr = np.corrcoef(X_train_original[:, i], X_train_recreated[:, i])[0, 1]
        correlations.append(corr)
    
    correlations = np.array(correlations)
    print(f"Mean correlation across all features: {np.nanmean(correlations):.4f}")
    print(f"Median correlation: {np.nanmedian(correlations):.4f}")
    print(f"Features with correlation > 0.9: {np.sum(correlations > 0.9)}/{len(correlations)}")
    print(f"Features with correlation > 0.95: {np.sum(correlations > 0.95)}/{len(correlations)}")
    print(f"Features with correlation > 0.99: {np.sum(correlations > 0.99)}/{len(correlations)}")
    
    # Find features with low correlation
    low_corr_features = np.where(correlations < 0.8)[0]
    if len(low_corr_features) > 0:
        print(f"\nFeatures with correlation < 0.8: {len(low_corr_features)}")
        print("First 10 low-correlation features:")
        for idx in low_corr_features[:10]:
            print(f"  Feature {idx+1} ({feature_names[idx]}): correlation = {correlations[idx]:.4f}")
    
    # Calculate mean absolute error
    mae = np.mean(np.abs(X_train_original - X_train_recreated))
    print(f"\nMean Absolute Error: {mae:.6f}")
    
    # Calculate RMSE
    rmse = np.sqrt(np.mean((X_train_original - X_train_recreated)**2))
    print(f"Root Mean Square Error: {rmse:.6f}")
    
    # Show comparison for first few samples
    print("\nSample comparison (first 3 windows, first 5 features):")
    print("Original:")
    print(X_train_original[:3, :5])
    print("\nRecreated:")
    print(X_train_recreated[:3, :5])
    print("\nDifference:")
    print(X_train_original[:3, :5] - X_train_recreated[:3, :5])
    
except FileNotFoundError:
    print("Error: dataset/train/X_train.txt not found")
except Exception as e:
    print(f"Error loading original dataset: {e}")

Loading original X_train.txt...
Original dataset shape: (7352, 561)
Recreated dataset shape: (7352, 561)
✓ Shapes match!

Calculating feature-wise correlations...
Mean correlation across all features: 0.6234
Median correlation: 0.9303
Features with correlation > 0.9: 289/561
Features with correlation > 0.95: 257/561
Features with correlation > 0.99: 195/561

Features with correlation < 0.8: 236
First 10 low-correlation features:
  Feature 23 (tBodyAcc-entropy()-X): correlation = 0.2589
  Feature 24 (tBodyAcc-entropy()-Y): correlation = -0.0258
  Feature 25 (tBodyAcc-entropy()-Z): correlation = -0.0642
  Feature 26 (tBodyAcc-arCoeff()-X,1): correlation = -0.7167
  Feature 27 (tBodyAcc-arCoeff()-X,2): correlation = 0.2997
  Feature 28 (tBodyAcc-arCoeff()-X,3): correlation = -0.2317
  Feature 29 (tBodyAcc-arCoeff()-X,4): correlation = -0.1230
  Feature 30 (tBodyAcc-arCoeff()-Y,1): correlation = -0.4860
  Feature 31 (tBodyAcc-arCoeff()-Y,2): correlation = -0.0062
  Feature 32 (tBodyAcc-arC

Loading original X_train.txt...
Original dataset shape: (7352, 561)
Recreated dataset shape: (7352, 561)
✓ Shapes match!

Calculating feature-wise correlations...
Mean correlation across all features: 0.6234
Median correlation: 0.9303
Features with correlation > 0.9: 289/561
Features with correlation > 0.95: 257/561
Features with correlation > 0.99: 195/561

Features with correlation < 0.8: 236
First 10 low-correlation features:
  Feature 23 (tBodyAcc-entropy()-X): correlation = 0.2589
  Feature 24 (tBodyAcc-entropy()-Y): correlation = -0.0258
  Feature 25 (tBodyAcc-entropy()-Z): correlation = -0.0642
  Feature 26 (tBodyAcc-arCoeff()-X,1): correlation = -0.7167
  Feature 27 (tBodyAcc-arCoeff()-X,2): correlation = 0.2997
  Feature 28 (tBodyAcc-arCoeff()-X,3): correlation = -0.2317
  Feature 29 (tBodyAcc-arCoeff()-X,4): correlation = -0.1230
  Feature 30 (tBodyAcc-arCoeff()-Y,1): correlation = -0.4860
  Feature 31 (tBodyAcc-arCoeff()-Y,2): correlation = -0.0062
  Feature 32 (tBodyAcc-arC

  c /= stddev[:, None]
  c /= stddev[None, :]


## Process Test Dataset

Now let's apply the same processing pipeline to the test dataset to create `X_test_recreated_filtered.csv`.


In [55]:
# Load test inertial signals
print("Loading test dataset inertial signals...")
test_data_dir = 'dataset/test/Inertial Signals/'

# Load body accelerometer
body_acc_x_test = np.loadtxt(test_data_dir + 'body_acc_x_test.txt')
body_acc_y_test = np.loadtxt(test_data_dir + 'body_acc_y_test.txt')
body_acc_z_test = np.loadtxt(test_data_dir + 'body_acc_z_test.txt')

# Load body gyroscope
body_gyro_x_test = np.loadtxt(test_data_dir + 'body_gyro_x_test.txt')
body_gyro_y_test = np.loadtxt(test_data_dir + 'body_gyro_y_test.txt')
body_gyro_z_test = np.loadtxt(test_data_dir + 'body_gyro_z_test.txt')

# Load total accelerometer
total_acc_x_test = np.loadtxt(test_data_dir + 'total_acc_x_test.txt')
total_acc_y_test = np.loadtxt(test_data_dir + 'total_acc_y_test.txt')
total_acc_z_test = np.loadtxt(test_data_dir + 'total_acc_z_test.txt')

print(f"Test data shape: {body_acc_x_test.shape}")
print(f"Number of test windows: {body_acc_x_test.shape[0]}")
print(f"Samples per window: {body_acc_x_test.shape[1]}")


Loading test dataset inertial signals...
Test data shape: (2947, 128)
Number of test windows: 2947
Samples per window: 128
Test data shape: (2947, 128)
Number of test windows: 2947
Samples per window: 128


In [56]:
# Calculate gravity acceleration for test data
print("Calculating gravity acceleration for test data...")
gravity_acc_x_test = np.array([apply_gravity_filter(window) for window in total_acc_x_test])
gravity_acc_y_test = np.array([apply_gravity_filter(window) for window in total_acc_y_test])
gravity_acc_z_test = np.array([apply_gravity_filter(window) for window in total_acc_z_test])

# Calculate jerk signals for test data
body_acc_jerk_x_test = calculate_jerk(body_acc_x_test)
body_acc_jerk_y_test = calculate_jerk(body_acc_y_test)
body_acc_jerk_z_test = calculate_jerk(body_acc_z_test)

body_gyro_jerk_x_test = calculate_jerk(body_gyro_x_test)
body_gyro_jerk_y_test = calculate_jerk(body_gyro_y_test)
body_gyro_jerk_z_test = calculate_jerk(body_gyro_z_test)

# Calculate magnitudes for test data
body_acc_mag_test = calculate_magnitude(body_acc_x_test, body_acc_y_test, body_acc_z_test)
gravity_acc_mag_test = calculate_magnitude(gravity_acc_x_test, gravity_acc_y_test, gravity_acc_z_test)
body_acc_jerk_mag_test = calculate_magnitude(body_acc_jerk_x_test, body_acc_jerk_y_test, body_acc_jerk_z_test)
body_gyro_mag_test = calculate_magnitude(body_gyro_x_test, body_gyro_y_test, body_gyro_z_test)
body_gyro_jerk_mag_test = calculate_magnitude(body_gyro_jerk_x_test, body_gyro_jerk_y_test, body_gyro_jerk_z_test)

print("Test data derived signals calculated")


Calculating gravity acceleration for test data...
Test data derived signals calculated
Test data derived signals calculated


In [57]:
# Extract features from test data
print("Extracting features from test data...")
all_features_test = {}

# Time domain features
print("Extracting time domain features...")
all_features_test.update(extract_features_3d(body_acc_x_test, body_acc_y_test, body_acc_z_test, 'tBodyAcc'))
all_features_test.update(extract_features_3d(gravity_acc_x_test, gravity_acc_y_test, gravity_acc_z_test, 'tGravityAcc'))
all_features_test.update(extract_features_3d(body_acc_jerk_x_test, body_acc_jerk_y_test, body_acc_jerk_z_test, 'tBodyAccJerk'))
all_features_test.update(extract_features_3d(body_gyro_x_test, body_gyro_y_test, body_gyro_z_test, 'tBodyGyro'))
all_features_test.update(extract_features_3d(body_gyro_jerk_x_test, body_gyro_jerk_y_test, body_gyro_jerk_z_test, 'tBodyGyroJerk'))

all_features_test.update(extract_features_magnitude(body_acc_mag_test, 'tBodyAccMag'))
all_features_test.update(extract_features_magnitude(gravity_acc_mag_test, 'tGravityAccMag'))
all_features_test.update(extract_features_magnitude(body_acc_jerk_mag_test, 'tBodyAccJerkMag'))
all_features_test.update(extract_features_magnitude(body_gyro_mag_test, 'tBodyGyroMag'))
all_features_test.update(extract_features_magnitude(body_gyro_jerk_mag_test, 'tBodyGyroJerkMag'))

# Frequency domain features
print("Extracting frequency domain features...")
all_features_test.update(extract_fft_features_3d(body_acc_x_test, body_acc_y_test, body_acc_z_test, 'fBodyAcc'))
all_features_test.update(extract_fft_features_3d(body_acc_jerk_x_test, body_acc_jerk_y_test, body_acc_jerk_z_test, 'fBodyAccJerk'))
all_features_test.update(extract_fft_features_3d(body_gyro_x_test, body_gyro_y_test, body_gyro_z_test, 'fBodyGyro'))

all_features_test.update(extract_fft_features_magnitude(body_acc_mag_test, 'fBodyAccMag'))
all_features_test.update(extract_fft_features_magnitude(body_acc_jerk_mag_test, 'fBodyBodyAccJerkMag'))
all_features_test.update(extract_fft_features_magnitude(body_gyro_mag_test, 'fBodyBodyGyroMag'))
all_features_test.update(extract_fft_features_magnitude(body_gyro_jerk_mag_test, 'fBodyBodyGyroJerkMag'))

print(f"Total test features extracted: {len(all_features_test)} feature types")


Extracting features from test data...
Extracting time domain features...
Extracting frequency domain features...
Extracting frequency domain features...
Total test features extracted: 468 feature types
Total test features extracted: 468 feature types


In [58]:
# Create feature matrix for test data in correct order
print("Creating test feature matrix in correct order...")
feature_data_test = []
for feature_name in feature_names:
    if feature_name in all_features_test:
        feature_data_test.append(all_features_test[feature_name])
    else:
        print(f"Warning: Feature '{feature_name}' not found in test data, using zeros")
        feature_data_test.append(np.zeros(body_acc_x_test.shape[0]))

X_test_recreated = np.array(feature_data_test).T

print(f"Test dataset shape: {X_test_recreated.shape}")
print(f"Expected shape: ({body_acc_x_test.shape[0]}, 561)")


Creating test feature matrix in correct order...
Test dataset shape: (2947, 561)
Expected shape: (2947, 561)


In [59]:
# Normalize test data using the SAME normalization parameters from training data
print("Normalizing test features using training data parameters...")

X_test_normalized = np.zeros_like(X_test_recreated)

for i in range(X_test_recreated.shape[1]):
    feature_values = X_test_recreated[:, i]
    
    # Use the min/max from TRAINING data (stored in feature_data[i])
    training_feature_values = np.array(feature_data[i])
    min_val = np.min(training_feature_values)
    max_val = np.max(training_feature_values)
    
    # Avoid division by zero
    if max_val - min_val > 1e-10:
        # Normalize to [-1, 1] using training parameters
        X_test_normalized[:, i] = 2 * (feature_values - min_val) / (max_val - min_val) - 1
    else:
        # If all training values were the same, set to 0
        X_test_normalized[:, i] = 0

print(f"Normalized test dataset shape: {X_test_normalized.shape}")
print(f"Min value: {np.min(X_test_normalized):.6f}")
print(f"Max value: {np.max(X_test_normalized):.6f}")

# Update the main variable
X_test_recreated = X_test_normalized


Normalizing test features using training data parameters...
Normalized test dataset shape: (2947, 561)
Min value: -1.706268
Max value: 1.731001


In [60]:
# Filter test data to only include desired features
print("Filtering test data to desired features...")
X_test_recreated_filtered = X_test_recreated[:, desired_indices]

print(f"Filtered test dataset shape: {X_test_recreated_filtered.shape}")
print(f"Expected shape: ({X_test_recreated.shape[0]}, {len(desired_features)})")


Filtering test data to desired features...
Filtered test dataset shape: (2947, 87)
Expected shape: (2947, 87)


In [61]:
# Save test dataset
print("Saving test dataset...")

# Save full test dataset (all 561 features)
output_file_test_full = 'X_test_recreated_full.txt'
np.savetxt(output_file_test_full, X_test_recreated, fmt='%.16f')
print(f"Full test dataset (561 features) saved to {output_file_test_full}")

# Save filtered test dataset as CSV with headers
output_file_test_csv = 'X_test_recreated_filtered.csv'
df_test_filtered = pd.DataFrame(X_test_recreated_filtered, columns=desired_features)
df_test_filtered.to_csv(output_file_test_csv, index=False)
print(f"Filtered test dataset ({len(desired_features)} features) saved to {output_file_test_csv}")

# Also save as TXT for backward compatibility
output_file_test_txt = 'X_test_recreated_filtered.txt'
np.savetxt(output_file_test_txt, X_test_recreated_filtered, fmt='%.16f')
print(f"Filtered test dataset ({len(desired_features)} features) also saved to {output_file_test_txt} (TXT format)")

# Display first few rows
print(f"\nFirst 5 rows of test CSV (first 5 columns):")
print(df_test_filtered.head())


Saving test dataset...
Full test dataset (561 features) saved to X_test_recreated_full.txt
Full test dataset (561 features) saved to X_test_recreated_full.txt
Filtered test dataset (87 features) saved to X_test_recreated_filtered.csv
Filtered test dataset (87 features) also saved to X_test_recreated_filtered.txt (TXT format)

First 5 rows of test CSV (first 5 columns):
   tBodyAcc-mean()-X  tBodyAcc-mean()-Y  tBodyAcc-mean()-Z  tBodyAcc-max()-X  \
0           0.257178          -0.023285          -0.014654         -0.894088   
1           0.286027          -0.013163          -0.119083         -0.894088   
2           0.275485          -0.026050          -0.118152         -0.939260   
3           0.270298          -0.032614          -0.117520         -0.938610   
4           0.274833          -0.027848          -0.129527         -0.938610   

   tBodyAcc-max()-Y  tBodyAcc-max()-Z  tBodyAcc-min()-X  tBodyAcc-min()-Y  \
0         -0.554577         -0.466223          0.717208          0.635

## Verify Test Dataset (Optional)

Compare the test dataset with the original to verify correctness.


In [62]:
# Load original X_test.txt and compare
print("Loading original X_test.txt...")
try:
    X_test_original = np.loadtxt('dataset/test/X_test.txt')
    print(f"Original test dataset shape: {X_test_original.shape}")
    print(f"Recreated test dataset shape: {X_test_recreated.shape}")
    
    # Filter original test dataset to only include desired features
    X_test_original_filtered = X_test_original[:, desired_indices]
    print(f"Filtered original test dataset shape: {X_test_original_filtered.shape}")
    print(f"Filtered recreated test dataset shape: {X_test_recreated_filtered.shape}")
    
    # Compare shapes
    if X_test_original_filtered.shape == X_test_recreated_filtered.shape:
        print("✓ Shapes match!")
    else:
        print("✗ Shapes do not match!")
    
    # Calculate correlation for each desired feature
    print(f"\n{'='*60}")
    print("TEST DATASET COMPARISON: DESIRED FEATURES ONLY ({} features)".format(len(desired_features)))
    print('='*60)
    
    correlations_test = []
    for i in range(X_test_original_filtered.shape[1]):
        corr = np.corrcoef(X_test_original_filtered[:, i], X_test_recreated_filtered[:, i])[0, 1]
        correlations_test.append(corr)
    
    correlations_test = np.array(correlations_test)
    print(f"\nMean correlation across desired features: {np.nanmean(correlations_test):.4f}")
    print(f"Median correlation: {np.nanmedian(correlations_test):.4f}")
    print(f"Features with correlation > 0.9: {np.sum(correlations_test > 0.9)}/{len(correlations_test)}")
    print(f"Features with correlation > 0.95: {np.sum(correlations_test > 0.95)}/{len(correlations_test)}")
    print(f"Features with correlation > 0.99: {np.sum(correlations_test > 0.99)}/{len(correlations_test)}")
    print(f"Features with correlation > 0.999: {np.sum(correlations_test > 0.999)}/{len(correlations_test)}")
    
    # Find features with low correlation
    low_corr_features_test = np.where(correlations_test < 0.9)[0]
    if len(low_corr_features_test) > 0:
        print(f"\nFeatures with correlation < 0.9: {len(low_corr_features_test)}")
        print("Low-correlation test features:")
        for idx in low_corr_features_test[:10]:  # Show first 10
            orig_idx = desired_indices[idx]
            print(f"  Feature {orig_idx+1} ({desired_features[idx]}): correlation = {correlations_test[idx]:.4f}")
    else:
        print("\n✓ All desired test features have correlation ≥ 0.9!")
    
    # Calculate mean absolute error
    mae_test = np.mean(np.abs(X_test_original_filtered - X_test_recreated_filtered))
    print(f"\nMean Absolute Error (test, desired features): {mae_test:.6f}")
    
    # Calculate RMSE
    rmse_test = np.sqrt(np.mean((X_test_original_filtered - X_test_recreated_filtered)**2))
    print(f"Root Mean Square Error (test, desired features): {rmse_test:.6f}")
    
    # Summary statistics
    print(f"\n{'='*60}")
    print("TEST DATASET SUMMARY")
    print('='*60)
    print(f"Total desired features: {len(desired_features)}")
    print(f"Perfect matches (>0.999): {np.sum(correlations_test > 0.999)}")
    print(f"Excellent matches (>0.99): {np.sum(correlations_test > 0.99)}")
    print(f"Good matches (>0.95): {np.sum(correlations_test > 0.95)}")
    print(f"Acceptable matches (>0.9): {np.sum(correlations_test > 0.9)}")
    print(f"Poor matches (<0.9): {len(low_corr_features_test)}")
    
    success_rate_test = np.sum(correlations_test > 0.9) / len(correlations_test) * 100
    print(f"\nSuccess rate (correlation > 0.9): {success_rate_test:.1f}%")
    
except FileNotFoundError:
    print("Error: dataset/test/X_test.txt not found")
except Exception as e:
    print(f"Error loading original test dataset: {e}")
    import traceback
    traceback.print_exc()


Loading original X_test.txt...
Original test dataset shape: (2947, 561)
Recreated test dataset shape: (2947, 561)
Filtered original test dataset shape: (2947, 87)
Filtered recreated test dataset shape: (2947, 87)
✓ Shapes match!

TEST DATASET COMPARISON: DESIRED FEATURES ONLY (87 features)

Mean correlation across desired features: 0.9637
Median correlation: 0.9995
Features with correlation > 0.9: 80/87
Features with correlation > 0.95: 72/87
Features with correlation > 0.99: 64/87
Features with correlation > 0.999: 47/87

Features with correlation < 0.9: 7
Low-correlation test features:
  Feature 357 (fBodyAccJerk-min()-X): correlation = 0.5444
  Feature 358 (fBodyAccJerk-min()-Y): correlation = 0.5791
  Feature 359 (fBodyAccJerk-min()-Z): correlation = 0.5656
  Feature 433 (fBodyGyro-max()-X): correlation = 0.8035
  Feature 520 (fBodyBodyAccJerkMag-min()): correlation = 0.6415
  Feature 532 (fBodyBodyGyroMag-max()): correlation = 0.8987
  Feature 546 (fBodyBodyGyroJerkMag-min()): cor