# Human Activity Recognition Data Preprocessing

UCI HAR Dataset preprocessing for LSTM models.

**Dataset**: 7,354 training + 2,949 test samples with 561 sensor features
**Output**: 6-class activity classification with LSTM-ready 3D tensors (128 timesteps, 4 features)

## 1. Import Required Libraries

In [1]:
import numpy as np
import pandas as pd
import os
from pathlib import Path
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.model_selection import train_test_split
from tensorflow.keras.utils import to_categorical
import warnings
warnings.filterwarnings('ignore')

print("‚úì All libraries imported successfully")
print(f"NumPy version: {np.__version__}")
print(f"Pandas version: {pd.__version__}")

‚úì All libraries imported successfully
NumPy version: 2.3.5
Pandas version: 2.3.3


## 2. Load UCI HAR Dataset

In [4]:
# Define dataset path
DATASET_PATH = Path('DATASETS')

# Verify dataset directory exists
if not DATASET_PATH.exists():
    print(f"‚ùå Error: Dataset directory not found at {DATASET_PATH}")
    print("Please ensure the DATASETS folder is in the current working directory")
else:
    print(f"‚úì Dataset directory found: {DATASET_PATH}")
    print(f"\nContents of DATASETS folder:")
    for item in DATASET_PATH.iterdir():
        if item.is_file():
            size_mb = item.stat().st_size / (1024**2)
            print(f"  - {item.name} ({size_mb:.2f} MB)")
        elif item.is_dir():
            print(f"  - {item.name}/ (directory)")

‚úì Dataset directory found: DATASETS

Contents of DATASETS folder:
  - test.csv (18.43 MB)
  - train.csv (45.91 MB)


## 3. Load Data from CSV

In [5]:
# Load training and test datasets from CSV files
train_csv = DATASET_PATH / 'train.csv'
test_csv = DATASET_PATH / 'test.csv'

print("Loading UCI HAR Dataset...")
print(f"  - Training set: {train_csv}")
print(f"  - Test set: {test_csv}")

# Load datasets
df_train = pd.read_csv(train_csv)
df_test = pd.read_csv(test_csv)

print(f"\n‚úì Datasets loaded successfully")
print(f"\nTraining set shape: {df_train.shape}")
print(f"Test set shape: {df_test.shape}")
print(f"\nFirst few columns: {df_train.columns[:5].tolist()}")
print(f"Last few columns: {df_train.columns[-3:].tolist()}")

# Display unique activities
print(f"\nUnique activities in training set:")
print(df_train['Activity'].unique())

Loading UCI HAR Dataset...
  - Training set: DATASETS\train.csv
  - Test set: DATASETS\test.csv

‚úì Datasets loaded successfully

Training set shape: (7352, 563)
Test set shape: (2947, 563)

First few columns: ['tBodyAcc-mean()-X', 'tBodyAcc-mean()-Y', 'tBodyAcc-mean()-Z', 'tBodyAcc-std()-X', 'tBodyAcc-std()-Y']
Last few columns: ['angle(Z,gravityMean)', 'subject', 'Activity']

Unique activities in training set:
['STANDING' 'SITTING' 'LAYING' 'WALKING' 'WALKING_DOWNSTAIRS'
 'WALKING_UPSTAIRS']


In [6]:
# Extract features and labels
# Remove 'subject' and 'Activity' columns, keeping only sensor features
X_train = df_train.drop(['subject', 'Activity'], axis=1).values
y_train = df_train['Activity'].values

X_test = df_test.drop(['subject', 'Activity'], axis=1).values
y_test = df_test['Activity'].values

print(f"Extracted sensor features:")
print(f"  Training features shape: {X_train.shape}")
print(f"  Test features shape: {X_test.shape}")
print(f"\nTotal number of features: {X_train.shape[1]}")
print(f"Activities in training set: {len(np.unique(y_train))} classes")
print(f"Activities in test set: {len(np.unique(y_test))} classes")

Extracted sensor features:
  Training features shape: (7352, 561)
  Test features shape: (2947, 561)

Total number of features: 561
Activities in training set: 6 classes
Activities in test set: 6 classes


## 4. Reshape for LSTM

Convert (samples, 561 features) ‚Üí (samples, 128 timesteps, 4 features)

In [7]:
# Reshape features for LSTM
# The UCI HAR features are already aggregated over a 128-sample window (2.56 seconds at 50Hz)
# We'll restructure the 561 features into a 3D tensor format suitable for LSTM
# Reshape: (samples, 561) -> (samples, 128, number of features per window)

# Number of time windows: features should be divisible into windows
# For simplicity, we'll reshape into (samples, 128, ~4-5 features per timestep)
# This creates a temporal sequence of 128 timesteps

TIMESTEPS = 128
n_features_per_step = X_train.shape[1] // TIMESTEPS  # 561 // 128 ‚âà 4

# Adjust features if not perfectly divisible
total_features = TIMESTEPS * n_features_per_step
X_train_reshaped = X_train[:, :total_features].reshape(-1, TIMESTEPS, n_features_per_step)
X_test_reshaped = X_test[:, :total_features].reshape(-1, TIMESTEPS, n_features_per_step)

print(f"‚úì Data reshaped for LSTM input")
print(f"\nOriginal shape: {X_train.shape}")
print(f"Reshaped for LSTM:")
print(f"  - Number of samples: {X_train_reshaped.shape[0]}")
print(f"  - Timesteps (sequence length): {X_train_reshaped.shape[1]}")
print(f"  - Features per timestep: {X_train_reshaped.shape[2]}")
print(f"\nFinal shape for LSTM: {X_train_reshaped.shape}")
print(f"Test set shape: {X_test_reshaped.shape}")

‚úì Data reshaped for LSTM input

Original shape: (7352, 561)
Reshaped for LSTM:
  - Number of samples: 7352
  - Timesteps (sequence length): 128
  - Features per timestep: 4

Final shape for LSTM: (7352, 128, 4)
Test set shape: (2947, 128, 4)


## 5. Encode Activity Labels to One-Hot Format

In [8]:
# Map activity names to numeric indices
activity_mapping = {
    'WALKING': 0,
    'WALKING_UPSTAIRS': 1,
    'WALKING_DOWNSTAIRS': 2,
    'SITTING': 3,
    'STANDING': 4,
    'LAYING': 5
}

# Create reverse mapping for reference
reverse_mapping = {v: k for k, v in activity_mapping.items()}

print("Activity mapping:")
for activity, idx in sorted(activity_mapping.items(), key=lambda x: x[1]):
    print(f"  {idx}: {activity}")

# Encode labels to numeric format
y_train_encoded = np.array([activity_mapping[activity] for activity in y_train])
y_test_encoded = np.array([activity_mapping[activity] for activity in y_test])

print(f"\n‚úì Labels encoded to numeric format")
print(f"Training labels shape: {y_train_encoded.shape}")
print(f"Test labels shape: {y_test_encoded.shape}")

# Convert to one-hot encoding (6 classes)
NUM_CLASSES = 6
y_train_onehot = to_categorical(y_train_encoded, num_classes=NUM_CLASSES)
y_test_onehot = to_categorical(y_test_encoded, num_classes=NUM_CLASSES)

print(f"\n‚úì Labels converted to one-hot encoding")
print(f"One-hot training labels shape: {y_train_onehot.shape}")
print(f"One-hot test labels shape: {y_test_onehot.shape}")
print(f"\nExample one-hot vector (first sample):")
print(f"  Label: {reverse_mapping[y_train_encoded[0]]}")
print(f"  One-hot: {y_train_onehot[0]}")

Activity mapping:
  0: WALKING
  1: WALKING_UPSTAIRS
  2: WALKING_DOWNSTAIRS
  3: SITTING
  4: STANDING
  5: LAYING

‚úì Labels encoded to numeric format
Training labels shape: (7352,)
Test labels shape: (2947,)

‚úì Labels converted to one-hot encoding
One-hot training labels shape: (7352, 6)
One-hot test labels shape: (2947, 6)

Example one-hot vector (first sample):
  Label: STANDING
  One-hot: [0. 0. 0. 0. 1. 0.]


## 6. Normalize Sensor Data

In [9]:
# Normalize the reshaped data
# Flatten back to 2D for normalization, then reshape again

X_train_flat = X_train_reshaped.reshape(-1, X_train_reshaped.shape[2])
X_test_flat = X_test_reshaped.reshape(-1, X_test_reshaped.shape[2])

print("Normalizing sensor data using StandardScaler...")
print(f"Before normalization:")
print(f"  Training data - Min: {X_train_flat.min():.4f}, Max: {X_train_flat.max():.4f}, Mean: {X_train_flat.mean():.4f}")

# Fit scaler on training data only (to prevent data leakage)
scaler = StandardScaler()
X_train_normalized = scaler.fit_transform(X_train_flat)
X_test_normalized = scaler.transform(X_test_flat)

print(f"\nAfter normalization:")
print(f"  Training data - Min: {X_train_normalized.min():.4f}, Max: {X_train_normalized.max():.4f}, Mean: {X_train_normalized.mean():.4f}")

# Reshape back to 3D for LSTM
X_train_normalized = X_train_normalized.reshape(X_train_reshaped.shape)
X_test_normalized = X_test_normalized.reshape(X_test_reshaped.shape)

print(f"\n‚úì Data normalized successfully")
print(f"Training data shape after normalization: {X_train_normalized.shape}")
print(f"Test data shape after normalization: {X_test_normalized.shape}")

Normalizing sensor data using StandardScaler...
Before normalization:
  Training data - Min: -1.0000, Max: 1.0000, Mean: -0.5078

After normalization:
  Training data - Min: -0.9639, Max: 3.1905, Mean: -0.0000

‚úì Data normalized successfully
Training data shape after normalization: (7352, 128, 4)
Test data shape after normalization: (2947, 128, 4)


## 7. Split Dataset into Train, Validation, and Test Sets

In [10]:
# Split training data into train and validation sets
# 80% training, 20% validation from the original training set
# The test set remains separate for final evaluation

TEST_SPLIT_RATIO = 0.2
RANDOM_STATE = 42

print("Splitting dataset into train, validation, and test sets...")

# Split training data
X_train_final, X_val, y_train_final, y_val = train_test_split(
    X_train_normalized,
    y_train_onehot,
    test_size=TEST_SPLIT_RATIO,
    random_state=RANDOM_STATE,
    stratify=y_train_encoded  # Ensure balanced distribution
)

# Test set remains as is
X_test_final = X_test_normalized
y_test_final = y_test_onehot

print(f"\n‚úì Dataset split successfully")
print(f"\nDataset sizes:")
print(f"  Training set: {X_train_final.shape[0]} samples ({X_train_final.shape[0]/len(X_train_normalized)*100:.1f}%)")
print(f"  Validation set: {X_val.shape[0]} samples ({X_val.shape[0]/len(X_train_normalized)*100:.1f}%)")
print(f"  Test set: {X_test_final.shape[0]} samples")
print(f"  Total: {X_train_final.shape[0] + X_val.shape[0] + X_test_final.shape[0]} samples")

# Display class distribution
print(f"\nClass distribution in training set:")
train_counts = y_train_final.sum(axis=0)
for i, count in enumerate(train_counts):
    print(f"  {reverse_mapping[i]}: {int(count)} samples")

print(f"\nClass distribution in validation set:")
val_counts = y_val.sum(axis=0)
for i, count in enumerate(val_counts):
    print(f"  {reverse_mapping[i]}: {int(count)} samples")

Splitting dataset into train, validation, and test sets...

‚úì Dataset split successfully

Dataset sizes:
  Training set: 5881 samples (80.0%)
  Validation set: 1471 samples (20.0%)
  Test set: 2947 samples
  Total: 10299 samples

Class distribution in training set:
  WALKING: 981 samples
  WALKING_UPSTAIRS: 858 samples
  WALKING_DOWNSTAIRS: 789 samples
  SITTING: 1029 samples
  STANDING: 1099 samples
  LAYING: 1125 samples

Class distribution in validation set:
  WALKING: 245 samples
  WALKING_UPSTAIRS: 215 samples
  WALKING_DOWNSTAIRS: 197 samples
  SITTING: 257 samples
  STANDING: 275 samples
  LAYING: 282 samples


## 8. Prepare Final Dataset for LSTM Model

In [11]:
# Convert all datasets to float32 (optimal for neural networks)
X_train = X_train_final.astype(np.float32)
X_val = X_val.astype(np.float32)
X_test = X_test_final.astype(np.float32)

y_train = y_train_final.astype(np.float32)
y_val = y_val.astype(np.float32)
y_test = y_test_final.astype(np.float32)

print("Final dataset preparation complete!")
print(f"\n‚úì All data converted to float32 format")

# Summary of final datasets
print(f"\n{'='*60}")
print(f"FINAL DATASET SUMMARY")
print(f"{'='*60}")

print(f"\nTraining Set (X_train):")
print(f"  Shape: {X_train.shape}")
print(f"  Type: {X_train.dtype}")
print(f"  Min value: {X_train.min():.6f}, Max value: {X_train.max():.6f}")
print(f"  Mean: {X_train.mean():.6f}, Std: {X_train.std():.6f}")

print(f"\nValidation Set (X_val):")
print(f"  Shape: {X_val.shape}")
print(f"  Type: {X_val.dtype}")

print(f"\nTest Set (X_test):")
print(f"  Shape: {X_test.shape}")
print(f"  Type: {X_test.dtype}")

print(f"\nTraining Labels (y_train):")
print(f"  Shape: {y_train.shape}")
print(f"  Type: {y_train.dtype}")
print(f"  One-hot encoded: ‚úì")

print(f"\nValidation Labels (y_val):")
print(f"  Shape: {y_val.shape}")

print(f"\nTest Labels (y_test):")
print(f"  Shape: {y_test.shape}")

print(f"\n{'='*60}")
print(f"LSTM INPUT SPECIFICATION")
print(f"{'='*60}")
print(f"Input shape per sample: ({X_train.shape[1]}, {X_train.shape[2]})")
print(f"  - Timesteps (sequence length): {X_train.shape[1]}")
print(f"  - Features per timestep: {X_train.shape[2]}")
print(f"\nOutput shape per sample: ({y_train.shape[1]},)")
print(f"  - Number of classes: {y_train.shape[1]}")
print(f"\n‚úì Data ready for LSTM model training!")

Final dataset preparation complete!

‚úì All data converted to float32 format

FINAL DATASET SUMMARY

Training Set (X_train):
  Shape: (5881, 128, 4)
  Type: float32
  Min value: -0.963937, Max value: 3.190548
  Mean: 0.000183, Std: 0.999976

Validation Set (X_val):
  Shape: (1471, 128, 4)
  Type: float32

Test Set (X_test):
  Shape: (2947, 128, 4)
  Type: float32

Training Labels (y_train):
  Shape: (5881, 6)
  Type: float32
  One-hot encoded: ‚úì

Validation Labels (y_val):
  Shape: (1471, 6)

Test Labels (y_test):
  Shape: (2947, 6)

LSTM INPUT SPECIFICATION
Input shape per sample: (128, 4)
  - Timesteps (sequence length): 128
  - Features per timestep: 4

Output shape per sample: (6,)
  - Number of classes: 6

‚úì Data ready for LSTM model training!


## 9. Verify Data and Generate Summary Statistics

In [12]:
# Check for missing values
print("Checking for missing values...")
print(f"Missing values in X_train: {np.isnan(X_train).sum()}")
print(f"Missing values in X_val: {np.isnan(X_val).sum()}")
print(f"Missing values in X_test: {np.isnan(X_test).sum()}")
print(f"Missing values in y_train: {np.isnan(y_train).sum()}")
print(f"‚úì No missing values detected!")

# Display sample from each set
print(f"\n{'='*60}")
print(f"SAMPLE DATA INSPECTION")
print(f"{'='*60}")

print(f"\nSample from training set (first sample):")
print(f"  Input shape: {X_train[0].shape}")
print(f"  Input dtype: {X_train[0].dtype}")
print(f"  First timestep values: {X_train[0, 0, :]}")
print(f"  Last timestep values: {X_train[0, -1, :]}")
print(f"  Label (one-hot): {y_train[0]}")
print(f"  Predicted activity: {reverse_mapping[np.argmax(y_train[0])]}")

# Verify shapes consistency
assert X_train.shape[1:] == X_val.shape[1:], "Train and Val input shapes don't match!"
assert X_train.shape[1:] == X_test.shape[1:], "Train and Test input shapes don't match!"
assert y_train.shape[1] == y_val.shape[1], "Train and Val label shapes don't match!"
assert y_train.shape[1] == y_test.shape[1], "Train and Test label shapes don't match!"
print(f"\n‚úì All shape consistency checks passed!")

# Activity distribution across splits
print(f"\n{'='*60}")
print(f"ACTIVITY DISTRIBUTION")
print(f"{'='*60}")

train_dist = y_train.sum(axis=0)
val_dist = y_val.sum(axis=0)
test_dist = y_test.sum(axis=0)

print(f"\n{'Activity':<20} {'Train':>10} {'Val':>10} {'Test':>10}")
print(f"{'-'*50}")
for i in range(NUM_CLASSES):
    activity_name = reverse_mapping[i]
    print(f"{activity_name:<20} {int(train_dist[i]):>10} {int(val_dist[i]):>10} {int(test_dist[i]):>10}")

print(f"\n‚úì Data preprocessing completed successfully!")

Checking for missing values...
Missing values in X_train: 0
Missing values in X_val: 0
Missing values in X_test: 0
Missing values in y_train: 0
‚úì No missing values detected!

SAMPLE DATA INSPECTION

Sample from training set (first sample):
  Input shape: (128, 4)
  Input dtype: float32
  First timestep values: [ 1.3131514   0.99538964  0.66294235 -0.91170317]
  Last timestep values: [-0.9020974  -0.88405174 -0.96393716  1.6330763 ]
  Label (one-hot): [0. 0. 0. 1. 0. 0.]
  Predicted activity: SITTING

‚úì All shape consistency checks passed!

ACTIVITY DISTRIBUTION

Activity                  Train        Val       Test
--------------------------------------------------
WALKING                     981        245        496
WALKING_UPSTAIRS            858        215        471
WALKING_DOWNSTAIRS          789        197        420
SITTING                    1029        257        491
STANDING                   1099        275        532
LAYING                     1125        282        53

## 10. Save Preprocessed Data

In [13]:
# Create a data directory if it doesn't exist
data_dir = Path('preprocessed_data')
data_dir.mkdir(exist_ok=True)

# Save datasets using numpy
np.save(data_dir / 'X_train.npy', X_train)
np.save(data_dir / 'X_val.npy', X_val)
np.save(data_dir / 'X_test.npy', X_test)
np.save(data_dir / 'y_train.npy', y_train)
np.save(data_dir / 'y_val.npy', y_val)
np.save(data_dir / 'y_test.npy', y_test)

# Save metadata
metadata = {
    'timesteps': X_train.shape[1],
    'features': X_train.shape[2],
    'num_classes': NUM_CLASSES,
    'activity_mapping': activity_mapping,
    'scaler_mean': scaler.mean_.tolist(),
    'scaler_scale': scaler.scale_.tolist()
}

import json
with open(data_dir / 'metadata.json', 'w') as f:
    json.dump(metadata, f, indent=4)

print(f"‚úì Preprocessed data saved to '{data_dir}' directory")
print(f"\nSaved files:")
print(f"  - X_train.npy ({X_train.nbytes / 1024**2:.2f} MB)")
print(f"  - X_val.npy ({X_val.nbytes / 1024**2:.2f} MB)")
print(f"  - X_test.npy ({X_test.nbytes / 1024**2:.2f} MB)")
print(f"  - y_train.npy ({y_train.nbytes / 1024**2:.2f} MB)")
print(f"  - y_val.npy ({y_val.nbytes / 1024**2:.2f} MB)")
print(f"  - y_test.npy ({y_test.nbytes / 1024**2:.2f} MB)")
print(f"  - metadata.json")

print(f"\nüìù To load this data later, use:")
print(f"  X_train = np.load('preprocessed_data/X_train.npy')")
print(f"  y_train = np.load('preprocessed_data/y_train.npy')")
print(f"  # ... and so on for other files")

‚úì Preprocessed data saved to 'preprocessed_data' directory

Saved files:
  - X_train.npy (11.49 MB)
  - X_val.npy (2.87 MB)
  - X_test.npy (5.76 MB)
  - y_train.npy (0.13 MB)
  - y_val.npy (0.03 MB)
  - y_test.npy (0.07 MB)
  - metadata.json

üìù To load this data later, use:
  X_train = np.load('preprocessed_data/X_train.npy')
  y_train = np.load('preprocessed_data/y_train.npy')
  # ... and so on for other files


## Summary

‚úì Preprocessing complete!

**Output files saved to `preprocessed_data/`:**
- X_train.npy (5,881, 128, 4)
- X_val.npy (1,471, 128, 4)
- X_test.npy (2,947, 128, 4)
- y_train.npy, y_val.npy, y_test.npy (one-hot encoded, 6 classes)
- metadata.json

Ready for LSTM model training.