# Exploratory Data Analysis (EDA) for Human Activity Recognition
This notebook is designed to explore the dataset and understand its structure, features, and potential insights.

It will include data loading, basic statistics, and visualizations to help understand the data better.

In [1]:
import pandas as pd

train_path = '../data/train.csv'
test_path = '../data/test.csv'

df_train = pd.read_csv(train_path)
df_test = pd.read_csv(test_path)

print("Train shape:", df_train.shape)
print("Test shape:", df_test.shape)

df_train.head()

Train shape: (7352, 563)
Test shape: (2947, 563)


Unnamed: 0,tBodyAcc-mean()-X,tBodyAcc-mean()-Y,tBodyAcc-mean()-Z,tBodyAcc-std()-X,tBodyAcc-std()-Y,tBodyAcc-std()-Z,tBodyAcc-mad()-X,tBodyAcc-mad()-Y,tBodyAcc-mad()-Z,tBodyAcc-max()-X,...,fBodyBodyGyroJerkMag-kurtosis(),"angle(tBodyAccMean,gravity)","angle(tBodyAccJerkMean),gravityMean)","angle(tBodyGyroMean,gravityMean)","angle(tBodyGyroJerkMean,gravityMean)","angle(X,gravityMean)","angle(Y,gravityMean)","angle(Z,gravityMean)",subject,Activity
0,0.288585,-0.020294,-0.132905,-0.995279,-0.983111,-0.913526,-0.995112,-0.983185,-0.923527,-0.934724,...,-0.710304,-0.112754,0.0304,-0.464761,-0.018446,-0.841247,0.179941,-0.058627,1,STANDING
1,0.278419,-0.016411,-0.12352,-0.998245,-0.9753,-0.960322,-0.998807,-0.974914,-0.957686,-0.943068,...,-0.861499,0.053477,-0.007435,-0.732626,0.703511,-0.844788,0.180289,-0.054317,1,STANDING
2,0.279653,-0.019467,-0.113462,-0.99538,-0.967187,-0.978944,-0.99652,-0.963668,-0.977469,-0.938692,...,-0.760104,-0.118559,0.177899,0.100699,0.808529,-0.848933,0.180637,-0.049118,1,STANDING
3,0.279174,-0.026201,-0.123283,-0.996091,-0.983403,-0.990675,-0.997099,-0.98275,-0.989302,-0.938692,...,-0.482845,-0.036788,-0.012892,0.640011,-0.485366,-0.848649,0.181935,-0.047663,1,STANDING
4,0.276629,-0.01657,-0.115362,-0.998139,-0.980817,-0.990482,-0.998321,-0.979672,-0.990441,-0.942469,...,-0.699205,0.12332,0.122542,0.693578,-0.615971,-0.847865,0.185151,-0.043892,1,STANDING


In [2]:
# Describe the dataset
df_train.describe()

Unnamed: 0,tBodyAcc-mean()-X,tBodyAcc-mean()-Y,tBodyAcc-mean()-Z,tBodyAcc-std()-X,tBodyAcc-std()-Y,tBodyAcc-std()-Z,tBodyAcc-mad()-X,tBodyAcc-mad()-Y,tBodyAcc-mad()-Z,tBodyAcc-max()-X,...,fBodyBodyGyroJerkMag-skewness(),fBodyBodyGyroJerkMag-kurtosis(),"angle(tBodyAccMean,gravity)","angle(tBodyAccJerkMean),gravityMean)","angle(tBodyGyroMean,gravityMean)","angle(tBodyGyroJerkMean,gravityMean)","angle(X,gravityMean)","angle(Y,gravityMean)","angle(Z,gravityMean)",subject
count,7352.0,7352.0,7352.0,7352.0,7352.0,7352.0,7352.0,7352.0,7352.0,7352.0,...,7352.0,7352.0,7352.0,7352.0,7352.0,7352.0,7352.0,7352.0,7352.0,7352.0
mean,0.274488,-0.017695,-0.109141,-0.605438,-0.510938,-0.604754,-0.630512,-0.526907,-0.60615,-0.468604,...,-0.307009,-0.625294,0.008684,0.002186,0.008726,-0.005981,-0.489547,0.058593,-0.056515,17.413085
std,0.070261,0.040811,0.056635,0.448734,0.502645,0.418687,0.424073,0.485942,0.414122,0.544547,...,0.321011,0.307584,0.336787,0.448306,0.608303,0.477975,0.511807,0.29748,0.279122,8.975143
min,-1.0,-1.0,-1.0,-1.0,-0.999873,-1.0,-1.0,-1.0,-1.0,-1.0,...,-0.995357,-0.999765,-0.97658,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,1.0
25%,0.262975,-0.024863,-0.120993,-0.992754,-0.978129,-0.980233,-0.993591,-0.978162,-0.980251,-0.936219,...,-0.542602,-0.845573,-0.121527,-0.289549,-0.482273,-0.376341,-0.812065,-0.017885,-0.143414,8.0
50%,0.277193,-0.017219,-0.108676,-0.946196,-0.851897,-0.859365,-0.950709,-0.857328,-0.857143,-0.881637,...,-0.343685,-0.711692,0.009509,0.008943,0.008735,-0.000368,-0.709417,0.182071,0.003181,19.0
75%,0.288461,-0.010783,-0.097794,-0.242813,-0.034231,-0.262415,-0.29268,-0.066701,-0.265671,-0.017129,...,-0.126979,-0.503878,0.150865,0.292861,0.506187,0.359368,-0.509079,0.248353,0.107659,26.0
max,1.0,1.0,1.0,1.0,0.916238,1.0,1.0,0.967664,1.0,1.0,...,0.989538,0.956845,1.0,1.0,0.998702,0.996078,1.0,0.478157,1.0,30.0


In [3]:
df_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2947 entries, 0 to 2946
Columns: 563 entries, tBodyAcc-mean()-X to Activity
dtypes: float64(561), int64(1), object(1)
memory usage: 12.7+ MB


In [4]:
# Define columns to remove
columns_to_remove = ['subject', 'Activity']

# Features = all columns except 'subject' and 'Activity'
X_train = df_train.drop(columns=columns_to_remove)
y_train = df_train['Activity']

X_test = df_test.drop(columns=columns_to_remove)
y_test = df_test['Activity']

print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)
print("y_train shape:", y_train.shape)
print("y_test shape:", y_test.shape)


X_train shape: (7352, 561)
X_test shape: (2947, 561)
y_train shape: (7352,)
y_test shape: (2947,)


### Step 1.3 – Encode Activity Labels

The activity labels in the dataset are currently stored as strings (e.g., "WALKING", "LAYING").
To prepare them for training in a neural network, we need to convert these categorical labels into numeric values.

We will use `LabelEncoder` from `scikit-learn` to map each activity to a unique integer (e.g., WALKING → 0, SITTING → 1, etc.).

This step is essential for multi-class classification with neural networks.

In [5]:
from sklearn.preprocessing import LabelEncoder

# Encode the target variable
label_encoder = LabelEncoder()

y_train_encoded = label_encoder.fit_transform(y_train)
y_test_encoded = label_encoder.transform(y_test)

# show the label mapping
label_mapping = dict(zip(label_encoder.classes_, label_encoder.transform(label_encoder.classes_)))
print("Label mapping:", label_mapping)

# Display the first few rows of the encoded target variable
print("Encoded y_train head:")
print(y_train_encoded[:10])

print("Encoded y_test head:")
print(y_test_encoded[:10])



Label mapping: {'LAYING': np.int64(0), 'SITTING': np.int64(1), 'STANDING': np.int64(2), 'WALKING': np.int64(3), 'WALKING_DOWNSTAIRS': np.int64(4), 'WALKING_UPSTAIRS': np.int64(5)}
Encoded y_train head:
[2 2 2 2 2 2 2 2 2 2]
Encoded y_test head:
[2 2 2 2 2 2 2 2 2 2]


### Step 1.4 – Normalize Input Features

Neural networks are sensitive to the scale of input features.

To ensure that all 561 features contribute equally to the learning process, we normalize them using `StandardScaler`, which transforms each feature to have a mean of 0 and standard deviation of 1.

This step improves model convergence and stability during training.

In [6]:
from sklearn.preprocessing import StandardScaler

# Standardize the features
scaler = StandardScaler()

# Fit the scaler on the training data and transform both training and test data
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Verification: mean and standard deviation of the first five features
import numpy as np
print("Mean of first five features in X_train_scaled:", np.round(X_train_scaled.mean(axis=0)[:5], 5))
print("Standard deviation of first five features in X_train_scaled:", np.round(X_train_scaled.std(axis=0)[:5], 5))


Mean of first five features in X_train_scaled: [-0.  0.  0. -0.  0.]
Standard deviation of first five features in X_train_scaled: [1. 1. 1. 1. 1.]


### Step 1.5 – Split Training Set into Train and Validation

Before training the model, we need to evaluate its performance during training to detect overfitting.

To do this, we split the original training set into two subsets:
- **Training Set**: used to train the model.
- **Validation Set**: used to evaluate performance during training, but not used in backpropagation.

We’ll use `train_test_split` from `scikit-learn`, and stratify the split to maintain the same label distribution across both sets.


In [7]:
from sklearn.model_selection import train_test_split

# Split the training data into training and validation sets
X_train_final, X_val, y_train_final, y_val = train_test_split(
    X_train_scaled,
    y_train_encoded,
    test_size=0.2,
    random_state=42,
    stratify=y_train_encoded # maintain class distribution
)

# Display shapes of the final datasets
print("Train shape:", X_train_final.shape, y_train_final.shape)
print("Validation shape:", X_val.shape, y_val.shape)
print("Train label distribution:", np.bincount(y_train_final))
print("Validation label distribution:", np.bincount(y_val))

Train shape: (5881, 561) (5881,)
Validation shape: (1471, 561) (1471,)
Train label distribution: [1125 1029 1099  981  789  858]
Validation label distribution: [282 257 275 245 197 215]


### Step 1.6 – Create PyTorch Dataset Class

To work with neural networks in PyTorch, we need to convert our NumPy arrays into a format that can be fed into the model during training.

We define a custom `HumanActivityDataset` class that inherits from `torch.utils.data.Dataset`.

This class will:
- Store feature and label arrays
- Convert them to PyTorch tensors
- Return data in the format `(X, y)` for training

This approach makes it easy to integrate with PyTorch’s `DataLoader` for batching, shuffling, and iteration.


In [8]:
import torch 
from torch.utils.data import Dataset

class HumanActivityDataset(Dataset):
    def __init__(self, features, labels):
        self.features = torch.tensor(features, dtype=torch.float32)
        self.labels = torch.tensor(labels, dtype=torch.long) # long =  required for CrossEntropyLoss

    def __len__(self):
        return len(self.labels)

    def __getitem__(self, idx):
        return self.features[idx], self.labels[idx]

In [9]:
# Instanciar os datasets com os conjuntos prontos
train_dataset = HumanActivityDataset(X_train_final, y_train_final)
val_dataset = HumanActivityDataset(X_val, y_val)
test_dataset = HumanActivityDataset(X_test_scaled, y_test_encoded)

# Verificar uma amostra
print("Sample [X, y]:", train_dataset[0])
print("Shape X:", train_dataset[0][0].shape)
print("Label y:", train_dataset[0][1])


Sample [X, y]: (tensor([ 1.0772e-01, -6.7335e-02, -2.0617e-01,  2.2605e-01,  4.1395e-01,
         8.1484e-01,  2.8676e-01,  3.3254e-01,  9.4111e-01,  2.9708e-02,
         7.6464e-01,  8.1059e-01, -5.4249e-01, -5.3888e-01, -4.0872e-01,
         4.5775e-01, -2.1444e-01, -2.5998e-02,  4.1385e-01,  4.5782e-01,
         1.5408e-01,  1.0968e+00,  1.1474e+00,  7.7100e-01,  1.2293e+00,
        -6.9908e-01,  5.2526e-01,  5.9928e-01, -1.1594e+00, -1.2123e+00,
         1.3580e+00, -6.2251e-01, -1.4879e-01, -5.4086e-01, -6.4880e-02,
         8.3196e-01, -2.5089e-01,  2.5880e-01,  3.4078e-01,  1.1177e+00,
         4.8415e-01, -6.1365e-01, -1.0332e+00, -2.7412e-01, -2.2917e-01,
        -4.2036e-01, -2.8762e-01, -2.2627e-01, -4.4167e-01,  4.7032e-01,
        -6.3029e-01, -1.0557e+00,  4.8750e-01, -5.8916e-01, -9.9130e-01,
         6.1666e-01,  4.8042e-01, -4.4149e-01, -2.5242e-01, -3.0082e-01,
        -1.9335e-01, -4.5500e-01,  1.7777e-01, -4.6504e-01, -7.9530e-01,
         5.4757e-01, -4.3201e-01,  

### Step 1.7 – Create PyTorch DataLoaders

Now that we have our custom `HumanActivityDataset` classes ready, we can wrap them in `DataLoader`s provided by PyTorch.

`DataLoader` allows efficient batching, optional shuffling (important for training), and multi-threaded loading.

We’ll create three DataLoaders:
- **Train Loader**: with shuffling and batching
- **Validation Loader**: used during training to monitor generalization
- **Test Loader**: used only for final evaluation


In [10]:
from torch.utils.data import DataLoader

# Initial hyperparameters
BATCH_SIZE = 64

# Create DataLoaders for batching and shuffling
train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=BATCH_SIZE, shuffle=False)
test_loader = DataLoader(test_dataset, batch_size=BATCH_SIZE, shuffle=False)

# Check the first batch from the training DataLoader
batch = next(iter(train_loader))
X_batch, y_batch = batch
print("X_batch shape:", X_batch.shape)
print("y_batch shape:", y_batch.shape)

X_batch shape: torch.Size([64, 561])
y_batch shape: torch.Size([64])


In [11]:
import pickle

with open("../data/processed_data.pkl", "wb") as f:
    pickle.dump(
        (X_train_final, y_train_final, X_val, y_val, X_test_scaled, y_test_encoded),
        f
    )

print("✅ Processed data saved to ../data/processed_data.pkl")


✅ Processed data saved to ../data/processed_data.pkl
