# 2. Pre-processing and Feature Extraction (On Raw collected data)

In this notebook, we pre-process the raw data collected from the sensors and extract features from it.

We used the `Physics Toolbox Suite` app to collect the data. We use the gForce meter from the app. However there were some difficulties which we faced when preprocessing the data. The data that was collected was sampled at the highest frequency possible from our mobile. Hence the samples were also not evenly spaced. The sampling frequency kept changing between 190 Hz to 210 Hz. This made it difficult to use the data as it is. So we took a sliding window of 20 miliseconds and calculated the mean of the samples in that window. This way we were able to get evenly spaced samples at a frequency of 50 Hz.

For spliting into test and train datasets, we used the leave one subject out stratagy to split the data into testing and training. After processing all the data we save them in the same format as the UCI HAR and TSFEL datasets directory structure.

Here is the explainiation of all the directories that are there in this folder:

- `./unprocessed`: This dataset contains the raw g-force data captured directly from the device, sampled at the highest available frequency, which varies between 190 and 210 Hz.

- `./processed`: The dataset has been downsampled to 50 Hz by segmenting the data into 20 ms intervals and computing the average acceleration within each segment.

- `./processed_trimmed`: In this dataset, the first 4.5 seconds (225 rows) of the time series data have been removed, as well as the final 0.5 seconds (25 rows), to mitigate any noise present at the start and end of the recordings.

- `./raw_dataset`: This dataset is divided into training and testing subsets. The testing subset specifically contains activity data for Yash Kokane, while the training subset includes data from all other subjects.

- `./TSFEL_features`: This dataset includes 1173 features extracted using TSFEL (Time Series Feature Extraction Library) from the raw_dataset. It does not contain any train-test splits.

- `./TSFEL_dataset`: This dataset also includes 1173 features extracted using TSFEL from the raw_dataset, but it is organized with predefined train-test splits.

We have followed the following steps to preprocess the data:

- Process the raw data to get evenly spaced samples at 50 Hz.
- Trim the data to remove the first 4.5 seconds and the last 0.5 seconds (get a 10s window)
- Split the data into training and testing datasets using the leave one subject out strategy.
- Extract features from the data using the Time Series Feature Extraction Library (TSFEL).
- Filter out the TSFEL features as ber ANOVA F-test.

## Imports

In [None]:
import os
import pandas as pd
from datetime import datetime
import shutil
import tsfel
from pathlib import Path
import numpy as np

## Processing the data to get evenly spaced samples at a frequency of 50 Hz

In [4]:
def process_activities(raw_dir, processed_dir):
    activities = os.listdir(raw_dir)

    for activity in activities:
        raw_activity_dir = os.path.join(raw_dir, activity)
        processed_activity_dir = os.path.join(processed_dir, activity)
        
        os.makedirs(processed_activity_dir, exist_ok=True)
        
        for filename in os.listdir(raw_activity_dir):
            if filename.endswith('.csv'):
                raw_filepath = os.path.join(raw_activity_dir, filename)
                
                data = pd.read_csv(raw_filepath)
                
                data['time'] = pd.to_datetime(data['time'], format='%H:%M:%S:%f')
                data['elapsed_time'] = (data['time'] - data['time'].iloc[0]).dt.total_seconds() * 1000
                
                downsampled_data = []

                interval = 20
                start_time = 0
                
                while start_time < data['elapsed_time'].iloc[-1]:
                    end_time = start_time + interval
                    mask = (data['elapsed_time'] >= start_time) & (data['elapsed_time'] < end_time)
                    group = data[mask]

                    if not group.empty:
                        avg_gFx = group['gFx'].mean()
                        avg_gFy = group['gFy'].mean()
                        avg_gFz = group['gFz'].mean()

                        downsampled_data.append([avg_gFx, avg_gFy, avg_gFz])
                    
                    start_time += interval

                downsampled_df = pd.DataFrame(downsampled_data, columns=['accx', 'accy', 'accz'])
                downsampled_df = downsampled_df.round(7)
                
                processed_filepath = os.path.join(processed_activity_dir, filename)
                downsampled_df.to_csv(processed_filepath, index=False)
                
                print(f"Processed and saved: {processed_filepath}")

process_activities('raw', 'processed')


Processed and saved: processed\LAYING\Subject_1.csv
Processed and saved: processed\LAYING\Subject_10.csv
Processed and saved: processed\LAYING\Subject_11.csv
Processed and saved: processed\LAYING\Subject_12.csv
Processed and saved: processed\LAYING\Subject_2.csv
Processed and saved: processed\LAYING\Subject_3.csv
Processed and saved: processed\LAYING\Subject_4.csv
Processed and saved: processed\LAYING\Subject_5.csv
Processed and saved: processed\LAYING\Subject_6.csv
Processed and saved: processed\LAYING\Subject_7.csv
Processed and saved: processed\LAYING\Subject_8.csv
Processed and saved: processed\LAYING\Subject_9.csv
Processed and saved: processed\SITTING\Subject_1.csv
Processed and saved: processed\SITTING\Subject_10.csv
Processed and saved: processed\SITTING\Subject_11.csv
Processed and saved: processed\SITTING\Subject_12.csv
Processed and saved: processed\SITTING\Subject_2.csv
Processed and saved: processed\SITTING\Subject_3.csv
Processed and saved: processed\SITTING\Subject_4.csv

## Triming the data into required duration:
- 10 seconds of data will be taken from the total data. THis will be done by removing the first 4.5 seconds and the last 0.5 seconds. 

In [5]:
base_dir = 'processed'
output_dir = 'processed_trimmed'

os.makedirs(output_dir, exist_ok=True)

for activity in os.listdir(base_dir):
    activity_dir = os.path.join(base_dir, activity)
    
    if os.path.isdir(activity_dir):
        output_activity_dir = os.path.join(output_dir, activity)
        os.makedirs(output_activity_dir, exist_ok=True)
        
        for filename in os.listdir(activity_dir):
            if filename.endswith('.csv'):
                input_filepath = os.path.join(activity_dir, filename)
                output_filepath = os.path.join(output_activity_dir, filename)
                
                data = pd.read_csv(input_filepath)
                
                data_trimmed = data.iloc[175:]
                
                data_trimmed = data_trimmed.iloc[:500]
                
                data_trimmed.to_csv(output_filepath, index=False)
                
                remaining_rows = len(data) - 675
                if remaining_rows < 25:
                    print(f"Warning: {filename} in {activity} has only {remaining_rows} rows left after processing.")
                
                print(f"Processed and saved: {output_filepath}")


Processed and saved: processed_trimmed\LAYING\Subject_1.csv
Processed and saved: processed_trimmed\LAYING\Subject_10.csv
Processed and saved: processed_trimmed\LAYING\Subject_11.csv
Processed and saved: processed_trimmed\LAYING\Subject_12.csv
Processed and saved: processed_trimmed\LAYING\Subject_2.csv
Processed and saved: processed_trimmed\LAYING\Subject_3.csv
Processed and saved: processed_trimmed\LAYING\Subject_4.csv
Processed and saved: processed_trimmed\LAYING\Subject_5.csv
Processed and saved: processed_trimmed\LAYING\Subject_6.csv
Processed and saved: processed_trimmed\LAYING\Subject_7.csv
Processed and saved: processed_trimmed\LAYING\Subject_8.csv
Processed and saved: processed_trimmed\LAYING\Subject_9.csv
Processed and saved: processed_trimmed\SITTING\Subject_1.csv
Processed and saved: processed_trimmed\SITTING\Subject_10.csv
Processed and saved: processed_trimmed\SITTING\Subject_11.csv
Processed and saved: processed_trimmed\SITTING\Subject_12.csv
Processed and saved: processed

## Leave one subject out stratagy to split the data into testing and training

In [1]:
base_dir = 'raw_dataset'

train_dir = os.path.join(base_dir, 'Train')
test_dir = os.path.join(base_dir, 'Test')

activity_dirs = ['LAYING', 'SITTING', 'STANDING', 'WALKING', 'WALKING_DOWNSTAIRS', 'WALKING_UPSTAIRS']

files_to_move = ['Subject_9.csv', 'Subject_10.csv', 'Subject_11.csv', 'Subject_12.csv']

for activity in activity_dirs:
    activity_train_dir = os.path.join(train_dir, activity)
    activity_test_dir = os.path.join(test_dir, activity)
    
    os.makedirs(activity_test_dir, exist_ok=True)

    # Move Subject_8.csv and Subject_12.csv from Train to Test
    for file_name in files_to_move:
        train_file_path = os.path.join(activity_train_dir, file_name)
        test_file_path = os.path.join(activity_test_dir, file_name)
        
        if os.path.exists(train_file_path):
            shutil.move(train_file_path, test_file_path)
            print(f"Moved {file_name} from {activity_train_dir} to {activity_test_dir}")

for activity in activity_dirs:
    activity_test_dir = os.path.join(test_dir, activity)
    
    test_files = os.listdir(activity_test_dir)
    
    for file_name in test_files:
        if file_name not in files_to_move:
            file_path = os.path.join(activity_test_dir, file_name)
            os.remove(file_path)
            print(f"Removed {file_name} from {activity_test_dir}")


Moved Subject_9.csv from raw_dataset\Train\LAYING to raw_dataset\Test\LAYING
Moved Subject_10.csv from raw_dataset\Train\LAYING to raw_dataset\Test\LAYING
Moved Subject_11.csv from raw_dataset\Train\LAYING to raw_dataset\Test\LAYING
Moved Subject_12.csv from raw_dataset\Train\LAYING to raw_dataset\Test\LAYING
Moved Subject_9.csv from raw_dataset\Train\SITTING to raw_dataset\Test\SITTING
Moved Subject_10.csv from raw_dataset\Train\SITTING to raw_dataset\Test\SITTING
Moved Subject_11.csv from raw_dataset\Train\SITTING to raw_dataset\Test\SITTING
Moved Subject_12.csv from raw_dataset\Train\SITTING to raw_dataset\Test\SITTING
Moved Subject_9.csv from raw_dataset\Train\STANDING to raw_dataset\Test\STANDING
Moved Subject_10.csv from raw_dataset\Train\STANDING to raw_dataset\Test\STANDING
Moved Subject_11.csv from raw_dataset\Train\STANDING to raw_dataset\Test\STANDING
Moved Subject_12.csv from raw_dataset\Train\STANDING to raw_dataset\Test\STANDING
Moved Subject_9.csv from raw_dataset\Train\

## Generating TSFEL Features on our dataset:

In [3]:
base_dir = 'raw_dataset/Train'
output_base_dir = 'TSFEL_dataset/Train'

activities = ['LAYING', 'SITTING', 'STANDING', 'WALKING', 'WALKING_UPSTAIRS', 'WALKING_DOWNSTAIRS']

for activity in activities:
    activity_dir = os.path.join(base_dir, activity)
    output_activity_dir = os.path.join(output_base_dir, activity)
    Path(output_activity_dir).mkdir(parents=True, exist_ok=True)
    subject_files = [f for f in os.listdir(activity_dir) if f.endswith('.csv')]
    for file in subject_files:
        file_path = os.path.join(activity_dir, file)
        df = pd.read_csv(file_path).iloc[:, :]
        cfg = tsfel.get_features_by_domain() 
        # print(cfg)
        for domain in cfg:
            for feature in cfg[domain]:
                cfg[domain][feature]['use'] = 'yes' # use all features, even ones disabled by default

        features = tsfel.time_series_features_extractor(cfg, df, fs=50) # sampling rate 50 Hz
        subject_id = file.split('.')[0]
        output_file = os.path.join(output_activity_dir, f'{subject_id}.csv')
        features.to_csv(output_file, index=False)


*** Feature extraction started ***



*** Feature extraction finished ***
*** Feature extraction started ***



*** Feature extraction finished ***
*** Feature extraction started ***



*** Feature extraction finished ***
*** Feature extraction started ***



*** Feature extraction finished ***
*** Feature extraction started ***



*** Feature extraction finished ***
*** Feature extraction started ***



*** Feature extraction finished ***
*** Feature extraction started ***



*** Feature extraction finished ***
*** Feature extraction started ***



*** Feature extraction finished ***
*** Feature extraction started ***



*** Feature extraction finished ***
*** Feature extraction started ***



*** Feature extraction finished ***
*** Feature extraction started ***



*** Feature extraction finished ***
*** Feature extraction started ***



*** Feature extraction finished ***
*** Feature extraction started ***



*** Feature extraction finished ***
*** Feature extraction started ***



*** Feature extraction finished ***
*** Feature extraction started ***



*** Feature extraction finished ***
*** Feature extraction started ***



*** Feature extraction finished ***
*** Feature extraction started ***



*** Feature extraction finished ***
*** Feature extraction started ***



*** Feature extraction finished ***
*** Feature extraction started ***



*** Feature extraction finished ***
*** Feature extraction started ***



*** Feature extraction finished ***
*** Feature extraction started ***



*** Feature extraction finished ***
*** Feature extraction started ***



*** Feature extraction finished ***
*** Feature extraction started ***



*** Feature extraction finished ***
*** Feature extraction started ***



*** Feature extraction finished ***
*** Feature extraction started ***



*** Feature extraction finished ***
*** Feature extraction started ***



*** Feature extraction finished ***
*** Feature extraction started ***



*** Feature extraction finished ***
*** Feature extraction started ***



*** Feature extraction finished ***
*** Feature extraction started ***



*** Feature extraction finished ***
*** Feature extraction started ***



*** Feature extraction finished ***
*** Feature extraction started ***



*** Feature extraction finished ***
*** Feature extraction started ***



*** Feature extraction finished ***
*** Feature extraction started ***



*** Feature extraction finished ***
*** Feature extraction started ***



*** Feature extraction finished ***
*** Feature extraction started ***



*** Feature extraction finished ***
*** Feature extraction started ***



*** Feature extraction finished ***
*** Feature extraction started ***



*** Feature extraction finished ***
*** Feature extraction started ***



*** Feature extraction finished ***
*** Feature extraction started ***



*** Feature extraction finished ***
*** Feature extraction started ***



*** Feature extraction finished ***
*** Feature extraction started ***



*** Feature extraction finished ***
*** Feature extraction started ***



*** Feature extraction finished ***
*** Feature extraction started ***



*** Feature extraction finished ***
*** Feature extraction started ***



*** Feature extraction finished ***
*** Feature extraction started ***



*** Feature extraction finished ***
*** Feature extraction started ***



*** Feature extraction finished ***
*** Feature extraction started ***



*** Feature extraction finished ***
*** Feature extraction started ***



*** Feature extraction finished ***


Moved Subject_4.csv from raw_dataset\Train\LAYING to raw_dataset\Test\LAYING
Moved Subject_8.csv from raw_dataset\Train\LAYING to raw_dataset\Test\LAYING
Moved Subject_12.csv from raw_dataset\Train\LAYING to raw_dataset\Test\LAYING
Moved Subject_4.csv from raw_dataset\Train\SITTING to raw_dataset\Test\SITTING
Moved Subject_8.csv from raw_dataset\Train\SITTING to raw_dataset\Test\SITTING
Moved Subject_12.csv from raw_dataset\Train\SITTING to raw_dataset\Test\SITTING
Moved Subject_4.csv from raw_dataset\Train\STANDING to raw_dataset\Test\STANDING
Moved Subject_8.csv from raw_dataset\Train\STANDING to raw_dataset\Test\STANDING
Moved Subject_12.csv from raw_dataset\Train\STANDING to raw_dataset\Test\STANDING
Moved Subject_4.csv from raw_dataset\Train\WALKING to raw_dataset\Test\WALKING
Moved Subject_8.csv from raw_dataset\Train\WALKING to raw_dataset\Test\WALKING
Moved Subject_12.csv from raw_dataset\Train\WALKING to raw_dataset\Test\WALKING
Moved Subject_4.csv from raw_dataset\Train\WALKI

## Filtering TSFEL Features as per best 60 ANOVA F-Value

In [9]:
selected_feature_indices = np.array([
    0, 1, 3, 6, 7, 260, 298, 300, 302, 303, 304, 306, 308, 311,
    315, 319, 325, 335, 336, 339, 340, 341, 342, 343, 344, 345,
    346, 347, 348, 349, 350, 351, 352, 353, 354, 355, 357, 358,
    359, 360, 361, 362, 363, 364, 365, 374, 679, 683, 687, 689,
    690, 715, 726, 727, 728, 736, 737, 738, 744, 1058
])

print(selected_feature_indices)

original_dir = 'TSFEL_dataset'
filtered_dir = 'TSFEL_filtereddataset'
shutil.copytree(original_dir, filtered_dir)

feature_names = None

for root, dirs, files in os.walk(filtered_dir):
    for file in files:
        if file.endswith('.csv'):
            file_path = os.path.join(root, file)
            
            df = pd.read_csv(file_path)
            
            if feature_names is None:
                feature_names = df.columns.tolist()
            
            df_filtered = df.iloc[:, selected_feature_indices]
            
            df_filtered.to_csv(file_path, index=False)

if feature_names is not None:
    print("List of features in the dataset:")
    for i, feature in enumerate(feature_names):
        print(f"{i+1}: {feature}")

print(f"Filtered features saved in the directory: {filtered_dir}")


[   0    1    3    6    7  260  298  300  302  303  304  306  308  311
  315  319  325  335  336  339  340  341  342  343  344  345  346  347
  348  349  350  351  352  353  354  355  357  358  359  360  361  362
  363  364  365  374  679  683  687  689  690  715  726  727  728  736
  737  738  744 1058]
List of features in the dataset:
1: accx_Absolute energy
2: accx_Area under the curve
3: accx_Autocorrelation
4: accx_Average power
5: accx_Centroid
6: accx_Detrended fluctuation analysis
7: accx_ECDF Percentile Count_0
8: accx_ECDF Percentile Count_1
9: accx_ECDF Percentile_0
10: accx_ECDF Percentile_1
11: accx_ECDF_0
12: accx_ECDF_1
13: accx_ECDF_2
14: accx_ECDF_3
15: accx_ECDF_4
16: accx_ECDF_5
17: accx_ECDF_6
18: accx_ECDF_7
19: accx_ECDF_8
20: accx_ECDF_9
21: accx_Entropy
22: accx_FFT mean coefficient_0
23: accx_FFT mean coefficient_1
24: accx_FFT mean coefficient_10
25: accx_FFT mean coefficient_100
26: accx_FFT mean coefficient_101
27: accx_FFT mean coefficient_102
28: accx_FFT 