<div class="alert" style="background-color:#29C5F6; color:white; padding:0px 10px; border-radius:5px;">
    <h1 style='margin:15px 15px; color:#000000; font-size:32px'><b>Data Generation (Processing)</b></h1>
        <h2 style='margin:15px 15px; color:#000000; font-size:24px'>WIreless Sensor Data Mining (WISDM) - Human Activity Recognition Problem</h2>
</div>

The work is under the **"Master Thesis"** by **Chau Tran** with the supervision from **Prof. Roland Olsson**.

<div class="alert" style="background-color:#29C5F6; border-radius:5px; padding:0px 10px; "><h3 style='margin:15px 15px'>6_1. WISDM v1.1</h3></div>

Source: 
* https://www.cis.fordham.edu/wisdm/dataset.php
* https://github.com/AchillesProject/MLCourse2020/blob/main/Project2/HAR_WISDM_MLCourse_v0_DataExploration.ipynb (access required)

Data Format: **[user],[activity],[timestamp],[x-acceleration],[y-accel],[z-accel]**

Number of examples: 1,098,207

Fields:
* user: 1..36
* activity: {Walking, Jogging, Sitting, Standing, Upstairs, Downstairs}
* timestamp: nanoseconds
* x-acceleration: floating-point values between -20 .. 20
* y-accel: floating-point values between -20 .. 20
* z-accel: floating-point values between -20 .. 20

The acceleration in the x direction as measured by the android phone's accelerometer. A value of 10 = 1g = 9.81 m/s^2, and 0 = no acceleration. The acceleration recorded includes gravitational acceleration toward the center of the Earth, so that when the phone is at rest on a flat surface the vertical axis will register +-10.

Data version 2 Information: https://archive.ics.uci.edu/ml/datasets/WISDM+Smartphone+and+Smartwatch+Activity+and+Biometrics+Dataset+


In [None]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder, LabelEncoder, RobustScaler
import sys

TIME_STEPS_arr = [90, 60, 50, 40]
isSTEPS_arr = [True, False]
SPLIT = 0.5

def divideData_perUser(data, per=0.5):
    train_df = pd.DataFrame()
    val_df = pd.DataFrame()
    X_df = pd.DataFrame()
    for user in np.unique(data['user']):
        dataPerUser = data[data['user']==user]
        for tag in np.unique(dataPerUser['activity']):
            dataPerActivity = dataPerUser[dataPerUser['activity']==tag]
            n = len(dataPerActivity)
            train_df = train_df.append(dataPerActivity[0:int(n*per)])
            val_df = val_df.append(dataPerActivity[int(n*per):int(n)])
            X_df = X_df.append(dataPerActivity)        
    return X_df, train_df, val_df

# Utils functions for segmenting windows
def windows(data,window_size,step):
    start = 0
    while start< data.count():
        yield int(start), int(start + window_size)
        start+= step
def segment_signal(data, window_size = 90, step=40,columns=[]):
    segments = np.empty((0,window_size,len(columns)))
    labels= np.empty((0))
    for user in np.unique(data['user']):
        userdata = data[(data.user == user)]
        for tag in np.unique(userdata['activity']):
            sub_class_data = userdata[(userdata.activity == tag)]
            for (start, end) in windows(pd.Series(sub_class_data.index.values),window_size,step):
                if end > sub_class_data.shape[0] - 1:
                    end = sub_class_data.shape[0]
                    true_length = end - start
                    remaining_data_length = window_size - true_length
                    start -= remaining_data_length
                if (sub_class_data[start:end].isnull().values.any()):
                    print(sub_class_data[start:end].isnull().sum())
                if(sub_class_data[start:end].shape[0] == window_size):
                    segments = np.vstack([segments,np.dstack([sub_class_data[column][start:end] for column in columns])])
                    labels = np.append(labels, tag) 
    return segments, labels.reshape(-1, 1)

wisdmdataset_path = '../../Datasets/6_wisdm/WISDM_ar_v1.1'
COLUMNS = ['x_axis', 'y_axis', 'z_axis']

rdf = pd.read_csv(f'{wisdmdataset_path}/WISDM_ar_v1.1_raw.txt', header=None, names=['user', 'activity', 'timestamp', 'x_axis', 'y_axis', 'z_axis'])
rdf.z_axis.replace(regex=True, inplace=True, to_replace=r';', value=r'')
rdf['x_axis'] = rdf.x_axis.astype(np.float64)
rdf['y_axis'] = rdf.y_axis.astype(np.float64)
rdf['z_axis'] = rdf.z_axis.astype(np.float64)
rdf['timestamp'].apply(lambda x: float(x))
rdf.dropna(axis=0, how='any', inplace=True)
rdf['activity'] = LabelEncoder().fit(np.unique(rdf['activity'])).transform(rdf['activity'])

X_df, train_df, val_df = divideData_perUser(rdf, SPLIT)

for isSTEPS in isSTEPS_arr:
    for TIME_STEPS in TIME_STEPS_arr:
        STEP = int(round(TIME_STEPS/2,-1)) if isSTEPS else TIME_STEPS
        print(TIME_STEPS, STEP)

        X, y = segment_signal(X_df, window_size=TIME_STEPS, step=STEP,columns=COLUMNS)
        X_train, y_train = segment_signal(train_df, window_size=TIME_STEPS, step=STEP,columns=COLUMNS)
        X_val, y_val = segment_signal(val_df, window_size=TIME_STEPS, step=STEP,columns=COLUMNS)

        y_train = OneHotEncoder().fit_transform(y_train).toarray()
        y_val = OneHotEncoder().fit_transform(y_val).toarray()
        y =  OneHotEncoder().fit_transform(y).toarray()

        y_train = np.tile(y_train, TIME_STEPS).reshape((y_train.shape[0], TIME_STEPS, y_train.shape[1]))
        y_val   = np.tile(y_val, TIME_STEPS).reshape((y_val.shape[0], TIME_STEPS, y_val.shape[1]))
        y       = np.tile(y, TIME_STEPS).reshape((y.shape[0], TIME_STEPS, y.shape[1]))

        df_train = np.concatenate((X_train, y_train), axis=2).reshape((X_train.shape[0], -1))
        df_val = np.concatenate((X_val, y_val), axis=2).reshape((X_val.shape[0], -1))
        df = np.concatenate((X,y), axis=2).reshape((X.shape[0], -1))
        
        print(X_train.shape, y_train.shape, df_train.shape)
        print(X_val.shape, y_val.shape, df_val.shape)
        print(X.shape, y.shape, df.shape)

        with open(f"{wisdmdataset_path}/../wisdm.ni={3}.no={6}.ts={TIME_STEPS}.os={STEP}.spit={0}.all.csv",'w') as csvfile:
            np.savetxt(csvfile, np.array([[3, 6]]),fmt='%d', delimiter=",")
        with open(f"{wisdmdataset_path}/../wisdm.ni={3}.no={6}.ts={TIME_STEPS}.os={STEP}.spit={0}.all.csv",'a') as csvfile:
            np.savetxt(csvfile, df, fmt='%.4f', delimiter=",")

        with open(f"{wisdmdataset_path}/../wisdm.ni={3}.no={6}.ts={TIME_STEPS}.os={STEP}.spit={int(SPLIT*100)}.train.csv",'w') as csvfile:
            np.savetxt(csvfile, np.array([[3, 6]]),fmt='%d', delimiter=",")
        with open(f"{wisdmdataset_path}/../wisdm.ni={3}.no={6}.ts={TIME_STEPS}.os={STEP}.spit={int(SPLIT*100)}.train.csv",'a') as csvfile:
            np.savetxt(csvfile, df_train, fmt='%.4f', delimiter=",")

        with open(f"{wisdmdataset_path}/../wisdm.ni={3}.no={6}.ts={TIME_STEPS}.os={STEP}.spit={int(SPLIT*100)}.val.csv",'w') as csvfile:
            np.savetxt(csvfile, np.array([[3, 6]]),fmt='%d', delimiter=",")
        with open(f"{wisdmdataset_path}/../wisdm.ni={3}.no={6}.ts={TIME_STEPS}.os={STEP}.spit={int(SPLIT*100)}.val.csv",'a') as csvfile:
            np.savetxt(csvfile, df_val, fmt='%.4f', delimiter=",")

<div class="alert" style="background-color:#29C5F6; border-radius:5px; padding:0px 10px; "><h3 style='margin:15px 15px'>6_2. WISDM v2.0</h3></div>

Source: https://archive.ics.uci.edu/ml/datasets/WISDM+Smartphone+and+Smartwatch+Activity+and+Biometrics+Dataset+

Raw's format: **[subject-id],[activity],[timestamp],[x-accel],[y-accel],[z-accel]**

Number of samples for non-hand-oriented activities (5 activities):
* Phone acceleration: 1,338,067
* Watch acceleration: 1,053,141
* Phone gyroscope:    1,006,749
* Watch gyroscope:    0,949,933

Fields:
* subject-id: 1600..1650 (51 participants)
* activity: {Walking - A, Jogging - B, Stairs - C, Sitting - D, Standing - E}
* timestamp: microsecond (Unix Time)
* x-acceleration: floating-point (can be positive or negative)
* y-accel: floating-point (can be positive or negative)
* z-accel: floating-point (can be positive or negative)

For the accelerometer sensor, the units are m/s2; while, for the gyroscope sensor, the units are radians/s. The force of gravity on Earth, which affects the accelerometer readings, is 9.8m/s2.

In [10]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder, LabelEncoder, RobustScaler
import sys, glob, os

TIME_STEPS_arr = [90, 60, 50, 40]
isSTEPS_arr = [True, False]
SPLIT = 0.5
COLUMNS = ['x_accel', 'y_accel', 'z_accel', 'x_gyro', 'y_gyro', 'z_gyro']
activities_arr = ['A', 'B', 'C', 'D', 'E']
def divideData_perUser(data, per=0.5):
    train_df = pd.DataFrame()
    val_df = pd.DataFrame()
    X_df = pd.DataFrame()
    for user in np.unique(data['user']):
        dataPerUser = data[data['user']==user]
        for tag in np.unique(dataPerUser['activity']):
            # if tag in activities_arr:
            dataPerActivity = dataPerUser[dataPerUser['activity']==tag]
            n = len(dataPerActivity)
            train_df = train_df.append(dataPerActivity[0:int(n*per)])
            val_df = val_df.append(dataPerActivity[int(n*per):int(n)])
            X_df = X_df.append(dataPerActivity)        
    return X_df, train_df, val_df

# Utils functions for segmenting windows
def windows(data,window_size,step):
    start = 0
    while start< data.count():
        yield int(start), int(start + window_size)
        start+= step
def segment_signal(data, window_size = 90, step=40,columns=[]):
    segments = np.empty((0,window_size,len(columns)))
    labels= np.empty((0))
    for user in np.unique(data['user']):
        userdata = data[(data.user == user)]
        for tag in np.unique(userdata['activity']):
            sub_class_data = userdata[(userdata.activity == tag)]
            for (start, end) in windows(pd.Series(sub_class_data.index.values),window_size,step):
                if end > sub_class_data.shape[0] - 1:
                    end = sub_class_data.shape[0]
                    true_length = end - start
                    remaining_data_length = window_size - true_length
                    start -= remaining_data_length
                if (sub_class_data[start:end].isnull().values.any()):
                    print(sub_class_data[start:end].isnull().sum())
                if(sub_class_data[start:end].shape[0] == window_size):
                    segments = np.vstack([segments,np.dstack([sub_class_data[column][start:end] for column in columns])])
                    labels = np.append(labels, tag) 
    return segments, labels.reshape(-1, 1)

wisdm_phone_path = '../../Datasets/6_wisdm/WISDM_ar_v2.0/wisdm-dataset/'
wisdm_phone_accel_path = '../../Datasets/6_wisdm/WISDM_ar_v2.0/wisdm-dataset/raw/phone/accel'
wisdm_phone_accel_files_mask = os.path.join(wisdm_phone_accel_path, '*.txt')
wisdm_phone_accel_files = sorted(glob.glob(wisdm_phone_accel_files_mask))

wisdm_phone_gyro_path = '../../Datasets/6_wisdm/WISDM_ar_v2.0/wisdm-dataset/raw/phone/gyro'
wisdm_phone_gyro_files_mask = os.path.join(wisdm_phone_gyro_path, '*.txt')
wisdm_phone_gyro_files = sorted(glob.glob(wisdm_phone_gyro_files_mask))

wisdm_phone_data = pd.DataFrame()
count = 0
for accel_file, gyro_file in zip(wisdm_phone_accel_files, wisdm_phone_gyro_files):
    accel_data = pd.read_csv(accel_file, header=None, names=['user', 'activity', 'timestamp', 'x_accel', 'y_accel', 'z_accel'], index_col=['user', 'activity', 'timestamp'])
    accel_data.z_accel.replace(regex=True, inplace=True, to_replace=r';', value=r'')
    accel_data = accel_data.loc[~accel_data.index.duplicated(keep='first')]
    gyro_data = pd.read_csv(gyro_file, header=None, names=['user', 'activity', 'timestamp', 'x_gyro', 'y_gyro', 'z_gyro'], index_col=['user', 'activity', 'timestamp'])
    gyro_data.z_gyro.replace(regex=True, inplace=True, to_replace=r';', value=r'')
    gyro_data = gyro_data.loc[~gyro_data.index.duplicated(keep='first')]
    user_data = pd.concat([accel_data, gyro_data], axis=1).dropna()
    wisdm_phone_data = wisdm_phone_data.append(user_data)
    
wisdm_phone_data = wisdm_phone_data.reset_index()
wisdm_phone_data['x_accel'] = wisdm_phone_data.x_accel.astype(np.float64)
wisdm_phone_data['y_accel'] = wisdm_phone_data.y_accel.astype(np.float64)
wisdm_phone_data['z_accel'] = wisdm_phone_data.z_accel.astype(np.float64)
wisdm_phone_data['x_gyro'] = wisdm_phone_data.x_gyro.astype(np.float64)
wisdm_phone_data['y_gyro'] = wisdm_phone_data.y_gyro.astype(np.float64)
wisdm_phone_data['z_gyro'] = wisdm_phone_data.z_accel.astype(np.float64)
wisdm_phone_data['timestamp'].apply(lambda x: float(x))
wisdm_phone_data.dropna(axis=0, how='any', inplace=True)
wisdm_phone_data = wisdm_phone_data[wisdm_phone_data.activity.isin(activities_arr) == True].reset_index()
wisdm_phone_data['activity'] = LabelEncoder().fit(np.unique(wisdm_phone_data['activity'])).transform(wisdm_phone_data['activity'])

X_df, train_df, val_df = divideData_perUser(wisdm_phone_data, SPLIT)

for isSTEPS in isSTEPS_arr:
    for TIME_STEPS in TIME_STEPS_arr:
        STEP = int(round(TIME_STEPS/2,-1)) if isSTEPS else TIME_STEPS
        print(TIME_STEPS, STEP)

        X_train, y_train = segment_signal(train_df, window_size=TIME_STEPS, step=STEP,columns=COLUMNS)
        X_val, y_val = segment_signal(val_df, window_size=TIME_STEPS, step=STEP,columns=COLUMNS)
        X, y = segment_signal(X_df, window_size=TIME_STEPS, step=STEP,columns=COLUMNS)
        
        y_train = OneHotEncoder().fit_transform(y_train).toarray()
        y_val = OneHotEncoder().fit_transform(y_val).toarray()
        y =  OneHotEncoder().fit_transform(y).toarray()

        y_train = np.tile(y_train, TIME_STEPS).reshape((y_train.shape[0], TIME_STEPS, y_train.shape[1]))
        y_val   = np.tile(y_val, TIME_STEPS).reshape((y_val.shape[0], TIME_STEPS, y_val.shape[1]))
        y       = np.tile(y, TIME_STEPS).reshape((y.shape[0], TIME_STEPS, y.shape[1]))

        df_train = np.concatenate((X_train, y_train), axis=2).reshape((X_train.shape[0], -1))
        df_val = np.concatenate((X_val, y_val), axis=2).reshape((X_val.shape[0], -1))
        df = np.concatenate((X,y), axis=2).reshape((X.shape[0], -1))
        
        print(X_train.shape, y_train.shape, df_train.shape)
        print(X_val.shape, y_val.shape, df_val.shape)
        print(X.shape, y.shape, df.shape)

        with open(f"{wisdm_phone_path}/wisdm.ni={6}.no={5}.ts={TIME_STEPS}.os={STEP}.spit={0}.all.csv",'w') as csvfile:
            np.savetxt(csvfile, np.array([[6, 5]]),fmt='%d', delimiter=",")
        with open(f"{wisdm_phone_path}/wisdm.ni={6}.no={5}.ts={TIME_STEPS}.os={STEP}.spit={0}.all.csv",'a') as csvfile:
            np.savetxt(csvfile, df, fmt='%.4f', delimiter=",")

        with open(f"{wisdm_phone_path}/wisdm.ni={6}.no={5}.ts={TIME_STEPS}.os={STEP}.spit={int(SPLIT*100)}.train.csv",'w') as csvfile:
            np.savetxt(csvfile, np.array([[6, 5]]),fmt='%d', delimiter=",")
        with open(f"{wisdm_phone_path}/wisdm.ni={6}.no={5}.ts={TIME_STEPS}.os={STEP}.spit={int(SPLIT*100)}.train.csv",'a') as csvfile:
            np.savetxt(csvfile, df_train, fmt='%.4f', delimiter=",")

        with open(f"{wisdm_phone_path}/wisdm.ni={6}.no={5}.ts={TIME_STEPS}.os={STEP}.spit={int(SPLIT*100)}.val.csv",'w') as csvfile:
            np.savetxt(csvfile, np.array([[6, 5]]),fmt='%d', delimiter=",")
        with open(f"{wisdm_phone_path}/wisdm.ni={6}.no={5}.ts={TIME_STEPS}.os={STEP}.spit={int(SPLIT*100)}.val.csv",'a') as csvfile:
            np.savetxt(csvfile, df_val, fmt='%.4f', delimiter=",")

90 40
(9561, 90, 6) (9561, 90, 5) (9561, 990)
(9561, 90, 6) (9561, 90, 5) (9561, 990)
(19064, 90, 6) (19064, 90, 5) (19064, 990)
60 30
(12742, 60, 6) (12742, 60, 5) (12742, 660)
(12742, 60, 6) (12742, 60, 5) (12742, 660)
(25311, 60, 6) (25311, 60, 5) (25311, 660)
50 20
(19061, 50, 6) (19061, 50, 5) (19061, 550)
(19061, 50, 6) (19061, 50, 5) (19061, 550)
(38022, 50, 6) (38022, 50, 5) (38022, 550)
40 20
(19064, 40, 6) (19064, 40, 5) (19064, 440)
(19064, 40, 6) (19064, 40, 5) (19064, 440)
(38025, 40, 6) (38025, 40, 5) (38025, 440)
90 90
(4287, 90, 6) (4287, 90, 5) (4287, 990)
(4287, 90, 6) (4287, 90, 5) (4287, 990)
(8517, 90, 6) (8517, 90, 5) (8517, 990)
60 60
(6380, 60, 6) (6380, 60, 5) (6380, 660)
(6380, 60, 6) (6380, 60, 5) (6380, 660)
(12744, 60, 6) (12744, 60, 5) (12744, 660)
50 50
(7671, 50, 6) (7671, 50, 5) (7671, 550)
(7671, 50, 6) (7671, 50, 5) (7671, 550)
(15300, 50, 6) (15300, 50, 5) (15300, 550)
40 40
(9565, 40, 6) (9565, 40, 5) (9565, 440)
(9565, 40, 6) (9565, 40, 5) (9565, 4

<div class="alert" style="background-color:#29C5F6; border-radius:5px; padding:0px 10px; "><h3 style='margin:15px 15px'>6_3. WISDM v3.0</h3></div>

Source: https://archive.ics.uci.edu/ml/datasets/WISDM+Smartphone+and+Smartwatch+Activity+and+Biometrics+Dataset+

Raw's format: **[subject-id],[activity],[timestamp],[x-accel],[y-accel],[z-accel]**

Number of samples for non-hand-oriented activities (5 activities):
* Phone acceleration: 1,338,067
* Watch acceleration: 1,053,141
* Phone gyroscope:    1,006,749
* Watch gyroscope:    0,949,933

Fields:
* subject-id: 1600..1650 (51 participants)
* activity: {Walking - A, Jogging - B, Stairs - C, Sitting - D, Standing - E}
* timestamp: microsecond (Unix Time)
* x-acceleration: floating-point (can be positive or negative)
* y-accel: floating-point (can be positive or negative)
* z-accel: floating-point (can be positive or negative)

For the accelerometer sensor, the units are m/s2; while, for the gyroscope sensor, the units are radians/s. The force of gravity on Earth, which affects the accelerometer readings, is 9.8m/s2.