# Overview

I used a convolutional neural network (CNN), made by KGP Talkie, to predict the labels of `test_time_series.csv`. I trained the model on accelerometer data from the WISDM dataset (link below), and evaluated the model's performance on the given `train_time_series.csv` and `train_labels.csv`. 

The reason I chose to use the WISDM dataset was that it had more samples than the given .csv files, and because when I trained it with WISDM I scored higher on the accuracy mark than when I trained it with the course-given training data. 

1. **KGP Talkie's Video**:
    https://youtu.be/lUI6VMj43PE
2. **WISDM Dataset**
    http://www.cis.fordham.edu/wisdm/dataset.php

In [1]:
import time
start = time.time() # runtime

## Data Pre-processing

The WISDM dataset's activity labels are: Walking, Jogging, Upstairs, Downstairs, Sitting, and Standing. The first thing I did was edit `train_time_series.csv` and `train_labels.csv` to be a merged pandas dataframe with column `activity` reverted back to label format (e.g., "Walking" instead of 2 etc.) to have a dataset similar to WISDM. I also filled in nan values for the `activity` column with their correct labels (since `train_labels.csv` labels were given every 10th observation). I also made edits to `test_time_series.csv`, and the merged train database (e.g., removing all columns except `x`, `y`, `z`, and `activity`).The merged database was saved as `edited_validation_database.csv` and the edited `test_time_series.csv` was saved as `edited_test_database.csv`.

In [2]:
import pandas as pd


def merge_processing(database):
    del database['Unnamed: 0_y']
    del database['UTC time_x']
    del database['UTC time_y']
    del database['accuracy']
    del database['timestamp']
    del database['Unnamed: 0_x']
    database.columns = ['x', 'y', 'z', 'activity']
    return database


def activity_decoding(database):

    def label_activity(row):
        if row['activity'] == 1:
            return 'Standing'
        elif row['activity'] == 2:
            return 'Walking'
        elif row['activity'] == 3:
            return 'Downstairs'
        elif row['activity'] == 4:
            return 'Upstairs'

    database['activity'] = database.apply(lambda row: label_activity(row), axis=1)

    return database


if __name__ == "__main__":
    # creating merged train CSV (i.e., train time-series + train labels)
    train_labels = pd.read_csv('train_labels.csv')
    train_time_series = pd.read_csv('train_time_series.csv')
    merged_database = pd.merge(left=train_time_series, right=train_labels, how='left', left_on="timestamp", right_on="timestamp")
    merged_database = merge_processing(merged_database)
    #fillna used to turn nan values to correct activities
    merged_database['activity'].fillna(method='backfill', inplace=True)
    merged_database = activity_decoding(merged_database)

    # Edited test CSV
    test_time_series = pd.read_csv('test_time_series.csv')
    del test_time_series['UTC time']
    del test_time_series['accuracy']
    del test_time_series['Unnamed: 0']
    del test_time_series['timestamp']
    test_time_series.columns = ['x', 'y', 'z']

    # Saved data-frame objects as edited CSVs
    merged_database.to_csv('edited_validation_database.csv')
    test_time_series.to_csv('edited_test_database.csv')

## More Data Pre-processing

Edits to `WISDM_ar_v1.1_raw.txt`:
1. Formatted to pandas dataframe
2. Dropped timestamp & user-id column
3. Balanced data (e.g., Walking, Standing etc have the same number of samples)
4. Scaled x, y, z columns & converted to float 
6. Re-indexed 
7. Removed unwanted activities (Jogging, Sitting) that do not exist in course data
8. Dropped every other row of database (since course data is taken at 10 Hz, and WISDM data at 20 Hz) 

Similar edits were made to `edited_validation_database.csv` and `edited_test_database.csv`. The edited WISDM database was saved as `scaled_train_database.csv`, the edited `edited_validation_database.csv` was saved as `scaled_validation_database.csv` and the edited `edited_test_database.csv` to `scaled_test_database.csv`.

In [3]:
import pandas as pd
from sklearn.preprocessing import StandardScaler
import numpy as np

# WISDM_formatting() taken from: https://youtu.be/lUI6VMj43PE?t=467
def WISDM_formatting(csv_name): 
    type_file = open(csv_name)
    lines = type_file.readlines()

    Processed_List = []
    for i, line in enumerate(lines):
        try:
            line = line.split(',')
            last = line[5].split(';')[0]
            last = last.strip()
            if last == '':
                break
            temp = [line[0], line[1], line[2], line[3], line[4], last]
            Processed_List.append(temp)
        except:
            print("Error at line number: ", i)

    columns = ['user', 'activity', 'time', 'x', 'y', 'z']

    edited_database = pd.DataFrame(data=Processed_List, columns=columns)
    edited_database = edited_database.drop(['user', 'time'], axis=1).copy()

    return edited_database

# balance_scale_index() function 
# created using KGP Talkie: https://youtu.be/lUI6VMj43PE
def balance_scale_index(database):
    # defining scaling used
    scale = StandardScaler()

    # making 'x', 'y', 'z' columns float values
    database['x'] = database['x'].astype('float')
    database['y'] = database['y'].astype('float')
    database['z'] = database['z'].astype('float')

    if 'activity' in database.columns:
        # determining number of samples of activity with least samples
        # if database has 'activity' column
        min_value = min(database['activity'].value_counts())

        # taking min_value number of rows from each activity
        Walking = database[database['activity'] == 'Walking'].head(min_value).copy()
        Downstairs = database[database['activity'] == 'Downstairs'].head(min_value).copy()
        Upstairs = database[database['activity'] == 'Upstairs'].head(min_value).copy()
        Standing = database[database['activity'] == 'Standing'].head(min_value).copy()

        # append balanced data
        database = pd.DataFrame()
        database = database.append([Walking, Downstairs, Upstairs, Standing])

        # scale 'x', 'y', 'z' columns from balanced data and create pd.Dataframe()
        x = database[['x', 'y', 'z']]
        x = scale.fit_transform(x)
        scaled_data = pd.DataFrame(data=x, columns=['x', 'y', 'z'])

        # add activity column
        y = database['activity']
        scaled_data['activity'] = y.values

    else:
        # scale 'x', 'y', 'z' columns and create pd.Dataframe()
        x = database[['x', 'y', 'z']]
        x = scale.fit_transform(x)
        scaled_data = pd.DataFrame(data=x, columns=['x', 'y', 'z'])

    scaled_data.index = np.arange(0, len(scaled_data))

    return scaled_data


if __name__ == "__main__":
    WISDM = WISDM_formatting('WISDM_ar_v1.1_raw.txt')
    WISDM = balance_scale_index(WISDM)
    WISDM = WISDM.iloc[::2]

    Validation = pd.read_csv('edited_validation_database.csv')
    Validation = balance_scale_index(Validation)

    Test = pd.read_csv('edited_test_database.csv')
    Test = balance_scale_index(Test)

    WISDM.to_csv("scaled_train_database.csv")
    Validation.to_csv("scaled_validation_database.csv")
    Test.to_csv("scaled_test_database.csv")

Error at line number:  281873
Error at line number:  281874
Error at line number:  281875


## Functions 
Here I defined some functions that I use later on. To feed data into a model it needs to be formatted a certain way, and `create_segments()` and `create_segments_and_labels()` format data appropriately (functions taken from KGP Talkie). Additionally, since a neural network cannot interpret labels such as "Walking" etc, this must be encoded into digits. For this I defined `encode_databases()`.

In [4]:
from sklearn.preprocessing import LabelEncoder
import scipy.stats as stats
import numpy as np

# label encoder
label = LabelEncoder()

# encodes databases that have 'activity' columns
def encode_databases(database1, database2):
    database1['label'] = label.fit_transform(database1['activity'].values.ravel())
    database2['label'] = label.transform(database2['activity'].values.ravel())
    return database1, database2

# function create_segments_and_labels() taken from Nils Ackermann
# https://towardsdatascience.com/human-activity-recognition-har-tutorial-with-keras-and-core-ml-part-1-8c05e365dfa0
def create_segments_and_labels(df, time_steps, step, label_name):
    # x, y, z acceleration as features
    n_features = 3
    # Number of steps to advance in each iteration (for me, it should always
    # be equal to the time_steps in order to have no overlap between segments)
    # step = time_steps
    segments = []
    labels = []
    for i in range(0, len(df) - time_steps, step):
        xs = df['x'].values[i: i + time_steps]
        ys = df['y'].values[i: i + time_steps]
        zs = df['z'].values[i: i + time_steps]
        # retrieve most often used label in segment
        label = stats.mode(df[label_name][i: i + time_steps])[0][0]
        segments.append([xs, ys, zs])
        labels.append(label)

    # reshape segment & labels
    reshaped_segments = np.asarray(segments, dtype=np.float32).reshape(-1, time_steps, n_features)
    labels = np.asarray(labels)

    return reshaped_segments, labels


def create_segments(df, time_steps, step):
    # x, y, z acceleration features
    n_features = 3
    
    segments = []
    for i in range(0, len(df) - time_steps, step):
        xs = df['x'].values[i: i + time_steps]
        ys = df['y'].values[i: i + time_steps]
        zs = df['z'].values[i: i + time_steps]
        segments.append([xs, ys, zs])

    # reshape segment
    reshaped_segments = np.asarray(segments, dtype=np.float32).reshape(-1, time_steps, n_features)

    return reshaped_segments



## Creating the Model

Now that all the functions are defined, I need to do the following:
1. encode train_database (e.g., WISDM database) & validation_database (e.g., course training data) activity columns with `encode_databases()`.
2. Format train_database into appropriate segments and labels, then train_test_split them (to have the model test itself with WISDM data). I'll also format validation_database into appropriate segments and labels to later test how well the model can predict the labels of course data. 
3. Reshape x_train, x_test, x_val into a shape that can be accepted by the model (4D Shape). 
4. Create the model (which is taken directly from KGP Talkie)
5. Fit the model on x_train/y_train and evaluate its performance on x_val/y_val.

As can be seen, the model has around 46-47% accuracy in predicting values which ... yikes! Isn't great. This could be due to not having enough data points. It could also be a result of taking data from two different datasets which have different environmental variables that are not controlled for. Two datasets of accelerometer data were used in order to have more data points - this was the preferred choice since using just one dataset led to decreased accuracy.

In [5]:
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Flatten, Dense, Dropout
from tensorflow.keras.layers import Conv2D
from tensorflow.keras.optimizers import Adam
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
import scipy.stats as stats

# defining label encoder
label = LabelEncoder()

# reading CSVs to pandas dataframe objects
train_database = pd.read_csv("scaled_train_database.csv")
validation_database = pd.read_csv("scaled_validation_database.csv")
test_database = pd.read_csv("scaled_test_database.csv")

# encoding databases
train_database, validation_database = encode_databases(train_database, validation_database)

Fs = 10  # 10 b/c course data is in 10 Hz
frame_size = 9  # number of rows to take for each prediction
hop_size = 10  # number of rows to hope from one prediction to the next (determines overlap between frames)

x_train, y_train = create_segments_and_labels(df=train_database,
                                              time_steps=frame_size,
                                              step=hop_size,
                                              label_name='label'
                                              )

x_train, x_test, y_train, y_test = train_test_split(x_train,
                                                    y_train,
                                                    test_size=0.2,
                                                    random_state=0,
                                                    stratify=y_train
                                                    )

x_val, y_val = create_segments_and_labels(df=validation_database,
                                          time_steps=frame_size,
                                          step=hop_size,
                                          label_name='label'
                                          )


# reshaping: since model accepts 4D data, x_train x_test and x_val must be reshaped as follows:
x_train = x_train.reshape(568, 9, 3, 1)
x_test = x_test.reshape(143, 9, 3, 1)
x_val = x_val.reshape(105, 9, 3, 1)

# layering model (from KGP Talkie)
model = Sequential()
model.add(Conv2D(16, (2, 2), activation='relu'))
model.add(Dropout(0.1))
model.add(Conv2D(32, (2, 2), activation="relu"))
model.add(Dropout(0.2))
model.add(Flatten())
model.add(Dense(64, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(6, activation="softmax"))

# compiling model
model.compile(optimizer=Adam(learning_rate=0.001), loss="sparse_categorical_crossentropy", metrics=['accuracy'])

# fitting model
history = model.fit(x_train, y_train, epochs=15, validation_data=(x_test, y_test), verbose=1)

# evaluating model and print out performance
scores = model.evaluate(x_val, y_val, verbose=1)
print("val_loss:", scores[0], "accuracy:", scores[1])

  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])


Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
Train on 568 samples, validate on 143 samples
Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15
val_loss: 1.3456271589511917 accuracy: 0.44761905


## Predicting Test Data
The test data must be formated from the `test_database` using `create_segments()`, and reshaped so it can be interpreted by the model. Then I use `model.predict_classes()` to predict the classes of the `x_test` segments. This is going to give me a list of integer representations of the classes, and these need to be converted to their labels with `label.inverse_transform()`. After they're converted to their label, they need to again converted to integers so that 'Standing' = 1, 'Walking'= 2 etc.  

In [6]:
# making test data segments
x_test = create_segments(df=test_database,
                         time_steps=frame_size,
                         step=hop_size
                         )

# re-shaping test data
x_test = x_test.reshape(125, 9, 3, 1)

# predicting results
program_encoded_class_results = model.predict_classes(x_test, verbose=1)
label_class_result = label.inverse_transform(program_encoded_class_results)

# results to be submitted
predicted_labels_coded = []
for i in range(len(label_class_result)):
    if label_class_result[i] == 'Standing':
        predicted_labels_coded.append(1)
    elif label_class_result[i] == 'Walking':
        predicted_labels_coded.append(2)
    elif label_class_result[i] == 'Downstairs':
        predicted_labels_coded.append(3)
    elif label_class_result[i] == 'Upstairs':
        predicted_labels_coded.append(4)

print("There are", len(predicted_labels_coded), "predictions.")
print("The predicted classes are:", predicted_labels_coded)

end = time.time() # timer for runtime
print("The runtime in seconds is:", end - start)

There are 125 predictions.
The predicted classes are: [4, 4, 4, 4, 4, 2, 4, 3, 2, 4, 4, 4, 3, 4, 4, 3, 4, 1, 4, 3, 4, 4, 1, 2, 4, 4, 4, 3, 3, 4, 4, 4, 2, 4, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 4, 2, 4, 4, 2, 4, 4, 4, 4, 4, 1, 3, 4, 3, 4, 4, 3, 4, 4, 2, 2, 2, 3, 4, 3, 3, 3, 2, 2, 4, 4, 3, 2, 2, 3, 2, 2, 2, 2, 2, 2, 3, 4, 3, 2, 2, 3, 2, 4, 3, 2, 4, 2, 3, 2, 3, 2, 3, 2, 3, 2, 2, 2, 2, 2, 4, 3, 2, 2, 4, 2, 4]
The runtime in seconds is: 7.755272388458252


In [7]:
# appending predicted labels to test_labels.csv and overwriting test_labels.csv file
test_labels = pd.read_csv("test_labels.csv")
test_labels['label'] = predicted_labels_coded
test_labels.to_csv("test_labels.csv")