# Sleeping Dataset Analysis
### Ruixuan Dong
### 04/29/2023

写在前面： 
关于怎样做research：
 - 每周的update是合在一个文件中，还是分开写？比如每周的report/ipynb怎么版本更新

## Table of Contents
- [1 - Packages](#1)
- [2 - Load Model and Process the Entire Dataset](#2)
- [3 - Two day-to-day generalization scenarios](#3)
    - [3.1 - 7/3 Data Set Selecting](#3-1)
        - [3.1.1 - LightGBM](#3-1-1)
    - [3.2 - One Day Out](#3-2)
- [4 - GAN Method](#4)

<a name='1'></a>
## 1 - Packages

Begin by importing all the packages we'll need during this assignment. 

- [numpy](https://www.numpy.org/) is the fundamental package for scientific computing with Python.
- [pandas](https://pandas.pydata.org/) is the package for operating data frame with Python.
- [matplotlib](http://matplotlib.org) is a library to plot graphs in Python.
- `utilities` provides some functions to implemente data set.

In [2]:
# Import packages
import numpy as np
import pandas as pd
from utilities import *
import warnings
warnings.filterwarnings('ignore')
from tensorflow import keras

2023-06-06 18:11:24.439464: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


<a name='2'></a>
## 2 - Load Model and Process the Entire Dataset

**Problem Statement**: We'll be using the "signal_1000_posture_4" dataset in this study. There are 1000 colunms in total, where the first colunm represents a time stamp, the second colunm `Action` is set as label, and the last 998 colunms are features in this case. We'll try to use features to establish an efficient classifier and try to predict actions while sleeping much better. In the first step, we load the model fitted by training set obtained from the first 15 day's first 70% data.

Let's test how this model performs on the whole dataset. Load the data by running the cell below.

In [3]:
# Load the model fitted with data augmentation, which was obtained based on last week work.
model = keras.models.load_model('my_model.h5')
# Load the entire data set
train_set_x_orig, train_set_y_orig, test_set_x_orig, test_set_y_orig, dataset = load_larger_dataset()

2023-06-06 18:11:36.890100: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [8]:
pd.DataFrame(dataset.iloc[:, 1].value_counts())

Unnamed: 0_level_0,count
Action,Unnamed: 1_level_1
go_to_the_bed,8975
sleep_on_right_side,2688
sleep_on_left_side,2571
sleep_on_stomach,1882


In [3]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
# fit the scaler to your dataframe and transform it
test_x_std = scaler.fit_transform(test_set_x_orig)
test_x = pd.DataFrame(test_x_std)
test_set_y = transform_label(test_set_y_orig)

result = model.predict(test_x)
tmp = np.argmax(result, axis=1)

from sklearn.metrics import accuracy_score
print(accuracy_score(test_set_y,tmp))

0.49140085256504484


In [4]:
from sklearn.metrics import confusion_matrix
NN_confusion_matrix = confusion_matrix(test_set_y, tmp).T
print(NN_confusion_matrix)

[[6483 1296 2185 2403]
 [ 458  152  198  152]
 [  90   33   41   30]
 [  47   19    9   10]]


In [5]:
pd.DataFrame(train_set_y_orig.value_counts())

Unnamed: 0_level_0,count
Action,Unnamed: 1_level_1
go_to_the_bed,1897
sleep_on_stomach,382
sleep_on_left_side,138
sleep_on_right_side,93


In [6]:
pd.DataFrame(test_set_y_orig.value_counts())

Unnamed: 0_level_0,count
Action,Unnamed: 1_level_1
go_to_the_bed,7078
sleep_on_right_side,2595
sleep_on_left_side,2433
sleep_on_stomach,1500


Propobably since the distribution of training set and new testing set are different, the accuracy on testing set based on the old model is not good, espcially consdiering the confusion matrix. 

<a name = '3'></a>
## 3 - Two day-to-day generalization scenarios

<a name = '3-1'></a>
### 3.1 - 7/3 Data Set Selecting

In [7]:
def load_larger_dataset_first_scenario():
    dataset = pd.read_csv('signal_1000_posture_4.csv')
    dataset = dataset.reset_index(drop = True)
    total_rows = dataset.shape[0]
    train_nrows = int(total_rows * 0.7)

    train = dataset.iloc[:train_nrows, :]
    test = dataset.iloc[train_nrows: , :]

    rows = train.values.tolist()
    random.shuffle(rows)
    train = pd.DataFrame(data=rows, columns=train.columns)

    rows = test.values.tolist()
    random.shuffle(rows)
    test = pd.DataFrame(data=rows, columns=test.columns)

    train_set_x_orig = train.iloc[:, 2:] # train set features
    train_set_y_orig = train.iloc[:, 1] # train set labels

    test_set_x_orig = test.iloc[:, 2:]# test set features
    test_set_y_orig = test.iloc[:, 1] # test set labels


    return train_set_x_orig, train_set_y_orig, test_set_x_orig, test_set_y_orig, dataset, train, test

train_set_x_orig, train_set_y_orig, test_set_x_orig, test_set_y_orig, dataset, train_set, test_set = load_larger_dataset_first_scenario()

In [8]:
pd.DataFrame(train_set_y_orig.value_counts())

Unnamed: 0_level_0,count
Action,Unnamed: 1_level_1
go_to_the_bed,5896
sleep_on_right_side,2526
sleep_on_left_side,1451
sleep_on_stomach,1408


In [9]:
pd.DataFrame(test_set_y_orig.value_counts())

Unnamed: 0_level_0,count
Action,Unnamed: 1_level_1
go_to_the_bed,3079
sleep_on_left_side,1120
sleep_on_stomach,474
sleep_on_right_side,162


In [10]:
train_set_y = transform_label(train_set_y_orig)
test_set_y = transform_label(test_set_y_orig)

Y = pd.get_dummies(train_set_y)
Y = Y.replace({True: 1, False: 0}).astype(float)

from sklearn.preprocessing import StandardScaler

# create a StandardScaler object
scaler = StandardScaler()

# fit the scaler to your dataframe and transform it
train_x_std = scaler.fit_transform(train_set_x_orig)
train_x = pd.DataFrame(train_x_std)

In [11]:
from keras import backend as K
from tensorflow.keras import layers, regularizers
import tensorflow as tf
warnings.filterwarnings('ignore')

def create_first_model():
    model = tf.keras.Sequential([
        tf.keras.layers.Input(1000), # dimension of X matrix
        tf.keras.layers.BatchNormalization(),
        tf.keras.layers.Dense(10000, activation="relu"),
        tf.keras.layers.BatchNormalization(),
        tf.keras.layers.Dropout(0.3),
        tf.keras.layers.Dense(5000, activation="relu"),
        tf.keras.layers.BatchNormalization(),
        tf.keras.layers.Dropout(0.3),
        tf.keras.layers.Dense(1000, activation="relu"),
        tf.keras.layers.BatchNormalization(),
        tf.keras.layers.Dropout(0.3),
        tf.keras.layers.Dense(1000, activation="relu"),
        tf.keras.layers.BatchNormalization(),
        tf.keras.layers.Dropout(0.3),
        tf.keras.layers.Dense(200, activation="relu", kernel_regularizer = regularizers.l2(0.01)),
        tf.keras.layers.BatchNormalization(),
        tf.keras.layers.Dropout(0.3),
        tf.keras.layers.Dense(20, activation="relu"),
        tf.keras.layers.BatchNormalization(),
        tf.keras.layers.Dense(4, activation=tf.nn.softmax)
    ])
    model.compile(optimizer='adam', loss="categorical_crossentropy", metrics=['acc'])
    return model

In [12]:
import keras
first_model = create_first_model()
callbacks=[
           keras.callbacks.EarlyStopping(monitor='loss',
           patience=5,
           ),
           keras.callbacks.TensorBoard(
               log_dir='my_log_dir',
               histogram_freq=1,
               embeddings_freq=1,
           )
]

first_history = first_model.fit(
    train_x.values,
    Y.values,
    validation_split=0.3,
    epochs=100,
    # callbacks=callbacks,
    batch_size=100,
    callbacks=[callbacks],
);

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100


In [13]:
scaler = StandardScaler()
# fit the scaler to your dataframe and transform it
test_x_std = scaler.fit_transform(test_set_x_orig)
test_x = pd.DataFrame(test_x_std)

result = first_model.predict(test_x)
tmp = np.argmax(result, axis=1)

from sklearn.metrics import accuracy_score
print(accuracy_score(test_set_y,tmp))

0.47094105480868664


In [14]:
from sklearn.metrics import confusion_matrix
NN_confusion_matrix = confusion_matrix(test_set_y, tmp).T
print(NN_confusion_matrix)

[[2132  307  972   68]
 [ 353   97   50   62]
 [ 145   13   26   10]
 [ 449   57   72   22]]


After getting this result, I'm considering to balance the data set using some data augmentation method, like jitter, permutation, or window slice. But the result based on the model fitted by augmented data set is really bad, gaining an accuracy of 29% on testing set.

Besides, there still exists a problem about running time. When I want to train larger model with more layers and more neurals, I found that I usually need to wait several hours until the parameter converged. Therefore, I trained much less models than last week I did, and this maybe also a reason for this week's bad result.

Then, I tried LightGBM, which is created as a machine learning algorithm and can speed up much more. Although the results(accuracy and confusion matrix) are similar, the running time was less.

In [None]:
dict = NN_augmentation_fit_44_days()

In [None]:
tmp = NN_augmentation_get_44_days_prediction()
tmp