## Code

### Preparing the environment

In [3]:
# Import libraries
import os
import re
import pandas as pd
import numpy as np
import dtale

import seaborn as sns
import matplotlib.pyplot as plt

In [4]:
# Change directory
dir_old = os.getcwd()    
os.chdir('../../')      # work under "My Documents"
dir_MyDoc = os.getcwd()
target_path = 'DSAI\Kaggle_Competitions\CMI_Detect Sleep States\RawData'
os.chdir(os.path.join(dir_MyDoc, target_path))    

### Overview of **train_events.csv**
First of all we load the file **train_events.csv** and print *df.info()*. The dataset contains five columns, of which two are numeric (night and step), two are categorical (series_id and event) and one is DateTime. The columns “night” and “step” are not continuous data according to the data description on the competition webpage.

In [5]:
# Load train_events.csv
file = './train_events.csv'
df = pd.read_csv(file)
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14508 entries, 0 to 14507
Data columns (total 5 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   series_id  14508 non-null  object 
 1   night      14508 non-null  int64  
 2   event      14508 non-null  object 
 3   step       9585 non-null   float64
 4   timestamp  9585 non-null   object 
dtypes: float64(1), int64(1), object(3)
memory usage: 566.8+ KB
None


Then we run *df.nunique()*, the result shows that the data was collected from 277 accelerometers. There are only two events which are our target classes (i.e. OnSet and WakeUp). With these pieces of information, we conclude that the 277 accelerometers recorded 14508 events in total. Some timesteps have more than one event recorded.

In [6]:
print(df.nunique())

series_id     277
night          84
event           2
step         7499
timestamp    9360
dtype: int64


### Looking into the details of the data
We also run *dtale.show()* and look into the details. 
The  75th quantile of column “night” is 21. This doesn’t mean that 75% of the accelerometers (samples) recorded 21 nights of data however, since we haven’t grouped the data by the series_id. We will do this and check the number of nights each accelerometer recorded, so that we could identify the number of nights that 75% of accelerometers recorded. This helps us determine the number of samples that need to be removed.
The figure below shows the distribution of the target variable “event”. The target variable has a uniform distribution. However, the figure which illustrates the frequency of the events over the step shows that event “wakeup” has three more counts than event “onset”. We need to investigate this contradiction.


In [6]:
d = dtale.show(df)
print(d._main_url)

http://SkyrimsHammer:40000/dtale/main/1


### Investigation of the contradictory data on target variable
The picture below identifies the accelerometers which recorded odd timesteps. These accelerometers have one day either “onset” or “wakeup” has been recorded as shown in the table. Record  of the corresponding days will not be taken into account of the rest of the project.

In [7]:
# Identify the accelerometers with suspicious records
gp = df.groupby('series_id')['step'].count() 
print(gp[gp % 2 == 1])

series_id
0ce74d6d2106    63
154fe824ed87    61
44a41bba1ee7    23
efbfc4526d58     9
f8a8da8bdd00    43
Name: step, dtype: int64


In [12]:
target_replace = gp[gp % 2 == 1]
df_contradict = pd.DataFrame({'series_id':target_replace.index, 
                              'night':[20, 30, 10, 7, 17], 
                              'additional event':['onset', 'wakeup', 'wakeup', 'wakeup', 'wakeup']
                              })
df_contradict

Unnamed: 0,series_id,night,additional event
0,0ce74d6d2106,20,onset
1,154fe824ed87,30,wakeup
2,44a41bba1ee7,10,wakeup
3,efbfc4526d58,7,wakeup
4,f8a8da8bdd00,17,wakeup


Referencing published research papers, studies required subjects to wear the accelerometer for seven to nine consecutive days. We will conduct statistics on it to determine which would be more suitable for our dataset.

| Reference | days |
|---|---|
| 1,3 | 9 consecutive days |
| 4 | 9 specific days |
| 2 | 7 consecutive days |

### Determining the time interval 1 - preparing the dataset
To achieve this, we first replace the five timesteps which cause the contradictions as mentioned above with null values. Then we identify the nights without records and compute the number of consecutive days an accelerometer collected records.

In [13]:
# Replace the five timesteps which cause the contradictions
for i in df_contradict.index:
   sid = df_contradict['series_id'][i]
   night = df_contradict['night'][i]
   event = df_contradict['additional event'][i]

   df.loc[(df['series_id'] == sid) & (df['night'] == night) & (df['event'] == event), 'step'] = np.nan
   df.loc[(df['series_id'] == sid) & (df['night'] == night) & (df['event'] == event), 'timestamp'] = np.nan

# Delete variables that become unnecessary for the remaining tasks
del(target_replace, df_contradict, sid, night, event, i)
   
# Save the updated dataset into .csv
# df.to_csv('./train_events_replacement.csv')

In [14]:
# Identify nights without records
gp = df.groupby('series_id')['step'].count()
gp = pd.DataFrame({'sid': gp.index, 'step_num': gp.values})
gp['empt_night'] = ''


for sid in gp.sid:
    df_temp = df[(df.series_id == sid)]
    idx = gp[gp.sid == sid].index[0]
    nights = []

    # Check if each night has a pair of steps
    empty_night = df_temp[df_temp['step'].isna()]['night']
    empty_night = empty_night.unique()
    gp.at[idx, 'empt_night'] = empty_night.tolist()

    # Coding for the number of consecutive days that an accelerometer collected records
    max_night = df_temp.groupby('series_id')['night'].max()[0]
    gp.at[idx, 'max_night'] = max_night
    mt_night = gp[gp.sid == sid]['empt_night'].values[0]

    if bool(mt_night) == True:
        for i in range(len(mt_night)):
            if mt_night[i] != 1:
                if i == 0:
                    con_night = mt_night[i] - 1
                elif i+1 == len(mt_night) and mt_night[i] < max_night:
                    con_night = max_night - mt_night[i]
                else:
                    con_night = mt_night[i] - mt_night[i-1] - 1
            elif len(mt_night) == 1:
                con_night = max_night - mt_night[i]

            nights.append(con_night)

        gp.at[idx, 'max_cont_night'] = max(nights)
    else:
        gp.at[idx, 'max_cont_night'] = max_night

# Configure the dtype of numeric data
gp['step_num'] = gp['step_num'].astype(np.int8)
gp['max_night'] = gp['max_night'].astype(np.int8)
gp['max_cont_night'] = gp['max_cont_night'].astype(np.int8)

gp.to_csv('./trE_cont_nights.csv')
del(df_temp, sid, idx, nights, empty_night, max_night, mt_night, i, con_night)


Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`


Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`


Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`


Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`


Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as la

In [15]:
# Look into the new table "gp"
d = dtale.show(gp)
print(d._main_url)

http://SkyrimsHammer:40000/dtale/main/1


### Determining the time interval 2 - Looking into the dataset

We have prepared a dataset "gp" to determine the time interval. Recalling the references, most of the studies used data collected in nine consecutive days, while some suggested to use seven consecutive days data. We have visualised the data with *dtale.show()* and will explain the highlights below.

The table below shows the statiscal summary of the maximum number of continuous nights that each accelerometer has collected data. Over 75% of accelerometers have collected data more than five consecutive days. We run *gp['max_cont_night'].quantile()* and find that if we select accelerometers which have collected data for at least seven consecutive days, 65% of samples will be covered. Alternatively, if we want to tighten the criteria and focus on accelerometers with nine consecutive days data, 57% of samples will be selected.

In [16]:
gp['max_cont_night'].describe()

count    277.000000
mean      12.256318
std        8.249023
min        0.000000
25%        5.000000
50%       11.000000
75%       18.000000
max       35.000000
Name: max_cont_night, dtype: float64

In [17]:
percent = pd.DataFrame({'Day': [gp['max_cont_night'].quantile(0.35), gp['max_cont_night'].quantile(0.43)], 'Percentage' : [1-0.35, 1-0.43],
                        'Number of Samples': [(277*.65), 277*.57]})
percent

Unnamed: 0,Day,Percentage,Number of Samples
0,7.0,0.65,180.05
1,9.0,0.57,157.89


Since we want to generalise our model better and since the maximum consecutive nights that data have been collected were 35 nights, we choose to use seven consecutive days as our criteria, so 180 samples will be included in our model training stage. To exploit the dataset, we are going to create a new table which holds the information of number of steps and the number of sleeping hours recorded each night. The new table is expected as below:

|sid|night|number of steps|sleep hours|sleep minutes|onset hour|wakeup hour|
|---|---|---|---|---|---|---|
|example 1 | 1 | 7000 | 7 | 0 | 22 | 07|
|example 1 | 2 | 8500 | 5 | 20 | 02 | 07 |   

### Exploiting the dataset - creating a new table

We will first extract the year, month, day, hour and minute from the timestamp. The information will be stored into the same dataframe. Then we will do the calculation and save the results into a new dataframe in which each accelerometer only has one entry. This new dataframe is expected to be the same as the one shown above.

In [18]:
# Extract the year, month, day, hour and minute
df_temp = pd.read_csv('./train_events_replacement.csv', index_col=0)
df_temp = df_temp.dropna()
df_temp['UTC_timestamp'] = pd.to_datetime(df_temp['timestamp'], utc=True)

df_temp['year'] = df_temp['UTC_timestamp'].dt.year
df_temp['month'] = df_temp['UTC_timestamp'].dt.month
df_temp['day'] = df_temp['UTC_timestamp'].dt.day
df_temp['hour'] = df_temp['UTC_timestamp'].dt.hour
df_temp['minute'] = df_temp['UTC_timestamp'].dt.minute
df_temp = df_temp.drop('timestamp', axis=1)
df_temp = df_temp.drop('UTC_timestamp', axis=1)

In [1]:
# Compute the number of steps for each night and store in a new dataframe
col_sid = []
col_night = []
col_diff = []

for sid in gp['sid'].values:
    max_night = gp['max_night'].values

    for night in range(1, max_night+1):
        step_on = df_temp[(df_temp['sid' == sid]) & (df_temp['night' == night]) & (df_temp['event' == 'onset'])]['step'].values[0]
        step_up = df_temp[(df_temp['sid' == sid]) & (df_temp['night' == night]) & (df_temp['event' == 'wakeup'])]['step'].values[0]
        diff = step_up - step_on
        
        col_sid.append(sid)
        col_night.append(night)
        col_diff.append(diff)
    
df_diff = pd.DataFrame({'sid': col_sid,
                        'night': col_night,
                        'step_number': col_diff
                        })


NameError: name 'gp' is not defined

## Note - missing entries
When we implement the code to exploit the dataset **train_event.csv**, we find that accelerometer 137771d19ca2 has its first night of data missing from the sleep log. Although it doesn't affect our EDA above, we add the following code to check if there are other missing entries in the accelerometers.

In [None]:
# Check missing entries
df_check = pd.read_csv('./train_events.csv')
for sid in gp.sid:
    max_night = gp[(gp.sid == sid)].max_night.values[0]
    all_nights = df_check[df_check.series_id == sid]['night'].unique()

    for i in range(1, max_night+1):
        if i not in all_nights:
            print(f'Missing entry identified : \tsid [{sid}] \tnight [{i}]')    

KeyboardInterrupt: 

## References
1.	Estimating sleep parameters using an accelerometer without sleep diary
2.	Genetic studies of accelerometer-based sleep measures yield new insights into human sleep behaviour
3.	A Novel, Open Access Method to Assess Sleep Duration Using a Wrist-Worn Accelerometer
4.	Sleep classification from wrist-worn accelerometer data using random forests