## Code

### Preparing the environment

In [1]:
# Import libraries
import os
import pandas as pd
import numpy as np
import dtale

In [2]:
# Change directory
dir_old = os.getcwd()    
os.chdir('../../')      # work under "My Documents"
dir_MyDoc = os.getcwd()
target_path = 'DSAI\Kaggle_Competitions\CMI_Detect Sleep States\RawData'
os.chdir(os.path.join(dir_MyDoc, target_path))    

### Overview of **train_events.csv**
First of all we load the file **train_events.csv** and print *df.info()*. The dataset contains five columns, of which two are numeric (night and step), two are categorical (series_id and event) and one is DateTime. The columns “night” and “step” are not continuous data according to the data description on the competition webpage.

In [39]:
# Load train_events.csv
file = './train_events.csv'
df = pd.read_csv(file)
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14508 entries, 0 to 14507
Data columns (total 5 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   series_id  14508 non-null  object 
 1   night      14508 non-null  int64  
 2   event      14508 non-null  object 
 3   step       9585 non-null   float64
 4   timestamp  9585 non-null   object 
dtypes: float64(1), int64(1), object(3)
memory usage: 566.8+ KB
None


Then we run *df.nunique()*, the result shows that the data was collected from 277 accelerometers. There are only two events which are our target classes (i.e. OnSet and WakeUp). With these pieces of information, we conclude that the 277 accelerometers recorded 14508 events in total. Some timesteps have more than one event recorded.

In [4]:
print(df.nunique())

series_id     277
night          84
event           2
step         7499
timestamp    9360
dtype: int64


### Looking into the details of the data
We also run *dtale.show()* and look into the details. 
The  75th quantile of column “night” is 21. This doesn’t mean that 75% of the accelerometers (samples) recorded 21 nights of data however, since we haven’t grouped the data by the series_id. We will do this and check the number of nights each accelerometer recorded, so that we could identify the number of nights that 75% of accelerometers recorded. This helps us determine the number of samples that need to be removed.
The figure below shows the distribution of the target variable “event”. The target variable has a uniform distribution. However, the figure which illustrates the frequency of the events over the step shows that event “wakeup” has three more counts than event “onset”. We need to investigate this contradiction.


In [8]:
d = dtale.show(df)
print(d._main_url)

http://SkyrimsHammer:40000/dtale/main/1


### Investigation of the contradictory data on target variable
The picture below identifies the accelerometers which recorded odd timesteps. These accelerometers have one day either “onset” or “wakeup” has been recorded as shown in the table. Record  of the corresponding days will not be taken into account of the rest of the project.

In [40]:
# Identify the accelerometers with suspicious records
gp = df.groupby('series_id')['step'].count() 
print(gp[gp % 2 == 1])

series_id
0ce74d6d2106    63
154fe824ed87    61
44a41bba1ee7    23
efbfc4526d58     9
f8a8da8bdd00    43
Name: step, dtype: int64


In [41]:
target_replace = gp[gp % 2 == 1]
df_contradict = pd.DataFrame({'series_id':target_replace.index, 
                              'night':[20, 30, 10, 7, 17], 
                              'additional event':['onset', 'wakeup', 'wakeup', 'wakeup', 'wakeup']
                              })
df_contradict

Unnamed: 0,series_id,night,additional event
0,0ce74d6d2106,20,onset
1,154fe824ed87,30,wakeup
2,44a41bba1ee7,10,wakeup
3,efbfc4526d58,7,wakeup
4,f8a8da8bdd00,17,wakeup


Referencing published research papers, studies required subjects to wear the accelerometer for seven to nine consecutive days. We will conduct statistics on it to determine which would be more suitable for our dataset.

| Reference | days |
|---|---|
| 1,3 | 9 consecutive days |
| 4 | 9 specific days |
| 2 | 7 consecutive days |

### Determining the time interval
To achieve this, we first replace the five timesteps which cause the contradictions as mentioned above with null values. Then we identify the nights without records and compute the number of consecutive days an accelerometer collected records.

In [42]:
# Replace the five timesteps which cause the contradictions
for i in df_contradict.index:
   sid = df_contradict['series_id'][i]
   night = df_contradict['night'][i]
   event = df_contradict['additional event'][i]

   df.loc[(df['series_id'] == sid) & (df['night'] == night) & (df['event'] == event), 'step'] = np.nan
   df.loc[(df['series_id'] == sid) & (df['night'] == night) & (df['event'] == event), 'timestamp'] = np.nan

# Delete variables that become unnecessary for the remaining tasks
del(target_replace, df_contradict, sid, night, event, i)
   
# Save the updated dataset into .csv
df.to_csv('./train_events_replacement.csv')

In [56]:
# Identify nights without records
gp = df.groupby('series_id')['step'].count()
gp = pd.DataFrame({'sid': gp.index, 'step_num': gp.values})
gp['empt_night'] = ''

for sid in gp.sid:

    # Extract the data for the particular accelerometer
    df_temp = df[(df.series_id == sid)]
    idx = gp[gp.sid == sid].index[0]
    nights = []

    # Check if each night has a pair of steps
    empty_night = df_temp[df_temp['step'].isna()]['night']
    empty_night = empty_night.unique()
    gp.at[idx, 'empt_night'] = empty_night.tolist()

    # Coding for the number of consecutive days that an accelerometer collected records
    max_night = df_temp.groupby('series_id')['night'].max()[0]
    gp.at[idx, 'max_night'] = max_night
    mt_night = gp[gp.sid == sid]['empt_night'].values[0]

    if bool(mt_night) == True:
        for i in range(len(mt_night)):
            if mt_night[i] != 1:
                if i == 0:
                    con_night = mt_night[i] - 1
                elif i+1 == len(mt_night) and mt_night[i] < max_night:
                    con_night = max_night - mt_night[i]
                else:
                    con_night = mt_night[i] - mt_night[i-1] - 1
            nights.append(con_night)

        gp.at[idx, 'max_cont_night'] = max(nights)
    else:
        gp.at[idx, 'max_cont_night'] = max_night

# Configure the dtype of numeric data
gp['step_num'] = gp['step_num'].astype(np.int8)
gp['max_night'] = gp['max_night'].astype(np.int8)
gp['max_cont_night'] = gp['max_cont_night'].astype(np.int8)

gp.to_csv('./trE_cont_nights.csv')

# Look into the new table "gp"
d = dtale.show(gp)
print(d._main_url)


## References
1.	Estimating sleep parameters using an accelerometer without sleep diary
2.	Genetic studies of accelerometer-based sleep measures yield new insights into human sleep behaviour
3.	A Novel, Open Access Method to Assess Sleep Duration Using a Wrist-Worn Accelerometer
4.	Sleep classification from wrist-worn accelerometer data using random forests