# Subject exploration
___

This notebook aims to explore, verify and add useful information to the subject informations. We will look at the provided information in the `SC_subjects.csv`, the information contained in the recording's headers and to the hypnograms. It will allow us to see the different sleep characteristics of the patients.

In [None]:
%load_ext autoreload
%autoreload 2

import os
import sys

# Ensure parent folder is in PYTHONPATH
module_path = os.path.abspath(os.path.join('..'))
if module_path not in sys.path:
    sys.path.append(module_path)

In [None]:
%matplotlib inline

import matplotlib.pyplot as plt
from datetime import datetime, timedelta, timezone
import pandas as pd
import mne 
import numpy as np
from seaborn import regplot
from importlib import reload

from utils import (fetch_data, AGE_SLEEP_RECORDS)
from constants import (SLEEP_STAGES_VALUES,
                       N_STAGES,
                       DATASET_SLEEP_STAGES_VALUES)

In [None]:
ALL_DATASET_SLEEP_STAGE_VALUES = {
    **DATASET_SLEEP_STAGES_VALUES,
    'Sleep stage ?': -1,
    'Movement time': -1
}

WAKE_STAGE = ['Sleep stage W', 'Sleep stage ?', 'Movement time']
SUBJECTS = range(83)
NIGHTS = [1,2]

In [None]:
subject_files = fetch_data(subjects=SUBJECTS, recording=NIGHTS, local_data_path=f"../{AGE_SLEEP_RECORDS}")

In [None]:
df_subject_information = pd.read_csv("../data/SC-subjects.csv", delimiter=';')
df_subject_information = df_subject_information.rename(columns={"sex (F=1)": "sex"}, errors="raise")
df_subject_information['LightsOff'] =  pd.to_datetime(df_subject_information['LightsOff'], format='%H:%M')
df_subject_information.head(5)

## Lights off onset
___

We initially just have the **local start date and local lights off time**. Start date can be found in the recording file's headers, whereas the local lights off times are located in a separate file. Both will be saved in another file, to be later used in the pipeline.

We currently have **153 night recordings** of 82 subjects. Because we have a limited capacity in terms of memory, we will have to discard some part of the recorded signal. As we've seen in the previous section, files contain a whole day of recording (a little bit less than 24 hours).

We will then discard all recorded signal before the subject turned the lights off and all of the signal after the subject has awaken in the morning. We have to still analyze how many hours of recording we will have left.

We will keep this information in our dataframe, where we define **NightDuration** as the timespan, in seconds, inbetween the time at which the subject closed the lights and the time at which the subject had their last non-wake sleep stage scored.

Even with this data reduction, it will not be enough. Since we do not want to bother with too much data (remember our files are 7.6Gb total), we will only use the 20 first files even if we dropped channels and some of the wake time of the recordings.

#### Define lights off onset
___

Since both times compared are in the same timezone, and they are only less than 24 hours apart, we will na√Øvely set them to UTC. 

In [None]:
def get_lights_off(raw_data_info, file_index):
    """
    Returns a tuple in which there are:
        - the duration (in seconds) between the beginning of the recording and the time
            at which the subject turned off the lights.
        - the datetime at which the lights were turned off
    """

    raw_data_start_time = datetime.utcfromtimestamp(raw_data_info['meas_date'][0])
    raw_data_lights_off_time = df_subject_information.loc[file_index, 'LightsOff']
    
    if raw_data_lights_off_time.time().hour < 12: # Fallen asleep after midnight
        lightoff_date = raw_data_start_time.date() + timedelta(days=1)
    else:                                         # Fallen asleep before midnight
        lightoff_date = raw_data_start_time.date()
        
    raw_data_lights_off_time = raw_data_lights_off_time.replace(year=lightoff_date.year, month=lightoff_date.month, day=lightoff_date.day)

    return ((raw_data_lights_off_time - raw_data_start_time).total_seconds(), raw_data_lights_off_time)

#### Define woke up onset
___

We consider here that the subject woke up at the last non-wake stage. It can lead to errors in some cases, because some people might have woken up in the morning, then take a nap later.

In [None]:
def find_last_non_wake_annotation(annotations, timestamps):
    scores_with_timestamp = list(zip(annotations, timestamps))

    return next(
        (time for (stage, time) in reversed(scores_with_timestamp) if stage not in WAKE_STAGE),
        None)


#### Calculate sleep onset and night duration
___

In [None]:

for file_index in range(len(subject_files)):

    data = mne.io.read_raw_edf(subject_files[file_index][0], preload=False, verbose=False)
    data.set_annotations(mne.read_annotations(subject_files[file_index][1]), emit_warning=False)
    
    start_time_timestamp = data.info['meas_date'][0]
    light_off_seconds, light_off_time = get_lights_off(data.info, file_index)
    last_non_wake_seconds = find_last_non_wake_annotation(data.annotations.description, data.annotations.onset)
    
    assert (last_non_wake_seconds - light_off_seconds) % 30 == 0, "Must respect epoch size"
    
    df_subject_information.loc[file_index, 'NightDuration'] = last_non_wake_seconds - light_off_seconds
    df_subject_information.loc[file_index, 'LightsOff'] = light_off_time
    df_subject_information.loc[file_index, 'LightsOffSecond'] = light_off_seconds
    df_subject_information.loc[file_index, 'StartRecord'] = datetime.utcfromtimestamp(start_time_timestamp)
    df_subject_information.loc[file_index, 'StartRecordTimestamp'] = start_time_timestamp
    
    del data

df_subject_information.head(5)

#### Exploring night duration and lights off
___

In [None]:
print(f"Hours of recording: {df_subject_information['NightDuration'].sum()/3600:.3f}")
print(f"Nb of 30s epochs: {df_subject_information['NightDuration'].sum()/30}")

We then check if any file contains unusual information.

In [None]:
plt.xlabel("Number of hours in bed")
plt.ylabel("Number of occurences")
plt.grid(b=True)
plt.hist([x/3600 for x in df_subject_information['NightDuration']]);

We can see that, as expected, some people spent an unsually high amount of time in bed, whereas the span between the last non wake stage and the light off mark. This can be explained by the fact that some people may have taken a nap the following day.


In [None]:
lights_off = df_subject_information['LightsOff']
night_duration = df_subject_information['NightDuration']

plt.figure(figsize=(20, 5))
plt.title("Number of occurences of the hours at which the user closed the lights")
plt.xlim(0, 24)
plt.xticks(range(24))
plt.grid(b=True)
plt.hist(x=[h for h in lights_off.dt.hour + lights_off.dt.minute/60], bins=48)

plt.plot()

In [None]:
df_subject_information.groupby(df_subject_information["LightsOff"].dt.hour).count()["LightsOff"].plot(kind="bar")

The earliest time at which a subject went to sleep is at about 10 o'clock and the later time at which a subject went to sleep is at about 1:45.

All of the processed information looks good.

In [None]:
df_subject_information.to_csv("../data/recordings-info.csv", index=False)

## Sleep Characteristics
___


#### Define sleep characteristics
___

In [None]:
def get_sleep_stage_at_onset(annotations, onset):
    """Returns the sleep stage at the specified onset.
    Input
    -------
    annotations: List of OrderedDicts, as returned by the `mne.read_annotations` function
    onset: Time since the start of the recording in seconds.

    Returns
    -------
    Sleep stage: str
    """
    return annotations[
        next(idx for idx, elem in enumerate(annotations) if elem['onset'] > onset) - 1
    ]['description']

def get_sleep_latency(annotations, lights_off_onset):
    """Returns the sleep latency
    Input
    -------
    annotations: List of OrderedDicts, as returned by the `mne.read_annotations` function
    lights_off_onset: Span between the time the record started and the lights were turned off

    Returns
    -------
    Span, in seconds, between the time lights were turned off the first non sleep stage
    """
    fell_asleep_onset = annotations[
        next(
            idx for idx, elem in enumerate(annotations) if elem['onset'] > lights_off_onset and elem['description'] not in WAKE_STAGE)
    ]['onset']
    return fell_asleep_onset - lights_off_onset
    

#### Calculate sleep characteristics
___

In [None]:

for i in range(len(df_subject_information)):
    annotations = mne.read_annotations(subject_files[i][1])
    lights_off_onset = df_subject_information.loc[i, 'LightsOffSecond'] 

    df_subject_information.loc[i, 'SleepLatency'] = get_sleep_latency(annotations, lights_off_onset)
    df_subject_information.loc[i, 'SleepStageAtLightsOff'] = get_sleep_stage_at_onset(annotations, lights_off_onset)
    df_subject_information.loc[i, 'TotalSleptTime'] = np.sum([annotation['duration'] for annotation in annotations if annotation['description'] not in WAKE_STAGE])
    df_subject_information.loc[i, 'TotalN1'] = np.sum([annotation['duration'] for annotation in annotations if annotation['description'] == 'Sleep stage 1'])
    df_subject_information.loc[i, 'TotalN2'] = np.sum([annotation['duration'] for annotation in annotations if annotation['description'] == 'Sleep stage 2'])
    df_subject_information.loc[i, 'TotalN3'] = np.sum([annotation['duration'] for annotation in annotations if annotation['description'] in ['Sleep stage 3', 'Sleep stage 4']])
    df_subject_information.loc[i, 'TotalR'] = np.sum([annotation['duration'] for annotation in annotations if annotation['description'] == 'Sleep stage R'])
    df_subject_information.loc[i, 'NbTransitionStade'] = len([annotation for annotation in annotations if annotation['description'] not in ['Sleep stage ?', 'Movement time']])

df_subject_information.head(5)

#### Verify sleep stage at lights off
___

In [None]:
df_subject_information[df_subject_information['SleepStageAtLightsOff'] != 'Sleep stage W']

We can see that some subjects are already alseep at the moment they marked they closed the lights.

It can be explained by an oversight made by the subjects that forgot to mark it down. [??]

In [None]:
problematic_subject = df_subject_information[df_subject_information['SleepStageAtLightsOff'] != 'Sleep stage W']

In [None]:
def print_hypnogram(annotations, title, lights_off_seconds=None):
    hypnogram_x = [onset for onset in annotations.onset for _ in (0, 1)][1:]
    hypnogram_y = [ALL_DATASET_SLEEP_STAGE_VALUES[stage] for stage in annotations.description for _ in (0, 1)][:-1]
    
    plt.rcParams["figure.figsize"] = (20,5)
    plt.gca().invert_yaxis()
    plt.plot(hypnogram_x, hypnogram_y)
    
    if lights_off_seconds is not None:
        plt.axvline(lights_off_seconds, color='r')
    
    plt.title(title)
    plt.xlabel("Onset (seconds)")
    plt.ylabel("Sleep stage")
    plt.show()

In [None]:
for subject_idx in problematic_subject.index:
    annotations = mne.read_annotations(subject_files[subject_idx][1])
    info = df_subject_information.iloc[subject_idx]
    print_hypnogram(
        annotations,
        f"Index #{subject_idx} with lights off at {info['LightsOffSecond']} seconds with start time at {info['StartRecord']}",
        info['LightsOffSecond'])

If we want to calculate mean sleep characteristics, we have to exclude those, because they do not start at the right time.

#### Verify sleep latency
___

In [None]:
plt.figure(figsize=(20,8))
plt.title('Sleep latency of every subjects')
plt.grid(True)
regplot(x='age', y='SleepLatency', data=df_subject_information[~df_subject_information.isin(problematic_subject)], label='Normal recordings')
regplot(x='age', y='SleepLatency', data=problematic_subject, label='Problematic recordings')
plt.legend()


#### Verify night duration
___

In [None]:
sleepy_subjects = df_subject_information[df_subject_information['NightDuration'] > 3600*12]
print("Number of subjects that slept a lot: ", len(sleepy_subjects))
sleepy_subjects.head(5)

In [None]:
for subject_idx in sleepy_subjects.index:
    annotations = mne.read_annotations(subject_files[subject_idx][1])
    info = df_subject_information.iloc[subject_idx]
    print_hypnogram(
        annotations,
        f"Index #{subject_idx} with lights off at {info['LightsOffSecond']} seconds with start time at {info['StartRecord']}",
        info['LightsOffSecond'])

We see that most people that have a night duration over 12 hours have generally fallen asleep the next day (we can assume they took a nap). If we want to calculcate the mean night duration, we have to consider those.