# Data Preparation

This notebook will be used to prepare the data for machine learning.

1. Annotate the dataset (Sleep 0 /Awake 1)
2. Signal Preparation (scaling, missing data, outliers, smoothing)
3. Subset generation (light, medium, heavy)

## Import


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import polars as pl
import gc
import joblib

from tqdm import tqdm

## Data Preparation

**Convert timestamp to datetime**


In [2]:
timestamp = [
    pl.col("timestamp").str.to_datetime("%Y-%m-%dT%H:%M:%S%z")
]

**Min-max normalization**

In [3]:
min_max_normalization = lambda x: (x - x.min()) / (x.max() - x.min())
normalization = [
    pl.col("anglez").map_batches(min_max_normalization).cast(pl.Float32), 
    pl.col("enmo").map_batches(min_max_normalization).cast(pl.Float32),
    pl.col("step").cast(pl.UInt32),
]


**Data import**

In [4]:
df_signals = pl.scan_parquet("data/train_series.parquet").with_columns(
    timestamp + normalization
).collect(streaming=True)

In [5]:
df_events = pl.scan_csv("data/train_events.csv").with_columns(
    timestamp + [pl.col("step").cast(pl.UInt32)]
).drop_nulls().collect()

**Data cleaning**

In [6]:
"""
# Removing null events and nights with mismatched counts from series_events
mismatches = df_events.group_by(['series_id', 'night']).agg(
    (pl.col('event') == 'onset').sum().alias('onset'),
    (pl.col('event') == 'wakeup').sum().alias('wakeup')
    ).sort(by=['series_id', 'night']).filter(pl.col('onset') != pl.col('wakeup')).select(pl.all().exclude('onset', 'wakeup'))
print(f"The mismatch Onset and Wakeup are : \n {mismatches}")
df_events = df_events.join(mismatches, on=['series_id', 'night'], how='anti')
"""

'\n# Removing null events and nights with mismatched counts from series_events\nmismatches = df_events.group_by([\'series_id\', \'night\']).agg(\n    (pl.col(\'event\') == \'onset\').sum().alias(\'onset\'),\n    (pl.col(\'event\') == \'wakeup\').sum().alias(\'wakeup\')\n    ).sort(by=[\'series_id\', \'night\']).filter(pl.col(\'onset\') != pl.col(\'wakeup\')).select(pl.all().exclude(\'onset\', \'wakeup\'))\nprint(f"The mismatch Onset and Wakeup are : \n {mismatches}")\ndf_events = df_events.join(mismatches, on=[\'series_id\', \'night\'], how=\'anti\')\n'

In [7]:
# Count for each series_id the number of onset and wakeup events
df_events_problem = df_events.group_by(['series_id']).agg(
    (pl.col('event') == 'onset').sum().alias('onset'),
    (pl.col('event') == 'wakeup').sum().alias('wakeup')
    ).sort(by=['series_id'])

In [8]:
# display the series_id with mismatched counts
mismatches = df_events_problem.filter(pl.col('onset') != pl.col('wakeup')).select(pl.all().exclude('onset', 'wakeup'))
print(f"The mismatch Onset and Wakeup are : \n {mismatches}")

The mismatch Onset and Wakeup are : 
 shape: (5, 1)
┌──────────────┐
│ series_id    │
│ ---          │
│ str          │
╞══════════════╡
│ 0ce74d6d2106 │
│ 154fe824ed87 │
│ 44a41bba1ee7 │
│ efbfc4526d58 │
│ f8a8da8bdd00 │
└──────────────┘


**Merge data**

In [9]:
df = df_signals.join(df_events, on=['series_id', 'timestamp', 'step'], how='left')

**Annotation Sleep // Awake**

In [10]:
df = df.with_columns(
    pl.lit(False).alias('state').cast(pl.Boolean)
)

def compute_state(df_group: pl.DataFrame) -> pl.DataFrame:
    # Check if there are onset and wakeup events in the group, if not, return as it is
    if not ('onset' in df_group['event'].unique() and 'wakeup' in df_group['event'].unique()):
        return df_group.with_columns(pl.lit(True).alias('state'))
    
    # Find the step for onset and wakeup
    onset_step = df_group.select('step', 'event').filter(pl.col('event') == 'onset')['step'][0]
    wakeup_step = df_group.select('step', 'event').filter(pl.col('event') == 'wakeup')['step'][0]
    #print(f"onset_step: {onset_step}, wakeup_step: {wakeup_step}")

    # Create the state column based on the range
    state_logic = (
        pl.when(df_group['step'] < onset_step).then(True)
        .when((df_group['step'] >= onset_step) & (df_group['step'] <= wakeup_step)).then(False)
        .otherwise(True)
    )
    
    df_group = df_group.with_columns(state_logic.alias('state'))
    
    return df_group

In [12]:
for series_id in tqdm(df['series_id'].unique()):
    df_serie = df.filter(pl.col('series_id') == series_id).group_by(['night']).map_groups(compute_state)
    df_serie.write_parquet(f"data/train_series/{series_id}.parquet")


  0%|          | 0/277 [00:00<?, ?it/s]

onset_step: 269976, wakeup_step: 273324
onset_step: 7812, wakeup_step: 14844
onset_step: 284472, wakeup_step: 291816
onset_step: 301692, wakeup_step: 308184
onset_step: 25596, wakeup_step: 33144
onset_step: 42924, wakeup_step: 48780
onset_step: 318852, wakeup_step: 325392
onset_step: 59400, wakeup_step: 65016
onset_step: 336312, wakeup_step: 342348
onset_step: 77076, wakeup_step: 82572
onset_step: 353112, wakeup_step: 359376
onset_step: 94176, wakeup_step: 100428
onset_step: 370260, wakeup_step: 375876
onset_step: 386820, wakeup_step: 393300
onset_step: 111492, wakeup_step: 117492
onset_step: 146016, wakeup_step: 152172
onset_step: 163656, wakeup_step: 169512
onset_step: 180552, wakeup_step: 186660
onset_step: 197112, wakeup_step: 205056
onset_step: 215544, wakeup_step: 222012
onset_step: 232020, wakeup_step: 238776
onset_step: 249252, wakeup_step: 256860


  1%|          | 2/277 [00:03<08:22,  1.83s/it]

onset_step: 7836, wakeup_step: 14832
onset_step: 25152, wakeup_step: 32160
onset_step: 42384, wakeup_step: 49044
onset_step: 59712, wakeup_step: 66420
onset_step: 76404, wakeup_step: 83964
onset_step: 94584, wakeup_step: 101244
onset_step: 111672, wakeup_step: 118644
onset_step: 128868, wakeup_step: 135756
onset_step: 145776, wakeup_step: 152904
onset_step: 163332, wakeup_step: 170364
onset_step: 180756, wakeup_step: 187500
onset_step: 198180, wakeup_step: 205284
onset_step: 215352, wakeup_step: 222828


  1%|          | 3/277 [00:07<10:58,  2.40s/it]


KeyboardInterrupt: 

**Missing Data**

Remove signals 6 hours after awake and 6 hours before sleep when an annotation is missing

In [None]:
# For each parquet file representing a time series
# We will sort them by timestamp
# if there are periods with 20 hours without sleep
# We will remove a period of 16 hours because we consider the annotations as missing

**Smoothing**

In [None]:
# Your code here ...

**Stratified Export**

In [None]:
# Your code here ...