# Data Preparation

This notebook will be used to prepare the data for machine learning.

1. Annotate the dataset (Sleep 0 /Awake 1)
2. Signal Preparation (scaling, missing data, outliers, smoothing)
3. Subset generation (light, medium, heavy)

## Import


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import polars as pl

from tqdm import tqdm

## Data Preparation

**Convert timestamp to datetime**


In [2]:
timestamp = [
    pl.col("timestamp").str.to_datetime("%Y-%m-%dT%H:%M:%S%z")
]

**Min-max normalization**

In [3]:
min_max_normalization = lambda x: (x - x.min()) / (x.max() - x.min())
normalization = [
    pl.col("anglez").map_batches(min_max_normalization).cast(pl.Float32), 
    pl.col("enmo").map_batches(min_max_normalization).cast(pl.Float32),
    pl.col("step").cast(pl.UInt32),
]


**Data import**

In [4]:
df_signals = pl.scan_parquet("data/train_series.parquet").with_columns(
    timestamp + normalization
).collect(streaming=True)

In [5]:
df_events = pl.scan_csv("data/train_events.csv").with_columns(
    timestamp + [pl.col("step").cast(pl.UInt32)]
).drop_nulls().collect()

**Data cleaning**

In [6]:
# Removing null events and nights with mismatched counts from series_events
mismatches = df_events.group_by(['series_id', 'night']).agg(
    (pl.col('event') == 'onset').sum().alias('onset'),
    (pl.col('event') == 'wakeup').sum().alias('wakeup')
    ).sort(by=['series_id', 'night']).filter(pl.col('onset') != pl.col('wakeup')).select(pl.all().exclude('onset', 'wakeup'))
print(f"The mismatch Onset and Wakeup are : \n {mismatches}")
df_events = df_events.join(mismatches, on=['series_id', 'night'], how='anti')


The mismatch Onset and Wakeup are : 
 shape: (5, 2)
┌──────────────┬───────┐
│ series_id    ┆ night │
│ ---          ┆ ---   │
│ str          ┆ i64   │
╞══════════════╪═══════╡
│ 0ce74d6d2106 ┆ 20    │
│ 154fe824ed87 ┆ 30    │
│ 44a41bba1ee7 ┆ 10    │
│ efbfc4526d58 ┆ 7     │
│ f8a8da8bdd00 ┆ 17    │
└──────────────┴───────┘


In [7]:
print(df_events.head(10))

shape: (10, 5)
┌──────────────┬───────┬────────┬───────┬─────────────────────────┐
│ series_id    ┆ night ┆ event  ┆ step  ┆ timestamp               │
│ ---          ┆ ---   ┆ ---    ┆ ---   ┆ ---                     │
│ str          ┆ i64   ┆ str    ┆ u32   ┆ datetime[μs, UTC]       │
╞══════════════╪═══════╪════════╪═══════╪═════════════════════════╡
│ 038441c925bb ┆ 1     ┆ onset  ┆ 4992  ┆ 2018-08-15 02:26:00 UTC │
│ 038441c925bb ┆ 1     ┆ wakeup ┆ 10932 ┆ 2018-08-15 10:41:00 UTC │
│ 038441c925bb ┆ 2     ┆ onset  ┆ 20244 ┆ 2018-08-15 23:37:00 UTC │
│ 038441c925bb ┆ 2     ┆ wakeup ┆ 27492 ┆ 2018-08-16 09:41:00 UTC │
│ …            ┆ …     ┆ …      ┆ …     ┆ …                       │
│ 038441c925bb ┆ 4     ┆ onset  ┆ 57240 ┆ 2018-08-18 03:00:00 UTC │
│ 038441c925bb ┆ 4     ┆ wakeup ┆ 62856 ┆ 2018-08-18 10:48:00 UTC │
│ 038441c925bb ┆ 6     ┆ onset  ┆ 91296 ┆ 2018-08-20 02:18:00 UTC │
│ 038441c925bb ┆ 6     ┆ wakeup ┆ 97860 ┆ 2018-08-20 11:25:00 UTC │
└──────────────┴───────┴────────┴

In [8]:
# Count for each series_id the number of onset and wakeup events
df_events_problem = df_events.group_by(['series_id']).agg(
    (pl.col('event') == 'onset').sum().alias('onset'),
    (pl.col('event') == 'wakeup').sum().alias('wakeup')
    ).sort(by=['series_id'])

In [9]:
# display the series_id with mismatched counts
mismatches = df_events_problem.filter(pl.col('onset') != pl.col('wakeup')).select(pl.all().exclude('onset', 'wakeup'))
print(f"The mismatch Onset and Wakeup are : \n {mismatches}")

The mismatch Onset and Wakeup are : 
 shape: (0, 1)
┌───────────┐
│ series_id │
│ ---       │
│ str       │
╞═══════════╡
└───────────┘


In [31]:
# Extended signals data
timestamps = ([f"2022-01-01 {hour:02d}:00:00" for hour in range(24)] + 
             ["2022-01-02 00:00:00"] +
             [f"2022-01-02 {hour:02d}:00:00" for hour in range(1, 13)])

df_signals = pl.DataFrame({
    "series_id": ["A"] * 37 + ["B"] * 37 + ["C"] * 37,
    "timestamp": timestamps * 3,
    "step": list(range(1, 38)) * 3
})

# Extended events data
df_events = pl.DataFrame({
    "series_id": ["A", "A", "B", "B", "C", "C"],
    "night": [1, 1, 1, 1, 1, 1],
    "event": ["onset", "wakeup", "onset", "wakeup", "onset", "wakeup"],
    "timestamp": ["2022-01-01 02:00:00", "2022-01-01 14:00:00", "2022-01-01 03:00:00",
                 "2022-01-01 15:00:00", "2022-01-01 04:00:00", "2022-01-01 16:00:00"],
    "step": [3, 15, 4, 16, 5, 17]
})


In [37]:
print(df_signals.filter(pl.col('series_id') == 'A'))

shape: (37, 3)
┌───────────┬─────────────────────┬──────┐
│ series_id ┆ timestamp           ┆ step │
│ ---       ┆ ---                 ┆ ---  │
│ str       ┆ str                 ┆ i64  │
╞═══════════╪═════════════════════╪══════╡
│ A         ┆ 2022-01-01 00:00:00 ┆ 1    │
│ A         ┆ 2022-01-01 01:00:00 ┆ 2    │
│ A         ┆ 2022-01-01 02:00:00 ┆ 3    │
│ A         ┆ 2022-01-01 03:00:00 ┆ 4    │
│ …         ┆ …                   ┆ …    │
│ A         ┆ 2022-01-02 09:00:00 ┆ 34   │
│ A         ┆ 2022-01-02 10:00:00 ┆ 35   │
│ A         ┆ 2022-01-02 11:00:00 ┆ 36   │
│ A         ┆ 2022-01-02 12:00:00 ┆ 37   │
└───────────┴─────────────────────┴──────┘


In [33]:
print(df_events)

shape: (6, 5)
┌───────────┬───────┬────────┬─────────────────────┬──────┐
│ series_id ┆ night ┆ event  ┆ timestamp           ┆ step │
│ ---       ┆ ---   ┆ ---    ┆ ---                 ┆ ---  │
│ str       ┆ i64   ┆ str    ┆ str                 ┆ i64  │
╞═══════════╪═══════╪════════╪═════════════════════╪══════╡
│ A         ┆ 1     ┆ onset  ┆ 2022-01-01 02:00:00 ┆ 3    │
│ A         ┆ 1     ┆ wakeup ┆ 2022-01-01 14:00:00 ┆ 15   │
│ B         ┆ 1     ┆ onset  ┆ 2022-01-01 03:00:00 ┆ 4    │
│ B         ┆ 1     ┆ wakeup ┆ 2022-01-01 15:00:00 ┆ 16   │
│ C         ┆ 1     ┆ onset  ┆ 2022-01-01 04:00:00 ┆ 5    │
│ C         ┆ 1     ┆ wakeup ┆ 2022-01-01 16:00:00 ┆ 17   │
└───────────┴───────┴────────┴─────────────────────┴──────┘


In [27]:
# Detect and remove mismatched onsets/wakeups
mismatches = df_events.group_by(['series_id', 'night']).agg(
    (pl.col('event') == 'onset').sum().alias('onset'),
    (pl.col('event') == 'wakeup').sum().alias('wakeup')
).filter(pl.col('onset') != pl.col('wakeup')).select(pl.all().exclude('onset', 'wakeup'))
df_events = df_events.join(mismatches, on=['series_id', 'night'], how='anti')


In [29]:
# Add onset_step and wakeup_step to df_events
df_events = df_events.with_columns(
    onset_step = pl.when(pl.col('event') == 'onset').then(pl.col('step')),
    wakeup_step = pl.when(pl.col('event') == 'wakeup').then(pl.col('step'))
)

# Merge df_signals and df_events
df = df_signals.join(df_events, on=['series_id', 'timestamp', 'step'], how='left')

# Sort by series_id and step
df = df.sort(['series_id', 'step'])

# Forward fill onset_step and wakeup_step
df = df.with_columns(
    onset_step = pl.col('onset_step').forward_fill(),
    wakeup_step = pl.col('wakeup_step').forward_fill()
)

# Calculate state column
df = df.with_columns(
    state = (pl.col('step') >= pl.col('onset_step')) & (pl.col('step') < pl.col('wakeup_step')).fill_null(True)
)

print(df.drop('night'))


shape: (8, 7)
┌───────────┬─────────────────────┬──────┬────────┬────────────┬─────────────┬───────┐
│ series_id ┆ timestamp           ┆ step ┆ event  ┆ onset_step ┆ wakeup_step ┆ state │
│ ---       ┆ ---                 ┆ ---  ┆ ---    ┆ ---        ┆ ---         ┆ ---   │
│ str       ┆ str                 ┆ i64  ┆ str    ┆ i64        ┆ i64         ┆ bool  │
╞═══════════╪═════════════════════╪══════╪════════╪════════════╪═════════════╪═══════╡
│ A         ┆ 2022-01-01 01:00:00 ┆ 1    ┆ null   ┆ null       ┆ null        ┆ null  │
│ A         ┆ 2022-01-01 02:00:00 ┆ 2    ┆ onset  ┆ 2          ┆ null        ┆ true  │
│ A         ┆ 2022-01-01 03:00:00 ┆ 3    ┆ null   ┆ 2          ┆ null        ┆ true  │
│ A         ┆ 2022-01-01 04:00:00 ┆ 4    ┆ wakeup ┆ 2          ┆ 4           ┆ false │
│ B         ┆ 2022-01-01 01:00:00 ┆ 1    ┆ null   ┆ 2          ┆ 4           ┆ false │
│ B         ┆ 2022-01-01 02:00:00 ┆ 2    ┆ null   ┆ 2          ┆ 4           ┆ true  │
│ B         ┆ 2022-01-01 03:0

**Merge data**

In [10]:
# Add onset_step and wakeup_step to df_events
df_events = df_events.with_columns(
    onset_step = pl.when(pl.col('event') == 'onset').then(pl.col('step')),
    wakeup_step = pl.when(pl.col('event') == 'wakeup').then(pl.col('step'))
)

In [11]:
# Merge df_signals and df_events
df = df_signals.join(df_events, on=['series_id', 'timestamp', 'step'], how='left')

# Sort by series_id and step
df = df.sort(['series_id', 'step'])

# Forward fill onset_step and wakeup_step
df = df.with_columns(
    onset_step = pl.col('onset_step').forward_fill(),
    wakeup_step = pl.col('wakeup_step').forward_fill()
)

**Annotation Sleep // Awake**

In [12]:
df = df.with_columns(
    state = ((pl.col('step') >= pl.col('onset_step')) & (pl.col('step') < pl.col('wakeup_step'))).fill_null(True)
).drop(['onset_step', 'wakeup_step'])

**Missing Data**

Remove signals 6 hours after awake and 6 hours before sleep when an annotation is missing

In [13]:
# For each parquet file representing a time series
# We will sort them by timestamp
# if there are periods with 20 hours without sleep
# We will remove a period of 16 hours because we consider the annotations as missing

**Smoothing**

In [14]:
# Your code here ...

**Stratified Export**

In [15]:
# Your code here ...