# Data Preparation

This notebook will be used to prepare the data for machine learning.

1. Annotate the dataset (Sleep 0 /Awake 1)
2. Signal Preparation (scaling, missing data, outliers, smoothing)
3. Subset generation (light, medium, heavy)

## Import


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import polars as pl

from tqdm import tqdm

## Data Preparation

**Convert timestamp to datetime**


In [2]:
timestamp = [
    pl.col("timestamp").str.to_datetime("%Y-%m-%dT%H:%M:%S%z")
]

**Min-max normalization**

In [3]:
min_max_normalization = lambda x: (x - x.min()) / (x.max() - x.min())
normalization = [
    pl.col("anglez").map_batches(min_max_normalization).cast(pl.Float32), 
    pl.col("enmo").map_batches(min_max_normalization).cast(pl.Float32),
    pl.col("step").cast(pl.UInt32),
]


**Data import**

In [4]:
df_signals = pl.scan_parquet("data/train_series.parquet").with_columns(
    timestamp + normalization
).collect(streaming=True)

In [5]:
df_events = pl.scan_csv("data/train_events.csv").with_columns(
    timestamp + [pl.col("step").cast(pl.UInt32)]
).drop_nulls().collect()

**Data cleaning**

In [6]:
# Removing null events and nights with mismatched counts from series_events
mismatches = df_events.group_by(['series_id', 'night']).agg(
    (pl.col('event') == 'onset').sum().alias('onset'),
    (pl.col('event') == 'wakeup').sum().alias('wakeup')
    ).sort(by=['series_id', 'night']).filter(pl.col('onset') != pl.col('wakeup')).select(pl.all().exclude('onset', 'wakeup'))
print(f"The mismatch Onset and Wakeup are : \n {mismatches}")
df_events = df_events.join(mismatches, on=['series_id', 'night'], how='anti')


The mismatch Onset and Wakeup are : 
 shape: (5, 2)
┌──────────────┬───────┐
│ series_id    ┆ night │
│ ---          ┆ ---   │
│ str          ┆ i64   │
╞══════════════╪═══════╡
│ 0ce74d6d2106 ┆ 20    │
│ 154fe824ed87 ┆ 30    │
│ 44a41bba1ee7 ┆ 10    │
│ efbfc4526d58 ┆ 7     │
│ f8a8da8bdd00 ┆ 17    │
└──────────────┴───────┘


In [7]:
print(df_events.head(10))

shape: (10, 5)
┌──────────────┬───────┬────────┬───────┬─────────────────────────┐
│ series_id    ┆ night ┆ event  ┆ step  ┆ timestamp               │
│ ---          ┆ ---   ┆ ---    ┆ ---   ┆ ---                     │
│ str          ┆ i64   ┆ str    ┆ u32   ┆ datetime[μs, UTC]       │
╞══════════════╪═══════╪════════╪═══════╪═════════════════════════╡
│ 038441c925bb ┆ 1     ┆ onset  ┆ 4992  ┆ 2018-08-15 02:26:00 UTC │
│ 038441c925bb ┆ 1     ┆ wakeup ┆ 10932 ┆ 2018-08-15 10:41:00 UTC │
│ 038441c925bb ┆ 2     ┆ onset  ┆ 20244 ┆ 2018-08-15 23:37:00 UTC │
│ 038441c925bb ┆ 2     ┆ wakeup ┆ 27492 ┆ 2018-08-16 09:41:00 UTC │
│ …            ┆ …     ┆ …      ┆ …     ┆ …                       │
│ 038441c925bb ┆ 4     ┆ onset  ┆ 57240 ┆ 2018-08-18 03:00:00 UTC │
│ 038441c925bb ┆ 4     ┆ wakeup ┆ 62856 ┆ 2018-08-18 10:48:00 UTC │
│ 038441c925bb ┆ 6     ┆ onset  ┆ 91296 ┆ 2018-08-20 02:18:00 UTC │
│ 038441c925bb ┆ 6     ┆ wakeup ┆ 97860 ┆ 2018-08-20 11:25:00 UTC │
└──────────────┴───────┴────────┴

In [8]:
# Count for each series_id the number of onset and wakeup events
df_events_problem = df_events.group_by(['series_id']).agg(
    (pl.col('event') == 'onset').sum().alias('onset'),
    (pl.col('event') == 'wakeup').sum().alias('wakeup')
    ).sort(by=['series_id'])

In [9]:
# display the series_id with mismatched counts
mismatches = df_events_problem.filter(pl.col('onset') != pl.col('wakeup')).select(pl.all().exclude('onset', 'wakeup'))
print(f"The mismatch Onset and Wakeup are : \n {mismatches}")

The mismatch Onset and Wakeup are : 
 shape: (0, 1)
┌───────────┐
│ series_id │
│ ---       │
│ str       │
╞═══════════╡
└───────────┘


**Merge data**

In [2]:
df = df_signals.join_asof(
        df_events,
        on='step',
        by='series_id',
        strategy='backward',
    )

NameError: name 'df_signals' is not defined

**Annotation Sleep // Awake**

In [3]:
df = df.with_columns(
    state= pl.when((pl.col('event')=='wakeup')).then(True).otherwise(False).cast(pl.Boolean)
    )

NameError: name 'df' is not defined

**Missing Data**

Remove signals 6 hours after awake and 6 hours before sleep when an annotation is missing

In [13]:
# For each parquet file representing a time series
# We will sort them by timestamp
# if there are periods with 20 hours without sleep
# We will remove a period of 16 hours because we consider the annotations as missing

**Smoothing**

In [14]:
# Your code here ...

**Stratified Export**

In [15]:
# Your code here ...