# Data Preparation

This notebook will be used to prepare the data for machine learning.

1. Annotate the dataset (Sleep 0 /Awake 1)
2. Signal Preparation (scaling, missing data, outliers, smoothing)
3. Subset generation (light, medium, heavy)

## Import


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import polars as pl

from tqdm import tqdm

## Data Preparation

**Convert timestamp to datetime**


In [2]:
timestamp = [
    pl.col("timestamp").str.to_datetime("%Y-%m-%dT%H:%M:%S%z")
]

**Min-max normalization**

In [3]:
min_max_normalization = lambda x: (x - x.min()) / (x.max() - x.min())
normalization = [
    pl.col("anglez").map_batches(min_max_normalization).cast(pl.Float32), 
    pl.col("enmo").map_batches(min_max_normalization).cast(pl.Float32),
    pl.col("step").cast(pl.UInt32),
]


**Data import**

In [4]:
df_signals = pl.scan_parquet("data/train_series.parquet").with_columns(
    timestamp + normalization
).collect(streaming=True)

In [5]:
df_events = pl.scan_csv("data/train_events.csv").with_columns(
    timestamp + [pl.col("step").cast(pl.UInt32)]
).drop_nulls().collect()

**Data cleaning**

In [6]:
# Removing null events and nights with mismatched counts from series_events
mismatches = df_events.group_by(['series_id', 'night']).agg(
    (pl.col('event') == 'onset').sum().alias('onset'),
    (pl.col('event') == 'wakeup').sum().alias('wakeup')
    ).sort(by=['series_id', 'night']).filter(pl.col('onset') != pl.col('wakeup')).select(pl.all().exclude('onset', 'wakeup'))
print(f"The mismatch Onset and Wakeup are : \n {mismatches}")
df_events = df_events.join(mismatches, on=['series_id', 'night'], how='anti')


The mismatch Onset and Wakeup are : 
 shape: (5, 2)
┌──────────────┬───────┐
│ series_id    ┆ night │
│ ---          ┆ ---   │
│ str          ┆ i64   │
╞══════════════╪═══════╡
│ 0ce74d6d2106 ┆ 20    │
│ 154fe824ed87 ┆ 30    │
│ 44a41bba1ee7 ┆ 10    │
│ efbfc4526d58 ┆ 7     │
│ f8a8da8bdd00 ┆ 17    │
└──────────────┴───────┘


In [7]:
# Count for each series_id the number of onset and wakeup events
df_events_problem = df_events.group_by(['series_id']).agg(
    (pl.col('event') == 'onset').sum().alias('onset'),
    (pl.col('event') == 'wakeup').sum().alias('wakeup')
    ).sort(by=['series_id'])

In [8]:
# display the series_id with mismatched counts
mismatches = df_events_problem.filter(pl.col('onset') != pl.col('wakeup')).select(pl.all().exclude('onset', 'wakeup'))
print(f"The mismatch Onset and Wakeup are : \n {mismatches}")

The mismatch Onset and Wakeup are : 
 shape: (0, 1)
┌───────────┐
│ series_id │
│ ---       │
│ str       │
╞═══════════╡
└───────────┘


**Merge data**

In [9]:
df = df_signals.join_asof(
        df_events.drop('timestamp'),
        on='step',
        by='series_id',
        strategy='backward',
    )

**Annotation Sleep // Awake**

In [10]:
df = df.with_columns(
        state= pl.when((pl.col('event')=='onset')).then(1).otherwise(0),
    ).select(
        pl.all().exclude('event','night')
    )

In [11]:
df = (
    df.with_columns(
        delta = pl.col('state').shift(-1) - pl.col('state'),
    ).with_columns(
        wakeup = pl.when(pl.col('delta') == -1).then(True).otherwise(False),
        onset = pl.when(pl.col('delta') == 1).then(True).otherwise(False),
    )
).drop('delta')

**Missing Data**

Remove signals 6 hours after awake and 6 hours before sleep when an annotation is missing

In [12]:
# For each parquet file representing a time series
# We will sort them by timestamp
# if there are periods with 20 hours without sleep
# We will remove a period of 16 hours because we consider the annotations as missing

In [None]:
# def make_train_dataset(train_data, train_events, drop_nulls=False) :
    
#     series_ids = train_data['series_id'].unique(maintain_order=True).to_list()
#     X, y = pl.DataFrame(), pl.DataFrame()
#     for idx in tqdm(series_ids) : 
        
#         # Normalizing sample features
#         sample = train_data.filter(pl.col('series_id')==idx).with_columns(
#             [(pl.col(col) / pl.col(col).std()).cast(pl.Float32) for col in feature_cols if col != 'hour']
#         )
        
#         events = train_events.filter(pl.col('series_id')==idx)
        
#         if drop_nulls : 
#             # Removing datapoints on dates where no data was recorded
#             sample = sample.filter(
#                 pl.col('timestamp').dt.date().is_in(events['timestamp'].dt.date())
#             )

**Smoothing**

In [13]:
# Your code here ...

## Train-Test Split

In [16]:
df.group_by('series_id').agg(pl.count().alias('count')).sort(by='count', descending=True)

series_id,count
str,u32
"""78569a801a38""",1433880
"""f564985ab692""",1052820
"""fb223ed2278c""",918360
"""f56824b503a0""",846360
"""cfeb11428dd7""",809820
"""062dbd4c95e6""",778680
"""f0482490923c""",761940
"""6ca4f4fca6a2""",759240
"""d043c0ca71cd""",745020
"""12d01911d509""",744480


In [17]:
wow = df.filter(
    pl.col('series_id') == '78569a801a38'
)

In [18]:
wow.write_parquet('data/78569a801a38.parquet')

**Stratified Export**

In [15]:
# Your code here ...