<img src="https://i.imgur.com/de3OrxO.png">

<center><h1> Detect Sleep States </h1></center>
<center><h1>- bringing down the memory usage -</h1></center>

> 📌 **Competition Scope**: detect the occurrence of *onset* (the beginning of sleep) and *wakeup* (the end of sleep) in the accelerometer series.

### Dataframes are very large

The first thing I tried to do in this competition is look at the data, but the `train_series.parquet` file is HUGE (contains hundreds of millions rows).

I downloaded the data locally and worked to bring down the memory usage as effectively as possible. This notebook contains the code I used, the process and results.

* `train_events` -> **73% reduction**
* `train_series` -> **86% reduction**

### ○ Libraries

In [None]:
# libraries
import os
import re
import gc
import wandb
import random
import math
from glob import glob
from tqdm import tqdm
from time import time
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
import pandas as pd
import numpy as np
import cudf

# env check
warnings.filterwarnings('ignore')
os.environ["WANDB_SILENT"] = "true"
CONFIG = {'competition': '2023_sleep', '_wandb_kernel': 'aot', "source_type": "artifact"}

# color
class clr:
    S = '\033[1m' + '\033[90m'
    E = '\033[0m'
    
my_colors = ["#f79256", "#fbd1a2", "#7dcfb6", "#00b2ca"]

print(clr.S+"Notebook Color Schemes:"+clr.E)
sns.palplot(sns.color_palette(my_colors))
plt.show()

### 🐝 W&B Fork & Run

In order to run this notebook you will need to input your own **secret API key** within the `! wandb login $secret_value_0` line. 

🐝**How do you get your own API key?**

Super simple! Go to **https://wandb.ai/site** -> Login -> Click on your profile in the top right corner -> Settings -> Scroll down to API keys -> copy your very own key (for more info check [this amazing notebook for ML Experiment Tracking on Kaggle](https://www.kaggle.com/ayuraj/experiment-tracking-with-weights-and-biases)).

<center><img src="https://i.imgur.com/fFccmoS.png" width=500></center>

In [None]:
# 🐝 secrets
from kaggle_secrets import UserSecretsClient
user_secrets = UserSecretsClient()
secret_value_0 = user_secrets.get_secret("wandb")

! wandb login $secret_value_0

### ○ Helper Functions Below

In [None]:
# === data discover ===
def get_general_info(df, desc=None):
    
    # 🐝 new exp
    run = wandb.init(project='2023_sleep', name=f'{desc}_data_summary', config=CONFIG)

    print(clr.S+"--- General Info ---"+clr.E)
    print(clr.S+"Data Shape:"+clr.E, df.shape)
    print(clr.S+"Data Cols:"+clr.E, df.columns.tolist())
    print(clr.S+"Total No. of Cols:"+clr.E, len(df.columns.tolist()))
    print(clr.S+"No. Missing Values:"+clr.E, df.isna().sum().sum())
    print(clr.S+"Columns with missing data:"+clr.E, "\n",
          df.isna().sum()[df.isna().sum() != 0], "\n")

    print(clr.S+"--- [object] columns ---"+clr.E)
    str_cols = df.select_dtypes(include=[object]).columns
    for col in str_cols:
        print(clr.S+f"[nunique] {col}:"+clr.E, 
              df[col].nunique())

    print("\n")

    print(clr.S+"--- [numerical] columns ---"+clr.E)
    digit_cols = df.select_dtypes(include=[int, float]).columns
    for col in digit_cols:
        print(clr.S+f"[describe] {col}:"+clr.E, "\n",
              df[col].describe())
        
    # log data
    wandb.log
    (
        {"data_shape": len(df),
         "missing_values": df.isna().sum().sum(),
         "unique_id": df.series_id.nunique(),
        }
    )
    wandb.finish()
    print("🐝 Info saved to dashboard.")
            

def get_missing_values_plot(df):
    '''
    Plots missing values barchart for a given dataframe.
    '''
    
    # count missing values
    missing_counts = df.isnull().sum().reset_index()\
                            .sort_values(0, ascending=False)\
                            .reset_index(drop=True)
    missing_counts.columns = ["col_name", "missing_count"]

    # plot
    plt.figure(figsize=(24, 16))
    axs = sns.barplot(y=missing_counts.col_name, x=missing_counts.missing_count, 
                      color=my_colors[0])
    show_values_on_bars(axs, h_v="h", space=0.4)
    plt.xlabel('no. missing values', size=20, weight="bold")
    plt.ylabel('column name', size=20, weight="bold")
    plt.title('Missing Values', size=22, weight="bold")
    plt.show();
            
            
# === plots ===
def show_values_on_bars(axs, h_v="v", space=0.4):
    '''Plots the value at the end of the a seaborn barplot.
    axs: the ax of the plot
    h_v: weather or not the barplot is vertical/ horizontal'''
    
    def _show_on_single_plot(ax):
        if h_v == "v":
            for p in ax.patches:
                _x = p.get_x() + p.get_width() / 2
                _y = p.get_y() + p.get_height()
                value = int(p.get_height())
                ax.text(_x, _y, format(value, ','), ha="center") 
        elif h_v == "h":
            for p in ax.patches:
                _x = p.get_x() + p.get_width() + float(space)
                _y = p.get_y() + p.get_height()
                value = int(p.get_width())
                ax.text(_x, _y, format(value, ','), ha="left")

    if isinstance(axs, np.ndarray):
        for idx, ax in np.ndenumerate(axs):
            _show_on_single_plot(ax)
    else:
        _show_on_single_plot(axs)
        
        
# === 🐝 w&b ===
def save_dataset_artifact(run_name, artifact_name, path, data_type="dataset"):
    '''Saves dataset to W&B Artifactory.
    run_name: name of the experiment
    artifact_name: under what name should the dataset be stored
    path: path to the dataset'''
    
    run = wandb.init(project='2023_sleep', 
                     name=run_name, 
                     config=CONFIG)
    artifact = wandb.Artifact(name=artifact_name, 
                              type=data_type)
    artifact.add_file(path)

    wandb.log_artifact(artifact)
    wandb.finish()
    print(f"🐝Artifact {artifact_name} has been saved successfully.")
    
    
def create_wandb_plot(x_data=None, y_data=None, x_name=None, y_name=None, title=None, log=None, plot="line"):
    '''Create and save lineplot/barplot in W&B Environment.
    x_data & y_data: Pandas Series containing x & y data
    x_name & y_name: strings containing axis names
    title: title of the graph
    log: string containing name of log'''
    
    data = [[label, val] for (label, val) in zip(x_data, y_data)]
    table = wandb.Table(data=data, columns = [x_name, y_name])
    
    if plot == "line":
        wandb.log({log : wandb.plot.line(table, x_name, y_name, title=title)})
    elif plot == "bar":
        wandb.log({log : wandb.plot.bar(table, x_name, y_name, title=title)})
    elif plot == "scatter":
        wandb.log({log : wandb.plot.scatter(table, x_name, y_name, title=title)})
        
        
def create_wandb_hist(x_data=None, x_name=None, title=None, log=None):
    '''Create and save histogram in W&B Environment.
    x_data: Pandas Series containing x values
    x_name: strings containing axis name
    title: title of the graph
    log: string containing name of log'''
    
    data = [[x] for x in x_data]
    table = wandb.Table(data=data, columns=[x_name])
    wandb.log({log : wandb.plot.histogram(table, x_name, title=title)})

# events.csv

In [None]:
# read data
events = cudf.read_csv("/kaggle/input/child-mind-institute-detect-sleep-states/train_events.csv")
events.head()

In [None]:
get_general_info(events, desc="events")

## 1. Initial memory usage

In [None]:
# initial memory usage
events.memory_usage(deep=True)

In [None]:
# total in MB
events.memory_usage(deep=True).sum() / (1024 * 1024)

In [None]:
# these are the initial dtypes
# for the variables within the dataframe
events.dtypes

## 2. Decrease memory usage

* `series_id`: from object -> uint16
    * there are only 277 unique ids
    * hence I remapped them with an id_map from 0 to 277
    * easier during development, can easily switch back to original id
* `night`: from int64 -> uint16
* `event`: from object -> uint8
    * relabeled as follows: 'onset':'1', 'wakeup':'2'
* `step`: from int64 -> uint32
* `timestamp`: from object -> datetime64

In [None]:
new_events = events.copy()

# 277 ids (same id's as in series)
# new_events.series_id = new_events.series_id.str.filter_characters({'0':'9'}).astype(np.int64)
# or map it :)
train_id_map = cudf.DataFrame({"series_id": new_events.series_id.unique(),
                               "id_map": new_events.series_id.unique().index})
train_id_map.id_map = train_id_map.id_map.astype(np.uint16)
train_id_map.to_parquet("./train_id_map.parquet", index=False)
new_events = new_events.merge(right=train_id_map, on="series_id").drop(columns="series_id")

# night
new_events.night = new_events.night.astype(np.uint16)
# event relabeled
new_events.event = new_events.event.replace({'onset':'1', 'wakeup':'2'}).astype(np.uint8)
# step
new_events.step = new_events.step.astype(np.uint32)
# timestamp
new_events.timestamp = cudf.to_datetime(new_events.timestamp, format='%Y-%m-%d %H:%M:%S')

## 3. Updated memory

In [None]:
# updated memory usage
new_events.memory_usage(deep=True)

In [None]:
# total in MB
new_events.memory_usage(deep=True).sum() / (1024 * 1024)

In [None]:
 new_events.dtypes

### Comparison before & after:

<center><img src="https://i.imgur.com/7qVZHBe.jpg" width=800></center>

# series.parquet

> 📌 **Note**: I cannot showcase the changes I did for `train_series.parquet`, as I get a lot of OOM error within the Kaggle environment. Plus the conversion of `timestamp` from dtype `object` to dtype `datetime` took **7 hours to run** :).

## 1. Initial memory usage

<center><img src="https://i.imgur.com/XRwBiCL.jpg" width=600></center>

## 2. Decrease memory usage

The code I used is almost the same as for events.csv, but used `pandas` for the `progress_apply()` function:
```
series = pd.read_parquet("/kaggle/input/child-mind-institute-detect-sleep-states/train_series.parquet")
```

Used `id_map` mapped previously from the events dataset:
```
train_id_map = pd.read_parquet("/kaggle/input/detect-sleep-states-memory-decrease/train_id_map.parquet")
new_series = series.merge(right=train_id_map, on="series_id")\
            .drop(columns="series_id")\
            .reset_index(drop=True)
```

Step convert:
```
new_series.step = new_series.step.astype(np.uint32)
```

Timestamp convert (thank you [@carlmcbrideellis](https://www.kaggle.com/carlmcbrideellis) for help!):
```
from pandarallel import pandarallel
pandarallel.initialize(progress_bar=True)

# Local Time converter
def to_date_time(x):
    import pandas as pd
    return pd.to_datetime(x, format='%Y-%m-%d %H:%M:%S') # utc=True

def to_localize(t):
    import pandas as pd
    return t.tz_localize(None)

new_series["timestamp"] = new_series.timestamp.parallel_apply(to_date_time).parallel_apply(to_localize)
```

## 3. Updated memory

<center><img src="https://i.imgur.com/DBN7WZY.png" width=500></center>

### before & after comparison

<center><img src="https://i.imgur.com/IhRv8EO.jpg" width=800></center>

In [None]:
# 🐝 train_id_map.parquet
save_dataset_artifact(run_name="train_id_map",
                      artifact_name="train_id_map",
                      path="/kaggle/input/detect-sleep-states-memory-decrease/train_id_map.parquet", 
                      data_type="dataset")

In [None]:
# 🐝 train_events.parquet
save_dataset_artifact(run_name="train_events_memory",
                      artifact_name="train_events",
                      path="/kaggle/input/detect-sleep-states-memory-decrease/train_events.parquet", 
                      data_type="dataset")

In [None]:
# 🐝 train_series.parquet
# this saves very fast, considering we deal with ~140mil rows
save_dataset_artifact(run_name="train_series_memory",
                      artifact_name="train_series",
                      path="/kaggle/input/detect-sleep-states-memory-decrease/train_series.parquet", 
                      data_type="dataset")

# new datasets links:

You can find the new datasets:
* **kaggle datasets**: https://www.kaggle.com/datasets/andradaolteanu/detect-sleep-states-memory-decrease
* 🐝 **W&B artifacts**: https://wandb.ai/andrada/2023_sleep/artifacts/code/source-2023_sleep-None/v1
    * easier for storage & versioning :)
    
### a moment for pandarallel library

The `timestamp` column is a pain in the behind. Using `pandas.apply()` to convert it from object to datetime dtype took ~7hrs on my workstation.

But using `pandarallel.parallel_apply()` sped it up from 7hrs to just 1 hr (I have 14 CPU cores tho). :) But what a change in performance! Gave me a bunch of time to try to debug why in the end the variable still saves as `object` instead of `datetime`.

<center><img src="https://i.imgur.com/3au3CjS.png" width=800></center>

### 🐝 [my W&B dash](https://wandb.ai/andrada/2023_sleep?workspace=user-andrada)
    
<center><img src="https://i.imgur.com/DoYO51s.jpg"></center>

------

<center><img src="https://i.imgur.com/knxTRkO.png"></center>

### My Specs

* 🖥 Z8 G4 Workstation
* 💾 2 CPUs & 96GB Memory
* 🎮 2x NVIDIA A6000
* 💻 Zbook Studio G9 on the go