# Johnny Stuto CS 575: PreProcessing 1

## Neural Network Implementation 

### Problem Definition: Address the issue of sleep/wake prediction based on data from wrist health mon

- the dataset is a substantial 1 TB time series representing approximately 35 individuals.
- Parquet is an open source file format built to handle flat columnar storage data formats. 
- Parquet works great with large, complex data and is known for its data compression and ability many encoding types.
- Data found: https://www.kaggle.com/competitions/child-mind-institute-detect-sleep-states/data
- size: 986.46 MB

In [1]:
import numpy as np
import gc;from datetime import datetime;import pandas as pd
import warnings;warnings.simplefilter(action='ignore', category=Warning)

In [2]:
train_parquet_path = "train_series.parquet"
test_parquet_path = "test_series.parquet"
train_events_path = "train_events.csv"
train_series = pd.read_parquet(train_parquet_path)
test_series = pd.read_parquet(test_parquet_path)
train_events = pd.read_csv(train_events_path)

In [3]:
series_NaN = train_events.groupby('series_id')['step'].apply(lambda x: x.isnull().any())
#series_has_NaN.value_counts()
non_NaN_series = series_NaN[~series_NaN].index.tolist()
 # known incomplete events data:31011ade7c0a,a596ad0b82aa
non_NaN_series.remove('31011ade7c0a')
non_NaN_series.remove('a596ad0b82aa') 

## Function Overview: `train_genny(series)`

### Purpose
- To fetch and clean training data for a specific series ID.

### Steps
1. **Data Reading**:
    - Reads from `train_series.parquet` to obtain training series data for the provided series ID.
    - Fetches event data from `train_events.csv` and filters it for the specified series ID.

2. **Data Cleaning**:
    - Drops NaN values from the event data.
    - Converts the `step` column to an integer and the `event` column to binary values representing awake/asleep.

3. **Data Merging**:
    - Joins the series and event data based on the "step" column.
    - Backfills any NaN values in the `awake` column.
    
4. **Final Adjustments**:
    - Fills any residual NaN values in the `awake` column with 1 (indicating the subject is awake).
    - Converts the `awake` column to an integer type.

5. **Return**:
    - Returns the cleaned training data.



In [4]:
def train_genny(series):
    train_series = pd.read_parquet("train_series.parquet", filters=[('series_id','=',series)])
    train_events = pd.read_csv("train_events.csv").query('series_id == @series')
    train_events = train_events.dropna()
    train_events["step"]  = train_events["step"].astype("int")
    train_events["awake"] = train_events["event"].replace({"onset":1,"wakeup":0})
    train = pd.merge(train_series, train_events[['step','awake']], on='step', how='left')
    train["awake"] = train["awake"].bfill(axis ='rows')
    train['awake'] = train['awake'].fillna(1)  
    train["awake"] = train["awake"].astype("int")
    return(train)
clean_data = []
for series_id in non_NaN_series:
    train = train_genny(series_id)
    clean_data.append(train)
    del train
    gc.collect(); #memory help !

In [5]:
Zzzs_train = pd.concat(clean_data).reset_index(drop=True)
print(Zzzs_train["series_id"].nunique(), "indivudual sleep training series")

35 indivudual sleep training series


In [23]:
file_path = 'C:/temp/clean_zzz.csv'
Zzzs_train.to_csv(file_path, index=False)   