# Features Analysis

This notebook explore the features of the dataset and their relationship with the target variable. It aims to understand and generate new features that can be used to improve the model. Those features will be tested by a very simple random forest model to check their relevance in this classification task.

Inspired from https://www.kaggle.com/code/carlmcbrideellis/zzzs-random-forest-model-starter/notebook

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from itertools import groupby
import gc
import polars as pl
import seaborn as sns
from sklearn.model_selection import train_test_split
import pickle
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.ensemble import RandomForestClassifier


In [3]:
df_signals = pl.read_parquet("data/train_series.parquet")
df_events = pl.read_csv("data/train_events.csv")

df_combined = df_signals.join(df_events, on=['series_id', 'timestamp'], how='inner')

In [None]:
with open("data//df_combined.pkl", "rb") as f:
   pickle.dump(df_combined, f)

In [None]:
with open("data/df_combined.pkl", "rb") as f:
   df_combined = pickle.load(f)

### Creation of new features based on z-angle

Possible features to be created from the z-angle:
- **diff_z_angle**: difference between the z-angle between two consecutive frames
- **z_angle_mean_window**: mean of the z-angle in a period of time
- **z_angle_std_window**: standard deviation of the z-angle in a period of time
- **z_angle_min_window**: minimum value of the z-angle in a period of time
- **z_angle_max_window**: maximum value of the z-angle in a period of time
with the window being a period of 2 to 50 frames

In [87]:
def make_features_anglez(df, periods):
   """
   Compute features for angle_z

   Param:
   df: dataframe
   periods: list of periods to compute the features

   Return:
   df: dataframe with the new features
   """
   for p in periods:
      # for each series in the dataset
      for serie in df['series_id'].unique():
         df.loc[df['series_id'] == serie, 'angle_z_mean_' + str(p)] = df.loc[df['series_id'] == serie, 'angle_z'].rolling(window=p).mean()
         

In [88]:
def make_features_enmo(df, periods):
   for p in periods:
      df[f'enmo_diff_{p}'] = df['enmo'].diff(periods=p).fillna(method="bfill").astype('float16')
      df[f'enmo_rolling_mean_{p}'] = df['enmo'].rolling(p).mean().fillna(method="bfill").astype('float16')
      df[f'enmo_rolling_std_{p}'] = df['enmo'].rolling(p).std().fillna(method="bfill").astype('float16')
      df[f'enmo_rolling_min_{p}'] = df['enmo'].rolling(p).min().fillna(method="bfill").astype('float16')
      df[f'enmo_rolling_max_{p}'] = df['enmo'].rolling(p).max().fillna(method="bfill").astype('float16')
      df[f'enmo_rolling_median_{p}'] = df['enmo'].rolling(p).median().fillna(method="bfill").astype('float16')
      df[f'enmo_rolling_skew_{p}'] = df['enmo'].rolling(p).skew().fillna(method="bfill").astype('float16')
                     

In [89]:
def make_features(df, periods=20):
   # transform angle_z values by their absolute value
   df['anglez'] = df['anglez'].abs()
   df['timestamp'].to_datetime()
   df = make_features_anglez(df, periods)
   df = make_features_enmo(df, periods)
   return df

In [91]:
df.head()

Unnamed: 0,series_id,night,event,step,timestamp
0,038441c925bb,1,onset,4992.0,2018-08-14T22:26:00-0400
1,038441c925bb,1,wakeup,10932.0,2018-08-15T06:41:00-0400
2,038441c925bb,2,onset,20244.0,2018-08-15T19:37:00-0400
3,038441c925bb,2,wakeup,27492.0,2018-08-16T05:41:00-0400
4,038441c925bb,3,onset,39996.0,2018-08-16T23:03:00-0400


In [90]:
periods = [10,20,50]
df = make_features(df, periods)

KeyError: 'anglez'

In [None]:
features = [col for col in df.columns if col not in ['enmo']]
features

In [None]:
X = df[features]
y = df["event"]

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [5]:
# save some memory
del X
del y
gc.collect();

In [None]:
classifier = RandomForestClassifier(random_state=42,n_jobs=-1)

classifier.fit(X_train, y_train)

# save some memory
del X_train, y_train
gc.collect();

In [None]:
predict = classifier.predict(X_test)

In [None]:
print(classification_report(y_test,predict))
print(confusion_matrix(y_test,predict))
print(accuracy_score(y_test, predict))