# Notebook to perform basic data preprocessing and cleaning
Mainly used for missing number detection, dataframe manipulation and indexing, and creating possible new practical .csv files.

In [3]:
# Imports
import numpy as np
import pandas as pd
from local_paths import local_paths
from test_ing_algo.common.constants import keys

In [11]:
# Data loading
heart_rate_df = pd.read_csv(local_paths.HEART_RATE_CSV)
interruptions_df = pd.read_csv(local_paths.INTERRUPTIONS_CSV)
locations_df = pd.read_csv(local_paths.LOCATIONS_CSV)
onskin_df = pd.read_csv(local_paths.ONSKIN_CSV)
all_df = [heart_rate_df, interruptions_df, locations_df, onskin_df]

In [13]:
# Check basic feature information
for df in all_df:
    print(df.columns)

Index(['time', 'heart_rate', 'cnfs'], dtype='object')
Index(['time', 'any_motion', 'high_g'], dtype='object')
Index(['time', 'x', 'y'], dtype='object')
Index(['time', 'onskin'], dtype='object')


As stated in the problem a set of time-series data for the desired features is provided.

In [14]:
# Compare dataframe shapes
for df in all_df:
    print(df.shape)

(76155, 3)
(19148, 3)
(55672, 3)
(19148, 2)


There is a very clear mismatch between the provided data depending on the feature. This may be caused by 
different resampling rates, different analyzed moments in the day, missing data or a number of other possibilities.

Next steps:
- Get a basic feel of how the data is structured
- Search for missing values
- Bind the timeframes of each df

In [18]:
for df in all_df:
    print(df.head(1))
    print(df.tail(1))
    print()

                      time  heart_rate  cnfs
0  2025-01-22 23:00:00.000          60    50
                          time  heart_rate  cnfs
76154  2025-01-21 23:00:03.000          59    50

                  time  any_motion  high_g
0  2025-01-22 23:00:00           0      10
                      time  any_motion  high_g
19147  2025-01-21 23:00:00           1      12

                      time  x  y
0  2025-01-22 23:00:00.000  0  0
                          time     x    y
55671  2025-01-21 23:00:03.000 -1068  283

                  time  onskin
0  2025-01-22 23:00:00       3
                      time  onskin
19147  2025-01-21 23:00:00       3



All of the features seem to start and end approximately at the same time, with valid values. One can conclude that, as stated in the problem, the data is intended to be bound to to January the 22nd, in the UTC+1 timezone (e.g. "Europe/Madrid" timezone).

One can assume that the different amounts of points depending on the feature depend mainly on the sampling frequency, or missing data. A simple plot of the time-series may help in getting a grasp of the actual case.

In [None]:
# Set date as index for all dataframes and order them
hr_df = heart_rate_df.set_index(keys.TIME)
hr_df.index = pd.to_datetime(hr_df.index)
hr_df = hr_df.sort_index()
