# Match the HBN Actigraphy Dataset with the train series
In this notebook is intended to show the code we used for determining whether the HBN actigraphy dataset we found is the same as the training dataset. 

In [51]:
import pandas as pd
import os

In [47]:
# Retrieve the training data
train_events: pd.DataFrame = pd.read_csv("../../data/raw/train_events.csv")
train_series: pd.DataFrame = pd.read_parquet("../../data/raw/train_series.parquet")
train_series["anglez"] = train_series["anglez"].round(3)

def get_series_names(batch: str) -> list[str]:
    return list(map(lambda x: x.split('_')[-1], os.listdir("../../data/processed/" + batch)))

def preprocess(batch: str, file: str) -> pd.DataFrame:
    try:
        hbn_data: pd.DataFrame = pd.read_csv("../../data/processed/" + batch + "/output_" + file + "/meta/csv/" + file + ".gt3x.RData.csv")
    except FileNotFoundError: # Some directories are empty, so add this to not crash
        hbn_data = pd.DataFrame()
        hbn_data['anglez'] = []
        hbn_data['enmo'] = []
        hbn_data['timestamp'] = []
        return hbn_data
    hbn_data['anglez'] = hbn_data['anglez'].astype('float32')
    hbn_data['enmo'] = hbn_data['ENMO'].astype('float32').round(3)
    hbn_data.drop("ENMO", axis=1, inplace=True)
    return hbn_data

# Retrieve the HBN Actigraphy Dataset
hbn_series_directories: list[str] = ["GGIR_processing", "Batch_2", "Batch_3", "Batch_4"]

# Store the list with all files together with the name of the batch
hbn_series_names: list[(str, list)] = list(zip(hbn_series_directories, list(map(get_series_names, hbn_series_directories))))

# Preprocess the data
hbn_series: list = list(map(lambda series: (series[0], list(map(lambda file: preprocess(series[0], file), series[1]))), hbn_series_names))

# Match the data from the train series with the HBN series
def match(hbn_series: pd.DataFrame) -> pd.DataFrame:
    merged_df: pd.DataFrame = pd.merge(train_series, hbn_series, on=['timestamp', 'anglez'], suffixes=('_train', '_hbn'), how='inner')
    merged_count: pd.DataFrame = pd.merge(train_series['series_id'].value_counts(), merged_df['series_id'].value_counts(), on=['series_id'], suffixes=('_train', '_hbn'), how='inner')
    result_df: pd.DataFrame = merged_count[merged_count['count_train'] == merged_count['count_hbn']]
    return result_df

print(hbn_series)

[('GGIR_processing', [                       timestamp     anglez   enmo
0       2018-12-19T15:45:00-0500 -85.633598  0.000
1       2018-12-19T15:45:05-0500 -85.633598  0.000
2       2018-12-19T15:45:10-0500 -85.633598  0.000
3       2018-12-19T15:45:15-0500 -85.633598  0.000
4       2018-12-19T15:45:20-0500 -85.633598  0.000
...                          ...        ...    ...
513535  2019-01-18T08:59:35-0500   3.631800  0.048
513536  2019-01-18T08:59:40-0500   0.452800  0.061
513537  2019-01-18T08:59:45-0500  -3.636300  0.108
513538  2019-01-18T08:59:50-0500  -0.324000  0.164
513539  2019-01-18T08:59:55-0500   6.555500  0.126

[513540 rows x 3 columns],                        timestamp     anglez   enmo
0       2019-02-27T15:45:00-0500 -26.057199  0.664
1       2019-02-27T15:45:05-0500 -46.754398  0.946
2       2019-02-27T15:45:10-0500 -48.178200  0.031
3       2019-02-27T15:45:15-0500 -43.887100  0.056
4       2019-02-27T15:45:20-0500 -56.901299  0.063
...                          ...

In [None]:
# Match all HBN series with the train series
matches = [(batch, list(map(match, files))) for batch, files in hbn_series]

# Print the results
for batch in matches:
    print(batch)

As you can see, no matches here, unfortunately. It is possible that the HBN actigraphy dataset was preprocessed in a different way. 
We could try to preprocess the data in the same way as the HBN actigraphy dataset, but this would take a lot of time
as we not only need time to understand and write our own R code, but we need to find out with what configuration the data was preprocessed as well. 

It is, of course, also possible that I made a mistake in the code above. After all, I did find earlier that a lot of the timestamps match up. 
It's from the about the same time range (2017-2019), so perhaps it is worth it to look into this further.