# Initial Analysis of File Structure
author: Spencer Weston 

In [58]:
import seedir as sd
import os 
import json
import pandas as pd

## Directory Structure 

Individual numbers are randomly generated ID's for each subject's ID. The AndroidAPS (Android Artificial Pancreas System) Uploader contains the data uploaded through AndroidAPS. I'm not sure what differences, if any, exist in the AndroidAPS data.

*Open Question 1: What differences, if any, exist between AndroidAPS and other data?*
* File structure is very different! 

In [14]:
# My local data is stored in the same directory as the repository root
dirs = os.listdir("../../n=183_OpenAPS_Data_Commons_August_2021_UNZIPPED")
androidAPS = os.listdir("../../n=183_OpenAPS_Data_Commons_August_2021_UNZIPPED/AndroidAPS Uploader")
print("normal directories: ", dirs[:10], "...")
print("androidAPS directories: ", androidAPS[:10], "...")
print(f"normal directory count: {len(dirs)-1}, androidAPS directory count: {len(androidAPS)}, total count: {len(dirs) + len(androidAPS) -1}")

normal directories:  ['00221634', '00309157', '00897741', '01352464', '02033176', '02199852', '04762925', '05274556', '07886752', '12689381'] ...
androidAPS directories:  ['01177138', '01739655', '01949240', '10540336', '20777653', '23340371', '23863411', '25401109', '26691577', '27553507'] ...
normal directory count: 145, androidAPS directory count: 38, total count: 183


Here's the file structure of a normal subjects' data upload:

In [7]:
sd.seedir("../../n=183_OpenAPS_Data_Commons_August_2021_UNZIPPED/00221634", style='emoji')

📁 00221634/
└─📁 direct-sharing-31/
  ├─📄 00221634_devicestatus_2018-03-01_to_2018-08-05.json
  ├─📁 00221634_devicestatus_2018-03-01_to_2018-08-05_csv/
  │ ├─📄 00221634_devicestatus_2018-03-01_to_2018-08-05_aa.csv
  │ ├─📄 00221634_devicestatus_2018-03-01_to_2018-08-05_ab.csv
  │ ├─📄 00221634_devicestatus_2018-03-01_to_2018-08-05_ac.csv
  │ └─📄 00221634_devicestatus_2018-03-01_to_2018-08-05_ad.csv
  ├─📁 00221634_devicestatus_2018-03-01_to_2018-08-05_parts/
  │ ├─📄 00221634_devicestatus_2018-03-01_to_2018-08-05_aa
  │ ├─📄 00221634_devicestatus_2018-03-01_to_2018-08-05_aa.json
  │ ├─📄 00221634_devicestatus_2018-03-01_to_2018-08-05_ab
  │ ├─📄 00221634_devicestatus_2018-03-01_to_2018-08-05_ab.json
  │ ├─📄 00221634_devicestatus_2018-03-01_to_2018-08-05_ac
  │ ├─📄 00221634_devicestatus_2018-03-01_to_2018-08-05_ac.json
  │ ├─📄 00221634_devicestatus_2018-03-01_to_2018-08-05_ad
  │ └─📄 00221634_devicestatus_2018-03-01_to_2018-08-05_ad.json
  ├─📄 00221634_entries_2018-03-01_to_2018-08-05.json
  ├─

Here's the file structure of an androidAPS subject:

In [15]:
sd.seedir("../../n=183_OpenAPS_Data_Commons_August_2021_UNZIPPED/AndroidAPS Uploader/01177138", style='emoji')

📁 01177138/
└─📁 direct-sharing-396/
  ├─📁 upload-num183-ver1-date20210730T151214-appid6b0354b7367e4decb2afd261ff3b563a/
  │ ├─📄 ApplicationInfo.csv
  │ ├─📄 ApplicationInfo.json
  │ ├─📄 APSData.csv
  │ ├─📄 APSData.json
  │ ├─📄 BgReadings.csv
  │ ├─📄 BgReadings.json
  │ ├─📄 CareportalEvents.csv
  │ ├─📄 CareportalEvents.json
  │ ├─📄 DeviceInfo.csv
  │ ├─📄 DeviceInfo.json
  │ ├─📄 DisplayInfo.csv
  │ ├─📄 DisplayInfo.json
  │ ├─📄 Preferences.csv
  │ ├─📄 Preferences.json
  │ ├─📄 TemporaryBasals.csv
  │ ├─📄 TemporaryBasals.json
  │ ├─📄 Treatments.csv
  │ ├─📄 Treatments.json
  │ ├─📄 UploadInfo.csv
  │ └─📄 UploadInfo.json
  ├─📄 upload-num183-ver1-date20210730T151214-appid6b0354b7367e4decb2afd261ff3b563a.zip
  ├─📁 upload-num184-ver1-date20210731T151225-appid6b0354b7367e4decb2afd261ff3b563a/
  │ ├─📄 ApplicationInfo.csv
  │ ├─📄 ApplicationInfo.json
  │ ├─📄 APSData.csv
  │ ├─📄 APSData.json
  │ ├─📄 BgReadings.csv
  │ ├─📄 BgReadings.json
  │ ├─📄 CareportalEvents.csv
  │ ├─📄 CareportalEvents.json
  │ ├

For both androidAPS and normal uploads, we have a plethora of files. The challenge is figuring out which are necessary for analysis and which can be discarded. [This repo](https://github.com/danamlewis/OpenHumansDataTools) contains many data processing scripts. Reading through the scripts, it appears some of these have already been applied to the data.
*  [unzip-split-csvify-OpenHumans-data.sh](https://github.com/danamlewis/OpenHumansDataTools/blob/master/bin/unzip-split-csvify-OpenHumans-data.sh)
    * On *normal uploads*, this script unzips files, converts to json, and finally converts to .csv
*  [unzip-csvify-AAPS-OpenHumans-data.sh](https://github.com/danamlewis/OpenHumansDataTools/blob/master/bin/unzip-csvify-AAPS-OpenHumans-data.sh)
    * On *Android APS uploads*, this script unzips files, converts to json, and finally converts to .csv
    
Given the presence of .json and .csv files, these two scripts seem to have been run. 

## Normal CSV's

Here, one of the intial folders is explored in more detail. First, we can ignore the .gz files as files have already been unzipped. Then, we have `devicestatus`, `entries`, `profile`, and `treatments` folders which end in `_json`, `_csv`, `_parts`. We'll go through each of the `devicestatus`, `entries`, `profile`, and `treatments` to look at their data. Here, we look at subject `00221634`. 

In [46]:
subject_id = "00221634"
folder = f"..\\..\\n=183_OpenAPS_Data_Commons_August_2021_UNZIPPED\\{subject_id}\\direct-sharing-31"

In [47]:
os.listdir(folder)

['00221634_devicestatus_2018-03-01_to_2018-08-05.json',
 '00221634_devicestatus_2018-03-01_to_2018-08-05_csv',
 '00221634_devicestatus_2018-03-01_to_2018-08-05_parts',
 '00221634_entries_2018-03-01_to_2018-08-05.json',
 '00221634_entries_2018-03-01_to_2018-08-05.json_csv',
 '00221634_profile_2018-03-01_to_2018-08-05.json',
 '00221634_profile_2018-03-01_to_2018-08-05_csv',
 '00221634_profile_2018-03-01_to_2018-08-05_parts',
 '00221634_treatments_2018-03-01_to_2018-08-05.json',
 '00221634_treatments_2018-03-01_to_2018-08-05_csv',
 '00221634_treatments_2018-03-01_to_2018-08-05_parts',
 'devicestatus_2018-03-01_to_2018-08-05.json.gz',
 'entries_2018-03-01_to_2018-08-05.json.gz',
 'profile_2018-03-01_to_2018-08-05.json.gz',
 'treatments_2018-03-01_to_2018-08-05.json.gz']

### Device Status

In [116]:
# File locations
device_status = "00221634_devicestatus_2018-03-01_to_2018-08-05"
json_file = f"{folder}\\{device_status}.json"
parts_fold = f"{folder}\\{device_status}_parts"
csv_fold = f"{folder}\\{device_status}_csv"

**JSON**

In [64]:
with open(json_file) as f:
    raw_json = json.load(f)
# raw_json looong file

In [79]:
print("Top level type:", type(raw_json))
print("Top level element count:", len(raw_json))
print()
print("First element:", raw_json[0])
print()
print("Second element:", raw_json[1])
print()
print("Top level keys:", raw_json[0].keys())
print("Pump keys:", raw_json[0]['pump'].keys())
print("Pump Extended keys:", raw_json[0]['pump']['extended'].keys())

Top level type: <class 'list'>
Top level element count: 53877

First element: {'NSCLIENT_ID': 1533427185282, '_id': '5b663df1f64f437f0a9d94db', 'created_at': '2018-08-04T23:59:45Z', 'pump': {'extended': {'ActiveProfile': 'MM - v4', 'PumpIOB': 1.37, 'BaseBasalRate': 0.52, 'LastBolus': '05.08.2018 01:01:00', 'Version': '2.0c-dev-04cba772b-2018.08.01-20:34', 'LastBolusAmount': 1}, 'reservoir': 108, 'clock': '2018-08-04T23:59:45Z', 'status': {'timestamp': '2018-08-04T23:59:40Z', 'status': 'normal'}, 'battery': {'percent': 75}}, 'device': 'G6QS5C', 'uploaderBattery': 100}

Second element: {'NSCLIENT_ID': 1533425993154, '_id': '5b663949f64f437f0a9d94d5', 'created_at': '2018-08-04T23:39:53Z', 'pump': {'extended': {'ActiveProfile': 'MM - v4', 'PumpIOB': 1.77, 'BaseBasalRate': 0.52, 'LastBolus': '05.08.2018 01:01:00', 'Version': '2.0c-dev-04cba772b-2018.08.01-20:34', 'LastBolusAmount': 1}, 'reservoir': 108, 'clock': '2018-08-04T23:39:53Z', 'status': {'timestamp': '2018-08-04T23:39:53Z', 'status

Anyways, I could keep printing out the full JSON structure, but this structure isn't too complex. The element printouts show all the data for each element. We just want to come back to this data in a moment and make sure the downstream files match it.

**Parts**

Here's what in the parts folder. The OpenHumansDataTools page indicates that the files are split to ensure each file is a reasonable size which is the source of the ascending `aa`, `ab`,`ac`, etc. series of file endings. 

In [92]:
aa_fold = f"{parts_fold}\\00221634_devicestatus_2018-03-01_to_2018-08-05_aa"
aa_json = f"{parts_fold}\\00221634_devicestatus_2018-03-01_to_2018-08-05_aa.json"
os.listdir(parts_fold)

['00221634_devicestatus_2018-03-01_to_2018-08-05_aa',
 '00221634_devicestatus_2018-03-01_to_2018-08-05_aa.json',
 '00221634_devicestatus_2018-03-01_to_2018-08-05_ab',
 '00221634_devicestatus_2018-03-01_to_2018-08-05_ab.json',
 '00221634_devicestatus_2018-03-01_to_2018-08-05_ac',
 '00221634_devicestatus_2018-03-01_to_2018-08-05_ac.json',
 '00221634_devicestatus_2018-03-01_to_2018-08-05_ad',
 '00221634_devicestatus_2018-03-01_to_2018-08-05_ad.json']

Manually checking the files with no file extension shows that they're compressed json files (not conformant with json syntax). Every element of the full json is compressed into a single line of text. Let's compare the `aa` files.

In [94]:
with open(aa_fold) as f:
    for line in f.readlines():
        print(line)
        # Save this line for later comparison 
        first_aa_line = line 
        break 
with open(aa_json) as f:
    aa_json_obj = json.load(f)
print(aa_json_obj[0])

{"NSCLIENT_ID":1533427185282,"_id":"5b663df1f64f437f0a9d94db","created_at":"2018-08-04T23:59:45Z","pump":{"extended":{"ActiveProfile":"MM - v4","PumpIOB":1.37,"BaseBasalRate":0.52,"LastBolus":"05.08.2018 01:01:00","Version":"2.0c-dev-04cba772b-2018.08.01-20:34","LastBolusAmount":1},"reservoir":108,"clock":"2018-08-04T23:59:45Z","status":{"timestamp":"2018-08-04T23:59:40Z","status":"normal"},"battery":{"percent":75}},"device":"G6QS5C","uploaderBattery":100}

{'NSCLIENT_ID': 1533427185282, '_id': '5b663df1f64f437f0a9d94db', 'created_at': '2018-08-04T23:59:45Z', 'pump': {'extended': {'ActiveProfile': 'MM - v4', 'PumpIOB': 1.37, 'BaseBasalRate': 0.52, 'LastBolus': '05.08.2018 01:01:00', 'Version': '2.0c-dev-04cba772b-2018.08.01-20:34', 'LastBolusAmount': 1}, 'reservoir': 108, 'clock': '2018-08-04T23:59:45Z', 'status': {'timestamp': '2018-08-04T23:59:40Z', 'status': 'normal'}, 'battery': {'percent': 75}}, 'device': 'G6QS5C', 'uploaderBattery': 100}


Looking at the above objects, we see that they're equivalent. 

**CSV**

Now, let's check the csv's. Like above, we have a series of .csv's with endings such as `ab`. These appear to mirror the parts from above.

In [95]:
os.listdir(csv_fold)

['00221634_devicestatus_2018-03-01_to_2018-08-05_aa.csv',
 '00221634_devicestatus_2018-03-01_to_2018-08-05_ab.csv',
 '00221634_devicestatus_2018-03-01_to_2018-08-05_ac.csv',
 '00221634_devicestatus_2018-03-01_to_2018-08-05_ad.csv']

In [96]:
df = pd.read_csv(f"{csv_fold}\\00221634_devicestatus_2018-03-01_to_2018-08-05_aa.csv")

  exec(code_obj, self.user_global_ns, self.user_ns)


In [97]:
df.head()

Unnamed: 0,NSCLIENT_ID,_id,created_at,pump/extended/ActiveProfile,pump/extended/PumpIOB,pump/extended/BaseBasalRate,pump/extended/LastBolus,pump/extended/Version,pump/extended/LastBolusAmount,pump/reservoir,...,openaps/enacted/predBGs/ZT/44,openaps/enacted/predBGs/ZT/45,openaps/enacted/predBGs/ZT/46,openaps/enacted/predBGs/ZT/47,openaps/enacted/predBGs/COB/45,openaps/enacted/predBGs/COB/46,openaps/enacted/predBGs/UAM/47,openaps/enacted/predBGs/COB/47,openaps/suggested/carbsReq,openaps/enacted/carbsReq
0,1533427000000.0,5b663df1f64f437f0a9d94db,2018-08-04T23:59:45Z,MM - v4,1.37,0.52,05.08.2018 01:01:00,2.0c-dev-04cba772b-2018.08.01-20:34,1.0,108.0,...,,,,,,,,,,
1,1533426000000.0,5b663949f64f437f0a9d94d5,2018-08-04T23:39:53Z,MM - v4,1.77,0.52,05.08.2018 01:01:00,2.0c-dev-04cba772b-2018.08.01-20:34,1.0,108.0,...,,,,,,,,,,
2,1533425000000.0,5b663496f64f437f0a9d94d0,2018-08-04T23:19:50Z,MM - v4,2.55,0.52,05.08.2018 01:01:00,2.0c-dev-04cba772b-2018.08.01-20:34,1.0,108.0,...,,,,,,,,,,
3,1533424000000.0,5b663049f64f437f0a9d94cb,2018-08-04T23:01:09Z,MM - v4,2.55,0.52,05.08.2018 01:01:00,2.0c-dev-04cba772b-2018.08.01-20:34,1.0,108.0,...,,,,,,,,,,
4,1533424000000.0,5b663011f64f437f0a9d94c9,2018-08-04T23:00:33Z,MM - v4,1.57,0.52,05.08.2018 00:10:00,2.0c-dev-04cba772b-2018.08.01-20:34,1.5,109.0,...,,,,,,,,,,


**Equivalence check**

Now, let's check that all the data files we've seen to this point hold equivalent data. Systemically doing this for all rows would be a significant task, so only the first elements will be compared.

In [113]:
print("Raw json: \n", raw_json[0])
print()
print("Json parts \n", aa_json_obj[0])
print()
print("Compressed json file \n", first_aa_line)
print()
print("csv \n", df.iloc[0,0:20].to_dict())

Raw json: 
 {'NSCLIENT_ID': 1533427185282, '_id': '5b663df1f64f437f0a9d94db', 'created_at': '2018-08-04T23:59:45Z', 'pump': {'extended': {'ActiveProfile': 'MM - v4', 'PumpIOB': 1.37, 'BaseBasalRate': 0.52, 'LastBolus': '05.08.2018 01:01:00', 'Version': '2.0c-dev-04cba772b-2018.08.01-20:34', 'LastBolusAmount': 1}, 'reservoir': 108, 'clock': '2018-08-04T23:59:45Z', 'status': {'timestamp': '2018-08-04T23:59:40Z', 'status': 'normal'}, 'battery': {'percent': 75}}, 'device': 'G6QS5C', 'uploaderBattery': 100}

Json parts 
 {'NSCLIENT_ID': 1533427185282, '_id': '5b663df1f64f437f0a9d94db', 'created_at': '2018-08-04T23:59:45Z', 'pump': {'extended': {'ActiveProfile': 'MM - v4', 'PumpIOB': 1.37, 'BaseBasalRate': 0.52, 'LastBolus': '05.08.2018 01:01:00', 'Version': '2.0c-dev-04cba772b-2018.08.01-20:34', 'LastBolusAmount': 1}, 'reservoir': 108, 'clock': '2018-08-04T23:59:45Z', 'status': {'timestamp': '2018-08-04T23:59:40Z', 'status': 'normal'}, 'battery': {'percent': 75}}, 'device': 'G6QS5C', 'uploa

In [115]:
len(df.columns)

445

The CSV takes a different format than the prior files. However, it contains the same information. Note that the csv has 445 columns. Most of these appear mostly null, but this will need to be analyzed further. If we look at the first ~20 columns, the number of elements in the json's, we see equivalent data is stored. Note that there are some amiguities. For example, the column `uploaderBattery` has the correct value of 100 while `uploader/battery` is null. 

The main conclusion here is that the .csv files hold all of the pertinent information for the normal file uploads. This means much of the data we have can be di

## Android APS models