# Demo notebook for analyzing application data

## Introduction

Application data refers to the information about which apps are open at a certain time. These data can reveal important information about people's circadian rhythm, social patterns, and activity. Application data is an event data, this means that it cannot be sampled at a regular frequency, but we just have information about the events that occured. There are two main issues with application data (1) missing data detection, and (2) privacy concerns. 

Regarding missing data detection, we may never know if all events were detected and reported. Unfortunately there is little we can do. Nevertheless, we can take into account some factors may interfere with the correct detection of all events (e.g. when the phone's battery is depleated). Therefore, to correctly process application data, we need to take into account other information like the battery status of the phone. 
Regarding the privacy concerns, application names can reveal too much about a subject, for example, for a uncommon app. Consequently, we need to try anonimizing the data by grouping the apps. `niimpy` provides a map with some of the common apps. 

To address both of these issues, `niimpy` includes the function `extract_features_app` to clean, downsample, and extract features from application data while taking into account factors like the battery level and naming groups. This function employs other functions to extract the following features:

- `app_count`: number of times an app group has been used 
- `app_duration`: how long an app group has been used

In addition, the app module has one internal function that help classify the apps into groups. 

In the following, we will analyze screen data provided by `niimpy` as an example to illustrate the use of application data.

## Read data

In [1]:
import niimpy
import niimpy.preprocessing.application as app
import warnings
warnings.filterwarnings("ignore")

In [2]:
data = niimpy.read_csv('/m/cs/scratch/networks/trianaa1/Paper3/niimpy/niimpy/sampledata/singleuser_AwareApplicationNotifications.csv', tz='Europe/Helsinki')
data.shape

(132, 6)

There are 132 datapoints with 6 columns in the dataset. Let us have a quick look at the data:

In [3]:
data.head()

Unnamed: 0,user,device,time,application_name,package_name,datetime
2019-08-05 14:02:51.009999872+03:00,iGyXetHE3S8u,Cq9vueHh3zVs,1565003000.0,Android System,android,2019-08-05 14:02:51.009999872+03:00
2019-08-05 14:02:58.009999872+03:00,iGyXetHE3S8u,Cq9vueHh3zVs,1565003000.0,Android System,android,2019-08-05 14:02:58.009999872+03:00
2019-08-05 14:03:17.009999872+03:00,iGyXetHE3S8u,Cq9vueHh3zVs,1565003000.0,Google Play Music,com.google.android.music,2019-08-05 14:03:17.009999872+03:00
2019-08-05 14:02:55.009999872+03:00,iGyXetHE3S8u,Cq9vueHh3zVs,1565003000.0,Google Play Music,com.google.android.music,2019-08-05 14:02:55.009999872+03:00
2019-08-05 14:03:31.009999872+03:00,iGyXetHE3S8u,Cq9vueHh3zVs,1565003000.0,Gmail,com.google.android.gm,2019-08-05 14:03:31.009999872+03:00


The dataframe seems to be complete. Its index is timestamps, and it has a column indicating the application that was prompted (*application_name*). 

#### A few words on missing data
Missing data for application is difficult to detect. Firstly, this sensor is triggered by events and not sampled at a fixed frequency. Secondly, different phones, OS, and settings change the ease to detect apps. Thirdly, events not related to the application sensor may affect its behavior, e.g. battery running out. Unfortunately, we can only correct missing data for events such as the screen turning off by using data from the screen sensor and the battery level. 

#### A few words on grouping the apps
As previously mentioned, the application name may reveal too much about a subject and privacy problems may arise. A possible solution to this problem is to classify the apps into more generic groups. For example, commonly used apps like WhatsApp, Signal, Telegram, etc. are commonly used for texting, so we can group them under the label *texting*. `niimpy` provides a default mapping, but this should be adapted to the characteristics of the sample, since apps are available depending on countries. 

#### A few words on the role of the battery and screen
As mentioned before, sometimes the screen may be OFF and these events will not be caught by the application data. For example, we can open an app and let it open until the phone screen turns off automatically. Another example is when the battery is depleated and the phone is shut down automatically. Having this information is crucial for correctly computing how long a subject used each app group. `niimpy`'s screen module is adapted to take into account both, the screen and battery data. 
Let's load the screen and battery data

In [5]:
bat_data = niimpy.read_csv('/m/cs/scratch/networks/trianaa1/Paper3/niimpy/niimpy/sampledata/multiuser_AwareBattery.csv', tz='Europe/Helsinki')
screen_data = niimpy.read_csv('/m/cs/scratch/networks/trianaa1/Paper3/niimpy/niimpy/sampledata/multiuser_AwareScreen.csv', tz='Europe/Helsinki')

In [6]:
bat_data.head()

Unnamed: 0,user,device,time,battery_level,battery_status,battery_health,battery_adaptor,datetime
2020-01-09 02:20:02.924999936+02:00,jd9INuQ5BBlW,3p83yASkOb_B,1578529000.0,74,3,2,0,2020-01-09 02:20:02.924999936+02:00
2020-01-09 02:21:30.405999872+02:00,jd9INuQ5BBlW,3p83yASkOb_B,1578529000.0,73,3,2,0,2020-01-09 02:21:30.405999872+02:00
2020-01-09 02:24:12.805999872+02:00,jd9INuQ5BBlW,3p83yASkOb_B,1578529000.0,72,3,2,0,2020-01-09 02:24:12.805999872+02:00
2020-01-09 02:35:38.561000192+02:00,jd9INuQ5BBlW,3p83yASkOb_B,1578530000.0,72,2,2,0,2020-01-09 02:35:38.561000192+02:00
2020-01-09 02:35:38.953000192+02:00,jd9INuQ5BBlW,3p83yASkOb_B,1578530000.0,72,2,2,2,2020-01-09 02:35:38.953000192+02:00


The dataframe looks fine. In this case, we are interested in the battery_status information. This is standard information provided by Android. However, if the dataframe has this information in a column with a different name, we can use the argument `battery_column_name` and input our custom battery column name (see Extracting features, customized features). 

In [7]:
screen_data.head()

Unnamed: 0,user,device,time,screen_status,datetime
2020-01-09 02:06:41.573999872+02:00,jd9INuQ5BBlW,OWd1Uau8POix,1578528000.0,0,2020-01-09 02:06:41.573999872+02:00
2020-01-09 02:09:29.152000+02:00,jd9INuQ5BBlW,OWd1Uau8POix,1578529000.0,1,2020-01-09 02:09:29.152000+02:00
2020-01-09 02:09:32.790999808+02:00,jd9INuQ5BBlW,OWd1Uau8POix,1578529000.0,3,2020-01-09 02:09:32.790999808+02:00
2020-01-09 02:11:41.996000+02:00,jd9INuQ5BBlW,OWd1Uau8POix,1578529000.0,0,2020-01-09 02:11:41.996000+02:00
2020-01-09 02:16:19.010999808+02:00,jd9INuQ5BBlW,OWd1Uau8POix,1578529000.0,1,2020-01-09 02:16:19.010999808+02:00


This dataframe looks fine too. In this case, we are interested in the screen_status information. However, if the dataframe has this information in a column with a different name, we can use the argument `screen_column_name` and input our custom screen column name (see Extracting features, customized features). 

## Extracting features

To extract app features, we need to employ the function `extract_features_app`. This function needs four inputs, a dataframe with the data, two dataframes with the information from screen and battery sensors, and a dictionary. The dataframe should contain the app observations, and the dictionary is used to input customizable arguments to the function. The battery and screen dataframes can be empty in case we do not have such information. The function has some parameters by default. Let's have a look at those first. 

### Default option

The default option will compute all features in 30-minute aggregation windows. To use the `extract_features_app` function with its default options, simply call the function. Remember to include battery and screen data when available.

In [8]:
default = app.extract_features_app(data, bat_data, screen_data, features=None)

computing app_count...
computing app_duration...


The function prints the computed features so you can track its process. Now let's have a look at the outputs

In [10]:
default.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,count,duration
user,app_group,datetime,Unnamed: 3_level_1,Unnamed: 4_level_1
iGyXetHE3S8u,comm,2019-08-05 14:00:00+03:00,86,37.0
iGyXetHE3S8u,leisure,2019-08-05 14:00:00+03:00,20,7.0
iGyXetHE3S8u,na,2019-08-05 14:00:00+03:00,19,9.0
iGyXetHE3S8u,work,2019-08-05 14:00:00+03:00,7,6.0


The function output is also a dataframe where each column stands for a feature. The indexes are subjects, app groups, and timestamps. 

The default option can also be run in absence of battery data. In this case, simply input an empty dataframe in the second position of the `extract_features_app`function.

In [26]:
empty_data = pd.DataFrame()
default = s.extract_features_screen(data, empty_data, features=None)
default.tail()

computing screen_off...
computing screen_count...
computing screen_duration...
computing screen_duration_min...
computing screen_duration_max...
computing screen_duration_mean...
computing screen_duration_median...
computing screen_duration_std...
computing screen_first_unlock...


Unnamed: 0_level_0,Unnamed: 1_level_0,screen_off,screen_on_count,screen_off_count,screen_use_count,screen_on_durationtotal,screen_off_durationtotal,screen_use_durationtotal,screen_on_durationminimum,screen_off_durationminimum,screen_use_durationminimum,...,screen_on_durationmean,screen_off_durationmean,screen_use_durationmean,screen_on_durationmedian,screen_off_durationmedian,screen_use_durationmedian,screen_on_durationstd,screen_off_durationstd,screen_use_durationstd,datetime
user,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
jd9INuQ5BBlW,2020-01-09 22:00:00+02:00,,1.0,1.0,0.0,154.643,0.011,,154.643,0.011,,...,154.643,0.011,,154.643,0.011,,,,,NaT
jd9INuQ5BBlW,2020-01-09 22:30:00+02:00,,0.0,0.0,0.0,0.0,0.0,,,,,...,,,,,,,,,,NaT
jd9INuQ5BBlW,2020-01-09 23:00:00+02:00,,4.0,3.0,0.0,6.931,0.025,,2.079,0.008,,...,2.310333,0.008333,,2.262,0.008,,0.258906,0.000577,,NaT
iGyXetHE3S8u,2019-08-05 00:00:00+03:00,,,,,,,,,,,...,,,,,,,,,,2019-08-05 14:03:42.322000128+03:00
jd9INuQ5BBlW,2020-01-09 00:00:00+02:00,,,,,,,,,,,...,,,,,,,,,,2020-01-09 02:16:19.010999808+02:00


### Customized features

The `extract_features_screen` function can also be customized. We can:
- extract some of the features (not all)
- modify the aggregation periods

All of these modifications need to be inside the dictionary input. 

Let's see how to use this to only call some functions. To do so, we need to create a dictionary where the keys are the name of the features we want to compute, and the values are empty dictionaries.

In [27]:
custom = {}
custom['screen_duration_max'] = {}
custom['screen_count'] = {}

In [30]:
custom_output = s.extract_features_screen(data, bat_data, features=custom)
custom_output.head()

computing screen_duration_max...
computing screen_count...


Unnamed: 0_level_0,Unnamed: 1_level_0,screen_on_durationmaximum,screen_off_durationmaximum,screen_use_durationmaximum,screen_on_count,screen_off_count,screen_use_count
user,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
iGyXetHE3S8u,2019-08-05 14:00:00+03:00,100.365,1084.703,0.345,4,4,4
iGyXetHE3S8u,2019-08-05 14:30:00+03:00,15.402,284779.15,0.119,2,2,2
iGyXetHE3S8u,2019-08-05 15:00:00+03:00,,,,0,0,0
iGyXetHE3S8u,2019-08-05 15:30:00+03:00,,,,0,0,0
iGyXetHE3S8u,2019-08-05 16:00:00+03:00,,,,0,0,0


As we see, this time only two features were computed in a 30-min aggregated period. Now, let's compute another set of features with different aggregation windows. For that, we rely on the arguments from the `pandas.DataFrame.resample` function. 

For this example, we will aggregate the features `screen_count` and `screen_duration`. The screen count will be computed in a daily basis and the screen duration will be computed in 5-hour periods with a 5-min offset.

In [39]:
features = {"screen_count":{"screen_column_name":"screen_status","resample_args":{"rule":"1D"}},
               "screen_duration":{"screen_column_name":"screen_status","resample_args":{"rule":"5H","offset":"5min"}}}

As we see, we have an input dictionary in which the main keys are the names of the features to compute. For each feature, we also have a dictionary. This new dictionary has some other arguments, mainly the name of the column that we would like to use for the computation and another dictionary named `resample_args`. The `screen_column_name` is the column name where the screen status is stored; this helps in case our dataframe has some other naming conventions. The `resample_args` dictionary contains the arguments to pass for the resampling (see `pandas.DataFrame.resample`).

In [40]:
custom_output = s.extract_features_screen(data, bat_data, features=features)
custom_output.head()

computing screen_count...
computing screen_duration...


Unnamed: 0_level_0,Unnamed: 1_level_0,screen_on_count,screen_off_count,screen_use_count,screen_on_durationtotal,screen_off_durationtotal,screen_use_durationtotal
user,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
iGyXetHE3S8u,2019-08-05 00:00:00+03:00,6.0,6.0,6.0,,,
iGyXetHE3S8u,2019-08-06 00:00:00+03:00,0.0,0.0,0.0,,,
iGyXetHE3S8u,2019-08-07 00:00:00+03:00,0.0,0.0,0.0,,,
iGyXetHE3S8u,2019-08-08 00:00:00+03:00,6.0,6.0,6.0,,,
iGyXetHE3S8u,2019-08-09 00:00:00+03:00,2.0,2.0,2.0,,,


In [41]:
custom_output.tail()

Unnamed: 0_level_0,Unnamed: 1_level_0,screen_on_count,screen_off_count,screen_use_count,screen_on_durationtotal,screen_off_durationtotal,screen_use_durationtotal
user,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
jd9INuQ5BBlW,2020-01-09 00:05:00+02:00,,,,54.894,1674.82,656.738
jd9INuQ5BBlW,2020-01-09 05:05:00+02:00,,,,0.0,0.0,0.0
jd9INuQ5BBlW,2020-01-09 10:05:00+02:00,,,,135.896,1455.094,425.015
jd9INuQ5BBlW,2020-01-09 15:05:00+02:00,,,,221.362001,24.673,667.703
jd9INuQ5BBlW,2020-01-09 20:05:00+02:00,,,,557.914999,46.079,169.079


The output is once again a dataframe. In this case, two aggregations are shown. The first one is the daily aggregation computed for the `screen_count` feature. The second one is the 5-hour aggregation period with 5-min offset for the `screen_duration`. We must note that because the `screen_duration`feature is not required to be aggregated daily, the daily aggregation timestamps have a NaN value. Similarly, because the `screen_count`is not required to be aggregated in 5-hour windows, its values are NaN for all subjects. 

## Implementing own features

We can implement our own customized features easily. To do so, we need to define a function that accepts a dataframe and returns a dataframe. The returned object should be indexed by user and timestamps. 
To make the feature readily available in the default options, we need add the *call* prefix to the new function (e.g. `call_my-new-feature`). 

In [42]:
def screen_last_unlock(df,feature_functions=None):
    if not "screen_column_name" in feature_functions:
        col_name = "screen_status"
    else:
        col_name = feature_functions["screen_column_name"]
    if not "resample_args" in feature_functions.keys():
        feature_functions["resample_args"] = {"rule":"30T"}
    
    df2 = screen_util(df, bat, feature_functions)
    df2 = screen_event_classification(df2, feature_functions)
    
    result = df2[df2.on==1].groupby("user").resample(rule='1D').max()
    result = result[["datetime"]]

Then, we can call our new function using the `extract_features_comms` function.

In [43]:
customized_features = s.extract_features_screen(data, bat_data, features={"screen_last_unlock": {}})

computing screen_last_unlock...


NameError: name 'screen_last_unlock' is not defined