# 01 Time Windows

Featuretools has some great functionality around time. This allows creation of features right up until the time of making a prediction (without risk of information leakage), it allows uncomplete records to be used which can provide useful information.... 

This can be used to control:
- when data record becomes available
- when specific columns within a record become avaialble

In [1]:
import pandas as pd
import numpy as np
import featuretools as ft
from create_data import make_attendances_dataframe

In [2]:
df = make_attendances_dataframe(15)

#### Setting time index
Use "time_index" to set the time when records become avaialble.

In [3]:
es = ft.EntitySet('Hospital')

es = es.entity_from_dataframe(entity_id='attendances',
                               dataframe=df,
                               index='atten_id',
                               time_index='arrival_datetime')

In [4]:
df.head()

Unnamed: 0,atten_id,pat_id,arrival_datetime,time_in_department,ambulance_arrival,departure_datetime,gender
5,1005,4680,2018-01-01 02:49:00,197,1,2018-01-01 06:06:00,0
9,1009,8345,2018-01-01 05:39:00,122,0,2018-01-01 07:41:00,1
4,1004,8342,2018-01-01 08:07:00,89,1,2018-01-01 09:36:00,0
0,1000,442,2018-01-01 08:15:00,59,0,2018-01-01 09:14:00,0
6,1006,3699,2018-01-01 11:21:00,303,1,2018-01-01 16:24:00,1


#### Cuttoff times

Can be used to define a datetime at which a prediction is wished to be made; no infomration after this point will be used. To utilise we create a dataframe to pass to DFS. This df requires the unique_id (e.g. atten_id) and a cuttoff time.

Cuttoff times also supports multiple cuttoff times being passed for each unique_id.

In [7]:
ct = pd.DataFrame()

ct['atten_id'] = [1005,1009, 1004]

ct['time'] = pd.to_datetime(['2018-01-01 06:00',
                              '2018-01-01 06:00',
                              '2018-01-01 06:00'])

# Label column is optional, and will not be touched in any way by DFS, it can be used to pass labels for prediction. 
ct['label'] = [True, True, False]

ct

Unnamed: 0,atten_id,time,label
0,1005,2018-01-01 06:00:00,True
1,1009,2018-01-01 06:00:00,True
2,1004,2018-01-01 06:00:00,False


In [9]:
fm, features = ft.dfs(entityset=es,
                       target_entity='attendances',
                       cutoff_time=ct,
                       cutoff_time_in_index=True)
 

fm

Unnamed: 0_level_0,Unnamed: 1_level_0,pat_id,time_in_department,ambulance_arrival,gender,DAY(arrival_datetime),DAY(departure_datetime),YEAR(arrival_datetime),YEAR(departure_datetime),MONTH(arrival_datetime),MONTH(departure_datetime),WEEKDAY(arrival_datetime),WEEKDAY(departure_datetime),label
atten_id,time,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
1004,2018-01-01 06:00:00,,,,,,,,,,,,,False
1005,2018-01-01 06:00:00,4680.0,197.0,1.0,0.0,1.0,1.0,2018.0,2018.0,1.0,1.0,0.0,0.0,True
1009,2018-01-01 06:00:00,8345.0,122.0,0.0,1.0,1.0,1.0,2018.0,2018.0,1.0,1.0,0.0,0.0,True


We can see that using DFS we have only included the attendances which are available at 6am 1st Jan 2018 (of the three proivided in ct dataframe). The attendances 1004 has not occured yet (arrival_datetime is after the "cutoff") so we return NaNs for this row.

An example use of this might be creation of data for prediction of "time_in_department", or "admission_flag" for those patients currently  in a department at 6am.

One problem in this case is reducing the data to that in columns which would be avaiable at the time of prediction, e.g. "time_in_department" would not be available for this prediction...we can reduce the columns by using a SECONDARY TIME INDEX. 



#### Setting Seccondary time index

Use "secondary_time_index" to define when new information in a particular record  becomes avaialble, by providing a dictionary. dictionary in e.g. below indicates that at the time "depart_datetime" the list of column names becomes available ( "time_in_department" in this e.g).  

In [21]:
import featuretools.variable_types as vtypes
data_variable_types = {'atten_id': vtypes.Id,
                       'pat_id': vtypes.Id,
                       'arrival_datetime': vtypes.Datetime,
                      'time_in_department': vtypes.Numeric,
                       'departure_datetime': vtypes.Datetime,
                       'gender': vtypes.Boolean,
                      'ambulance_arrival': vtypes.Boolean}
#es = ft.EntitySet('Hospital')
es = es.entity_from_dataframe(entity_id='attendances',
                               dataframe=df,
                               index='atten_id',
                               time_index='arrival_datetime',
                              secondary_time_index={'departure_datetime':['time_in_department']}, # dictionary here!
                               variable_types=data_variable_types)

In [22]:
fm, features = ft.dfs(entityset=es,
                       target_entity='attendances',
                       cutoff_time=ct,
                       cutoff_time_in_index=True)
 

fm

Unnamed: 0_level_0,Unnamed: 1_level_0,pat_id,gender,ambulance_arrival,DAY(arrival_datetime),DAY(departure_datetime),YEAR(arrival_datetime),YEAR(departure_datetime),MONTH(arrival_datetime),MONTH(departure_datetime),WEEKDAY(arrival_datetime),WEEKDAY(departure_datetime),label
atten_id,time,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
1004,2018-01-01 06:00:00,,,,,,,,,,,,False
1005,2018-01-01 06:00:00,4680.0,0.0,1.0,1.0,,2018.0,,1.0,,0.0,,True
1009,2018-01-01 06:00:00,8345.0,1.0,0.0,1.0,,2018.0,,1.0,,0.0,,True


#### Training windows

Whilst a cuttoff time limits the data to be used after a datetime. A "training window" limits the amount of past data that can be used while calculating a particular feature matrix. 

In [17]:
es.add_last_time_indexes()

In [20]:
window_fm, window_features = ft.dfs(entityset=es,
                                     target_entity="attendances",
                                     cutoff_time=ct,
                                     cutoff_time_in_index=True,
                                     training_window="24 hours")

window_fm

Unnamed: 0_level_0,Unnamed: 1_level_0,pat_id,gender,ambulance_arrival,DAY(arrival_datetime),DAY(departure_datetime),YEAR(arrival_datetime),YEAR(departure_datetime),MONTH(arrival_datetime),MONTH(departure_datetime),WEEKDAY(arrival_datetime),WEEKDAY(departure_datetime),label
atten_id,time,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
1004,2018-01-01 06:00:00,,,,,,,,,,,,False
1005,2018-01-01 06:00:00,4680.0,0.0,1.0,1.0,,2018.0,,1.0,,0.0,,True
1009,2018-01-01 06:00:00,8345.0,1.0,0.0,1.0,,2018.0,,1.0,,0.0,,True
