# 00 Feature Tools and Entity Sets

#### Intro
- if you are completely new to the featuretool library, suggest you check out the documentation here: https://docs.featuretools.com/
- these notebooks aim to introduce some of the possible uses of featuretools applied to patient hospital data 
- we demonstrate how to generate features using individual patient transactions
- we work with a synthetic data set of hospital patient stays

In [1]:
import pandas as pd
import numpy as np
import featuretools as ft

#### Get synthetic patient dataset

In [2]:
from create_data import make_attendances_dataframe

In [3]:
df = make_attendances_dataframe(15)

We see we have individial records for patients attendance in an ED. Including unique: atten_id. Also non-unique fields: pat_id, arrival time, departure_time, stay in department (minutes), gender of patient, boolean inducating whether patient arrived by ambluance or not.

In [4]:
df.head()

Unnamed: 0,atten_id,pat_id,arrival_datetime,time_in_department,ambulance_arrival,departure_datetime,gender
0,1000,3814,2018-01-01 08:15:00,59,0,2018-01-01 09:14:00,0
1,1001,1847,2018-01-01 13:08:00,256,0,2018-01-01 17:24:00,1
2,1002,7573,2018-01-01 22:43:00,324,1,2018-01-02 04:07:00,1
3,1003,8807,2018-01-01 18:11:00,326,1,2018-01-01 23:37:00,1
4,1004,13380,2018-01-01 08:07:00,89,1,2018-01-01 09:36:00,0


Feature tools requires loading dataframes as Entity Sets.

In [5]:
es = ft.EntitySet('Hospital')

es = es.entity_from_dataframe(entity_id='attendances',
                               dataframe=df,
                               index='atten_id',
                               time_index='arrival_datetime')

its possible  to view entities their shapes & relations in current entity set:  

In [6]:
es

Entityset: Hospital
  Entities:
    attendances [Rows: 15, Columns: 7]
  Relationships:
    No relationships

In [7]:
es['attendances']

Entity: attendances
  Variables:
    atten_id (dtype: index)
    pat_id (dtype: numeric)
    arrival_datetime (dtype: datetime_time_index)
    time_in_department (dtype: numeric)
    ambulance_arrival (dtype: numeric)
    departure_datetime (dtype: datetime)
    gender (dtype: numeric)
  Shape:
    (Rows: 15, Columns: 7)

We can create new entities from current ones by 'normalising', for e.g. a patient level data set:

In [8]:
es.normalize_entity(base_entity_id='attendances',
                   new_entity_id='patients',
                   index='pat_id',
                   make_time_index=False)

Entityset: Hospital
  Entities:
    attendances [Rows: 15, Columns: 7]
    patients [Rows: 15, Columns: 1]
  Relationships:
    attendances.pat_id -> patients.pat_id

we can see that ft has added a second entity to the entitiy set, and has also added the relationship between the two tables.

Any relationship listed will always be a One -> Many.

In [9]:
es

Entityset: Hospital
  Entities:
    attendances [Rows: 15, Columns: 7]
    patients [Rows: 15, Columns: 1]
  Relationships:
    attendances.pat_id -> patients.pat_id

#### Creating features! 

ft allows you to create features very quickly once you have you data loaded in using DEEP FEATURE SYNTHESIS (DFS).

For e.g. creating features for each unique "attendance":

In [10]:
fm, features = ft.dfs(entityset=es,
                     target_entity='attendances')

fm.sample(3)

Unnamed: 0_level_0,pat_id,time_in_department,ambulance_arrival,gender,DAY(arrival_datetime),DAY(departure_datetime),YEAR(arrival_datetime),YEAR(departure_datetime),MONTH(arrival_datetime),MONTH(departure_datetime),...,patients.SKEW(attendances.time_in_department),patients.SKEW(attendances.ambulance_arrival),patients.SKEW(attendances.gender),patients.MIN(attendances.time_in_department),patients.MIN(attendances.ambulance_arrival),patients.MIN(attendances.gender),patients.MEAN(attendances.time_in_department),patients.MEAN(attendances.ambulance_arrival),patients.MEAN(attendances.gender),patients.COUNT(attendances)
atten_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1014,4681,51,1,1,2,2,2018,2018,1,1,...,,,,51,1,1,51,1,1,1
1013,13379,239,1,1,2,2,2018,2018,1,1,...,,,,239,1,1,239,1,1,1
1000,3814,59,0,0,1,1,2018,2018,1,1,...,,,,59,0,0,59,0,0,1


In [11]:
features

[<Feature: pat_id>,
 <Feature: time_in_department>,
 <Feature: ambulance_arrival>,
 <Feature: gender>,
 <Feature: DAY(arrival_datetime)>,
 <Feature: DAY(departure_datetime)>,
 <Feature: YEAR(arrival_datetime)>,
 <Feature: YEAR(departure_datetime)>,
 <Feature: MONTH(arrival_datetime)>,
 <Feature: MONTH(departure_datetime)>,
 <Feature: WEEKDAY(arrival_datetime)>,
 <Feature: WEEKDAY(departure_datetime)>,
 <Feature: patients.SUM(attendances.time_in_department)>,
 <Feature: patients.SUM(attendances.ambulance_arrival)>,
 <Feature: patients.SUM(attendances.gender)>,
 <Feature: patients.STD(attendances.time_in_department)>,
 <Feature: patients.STD(attendances.ambulance_arrival)>,
 <Feature: patients.STD(attendances.gender)>,
 <Feature: patients.MAX(attendances.time_in_department)>,
 <Feature: patients.MAX(attendances.ambulance_arrival)>,
 <Feature: patients.MAX(attendances.gender)>,
 <Feature: patients.SKEW(attendances.time_in_department)>,
 <Feature: patients.SKEW(attendances.ambulance_arriva

Feature tools uses "PRIMATIVES" to define new features. It is posssible to specific which primatives to use during the call of DFS.

- transformation primatives - define what functions to be applied within tables
- aggreagation primatives - define which functions to be applied when grouping a table (and moving up a level)
-  max_depth - parameter can be used to control how many levels can be aggregated to each other.

In [12]:
fm, features = ft.dfs(entityset=es,
                     target_entity='attendances',
                     agg_primitives=['mean','count'],
                     trans_primitives=['day'],
                     max_depth = 2)

fm.sample(3)

Unnamed: 0_level_0,pat_id,time_in_department,ambulance_arrival,gender,DAY(arrival_datetime),DAY(departure_datetime),patients.MEAN(attendances.time_in_department),patients.MEAN(attendances.ambulance_arrival),patients.MEAN(attendances.gender),patients.COUNT(attendances)
atten_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1011,11909,81,0,1,2,2,81,0,1,1
1000,3814,59,0,0,1,1,59,0,0,1
1012,174,53,1,0,2,2,53,1,0,1


In [13]:
features

[<Feature: pat_id>,
 <Feature: time_in_department>,
 <Feature: ambulance_arrival>,
 <Feature: gender>,
 <Feature: DAY(arrival_datetime)>,
 <Feature: DAY(departure_datetime)>,
 <Feature: patients.MEAN(attendances.time_in_department)>,
 <Feature: patients.MEAN(attendances.ambulance_arrival)>,
 <Feature: patients.MEAN(attendances.gender)>,
 <Feature: patients.COUNT(attendances)>]

There are built in PRIMATIVES to ft accessed below...but you can also write custom ones.

In [14]:
ft.list_primitives().sample(5)

Unnamed: 0,name,type,description
36,time_since_previous,transform,Compute the time since the previous instance.
45,year,transform,Transform a Datetime feature into the year.
51,cum_min,transform,Calculates the min of previous values of an in...
37,cum_max,transform,Calculates the max of previous values of an in...
38,not,transform,"For each value of the base feature, negates th..."


## Further notes:

Sometimes you may wish to enforce data types explicitly to prevent ftools making wrong assumptions. The datatype determines the possiblr PRIMATIVES which can be applied to it.

In [15]:
import featuretools.variable_types as vtypes
data_variable_types = {'atten_id': vtypes.Id,
                       'pat_id': vtypes.Id,
                       'arrival_datetime': vtypes.Datetime,
                      'time_in_department': vtypes.Numeric,
                       'departure_datetime': vtypes.Datetime,
                       'gender': vtypes.Boolean,
                      'ambulance_arrival': vtypes.Boolean}

In [16]:
es = ft.EntitySet('Hospital')
es = es.entity_from_dataframe(entity_id='attendances',
                               dataframe=df,
                               index='atten_id',
                             variable_types=data_variable_types)