# Feature primitives
Feature primitives are the building blocks of Featuretools. They define individual computations that can be applied to raw datasets to create new features. Because a primitive only constrains the input and output data types, they can be applied across datasets and can stack to create new calculations.

## Why primitives?
The space of potential functions that humans use to create a feature is expansive. By breaking common feature engineering calculations down into primitive components, we are able to capture the underlying structure of the features humans create today.

A primitive only constrains the input and output data types. This means they can be used to transfer calculations known in one domain to another. Consider a feature which is often calculated by data scientists for transactional or event logs data: **average time between events**. This feature is incredibly valuable in predicting fraudulent behavior or future customer engagement.

DFS achieves the same feature by stacking two primitives `"time_since_previous"` and `"mean"`

In [2]:
# load data
import featuretools as ft
es = ft.demo.load_mock_customer(return_entityset = True)

In [3]:
feature_defs = ft.dfs(entityset = es, 
                      target_entity = "customers", 
                      agg_primitives = ['mean'], 
                      trans_primitives = ['time_since_previous'], 
                      features_only = True)

In [4]:
feature_defs


[<Feature: zip_code>,
 <Feature: MEAN(transactions.amount)>,
 <Feature: MEAN(sessions.MEAN(transactions.amount))>,
 <Feature: MEAN(sessions.time_since_previous_by_customer_id)>]

A second advantage of primitives is that they can be used to quickly enumerate many interesting features in a parameterized way. This is used by Deep Feature Synthesis to get several different ways of summarizing the time since the previous event.

In [5]:
feature_matrix, feature_defs = ft.dfs(entityset = es, 
                                      target_entity = "customers", 
                                      agg_primitives = ['mean', 'max', 'min', 'skew', 'std'], 
                                      trans_primitives = ['time_since_previous'])

In [13]:
feature_defs

[<Feature: zip_code>,
 <Feature: MEAN(transactions.amount)>,
 <Feature: MAX(transactions.amount)>,
 <Feature: MIN(transactions.amount)>,
 <Feature: SKEW(transactions.amount)>,
 <Feature: STD(transactions.amount)>,
 <Feature: MEAN(sessions.MEAN(transactions.amount))>,
 <Feature: MEAN(sessions.MAX(transactions.amount))>,
 <Feature: MEAN(sessions.MIN(transactions.amount))>,
 <Feature: MEAN(sessions.SKEW(transactions.amount))>,
 <Feature: MEAN(sessions.STD(transactions.amount))>,
 <Feature: MEAN(sessions.time_since_previous_by_customer_id)>,
 <Feature: MAX(sessions.MEAN(transactions.amount))>,
 <Feature: MAX(sessions.MIN(transactions.amount))>,
 <Feature: MAX(sessions.SKEW(transactions.amount))>,
 <Feature: MAX(sessions.STD(transactions.amount))>,
 <Feature: MAX(sessions.time_since_previous_by_customer_id)>,
 <Feature: MIN(sessions.MEAN(transactions.amount))>,
 <Feature: MIN(sessions.MAX(transactions.amount))>,
 <Feature: MIN(sessions.SKEW(transactions.amount))>,
 <Feature: MIN(sessions.ST

In [7]:
feature_matrix[["MEAN(sessions.time_since_previous_by_customer_id)",
                "MAX(sessions.time_since_previous_by_customer_id)",
                "MIN(sessions.time_since_previous_by_customer_id)",
                "STD(sessions.time_since_previous_by_customer_id)",
                "SKEW(sessions.time_since_previous_by_customer_id)"]]

Unnamed: 0_level_0,MEAN(sessions.time_since_previous_by_customer_id),MAX(sessions.time_since_previous_by_customer_id),MIN(sessions.time_since_previous_by_customer_id),STD(sessions.time_since_previous_by_customer_id),SKEW(sessions.time_since_previous_by_customer_id)
customer_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,3502.777778,7670.0,1040.0,1849.029943,
2,2655.714286,7085.0,910.0,2167.930924,
3,6971.25,11245.0,1430.0,4260.940881,
4,2405.0,7605.0,910.0,2213.274884,
5,9316.666667,15860.0,3705.0,5005.797195,


## Aggregation VS Transform Primitives
In the example above, we use two types of primitives.

**Aggregation primitives**: These primitives take related instances as an input and output a single value. They are applied across a parent-child relationship in an entity set. E.g: `"count", "sum", "avg_time_between".`

**Transform primitives**: These primitives take one or more variables from an entity as an input and output a new variable for that entity. They are applied to a single entity. E.g: `"hour", "time_since_previous", "absolute".`

For a DataFrame that lists and describes each built-in primitive in Featuretools, call `ft.list_primitives().`

In [60]:
print(ft.list_primitives().shape)
ft.list_primitives().sample(5)

(62, 3)


Unnamed: 0,name,type,description
59,second,transform,Transform a Datetime feature into the second.
15,time_since_last,aggregation,Time since last related instance.
21,latitude,transform,Returns the first value of the tuple base feat...
11,skew,aggregation,Computes the skewness of a data set.
8,mode,aggregation,Finds the most common element in a categorical...


## Defining Custom Primitives
The library of primitives in Featuretools is constantly expanding. Users can define their own primitive using the APIs below. To define a primitive, a user will

- Specify the type of primitive `Aggregation` or `Transform`
- Define the input and output data types
- Write a function in python to do the calculation
- Annotate with attributes to constrain how it is applied

Once a primitive is defined, it can stack with existing primitives to generate complex patterns. This enables primitives known to be important for one domain to automatically be transfered to another.

### Simple Custom Primitives

In [70]:
from featuretools.primitives import make_agg_primitive, make_trans_primitive
from featuretools.variable_types import Text, Numeric

def absolute(column):
    return abs(column)

Absolute = make_trans_primitive(function = absolute, 
                                input_types = [Numeric], 
                                return_type = Numeric)

Above we created a new transform primitive that can be used with Deep Feature Synthesis using `make_trans_primitive` and a python function we defined. Additionally, we annotated the input data types that the primitive can be applied to and the data type it returns.

Similarly, we can make a new aggregation primitive using `make_agg_primitive`.

In [72]:
def maximum(column):
    return max(column) 

Maximum = make_agg_primitive(function = maximum, 
                             input_types = [Numeric], 
                             return_type = Numeric)

Because we defined an aggregation primitive, the function takes in a list of values but only returns one.

Now that we’ve defined two primitives, we can use them with the dfs function as if they were built-in primitives.

In [77]:
feature_matrix, feature_defs = ft.dfs(entityset = es, 
                                      target_entity = 'sessions', 
                                      agg_primitives = [Maximum], 
                                      trans_primitives = [Absolute], 
                                      max_depth = 2)
feature_defs

[<Feature: customer_id>,
 <Feature: device>,
 <Feature: MAXIMUM(transactions.amount)>,
 <Feature: customers.zip_code>,
 <Feature: MAXIMUM(transactions.ABSOLUTE(amount))>,
 <Feature: ABSOLUTE(MAXIMUM(transactions.amount))>,
 <Feature: customers.MAXIMUM(transactions.amount)>]

In [74]:
feature_matrix[["customers.MAXIMUM(transactions.amount)", "MAXIMUM(transactions.ABSOLUTE(amount))"]].head(5)

Unnamed: 0_level_0,customers.MAXIMUM(transactions.amount),MAXIMUM(transactions.ABSOLUTE(amount))
session_id,Unnamed: 1_level_1,Unnamed: 2_level_1
1,149.95,147.23
2,149.95,148.14
3,148.17,141.66
4,147.73,147.73
5,149.15,124.29


### Word Count example
Here we define a function, `word_count`, which counts the number of words in each row of an input and returns a list of the counts.



In [101]:
def word_count(column):
    word_counts = []
    for value in column:
        words = value.split(None)
        word_counts.append(len(words))
    return word_counts

In [115]:
WordCount = make_trans_primitive(function = word_count, 
                                 input_types = [Text], 
                                 return_type = Numeric)

In [112]:
feature_matrix, features_defs = ft.dfs(entityset=es,
                                  target_entity="sessions",
                                  agg_primitives=["sum", "mean", "std"],
                                  trans_primitives=[WordCount])

In [113]:
feature_defs

[<Feature: customer_id>,
 <Feature: device>,
 <Feature: MAXIMUM(transactions.amount)>,
 <Feature: customers.zip_code>,
 <Feature: MAXIMUM(transactions.ABSOLUTE(amount))>,
 <Feature: ABSOLUTE(MAXIMUM(transactions.amount))>,
 <Feature: customers.MAXIMUM(transactions.amount)>]

In [116]:
# Features not existed in the original dataset.
# Please check original example : https://docs.featuretools.com/automated_feature_engineering/primitives.html#word-count-examplem

# feature_matrix[["customers.WORD_COUNT(favorite_quote)", 
#                 "STD(log.WORD_COUNT(comments))", 
#                 "SUM(log.WORD_COUNT(comments))", 
#                 "MEAN(log.WORD_COUNT(comments))"]]

### Multiple Input Types
If a primitive requires multiple features as input, `input_types` has multiple elements, eg `[Numeric, Numeric]` would mean the primitive requires two Numeric features as input. Below is an example of a primitive that has multiple input features.

In [138]:
from featuretools.variable_types import Datetime, Timedelta, Variable
import pandas as pd

def mean_sunday(numeric, datetime):
    '''
    Finds the mean of non-null values of a feature that occurred on Sundays
    '''
    days = pd.DatetimeIndex(datetime).weekday.values
    df = pd.DataFrame({'numeric': numeric, 'time': days})
    return df[df['time'] == 2]['numeric'].mean()

MeanSunday = make_agg_primitive(function=mean_sunday,
                                input_types=[Numeric, Datetime],
                                return_type=Numeric)


feature_matrix, features = ft.dfs(entityset=es,
                                  target_entity="sessions",
                                  agg_primitives=[MeanSunday],
                                  trans_primitives=[],
                                  max_depth=1)

In [139]:
features

[<Feature: customer_id>,
 <Feature: device>,
 <Feature: MEAN_SUNDAY(transactions.amount, transaction_time)>,
 <Feature: customers.zip_code>]

In [140]:
pd.DatetimeIndex(es['transactions'].df.transaction_time).weekday.unique()

Int64Index([2], dtype='int64', name='transaction_time')

In [141]:
feature_matrix

Unnamed: 0_level_0,customer_id,device,"MEAN_SUNDAY(transactions.amount, transaction_time)",customers.zip_code
session_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,1,desktop,77.84625,60091
2,1,desktop,89.533,60091
3,5,mobile,67.13,2139
4,3,mobile,82.1728,2139
5,2,tablet,65.031818,2139
6,1,desktop,70.699412,60091
7,2,desktop,71.148571,2139
8,2,mobile,63.326111,2139
9,1,desktop,83.244667,60091
10,2,mobile,66.718667,2139


In [121]:
# Check original example : https://docs.featuretools.com/automated_feature_engineering/primitives.html#multiple-input-types 

# feature_matrix[["MEAN_SUNDAY(log.value, datetime)", 
#                  "MEAN_SUNDAY(log.value_2, datetime)"]]