In [1]:
import featuretools as ft
import pandas as pd

# Why does is DFS not creating aggregation features?
One common issue you might run into is with aggregation features. You may have created your entityset, and then applied DFS to create features. However, you may be puzzled why no aggreation features were not created. 
- This is most likely because you have a single table in your entity, and thus DFS is not capable of creating aggregation features. You need at least 2 entities. Featuretools will look for a relationship, and aggregate based on that relationship.

Let's look at a simple example.

In [2]:
data = ft.demo.load_mock_customer()
transactions_df = data["transactions"].merge(data["sessions"]).merge(data["customers"])
es = ft.EntitySet(id="customer_data")
es = es.entity_from_dataframe(entity_id="transactions",
                              dataframe=transactions_df,
                              index="transaction_id")
es

Entityset: customer_data
  Entities:
    transactions [Rows: 500, Columns: 11]
  Relationships:
    No relationships

Notice how we only have 1 entity in our entityset. If we try to create aggregation features on this entityset, it will not be possible because aggregation features need 2 entities. 

In [3]:
feature_matrix, feature_defs = ft.dfs(entityset=es, target_entity="transactions")
feature_defs

[<Feature: session_id>,
 <Feature: product_id>,
 <Feature: amount>,
 <Feature: customer_id>,
 <Feature: device>,
 <Feature: zip_code>,
 <Feature: DAY(transaction_time)>,
 <Feature: DAY(session_start)>,
 <Feature: DAY(join_date)>,
 <Feature: DAY(date_of_birth)>,
 <Feature: YEAR(transaction_time)>,
 <Feature: YEAR(session_start)>,
 <Feature: YEAR(join_date)>,
 <Feature: YEAR(date_of_birth)>,
 <Feature: MONTH(transaction_time)>,
 <Feature: MONTH(session_start)>,
 <Feature: MONTH(join_date)>,
 <Feature: MONTH(date_of_birth)>,
 <Feature: WEEKDAY(transaction_time)>,
 <Feature: WEEKDAY(session_start)>,
 <Feature: WEEKDAY(join_date)>,
 <Feature: WEEKDAY(date_of_birth)>]

None of the above features are aggregation features. To fix this issue, you can add another entity to your entityset.

There is a couple of ways to add an entity to your entityset:

**Solution #1 - You can add new entity if you have additional data.**

In [4]:
products_df = data["products"]
es = es.entity_from_dataframe(entity_id="products",
                              dataframe=products_df,
                              index="product_id")
es

Entityset: customer_data
  Entities:
    transactions [Rows: 500, Columns: 11]
    products [Rows: 5, Columns: 2]
  Relationships:
    No relationships

Notice how we now have an additional entity in our entityset.

**Solution #2 - You can normalize an existing entity.**

In [5]:
es = es.normalize_entity(base_entity_id="transactions",
                         new_entity_id="sessions",
                         index="session_id",
                         make_time_index="session_start",
                         additional_variables=["device", "customer_id", "zip_code", "session_start", "join_date"])
es

Entityset: customer_data
  Entities:
    transactions [Rows: 500, Columns: 6]
    products [Rows: 5, Columns: 2]
    sessions [Rows: 35, Columns: 6]
  Relationships:
    transactions.session_id -> sessions.session_id

Notice how we have an additional entity in our entityset.

In [6]:
feature_matrix, feature_defs = ft.dfs(entityset=es, target_entity="transactions")
feature_defs[:-10]

[<Feature: session_id>,
 <Feature: product_id>,
 <Feature: amount>,
 <Feature: DAY(transaction_time)>,
 <Feature: DAY(date_of_birth)>,
 <Feature: YEAR(transaction_time)>,
 <Feature: YEAR(date_of_birth)>,
 <Feature: MONTH(transaction_time)>,
 <Feature: MONTH(date_of_birth)>,
 <Feature: WEEKDAY(transaction_time)>,
 <Feature: WEEKDAY(date_of_birth)>,
 <Feature: sessions.device>,
 <Feature: sessions.customer_id>,
 <Feature: sessions.zip_code>,
 <Feature: sessions.SUM(transactions.amount)>,
 <Feature: sessions.STD(transactions.amount)>,
 <Feature: sessions.MAX(transactions.amount)>,
 <Feature: sessions.SKEW(transactions.amount)>,
 <Feature: sessions.MIN(transactions.amount)>,
 <Feature: sessions.MEAN(transactions.amount)>,
 <Feature: sessions.COUNT(transactions)>]

Now we have sucessfully created aggregation features, a few of which are:
- `<Feature: sessions.SUM(transactions.amount)>`
- `<Feature: sessions.STD(transactions.amount)>`
- `<Feature: sessions.MAX(transactions.amount)>`
- `<Feature: sessions.SKEW(transactions.amount)>`
- `<Feature: sessions.MIN(transactions.amount)>`
- `<Feature: sessions.MEAN(transactions.amount)>`
- `<Feature: sessions.COUNT(transactions)>`

## Why am I getting this error `AssertionError: Index is not unique on dataframe` ?
One error you might run into is with index on your entity. You may have may be trying to create your entity, and running into this error. 
- This is because each entity in your entityset needs a unique index.

Let's look at a simple example.

In [7]:
product_df = pd.DataFrame({'id': [1, 2, 3, 4, 4],
                           'rating': [3.5, 4.0, 4.5, 1.5, 5.0]})
product_df

Unnamed: 0,id,rating
0,1,3.5
1,2,4.0
2,3,4.5
3,4,1.5
4,4,5.0


Notice how the `id` column has a duplicate index of `4`. If you try to create an entity with this dataframe, you will run into an error.

In [8]:
es = ft.EntitySet(id="product_data")
es = es.entity_from_dataframe(entity_id="products",
                              dataframe=product_df,
                              index="id")

AssertionError: Index is not unique on dataframe (Entity products)

To fix this issue, you can do a couple of things:

**Solution #1 - You can create a unique index on your dataframe.**

In [9]:
product_df = pd.DataFrame({'id': [1, 2, 3, 4, 5],
                           'rating': [3.5, 4.0, 4.5, 1.5, 5.0]})
product_df

Unnamed: 0,id,rating
0,1,3.5
1,2,4.0
2,3,4.5
3,4,1.5
4,5,5.0


In [10]:
es = es.entity_from_dataframe(entity_id="products",
                              dataframe=product_df,
                              index="id")

**Solution #2 - Set `make_index` to True in your call to `entity_from_dataframe` to create a new index on that data**
- `make_index` is creates a unique index for each row by just looking at what number the row is, in relation to all the other rows.

In [11]:
product_df = pd.DataFrame({'id': [1, 2, 3, 4, 4],
                           'rating': [3.5, 4.0, 4.5, 1.5, 5.0]})

es = ft.EntitySet(id="product_data")
es = es.entity_from_dataframe(entity_id="products",
                              dataframe=product_df,
                              index="product_id",
                              make_index=True)
es['products'].df

Unnamed: 0,product_id,id,rating
0,0,1,3.5
1,1,2,4.0
2,2,3,4.5
3,3,4,1.5
4,4,4,5.0


### What is the difference between `copy_variables` and `additional_variables`?
One function you make run for creating entity is `normalize_entity`. This function creates a new entity, and relationship from unique values of an existing relationships. It has 2 similar, but different arguments (`copy_varaibles` and `additional_variables`). You may be confused as to what the difference is between these two arguments:

- `additional_variables` will remove variables from the base entity, and move them to the new entity. 
- `copy_variables` will keep the variables in the base entity, and copy them to the new entity.

In [12]:
data = pd.DataFrame({'product_id': [1, 2, 3, 4, 5],
                     'os': ['android', 'ios', 'android', 'ios', 'windows'],
                     'storage': [64, 32, 64, 32, 16],
                     'price': [900, 1000, 800, 900, 1000], 
                     'rating': [3.5, 4.0, 4.5, 1.5, 5.0]})

es = ft.EntitySet(id="product_data")
es = es.entity_from_dataframe(entity_id="products",
                              dataframe=data,
                              index="product_id")

Before we normalize to create a new entity, let's look at base entity

In [13]:
es['products'].df.head()

Unnamed: 0,product_id,os,storage,price,rating
1,1,android,64,900,3.5
2,2,ios,32,1000,4.0
3,3,android,64,800,4.5
4,4,ios,32,900,1.5
5,5,windows,16,1000,5.0


Notice the columns `storage`, and `price` columns.

In [None]:
es = es.normalize_entity(base_entity_id="products",
                         new_entity_id="device",
                         index="os",
                         additional_variables=["storage"],
                         copy_variables=["price"])

We normalized the columns to create a new entity. 
- For `additional_variables`, `storage` will be removed from the `products` entity, and moved to the new `device` entity. 
- For `copy_variables`, `price` will be copied from the `products` entity to the new `device` entity. 

Let's see this in the actual Entityset.

In [15]:
es['products'].df.head()

Unnamed: 0,product_id,os,price,rating
1,1,android,900,3.5
2,2,ios,1000,4.0
3,3,android,800,4.5
4,4,ios,900,1.5
5,5,windows,1000,5.0


Notice above how `price` is still in the products entity, while `storage` is not. It has been moved to the `device` entity, as seen below.

In [16]:
es['device'].df.head()

Unnamed: 0,os,storage,price
android,android,64,900
ios,ios,32,1000
windows,windows,16,1000


### Why am I this error `LookupError: Time index not found in dataframe` when using `normalize_entity` function?

