## Feature Engineering

**Your Python Jupyter notebook should be configured for >8GB of memory.**

In this series of tutorials, we will build a recommender system for fashion items. It will consist of two models: a *retrieval model* and a *ranking model*. The idea is that the retrieval model should be able to quickly generate a small subset of candidate items from a large collection of items. This comes at the cost of granularity, which is why we also train a ranking model that can afford to use more features than the retrieval model.

### Data

We will use data from the [H&M Personalized Fashion Recommendations](https://www.kaggle.com/competitions/h-and-m-personalized-fashion-recommendations) Kaggle competition.

<!-- https://www.kaggle.com/competitions/h-and-m-personalized-fashion-recommendations/data

For this challenge you are given the purchase history of customers across time, along with supporting metadata. Your challenge is to predict what articles each customer will purchase in the 7-day period immediately after the training data ends. Customer who did not make any purchase during that time are excluded from the scoring. -->

The full dataset contains images of all products, but here we will simply use the tabular data. We have three data sources:
- `articles.csv`: info about fashion items.
- `customers.csv`: info about users.
- `transactions_train.csv`: info about transactions.

You can use the *hopsworks* library to download these files locally, assuming that they are stored in your cluster. In this example, we have saved them to the `Resources` directory.

In [1]:
import hopsworks

connection = hopsworks.connection()
project = connection.get_project()
dataset_api = project.get_dataset_api()

for file in ["articles.parquet", "customers.csv", "transactions_train.parquet"]:
   dataset_api.download(f"Resources/{file}", overwrite=True)

Connected. Call `.close()` to terminate connection gracefully.


Downloading: 0.000%|          | 0/6445685 elapsed<00:00 remaining<?

Downloading: 0.000%|          | 0/207135859 elapsed<00:00 remaining<?

Downloading: 0.000%|          | 0/778112758 elapsed<00:00 remaining<?

In [2]:
import pandas as pd
from hops import hdfs
#path = hdfs.project_path() + "/Resources/"
path=""

In [3]:
articles_df = pd.read_parquet(path + "articles.parquet")
articles_df["article_id"] = articles_df["article_id"].astype(str)
articles_df.head(3)

Unnamed: 0,article_id,product_code,prod_name,product_type_no,product_type_name,product_group_name,graphical_appearance_no,graphical_appearance_name,colour_group_code,colour_group_name,...,department_name,index_code,index_name,index_group_no,index_group_name,section_no,section_name,garment_group_no,garment_group_name,detail_desc
0,108775015,108775,Strap top,253,Vest top,Garment Upper body,1010016,Solid,9,Black,...,Jersey Basic,A,Ladieswear,1,Ladieswear,16,Womens Everyday Basics,1002,Jersey Basic,Jersey top with narrow shoulder straps.
1,108775044,108775,Strap top,253,Vest top,Garment Upper body,1010016,Solid,10,White,...,Jersey Basic,A,Ladieswear,1,Ladieswear,16,Womens Everyday Basics,1002,Jersey Basic,Jersey top with narrow shoulder straps.
2,108775051,108775,Strap top (1),253,Vest top,Garment Upper body,1010017,Stripe,11,Off White,...,Jersey Basic,A,Ladieswear,1,Ladieswear,16,Womens Everyday Basics,1002,Jersey Basic,Jersey top with narrow shoulder straps.


In [4]:
customers_df = pd.read_csv(path + "customers.csv")
customers_df.head()

Unnamed: 0,customer_id,FN,Active,club_member_status,fashion_news_frequency,age,postal_code
0,00000dbacae5abe5e23885899a1fa44253a17956c6d1c3...,,,ACTIVE,NONE,49.0,52043ee2162cf5aa7ee79974281641c6f11a68d276429a...
1,0000423b00ade91418cceaf3b26c6af3dd342b51fd051e...,,,ACTIVE,NONE,25.0,2973abc54daa8a5f8ccfe9362140c63247c5eee03f1d93...
2,000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...,,,ACTIVE,NONE,24.0,64f17e6a330a85798e4998f62d0930d14db8db1c054af6...
3,00005ca1c9ed5f5146b52ac8639a40ca9d57aeff4d1bd2...,,,ACTIVE,NONE,54.0,5d36574f52495e81f019b680c843c443bd343d5ca5b1c2...
4,00006413d8573cd20ed7128e53b7b13819fe5cfc2d801f...,1.0,1.0,ACTIVE,Regularly,52.0,25fa5ddee9aac01b35208d01736e57942317d756b32ddd...


In [5]:
trans_df = pd.read_parquet(path + "transactions_train.parquet")
trans_df["article_id"] = trans_df["article_id"].astype(str)
trans_df.head()

Unnamed: 0,t_dat,customer_id,article_id,price,sales_channel_id
0,2018-09-20,000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...,663713001,0.050831,2
1,2018-09-20,000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...,541518023,0.030492,2
2,2018-09-20,00007d2de826758b65a93dd24ce629ed66842531df6699...,505221004,0.015237,2
3,2018-09-20,00007d2de826758b65a93dd24ce629ed66842531df6699...,685687003,0.016932,2
4,2018-09-20,00007d2de826758b65a93dd24ce629ed66842531df6699...,685687004,0.016932,2


In [6]:
# reduce size of the data frame 
# trans_df = trans_df.drop(trans_df.index[:-106012])

In [7]:
trans_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 106012 entries, 31682312 to 31788323
Data columns (total 5 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   t_dat             106012 non-null  object 
 1   customer_id       106012 non-null  object 
 2   article_id        106012 non-null  object 
 3   price             106012 non-null  float64
 4   sales_channel_id  106012 non-null  int64  
dtypes: float64(1), int64(1), object(3)
memory usage: 4.9+ MB


In [8]:
trans_df['t_dat'] = trans_df['t_dat'].apply(lambda x: pd.to_datetime(x))
trans_df.head()

Unnamed: 0,t_dat,customer_id,article_id,price,sales_channel_id
31682312,2020-09-19,bcd1866baeb9b5db248ef4cd98b42f7d19088cc44e7f5c...,506098006,0.030492,2
31682313,2020-09-19,bcd5ce079c7e2c7acf725eae469f132e45f423a14908da...,865929003,0.016932,1
31682314,2020-09-19,bce6c908d439d9593e6200a595eaa2cc7ffa431c666292...,893053001,0.044051,2
31682315,2020-09-19,bce6c908d439d9593e6200a595eaa2cc7ffa431c666292...,806916002,0.050831,2
31682316,2020-09-19,bce6c908d439d9593e6200a595eaa2cc7ffa431c666292...,831460001,0.013542,2


In [9]:
trans_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 106012 entries, 31682312 to 31788323
Data columns (total 5 columns):
 #   Column            Non-Null Count   Dtype         
---  ------            --------------   -----         
 0   t_dat             106012 non-null  datetime64[ns]
 1   customer_id       106012 non-null  object        
 2   article_id        106012 non-null  object        
 3   price             106012 non-null  float64       
 4   sales_channel_id  106012 non-null  int64         
dtypes: datetime64[ns](1), float64(1), int64(1), object(2)
memory usage: 4.9+ MB


In [10]:
print(f"There are {len(trans_df):,} transactions in total.")

There are 106,012 transactions in total.


We can see that we have a large dataset. For the sake of the tutorial, we will use a small subset of this dataset, which we generate by sampling 25'000 customers and using their transactions.

In [11]:
N_USERS = 25_000

# Consider only customers with age defined.
customers_df.dropna(inplace=True, subset=["age"])
customer_subset_df = customers_df.sample(N_USERS, random_state=27)
trans_df = trans_df.merge(customer_subset_df["customer_id"])

print(f"Subset has {len(trans_df):,} transactions in total.")

Subset has 2,099 transactions in total.


### Feature Engineering

Next, we do some feature engineering.

The time of the year a purchase was made should be a strong predictor, as seasonality plays a big factor in fashion purchases. Here, we will use the month of the purchase as a feature. Since this is a cyclical feature (January is as close to December as it is to February), we'll map each month to the unit circle using sine and cosine.

In [12]:
import numpy as np

# TODO - this is a transformation. We are applying it before we write to the FG.
# We should instead apply it as a transformation fn to the feature-view

# Map month to range [0,11].
month = trans_df["t_dat"].apply(lambda x : x.month - 1)
C = 2*np.pi/12

# Map month to the unit circle.
trans_df["month_sin"] = np.sin(month*C)
trans_df["month_cos"] = np.cos(month*C)

We'll also remove columns with null values.

In [13]:
customers_df.dropna(axis=1, inplace=True)
articles_df.dropna(axis=1, inplace=True)

convert python datetime object to unix epoch milliseconds 

In [14]:
trans_df.t_dat = trans_df.t_dat.values.astype(np.int64) // 10 ** 6

### Feature Groups

A [feature group](https://docs.hopsworks.ai/feature-store-api/latest/generated/feature_group/) can be seen as a collection of conceptually related features.

Before we can create a feature group we need to connect to our feature store.

In [15]:
import hsfs

conn = hsfs.connection()
fs = conn.get_feature_store()

Connected. Call `.close()` to terminate connection gracefully.


To create a feature group we need to give it a name and specify a primary key. It is also good to provide a description of the contents of the feature group.

In [16]:
customers_fg = fs.create_feature_group(
    name="customers",
    description="Customer data.",
    primary_key=["customer_id"],
    online_enabled=True
)

Here we have also set `online_enabled=True`, which enables low latency access to the data. A full list of arguments can be found in the [documentation](https://docs.hopsworks.ai/feature-store-api/latest/generated/api/feature_store_api/#create_feature_group).

At this point, we have only specified some metadata for the feature group. It does not store any data or even have a schema defined for the data. To make the feature group persistent we populate it with its associated data using the `save` function.

In [17]:
customers_fg.insert(customers_df)

Feature Group created successfully, explore it at 
https://2176a0f0-3503-11ed-be64-b1a4781e5f0a.cloud.hopsworks.ai/p/135/fs/83/fg/14


Uploading Dataframe: 0.00% |          | Rows 0/1356119 | Elapsed Time: 00:00 | Remaining Time: ?

Launching offline feature group backfill job...
Backfill Job started successfully, you can follow the progress at 
https://2176a0f0-3503-11ed-be64-b1a4781e5f0a.cloud.hopsworks.ai/p/135/jobs/named/customers_1_offline_fg_backfill/executions


(<hsfs.core.job.Job at 0x7f2eb67f9c40>, None)

Let's do the same thing for the rest of the data frames.

In [18]:
articles_fg = fs.create_feature_group(
    name="articles",
    description="Fashion item data.",
    primary_key=["article_id"],
    online_enabled=True
)
articles_fg.insert(articles_df)



Feature Group created successfully, explore it at 
https://2176a0f0-3503-11ed-be64-b1a4781e5f0a.cloud.hopsworks.ai/p/135/fs/83/fg/15


Uploading Dataframe: 0.00% |          | Rows 0/105542 | Elapsed Time: 00:00 | Remaining Time: ?

Launching offline feature group backfill job...
Backfill Job started successfully, you can follow the progress at 
https://2176a0f0-3503-11ed-be64-b1a4781e5f0a.cloud.hopsworks.ai/p/135/jobs/named/articles_1_offline_fg_backfill/executions


(<hsfs.core.job.Job at 0x7f31432af520>, None)

In [19]:
trans_fg = fs.create_feature_group(
    name="transactions",
    version=1,
    description="Transaction data.",
    primary_key=["customer_id", "article_id"], 
    online_enabled=True,
    event_time=["t_dat"]
)
trans_fg.insert(trans_df)

Feature Group created successfully, explore it at 
https://2176a0f0-3503-11ed-be64-b1a4781e5f0a.cloud.hopsworks.ai/p/135/fs/83/fg/18


Uploading Dataframe: 0.00% |          | Rows 0/2099 | Elapsed Time: 00:00 | Remaining Time: ?

Launching offline feature group backfill job...
Backfill Job started successfully, you can follow the progress at 
https://2176a0f0-3503-11ed-be64-b1a4781e5f0a.cloud.hopsworks.ai/p/135/jobs/named/transactions_1_offline_fg_backfill/executions


(<hsfs.core.job.Job at 0x7f2eb30c6610>, None)

You should now be able to inspect the feature groups in the Hopsworks UI.

### Next Steps

In the next notebook we'll create a dataset that we can train a retrieval model on.