##### Copyright 2020 The TensorFlow Authors.

In [1]:
#@title Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Taking advantage of context features

<table class="tfo-notebook-buttons" align="left">
  <td>
    <a target="_blank" href="https://www.tensorflow.org/recommenders/examples/context_features"><img src="https://www.tensorflow.org/images/tf_logo_32px.png" />View on TensorFlow.org</a>
  </td>
  <td>
    <a target="_blank" href="https://colab.research.google.com/github/tensorflow/recommenders/blob/main/docs/examples/context_features.ipynb"><img src="https://www.tensorflow.org/images/colab_logo_32px.png" />Run in Google Colab</a>
  </td>
  <td>
    <a target="_blank" href="https://github.com/tensorflow/recommenders/blob/main/docs/examples/context_features.ipynb"><img src="https://www.tensorflow.org/images/GitHub-Mark-32px.png" />View source on GitHub</a>
  </td>
  <td>
    <a href="https://storage.googleapis.com/tensorflow_docs/recommenders/docs/examples/context_features.ipynb"><img src="https://www.tensorflow.org/images/download_logo_32px.png" />Download notebook</a>
  </td>
</table>

In [the featurization tutorial](featurization) we incorporated multiple features beyond just user and movie identifiers into our models, but we haven't explored whether those features improve model accuracy.

Many factors affect whether features beyond ids are useful in a recommender model:

1. __Importance of context__: if user preferences are relatively stable across contexts and time, context features may not provide much benefit. If, however, users preferences are highly contextual, adding context will improve the model significantly. For example, day of the week may be an important feature when deciding whether to recommend a short clip or a movie: users may only have time to watch short content during the week, but can relax and enjoy a full-length movie during the weekend. Similarly, query timestamps may play an important role in modelling popularity dynamics: one movie may be highly popular around the time of its release, but decay quickly afterwards. Conversely, other movies may be evergreens that are happily watched time and time again.
2. __Data sparsity__: using non-id features may be critical if data is sparse. With few observations available for a given user or item, the model may struggle with estimating a good per-user or per-item representation. To build an accurate model, other features such as item categories, descriptions, and images have to be used to help the model generalize beyond the training data. This is especially relevant in [cold-start](https://en.wikipedia.org/wiki/Cold_start_(recommender_systems)) situations, where relatively little data is available on some items or users.

In this tutorial, we'll experiment with using features beyond movie titles and user ids to our MovieLens model.

## Preliminaries

We first import the necessary packages.

In [18]:
!pip install -q tensorflow-recommenders
!pip install -q --upgrade tensorflow-datasets

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
pylint 2.12.2 requires typing-extensions>=3.10.0; python_version < "3.10", but you have typing-extensions 3.7.4.3 which is incompatible.
astroid 2.9.0 requires typing-extensions>=3.10; python_version < "3.10", but you have typing-extensions 3.7.4.3 which is incompatible.
aiobotocore 1.3.0 requires botocore<1.20.50,>=1.20.49, but you have botocore 1.24.25 which is incompatible.[0m


In [19]:
import os
import tempfile

import numpy as np
import tensorflow as tf
import tensorflow_datasets as tfds

import tensorflow_recommenders as tfrs

We follow [the featurization tutorial](featurization) and keep the user id, timestamp, and movie title features.

In [157]:
ratings = tfds.load("movielens/100k-ratings", split="train")
movies = tfds.load("movielens/100k-movies", split="train")

ratings = ratings.map(lambda x: {
    "movie_title": x["movie_title"],
    "user_id": x["user_id"],
    "timestamp": x["timestamp"],
})
movies = movies.map(lambda x: x["movie_title"])

[1mDownloading and preparing dataset 4.70 MiB (download: 4.70 MiB, generated: 32.41 MiB, total: 37.10 MiB) to /home/ec2-user/tensorflow_datasets/movielens/100k-ratings/0.1.0...[0m


Dl Completed...: 0 url [00:00, ? url/s]

Dl Size...: 0 MiB [00:00, ? MiB/s]

Extraction completed...: 0 file [00:00, ? file/s]

Generating splits...:   0%|          | 0/1 [00:00<?, ? splits/s]

Generating train examples...:   0%|          | 0/100000 [00:00<?, ? examples/s]

Shuffling /home/ec2-user/tensorflow_datasets/movielens/100k-ratings/0.1.0.incomplete78PIWE/movielens-train.tfr…

[1mDataset movielens downloaded and prepared to /home/ec2-user/tensorflow_datasets/movielens/100k-ratings/0.1.0. Subsequent calls will reuse this data.[0m
[1mDownloading and preparing dataset 4.70 MiB (download: 4.70 MiB, generated: 150.35 KiB, total: 4.84 MiB) to /home/ec2-user/tensorflow_datasets/movielens/100k-movies/0.1.0...[0m


Dl Completed...: 0 url [00:00, ? url/s]

Dl Size...: 0 MiB [00:00, ? MiB/s]

Extraction completed...: 0 file [00:00, ? file/s]

Generating splits...:   0%|          | 0/1 [00:00<?, ? splits/s]

Generating train examples...:   0%|          | 0/1682 [00:00<?, ? examples/s]

Shuffling /home/ec2-user/tensorflow_datasets/movielens/100k-movies/0.1.0.incomplete32LLY1/movielens-train.tfre…

[1mDataset movielens downloaded and prepared to /home/ec2-user/tensorflow_datasets/movielens/100k-movies/0.1.0. Subsequent calls will reuse this data.[0m


In [2]:
import pandas as pd
masterdf = pd.read_csv('s3a://hluan/hm/sampled_10_users_transactions.csv', dtype={"article_id": "str"})
masterdf.head(3)

Unnamed: 0,t_dat,customer_id,article_id,price,sales_channel_id,sample_prob,train_test
0,2018-09-20,002b3c0a44a22c45a8d62ea9d2b88d1a89e335f8b84003...,673531001,0.008458,2,0.05903,0
1,2018-09-20,002b3c0a44a22c45a8d62ea9d2b88d1a89e335f8b84003...,464277014,0.022017,2,0.05903,0
2,2018-09-20,002b3c0a44a22c45a8d62ea9d2b88d1a89e335f8b84003...,464277014,0.022017,2,0.05903,0


In [3]:
from datetime import datetime
masterdf['t_dat'] = pd.to_datetime(masterdf['t_dat'])

In [4]:
masterdf['t_dat'] = datetime(2020, 9, 22) - masterdf['t_dat']
masterdf['t_dat'] = (masterdf['t_dat'].dt.days / 7).astype('int')

In [5]:
master_df_part = masterdf[['t_dat', 'customer_id', 'article_id', 'price']]

In [6]:
master_df_part.head()

Unnamed: 0,t_dat,customer_id,article_id,price
0,104,002b3c0a44a22c45a8d62ea9d2b88d1a89e335f8b84003...,673531001,0.008458
1,104,002b3c0a44a22c45a8d62ea9d2b88d1a89e335f8b84003...,464277014,0.022017
2,104,002b3c0a44a22c45a8d62ea9d2b88d1a89e335f8b84003...,464277014,0.022017
3,104,005c9fb2ba6c49b2098a662f64a9124ef95cbec5fcf4eb...,625939005,0.003373
4,104,005c9fb2ba6c49b2098a662f64a9124ef95cbec5fcf4eb...,508184020,0.024136


In [7]:
master_df_part['rating'] = 1

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  if __name__ == '__main__':


In [8]:
### define interactions data and user data

### interactions 
### here we create a reference table of the user , item, and quantity purchased
interactions_dict = master_df_part.groupby(['customer_id', 
                                      'article_id',
                                      't_dat'])[['rating', 'price']].sum().reset_index()



In [9]:
interactions_dict

Unnamed: 0,customer_id,article_id,t_dat,rating,price
0,00018385675844f7a6babbed41b5655b5727fb16483b6e...,0535455002,99,1,0.020322
1,00018385675844f7a6babbed41b5655b5727fb16483b6e...,0616849012,99,1,0.008458
2,00018385675844f7a6babbed41b5655b5727fb16483b6e...,0621020001,103,1,0.033881
3,00018385675844f7a6babbed41b5655b5727fb16483b6e...,0626813002,99,1,0.008458
4,00018385675844f7a6babbed41b5655b5727fb16483b6e...,0626813004,99,1,0.008458
...,...,...,...,...,...
2841724,ffffbbf78b6eaac697a8a5dfbfd2bfa8113ee5b403e474...,0832321002,13,1,0.016932
2841725,ffffbbf78b6eaac697a8a5dfbfd2bfa8113ee5b403e474...,0832520001,26,1,0.025407
2841726,ffffbbf78b6eaac697a8a5dfbfd2bfa8113ee5b403e474...,0835008005,26,1,0.045746
2841727,ffffbbf78b6eaac697a8a5dfbfd2bfa8113ee5b403e474...,0840567001,15,1,0.030492


In [10]:
interactions_dict = interactions_dict[interactions_dict['t_dat'] <= 5]

In [11]:
input_data_path = ""
customers = pd.read_csv(f'{input_data_path}customers.csv', dtype={"article_id": "str"})
articles = pd.read_csv(f'{input_data_path}articles.csv', dtype={"article_id": "str"})

In [12]:
articles['product_group_code'] = articles['product_group_name'].astype('category').cat.codes
articles['index_code_id'] = articles['index_code'].astype('category').cat.codes

In [122]:
articles['color-code'] = articles['colour_group_code'].astype('str') + articles['perceived_colour_value_id'].astype('str') \
    + articles['perceived_colour_master_id'].astype('str')

In [12]:
articles['color-code']

0          945
1         1039
2         1119
3          945
4         1039
          ... 
105537     945
105538     945
105539     945
105540     945
105541    1119
Name: color-code, Length: 105542, dtype: object

In [13]:
articles.columns

Index(['article_id', 'product_code', 'prod_name', 'product_type_no',
       'product_type_name', 'product_group_name', 'graphical_appearance_no',
       'graphical_appearance_name', 'colour_group_code', 'colour_group_name',
       'perceived_colour_value_id', 'perceived_colour_value_name',
       'perceived_colour_master_id', 'perceived_colour_master_name',
       'department_no', 'department_name', 'index_code', 'index_name',
       'index_group_no', 'index_group_name', 'section_no', 'section_name',
       'garment_group_no', 'garment_group_name', 'detail_desc',
       'product_group_code', 'index_code_id', 'color-code'],
      dtype='object')

In [13]:
articles_processed = articles[['article_id', 
       'prod_name', 
       'product_type_name', 'product_group_name', 
       'graphical_appearance_name', 'colour_group_name',
       'perceived_colour_value_name',
       'perceived_colour_master_name',
       'department_name', 'index_name',
       'index_group_name', 'section_name',
       'garment_group_name']]

In [14]:
bins = [0, 35, 45, 55, 120]
labels = [1,2,3,4]
customers['age_group'] = pd.cut(customers['age'], bins=bins, labels=labels)

In [15]:
interactions_dict = interactions_dict[['t_dat', 'customer_id', 'article_id', 'rating', 'price']]\
    .merge(articles_processed, on='article_id', how='left')\
    .merge(customers[['customer_id', 'age_group']], on='customer_id', how='left')\

In [23]:
# interactions_dict['section_no'] = interactions_dict['section_no'].astype('str')

In [150]:
interactions_dict.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2841729 entries, 0 to 2841728
Data columns (total 18 columns):
 #   Column                        Dtype   
---  ------                        -----   
 0   t_dat                         int64   
 1   customer_id                   object  
 2   article_id                    object  
 3   rating                        int64   
 4   price                         float64 
 5   prod_name                     object  
 6   product_type_name             object  
 7   product_group_name            object  
 8   graphical_appearance_name     object  
 9   colour_group_name             object  
 10  perceived_colour_value_name   object  
 11  perceived_colour_master_name  object  
 12  department_name               object  
 13  index_name                    object  
 14  index_group_name              object  
 15  section_name                  object  
 16  garment_group_name            object  
 17  age_group                     category
dtypes:

In [129]:
articles_processed.isnull().sum()

article_id                      0
prod_name                       0
product_type_name               0
product_group_name              0
graphical_appearance_name       0
colour_group_name               0
perceived_colour_value_name     0
perceived_colour_master_name    0
department_name                 0
index_name                      0
index_group_name                0
section_name                    0
garment_group_name              0
dtype: int64

In [16]:
interactions_dict['age_group'] = interactions_dict['age_group'].fillna(1)
# interactions_dict['detail_desc'] = interactions_dict['detail_desc'].fillna("")


In [20]:
interactions_dict = {name: np.array(value) for name, value in interactions_dict.items()}
ratings = tf.data.Dataset.from_tensor_slices(interactions_dict)

# movies_dict = articles_processed.map(lambda x: x["article_id"])

In [21]:
ratings

<TensorSliceDataset shapes: {t_dat: (), customer_id: (), article_id: (), rating: (), color-code: (), detail_desc: (), section_no: (), age_group: ()}, types: {t_dat: tf.int64, customer_id: tf.string, article_id: tf.string, rating: tf.int64, color-code: tf.string, detail_desc: tf.string, section_no: tf.int64, age_group: tf.int64}>

In [22]:
# articles_processed['detail_desc'] = articles_processed['detail_desc'].fillna("")

In [24]:

items_dict = {name: np.array(value) for name, value in articles_processed.items()}
items = tf.data.Dataset.from_tensor_slices(items_dict)

In [165]:
ratings

<TensorSliceDataset shapes: {t_dat: (), customer_id: (), article_id: (), rating: (), price: (), prod_name: (), product_type_name: (), product_group_name: (), graphical_appearance_name: (), colour_group_name: (), perceived_colour_value_name: (), perceived_colour_master_name: (), department_name: (), index_name: (), index_group_name: (), section_name: (), garment_group_name: (), age_group: ()}, types: {t_dat: tf.int64, customer_id: tf.string, article_id: tf.string, rating: tf.int64, price: tf.float64, prod_name: tf.string, product_type_name: tf.string, product_group_name: tf.string, graphical_appearance_name: tf.string, colour_group_name: tf.string, perceived_colour_value_name: tf.string, perceived_colour_master_name: tf.string, department_name: tf.string, index_name: tf.string, index_group_name: tf.string, section_name: tf.string, garment_group_name: tf.string, age_group: tf.int64}>

In [158]:
items

<TensorSliceDataset shapes: {article_id: (), prod_name: (), product_type_name: (), product_group_name: (), graphical_appearance_name: (), colour_group_name: (), perceived_colour_value_name: (), perceived_colour_master_name: (), department_name: (), index_name: (), index_group_name: (), section_name: (), garment_group_name: ()}, types: {article_id: tf.string, prod_name: tf.string, product_type_name: tf.string, product_group_name: tf.string, graphical_appearance_name: tf.string, colour_group_name: tf.string, perceived_colour_value_name: tf.string, perceived_colour_master_name: tf.string, department_name: tf.string, index_name: tf.string, index_group_name: tf.string, section_name: tf.string, garment_group_name: tf.string}>

In [25]:
interactions = ratings.map(lambda x: {
    'customer_id': x['customer_id'], 
    'age_group': x['age_group'],
    'article_id': x['article_id'], 
    'rating': int(x['rating']),
    'price': int(x['price']),
    "t_dat": x["t_dat"],
    "prod_name": x['prod_name'],
    "product_type_name": x['product_type_name'],
    "product_group_name": x['product_group_name'],
    "graphical_appearance_name": x['graphical_appearance_name'],
    "colour_group_name": x['colour_group_name'],
    "perceived_colour_value_name": x['perceived_colour_value_name'],
    "perceived_colour_master_name": x['perceived_colour_master_name'],
    "department_name": x['department_name'],
    "index_name": x['index_name'],
    "index_group_name": x['index_group_name'],
    "section_name": x['section_name'],
    "garment_group_name": x['garment_group_name'],
})

articles = items.map(lambda x: x['article_id'])
age_group = ratings.map(lambda x: x['age_group'])


In [26]:
items = items.map(lambda x:
                 {    'article_id': x['article_id'], 
    "prod_name": x['prod_name'],
    "product_type_name": x['product_type_name'],
    "product_group_name": x['product_group_name'],
    "graphical_appearance_name": x['graphical_appearance_name'],
    "colour_group_name": x['colour_group_name'],
    "perceived_colour_value_name": x['perceived_colour_value_name'],
    "perceived_colour_master_name": x['perceived_colour_master_name'],
    "department_name": x['department_name'],
    "index_name": x['index_name'],
    "index_group_name": x['index_group_name'],
    "section_name": x['section_name'],
    "garment_group_name": x['garment_group_name'],
})

In [27]:
interactions

<MapDataset shapes: {customer_id: (), age_group: (), article_id: (), rating: (), price: (), t_dat: (), prod_name: (), product_type_name: (), product_group_name: (), graphical_appearance_name: (), colour_group_name: (), perceived_colour_value_name: (), perceived_colour_master_name: (), department_name: (), index_name: (), index_group_name: (), section_name: (), garment_group_name: ()}, types: {customer_id: tf.string, age_group: tf.int64, article_id: tf.string, rating: tf.int32, price: tf.int32, t_dat: tf.int64, prod_name: tf.string, product_type_name: tf.string, product_group_name: tf.string, graphical_appearance_name: tf.string, colour_group_name: tf.string, perceived_colour_value_name: tf.string, perceived_colour_master_name: tf.string, department_name: tf.string, index_name: tf.string, index_group_name: tf.string, section_name: tf.string, garment_group_name: tf.string}>

In [28]:
items

<MapDataset shapes: {article_id: (), prod_name: (), product_type_name: (), product_group_name: (), graphical_appearance_name: (), colour_group_name: (), perceived_colour_value_name: (), perceived_colour_master_name: (), department_name: (), index_name: (), index_group_name: (), section_name: (), garment_group_name: ()}, types: {article_id: tf.string, prod_name: tf.string, product_type_name: tf.string, product_group_name: tf.string, graphical_appearance_name: tf.string, colour_group_name: tf.string, perceived_colour_value_name: tf.string, perceived_colour_master_name: tf.string, department_name: tf.string, index_name: tf.string, index_group_name: tf.string, section_name: tf.string, garment_group_name: tf.string}>

We also do some housekeeping to prepare feature vocabularies.

In [29]:
timestamps = np.concatenate(list(ratings.map(lambda x: x["t_dat"]).batch(100)))

max_timestamp = timestamps.max()
min_timestamp = timestamps.min()

timestamp_buckets = np.linspace(
    min_timestamp, max_timestamp, num=105,
)


unique_movie_titles = np.unique(np.concatenate(list(items.batch(1000).map(lambda x: x['article_id']))))
unique_user_ids = np.unique(np.concatenate(list(ratings.batch(1_000).map(
    lambda x: x["customer_id"]))))

In [31]:
price = np.concatenate(list(ratings.map(lambda x: x["price"]).batch(100)))

max_price = price.max()
min_price = price.min()

price_buckets = np.linspace(
    min_price, max_price, num=100,
)

In [32]:
unique_movie_titles

array([b'0108775015', b'0108775044', b'0108775051', ..., b'0956217002',
       b'0957375001', b'0959461001'], dtype=object)

In [33]:
unique_user_ids

array([b'0003e867a930d0d6842f923d6ba7c9b77aba33fe2a0fbf4672f30b3e622fec55',
       b'000b5ee14437ff127c1093eaa2ae3cc8801cfa5dd0b66fd775de26ca7e2265c3',
       b'000ee56f745271e72ae8b5680a416a4fbf8acf6a690ab2df92ee58505e6d0136',
       ...,
       b'fffb287f12aea1204e9eabd5e02eaf7f3ed5f9abecd9a4cb06cd9ecd793a996f',
       b'fffd0248a95c2e49fee876ff93598e2e20839e51b9b7678aab75d9e8f9f3c6c8',
       b'ffffbbf78b6eaac697a8a5dfbfd2bfa8113ee5b403e4747568cac33e8c541831'],
      dtype=object)

In [173]:
interactions_dict.keys()

dict_keys(['t_dat', 'customer_id', 'article_id', 'rating', 'price', 'prod_name', 'product_type_name', 'product_group_name', 'graphical_appearance_name', 'colour_group_name', 'perceived_colour_value_name', 'perceived_colour_master_name', 'department_name', 'index_name', 'index_group_name', 'section_name', 'garment_group_name', 'age_group'])

In [34]:
feature_names = ['prod_name', 'product_type_name', 'product_group_name', 'graphical_appearance_name', 'colour_group_name', 'perceived_colour_value_name', 'perceived_colour_master_name', 'department_name', 'index_name', 'index_group_name', 'section_name', 'garment_group_name', 'age_group']

vocabularies = {}

for feature_name in feature_names:
  vocab = interactions.batch(1_000).map(lambda x: x[feature_name])
  vocabularies[feature_name] = np.unique(np.concatenate(list(vocab)))

In [164]:
items

<TensorSliceDataset shapes: {article_id: (), prod_name: (), product_type_name: (), product_group_name: (), graphical_appearance_name: (), colour_group_name: (), perceived_colour_value_name: (), perceived_colour_master_name: (), department_name: (), index_name: (), index_group_name: (), section_name: (), garment_group_name: ()}, types: {article_id: tf.string, prod_name: tf.string, product_type_name: tf.string, product_group_name: tf.string, graphical_appearance_name: tf.string, colour_group_name: tf.string, perceived_colour_value_name: tf.string, perceived_colour_master_name: tf.string, department_name: tf.string, index_name: tf.string, index_group_name: tf.string, section_name: tf.string, garment_group_name: tf.string}>

In [35]:
# desc = items.map(lambda x: x['detail_desc'])

In [36]:
# color_code = items.map(lambda x: x['color-code'])

In [37]:
# unique_color_code = np.unique(np.concatenate(list(items.batch(1_000).map(
#     lambda x: x["color-code"]))))

In [None]:
# unique_color_code

In [None]:
# section = items.map(lambda x: x['section_no'])

In [108]:
items

<MapDataset shapes: {article_id: (), color-code: (), section_no: ()}, types: {article_id: tf.string, color-code: tf.string, section_no: tf.string}>

In [62]:
unique_section_code = np.unique(np.concatenate(list(items.batch(1_000).map(
    lambda x: x["section_no"]))))

In [78]:
# unique_section_code = np.array(unique_section_code, dtype='str')
# unique_section_code = np.array(unique_section_code, dtype='object')

In [88]:
unique_section_code

array(['2', '4', '5', '6', '8', '11', '14', '15', '16', '17', '18', '19',
       '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '30',
       '31', '40', '41', '42', '43', '44', '45', '46', '47', '48', '49',
       '50', '51', '52', '53', '55', '56', '57', '58', '60', '61', '62',
       '64', '65', '66', '70', '71', '72', '76', '77', '79', '80', '82',
       '97'], dtype=object)

## Model definition

### Query model

We start with the user model defined in [the featurization tutorial](featurization) as the first layer of our model, tasked with converting raw input examples into feature embeddings. However, we change it slightly to allow us to turn timestamp features on or off. This will allow us to more easily demonstrate the effect that timestamp features have on the model. In the code below, the `use_timestamps` parameter gives us control over whether we use timestamp features.

In [35]:
class UserModel(tf.keras.Model):
  
    def __init__(self, use_timestamps):
        super().__init__()
        max_tokens = 10_000
        self.embedding_dimension = 32
        self._use_timestamps = use_timestamps

        self.user_embedding = tf.keras.Sequential([
        tf.keras.layers.StringLookup(
            vocabulary=unique_user_ids, mask_token=None),
        tf.keras.layers.Embedding(len(unique_user_ids) + 1, self.embedding_dimension),
        ])
        
        str_features = []
        int_features = ['age_group']
        
        self._all_features = str_features + int_features
        self._embeddings = {}
        if use_timestamps:
            self.timestamp_embedding = tf.keras.Sequential([
              tf.keras.layers.Discretization(timestamp_buckets.tolist()),
              tf.keras.layers.Embedding(len(timestamp_buckets) + 1, self.embedding_dimension),
            ])
            self.normalized_timestamp = tf.keras.layers.Normalization(
              axis=None
            )

            self.normalized_timestamp.adapt(timestamps)
            
            self.price_embedding = tf.keras.Sequential([
              tf.keras.layers.Discretization(price_buckets.tolist()),
              tf.keras.layers.Embedding(len(price_buckets) + 1, self.embedding_dimension),
            ])
            self.normalized_price = tf.keras.layers.Normalization(
              axis=None
            )

            self.normalized_price.adapt(price)
            
            for feature_name in str_features:
              vocabulary = vocabularies[feature_name]
              self._embeddings[feature_name] = tf.keras.Sequential(
                  [tf.keras.layers.StringLookup(
                      vocabulary=vocabulary, mask_token=None),
                   tf.keras.layers.Embedding(len(vocabulary) + 1,
                                             self.embedding_dimension)
            ])

            # Compute embeddings for int features.
            for feature_name in int_features:
              vocabulary = vocabularies[feature_name]
              self._embeddings[feature_name] = tf.keras.Sequential(
                  [tf.keras.layers.IntegerLookup(
                      vocabulary=vocabulary, mask_value=None),
                   tf.keras.layers.Embedding(len(vocabulary) + 1,
                                             self.embedding_dimension)
            ])

    def call(self, inputs):
        if not self._use_timestamps:
            return self.user_embedding(inputs["customer_id"])
        embeddings = [
            self.user_embedding(inputs["customer_id"]),
            self.timestamp_embedding(inputs["t_dat"]),
            tf.reshape(self.normalized_timestamp(inputs["t_dat"]), (-1, 1)),
            self.price_embedding(inputs["price"]),
            tf.reshape(self.normalized_price(inputs["price"]), (-1, 1)),]
        for feature_name in self._all_features:
            embedding_fn = self._embeddings[feature_name]
            embeddings.append(embedding_fn(inputs[feature_name]))
        return tf.concat(embeddings, axis=1)

Note that our use of timestamp features in this tutorial interacts with our choice of training-test split in an undesirable way. Because we have split our data randomly rather than chronologically (to ensure that events that belong to the test dataset happen later than those in the training set), our model can effectively learn from the future. This is unrealistic: after all, we cannot train a model today on data from tomorrow.

This means that adding time features to the model lets it learn _future_ interaction patterns. We do this for illustration purposes only: the MovieLens dataset itself is very dense, and unlike many real-world datasets does not benefit greatly from features beyond user ids and movie titles. 

This caveat aside, real-world models may well benefit from other time-based features such as time of day or day of the week, especially if the data has strong seasonal patterns.

### Candidate model

For simplicity, we'll keep the candidate model fixed. Again, we copy it from the [featurization](featurization) tutorial:

In [36]:
class MovieModel(tf.keras.Model):
  
    def __init__(self):
        super().__init__()
        self.embedding_dimension = 32
        max_tokens = 10_000

        self.title_embedding = tf.keras.Sequential([
            tf.keras.layers.StringLookup(
              vocabulary=unique_movie_titles, mask_token=None),
            tf.keras.layers.Embedding(len(unique_movie_titles) + 1, self.embedding_dimension)
        ])

        str_features = ['prod_name', 'product_type_name', 'product_group_name', 'graphical_appearance_name', 
                        'colour_group_name', 'perceived_colour_value_name', 'perceived_colour_master_name', 'department_name', 
                        'index_name', 'index_group_name', 'section_name', 'garment_group_name', ]
        int_features = []
        
        self._all_features = str_features + int_features
        self._embeddings = {}
        
            
        for feature_name in str_features:
          vocabulary = vocabularies[feature_name]
          self._embeddings[feature_name] = tf.keras.Sequential(
              [tf.keras.layers.StringLookup(
                  vocabulary=vocabulary, mask_token=None),
               tf.keras.layers.Embedding(len(vocabulary) + 1,
                                         self.embedding_dimension)
        ])

        # Compute embeddings for int features.
        for feature_name in int_features:
          vocabulary = vocabularies[feature_name]
          self._embeddings[feature_name] = tf.keras.Sequential(
              [tf.keras.layers.IntegerLookup(
                  vocabulary=vocabulary, mask_value=None),
               tf.keras.layers.Embedding(len(vocabulary) + 1,
                                         self.embedding_dimension)
        ])
    def call(self, inputs):
        embeddings = []
        embeddings.append(self.title_embedding(inputs['article_id']))
        for feature_name in self._all_features:
            embedding_fn = self._embeddings[feature_name]
            embeddings.append(embedding_fn(inputs[feature_name]))
        return tf.concat(embeddings, axis=1)

### Combined model

With both `UserModel` and `MovieModel` defined, we can put together a combined model and implement our loss and metrics logic.

Here we're building a retrieval model. For a refresher on how this works, see the [Basic retrieval](basic_retrieval.ipynb) tutorial.

Note that we also need to make sure that the query model and candidate model output embeddings of compatible size. Because we'll be varying their sizes by adding more features, the easiest way to accomplish this is to use a dense projection layer after each model:



In [37]:
class MovielensModel(tfrs.models.Model):

  def __init__(self, use_timestamps):
    super().__init__()
    self.query_model = tf.keras.Sequential([
      UserModel(use_timestamps),
      tf.keras.layers.Dense(32)
    ])
    self.candidate_model = tf.keras.Sequential([
      MovieModel(),
      tf.keras.layers.Dense(32)
    ])
    self.task = tfrs.tasks.Retrieval(
        metrics=tfrs.metrics.FactorizedTopK(
            candidates=items.batch(128).map(self.candidate_model),
        ),
    )

  def compute_loss(self, features, training=False):
    # We only pass the user id and timestamp features into the query model. This
    # is to ensure that the training inputs would have the same keys as the
    # query inputs. Otherwise the discrepancy in input structure would cause an
    # error when loading the query model after saving it.
    query_embeddings = self.query_model({
        "customer_id": features["customer_id"],
        "t_dat": features["t_dat"],
        "price": features["price"],
        "age_group": features['age_group'],
    })
    movie_embeddings = self.candidate_model(
        {"article_id": features["article_id"],
        "prod_name": features['prod_name'],
    "product_type_name": features['product_type_name'],
    "product_group_name": features['product_group_name'],
    "graphical_appearance_name": features['graphical_appearance_name'],
    "colour_group_name": features['colour_group_name'],
    "perceived_colour_value_name": features['perceived_colour_value_name'],
    "perceived_colour_master_name": features['perceived_colour_master_name'],
    "department_name": features['department_name'],
    "index_name": features['index_name'],
    "index_group_name": features['index_group_name'],
    "section_name": features['section_name'],
    "garment_group_name": features['garment_group_name'],
        }
    )

    return self.task(query_embeddings, movie_embeddings)

In [264]:
class MovielensModel(tfrs.models.Model):

  def __init__(self, use_timestamps):
    super().__init__()
    self.query_model = tf.keras.Sequential([
      UserModel(use_timestamps),
#       tf.keras.layers.Dense(32)
    ])
    self.candidate_model = tf.keras.Sequential([
      MovieModel(),
#       tf.keras.layers.Dense(32)
    ])
    self.task = tfrs.tasks.Retrieval(
        metrics=tfrs.metrics.FactorizedTopK(
            candidates=items.batch(128).map(self.candidate_model),
        ),
    )

  def compute_loss(self, features, training=False):
    # We only pass the user id and timestamp features into the query model. This
    # is to ensure that the training inputs would have the same keys as the
    # query inputs. Otherwise the discrepancy in input structure would cause an
    # error when loading the query model after saving it.
    query_embeddings = self.query_model({
        "customer_id": features["customer_id"],
        "t_dat": features["t_dat"],
        "price": features["price"],
        "age_group": features['age_group'],
    })
    movie_embeddings = self.candidate_model(
        {"article_id": features["article_id"],
        "prod_name": features['prod_name'],
    "product_type_name": features['product_type_name'],
    "product_group_name": features['product_group_name'],
    "graphical_appearance_name": features['graphical_appearance_name'],
    "colour_group_name": features['colour_group_name'],
    "perceived_colour_value_name": features['perceived_colour_value_name'],
    "perceived_colour_master_name": features['perceived_colour_master_name'],
    "department_name": features['department_name'],
    "index_name": features['index_name'],
    "index_group_name": features['index_group_name'],
    "section_name": features['section_name'],
    "garment_group_name": features['garment_group_name'],
        }
    )

    return self.task(query_embeddings, movie_embeddings)

## Experiments

### Prepare the data

We first split the data into a training set and a testing set.

In [38]:
tf.random.set_seed(42)
shuffled = interactions.shuffle(100_000, seed=42, reshuffle_each_iteration=False)

train = shuffled.take(80_000)
test = shuffled.skip(80_000).take(20_000)

cached_train = train.shuffle(100_000).batch(2048)
cached_test = test.batch(4096).cache()

### Baseline: no timestamp features

We're ready to try out our first model: let's start with not using timestamp features to establish our baseline.

In [294]:
model = MovielensModel(use_timestamps=False)
model.compile(optimizer=tf.keras.optimizers.Adagrad(0.1))

model.fit(cached_train, epochs=3)

train_accuracy = model.evaluate(
    cached_train, return_dict=True)["factorized_top_k/top_100_categorical_accuracy"]
test_accuracy = model.evaluate(
    cached_test, return_dict=True)["factorized_top_k/top_100_categorical_accuracy"]

print(f"Top-100 accuracy (train): {train_accuracy:.2f}.")
print(f"Top-100 accuracy (test): {test_accuracy:.2f}.")

Consider rewriting this model with the Functional API.


Consider rewriting this model with the Functional API.


Epoch 1/3
Consider rewriting this model with the Functional API.


Consider rewriting this model with the Functional API.


Consider rewriting this model with the Functional API.


Consider rewriting this model with the Functional API.


Consider rewriting this model with the Functional API.


Consider rewriting this model with the Functional API.


Consider rewriting this model with the Functional API.


Consider rewriting this model with the Functional API.


UnimplementedError:  Cast int64 to string is not supported
	 [[node sequential_211/movie_model_29/sequential_210/string_lookup_114/Cast (defined at <ipython-input-276-f4b9da47b25b>:30) ]] [Op:__inference_train_function_398809]

Errors may have originated from an input operation.
Input Source operations connected to node sequential_211/movie_model_29/sequential_210/string_lookup_114/Cast:
 IteratorGetNext (defined at <ipython-input-294-438035174cc8>:4)

Function call stack:
train_function


This gives us a baseline top-100 accuracy of around 0.2.



### Capturing time dynamics with time features

Do the result change if we add time features?

In [39]:
model = MovielensModel(use_timestamps=True)
model.compile(optimizer=tf.keras.optimizers.Adagrad(0.1))

model.fit(cached_train, epochs=30)

train_accuracy = model.evaluate(
    cached_train, return_dict=True)["factorized_top_k/top_100_categorical_accuracy"]
test_accuracy = model.evaluate(
    cached_test, return_dict=True)["factorized_top_k/top_100_categorical_accuracy"]
    
print(f"Top-100 accuracy (train): {train_accuracy:.2f}.")
print(f"Top-100 accuracy (test): {test_accuracy:.2f}.")

Consider rewriting this model with the Functional API.
Epoch 1/30
Consider rewriting this model with the Functional API.
Consider rewriting this model with the Functional API.
Consider rewriting this model with the Functional API.
Consider rewriting this model with the Functional API.
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30
Consider rewriting this model with the Functional API.
Consider rewriting this model with the Functional API.
Top-100 accuracy (train): 0.99.
Top-100 accuracy (test): 0.13.


In [266]:
model = MovielensModel(use_timestamps=True)
model.compile(optimizer=tf.keras.optimizers.Adagrad(0.1))

model.fit(cached_train, epochs=10)

train_accuracy = model.evaluate(
    cached_train, return_dict=True)["factorized_top_k/top_100_categorical_accuracy"]
test_accuracy = model.evaluate(
    cached_test, return_dict=True)["factorized_top_k/top_100_categorical_accuracy"]
    
print(f"Top-100 accuracy (train): {train_accuracy:.2f}.")
print(f"Top-100 accuracy (test): {test_accuracy:.2f}.")

Consider rewriting this model with the Functional API.


Consider rewriting this model with the Functional API.


Epoch 1/10
Consider rewriting this model with the Functional API.


Consider rewriting this model with the Functional API.


Consider rewriting this model with the Functional API.


Consider rewriting this model with the Functional API.


ValueError: in user code:

    /home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/keras/engine/training.py:853 train_function  *
        return step_function(self, iterator)
    /home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/tensorflow_recommenders/tasks/retrieval.py:132 call  *
        scores = tf.linalg.matmul(
    /home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/tensorflow/python/util/dispatch.py:206 wrapper  **
        return target(*args, **kwargs)
    /home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/tensorflow/python/ops/math_ops.py:3655 matmul
        a, b, transpose_a=transpose_a, transpose_b=transpose_b, name=name)
    /home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/tensorflow/python/ops/gen_math_ops.py:5714 mat_mul
        name=name)
    /home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py:750 _apply_op_helper
        attrs=attr_protos, op_def=op_def)
    /home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/tensorflow/python/framework/func_graph.py:601 _create_op_internal
        compute_device)
    /home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/tensorflow/python/framework/ops.py:3569 _create_op_internal
        op_def=op_def)
    /home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/tensorflow/python/framework/ops.py:2042 __init__
        control_input_ops, op_def)
    /home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/tensorflow/python/framework/ops.py:1883 _create_c_op
        raise ValueError(str(e))

    ValueError: Dimensions must be equal, but are 130 and 416 for '{{node retrieval_8/MatMul}} = MatMul[T=DT_FLOAT, transpose_a=false, transpose_b=true](sequential_152/user_model_12/concat, sequential_166/movie_model_11/concat)' with input shapes: [?,130], [?,416].


This is quite a bit better: not only is the training accuracy much higher, but the test accuracy is also substantially improved.

In [310]:
# pip install scann

In [269]:
scann_index = tfrs.layers.factorized_top_k.ScaNN(model.query_model)


In [270]:
scann_index.index_from_dataset(
    tf.data.Dataset.zip((items.batch(128).map(lambda x: x['article_id']), items.batch(128).map(model.candidate_model)))
)

Consider rewriting this model with the Functional API.


Consider rewriting this model with the Functional API.


<tensorflow_recommenders.layers.factorized_top_k.ScaNN at 0x7f296831d390>

In [199]:
unique_user_ids

array([b'00018385675844f7a6babbed41b5655b5727fb16483b6ea51d5798a6ab947344',
       b'00019d6c20e0fbb551af18c57149af4707ec016bb0decdf064cdae15ab1569a8',
       b'000253f6914890557a88d0b91288ce85fae9332dac43ee5445c33e3891df6fd3',
       ...,
       b'ffff4c4e8b57b633c1ddf8fbd53db16b962cf831baf9ed67c6a53d86e167a35b',
       b'ffff8f9ecdce722b5bab97fff68a6d1866492209bfe5242c50d2a10a652fb5ef',
       b'ffffbbf78b6eaac697a8a5dfbfd2bfa8113ee5b403e4747568cac33e8c541831'],
      dtype=object)

In [244]:
list(interactions.take(1).as_numpy_iterator())

[{'customer_id': b'00018385675844f7a6babbed41b5655b5727fb16483b6ea51d5798a6ab947344',
  'age_group': 4,
  'article_id': b'0535455002',
  'rating': 1,
  'price': 0,
  't_dat': 99,
  'prod_name': b'Lastday',
  'product_type_name': b'Blouse',
  'product_group_name': b'Garment Upper body',
  'graphical_appearance_name': b'Jacquard',
  'colour_group_name': b'Dark Red',
  'perceived_colour_value_name': b'Dark',
  'perceived_colour_master_name': b'Orange',
  'department_name': b'Blouse',
  'index_name': b'Ladieswear',
  'index_group_name': b'Ladieswear',
  'section_name': b'Womens Everyday Collection',
  'garment_group_name': b'Blouses'}]

In [276]:
# Get recommendations.

for row in test.batch(3).take(1):
    print(list(row))
    print(f"Best recommendations: {scann_index(row)[1].numpy()[:, :5].tolist()}")

['customer_id', 'age_group', 'article_id', 'rating', 'price', 't_dat', 'prod_name', 'product_type_name', 'product_group_name', 'graphical_appearance_name', 'colour_group_name', 'perceived_colour_value_name', 'perceived_colour_master_name', 'department_name', 'index_name', 'index_group_name', 'section_name', 'garment_group_name']
array([b'01e5bd53e72a6bc2c923fa646c41d2250cee46d9820a143c5ca8f1e2ea9fdff2',
       b'0cc663c22bc8b0a52bf5aa5948f76b43987b382e5347b7b8bed952173e02b2de',
       b'07c0075ab098b0807b511e2e44abe0cf245feb61527cb6191dd8f83a487a89e5'],
      dtype=object)>, 'age_group': <tf.Tensor: shape=(3,), dtype=int64, numpy=array([1, 1, 1])>, 'article_id': <tf.Tensor: shape=(3,), dtype=string, numpy=array([b'0488697002', b'0714790008', b'0685284002'], dtype=object)>, 'rating': <tf.Tensor: shape=(3,), dtype=int32, numpy=array([1, 1, 2], dtype=int32)>, 'price': <tf.Tensor: shape=(3,), dtype=int32, numpy=array([0, 0, 0], dtype=int32)>, 't_dat': <tf.Tensor: shape=(3,), dtype=int64, n

array([b'01e5bd53e72a6bc2c923fa646c41d2250cee46d9820a143c5ca8f1e2ea9fdff2',
       b'0cc663c22bc8b0a52bf5aa5948f76b43987b382e5347b7b8bed952173e02b2de',
       b'07c0075ab098b0807b511e2e44abe0cf245feb61527cb6191dd8f83a487a89e5'],
      dtype=object)>, 'age_group': <tf.Tensor: shape=(3,), dtype=int64, numpy=array([1, 1, 1])>, 'article_id': <tf.Tensor: shape=(3,), dtype=string, numpy=array([b'0488697002', b'0714790008', b'0685284002'], dtype=object)>, 'rating': <tf.Tensor: shape=(3,), dtype=int32, numpy=array([1, 1, 2], dtype=int32)>, 'price': <tf.Tensor: shape=(3,), dtype=int32, numpy=array([0, 0, 0], dtype=int32)>, 't_dat': <tf.Tensor: shape=(3,), dtype=int64, numpy=array([75, 52, 98])>, 'prod_name': <tf.Tensor: shape=(3,), dtype=string, numpy=
array([b'Nohar Sneaker', b'Mom HW Ankle Consc', b'Bowy skirt'],
      dtype=object)>, 'product_type_name': <tf.Tensor: shape=(3,), dtype=string, numpy=array([b'Sneakers', b'Trousers', b'Skirt'], dtype=object)>, 'product_group_name': <tf.Tensor: s

Best recommendations: [[b'0617249020', b'0617245003', b'0802087001', b'0680263013', b'0718086002'], [b'0677848008', b'0732412002', b'0739953003', b'0270382004', b'0699867001'], [b'0529008010', b'0561814002', b'0687036007', b'0631837001', b'0251510001']]


In [None]:
[b'0663261002', b'0651273003', b'0619464003', b'0695632006', b'0535455002']

[b'0535455002', b'0616849012', b'0621020001', b'0626813002',
       b'0626813004', b'0651273003', b'0651273004', b'0667916002',

In [231]:
model.query_model()

TypeError: 'Sequential' object does not support indexing

In [226]:
scann = tfrs.layers.factorized_top_k.ScaNN(num_reordering_candidates=100)
scann.index_from_dataset(
    tf.data.Dataset.zip((lots_of_movies, lots_of_movies_embeddings))
)

NameError: name 'lots_of_movies' is not defined

## Next Steps

This tutorial shows that even simple models can become more accurate when incorporating more features. However, to get the most of your features it's often necessary to build larger, deeper models. Have a look at the [deep retrieval tutorial](deep_recommenders) to explore this in more detail.