# A BERT, Transformer, and ReLU-Based Recommendation System



**Author:** Farhan Yousuf<br>
**Date created:** 2022/04/17<br>
**Last modified:** 2022/04/18<br>
**Description:** Utilizing the Movielens data, supplemented by text features, to produce an architecture consisting of BERT, a transformer layer, and a 3-layer ReLU-based neural network for movie rating prediction.

## Introduction

The original colab notebook which this work took inspiration from demonstrates the [Behavior Sequence Transformer (BST)](https://arxiv.org/abs/1905.06874)
model, by Qiwei Chen et al., using the [Movielens dataset](https://grouplens.org/datasets/movielens/).
This dataset was augmenting by incorporating additional information regarding the movies in the form of summaries. A BERT model was implemented in order to process the tokenizations of these overviews and concatenated with the BST model, which itself aims to predict the rating of a target movie by accepting the following inputs:

1. A fixed-length *sequence* of `movie_ids` watched by a user.
2. A fixed-length *sequence* of the `ratings` for the movies watched by a user.
3. A *set* of user features, including `user_id`, `sex`, `occupation`, and `age_group`.
4. A *set* of `genres` for each movie in the input sequence and the target movie.
5. A `target_movie_id` for which to predict the rating.

Note that the above does not cover all the features used; the BERT outputs were produced, and their hidden features

The BST model was modified by including the movie categorical features (in this case the genre(s)) into the embedding of each movie. These encodings were further augmented by incorporating numerical features into the model. The data was divided into sequences, with the last element being the target, or label, and all others comprising the feature space. As such, the positions of all data within the sequence was taken into consideration.

Note that this example should be run with TensorFlow 2.4 or higher.

## The dataset

We use the [1M version of the Movielens dataset](https://grouplens.org/datasets/movielens/1m/).
The dataset includes around 1 million ratings from 6000 users on 4000 movies,
along with some user features, movie genres. In addition, the timestamp of each user-movie
rating is provided, which allows creating sequences of movie ratings for each user,
as expected by the BST model.

## Setup

In [None]:
!pip install transformers

Collecting transformers
  Downloading transformers-4.18.0-py3-none-any.whl (4.0 MB)
[K     |████████████████████████████████| 4.0 MB 4.3 MB/s 
[?25hCollecting sacremoses
  Downloading sacremoses-0.0.49-py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 65.7 MB/s 
Collecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.5.1-py3-none-any.whl (77 kB)
[K     |████████████████████████████████| 77 kB 8.9 MB/s 
[?25hCollecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 82.1 MB/s 
Collecting tokenizers!=0.11.3,<0.13,>=0.11.1
  Downloading tokenizers-0.12.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.6 MB)
[K     |████████████████████████████████| 6.6 MB 58.3 MB/s 
Installing collected packages: pyyaml, tokenizers, sacremoses, huggingface-hub, transformers
  Attempting uninstall: pyyaml


In [None]:
import os
import math
from zipfile import ZipFile
from urllib.request import urlretrieve
import numpy as np
import pandas as pd
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras.layers import StringLookup
from transformers import BertTokenizer, TFBertModel, BertConfig

## Prepare the data

### Download and prepare the DataFrames

The movielens data can be downloaded through online files. However, for this project, an additional file is given that provides information on each movie, including text (an overview) and numerical features (budget, vote average, etc.).
The downloaded folder will contain three data files: `users.dat`, `movies.dat`,
and `ratings.dat`.

In [None]:
urlretrieve("http://files.grouplens.org/datasets/movielens/ml-1m.zip", "movielens.zip")
ZipFile("movielens.zip", "r").extractall()

Then, we load the data into pandas DataFrames with their proper column names.

In [None]:
users = pd.read_csv(
    "ml-1m/users.dat",
    sep="::",
    names=["user_id", "sex", "age_group", "occupation", "zip_code"],
)

ratings = pd.read_csv(
    "ml-1m/ratings.dat",
    sep="::",
    names=["user_id", "movie_id", "rating", "unix_timestamp"],
)

movies = pd.read_csv(
    "ml-1m/movies.dat", sep="::", names=["movie_id", "title", "genres"], encoding="ISO-8859-1"
)

  return func(*args, **kwargs)


The movie meta data file is not available through the online folder and as such must be uploaded through another method; here, Google Drive is used. The following code connects this notebook to Google Drive.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


Here, we do some simple data processing to fix the data types of the columns.

In [None]:
users["user_id"] = users["user_id"].apply(lambda x: f"user_{x}")
users["age_group"] = users["age_group"].apply(lambda x: f"group_{x}")
users["occupation"] = users["occupation"].apply(lambda x: f"occupation_{x}")

movies["movie_id"] = movies["movie_id"].apply(lambda x: f"movie_{x}")

ratings["movie_id"] = ratings["movie_id"].apply(lambda x: f"movie_{x}")
ratings["user_id"] = ratings["user_id"].apply(lambda x: f"user_{x}")
ratings["rating"] = ratings["rating"].apply(lambda x: float(x))

Each movie has multiple genres. We split them into separate columns in the `movies`
DataFrame. These will be our categorical features.

In [None]:
genres = [
    "Action",
    "Adventure",
    "Animation",
    "Children's",
    "Comedy",
    "Crime",
    "Documentary",
    "Drama",
    "Fantasy",
    "Film-Noir",
    "Horror",
    "Musical",
    "Mystery",
    "Romance",
    "Sci-Fi",
    "Thriller",
    "War",
    "Western",
]

for genre in genres:
    movies[genre] = movies["genres"].apply(
        lambda values: int(genre in values.split("|"))
    )


In [None]:
meta = pd.read_csv('/content/drive/My Drive/movies_metadata.csv')

  exec(code_obj, self.user_global_ns, self.user_ns)


The instructions for this project include aligning the overview with each movie based on year and title, which is done below.

In [None]:
meta['year'] = meta['release_date'].str[:4]
meta['year_and_title'] = meta['title'] + " (" + meta['year'] + ')'

meta = meta.dropna(subset=['year_and_title'])
meta.drop_duplicates(subset ="year_and_title",
                     keep = False, inplace = True)
movies = pd.merge(movies, meta[['year_and_title', 'budget', 'overview', 'popularity', 'revenue', 'runtime', 'vote_average', 'vote_count']], 
                  how='left', left_on='title', right_on='year_and_title')

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return func(*args, **kwargs)


Similarly, the ratings dataframe, which will be used in the model post-processing, will merge with the movies to provide additional features.

In [None]:
ratings = pd.merge(ratings, movies[['movie_id', 'budget', 'overview', 'popularity', 'revenue', 'runtime', 'vote_average', 'vote_count']],
                   how='left', left_on='movie_id', right_on='movie_id')

NaN values must be filled or dropped.

In [None]:
ratings["budget"] = ratings["budget"].astype(float)
ratings["popularity"] = ratings["popularity"].astype(float)
ratings["revenue"] = ratings["revenue"].astype(float)
ratings["runtime"] = ratings["runtime"].astype(float)
ratings["vote_average"] = ratings["vote_average"].astype(float)
ratings["vote_count"] = ratings["vote_count"].astype(float)

In [None]:
ratings["budget"].fillna(ratings["budget"].mean(), inplace=True)
ratings["popularity"].fillna(ratings["popularity"].mean(), inplace=True)
ratings["revenue"].fillna(ratings["revenue"].mean(), inplace=True)
ratings["runtime"].fillna(ratings["runtime"].mean(), inplace=True)
ratings["vote_average"].fillna(ratings["vote_average"].mean(), inplace=True)
ratings["vote_count"].fillna(ratings["vote_count"].mean(), inplace=True)

In [None]:
ratings["overview"].fillna("No overview available", inplace=True)

My earlier attempts to complete this project resulted in an exploding gradients problem (signified by the loss appearing as NaN values). As such, I sought to normalize the numerical data, shown below.

In [None]:
ratings["budget"] = (ratings["budget"] - ratings["budget"].mean())/ ratings["budget"].std()
ratings["popularity"] = (ratings["popularity"] - ratings["popularity"].mean()) / ratings["popularity"].std()
ratings["revenue"] = (ratings["revenue"] - ratings["revenue"].mean())/ratings["revenue"].std()
ratings["runtime"] = (ratings["runtime"] - ratings["runtime"].mean())/ratings["runtime"].std()
ratings["vote_count"] = (ratings["vote_count"] - ratings["vote_count"].mean())/ratings["vote_count"].std()

In [None]:
ratings["budget"] = ratings["budget"].astype(str)
ratings["popularity"] = ratings["popularity"].astype(str)
ratings["revenue"] = ratings["revenue"].astype(str)
ratings["runtime"] = ratings["runtime"].astype(str)
ratings["vote_average"] = ratings["vote_average"].astype(str)
ratings["vote_count"] = ratings["vote_count"].astype(str)

BERT was used to encode the textual data. However, the text cannot be fed directly to BERT; a BERT-oriented tokenizer must be used.

In [None]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
overview_unique = list(set(movies["overview"].dropna()))
overview_unique = list(set(overview_unique))

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Model configuration was set such that the hidden states were output, to be concatenated with the transformer output later on.

In [None]:
config = BertConfig.from_pretrained("bert-base-uncased", output_hidden_states=True)
model = TFBertModel.from_pretrained("bert-base-uncased", config=config)
overview_tokens = [tokenizer.encode(overview_unique, add_special_tokens=True,max_length=64,pad_to_max_length=True, return_tensors='tf') for sent in overview_unique]

Downloading:   0%|          | 0.00/511M [00:00<?, ?B/s]

Some layers from the model checkpoint at bert-base-uncased were not used when initializing TFBertModel: ['nsp___cls', 'mlm___cls']
- This IS expected if you are initializing TFBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFBertModel were initialized from the model checkpoint at bert-base-uncased.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions without further training.
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. 

In [None]:
output = []
for token in overview_tokens:
  output.append(model(token))

In [None]:
hidden_states_for_all = [states[2] for states in output]
last_2_layers_hidden_states = [tf.math.reduce_mean(hidden_states[-2:], axis=0) for hidden_states in hidden_states_for_all]
sentence_embeddings = [tf.math.reduce_mean(last_2_layer_hidden_state, axis=1) for last_2_layer_hidden_state in last_2_layers_hidden_states]

In [None]:
arrays = []
for embedding in sentence_embeddings:
  arrays.append(embedding.numpy())

In [None]:
array_strings = []
for array in arrays:
  string = str([" ".join(item) for item in array.astype(str)])
  array_strings.append(string[2:-2])

In [None]:
embeddings = {"embeddings": array_strings}
embedding_df = pd.DataFrame(data=embeddings)

In [None]:
embedding_df["overview"] = pd.DataFrame(data= {"overview": overview_unique})

In [None]:
movies = pd.merge(movies, embedding_df[['embeddings', 'overview']], 
                  how='left', left_on='overview', right_on='overview')

In [None]:
ratings = pd.merge(ratings, movies[['embeddings', 'movie_id']],
                   how='left', left_on='movie_id', right_on='movie_id')

In [None]:
ratings.dropna(inplace=True)

### Transform the movie ratings data into sequences

Here, we sort the the ratings data using the `unix_timestamp`, and then group the
`movie_id` values and the `rating` values by `user_id`.

The output DataFrame will have a record for each `user_id`, with two ordered lists
(sorted by rating datetime): the movies they have rated, and their ratings of these movies.

In [None]:
ratings_group = ratings.sort_values(by=["unix_timestamp"]).groupby("user_id")

ratings_data = pd.DataFrame(
    data={
        "user_id": list(ratings_group.groups.keys()),
        "movie_ids": list(ratings_group.movie_id.apply(list)),
        "ratings": list(ratings_group.rating.apply(list)),
        "embeddings": list(ratings_group.embeddings.apply(list)),
        "timestamps": list(ratings_group.unix_timestamp.apply(list)),
        "budget": list(ratings_group.budget.apply(list)),
        "overview": list(ratings_group.overview.apply(list)),
        "popularity": list(ratings_group.popularity.apply(list)),
        "revenue": list(ratings_group.popularity.apply(list)),
        "runtime": list(ratings_group.runtime.apply(list)),
        "vote_average": list(ratings_group.runtime.apply(list)),
        "vote_count": list(ratings_group.runtime.apply(list))
    }
)


Now, let's split the `movie_ids` list into a set of sequences of a fixed length.
We do the same for the `ratings`. Set the `sequence_length` variable to change the length
of the input sequence to the model. You can also change the `step_size` to control the
number of sequences to generate for each user.

In [None]:
sequence_length = 4
step_size = 2


def create_sequences(values, window_size, step_size):
    sequences = []
    start_index = 0
    while True:
        end_index = start_index + window_size
        seq = values[start_index:end_index]
        if len(seq) < window_size:
            seq = values[-window_size:]
            if len(seq) == window_size:
                sequences.append(seq)
            break
        sequences.append(seq)
        start_index += step_size
    return sequences


ratings_data.movie_ids = ratings_data.movie_ids.apply(
    lambda ids: create_sequences(ids, sequence_length, step_size)
)

ratings_data.ratings = ratings_data.ratings.apply(
    lambda ids: create_sequences(ids, sequence_length, step_size)
)

ratings_data.budget = ratings_data.budget.apply(
    lambda ids: create_sequences(ids, sequence_length, step_size)
)
ratings_data.overview = ratings_data.overview.apply(
    lambda ids: create_sequences(ids, sequence_length, step_size)
)

ratings_data.embeddings = ratings_data.embeddings.apply(
    lambda ids: create_sequences(ids, sequence_length, step_size)
)

ratings_data.popularity = ratings_data.popularity.apply(
    lambda ids: create_sequences(ids, sequence_length, step_size)
)

ratings_data.revenue = ratings_data.revenue.apply(
    lambda ids: create_sequences(ids, sequence_length, step_size)
)

ratings_data.runtime = ratings_data.runtime.apply(
    lambda ids: create_sequences(ids, sequence_length, step_size)
)

ratings_data.vote_average = ratings_data.vote_average.apply(
    lambda ids: create_sequences(ids, sequence_length, step_size)
)

ratings_data.vote_count = ratings_data.vote_count.apply(
    lambda ids: create_sequences(ids, sequence_length, step_size)
)

del ratings_data["timestamps"]

After that, we process the output to have each sequence in a separate records in
the DataFrame. In addition, we join the user features with the ratings data.

In [None]:
ratings_data_movies = ratings_data[["user_id", "movie_ids"]].explode(
    "movie_ids", ignore_index=True
)
ratings_data_rating = ratings_data[["ratings"]].explode("ratings", ignore_index=True)
ratings_data_budget = ratings_data[["budget"]].explode("budget", ignore_index=True)
ratings_data_overview = ratings_data[["overview"]].explode("overview", ignore_index=True)
ratings_data_embeddings = ratings_data[["embeddings"]].explode("embeddings", ignore_index=True)
ratings_data_popularity = ratings_data[["popularity"]].explode("popularity", ignore_index=True)
ratings_data_revenue = ratings_data[["revenue"]].explode("revenue", ignore_index=True)
ratings_data_runtime = ratings_data[["runtime"]].explode("runtime", ignore_index=True)
ratings_data_vote_average = ratings_data[["vote_average"]].explode("vote_average", ignore_index=True)
ratings_data_vote_count = ratings_data[["vote_count"]].explode("vote_count", ignore_index=True)
ratings_data_transformed = pd.concat([ratings_data_movies, ratings_data_rating, ratings_data_budget, ratings_data_popularity, ratings_data_overview,
                                      ratings_data_embeddings, ratings_data_revenue, ratings_data_runtime, ratings_data_vote_average, ratings_data_vote_count], axis=1)
ratings_data_transformed = ratings_data_transformed.join(
    users.set_index("user_id"), on="user_id"
)
ratings_data_transformed.movie_ids = ratings_data_transformed.movie_ids.apply(
    lambda x: ",".join(x)
)
ratings_data_transformed.overview = ratings_data_transformed.overview.apply(
    lambda x: "???".join([str(v) for v in x])
)
ratings_data_transformed.embeddings = ratings_data_transformed.embeddings.apply(
    lambda x: "?".join([str(v) for v in x])
)

ratings_data_transformed.ratings = ratings_data_transformed.ratings.apply(
    lambda x: ",".join([str(v) for v in x])
)
ratings_data_transformed.budget = ratings_data_transformed.budget.apply(
    lambda x: ",".join([str(v) for v in x])
)
ratings_data_transformed.popularity = ratings_data_transformed.popularity.apply(
    lambda x: ",".join([str(v) for v in x])
)
ratings_data_transformed.revenue = ratings_data_transformed.revenue.apply(
    lambda x: ",".join([str(v) for v in x])
)
ratings_data_transformed.runtime = ratings_data_transformed.runtime.apply(
    lambda x: ",".join([str(v) for v in x])
)
ratings_data_transformed.vote_average = ratings_data_transformed.vote_average.apply(
    lambda x: ",".join([str(v) for v in x])
)
ratings_data_transformed.vote_count = ratings_data_transformed.vote_count.apply(
    lambda x: ",".join([str(v) for v in x])
)

del ratings_data_transformed["zip_code"]

ratings_data_transformed.rename(
    columns={"movie_ids": "sequence_movie_ids", "ratings": "sequence_ratings"},
    inplace=True,
)

With `sequence_length` of 4 and `step_size` of 2, we end up with 498,623 sequences.

Finally, we split the data into training and testing splits, with 85% and 15% of
the instances, respectively, and store them to CSV files.

In [None]:
random_selection = np.random.rand(len(ratings_data_transformed.index)) <= 0.85
train_data = ratings_data_transformed[random_selection]
test_data = ratings_data_transformed[~random_selection]

train_data.to_csv("train_data.csv", index=False, sep="|", header=False)
test_data.to_csv("test_data.csv", index=False, sep="|", header=False)

## Define metadata

In [None]:
CSV_HEADER = list(ratings_data_transformed.columns)

CATEGORICAL_FEATURES_WITH_VOCABULARY = {
    "user_id": list(users.user_id.unique()),
    "movie_id": list(movies.movie_id.unique()),
    "sex": list(users.sex.unique()),
    "age_group": list(users.age_group.unique()),
    "occupation": list(users.occupation.unique()),
}

NUMERICAL_FEATURES = ["budget", "popularity", "revenue", "runtime", "vote_average", "vote_count"]

USER_FEATURES = ["sex", "age_group", "occupation"]

MOVIE_FEATURES = ["genres"]

## Create `tf.data.Dataset` for training and evaluation

In [None]:

def get_dataset_from_csv(csv_file_path, shuffle=False, batch_size=128):
    def process(features):
        movie_ids_string = features["sequence_movie_ids"]
        sequence_movie_ids = tf.strings.split(movie_ids_string, ",").to_tensor()

        # The last movie id in the sequence is the target movie.
        features["target_movie_id"] = sequence_movie_ids[:, -1]
        features["sequence_movie_ids"] = sequence_movie_ids[:, :-1]

        # Embeddings
        embeddings_string = features["embeddings"]
        sequence_embeddings = tf.strings.to_number(
            tf.strings.split(tf.strings.split(embeddings_string, "?"), " "), tf.dtypes.float32 
        ).to_tensor()
        features["target_embedding"] = sequence_embeddings[:, -1]
        features["sequence_embedding"] = sequence_embeddings[:, :-1]
        
        # Overview
        overview_string = features["overview"]
        sequence_overview = tf.strings.split(overview_string, "???").to_tensor()
        features["target_overview"] = sequence_overview[:, -1]
        features["sequence_overview"] = sequence_overview[:, :-1]

        ratings_string = features["sequence_ratings"]
        sequence_ratings = tf.strings.to_number(
            tf.strings.split(ratings_string, ","), tf.dtypes.float32
        ).to_tensor()

        # The last rating in the sequence is the target for the model to predict.
        target = sequence_ratings[:, -1]
        features["sequence_ratings"] = sequence_ratings[:, :-1]

        # Budget
        budget_string = features["budget"]
        sequence_budget = tf.strings.to_number(tf.strings.split(budget_string, ","), tf.dtypes.float32).to_tensor()
        features["target_budget"] = sequence_budget[:, -1]
        features["sequence_budget"] = sequence_budget[:, :-1]

        # Popularity
        popularity_string = features["popularity"]
        sequence_popularity = tf.strings.to_number(tf.strings.split(popularity_string, ","), tf.dtypes.float32).to_tensor()
        features["target_popularity"] = sequence_popularity[:, -1]
        features["sequence_popularity"] = sequence_popularity[:, :-1]

        # Revenue
        revenue_string = features["revenue"]
        sequence_revenue = tf.strings.to_number(tf.strings.split(revenue_string, ","), tf.dtypes.float32).to_tensor()
        features["target_revenue"] = sequence_revenue[:, -1]
        features["sequence_revenue"] = sequence_revenue[:, :-1]

        # Runtime
        runtime_string = features["runtime"]
        sequence_runtime = tf.strings.to_number(tf.strings.split(runtime_string, ","), tf.dtypes.float32).to_tensor()
        features["target_runtime"] = sequence_runtime[:, -1]
        features["sequence_runtime"] = sequence_runtime[:, :-1]

        # Vote Average
        vote_average_string = features["vote_average"]
        sequence_vote_average = tf.strings.to_number(tf.strings.split(vote_average_string, ","), tf.dtypes.float32).to_tensor()
        features["target_vote_average"] = sequence_vote_average[:, -1]
        features["sequence_vote_average"] = sequence_vote_average[:, :-1]

        # Vote Count
        vote_count_string = features["vote_count"]
        sequence_vote_count = tf.strings.to_number(tf.strings.split(vote_count_string, ","), tf.dtypes.float32).to_tensor()
        features["target_vote_count"] = sequence_vote_count[:, -1]
        features["sequence_vote_count"] = sequence_vote_count[:, :-1]

        return features, target

    dataset = tf.data.experimental.make_csv_dataset(
        csv_file_path,
        batch_size=batch_size,
        column_names=CSV_HEADER,
        num_epochs=1,
        header=False,
        field_delim="|",
        shuffle=shuffle,
    ).map(process)

    return dataset


## Create model inputs

In [None]:

def create_model_inputs():
    return {
        "user_id": layers.Input(name="user_id", shape=(1,), dtype=tf.string),
        "sequence_movie_ids": layers.Input(
            name="sequence_movie_ids", shape=(sequence_length - 1,), dtype=tf.string
        ),
        "target_movie_id": layers.Input(
            name="target_movie_id", shape=(1,), dtype=tf.string
        ),
        "sequence_embedding": layers.Input(
            name="sequence_embedding", shape=(sequence_length - 1, 768), dtype=tf.float32
        ),
        "target_embedding": layers.Input(
            name="target_embedding", shape=(1, 768), dtype=tf.float32
        ),
        "sequence_ratings": layers.Input(
            name="sequence_ratings", shape=(sequence_length - 1,), dtype=tf.float32
        ),
        "sequence_budget": layers.Input(
            name="sequence_budget", shape=(sequence_length - 1,), dtype=tf.float32
        ),
        "target_budget": layers.Input(
            name="target_budget", shape=(1,), dtype=tf.float32
        ),
        "sequence_overview": layers.Input(
            name="sequence_overview", shape=(sequence_length - 1,), dtype=tf.string
        ),
        "target_overview": layers.Input(
            name="target_overview", shape=(1,), dtype=tf.string
        ),
        "sequence_popularity": layers.Input(
            name="sequence_popularity", shape=(sequence_length - 1,), dtype=tf.float32
        ),
        "target_popularity": layers.Input(
            name="target_popularity", shape=(1,), dtype=tf.float32
        ),
        "sequence_revenue": layers.Input(
            name="sequence_revenue", shape=(sequence_length - 1,), dtype=tf.float32
        ),
        "target_revenue": layers.Input(
            name="target_revenue", shape=(1,), dtype=tf.float32
        ),
        "sequence_runtime": layers.Input(
            name="sequence_runtime", shape=(sequence_length - 1,), dtype=tf.float32
        ),
        "target_runtime": layers.Input(
            name="target_runtime", shape=(1,), dtype=tf.float32
        ),
        "sequence_vote_average": layers.Input(
            name="sequence_vote_average", shape=(sequence_length - 1,), dtype=tf.float32
        ),
        "target_vote_average": layers.Input(
            name="target_vote_average", shape=(1,), dtype=tf.float32
        ),
        "sequence_vote_count": layers.Input(
            name="sequence_vote_count", shape=(sequence_length - 1,), dtype=tf.float32
        ),
        "target_vote_count": layers.Input(
            name="target_vote_count", shape=(1,), dtype=tf.float32
        ),

        "sex": layers.Input(name="sex", shape=(1,), dtype=tf.string),
        "age_group": layers.Input(name="age_group", shape=(1,), dtype=tf.string),
        "occupation": layers.Input(name="occupation", shape=(1,), dtype=tf.string),
    }


## Encode input features

The `encode_input_features` method works as follows:

1. Each categorical user feature is encoded using `layers.Embedding`, with embedding
dimension equals to the square root of the vocabulary size of the feature.
The embeddings of these features are concatenated to form a single input tensor.

2. Each movie in the movie sequence and the target movie is encoded `layers.Embedding`,
where the dimension size is the square root of the number of movies.

3. A multi-hot genres vector for each movie is concatenated with its embedding vector,
and processed using a non-linear `layers.Dense` to output a vector of the same movie
embedding dimensions.

4. A positional embedding is added to each movie embedding in the sequence, and then
multiplied by its rating from the ratings sequence. Unlike the original implementation, the embeddings also incorporate other numerical features as factors.

5. The target movie embedding is concatenated to the sequence movie embeddings, producing
a tensor with the shape of `[batch size, sequence length, embedding size]`, as expected
by the attention layer for the transformer architecture.

6. The method returns a tuple of three elements:  `encoded_transformer_features`,
`encoded_other_features`, and `encoded_bert_features`.

In [None]:

def encode_input_features(
    inputs,
    include_user_id=True,
    include_user_features=True,
    include_movie_features=True,
):

    encoded_transformer_features = []
    encoded_other_features = []
    encoded_bert_features = []

    other_feature_names = []
    if include_user_id:
        other_feature_names.append("user_id")
    if include_user_features:
        other_feature_names.extend(USER_FEATURES)

    ## Encode user features
    for feature_name in other_feature_names:
        # Convert the string input values into integer indices.
        vocabulary = CATEGORICAL_FEATURES_WITH_VOCABULARY[feature_name]
        idx = StringLookup(vocabulary=vocabulary, mask_token=None, num_oov_indices=0)(
            inputs[feature_name]
        )
        # Compute embedding dimensions
        embedding_dims = int(math.sqrt(len(vocabulary)))
        # Create an embedding layer with the specified dimensions.
        embedding_encoder = layers.Embedding(
            input_dim=len(vocabulary),
            output_dim=embedding_dims,
            name=f"{feature_name}_embedding",
        )
        # Convert the index values to embedding representations.
        encoded_other_features.append(embedding_encoder(idx))

    ## Create a single embedding vector for the user features
    if len(encoded_other_features) > 1:
        encoded_other_features = layers.concatenate(encoded_other_features)
    elif len(encoded_other_features) == 1:
        encoded_other_features = encoded_other_features[0]
    else:
        encoded_other_features = None

    ## Create a movie embedding encoder
    movie_vocabulary = CATEGORICAL_FEATURES_WITH_VOCABULARY["movie_id"]
    movie_embedding_dims = int(math.sqrt(len(movie_vocabulary)))
    # Create a lookup to convert string values to integer indices.
    movie_index_lookup = StringLookup(
        vocabulary=movie_vocabulary,
        mask_token=None,
        num_oov_indices=0,
        name="movie_index_lookup",
    )
    # Create an embedding layer with the specified dimensions.
    movie_embedding_encoder = layers.Embedding(
        input_dim=len(movie_vocabulary),
        output_dim=movie_embedding_dims,
        name=f"movie_embedding",
    )
    # Create a vector lookup for movie genres.
    genre_vectors = movies[genres].to_numpy()
    movie_genres_lookup = layers.Embedding(
        input_dim=genre_vectors.shape[0],
        output_dim=genre_vectors.shape[1],
        embeddings_initializer=tf.keras.initializers.Constant(genre_vectors),
        trainable=False,
        name="genres_vector",
    )
    # Create a processing layer for genres.
    movie_embedding_processor = layers.Dense(
        units=movie_embedding_dims,
        activation="relu",
        name="process_movie_embedding_with_genres",
    )

    ## Define a function to encode a given movie id.
    def encode_movie(movie_id):
        # Convert the string input values into integer indices.
        movie_idx = movie_index_lookup(movie_id)
        movie_embedding = movie_embedding_encoder(movie_idx)
        encoded_movie = movie_embedding
        if include_movie_features:
            movie_genres_vector = movie_genres_lookup(movie_idx)
            encoded_movie = movie_embedding_processor(
                layers.concatenate([movie_embedding, movie_genres_vector])
            )
        return encoded_movie

    ## Encoding target_movie_id
    target_movie_id = inputs["target_movie_id"]
    encoded_target_movie = encode_movie(target_movie_id)

    ## Encoding sequence movie_ids.
    sequence_movies_ids = inputs["sequence_movie_ids"]
    encoded_sequence_movies = encode_movie(sequence_movies_ids)
    # Create positional embedding.
    position_embedding_encoder = layers.Embedding(
        input_dim=sequence_length,
        output_dim=movie_embedding_dims,
        name="position_embedding",
    )
    positions = tf.range(start=0, limit=sequence_length - 1, delta=1)
    encodded_positions = position_embedding_encoder(positions)
    # Retrieve sequence ratings to incorporate them into the encoding of the movie.
    sequence_ratings = tf.expand_dims(inputs["sequence_ratings"], -1)

    # Retrive other sequences to incorporate them into the encoding of the movie
    sequence_budget = tf.expand_dims(inputs["sequence_budget"], -1)
    sequence_popularity = tf.expand_dims(inputs["sequence_popularity"], -1)
    sequence_revenue = tf.expand_dims(inputs["sequence_revenue"], -1)
    sequence_runtime = tf.expand_dims(inputs["sequence_runtime"], -1)
    sequence_vote_average = tf.expand_dims(inputs["sequence_vote_average"], -1)
    sequence_vote_count = tf.expand_dims(inputs["sequence_vote_count"], -1)
    # Add the positional encoding to the movie encodings and multiply them by rating and other sequences
    encoded_sequence_movies_with_poistion_and_rating = layers.Multiply()(
        [(encoded_sequence_movies + encodded_positions), sequence_ratings, sequence_budget, sequence_popularity,
         sequence_revenue, sequence_runtime, sequence_vote_average, sequence_vote_count]
    )

    # Construct the transformer inputs.
    for encoded_movie in tf.unstack(
        encoded_sequence_movies_with_poistion_and_rating, axis=1
    ):
        encoded_transformer_features.append(tf.expand_dims(encoded_movie, 1))
    encoded_transformer_features.append(encoded_target_movie)
    
    encoded_bert_features.append(inputs["target_embedding"])                        

    encoded_transformer_features = layers.concatenate(
        encoded_transformer_features, axis=1
    )

    return encoded_transformer_features, encoded_other_features, encoded_bert_features


## Developing Model Architecture

A transformer layer is developed, with its output being concatenated to BERT outputs, which are then fed to a ReLU neural network.

In [None]:
include_user_id = False
include_user_features = False
include_movie_features = False

hidden_units = [256, 128, 128]
dropout_rate = 0.5
num_heads = 3


def create_model():
    inputs = create_model_inputs()
    transformer_features, other_features, bert_features = encode_input_features(
        inputs, include_user_id, include_user_features, include_movie_features
    )

    # Create a multi-headed attention layer.
    attention_output = layers.MultiHeadAttention(
        num_heads=num_heads, key_dim=transformer_features.shape[2], dropout=dropout_rate
    )(transformer_features, transformer_features)

    # Transformer block.
    attention_output = layers.Dropout(dropout_rate)(attention_output)
    x1 = layers.Add()([transformer_features, attention_output])
    x1 = layers.LayerNormalization()(x1)
    x2 = layers.LeakyReLU()(x1)
    x2 = layers.Dense(units=x2.shape[-1])(x2)
    x2 = layers.Dropout(dropout_rate)(x2)
    transformer_features = layers.Add()([x1, x2])
    transformer_features = layers.LayerNormalization()(transformer_features)
    features = layers.Flatten()(transformer_features)
    features = layers.concatenate(
        [features, layers.Reshape([bert_features[0].shape[-1]])(bert_features[0])]
    )

    # Included the other features.
    if other_features is not None:
        features = layers.concatenate(
            [features, layers.Reshape([other_features.shape[-1]])(other_features)]
        )

    # Fully-connected layers.
    for num_units in hidden_units:
        features = layers.Dense(num_units)(features)
        features = layers.BatchNormalization()(features)
        features = layers.ReLU()(features)
        features = layers.Dropout(dropout_rate)(features)

    outputs = layers.Dense(units=1)(features)
    model = keras.Model(inputs=inputs, outputs=outputs)
    return model


model = create_model()

  return bool(asarray(a1 == a2).all())


## Training & Testing



In [None]:
# Compile the model.


#model.compile(
#    optimizer=keras.optimizers.Adagrad(learning_rate=0.01),
#    loss=keras.losses.MeanSquaredError(),
#    metrics=[keras.metrics.MeanAbsoluteError()],
#)

model.compile(
    optimizer=keras.optimizers.SGD(learning_rate=0.01, momentum=0.9, clipnorm=1.0),
    loss = keras.losses.MeanSquaredError(),
    metrics=[keras.metrics.MeanAbsoluteError()],
)

# Read the training data.
train_dataset = get_dataset_from_csv("train_data.csv", shuffle=True, batch_size=265)

# Fit the model with the training data.
model.fit(train_dataset, epochs=5)

# Read the test data.
test_dataset = get_dataset_from_csv("test_data.csv", batch_size=265)

# Evaluate the model on the test data.
_, rmse = model.evaluate(test_dataset, verbose=0)
print(f"Test MAE: {round(rmse, 3)}")

Epoch 1/5


  inputs = self._flatten_to_reference_inputs(inputs)


Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Test MAE: 0.793


Multiple optimizers were considered, but the most effective appeared to be SGD with momentum and clipnorm applied.

## Conclusion

The BST model uses the Transformer layer in its architecture to capture the sequential signals underlying
users’ behavior sequences for recommendation.

This was an approach to combining data that incorporated numerical, textual, and categorical features. Each of these needed to be approached in different ways, and preprocessing constituted the majority of this project. I found that it was more effective to maintain the numerical features, as opposed to bucketizing and treating them as categorical.

There appeared to be some room for interpretation in the directions; the provided paper that utilizes hidden layer states of BERT actually extracts the embeddings and uses them as the embedding layer for the neural network classifier (regression model in this project). However, the architecture described in the instructions appears to differ in that the goal is to concatenate BERT outputs with the transformer output and then feed it into the neural network, as opposed to having the embeddings be a part of the neural network architecture itself.
