# Collaborative Filtering using Tensorflow & Keras
In our [Recommender Course](https://www.codingforentrepreneurs.com/courses/recommender/) we build a Django-based recommendation engine leveraging the Surprise ML package (among other things). This guide is made to help you upgrade your ML package by leveraging Keras and a neural network. 


Recommended requirements for running this notebook:
- GPU-accelerated / CUDA-enabled environment
- Cloud-based service such as Google Colab, Deepnote, and/or Paperspace
- [Recommender]((https://github.com/codingforentrepreneurs/recommender)) code forked/cloned/downloaded, open-source datasets loaded in, and Recommender models exported
- To export the [Recommender](https://github.com/codingforentrepreneurs/recommender)'s datasets, you can run the functions `export_rating_dataset_task` and `export_movies_dataset_task` in the `exports/tasks.py`
-  After you run these functions, you'll have the movies dataset located in `local-cdn/media/exports/movies/latest.csv` and the ratings dataset in `local-cdn/media/exports/ratings/latest.csv`



This code was directly inspired and modified from the following posts:
- [Fast.ai's Collaborative Filtering Lesson](https://course.fast.ai/Lessons/lesson7.html)
- [How to create a Recommendation System from scratch using Keras from the Antonai Blog](https://antonai.blog/how-to-create-a-recommendation-system-from-scratch-using-keras/)
- [Collaborative Filtering for Movie Recommendations the Keras Docs](https://keras.io/examples/structured_data/collaborative_filtering_movielens/)


### Open this notebook in...

[<img src="https://deepnote.com/buttons/launch-in-deepnote-white-small.svg">](https://deepnote.com/launch?url=https://github.com/codingforentrepreneurs/recommender/blob/main/src/nbs/Example%20Collaborative%20Filtering%20with%20Tensorflow%20Keras.ipynb)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/codingforentrepreneurs/recommender/blob/main/src/nbs/Example%20Collaborative%20Filtering%20with%20Tensorflow%20Keras.ipynb)

[![Run on Gradient](https://assets.paperspace.io/img/gradient-badge.svg)](https://console.paperspace.com/github/codingforentrepreneurs/recommender/blob/main/src/nbs/Example%20Collaborative%20Filtering%20with%20Tensorflow%20Keras.ipynb)

In [2]:
# !pip install tensorflow sklearn matplotlib pandas

Collecting tensorflow
  Downloading tensorflow-2.13.1-cp38-cp38-win_amd64.whl (1.9 kB)
Collecting sklearn
  Using cached sklearn-0.0.post11.tar.gz (3.6 kB)
Collecting matplotlib
  Downloading matplotlib-3.7.4-cp38-cp38-win_amd64.whl (7.5 MB)


ERROR: Could not find a version that satisfies the requirement tensorflow-intel==2.13.1; platform_system == "Windows" (from tensorflow) (from versions: 0.0.1, 2.10.0.dev20220728, 2.10.0rc0, 2.10.0rc1, 2.10.0rc2, 2.10.0rc3, 2.10.0, 2.10.1, 2.11.0rc0, 2.11.0rc1, 2.11.0rc2, 2.11.0, 2.11.1, 2.12.0rc0, 2.12.0rc1, 2.12.0, 2.12.1, 2.13.0rc0, 2.13.0rc1, 2.13.0rc2, 2.13.0)
ERROR: No matching distribution found for tensorflow-intel==2.13.1; platform_system == "Windows" (from tensorflow)


In [3]:
import pandas as pd
import numpy as np
from zipfile import ZipFile
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
import pathlib
from pathlib import Path
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

In [25]:
# if using a cloud provider, upload your files to an "exports folder"
# exports_dir = pathlib.Path().resolve() / 'exports' 

# if running this notebook from the root of the Recommender project
exports_dir = pathlib.Path().resolve().parent /'data' / 'local-cdn' / 'media' / 'exports'

movies_exports = exports_dir / 'movies' / 'latest.csv'
ratings_exports = exports_dir / 'ratings' / 'latest.csv'
print(movies_exports.exists(), ratings_exports.exists())

True True


Load in the movies dataset

In [26]:
movies_df = pd.read_csv(movies_exports)

# add a "trend" column to combine the count of ratings with the movie's average rating
movies_df['trend'] = movies_df['rating_count'] * movies_df['rating_avg']
movies_df['movieIdx'] = movies_df['movieIdx'].astype(int)
movies_df['movieId'] = movies_df['movieId'].astype(int)

print(movies_df.shape)
movies_df.head()

(42277, 6)


Unnamed: 0,release_date,rating_count,rating_avg,movieId,movieIdx,trend
0,1995-10-30,247.0,3.97,1,0,980.59
1,1986-10-16,59.0,3.24,3,1,191.16
2,1995-12-22,13.0,2.46,4,2,31.98
3,1995-12-09,56.0,3.36,5,3,188.16
4,1995-12-15,53.0,3.34,7,4,177.02


Load in the entire ratings dataset

In [27]:
rating_df = pd.read_csv(ratings_exports)
print(rating_df.shape)
rating_df.head()

(80913, 3)


Unnamed: 0,userId,movieId,rating
0,1,300665,4
1,1,439502,4
2,1,271404,5
3,671,4896,5
4,671,4963,5


Join the movies dataset and ratings dataset.

In [28]:
df = rating_df.copy()
df['userId'] = df['userId'].astype(int)
df['movieId'] = df['movieId'].astype(int)
df = df.join(movies_df, on='movieId', rsuffix='_movie_df')
df.sort_values(by=['trend'], inplace=True, ascending=False)
print(df.shape)
df.head()

(80913, 9)


Unnamed: 0,userId,movieId,rating,release_date,rating_count,rating_avg,movieId_movie_df,movieIdx,trend
13824,564,183,5,2000-02-09,311.0,4.57,318.0,183.0,1421.27
13832,564,206,5,1994-07-06,341.0,4.15,356.0,206.0,1415.15
17617,547,206,1,1994-07-06,341.0,4.15,356.0,206.0,1415.15
18388,536,206,4,1994-07-06,341.0,4.15,356.0,206.0,1415.15
31361,451,206,4,1994-07-06,341.0,4.15,356.0,206.0,1415.15


Make note of the missing number of movies from the ratings dataset. These are missing for a couple reasons: 
- Initial dataset used had invalid ids (from the movielens datasset) - Most likely
- Movies have been deleted from the Recommender database - Likely
- Incorrect datatypes - Unlikely but possible

In [9]:
missing_data = df[df['movieIdx'].isna()]

number_of_missing_movies = len(missing_data.movieId.unique().tolist())
print(number_of_missing_movies, 'movie ids missing that were rated')

2733 movie ids missing that were rated


Drop `NaN` columns that lack a `movieIdx` value:

In [10]:
training_df = df.copy().dropna()
training_df['movieIdx'] = training_df['movieIdx'].astype(int)
training_df.shape

(53219, 9)

In [11]:
user_ids = training_df["userId"].unique().tolist()
user2user_encoded = {x: i for i, x in enumerate(user_ids)}
userencoded2user = {i: x for i, x in enumerate(user_ids)}



movie_ids = training_df["movieIdx"].unique().tolist()

df = training_df.copy()
df["user"] = df["userId"].map(user2user_encoded)
df["movie"] = df["movieIdx"]

num_users = len(user2user_encoded)
num_movies = len(movie_ids)

df["rating"] = training_df["rating"].values.astype(np.float32)
# min and max ratings will be used to normalize the ratings later
min_rating = min(df["rating"])
max_rating = max(df["rating"])

print(
    "Number of users: {}, Number of Movies: {}, Min rating: {}, Max rating: {}".format(
        num_users, num_movies, min_rating, max_rating
    )
)

Number of users: 671, Number of Movies: 3324, Min rating: 1.0, Max rating: 5.0


In [12]:
df = df.sample(frac=1, random_state=42)
x = df[["user", "movie"]].values
# Normalize the targets between 0 and 1. Makes it easy to train.
y = df["rating"].apply(lambda x: (x - min_rating) / (max_rating - min_rating)).values
# Assuming training on 90% of the data and validating on 10%.
train_indices = int(0.9 * df.shape[0])
x_train, x_val, y_train, y_val = (
    x[:train_indices],
    x[train_indices:],
    y[:train_indices],
    y[train_indices:],
)

In [13]:
from tensorflow.keras.layers import Embedding, multiply, concatenate, Flatten, Input, Dense
from tensorflow.keras import optimizers as opt

# from sklearn.model_selection import train_test_split
# from keras.layers import Input, Embedding, Flatten, Dot, Dense, Concatenate
from keras.models import Model



EMBEDDING_SIZE = 500
num_unique_users = num_users
num_unique_movies = num_movies
users_input = Input(shape=(1,), name="users_input")
users_embedding = Embedding(num_unique_users + 1, EMBEDDING_SIZE, name="users_embeddings")(users_input)
users_bias = Embedding(num_unique_users + 1, 1, name="users_bias")(users_input)

movies_input = Input(shape=(1,), name="movies_input")
movies_embedding = Embedding(num_unique_movies + 1, EMBEDDING_SIZE, name="movies_embedding")(movies_input)
movies_bias = Embedding(num_unique_movies + 1, 1, name="movies_bias")(movies_input)

dot_product_users_movies = multiply([users_embedding, movies_embedding])
input_terms = dot_product_users_movies + users_bias + movies_bias
input_terms = Flatten(name="fl_inputs")(input_terms)
# output = Dense(1, activation="relu", name="output")(input_terms) 

output = Dense(1, activation="sigmoid", name="output")(input_terms) 
output = output * (max_rating - min_rating) + min_rating


model = Model(inputs=[users_input, movies_input], outputs=output)

opt_adam = opt.Adam(learning_rate = 0.005)
model.compile(optimizer=opt_adam, loss= ['mse'], metrics=['mean_absolute_error'])

In [14]:
model.summary()

Model: "model"
__________________________________________________________________________________________________
 Layer (type)                Output Shape                 Param #   Connected to                  
 users_input (InputLayer)    [(None, 1)]                  0         []                            
                                                                                                  
 movies_input (InputLayer)   [(None, 1)]                  0         []                            
                                                                                                  
 users_embeddings (Embeddin  (None, 1, 500)               336000    ['users_input[0][0]']         
 g)                                                                                               
                                                                                                  
 movies_embedding (Embeddin  (None, 1, 500)               1662500   ['movies_input[0][0]']    

In [15]:
df_train, df_val = train_test_split(df, random_state=42, test_size=0.2, stratify=df.rating)

In [31]:
df_train.rating

55624    4.0
12495    2.0
70897    4.0
63905    5.0
1934     5.0
        ... 
14438    4.0
27626    3.0
9477     4.0
49427    3.0
58063    5.0
Name: rating, Length: 42575, dtype: float32

In [17]:
history = model.fit(
    x=[df_train.user.to_numpy(), df_train.movie.to_numpy()],
    y=df_train.rating.to_numpy(),
    batch_size=200,
    epochs=10,
    verbose=1,
    validation_data=([df_val.user.to_numpy(), df_val.movie.to_numpy()],df_val.rating.to_numpy()))

Epoch 1/10


InvalidArgumentError: Graph execution error:

Detected at node 'model/movies_bias/embedding_lookup' defined at (most recent call last):
    File "C:\Users\tandu\AppData\Local\Programs\Python\Python38\lib\runpy.py", line 194, in _run_module_as_main
      return _run_code(code, main_globals, None,
    File "C:\Users\tandu\AppData\Local\Programs\Python\Python38\lib\runpy.py", line 87, in _run_code
      exec(code, run_globals)
    File "e:\Subjects\PBL6-Movie Recommender System\recommender\venv\lib\site-packages\ipykernel_launcher.py", line 17, in <module>
      app.launch_new_instance()
    File "e:\Subjects\PBL6-Movie Recommender System\recommender\venv\lib\site-packages\traitlets\config\application.py", line 1053, in launch_instance
      app.start()
    File "e:\Subjects\PBL6-Movie Recommender System\recommender\venv\lib\site-packages\ipykernel\kernelapp.py", line 739, in start
      self.io_loop.start()
    File "e:\Subjects\PBL6-Movie Recommender System\recommender\venv\lib\site-packages\tornado\platform\asyncio.py", line 195, in start
      self.asyncio_loop.run_forever()
    File "C:\Users\tandu\AppData\Local\Programs\Python\Python38\lib\asyncio\base_events.py", line 570, in run_forever
      self._run_once()
    File "C:\Users\tandu\AppData\Local\Programs\Python\Python38\lib\asyncio\base_events.py", line 1859, in _run_once
      handle._run()
    File "C:\Users\tandu\AppData\Local\Programs\Python\Python38\lib\asyncio\events.py", line 81, in _run
      self._context.run(self._callback, *self._args)
    File "e:\Subjects\PBL6-Movie Recommender System\recommender\venv\lib\site-packages\ipykernel\kernelbase.py", line 529, in dispatch_queue
      await self.process_one()
    File "e:\Subjects\PBL6-Movie Recommender System\recommender\venv\lib\site-packages\ipykernel\kernelbase.py", line 518, in process_one
      await dispatch(*args)
    File "e:\Subjects\PBL6-Movie Recommender System\recommender\venv\lib\site-packages\ipykernel\kernelbase.py", line 424, in dispatch_shell
      await result
    File "e:\Subjects\PBL6-Movie Recommender System\recommender\venv\lib\site-packages\ipykernel\kernelbase.py", line 766, in execute_request
      reply_content = await reply_content
    File "e:\Subjects\PBL6-Movie Recommender System\recommender\venv\lib\site-packages\ipykernel\ipkernel.py", line 429, in do_execute
      res = shell.run_cell(
    File "e:\Subjects\PBL6-Movie Recommender System\recommender\venv\lib\site-packages\ipykernel\zmqshell.py", line 549, in run_cell
      return super().run_cell(*args, **kwargs)
    File "e:\Subjects\PBL6-Movie Recommender System\recommender\venv\lib\site-packages\IPython\core\interactiveshell.py", line 3009, in run_cell
      result = self._run_cell(
    File "e:\Subjects\PBL6-Movie Recommender System\recommender\venv\lib\site-packages\IPython\core\interactiveshell.py", line 3064, in _run_cell
      result = runner(coro)
    File "e:\Subjects\PBL6-Movie Recommender System\recommender\venv\lib\site-packages\IPython\core\async_helpers.py", line 129, in _pseudo_sync_runner
      coro.send(None)
    File "e:\Subjects\PBL6-Movie Recommender System\recommender\venv\lib\site-packages\IPython\core\interactiveshell.py", line 3269, in run_cell_async
      has_raised = await self.run_ast_nodes(code_ast.body, cell_name,
    File "e:\Subjects\PBL6-Movie Recommender System\recommender\venv\lib\site-packages\IPython\core\interactiveshell.py", line 3448, in run_ast_nodes
      if await self.run_code(code, result, async_=asy):
    File "e:\Subjects\PBL6-Movie Recommender System\recommender\venv\lib\site-packages\IPython\core\interactiveshell.py", line 3508, in run_code
      exec(code_obj, self.user_global_ns, self.user_ns)
    File "C:\Users\tandu\AppData\Local\Temp\ipykernel_24700\2630691634.py", line 1, in <module>
      history = model.fit(
    File "e:\Subjects\PBL6-Movie Recommender System\recommender\venv\lib\site-packages\keras\src\utils\traceback_utils.py", line 65, in error_handler
      return fn(*args, **kwargs)
    File "e:\Subjects\PBL6-Movie Recommender System\recommender\venv\lib\site-packages\keras\src\engine\training.py", line 1742, in fit
      tmp_logs = self.train_function(iterator)
    File "e:\Subjects\PBL6-Movie Recommender System\recommender\venv\lib\site-packages\keras\src\engine\training.py", line 1338, in train_function
      return step_function(self, iterator)
    File "e:\Subjects\PBL6-Movie Recommender System\recommender\venv\lib\site-packages\keras\src\engine\training.py", line 1322, in step_function
      outputs = model.distribute_strategy.run(run_step, args=(data,))
    File "e:\Subjects\PBL6-Movie Recommender System\recommender\venv\lib\site-packages\keras\src\engine\training.py", line 1303, in run_step
      outputs = model.train_step(data)
    File "e:\Subjects\PBL6-Movie Recommender System\recommender\venv\lib\site-packages\keras\src\engine\training.py", line 1080, in train_step
      y_pred = self(x, training=True)
    File "e:\Subjects\PBL6-Movie Recommender System\recommender\venv\lib\site-packages\keras\src\utils\traceback_utils.py", line 65, in error_handler
      return fn(*args, **kwargs)
    File "e:\Subjects\PBL6-Movie Recommender System\recommender\venv\lib\site-packages\keras\src\engine\training.py", line 569, in __call__
      return super().__call__(*args, **kwargs)
    File "e:\Subjects\PBL6-Movie Recommender System\recommender\venv\lib\site-packages\keras\src\utils\traceback_utils.py", line 65, in error_handler
      return fn(*args, **kwargs)
    File "e:\Subjects\PBL6-Movie Recommender System\recommender\venv\lib\site-packages\keras\src\engine\base_layer.py", line 1150, in __call__
      outputs = call_fn(inputs, *args, **kwargs)
    File "e:\Subjects\PBL6-Movie Recommender System\recommender\venv\lib\site-packages\keras\src\utils\traceback_utils.py", line 96, in error_handler
      return fn(*args, **kwargs)
    File "e:\Subjects\PBL6-Movie Recommender System\recommender\venv\lib\site-packages\keras\src\engine\functional.py", line 512, in call
      return self._run_internal_graph(inputs, training=training, mask=mask)
    File "e:\Subjects\PBL6-Movie Recommender System\recommender\venv\lib\site-packages\keras\src\engine\functional.py", line 669, in _run_internal_graph
      outputs = node.layer(*args, **kwargs)
    File "e:\Subjects\PBL6-Movie Recommender System\recommender\venv\lib\site-packages\keras\src\utils\traceback_utils.py", line 65, in error_handler
      return fn(*args, **kwargs)
    File "e:\Subjects\PBL6-Movie Recommender System\recommender\venv\lib\site-packages\keras\src\engine\base_layer.py", line 1150, in __call__
      outputs = call_fn(inputs, *args, **kwargs)
    File "e:\Subjects\PBL6-Movie Recommender System\recommender\venv\lib\site-packages\keras\src\utils\traceback_utils.py", line 96, in error_handler
      return fn(*args, **kwargs)
    File "e:\Subjects\PBL6-Movie Recommender System\recommender\venv\lib\site-packages\keras\src\layers\core\embedding.py", line 272, in call
      out = tf.nn.embedding_lookup(self.embeddings, inputs)
Node: 'model/movies_bias/embedding_lookup'
indices[13,0] = 38061 is not in [0, 3325)
	 [[{{node model/movies_bias/embedding_lookup}}]] [Op:__inference_train_function_1317]

In [None]:
number_of_preds = 100
movies = df.sample(n=number_of_preds).movie.to_list()
user_list = df.sample(n=1).user.to_list() * number_of_preds
use_id = False
if use_id:
    user_list = [user2user_encoded.get(1)] * number_of_preds
preds = model.predict(x=[np.array(user_list), np.array(movies)])
preds

In [None]:
suggestions = []
user_id = userencoded2user.get(user_list[0])

suggestions_df = movies_df.copy()[movies_df['movieIdx'].isin(movies)]
suggestions_df['userId'] = user_id

suggestions_df['score'] = suggestions_df['movieIdx'].apply(lambda x: preds[movies.index(x)][0])

for i, movieIdx in enumerate(movies):
    pred_rank = preds[i][0]
    print(user_id, movieIdx, pred_rank)

In [None]:
user_ratings = rating_df.copy()[rating_df.userId == suggestions_df.userId.tolist()[0]]
user_ratings.rating.describe()

In [None]:
suggestions_df.sort_values(by=['score'], inplace=True, ascending=False)
suggestions_df.head()

Save the model for reuse

In [None]:
model.save("my-model.h5")