# **Recommender Systems or Why Your Phone Isn't Actually Spying on You**

<img src="https://drive.google.com/uc?id=1vpzhvPNf6fMdW1eettSzHHFeggNUzRZB" />

<a href="https://colab.research.google.com/github/deep-learning-indaba/indaba-pracs-2023/blob/main/practicals/Indaba_2023_Prac_Template.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a> [Change colab link to point to prac.]

© Deep Learning Indaba 2023. Apache License 2.0.

**Authors:** Amrit Purshotam, Jama Mohamud

**Reviewers:** Kyle Taylor

**Introduction:**

Recommender Systems are probably one of the most ubiquitous type of machine learning model that we encounter in our online life. They influence what we see in our social media feeds, the products we buy, the music we listen to, the food we eat, and the movies we watch. Sometimes they're so good that people feel that their phone is spying on their conversations! In this prac, we hope to convince you that this isn't the case (mostly) as well as taking you through some of the techniques popularly used in industry that recommends the content you see online by building our very own movie recommender system.

**Topics:**

Content: Machine Learning, Recommender Systems, Approximate Nearest Neighbours

Level: <font color='grey'>`Beginner`</font>

**Aims/Learning Objectives:**

- General architecture of a recommender systems.
- Techniques for making recommendations.
- Serving recommendations efficiently in production.

**Prerequisites:**

[Knowledge required for this prac. You can link a relevant parallel track session, blogs, papers, courses, topics etc.]

**Outline:**

[Points that link to each section. Auto-generate following the instructions [here](https://stackoverflow.com/questions/67458990/how-to-automatically-generate-a-table-of-contents-in-colab-notebook).]

**Before you start:**

For this practical, you will need to use a GPU to speed up training. To do this, go to the "Runtime" menu in Colab, select "Change runtime type" and then in the popup menu, choose "GPU" in the "Hardware accelerator" box.

**Suggested experience level in this topic:**

| Level         | Experience                            |
| --- | --- |
`Beginner`      | It is my first time being introduced to this work. |
`Intermediate`  | I have done some basic courses/intros on this topic. |
`Advanced`      | I work in this area/topic daily. |

In [None]:
# @title **Paths to follow:** What is your level of experience in the topics presented in this notebook? (Run Cell)
experience = "advanced" #@param ["beginner", "intermediate", "advanced"]

sections_to_follow=""

if experience == "beginner":
  sections_to_follow="Introduction -> 1.1 Subsection -> 2.1 Subsection -> Conclusion -> Feedback"
elif experience == "intermediate":
  sections_to_follow="Introduction -> 1.2 Subsection -> 2.2 Subsection -> Conclusion -> Feedback"
elif experience == "advanced":
  sections_to_follow="Introduction -> 1.3 Subsection -> 2.3 Subsection -> Conclusion -> Feedback"

print(f"Based on your experience, it is advised you follow these -- {sections_to_follow} sections. Note this is just a guideline.")

## Installation and Imports

In [None]:
## Install and import anything required. Capture hides the output from the cell.
# @title Install and import required packages. (Run Cell)

import tensorflow_datasets as tfds
from flax import linen as nn
from jax import numpy as jnp
import jax
import optax
from flax import struct
from clu import metrics
from flax.training import train_state

from typing import Iterable, Mapping, Sequence, Tuple
import pandas as pd
import tensorflow as tf

In [None]:
# @title Helper methods

def get_dataset(ds_name: str) -> pd.DataFrame:
  ds, info = tfds.load(f'movielens/{ds_name}', data_dir="./data", with_info=True)
  df = tfds.as_dataframe(ds['train'], info)
  df = df.astype({'user_id': int, 'movie_id': int})
  df.loc[:, 'movie_title'] = df['movie_title'].str.decode("utf-8")
  return df

def cross_tabulate(df: pd.DataFrame, num_samples: int = 10) -> pd.DataFrame:
  pivot_df = df.pivot(index='user_id', columns='movie_id', values='user_rating')
  pivot_df.loc[df['user_id'].sample(num_samples), df['movie_id'].sample(num_samples)].dropna(axis=0, thresh=1).fillna("")
  return pivot_df

def make_mapping(id_set: Iterable[str]) -> Mapping[str, int]:
  return {id_str: i for (i, id_str) in enumerate(id_set)}

def densify_column_values(df: pd.DataFrame, col_name: str) -> Tuple[pd.Series, Sequence[str]]:
  col_values = sorted(set(df[col_name]))
  col_ids_map = make_mapping(col_values)
  return df[col_name].apply(lambda col_id: col_ids_map[col_id]), col_values, col_ids_map

def to_tfds(df: pd.DataFrame) -> tf.data.Dataset:
  fields = {
    ('user_id', tf.int32),
    ('item_id', tf.int32),
    ('user_rating', tf.float32),
    ('timestamp', tf.int32)
  }

  tensor_slices = {
      field: tf.cast(df[field].values, dtype=field_type)
      for field, field_type in fields
  }

  return tf.data.Dataset.from_tensor_slices(tensor_slices)

In [None]:
class Config:
    DATASET = "latest-small-ratings" # all the options are 100k-ratings 1m-ratings 20m-ratings 25m-ratings latest-small-ratings
    SEED = 42
    EMB_DIM = 50
    LR = 5e-3
    BATCH_SIZE = 64
    NUM_EPOCHS = 3

config = Config()

## **A Real World Scenario**

Imagine you're going shopping for a new book. You enter the store and start walking around scanning the shelves of books. Something catches your interest, you pause, take the book down and inspect the cover, and perhaps also read the blurb. You continue to do this until you find something you like at which point you purchase the book and leave.

Now imagine you've read the book and quite enjoyed it and you want to buy another one. You go back to the store, browse around, and occasionally inspect some books that catch your interest. An assistant this time approaches you asking if you need help. Gladly you accept, and you mention the books you were looking at as well as the one that you purchased recently. Since they've been working there a long time and have helped many customers, they now have a good idea of what you may like and recommends a short list of books for you to look at. You do so and eventually settle on one to purchase.

**Exercise**. Let's pause here for a moment and dig deeper into this scenario.

- From all the books in the store, what does it say about the ones you paused to look at, the ones you ignored, and the ones you purchased?
- What does it say about you as the reader and your preferences? Could there be
other people like you?
- How was the assistant able to narrow down all the books in the store to just a few from which you actually purchased one?

<img src="https://drive.google.com/uc?id=1kG2UcOybxNTWCCI1E3IwH1R--yVqpi_u" />

Source: Kim Falk. *Practical Recommender Systems*. 2019. Manning.

The answer to these questions is now getting into the heart of recommender systems. Your behaviour probably wasn't random but instead had some structure and logic to it. You also likely have particular preferences to some genres and/or authors. Without even knowing anything else about the books, this follows then that the books you looked at and purchased matched those preferences and the ones you ignored more likely did not. Additionally, based on your preferences, what was catching your interest, and what you previously read, the store assistant was able to stitch together a rough profile of you. She then thought about her previous customers similar to you and what they had previously purchased. This is how she was then able to shortlist relevant books for you to look at.

As you probably guessed it by now, the store assistant is the recommender system in this example. Instead of a person though, we want to build something that is able to learn the latent structure between all the books, all our customers, and how well they match each other in order to then make our recommendations. In the following sections we will learn how to do this. But first let's go over the general architecture of a recommender system.

## **Recommendation System Architecture**

### Overview
![picture](https://drive.google.com/uc?id=1m7ARSm_4PAaN6qP969HeHIGUfOjHX3Y9)
Source: [System Design Interview](https://medium.com/double-pointer/system-design-interview-recommendation-system-design-as-used-by-youtube-netflix-etc-c457aaec3ab)

The above diagram generalises our previous scenario from books to searches, songs, and movies (note there will usually only be one) while the user is you as before. Notice how the user interacts with these *items* which then gets sent to the recommender system as feedback. The recommender system processes this, retrieves relevant items from it's database and then serves them to the user. The user, in turn, further interacts with these items and continues this loop until some terminating criteria. Perhaps they found a movie they like and started watching, or they purchase a book like in our previous example, or they just simply leave.

### Zooming in

Let's zoom in closer to the recommender system now by having a look at how YouTube described theirs back in 2016 in their seminal paper on the topic. Being YouTube, the items here would be videos which number in the millions (and most likely in the billions as of writing in 2023). Pay special attention to the blue stages, notice the number of items going into each stage going down and hence the funnel shape.

Looking at the first one, the job of the *candidate generation* stage is to efficiently *retrieve* relevant items from your database (which is why you may find in the literature, this stage is also known as *retrieval*). Speed is of the utmost importance here so some leeway is allowed in terms of relevancy as long as the number of items we reduce down to is manageable for the downstream parts of the recommender. This lookup speed is achieved partly by the techniques we discuss and implement later in this practical but also by limiting the number of features that feed into this stage.

The second *ranking* stage (also known as *scoring*) then takes these items and sorts them based on additional features that come from the user as well as the features of the item itself to optimise for some target we care about using machine learning. In the case of YouTube, it will be for watch time, or in the case of an e-commerce website, the likelihood to purchase the items. Another reason for a ranking stage is that you may have multiple candidate generators and you now need a way to combine the results in some optimal way, at which point they're then shown to the user. It's also important to note, this ranking stage is not always necessary. Sometimes the retrieval step is good enough and you can keep the complexity of the system down, reduce implementation times, lower maintenance overhead, and therefore costs.

![picture](https://drive.google.com/uc?id=1SmTqpqZFwrf4-3CmXp3vVQl8er6d4XAn)

Source: [Deep Neural Networks for YouTube Recommendations]()

Finally, this last diagram, courtesy of the recommendations team at NVIDIA, further expands on the above ideas by defining two more stages namely *filtering* and *ordering*. After the retrieval / candidate generation stage, we may find that some of the items, while relevant, aren't useful and so need to be filtered out. In an e-commerce scenario this could be an item that's out of stock or in the case of a social media platform, a post coming from a person or topic you've blocked / muted. The *ordering* step which takes place after ranking / scoring then refers to further refining the order of the items depending on some business logic. For example, promoting on sale items or perhaps even sponsored placements.
![picture](https://drive.google.com/uc?id=12fWnK5lrtdUX79GXT3tzOp6Fr78mKh57)

Source: [Recommender Systems, Not Just Recommender Models](https://medium.com/nvidia-merlin/recommender-systems-not-just-recommender-models-485c161c755e)

The rest of this practical will now focus primarily on the candidate generation stage of a recommender system. We hope this introduction and overview of recommender systems helps put into context the specific piece we will be building out. In particular, we will be learning about Collabarative Filtering and Graph Neural Networks for recommendations. But first, let's explore the data.

## **The MovieLens Dataset**

Since we don't have access to an actual streaming service watch history, we will instead use a dataset called [MovieLens](https://grouplens.org/datasets/movielens/). The full dataset contains 25 million ratings across 62 thousand movies, created by 162 thousand users. However, to keep processing and training times low, we will be using a subset of 100 thousand ratings from over 9 thousand movies and 600 users. Let's have a look.

In [None]:
df = get_dataset(config.DATASET)
df.head(5)[['user_id', 'movie_title', 'user_rating', 'timestamp']]

And what if we cross-tabulate a sample of this data to get an alternative view.

In [None]:
cross_tabulate(df)

The table displayed above shows some of the more popular movies and users. The empty cells are what we want our model to learn to fill in i.e. the movies we presume a user has not yet watched because they have yet to rate it. Then once we make these predictions, we can figure out which of those movies they're most likely to enjoy.

## **Dataset Preparation**

Now let's prepare our dataset for training. We will be mapping our users and movies from indexes starting from 0.

In [None]:
df['user_id'], user_list, user_to_id_mapping = densify_column_values(df, 'user_id')
df['item_id'], movie_list, movie_to_id_mapping = densify_column_values(df, 'movie_title')
df.head(5)[['user_id', 'item_id', 'user_rating', 'timestamp']]

Then we will split our data randomly into a train and validation set and create our dataloaders that will feed data into our training process later. The exact details here aren't important so don't worry if you don't fully understand the below code.

In [None]:
val_df = df.sample(frac=0.2)
train_df = df[~df.index.isin(val_df.index)]

train_ds = (
  to_tfds(train_df)
  .repeat(config.NUM_EPOCHS)
  .shuffle(1024)
  .batch(config.BATCH_SIZE, drop_remainder=False)
  .prefetch(1)
)
val_ds = (
  to_tfds(val_df)
  .shuffle(1024)
  .batch(config.BATCH_SIZE, drop_remainder=False)
  .prefetch(1)
)

We are now ready to learn and implement Collaborative Filtering.

## **Collaborative Filtering**

Collaboratve Filtering uses similarities between users and items simultaneously to provide recommendations.

![picture](https://drive.google.com/uc?id=1qxPGR6bXjDbigYxvFhE-bbeUK8mS-1Ys)

## Similarity Measures

<img src="https://drive.google.com/uc?id=1cqEj6omOtiSjcVf9EKtlBQP59orNuiVa" width="50%" />

### Tensorboard Projector

## Feedback

## **Graph Neural Networks**

[Background/content for the section.]

## Conclusion
**Summary:**

[Summary of the main points/takeaways from the prac.]

**Next Steps:**

[Next steps for people who have completed the prac, like optional reading (e.g. blogs, papers, courses, youtube videos). This could also link to other pracs.]

**Appendix:**

[Anything (probably math heavy stuff) we don't have space for in the main practical sections.]

**References:**

[References for any content used in the notebook.]

For other practicals from the Deep Learning Indaba, please visit [here](https://github.com/deep-learning-indaba/indaba-pracs-2023).

## Feedback

Please provide feedback that we can use to improve our practicals in the future.

In [None]:
# @title Generate Feedback Form. (Run Cell)
from IPython.display import HTML

HTML(
    """
<iframe
	src="https://forms.gle/Cg9aoa7czoZCYqxF7",
  width="80%"
	height="1200px" >
	Loading...
</iframe>
"""
)

<img src="https://baobab.deeplearningindaba.com/static/media/indaba-logo-dark.d5a6196d.png" width="50%" />