# Building Recommender Systems

Return to the [castle](https://github.com/Nkluge-correa/teeny-tiny_castle).

A recommender system, or a recommendation system (sometimes replacing 'system' with a synonym such as a platform or an engine), is a subclass of information filtering systems that provide suggestions for items most pertinent to a particular user.

Recommender systems are used in various areas, with commonly recognized examples as playlist generators for video and music services, product recommenders for online stores, or content recommenders for social media platforms, and open web content recommenders.

![recommend](https://miro.medium.com/max/1200/1*E8c4PEwsogQQWJPErGda2A.gif)

[Source](https://medium.com/analytics-vidhya/how-create-image-recomendation-system-3dcc5edf1597).

In this tutorial, we will build different recommendation systems, using movies as the content we want to promote. There are three types of recommender systems:

-   **Demographic Filtering:** _They offer generalized recommendations to every user based on demographic patterns of the type of content we want to recommend (e.g., movie popularity, genre, etc.). The basic idea behind this system is that content that is more popular and critically acclaimed will have a higher probability of being liked by the average audience._

-   **Content-Based Filtering:** _The general idea behind these recommender systems is that if a person likes a particular item, he or she will also like an item that is similar to it. Hence, similar content has a higher probability of being recommended._

-   **Collaborative Filtering:** _This system matches persons with similar interests and provides recommendations based on this matching. The general idea is that similar people will prefer the same type of content._

Let us start with **Demographic Filtering.**

For this notebook, we are using a couple of datasets from the Hugging Face Hub. To access them, we are going to use the `datasets` library.

> **Note**: all datasets and models related to the course and repo are in the Hub.

We can use the `load_dataset` functionality and download our data straight from the Hub 🤗. This function downloads a `DatasetDict`, from which we can extract our data. You can also transform this dataset object into a `pandas.DataFrame`, giving you all the functionalities a `pandas.DataFrame` offers.

In [1]:
# the `-q` (quiet) key makes the installation less "verbose"
!pip install datasets -q

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m519.6/519.6 kB[0m [31m4.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.3/115.3 kB[0m [31m7.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m6.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m7.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m295.0/295.0 kB[0m [31m7.9 MB/s[0m eta [36m0:00:00[0m
[?25h

In [2]:
from datasets import load_dataset

# load the datasets from the hub
movie_dataset = load_dataset('dieineb/tmdb_5000_movies')

# turn the datasets into a pandas.DataFrame
movie_dataset = movie_dataset['train'].to_pandas()

display(movie_dataset.head())

Downloading readme:   0%|          | 0.00/1.26k [00:00<?, ?B/s]

Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/13.9M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split:   0%|          | 0/4803 [00:00<?, ? examples/s]

Unnamed: 0,id,budget,genres,homepage,keywords,original_language,original_title,overview,popularity,production_companies,...,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count,cast,crew
0,5,4000000,"[{""id"": 80, ""name"": ""Crime""}, {""id"": 35, ""name...",,"[{""id"": 612, ""name"": ""hotel""}, {""id"": 613, ""na...",en,Four Rooms,It's Ted the Bellhop's first night on the job....,22.87623,"[{""name"": ""Miramax Films"", ""id"": 14}, {""name"":...",...,4300000,98.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,Twelve outrageous guests. Four scandalous requ...,Four Rooms,6.5,530,"[{""cast_id"": 42, ""character"": ""Ted the Bellhop...","[{""credit_id"": ""52fe420dc3a36847f800012d"", ""de..."
1,11,11000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 28, ""...",http://www.starwars.com/films/star-wars-episod...,"[{""id"": 803, ""name"": ""android""}, {""id"": 4270, ...",en,Star Wars,Princess Leia is captured and held hostage by ...,126.393695,"[{""name"": ""Lucasfilm"", ""id"": 1}, {""name"": ""Twe...",...,775398007,121.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"A long time ago in a galaxy far, far away...",Star Wars,8.1,6624,"[{""cast_id"": 3, ""character"": ""Luke Skywalker"",...","[{""credit_id"": ""52fe420dc3a36847f8000437"", ""de..."
2,12,94000000,"[{""id"": 16, ""name"": ""Animation""}, {""id"": 10751...",http://movies.disney.com/finding-nemo,"[{""id"": 494, ""name"": ""father son relationship""...",en,Finding Nemo,"Nemo, an adventurous young clownfish, is unexp...",85.688789,"[{""name"": ""Pixar Animation Studios"", ""id"": 3}]",...,940335536,100.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"There are 3.7 trillion fish in the ocean, they...",Finding Nemo,7.6,6122,"[{""cast_id"": 8, ""character"": ""Marlin (voice)"",...","[{""credit_id"": ""52fe420ec3a36847f80006b1"", ""de..."
3,13,55000000,"[{""id"": 35, ""name"": ""Comedy""}, {""id"": 18, ""nam...",,"[{""id"": 422, ""name"": ""vietnam veteran""}, {""id""...",en,Forrest Gump,A man with a low IQ has accomplished great thi...,138.133331,"[{""name"": ""Paramount Pictures"", ""id"": 4}]",...,677945399,142.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"The world will never be the same, once you've ...",Forrest Gump,8.2,7927,"[{""cast_id"": 7, ""character"": ""Forrest Gump"", ""...","[{""credit_id"": ""52fe420ec3a36847f800076b"", ""de..."
4,14,15000000,"[{""id"": 18, ""name"": ""Drama""}]",http://www.dreamworks.com/ab/,"[{""id"": 255, ""name"": ""male nudity""}, {""id"": 29...",en,American Beauty,"Lester Burnham, a depressed suburban father in...",80.878605,"[{""name"": ""DreamWorks SKG"", ""id"": 27}, {""name""...",...,356296601,122.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,Look closer.,American Beauty,7.9,3313,"[{""cast_id"": 6, ""character"": ""Lester Burnham"",...","[{""credit_id"": ""52fe420ec3a36847f8000809"", ""de..."


xxx

To perform a recommendation based on **Demographic Filtering**, we first need to:

-   Define a metric to score movies.
-   Calculate the score for every movie.
-   Sort the scores and recommend the best-rated movie.

We can use the average ratings of the movie as the score but using this won't be fair enough since a movie with an 8.9 average rating and only three votes cannot be considered better than the movie with 7.8 as an average rating but 40 votes.

To bypass this issue, we can use the `IMDB's weighted rating` formula:

$$\text{Weighted Rating (WR)} = (\frac{v}{v+m} \times R) + (\frac{m}{v+m} \times C)$$

where:

-   $v$ is the number of votes for the movie.
-   $m$ is the minimum votes required to be listed.
-   $R$ is the average rating of the movie.
-   $C$ is the mean vote across the whole dataset.

Below, we implement this rating system, apply it to our dataset, and sort all movies based on their IMDB rating.

In [3]:

# Average vote acros all movies in the dataset
C = movie_dataset['vote_average'].mean()

# For a movie to feature in the charts, it must have more votes than at least 90% of the movies in the list
M = movie_dataset['vote_count'].quantile(0.9)

recommender_list = movie_dataset.copy().loc[movie_dataset['vote_count'] >= M]

def IMDB_weighted_rating(x, M=M, C=C):
    """
    Calculates the weighted rating for a movie based on its
    average rating (R), the number of votes it received (v),
    and two external variables M and C. The formula used is
    based on the IMDB formula.

    Parameters:
    -----------
        x : pandas.DataFrame
            The DataFrame containing the movie information, including
            'vote_count' and 'vote_average' columns.
        M : int or float
            The minimum number of votes required to be listed in
            the top rated movies.
        C : int or float
            The mean vote across the whole dataset.

    Returns:
    --------
        float : The weighted rating for the movie.
    """
    v = x['vote_count']
    R = x['vote_average']

    # Calculation based on the IMDB formula
    return (v/(v+M) * R) + (M/(M+v) * C)

# Apply the IMDB_weighted_rating function in the recommender_list dataframe
recommender_list['score'] = recommender_list.apply(IMDB_weighted_rating, axis=1)

# Display movies sorted by score
display(recommender_list[['title', 'vote_count', 'vote_average', 'score']].sort_values('score', ascending=False))

Unnamed: 0,title,vote_count,vote_average,score
127,The Shawshank Redemption,8205,8.5,8.059258
230,Fight Club,9413,8.3,7.939256
68,The Dark Knight,12002,8.2,7.920020
305,Pulp Fiction,8428,8.3,7.904645
2955,Inception,13752,8.1,7.863239
...,...,...,...,...
3380,Green Lantern,2487,5.1,5.521697
3442,A Good Day to Die Hard,3493,5.2,5.507643
3884,After Earth,2532,5.0,5.459420
3440,Independence Day: Resurgence,2491,4.9,5.406234


Now we have a recommendation system that indicates the best movies _"overall"_.

However, demographic recommender systems _are not sensitive to the interests and tastes of a particular user_. For this, we need **Content-Based Filtering**.

In this recommender system, the movie's content (_overview, cast, crew, keyword, tagline_, etc.) is used to find its similarity with other movies. Then, we recommend movies that are most similar to the ones the user already watched (and supposedly liked).

We can achieve this by assessing the similarity of, for example, the synopsis of different movies.

In [9]:
from IPython.display import Markdown as MK

for i in range(3):
    display(MK(f'''### {movie_dataset['original_title'][i]}\n\n_{movie_dataset['overview'][i]}_\n\n'''))

### Four Rooms

_It's Ted the Bellhop's first night on the job...and the hotel's very unusual guests are about to place him in some outrageous predicaments. It seems that this evening's room service is serving up one unbelievable happening after another._



### Star Wars

_Princess Leia is captured and held hostage by the evil Imperial forces in their effort to take over the galactic Empire. Venturesome Luke Skywalker and dashing captain Han Solo team together with the loveable robot duo R2-D2 and C-3PO to rescue the beautiful princess and restore peace and justice in the Empire._



### Finding Nemo

_Nemo, an adventurous young clownfish, is unexpectedly taken from his Great Barrier Reef home to a dentist's office aquarium. It's up to his worrisome father Marlin and a friendly but forgetful fish Dory to bring Nemo home -- meeting vegetarian sharks, surfer dude turtles, hypnotic jellyfish, hungry seagulls, and more along the way._



For this, we can compute the _"Term Frequency - Inverse Document Frequency"_ (TF-IDF) vectors for each synopsis (one could achieve a similar result by using [embeddings](https://huggingface.co/blog/getting-started-with-embeddings)).

- Term frequency (`TF`) is the relative frequency of a word in a document $(\frac{\text{term instances}}{\text{total instances}})$.
- Inverse Document Frequency (`IDF`) is the relative count of documents containing the term $log(\frac{\text{number of documents}}{\text{documents with term}})$.
- The overall importance of each word to the documents in which they appear is equal to  $TF \times IDF$.

This will give you a matrix where each column represents a word in the vocabulary (represented by their TF-IDF value), and each row represents a movie in the dataset.

Fortunately, `scikit-learn` can do all this heavy lifting for you.

> Note: We used the same technique to allow users to explore the dataset of our instruction-tuned language model series ([`Aira`](https://huggingface.co/spaces/nicholasKluge/Aira-Demo)).

In [10]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Create a TfidfVectorizer instance
# Stop words are commonly used words that are often filtered out because they offer
# little meaningful information for analysis or machine learning models
# E.g., the, is, at, which, and on.
tfidf = TfidfVectorizer(stop_words='english')


# Replace NaN with an empty string
movie_dataset['overview'] = movie_dataset['overview'].fillna('')

# Fit the TfidfVectorizer to the text corpus
tfidf_matrix = tfidf.fit_transform(movie_dataset['overview'])

#Output the shape of tfidf_matrix
print(f'Number of movies: {tfidf_matrix.shape[0]}.')
print(f'Size of the vocabulary used to describe them: {tfidf_matrix.shape[1]} words.')

Number of movies: 4803.
Size of the vocabulary used to describe them: 20978 words.


With the TF-ITF matrix in hand, we can now compute a similarity score between the created vectors. There are several functions that could achieve this result:

- [Euclidean distance](https://en.wikipedia.org/wiki/Euclidean_distance).
- [Manhattan Distance](https://en.wikipedia.org/wiki/Taxicab_geometry).
- [Cosine Similarity](https://en.wikipedia.org/wiki/Cosine_similarity).
- [Jaccard Similarity](https://en.wikipedia.org/wiki/Jaccard_index).
- [Pearson correlation coefficient](https://en.wikipedia.org/wiki/Pearson_correlation_coefficient).
- _and many others..._

In this notebook, we will be using the cosine similarity score.


$$\text{similarity} = \cos(\theta) = \frac{\textbf{A} \cdot \textbf{B}}{\lVert \textbf{A} \rVert \lVert \textbf{B} \rVert} = \frac{\sum\limits_{i=1}^{n} A_i B_i}{\sqrt{\sum\limits_{i=1}^{n} A_i^2} \sqrt{\sum\limits_{i=1}^{n} B_i^2}}$$

where:

- $\textbf{A}$ and $\textbf{B}$ are two vectors.
- $\theta$ is the angle between them.

This formula calculates the cosine of the angle between the two vectors, which measures their direction similarity. The similarity score ranges from $-1$ to $1$, with values closer to $1$ indicating higher similarity.

This formula is commonly used in natural language processing and information retrieval to measure the similarity between two documents or two sets of features. Much of modern [retrieval augmented language models](https://arxiv.org/abs/2302.00083) work using this type of vector search approach to improve their performance.


In [12]:
from sklearn.metrics.pairwise import cosine_similarity

# Compute the cosine similarity matrix
cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)

print(cosine_sim.shape)

(4803, 4803)


We now have a similarity matrix, with a movies $\times$ movies format, that tells us the similarity of the synopsis of all the movies with all the others.

Now, we can define a function that takes in a movie title as an input and outputs a list of the most similar movies!

In [15]:
import pandas as pd

indices = pd.Series(movie_dataset.index, index=movie_dataset['title']).drop_duplicates()

def get_recommendations(title, cosine_sim=cosine_sim):
    idx = indices[title]

    sim_scores = list(enumerate(cosine_sim[idx]))

    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # exclude 0 since it's the movie itself..
    sim_scores = sim_scores[1:6]

    movie_indices = [x[0] for x in sim_scores]

    display(MK(f'### Recommendations for "{title}"\n\n{"_" * 3}\n'))
    for movie in movie_indices:
        a = movie_dataset['title'].iloc[movie]
        b = movie_dataset['overview'].iloc[movie]
        display(MK(f'#### Title: {a}.\n\n**Synopsis:** _{b}_\n\n'))



get_recommendations('Avatar')

### Recommendations for "Avatar"

___


#### Title: Apollo 18.

**Synopsis:** _Officially, Apollo 17 was the last manned mission to the moon. But a year later in 1973, three American astronauts were sent on a secret mission to the moon funded by the US Department of Defense. What you are about to see is the actual footage which the astronauts captured on that mission. While NASA denies it's authenticity, others say it's the real reason we've never gone back to the moon._



#### Title: The American.

**Synopsis:** _Dispatched to a small Italian town to await further orders, assassin Jack embarks on a double life that may be more relaxing than is good for him._



#### Title: The Matrix.

**Synopsis:** _Set in the 22nd century, The Matrix tells the story of a computer hacker who joins a group of underground insurgents fighting the vast and powerful computers who now rule the earth._



#### Title: The Inhabited Island.

**Synopsis:** _On the threshold of 22nd century, furrowing the space, protagonist from the Free Search Group makes emergency landing on an unknown planet where he must stay. People who are living on this planet have remained at the stone level of the 20th century, with its social problems, miserable ecology and shaky world.._



#### Title: Tears of the Sun.

**Synopsis:** _Navy SEAL Lieutenant A.K. Waters and his elite squadron of tactical specialists are forced to choose between their duty and their humanity, between following orders by ignoring the conflict that surrounds them, or finding the courage to follow their conscience and protect a group of innocent refugees. When the democratic government of Nigeria collapses and the country is taken over by a ruthless military dictator, Waters, a fiercely loyal and hardened veteran is dispatched on a routine mission to retrieve a Doctors Without Borders physician._



Let us use the same technique, but instead of using synopsis as our metadata, let us use `['cast', 'crew', 'keywords', 'genres']`.

In [16]:
from ast import literal_eval
import numpy as np

features = ['cast', 'crew', 'keywords', 'genres']

for feature in features:
    movie_dataset[feature] = movie_dataset[feature].apply(literal_eval)

def get_director(x):
    for i in x:
        if i['job'] == 'Director':
            return i['name']
    return np.nan

def get_list(x):
    if isinstance(x, list):
        names = [i['name'] for i in x]
        if len(names) > 3:
            names = names[:3]
        return names
    return []

movie_dataset['director'] = movie_dataset['crew'].apply(get_director)

features = ['cast', 'keywords', 'genres']
for feature in features:
    movie_dataset[feature] = movie_dataset[feature].apply(get_list)

display(movie_dataset[['title', 'cast', 'director', 'keywords', 'genres']].head())

Unnamed: 0,title,cast,director,keywords,genres
0,Four Rooms,"[Tim Roth, Antonio Banderas, Jennifer Beals]",Allison Anders,"[hotel, new year's eve, witch]","[Crime, Comedy]"
1,Star Wars,"[Mark Hamill, Harrison Ford, Carrie Fisher]",George Lucas,"[android, galaxy, hermit]","[Adventure, Action, Science Fiction]"
2,Finding Nemo,"[Albert Brooks, Ellen DeGeneres, Alexander Gould]",Andrew Stanton,"[father son relationship, harbor, underwater]","[Animation, Family]"
3,Forrest Gump,"[Tom Hanks, Robin Wright, Gary Sinise]",Robert Zemeckis,"[vietnam veteran, hippie, mentally disabled]","[Comedy, Drama, Romance]"
4,American Beauty,"[Kevin Spacey, Annette Bening, Thora Birch]",Sam Mendes,"[male nudity, female nudity, adultery]",[Drama]


Almost every time we work with text data, some cleaning is necessary. Below we are eliminating certain characters and lower casing all strings.

In [17]:
def clean_data(x):
    """
    Cleans the input data by converting all strings
    to lowercase and removing any spaces.

    Args:
        x (str or list): The input data to clean.

    Returns:
        str or list: The cleaned data. If the input is a
        string, a cleaned string is returned.
        If the input is a list of strings, a list of c
        leaned strings is returned. If the input is
        not a string or a list, an empty string is returned.
    """
    if isinstance(x, list):
        return [str.lower(i.replace(" ", "")) for i in x]
    else:
        if isinstance(x, str):
            return str.lower(x.replace(" ", ""))
        else:
            return ''

features = ['cast', 'keywords', 'director', 'genres']

for feature in features:
    movie_dataset[feature] = movie_dataset[feature].apply(clean_data)

display(movie_dataset[['title', 'cast', 'director', 'keywords', 'genres']].head())

Unnamed: 0,title,cast,director,keywords,genres
0,Four Rooms,"[timroth, antoniobanderas, jenniferbeals]",allisonanders,"[hotel, newyear'seve, witch]","[crime, comedy]"
1,Star Wars,"[markhamill, harrisonford, carriefisher]",georgelucas,"[android, galaxy, hermit]","[adventure, action, sciencefiction]"
2,Finding Nemo,"[albertbrooks, ellendegeneres, alexandergould]",andrewstanton,"[fathersonrelationship, harbor, underwater]","[animation, family]"
3,Forrest Gump,"[tomhanks, robinwright, garysinise]",robertzemeckis,"[vietnamveteran, hippie, mentallydisabled]","[comedy, drama, romance]"
4,American Beauty,"[kevinspacey, annettebening, thorabirch]",sammendes,"[malenudity, femalenudity, adultery]",[drama]


Now, we will group all these features into one big "_soup feature_", i.e., a giant string containing all information about these separated features.

> Note: The creation of a new feature by modifying and combining other features is what we call [`feature engineering`](https://en.wikipedia.org/wiki/Feature_engineering).

In [18]:
def create_soup_feature(x):
    """
    Create a feature by combining information about the
    keywords, cast, director, and genres of a movie.

    Args:
        - x (pd.Series): A Pandas series containing
        information about a movie.

    Returns:
        - str: A string that represents the combined
        features of the movie.
    """
    return ' '.join(x['keywords']) + ' ' + ' '.join(x['cast']) + ' ' + x['director'] + ' ' + ' '.join(x['genres'])
movie_dataset['soup'] = movie_dataset.apply(create_soup_feature, axis=1)

display(movie_dataset['soup'][0])

"hotel newyear'seve witch timroth antoniobanderas jenniferbeals allisonanders crime comedy"

Now, we can use the same function we wrote before, changing the similarity matrix, and get recommendations that take into account the director, genre, actors, and all the other features we selected.

In [19]:
from sklearn.feature_extraction.text import CountVectorizer

# The CountVectorizer converts a collection of text documents to a matrix of token counts
# i.e., a tokenization method that takes into account how many times each token appears
count = CountVectorizer(stop_words='english')

count_matrix = count.fit_transform(movie_dataset['soup'])

cosine_sim2 = cosine_similarity(count_matrix, count_matrix)

get_recommendations('Avatar', cosine_sim2)

### Recommendations for "Avatar"

___


#### Title: Clash of the Titans.

**Synopsis:** _Born of a god but raised as a man, Perseus is helpless to save his family from Hades, vengeful god of the underworld. With nothing to lose, Perseus volunteers to lead a dangerous mission to defeat Hades before he can seize power from Zeus and unleash hell on earth. Battling unholy demons and fearsome beasts, Perseus and his warriors will only survive if Perseus accepts his power as a god, defies fate and creates his own destiny._



#### Title: The Mummy: Tomb of the Dragon Emperor.

**Synopsis:** _Archaeologist Rick O'Connell travels to China, pitting him against an emperor from the 2,000-year-old Han dynasty who's returned from the dead to pursue a quest for world domination. This time, O'Connell enlists the help of his wife and son to quash the so-called 'Dragon Emperor' and his abuse of supernatural power._



#### Title: The Monkey King 2.

**Synopsis:** _Taking place 500 years after the Havoc in Heaven, the Tang Priest is appointed by Buddha to go to the West to fetch the sacred scriptures, only to accidentally free the Monkey King. With Lady White (Gong Li) aiming to break up the team assembled to defeat her, the Monkey King must fight in order to save his world!_



#### Title: Fantastic 4: Rise of the Silver Surfer.

**Synopsis:** _The Fantastic Four return to the big screen as a new and all powerful enemy threatens the Earth. The seemingly unstoppable 'Silver Surfer', but all is not what it seems and there are old and new enemies that pose a greater threat than the intrepid superheroes realize._



#### Title: The Time Machine.

**Synopsis:** _Hoping to alter the events of the past, a 19th century inventor instead travels 800,000 years into the future, where he finds humankind divided into two warring races._



The above system is only capable of suggesting movies that are similar to other movies. That is, it is not capable of capturing tastes and providing recommendations across genres. Also, the system doesn't capture the personal tastes and biases of a user. However, with **Collaborative Filtering** we can address this flaw.

In this section, we demonstrate [Collaborative filtering](https://en.wikipedia.org/wiki/Collaborative_filtering) using the [Movielens dataset](https://www.kaggle.com/datasets/abhikjha/movielens-100k) from the Hub 🤗.

In [20]:
# Load the datasets from the hub
user_data = load_dataset('dieineb/movielens_user_ratings')

# Turn the datasets into a pandas.DataFrame
user_data = user_data['train'].to_pandas()

display(user_data)

Downloading readme:   0%|          | 0.00/565 [00:00<?, ?B/s]

Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/1.17M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split:   0%|          | 0/100836 [00:00<?, ? examples/s]

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931
...,...,...,...,...
100831,610,166534,4.0,1493848402
100832,610,168248,5.0,1493850091
100833,610,168250,5.0,1494273047
100834,610,168252,5.0,1493846352


The steps to build a recommender system with this approach will be:

- Map user ID to a "_user vector_".
- Map movie ID to a "_movie vector_".
- Compute the dot product between the user vector and movie vector (use [`embeddings`](https://en.wikipedia.org/wiki/Word_embedding) to predict rating).
- Train the embeddings via gradient descent using all known user-movie pairs.

> Note: In machine learning, `embeddings` refer to a type of representation for data that maps high-dimensional data points to low-dimensional vectors. These vectors capture the key features or attributes of the data points in a compressed form. For example, in `NLP`, words or syllables can be represented as embeddings that capture their semantic meaning.

In [None]:
# Get a unique list of users
user_ids = user_data["userId"].unique().tolist()

user2user_encoded = {x: i for i, x in enumerate(user_ids)}

user2user_encoded

In [23]:
# Get a unique list of users
user_ids = user_data["userId"].unique().tolist()

# Create a look up table to convert users to encodings and vice-versa
user2user_encoded = {x: i for i, x in enumerate(user_ids)}
userencoded2user = {i: x for i, x in enumerate(user_ids)}

# Get a unique list of movies
movie_ids = user_data["movieId"].unique().tolist()

# Create a look up table to convert movies to encodings and vice-versa
movie2movie_encoded = {x: i for i, x in enumerate(movie_ids)}
movie_encoded2movie = {i: x for i, x in enumerate(movie_ids)}

# Create the features and targets for our ML model
user_data["user"] = user_data["userId"].map(user2user_encoded)
user_data["movie"] = user_data["movieId"].map(movie2movie_encoded)
user_data["rating"] = user_data["rating"].values.astype(np.float32)

# Get the min and max rating values (for standarnization porupuses)
min_rating = min(user_data["rating"])
max_rating = max(user_data["rating"])

display(user_data)
print(f"Number of users: {len(user2user_encoded)}.")
print(f"Number of Movies: {len(movie_encoded2movie)}.")


Unnamed: 0,userId,movieId,rating,timestamp,user,movie
0,1,1,4.0,964982703,0,0
1,1,3,4.0,964981247,0,1
2,1,6,4.0,964982224,0,2
3,1,47,5.0,964983815,0,3
4,1,50,5.0,964982931,0,4
...,...,...,...,...,...,...
100831,610,166534,4.0,1493848402,609,3120
100832,610,168248,5.0,1493850091,609,2035
100833,610,168250,5.0,1494273047,609,3121
100834,610,168252,5.0,1493846352,609,1392


Number of users: 610.
Number of Movies: 9724.


Now we shuffle our dataset, separate features and labels, and divide the training set into `train_set` and `validatiosn_set`.

In [24]:
# Shuffle the dataset
user_data = user_data.sample(frac=1, random_state=42)

# Define input features
x = user_data[["user", "movie"]].values

# Normalize the targets between 0 and 1
y = user_data["rating"].apply(lambda x: (x - min_rating) / (max_rating - min_rating)).values

# Use 90% of data for training
train_indices = int(0.9 * user_data.shape[0])

x_train, x_val, y_train, y_val = (
    x[:train_indices],
    x[train_indices:],
    y[:train_indices],
    y[train_indices:],
)

print('Tarining set: ', x_train.shape)
print('Validation set: ', x_val.shape)

Tarining set:  (90752, 2)
Validation set:  (10084, 2)


Bellow, we are implementing a Keras model using `classes`,  just like in our PyTorch tutorial. The model embeds both users and movies into 50-dimensional vectors and then computes a similarity score between user and movie embeddings via a dot product, adding a per-movie and per-user bias at the end.

The final output of the model will be a score of how likely the user will "like" other movies based on their "_movie history_."

In [25]:
import tensorflow as tf
from tensorflow import keras
from keras import layers


class RecommenderNet(keras.Model):
    """
    Neural network model for movie recommendation.

    Args:
        num_users (int): Number of unique users in the dataset.
        num_movies (int): Number of unique movies in the dataset.
        embedding_size (int): Size of the embedding vectors
        for users and movies.

    Attributes:
        user_embedding (keras.layers.Embedding): Embedding
        layer for users.
        user_bias (keras.layers.Embedding): Bias layer for users.
        movie_embedding (keras.layers.Embedding): Embedding layer
        for movies.
        movie_bias (keras.layers.Embedding): Bias layer for movies.

    Methods:
        call(inputs): Computes the output of the model for a
        batch of inputs.
    """
    def __init__(self, num_users, num_movies, embedding_size, **kwargs):
        super(RecommenderNet, self).__init__(**kwargs)
        self.num_users = num_users
        self.num_movies = num_movies
        self.embedding_size = embedding_size
        self.user_embedding = layers.Embedding(
            num_users,
            embedding_size,
            embeddings_initializer="he_normal",
            embeddings_regularizer=keras.regularizers.l2(1e-6),
        )
        self.user_bias = layers.Embedding(num_users, 1)
        self.movie_embedding = layers.Embedding(
            num_movies,
            embedding_size,
            embeddings_initializer="he_normal",
            embeddings_regularizer=keras.regularizers.l2(1e-6),
        )
        self.movie_bias = layers.Embedding(num_movies, 1)

    def call(self, inputs):
        user_vector = self.user_embedding(inputs[:, 0])
        user_bias = self.user_bias(inputs[:, 0])
        movie_vector = self.movie_embedding(inputs[:, 1])
        movie_bias = self.movie_bias(inputs[:, 1])
        dot_user_movie = tf.tensordot(user_vector, movie_vector, 2)
        x = dot_user_movie + user_bias + movie_bias
        return tf.nn.sigmoid(x)


model = RecommenderNet(len(user2user_encoded), len(movie_encoded2movie), 50)

model.compile(
    loss=tf.keras.losses.BinaryCrossentropy(),
    optimizer=keras.optimizers.Adam(learning_rate=0.001))

print("Version: ", tf.__version__)
print("Eager mode: ", tf.executing_eagerly())
print("GPU is", "available" if tf.config.list_physical_devices('GPU') else "NOT AVAILABLE")


history = model.fit(
    x=x_train,
    y=y_train,
    batch_size=64,
    epochs=10,
    verbose=1,
    validation_data=(x_val, y_val),
)

Version:  2.13.0
Eager mode:  True
GPU is NOT AVAILABLE
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


Now that we have a trained recommendation system, let's see a list of movies to be recommended.

In [26]:
# Load the datasets from the hub
movies_df = load_dataset('dieineb/movielens_movies')

# Turn the datasets into a pandas.DataFrame
movies_df = movies_df['train'].to_pandas()

display(movies_df)

Downloading readme:   0%|          | 0.00/516 [00:00<?, ?B/s]

Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/300k [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split:   0%|          | 0/9742 [00:00<?, ? examples/s]

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy
...,...,...,...
9737,193581,Black Butler: Book of the Atlantic (2017),Action|Animation|Comedy|Fantasy
9738,193583,No Game No Life: Zero (2017),Animation|Comedy|Fantasy
9739,193585,Flint (2017),Drama
9740,193587,Bungo Stray Dogs: Dead Apple (2018),Action|Animation


And let us use user `42`, and see the top recommendations according to our model for him/her/they.

In [29]:
userID = 42

# Movies that user 42 has seen
movies_watched_by_user_42 = user_data[user_data.userId == userID]
movies_watched = movies_df[movies_df["movieId"].isin(movies_watched_by_user_42.movieId.values)]["movieId"]

# Movies not seen by user 42
movies_not_watched = movies_df[~movies_df["movieId"].isin(movies_watched_by_user_42.movieId.values)]["movieId"]
movies_not_watched = list(set(movies_not_watched).intersection(set(movie2movie_encoded.keys())))
movies_not_watched = [[movie2movie_encoded.get(x)] for x in movies_not_watched]

# Get the user_encoding for user 42
user_encoder = user2user_encoded.get(42)

# Create an array to store recommendations
user_movie_array = np.hstack(([[user_encoder]] * len(movies_not_watched), movies_not_watched))

# Generate recommendations with the trained model
ratings = model.predict(user_movie_array, verbose=0).flatten()

# Select top candidates
top_ratings_indices = ratings.argsort()[-10:][::-1]
recommended_movie_ids = [movie_encoded2movie.get(movies_not_watched[x][0]) for x in top_ratings_indices]
top_movies_user = (movies_watched_by_user_42.sort_values(by="rating", ascending=False).head(5).movieId.values)
movie_df_rows = movies_df[movies_df["movieId"].isin(top_movies_user)]
recommended_movies = movies_df[movies_df["movieId"].isin(recommended_movie_ids)]

# Print the results
display(MK(f'''
### Recommendations for user: ${userID}$\n\n---'''))

for row in movie_df_rows.itertuples():
    display(MK(f"**{row.title}** | _{row.genres}_."))

display(MK("\n\n### Top 10 movie recommendations\n\n---"))

for row in recommended_movies.itertuples():
    display(MK(f"**{row.title}** | _{row.genres}_."))

display(MK('---'))


### Recommendations for user: $42$

---

**American President, The (1995)** | _Comedy|Drama|Romance_.

**Right Stuff, The (1983)** | _Drama_.

**Gattaca (1997)** | _Drama|Sci-Fi|Thriller_.

**Airplane! (1980)** | _Comedy_.

**City Slickers (1991)** | _Comedy|Western_.



### Top 10 movie recommendations

---

**Fargo (1996)** | _Comedy|Crime|Drama|Thriller_.

**Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb (1964)** | _Comedy|War_.

**Sunset Blvd. (a.k.a. Sunset Boulevard) (1950)** | _Drama|Film-Noir|Romance_.

**His Girl Friday (1940)** | _Comedy|Romance_.

**Streetcar Named Desire, A (1951)** | _Drama_.

**Cool Hand Luke (1967)** | _Drama_.

**Boondock Saints, The (2000)** | _Action|Crime|Drama|Thriller_.

**Double Indemnity (1944)** | _Crime|Drama|Film-Noir_.

**Spirited Away (Sen to Chihiro no kamikakushi) (2001)** | _Adventure|Animation|Fantasy_.

**Eternal Sunshine of the Spotless Mind (2004)** | _Drama|Romance|Sci-Fi_.

---

We explored three of the main methodologies for creating recommender systems:

- **Demographic Filtering.**
- **Content-Based Filtering.**  
- **Collaborative Filtering.**  

Good recommenders usually use all of these techniques combined, but this is a good start if you want to master recommendation systems through ML. If you want to know more about recommender systems powered by ML models, we recommend the "_[The Use of Machine Learning Algorithms in Recommender Systems: A Systematic Review](https://arxiv.org/ftp/arxiv/papers/1511/1511.05263.pdf)_" as a good place to start! 🙃

-----

Return to the [castle](https://github.com/Nkluge-correa/teeny-tiny_castle).