# Embeddings for Recommendation Systems

As mentioned earlier, the concept of embeddings is useful in many domains beyond natural language processing. In industry, embeddings are widely used for recommendation systems, for example to suggest movies, products, or videos.

In this project, we use the Word2Vec algorithm to embed movies based on real user behavior. We treat each movie as a token and each user’s watch history as a sentence. By learning from which movies frequently appear together in user watchlists, the model learns vector representations that capture movie similarity.

These embeddings can then be used to recommend movies that are often watched together or share similar viewing patterns.

The dataset we use is the **MovieLens Latest Small** dataset from GroupLens, which contains ratings from thousands of users on thousands of movies. We transform user ratings into watchlists and use them as training data for the embedding model.

The figure below illustrates the idea of treating each user’s watch history as a sequence of movie IDs, similar to how a sentence is a sequence of words.

![User watchlists containing movie IDs](../assets/videos_playlists.png)

*Figure: For movie embeddings that capture similarity, we use a dataset made up of a collection of user watchlists, each containing a list of movies.*


Let’s demonstrate the end product before we look at how it’s built. So let’s give it a few movies and see what it recommends in response.



### Training a Movie Embedding Model

We start by loading the MovieLens dataset, which contains user ratings and movie metadata such as movie titles and genres. We then transform the user ratings into watchlists, where each watchlist represents the set of movies watched by a user. These watchlists are used as training sequences for the embedding model.



In [2]:
import pandas as pd
import kagglehub
from gensim.models import Word2Vec

path = kagglehub.dataset_download("grouplens/movielens-latest-small")

ratings = pd.read_csv(f"{path}/ratings.csv")
movies = pd.read_csv(f"{path}/movies.csv")

data = ratings.merge(movies, on="movieId")

watchlists = data.groupby("userId")["title"].apply(list).tolist()
watchlists = [lst for lst in watchlists if len(lst) > 1]


  from .autonotebook import tqdm as notebook_tqdm


In [3]:
print('Watchlist #1:\n ', watchlists[0], '\n')
print('Watchlist #2:\n ', watchlists[1])


Watchlist #1:
  ['Toy Story (1995)', 'Grumpier Old Men (1995)', 'Heat (1995)', 'Seven (a.k.a. Se7en) (1995)', 'Usual Suspects, The (1995)', 'From Dusk Till Dawn (1996)', 'Bottle Rocket (1996)', 'Braveheart (1995)', 'Rob Roy (1995)', 'Canadian Bacon (1995)', 'Desperado (1995)', 'Billy Madison (1995)', 'Clerks (1994)', 'Dumb & Dumber (Dumb and Dumber) (1994)', 'Ed Wood (1994)', 'Star Wars: Episode IV - A New Hope (1977)', 'Pulp Fiction (1994)', 'Stargate (1994)', 'Tommy Boy (1995)', 'Clear and Present Danger (1994)', 'Forrest Gump (1994)', 'Jungle Book, The (1994)', 'Mask, The (1994)', 'Blown Away (1994)', 'Dazed and Confused (1993)', 'Fugitive, The (1993)', 'Jurassic Park (1993)', 'Mrs. Doubtfire (1993)', "Schindler's List (1993)", 'So I Married an Axe Murderer (1993)', 'Three Musketeers, The (1993)', 'Tombstone (1993)', 'Dances with Wolves (1990)', 'Batman (1989)', 'Silence of the Lambs, The (1991)', 'Pinocchio (1940)', 'Fargo (1996)', 'Mission: Impossible (1996)', 'James and the Giant

In [4]:
from gensim.models import Word2Vec

# Train our Word2Vec model
model = Word2Vec(
    sentences=watchlists,
    vector_size=50,
    window=5,
    min_count=2,
    workers=4
)


In [5]:
movie_id = 1  

movie_title = movies.loc[movies.movieId == movie_id, "title"].values[0]
print("Movie:", movie_title)

model.wv.most_similar(positive=movie_title, topn=10)



Movie: Toy Story (1995)


[('Twelve Monkeys (a.k.a. 12 Monkeys) (1995)', 0.9934884905815125),
 ('Babe (1995)', 0.9932944774627686),
 ('GoldenEye (1995)', 0.9926254749298096),
 ('Jumanji (1995)', 0.9920353889465332),
 ('Clueless (1995)', 0.9899606704711914),
 ('Seven (a.k.a. Se7en) (1995)', 0.989814281463623),
 ('Casino (1995)', 0.9895472526550293),
 ('Heat (1995)', 0.9892486929893494),
 ('Leaving Las Vegas (1995)', 0.9891033172607422),
 ('American President, The (1995)', 0.987956702709198)]

In [6]:
print(movies.iloc[2172])

movieId                     2888
title      Drive Me Crazy (1999)
genres            Comedy|Romance
Name: 2172, dtype: object


In [7]:
def print_recommendations(movie_id, topn=5):
    """
    Given a movieId, return the top-N most similar movies.
    """

    # Convert movieId to movie title
    row = movies.loc[movies.movieId == movie_id, "title"]

    if row.empty:
        return f"Movie ID {movie_id} not found in dataset."

    movie_title = row.values[0]

    # Check if the movie exists in the model vocabulary
    if movie_title not in model.wv.key_to_index:
        return f"Movie '{movie_title}' not in vocabulary."

    # Get similar movies
    similar_movies = model.wv.most_similar(positive=movie_title, topn=topn)

    similar_titles = [title for title, _ in similar_movies]

    return movies[movies["title"].isin(similar_titles)]


In [8]:
print_recommendations(345)

Unnamed: 0,movieId,title,genres
297,339,While You Were Sleeping (1995),Comedy|Romance
306,348,Bullets Over Broadway (1994),Comedy
307,349,Clear and Present Danger (1994),Action|Crime|Drama|Thriller
308,350,"Client, The (1994)",Drama|Mystery|Thriller
311,353,"Crow, The (1994)",Action|Crime|Fantasy|Thriller


In [9]:
print_recommendations(842)



Unnamed: 0,movieId,title,genres
637,810,Kazaam (1996),Children|Comedy|Fantasy
668,880,"Island of Dr. Moreau, The (1996)",Sci-Fi|Thriller
676,892,Twelfth Night (1996),Comedy|Drama|Romance
711,930,Notorious (1946),Film-Noir|Romance|Thriller
742,969,"African Queen, The (1951)",Adventure|Comedy|Romance|War


In [10]:
print_recommendations(50)

Unnamed: 0,movieId,title,genres
43,47,Seven (a.k.a. Se7en) (1995),Mystery|Thriller
44,48,Pocahontas (1995),Animation|Children|Drama|Musical|Romance
52,58,"Postman, The (Postino, Il) (1994)",Comedy|Drama|Romance
92,104,Happy Gilmore (1996),Comedy
97,110,Braveheart (1995),Action|Drama|War
