# Embeddings for Recommendation Systems

As we’ve mentioned, the concept of embeddings is useful in so many other domains. In industry, it’s widely used for recommendation systems, for example.

we’ll use the word2vec algorithm to embed songs using human-made music playlists. Imagine if we treated each song as we would a word or token, and we treated each playlist like a sentence. These embeddings can then be used to recommend similar songs that often appear together in playlists.

The dataset we’ll use was collected by Shuo Chen from Cornell University. It contains playlists from hundreds of radio stations around the US. Figure 2-17 demonstrates this dataset.

![Three playlists containing watched video IDs](../assets/videos_playlists.png)

Figure 2-17. For video embeddings that capture video similarity we’ll use a dataset made up of a collection of playlists, each containing a list of videos.


Let’s demonstrate the end product before we look at how it’s built. So let’s give it a few songs and see what it recommends in response.



### Training a Song Embedding Model

We’ll start by loading the dataset containing the song playlists as well as each song’s metadata, such as its title and artist:



In [1]:
import pandas as pd
from urllib import request

# Get the playlist dataset file
#movies reccomendation system
data_url = "https://files.grouplens.org/datasets/movielens/ml-100k/u.data"
data = request.urlopen(data_url)

lines = data.read().decode("utf-8").split("\n")
rows = [line.split("\t") for line in lines if len(line.split("\t")) == 4]
df = pd.DataFrame(rows, columns=["user_id", "movie_id", "rating", "timestamp"])

df["user_id"] = df["user_id"].astype(int)
df["movie_id"] = df["movie_id"].astype(int)

# Load song metadata
# songs_file = request.urlopen('https://storage.googleapis.com/maps-premium/dataset/yes_complete/song_hash.txt')
# songs_file = songs_file.read().decode("utf-8").split('\n')
# songs = [s.rstrip().split('\t') for s in songs_file]
# songs_df = pd.DataFrame(data=songs, columns = ['id', 'title', 'artist'])
# songs_df = songs_df.set_index('id')

In [2]:
print("Number of rows:", len(df))
print("Number of unique users:", df["user_id"].nunique())
print("Number of unique movies:", df["movie_id"].nunique())

print("\nSample of the dataset:")
print(df.head())

print("\nMost watched movies:")
print(df["movie_id"].value_counts().head())

print("\nUsers most active:")
print(df["user_id"].value_counts().head())

Number of rows: 100000
Number of unique users: 943
Number of unique movies: 1682

Sample of the dataset:
   user_id  movie_id rating  timestamp
0      196       242      3  881250949
1      186       302      3  891717742
2       22       377      1  878887116
3      244        51      2  880606923
4      166       346      1  886397596

Most watched movies:
movie_id
50     583
258    509
100    508
181    507
294    485
Name: count, dtype: int64

Users most active:
user_id
405    737
655    685
13     636
450    540
276    518
Name: count, dtype: int64


In [6]:
movielists = (
    df.groupby("user_id")["movie_id"]
    .apply(lambda x: list(map(str, x.tolist())))
    .tolist()
)
print("\nmovielist #1:", movielists[0])
print("movielist #2:", movielists[1])


movielist #1: ['61', '189', '33', '160', '20', '202', '171', '265', '155', '117', '47', '222', '253', '113', '227', '17', '90', '64', '92', '228', '266', '121', '114', '132', '74', '134', '98', '186', '221', '84', '31', '70', '60', '177', '27', '260', '145', '174', '159', '82', '56', '272', '80', '229', '140', '225', '235', '120', '125', '215', '6', '104', '49', '206', '76', '72', '185', '96', '213', '233', '258', '81', '78', '212', '143', '151', '51', '175', '107', '218', '209', '259', '108', '262', '12', '14', '97', '44', '53', '163', '210', '184', '157', '201', '150', '183', '248', '208', '128', '242', '148', '112', '193', '264', '219', '232', '236', '252', '200', '180', '250', '85', '91', '10', '254', '129', '241', '130', '255', '103', '118', '54', '267', '24', '86', '196', '39', '164', '230', '36', '23', '224', '73', '67', '65', '190', '100', '226', '243', '154', '214', '161', '62', '188', '102', '69', '170', '38', '9', '246', '22', '21', '179', '187', '135', '68', '146', '176', 

In [7]:
meta_url = "https://files.grouplens.org/datasets/movielens/ml-100k/u.item"
meta_raw = request.urlopen(meta_url).read().decode("latin-1").split("\n")

metadata = {}

for row in meta_raw:
    parts = row.split("|")
    if len(parts) > 2:
        movie_id = parts[0]
        title = parts[1]
        metadata[movie_id] = title

print("Movietitle:", metadata["1"])

Movietitle: Toy Story (1995)


Based on the official [Gensim Word2Vec documentation](https://radimrehurek.com/gensim/models/word2vec.html), here is the description for each parameter, of the next code snippet calling `Word2Vec`:

* **`sentences` (playlists):** The input data. It must be an iterable of lists of tokens (in your case, song IDs or names within a playlist).
* **`vector_size=32`:** The dimensionality of the word vectors. This defines the number of features in the hidden layer of the neural network used to represent each item.
* **`window=20`:** The maximum distance between the current and predicted word within a sentence. A larger window captures more global context.
* **`negative=50`:** Specifies how many "noise words" should be drawn for **Negative Sampling**. According to the documentation, values between 5 and 20 are typical for small datasets, while 2 to 5 suffice for large ones. You have set this high (50) to increase training rigor.
* **`min_count=1`:** The model ignores all words with a total frequency lower than this. Setting it to 1 ensures every item in your playlists is included in the vocabulary.
* **`workers=4`:** The number of worker threads used to train the model, allowing for multicore parallelization to speed up training.

In [8]:
!pip install gensim

Collecting gensim
  Downloading gensim-4.4.0-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.metadata (8.4 kB)
Downloading gensim-4.4.0-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (27.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m27.9/27.9 MB[0m [31m66.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: gensim
Successfully installed gensim-4.4.0


In [9]:
from gensim.models import Word2Vec

# Train our Word2Vec model
model = Word2Vec(
    movielists,
    vector_size=32,
    window=20,
    negative=50,
    min_count=1,
    workers=4
)

In [10]:
# song_id = 2172

# # Ask the model for songs similar to song #2172
# model.wv.most_similar(positive=str(song_id))

movie_id = "50" # StarWars

print("Most similar movies to ID=50 which is star wars")
print(model.wv.most_similar(positive=[movie_id]))

Most similar movies to ID=50 which is star wars
[('473', 0.9755904674530029), ('181', 0.9755661487579346), ('290', 0.9741015434265137), ('369', 0.9723175764083862), ('3', 0.9715413451194763), ('109', 0.9715346097946167), ('108', 0.9710854887962341), ('841', 0.970744252204895), ('1061', 0.9705591797828674), ('620', 0.9697421193122864)]


In [14]:
import numpy as np

def print_movie_recommendations(movie_id):
    similar_ids = np.array(
        model.wv.most_similar(positive=[movie_id], topn=5)
    )[:,0]

    print(f"\nRecommendations for movie {movie_id}:")
    for mid in similar_ids:
        print(mid, " : ", metadata.get(mid, "Unknown"))

print_movie_recommendations("50")


Recommendations for movie 50:
473  :  James and the Giant Peach (1996)
181  :  Return of the Jedi (1983)
290  :  Fierce Creatures (1997)
369  :  Black Sheep (1996)
3  :  Four Rooms (1995)


In [15]:
print_movie_recommendations("100")


Recommendations for movie 100:
9  :  Dead Man Walking (1995)
475  :  Trainspotting (1996)
276  :  Leaving Las Vegas (1995)
150  :  Swingers (1996)
13  :  Mighty Aphrodite (1995)


In [16]:
print_movie_recommendations("181")


Recommendations for movie 181:
596  :  Hunchback of Notre Dame, The (1996)
274  :  Sabrina (1995)
455  :  Jackie Chan's First Strike (1996)
473  :  James and the Giant Peach (1996)
756  :  Father of the Bride Part II (1995)
