# Recommending songs by embeddings

**NOTE:** This notebook is based on the tutorial in Chapter 2 of *[Hands-On Large Language Models](https://www.oreilly.com/library/view/hands-on-large-language/9781098150952/)* by [Jay Alammar](https://www.linkedin.com/in/jalammar/) and [Maarten Grootendorst](https://www.linkedin.com/in/mgrootendorst/).

The idea here is that we have a bunch of song playlists like this...

- Rossana * Billy Jean * Let's go crazy * etc.
- Fack to black * Between the lines * One * etc.

...and the word embedding model will cluster songs that appear next to each other in a bunch of playlists. We can then use those similarities to generate new playlists based on individual songs.

In [None]:
# !pip install gensim # we use gensim to download a word2vec model

In [None]:
## Import modules we'll need
import urllib.request
from gensim.models import word2vec # We will train a word2vec model with playlist data
import pandas as pd # we'll use pandas to format data

In [None]:
## Read in a tab-delimited file that contains song id numbers
## paired with song names and artists.
# id_to_title = pd.read_csv("song_hash.txt", sep="\t", 
#                           header=None, 
#                           names=["id", "title", "artist"])
id_to_title = pd.read_csv("https://raw.githubusercontent.com/StatQuest/embeddings_for_recommendations/main/song_hash.txt", 
                          sep="\t", 
                          header=None, 
                          names=["id", "title", "artist"])
id_to_title.head() # print out the first few rows

----

# Import the playlist data

In [None]:
## NOTE: The data files were originally created by Shuo Chen (shuochen@cs.cornell.edu) 
##       in the Dept. of Computer Science, Cornell University.
## I downloaded them from here: https://www.cs.cornell.edu/~shuochen/lme/data_page.html
##
## open() opens the file...
## read() reads it in...
## split('\n') makes it legible
## [2:] skips the first to lines of metadata
# data = open("train.txt", "r").read().split('\n')[2:]

data = urllib.request.urlopen('https://raw.githubusercontent.com/StatQuest/embeddings_for_recommendations/main/train.txt')
data = data.read().decode("utf-8").split('\n')[2:]

In [None]:
## Remove playlists with only one song
playlists = [s.rstrip().split() for s in data if len(s.split()) > 1]

In [None]:
print( 'Playlist #1:\n ', playlists[0], '\n')
print( 'Playlist #2:\n ', playlists[1])

In [None]:
## Train a word embedding model with our playlists
## NOTE: By default Word2Vec uses the "CBOW" (continuous bag of words) method for 
##       training. CBOW uses surrounding words to predict a word in the middle.
##       For example, if the training set was "Troll2 is great", then
##       CBOW would use "Troll2" and "great" to predicet "is".
## vector_size : dimensionality of the word vectors.
## negative : If > 0, negative sampling will be used, 
##            and specifies how many “noise words” should be drawn (usually between 5-20).
## min_count : Ignores all words with total frequency lower than this.
## workers : Use these many worker threads to train the model
## NOTE: The value I selected for the arguments allowed for relatively fast training and 
##       worked well enough.
model = word2vec.Word2Vec(playlists, vector_size=32, negative=10, min_count=1, workers=4) #

In [None]:
song_id = 3822 # Billie Jean - Michael Jackson
# song_id = 2172 # Fade To Black - Metallica
# song_id = 842 # California Love - 2Pac

In [None]:
id_to_title.iloc[song_id]

In [None]:
## find the most similar songs
new_playlist = pd.DataFrame(model.wv.most_similar(positive=str(song_id)),
                            columns=["id", "sim"])  

In [None]:
new_playlist

In [None]:
## Print out the song names and artists for the new
id_to_title.iloc[new_playlist["id"]]

# Bam!!!