<a href="https://colab.research.google.com/github/MCanela-1954/DataSci_Course/blob/main/%5BDATA-04E%5D%20Example%20-%20yes.com%20playlists%20data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# [DATA-04E] Example - yes.com playlists data

## Introduction

This example uses playlists data to illustrate the use of **vector embeddings** in **recommender systems**. The data were collected by Shuo Chen from the Department of Computer Science, Cornell University. The playlists were crawled from Yes.com, a website providing radio lists from hundreds of radio stations in the United States. At that time, through the web based API http://api.yes.com, one could retrieve the playlist record of a specified station for the last 7 days.

Chen collected as many playlists as possible by searching for all the possible genres at all the possible stations. The data collection lasted from December 2010 to May 2011. This led to a data set of 11,137 playlists, containing 75,262 different songs and 2,840,553 transitions.

## The data set

The playlists come in the text file `playlists.txt`, each playlist in one line. The playlist is a sequence of numbers, which are the song ID's (from 0 to the total number of songs minus one), separated by a white space. In this file, each line is ended by a space.

A second file, `songs.txt`, contains data on the songs. Each line corresponds to one song, and has the format `Integer_ID \t Title \t Artist \n` (`\t` is the tab character in Python). When data were missing, a dash was placed instead.

Sources:

1. S Chen, JL Moore, D Turnbull & T Joachims (2012), *Playlist Prediction via Metric Embedding*, ACM Conference on Knowledge Discovery and Data Mining (KDD).

2. JL Moore, S Chen, T Joachims & D Turnbull (2012), *Learning to Embed Songs and Tags for Playlists Prediction*, International Society for Music Information Retrieval (ISMIR).

## Importing the playlists data

Since the playlists data set is not tabular, we cannot import it to a Pandas data frame. We use the package Requests to import it to a string. First, we import the package.

In [None]:
import requests

The file `playlists.txt` has the same path as other data sources used in this course.

In [None]:
path = 'https://raw.githubusercontent.com/MCanela-1954/Data/main/'

We apply the Requests function `get()` to the URL resulting from appending the file name to this path. The request is accepted, and the attribute `.text` of the object returned is a string equal to the content of the text file.

In [None]:
playlists = requests.get(path + 'playlists.txt').text

We **split** this string as a list of strings, one for each line. This can be done with the method `.split()`, using the new line character `\n` as the **separator**.

In [None]:
playlists = playlists.split('\n')

We split now every string in this list as a list of strings, using now the default separator, which is the white space.

In [None]:
playlists = [p.split() for p in playlists]

Now, `playlists` is a list of lists, every one containing the ID's of the songs for the corresponding playlist. Let us check the number of playlists.

In [None]:
len(playlists)

We can also take a look to a couple of these playlists.

In [None]:
print(playlists[0])

In [None]:
print(playlists[-1])

## Importing the songs data

Since the songs data set comes as a table, we can use the Pandas function `read.csv()` to import the data to a data frame. Note that we specify a separator here, since this is not a true "comma" separated file. We also have to specify the column names, because the source file has no **header**.

In [None]:
import pandas as pd
songs = pd.read_csv(path + 'songs.txt', sep='\t', header=None)
songs.columns = ['id', 'title', 'artist']

Let us check the contents of this data frame.

In [None]:
songs.info()

In [None]:
songs.head()

## Encoding the playlists

To be able to measure the similarity between songs, to use an **embedding representation**. We use **Word2Vec** to create the embedding vectors from the playlists. Word2vec comes in the Python package `gensim`.

This package is not available for Colab notebooks, so we have to install it with `pip`. The role of the **exclamation mark** is to tell Colaboratory that this is a command to be executed in the shell (Colab notebooks run in a Linux virtual machine).

*Note*. In your computer, you install a package only once, but in Colaboratory it is uninstalled when the notebook is disconnected. So, in practice, the package is reinstalled every time that you run the notebook.

Now we can get Word2Vec as a function from the subpackage `gensim.models`.

In [None]:
!pip install gensim

In [None]:
from gensim.models import Word2Vec

With this function, we can train a Word2Vec model on the playlists data. We use here **embedding dimension** 32, but you can try other choices. Don't make the vectors too long.

In [None]:
model = Word2Vec(playlists, vector_size=32)

The function `Word2vec()` returns an object containing various things. The vectors are packed in the object `model.wv`.

In [None]:
type(model.wv)

Though this is an Gensim object of a specific type, the vectors can be extracted in an obvious way.

In [None]:
model.wv[0]

## Recommendation

For this example, we select the song *Who's That Chick*, from *Rihanna*.

In [None]:
print(songs.iloc[12])

The recommended songs will be those similar to the song with ID 12. We get them with the method `.most_similar()`. The parameter `positive` indicates that we search for similar vectors. With `negative`, we can work in the opposite way. The default number of similar vectors is 10, so we have restricted the search here.

Note that the vectors created by Gensim Word2Vec are not normalized, but we don't have to care, because the default similarity of Word2Vec is the **cosine similarity**.

In [None]:
recs = model.wv.most_similar(positive='12', topn=5)
recs

The top-5 recommendation is extracted in Gensim as a list of pairs. These pairs are called **tuples** in Python. We can get them as an ordinary list as:

In [None]:
reclist = [rec[0] for rec in recs]
reclist

We can go back to the songs data to see which songs are these. In top of the list, we get the song with ID 68, which is *Boom Boom Pow*, by *Black Eyed Peas*. Hope you agree with that.

In [None]:
songs.iloc[reclist]

*Note*. By replicating this example, you will find that Gensim Word2Vec does not create exactly the same vectors every time you train the model. According to Gensim documentation, this is due to "ordering jitter from OS thread scheduling". You can fix that with the argument `workers=1`.