# <span style="color:red">Deep Learning Project: Music Playlist Generation based on Spotify Playlists</span>

This project was realized by Alexandre Felix and Jérémy Houdé for the Deep Learning lesson in SoSe 2023. The goal of the project was to automatically generate playlists, based on the [Spotify Millon Playlist Dataset](https://www.aicrowd.com/challenges/spotify-million-playlist-dataset-challenge). With one or multiple songs, the method should be able to append new relevant songs to the initial playlist.

## <span style="color:green">Spotify Millon Playlist Dataset</span>

The data is stored in a ZIP file, in which data are stored in multiple json slices. Each json file contains 1,000 playlists and are named as
```
mpd.slice.{start_playlist_ID}-{end_playlist_ID}.json
```

Stored data has following structure for each json slice:
```json
{
    'info' { ... },
    'playlists': {
        "name": "musical",
        "collaborative": "false",
        "pid": 5,
        "modified_at": 1493424000,
        "num_albums": 7,
        "num_tracks": 12,
        "num_followers": 1,
        "num_edits": 2,
        "duration_ms": 2657366,
        "num_artists": 6,
        "tracks": [
            {
                "pos": 0,
                "artist_name": "Degiheugi",
                "track_uri": "spotify:track:7vqa3sDmtEaVJ2gcvxtRID",
                "artist_uri": "spotify:artist:3V2paBXEoZIAhfZRJmo2jL",
                "track_name": "Finalement",
                "album_uri": "spotify:album:2KrRMJ9z7Xjoz1Az4O6UML",
                "duration_ms": 166264,
                "album_name": "Dancing Chords and Fireflies"
            }
    }
}
```

We are interested by the stored tracks in each playlist and of course about the tracks.

### Load data

The unziped data weigth more than 40 GB. Because of RAM or vRAM limitations, we will only use a subset for our experiments. The selected subset will contains the first 50,000 playlists, contained in the first 50 files. We have firstly tested with 200 and 150 files, but both needed to much RAM for our computers.
By extrating the data, we are only insteressed by the playlists array.

In [1]:
import json
import fnmatch
from tqdm import tqdm # progression bar
from zipfile import ZipFile

def load_zip_data(zip_file: str, num_slices: int) -> list[dict]:
    with ZipFile(zip_file) as zipfiles: # open ZIP file
        file_list = zipfiles.namelist()

        json_files = fnmatch.filter(file_list, "*.json")
        json_files = [file for index,file in sorted([(int(filename.split('.')[2].split('-')[0]), filename) for filename in json_files])]
        playlists: list[dict] = []

        for filename in tqdm(json_files[:num_slices]): # for each json file
            with zipfiles.open(filename) as json_file:
                current_slice = json.loads(json_file.read())
                playlists.extend(current_slice['playlists']) # add new playlists

        return playlists

We will now extract the first 20,000 playlists and fit it into a pandas dataframe.


We will keep folling data:
```json
{
    'playlists': {
        "num_tracks": 12,
        "duration_ms": 2657366,
        "tracks": [
            {
                "track_uri": "spotify:track:7vqa3sDmtEaVJ2gcvxtRID",
                "artist_uri": "spotify:artist:3V2paBXEoZIAhfZRJmo2jL",
                "album_uri": "spotify:album:2KrRMJ9z7Xjoz1Az4O6UML",
                "duration_ms": 166264
            }
        ]
    }
}
```
- We consider the link to the artist and album, and the song duration as importants data for each song.
- Unfortunately, the genre wasn't stored in the initial database

For this project following folders are required:
- `./data`
- `./models`

Please place `spotify_million_playlist_dataset.zip` in the folder `./data` or adapt the cells

In [2]:
import pandas as pd

zip_file = 'data/spotify_million_playlist_dataset.zip'
playlists_dict = load_zip_data(zip_file, 20)

playlists = pd.DataFrame(playlists_dict, columns = ["tracks", "num_tracks", "duration_ms"])

100%|██████████| 20/20 [00:07<00:00,  2.68it/s]


We remove now the unnecessary data in each track. We only keep as IDs the uris and we make the assuption that the duration is an important factor for the conception of a playlist.

In [3]:
# remove from tracks not needed information
for tracks in playlists['tracks']:
    for track in tracks:
        del track['pos']
        del track['artist_name']
        del track['track_name']
        del track['album_name']

### 20,000 first playlists

In [4]:
playlists

Unnamed: 0,tracks,num_tracks,duration_ms
0,[{'track_uri': 'spotify:track:0UaMYEvWZi0ZqiDO...,52,11532414
1,[{'track_uri': 'spotify:track:2HHtWyy5CgaQbC7X...,39,11656470
2,[{'track_uri': 'spotify:track:74tqql9zP6JjF5hj...,64,14039958
3,[{'track_uri': 'spotify:track:4WJ7UMD4i6DOPzyX...,126,28926058
4,[{'track_uri': 'spotify:track:4iCGSi1RonREsPtf...,17,4335282
...,...,...,...
19995,[{'track_uri': 'spotify:track:64j3Bd62HTe0pclk...,18,4614171
19996,[{'track_uri': 'spotify:track:09OojFvtrM9YRzRj...,24,4675554
19997,[{'track_uri': 'spotify:track:4lNznSUjByH5zWpP...,106,28912970
19998,[{'track_uri': 'spotify:track:1yy2DlSDtEt90d54...,36,9374114


### Unique tracks

We will store all tracks in a separate DataFrame. This will be usefull later for track embeddings and for the random base line method `get_next_random_song`.

In [5]:
tracks = pd.json_normalize(playlists_dict, record_path=['tracks'])
tracks.drop_duplicates(inplace=True, ignore_index=True)

# free RAM
del playlists_dict

tracks

Unnamed: 0,track_uri,artist_uri,album_uri,duration_ms
0,spotify:track:0UaMYEvWZi0ZqiDOoHU3YI,spotify:artist:2wIVse2owClT7go1WT98tk,spotify:album:6vV5UrXcfyQD1wu4Qo2I9K,226863
1,spotify:track:6I9VzXrHxO9rA9A5euc8Ak,spotify:artist:26dSoYclwsYLMAKD3tpOr4,spotify:album:0z7pVBGOD7HCIB7S8eLkLI,198800
2,spotify:track:0WqIKmW4BTrj3eJFmnCKMv,spotify:artist:6vWDO969PvNqNYHIOW5v0m,spotify:album:25hVFAxTlDvXbx2X2QkUkE,235933
3,spotify:track:1AWQoqb9bSvzTjaLralEkT,spotify:artist:31TPClRtHm23RisEBtV3X7,spotify:album:6QPkyl04rXwTGlGlcYaRoW,267266
4,spotify:track:1lzr43nnXAijIGYnCT8M8H,spotify:artist:5EvFsr3kj42KNv97ZEnqij,spotify:album:6NmFmPX56pcLBOFMhIiKvF,227600
...,...,...,...,...
263464,spotify:track:2MQ9NWMZfi0qPUhDR6sRCL,spotify:artist:278ZYwGhdK6QTzE3MFePnP,spotify:album:0kIXzVzbFuUf5kxM8US67m,297053
263465,spotify:track:4eOptezifAi7VpOoz9lu4r,spotify:artist:2ye2Wgw4gimLv2eAKyk1NB,spotify:album:6VeUJmkLCGWRiF8j6RrIEx,301293
263466,spotify:track:2FvIkVNVEmiAaasafDSWSV,spotify:artist:6PWU6JQvvYv5sz5FOODHg6,spotify:album:2rkBQR8GIeP8XlEYrp6DsM,250506
263467,spotify:track:48ifRcXHbUjc1moUjJcwhx,spotify:artist:6UfoTQXaV3DuqtDVjZIxwZ,spotify:album:1p5T4GozRHLUxtaLN46sLz,319760


Check if duplicates are removed:

In [6]:
song = tracks.loc[tracks['track_uri'] == 'spotify:track:2jFlMILIQzs7lSFudG9lbo']
song

Unnamed: 0,track_uri,artist_uri,album_uri,duration_ms
40,spotify:track:2jFlMILIQzs7lSFudG9lbo,spotify:artist:6wPhSqRtPu1UhRCDX5yaDJ,spotify:album:0ylxpXE00fVxh6d60tevT8,229360


### Save or reload data

In [7]:
# exec on every opening
tracks_filename = 'data/tracks.json'
playlists_filename = 'data/playlists.json'

Save data as json files:

In [8]:
playlists.to_json(playlists_filename)
tracks.to_json(tracks_filename)

When needed, reload processed tracks and playlists from json files:

In [19]:
# exec on every opening
import pandas as pd

tracks = pd.read_json(tracks_filename)
playlists = pd.read_json(playlists_filename)

## <span style="color:green">Defining baseline methods</span>

before looking to the architecture of the neural network, we will first define a baseline methods to compare against.
This method should get as an input k songs or tracks with the selected data and add a new song, that could work with.

We will here test both methods on some playlist and analyse the result. [We will later evaluated both baseline methods with the neural network by using k-gramm and analysing the playlist generation as a binary problem]

### Random baseline

the random baseline takes one random song from the tracks list:

In [60]:
# exec on every opening

def get_random_next_song(song_uris: list[str] | None = None) -> str:
    """return a song"""
    return tracks['track_uri'].sample().iloc[0]

We execute the method with the first 3 tracks from the first playlist:

In [10]:
## URIs of the first 3 songs from playlist 1:
song_uris = [song['track_uri'] for song in playlists.iloc[0]['tracks'][:3]]
num_songs = 5

print(f'initial playlist: {song_uris}')
for i in range(num_songs):
    new_song = get_random_next_song()
    song_uris.append(new_song)

print(f'New playlist: {song_uris}')

initial playlist: ['spotify:track:0UaMYEvWZi0ZqiDOoHU3YI', 'spotify:track:6I9VzXrHxO9rA9A5euc8Ak', 'spotify:track:0WqIKmW4BTrj3eJFmnCKMv']
New playlist: ['spotify:track:0UaMYEvWZi0ZqiDOoHU3YI', 'spotify:track:6I9VzXrHxO9rA9A5euc8Ak', 'spotify:track:0WqIKmW4BTrj3eJFmnCKMv', 'spotify:track:1QpIoZbeUeTBoQzJtN7MM6', 'spotify:track:0ccllnXp7eTzhgPwotvDla', 'spotify:track:7aWqIsFi7vghmGS6fPgoyk', 'spotify:track:4gmtKUjYw3JuuujVUwReiN', 'spotify:track:4ES2gMcCQshE8EWU0FY8vW']


In [11]:
tracks[tracks['track_uri'].isin(song_uris)]

Unnamed: 0,track_uri,artist_uri,album_uri,duration_ms
0,spotify:track:0UaMYEvWZi0ZqiDOoHU3YI,spotify:artist:2wIVse2owClT7go1WT98tk,spotify:album:6vV5UrXcfyQD1wu4Qo2I9K,226863
1,spotify:track:6I9VzXrHxO9rA9A5euc8Ak,spotify:artist:26dSoYclwsYLMAKD3tpOr4,spotify:album:0z7pVBGOD7HCIB7S8eLkLI,198800
2,spotify:track:0WqIKmW4BTrj3eJFmnCKMv,spotify:artist:6vWDO969PvNqNYHIOW5v0m,spotify:album:25hVFAxTlDvXbx2X2QkUkE,235933
239787,spotify:track:0ccllnXp7eTzhgPwotvDla,spotify:artist:1Bd4UVlqlaKEXYRG3wgrCK,spotify:album:3EI8YIXlPsYfyDXaAh3O7P,195106
598591,spotify:track:4ES2gMcCQshE8EWU0FY8vW,spotify:artist:3aVoqlJOYx31lH1gibGDt3,spotify:album:31EfLOgVsfPS4oZGBVlLRB,215466
660711,spotify:track:4gmtKUjYw3JuuujVUwReiN,spotify:artist:5AUTN6tMncnOnYgJK1VM6K,spotify:album:0El44GWelAsA0K93a2GRQC,202933
699854,spotify:track:7aWqIsFi7vghmGS6fPgoyk,spotify:artist:3iCEJiyLrmF5bHH6w1vIz6,spotify:album:5x58fNEURk7ZzfgWNdKRWx,295278
1141530,spotify:track:1QpIoZbeUeTBoQzJtN7MM6,spotify:artist:1iIxNEvPPmdFIIP0tdpw6G,spotify:album:7KOiQyPp08Sl9I1INZv51U,167497


In [12]:
playlist = [(index, song['track_uri']) for (index, song) in enumerate(playlists.iloc[0]['tracks'])]
list(filter(lambda item: item[1] in song_uris, playlist)) # songs in playlist 1 and in generated playlist

[(0, 'spotify:track:0UaMYEvWZi0ZqiDOoHU3YI'),
 (1, 'spotify:track:6I9VzXrHxO9rA9A5euc8Ak'),
 (2, 'spotify:track:0WqIKmW4BTrj3eJFmnCKMv')]

This method is fast, but we don't expect a song that maches perfectly. These songs possibely not occur in a playlist together like here and there are not necessary similarities together.

### Get next song based on the previous k songs

This method will analyse every playlist and find a song that mostly occurs with the k previous songs. In this method the order matters and only songs placed after the input songs will be retourned. Moreover, The order of input songs also matters, that's why it could be possible that this method may sometimes not return or find a next song.

This method is far slower than the first one, but we expect that its only return relevant songs, when they exist.

In [44]:
# exec on every opening
import pandas as pd

def get_next_song(song_uris: list[str]) -> str | None:
    """select the song that appears in together in the largest number of playlists with the previous songs."""
    song_occurrences: dict[str, int] = {}
        
    for tracks in playlists['tracks']:
        track_uris = [track['track_uri'] for track in tracks]
        
        if not all(song in track_uris for song in song_uris): # check if every songs are contained
            continue
            
        songs_count = 0
        
        for track_uri in track_uris:
            if (songs_count < len(song_uris) and
                track_uri == song_uris[songs_count]):
                songs_count += 1
                continue
            if songs_count < len(song_uris) or track_uri in song_uris:
                continue
            
            # current track_uri is not in song_uris and occurs after the k previous songs
            if track_uri in song_occurrences.keys():
                song_occurrences[track_uri] += 1
            else:
                song_occurrences[track_uri] = 1
    next_songs = [ track_uri for (track_uri, _n) in sorted(song_occurrences.items(), key=lambda x:x[1], reverse=True)] # sort by playlist count desc
    
    return next_songs[0] if len(next_songs) > 0 else None # could not found a next song

# method to analyse the occurences of a particular song in the playlists
def song_occurrence(song_uri: str, playlists: pd.DataFrame) -> int:
    """Return occurrence of one song in the playlists"""
    song_occurrence = 0
        
    for tracks in playlists['tracks']:
        track_uris = [track['track_uri'] for track in tracks]
        
        if song_uri in track_uris : # check if every songs are contained
            song_occurrence += 1
    
    return song_occurrence

#### Test with the first 3 songs of the first two playlists:

We will now test our method with the first 3 songs from the two first playlists. We will after that check if they are from the same album, artist or contained in that order in the complete playlist.

In [45]:
## URIs of the first 3 songs from playlist 1:
song_uris = [song['track_uri'] for song in playlists.iloc[0]['tracks'][:3]]
num_songs = 5

print(f'initial playlist: {song_uris}')
for i in range(num_songs):
    new_song = get_next_song(song_uris)
    
    if new_song:
        song_uris.append(new_song)

print(f'New playlist: {song_uris}')

initial playlist: ['spotify:track:0UaMYEvWZi0ZqiDOoHU3YI', 'spotify:track:6I9VzXrHxO9rA9A5euc8Ak', 'spotify:track:0WqIKmW4BTrj3eJFmnCKMv']
New playlist: ['spotify:track:0UaMYEvWZi0ZqiDOoHU3YI', 'spotify:track:6I9VzXrHxO9rA9A5euc8Ak', 'spotify:track:0WqIKmW4BTrj3eJFmnCKMv', 'spotify:track:2gam98EZKrF9XuOkU13ApN', 'spotify:track:0uqPG793dkDDN7sCUJJIVC', 'spotify:track:6GIrIt2M39wEGwjCQjGChX', 'spotify:track:4E5P1XyAFtrjpiIxkydly4', 'spotify:track:3H1LCvO3fVsK2HPguhbml0']


In [46]:
tracks[tracks['track_uri'].isin(song_uris)]

Unnamed: 0,track_uri,artist_uri,album_uri,duration_ms
0,spotify:track:0UaMYEvWZi0ZqiDOoHU3YI,spotify:artist:2wIVse2owClT7go1WT98tk,spotify:album:6vV5UrXcfyQD1wu4Qo2I9K,226863
1,spotify:track:6I9VzXrHxO9rA9A5euc8Ak,spotify:artist:26dSoYclwsYLMAKD3tpOr4,spotify:album:0z7pVBGOD7HCIB7S8eLkLI,198800
2,spotify:track:0WqIKmW4BTrj3eJFmnCKMv,spotify:artist:6vWDO969PvNqNYHIOW5v0m,spotify:album:25hVFAxTlDvXbx2X2QkUkE,235933
10,spotify:track:2gam98EZKrF9XuOkU13ApN,spotify:artist:2jw70GZXlAI8QzWeY2bgRc,spotify:album:2yboV2QBcVGEhcRlYuPpDT,242293
21,spotify:track:0uqPG793dkDDN7sCUJJIVC,spotify:artist:1yxSLGMDHlW21z4YXirZDS,spotify:album:1bNyYpkDRovmErm4QeDrpJ,272533
31,spotify:track:6GIrIt2M39wEGwjCQjGChX,spotify:artist:0vWCyXMrrvMlCcepuOJaGI,spotify:album:4WqgusSAgXkrjbXzqdBY68,206520
32,spotify:track:4E5P1XyAFtrjpiIxkydly4,spotify:artist:5tKXB9uuebKE34yowVaU3C,spotify:album:44hyrGuZKAvITbmrlhryf8,182306
33,spotify:track:3H1LCvO3fVsK2HPguhbml0,spotify:artist:7bXgB6jMjp9ATFy66eO08Z,spotify:album:1UtE4zAlSE2TlKmTFgrTg5,277106


In [47]:
playlist = [(index, song['track_uri']) for (index, song) in enumerate(playlists.iloc[0]['tracks'])]
list(filter(lambda item: item[1] in song_uris, playlist))  # songs in playlist 1 and in generated playlist

[(0, 'spotify:track:0UaMYEvWZi0ZqiDOoHU3YI'),
 (1, 'spotify:track:6I9VzXrHxO9rA9A5euc8Ak'),
 (2, 'spotify:track:0WqIKmW4BTrj3eJFmnCKMv'),
 (10, 'spotify:track:2gam98EZKrF9XuOkU13ApN'),
 (21, 'spotify:track:0uqPG793dkDDN7sCUJJIVC'),
 (31, 'spotify:track:6GIrIt2M39wEGwjCQjGChX'),
 (32, 'spotify:track:4E5P1XyAFtrjpiIxkydly4'),
 (33, 'spotify:track:3H1LCvO3fVsK2HPguhbml0'),
 (51, 'spotify:track:6GIrIt2M39wEGwjCQjGChX')]

In [48]:
for song_uri in song_uris:
    print(f'Song: {song_uri}, occurrence: {song_occurrence(song_uri, playlists)}')

Song: spotify:track:0UaMYEvWZi0ZqiDOoHU3YI, occurrence: 136
Song: spotify:track:6I9VzXrHxO9rA9A5euc8Ak, occurrence: 249
Song: spotify:track:0WqIKmW4BTrj3eJFmnCKMv, occurrence: 350
Song: spotify:track:2gam98EZKrF9XuOkU13ApN, occurrence: 331
Song: spotify:track:0uqPG793dkDDN7sCUJJIVC, occurrence: 179
Song: spotify:track:6GIrIt2M39wEGwjCQjGChX, occurrence: 99
Song: spotify:track:4E5P1XyAFtrjpiIxkydly4, occurrence: 238
Song: spotify:track:3H1LCvO3fVsK2HPguhbml0, occurrence: 177


All 8 songs aren't from the same artists or album, but seems to occur often togerther in the 15,000 playlists. 5 nexts songs aren't exactly the 5 next songs in playlist one, but they are indeed contained in the same order in playlist one.

We now repeat it with the second playlist, before analysing the final playlist.

In [49]:
## URIs of the first 3 songs from playlist 2:
song_uris = [song['track_uri'] for song in playlists.iloc[1]['tracks'][:3]]
num_songs = 5

print(f'initial playlist: {song_uris}')
for i in range(num_songs):
    new_song = get_next_song(song_uris)
    
    if new_song:
        song_uris.append(new_song)

print(f'New playlist: {song_uris}')

initial playlist: ['spotify:track:2HHtWyy5CgaQbC7XSoOb0e', 'spotify:track:1MYYt7h6amcrauCOoso3Gx', 'spotify:track:3x2mJ2bjCIU70NrH49CtYR']
New playlist: ['spotify:track:2HHtWyy5CgaQbC7XSoOb0e', 'spotify:track:1MYYt7h6amcrauCOoso3Gx', 'spotify:track:3x2mJ2bjCIU70NrH49CtYR', 'spotify:track:1Pm3fq1SC6lUlNVBGZi3Em', 'spotify:track:1NXTEkIeRL59NK61QuhYUl', 'spotify:track:3RGlJJFkWEavxeRQr9ivAd', 'spotify:track:0e9hR1vTrzlUvFH5PgA9rY', 'spotify:track:7dkbEHIMLoeuG4zXGmzhEH']


In [50]:
tracks[tracks['track_uri'].isin(song_uris)]

Unnamed: 0,track_uri,artist_uri,album_uri,duration_ms
51,spotify:track:2HHtWyy5CgaQbC7XSoOb0e,spotify:artist:26bcq2nyj5GB7uRr558iQg,spotify:album:4PT9VulQaQP6XR1xBI2x1W,243773
52,spotify:track:1MYYt7h6amcrauCOoso3Gx,spotify:artist:7zdmbPudNX4SQJXnYIuCTC,spotify:album:3q8vR3PFV8kG1m1Iv8DpKq,70294
53,spotify:track:3x2mJ2bjCIU70NrH49CtYR,spotify:artist:7zdmbPudNX4SQJXnYIuCTC,spotify:album:3q8vR3PFV8kG1m1Iv8DpKq,65306
54,spotify:track:1Pm3fq1SC6lUlNVBGZi3Em,spotify:artist:7zdmbPudNX4SQJXnYIuCTC,spotify:album:3q8vR3PFV8kG1m1Iv8DpKq,108532
55,spotify:track:1NXTEkIeRL59NK61QuhYUl,spotify:artist:7zdmbPudNX4SQJXnYIuCTC,spotify:album:3q8vR3PFV8kG1m1Iv8DpKq,214268
56,spotify:track:3RGlJJFkWEavxeRQr9ivAd,spotify:artist:7zdmbPudNX4SQJXnYIuCTC,spotify:album:3q8vR3PFV8kG1m1Iv8DpKq,110219
57,spotify:track:0e9hR1vTrzlUvFH5PgA9rY,spotify:artist:7zdmbPudNX4SQJXnYIuCTC,spotify:album:60wUpRwDRF1jmViHaW2yu4,207520
58,spotify:track:7dkbEHIMLoeuG4zXGmzhEH,spotify:artist:6BKWwLs98ZY3ifhCDNGvLk,spotify:album:38tJMNu2lPatR7xnPchOOB,226000


In [51]:
playlist = [(index, song['track_uri']) for (index, song) in enumerate(playlists.iloc[1]['tracks'])]
list(filter(lambda item: item[1] in song_uris, playlist)) # songs in playlist 2 and in generated playlist

[(0, 'spotify:track:2HHtWyy5CgaQbC7XSoOb0e'),
 (1, 'spotify:track:1MYYt7h6amcrauCOoso3Gx'),
 (2, 'spotify:track:3x2mJ2bjCIU70NrH49CtYR'),
 (3, 'spotify:track:1Pm3fq1SC6lUlNVBGZi3Em'),
 (4, 'spotify:track:1NXTEkIeRL59NK61QuhYUl'),
 (5, 'spotify:track:3RGlJJFkWEavxeRQr9ivAd'),
 (6, 'spotify:track:0e9hR1vTrzlUvFH5PgA9rY'),
 (7, 'spotify:track:7dkbEHIMLoeuG4zXGmzhEH')]

In [52]:
for song_uri in song_uris:
    print(f'Song: {song_uri}, occurrence: {song_occurrence(song_uri, playlists)}')

Song: spotify:track:2HHtWyy5CgaQbC7XSoOb0e, occurrence: 271
Song: spotify:track:1MYYt7h6amcrauCOoso3Gx, occurrence: 1
Song: spotify:track:3x2mJ2bjCIU70NrH49CtYR, occurrence: 1
Song: spotify:track:1Pm3fq1SC6lUlNVBGZi3Em, occurrence: 1
Song: spotify:track:1NXTEkIeRL59NK61QuhYUl, occurrence: 2
Song: spotify:track:3RGlJJFkWEavxeRQr9ivAd, occurrence: 2
Song: spotify:track:0e9hR1vTrzlUvFH5PgA9rY, occurrence: 3
Song: spotify:track:7dkbEHIMLoeuG4zXGmzhEH, occurrence: 1


Even with 20.000 playlists, the nexts songs are generally the close next songs from  the initial playlist. There are not necessary songs from same artists or album. The reason is that some tracks or sequences of tracks only occur once in the playlists.

#### Drawback
The only drawback is that we can't find a new song for an unseen sequence of songs, like the example under with a playlist with 2 song from two playlists: 

In [43]:
my_playlist = [song['track_uri'] for song in playlists.iloc[1]['tracks'][:2]] # 2 songs from playlist 1
my_playlist.extend([song['track_uri'] for song in playlists.iloc[0]['tracks'][:2]]) # add 2 songs from playlist 2

print(my_playlist)
new_song = get_next_song(my_playlist)

if not new_song:
    print("next song not found.")
else:
    print(f"new song: {new_song}") # should not found a track

['spotify:track:2HHtWyy5CgaQbC7XSoOb0e', 'spotify:track:1MYYt7h6amcrauCOoso3Gx', 'spotify:track:0UaMYEvWZi0ZqiDOoHU3YI', 'spotify:track:6I9VzXrHxO9rA9A5euc8Ak']
next song not found.


## <span style="color:green">Defining a song embeddings / representation</span>

### Get tracks context

The computed embedding of tracks should use place tracks close to each other when:
- tracks are from the same playlists (we suppose that tracks in the same playlists could have the related genres)
- tracks from the same album
- tracks from the same artist

That's why we will generate as context list of tracks for each artists and for each album as additional information with the collected playlists. This new playlists of tracks uris will represent the words in a document, so that we can use word2vec from gensim. Before that, we will convert each track_uri to an unique number.

In [23]:
uri_playlists = playlists['tracks'].apply(lambda tracks: [song['track_uri'] for song in tracks])

# Group tracks by artists uri
artists_tracks = tracks.groupby('artist_uri')['track_uri'].apply(list)

# Group tracks by album uri
albums_tracks = tracks.groupby('album_uri')['track_uri'].apply(list)

# whole context
context_tracks = [
    *uri_playlists.values,
    *artists_tracks.values,
    *albums_tracks.values
]

print(len(context_tracks))

192070


A track will occur more often in the uri_playslists and only once in artists_tracks and albums_tracks, so that the occurence in the initial playlists has more inportance. **How to improve it for artists and albums ? **

### tracks to number

In [9]:
# exec on every opening
# dict for swiching between tracks and numbers
number2track = dict((index, track) for index, track in enumerate(tracks['track_uri'].values))
track2number = dict((track, index) for index, track in number2track.items())

### Train a Track2Vec model

Word2Vec offers two methods: skip-gram and continuous bag of words(CBOW). We will use skip-gram, as it seems that skip-gram works well with infrequent word in the datasets ([NLP 101: Word2Vec — Skip-gram and CBOW](https://towardsdatascience.com/nlp-101-word2vec-skip-gram-and-cbow-93512ee24314), [Google archive word2vec](https://code.google.com/archive/p/word2vec/)). The methode will learn the word embedding from the words context, which is composed of the surrounding words.
As we already saw, some songs could only occurs once in the initial first 20.000 playlists. Tracks could occur only three times: once in the 20.000 playlists, in one album and in one artirst playlist.

Based of computed track context, a track embedding will be computed with 300 dimensions and a window size of 6 for the skip-gram method. Each element of the vector will be in [-1, 1].

In [25]:
from gensim.models.word2vec import Word2Vec

# convert each track to its corresponding number -> sequences for word2vec
context_tracks_numbers = [ [track2number[track_uri] for track_uri in tracks_list] for tracks_list in context_tracks]
print("Number of playlists", len(context_tracks_numbers))

# embedding dimensionality
num_features = 300 

# how often does a word have to occur to be considered
min_word_count = 3 # song occurs in minimum one playlist, in exactly one album and belongs to one artists -> 3 times

# number of threads to use
num_workers = 8

# window size
window_size = 6

# subsampling parameter for word2vec
subsampling = 1e-3

track2vec = Word2Vec(context_tracks_numbers, 
                     workers=num_workers,
                     vector_size=num_features,
                     min_count=min_word_count,
                     sg=1, # 1: skip-gram, 0: CBOW
                     window=window_size,
                     sample=subsampling)

del context_tracks # free RAM

Number of playlists 192070


### Save or load trained track2vec model

In [26]:
track2vec.save('./models/track2vec.model')

In [10]:
# exec on every opening
from gensim.models.word2vec import Word2Vec

track2vec = Word2Vec.load('./models/track2vec.model')

### Helper methods for switching between tracks uris and vectors

In [11]:
# exec on every opening
import numpy as np

def convert_track_to_vec(track_uri: str) -> np.ndarray:
    """Convert the track uri to a dense vector"""
    return track2vec.wv[track2number[track_uri]]

def convert_vec_to_track(vector: np.ndarray) -> str:
    """Convert a dense vector to a track uri"""
    return number2track[track2vec.wv.similar_by_vector(vector, topn=1)[0][0]]

In [28]:
track = tracks['track_uri'].iloc[0]

print("initial track", track)
vector = convert_track_to_vec(track)
new_track = convert_vec_to_track(vector)

print("After converting to vector and backwards:", new_track)

initial track spotify:track:0UaMYEvWZi0ZqiDOoHU3YI
After converting to vector and backwards: spotify:track:0UaMYEvWZi0ZqiDOoHU3YI


In [61]:
# exec on every opening

def get_next_similar_song(song_uris: list[str]) -> str:
    """
    select the next similar song based on a computed embedding: track2vec.
    """
    encoded_playlist = [ track2number[song_uri] for song_uri in song_uris ] # most_similar uses track_numbers as vocabulary
    similar_songs = track2vec.wv.most_similar(encoded_playlist, topn = 1) # [(number, sim(number, playlist))]
    
    return number2track[similar_songs[0][0]]

### Testing track embedding

#### Test with 3 tracks from same artists and album

In [31]:
song_uris = [song['track_uri'] for song in playlists.iloc[1]['tracks'][1:4]]
num_songs = 5

print(f'initial playlist: {song_uris}')
for i in range(num_songs):
    new_song = get_next_similar_song(song_uris)
    
    if new_song:
        song_uris.append(new_song)

print(f'New playlist: {song_uris}')

initial playlist: ['spotify:track:1MYYt7h6amcrauCOoso3Gx', 'spotify:track:3x2mJ2bjCIU70NrH49CtYR', 'spotify:track:1Pm3fq1SC6lUlNVBGZi3Em']
New playlist: ['spotify:track:1MYYt7h6amcrauCOoso3Gx', 'spotify:track:3x2mJ2bjCIU70NrH49CtYR', 'spotify:track:1Pm3fq1SC6lUlNVBGZi3Em', 'spotify:track:2l9HEwPU6udcyYN2gDi0nn', 'spotify:track:6sYFewKgcRSabu9R8hjUem', 'spotify:track:33SGhp4QaGuTrFlsMPoPkn', 'spotify:track:4Umy8mk1bBdWYoxKB8354f', 'spotify:track:5tairRxXpTHPXaJvq2ISFf']


In [32]:
tracks[tracks['track_uri'].isin(song_uris[3:])]

Unnamed: 0,track_uri,artist_uri,album_uri,duration_ms
281107,spotify:track:4Umy8mk1bBdWYoxKB8354f,spotify:artist:4Eon8wJtGHP3JNL7lB8Qml,spotify:album:6keE26GqNkLzprE0RBBzFJ,126666
864780,spotify:track:6sYFewKgcRSabu9R8hjUem,spotify:artist:6Gp8BbKAqPO3R0UAYkm8J0,spotify:album:2t34tkX7TR5rYKMu3WYO0z,151813
1520004,spotify:track:2l9HEwPU6udcyYN2gDi0nn,spotify:artist:7b85ve82Sh36a3UAx74wut,spotify:album:5B7NDPESn71gfih59EEy3s,248165
2123183,spotify:track:5tairRxXpTHPXaJvq2ISFf,spotify:artist:58lV9VcRSjABbAbfWS6skp,spotify:album:4SIeviZcnTJ86k3LUAj2yu,315360
2871947,spotify:track:33SGhp4QaGuTrFlsMPoPkn,spotify:artist:5RTA06B6cts3a6I7iMqKGu,spotify:album:0mDSPMGVdWORtoEuxhyMJx,224676


In [33]:
playlist = [(index, song['track_uri']) for (index, song) in enumerate(playlists.iloc[1]['tracks'])]
list(filter(lambda item: item[1] in song_uris, playlist)) # songs in playlist 2 and in generated playlist

[(1, 'spotify:track:1MYYt7h6amcrauCOoso3Gx'),
 (2, 'spotify:track:3x2mJ2bjCIU70NrH49CtYR'),
 (3, 'spotify:track:1Pm3fq1SC6lUlNVBGZi3Em')]

Returned similar tracks aren't from the same artist or album as in the input playlist. Additionally returned tracks aren't the same as with the baseline method `get_next_song`. The similarity is here more based en songs occuring mostly together and as we used skip-gram, the embedding was computed based on tracks occuring before and after the song to guess in the playlists.

#### Test with one track

In [34]:
song_uris = [playlists.iloc[1]['tracks'][1]['track_uri']]
num_songs = 5

print(f'initial playlist: {song_uris}')
for i in range(num_songs):
    new_song = get_next_similar_song(song_uris)
    
    if new_song:
        song_uris.append(new_song)

print(f'New playlist: {song_uris}')

initial playlist: ['spotify:track:1MYYt7h6amcrauCOoso3Gx']
New playlist: ['spotify:track:1MYYt7h6amcrauCOoso3Gx', 'spotify:track:2l9HEwPU6udcyYN2gDi0nn', 'spotify:track:6sYFewKgcRSabu9R8hjUem', 'spotify:track:5tairRxXpTHPXaJvq2ISFf', 'spotify:track:546FkIJm7xhVZV0TWxYOlo', 'spotify:track:5hgalo6f5axSbdhq2GlYZz']


In [35]:
tracks[tracks['track_uri'].isin(song_uris)]

Unnamed: 0,track_uri,artist_uri,album_uri,duration_ms
53,spotify:track:1MYYt7h6amcrauCOoso3Gx,spotify:artist:7zdmbPudNX4SQJXnYIuCTC,spotify:album:3q8vR3PFV8kG1m1Iv8DpKq,70294
864780,spotify:track:6sYFewKgcRSabu9R8hjUem,spotify:artist:6Gp8BbKAqPO3R0UAYkm8J0,spotify:album:2t34tkX7TR5rYKMu3WYO0z,151813
896871,spotify:track:546FkIJm7xhVZV0TWxYOlo,spotify:artist:56E5XajgEQr7pQNK4C10RF,spotify:album:4waqpSbSeFM8xuw7ULTVV4,290146
1458745,spotify:track:5hgalo6f5axSbdhq2GlYZz,spotify:artist:0tw9OwFgbaPVDCBHYSsUMN,spotify:album:5uMWxkCaJN1iNbOeMY9AWu,243948
1520004,spotify:track:2l9HEwPU6udcyYN2gDi0nn,spotify:artist:7b85ve82Sh36a3UAx74wut,spotify:album:5B7NDPESn71gfih59EEy3s,248165
2123183,spotify:track:5tairRxXpTHPXaJvq2ISFf,spotify:artist:58lV9VcRSjABbAbfWS6skp,spotify:album:4SIeviZcnTJ86k3LUAj2yu,315360


## <span style="color:green">Defining a neural network</span>


### Training, validation, and test data

We first split the 20.000 playlists into training, validation, and test data:

In [12]:
# exec on every opening
from sklearn.model_selection import train_test_split

# percentages used for training, validation, and test data
train_perc = 0.60
val_perc = 0.20
test_perc = 0.20

# formated playlists with only the track_uris
uri_playlists = playlists['tracks'].apply(lambda tracks: [song['track_uri'] for song in tracks])

# extract training data and temporary validation-test data
playlists_train, playlists_tmp = train_test_split(uri_playlists, test_size=val_perc+test_perc, random_state=0)

# split validation-test data into validation and test data
playlists_val, playlists_test = train_test_split(playlists_tmp, test_size=test_perc/(val_perc+test_perc), random_state=0)

# save playlists_test for baseline evaluation later
playlists_test_baseline = playlists_test.copy(deep=True)

# free RAM
del uri_playlists
del playlists_tmp

# convert each track uri to the corresponding vector
track_uri2vec = lambda tracks: [convert_track_to_vec(track_uri) for track_uri in tracks]

playlists_train = playlists_train.apply(track_uri2vec)
playlists_val = playlists_val.apply(track_uri2vec)
playlists_test = playlists_test.apply(track_uri2vec)

print(f'Train: {len(playlists_train)}, Val: {len(playlists_val)}, Test: {len(playlists_test)}')

Train: 12000, Val: 4000, Test: 4000


We will now split our training, validation, and test data in k grams. Each X[i] will contains k tracks and y[i] will contain the next track after the k tracks. After that, each X will be converted to a 3D tensor with shape (number of playlists, k or windows size, number of embedding dimension). Each y will be converted to a 2D matrix with shape (number of playlists, number of embedding dimension).

In [13]:
# exec on every opening
### test for k + 1 pliting -> for training and evaluation
import pandas as pd

def playlist_k_gram(tracks: list, k: int, step: int = 1) -> tuple[list[list], list]:
    """Split a list of tracks in k grams: X[i] contains a sequence of k trachs, y[i] the k+1 track."""
    X: list[list] = []
    y: list = []
        
    for i in range(0, len(tracks) - k, step):
        X.append(tracks[i:i+k])
        y.append(tracks[i+k])
    
    return X, y

def playlists_k_gram(playlists: pd.Series, k: int, step: int = 1) -> tuple[list[list], list]:
    """Split the list of playlists in k grams: X[i] contains a sequence of k trachs, y[i] the k+1 track."""
    X: list[list] = []
    y: list = []
        
    for playlist in playlists:
        playlist_x, playlist_y = playlist_k_gram(playlist, k, step)
        X.extend(playlist_x)
        y.extend(playlist_y)
    
    return X, y

In [14]:
import numpy as np

# exec on every opening
# params for k-gram
step = 10
playlist_max_len = 10 # k

# splitting in X, y with k grams
# convert X to 3D tensors -> X: (#Playlists, k, embedding_dim) and y to 2D Matrix -> y: (#Playlists, embedding_dim)
X_playlists_train, y_playlists_train = (np.array(item) for item in playlists_k_gram(playlists_train, playlist_max_len, step))
X_playlists_val, y_playlists_val = (np.array(item) for item in playlists_k_gram(playlists_val, playlist_max_len, step))
X_playlists_test, y_playlists_test = (np.array(item) for item in playlists_k_gram(playlists_test, playlist_max_len, step))

print(f'X -> Train: {X_playlists_train.shape}, Val: {X_playlists_val.shape}, Test: {X_playlists_test.shape}')
print(f'y -> Train: {y_playlists_train.shape}, Val: {y_playlists_val.shape}, Test: {y_playlists_test.shape}')

X -> Train: (73164, 10, 300), Val: (24328, 10, 300), Test: (24770, 10, 300)
y -> Train: (73164, 300), Val: (24328, 300), Test: (24770, 300)


In [15]:
# free RAM
del playlists_train
del playlists_val
del playlists_test

del playlists
del tracks

### Designing the neural network

In [16]:
# turn off tensorflow info and warning messages
import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2'

from keras.metrics import Precision
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.callbacks import CSVLogger

# Build the model
embedding_dim = X_playlists_train.shape[2] # 300

model = Sequential()
model.add(LSTM(128, input_shape=(playlist_max_len, embedding_dim)))
model.add(Dense(embedding_dim, activation='softmax')) # 300 values in [-1, 1] -> softmax not good choice
model.compile(optimizer='rmsprop', loss='categorical_crossentropy',
              metrics=['accuracy', Precision()]) # TODO adapt loss and last layer to generate a vec representing a track
model.summary()

# Train the model
epochs = 20
batch_size = 128
csv_logger = CSVLogger('training.log', separator=',', append=False)

history = model.fit(X_playlists_train, y_playlists_train,
          batch_size=batch_size, epochs=epochs,
          validation_data=(X_playlists_val, y_playlists_val),
          verbose=1, callbacks=[csv_logger])

# Evaluate the model
score = model.evaluate(X_playlists_test[:2], y_playlists_test[:2], verbose=1)

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 lstm (LSTM)                 (None, 128)               219648    
                                                                 
 dense (Dense)               (None, 300)               38700     
                                                                 
Total params: 258,348
Trainable params: 258,348
Non-trainable params: 0
_________________________________________________________________
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20

KeyboardInterrupt: 

In [11]:
print(score)

[-0.6705567836761475, 0.5, 0.0]


### Evolution of loss and accuracy

In [None]:
print(history.history.keys()) # Pression ?

In [None]:
import matplotlib.pyplot as plt

history_dict = history.history
train_loss = history_dict['loss']
val_loss = history_dict['val_loss']

epochs = range(1, len(train_loss) + 1)
plt.title('Training and validation loss')
plt.plot(epochs, train_loss, label='Training loss')
plt.plot(epochs, val_loss, label='Validation loss')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.legend()
plt.grid()
plt.show()

In [None]:
import matplotlib.pyplot as plt

history_dict = history.history
train_acc = history_dict['accuracy']
val_acc = history_dict['val_accuracy']

epochs = range(1, len(train_loss) + 1)
plt.title('Training and validation accuracy')
plt.plot(epochs, train_acc, label='Training accuracy')
plt.plot(epochs, val_acc, label='Validation accuracy')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.legend()
plt.grid()
plt.show()

### Save and load the model

In [None]:
model.save('../models/spotify_playlist.h5')

In [None]:
# turn off tensorflow info and warning messages
import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2'

from keras.models import load_model
# Recreate the exact same model, including its weights and the optimizer
model = load_model('../models/spotify_playlist.h5')

# Show the model architecture
if model:
    model.summary()

### Free data

In [9]:
# free ram
del X_playlists_train
del y_playlists_train
del X_playlists_val
del y_playlists_val
del X_playlists_test
del y_playlists_test

NameError: name 'X_playlists_train' is not defined

## <span style="color:green">Evaluation of each method</span>

With k-gramm, given k songs attempted to find the direct next one
2nd variation: given k songs attempted to find a song that occur in the same playlist after this k songs
3rd variation: given k songs, has thenext song a common album / artist ?

In [55]:
from typing import Callable
from tqdm import tqdm # progression bar

def eval_baseline_method(X: list[list[str]],
                         y: list[str],
                         method: Callable[[list[str]], str|None]) -> dict[str, int]:
    score = {
        'true': 0,
        'total': len(X)
    }
    
    for index in tqdm(range(len(X))):
        next_song = method(X[index])
        
        if next_song == y[index]:
            score['true'] += 1

    return score

In [38]:
# exec on every opening
# reload tracks and playlists for baseline methods
import pandas as pd

tracks = pd.read_json(tracks_filename)
playlists = pd.read_json(playlists_filename)

In [53]:
# test data for evaluation of baseline methods -> 20% of the 20.000 playlists
X_test_baseline, y_test_baseline = playlists_k_gram(playlists_test_baseline, playlist_max_len, step)

In [62]:
## maybe test for unseen sequence of songs too
score = eval_baseline_method(X_test_baseline, y_test_baseline, get_random_next_song)
print(f'random: {score}, Precision: {score["true"]/score["total"]}')

100%|██████████| 24770/24770 [02:17<00:00, 180.06it/s]

random: {'true': 0, 'total': 24770}, Precision: 0.0





In [63]:
score = eval_baseline_method(X_test_baseline, y_test_baseline, get_next_song)
print(f'table scan: {score}, Precision: {score["true"]/score["total"]}')

  0%|          | 52/24770 [00:09<1:15:28,  5.46it/s]


KeyboardInterrupt: 

In [64]:
score = eval_baseline_method(X_test_baseline, y_test_baseline, get_next_similar_song)
print(f'track2vec and cosine similarity: {score}, Precision: {score["true"]/score["total"]}')

100%|██████████| 24770/24770 [08:08<00:00, 50.73it/s]

track2vec and cosine similarity: {'true': 87, 'total': 24770}, Precision: 0.003512313282196205





In [None]:
del playlists_test_baseline