# Описание задачи

Данные собраны с помощью API Spotify. 

Как понять какой музыкальный трек подойдет к какому плейлисту? В ходе этого проекта я разработал ML-модель, кторая ранжирует плейлисты в порядке их релевантности для определенного произведения.

Цель задачи - разработать модель, которая на основе данных об имеющихся плейлистах, а также о музыкантах и их треках будет составлять ранжированный список наиболее релевантных плейлистов для конкретного музыкального произведения.

Описание данных:
- author_name – имя исполнителя;
- album_name - название альбома;
- track_name - название трека;
- track_id - id трека;
- playlist_id - id плейлиста;
- description - описание плейлиста;
- playlist_name - название плейлиста;
- owner_name - имя куратора плейлиста;
- score - метка принадлежности: трек присутствует в плейлисте (1 - Да, 0 - Нет).

# Import

In [1]:
import pandas as pd
import numpy as np
import re
import joblib

from catboost import CatBoostRanker, Pool
from sklearn.metrics import ndcg_score
from sklearn.model_selection import train_test_split

import warnings
warnings.filterwarnings('ignore')

RAND = 10

In [2]:
df_train = pd.read_parquet('../../data/dataset_for_train')
df_train

Unnamed: 0,author_name,album_name,track_name,track_id,playlist_id,description,playlist_name,owner_name,score
0,NNIK,night walk,night walk,3J0iKUpvfpvz8U7qMfNAc2,1TAYkyS1BlrhRX7gklVW8v,Grab a coffee and get into the mood. IG: @lofi...,lofi beats - deep focus,Jannik Wagner,1
1,Dumage,Sunset Gradient,Sunset Gradient,4R6lnvy5FiovOoA2VO8HHl,1TAYkyS1BlrhRX7gklVW8v,Grab a coffee and get into the mood. IG: @lofi...,lofi beats - deep focus,Jannik Wagner,1
2,Endemico,Yokohama,lullaby of mechanics,039KxZdCDLbaTsxP9hDsu5,1TAYkyS1BlrhRX7gklVW8v,Grab a coffee and get into the mood. IG: @lofi...,lofi beats - deep focus,Jannik Wagner,1
3,Calmure,Wishing well,Wishing well,6ZtbYL3O4edmywZh37XQR0,1TAYkyS1BlrhRX7gklVW8v,Grab a coffee and get into the mood. IG: @lofi...,lofi beats - deep focus,Jannik Wagner,1
4,Characterising Runs,Just A Thought,Just A Thought,2zfUoVsjpG0RR35ar7SuCI,1TAYkyS1BlrhRX7gklVW8v,Grab a coffee and get into the mood. IG: @lofi...,lofi beats - deep focus,Jannik Wagner,1
...,...,...,...,...,...,...,...,...,...
363145,Melw Crew,JDB #1,J'ENTASSE,1MnNLBSnAG1NqiawYYyh4H,762uEHCOYgUEIwt1vNCIGl,"Los Angeles, California-based house queen LP G...",ThisSongIsSick Presents: The Remedy 038 Ft. LP...,THIS SONG IS SICK,0
363146,Lyane Hegemann,Seelenfeuer,Ich frag mich - Radio Mix,7m5ATd6Mp93C81hM294hzP,520wgP4UblITpzpVs32qK0,"Los Angeles, California-based house queen LP G...",ThisSongIsSick Presents: The Remedy 038 Ft. LP...,THIS SONG IS SICK,0
363147,Lost Frequencies Calum Scott,Where Are You Now,Where Are You Now,3uUuGVFu1V7jTQL60S1r8z,3NW9pdxBVW3PWnUPURZYD2,"Los Angeles, California-based house queen LP G...",ThisSongIsSick Presents: The Remedy 038 Ft. LP...,THIS SONG IS SICK,0
363148,BLKCITY RAIZA BIZA Mo Muse JessB Blaze the Emp...,Flying,Flying,6CI9XrryKhtQYEs6Us2wNb,2Mn7JHYzCaSgMaoWBdlwga,"Los Angeles, California-based house queen LP G...",ThisSongIsSick Presents: The Remedy 038 Ft. LP...,THIS SONG IS SICK,0


In [3]:
df_playlists = pd.read_parquet('../../data/full/playlists_features.parquet')
df_playlists

Unnamed: 0,description,id,playlist_name,owner_name
0,A Playlist with the Best Remixes of Popular So...,6TqkCoZ6mboVBlEkGDFZw9,Best Remixes of Popular Songs 🔥 New Hits 2023 🔥,daveepa
1,A Playlist with the best Techno Remixes of Pop...,5tC6QbgGXsERRpYyjTIMM8,Techno Remixes of Popular Songs 🔥 TikTok Techn...,daveepa
2,Party Music 2023 🔥 This is a playlist with the...,03Uyooblz44mbIFH6gLP5h,Party Music 2023 🔥 Best of EDM & Electro House,daveepa
3,No cap 🧢 only Deep House Bangers. Music by Joh...,4J50w2KWihoa3hn4MXtTjB,Deep House Bangers nobody tells you about 🔥,daveepa
4,No cap 🧢 only Tech House Bangers. Music by FIS...,5v09TzvXZWz4EH50EUUJkC,Tech House Bangers nobody tells you about 🔥,daveepa
...,...,...,...,...
27961,Dein Magazin und Shop für Musik. Täglich aktua...,4WWdQfpWYJR1pW2ge4x5vl,Pretty in Noise Top 200,prettyinnoise.de
27962,,7wGH9IK738EZC53iuaK3Wn,OS24 Druckfrisch,prettyinnoise.de
27963,Am 10. und 11. September 2021 sollen Bands aus...,1Jlg41rPMNWjVIL6l4Dgt3,HELLSEATIC 2021,prettyinnoise.de
27964,Mit der unregelmäßigen Serie der The End of Tr...,1DS83PzxbuWU7wvlzJgUo1,The End of Travelling,prettyinnoise.de


In [4]:
def data_merge(data: pd.DataFrame,
               author_name: str, 
               album_name: str, 
               track_name: str) -> pd.DataFrame:
    """
    Формирует датасет
    :param author_name: имя музыканта
    :param album_name: название альбома
    :param track_name: название трека
    :return: датасет
    """
    data.insert(0, 'author_name', author_name)
    data.insert(1, 'album_name', album_name)
    data.insert(2, 'track_name', track_name)
    
    return data


def clean_text(data: pd.Series) -> pd.Series:
    """
    Чистит текст
    """
    data = data.lower()
    data = re.sub(r'[^\sa-zA-Z0-9@\[\]]',' ',data) # удаляет пунктцацию
    data = re.sub(r'\w*\d+\w*', '', data) # удаляет цифры
    data = re.sub(r'[^\w\s]', ' ', data) # удаляет знаки
    data = re.sub('\s{2,}', " ", data) # удаляет ненужные пробелы
    
    return data


def data_drop(data: pd.DataFrame, drop_columns: list) -> pd.DataFrame:
    """
    Удаляет признаки
    :param data: датасет
    :param drop_columns: список с признаками
    :return: датасет
    """
    return data.drop(columns=drop_columns, axis=1)

# EDA

In [5]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 363150 entries, 0 to 363149
Data columns (total 9 columns):
 #   Column         Non-Null Count   Dtype 
---  ------         --------------   ----- 
 0   author_name    363150 non-null  object
 1   album_name     363150 non-null  object
 2   track_name     363150 non-null  object
 3   track_id       363150 non-null  object
 4   playlist_id    363150 non-null  object
 5   description    363150 non-null  object
 6   playlist_name  363150 non-null  object
 7   owner_name     363150 non-null  object
 8   score          363150 non-null  int64 
dtypes: int64(1), object(8)
memory usage: 24.9+ MB


Пропусков нет

In [6]:
# чистка текста
df_train['author_name'] = df_train['author_name'].apply(clean_text)
df_train['album_name'] = df_train['album_name'].apply(clean_text)
df_train['track_name'] = df_train['track_name'].apply(clean_text)
df_train['playlist_name'] = df_train['playlist_name'].apply(clean_text)
df_train['owner_name'] = df_train['owner_name'].apply(clean_text)
df_train['description'] = df_train['description'].apply(clean_text)

In [21]:
# группирует по playlist_id и score
df_train.sort_values(by=['playlist_id', 'score'], inplace=True) #для рассчета метрики track_id
df_train.reset_index(drop=True, inplace=True)
df_train

Unnamed: 0,author_name,album_name,track_name,track_id,playlist_id,description,playlist_name,owner_name,score
0,tyla ayra starr,girl next door,girl next door,000N4CJL8IjQ0f2I4grgBO,37i9dQZF1DWUzFXarNiofw,sad life sad you sad mode amp lt submitmusic t...,s a d m o d e,b r o k e n h e a r t s,0
1,tyla ayra starr,girl next door,girl next door,000N4CJL8IjQ0f2I4grgBO,37i9dQZF1DWUzFXarNiofw,clozee hits the mark perfectly with these rich...,thissongissick presents the remedy ft clozee,this song is sick,0
2,tyla ayra starr,girl next door,girl next door,000N4CJL8IjQ0f2I4grgBO,37i9dQZF1DWUzFXarNiofw,press updated every week check all the playlis...,bershka fashion store music,street music,1
3,,elva,general error,000QWvZpHrBIVrW4dGbaVI,2pEF8IfgpycuKlWTN2LSTq,best gaming music it s gaming time enjoy the b...,gaming music,emkaymusic,0
4,,elva,general error,000QWvZpHrBIVrW4dGbaVI,2pEF8IfgpycuKlWTN2LSTq,,kitchen,sarge malone,0
...,...,...,...,...,...,...,...,...,...
363145,spastic ink,ink complete,mosquito brain surgery,7zzbwY4h8Q6kI1ZX3gB1B3,1Xx9ITzxRrLZgPZIpvyQcS,soittolista quot instrumentti teknisen death m...,nauhaton basso metallissa,kaaos zine,1
363146,pink sweat,heaven,heaven,7zzl5nLmPPkAwk2MRIUpa4,5nn1h2PI3eXe6U5UKCYrQm,terrace space ibiza s house amp trance dance ...,ecstasy of sunny terrace space ibiza,vacho,0
363147,pink sweat,heaven,heaven,7zzl5nLmPPkAwk2MRIUpa4,5nn1h2PI3eXe6U5UKCYrQm,some pop songs for a chill atmosphere,chill pill,vadim kobal,1
363148,angus maude,orange hoodie,orange hoodie,7zznnb4017w7BwU5tMiBi9,5b3VPRubQlA9iM9fyfTnB5,,sanu radio,spotify,0


# Modeling

In [22]:
# разбивает на тренировочну, тестовую и валидационную выборки
X = df_train.drop('score', axis=1)
y = df_train['score']

X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size=0.2,
                                                    shuffle=False,
                                                    random_state=RAND)

X_train_, X_val, y_train_, y_val = train_test_split(X_train,
                                                    y_train,
                                                    test_size=0.16,
                                                    shuffle=False,
                                                    random_state=RAND)  

X_train.reset_index(drop=True, inplace=True)
X_test.reset_index(drop=True, inplace=True)
y_train.reset_index(drop=True, inplace=True)
y_test.reset_index(drop=True, inplace=True)
X_train_.reset_index(drop=True, inplace=True)
X_val.reset_index(drop=True, inplace=True)
y_train_.reset_index(drop=True, inplace=True)
y_val.reset_index(drop=True, inplace=True)

eval_set = [(X_val, y_val)]

## Baseline

In [13]:
cat_features = [0, 1, 2, 4, 5]
text_features = [3]

train_gr = X_train_['playlist_id']
val_gr = X_val['playlist_id']

train = Pool(data=X_train_.drop(['playlist_id', 'track_id'], axis=1),
             label=y_train_,
             group_id=train_gr.values,
             cat_features=cat_features,
             text_features=text_features)

val = Pool(data=X_val.drop(['playlist_id', 'track_id'], axis=1),
            label=y_val.values,
            group_id=val_gr.values,
            cat_features=cat_features,
            text_features=text_features)

clf = CatBoostRanker(random_state=RAND,
                     loss_function='YetiRank',
                     eval_metric='NDCG:top=10;type=Exp')

clf.fit(train,
        eval_set=val,
        verbose=False)

<catboost.core.CatBoostRanker at 0x7f8bdae1ad60>

## With hyperparameters

In [None]:
cat_features = [0, 1, 2, 4, 5]
text_features = [3]

parameters = {'loss_function': 'YetiRank',
              'iterations': 5000,
              'eval_metric': 'NDCG:top=10;type=Exp',
              'metric_period': 50,
              'random_seed': 0,
              'depth': 6,
              'eta': 0.35,
              'use_best_model': True,
              'early_stopping_rounds': 1000}

train_gr = X_train_['playlist_id']
val_gr = X_val['playlist_id']

train = Pool(data=X_train_.drop(['playlist_id', 'track_id'], axis=1),
             label=y_train_,
             group_id=train_gr.values,
             cat_features=cat_features,
             text_features=text_features)

val = Pool(data=X_val.drop(['playlist_id', 'track_id'], axis=1),
           label=y_val,
           group_id=val_gr.values,
           cat_features=cat_features,
           text_features=text_features)

clf_final = CatBoostRanker(**parameters)

clf_final.fit(train,
              eval_set=val)

# Metrics

In [32]:
clf_final = joblib.load('../../model/clf_final.joblib')

cat_features = [0, 1, 2, 4, 5]
text_features = [3]

test = Pool(X_test.drop(['playlist_id', 'track_id'], axis=1), 
            cat_features=cat_features, 
            text_features=text_features)

clf_pred = clf.predict(test)
clf_final_pred = clf_final.predict(test)

##  My_function (плейлисты ранжированны относительно трека)

In [27]:
def track_sort(data: pd.DataFrame, y_test: pd.Series, y_pred: np.ndarray) -> list:
    """
    Делает вложенные списки сгруппированные по трекам
    :param data: датафрейм
    :param y_test: флаг наличия трека в плейлисте
    :param y_pred: ответы catboost
    :return: список флагов, список ответов  catboost   
    """
    score_lst = []
    pred_lst = []
    lst1, lst2 = [], []

    for i, j, k in zip(range(len(data['track_id'])), y_test, y_pred):
        lst1.append(j)
        lst2.append(k)
        if i == len(data['track_id'])-1 or data['track_id'][i] != data['track_id'][i+1]:
            score_lst.append(lst1)
            pred_lst.append(lst2)
            lst1, lst2 = [], []
            
    return score_lst, pred_lst

In [28]:
def ndcg_at_k(score: list, pred: list, k: int) -> float:
    """
    Считает среднее значение NDCG
    :param score: отсортированный по трекам список присутствия его в плейлисте
    :param pred: отсортированный по трекам список ответов catboost
    :return: среднее значение NDCG 
    """
    ndcg_lst = []
    for i, j in zip(score, pred):
        ndcg = 0
        if len(i) == 1:
            if i[0] == 0:
                ndcg = 0
            else: 
                ndcg == 1
        else:
            ndcg = ndcg_score([i], [j], k=k)
            ndcg_lst.append(ndcg)
    ndcg_mean = np.mean(ndcg_lst)
    return ndcg_mean

In [33]:
score_lst_clf, pred_lst_clf = track_sort(X_test, 
                                         y_test, 
                                         clf_pred)

score_lst_clf_final, pred_lst_clf_final = track_sort(X_test, 
                                                     y_test, 
                                                     clf_final_pred)

In [34]:
ndcg_mean_clf = ndcg_at_k(score_lst_clf, 
                          pred_lst_clf, 10)

ndcg_mean_clf_final = ndcg_at_k(score_lst_clf_final, 
                                pred_lst_clf_final, 10)

## Sklearn.metrics (плейлисты ранжированны относительно трека)¶

In [38]:
ndcg_clf = ndcg_score([y_test], [clf_pred], k=10)
ndcg_clf_final = ndcg_score([y_test], [clf_final_pred], k=10)

## Таблица с метриками

In [48]:
metrics = pd.DataFrame({'name':['ndcg_mean_clf', 'ndcg_mean_clf_final'], 
                       'ndcg_mean':[ndcg_mean_clf, ndcg_mean_clf_final], 
                        'ndcg':[ndcg_clf, ndcg_clf_final]})
metrics

Unnamed: 0,name,ndcg_mean,ndcg
0,ndcg_mean_clf,0.790902,1.0
1,ndcg_mean_clf_final,0.792355,1.0


**Вывод:**
Исходя из полученных результатов, я выбираю для дальнейшего использования clf_final (модель обученная на заранее подобранных гиперпараметрах).

# Evaluate

Для примера прогноза модели, взят артист tyla ayra starr, с треком girl next door.

In [50]:
df_playlists = pd.read_parquet('../../data/full/playlists_features.parquet')
df_playlists

Unnamed: 0,description,id,playlist_name,owner_name
0,A Playlist with the Best Remixes of Popular So...,6TqkCoZ6mboVBlEkGDFZw9,Best Remixes of Popular Songs 🔥 New Hits 2023 🔥,daveepa
1,A Playlist with the best Techno Remixes of Pop...,5tC6QbgGXsERRpYyjTIMM8,Techno Remixes of Popular Songs 🔥 TikTok Techn...,daveepa
2,Party Music 2023 🔥 This is a playlist with the...,03Uyooblz44mbIFH6gLP5h,Party Music 2023 🔥 Best of EDM & Electro House,daveepa
3,No cap 🧢 only Deep House Bangers. Music by Joh...,4J50w2KWihoa3hn4MXtTjB,Deep House Bangers nobody tells you about 🔥,daveepa
4,No cap 🧢 only Tech House Bangers. Music by FIS...,5v09TzvXZWz4EH50EUUJkC,Tech House Bangers nobody tells you about 🔥,daveepa
...,...,...,...,...
27961,Dein Magazin und Shop für Musik. Täglich aktua...,4WWdQfpWYJR1pW2ge4x5vl,Pretty in Noise Top 200,prettyinnoise.de
27962,,7wGH9IK738EZC53iuaK3Wn,OS24 Druckfrisch,prettyinnoise.de
27963,Am 10. und 11. September 2021 sollen Bands aus...,1Jlg41rPMNWjVIL6l4Dgt3,HELLSEATIC 2021,prettyinnoise.de
27964,Mit der unregelmäßigen Serie der The End of Tr...,1DS83PzxbuWU7wvlzJgUo1,The End of Travelling,prettyinnoise.de


## Preprocessing

In [52]:
def pipeline_preprocess(data: pd.DataFrame) -> pd.DataFrame:
    """
    Пайплайн по предобработке данных
    :param data: датасет
    :return: датасет
    """
    # data_merge
    author_name = 'tyla ayra starr'
    album_name = 'girl next door'
    track_name = 'girl next door'
    
    data = data_merge(data=data, 
                      author_name=author_name,
                      album_name=album_name,
                      track_name=track_name,)
    
    # clean_text
    data['owner_name'] = data['owner_name'].apply(clean_text)
    data['description'] = data['description'].apply(clean_text)
    data['playlist_name'] = data['playlist_name'].apply(clean_text)
    data['author_name'] = data['author_name'].apply(clean_text)
    data['album_name'] = data['album_name'].apply(clean_text)
    data['track_name'] = data['track_name'].apply(clean_text)

    # data_drop
    drop_columns = ['id']
    data = data_drop(data=data, drop_columns=drop_columns)
    
    return data

In [53]:
%%time
data_proc = pipeline_preprocess(data=df_playlists)
data_proc

CPU times: user 1.26 s, sys: 15.5 ms, total: 1.27 s
Wall time: 1.27 s


Unnamed: 0,author_name,album_name,track_name,description,playlist_name,owner_name
0,tyla ayra starr,girl next door,girl next door,a playlist with the best remixes of popular so...,best remixes of popular songs new hits,daveepa
1,tyla ayra starr,girl next door,girl next door,a playlist with the best techno remixes of pop...,techno remixes of popular songs tiktok techno ...,daveepa
2,tyla ayra starr,girl next door,girl next door,party music this is a playlist with the best o...,party music best of edm electro house,daveepa
3,tyla ayra starr,girl next door,girl next door,no cap only deep house bangers music by john s...,deep house bangers nobody tells you about,daveepa
4,tyla ayra starr,girl next door,girl next door,no cap only tech house bangers music by fisher...,tech house bangers nobody tells you about,daveepa
...,...,...,...,...,...,...
27961,tyla ayra starr,girl next door,girl next door,dein magazin und shop f r musik t glich aktual...,pretty in noise top,prettyinnoise de
27962,tyla ayra starr,girl next door,girl next door,,druckfrisch,prettyinnoise de
27963,tyla ayra starr,girl next door,girl next door,am und september sollen bands aus thrash death...,hellseatic,prettyinnoise de
27964,tyla ayra starr,girl next door,girl next door,mit der unregelm igen serie der the end of tra...,the end of travelling,prettyinnoise de


## Evaluate

In [56]:
def evaluate(data: pd.DataFrame) -> pd.DataFrame:
    """
    Делает предсказание
    :param: датасет
    :return: датасет с предсказанием
    """
    model = joblib.load('../../model/clf_final.joblib')

    cat_features = [0, 1, 2, 4, 5]
    text_features = [3]

    data_pool = Pool(data=data, 
                     cat_features=cat_features, 
                     text_features=text_features)
    
    data['predict'] = model.predict(data_pool)
    
    return data

In [57]:
data_proc = evaluate(data=data_proc)
data_proc

Unnamed: 0,author_name,album_name,track_name,description,playlist_name,owner_name,predict
0,tyla ayra starr,girl next door,girl next door,a playlist with the best remixes of popular so...,best remixes of popular songs new hits,daveepa,0.894864
1,tyla ayra starr,girl next door,girl next door,a playlist with the best techno remixes of pop...,techno remixes of popular songs tiktok techno ...,daveepa,-1.250476
2,tyla ayra starr,girl next door,girl next door,party music this is a playlist with the best o...,party music best of edm electro house,daveepa,0.308195
3,tyla ayra starr,girl next door,girl next door,no cap only deep house bangers music by john s...,deep house bangers nobody tells you about,daveepa,0.069829
4,tyla ayra starr,girl next door,girl next door,no cap only tech house bangers music by fisher...,tech house bangers nobody tells you about,daveepa,0.885137
...,...,...,...,...,...,...,...
27961,tyla ayra starr,girl next door,girl next door,dein magazin und shop f r musik t glich aktual...,pretty in noise top,prettyinnoise de,-3.154409
27962,tyla ayra starr,girl next door,girl next door,,druckfrisch,prettyinnoise de,-3.180707
27963,tyla ayra starr,girl next door,girl next door,am und september sollen bands aus thrash death...,hellseatic,prettyinnoise de,-4.901734
27964,tyla ayra starr,girl next door,girl next door,mit der unregelm igen serie der the end of tra...,the end of travelling,prettyinnoise de,-4.057409


In [60]:
def pred_playlist(data: pd.DataFrame, top: int) -> list:
    """
    Формирует список топ релевантных плейлистов
    :param data: датасет
    :param top: число топ релевантных плейлистов
    :return: список
    """
    df_playlists = pd.read_parquet('../../data/full/playlists_features.parquet')
    
    data['name'] = df_playlists['playlist_name']
    data.sort_values('predict', ascending=False, inplace=True)
    data[['name', 'predict']].head(top)
    top_playlist_lst = data['name'].head(top).tolist()
    
    return top_playlist_lst

In [61]:
top_playlist_lst = pred_playlist(data=data_proc, top=10)
top_playlist_lst

['Background Lo-fi',
 '20 Jahre Defected –\xa0selected hits by FAZEmag',
 'Schlagerliebe - Ich find Schlager toll',
 'Jürgen Drews - Die größten Hits - Ich find Schlager toll',
 'Monsoon - Vize, Leony, Niklas Dee - Famous Bootlegs',
 'Escena Latinoamericana',
 'Spannende Hörbücher für den Sommer: Sherlock Holmes /  Edgar Allan Poe / Pater Brown / Edgar Wallace',
 'Melodic Therapy | Mellow Melodies | Soulful Sounds | Chillhop',
 'Liquid Drum & Bass || Chill DnB || Melodic Drum n Bass 2023',
 'Chill Your Mind • Cozy Music • Studying • Reading • yaeow - long way home']

**Вывод:**
Таким образом в ходе проекта я провел эксперемент и создал модель CatBoostRanker, которая делает прогноз релевантности плейлистов относительно определенного трека, а также выдает результат в виде списка топ наиболее подходящих плейлистов для него. Тем самым цель проекта считаю достигнутой. 