# Spotify KNeighbors Analysis - Artist Recommender 🎧 <a class='tocSkip'>"

# Introduction

This is a notebook trying to find you the best alternatives to your favorite musician. The input data is a list of about 30,000 artists from all over the world defined by their most common style features. <br>
In the first half of the notebook (3, 4) the data is described and visualized. The second part (5ff) is dealing with a simple KNeighborsRegressor model from Scikit Learn to define the nearest neighbors of the given artists in the data set. Finally, there is a recommender to return the most similar artists for a queried musician.<br>

The recommender is also pushed to production to try it without the notebook. You can find it here (Heroku): [The Artist Recommender](https://artist-recommender.herokuapp.com/). The app is run on a Streamlit Framework and hosted by heroku. Longer _loading times might_ occur.

# Imports 📚

In [None]:
# TABLES & VIS
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from ast import literal_eval
import numpy as np

# NLTK
import regex as re
import unicodedata

# ML
from sklearn.neighbors import KNeighborsRegressor
from joblib import dump, load
import pickle

# The Dataset 👨‍💻

The data set I am working on is provided on [Kaggle](https://www.kaggle.com/yamaerenay/spotify-dataset-19212020-160k-tracks?select=data_by_artist_o.csv).<br>
- It has a total of ~29,000 artists and their audio features.
- It is a grouped data set from a bigger set of tracks. The features are aggregated by mean (for numerical) and mode (for categorical values). 

In [None]:
df = pd.read_csv('../input/spotify-dataset-19212020-160k-tracks/data_by_artist_o.csv')
df.shape

In [None]:
# a glance at the data set
df.sample(5)

In [None]:
df.dtypes
# data types seem to be correctly evaluated

# Provided Features 🎼

## Overview

In [None]:
print(f'The data set has the following features provided \n{list(df.columns)}')

In [None]:
print(f'The data set has {(df.dtypes != "object").sum()} numerical features')

In [None]:
df.describe()

**Note**<br>
All features besides ```loudness```,  ```key``` and ```popularity``` are scaled from 0 to 1. <br>
- ```loudness``` is represented in a range from -60 to 1.34, measured in dB. The lower number represents _higher_ volumes.
- ```key``` is a range from 0 to 11 (meaning from the note C to H)
- ```popularity``` seems to be in a range of 0 to 100 with a highest rating of 93 points.

## Feature relation

In [None]:
# Pairplot to inspect the correlations between the features
# removing features which don't seem to have any descriptive correlation or are the counterparts of other features
## such as acousticness<>speachiness
pair = sns.pairplot(df.drop(columns=['genres', 
                                     'artists', 
                                     'count', 
                                     'mode', 
                                     'key', 
                                     'duration_ms', 
                                     'instrumentalness',
                                     'acousticness']))
pair

**Note**<br>
We can see a slight positive correlation between the following features over the whole data set:<br>
- ```valence``` vs. ```loudness```
- ```danceability``` vs. ```loudness```
- ```danceability``` vs. ```speachiness```
- ```danceability``` vs. ```tempo```

There is no main factor for an artist to **be popular.** Although, more speachiness is more likely to achieve less popularity.  

In [None]:
# Another way of analysing correlations - The Heatmap
corr = df.corr()
plt.figure(figsize=(10,7))
heatmap = sns.heatmap(corr, cmap='coolwarm')

## Feature distributions

In [None]:
# Plotting distributions
df.hist(bins=50, figsize=(25,15))
plt.show()

**Note**<br>
- ```liveness``` is skewed to the right => most songs don't show much liveness
- ```loudness``` is skewed to the left => songs are usually more silent
- ```key``` 'G' is used the most under all the given artists (check further analysis on that)
- ```instrumentalness``` seems to be having a lot of non-instrumental artists (e.g. audio books).

... _lets analyze the keys and modes a bit more_

In [None]:
# Joining the key and mode together to have them plotted separately
key_mode = pd.DataFrame()
semi = ['C', 'Csharp', 'D', 'Dsharp', 'E', 'F', 'Fsharp', 'G', 'Gsharp', 'A', 'Asharp', 'H', 'C']
minmaj = {0.0:'min', 1.0:'maj'}

In [None]:
key_mode['mode_str'] = df['mode'].replace(minmaj)
key_mode['key_str'] =  df['key'].apply(lambda x: semi[x])
key_mode['key_mode'] = key_mode['key_str'] + '_' + key_mode['mode_str']

In [None]:
plt.figure(figsize=(14,7))
sns.histplot(key_mode['key_mode'].sort_values(), kde=True)
plt.xticks(rotation=90)
plt.title('Distribution of Keys', size=15)
plt.show()

**Note**<br>
```Gsharp_maj``` appears to be the most common key by far among the artists. This may derive from its unique harmony and easy playability (on a guitar 🎸 e.g.).

In [None]:
# unpacking all the genres in the lists of genres
genres = pd.DataFrame()
genres['genres'] = df.genres.apply(lambda x: literal_eval(x))
avail_genres = []
for lst in genres.genres:
    lst = [x.replace(' ', '_') for x in lst]
    _ = [avail_genres.append(x) for x in lst if len(lst) > 0]

In [None]:
# getting the count per genre
counts = {i:avail_genres.count(i) for i in avail_genres}
genres_ = pd.DataFrame.from_dict(counts, 
                                 orient='index', 
                                 columns=['count']).sort_values(by='count', 
                                                                ascending=False)

In [None]:
# plotting the 50 most important genres
genres_.head(50).plot(kind='bar', figsize=(15,7))
plt.xticks(rotation=90)
plt.show()

In [None]:
print(f'The 5 most common genres among the artists are \n{list(genres_.head(5).index)}')
print(f'\nIn total there are \033[1m{genres_.shape[0]} available genres\033[0m')

## Missing Values

In [None]:
# checking for NaNs
print(f'There are {sum(df.isnull().sum())} NaN values in the data set')

In [None]:
# Inspecting genres
print(f'Although, there are some empty genre lists: {sum([len(list(x)) == 0 for x in df["genres"]])} in the rows')

In [None]:
# Inspecting instrumentalness
df[df['instrumentalness']==0]['speechiness'].hist(figsize=(9,4), bins=20)
plt.title('Spechiness at instrumentalness == 0')
print(f' A total of {df[df["instrumentalness"]==0].shape[0]} entries have 0 instrumentalness')
plt.show()

**Note**<br>
The theory of audio books from [section 4.3](#Feature-distributions) as being complementary to low instrumentalness couldn't be verified. I suggest, we have missing values here.

In [None]:
# inspecting popularity
print(f'A total of {df[df["popularity"]==0].shape[0]} entries have 0 popularity')

**Summary**<br>
Missing values at features ```genres```, ```instrumentalness```, ```popularity``` should be either imputed or dropped from the data set in order not to have an unbalanced and biased model.

# Preprocessing

## Genres

Looking at the ```genres``` feature I see some potential to implement it in the data set. However, they are not correctly displayed and therefore have to be encoded in separate columns.

In [None]:
# Copying to have easy access to original values
df_enc = df.copy()

In [None]:
# populating the top 50 genres as OHE in df_enc
# non appearing values will be filled with 'rock'
top_50 = genres_.head(50).index
def OHE_genres(df=df_enc, top_50=top_50, fillna='rock'):
    df_cop = df.copy()
    df_cop[top_50] = 0
    for c, row in df_cop.iterrows():
        for i in top_50:
            if i in row['genres']:
                df_cop.loc[c, i] = 1
        if all(row[top_50] == 0):
            df_cop.loc[c, fillna] = 1
            
    return df_cop 

In [None]:
df_enc = OHE_genres()

In [None]:
pd.set_option('display.max_columns', None)
df_enc.head(5)

## Popularity

Although we have a lot of missing ```popularity``` entries, I would like to keep the feature in the data set. The recommender should have the ability to recommend artists which are similarly popular.

In [None]:
# replacing with the median (distribution is quite normal)
median_pop = df_enc[df_enc['popularity'] > 0].popularity.median()
df_enc['popularity'] = df_enc.popularity.replace(0.0, median_pop)

In [None]:
df_enc.popularity.min()

## Instrumentalness

About 65% of the ```instrumentalness``` feature close to 0 (<0.1). The feature will be dropped. All imputation attempts would lead to a biased model.

In [None]:
df_enc = df_enc.drop(columns='instrumentalness')

In [None]:
df_enc.head(3)

## Target

**Note**<br>
For the artist recommender we only want to find out the vicinity of artists to each other. Therefore we won't need a ```target```.

In [None]:
df_enc['target'] = 0

In [None]:
# saving for the app
# df_enc.to_csv('../raw_data/preprocessed.csv')

In [None]:
df_enc.drop(columns=['artists', 'genres', 'target']).shape

# Model

Lets have a very easy Kmeans model to define the artists which are lying close to each other measured by the given input features. 

In [None]:
# Define X and y
X = df_enc.drop(columns=['artists', 'genres', 'target']) # Remove non numerical features
y = df_enc['target']

knn_model = KNeighborsRegressor().fit(X, y) # Instanciate and train model
# save the model for production
dump(knn_model, '../model.joblib')

# Recommender

In [None]:
def normalize(inpt):
    """
    removing interpunction, trimming whitespaces and replacing spacial chars with their nearest relatives.
    inpt: a string you wish to normalize.
    returns: the normalized string.
    """
    reg = re.sub(r'[^\w\s]', ' ', inpt.lower())
    white = re.sub(r'\s+', ' ', reg)
    uni = unicodedata.normalize('NFKD', white).encode('ascii','ignore').decode('utf8')
    return uni

In [None]:
def finder(artist, data=df_enc):
    """
    will find your favourite artist in a given dataframe under the column name 'artist'.
    Pass the artist name (str) and the data frame (pd.DataFrame).
    returns: a pd.DataFrame containing the row of your searched artist
    """
    df_cop = data.copy()
    df_cop['artists'] = df_cop.artists.apply(lambda x: normalize(x))
    artist = normalize(artist)
    print(f'looking for normalized term {artist}')
    try:
        # try to find the whole search term in artists via bool indexing
        ix = df_cop.artists[df_cop.artists == artist].index[0]
        print('found via bool indexing')
        return pd.DataFrame(df_cop.loc[ix,:]).T
    except:
        # tokenize search term
        search = artist.split(' ')
        # get index of searched term
        ixs = []
        for i, row in df_cop.iterrows():
            splt = row['artists'].lower().split(' ')
            found = [x in splt for x in search]
            if sum(found) >= len(found)/2:
                ixs.append(i)
        if len(ixs) > 0:
            print('found alternatives via tokenized search')
            print(f'did you mean...')
            return list(df_cop.artists[ixs])
        else:
            print('could not find your artist. Please refine your search.')
            return 'please refine your search'

In [None]:
def recommend_artist(artist, model=knn_model, df=df_enc, neighbors=3):
    """
    will find the nearest neighbors of the desired artist.
    pass the artists name, the fitted model and the pd.DataFrame suiting the model.
    returns a list of recommended artists similar to the imput artist.
    """
    inpt = finder(artist, data=df)
    if isinstance(inpt, pd.DataFrame):
        nearest = knn_model.kneighbors(inpt.drop(columns=['artists', 'genres', 'target']),n_neighbors=neighbors+1) # Return the distances and index of the 2 closest points
        indexes = list(nearest[1][0])
        return list(df.artists[indexes[1:]])
    else:
        print(inpt)

## Try it out

In [None]:
artist = input('Type your favoutite artist:')

In [None]:
recommend_artist(artist)

**Note**<br>
The **recommender** is case, special-char and interpunction insensitive and therefore quite stable for a small project like this. Although it will only recognize fully written words and not auto-fill them (like 'Emine' or 'beatle'). Try it out yourself and have fun! 🥁

**Try it out online**<br>
[THE SPOTIFY RECOMMENDER APP](https://artist-recommender.herokuapp.com/)