In [None]:
import numpy as np, pandas as pd
import matplotlib.pyplot as plt, seaborn as sns
from tqdm import tqdm
import warnings
warnings.filterwarnings("ignore")

### Data Loading

In [None]:
data = pd.read_csv("../input/spotify-dataset-19212020-160k-tracks/data.csv")

Checking if data loaded correctly.

In [None]:
data.head()

Checking data's shape.

In [None]:
data.shape

### Data Preprocessing

Let's check if our data has any missing values.

In [None]:
data.isna().sum()

There is no missing data, which is fine for us!

### Exploratory Data Analisys

Now, we will vizualize our data, to understand, how music changed during the century.

In [None]:
viz_data = data.drop(columns=['id', 'name', 'artists', 'release_date', 'year'])

plt.figure(figsize=(50, 50))
for i in tqdm(np.arange(1, len(viz_data.columns))):
    plt.subplot(7, 2, i)
    sns.barplot(x=data.year,y=viz_data[viz_data.columns[i]])
    plt.xticks(rotation=45);
plt.show()

As can be seen from the plots, nowadays music became more energetic and popular. The loudness decreased and duration increased.

#### Correlation

Let's check if any features correlates with each other.

#### Pearson Correlation

In [None]:
plt.subplots(figsize=(12, 8))
sns.heatmap(viz_data.corr(), annot=True, square=True)
plt.show()

As we can see, loudness greatly correlates with energy/accousticness and accousticness correlates with energy.

### Feature Engineering

In [None]:
data.head(2)

As we will use our data to calculate the distances between the songs and our feature's data veries, we will create a function to normalize it.

In [None]:
def normalize_column(col):
    max_d = data[col].max()
    min_d = data[col].min()
    data[col] = (data[col] - min_d)/(max_d - min_d)

Now, let's get all the numerical columns and normalize them.

In [None]:
num_types = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
num = data.select_dtypes(include=num_types)
        
for col in num.columns:
    normalize_column(col)

Let's check if our data transformed correctly.

In [None]:
data.head(3)

Seems like everything worked fine.

There is a probability, that songs from the different genres could have quite similar characteristics, and that's not fine. 

For example, Nicki Minaj songs won't be an accurate recomendation for Slayer songs.

That's why we will create a new feature, which would differ the songs from different groups.

We will use KMeans clusterization with 10 clusters for this goal.

In [None]:
from sklearn.cluster import KMeans

km = KMeans(n_clusters=10)
cat = km.fit_predict(num)
data['cat'] = cat
normalize_column('cat')

Let's check the result.

In [None]:
data.cat[:10]

Seems like everything is fine, let's move on.

### Recommendation System

Our data has numeric features like acousticness, danceability, energy etc, we will use it to find the most similar songs for ours. 

In [None]:
caya=data[data.name=='Come As You Are']
caya.head(3)

As there could be many versions of the same song (example above), we will always take the oldest version.

Now, let's create a class which will make the recomendations for our songs.

To find the difference among the songs, we will calculate the manhattan distance between all of them. 

And, as the result, we will choose the songs with the smallest distances.

In [None]:
class SpotifyRecommender():
    def __init__(self, rec_data):
        #our class should understand which data to work with
        self.rec_data_ = rec_data
    
    #if we need to change data
    def change_data(self, rec_data):
        self.rec_data_ = rec_data
    
    #function which returns recommendations, we can also choose the amount of songs to be recommended
    def get_recommendations(self, song_name, amount=1):
        distances = []
        #choosing the data for our song
        song = self.rec_data_[(self.rec_data_.name.str.lower() == song_name.lower())].head(1).values[0]
        #dropping the data with our song
        res_data = self.rec_data_[self.rec_data_.name.str.lower() != song_name.lower()]
        for r_song in tqdm(res_data.values):
            dist = 0
            for col in np.arange(len(res_data.columns)):
                #indeces of non-numerical columns
                if not col in [1, 6, 12, 14, 18]:
                    #calculating the manhettan distances for each numerical feature
                    dist = dist + np.absolute(float(song[col]) - float(r_song[col]))
            distances.append(dist)
        res_data['distance'] = distances
        #sorting our data to be ascending by 'distance' feature
        res_data = res_data.sort_values('distance')
        columns = ['artists', 'name']
        return res_data[columns][:amount]

Let's create the object of our SpotifyRecommender.

In [None]:
recommender = SpotifyRecommender(data)

#### Nirvana - Come As You Are

In [None]:
recommender.get_recommendations('come as you are', 5)

Seems like results are pretty logical, as all the songs have pretty similar genre and the sounding.

Let's test our function on the other songs.

#### Motörhead - Ace Of Spades 

In [None]:
recommender.get_recommendations('ace of spades', 5)

#### 50 Cent - In Da Club

In [None]:
recommender.get_recommendations('in da club', 5)

#### Lil Skies - Red Roses (feat. Landon Cube)

In [None]:
recommender.get_recommendations('Red Roses (feat. Landon Cube)', 5)

Seems like recommender works pretty fine and gives us really accurate recommendations.

### That's all. Thank you for reading this notebook, you can upvote it, if you find it useful!


### Good luck!