# Mock Spotify Song Recommender Model & Explainer
_Author: Jonathan Finger_

## Imports

Here we import a few packages...

In [1]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

## Helper Functions

Next we define a couple functions that will help us along the way.

In [2]:
def title_from_index(index):
    return df[df.index == index]["Title"].values.astype(str)[0]

def artist_from_index(index):
    return df[df.index == index]["Artist"].values.astype(str)[0]

def genre_from_index(index):
    return df[df.index == index]["Genre"].values.astype(str)[0]

def index_from_title(title):
    return df[df['Title'] == title].index.values.astype(int)[0]

## Mock Data

Next we generate some mock data. Here we create a dataset with a few example songs.

In [3]:
data = [["You Should Be Sad", "Halsey", "2020", "Pop"], ["Old Town Road", "Lil Nas X", "2019", "Country"], 
        ["Godzilla", "Eminem", "2020", "Rap"], ["Piano Sonata No. 14", "Beethoven", "1801", "Classical"],
       ["Moonshine", "Caravan Palace", "2020", "Electroswing"], ["Sucker", "Jonas Brothers", "2019", "Pop"],
       ["ADHD", "Joyner Lucas", "2019", "Rap"], ["The Real Slim Shady", "Eminem", "2000", "Rap"]]
df = pd.DataFrame(data, columns=['Title', 'Artist', 'Year', 'Genre'])

And we can see the resulting mock dataframe below.

In [4]:
df

Unnamed: 0,Title,Artist,Year,Genre
0,You Should Be Sad,Halsey,2020,Pop
1,Old Town Road,Lil Nas X,2019,Country
2,Godzilla,Eminem,2020,Rap
3,Piano Sonata No. 14,Beethoven,1801,Classical
4,Moonshine,Caravan Palace,2020,Electroswing
5,Sucker,Jonas Brothers,2019,Pop
6,ADHD,Joyner Lucas,2019,Rap
7,The Real Slim Shady,Eminem,2000,Rap


## Model preparation

Defining the columns (features) that we think are important.

In [5]:
features = ['Title', 'Artist', 'Year', 'Genre']

For the mock model, I am using something __cosine similarity__. The idea is to take all of the music metadata (title, artist, year, genre; and of course whatever the real spotify api has) and mush them together into one string-- example: Old Town Road Lil Nas X 2019 Country. Cosine similarity steals from geometry/trigonometry to gauge how similar two things are. Imagine an analog clock with the hands pointing to 12:15PM, the hands of the clock form a right angle (45deg, little hand up big hand to the right). Now imagine the clock at 12:05. The hands of the clock make a much smaller angle because the points the hands indicate (12 o clock with the little hand, and 5 minutes with the big hand) are closer together. 

So, now look at the Eminem songs in the chart. When we smash their values together, they will share `Eminem` and `Rap` and maybe even the same year if they are on the same `album`. The fact that these songs share so many values makes their `cosine angle` small (very similar). If you look at the songs by `Beethoven` and `Jonas Brothers`, they don't share anything in common (aka their angle is large-- so not similar).

To start, the function below will create a new column with the smushed string values.

In [6]:
# Create new column that is combination of features
def row_concat(row):
    return row['Title'] + " " + row['Artist'] + " " + row['Year'] + " " + row["Genre"]

In [7]:
#Use new function
df['combined row'] = df.apply(row_concat, axis = 1)

We can see the new smushed column as such:

In [8]:
df

Unnamed: 0,Title,Artist,Year,Genre,combined row
0,You Should Be Sad,Halsey,2020,Pop,You Should Be Sad Halsey 2020 Pop
1,Old Town Road,Lil Nas X,2019,Country,Old Town Road Lil Nas X 2019 Country
2,Godzilla,Eminem,2020,Rap,Godzilla Eminem 2020 Rap
3,Piano Sonata No. 14,Beethoven,1801,Classical,Piano Sonata No. 14 Beethoven 1801 Classical
4,Moonshine,Caravan Palace,2020,Electroswing,Moonshine Caravan Palace 2020 Electroswing
5,Sucker,Jonas Brothers,2019,Pop,Sucker Jonas Brothers 2019 Pop
6,ADHD,Joyner Lucas,2019,Rap,ADHD Joyner Lucas 2019 Rap
7,The Real Slim Shady,Eminem,2000,Rap,The Real Slim Shady Eminem 2000 Rap


Below I use `CountVectorizer` to count up the similar words between smushed strings and create a matrix of values. Then I use the `cosine similarity` function to get a similarity between every possible pairing.

In [9]:
# Try out CountVectorizer 
cv = CountVectorizer()
count_matrix = cv.fit_transform(df['combined row'])

In [10]:
#Find similarity
cosine_simil = cosine_similarity(count_matrix)

Let's test it out using the classic: `The Real Slim Shady` by `Eminem`. It should pick other eminem songs first and maybe other rap songs if it is robust.

In [11]:
user_test_title = "The Real Slim Shady"

In [12]:
#Use helper function to find index (#) of title.
song_index = index_from_title(user_test_title)

In [13]:
#Create list of similar songs
similar_songs =  list(enumerate(cosine_simil[song_index]))

In [14]:
#Sort list by most similar to least
sorted_similar_songs = sorted(similar_songs,key=lambda x:x[1],reverse=True)

In [15]:
## Print titles of first 10 songs
i=0
for song in sorted_similar_songs:
    if i == 0:
        i = i+1
    else:
        print(f'No. {i} "{title_from_index(song[0])}" by {artist_from_index(song[0])} ({genre_from_index(song[0])})')
        i=i+1
        if i>11:
            break

No. 1 "Godzilla" by Eminem (Rap)
No. 2 "ADHD" by Joyner Lucas (Rap)
No. 3 "You Should Be Sad" by Halsey (Pop)
No. 4 "Old Town Road" by Lil Nas X (Country)
No. 5 "Piano Sonata No. 14" by Beethoven (Classical)
No. 6 "Moonshine" by Caravan Palace (Electroswing)
No. 7 "Sucker" by Jonas Brothers (Pop)


Both rap songs first, with the other eminem song before the other rap song.  

## Notes

This is only a basic version of a machine learning model. I started with choosing an artist and giving back similar songs based on the title This was just to try things out. _Note:_ I remember us talking about choosing genre as our seed. This approach won't work with our current set up, at least not for what I can do in a couple days. In order to select `genre`, you'd then recommend popular songs in that `genre` based on what the user already listened to or through a collaborative model (user a & b like The Beatles, but what songs does user b listen to that user a might like because of this?). We would need users and mock users with playlists already established-- that sounds a bit ambitious off the bat.

I already see some non ideal situations too... example: if two albums start with "the", I don't really think that should add to similarity. But these are speedbumps to tackle later as I iterate on the model and especially once I get real data from the spotify API.