# Song Recomender Model

### Notes:
1. numerical variables. (can do research and find which features are relevant for model)
2. scale
3. train k means (+pickle)

- create model

user input
- search for track id
- get audio features
- scale
- predict cluster
- recommend random song cluster

X
X-noid = X.drop("ID")

kmeans.labels
x["cluster"] = kmeans.labels



Tip. Go through whole process first and get quick and dirty model.



to train model only use numerical values
can potentially drop some numerical features to improve model(recommended) (research)


### Which feature to use
For a K-means model that recommends songs based on user input, focusing on Spotify audio features that capture the essence of a song's mood, energy, and danceability can be particularly effective. Consider these features:

Danceability: Reflects the suitability of a song for dancing, based on tempo, rhythm stability, beat strength, and overall regularity.
Energy: Measures intensity and activity, capturing the dynamic feel of a song.
Valence: Indicates the musical positiveness conveyed by a track, which can help in understanding the emotional context.
Tempo: The speed or pace of a song, which is fundamental in matching songs with a similar vibe.
Acousticness: Helps in distinguishing between acoustic and more electronic/synthetic music.
Instrumentalness: Useful for identifying songs with a focus on instrumentation, which might be preferred by users interested in instrumental tracks.
Liveness: Could be considered if the presence of live audience sounds or a "live" feel is important for similarity.
Speechiness: Can be useful to filter out tracks with more spoken words, distinguishing between music tracks and those with significant vocal content like podcasts or audiobooks.
These features collectively can capture the essence of what a user might enjoy in a song, beyond just genre or artist. By clustering songs using these features, your K-means model can identify and recommend songs that share similar characteristics with the user's input song, potentially leading to more personalized and satisfying recommendations.

Remember, the choice of features might require iteration; you might start with these and adjust based on the performance of your recommendation system and user feedback.

## To do
- increase number of clusters. approx 20'000 song so should have a lot of klusters

## Importing libraries

In [None]:
import numpy as np
import pandas as pd
import pickle
from sklearn import datasets # sklearn comes with some toy datasets to practice
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
from sklearn.metrics import silhouette_score

## Dataset preparation

In [None]:
song_database_df = pd.read_csv('../data/full_af.csv',index_col=0)
song_database_df = song_database_df.drop(columns = ["mode", "key", "duration_ms", "liveness"])

In [None]:
#getting only numeric features for model
song_database_numeric_noid_df = song_database_df.select_dtypes(include=['number'])

In [None]:
#getting df with numeric features plus id
song_database_numeric_id_df = pd.concat([song_database_df["id"],song_database_numeric_noid_df], axis = 1)

## Scaling features

In [None]:
X = song_database_numeric_noid_df
X

In [None]:
scaler = StandardScaler()
scaler.fit(X)
X_scaled = scaler.transform(X)
X_scaled_df = pd.DataFrame(X_scaled, columns = X.columns)

## Clustering with K-Means

## Feature selection

In [None]:
X_scaled_df


In [None]:
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

# Apply PCA
pca = PCA(n_components=6)
X_pca = pca.transform(X_scaled)
pd.DataFrame(X_pca)



In [None]:
# The explained variance ratio for each principal component
explained_variance_ratio = pca.explained_variance_ratio_

# Print explained variance ratio
for i, ratio in enumerate(explained_variance_ratio):
    print(f"Principal Component {i+1}: {ratio:.2f}")

In [None]:
cumulative_explained_variance = explained_variance_ratio.cumsum()

# Print cumulative explained variance
for i, cum_ratio in enumerate(cumulative_explained_variance):
    print(f"Total variance explained by the first {i+1} components: {cum_ratio:.2f}")

## Investigating optimum number of clusters

In [None]:
K = range(2, 50)
inertia = []

for k in K:
    print("Training a K-Means model with {} clusters! ".format(k))
    print()
    kmeans = KMeans(n_clusters = k,
                init="k-means++",
                n_init = "auto",
                max_iter= 50,
                algorithm="elkan",
                random_state=1234)
    kmeans.fit(X_pca)
    inertia.append(kmeans.inertia_)

import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

plt.figure(figsize=(16,8))
plt.plot(K, inertia, 'bx-')
plt.xlabel('k')
plt.ylabel('inertia')
plt.xticks(np.arange(min(K), max(K)+1, 1.0))
plt.title('Elbow Method showing the optimal k')

#### Silhouette

In [None]:
K = range(7,50)
silhouette = []

for k in K:
    print("Training a K-Means model with {} clusters! ".format(k))
    print()
    kmeans = KMeans(n_clusters=k,
                    init="k-means++",
                    n_init = "auto",
                    max_iter= 50,
                    algorithm="elkan",
                    random_state=1234)
    
    kmeans.fit(X_pca)
    
    silhouette.append(silhouette_score(X_pca, kmeans.predict(X_pca)))


plt.figure(figsize=(16,8))
plt.plot(K, silhouette, 'bx-')
plt.xlabel('k')
plt.ylabel('silhouette score')
plt.xticks(np.arange(min(K), max(K)+1, 1.0))
plt.title('Silhouette Method showing the optimal k')

In [None]:
from yellowbrick.cluster import SilhouetteVisualizer
import matplotlib.pyplot as plt

# Assuming X_scaled_df is your scaled data prepared for KMeans
model = KMeans(n_clusters=,
               init="k-means++",
               n_init = 5,
               max_iter= 50,
               algorithm="elkan",
               random_state=1234)

# Specify the size of the figure
visualizer = SilhouetteVisualizer(model, colors='yellowbrick', size=(1080, 1100))
visualizer.fit(X_scaled_df)        # Fit the data to the visualizer
visualizer.show()        # Finalize and render the figure

## Kmeans model

In [None]:
kmeans = KMeans(n_clusters=41,
               init="k-means++",
               n_init = "auto",
               max_iter= 100,
               algorithm="elkan",
               random_state=1234)
kmeans.fit(X_pca)
print(kmeans.inertia_)

In [None]:
silhouette_score(X_pca, kmeans.predict(X_pca))

### Cluster info

In [None]:
labels = kmeans.labels_
labels

In [None]:
np.unique(labels)

In [None]:
clusters = kmeans.predict(X_pca)
pd.Series(clusters).value_counts().sort_index()

## Adding clusters to dataset

In [None]:
#X_df = pd.DataFrame(X)
X["cluster"] = clusters
song_database_numeric_id_df["cluster"] = clusters
song_database_numeric_id_df

### Testing clusters by ear

In [None]:
from IPython.display import IFrame

def play_song(track_id):
    iframe = IFrame(src="https://open.spotify.com/embed/track/"+track_id,
       width="320",
       height="80",
       frameborder="0",
       allowtransparency="true",
       allow="encrypted-media",
      )
    display(iframe)

In [None]:
#find songs in same cluster
track = song_database_numeric_id_df[song_database_numeric_id_df["cluster"] == 7].sample()
track_id = track["id"].item()
play_song(track_id)

## Exporting models,scalers and dataframes

### Model and scaler

In [None]:
import pickle

def save(model, filename = "filename.pickle"):
    with open(filename, "wb") as f:
        pickle.dump(model, f)

In [None]:
with open("model/scaler.pickle", "wb") as f:
    pickle.dump(scaler,f)

with open("model/kmeans_4.pickle", "wb") as f:
    pickle.dump(kmeans,f)

### Dataset


In [None]:
song_database_numeric_id_df.to_csv("../data/audio_features_db_df.csv")