# Moosic Baseline Model :: Iteration v1


* combining datasets as done prior [data preprocessing][data management]
* train test split [modelling]
* baseline model [discussion needed]
* baseline model sketch and implementation


## Importing required libraries




In [None]:
# IMPORT LIBRARIES


try:

    import numpy as np
    import pandas as pd

    # databases - sql
    #from dotenv import dotenv_values
    #import sqlalchemy

    # visualisation
    import seaborn as sns
    import matplotlib.pyplot as plt

    # modelling - evaluation metrics
    from sklearn.model_selection import train_test_split
    from sklearn.cluster import KMeans
    from sklearn.preprocessing import StandardScaler

    from sklearn.metrics import euclidean_distances
    from sklearn.metrics.pairwise import sigmoid_kernel
    from sklearn.metrics.pairwise import cosine_similarity


    from sklearn.pipeline import Pipeline
    from sklearn.manifold import TSNE
    from sklearn.decomposition import PCA


except ImportError as error:
    print(f"Installation of the required dependencies necessary! {error}")

    %pip install numpy
    %pip install pandas
    #%pip install dotenv
    #%pip install sqlalchemy
    %pip install seaborn
    %pip install matplotlib
    %pip install sklearn

    print(f"Successful installation of the required dependencies necessary")


    import warnings
    warnings.filterwarnings('ignore')



## Loading the data

In [None]:
df_artists = pd.read_csv('../data/moosic-raw/spotify_600k_artists.csv', low_memory=False)
df_artists.head().T

In [None]:
df_tracks = pd.read_csv('../data/moosic-raw/spotify_600k_tracks.csv', low_memory=False)
df_tracks.head().T

## Data Overview Artists

| column | additional information |
|--------|------------------------|
| id | id of artist |
| followers | number of followers | 
| genres | genres associated with artist |
| name | name of artist |
| popularity | popularity of artist in range 0 to 100 |

## Data Overview Tracks

| column | additional information |
|--------|------------------------|
| id | id of track |
| name | name of track | 
| popularity | popularity of track in range 0 to 100 |
| duration_ms | duration of songs in ms |
| explicit | whether it contains explicit content or not |
| artists | artists who created the track | 
| id_artists | id of artists who created the track |
| release_date | date of release |
| danceability | how danceable a song is in range 0 to 1 |
| energy | how energized a song is in range 0 to 1 |
| key | The key the track is in. Integers map to pitches using standard Pitch Class notation. E.g. 0 = C, 1 = C♯/D♭, 2 = D, and so on. If no key was detected, the value is -1 |
| loudness | The overall loudness of a track in decibels (dB) |
| mode |  Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0 |
| speechiness | Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks |
| acousticness | A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic |
| instrumentalness | Predicts whether a track contains no vocals. "Ooh" and "aah" sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly "vocal". The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content |
| liveness | Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live |
| valence | A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry) |
| tempo | The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration | 
| time_signature | An estimated time signature. The time signature (meter) is a notational convention to specify how many beats are in each bar (or measure). The time signature ranges from 3 to 7 indicating time signatures of 3/4, to 7/4. | 

In [None]:
df_artists.info()

&rarr; Some missing values in columns _followers_ and _name_!

In [None]:
df_tracks.info()

&rarr; Some missing values in column _name_!

In [None]:
df_artists.nunique()

In [None]:
df_tracks.nunique()

In [None]:
# get shape of the tracks dataframe

print(f"Track data: There are {df_tracks.shape[0]} observations and {df_tracks.shape[1]} feature variables ")
print('----------'*10)

df_tracks.shape

In [None]:
# check for number of null values in each columns

df_tracks.isnull().sum()

The tracks-csv has all the data we need. From this time on we will only work with the tracks data.
We will split the data to avoid data leakage.

In [None]:
# Drop NaNs in column name

df_tracks = df_tracks.dropna()

In [None]:
# re-check for number of null values in each columns

df_tracks.isnull().sum()

In [None]:
# check for duplicates

df_tracks.duplicated()

In [None]:
df_tracks.head()

We have duplicated track names. Are these tracks with the same name from different artists, or du we have duplicates for the same track?

In [None]:
# Check for completely identical rows
identical_rows = df_tracks[df_tracks.duplicated(keep=False)]

# Print completely identical rows
print("Completely identical rows:")
print(identical_rows)

In [None]:
# get count of duplicated values in tracks dataframe

display(df_tracks.duplicated().value_counts())

In [None]:
# show top 5 rows of data (transposed)

df_tracks.head().T

In [None]:
# Train-test Split

# Defining X and y
features = df_tracks.columns.tolist()
features.remove('name')

X = df_tracks[features]
y = df_tracks['name']

print(X.shape)
print(y.shape)

# Splitting the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42, shuffle=True)  # , stratify=y)

# Check the shape of the data sets
print("X_train:", X_train.shape)
print("y_train:", y_train.shape)
print("X_test:", X_test.shape)
print("y_test:", y_test.shape)

When we use 

 ```
stratify=y 
```

we get a ValueError: The least populated class in y has only 1 member, which is too few. The minimum number of groups for any class cannot be less than 2. 

This because of the nature of stratification. The stratify parameter set it to split data in a way to allocate test_size amount of data to each class. In this case, you don't have sufficient class labels of one of your classes to keep the data splitting ratio equal to test_size.

I confirm the above explanation. I have encountered this situation when dealing with a class that has a very low count . You can either take a random sample (not stratified) or try different test_size values, to be able to have an adequate size that could hold all your various labels.


---

# BASELINE MODELS SKETCH AND IMPLEMENTATION

------------------------------------

> We have 586672 observations and 20 feature variables in our current (initial) dataset

```python

all_features =  ['id', 'name',
    'popularity', 'duration_ms', 'explicit',
    'artists', 'id_artists','release_date',
    'danceability','energy', 'key',
    'loudness', 'mode', 'speechiness',
    'acousticness', 'instrumentalness',
    'liveness','valence','tempo',
    'time_signature']
```

#### Baseline model sketch


> baseline recommender: 

* baseline focus : 
    - similarity :  measure similarity based on valence mood category and genre, if popularity is greater than 0.7 then recommend top N song names and/or most similar content
    - top N : when grouped by valence mood category and genre, if popularity is greater than 0.7 then recommend top 10 song names and/or most similar content

* baseline features :
    - all features + genre? except id, id_artists
    - song name, acousticness, and popularity

* baseline features : all features + genre?

* recommend top 10 songs(name) based on similarity measure of genre (group by valence)


> next - main recommender model : 

* main model next :  mood groups/labels based on valence value
* recommend top N = 10 songs(name) based on mood label and popularity by genre/name/artist


* baseline recommender function:

    - similarity measure of track , genre , valence(category), artists
    - feat_col = ['valence', 'genre', 'name of music track', 'artists', 'popularity', acoust]
    - metric : similarity metric, top N
    - TSNE, KMeans, SVD, PCA

    - 


* Metric : Top_N, cosine similarity




In [None]:
# IMPORT LIBRARIES


try:

    import numpy as np
    import pandas as pd

    # databases - sql
    #from dotenv import dotenv_values
    #import sqlalchemy

    # visualisation
    import seaborn as sns
    import matplotlib.pyplot as plt

    # split data - avoid data leakage
    from sklearn.model_selection import train_test_split

    # preprocessing, scaling
    from sklearn.preprocessing import StandardScaler
    from sklearn.preprocessing import MinMaxScaler

    # modelling
    from sklearn.cluster import KMeans

    # cross validation, hyperparameter tuning
    #from surprise.model_selection import GridSearchCV
    from sklearn.model_selection import GridSearchCV

    # metrics
    from sklearn.metrics import euclidean_distances
    from sklearn.metrics.pairwise import sigmoid_kernel
    from sklearn.metrics.pairwise import cosine_similarity

    # high dimensional usage - dimensionality reduction
    from sklearn.manifold import TSNE
    from sklearn.decomposition import PCA

    # text converter/vectorizer
    from sentence_transformers import SentenceTransformer
    from sklearn.feature_extraction.text import CountVectorizer

    # pipeline
    from sklearn.pipeline import Pipeline



except ImportError as error:
    print(f"Installation of the required dependencies necessary! {error}")

    %pip install numpy
    %pip install pandas
    #%pip install dotenv
    #%pip install sqlalchemy
    %pip install seaborn
    %pip install matplotlib
    %pip install scikit-learn
    
    %pip install scikit-surprise
    %pip install sentence-transformers

    print(f"Successful installation of the required dependencies necessary")


    import warnings
    warnings.filterwarnings('ignore')



In [None]:
# all features

all_features =  ['id', 'name',
    'popularity', 'duration_ms', 'explicit',
    'artists', 'id_artists','release_date',
    'danceability','energy', 'key',
    'loudness', 'mode', 'speechiness',
    'acousticness', 'instrumentalness',
    'liveness','valence','tempo',
    'time_signature']



In [None]:
# descriptive statistics summary


df_tracks[all_features].describe().T.style \
        .format("{:.2f}")





In [None]:
# display data

df_tracks.head()



### Clustering data based on mood (valence) and other audio features






In [None]:
#

# similarity recommend measure
# queries based on similar mood tracks
# modify column -- 
    # cols = valence .. n most high values of valence = 1. 
    # what are the top N mood-based-music in our data/ recommendation list?
# columns = ['mood_label', 'mood_class', 'music_name', 'acousticness','danceability','energy','instrumentalness','liveness','speechiness','valence']
# mood-emotions scale.. between 0 to 1 , group into 8 

# n_songs = 10 # 10 closes tracks neighbors
# filter_based_on_cols = ["valence", "acoustiness", "energy", "danceability"]
# sad_tracks = {"valence": 0, "acoustiness": 0}
# ... other moods and audio features
# happy_tracks = {"valence": 1, "acoustiness": 1 }

# search music track based on mood
# for music in mood_based_query( query_input = mood_based_data column).retrieve(top_n_songs):
    # trackname = get_track_name(moosic_data[name?])
    # search_music_name = search(music_name)

# mood based recommender
#music mood classification/prediction
# curate spotify music data to identifyand tag music with moods







In [None]:
# create new column 'mood labels'(text) and 'mood_class'(nominal value) 



#mood_labels = ["happy", "exuberant", "energetic", "frantic", "anxious/sad", "depression", "calm", "content"]
#mood_valence_values =  [1.0, 0.875, 0.75, 0.625, 0.5, 0.375, 0.25, 0.125, 0.0]

mood_labels = ["content", "calm", "depression", "anxious/sad", "frantic", "energetic", "exuberant", "happy"]
mood_class = [0, 1, 2, 3, 4, 5, 6, 7]
mood_valence_values =  [0.0 , 0.125, 0.25, 0.375, 0.5, 0.625, 0.75, 0.875, 1.0]

# copy of data and info

moosic_data = df_tracks.copy(deep=True)

moosic_data["mood_labels"] = (pd.cut(moosic_data["valence"], bins=mood_valence_values, labels=mood_labels)).astype('string')
moosic_data["mood_class"] = (pd.cut(moosic_data["valence"], bins=mood_valence_values, labels=mood_class)).astype('Int64')
moosic_data = moosic_data.rename({'name': 'music_name', 'id': 'music_id', 'mode': 'modality', 'popularity': 'music_popularity'}, axis=1)
moosic_data.head(10)


In [None]:
# specify features, drop id, id_artists
# unsupervised : kmeans clustering
# elbow method to get best number of clusters

df_moosic = moosic_data.copy(deep=True)
cat_data = df_moosic.select_dtypes(include=['object', 'string'])
cat_features = cat_data.columns.tolist()

print(cat_features)

num_data = df_moosic.select_dtypes(include=['int64', 'float64'])
num_features = num_data.columns.tolist()

print(num_features)




In [None]:
# moosic baseline recommender : similarity based on feature vaiable 
#baseline focus : when grouped by valence mood category and genre, 
# if music is in similar queried mood_class as others then recommend top 10 song names and/or most similar content
# not normalize or scaled
# no outliers removed 

# work in progress!!! to-do


# test
''' 
baseline_params_0 = {
                'feat_cols' : ['music_name', 'popularity', 'danceability','energy',
                'loudness', 'acousticness', 'instrumentalness', 'tempo'],
                'target_col' : 'valence',
                'agg_type' : 'mean',
                'top_n' : 10
        }
'''

def baseline_moosic_v1(dataset, feat_cols= "mood_label", target_col = "music_name", agg_type = "count", top_n = 10, *args, **kwargs):

    data = dataset.groupby(feat_cols).agg({target_col : agg_type}).reset_index()
    data = data.sort_values(target_col, ascending=False)
    top_n_recommend = data.head(top_n)

    return top_n_recommend





In [None]:



# recommendations based on the valence (mood) feature
# mood_label, mood_class, name of music track


baseline_params_0 = {
                'feat_cols' : ['mood_labels', 'mood_class'],
                'target_col' : 'music_name',
                'agg_type' : 'count',
                'top_n' : 10
        }

baseline_0_stats = moosic_data[baseline_params_0['feat_cols']].describe().T.style \
        .format("{:.2f}")

display(baseline_0_stats)

print("_________"*10)

top_n_val = baseline_moosic_v1(moosic_data, **baseline_params_0)

top_n_val.sort_values(by=['mood_class'], ascending=True)



In [None]:
# test
mood_filter = 'calm'
moosic_data['mood_labels'].str.contains(mood_filter).any()

In [None]:
## mood based baseline recommender
## recommend music based on similar mood classes/label

## Work in progress (WIP): 
    # mood filter or query can be a string based on mood label or mood class scale from 0 to 7
    # i.e. mood_filter = 'calm' or mood_filter = 1


def baseline_moosic_v2(dataset, mood_filter = 'calm', feat_cols= "mood_labels", 
                        target_col = "music_name", agg_type = "count", top_n = 10, *args, **kwargs):


    data = dataset.groupby(feat_cols).agg({target_col : agg_type}).reset_index()
    data = data.sort_values(target_col, ascending=False)

    if (isinstance(mood_filter, str)) and (data['mood_labels'].str.contains(mood_filter).any() == True):
        print(f'''Here are the top {top_n} music recommendations based on your current mood : {mood_filter} ''')

        for music in data.query(f" mood_labels == {mood_filter}").samples(top_n):
            top_n_recommend = music
            print(top_n_recommend)

    elif (isinstance(mood_filter, int)) and (data['mood_labels'].str.contains(mood_filter).any() == True):
        print(f'''Here are the top {top_n} music recommendations based on your current mood : {mood_filter} ''')

        for music in data.query(f" mood_labels == {mood_filter}").samples(top_n):
            top_n_recommend = music
            print(top_n_recommend)

    else:
        print('Error: enter mood eithwe by mood-label-text or mood-class-value')
        break

    return top_n_recommend






In [None]:

# recommendations for baseline model

mood_based_params_0 = {
                'feat_cols' : ['mood_labels', 'mood_class'],
                'target_col' : 'music_name',
                'agg_type' : 'count',
                'top_n' : 10
        }


moosic_baseline_stats = moosic_data[mood_based_params_0['feat_cols']].describe().T.style \
        .format("{:.2f}")

display(moosic_baseline_stats)

print("_________"*10)

moosic_baseline_recommender = baseline_moosic_v2(moosic_data, **mood_based_params_0)

moosic_baseline_recommender.sort_values(by=['mood_class'], ascending=True)




In [None]:

#df_mdata = df_data.drop(columns=["id", "id_artists"], axis=1)
#df_mdata = df_data.drop(columns=[*cat_features], axis=1)

moosic_data_0 = moosic_data[['popularity', 'valence', 'tempo']]


plt.figure(figsize=(15,6))
sns.barplot(x=agex, y=agey, palette=['grey', 'grey','red','grey','grey'])
plt.title("Age (years)")
plt.xlabel("Age")
plt.ylabel("Number of Customer")
plt.show()

wcss_1 = []
for i in range(5, 20):
    km1 = KMeans(n_clusters = i, init = 'k-means++', max_iter = 300, n_init = 10, random_state = 0)
    km1.fit(df_mdata)
    wcss_1.append(km1.inertia_)
fig0=plt.figure(figsize=(10,6))  
fig0.patch.set_facecolor('#f6f5f5')

print('shape of data', df_mdata.shape)
#print('wcss1', wcss_1)

print('______'*20)

plt.plot(range(5, 20), wcss_1)
plt.title('The Elbow Method', fontsize = 20)
plt.xlabel('No. of Clusters')
plt.ylabel('wcss')
fig0.text(0.5,0.4,"The best k-value is here")
plt.show()




In [None]:
# train and predict kmeans model with the best number of clusters values given, n_clusters = 6


km1 = KMeans(n_clusters = 6, init = 'k-means++', max_iter = 300, n_init = 10, random_state = 0)
y_means1 = km1.fit_predict(x1)

---

[algo] define  model based on similarity of possible target features and focus

---


In [None]:
# TSNE (T-distributed Stochastic Neighbor Embedding) ML model 

"""
# T-SNE (T-distributed Stochastic Neighbor Embedding) ML model 

- unsupervised approach
- randomized nonlinear dimensionality reduction technique
- for high (multi-) dimensional data e.g N = 18 dimension space(asides id & artist id) for our data
- embeds high dimension data visually in a low dimension space of 2 or 3 dim
- i.e map high-dim data to low-dim space
- not affected by outliers

- algorithm :
    - randomized approach  for dimensionality reduction (not deterministic like PCA)
    - finds patterns in data based on similarity of data points with features 
    - similarity measure : pairwise conditional probability of data points choosing other data  with similar features as its neighbour
    - P(data point a | data point b) = P(data point a _n_ data point b) / P( data point b)
    - then minimize the sum of the difference between the similarities measured in high-dim and low-dim
    - represents datapoints properly in low-dim space

    - space/time complexity : quadratic time and space , inthe number of datapoints

- standardized data before usage to save time complexity needed for reduction process.



"""

#standardized_data = StandardScaler().fit_transform(data)
#print(standardized_data.shape)


#dftracks = df_tracks.drop(['id', 'id_artists'], axis = 1)
dftracks = df_data[['popularity', 'valence', 'tempo']]

model = TSNE(n_components = 2, random_state = 0)
tsne_data = model.fit_transform(dftracks.head(500))
plt.figure(figsize = (7, 7))
plt.scatter(tsne_data[:,0], tsne_data[:,1])
plt.show()




In [None]:
dftracks0 = df_data[['popularity', 'valence', 'tempo']]

model0 = TSNE(n_components = 3, random_state = 0)
tsne_data0 = model0.fit_transform(dftracks0.head(500))
plt.figure(figsize = (7, 7))
plt.scatter(tsne_data0[:,0], tsne_data0[:,1])
plt.show()


In [None]:
dftracks1 = df_data[['popularity', 'valence']] #df_data[['valence', 'tempo']]

model1 = TSNE(n_components = 2, random_state = 0)
tsne_data1 = model1.fit_transform(dftracks1.head(500))
plt.figure(figsize = (7, 7))
plt.scatter(tsne_data1[:,0], tsne_data1[:,1])
plt.show()

In [None]:
standardized_data = StandardScaler().fit_transform(df_data[num_features])
print(standardized_data.shape)





In [None]:
# Cosine similarity measure 

"""
# Cosine similarity measure 


- unsupervised approach
- randomized nonlinear dimensionality reduction technique
- for high (multi-) dimensional data e.g N = 18 dimension space(asides id & artist id) for our data
- embeds high dimension data visually in a low dimension space of 2 or 3 dim
- i.e map high-dim data to low-dim space
- not affected by outliers

- algorithm :
    - randomized approach  for dimensionality reduction (not deterministic like PCA)
    - finds patterns in data based on similarity of data points with features 
    - similarity measure : pairwise conditional probability of data points choosing other data  with similar features as its neighbour
    - P(data point a | data point b) = P(data point a _n_ data point b) / P( data point b)
    - then minimize the sum of the difference between the similarities measured in high-dim and low-dim
    - represents datapoints properly in low-dim space

    - space/time complexity : quadratic time and space , inthe number of datapoints




"""



