# Moosic Baseline Model :: Iteration v1


* combining datasets as done prior [data preprocessing][data management]
* train test split [modelling]
* baseline model [discussion needed]
* baseline model sketch and implementation


## Importing required libraries




In [None]:
# IMPORT LIBRARIES


try:

    import numpy as np
    import pandas as pd

    # databases - sql
    #from dotenv import dotenv_values
    #import sqlalchemy

    # visualisation
    import seaborn as sns
    import matplotlib.pyplot as plt

    # modelling - evaluation metrics
    from sklearn.model_selection import train_test_split
    from sklearn.cluster import KMeans
    from sklearn.preprocessing import StandardScaler

    from sklearn.metrics import euclidean_distances
    from sklearn.metrics.pairwise import sigmoid_kernel
    from sklearn.metrics.pairwise import cosine_similarity


    from sklearn.pipeline import Pipeline
    from sklearn.manifold import TSNE
    from sklearn.decomposition import PCA


except ImportError as error:
    print(f"Installation of the required dependencies necessary! {error}")

    %pip install numpy
    %pip install pandas
    #%pip install dotenv
    #%pip install sqlalchemy
    %pip install seaborn
    %pip install matplotlib
    %pip install sklearn

    print(f"Successful installation of the required dependencies necessary")


    import warnings
    warnings.filterwarnings('ignore')



## Loading the data

In [None]:
df_artists = pd.read_csv('../data/moosic-raw/spotify_600k_artists.csv', low_memory=False)
df_artists.head().T

In [None]:
df_tracks = pd.read_csv('../data/moosic-raw/spotify_600k_tracks.csv', low_memory=False)
df_tracks.head().T

## Data Overview Artists

| column | additional information |
|--------|------------------------|
| id | id of artist |
| followers | number of followers | 
| genres | genres associated with artist |
| name | name of artist |
| popularity | popularity of artist in range 0 to 100 |

## Data Overview Tracks

| column | additional information |
|--------|------------------------|
| id | id of track |
| name | name of track | 
| popularity | popularity of track in range 0 to 100 |
| duration_ms | duration of songs in ms |
| explicit | whether it contains explicit content or not |
| artists | artists who created the track | 
| id_artists | id of artists who created the track |
| release_date | date of release |
| danceability | how danceable a song is in range 0 to 1 |
| energy | how energized a song is in range 0 to 1 |
| key | The key the track is in. Integers map to pitches using standard Pitch Class notation. E.g. 0 = C, 1 = C♯/D♭, 2 = D, and so on. If no key was detected, the value is -1 |
| loudness | The overall loudness of a track in decibels (dB) |
| mode |  Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0 |
| speechiness | Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks |
| acousticness | A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic |
| instrumentalness | Predicts whether a track contains no vocals. "Ooh" and "aah" sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly "vocal". The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content |
| liveness | Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live |
| valence | A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry) |
| tempo | The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration | 
| time_signature | An estimated time signature. The time signature (meter) is a notational convention to specify how many beats are in each bar (or measure). The time signature ranges from 3 to 7 indicating time signatures of 3/4, to 7/4. | 

In [None]:
df_artists.info()

&rarr; Some missing values in columns _followers_ and _name_!

In [None]:
df_tracks.info()

&rarr; Some missing values in column _name_!

In [None]:
df_artists.nunique()

In [None]:
df_tracks.nunique()

In [None]:
# get shape of the tracks dataframe

print(f"Track data: There are {df_tracks.shape[0]} observations and {df_tracks.shape[1]} feature variables ")
print('----------'*10)

df_tracks.shape

In [None]:
# check for number of null values in each columns

df_tracks.isnull().sum()

The tracks-csv has all the data we need. From this time on we will only work with the tracks data.
We will split the data to avoid data leakage.

In [None]:
# Drop NaNs in column name

df_tracks = df_tracks.dropna()

In [None]:
# re-check for number of null values in each columns

df_tracks.isnull().sum()

In [None]:
# check for duplicates

df_tracks.duplicated()

In [None]:
df_tracks.head()

We have duplicated track names. Are these tracks with the same name from different artists, or du we have duplicates for the same track?

In [None]:
# Check for completely identical rows
identical_rows = df_tracks[df_tracks.duplicated(keep=False)]

# Print completely identical rows
print("Completely identical rows:")
print(identical_rows)

In [None]:
# get count of duplicated values in tracks dataframe

display(df_tracks.duplicated().value_counts())

In [None]:
# show top 5 rows of data (transposed)

df_tracks.head().T

In [None]:
# Train-test Split

# Defining X and y
features = df_tracks.columns.tolist()
features.remove('name')

X = df_tracks[features]
y = df_tracks['name']

print(X.shape)
print(y.shape)

# Splitting the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42, shuffle=True)  # , stratify=y)

# Check the shape of the data sets
print("X_train:", X_train.shape)
print("y_train:", y_train.shape)
print("X_test:", X_test.shape)
print("y_test:", y_test.shape)

When we use 

 ```
stratify=y 
```

we get a ValueError: The least populated class in y has only 1 member, which is too few. The minimum number of groups for any class cannot be less than 2. 

This because of the nature of stratification. The stratify parameter set it to split data in a way to allocate test_size amount of data to each class. In this case, you don't have sufficient class labels of one of your classes to keep the data splitting ratio equal to test_size.

I confirm the above explanation. I have encountered this situation when dealing with a class that has a very low count . You can either take a random sample (not stratified) or try different test_size values, to be able to have an adequate size that could hold all your various labels.

# BASELINE MODELS SKETCH AND IMPLEMENTATION

------------------------------------

```python

all_features =  ['id', 'name',
    'popularity', 'duration_ms', 'explicit',
    'artists', 'id_artists','release_date',
    'danceability','energy', 'key',
    'loudness', 'mode', 'speechiness',
    'acousticness', 'instrumentalness',
    'liveness','valence','tempo',
    'time_signature']
```
* baseline_0 recommender: 
    - baseline focus : when grouped by valence mood category and genre, if popularity 
                        is greater than 0.7 then recommend top 10 song names and/or most similar content
    - baseline features :song name, acousticness, and popularity
    - recommend top 10 songs(name)


* baseline_1 recommender: 
    - baseline focus : when grouped by valence mood category and genre, if popularity 
                        is greater than 0.7 then recommend top N song names and/or most similar content
    - baseline features : all features + genre?
    - recommend top 10 songs(name) based on valence and popularity

* baseline_2 recommender: 
    - baseline focus : when grouped by valence mood category and genre, if popularity 
                        is greater than 0.7 then recommend top N song names and/or most similar content
    - baseline features : all features + genre? except id, id_artists
    - recommend top 10 songs(name) based on valence and popularity

* baseline_3 recommender: 
    - baseline focus : when grouped by valence mood category and genre, if popularity 
                        is greater than 0.7 then recommend top N song names and/or most similar content
    - baseline features : [ 'name', 'popularity', 'duration_ms', 
                            'danceability','energy', 'loudness','speechiness',
                            'acousticness', 'instrumentalness', 'liveness','valence','tempo']
    - recommend top 10 songs (name) based on valence and popularity


* Metric : Top_N, cosine similarity




In [None]:
# all features

all_features =  ['id', 'name',
    'popularity', 'duration_ms', 'explicit',
    'artists', 'id_artists','release_date',
    'danceability','energy', 'key',
    'loudness', 'mode', 'speechiness',
    'acousticness', 'instrumentalness',
    'liveness','valence','tempo',
    'time_signature']



In [None]:
# descriptive statistics summary


df_tracks[all_features].describe().T.style \
        .format("{:.2f}")





In [None]:
# moosic baseline recommender : similarity based on feature vaiable 
#baseline focus : when grouped by valence mood category and genre, if popularity 
# is greater than 0.7 then recommend top 10 song names and/or most similar content
# not normalize or scaled
# no outliers removed 

# work in progress!!! to-do
def baseline_moosic_v1(dataset, feat_cols, target_col = "valence", agg_type = "mean", top_n = 10, *args, **kwargs):

    data = dataset.groupby(feat_cols).agg({target_col : agg_type}).reset_index()
    data = data.sort_values(target_col, ascending=False)
    top_n_recommend = data.head(top_n)

    return top_n_recommend




In [None]:
# recommendations based on the valence (mood) feature


baseline_params_0 = {
                'feat_cols' : ['name', 'popularity', 'danceability','energy',
                'loudness', 'acousticness', 'instrumentalness', 'tempo'],
                'target_col' : 'valence',
                'agg_type' : 'mean',
                'top_n' : 10
        }

baseline_0_stats = df_tracks[baseline_params_0['feat_cols']].describe().T.style \
        .format("{:.2f}")

display(baseline_0_stats)

print("_________"*10)

top_n_val = baseline_moosic_v1(df_tracks, **baseline_params_0)

top_n_val


In [None]:
# groupby the name --- 
# aggregate by the mean popularity? name count? max/mintempo?
# show top 10

baseline_params_1 = {
                'feat_cols' : ['valence', 'popularity'],
                'target_col' : 'name',
                'agg_type' : 'count',
                'top_n' : 15
        }

baseline_1_stats = df_tracks[baseline_params_1['feat_cols']].describe().T.style \
        .format("{:.2f}")

display(baseline_1_stats)

print("_________"*10)

top_n_val1 = baseline_moosic_v1(df_tracks, **baseline_params_1)

top_n_val1

In [None]:
#baseline focus : when grouped by valence mood category and genre, if popularity 
#is greater than 0.7 then recommend top 10 song names and/or most similar content


#re do function
# valence bin - curated column or clusters need

#df_test = df_tracks.groupby('valence').agg({target_col : agg_type}).reset_index()
#df_test


In [None]:
# supervised vs unsupervised?

# prediction model of song tracks mood type label?
# split dataset into train and test with the target variable as valence (mood)
# define the input data feature, X_data and the focus target, y_data
# depending on algorithm, how much data to be used


X_data = df_tracks.drop(columns=["valence"], axis=1 )
y_data = df_tracks['valence']


X_train1, X_test1, y_train1, y_test1 = train_test_split(X_data, y_data, test_size=0.25, random_state=42, shuffle=True)

dat_shape = {
        "X_train" : X_train1.shape, "y_train": y_train1.shape,
        "X_test" : X_test1.shape, "y_test" : y_test1.shape
        }

dat_shape