In [2]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler

In [3]:
df = pd.read_csv('../csv/trimmed_dataset.csv')
df

Unnamed: 0,melodic/bittersweet/sentimental/romantic,spiritual/atmospheric/surreal/dense/mysterious,progressive/dense/epic,energetic/raw/rebellious/noisy/angry-warm/calm/natural/acoustic/pastoral,cold/dark/sad/atmospheric/anxious/mysterious/serious-quirky/happy
0,0.428571,0.000000,0.000000,-0.538462,-0.481481
1,0.714286,0.703704,0.428571,0.733333,-0.444444
2,0.619048,0.000000,0.000000,-0.087179,0.037037
3,0.904762,0.407407,0.428571,0.400000,0.111111
4,0.523810,0.481481,0.523810,0.000000,-0.555556
...,...,...,...,...,...
420,0.428571,0.925926,0.428571,0.733333,-0.555556
421,0.809524,0.555556,0.000000,-0.005128,-0.444444
422,0.619048,0.000000,0.000000,0.230769,0.074074
423,0.523810,0.629630,0.619048,-0.087179,-0.666667


In this notebook we'll be comparing our clustering output using different subsets of our dataset:
- only binary features;
- only continuous features;
- all features.

The purpose of this is not to compare the features themselves, but rather to check what effects their type (binary x continuous) may have in the clustering quality. The subset with all features will also be used in order to check if a 15-dimensional space already incurs in "curse of dimensionality" (more about it in 02_Data_Modeling.ipynb) for these 449 entries.

As was discussed in the past notebook, clustering is optimal in a dataset with only **n** binary features when it has 2^n clusters. For a dataset fully comprised of continuous features, it's simpler: the more clusters the better. Since we want to compare both subsets at their best, the number of features and clusters chosen for the binary-only subset will be the number we use for the continuous dataset too.

For 5 features, the binary subset would perform optimally at 32 clusters, which is a little too much for the 449 entries we currently have. For 4 features, it performs optimally at 16 clusters, which seems to be a better number. So 4 features will be chosen for both the binary and continuous subset. The criterion for choosing features will be variance.

Every subset will be separately normalized using StandardScaler before running k-Means Clustering.

In [3]:
variances_df = df.var()
variances_sorted = variances_df.sort_values(ascending=False)
variances_sorted

rhythmic                                                                    0.558585
poetic                                                                      0.517221
energetic/raw/rebellious/noisy/angry-warm/calm/natural/acoustic/pastoral    0.516981
eclectic                                                                    0.493060
cold/dark/sad/atmospheric/anxious/mysterious/serious-quirky/happy           0.387733
anthemic                                                                    0.323688
heavy                                                                       0.253666
urban                                                                       0.228126
humorous                                                                    0.214560
progressive/dense/epic                                                      0.176385
spiritual/atmospheric/surreal/dense/mysterious                              0.160897
melodic/bittersweet/sentimental/romantic                         

In [4]:
# Therefore our subsets will be:
df_bin = df[['rhythmic', 'poetic', 'eclectic', 'anthemic']]
df_cont = df[['energetic/raw/rebellious/noisy/angry-warm/calm/natural/acoustic/pastoral', 'cold/dark/sad/atmospheric/anxious/mysterious/serious-quirky/happy', 'progressive/dense/epic', 'spiritual/atmospheric/surreal/dense/mysterious']]
df_all = df


### 1. df_bin

In [47]:
X = df_bin.to_numpy()
scaler = StandardScaler()
scaler.fit(X)
X = scaler.transform(X)

clusterNum = 16
k_means = KMeans(init = "k-means++", n_clusters = clusterNum, n_init = 12)
k_means.fit(X)
labels = k_means.labels_
df_bin['labels'] = k_means.labels_



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



In [48]:
# Extracting album names from "named_df" (full_descriptor_dataset.csv) and labels from "df", then merging into one dataframe "merged_df"
named_df = pd.read_csv('../csv/full_descriptor_dataset.csv')
merged_df = pd.merge(df_bin['labels'], named_df[['name']], left_index=True, right_index=True)

In [49]:
merged_df

Unnamed: 0,labels,name
0,3,Living in Darkness
1,6,Souvenirs d'un autre monde
2,1,People Who Can Eat People Are the Luckiest Peo...
3,4,Funeral
4,6,Neon Bible
...,...,...
444,6,In the Aeroplane Over the Sea
445,1,McCartney
446,5,Hail to the Thief
447,6,Wish You Were Here


In [50]:
label_name_list = merged_df.groupby('labels')['name'].apply(list)
print(label_name_list)

labels
0     [Humbug, Low, Construção, Cartola, Cartola, Sc...
1     [People Who Can Eat People Are the Luckiest Pe...
2     [Blink-182, Carlos, Erasmo..., Swing Lo Magell...
3     [Living in Darkness, AM, The B-52's, Bad Brain...
4     [Funeral, No Control, The Rise and Fall of Zig...
5     [Mask, Paul's Boutique, Check Your Head, Ill C...
6     [Souvenirs d'un autre monde, Neon Bible, Suck ...
7     [Whatever People Say I Am, That's What I'm Not...
8     [Rubber Soul, The Beatles [White Album], Abbey...
9     [The Suburbs, Surf's Up, Mutations, The Kick I...
10    [Reflektor, Licensed to Ill, London Calling, S...
11    [Damaged, Take Off Your Pants and Jacket, Diam...
12    [On Land and in the Sea, nimrod., Kylmälle maa...
13    [Energy, A dança da solidão, Da lama ao caos, ...
14    [Rumours, The Age of Adz, A Fever You Can't Sw...
15                [Cabeça dinossauro, The Black Parade]
Name: name, dtype: object


In [51]:
# Saving in .txt file
with open('../Outputs/df_bin_output.txt', 'w', encoding='utf-8') as file:
    for label, names in label_name_list.items():
        file.write(f"{label}: {', '.join(names)}\n")
        file.write(f"\n")

Let's repeat the same process with df_cont and df_all

### 2. df_cont

In [52]:
X = df_cont.to_numpy()
scaler = StandardScaler()
scaler.fit(X)
X = scaler.transform(X)

k_means.fit(X)
labels = k_means.labels_
df_cont['labels'] = k_means.labels_



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



In [53]:
merged_df = pd.merge(df_cont['labels'], named_df[['name']], left_index=True, right_index=True)

In [54]:
label_name_list = merged_df.groupby('labels')['name'].apply(list)

In [55]:
with open('../Outputs/df_cont_output.txt', 'w', encoding='utf-8') as file:
    for label, names in label_name_list.items():
        file.write(f"{label}: {', '.join(names)}\n")
        file.write(f"\n")

### 3. df_all

In [56]:
X = df_all.to_numpy()
scaler = StandardScaler()
scaler.fit(X)
X = scaler.transform(X)

k_means.fit(X)
labels = k_means.labels_
df_all['labels'] = k_means.labels_

In [57]:
merged_df = pd.merge(df_all['labels'], named_df[['name']], left_index=True, right_index=True)

In [58]:
label_name_list = merged_df.groupby('labels')['name'].apply(list)

In [59]:
with open('../Outputs/df_all_output.txt', 'w', encoding='utf-8') as file:
    for label, names in label_name_list.items():
        file.write(f"{label}: {', '.join(names)}\n")
        file.write(f"\n")

## Thoughts about the outputs

Outputs can be checked in the "Outputs" folder. **Note**: Since the present albums were all chosen from my listening list, this evaluation will inevitably come from a personal perspective. Which isn't inappropriate, after all, clustering algorithms belong to the "unsupervised learning" paradigm.

**df_bin_output.txt**: Here we can observe a phenomenon which already showed itself in lower dimensions when we plotted a 2D histogram for ('rhythmic' x 'poetic'): some focal points have much, much more data density than others, to the point that 4 of our 16 clusters have 6 or less labeled albums under them, a very small number. That doesn't mean these binary features can't contribute in our model, however the nature of data dispersion along binary features hint that they can't be used alone;

**df_cont_output.txt**: Better than df_bin_output, but still not good enough. Some labels have good sound cohesion, but most are pretty mixed and some don't have any sound cohesion at all;

**df_all_output.txt**: Worse than df_cont_output. Not a single label has reasonable cohesion, from which I could speculate 2 possible causes:
1) 15 features might indeed be too many dimensions for the number of samples we currently have (449), so in order to use all 15 features, we should have many more samples than what we currently have, in order to fill all the "blank" space in our vector space created by this high number of dimensions;
2) The data behavior we observed in binay features' distributions isn't exclusive to subsets that are "binary features only". Instead, the skewed distribution among focal points is being projected by binary features onto the full data distribution. **Note:** At first, I admit I thought this wouldn't happen, but now it's naive to think it wouldn't, since 10 of the 15 features are binary.