# Selecting Features

After executing "01_HTML_Scrapper.ipynb" we're left with the list saved in "sorted_descriptor_list.txt", which contains 160+ descriptors.

k-Means Clustering consists of moving cluster centroids through the vector space in a way that minimizes the distance between the centroids and the data points. That means it's a distance-based algorithm, and therefore very sensitive to the increasing of dimensions and prone to suffer from the "curse of dimensionality" a phenomenon well-known in Data Scince and Machine Learning which basically states that, as dimensions increase linearly, our vector space increases exponentially, implying we need exponentially more data observations/instances/entries to fill the same percentage of vector space as we would in lower dimensions. Here's a good learning source: https://www.visiondummy.com/2014/04/curse-dimensionality-affect-classification/.

Moral of the story: if we don't want to gather much MUCH more album descriptor data than what we currently have, we should keep our dimensionality low, a recommended practice in most ML problems. That means we can't just turn all 160+ descriptors into 160+ features of our dataset. In this notebook we'll be working on modelling a dataset with features based on these 160+ descriptors, while losing/disturbing as little information as possible, but also keeping feature/dimension numbers as low as possible. For that, we'll make use of some strategies:

1 - Removing descriptors that have no relation to sound

2 - Grouping similar-enough descriptors together

3 - Removing unpopular descriptors

## 1 - Removing descriptors that have no relation to sound

All descriptors were taken from RateYourMusic. Here's their definition for what all music descriptors mean: https://rateyourmusic.com/music_descriptor/

As you can see, not all descriptors are "sound-based", some describe a common lyrical theme that appears on the album. However, both an aggressive rock album, and a soft acoustic album can address the same lyrical theme ("politics", for instance), even though we don't want to cluster these albums together, since the criterion is "sound-based". Therefore, we'll be removing all descriptors that have no relation to sound.

Note: some descriptors refer to both sound and imagery carried out in lyrics. These still bear some relation to sound, so will be kept.

After some analysis, the descriptors below were found to not be "sound-based". Most did not occur in "sorted_descriptr_list.txt", the ones that did being marked with an asterisk (\*) by the name

Form
    - ballad
    - carol
    - children's music
    - fairy tale
    - lullaby
    - nursery rhyme
    - concept album
    - rock opera
    - concerto
    - ensemble
    - a cappella
    - androgynous vocals
    - chamber music
    - string quartet
    - choral
    - female vocalist
    - instrumental
    - male vocalist
    - non-binary vocalist
    - orchestral
    - vocal group
    - hymn
    - jingle
    - madrigal
    - mashup
    - medley
    - monologue
    - novelty
    - opera
    - oratorio
    - parody
    - poem
    - section
    - interlude
    - intro
    - movement
    - outro
    - reprise
    - silence
    - skit
    - sonata
    - stem
    - suite
    - symphony
    - tone poem
    - waltz

Lyrics
    - theme
    - abstract
    - alienation
    - conscious
    - crime
    - death
    - suicide
    - drugs
    - alcohol
    - educational
    - fantasy
    - folklore
    - hedonistic
    - history
    - holiday
    - Christmas
    - Halloween
    - ideology
    - anti-religious
    - pagan
    - political
    - anarchism
    - nationalism
    - propaganda
    - protest
    - religious
    - Christian
    - Islamic
    - satanic
    - introspective
    - LGBT
    - love
    - breakup
    - misanthropic
    - mythology
    - nature
    - occult
    - paranormal
    - patriotic
    - philosophical
    - existential
    - nihilistic
    - science fiction
    - self-hatred
    - sexual
    - sports
    - violence
    - war

Technique
    - composition
    - aleatory
    - generative music
    - imporvisation
    - uncommon time signatures
    - production
    - lobit
    - sampling
    - Wall of Sound

In [1]:
# After removing these, we're left with:

sorted_descriptors_1 = [('melodic', 429),
 ('energetic', 377),
 ('rhythmic', 319),
 ('passionate', 279),
 ('playful', 274),
 ('raw', 233),
 ('anxious', 229),
 ('bittersweet', 217),
 ('warm', 206),
 ('rebellious', 200),
 ('psychedelic', 198),
 ('melancholic', 190),
 ('poetic', 186),
 ('quirky', 179),
 ('eclectic', 176),
 ('atmospheric', 162),
 ('sarcastic', 159),
 ('surreal', 146),
 ('lush', 145),
 ('urban', 144),
 ('nocturnal', 143),
 ('dark', 135),
 ('progressive', 135),
 ('longing', 132),
 ('mellow', 125),
 ('humorous', 116),
 ('noisy', 116),
 ('lonely', 112),
 ('dense', 110),
 ('summer', 110),
 ('uplifting', 107),
 ('manic', 107),
 ('sentimental', 104),
 ('anthemic', 102),
 ('hypnotic', 99),
 ('romantic', 98),
 ('cryptic', 95),
 ('angry', 93),
 ('repetitive', 88),
 ('aggressive', 86),
 ('sombre', 86),
 ('pessimistic', 85),
 ('avant-garde', 83),
 ('mysterious', 73),
 ('complex', 71),
 ('satirical', 68),
 ('heavy', 67),
 ('lo-fi', 64),
 ('ominous', 62),
 ('cold', 62),
 ('soothing', 61),
 ('ethereal', 60),
 ('depressive', 57),
 ('epic', 55),
 ('tropical', 53),
 ('soft', 52),
 ('technical', 51),
 ('dissonant', 51),
 ('pastoral', 50),
 ('chaotic', 50),
 ('autumn', 48),
 ('acoustic', 47),
 ('apocalyptic', 45),
 ('sad', 43),
 ('serious', 43),
 ('calm', 43),
 ('spiritual', 42),
 ('happy', 38),
 ('peaceful', 34),
 ('minimalistic', 32),
 ('vulgar', 31),
 ('party', 31),
 ('futuristic', 31),
 ('mechanical', 31),
 ('optimistic', 27),
 ('suspenseful', 27),
 ('sensual', 23),
 ('space', 23),
 ('apathetic', 22),
 ('disturbing', 21),
 ('triumphant', 21),
 ('deadpan', 20),
 ('spring', 19),
 ('winter', 19),
 ('lethargic', 18),
 ('boastful', 16),
 ('tribal', 14),
 ('aquatic', 14),
 ('sparse', 14),
 ('desert', 13),
 ('polyphonic', 13),
 ('funereal', 13),
 ('meditative', 10),
 ('ritualistic', 9),
 ('atonal', 9),
 ('rain', 9),
 ('forest', 8),
 ('scary', 8),
 ('martial', 8),
 ('hateful', 6),
 ('infernal', 6),
 ('medieval', 6),
 ('seasonal', 5),
 ('microtonal', 4),
 ('natural', 1)]

In [2]:
len(sorted_descriptors_1)

105

## 2 - Grouping similar-enough descriptors together

Example: 'rain', 'forest', 'desert', 'aquatic', 'tropical' and 'natural' can all be generalised under the same name. When doing this, loss of some information and fine particularities is inevitable, however in a small degree, while the benefit of reducing dimensions is far greater.

'natural' -> 'natural', 'rain', 'forest', 'desert', 'aquatic', 'tropical', 'seasonal', 'autumn', 'spring'
'dark' -> 'dark', 'funereal', 'infernal', 'ominous', 'scary', 'disturbing', 'apocalyptic'
'sad' -> 'sad', 'depressive', 'lonely', 'melancholic', 'sombre', 'pessimistic', 'hateful'
'warm' -> 'warm', 'summer'
'cold' -> 'cold', 'winter', 'nocturnal'
'angry' -> 'angry', 'aggressive'
'calm' -> 'calm', 'meditative', 'mellow', 'soothing', 'peaceful', 'soft'
'energetic' -> 'energetic', 'manic'
'happy' -> 'happy', 'playful', 'uplifting', 'triumphant', 'optimistic'
'futuristic' -> 'futuristic', 'space'
'humorous' -> 'humorous', 'sarcastic', 'vulgar', 'satirical'
'spiritual' -> 'spiritual', 'ethereal', 'hypnotic'
'minimalistic' -> 'minimalistic', 'repetitive', 'sparse'
'progressive' -> 'progressive', 'microtonal', 'complex', 'polyphonic', 'avant-garde', 'atonal', 'technical'
'bittersweet' -> 'bittersweet', 'longing'
'surreal' -> 'surreal', 'psychedelic', 'lush'
'apathetic' -> 'apathetic', 'lethargic', 'deadpan'
'mysterious' -> 'mysterious', 'cryptic'
'noisy' -> 'noisy', 'chaotic', 'dissonant'
'sentimental' -> 'sentimental', 'passionate'
'raw' -> 'raw', 'lo-fi'
'urban' -> 'urban', 'party'
'romantic' -> 'romantic', 'sensual'

In [3]:
# After grouping these, we're left with:

sorted_descriptors_2 = [('melodic', 429),
 ('energetic', 484),
 ('rhythmic', 319),
 ('raw', 297),
 ('anxious', 229),
 ('bittersweet', 349),
 ('warm', 316),
 ('rebellious', 200),
 ('poetic', 186),
 ('quirky', 179),
 ('eclectic', 176),
 ('atmospheric', 162),
 ('surreal', 489),
 ('urban', 175),
 ('dark', 290),
 ('progressive', 366),
 ('humorous', 374),
 ('noisy', 217), # calm
 ('dense', 110),
 ('sentimental', 383),
 ('anthemic', 102),
 ('romantic', 121),
 ('angry', 179),
 ('mysterious', 168),
 ('heavy', 67),
 ('cold', 224),
 ('epic', 55),
 ('pastoral', 50),
 ('acoustic', 47),
 ('sad', 493),
 ('serious', 43),
 ('calm', 325),
 ('spiritual', 201),
 ('happy', 467),
 ('minimalistic', 134),
 ('futuristic', 54),
 ('apathetic', 60),
 ('natural', 170),
 ('mechanical', 31),
 ('suspenseful', 27),
 ('boastful', 16),
 ('tribal', 14),
 ('ritualistic', 9),
 ('martial', 8)]

In [4]:
len(sorted_descriptors_2)

44

## 3 - Removing unpopular descriptors that don't fit in with any others

mechanical

suspenseful

boastful

tribal

ritualistic

martial

In [5]:
# After removing these, we're left with:

sorted_descriptors_3 = [('sad', 579),
 ('surreal', 489),
 ('energetic', 484),
 ('happy', 467),
 ('melodic', 429),
 ('sentimental', 383),
 ('humorous', 374),
 ('progressive', 366),
 ('bittersweet', 349),
 ('calm', 325),
 ('rhythmic', 319),
 ('warm', 316),
 ('raw', 297),
 ('dark', 290),
 ('anxious', 229),
 ('cold', 224),
 ('noisy', 217),
 ('spiritual', 201),
 ('rebellious', 200),
 ('poetic', 186),
 ('angry', 179),
 ('quirky', 179),
 ('eclectic', 176),
 ('urban', 175),
 ('natural', 170),
 ('mysterious', 168),
 ('atmospheric', 162),
 ('romantic', 121),
 ('dense', 110),
 ('anthemic', 102),
 ('minimalistic', 102),
 ('heavy', 67),
 ('apathetic', 60),
 ('epic', 55),
 ('futuristic', 54),
 ('pastoral', 50),
 ('acoustic', 47),
 ('serious', 43)]

In [6]:
len(sorted_descriptors_3)

38

As things stand, we just went from 160+ descriptors to 38, which is great progress for now. However we're not done with Data Modelling, and will continue to reduce these numbers later.