# __Big Data Project__ - _Free Music Archive Analysis_
### *M2 EcoStat*
By _Olivier Robert_, _Cyprien Cambus_, _Rémi Perrichon_ and _Baptiste Hessel_.

## Tensorflow Model

The goal of this notebook is to make a multiclass classification neural network. 

We will use the temporal features extracted from the audio songs to predict the principal genre of a track.

The problem is that there are almost 50% of missing values in the target variable that is the _top_genre_ column. Therefore one of the main challenge of this notebook has been to construct a new target variable by remplacing _NaN_ by the most popular genre of the track that can be found in the column _genres_all_. 

That way, we have been able to keep almost the 100.000 lines of the dataset to build a model.

The second challenge to deal with was the very important number of classes to predict (more than 50 at the beggining).

For this notebook you need to use the colab data (i.e. the regular data, not the one for databricks).


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn
import sklearn.utils, sklearn.preprocessing, sklearn.decomposition, sklearn.svm
from sklearn.datasets import make_multilabel_classification
from sklearn.model_selection import RepeatedKFold, StratifiedKFold
from sklearn.model_selection import train_test_split, KFold
from sklearn.preprocessing import MultiLabelBinarizer, LabelEncoder
from sklearn.preprocessing import LabelBinarizer, StandardScaler
from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestClassifier
from keras.models import Sequential
from keras.layers import Dense
from google.colab import drive
import IPython.display as ipd
import ast, math

In [None]:
# You may need to replace the part
# "Drive/Final_Project_Big_Data_M2/Data/fma_metadata" by the location you placed
# the dataset
drive.mount("/content/drive")
%cd /content/drive/My\ Drive/Final_Project_Big_Data_M2/Data/fma_metadata

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
/content/drive/.shortcut-targets-by-id/1CsEfNPMaW5wHrONjMaS-sUcmCNwzq4y7/Final_Project_Big_Data_M2/Data/fma_metadata



## Loading of the dataset



In [None]:
genres = pd.read_csv("genres.csv", index_col=0)
features = pd.read_csv("features.csv", index_col=0, header=[0, 1, 2])
tracks = pd.read_csv("tracks.csv", index_col=0, header=[0, 1])
echonest = pd.read_csv("echonest.csv", index_col=0, header=[0, 1, 2])

The features dataset contains the variables that will allow us to predict the genre of the tracks.

The tracks dataset contains the target variable as well as some informations about the tracks.

genres is a small dataset containing mostly the correspondance between the id of a genre and the title of that genre.

echonest contains some features extracted from the time series associated to the tracks. But they are not available for many tracks, that's why we won't include them in this analysis.


## Data Preparation

In [None]:
COLUMNS = [('track', 'tags'), ('album', 'tags'), ('artist', 'tags'),
                   ('track', 'genres'), ('track', 'genres_all')]

# Safely evaluate an expression node or a Unicode or Latin-1 encoded string
# containing a Python expression
for column in COLUMNS:
    tracks[column] = tracks[column].map(ast.literal_eval)

# we convert the following columns to date format
COLUMNS = [('track', 'date_created'), ('track', 'date_recorded'),
            ('album', 'date_created'), ('album', 'date_released'),
            ('artist', 'date_created'), ('artist', 'active_year_begin'),
            ('artist', 'active_year_end')]
for column in COLUMNS:
    tracks[column] = pd.to_datetime(tracks[column])

# There are 3 subsets to train the model on different sizes of data
# We also convert to categorical type
SUBSETS = ('small', 'medium', 'large')
try:
    tracks['set', 'subset'] = tracks['set', 'subset'].astype(
            'category', categories=SUBSETS, ordered=True)
except (ValueError, TypeError):
    tracks['set', 'subset'] = tracks['set', 'subset'].astype(
              pd.CategoricalDtype(categories=SUBSETS, ordered=True))

COLUMNS = [('track', 'genre_top'), ('track', 'license'),
            ('album', 'type'), ('album', 'information'),
            ('artist', 'bio')]
# Convert each of the above columns of tracks to categorical type
for column in COLUMNS:
    tracks[column] = tracks[column].astype('category')

### Verification of the shapes of the datasets

Tracks and features must have the same number of lines to train a model.

In [None]:
print("{:<25}{:<8}{:>8}".format("", 'lines', 'columns'))
print("{:->42s}".format(""))
print("{:<25}{:<8}{:>8}".format("genres.shape", *genres.shape))
print("{:<25}{:<8}{:>8}".format("echonest.shape", *echonest.shape))
print("{:<25}{:<8}{:>8}".format("tracks.shape", *tracks.shape))
print("{:<25}{:<8}{:>8}".format("features.shape", *features.shape))

                         lines    columns
------------------------------------------
genres.shape             163            4
echonest.shape           13129        249
tracks.shape             106574        52
features.shape           106574       518


In [None]:
ipd.display(tracks['track'].head())
ipd.display(tracks['album'].head())
ipd.display(tracks['artist'].head())
ipd.display(tracks['set'].head())

Unnamed: 0_level_0,bit_rate,comments,composer,date_created,date_recorded,duration,favorites,genre_top,genres,genres_all,information,interest,language_code,license,listens,lyricist,number,publisher,tags,title
track_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
2,256000,0,,2008-11-26 01:48:12,2008-11-26,168,2,Hip-Hop,[21],[21],,4656,en,Attribution-NonCommercial-ShareAlike 3.0 Inter...,1293,,3,,[],Food
3,256000,0,,2008-11-26 01:48:14,2008-11-26,237,1,Hip-Hop,[21],[21],,1470,en,Attribution-NonCommercial-ShareAlike 3.0 Inter...,514,,4,,[],Electric Ave
5,256000,0,,2008-11-26 01:48:20,2008-11-26,206,6,Hip-Hop,[21],[21],,1933,en,Attribution-NonCommercial-ShareAlike 3.0 Inter...,1151,,6,,[],This World
10,192000,0,Kurt Vile,2008-11-25 17:49:06,2008-11-26,161,178,Pop,[10],[10],,54881,en,Attribution-NonCommercial-NoDerivatives (aka M...,50135,,1,,[],Freeway
20,256000,0,,2008-11-26 01:48:56,2008-01-01,311,0,,"[76, 103]","[17, 10, 76, 103]",,978,en,Attribution-NonCommercial-NoDerivatives (aka M...,361,,3,,[],Spiritual Level


Unnamed: 0_level_0,comments,date_created,date_released,engineer,favorites,id,information,listens,producer,tags,title,tracks,type
track_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
2,0,2008-11-26 01:44:45,2009-01-05,,4,1,<p></p>,6073,,[],AWOL - A Way Of Life,7,Album
3,0,2008-11-26 01:44:45,2009-01-05,,4,1,<p></p>,6073,,[],AWOL - A Way Of Life,7,Album
5,0,2008-11-26 01:44:45,2009-01-05,,4,1,<p></p>,6073,,[],AWOL - A Way Of Life,7,Album
10,0,2008-11-26 01:45:08,2008-02-06,,4,6,,47632,,[],Constant Hitmaker,2,Album
20,0,2008-11-26 01:45:05,2009-01-06,,2,4,"<p> ""spiritual songs"" from Nicky Cook</p>",2710,,[],Niris,13,Album


Unnamed: 0_level_0,active_year_begin,active_year_end,associated_labels,bio,comments,date_created,favorites,id,latitude,location,longitude,members,name,related_projects,tags,website,wikipedia_page
track_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
2,2006-01-01,NaT,,"<p>A Way Of Life, A Collective of Hip-Hop from...",0,2008-11-26 01:42:32,9,1,40.058324,New Jersey,-74.405661,"Sajje Morocco,Brownbum,ZawidaGod,Custodian of ...",AWOL,The list of past projects is 2 long but every1...,[awol],http://www.AzillionRecords.blogspot.com,
3,2006-01-01,NaT,,"<p>A Way Of Life, A Collective of Hip-Hop from...",0,2008-11-26 01:42:32,9,1,40.058324,New Jersey,-74.405661,"Sajje Morocco,Brownbum,ZawidaGod,Custodian of ...",AWOL,The list of past projects is 2 long but every1...,[awol],http://www.AzillionRecords.blogspot.com,
5,2006-01-01,NaT,,"<p>A Way Of Life, A Collective of Hip-Hop from...",0,2008-11-26 01:42:32,9,1,40.058324,New Jersey,-74.405661,"Sajje Morocco,Brownbum,ZawidaGod,Custodian of ...",AWOL,The list of past projects is 2 long but every1...,[awol],http://www.AzillionRecords.blogspot.com,
10,NaT,NaT,"Mexican Summer, Richie Records, Woodsist, Skul...","<p><span style=""font-family:Verdana, Geneva, A...",3,2008-11-26 01:42:55,74,6,,,,"Kurt Vile, the Violators",Kurt Vile,,"[philly, kurt vile]",http://kurtvile.com,
20,1990-01-01,2011-01-01,,<p>Songs written by: Nicky Cook</p>\n<p>VOCALS...,2,2008-11-26 01:42:52,10,4,51.895927,Colchester England,0.891874,Nicky Cook\n,Nicky Cook,,"[instrumentals, experimental pop, post punk, e...",,


Unnamed: 0_level_0,split,subset
track_id,Unnamed: 1_level_1,Unnamed: 2_level_1
2,training,small
3,training,medium
5,training,small
10,training,small
20,training,large


### Interesting features

The features in the table can be interesting to improve the predictions of the model _genre classification_.

| tracks[‘track’]      |   tracks[‘album’]  |   tracks[‘artist’]   |
| -------------------- |: -----------------:| ------------------:  |
| duration             |  favorites         | active_year_end      |
| favorites            |     listens        | active_year_begin    |
| interest             |                    |      comments        |
| licence              |                    |  favorites           |
| listens              |                    | name                 |




### Decomposition of the data in subset and train-test labels


Each line of the dataset tracks and features have two labels:
* one corresponds to __test__, __train__ or __validation__
* the other corresponds to __small__, __medium__ or __large__
It is a tool that allows to simplify the training and validation of a model.

In [None]:
# Let's look at the number of data in each subset
train = tracks['set', 'split'] == 'training'
test = tracks['set', 'split'] == 'test'
for subset in ['small', 'medium', 'large']:
    sub = tracks['set', 'subset'] <= subset
    args = [subset, *tracks.loc[sub].shape, *tracks.loc[sub & train].shape]
    args += [*tracks.loc[sub & test].shape]
    print("{:<10}{:>8}   train ({:<3}{:>5})   test ({:<3}{:>5})".format(*args))
    print("{:-^60}".format(""))

small         8000   train (52  6400)   test (52   800)
------------------------------------------------------------
medium       25000   train (52 19922)   test (52  2573)
------------------------------------------------------------
large       106574   train (52 84353)   test (52 11263)
------------------------------------------------------------


## Study of the target column __genre_top__

The column _genre_top_ is our target variable. That's why we will deeply inscpect this column in order to have the higher number of lines with a coherent genre.

In [None]:
col_name = ('track', 'genre_top')
df_genre_top = tracks[[('track', 'title'), col_name]].groupby([col_name])
print(df_genre_top.count())
tot = df_genre_top.count().sum()
print("\nThe number of valid lines: {}".format(tot))

                     track
                     title
(track, genre_top)        
Blues                  110
Classical             1230
Country                194
Easy Listening          24
Electronic            9371
Experimental         10608
Folk                  2803
Hip-Hop               3552
Instrumental          2079
International         1389
Jazz                   571
Old-Time / Historic    554
Pop                   2332
Rock                 14182
Soul-RnB               175
Spoken                 423

The number of valid lines: track  title    49597
dtype: int64


We can see that there are some NaN in the _tracks[('tracks', 'genre_top')]_ columns. An idea is to use one of the genres in the columns _genre_all_ to replace the missing values by a coherent genre.

#### Creation of a dictionary that will help us remplace _NaN_ values.

In [None]:
# Construction of a dictionary with the id of the genre associated with the
# genre title
genres.head()
dict_genres = {}
for genre_id, content in genres.title.items():
    if genre_id not in dict_genres.keys():
        dict_genres[genre_id] = content

print("Number of different genres {}.\n".format(len(dict_genres)))
for genre_id, title_genre in dict_genres.items():
    print("{:4.0f}{:->30s}".format(genre_id, str(title_genre)))

Number of different genres 163.

   1-------------------Avant-Garde
   2-----------------International
   3-------------------------Blues
   4--------------------------Jazz
   5---------------------Classical
   6-----------------------Novelty
   7------------------------Comedy
   8-----------Old-Time / Historic
   9-----------------------Country
  10---------------------------Pop
  11-------------------------Disco
  12--------------------------Rock
  13----------------Easy Listening
  14----------------------Soul-RnB
  15--------------------Electronic
  16-----------------Sound Effects
  17--------------------------Folk
  18--------------------Soundtrack
  19--------------------------Funk
  20------------------------Spoken
  21-----------------------Hip-Hop
  22-----------------Audio Collage
  25--------------------------Punk
  26---------------------Post-Rock
  27-------------------------Lo-Fi
  30--------------Field Recordings
  31-------------------------Metal
  32------------------

#### The number of tracks associated to each genre of the dataset

In [None]:
# Let's look at all the genres in the dataset
dico_nbtracks = {}
# We use the values in the column genres that contains the ids of all genres
# associated to a track
for id_track, genre_track in tracks[('track', 'genres')].items():
    for unique_genre in genre_track:
        title_g = dict_genres[unique_genre]
        dico_nbtracks[title_g] = dico_nbtracks.get(title_g, 0) + 1

st_dico_nbtracks = dict(sorted(dico_nbtracks.items(),
                               key=lambda x: x[1],
                               reverse=True))

print("{:<30}   {:>20}".format("genre", "nb_tracks"))
print("{:-^60}".format(""))
for title_genre, nb_tracks in st_dico_nbtracks.items():
    print("{:<30s}   {:>20}".format(title_genre, nb_tracks))

sum_genres = sum((nb for nb in st_dico_nbtracks.values()))
print("{:-^60}".format(""))
print("\nThe total nb of genres associated to the tracks {}".format(sum_genres))
print("It makes {:.2f} genres per track".format(sum_genres/len(tracks)))


genre                                       nb_tracks
------------------------------------------------------------
Experimental                                    24912
Electronic                                      23866
Avant-Garde                                      8693
Rock                                             8038
Noise                                            7268
Ambient                                          7206
Experimental Pop                                 7144
Folk                                             7105
Pop                                              6362
Electroacoustic                                  6110
Instrumental                                     6055
Lo-Fi                                            6041
Hip-Hop                                          5922
Ambient Electronic                               5723
Soundtrack                                       5575
Indie-Rock                                       5432
Punk                 

### Creation of a new target variable with more values

A first idea would be to create a new column similar to _genre_top_ but by remplacing the NaN by the genre with the highest number of tracks in the _genres_ column.

In [None]:
# st_dico_nbtracks -> {genre_title: nb_tracks}
# dict_genres -> {id_genre: genre_title}
new_column = []
n1 = ('track', 'genres')
n2 = ('track', 'genre_top')
nan_indexes = []
# We iterate on the columns genres and genre_top to eventually replace the NaN
# in the genre_top by the most popular genre in genres
for (index, list_genres), (_, top_genre) in zip(tracks[n1].items(), 
                                                tracks[n2].items()):
    # If NaN, we replace by a most popular genre in genres
    try:
        if math.isnan(top_genre):
            # list with tuples [(genre_title, nb_tracks), ...]
            genre_and_pop = [(dict_genres[id_genre],
                              st_dico_nbtracks[dict_genres[id_genre]])
                            for id_genre in list_genres]
            # We sort this list by the number of tracks
            st_genre_and_pop = list(sorted(genre_and_pop,
                                          key=lambda x: x[1],
                                          reverse=True))
            # We keep only genre with at least 1000 tracks in the dataset
            filter_st_genre_and_pop = list(filter(lambda x: x[1] >= 1000,
                                                  st_genre_and_pop))
            if len(filter_st_genre_and_pop) > 0:
                # We add the most popular genre
                new_column.append(filter_st_genre_and_pop[0][0])
            else:
                # If no genre, we should remove this line
                # We add a random genre in new_column since it will be removed
                # anyway
                new_column.append("IDM")
                nan_indexes.append(index)
    except:
        # If there is a top_genre we do nothing
        new_column.append(top_genre)
        # If the top_genre has less than 1000 tracks we will delete this line
        if st_dico_nbtracks[top_genre] < 1000:
            nan_indexes.append(index)

# Printing of the results
set_genres = set(new_column)
not_popular_top_genres = [gr for gr in set_genres
                          if st_dico_nbtracks[gr] < 1000]
nb = len(set_genres) - len(not_popular_top_genres)
print("There are {:*^11} different genres in the new column".format(nb))
print("{:-^75}".format(""))
print("We must drop {:>11} indexes.".format(len(nan_indexes)))
print("{:-^75}".format(""))
for different_genre in set_genres:
    args = [different_genre, st_dico_nbtracks[different_genre]]
    print("genre new col {:^25s} nb tracks {:->20}".format(*args))


There are ****54***** different genres in the new column
---------------------------------------------------------------------------
We must drop        4146 indexes.
---------------------------------------------------------------------------
genre new col        Psych-Folk         nb tracks ----------------2267
genre new col           Punk            nb tracks ----------------5421
genre new col         Classical         nb tracks ----------------2101
genre new col          Garage           nb tracks ----------------3373
genre new col     Singer-Songwriter     nb tracks ----------------4162
genre new col          Ambient          nb tracks ----------------7206
genre new col        Minimalism         nb tracks ----------------1392
genre new col  Contemporary Classical   nb tracks ----------------1239
genre new col         Post-Rock         nb tracks ----------------1477
genre new col      Electroacoustic      nb tracks ----------------6110
genre new col           Drone           nb trac

We can see that there are still genres with less than 1000 tracks in the dataset. It corresponds in fact to some of the genres in the column _genre_top_. We should also remove this values that will be very hard to classify with so little data.

We end up with a new column with 54 different genres. It would probably be difficult to classify efficiently with so many classes, but we have dropped only 4000 lines.
A next step could be to aggregate close labels to reduce this high number of classes.

### We create a new column _target_genre_ with no _NaN_ values.

We will try to predict the classes in that column.

In [None]:
tracks[('track', 'target_genre')] = new_column
tracks_reduced = tracks.copy()
before = "Initial shape of tracks"
after = "New shape of tracks_reduced"
print("{:30s} {:>20}".format(before, str(tracks_reduced.shape)))
tracks_reduced.drop(nan_indexes, inplace=True)
print("{:30s} {:>20}".format(after, str(tracks_reduced.shape)))
print("\n{:-^75}\n".format(""))

# We should also do that on features
before2 = "Initial shape of features"
after2 = "New shape of features_reduced"
features_reduced = features.copy()
print("{:30s} {:>20}".format(before2, str(features_reduced.shape)))
features_reduced.drop(nan_indexes, inplace=True)
print("{:30s} {:>20}".format(after2, str(features_reduced.shape)))

Initial shape of tracks                (106574, 53)
New shape of tracks_reduced            (102428, 53)

---------------------------------------------------------------------------

Initial shape of features             (106574, 518)
New shape of features_reduced         (102428, 518)


#### Verification of the genres in the column target_genre

In [None]:
# Let's look at the classes and the number of tracks in the dataset
list_see = [(genre_new_col, st_dico_nbtracks[genre_new_col])
            for genre_new_col in set(tracks_reduced[('track', 'target_genre')])]
st_list_see = list(sorted(list_see, key=lambda x: x[1], reverse=True))
for genre_new_col, nb_tr in st_list_see:
    print("{:30} {:>10} tracks".format(genre_new_col, nb_tr))


Experimental                        24912 tracks
Electronic                          23866 tracks
Avant-Garde                          8693 tracks
Rock                                 8038 tracks
Noise                                7268 tracks
Ambient                              7206 tracks
Experimental Pop                     7144 tracks
Folk                                 7105 tracks
Pop                                  6362 tracks
Electroacoustic                      6110 tracks
Instrumental                         6055 tracks
Lo-Fi                                6041 tracks
Hip-Hop                              5922 tracks
Ambient Electronic                   5723 tracks
Soundtrack                           5575 tracks
Indie-Rock                           5432 tracks
Punk                                 5421 tracks
Improv                               4261 tracks
Singer-Songwriter                    4162 tracks
IDM                                  3472 tracks
Garage              

Let's aggregate some genres:

* Power-Pop and Pop
* Minimal Electronic and Electronic
* Dubstep and techno
* Hip-Hop beats and Hip-Hop
* Chiptune, Chip Music and techno
* Contemporary Classical and Classical
* Freak-Folk and Folk
* Free-Jazz and Jazz
* Minimalism and Instrumental
* Post-rock and Rock and Indie-Rock

Let's drop:
* Hardcore
* Unclassifiable
* Sound Art
* Dance
* Blues
* ound Collage
* Synth Pop
* International
* Trip-Hop
* Punk
* Improv
* Singer-Songwriter
* IDM
* Garage
* Drone
* Glitch
* Psych-Rock
* Psych-Folk
* Downtempo
* Techno

#### Aggregation of come classes and removing of genres with very little tracks in the dataset

In [None]:
dico_remplacement = {'Power-Pop': 'Pop',
                     'Minimal Electronic': 'Electronic',
                     'Dubstep': 'techno',
                     'Hip-Hop beats': 'Hip-Hop',
                     'Chiptune': 'techno',
                     'Chip Music': 'techno',
                     'Contemporary Classical': 'Classical',
                     'Freak-Folk': 'Folk',
                     'Free-Jazz': 'Jazz',
                     'Minimalism': 'Instrumental',
                     'Post-rock': 'Rock', 
                     'Indie-Rock': 'Rock',
                     'Noise-Rock': 'Rock',
                     'Post-Punk': 'Punk',
                     'Ambient Electronic ': 'Ambient'}

to_drop = ['Hardcore', 'Unclassifiable', 'Sound Art', 'Dance', 'Blues']
to_drop += ['Sound Collage', 'Synth Pop', 'International', 'Trip-Hop']
to_drop += ['Punk', 'Improv', 'Singer-Songwriter', 'IDM', 'Garage', 'Drone']
to_drop += ['Glitch', 'Psych-Rock', 'Psych-Folk', 'Downtempo', 'Techno']                                  

new_column_2, bad_indexes = [], []
rplcmt, dro = 0, 0
for index, target_gr in tracks_reduced[('track', 'target_genre')].items():
    # We replace by a bigger connected genre
    if target_gr in dico_remplacement.keys():
        new_column_2.append(dico_remplacement[target_gr])
        rplcmt += 1
    # We check if this is to drop
    elif target_gr in to_drop:
        # We will drop that anyway
        new_column_2.append("Minimalism")
        bad_indexes.append(index)
        dro += 1
    else:
        new_column_2.append(target_gr)

length_tr = tracks_reduced.shape[0]
nb_classes = len(set(new_column_2))
print("We have done {} replacements and dropped {} lines.".format(rplcmt, dro))
print("\n{:-^75}\n".format("Verification"))
print("{:<25} {:>15}".format("length of new column_2", len(new_column_2)))
print("{:<25} {:>15}".format("length of tracks_reduces", length_tr))
print("\n{:-^75}\n".format(""))
print("Number of classes: {}.".format(nb_classes))


We have done 1112 replacements and dropped 4709 lines.

-------------------------------Verification--------------------------------

length of new column_2             102428
length of tracks_reduces           102428

---------------------------------------------------------------------------

Number of classes: 25.


In [None]:
# We add the new column to the dataset
tracks_reduced[('track', 'target_genre')] = new_column_2
bef = "Initial shape of tracks"
aft = "New shape of tracks_reduced"
print("{:30s} {:>20}".format(bef, str(tracks_reduced.shape)))
tracks_reduced.drop(bad_indexes, inplace=True)
print("{:30s} {:>20}".format(aft, str(tracks_reduced.shape)))
print("\n{:-^75}\n".format(""))

# We should also do that on features
bef2 = "Initial shape of features"
aft2 = "New shape of features_reduced"
print("{:30s} {:>20}".format(bef2, str(features_reduced.shape)))
features_reduced.drop(bad_indexes, inplace=True)
print("{:30s} {:>20}".format(aft2, str(features_reduced.shape)))


Initial shape of tracks                (102428, 53)
New shape of tracks_reduced             (97719, 53)

---------------------------------------------------------------------------

Initial shape of features             (102428, 518)
New shape of features_reduced          (97719, 518)


### Classes aggregation using a Random Forest Classifier

The idea is to use the predicted probabilities of a random forest classifier to detect similar type of music.

For each genre, we construct a mean vector of all the vectors of predicted probabilities for the tracks of that genre.

We proceed in 2 steps:
1. We create a dictionary with a genre associated to the ids of the tracks of that genre
2. We use the previous indexes to isolate the predictions for each genre to contruct a mean vector whose length is the number of classes

In [None]:
# Let's first gather in a dict the indexes of each genre in 'target_genre'
# Before, we need to create a column with regular indexes
n = len(tracks_reduced)
tracks_reduced[('track', 'false_index')] = [i for i in range(n)]

dict_indexes_genres = {}
for genre in set(tracks_reduced[('track', 'target_genre')]):
    sub_df = tracks_reduced[tracks_reduced[('track', 'target_genre')] == genre]
    dict_indexes_genres[genre] = sub_df[('track', 'false_index')].values

for i, (genre, indexes) in enumerate(dict_indexes_genres.items()):
    print("{:<25s} some indexes {:>6} {:>6} {:>6}".format(genre, *indexes[:3]))
    if i == 5:
        break


Punk                      some indexes  16716  16717  16753
Classical                 some indexes   2579   2580   4783
Ambient                   some indexes    563    564    565
Post-Rock                 some indexes  29141  29142  69169
Electroacoustic           some indexes  13729  13730  13731
Instrumental              some indexes    591    592    593


#### Let's see what a random forest predict for each genre. We can use these predictions to detect close genres.

In [None]:
# It takes around 5 min to fit the RFC.
Y = tracks_reduced[('track', 'target_genre')]
X = features_reduced['mfcc']
clf = RandomForestClassifier(n_estimators=50)
clf.fit(X, Y)
X_preds = clf.predict_proba(X)

In [None]:
# Mean predictions for each class
dict_mean_probas_genre = {}
for genre, indexes in dict_indexes_genres.items():
    mean_probas = np.mean(X_preds[indexes], axis=0)
    dict_mean_probas_genre[genre] = mean_probas

dict_tmp = {i: genre for i, genre in enumerate(dict_mean_probas_genre.keys())}

phrase = "mean probabilities on each track of that genre"
print("{:<30} {:<55}".format("genre", phrase))
print("{:-^90}".format(""))
for genre, mean_vect in dict_mean_probas_genre.items():
    clean_mean_vect = [(round(elt, 2), dict_tmp[i])
                       for i, elt in enumerate(mean_vect)]
    interesting_preds = list(filter(lambda x: x[0] > 0.15, clean_mean_vect))
    correct = True if interesting_preds[0][1] == genre else False
    print("{:<30} {:<55} {}".format(genre, str(interesting_preds), correct))


genre                          mean probabilities on each track of that genre         
------------------------------------------------------------------------------------------
Field Recordings               [(0.66, 'Noise')]                                       False
Folk                           [(0.72, 'Post-Rock')]                                   False
techno                         [(0.16, 'Jazz'), (0.65, 'Avant-Garde')]                 False
Hip-Hop Beats                  [(0.63, 'Lo-Fi')]                                       False
Pop                            [(0.66, 'Instrumental')]                                False
Jazz                           [(0.65, 'Soundtrack')]                                  False
Musique Concrete               [(0.67, 'Ambient')]                                     False
Electroacoustic                [(0.65, 'Pop')]                                         False
Noise                          [(0.65, 'Experimental')]                       

We can see that is is not an easy task, as the mean predictions of a RFC are almost always wrong.

Unfortunately, we can not use these results to aggregate some classes since they are not good enough.

## Neural Network for Multiclass Classification


## Model with 24 classes 

In [None]:
# We use an architecture with 3 layers
def get_model(n_inputs, n_outputs):
    model = Sequential()
    model.add(Dense(30, input_dim=n_inputs, kernel_initializer='he_uniform',
                    activation='relu'))
    model.add(Dense(25, activation = 'relu'))
    model.add(Dense(n_outputs, activation='softmax'))
    model.compile(loss='categorical_crossentropy', optimizer='adam')
    return model


### Cross-Validation of the model

In [None]:
Y = tracks_reduced[('track', 'target_genre')].to_numpy()
X = features_reduced['mfcc'].to_numpy()
print('{} features, {} classes'.format(X.shape[1], np.unique(Y).size))

# cross-validation
n_inputs, n_outputs = X.shape[1], np.unique(Y).size
skf = StratifiedKFold(n_splits=3)

liste_res = []
for i, (train, test) in enumerate(skf.split(X, Y)):
    x_train, x_test = X[train], X[test]
    y_train, y_test = Y[train], Y[test]

    # Verification number of classes in train and test
    if np.unique(y_train).size != np.unique(y_test).size:
        end = "number of classes in train and test"
        print("Bad shuffling. There is a different " + end)
        cl_ytr = np.unique(y_train).size
        cl_yte = np.unique(y_test).size
        print("y_train {} != {} y_test".format(cl_ytr, cl_yte))
    else:
        labels_onehot_train = LabelBinarizer().fit_transform(y_train)
        labels_onehot_test = LabelBinarizer().fit_transform(y_test)
        model = get_model(n_inputs, n_outputs)
        # fit model
        model.fit(x_train, labels_onehot_train, verbose=0, epochs=100)
        # make a prediction on the test set
        yhat = model.predict(x_test)
        # round probabilities to class labels
        yhat = yhat.round()
        # calculate accuracy
        acc = accuracy_score(labels_onehot_test, yhat)
        # store result
        liste_res.append(acc)
        print("fold {} accuracy {:*^11.4f}".format(i, acc))

print("Mean accuracy {:-^15.3f}".format(np.mean(liste_res)))


140 features, 24 classes
fold 0 accuracy **0.2279***
fold 1 accuracy **0.2273***
Mean accuracy -----0.228-----


#### Let's look at the detail for each class

In [None]:
print(sklearn.metrics.classification_report(labels_onehot_test, yhat))

              precision    recall  f1-score   support

           0       0.25      0.00      0.00      1048
           1       0.00      0.00      0.00       289
           2       0.00      0.00      0.00       742
           3       0.40      0.19      0.26       667
           4       0.00      0.00      0.00       207
           5       0.53      0.35      0.42     10228
           6       0.62      0.21      0.31     14022
           7       0.00      0.00      0.00      1521
           8       0.00      0.00      0.00       214
           9       0.64      0.09      0.16      3147
          10       0.58      0.23      0.33      2196
          11       0.00      0.00      0.00        32
          12       0.00      0.00      0.00        51
          13       0.07      0.00      0.00      1784
          14       0.00      0.00      0.00       482
          15       0.00      0.00      0.00       484
          16       0.00      0.00      0.00         8
          17       0.00    

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


The model only focuses on 8 classes over 24.

As the model is unable to predict some classes with a little number of values, let's try to only focus on the main classes.

### Classification Model with 8 classes

In [None]:
# get the model
def get_model(n_inputs, n_outputs):
    model = Sequential()
    model.add(Dense(20, input_dim=n_inputs, kernel_initializer='he_uniform', activation='relu'))
    model.add(Dense(15, activation = 'relu'))
    model.add(Dense(10, activation = 'relu'))
    model.add(Dense(n_outputs, activation='softmax'))
    model.compile(loss='categorical_crossentropy', optimizer='adam')
    return model

In [None]:
to_predict = ['Experimental',
              'Electronic',  
              'Avant-Garde',
              'Rock',
              'Noise',
              'Ambient',
              'Experimental Pop',
              'Folk']       

Y = tracks_reduced[('track', 'target_genre')]
X = features_reduced['mfcc']

g0, g1, g2, g3 = 'Experimental', 'Electronic', 'Avant-Garde', 'Experimental Pop'
df = tracks_reduced[(tracks_reduced[('track', 'target_genre')] == g0)
                    | (tracks_reduced[('track', 'target_genre')] == g1)
                    | (tracks_reduced[('track', 'target_genre')] == g2)
                    | (tracks_reduced[('track', 'target_genre')] == 'Rock')
                    | (tracks_reduced[('track', 'target_genre')] == 'Noise')
                    | (tracks_reduced[('track', 'target_genre')] == 'Ambient')
                    | (tracks_reduced[('track', 'target_genre')] == g3)
                    | (tracks_reduced[('track', 'target_genre')] == 'Folk')]

to_drop = set(X.index) - set(df.index)
print("We have dropped {} indexes".format(len(to_drop)))
Y = df[('track', 'target_genre')]
X = features_reduced.drop(list(to_drop))
X = X['mfcc']
print("X.shape", X.shape)
print("Y.shape", Y.shape)


We have dropped 17851 indexes
X.shape (79868, 140)
Y.shape (79868,)


In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=.2)

print('{} training examples, {} testing examples'.format(y_train.size, y_test.size))
print('{} features, {} classes'.format(X_train.shape[1], np.unique(y_train).size))

labels_onehot_train = LabelBinarizer().fit_transform(y_train)
labels_onehot_test = LabelBinarizer().fit_transform(y_test)

# Without cross-validation
n_inputs, n_outputs = X_train.shape[1], np.unique(y_train).size
model = get_model(n_inputs, n_outputs)

# fit model
model.fit(X_train, labels_onehot_train, verbose=0, epochs=100)

# make a prediction on the test set
yhat = model.predict(X_test)

# round probabilities to class labels
yhat = yhat.round()

# calculate accuracy
acc = accuracy_score(labels_onehot_test, yhat)

# store result
print("accuracy {:*^11.4f}".format(acc))

63894 training examples, 15974 testing examples
140 features, 8 classes
accuracy **0.4283***


Let's look at the detail of the results for each class.

In [None]:
print(sklearn.metrics.classification_report(labels_onehot_test, yhat))

              precision    recall  f1-score   support

           0       0.00      0.00      0.00       399
           1       0.00      0.00      0.00       314
           2       0.70      0.38      0.50      4182
           3       0.62      0.52      0.56      5637
           4       0.71      0.01      0.02       603
           5       0.65      0.23      0.34      1233
           6       0.44      0.03      0.05       267
           7       0.71      0.61      0.66      3339

   micro avg       0.66      0.43      0.52     15974
   macro avg       0.48      0.22      0.26     15974
weighted avg       0.64      0.43      0.49     15974
 samples avg       0.43      0.43      0.43     15974



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


## Model with the 4 main classes

Like in databricks we try to predict only the main classes. But here we have more lines, as  we have fill some values in the target column.

In [None]:
to_predict = ['Electronic', 'Rock', 'Folk', 'Hip-Hop']       

Y = tracks_reduced[('track', 'target_genre')]
X = features_reduced['mfcc']

df2 = tracks_reduced[(tracks_reduced[('track', 'target_genre')] == 'Electronic')
                     | (tracks_reduced[('track', 'target_genre')] == 'Hip-Hop')
                     | (tracks_reduced[('track', 'target_genre')] == 'Folk')
                     | (tracks_reduced[('track', 'target_genre')] == 'Rock')]

to_drop = set(X.index) - set(df2.index)
print("We have dropped {} indexes".format(len(to_drop)))
Y = df2[('track', 'target_genre')]
X = features_reduced.drop(list(to_drop))
X = X['mfcc']
print("X.shape", X.shape)
print("Y.shape", Y.shape)


We have dropped 49593 indexes
X.shape (48126, 140)
Y.shape (48126,)


In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=.2)

print('{} training examples, {} testing examples'.format(y_train.size, y_test.size))
print('{} features, {} classes'.format(X_train.shape[1], np.unique(y_train).size))

labels_onehot_train = LabelBinarizer().fit_transform(y_train)
labels_onehot_test = LabelBinarizer().fit_transform(y_test)

# Without cross-validation
n_inputs, n_outputs = X_train.shape[1], np.unique(y_train).size
model = get_model(n_inputs, n_outputs)

# fit model
model.fit(X_train, labels_onehot_train, verbose=0, epochs=100)

# make a prediction on the test set
yhat = model.predict(X_test)

# round probabilities to class labels
yhat = yhat.round()

# calculate accuracy
acc = accuracy_score(labels_onehot_test, yhat)

# store result
print("accuracy {:*^11.4f}".format(acc))

38500 training examples, 9626 testing examples
140 features, 4 classes
accuracy **0.6968***


In [None]:
print(sklearn.metrics.classification_report(labels_onehot_test, yhat))

              precision    recall  f1-score   support

           0       0.78      0.79      0.78      4036
           1       0.72      0.54      0.61      1265
           2       0.76      0.34      0.47       914
           3       0.83      0.75      0.79      3411

   micro avg       0.79      0.70      0.74      9626
   macro avg       0.77      0.60      0.66      9626
weighted avg       0.79      0.70      0.73      9626
 samples avg       0.70      0.70      0.70      9626



  _warn_prf(average, modifier, msg_start, len(result))


We can see that the model is quite efficient when there are only 4 classes to predict. There are probably not enough data to predict more classes. It would also interesting to use more features to predict the genres.