# <font color=red>DATA GATHERING II: MUSIC GENRES AND SUBGENRES</font>

In [1]:
import pandas as pd
import numpy as np
import time
import math
import tqdm
import warnings
warnings.filterwarnings('ignore')

## <font color=blue>1) Genres and subgenres</font>

https://www.musicgenreslist.com/ + others in Musicbrainz - total: 927 into 14 subgenres:

- Blues
- Classical
- Country
- Electronic
- Folk
- Hip Hop
- Jazz
- Latin
- Pop
- Punk
- Rythm & Blues (R&B) / Soul
- Rock
- World (local music genres from specific regions of the world)
- Others (This category contains all the subgenres I haven't been able to classify in the previous categories)

According to Musicbrainz's Genre description in https://wiki.musicbrainz.org/Genre:

"Genres are currently supported in MusicBrainz as part of the tag system.

Some tags (the ones in the genre list) are automatically read and presented as genres."

What we want for our visualization is to have, for each release, its main genre and eventually its subgenre. To do so, I have copied Musicbrainz's "genre list" into a csv file. There are 419 elements considered as genres by Musicbrainz but for our study we'll consider them as our subgenres.

Of course, I wasn't familiar with all the genres appearing in the list so, in order to classify those, I looked at their definition in wikipedia and chose the best main genre for them. If no definition was provided by wikipedia, I searched for them in Google and listened to a representative song in order to make a decision.

In [2]:
all_genres = pd.read_csv('Main_genre_list.csv', sep='\t', header=0, encoding='utf-8')
all_genres.head()

Unnamed: 0,Main_genre,subgenre
0,Others,2 tone
1,Electronic,2 step
2,Electronic,4 beat
3,Electronic,4×4
4,Electronic,8bit


As we read before, Musicbrainz's genre list (subgenre for us) is part of their tag system. Let's import the Musicbrainz's "tags" table and try to identify, from its elements, the ones that are genres.

In [3]:
tags = pd.read_csv('Musicbrainz/Tables_used/tags.txt',sep='\t', header=None, engine='c', usecols=[0,1])
tags.columns = ['tag_id','tag_name']
tags.head()

Unnamed: 0,tag_id,tag_name
0,95,finnish
1,23,slovak
2,801,iowa
3,4,groundbreaking
4,130,taiwanese


In [4]:
#How many tags are there?
tags['tag_id'].nunique()

86806

In [5]:
#What do the tags look like?
tags.tag_name.value_counts()

herb recordings                                                                                         2
ur so fail                                                                                              2
indie rock                                                                                              2
enigma                                                                                                  2
universidad austral de chile                                                                            2
post rock                                                                                               2
prog rock                                                                                               2
materialeyes                                                                                            2
campus miraflores                                                                                       2
concept album                                 

As we can see, the tags list contains the genres but also other (more subjective) expressions that some users have chosen as representative for the music entity. 

We will add columns to this tags dataframe to distinguish which of them are actually genres/subgenres. As we will do the matching by tag_name, we have to format the tag_names as the ones in all_genres: without punctuation and in lower case.

In [6]:
#We first normalize in lower case the tag_names:
tags['tag_name'] = tags['tag_name'].str.lower()

In [7]:
#We replace the punctuation with a space:
tags['tag_name'] = tags['tag_name'].str.replace('#!?()*-%"/\,<>:$@.',' ')
#We remove leading & trainling spaces:
tags['tag_name'] = tags['tag_name'].str.strip()

In [8]:
#And now we can do the merging:
tags_genres = pd.merge(tags, all_genres, how='left', left_on='tag_name', right_on='subgenre')
tags_genres.head()

Unnamed: 0,tag_id,tag_name,Main_genre,subgenre
0,95,finnish,,
1,23,slovak,,
2,801,iowa,,
3,4,groundbreaking,,
4,130,taiwanese,,


In [9]:
#How many subgenres did we identify?
pd.notna(tags_genres['Main_genre']).value_counts()

False    86021
True       785
Name: Main_genre, dtype: int64

In [10]:
#What kind of tag_names haven't been associated with a Main genre?
tags_genres[tags_genres['Main_genre'].isnull()]

Unnamed: 0,tag_id,tag_name,Main_genre,subgenre
0,95,finnish,,
1,23,slovak,,
2,801,iowa,,
3,4,groundbreaking,,
4,130,taiwanese,,
5,134,thai,,
6,154,war,,
7,52,netlabel,,
8,101,cotm,,
9,82,punkrock,,


As we can see above, some of the tags that don't have a Main genre associated could be easily classified (for instance: "punkrock", or "dark metal"). 

Those tag names are not considered as a subgenre by Musicbrainz but they do provide us with some information about the release main genre. We will consider them as subgenre and identify their main genre.

What I will do now is to retrieve more information about these genreless tag_names in order to be able to classiffy them:

In [11]:
#Creating a specific dataframe for them:
genreless = tags_genres[pd.notna(tags_genres.tag_name) & pd.isnull(tags_genres.Main_genre)]
genreless.head()

Unnamed: 0,tag_id,tag_name,Main_genre,subgenre
0,95,finnish,,
1,23,slovak,,
2,801,iowa,,
3,4,groundbreaking,,
4,130,taiwanese,,


In [12]:
#We create new columns to retrieve some information about the content of each tag:
genreless['Blues'] = np.nan
genreless['Classical'] = np.nan
genreless['Country'] = np.nan
genreless['Electronic'] = np.nan
genreless['Folk'] = np.nan
genreless['Hip_Hop'] = np.nan
genreless['Jazz'] = np.nan
genreless['Latin'] = np.nan
genreless['Pop'] = np.nan
genreless['Punk'] = np.nan
genreless['RB'] = np.nan
genreless['Rock'] = np.nan
genreless.head()

Unnamed: 0,tag_id,tag_name,Main_genre,subgenre,Blues,Classical,Country,Electronic,Folk,Hip_Hop,Jazz,Latin,Pop,Punk,RB,Rock
0,95,finnish,,,,,,,,,,,,,,
1,23,slovak,,,,,,,,,,,,,,
2,801,iowa,,,,,,,,,,,,,,
3,4,groundbreaking,,,,,,,,,,,,,,
4,130,taiwanese,,,,,,,,,,,,,,


In [13]:
#We create a column tag_name_clean where the text is formatted (remove punctuation, concatenate all words):
punctuation = ['#','!','?','(',')','*','-','%',' ',',',"'",'.','"','/','<','>',':']
genreless['tag_name_clean'] = genreless['tag_name'].apply(lambda x: ''.join(c for c in x if c not in punctuation))
genreless.head()

Unnamed: 0,tag_id,tag_name,Main_genre,subgenre,Blues,Classical,Country,Electronic,Folk,Hip_Hop,Jazz,Latin,Pop,Punk,RB,Rock,tag_name_clean
0,95,finnish,,,,,,,,,,,,,,,finnish
1,23,slovak,,,,,,,,,,,,,,,slovak
2,801,iowa,,,,,,,,,,,,,,,iowa
3,4,groundbreaking,,,,,,,,,,,,,,,groundbreaking
4,130,taiwanese,,,,,,,,,,,,,,,taiwanese


In [14]:
#We create a pattern of words that could be associated with each genre:
Blues = 'blues'
Classical = 'classical|symphony|orchestra|stringquartet'
Country = 'country'
Electronic = 'electronic|electr|dance|house'
Folk = 'folk'
Hip_Hop = 'hiphop|rap'
Jazz = 'jazz|jamband'
Latin = 'latin'
Pop = 'pop'
Punk = 'punk'
RB = 'rhythmandblues|rythmandblues|R&B'
Rock = 'rock|metal'

In [15]:
#And now we fill each genre column by searching if the column tag_name_clean contains the patterns:
genreless.Blues = np.where(genreless.tag_name_clean.str.contains(Blues), 'Blues', np.nan)
genreless.Classical = np.where(genreless.tag_name_clean.str.contains(Classical), 'Classical', np.nan)
genreless.Country = np.where(genreless.tag_name_clean.str.contains(Country), 'Country', np.nan)
genreless.Electronic = np.where(genreless.tag_name_clean.str.contains(Electronic), 'Electronic', np.nan)
genreless.Folk = np.where(genreless.tag_name_clean.str.contains(Folk), 'Folk', np.nan)
genreless.Hip_Hop = np.where(genreless.tag_name_clean.str.contains(Hip_Hop), 'Hip Hop', np.nan)
genreless.Jazz = np.where(genreless.tag_name_clean.str.contains(Jazz), 'Jazz', np.nan)
genreless.Latin = np.where(genreless.tag_name_clean.str.contains(Latin), 'Latin', np.nan)
genreless.Pop = np.where(genreless.tag_name_clean.str.contains(Pop), 'Pop', np.nan)
genreless.Punk = np.where(genreless.tag_name_clean.str.contains(Punk), 'Punk', np.nan)
genreless.RB = np.where(genreless.tag_name_clean.str.contains(RB), 'RB', np.nan)
genreless.Rock = np.where(genreless.tag_name_clean.str.contains(Rock), 'Rock', np.nan)

In [16]:
genreless.head(1000)

Unnamed: 0,tag_id,tag_name,Main_genre,subgenre,Blues,Classical,Country,Electronic,Folk,Hip_Hop,Jazz,Latin,Pop,Punk,RB,Rock,tag_name_clean
0,95,finnish,,,,,,,,,,,,,,,finnish
1,23,slovak,,,,,,,,,,,,,,,slovak
2,801,iowa,,,,,,,,,,,,,,,iowa
3,4,groundbreaking,,,,,,,,,,,,,,,groundbreaking
4,130,taiwanese,,,,,,,,,,,,,,,taiwanese
5,134,thai,,,,,,,,,,,,,,,thai
6,154,war,,,,,,,,,,,,,,,war
7,52,netlabel,,,,,,,,,,,,,,,netlabel
8,101,cotm,,,,,,,,,,,,,,,cotm
9,82,punkrock,,,,,,,,,,,,Punk,,Rock,punkrock


In [17]:
genreless.replace('nan', np.nan, inplace=True)
genreless.head()

Unnamed: 0,tag_id,tag_name,Main_genre,subgenre,Blues,Classical,Country,Electronic,Folk,Hip_Hop,Jazz,Latin,Pop,Punk,RB,Rock,tag_name_clean
0,95,finnish,,,,,,,,,,,,,,,finnish
1,23,slovak,,,,,,,,,,,,,,,slovak
2,801,iowa,,,,,,,,,,,,,,,iowa
3,4,groundbreaking,,,,,,,,,,,,,,,groundbreaking
4,130,taiwanese,,,,,,,,,,,,,,,taiwanese


What we want now, is to identify the tag_name which can contain more than 1 Main genre (e.g: "poprock"), and decide which is the main genre for them.

In [18]:
#We create a column "genre_counts" that sums the number of genres identified for each tag_name:
genreless['genre_counts'] = genreless.iloc[:,4:16].notnull().sum(axis=1)
genreless.head()

Unnamed: 0,tag_id,tag_name,Main_genre,subgenre,Blues,Classical,Country,Electronic,Folk,Hip_Hop,Jazz,Latin,Pop,Punk,RB,Rock,tag_name_clean,genre_counts
0,95,finnish,,,,,,,,,,,,,,,finnish,0
1,23,slovak,,,,,,,,,,,,,,,slovak,0
2,801,iowa,,,,,,,,,,,,,,,iowa,0
3,4,groundbreaking,,,,,,,,,,,,,,,groundbreaking,0
4,130,taiwanese,,,,,,,,,,,,,,,taiwanese,0


In [19]:
#We gather all the genres in a new column:
start = time.time()

genreless['genres'] = np.nan
genreless.reset_index(drop=True, inplace=True)

for i in tqdm.tqdm(range(len(genreless))):
    if genreless['genre_counts'][i] != 0:
        a = genreless.loc[i,"Blues":"Rock"].notna()
        z = [i for i, x in enumerate(a) if x]
        genreless['genres'][i] = str(a[z].index.values)
    else:
        pass

end = time.time()
print((end-start)/60)

100%|██████████| 86019/86019 [05:14<00:00, 273.91it/s]

5.234478032588958





In [20]:
#We can now get rid of the intermediary columns:
genreless.drop(labels=['subgenre','Blues', 'Classical', 'Country',
       'Electronic', 'Folk', 'Hip_Hop', 'Jazz', 'Latin', 'Pop',
       'Punk', 'RB', 'Rock', 'tag_name_clean'], axis=1, inplace=True)
genreless.head()

Unnamed: 0,tag_id,tag_name,Main_genre,genre_counts,genres
0,95,finnish,,0,
1,23,slovak,,0,
2,801,iowa,,0,
3,4,groundbreaking,,0,
4,130,taiwanese,,0,


In [21]:
#We can fill the main genre column for the ones that have just 1 genre identified:
genreless.Main_genre = np.where(genreless.genre_counts.isin([1]), genreless.genres,genreless.Main_genre )

In [22]:
#How many did we identify?
genreless.Main_genre.isnull().value_counts()

True     76580
False     9439
Name: Main_genre, dtype: int64

Not bad: we were able to retrieve the Main genre for 9.439 tags via this technique.

What we want now is to analyze the cases where there is more than one main genre identified:

In [23]:
multiple_genre = genreless[genreless['genre_counts'] >1]
multiple_genre.head(100)

Unnamed: 0,tag_id,tag_name,Main_genre,genre_counts,genres
9,82,punkrock,,2,['Punk' 'Rock']
13,52611,electro justice rock bbc one madeon remix daft...,,3,['Electronic' 'Punk' 'Rock']
247,58451,echo park echopark rock pop rockpop guildford ...,,2,['Pop' 'Rock']
445,729,pop-jazz,,2,['Jazz' 'Pop']
535,34728,dance acid jazz,,2,['Electronic' 'Jazz']
563,898,irish folk rock,,2,['Folk' 'Rock']
612,31371,popunk,,2,['Pop' 'Punk']
661,1055,jazz metal,,2,['Jazz' 'Rock']
676,1083,piano pop rock,,2,['Pop' 'Rock']
680,1089,neo-classical metal,,2,['Classical' 'Rock']


#### Establishing dominant genres: 

In order to classify the tags that have been associated with more than one Main genre, we need to use some criteria. From my perspective, I think there are some Main genres that are dominant against others.

Again, music genre is something that can be very subjective in some cases: some people would consider The Beattles as a rock band, while I personally think they produced Pop music (maybe PopRock, but definitely not Rock music as I see it). 

As this project is done by myself, even if I try to be as objective as possible, I need to input my personal criteria and here they are:

 - If a tag has the genre "Electronic" associated, I consider it as Electronic music. 
 - If a tag isn't associated with Electronic music but with Punk music, I consider it as Punk music.
 - If a tag isn't included in the above and has the genre Pop in it, I consider it as Pop.
 - If a tag isn't included in the above and has the genre Rock in it, I consider it as Rock.

However, I will use this criteria only if the number of Main genres identified are two. I think the cases where there are more than 2 Main genres identified are probably incorrect tags (like, for instance "bossa-nova latin world pop folk jazz flamenco").

In [24]:
#We drop the rows for which we didn't retrieve any genre at all:
genreless.dropna(subset=['genres'], axis=0, inplace=True)

In [25]:
#We drop also the rows for whose the tag count is greater than 2:
genreless.drop(genreless[genreless['genre_counts'] > 2].index, inplace=True)

In [26]:
start = time.time()

#Filling the Main_genre column for our multiplt-tagged rows:

genreless.reset_index(drop=True, inplace=True)

for i in tqdm.tqdm(range(len(genreless))):
    if genreless['genre_counts'][i] == 2 and 'Electronic' in genreless['genres'][i]:
        genreless['Main_genre'][i] = 'Electronic'
    elif genreless['genre_counts'][i] == 2 and 'Punk' in genreless['genres'][i]:
        genreless['Main_genre'][i] = 'Punk'
    elif genreless['genre_counts'][i] == 2 and 'Pop' in genreless['genres'][i]:
        genreless['Main_genre'][i] = 'Pop'       
    elif genreless['genre_counts'][i] == 2 and 'Rock' in genreless['genres'][i]:
        genreless['Main_genre'][i] = 'Rock'
    else:
        pass

end = time.time()
print((end-start)/60)

100%|██████████| 11299/11299 [00:45<00:00, 246.65it/s]

0.7635457833607991





In [27]:
#We remove the punctuation in Main_genre:
genreless['Main_genre'] = genreless['Main_genre'].str.strip('[]').str.strip("'")
genreless.head()

Unnamed: 0,tag_id,tag_name,Main_genre,genre_counts,genres
0,82,punkrock,Punk,2,['Punk' 'Rock']
1,137,dark metal,Rock,1,['Rock']
2,34257,jazz blaxploitation,Jazz,1,['Jazz']
3,142,hardcore metal,Rock,1,['Rock']
4,33903,thrash death metal,Rock,1,['Rock']


In [28]:
#We delete the useless columns:
genreless.drop(labels=['genre_counts', 'genres'], axis=1, inplace=True)

In [29]:
#How many did we identify this time?
genreless.Main_genre.isnull().value_counts()

False    11169
True       130
Name: Main_genre, dtype: int64

We have identified an extra 1730 tag names in this last step. We are now ready to input this information into our tags_genres dataframe: 

In [30]:
#Do the merging:
tags_all = pd.merge(tags_genres, genreless[['tag_id','Main_genre']], how='left', on='tag_id')

In [31]:
tags_all.head()

Unnamed: 0,tag_id,tag_name,Main_genre_x,subgenre,Main_genre_y
0,95,finnish,,,
1,23,slovak,,,
2,801,iowa,,,
3,4,groundbreaking,,,
4,130,taiwanese,,,


In [32]:
#How many rows did we have without Main genre?
tags_all.Main_genre_x.isnull().value_counts()

True     86021
False      785
Name: Main_genre_x, dtype: int64

In [33]:
#Fill the column Main_genre_y for the rows we just retrieved:
tags_all.Main_genre_x = np.where(tags_all.Main_genre_x.isnull(), tags_all.Main_genre_y,tags_all.Main_genre_x )

In [34]:
#How many do we have now?
tags_all.Main_genre_x.isnull().value_counts()

True     74852
False    11954
Name: Main_genre_x, dtype: int64

So we have been able to identify the Main genre for 11.954 tags in total: this will be very useful in the next steps.

In [35]:
#We can delete the useless columns:
tags_all.drop(labels=['subgenre', 'Main_genre_y'], axis=1, inplace=True)
#And rename the Main genre column:
tags_all.rename(columns={'Main_genre_x':'Main_genre'}, inplace=True)

## <font color=blue>2) Release genre</font>

### Data from Musicbrainz.org

Musicbrainz provides a table with all the release groups which have been tagged by their users. What we'll do next, is to retrieve those tags and select the ones that are part of the genres list.

In [36]:
#We import our main dataframe from the previous notebook:
df = pd.read_csv('Dataframe_with_origin.csv', sep='\t', header=0, encoding='utf-8')
df.head()

Unnamed: 0,release_id,release_group,group_id,release_year,artist_id,artist_mbid,credit_id,artist_name_x,area_id,area_name,ISO_code,ISO_country,lat,long
0,2265346,Le 1,2042812,2018-01-01,1720981.0,a69efb5f-0b28-4328-8ff0-44d8d6f39755,2291833,TedeuzeM,68613.0,Aix-en-Provence,,FR,46.0,2.0
1,1772538,devil jokes,1656147,2016-01-01,1363025.0,c941ad72-8b13-4940-8d99-0ed9becad2d7,1653884,yzome,9655.0,Seattle,US-WA,US,47.0417,-122.8958
2,1247979,!,1234953,2009-01-01,834659.0,9d02b2a1-c9a7-46aa-8674-adf38c44d81a,874079,Gatuzo,53.0,Croatia,HR,HR,45.1667,15.5
3,1571374,!,1497879,2010-01-01,674029.0,27a3d370-5430-42c0-8de4-1a7635d781b2,1440641,С.К.А.Й.,219.0,Ukraine,UA,UA,49.0,32.0
4,1528674,!,1463719,2014-01-01,491638.0,ed962474-bb85-47f9-b108-073184f09bc8,491638,Rusko,222.0,United States,US,US,38.0,-97.0


In [37]:
release_groups = pd.read_csv('Musicbrainz/Tables_used/release_group.txt',sep='\t', header=None, engine='c', usecols=[0,1,2])
release_groups.columns = ['group_id','group_mbid','release_group_name']
release_groups.head()

Unnamed: 0,group_id,group_mbid,release_group_name
0,1964563,f59da930-70ba-4992-a346-7ed2d8e3cda8,Wande
1,12,2b10653e-655d-34fe-9db4-77242d817a17,Chore of Enchantment
2,13,0eac6659-d590-3eb7-8c13-ed8b3fdf4ef7,The Inevitable
3,28,c554da1a-c1aa-30c3-b0bb-44b1b837de33,Piece and Love
4,60,06729175-db17-3443-add7-921739a92762,Ultimate Alternative Wavers


In [38]:
release_groups['group_id'].nunique()

1745126

In [39]:
len(release_groups)

1745126

In [40]:
group_tag = pd.read_csv('Musicbrainz/Tables_used/release_group_tag.txt',sep='\t', header=None, engine='c', usecols=[0,1,2])
group_tag.columns = ['group_id','tag_id','tag_counts']
group_tag.head()

Unnamed: 0,group_id,tag_id,tag_counts
0,93688,150,1
1,906692,1371,1
2,906692,6948,1
3,617615,11,1
4,617615,545,1


In [41]:
#We can now merge the release groups with the tag ids and tag counts:
Table = pd.merge(release_groups, group_tag, how='left', on='group_id')
Table.head()

Unnamed: 0,group_id,group_mbid,release_group_name,tag_id,tag_counts
0,1964563,f59da930-70ba-4992-a346-7ed2d8e3cda8,Wande,,
1,12,2b10653e-655d-34fe-9db4-77242d817a17,Chore of Enchantment,41017.0,2.0
2,13,0eac6659-d590-3eb7-8c13-ed8b3fdf4ef7,The Inevitable,1053.0,2.0
3,13,0eac6659-d590-3eb7-8c13-ed8b3fdf4ef7,The Inevitable,1230.0,1.0
4,13,0eac6659-d590-3eb7-8c13-ed8b3fdf4ef7,The Inevitable,71.0,3.0


In [42]:
#And finally have our release groups associated with their genres:
release_group_genre = pd.merge(Table, tags_all, how='left', on='tag_id')
release_group_genre.head()

Unnamed: 0,group_id,group_mbid,release_group_name,tag_id,tag_counts,tag_name,Main_genre
0,1964563,f59da930-70ba-4992-a346-7ed2d8e3cda8,Wande,,,,
1,12,2b10653e-655d-34fe-9db4-77242d817a17,Chore of Enchantment,41017.0,2.0,alternative/indie rock,Rock
2,13,0eac6659-d590-3eb7-8c13-ed8b3fdf4ef7,The Inevitable,1053.0,2.0,swing,Jazz
3,13,0eac6659-d590-3eb7-8c13-ed8b3fdf4ef7,The Inevitable,1230.0,1.0,dixieland,Jazz
4,13,0eac6659-d590-3eb7-8c13-ed8b3fdf4ef7,The Inevitable,71.0,3.0,jazz,Jazz


Let's stop here for a while and check one of the releases that has several genre tags associated. Let's do this with one of the most popular releases of all times: the album "Thriller", by the king of Pop music: Michael Jackson. 

In [43]:
release_group_genre[release_group_genre['group_mbid']=='f32fab67-77dd-3937-addc-9062e28e4c37']

Unnamed: 0,group_id,group_mbid,release_group_name,tag_id,tag_counts,tag_name,Main_genre
1429052,61656,f32fab67-77dd-3937-addc-9062e28e4c37,Thriller,7282.0,2.0,vendu,
1429053,61656,f32fab67-77dd-3937-addc-9062e28e4c37,Thriller,642.0,2.0,disco,Electronic
1429054,61656,f32fab67-77dd-3937-addc-9062e28e4c37,Thriller,7935.0,1.0,discothèque,
1429055,61656,f32fab67-77dd-3937-addc-9062e28e4c37,Thriller,24521.0,0.0,80 s and 90 s pop,Pop
1429056,61656,f32fab67-77dd-3937-addc-9062e28e4c37,Thriller,1060.0,1.0,dance-pop,Electronic
1429057,61656,f32fab67-77dd-3937-addc-9062e28e4c37,Thriller,303.0,3.0,funk,Others
1429058,61656,f32fab67-77dd-3937-addc-9062e28e4c37,Thriller,11.0,0.0,electronic,Electronic
1429059,61656,f32fab67-77dd-3937-addc-9062e28e4c37,Thriller,41021.0,2.0,club/dance,Electronic
1429060,61656,f32fab67-77dd-3937-addc-9062e28e4c37,Thriller,76.0,1.0,dance,Electronic
1429061,61656,f32fab67-77dd-3937-addc-9062e28e4c37,Thriller,41027.0,3.0,contemporary r&b,R&B/Soul


As we can see, "Pop" is the most used tag for this group so we should keep it as the release's genre.

As music genre is a very subjective feature, in order to be as "objective" as possible, we'll take into consideration the majority of the votes to chose the subgenre and main genre of each release group.

To do so, we will sort the release_group_genre dataframe by number of counts and keep the top tag for each release group.

In [44]:
#We sort by group_id and tag_counts:
release_group_genre.sort_values(['group_id','tag_counts'], ascending=[True,False], inplace=True)
release_group_genre.head()

Unnamed: 0,group_id,group_mbid,release_group_name,tag_id,tag_counts,tag_name,Main_genre
312152,2,e8bee759-9efc-35c2-93d7-09ace9123467,Eclectic Electric,1186.0,2.0,acid rap,Hip_Hop
312153,2,e8bee759-9efc-35c2-93d7-09ace9123467,Eclectic Electric,92310.0,1.0,oldest release group #2,
737291,4,8b6f133a-2fdf-3cc2-b84d-1c889adc0939,Blue Lines,1498.0,7.0,trip hop,Electronic
737302,4,8b6f133a-2fdf-3cc2-b84d-1c889adc0939,Blue Lines,12.0,6.0,downtempo,Electronic
737293,4,8b6f133a-2fdf-3cc2-b84d-1c889adc0939,Blue Lines,11.0,5.0,electronic,Electronic


In [45]:
#And now we can drop the duplicate group_ids, keeping the top tags:
release_group_genre.drop_duplicates(subset=['group_id'],keep='first', inplace=True)
release_group_genre.head()

Unnamed: 0,group_id,group_mbid,release_group_name,tag_id,tag_counts,tag_name,Main_genre
312152,2,e8bee759-9efc-35c2-93d7-09ace9123467,Eclectic Electric,1186.0,2.0,acid rap,Hip_Hop
737291,4,8b6f133a-2fdf-3cc2-b84d-1c889adc0939,Blue Lines,1498.0,7.0,trip hop,Electronic
1756939,11,c6fe6a2b-0ed6-3d2c-b9ce-ddd5421a3452,Hot,71.0,3.0,jazz,Jazz
1,12,2b10653e-655d-34fe-9db4-77242d817a17,Chore of Enchantment,41017.0,2.0,alternative/indie rock,Rock
4,13,0eac6659-d590-3eb7-8c13-ed8b3fdf4ef7,The Inevitable,71.0,3.0,jazz,Jazz


What we want now is to combine our main dataframe with this new genre information we just retrieved:

In [46]:
#We merge both dataframes:
main_df = pd.merge(df, release_group_genre[['group_id','tag_id','tag_counts','tag_name','Main_genre']], how='left', on='group_id')
main_df.head()

Unnamed: 0,release_id,release_group,group_id,release_year,artist_id,artist_mbid,credit_id,artist_name_x,area_id,area_name,ISO_code,ISO_country,lat,long,tag_id,tag_counts,tag_name,Main_genre
0,2265346,Le 1,2042812,2018-01-01,1720981.0,a69efb5f-0b28-4328-8ff0-44d8d6f39755,2291833,TedeuzeM,68613.0,Aix-en-Provence,,FR,46.0,2.0,,,,
1,1772538,devil jokes,1656147,2016-01-01,1363025.0,c941ad72-8b13-4940-8d99-0ed9becad2d7,1653884,yzome,9655.0,Seattle,US-WA,US,47.0417,-122.8958,,,,
2,1247979,!,1234953,2009-01-01,834659.0,9d02b2a1-c9a7-46aa-8674-adf38c44d81a,874079,Gatuzo,53.0,Croatia,HR,HR,45.1667,15.5,,,,
3,1571374,!,1497879,2010-01-01,674029.0,27a3d370-5430-42c0-8de4-1a7635d781b2,1440641,С.К.А.Й.,219.0,Ukraine,UA,UA,49.0,32.0,,,,
4,1528674,!,1463719,2014-01-01,491638.0,ed962474-bb85-47f9-b108-073184f09bc8,491638,Rusko,222.0,United States,US,US,38.0,-97.0,,,,


In [47]:
len(main_df)

1282170

In [48]:
main_df['release_id'].nunique()

1282170

In [49]:
#For how many releases do we have the main genre now?
main_df.Main_genre.isnull().value_counts()

True     1144323
False     137847
Name: Main_genre, dtype: int64

In [50]:
#We export the retrieved releases into a dataframe, and the pending into another:
retrieved1 = main_df[main_df['Main_genre'].notnull()]
pending1 = main_df[main_df['Main_genre'].isnull()]
#And remove the columns related to genre in the pending1 dataframe:
pending1.drop(labels=['tag_id', 'tag_counts', 'tag_name', 'Main_genre'], axis=1, inplace=True)

So, according to the above results, we have for now the genre for only 140.497 releases, under a total of 1.362.763 (10% of our dataframe only).

## <font color=blue>3) Artist genre</font>

In order to retrieve more genres, the next step is retrieving the artists' genre (the same we did for the release groups), and add them to our main_df.

Note: by doing this, we are assuming that each band or artist always produces the same musical genre. This is not 100% always accurate (especially if we look at the subgenres). However in general, we can say that the majority of the bands/artists usually stay in the same musical line during their professional lives and they can be categorized into the same "Main genre". Again, this is an assumption that we need to make in order to retrieve more info for this project.

For that, we'll use first Musicbrainz's artist_tag table and we'll follow the same process we did before.

In [51]:
artist_tag = pd.read_csv('Musicbrainz/Tables_used/artist_tag.txt',sep='\t', header=None, engine='c', usecols=[0,1,2])
artist_tag.columns = ['artist_id','tag_id','tag_counts']
artist_tag.head()

Unnamed: 0,artist_id,tag_id,tag_counts
0,468800,29,2
1,522545,63294,1
2,31390,173,1
3,108404,271,1
4,108404,7,1


In [52]:
#We merge it with the tags_genres dataframe:
artist_tag_genre = pd.merge(artist_tag, tags_all, how='left', on='tag_id')
artist_tag_genre.head()

Unnamed: 0,artist_id,tag_id,tag_counts,tag_name,Main_genre
0,468800,29,2,progressive rock,Rock
1,522545,63294,1,austrian composer,
2,31390,173,1,polish,
3,108404,271,1,hard rock,Rock
4,108404,7,1,rock,Rock


In [53]:
#We retrieve the artist name:
artists = pd.read_csv('Musicbrainz/Tables_used/artist.txt',sep='\t', header=None, engine='c', usecols=[0,2])
artists.columns = ['artist_id','artist_name']
artists.head()

Unnamed: 0,artist_id,artist_name
0,805192,WIK▲N
1,371203,Pete Moutso
2,273232,Zachary
3,101060,The Silhouettes
4,145773,Aric Leavitt


In [54]:
#We merge it with the artist dataframe to see the names for each artist:
artist_genre = pd.merge(artist_tag_genre, artists[['artist_id','artist_name']], on='artist_id', how='left')
artist_genre.head()

Unnamed: 0,artist_id,tag_id,tag_counts,tag_name,Main_genre,artist_name
0,468800,29,2,progressive rock,Rock,Citadel
1,522545,63294,1,austrian composer,,Robert Fuchs
2,31390,173,1,polish,,Behemoth
3,108404,271,1,hard rock,Rock,Blake
4,108404,7,1,rock,Rock,Blake


In [55]:
#We sort by artist_id and tag_counts:
artist_genre.sort_values(['artist_id','tag_counts'], ascending=[True,False], inplace=True)
artist_genre.head(20)

Unnamed: 0,artist_id,tag_id,tag_counts,tag_name,Main_genre,artist_name
302542,1,21,11,special purpose artist,,Various Artists
311787,1,119673,1,absolute voices,,Various Artists
302432,1,1769,0,chicago,,Various Artists
302439,1,107130,0,kindly fixme,,Various Artists
302440,1,112334,0,legion of von,,Various Artists
302445,1,104932,0,megafixme,,Various Artists
302448,1,115238,0,my sharona,,Various Artists
302462,1,115626,0,cdx,,Various Artists
302471,1,107131,0,please rename to [various artists],,Various Artists
302483,1,107127,0,do not using this for artist credits,,Various Artists


In [56]:
#And now we can drop the duplicate artist_ids, keeping the top tags:
artist_genre.drop_duplicates(subset=['artist_id'],keep='first', inplace=True)
artist_genre.head(20)

Unnamed: 0,artist_id,tag_id,tag_counts,tag_name,Main_genre,artist_name
302542,1,21,11,special purpose artist,,Various Artists
208397,4,1,10,trip-hop,,Massive Attack
10306,6,171,2,british,,Apartment 26
106813,7,98,0,bogus artist,,Dr. Evil
32,9,1600,1,european,,Robert Miles
583,10,1661,1,warp,,Vincent Gallo
4653,11,71,2,jazz,Jazz,Squirrel Nut Zippers
158380,12,304,1,country,Country,Giant Sand
149903,15,1600,1,european,,Éric Serra
10421,16,111,1,american,,William S. Burroughs


In [57]:
#We add this new information into our pending1 dataframe:
main_df2 = pd.merge(pending1, artist_genre, how='left', on='artist_id')
main_df2.head()

Unnamed: 0,release_id,release_group,group_id,release_year,artist_id,artist_mbid,credit_id,artist_name_x,area_id,area_name,ISO_code,ISO_country,lat,long,tag_id,tag_counts,tag_name,Main_genre,artist_name
0,2265346,Le 1,2042812,2018-01-01,1720981.0,a69efb5f-0b28-4328-8ff0-44d8d6f39755,2291833,TedeuzeM,68613.0,Aix-en-Provence,,FR,46.0,2.0,,,,,
1,1772538,devil jokes,1656147,2016-01-01,1363025.0,c941ad72-8b13-4940-8d99-0ed9becad2d7,1653884,yzome,9655.0,Seattle,US-WA,US,47.0417,-122.8958,,,,,
2,1247979,!,1234953,2009-01-01,834659.0,9d02b2a1-c9a7-46aa-8674-adf38c44d81a,874079,Gatuzo,53.0,Croatia,HR,HR,45.1667,15.5,,,,,
3,1571374,!,1497879,2010-01-01,674029.0,27a3d370-5430-42c0-8de4-1a7635d781b2,1440641,С.К.А.Й.,219.0,Ukraine,UA,UA,49.0,32.0,7.0,1.0,rock,Rock,С.К.А.Й.
4,1528674,!,1463719,2014-01-01,491638.0,ed962474-bb85-47f9-b108-073184f09bc8,491638,Rusko,222.0,United States,US,US,38.0,-97.0,30.0,2.0,dubstep,Electronic,Rusko


In [58]:
main_df2.isnull().sum(axis=0)

release_id            0
release_group         3
group_id              0
release_year          0
artist_id            95
artist_mbid          95
credit_id             0
artist_name_x        99
area_id               0
area_name             0
ISO_code           7122
ISO_country          26
lat                   0
long                  0
tag_id           650075
tag_counts       650075
tag_name         651499
Main_genre       855291
artist_name      650077
dtype: int64

In [59]:
len(main_df2)

1144323

Not bad: we have now "only" 926.891 releases with no Main genre, so we have just retrieved the info for an extra 295.375 releases using the artists' information. In total, we have for now 435.872 releases with their genre information, so 32% of our Dataframe.

In [60]:
#We delete the last column and split the dataframe again:
main_df2.drop(labels=['artist_name'], axis=1, inplace=True)
retrieved2 = main_df2[main_df2['Main_genre'].notnull()]
pending2 = main_df2[main_df2['Main_genre'].isnull()]
#And remove the columns related to genre in the pending2 dataframe:
pending2.drop(labels=['tag_id', 'tag_counts', 'tag_name', 'Main_genre'], axis=1, inplace=True)

In [61]:
len(retrieved2)

289032

### Data from Wikidata Query with SPARQL

In [62]:
#Open the files and load them into dataframes with the same column names (to match with our main dataframe later):
musicians = pd.read_csv('wikidata/query_wikidata_musicians.csv',sep=',', encoding='utf-8', usecols=[3,4])
musicians.rename(columns={'genreLabel':'artist_genre','MusicBrainz_artist_ID':'artist_mbid'}, inplace=True)
singers = pd.read_csv('wikidata/query_wikidata_singers.csv',sep=',', encoding='utf-8', usecols=[3,4])
singers.rename(columns={'genreLabel':'artist_genre','MusicBrainz_artist_ID':'artist_mbid'}, inplace=True)
bands = pd.read_csv('wikidata/query_wikidata_bands.csv',sep=',', encoding='utf-8', usecols=[3,4])
bands.rename(columns={'genreLabel':'artist_genre','MusicBrainz_artist_ID':'artist_mbid'}, inplace=True)

In [63]:
#Now we can concatenate the 3 dataframes into one:
wiki_df = pd.concat([musicians, singers, bands])
wiki_df.head()

Unnamed: 0,artist_genre,artist_mbid
0,,
1,opera,b972f589-fb0e-474e-b64a-803b0364fa75
2,classical music,b972f589-fb0e-474e-b64a-803b0364fa75
3,symphony,b972f589-fb0e-474e-b64a-803b0364fa75
4,concerto,b972f589-fb0e-474e-b64a-803b0364fa75


In [64]:
#We merge the dataframe with the tags_genres to retrieve tag_id and Main_genre:
wiki_genres = pd.merge(wiki_df, tags_all, how='left', left_on='artist_genre', right_on='tag_name')
wiki_genres.head()

Unnamed: 0,artist_genre,artist_mbid,tag_id,tag_name,Main_genre
0,,,32232.0,,
1,,,80586.0,,
2,opera,b972f589-fb0e-474e-b64a-803b0364fa75,480.0,opera,Classical
3,classical music,b972f589-fb0e-474e-b64a-803b0364fa75,2092.0,classical music,Classical
4,classical music,b972f589-fb0e-474e-b64a-803b0364fa75,54585.0,classical music,Classical


In [65]:
#We drop the rows that don't have any artist_mbid (as we won't be able to match them):
wiki_genres.dropna(subset=['artist_mbid'], axis=0, inplace=True)

As some artists appear more than once (if they have more than one tag), we will sort them by artist_mbid and Main_genre and keep the first appearance for each artist. In this case, we don't have a tag_count field so we can't really know which is the main one.

In [66]:
#We sort by artist_id and Main_genre:
wiki_genres.sort_values(['artist_mbid','Main_genre'], inplace=True)
wiki_genres.head()

Unnamed: 0,artist_genre,artist_mbid,tag_id,tag_name,Main_genre
112398,flamenco,00010eb3-ebfe-4965-81ef-0ac64cd49fde,367.0,flamenco,Latin
314081,,000200d1-1176-4859-b39c-669bde26ecea,32232.0,,
314082,,000200d1-1176-4859-b39c-669bde26ecea,80586.0,,
163208,,00026532-1fe3-45fb-a0df-34aec04a1319,32232.0,,
163209,,00026532-1fe3-45fb-a0df-34aec04a1319,80586.0,,


In [67]:
#And now we can drop the duplicate artist_mbids, keeping the top rows:
wiki_genres.drop_duplicates(subset=['artist_mbid'],keep='first', inplace=True)
wiki_genres.head()

Unnamed: 0,artist_genre,artist_mbid,tag_id,tag_name,Main_genre
112398,flamenco,00010eb3-ebfe-4965-81ef-0ac64cd49fde,367.0,flamenco,Latin
314081,,000200d1-1176-4859-b39c-669bde26ecea,32232.0,,
163208,,00026532-1fe3-45fb-a0df-34aec04a1319,32232.0,,
271722,reggae,00034ede-a1f1-4219-be39-02f36853373e,267.0,reggae,World
125067,J-pop,0003fd17-b083-41fe-83a9-d550bd4f00a1,,,


In [68]:
#We can also drop the null values in Main_genre, as they won't add any information later:
wiki_genres.dropna(subset=['Main_genre'], axis=0, inplace=True)
#And the column artist_genre, as it's the same as tag_name now:
wiki_genres.drop(labels=['artist_genre'], axis=1, inplace=True)

In [69]:
#Now we can input this new information into our main dataframe:
main_df3 = pd.merge(pending2, wiki_genres, how='left', on='artist_mbid')
main_df3.head()

Unnamed: 0,release_id,release_group,group_id,release_year,artist_id,artist_mbid,credit_id,artist_name_x,area_id,area_name,ISO_code,ISO_country,lat,long,tag_id,tag_name,Main_genre
0,2265346,Le 1,2042812,2018-01-01,1720981.0,a69efb5f-0b28-4328-8ff0-44d8d6f39755,2291833,TedeuzeM,68613.0,Aix-en-Provence,,FR,46.0,2.0,,,
1,1772538,devil jokes,1656147,2016-01-01,1363025.0,c941ad72-8b13-4940-8d99-0ed9becad2d7,1653884,yzome,9655.0,Seattle,US-WA,US,47.0417,-122.8958,,,
2,1247979,!,1234953,2009-01-01,834659.0,9d02b2a1-c9a7-46aa-8674-adf38c44d81a,874079,Gatuzo,53.0,Croatia,HR,HR,45.1667,15.5,,,
3,1947276,! (LP/2017),1791666,2017-01-01,627929.0,e95d5542-401d-4319-9918-7cbbc507b758,627929,Chapelier Fou,73.0,France,FR,FR,46.0,2.0,,,
4,1218357,! -attention-,1210730,1998-01-01,495678.0,9b9f6d5f-e633-4736-86c9-828bcbd10e98,495678,20th Century,107.0,Japan,JP,JP,36.0,138.0,,,


In [70]:
main_df3.isnull().sum(axis=0)

release_id            0
release_group         2
group_id              0
release_year          0
artist_id            95
artist_mbid          95
credit_id             0
artist_name_x        99
area_id               0
area_name             0
ISO_code           5746
ISO_country          19
lat                   0
long                  0
tag_id           696997
tag_name         696997
Main_genre       696997
dtype: int64

We have now 767.079 releases with no Main genre, so we have just retrieved the info for an extra 159.812 releases using the artists' information. In total, we have for now 595.684 releases with their genre information, so 44% of our Dataframe.

In [71]:
#We split again the dataframe in two, and keep retrieving:
retrieved3 = main_df3[main_df3['Main_genre'].notnull()]
pending3 = main_df3[main_df3['Main_genre'].isnull()]
#And remove the columns related to genre in the pending2 dataframe:
pending3.drop(labels=['tag_id','tag_name', 'Main_genre'], axis=1, inplace=True)

## Data from 1 Million Songs Dataset

In [72]:
#We open the file where some tracks have their genre associated:
tracks = pd.read_csv('1M_songs/msd_tagtraum_cd2c.csv', header=0, usecols = [0,1])
tracks.head()

Unnamed: 0,track_id,majority_genre
0,TRAAAAK128F9318786,Rock
1,TRAAAAW128F429D538,Rap
2,TRAAADJ128F4287B47,Rock
3,TRAAADZ128F9348C2E,Latin
4,TRAAAED128E0783FAB,Jazz


As we can see, these track have already a majority genre established. Which ones are there?

In [73]:
tracks.majority_genre.value_counts()

Rock          75013
Electronic    21865
Jazz          14700
Pop           12967
Rap           11001
RnB            9811
Metal          9224
Country        8983
Reggae         7970
Blues          6219
Folk           4188
Punk           3275
Latin          3113
World          1919
New Age        1153
Name: majority_genre, dtype: int64

Luckily, their groups are very similar to our Main genres so we'll just need to make a few change of names in order for them to fit our classification:

- "Rap" will be changed to "Hip Hop"
- "RnB" will be changed to "R&B/Soul"
- "Metal" will be changed to "Rock"
- "Reggae" will be changed to "World"
- "New Age" will be changed to "Others"

In [74]:
tracks.replace({'Rap':'Hip Hop', 'RnB':'R&B/Soul', 'Metal':'Rock', 'Reggae': 'World', 'New Age':'Others'}, inplace=True)
tracks.majority_genre.value_counts()

Rock          84237
Electronic    21865
Jazz          14700
Pop           12967
Hip Hop       11001
World          9889
R&B/Soul       9811
Country        8983
Blues          6219
Folk           4188
Punk           3275
Latin          3113
Others         1153
Name: majority_genre, dtype: int64

In [75]:
#We open the file where we can match track_id and artist_mbid:
tracks_metadata = pd.read_csv('1M_songs/track_metadata.csv', header=0, usecols = [0,5])
tracks_metadata.head()

Unnamed: 0,track_id,artist_mbid
0,TRMMMYQ128F932D901,357ff05d-848a-44cf-b608-cb34b5701ae5
1,TRMMMKD128F425225D,8d7ef530-a6fd-4f8f-b2e2-74aec765e0f9
2,TRMMMRX128F93187D9,3d403d44-36ce-465c-ad43-ae877e65adc4
3,TRMMMCH128F425532C,12be7648-7094-495f-90e6-df4189d68615
4,TRMMMWA128F426B589,


In [76]:
#We drop the rows with no value in artist_mbid:
tracks_metadata.dropna(subset=['artist_mbid'], axis=0, inplace=True)

In [77]:
#We merge the tracks dataframe with tracks metadata to retrieve the genre by artist:
artist_genre_1m = pd.merge(tracks, tracks_metadata, how='left', on='track_id')
artist_genre_1m.head()

Unnamed: 0,track_id,majority_genre,artist_mbid
0,TRAAAAK128F9318786,Rock,6ae6a016-91d7-46cc-be7d-5e8e5d320c54
1,TRAAAAW128F429D538,Hip Hop,e77e51a5-4761-45b3-9847-2051f811e366
2,TRAAADJ128F4287B47,Rock,3cf5a3be-25ef-4408-98fe-e66fee536be1
3,TRAAADZ128F9348C2E,Latin,7a273984-edd9-4451-9c4d-39b38f05ebcd
4,TRAAAED128E0783FAB,Jazz,e0e9d279-37d5-4493-99b8-5a21309502f6


In [78]:
artist_genre_1m.duplicated(subset='artist_mbid').value_counts()

True     170381
False     21020
dtype: int64

There is more than one genre associated with each artist so, what we will do is group by artist and genre, and we'll keep the genre that has most counts:

In [79]:
artists_scores = pd.DataFrame(pd.pivot_table(artist_genre_1m, index=['artist_mbid'], columns=['majority_genre'], aggfunc='count')) 

In [80]:
artists_scores.head()

Unnamed: 0_level_0,track_id,track_id,track_id,track_id,track_id,track_id,track_id,track_id,track_id,track_id,track_id,track_id,track_id
majority_genre,Blues,Country,Electronic,Folk,Hip Hop,Jazz,Latin,Others,Pop,Punk,R&B/Soul,Rock,World
artist_mbid,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2
0002f649-8285-4a72-b847-b3854e1a449c,,,,,,,,,,,,12.0,
00034ede-a1f1-4219-be39-02f36853373e,,,,,,,,,,,,11.0,
0004537a-4b12-43eb-a023-04009e738d2e,,,1.0,,,,,,,,,,
00077d46-7b4a-4761-9eed-c7dd435fa5ff,,,,,,,,,,,,2.0,
000842dd-08e9-485f-a9b6-8ada9f1c4a12,,,,,,,,,,,,,1.0


In [81]:
artists_scores.reset_index(inplace=True)
artists_scores.columns = artists_scores.columns.droplevel(0)
artists_scores.rename(columns={artists_scores.columns[0]: 'artist_mbid'}, inplace=True)
artists_scores.head()

majority_genre,artist_mbid,Blues,Country,Electronic,Folk,Hip Hop,Jazz,Latin,Others,Pop,Punk,R&B/Soul,Rock,World
0,0002f649-8285-4a72-b847-b3854e1a449c,,,,,,,,,,,,12.0,
1,00034ede-a1f1-4219-be39-02f36853373e,,,,,,,,,,,,11.0,
2,0004537a-4b12-43eb-a023-04009e738d2e,,,1.0,,,,,,,,,,
3,00077d46-7b4a-4761-9eed-c7dd435fa5ff,,,,,,,,,,,,2.0,
4,000842dd-08e9-485f-a9b6-8ada9f1c4a12,,,,,,,,,,,,,1.0


In [82]:
#We create a column "genre_counts" that sums the number of genres identified for each tag_name:
artists_scores['genre_counts'] = artists_scores.iloc[:,1:].notnull().sum(axis=1)
artists_scores.head()

majority_genre,artist_mbid,Blues,Country,Electronic,Folk,Hip Hop,Jazz,Latin,Others,Pop,Punk,R&B/Soul,Rock,World,genre_counts
0,0002f649-8285-4a72-b847-b3854e1a449c,,,,,,,,,,,,12.0,,1
1,00034ede-a1f1-4219-be39-02f36853373e,,,,,,,,,,,,11.0,,1
2,0004537a-4b12-43eb-a023-04009e738d2e,,,1.0,,,,,,,,,,,1
3,00077d46-7b4a-4761-9eed-c7dd435fa5ff,,,,,,,,,,,,2.0,,1
4,000842dd-08e9-485f-a9b6-8ada9f1c4a12,,,,,,,,,,,,,1.0,1


In [83]:
#Is there any artist with more than one genre?
artists_scores[artists_scores['genre_counts'] >1]

majority_genre,artist_mbid,Blues,Country,Electronic,Folk,Hip Hop,Jazz,Latin,Others,Pop,Punk,R&B/Soul,Rock,World,genre_counts
5,000ba849-700e-452e-8858-0db591587e4a,,,,,,,,,2.0,,,4.0,,2
9,0019749d-ee29-4a5f-ab17-6bfa11deb969,,,16.0,,,2.0,,,,,,,,2
17,0039c7ae-e1a7-4a7d-9b49-0cbc716821a6,,,,,,,,,1.0,,,24.0,,2
21,00467da8-2a92-498f-8b10-a80889bcded7,,,,,,,,,,1.0,,29.0,,2
32,006f0783-c5a0-458b-a9da-f8551f7ebe77,,,,,,,,,1.0,,,19.0,,2
39,0092dc2a-38ca-4b01-94dd-5334bba14059,,,,,,,,,1.0,,1.0,,,2
41,00950dec-8f3a-4a17-9717-e7872a954d8b,,,13.0,,,,1.0,,,,,,,2
69,00e8c9fb-546f-481e-b529-6ec23f3b3f72,,,,,2.0,,,,2.0,,1.0,,,3
71,00ed154e-8679-42f0-8f42-e59bd7e185af,,,,,,,,,5.0,,,4.0,,2
72,00ef5e52-582b-4d53-a03a-bbd5b3084197,,,12.0,,,,,,1.0,,,,,2


In [84]:
#We create a column containing the top scored genre for each row:
artists_scores['top_score'] = artists_scores.max(axis=1)
artists_scores.head()

majority_genre,artist_mbid,Blues,Country,Electronic,Folk,Hip Hop,Jazz,Latin,Others,Pop,Punk,R&B/Soul,Rock,World,genre_counts,top_score
0,0002f649-8285-4a72-b847-b3854e1a449c,,,,,,,,,,,,12.0,,1,12.0
1,00034ede-a1f1-4219-be39-02f36853373e,,,,,,,,,,,,11.0,,1,11.0
2,0004537a-4b12-43eb-a023-04009e738d2e,,,1.0,,,,,,,,,,,1,1.0
3,00077d46-7b4a-4761-9eed-c7dd435fa5ff,,,,,,,,,,,,2.0,,1,2.0
4,000842dd-08e9-485f-a9b6-8ada9f1c4a12,,,,,,,,,,,,,1.0,1,1.0


In [85]:
#We retrieve the most scored genre for each artist:

artists_scores['Main_genre'] = np.nan

a = artists_scores.iloc[i,:].index.values

for i in tqdm.tqdm(range(len(artists_scores))):
    for j in range(1,14):
        if artists_scores.iloc[i][j] - artists_scores.iloc[i]['top_score'] == 0:
            artists_scores['Main_genre'][i] = a[j]
        else:
            pass

100%|██████████| 21019/21019 [06:17<00:00, 55.72it/s]


In [86]:
artists_scores.head()

majority_genre,artist_mbid,Blues,Country,Electronic,Folk,Hip Hop,Jazz,Latin,Others,Pop,Punk,R&B/Soul,Rock,World,genre_counts,top_score,Main_genre
0,0002f649-8285-4a72-b847-b3854e1a449c,,,,,,,,,,,,12.0,,1,12.0,Rock
1,00034ede-a1f1-4219-be39-02f36853373e,,,,,,,,,,,,11.0,,1,11.0,Rock
2,0004537a-4b12-43eb-a023-04009e738d2e,,,1.0,,,,,,,,,,,1,1.0,Electronic
3,00077d46-7b4a-4761-9eed-c7dd435fa5ff,,,,,,,,,,,,2.0,,1,2.0,Rock
4,000842dd-08e9-485f-a9b6-8ada9f1c4a12,,,,,,,,,,,,,1.0,1,1.0,World


In [87]:
#Is there any null value in Main_genre?
artists_scores.Main_genre.isnull().value_counts()

False    20683
True       336
Name: Main_genre, dtype: int64

In [88]:
#We can drop them:
artists_scores.dropna(subset=['Main_genre'], axis=0, inplace=True)
#And drop the unnecessary columns:
artists_final = artists_scores[['artist_mbid', 'Main_genre']].copy()
artists_final.head()

majority_genre,artist_mbid,Main_genre
0,0002f649-8285-4a72-b847-b3854e1a449c,Rock
1,00034ede-a1f1-4219-be39-02f36853373e,Rock
2,0004537a-4b12-43eb-a023-04009e738d2e,Electronic
3,00077d46-7b4a-4761-9eed-c7dd435fa5ff,Rock
4,000842dd-08e9-485f-a9b6-8ada9f1c4a12,World


In [89]:
#And we can input this information into our pending3 dataframe:
main_df4 = pd.merge(pending3, artists_final, how='left', on='artist_mbid')
#How many releases did we retrieve the genre for in this last step?
main_df4.Main_genre.isnull().value_counts()

True     656736
False     40261
Name: Main_genre, dtype: int64

We have now 726.197 releases with no Main genre, so we have just retrieved the info for an extra 40.882 releases using the 1 Million Songs data. In total, we have for now 636.566 releases with their genre information, so 47% of our Dataframe.

We'll split the dataframe in two like we did before, and step to the last part of this notebook.

In [90]:
#We split again the dataframe in two, and keep retrieving:
retrieved4 = main_df4[main_df4['Main_genre'].notnull()]
pending4 = main_df4[main_df4['Main_genre'].isnull()]

### Extend artist genre into all the dataframe:

The idea of this last stage is to verify if, for the same artist, some releases have a main genre and others don't (this could have happened in the first stage, when we retrieved the genre by release group).

In order to do it, we will concatenate all our dataframes retrieved and pending, and check it:

In [91]:
#We first create new columns in the dataframes to be able to concatenate them:
retrieved3['tag_counts'] = np.nan
retrieved4['tag_counts'] = np.nan
retrieved4['tag_id'] = np.nan
retrieved4['tag_name'] = np.nan
pending4['tag_counts'] = np.nan
pending4['tag_id'] = np.nan
pending4['tag_name'] = np.nan

In [92]:
#We can concatenate them:
main_df5 = pd.concat([retrieved1, retrieved2, retrieved3, retrieved4, pending4 ], ignore_index=True)
main_df5.head()

Unnamed: 0,ISO_code,ISO_country,Main_genre,area_id,area_name,artist_id,artist_mbid,artist_name_x,credit_id,group_id,lat,long,release_group,release_id,release_year,tag_counts,tag_id,tag_name
0,US,US,Electronic,222.0,United States,109013.0,f26c72d3-e52c-467b-b651-679c73d8e1a7,!!!,109013,150660,38.0,-97.0,!!!,9236,2001-01-01,1.0,1031.0,leftfield
1,DE,DE,Hip_Hop,81.0,Germany,576051.0,8e4e1dff-978a-482f-8d18-30d7998dd209,Prinz Pi,576051,651366,51.0,9.0,!Donnerwetter!,311893,2006-01-01,1.0,150.0,hip-hop
2,NL,NL,Electronic,150.0,Netherlands,45223.0,734fa82c-864e-468b-bee4-944cb4b1952b,Speedy J,45223,85169,52.5,5.75,!ive,57031,1995-01-01,1.0,11.0,electronic
3,US,US,Rock,222.0,United States,21416.0,3ec17e85-9284-4f4c-8831-4e56c2354cdb,Reba McEntire,21416,438466,38.0,-97.0,# 1's,78797,2005-01-01,2.0,719.0,country rock
4,SK,SK,Electronic,189.0,Slovakia,283685.0,e4b17c6c-2951-4513-a811-9ae7ad959c4a,Olga+Jozef,283685,690152,48.6667,19.5,#03,359385,1999-01-01,1.0,11.0,electronic


In [93]:
len(main_df5)

1282170

Now we want to select all the artists that have a Main genre associated:

In [94]:
count = main_df5.groupby(by=['artist_id', 'Main_genre'], axis=0, as_index=False).count()
count.head(20)

Unnamed: 0,artist_id,Main_genre,ISO_code,ISO_country,area_id,area_name,artist_mbid,artist_name_x,credit_id,group_id,lat,long,release_group,release_id,release_year,tag_counts,tag_id,tag_name
0,4.0,Electronic,80,80,80,80,80,80,80,80,80,80,80,80,80,34,80,80
1,6.0,Pop,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1
2,6.0,Rock,2,2,2,2,2,2,2,2,2,2,2,2,2,1,2,2
3,9.0,Electronic,21,21,21,21,21,21,21,21,21,21,21,21,21,10,21,21
4,9.0,Others,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1
5,10.0,Electronic,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2
6,10.0,Rock,4,4,4,4,4,4,4,4,4,4,4,4,4,2,2,2
7,11.0,Jazz,12,12,12,12,12,12,12,12,12,12,12,12,12,12,12,12
8,12.0,Country,36,36,36,36,36,36,36,36,36,36,36,36,36,36,36,36
9,12.0,Rock,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6


In [95]:
#We sort by artist_id and the count of releases for each genre (to select the main genre for each artist):
count.sort_values(['artist_id','artist_mbid'], ascending=[True,False], inplace=True)
count.head()

Unnamed: 0,artist_id,Main_genre,ISO_code,ISO_country,area_id,area_name,artist_mbid,artist_name_x,credit_id,group_id,lat,long,release_group,release_id,release_year,tag_counts,tag_id,tag_name
0,4.0,Electronic,80,80,80,80,80,80,80,80,80,80,80,80,80,34,80,80
2,6.0,Rock,2,2,2,2,2,2,2,2,2,2,2,2,2,1,2,2
1,6.0,Pop,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1
3,9.0,Electronic,21,21,21,21,21,21,21,21,21,21,21,21,21,10,21,21
4,9.0,Others,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1


In [96]:
#And now we can drop the duplicate artist_ids, keeping the top tags:
count.drop_duplicates(subset=['artist_id'],keep='first', inplace=True)
count.head()

Unnamed: 0,artist_id,Main_genre,ISO_code,ISO_country,area_id,area_name,artist_mbid,artist_name_x,credit_id,group_id,lat,long,release_group,release_id,release_year,tag_counts,tag_id,tag_name
0,4.0,Electronic,80,80,80,80,80,80,80,80,80,80,80,80,80,34,80,80
2,6.0,Rock,2,2,2,2,2,2,2,2,2,2,2,2,2,1,2,2
3,9.0,Electronic,21,21,21,21,21,21,21,21,21,21,21,21,21,10,21,21
6,10.0,Rock,4,4,4,4,4,4,4,4,4,4,4,4,4,2,2,2
7,11.0,Jazz,12,12,12,12,12,12,12,12,12,12,12,12,12,12,12,12


In [97]:
#We extract the information we need:
all_artists_genre = count[['artist_id', 'Main_genre']].copy()

In [98]:
len(all_artists_genre)

90872

In [99]:
#Finally, we can merge our pending4 dataframe with this last one, and see if we retrieved more info:
main_df6 = pd.merge(pending4, all_artists_genre, how='left', on='artist_id')

In [100]:
main_df6.Main_genre_y.isnull().value_counts()

True     584967
False     71769
Name: Main_genre_y, dtype: int64

Thankfully, by applying this last strategy, we have identified the genre for and extra 73.772 releases, which means that we have now a total of 710.338 releases with their genre (52% of the dataset).

Now, we'll gather all the information retrieved in a single file, and the information pending in another file:

In [101]:
retrieved1.columns

Index(['release_id', 'release_group', 'group_id', 'release_year', 'artist_id',
       'artist_mbid', 'credit_id', 'artist_name_x', 'area_id', 'area_name',
       'ISO_code', 'ISO_country', 'lat', 'long', 'tag_id', 'tag_counts',
       'tag_name', 'Main_genre'],
      dtype='object')

In [102]:
main_df6.columns

Index(['release_id', 'release_group', 'group_id', 'release_year', 'artist_id',
       'artist_mbid', 'credit_id', 'artist_name_x', 'area_id', 'area_name',
       'ISO_code', 'ISO_country', 'lat', 'long', 'Main_genre_x', 'tag_counts',
       'tag_id', 'tag_name', 'Main_genre_y'],
      dtype='object')

In [103]:
#We first rename a column in main_df6:
main_df6.rename(columns={'Main_genre_y':'Main_genre'}, inplace=True)
#And delete useless column:
main_df6.drop(labels=['Main_genre_x'], axis=1, inplace=True)

In [104]:
#We split again the dataframe in two, and keep retrieving:
retrieved5 = main_df6[main_df6['Main_genre'].notnull()]
pending5 = main_df6[main_df6['Main_genre'].isnull()]

In [105]:
#Now we can concatenate the 5 retrieved dataframes:
data_out = pd.concat([retrieved1, retrieved2, retrieved3, retrieved4, retrieved5 ], ignore_index=True)
data_out.head()

Unnamed: 0,ISO_code,ISO_country,Main_genre,area_id,area_name,artist_id,artist_mbid,artist_name_x,credit_id,group_id,lat,long,release_group,release_id,release_year,tag_counts,tag_id,tag_name
0,US,US,Electronic,222.0,United States,109013.0,f26c72d3-e52c-467b-b651-679c73d8e1a7,!!!,109013,150660,38.0,-97.0,!!!,9236,2001-01-01,1.0,1031.0,leftfield
1,DE,DE,Hip_Hop,81.0,Germany,576051.0,8e4e1dff-978a-482f-8d18-30d7998dd209,Prinz Pi,576051,651366,51.0,9.0,!Donnerwetter!,311893,2006-01-01,1.0,150.0,hip-hop
2,NL,NL,Electronic,150.0,Netherlands,45223.0,734fa82c-864e-468b-bee4-944cb4b1952b,Speedy J,45223,85169,52.5,5.75,!ive,57031,1995-01-01,1.0,11.0,electronic
3,US,US,Rock,222.0,United States,21416.0,3ec17e85-9284-4f4c-8831-4e56c2354cdb,Reba McEntire,21416,438466,38.0,-97.0,# 1's,78797,2005-01-01,2.0,719.0,country rock
4,SK,SK,Electronic,189.0,Slovakia,283685.0,e4b17c6c-2951-4513-a811-9ae7ad959c4a,Olga+Jozef,283685,690152,48.6667,19.5,#03,359385,1999-01-01,1.0,11.0,electronic


In [106]:
#Do we have our 710.338 releases?
len(data_out)

697203

In [109]:
data_out.Main_genre.value_counts()

Rock          205816
Electronic    135086
Pop            93196
Jazz           52777
Classical      42344
Hip Hop        30882
Others         25851
Punk           23033
Folk           20690
Country        14259
R&B/Soul       13924
Latin          13622
Blues          13544
World          12179
Name: Main_genre, dtype: int64

In [108]:
#It looks like we have 2 different names for Hip Hop:
data_out.Main_genre = np.where(data_out.Main_genre == 'Hip_Hop', 'Hip Hop',data_out.Main_genre)

In [110]:
#We export the dataframes:
data_out.to_csv('data_out.csv', sep='\t', index=False, encoding='utf-8')
pending5.to_csv('data_pending.csv', sep='\t', index=False, encoding='utf-8')