# <font color=red>DATA GATHERING II: MUSIC GENRES AND SUBGENRES</font>

In [1]:
import pandas as pd
import numpy as np
import time
import math
import tqdm
import re
import warnings
warnings.filterwarnings('ignore')

## <font color=blue>1) Genres and subgenres</font>

According to Musicbrainz's Genre description in https://wiki.musicbrainz.org/Genre:

"Genres are currently supported in MusicBrainz as part of the tag system.

Some tags (the ones in the genre list) are automatically read and presented as genres."

What we want for our visualization is to have, for each release, its main genre. To do so, I have copied Musicbrainz's "genre list" into a csv file. There are 419 elements considered as genres by Musicbrainz but for our study we'll consider them as our subgenres.

I have also added all the subgenres appearing in this website:https://www.musicgenreslist.com/ and classified all of them into 14 Main genre categories:

- Blues
- Classical
- Country
- Electronic
- Folk
- Hip Hop
- Jazz
- Latin
- Pop
- Punk
- Rythm & Blues (R&B) / Soul
- Rock
- World (local music genres from specific regions of the world)
- Others (This category contains all the subgenres I haven't been able to classify in the previous categories)


Of course, I wasn't familiar with all the genres appearing in the list so, in order to classify those, I looked at their definition in wikipedia and chose the best main genre for them. If no definition was provided by wikipedia, I searched for them in Google and listened to a representative song in order to make a decision.

In [2]:
all_genres = pd.read_csv('Main_genre_list.csv', sep='\t', header=0, encoding='utf-8')
all_genres.head()

Unnamed: 0,Main_genre,subgenre
0,Blues,acoustic blues
1,Blues,african blues
2,Blues,blues
3,Blues,blues music
4,Blues,blues rock


As we read before, Musicbrainz's genre list (subgenre for us) is part of their tag system. Let's import the Musicbrainz's "tags" table and try to identify, from its elements, the ones that are genres.

In [3]:
tags = pd.read_csv('Musicbrainz/Tables_used/tags.txt',sep='\t', header=None, engine='c', usecols=[0,1])
tags.columns = ['tag_id','tag_name']
tags.head()

Unnamed: 0,tag_id,tag_name
0,95,finnish
1,23,slovak
2,801,iowa
3,4,groundbreaking
4,130,taiwanese


In [4]:
#How many tags are there?
tags['tag_id'].nunique()

86806

In [5]:
#What do the tags look like?
tags.tag_name.value_counts()

new age                                                                                                             2
post rock                                                                                                           2
music box                                                                                                           2
punk rock                                                                                                           2
lejos del fuego                                                                                                     2
rock                                                                                                                2
space music                                                                                                         2
indie rock                                                                                                          2
brian eno                                               

As we can see, the tags list contains the genres but also other (more subjective) expressions that some users have chosen as representative for the music entity. 

We will add columns to this tags dataframe to distinguish which of them are actually genres/subgenres. As we will do the matching by tag_name, we have to format the tag_names as the ones in all_genres: without punctuation and in lower case.

In [6]:
#We first normalize in lower case the tag_names:
tags['tag_name'] = tags['tag_name'].str.lower()

In [7]:
#We replace the punctuation with a space:
tags['tag_name'] = tags['tag_name'].apply(lambda x: re.sub(r"[^\w ]", " ", str(x), 0, re.MULTILINE))
#We remove leading & trainling spaces:
tags['tag_name'] = tags['tag_name'].str.strip()

In [8]:
#And now we can do the merging:
tags_genres = pd.merge(tags, all_genres, how='left', left_on='tag_name', right_on='subgenre')
tags_genres.head()

Unnamed: 0,tag_id,tag_name,Main_genre,subgenre
0,95,finnish,,
1,23,slovak,,
2,801,iowa,,
3,4,groundbreaking,,
4,130,taiwanese,,


In [9]:
#How many subgenres did we identify?
pd.notna(tags_genres['Main_genre']).value_counts()

False    85679
True      1127
Name: Main_genre, dtype: int64

In [10]:
#What kind of tag_names haven't been associated with a Main genre?
tags_genres[tags_genres['Main_genre'].isnull()]

Unnamed: 0,tag_id,tag_name,Main_genre,subgenre
0,95,finnish,,
1,23,slovak,,
2,801,iowa,,
3,4,groundbreaking,,
4,130,taiwanese,,
5,134,thai,,
6,154,war,,
7,52,netlabel,,
8,101,cotm,,
9,82,punkrock,,


As we can see above, some of the tags that don't have a Main genre associated could be easily classified (for instance: "punkrock", or "dark metal"). 

Those tag names are not considered as a subgenre by Musicbrainz but they do provide us with some information about the release main genre. We will consider them as subgenre and identify their main genre.

What I will do now is to retrieve more information about these genreless tag_names in order to be able to classiffy them:

In [11]:
#Creating a specific dataframe for them:
genreless = tags_genres[pd.notna(tags_genres.tag_name) & pd.isnull(tags_genres.Main_genre)]
genreless.head()

Unnamed: 0,tag_id,tag_name,Main_genre,subgenre
0,95,finnish,,
1,23,slovak,,
2,801,iowa,,
3,4,groundbreaking,,
4,130,taiwanese,,


In [12]:
#We create new columns to retrieve some information about the content of each tag:
genreless['Blues'] = np.nan
genreless['Classical'] = np.nan
genreless['Country'] = np.nan
genreless['Electronic'] = np.nan
genreless['Folk'] = np.nan
genreless['Hip_Hop'] = np.nan
genreless['Jazz'] = np.nan
genreless['Latin'] = np.nan
genreless['Pop'] = np.nan
genreless['Punk'] = np.nan
genreless['RB'] = np.nan
genreless['Rock'] = np.nan
genreless.head()

Unnamed: 0,tag_id,tag_name,Main_genre,subgenre,Blues,Classical,Country,Electronic,Folk,Hip_Hop,Jazz,Latin,Pop,Punk,RB,Rock
0,95,finnish,,,,,,,,,,,,,,
1,23,slovak,,,,,,,,,,,,,,
2,801,iowa,,,,,,,,,,,,,,
3,4,groundbreaking,,,,,,,,,,,,,,
4,130,taiwanese,,,,,,,,,,,,,,


In [13]:
#We create a column tag_name_clean where the text is formatted (remove punctuation, concatenate all words):
genreless['tag_name_clean'] = genreless['tag_name'].apply(lambda x: re.sub(r"[^\w]", "", str(x), 0, re.MULTILINE))
genreless.head()

Unnamed: 0,tag_id,tag_name,Main_genre,subgenre,Blues,Classical,Country,Electronic,Folk,Hip_Hop,Jazz,Latin,Pop,Punk,RB,Rock,tag_name_clean
0,95,finnish,,,,,,,,,,,,,,,finnish
1,23,slovak,,,,,,,,,,,,,,,slovak
2,801,iowa,,,,,,,,,,,,,,,iowa
3,4,groundbreaking,,,,,,,,,,,,,,,groundbreaking
4,130,taiwanese,,,,,,,,,,,,,,,taiwanese


In [14]:
#We create a pattern of words that could be associated with each genre:
Blues = 'blues'
Classical = 'classical|symphony|orchestra|stringquartet|pianist|opera|soprano|symph'
Country = 'country'
Electronic = 'electronic|electr|house|techno'
Folk = 'folk'
Hip_Hop = 'hiphop|rap|gangsta'
Jazz = 'jazz|jamband'
Latin = 'latin|reggaeton'
Pop = 'pop'
Punk = 'punk'
RB = 'rhythmandblues|rythmandblues|R&B'
Rock = 'rock|metal'

In [15]:
#And now we fill each genre column by searching if the column tag_name_clean contains the patterns:
genreless.Blues = np.where(genreless.tag_name_clean.str.contains(Blues), 'Blues', np.nan)
genreless.Classical = np.where(genreless.tag_name_clean.str.contains(Classical), 'Classical', np.nan)
genreless.Country = np.where(genreless.tag_name_clean.str.contains(Country), 'Country', np.nan)
genreless.Electronic = np.where(genreless.tag_name_clean.str.contains(Electronic), 'Electronic', np.nan)
genreless.Folk = np.where(genreless.tag_name_clean.str.contains(Folk), 'Folk', np.nan)
genreless.Hip_Hop = np.where(genreless.tag_name_clean.str.contains(Hip_Hop), 'Hip Hop', np.nan)
genreless.Jazz = np.where(genreless.tag_name_clean.str.contains(Jazz), 'Jazz', np.nan)
genreless.Latin = np.where(genreless.tag_name_clean.str.contains(Latin), 'Latin', np.nan)
genreless.Pop = np.where(genreless.tag_name_clean.str.contains(Pop), 'Pop', np.nan)
genreless.Punk = np.where(genreless.tag_name_clean.str.contains(Punk), 'Punk', np.nan)
genreless.RB = np.where(genreless.tag_name_clean.str.contains(RB), 'RB', np.nan)
genreless.Rock = np.where(genreless.tag_name_clean.str.contains(Rock), 'Rock', np.nan)

In [16]:
genreless.head()

Unnamed: 0,tag_id,tag_name,Main_genre,subgenre,Blues,Classical,Country,Electronic,Folk,Hip_Hop,Jazz,Latin,Pop,Punk,RB,Rock,tag_name_clean
0,95,finnish,,,,,,,,,,,,,,,finnish
1,23,slovak,,,,,,,,,,,,,,,slovak
2,801,iowa,,,,,,,,,,,,,,,iowa
3,4,groundbreaking,,,,,,,,,,,,,,,groundbreaking
4,130,taiwanese,,,,,,,,,,,,,,,taiwanese


In [17]:
genreless.replace('nan', np.nan, inplace=True)
genreless.head()

Unnamed: 0,tag_id,tag_name,Main_genre,subgenre,Blues,Classical,Country,Electronic,Folk,Hip_Hop,Jazz,Latin,Pop,Punk,RB,Rock,tag_name_clean
0,95,finnish,,,,,,,,,,,,,,,finnish
1,23,slovak,,,,,,,,,,,,,,,slovak
2,801,iowa,,,,,,,,,,,,,,,iowa
3,4,groundbreaking,,,,,,,,,,,,,,,groundbreaking
4,130,taiwanese,,,,,,,,,,,,,,,taiwanese


What we want now, is to identify the tag_name which can contain more than 1 Main genre (e.g: "poprock"), and decide which is the main genre for them.

In [18]:
#We create a column "genre_counts" that counts the number of genres identified for each tag_name:
genreless['genre_counts'] = genreless.iloc[:,4:16].notnull().sum(axis=1)
genreless.head()

Unnamed: 0,tag_id,tag_name,Main_genre,subgenre,Blues,Classical,Country,Electronic,Folk,Hip_Hop,Jazz,Latin,Pop,Punk,RB,Rock,tag_name_clean,genre_counts
0,95,finnish,,,,,,,,,,,,,,,finnish,0
1,23,slovak,,,,,,,,,,,,,,,slovak,0
2,801,iowa,,,,,,,,,,,,,,,iowa,0
3,4,groundbreaking,,,,,,,,,,,,,,,groundbreaking,0
4,130,taiwanese,,,,,,,,,,,,,,,taiwanese,0


In [19]:
#We gather all the genres in a new column:
def add_genres(i):
    genre_list = genreless.loc[i,"Blues":"Rock"]
    return [x for x in genre_list if x is not np.nan]

In [20]:
genreless.reset_index(drop=True, inplace=True)
genreless['genres'] = [add_genres(row) for row in range(len(genreless))]

In [21]:
#We can now get rid of the intermediary columns:
genreless.drop(labels=['subgenre','Blues', 'Classical', 'Country',
       'Electronic', 'Folk', 'Hip_Hop', 'Jazz', 'Latin', 'Pop',
       'Punk', 'RB', 'Rock', 'tag_name_clean'], axis=1, inplace=True)
genreless.head()

Unnamed: 0,tag_id,tag_name,Main_genre,genre_counts,genres
0,95,finnish,,0,[]
1,23,slovak,,0,[]
2,801,iowa,,0,[]
3,4,groundbreaking,,0,[]
4,130,taiwanese,,0,[]


In [22]:
#We can fill the main genre column for the ones that have just 1 genre identified:
genreless.Main_genre = np.where(genreless.genre_counts.isin([1]), genreless.genres,genreless.Main_genre )

In [23]:
#How many did we identify?
genreless.Main_genre.isnull().value_counts()

True     76392
False     9287
Name: Main_genre, dtype: int64

Not bad: we were able to retrieve the Main genre for 9.287 tags via this technique.

What we want now is to analyze the cases where there is more than one main genre identified:

In [24]:
genreless[genreless['genre_counts'] >1].head(100)

Unnamed: 0,tag_id,tag_name,Main_genre,genre_counts,genres
9,82,punkrock,,2,"[Punk, Rock]"
13,52611,electro justice rock bbc one madeon remix daft...,,3,"[Electronic, Punk, Rock]"
248,58451,echo park echopark rock pop rockpop guildford ...,,2,"[Pop, Rock]"
445,729,pop jazz,,2,"[Jazz, Pop]"
562,898,irish folk rock,,2,"[Folk, Rock]"
611,31371,popunk,,2,"[Pop, Punk]"
660,1055,jazz metal,,2,"[Jazz, Rock]"
675,1083,piano pop rock,,2,"[Pop, Rock]"
679,1089,neo classical metal,,2,"[Classical, Rock]"
687,1111,electro rock,,2,"[Electronic, Rock]"


#### Establishing dominant genres: 

In order to classify the tags that have been associated with more than one Main genre, we need to use some criteria. From my perspective, I think there are some Main genres that are dominant against others.

Again, music genre is something that can be very subjective in some cases: some people would consider The Beattles as a rock band, while I personally think they produced Pop music (maybe PopRock, but definitely not Rock music as I see it). 

As this project is done by myself, even if I try to be as objective as possible, I need to input my personal criteria and here they are:

 - If a tag has the genre "Electronic" associated, I consider it as Electronic music. 
 - If a tag isn't associated with Electronic music but with Punk music, I consider it as Punk music.
 - If a tag isn't included in the above and has the genre Pop in it, I consider it as Pop.
 - If a tag isn't included in the above and has the genre Rock in it, I consider it as Rock.
 - If a tag isn't included in the above and has the genre Hip Hop in it, I consider it as Hip Hop.
 - If a tag isn't included in the above and has the genre Jazz in it, I consider it as Jazz.
 - If a tag isn't included in the above and has the genre Folk in it, I consider it as Folk.
 - If a tag isn't included in the above and has the genre Blues in it, I consider it as Blues.
 - If a tag isn't included in the above and has the genre Latin in it, I consider it as Latin.
 - If a tag isn't included in the above and has the genre Classical in it, I consider it as Classical.

However, I will use this criteria only if the number of Main genres identified are two. I think the cases where there are more than 2 Main genres identified are probably incorrect tags (like, for instance "bossa-nova latin world pop folk jazz flamenco").

In [25]:
#We drop the rows for which we didn't retrieve any genre at all:
genreless.dropna(subset=['genres'], axis=0, inplace=True)

In [26]:
#We drop also the rows for whose the tag count is greater than 2:
genreless.drop(genreless[genreless['genre_counts'] > 2].index, inplace=True)

In [27]:
start = time.time()

#Filling the Main_genre column for our multiplt-tagged rows:

genreless.reset_index(drop=True, inplace=True)

for i in tqdm.tqdm(range(len(genreless))):
    if genreless['genre_counts'][i] == 2 and 'Electronic' in genreless['genres'][i]:
        genreless['Main_genre'][i] = 'Electronic'
    elif genreless['genre_counts'][i] == 2 and 'Punk' in genreless['genres'][i] and pd.isnull(genreless['Main_genre'][i]):
        genreless['Main_genre'][i] = 'Punk'
    elif genreless['genre_counts'][i] == 2 and 'Pop' in genreless['genres'][i] and pd.isnull(genreless['Main_genre'][i]):
        genreless['Main_genre'][i] = 'Pop'       
    elif genreless['genre_counts'][i] == 2 and 'Rock' in genreless['genres'][i] and pd.isnull(genreless['Main_genre'][i]):
        genreless['Main_genre'][i] = 'Rock'
    elif genreless['genre_counts'][i] == 2 and 'Hip_Hop' in genreless['genres'][i] and pd.isnull(genreless['Main_genre'][i]):
        genreless['Main_genre'][i] = 'Hip Hop'
    elif genreless['genre_counts'][i] == 2 and 'Jazz' in genreless['genres'][i] and pd.isnull(genreless['Main_genre'][i]):
        genreless['Main_genre'][i] = 'Jazz'
    elif genreless['genre_counts'][i] == 2 and 'Folk' in genreless['genres'][i] and pd.isnull(genreless['Main_genre'][i]):
        genreless['Main_genre'][i] = 'Folk'
    elif genreless['genre_counts'][i] == 2 and 'Blues' in genreless['genres'][i] and pd.isnull(genreless['Main_genre'][i]):
        genreless['Main_genre'][i] = 'Blues'
    elif genreless['genre_counts'][i] == 2 and 'Latin' in genreless['genres'][i] and pd.isnull(genreless['Main_genre'][i]):
        genreless['Main_genre'][i] = 'Latin'
    elif genreless['genre_counts'][i] == 2 and 'Classical' in genreless['genres'][i] and pd.isnull(genreless['Main_genre'][i]):
        genreless['Main_genre'][i] = 'Classical'
    else:
        pass

end = time.time()
print((end-start)/60)

100%|██████████| 85423/85423 [01:14<00:00, 1146.16it/s]

1.242217739423116





In [28]:
#We remove the punctuation in Main_genre:
genreless['Main_genre'] = genreless['Main_genre'].apply(lambda x: re.sub(r"[^\w]", "", str(x), 0, re.MULTILINE))
genreless.head()

Unnamed: 0,tag_id,tag_name,Main_genre,genre_counts,genres
0,95,finnish,,0,[]
1,23,slovak,,0,[]
2,801,iowa,,0,[]
3,4,groundbreaking,,0,[]
4,130,taiwanese,,0,[]


In [29]:
#We delete the useless columns:
genreless.drop(labels=['genre_counts', 'genres'], axis=1, inplace=True)

In [30]:
genreless.replace('nan', np.nan, inplace=True)

In [31]:
#How many did we identify this time?
genreless.Main_genre.isnull().value_counts()

True     74359
False    11064
Name: Main_genre, dtype: int64

We have identified an extra 1777 tag names in this last step. We are now ready to input this information into our tags_genres dataframe: 

In [32]:
#We first drop the Null values in Main_genre (those will be in genreless):
tags_genres.dropna(subset=['Main_genre'], axis=0, inplace=True)
#And the column subgenre which is not useful anymore:
tags_genres.drop(labels=['subgenre'], axis=1, inplace=True)

In [33]:
#Do the merging:
tags_all = pd.concat([tags_genres, genreless], ignore_index=True)
tags_all.head()

Unnamed: 0,tag_id,tag_name,Main_genre
0,24,digital hardcore,Electronic
1,28,raggacore,Electronic
2,79,techstep,Electronic
3,30,dubstep,Electronic
4,122,visual kei,Rock


In [34]:
#How many tags do we have in total with a Main genre associated?
tags_all.Main_genre.isnull().value_counts()

True     74359
False    12191
Name: Main_genre, dtype: int64

So we have been able to identify the Main genre for 12.191 tags in total: this will be very useful in the next steps.

## <font color=blue>2) Release genre</font>

### Data from Musicbrainz.org

Musicbrainz provides a table with all the release groups which have been tagged by their users. What we'll do next, is to retrieve those tags and select the ones that are part of the genres list.

In [35]:
#We import our main dataframe from the previous notebook:
df = pd.read_csv('Dataframe_with_origin_2.csv', sep='\t', header=0, encoding='utf-8')
df.head()

Unnamed: 0,release_id,release_group,group_id,release_year,artist_id,artist_mbid,credit_id,artist_name,area_id,area_name,subdivision_name,country_name,latitude,longitude
0,4,From the Choirgirl Hotel,876990,1998-01-01,60.0,c0b2500e-0cef-4130-869d-732b23ed9df5,60,Tori Amos,22284.0,Newton,,United States,35.759573,-79.0193
1,8,Scarlet's Walk,90019,2002-01-01,60.0,c0b2500e-0cef-4130-869d-732b23ed9df5,60,Tori Amos,22284.0,Newton,,United States,35.759573,-79.0193
2,11,Glory of the 80's,95360,1999-01-01,60.0,c0b2500e-0cef-4130-869d-732b23ed9df5,60,Tori Amos,22284.0,Newton,,United States,35.759573,-79.0193
3,15,Llanfairpwllgwyngyllgogerychwyndrobwllantysili...,94305,1995-01-01,20211.0,c5f5dc27-3059-49c0-ae45-5009a01bb9ec,20211,Super Furry Animals,3813.0,Cardiff,,United Kingdom,52.130661,-3.783712
4,16,Something 4 the Weekend,94303,1996-01-01,20211.0,c5f5dc27-3059-49c0-ae45-5009a01bb9ec,20211,Super Furry Animals,3813.0,Cardiff,,United Kingdom,52.130661,-3.783712


In [36]:
release_groups = pd.read_csv('Musicbrainz/Tables_used/release_group.txt',sep='\t', header=None, engine='c', usecols=[0,1,2])
release_groups.columns = ['group_id','group_mbid','release_group_name']
release_groups.head()

Unnamed: 0,group_id,group_mbid,release_group_name
0,1964563,f59da930-70ba-4992-a346-7ed2d8e3cda8,Wande
1,12,2b10653e-655d-34fe-9db4-77242d817a17,Chore of Enchantment
2,13,0eac6659-d590-3eb7-8c13-ed8b3fdf4ef7,The Inevitable
3,28,c554da1a-c1aa-30c3-b0bb-44b1b837de33,Piece and Love
4,60,06729175-db17-3443-add7-921739a92762,Ultimate Alternative Wavers


In [37]:
release_groups['group_id'].nunique()

1745126

In [38]:
len(release_groups)

1745126

In [39]:
group_tag = pd.read_csv('Musicbrainz/Tables_used/release_group_tag.txt',sep='\t', header=None, engine='c', usecols=[0,1,2])
group_tag.columns = ['group_id','tag_id','tag_counts']
group_tag.head()

Unnamed: 0,group_id,tag_id,tag_counts
0,93688,150,1
1,906692,1371,1
2,906692,6948,1
3,617615,11,1
4,617615,545,1


In [40]:
#We can now merge the release groups with the tag ids and tag counts:
Table = pd.merge(release_groups, group_tag, how='left', on='group_id')
Table.head()

Unnamed: 0,group_id,group_mbid,release_group_name,tag_id,tag_counts
0,1964563,f59da930-70ba-4992-a346-7ed2d8e3cda8,Wande,,
1,12,2b10653e-655d-34fe-9db4-77242d817a17,Chore of Enchantment,41017.0,2.0
2,13,0eac6659-d590-3eb7-8c13-ed8b3fdf4ef7,The Inevitable,1053.0,2.0
3,13,0eac6659-d590-3eb7-8c13-ed8b3fdf4ef7,The Inevitable,1230.0,1.0
4,13,0eac6659-d590-3eb7-8c13-ed8b3fdf4ef7,The Inevitable,71.0,3.0


In [41]:
#And finally have our release groups associated with their genres:
release_group_genre = pd.merge(Table, tags_all, how='left', on='tag_id')
release_group_genre.head()

Unnamed: 0,group_id,group_mbid,release_group_name,tag_id,tag_counts,tag_name,Main_genre
0,1964563,f59da930-70ba-4992-a346-7ed2d8e3cda8,Wande,,,,
1,12,2b10653e-655d-34fe-9db4-77242d817a17,Chore of Enchantment,41017.0,2.0,alternative indie rock,Rock
2,13,0eac6659-d590-3eb7-8c13-ed8b3fdf4ef7,The Inevitable,1053.0,2.0,swing,Jazz
3,13,0eac6659-d590-3eb7-8c13-ed8b3fdf4ef7,The Inevitable,1230.0,1.0,dixieland,Jazz
4,13,0eac6659-d590-3eb7-8c13-ed8b3fdf4ef7,The Inevitable,71.0,3.0,jazz,Jazz


Let's stop here for a while and check one of the releases that has several genre tags associated. Let's do this with one of the most popular releases of all times: the album "Thriller", by the king of Pop music: Michael Jackson. 

In [42]:
release_group_genre[release_group_genre['group_mbid']=='f32fab67-77dd-3937-addc-9062e28e4c37']

Unnamed: 0,group_id,group_mbid,release_group_name,tag_id,tag_counts,tag_name,Main_genre
1429052,61656,f32fab67-77dd-3937-addc-9062e28e4c37,Thriller,7282.0,2.0,vendu,
1429053,61656,f32fab67-77dd-3937-addc-9062e28e4c37,Thriller,642.0,2.0,disco,Pop
1429054,61656,f32fab67-77dd-3937-addc-9062e28e4c37,Thriller,7935.0,1.0,discothèque,
1429055,61656,f32fab67-77dd-3937-addc-9062e28e4c37,Thriller,24521.0,0.0,80 s and 90 s pop,Pop
1429056,61656,f32fab67-77dd-3937-addc-9062e28e4c37,Thriller,1060.0,1.0,dance pop,Pop
1429057,61656,f32fab67-77dd-3937-addc-9062e28e4c37,Thriller,303.0,3.0,funk,Others
1429058,61656,f32fab67-77dd-3937-addc-9062e28e4c37,Thriller,11.0,0.0,electronic,Electronic
1429059,61656,f32fab67-77dd-3937-addc-9062e28e4c37,Thriller,41021.0,2.0,club dance,Electronic
1429060,61656,f32fab67-77dd-3937-addc-9062e28e4c37,Thriller,76.0,1.0,dance,Electronic
1429061,61656,f32fab67-77dd-3937-addc-9062e28e4c37,Thriller,41027.0,3.0,contemporary r b,


As we can see, "Pop" is the most used tag for this group so we should keep it as the release's genre.

As music genre is a very subjective feature, in order to be as "objective" as possible, we'll take into consideration the majority of the votes to chose the subgenre and main genre of each release group.

To do so, we group the release_group_genre dataframe by Main_genre and number of tag counts and keep the top genre for each release group.

In [43]:
release_scores = pd.pivot_table(release_group_genre,values='tag_counts', index=['group_id', 'Main_genre'], aggfunc=np.sum, fill_value=0, margins=True)
release_scores.reset_index(level=['group_id','Main_genre'], inplace=True)
release_scores.head()

Unnamed: 0,group_id,Main_genre,tag_counts
0,2,HipHop,2
1,4,Electronic,32
2,4,Pop,3
3,4,Rock,3
4,11,Folk,1


To avoid incorrect taggings, we will take into consideration only the tags that have more than one vote:

In [44]:
release_scores_filtered = release_scores[release_scores['tag_counts'] > 1]

In [45]:
#We sort by group_id and tag_counts:
release_scores_filtered.sort_values(['group_id','tag_counts'], ascending=[True,False], inplace=True)
release_scores_filtered.head()

Unnamed: 0,group_id,Main_genre,tag_counts
0,2,HipHop,2
1,4,Electronic,32
2,4,Pop,3
3,4,Rock,3
5,11,Jazz,5


In [46]:
#And now we can drop the duplicate group_ids, keeping the top Main_genre:
release_scores_filtered.drop_duplicates(subset=['group_id'],keep='first', inplace=True)

What we want now is to combine our main dataframe with this new genre information we just retrieved:

In [47]:
#We merge both dataframes:
main_df = pd.merge(df, release_scores_filtered, how='left', on='group_id')
main_df.head()

Unnamed: 0,release_id,release_group,group_id,release_year,artist_id,artist_mbid,credit_id,artist_name,area_id,area_name,subdivision_name,country_name,latitude,longitude,Main_genre,tag_counts
0,4,From the Choirgirl Hotel,876990,1998-01-01,60.0,c0b2500e-0cef-4130-869d-732b23ed9df5,60,Tori Amos,22284.0,Newton,,United States,35.759573,-79.0193,Rock,5.0
1,8,Scarlet's Walk,90019,2002-01-01,60.0,c0b2500e-0cef-4130-869d-732b23ed9df5,60,Tori Amos,22284.0,Newton,,United States,35.759573,-79.0193,Rock,3.0
2,11,Glory of the 80's,95360,1999-01-01,60.0,c0b2500e-0cef-4130-869d-732b23ed9df5,60,Tori Amos,22284.0,Newton,,United States,35.759573,-79.0193,Pop,2.0
3,15,Llanfairpwllgwyngyllgogerychwyndrobwllantysili...,94305,1995-01-01,20211.0,c5f5dc27-3059-49c0-ae45-5009a01bb9ec,20211,Super Furry Animals,3813.0,Cardiff,,United Kingdom,52.130661,-3.783712,Rock,4.0
4,16,Something 4 the Weekend,94303,1996-01-01,20211.0,c5f5dc27-3059-49c0-ae45-5009a01bb9ec,20211,Super Furry Animals,3813.0,Cardiff,,United Kingdom,52.130661,-3.783712,,


In [48]:
len(main_df)

593528

In [49]:
main_df['release_id'].nunique()

593528

In [50]:
#For how many releases do we have the main genre now?
main_df.Main_genre.isnull().value_counts()

True     545943
False     47585
Name: Main_genre, dtype: int64

In [51]:
main_df.columns

Index(['release_id', 'release_group', 'group_id', 'release_year', 'artist_id',
       'artist_mbid', 'credit_id', 'artist_name', 'area_id', 'area_name',
       'subdivision_name', 'country_name', 'latitude', 'longitude',
       'Main_genre', 'tag_counts'],
      dtype='object')

In [52]:
#We export the retrieved releases into a dataframe, and the pending into another:
retrieved1 = main_df[main_df['Main_genre'].notnull()]
pending1 = main_df[main_df['Main_genre'].isnull()]
#And remove the columns related to genre in the pending1 dataframe:
pending1.drop(labels=['tag_counts', 'Main_genre'], axis=1, inplace=True)

So, according to the above results, we have for now the genre for only 47.585 releases, under a total of 593.528 (8% of our dataframe only).

## <font color=blue>3) Artist genre</font>

In order to retrieve more genres, the next step is retrieving the artists' genre (the same we did for the release groups), and add them to our main_df.

Note: by doing this, we are assuming that each band or artist always produces the same musical genre. This is not 100% always accurate (especially if we look at the subgenres). However in general, we can say that the majority of the bands/artists usually stay in the same musical line during their professional lives and they can be categorized into the same "Main genre". Again, this is an assumption that we need to make in order to retrieve more info for this project.

For that, we'll use first Musicbrainz's artist_tag table and we'll follow the same process we did before.

In [53]:
artist_tag = pd.read_csv('Musicbrainz/Tables_used/artist_tag.txt',sep='\t', header=None, engine='c', usecols=[0,1,2])
artist_tag.columns = ['artist_id','tag_id','tag_counts']
artist_tag.head()

Unnamed: 0,artist_id,tag_id,tag_counts
0,468800,29,2
1,522545,63294,1
2,31390,173,1
3,108404,271,1
4,108404,7,1


In [54]:
#We merge it with the tags_genres dataframe:
artist_tag_genre = pd.merge(artist_tag, tags_all, how='left', on='tag_id')
artist_tag_genre.head()

Unnamed: 0,artist_id,tag_id,tag_counts,tag_name,Main_genre
0,468800,29,2,progressive rock,Rock
1,522545,63294,1,austrian composer,
2,31390,173,1,polish,
3,108404,271,1,hard rock,Rock
4,108404,7,1,rock,Rock


In [55]:
#We drop the artists that don't have a Main genre:
artist_tag_genre.dropna(subset=['Main_genre'], axis=0, inplace=True)

In [56]:
#We retrieve the artist name:
artists = pd.read_csv('Musicbrainz/Tables_used/artist.txt',sep='\t', header=None, engine='c', usecols=[0,2])
artists.columns = ['artist_id','artist_name']
artists.head()

Unnamed: 0,artist_id,artist_name
0,805192,WIK▲N
1,371203,Pete Moutso
2,273232,Zachary
3,101060,The Silhouettes
4,145773,Aric Leavitt


In [57]:
len(artists)

1476425

In [58]:
#We remove the vague artists in our main dataframe:
labels = ['[unknown]','[nature sounds]','[dialogue]','[christmas music]', '[no artist]', '[church chimes]','Various Artists','[language instruction]']
artists.drop(artists[artists['artist_name'].isin(labels)].index, axis=0, inplace=True)

In [59]:
#We merge it with the artist dataframe to see the names for each artist:
artist_genre = pd.merge(artists, artist_tag_genre, on='artist_id', how='left')
artist_genre.head()

Unnamed: 0,artist_id,artist_name,tag_id,tag_counts,tag_name,Main_genre
0,805192,WIK▲N,,,,
1,371203,Pete Moutso,,,,
2,273232,Zachary,,,,
3,101060,The Silhouettes,,,,
4,145773,Aric Leavitt,,,,


In [60]:
#We drop the artists that don't have any Main_genre associated:
artist_genre.dropna(subset=['Main_genre'], axis=0, inplace=True)

We follow the same scoring procedure that we did with the releases:

In [61]:
artist_scores = pd.pivot_table(artist_genre,values='tag_counts', index=['artist_id', 'Main_genre'], aggfunc=np.sum, fill_value=0, margins=True)
artist_scores.reset_index(level=['artist_id','Main_genre'], inplace=True)
artist_scores.head()

Unnamed: 0,artist_id,Main_genre,tag_counts
0,4,Electronic,26
1,4,HipHop,1
2,4,Rock,0
3,6,Electronic,1
4,6,Jazz,1


To avoid incorrect taggings, we will take into consideration only the tags that have more than one vote:

In [62]:
artist_scores_filtered = artist_scores[artist_scores['tag_counts'] > 1]

In [63]:
#We sort by group_id and tag_counts:
artist_scores_filtered.sort_values(['artist_id','tag_counts'], ascending=[True,False], inplace=True)
artist_scores_filtered.head()

Unnamed: 0,artist_id,Main_genre,tag_counts
0,4,Electronic,26
5,6,Rock,4
6,9,Electronic,10
9,11,Jazz,3
16,17,Rock,11


In [64]:
#And now we can drop the duplicate artist_ids, keeping the top Main_genre:
artist_scores_filtered.drop_duplicates(subset=['artist_id'],keep='first', inplace=True)

In [65]:
#We add this new information into our pending1 dataframe:
main_df2 = pd.merge(pending1, artist_scores_filtered, how='left', on='artist_id')
main_df2.head()

Unnamed: 0,release_id,release_group,group_id,release_year,artist_id,artist_mbid,credit_id,artist_name,area_id,area_name,subdivision_name,country_name,latitude,longitude,Main_genre,tag_counts
0,16,Something 4 the Weekend,94303,1996-01-01,20211,c5f5dc27-3059-49c0-ae45-5009a01bb9ec,20211,Super Furry Animals,3813.0,Cardiff,,United Kingdom,52.130661,-3.783712,Rock,9.0
1,17,If You Don’t Want Me to Destroy You,94657,1996-01-01,20211,c5f5dc27-3059-49c0-ae45-5009a01bb9ec,20211,Super Furry Animals,3813.0,Cardiff,,United Kingdom,52.130661,-3.783712,Rock,9.0
2,18,Hermann ♥’s Pauline,94298,1997-01-01,20211,c5f5dc27-3059-49c0-ae45-5009a01bb9ec,20211,Super Furry Animals,3813.0,Cardiff,,United Kingdom,52.130661,-3.783712,Rock,9.0
3,19,The International Language of Screaming,94301,1997-01-01,20211,c5f5dc27-3059-49c0-ae45-5009a01bb9ec,20211,Super Furry Animals,3813.0,Cardiff,,United Kingdom,52.130661,-3.783712,Rock,9.0
4,22,Wish You Were Dead,166285,1996-01-01,51977,a13eb6dc-2708-4135-8eb3-de042e373ead,51977,Scheer,115532.0,County Londonderry,Northern Ireland,United Kingdom,54.787715,-6.492314,,


In [66]:
main_df2.isnull().sum(axis=0)

release_id               0
release_group            1
group_id                 0
release_year             0
artist_id                0
artist_mbid              0
credit_id                0
artist_name              2
area_id                 19
area_name                3
subdivision_name    416983
country_name             0
latitude                 0
longitude                0
Main_genre          373952
tag_counts          373952
dtype: int64

In [67]:
len(main_df2)

545943

Not bad: we have now "only" 373.952 releases with no Main genre, so we have just retrieved the info for an extra 171.991 releases using the artists' information. In total, we have for now 219.576 releases with their genre information, so 37% of our Dataframe.

In [68]:
#We split the dataframe again:
retrieved2 = main_df2[main_df2['Main_genre'].notnull()]
pending2 = main_df2[main_df2['Main_genre'].isnull()]
#And remove the columns related to genre in the pending2 dataframe:
pending2.drop(labels=['tag_counts', 'Main_genre'], axis=1, inplace=True)

In [69]:
len(retrieved2)

171991

### Data from Wikidata Query with SPARQL

In [70]:
#Open the files and load them into dataframes with the same column names (to match with our main dataframe later):
musicians = pd.read_csv('wikidata/query_wikidata_musicians.csv',sep=',', encoding='utf-8', usecols=[3,4])
musicians.rename(columns={'genreLabel':'artist_genre','MusicBrainz_artist_ID':'artist_mbid'}, inplace=True)
singers = pd.read_csv('wikidata/query_wikidata_singers.csv',sep=',', encoding='utf-8', usecols=[3,4])
singers.rename(columns={'genreLabel':'artist_genre','MusicBrainz_artist_ID':'artist_mbid'}, inplace=True)
bands = pd.read_csv('wikidata/query_wikidata_bands.csv',sep=',', encoding='utf-8', usecols=[3,4])
bands.rename(columns={'genreLabel':'artist_genre','MusicBrainz_artist_ID':'artist_mbid'}, inplace=True)

In [71]:
#Now we can concatenate the 3 dataframes into one:
wiki_df = pd.concat([musicians, singers, bands])
wiki_df.head()

Unnamed: 0,artist_genre,artist_mbid
0,,
1,opera,b972f589-fb0e-474e-b64a-803b0364fa75
2,classical music,b972f589-fb0e-474e-b64a-803b0364fa75
3,symphony,b972f589-fb0e-474e-b64a-803b0364fa75
4,concerto,b972f589-fb0e-474e-b64a-803b0364fa75


In [72]:
#We merge the dataframe with the tags_genres to retrieve tag_id and Main_genre:
wiki_genres = pd.merge(wiki_df, tags_all, how='left', left_on='artist_genre', right_on='tag_name')
wiki_genres.head()

Unnamed: 0,artist_genre,artist_mbid,tag_id,tag_name,Main_genre
0,,,32232.0,,
1,,,80586.0,,
2,opera,b972f589-fb0e-474e-b64a-803b0364fa75,480.0,opera,Classical
3,classical music,b972f589-fb0e-474e-b64a-803b0364fa75,2092.0,classical music,Classical
4,classical music,b972f589-fb0e-474e-b64a-803b0364fa75,54585.0,classical music,Classical


In [73]:
#We drop the rows that don't have any artist_mbid (as we won't be able to match them):
wiki_genres.dropna(subset=['artist_mbid'], axis=0, inplace=True)

As some artists appear more than once (if they have more than one tag), we will have to score again the top Main genre. This time, however, as we don't have any tag_counts we will keep the top Main genre directly.

In [74]:
wiki_scores = pd.pivot_table(wiki_genres, index=['artist_mbid', 'Main_genre'], aggfunc='count')
wiki_scores.reset_index(level=['artist_mbid','Main_genre'], inplace=True)
wiki_scores.drop(labels=['tag_id', 'tag_name'], axis=1, inplace=True)
wiki_scores.head()

Unnamed: 0,artist_mbid,Main_genre,artist_genre
0,00010eb3-ebfe-4965-81ef-0ac64cd49fde,Latin,1
1,00034ede-a1f1-4219-be39-02f36853373e,World,3
2,0004537a-4b12-43eb-a023-04009e738d2e,Electronic,2
3,00050e90-e93a-4b06-b233-8899d437d201,Rock,2
4,00077d46-7b4a-4761-9eed-c7dd435fa5ff,Rock,5


In [75]:
#We sort by artist_id and artist_genre:
wiki_scores.sort_values(['artist_mbid','artist_genre'], ascending=[True,False], inplace=True)
wiki_scores.head()

Unnamed: 0,artist_mbid,Main_genre,artist_genre
0,00010eb3-ebfe-4965-81ef-0ac64cd49fde,Latin,1
1,00034ede-a1f1-4219-be39-02f36853373e,World,3
2,0004537a-4b12-43eb-a023-04009e738d2e,Electronic,2
3,00050e90-e93a-4b06-b233-8899d437d201,Rock,2
4,00077d46-7b4a-4761-9eed-c7dd435fa5ff,Rock,5


In [76]:
#And now we can drop the duplicate artist_ids, keeping the top Main_genre:
wiki_scores.drop_duplicates(subset=['artist_mbid'],keep='first', inplace=True)

In [77]:
#Now we can input this new information into our main dataframe:
main_df3 = pd.merge(pending2, wiki_scores, how='left', on='artist_mbid')
main_df3.drop(labels=['artist_genre'], axis=1, inplace=True)
main_df3.head()

Unnamed: 0,release_id,release_group,group_id,release_year,artist_id,artist_mbid,credit_id,artist_name,area_id,area_name,subdivision_name,country_name,latitude,longitude,Main_genre
0,22,Wish You Were Dead,166285,1996-01-01,51977,a13eb6dc-2708-4135-8eb3-de042e373ead,51977,Scheer,115532.0,County Londonderry,Northern Ireland,United Kingdom,54.787715,-6.492314,Rock
1,123,Brigitte Fontaine,173514,1972-01-01,30270,3e356f2a-3501-4e68-8eff-f77d582c066a,30270,Brigitte Fontaine,46952.0,Morlaix,,France,48.202047,-2.932644,Others
2,148,Star Trek: Nemesis: Music From the Original Mo...,138371,2002-01-01,1338,db403e3d-4753-45d6-8fcb-94db03e882e1,1338,Jerry Goldsmith,7956.0,Pasadena,,United States,36.778261,-119.417932,
3,151,Star Trek: First Contact: Original Motion Pict...,6647,1996-01-01,1338,db403e3d-4753-45d6-8fcb-94db03e882e1,1338,Jerry Goldsmith,7956.0,Pasadena,,United States,36.778261,-119.417932,
4,179,Joy Will Find a Way,164532,1975-01-01,298,254b70d3-4aec-4c64-ac95-b13a1dbb30cb,298,Bruce Cockburn,5107.0,Ottawa,,Canada,51.253775,-85.323214,Rock


In [78]:
main_df3.isnull().sum(axis=0)

release_id               0
release_group            1
group_id                 0
release_year             0
artist_id                0
artist_mbid              0
credit_id                0
artist_name              2
area_id                 14
area_name                3
subdivision_name    266254
country_name             0
latitude                 0
longitude                0
Main_genre          263776
dtype: int64

We have now 263.593 releases with no Main genre, so we have just retrieved the info for an extra 110.359 releases using the artists' information. In total, we have for now 329.935 releases with their genre information, so 55% of our Dataframe.

In [79]:
#We split again the dataframe in two, and keep retrieving:
retrieved3 = main_df3[main_df3['Main_genre'].notnull()]
pending3 = main_df3[main_df3['Main_genre'].isnull()]
#And remove the columns related to genre in the pending2 dataframe:
pending3.drop(labels=['Main_genre'], axis=1, inplace=True)

## Data from 1 Million Songs Dataset

In [80]:
#We open the file where some tracks have their genre associated:
tracks = pd.read_csv('1M_songs/msd_tagtraum_cd2c.csv', header=0, usecols = [0,1])
tracks.head()

Unnamed: 0,track_id,majority_genre
0,TRAAAAK128F9318786,Rock
1,TRAAAAW128F429D538,Rap
2,TRAAADJ128F4287B47,Rock
3,TRAAADZ128F9348C2E,Latin
4,TRAAAED128E0783FAB,Jazz


As we can see, these track have already a majority genre established. Which ones are there?

In [81]:
tracks.majority_genre.value_counts()

Rock          75013
Electronic    21865
Jazz          14700
Pop           12967
Rap           11001
RnB            9811
Metal          9224
Country        8983
Reggae         7970
Blues          6219
Folk           4188
Punk           3275
Latin          3113
World          1919
New Age        1153
Name: majority_genre, dtype: int64

Luckily, their groups are very similar to our Main genres so we'll just need to make a few changes of names in order for them to fit our classification:

- "Rap" will be changed to "Hip Hop"
- "RnB" will be changed to "R&B/Soul"
- "Metal" will be changed to "Rock"
- "Reggae" will be changed to "World"
- "New Age" will be changed to "Others"

In [82]:
tracks.replace({'Rap':'Hip Hop', 'RnB':'R&B/Soul', 'Metal':'Rock', 'Reggae': 'World', 'New Age':'Others'}, inplace=True)
tracks.majority_genre.value_counts()

Rock          84237
Electronic    21865
Jazz          14700
Pop           12967
Hip Hop       11001
World          9889
R&B/Soul       9811
Country        8983
Blues          6219
Folk           4188
Punk           3275
Latin          3113
Others         1153
Name: majority_genre, dtype: int64

In [83]:
#We open the file where we can match track_id and artist_mbid:
tracks_metadata = pd.read_csv('1M_songs/track_metadata.csv', header=0, usecols = [0,5])
tracks_metadata.head()

Unnamed: 0,track_id,artist_mbid
0,TRMMMYQ128F932D901,357ff05d-848a-44cf-b608-cb34b5701ae5
1,TRMMMKD128F425225D,8d7ef530-a6fd-4f8f-b2e2-74aec765e0f9
2,TRMMMRX128F93187D9,3d403d44-36ce-465c-ad43-ae877e65adc4
3,TRMMMCH128F425532C,12be7648-7094-495f-90e6-df4189d68615
4,TRMMMWA128F426B589,


In [84]:
#We drop the rows with no value in artist_mbid:
tracks_metadata.dropna(subset=['artist_mbid'], axis=0, inplace=True)

In [85]:
#We merge the tracks dataframe with tracks metadata to retrieve the genre by artist:
artist_genre_1m = pd.merge(tracks, tracks_metadata, how='left', on='track_id')
artist_genre_1m.drop(labels=['track_id'], axis=1, inplace=True)
artist_genre_1m.head()

Unnamed: 0,majority_genre,artist_mbid
0,Rock,6ae6a016-91d7-46cc-be7d-5e8e5d320c54
1,Hip Hop,e77e51a5-4761-45b3-9847-2051f811e366
2,Rock,3cf5a3be-25ef-4408-98fe-e66fee536be1
3,Latin,7a273984-edd9-4451-9c4d-39b38f05ebcd
4,Jazz,e0e9d279-37d5-4493-99b8-5a21309502f6


In [86]:
artist_genre_1m.duplicated(subset='artist_mbid').value_counts()

True     170381
False     21020
dtype: int64

There is more than one genre associated with each artist so, we will repeat our scoring procedure:

In [87]:
artist_genre_1m['count'] = 1
scores_1m = pd.pivot_table(artist_genre_1m, index=['artist_mbid', 'majority_genre'], aggfunc='count')
scores_1m.reset_index(level=['artist_mbid','majority_genre'], inplace=True)
scores_1m.head()

Unnamed: 0,artist_mbid,majority_genre,count
0,0002f649-8285-4a72-b847-b3854e1a449c,Rock,12
1,00034ede-a1f1-4219-be39-02f36853373e,Rock,11
2,0004537a-4b12-43eb-a023-04009e738d2e,Electronic,1
3,00077d46-7b4a-4761-9eed-c7dd435fa5ff,Rock,2
4,000842dd-08e9-485f-a9b6-8ada9f1c4a12,World,1


In [88]:
#We sort by artist_id and artist_genre:
scores_1m.sort_values(['artist_mbid','count'], ascending=[True,False], inplace=True)
scores_1m.head()

Unnamed: 0,artist_mbid,majority_genre,count
0,0002f649-8285-4a72-b847-b3854e1a449c,Rock,12
1,00034ede-a1f1-4219-be39-02f36853373e,Rock,11
2,0004537a-4b12-43eb-a023-04009e738d2e,Electronic,1
3,00077d46-7b4a-4761-9eed-c7dd435fa5ff,Rock,2
4,000842dd-08e9-485f-a9b6-8ada9f1c4a12,World,1


In [89]:
#And now we can drop the duplicate artist_ids, keeping the top Main_genre:
scores_1m.drop_duplicates(subset=['artist_mbid'],keep='first', inplace=True)
scores_1m.drop(labels=['count'], axis=1, inplace=True)

In [90]:
#Is there any null value in Main_genre?
scores_1m.majority_genre.isnull().value_counts()

False    21019
Name: majority_genre, dtype: int64

In [91]:
#And we can input this information into our pending3 dataframe:
main_df4 = pd.merge(pending3, scores_1m, how='left', on='artist_mbid')
#How many releases did we retrieve the genre for in this last step?
main_df4.majority_genre.isnull().value_counts()

True     234410
False     29366
Name: majority_genre, dtype: int64

In [92]:
main_df4.rename(columns={'majority_genre':'Main_genre'}, inplace=True)

We have now 234.410 releases with no Main genre, so we have just retrieved the info for an extra 29.366 releases using the 1 Million Songs data. In total, we have for now 359.301 releases with their genre information, so 60% of our Dataframe.

We'll split the dataframe in two like we did before, and step to the next part of this notebook.

In [93]:
#We split again the dataframe in two, and keep retrieving:
retrieved4 = main_df4[main_df4['Main_genre'].notnull()]
pending4 = main_df4[main_df4['Main_genre'].isnull()]

In [94]:
pending4.drop(labels=['Main_genre'], axis=1, inplace=True)

### Extend artist genre into all the dataframe:

The idea of this last stage is to verify if, for the same artist, some releases have a main genre and others don't (this could have happened in the first stage, when we retrieved the genre by release group).

In order to do it, we will concatenate all our dataframes retrieved and pending, and check it:

In [95]:
retrieved1.drop(labels=['tag_counts'], axis=1, inplace=True)
retrieved2.drop(labels=['tag_counts'], axis=1, inplace=True)

In [96]:
#We can concatenate them:
main_df5 = pd.concat([retrieved1, retrieved2, retrieved3, retrieved4, pending4 ], ignore_index=True)
main_df5.head()

Unnamed: 0,Main_genre,area_id,area_name,artist_id,artist_mbid,artist_name,country_name,credit_id,group_id,latitude,longitude,release_group,release_id,release_year,subdivision_name
0,Rock,22284.0,Newton,60,c0b2500e-0cef-4130-869d-732b23ed9df5,Tori Amos,United States,60,876990,35.759573,-79.0193,From the Choirgirl Hotel,4,1998-01-01,
1,Rock,22284.0,Newton,60,c0b2500e-0cef-4130-869d-732b23ed9df5,Tori Amos,United States,60,90019,35.759573,-79.0193,Scarlet's Walk,8,2002-01-01,
2,Pop,22284.0,Newton,60,c0b2500e-0cef-4130-869d-732b23ed9df5,Tori Amos,United States,60,95360,35.759573,-79.0193,Glory of the 80's,11,1999-01-01,
3,Rock,3813.0,Cardiff,20211,c5f5dc27-3059-49c0-ae45-5009a01bb9ec,Super Furry Animals,United Kingdom,20211,94305,52.130661,-3.783712,Llanfairpwllgwyngyllgogerychwyndrobwllantysili...,15,1995-01-01,
4,Rock,5107.0,Ottawa,298,254b70d3-4aec-4c64-ac95-b13a1dbb30cb,Bruce Cockburn,Canada,298,164531,51.253775,-85.323214,Bruce Cockburn,23,1970-01-01,


In [97]:
len(main_df5)

593528

Now we want to select all the artists that have a Main genre associated, and count the number of releases that they have for each genre (this will be also like the scoring technique we used previously)

In [98]:
copy_maindf = main_df5[['artist_id', 'Main_genre']].copy()
copy_maindf['count'] = 1

In [99]:
scores_maindf = pd.pivot_table(copy_maindf, index=['artist_id', 'Main_genre'], aggfunc='count')
scores_maindf.reset_index(level=['artist_id','Main_genre'], inplace=True)
scores_maindf.head()

Unnamed: 0,artist_id,Main_genre,count
0,1.0,Classical,1
1,1.0,Electronic,1
2,1.0,World,1
3,4.0,Electronic,58
4,9.0,Electronic,16


In [100]:
#We sort by artist_id and artist_genre:
scores_maindf.sort_values(['artist_id','count'], ascending=[True,False], inplace=True)
scores_maindf.head()

Unnamed: 0,artist_id,Main_genre,count
0,1.0,Classical,1
1,1.0,Electronic,1
2,1.0,World,1
3,4.0,Electronic,58
4,9.0,Electronic,16


In order to avoid incorrectly assigning a genre to an artist who only has 1 release associated with it, in this case we'll consider only the release_count by genre above 1:

In [101]:
scores_maindf.drop(scores_maindf[scores_maindf['count'].isin([1])].index, axis=0, inplace=True)

In [102]:
#And now we can drop the duplicate artist_ids, keeping the top Main_genre:
scores_maindf.drop_duplicates(subset=['artist_id'],keep='first', inplace=True)
scores_maindf.drop(labels=['count'], axis=1, inplace=True)

In [103]:
len(scores_maindf)

29311

In [104]:
#Finally, we can merge our pending4 dataframe with this last one, and see if we retrieved more info:
main_df6 = pd.merge(pending4, scores_maindf, how='left', on='artist_id')

In [105]:
main_df6.Main_genre.isnull().value_counts()

True     222715
False     11695
Name: Main_genre, dtype: int64

Thankfully, by applying this last strategy, we have identified the genre for and extra 11.695 releases, which means that we have now a total of 370.996 releases with their genre (62% of the dataset).

Now, we'll gather all the information retrieved in a single file, and the information pending in another file:

In [106]:
#We split again the dataframe in two, and keep retrieving:
retrieved5 = main_df6[main_df6['Main_genre'].notnull()]
pending5 = main_df6[main_df6['Main_genre'].isnull()]

In [107]:
#Now we can concatenate the 5 retrieved dataframes:
all_retrieved = pd.concat([retrieved1, retrieved2, retrieved3, retrieved4, retrieved5 ], ignore_index=True)
all_retrieved.head()

Unnamed: 0,release_id,release_group,group_id,release_year,artist_id,artist_mbid,credit_id,artist_name,area_id,area_name,subdivision_name,country_name,latitude,longitude,Main_genre
0,4,From the Choirgirl Hotel,876990,1998-01-01,60,c0b2500e-0cef-4130-869d-732b23ed9df5,60,Tori Amos,22284.0,Newton,,United States,35.759573,-79.0193,Rock
1,8,Scarlet's Walk,90019,2002-01-01,60,c0b2500e-0cef-4130-869d-732b23ed9df5,60,Tori Amos,22284.0,Newton,,United States,35.759573,-79.0193,Rock
2,11,Glory of the 80's,95360,1999-01-01,60,c0b2500e-0cef-4130-869d-732b23ed9df5,60,Tori Amos,22284.0,Newton,,United States,35.759573,-79.0193,Pop
3,15,Llanfairpwllgwyngyllgogerychwyndrobwllantysili...,94305,1995-01-01,20211,c5f5dc27-3059-49c0-ae45-5009a01bb9ec,20211,Super Furry Animals,3813.0,Cardiff,,United Kingdom,52.130661,-3.783712,Rock
4,23,Bruce Cockburn,164531,1970-01-01,298,254b70d3-4aec-4c64-ac95-b13a1dbb30cb,298,Bruce Cockburn,5107.0,Ottawa,,Canada,51.253775,-85.323214,Rock


In [108]:
all_retrieved.Main_genre.value_counts()

Rock          123152
Electronic     56456
Pop            45312
Classical      34767
Jazz           28053
Hip Hop        16256
Others          9322
Country         9209
Folk            8860
Punk            8717
Blues           8215
Latin           7614
R&B/Soul        7482
World           6494
HipHop           904
Name: Main_genre, dtype: int64

In [109]:
#It looks like we have 2 different names for Hip Hop:
all_retrieved.Main_genre = np.where(all_retrieved.Main_genre == 'Hip_Hop', 'Hip Hop',all_retrieved.Main_genre)

In [110]:
#We export the pending dataframe:
pending5.to_csv('data_pending_2.csv', sep='\t', index=False, encoding='utf-8')

## <font color=blue>4) Pending data retrieval with Wikipedia</font>

In this last step, we'll use the information we retrieved in Wikipedia (see auxiliary notebook "Wikipedia artists information retrieval", part 2).

In [111]:
wikipedia_artists = pd.read_csv('Wikipedia_genres_retrieved.csv', sep='\t', header=0, encoding='utf-8')
wikipedia_artists.head()

Unnamed: 0,artist_id,Main_genre,subgenre
0,562672.0,Rock,rock
1,153755.0,Pop,c-pop
2,279956.0,Pop,j-pop
3,210784.0,Rock,folk rock
4,35358.0,Pop,pop


In [112]:
#We can now merge our pending5 dataframe with the info retrieved in wikipedia:
pending5.drop(labels=['Main_genre'], axis=1, inplace=True)
retrieved_wikipedia = pd.merge(pending5, wikipedia_artists, how='left', on='artist_id')
retrieved_wikipedia.head()

Unnamed: 0,release_id,release_group,group_id,release_year,artist_id,artist_mbid,credit_id,artist_name,area_id,area_name,subdivision_name,country_name,latitude,longitude,Main_genre,subgenre
0,148,Star Trek: Nemesis: Music From the Original Mo...,138371,2002-01-01,1338,db403e3d-4753-45d6-8fcb-94db03e882e1,1338,Jerry Goldsmith,7956.0,Pasadena,,United States,36.778261,-119.417932,,
1,151,Star Trek: First Contact: Original Motion Pict...,6647,1996-01-01,1338,db403e3d-4753-45d6-8fcb-94db03e882e1,1338,Jerry Goldsmith,7956.0,Pasadena,,United States,36.778261,-119.417932,,
2,353,Angelscore,86104,1996-01-01,52464,3a7ef526-8a67-48dd-bb0c-1a62305fc22b,52464,Chainsuck,7488.0,Brookline,,United States,42.407211,-71.382437,,
3,428,Barbecue Music,47731,2000-01-01,32915,9620843a-2b7f-4a5d-bfde-46da601caa97,32915,Uncle Brian,19680.0,Salisbury,,United Kingdom,52.355518,-1.17432,,
4,430,It Just Seems Right,111349,2003-01-01,32915,9620843a-2b7f-4a5d-bfde-46da601caa97,32915,Uncle Brian,19680.0,Salisbury,,United Kingdom,52.355518,-1.17432,,


In [113]:
#How many releases did we identify the genre for in this last step?
retrieved_wikipedia.Main_genre.isnull().value_counts()

True     196279
False     26436
Name: Main_genre, dtype: int64

Thanks to Wikipedia, we have identified the genre for and extra 26.436 releases, which means that we have now a total of 397.432 releases with their genre (67% of the dataset).

We can now put all of them together and export our file for the visualization:

In [114]:
#We drop the rows for which we don't have Main genre:
retrieved_wikipedia.dropna(subset=['Main_genre'], axis=0, inplace=True)

In [115]:
#Adding an extra column "count" for later:
retrieved_wikipedia['count'] = 1

In [116]:
final = pd.concat([all_retrieved, retrieved_wikipedia], ignore_index=True)
final.to_csv('Final_dataframe_visualization.csv', sep='\t', index=False, encoding='utf-8')