# <font color=red>DATA GATHERING II: MUSIC GENRES AND SUBGENRES</font>

In [24]:
import pandas as pd
import numpy as np
import time
#import tqdm

## <font color=blue>1) Genres and subgenres</font>

https://www.musicgenreslist.com/ + others in Musicbrainz - total: 945 into 14 subgenres:

- Blues
- Classical
- Country
- Electronic
- Folk
- Hip Hop
- Jazz
- Latin
- Pop
- Punk
- Rythm & Blues (R&B) / Soul
- Rock
- World (local music genres from specific regions of the world)
- Others (This category contains all the subgenres I haven't been able to classify in the previous categories)

In [70]:
all_genres = pd.read_csv('Main_genre_list.csv', sep='\t', header=0, encoding='utf-8')
all_genres.head()

Unnamed: 0,Main_genre,subgenre
0,Blues,blues music
1,Blues,acoustic blues
2,Blues,african blues
3,Blues,blues
4,Blues,blues rock


# SEGUIR DESDE AQUI

## <font color=blue>2) Release genre</font>

### Data from Musicbrainz.org

In [25]:
#We import our main dataframe from the previous notebook:
df = pd.read_csv('Dataframe_with_origin.csv', sep='\t', header=0, encoding='utf-8')
df.head()

Unnamed: 0,release_id,group_id,release_group,credit_id,release_area,release_area_name,release_ISO_code,release_code_type,release_year,artist_id,artist_mbid,artist_name_x,origin_code,origin_name_x,origin_ISO_code,origin_ISO_country,origin_code_type,is_duplicated
0,2379918,2133727,Ella lo que quiere (moombahton remix),2392457,222.0,United States,US,1.0,2014-01-01,998875.0,65e1233a-8183-4b22-95f8-3a5a674fe4b4,DKB Dkuba,222.0,United States,US,US,1.0,False
1,2379914,2133716,We're Gonna Fly (dance version),2392457,222.0,United States,US,1.0,2012-01-01,998875.0,65e1233a-8183-4b22-95f8-3a5a674fe4b4,DKB Dkuba,222.0,United States,US,US,1.0,False
2,2379913,2133723,Listen to What the Man Said,2734,221.0,United Kingdom,GB,1.0,1975-01-01,2734.0,d922d727-240a-4432-9a88-05a7cf9bc403,Wings,221.0,United Kingdom,GB,GB,1.0,False
3,2379911,2133721,Vi är såå lyckliga som bara olyckliga människo...,2392476,202.0,Sweden,SE,1.0,2019-01-01,73159.0,35e991b9-abf7-41dc-ab0e-0ca947463808,bob hund,202.0,Sweden,SE,SE,1.0,False
4,2379910,2133720,L’Ingénieux romanesque,67276,73.0,France,FR,1.0,2009-01-01,67276.0,831094a1-8295-459d-bc64-ed25e6cc1192,Boris Vian,73.0,France,FR,FR,1.0,False


According to Musicbrainz's Genre description in https://wiki.musicbrainz.org/Genre:

"Genres are currently supported in MusicBrainz as part of the tag system.

Some tags (the ones in the genre list) are automatically read and presented as genres."

What we want for our visualization is to have, for each release, its main genre and eventually its subgenre. To do so, I have copied Musicbrainz's "genre list" into a csv file. There are 419 elements considered as genres by Musicbrainz but for our study we'll consider them as our subgenres.

I have manually classified all of these subgenres into 14 categories or "Main genres":

- Blues
- Classical
- Country
- Electronic
- Folk
- Heavy Metal
- Hip Hop
- Jazz
- Latin
- Pop
- Punk
- Rythm & Blues (R&B)
- Rock
- Others (This category contains all the subgenres I haven't been able to classify in the previous categories)

Of course, I wasn't familiar with all the genres appearing in the list so, in order to classify those, I looked at their definition in wikipedia and chose the best main genre for them. If no definition was provided by wikipedia, I searched for them in Google and listened to a representative song in order to make a decision.

In [26]:
#Let's see how the genres and subgenres look like:
genres = pd.read_csv('Musicbrainz/Tables_used/genres.csv',sep='\t', encoding='utf-8')
genres.head()

Unnamed: 0,Main_genre,Subgenre
0,Electronic,acid house
1,Electronic,acid jazz
2,Electronic,acid techno
3,Blues,acoustic blues
4,Rock,acoustic rock


As we read before, Musicbrainz's genre list (subgenre for us) is part of their tag system. Let's import the Musicbrainz's "tags" table and try to identify, from its elements, the ones that are genres.

In [27]:
tags = pd.read_csv('Musicbrainz/Tables_used/tags.txt',sep='\t', header=None, engine='c', usecols=[0,1])
tags.columns = ['tag_id','tag_name']
tags.head()

Unnamed: 0,tag_id,tag_name
0,95,finnish
1,23,slovak
2,801,iowa
3,4,groundbreaking
4,130,taiwanese


In [28]:
#How many tags are there?
tags['tag_id'].nunique()

86806

In [29]:
#What do the tags look like?
tags.tag_name.value_counts()

acid folk                                                                                               2
classical music                                                                                         2
rock music                                                                                              2
rock independiente                                                                                      2
mr puaz                                                                                                 2
west wales                                                                                              2
fred seibert                                                                                            2
la escena                                                                                               2
mike oldfield                                                                                           2
プラスチックのcd箱（2枚）について、少しがっかりした… それは新たのきれいなケースになって

As we can see, the tags list contains the genres but also other (more subjective) expressions that some users have chosen as representative for the music entity. 

We will add columns to this tags dataframe to distinguish which of them are actually genres/subgenres:

In [30]:
#First, we change the Subgenre column name to tag_name in our genre file, to be able to join both dataframes:
genres.rename(columns={'Subgenre':'tag_name'}, inplace=True)
tags_genres = pd.merge(tags, genres, how='left', on='tag_name')
tags_genres.head()

Unnamed: 0,tag_id,tag_name,Main_genre
0,95,finnish,
1,23,slovak,
2,801,iowa,
3,4,groundbreaking,
4,130,taiwanese,


In [31]:
#Did we identify all the 419 genres in our dataframe?
pd.notna(tags_genres['Main_genre']).value_counts()

False    86380
True       426
Name: Main_genre, dtype: int64

In [32]:
#We retrieved 7 more, are there duplicates?
table = tags_genres.dropna(subset=['Main_genre'], axis=0).groupby('tag_name').count()
table[table['tag_id'] != 1]

Unnamed: 0_level_0,tag_id,Main_genre
tag_name,Unnamed: 1_level_1,Unnamed: 2_level_1
alternative rock,2,2
hard rock,2,2
hip hop,2,2
indie rock,2,2
new age,2,2
pop punk,2,2
pop rap,2,2
pop rock,2,2
progressive rock,2,2
psychedelic rock,2,2


It seems that we have 12 subgenres repeated twice in our tags_genres dataframe. That means they probably have 2 different tag_id's each:

In [33]:
list_duplicates = table[table['tag_id'] != 1].index.tolist()
tags_genres[tags_genres['tag_name'].isin(list_duplicates)]

Unnamed: 0,tag_id,tag_name,Main_genre
13595,1182,pop rap,Hip Hop
14217,133,punk rock,Rock
14238,235,hip hop,Hip Hop
14373,7,rock,Rock
15338,1100,pop punk,Punk
15380,618,new age,Others
15534,29,progressive rock,Rock
16100,284,indie rock,Rock
16528,271,hard rock,Rock
16616,1091,pop rock,Rock


Indeed, they have two tag_id each so we need to keep both tag_id's in order not to lose information later on.

Musicbrainz provides a table with all the release groups which have been tagged by their users. What we'll do next, is to retrieve those tags and select the ones that are part of the genres list.

In [34]:
release_groups = pd.read_csv('Musicbrainz/Tables_used/release_group.txt',sep='\t', header=None, engine='c', usecols=[0,1,2,3])
release_groups.columns = ['group_id','group_mbid','release_group_name','artist_credit']
release_groups.head()

Unnamed: 0,group_id,group_mbid,release_group_name,artist_credit
0,1964563,f59da930-70ba-4992-a346-7ed2d8e3cda8,Wande,627364
1,12,2b10653e-655d-34fe-9db4-77242d817a17,Chore of Enchantment,12
2,13,0eac6659-d590-3eb7-8c13-ed8b3fdf4ef7,The Inevitable,11
3,28,c554da1a-c1aa-30c3-b0bb-44b1b837de33,Piece and Love,26
4,60,06729175-db17-3443-add7-921739a92762,Ultimate Alternative Wavers,44


In [35]:
release_groups['group_id'].nunique()

1745126

In [36]:
len(release_groups)

1745126

In [37]:
group_tag = pd.read_csv('Musicbrainz/Tables_used/release_group_tag.txt',sep='\t', header=None, engine='c', usecols=[0,1,2])
group_tag.columns = ['group_id','tag_id','tag_counts']
group_tag.head()

Unnamed: 0,group_id,tag_id,tag_counts
0,93688,150,1
1,906692,1371,1
2,906692,6948,1
3,617615,11,1
4,617615,545,1


In [38]:
#We can now merge the release groups with the tag ids and tag counts:
Table = pd.merge(release_groups, group_tag, how='left', on='group_id')
Table.head()

Unnamed: 0,group_id,group_mbid,release_group_name,artist_credit,tag_id,tag_counts
0,1964563,f59da930-70ba-4992-a346-7ed2d8e3cda8,Wande,627364,,
1,12,2b10653e-655d-34fe-9db4-77242d817a17,Chore of Enchantment,12,41017.0,2.0
2,13,0eac6659-d590-3eb7-8c13-ed8b3fdf4ef7,The Inevitable,11,1053.0,2.0
3,13,0eac6659-d590-3eb7-8c13-ed8b3fdf4ef7,The Inevitable,11,1230.0,1.0
4,13,0eac6659-d590-3eb7-8c13-ed8b3fdf4ef7,The Inevitable,11,71.0,3.0


In [39]:
#And finally have our release groups associated with their genres:
release_group_genre = pd.merge(Table, tags_genres, how='left', on='tag_id')
release_group_genre.head()

Unnamed: 0,group_id,group_mbid,release_group_name,artist_credit,tag_id,tag_counts,tag_name,Main_genre
0,1964563,f59da930-70ba-4992-a346-7ed2d8e3cda8,Wande,627364,,,,
1,12,2b10653e-655d-34fe-9db4-77242d817a17,Chore of Enchantment,12,41017.0,2.0,alternative/indie rock,
2,13,0eac6659-d590-3eb7-8c13-ed8b3fdf4ef7,The Inevitable,11,1053.0,2.0,swing,Jazz
3,13,0eac6659-d590-3eb7-8c13-ed8b3fdf4ef7,The Inevitable,11,1230.0,1.0,dixieland,
4,13,0eac6659-d590-3eb7-8c13-ed8b3fdf4ef7,The Inevitable,11,71.0,3.0,jazz,Jazz


Let's stop here for a while and check one of the releases that has several genre tags associated. Let's do this with one of the most popular releases of all times: the album "Thriller", by the king of Pop music: Michael Jackson. 

In [40]:
release_group_genre[release_group_genre['group_mbid']=='f32fab67-77dd-3937-addc-9062e28e4c37']

Unnamed: 0,group_id,group_mbid,release_group_name,artist_credit,tag_id,tag_counts,tag_name,Main_genre
1429052,61656,f32fab67-77dd-3937-addc-9062e28e4c37,Thriller,519,7282.0,2.0,vendu,
1429053,61656,f32fab67-77dd-3937-addc-9062e28e4c37,Thriller,519,642.0,2.0,disco,Electronic
1429054,61656,f32fab67-77dd-3937-addc-9062e28e4c37,Thriller,519,7935.0,1.0,discothèque,
1429055,61656,f32fab67-77dd-3937-addc-9062e28e4c37,Thriller,519,24521.0,0.0,80 s and 90 s pop,
1429056,61656,f32fab67-77dd-3937-addc-9062e28e4c37,Thriller,519,1060.0,1.0,dance-pop,Electronic
1429057,61656,f32fab67-77dd-3937-addc-9062e28e4c37,Thriller,519,303.0,3.0,funk,Others
1429058,61656,f32fab67-77dd-3937-addc-9062e28e4c37,Thriller,519,11.0,0.0,electronic,Electronic
1429059,61656,f32fab67-77dd-3937-addc-9062e28e4c37,Thriller,519,41021.0,2.0,club/dance,
1429060,61656,f32fab67-77dd-3937-addc-9062e28e4c37,Thriller,519,76.0,1.0,dance,Electronic
1429061,61656,f32fab67-77dd-3937-addc-9062e28e4c37,Thriller,519,41027.0,3.0,contemporary r&b,R&B


As we can see, "Pop" is the most used tag for this group so we should keep it as the release's genre.

As music genre is a very subjective feature, in order to be as "objective" as possible, we'll take into consideration the majority of the votes to chose the subgenre and main genre of each release group.

To do so, we will sort the release_group_genre dataframe by number of counts and keep the top tag for each release group.

In [41]:
#We sort by group_id and tag_counts:
release_group_genre.sort_values(['group_id','tag_counts'], ascending=[True,False], inplace=True)
release_group_genre.head()

Unnamed: 0,group_id,group_mbid,release_group_name,artist_credit,tag_id,tag_counts,tag_name,Main_genre
312152,2,e8bee759-9efc-35c2-93d7-09ace9123467,Eclectic Electric,1,1186.0,2.0,acid rap,
312153,2,e8bee759-9efc-35c2-93d7-09ace9123467,Eclectic Electric,1,92310.0,1.0,oldest release group #2,
737291,4,8b6f133a-2fdf-3cc2-b84d-1c889adc0939,Blue Lines,4,1498.0,7.0,trip hop,Hip Hop
737302,4,8b6f133a-2fdf-3cc2-b84d-1c889adc0939,Blue Lines,4,12.0,6.0,downtempo,Electronic
737293,4,8b6f133a-2fdf-3cc2-b84d-1c889adc0939,Blue Lines,4,11.0,5.0,electronic,Electronic


In [42]:
#And now we can drop the duplicate group_ids, keeping the top tags:
release_group_genre.drop_duplicates(subset=['group_id'],keep='first', inplace=True)
release_group_genre.head()

Unnamed: 0,group_id,group_mbid,release_group_name,artist_credit,tag_id,tag_counts,tag_name,Main_genre
312152,2,e8bee759-9efc-35c2-93d7-09ace9123467,Eclectic Electric,1,1186.0,2.0,acid rap,
737291,4,8b6f133a-2fdf-3cc2-b84d-1c889adc0939,Blue Lines,4,1498.0,7.0,trip hop,Hip Hop
1756939,11,c6fe6a2b-0ed6-3d2c-b9ce-ddd5421a3452,Hot,11,71.0,3.0,jazz,Jazz
1,12,2b10653e-655d-34fe-9db4-77242d817a17,Chore of Enchantment,12,41017.0,2.0,alternative/indie rock,
4,13,0eac6659-d590-3eb7-8c13-ed8b3fdf4ef7,The Inevitable,11,71.0,3.0,jazz,Jazz


What we want now is to combine our main dataframe with this new genre information we just retrieved:

In [43]:
#We merge both dataframes:
main_df = pd.merge(df, release_group_genre, how='left', on='group_id')
main_df.head()

Unnamed: 0,release_id,group_id,release_group,credit_id,release_area,release_area_name,release_ISO_code,release_code_type,release_year,artist_id,...,origin_ISO_country,origin_code_type,is_duplicated,group_mbid,release_group_name,artist_credit,tag_id,tag_counts,tag_name,Main_genre
0,2379918,2133727,Ella lo que quiere (moombahton remix),2392457,222.0,United States,US,1.0,2014-01-01,998875.0,...,US,1.0,False,2b11e3a6-8a8f-462e-80c9-76ad6381defe,Ella lo que quiere (moombahton remix),2392457.0,,,,
1,2379914,2133716,We're Gonna Fly (dance version),2392457,222.0,United States,US,1.0,2012-01-01,998875.0,...,US,1.0,False,a06a9783-9fec-4d32-86bd-a8235e76c95d,We're Gonna Fly,2392457.0,,,,
2,2379913,2133723,Listen to What the Man Said,2734,221.0,United Kingdom,GB,1.0,1975-01-01,2734.0,...,GB,1.0,False,f2606a94-bb60-4103-8897-5c35d2bc3cde,Listen to What the Man Said,2734.0,,,,
3,2379911,2133721,Vi är såå lyckliga som bara olyckliga människo...,2392476,202.0,Sweden,SE,1.0,2019-01-01,73159.0,...,SE,1.0,False,ac8f26ba-ed9d-4e91-93ce-c6d0ada3feaf,Vi är såå lyckliga som bara olyckliga människo...,2392476.0,,,,
4,2379910,2133720,L’Ingénieux romanesque,67276,73.0,France,FR,1.0,2009-01-01,67276.0,...,FR,1.0,False,c64f2a59-0ab8-46b9-99e3-28d58929850c,L’Ingénieux romanesque,67276.0,,,,


In [44]:
len(main_df)

1362763

In [45]:
#For how many releases do we have the main genre now?
main_df.Main_genre.isnull().value_counts()

True     1240437
False     122326
Name: Main_genre, dtype: int64

So, according to the above results, we have for now the genre for just 122.326 releases, under a total of 1.362.763 (9% only).

## <font color=blue>3) Artist genre</font>

In order to retrieve more genres, the next step is retrieving the artists' genre (the same we did for the release groups), and add them to our main_df.

Note: by doing this, we are assuming that each band or artist always produces the same musical genre. This is not 100% always accurate (especially if we look at the subgenres). However in general, we can say that the majority of the bands/artists usually stay in the same musical line during their professional lives and they can be categorized into the same "Main genre". Again, this is an assumption that we need to make in order to retrieve more info for this project.

For that, we'll use first Musicbrainz's artist_tag table and we'll follow the same process we did before.

In [46]:
artist_tag = pd.read_csv('Musicbrainz/Tables_used/artist_tag.txt',sep='\t', header=None, engine='c', usecols=[0,1,2])
artist_tag.columns = ['artist_id','tag_id','tag_counts']
artist_tag.head()

Unnamed: 0,artist_id,tag_id,tag_counts
0,468800,29,2
1,522545,63294,1
2,31390,173,1
3,108404,271,1
4,108404,7,1


In [47]:
#We merge it with the tags_genres dataframe:
temp = pd.merge(artist_tag, tags_genres, how='left', on='tag_id')
temp.head()

Unnamed: 0,artist_id,tag_id,tag_counts,tag_name,Main_genre
0,468800,29,2,progressive rock,Rock
1,522545,63294,1,austrian composer,
2,31390,173,1,polish,
3,108404,271,1,hard rock,Rock
4,108404,7,1,rock,Rock


In [48]:
artists= pd.read_csv('Musicbrainz/Tables_used/artist.txt',sep='\t', header=None, engine='c', usecols=[0,2])
artists.columns = ['artist_id','artist_name']
artists.head()

Unnamed: 0,artist_id,artist_name
0,805192,WIK▲N
1,371203,Pete Moutso
2,273232,Zachary
3,101060,The Silhouettes
4,145773,Aric Leavitt


In [49]:
#We merge it with the artist dataframe to see the names for each artist:
artist_genre = pd.merge(temp, artists[['artist_id','artist_name']], on='artist_id', how='left')
artist_genre.head()

Unnamed: 0,artist_id,tag_id,tag_counts,tag_name,Main_genre,artist_name
0,468800,29,2,progressive rock,Rock,Citadel
1,522545,63294,1,austrian composer,,Robert Fuchs
2,31390,173,1,polish,,Behemoth
3,108404,271,1,hard rock,Rock,Blake
4,108404,7,1,rock,Rock,Blake


In [50]:
#We sort by artist_id and tag_counts:
artist_genre.sort_values(['artist_id','tag_counts'], ascending=[False,False], inplace=True)
artist_genre.head(20)

Unnamed: 0,artist_id,tag_id,tag_counts,tag_name,Main_genre,artist_name
313038,1797176,133,1,punk rock,Rock,Spit
313037,1797175,235,1,hip hop,Hip Hop,MC Kresha
313035,1797174,68,1,melodic death metal,Heavy Metal,Spit
313036,1797174,94,1,metalcore,Heavy Metal,Spit
313025,1797170,49,1,instrumental,Others,Their Methlab
313026,1797170,7,1,rock,Rock,Their Methlab
313027,1797170,48394,1,heavy psych,,Their Methlab
313028,1797170,709,1,psychedelic rock,Rock,Their Methlab
313029,1797170,93,1,stoner rock,Rock,Their Methlab
313030,1797170,16,1,post-rock,Rock,Their Methlab


In [51]:
#And now we can drop the duplicate artist_ids, keeping the top tags:
artist_genre.drop_duplicates(subset=['artist_id'],keep='first', inplace=True)
artist_genre.head(20)

Unnamed: 0,artist_id,tag_id,tag_counts,tag_name,Main_genre,artist_name
313038,1797176,133,1,punk rock,Rock,Spit
313037,1797175,235,1,hip hop,Hip Hop,MC Kresha
313035,1797174,68,1,melodic death metal,Heavy Metal,Spit
313025,1797170,49,1,instrumental,Others,Their Methlab
313010,1797148,88,1,punk,Punk,School Damage
312981,1797111,652,1,producer,,LuxrayBeats
312982,1797110,583,1,actress,,Suzan Shine
312983,1797109,2345,1,dancer,,Helen Schmiedle
312984,1797108,2345,1,dancer,,Dioni Birmpili
312985,1797107,2345,1,dancer,,Natascha Böhler


In [52]:
#We add this new information into our main dataframe:
main_df2 = pd.merge(main_df, artist_genre, how='left', on='artist_id')
main_df2.head()

Unnamed: 0,release_id,group_id,release_group,credit_id,release_area,release_area_name,release_ISO_code,release_code_type,release_year,artist_id,...,artist_credit,tag_id_x,tag_counts_x,tag_name_x,Main_genre_x,tag_id_y,tag_counts_y,tag_name_y,Main_genre_y,artist_name
0,2379918,2133727,Ella lo que quiere (moombahton remix),2392457,222.0,United States,US,1.0,2014-01-01,998875.0,...,2392457.0,,,,,1371.0,1.0,latin,Latin,DKB Dkuba
1,2379914,2133716,We're Gonna Fly (dance version),2392457,222.0,United States,US,1.0,2012-01-01,998875.0,...,2392457.0,,,,,1371.0,1.0,latin,Latin,DKB Dkuba
2,2379913,2133723,Listen to What the Man Said,2734,221.0,United Kingdom,GB,1.0,1975-01-01,2734.0,...,2734.0,,,,,7.0,2.0,rock,Rock,Wings
3,2379911,2133721,Vi är såå lyckliga som bara olyckliga människo...,2392476,202.0,Sweden,SE,1.0,2019-01-01,73159.0,...,2392476.0,,,,,66.0,2.0,swedish,,bob hund
4,2379910,2133720,L’Ingénieux romanesque,67276,73.0,France,FR,1.0,2009-01-01,67276.0,...,67276.0,,,,,71.0,1.0,jazz,Jazz,Boris Vian


In [53]:
main_df2.isnull().sum(axis=0)

release_id                  0
group_id                    0
release_group               4
credit_id                   0
release_area                0
release_area_name           0
release_ISO_code           15
release_code_type      228659
release_year                0
artist_id                 151
artist_mbid               151
artist_name_x             155
origin_code             80839
origin_name_x           80839
origin_ISO_code        107607
origin_ISO_country      80940
origin_code_type        81779
is_duplicated               0
group_mbid               1370
release_group_name       1374
artist_credit            1370
tag_id_x              1193393
tag_counts_x          1193393
tag_name_x            1194160
Main_genre_x          1240437
tag_id_y               758007
tag_counts_y           758007
tag_name_y             759601
Main_genre_y          1042893
artist_name            758009
dtype: int64

In order to determine how much extra information we have retrieved in this last step, we need to input all the information related to genre into the same column. 

We will repeat the procedure we followed for the origin columns: if the release has a Main genre, we leave it as is. If the value is missing, we fill it with the artist's Main genre.

In [54]:
main_df2.Main_genre_x = np.where(main_df2.Main_genre_x.isnull(), main_df2.Main_genre_y, main_df2.Main_genre_x)
main_df2.tag_id_x = np.where(main_df2.Main_genre_x.isnull(), main_df2.tag_id_y, main_df2.tag_id_x)
main_df2.tag_name_x = np.where(main_df2.Main_genre_x.isnull(), main_df2.tag_name_y, main_df2.tag_name_x)

In [55]:
main_df2.isnull().sum(axis=0)

release_id                  0
group_id                    0
release_group               4
credit_id                   0
release_area                0
release_area_name           0
release_ISO_code           15
release_code_type      228659
release_year                0
artist_id                 151
artist_mbid               151
artist_name_x             155
origin_code             80839
origin_name_x           80839
origin_ISO_code        107607
origin_ISO_country      80940
origin_code_type        81779
is_duplicated               0
group_mbid               1370
release_group_name       1374
artist_credit            1370
tag_id_x               965243
tag_counts_x          1193393
tag_name_x             967083
Main_genre_x           979223
tag_id_y               758007
tag_counts_y           758007
tag_name_y             759601
Main_genre_y          1042893
artist_name            758009
dtype: int64

In [56]:
len(main_df2)

1362763

In [57]:
main_df2.columns

Index(['release_id', 'group_id', 'release_group', 'credit_id', 'release_area',
       'release_area_name', 'release_ISO_code', 'release_code_type',
       'release_year', 'artist_id', 'artist_mbid', 'artist_name_x',
       'origin_code', 'origin_name_x', 'origin_ISO_code', 'origin_ISO_country',
       'origin_code_type', 'is_duplicated', 'group_mbid', 'release_group_name',
       'artist_credit', 'tag_id_x', 'tag_counts_x', 'tag_name_x',
       'Main_genre_x', 'tag_id_y', 'tag_counts_y', 'tag_name_y',
       'Main_genre_y', 'artist_name'],
      dtype='object')

Not bad: we have now "only" 979.223 releases with no Main genre, so we have just retrieved the info for an extra 261.214 releases. In total, we have for now 383.540 releases with their genre information, so 28% of our Dataframe.

In [58]:
#We change the columns' names and delete extra columns:
main_df2.rename(columns={'artist_name_x':'artist_name','tag_id_x':'tag_id','tag_name_x':'tag_name','Main_genre_x':'Main_genre'}, inplace=True)
main_df2.drop(labels=['tag_counts_x','tag_id_y','tag_counts_y','tag_name_y','Main_genre_y','artist_name'], axis=1, inplace=True)

### Data from Wikidata Query with SPARQL

In [59]:
#Open the files and load them into dataframes with the same column names (to match with our main dataframe later):
musicians = pd.read_csv('wikidata/query_wikidata_musicians.csv',sep=',', encoding='utf-8', usecols=[3,4])
musicians.rename(columns={'genreLabel':'artist_genre','MusicBrainz_artist_ID':'artist_mbid'}, inplace=True)
singers = pd.read_csv('wikidata/query_wikidata_singers.csv',sep=',', encoding='utf-8', usecols=[3,4])
singers.rename(columns={'genreLabel':'artist_genre','MusicBrainz_artist_ID':'artist_mbid'}, inplace=True)
bands = pd.read_csv('wikidata/query_wikidata_bands.csv',sep=',', encoding='utf-8', usecols=[3,4])
bands.rename(columns={'genreLabel':'artist_genre','MusicBrainz_artist_ID':'artist_mbid'}, inplace=True)

In [60]:
#Now we can concatenate the 3 dataframes into one:
wiki_df = pd.concat([musicians, singers, bands])
wiki_df.head()

Unnamed: 0,artist_genre,artist_mbid
0,,
1,opera,b972f589-fb0e-474e-b64a-803b0364fa75
2,classical music,b972f589-fb0e-474e-b64a-803b0364fa75
3,symphony,b972f589-fb0e-474e-b64a-803b0364fa75
4,concerto,b972f589-fb0e-474e-b64a-803b0364fa75


In [61]:
#We merge the dataframe with the tags_genres to retrieve tag_id and Main_genre:
wiki_genres = pd.merge(wiki_df, tags_genres, how='left', left_on='artist_genre', right_on='tag_name')
wiki_genres.head()

Unnamed: 0,artist_genre,artist_mbid,tag_id,tag_name,Main_genre
0,,,32232.0,,
1,,,80586.0,,
2,opera,b972f589-fb0e-474e-b64a-803b0364fa75,480.0,opera,Classical
3,classical music,b972f589-fb0e-474e-b64a-803b0364fa75,2092.0,classical music,
4,classical music,b972f589-fb0e-474e-b64a-803b0364fa75,54585.0,classical music,


# AQUI: IDENTIFICAR LOS QUE SE PUEDA DE TAG_NAME

As we can see above in the 4th and 5th rows, there are some tag names that could easily be identified. 

I will export them into a csv file and tag them manually:

As some artists appear more than once (if they have more than one tag), we will sort them by artist_mbid and Main_genre and keep the first appearance for each artist. In this case, we don't have a tag_count field so we can't really know which is the main one.

In [62]:
#We sort by artist_id and Main_genre:
wiki_genres.sort_values(['artist_mbid','Main_genre'], inplace=True)
wiki_genres.head()

Unnamed: 0,artist_genre,artist_mbid,tag_id,tag_name,Main_genre
112398,flamenco,00010eb3-ebfe-4965-81ef-0ac64cd49fde,367.0,flamenco,Folk
314081,,000200d1-1176-4859-b39c-669bde26ecea,32232.0,,
314082,,000200d1-1176-4859-b39c-669bde26ecea,80586.0,,
163208,,00026532-1fe3-45fb-a0df-34aec04a1319,32232.0,,
163209,,00026532-1fe3-45fb-a0df-34aec04a1319,80586.0,,


In [63]:
#And now we can drop the duplicate artist_mbids, keeping the top rows:
wiki_genres.drop_duplicates(subset=['artist_mbid'],keep='first', inplace=True)
wiki_genres.head()

Unnamed: 0,artist_genre,artist_mbid,tag_id,tag_name,Main_genre
112398,flamenco,00010eb3-ebfe-4965-81ef-0ac64cd49fde,367.0,flamenco,Folk
314081,,000200d1-1176-4859-b39c-669bde26ecea,32232.0,,
163208,,00026532-1fe3-45fb-a0df-34aec04a1319,32232.0,,
271722,reggae,00034ede-a1f1-4219-be39-02f36853373e,267.0,reggae,Others
125067,J-pop,0003fd17-b083-41fe-83a9-d550bd4f00a1,,,


In [64]:
#Now we can input this new information into our main dataframe:
main_df3 = pd.merge(main_df2, wiki_genres, how='left', on='artist_mbid')
main_df3.head()

Unnamed: 0,release_id,group_id,release_group,credit_id,release_area,release_area_name,release_ISO_code,release_code_type,release_year,artist_id,...,group_mbid,release_group_name,artist_credit,tag_id_x,tag_name_x,Main_genre_x,artist_genre,tag_id_y,tag_name_y,Main_genre_y
0,2379918,2133727,Ella lo que quiere (moombahton remix),2392457,222.0,United States,US,1.0,2014-01-01,998875.0,...,2b11e3a6-8a8f-462e-80c9-76ad6381defe,Ella lo que quiere (moombahton remix),2392457.0,,,Latin,,,,
1,2379914,2133716,We're Gonna Fly (dance version),2392457,222.0,United States,US,1.0,2012-01-01,998875.0,...,a06a9783-9fec-4d32-86bd-a8235e76c95d,We're Gonna Fly,2392457.0,,,Latin,,,,
2,2379913,2133723,Listen to What the Man Said,2734,221.0,United Kingdom,GB,1.0,1975-01-01,2734.0,...,f2606a94-bb60-4103-8897-5c35d2bc3cde,Listen to What the Man Said,2734.0,,,Rock,power pop,339.0,power pop,Punk
3,2379911,2133721,Vi är såå lyckliga som bara olyckliga människo...,2392476,202.0,Sweden,SE,1.0,2019-01-01,73159.0,...,ac8f26ba-ed9d-4e91-93ce-c6d0ada3feaf,Vi är såå lyckliga som bara olyckliga människo...,2392476.0,66.0,swedish,,rock music,1501.0,rock music,
4,2379910,2133720,L’Ingénieux romanesque,67276,73.0,France,FR,1.0,2009-01-01,67276.0,...,c64f2a59-0ab8-46b9-99e3-28d58929850c,L’Ingénieux romanesque,67276.0,,,Jazz,jazz,71.0,jazz,Jazz


In order to determine how much extra information we have retrieved in this last step, we need to input all the information related to genre into the same column.

We will repeat the procedure we followed in previous steps: if the release has a Main_genre_x and tag_name_x, we leave it as is. If the value is missing, we fill it with the data from wikidata (tag_name_y, Main_genre_y).

In [65]:
#Now we can input the origin data into the column origin_name_x:
main_df3.tag_name_x = np.where(np.logical_and(main_df3.tag_name_x.isnull(),main_df3.Main_genre_x.isnull()), main_df3.tag_name_y, main_df3.tag_name_x)
main_df3.Main_genre_x = np.where(np.logical_and(main_df3.tag_name_x.isnull(),main_df3.Main_genre_x.isnull()), main_df3.Main_genre_y, main_df3.Main_genre_x)
main_df3.head()

Unnamed: 0,release_id,group_id,release_group,credit_id,release_area,release_area_name,release_ISO_code,release_code_type,release_year,artist_id,...,group_mbid,release_group_name,artist_credit,tag_id_x,tag_name_x,Main_genre_x,artist_genre,tag_id_y,tag_name_y,Main_genre_y
0,2379918,2133727,Ella lo que quiere (moombahton remix),2392457,222.0,United States,US,1.0,2014-01-01,998875.0,...,2b11e3a6-8a8f-462e-80c9-76ad6381defe,Ella lo que quiere (moombahton remix),2392457.0,,,Latin,,,,
1,2379914,2133716,We're Gonna Fly (dance version),2392457,222.0,United States,US,1.0,2012-01-01,998875.0,...,a06a9783-9fec-4d32-86bd-a8235e76c95d,We're Gonna Fly,2392457.0,,,Latin,,,,
2,2379913,2133723,Listen to What the Man Said,2734,221.0,United Kingdom,GB,1.0,1975-01-01,2734.0,...,f2606a94-bb60-4103-8897-5c35d2bc3cde,Listen to What the Man Said,2734.0,,,Rock,power pop,339.0,power pop,Punk
3,2379911,2133721,Vi är såå lyckliga som bara olyckliga människo...,2392476,202.0,Sweden,SE,1.0,2019-01-01,73159.0,...,ac8f26ba-ed9d-4e91-93ce-c6d0ada3feaf,Vi är såå lyckliga som bara olyckliga människo...,2392476.0,66.0,swedish,,rock music,1501.0,rock music,
4,2379910,2133720,L’Ingénieux romanesque,67276,73.0,France,FR,1.0,2009-01-01,67276.0,...,c64f2a59-0ab8-46b9-99e3-28d58929850c,L’Ingénieux romanesque,67276.0,,,Jazz,jazz,71.0,jazz,Jazz


In [66]:
main_df3.isnull().sum(axis=0)

release_id                  0
group_id                    0
release_group               4
credit_id                   0
release_area                0
release_area_name           0
release_ISO_code           15
release_code_type      228659
release_year                0
artist_id                 151
artist_mbid               151
origin_code             80839
origin_name_x           80839
origin_ISO_code        107607
origin_ISO_country      80940
origin_code_type        81779
is_duplicated               0
group_mbid               1370
release_group_name       1374
artist_credit            1370
tag_id_x               965243
tag_name_x             867739
Main_genre_x           979223
artist_genre           944823
tag_id_y               873411
tag_name_y             979692
Main_genre_y          1111353
dtype: int64

In [68]:
#And now we can delete and rename some columns:
main_df3.rename(columns={'origin_name_x':'origin_name','tag_id_x':'tag_id','tag_name_x':'tag_name','Main_genre_x':'main_genre'}, inplace=True)
main_df3.drop(labels=['group_mbid','release_group_name','artist_credit','tag_name_y','tag_id_y','Main_genre_y'], axis=1, inplace=True)
main_df3.head()

Unnamed: 0,release_id,group_id,release_group,credit_id,release_area,release_area_name,release_ISO_code,release_code_type,release_year,artist_id,...,origin_code,origin_name,origin_ISO_code,origin_ISO_country,origin_code_type,is_duplicated,tag_id,tag_name,main_genre,artist_genre
0,2379918,2133727,Ella lo que quiere (moombahton remix),2392457,222.0,United States,US,1.0,2014-01-01,998875.0,...,222.0,United States,US,US,1.0,False,,,Latin,
1,2379914,2133716,We're Gonna Fly (dance version),2392457,222.0,United States,US,1.0,2012-01-01,998875.0,...,222.0,United States,US,US,1.0,False,,,Latin,
2,2379913,2133723,Listen to What the Man Said,2734,221.0,United Kingdom,GB,1.0,1975-01-01,2734.0,...,221.0,United Kingdom,GB,GB,1.0,False,,,Rock,power pop
3,2379911,2133721,Vi är såå lyckliga som bara olyckliga människo...,2392476,202.0,Sweden,SE,1.0,2019-01-01,73159.0,...,202.0,Sweden,SE,SE,1.0,False,66.0,swedish,,rock music
4,2379910,2133720,L’Ingénieux romanesque,67276,73.0,France,FR,1.0,2009-01-01,67276.0,...,73.0,France,FR,FR,1.0,False,,,Jazz,jazz


In [69]:
#How much information did we retrieve in this last step?
main_df3.isnull().sum(axis=0)

release_id                 0
group_id                   0
release_group              4
credit_id                  0
release_area               0
release_area_name          0
release_ISO_code          15
release_code_type     228659
release_year               0
artist_id                151
artist_mbid              151
origin_code            80839
origin_name            80839
origin_ISO_code       107607
origin_ISO_country     80940
origin_code_type       81779
is_duplicated              0
tag_id                965243
tag_name              867739
main_genre            979223
artist_genre          944823
dtype: int64

So, according to the above, we have now 863.174 releases with no musical genre: Wikidata has provided us information for an extra 115.904 releases.

As for the origin name, we have now only 49.524 releases with no name, while before we had 50.065. 

Now that we have gathered as much data as possible, we will export our main dataframe and keep polishing our information in the next notebook.