# <font color=red>DATA GATHERING II: MUSIC GENRES AND SUBGENRES</font>

## <font color=blue>1) iNSERT</font>

### Data from Musicbrainz.org

In [61]:
import pandas as pd
import numpy as np
import time
#import tqdm

## 5) Adding music genres to our dataframe

According to Musicbrainz's Genre description in https://wiki.musicbrainz.org/Genre:

"Genres are currently supported in MusicBrainz as part of the tag system.

Some tags (the ones in the genre list) are automatically read and presented as genres."

What we want for our visualization is to have, for each release, its main genre and eventually its subgenre. To do so, I have copied Musicbrainz's "genre list" into a csv file. There are 419 elements considered as genres by Musicbrainz but for our study we'll consider them as our subgenres.

I have manually classified all of these subgenres into 14 categories or "Main genres":

- Blues
- Classical
- Country
- Electronic
- Folk
- Heavy Metal
- Hip Hop
- Jazz
- Latin
- Pop
- Punk
- Rythm & Blues (R&B)
- Rock
- Others (This category contains all the subgenres I haven't been able to classify in the previous categories)

Of course, I wasn't familiar with all the genres appearing in the list so, in order to classify those, I looked at their definition in wikipedia and chose the best main genre for them. If no definition was provided by wikipedia, I searched for them in Google and listened to a representative song in order to make a decision.

In [85]:
#Let's see how the genres and subgenres look like:
genres = pd.read_csv('Musicbrainz/Tables_used/genres.csv',sep='\t', encoding='utf-8')
genres.head()

Unnamed: 0,Main_genre,Subgenre
0,Electronic,acid house
1,Electronic,acid jazz
2,Electronic,acid techno
3,Blues,acoustic blues
4,Rock,acoustic rock


As we read before, Musicbrainz's genre list (subgenre for us) is part of their tag system. Let's import the Musicbrainz's "tags" table and try to identify, from its elements, the ones that are genres.

In [86]:
tags = pd.read_csv('Musicbrainz/Tables_used/tags.txt',sep='\t', header=None, engine='c', usecols=[0,1])
tags.columns = ['tag_id','tag_name']
tags.head()

Unnamed: 0,tag_id,tag_name
0,95,finnish
1,23,slovak
2,801,iowa
3,4,groundbreaking
4,130,taiwanese


In [87]:
#How many tags are there?
tags['tag_id'].nunique()

86806

In [88]:
#What do the tags look like?
tags.tag_name.value_counts()

rock & roll                                                                                                                          2
alt rock                                                                                                                             2
yanni                                                                                                                                2
rhythm & blues                                                                                                                       2
little drummer boy                                                                                                                   2
acustica uach                                                                                                                        2
classical music                                                                                                                      2
pop rock                                               

As we can see, the tags list contains the genres but also other (more subjective) expressions that some users have chosen as representative for the music entity. 

We will add columns to this tags dataframe to distinguish which of them are actually genres/subgenres:

In [89]:
#First, we change the Subgenre column name to tag_name in our genre file, to be able to join both dataframes:
genres.rename(columns={'Subgenre':'tag_name'}, inplace=True)
tags_genres = pd.merge(tags, genres, how='left', on='tag_name')
tags_genres.head()

Unnamed: 0,tag_id,tag_name,Main_genre
0,95,finnish,
1,23,slovak,
2,801,iowa,
3,4,groundbreaking,
4,130,taiwanese,


In [90]:
#Did we identify all the 419 genres in our dataframe?
pd.notna(tags_genres['Main_genre']).value_counts()

False    86380
True       426
Name: Main_genre, dtype: int64

In [91]:
#We retrieved 7 more, are there duplicates?
table = tags_genres.dropna(subset=['Main_genre'], axis=0).groupby('tag_name').count()
table[table['tag_id'] != 1]

Unnamed: 0_level_0,tag_id,Main_genre
tag_name,Unnamed: 1_level_1,Unnamed: 2_level_1
alternative rock,2,2
hard rock,2,2
hip hop,2,2
indie rock,2,2
new age,2,2
pop punk,2,2
pop rap,2,2
pop rock,2,2
progressive rock,2,2
psychedelic rock,2,2


It seems that we have 12 subgenres repeated twice in our tags_genres dataframe. That means they probably have 2 different tag_id's each:

In [92]:
list_duplicates = table[table['tag_id'] != 1].index.tolist()
tags_genres[tags_genres['tag_name'].isin(list_duplicates)]

Unnamed: 0,tag_id,tag_name,Main_genre
13595,1182,pop rap,Hip Hop
14217,133,punk rock,Rock
14238,235,hip hop,Hip Hop
14373,7,rock,Rock
15338,1100,pop punk,Punk
15380,618,new age,Others
15534,29,progressive rock,Rock
16100,284,indie rock,Rock
16528,271,hard rock,Rock
16616,1091,pop rock,Rock


Indeed, they have two tag_id each so we need to keep both tag_id's in order not to lose information later on.

Musicbrainz provides a table with all the release groups which have been tagged by their users. What we'll do next, is to retrieve those tags and select the ones that are part of the genres list.

In [93]:
release_groups = pd.read_csv('Musicbrainz/Tables_used/release_group.txt',sep='\t', header=None, engine='c', usecols=[0,1,2,3])
release_groups.columns = ['group_id','group_mbid','release_group_name','artist_credit']
release_groups.head()

Unnamed: 0,group_id,group_mbid,release_group_name,artist_credit
0,1964563,f59da930-70ba-4992-a346-7ed2d8e3cda8,Wande,627364
1,12,2b10653e-655d-34fe-9db4-77242d817a17,Chore of Enchantment,12
2,13,0eac6659-d590-3eb7-8c13-ed8b3fdf4ef7,The Inevitable,11
3,28,c554da1a-c1aa-30c3-b0bb-44b1b837de33,Piece and Love,26
4,60,06729175-db17-3443-add7-921739a92762,Ultimate Alternative Wavers,44


In [94]:
release_groups['group_id'].nunique()

1745126

In [95]:
len(release_groups)

1745126

In [96]:
group_tag = pd.read_csv('Musicbrainz/Tables_used/release_group_tag.txt',sep='\t', header=None, engine='c', usecols=[0,1,2])
group_tag.columns = ['group_id','tag_id','tag_counts']
group_tag.head()

Unnamed: 0,group_id,tag_id,tag_counts
0,93688,150,1
1,906692,1371,1
2,906692,6948,1
3,617615,11,1
4,617615,545,1


In [97]:
#We can now merge the release groups with the tag ids and tag counts:
Table = pd.merge(release_groups, group_tag, how='left', on='group_id')
Table.head()

Unnamed: 0,group_id,group_mbid,release_group_name,artist_credit,tag_id,tag_counts
0,1964563,f59da930-70ba-4992-a346-7ed2d8e3cda8,Wande,627364,,
1,12,2b10653e-655d-34fe-9db4-77242d817a17,Chore of Enchantment,12,41017.0,2.0
2,13,0eac6659-d590-3eb7-8c13-ed8b3fdf4ef7,The Inevitable,11,1053.0,2.0
3,13,0eac6659-d590-3eb7-8c13-ed8b3fdf4ef7,The Inevitable,11,1230.0,1.0
4,13,0eac6659-d590-3eb7-8c13-ed8b3fdf4ef7,The Inevitable,11,71.0,3.0


In [98]:
#And finally have our release groups associated with their genres:
release_group_genre = pd.merge(Table, tags_genres, how='left', on='tag_id')
release_group_genre.head()

Unnamed: 0,group_id,group_mbid,release_group_name,artist_credit,tag_id,tag_counts,tag_name,Main_genre
0,1964563,f59da930-70ba-4992-a346-7ed2d8e3cda8,Wande,627364,,,,
1,12,2b10653e-655d-34fe-9db4-77242d817a17,Chore of Enchantment,12,41017.0,2.0,alternative/indie rock,
2,13,0eac6659-d590-3eb7-8c13-ed8b3fdf4ef7,The Inevitable,11,1053.0,2.0,swing,Jazz
3,13,0eac6659-d590-3eb7-8c13-ed8b3fdf4ef7,The Inevitable,11,1230.0,1.0,dixieland,
4,13,0eac6659-d590-3eb7-8c13-ed8b3fdf4ef7,The Inevitable,11,71.0,3.0,jazz,Jazz


Let's stop here for a while and check one of the releases that has several genre tags associated. Let's do this with one of the most popular releases of all times: the album "Thriller", by the king of Pop music: Michael Jackson. 

In [99]:
release_group_genre[release_group_genre['group_mbid']=='f32fab67-77dd-3937-addc-9062e28e4c37']

Unnamed: 0,group_id,group_mbid,release_group_name,artist_credit,tag_id,tag_counts,tag_name,Main_genre
1429052,61656,f32fab67-77dd-3937-addc-9062e28e4c37,Thriller,519,7282.0,2.0,vendu,
1429053,61656,f32fab67-77dd-3937-addc-9062e28e4c37,Thriller,519,642.0,2.0,disco,Electronic
1429054,61656,f32fab67-77dd-3937-addc-9062e28e4c37,Thriller,519,7935.0,1.0,discothèque,
1429055,61656,f32fab67-77dd-3937-addc-9062e28e4c37,Thriller,519,24521.0,0.0,80 s and 90 s pop,
1429056,61656,f32fab67-77dd-3937-addc-9062e28e4c37,Thriller,519,1060.0,1.0,dance-pop,Electronic
1429057,61656,f32fab67-77dd-3937-addc-9062e28e4c37,Thriller,519,303.0,3.0,funk,Others
1429058,61656,f32fab67-77dd-3937-addc-9062e28e4c37,Thriller,519,11.0,0.0,electronic,Electronic
1429059,61656,f32fab67-77dd-3937-addc-9062e28e4c37,Thriller,519,41021.0,2.0,club/dance,
1429060,61656,f32fab67-77dd-3937-addc-9062e28e4c37,Thriller,519,76.0,1.0,dance,Electronic
1429061,61656,f32fab67-77dd-3937-addc-9062e28e4c37,Thriller,519,41027.0,3.0,contemporary r&b,R&B


As we can see, "Pop" is the most used tag for this group so we should keep it as the release's genre.

As music genre is a very subjective feature, in order to be as "objective" as possible, we'll take into consideration the majority of the votes to chose the subgenre and main genre of each release group.

To do so, we will sort the release_group_genre dataframe by number of counts and keep the top tag for each release group.

In [100]:
#We sort by group_id and tag_counts:
release_group_genre.sort_values(['group_id','tag_counts'], ascending=[True,False], inplace=True)
release_group_genre.head()

Unnamed: 0,group_id,group_mbid,release_group_name,artist_credit,tag_id,tag_counts,tag_name,Main_genre
312152,2,e8bee759-9efc-35c2-93d7-09ace9123467,Eclectic Electric,1,1186.0,2.0,acid rap,
312153,2,e8bee759-9efc-35c2-93d7-09ace9123467,Eclectic Electric,1,92310.0,1.0,oldest release group #2,
737291,4,8b6f133a-2fdf-3cc2-b84d-1c889adc0939,Blue Lines,4,1498.0,7.0,trip hop,Hip Hop
737302,4,8b6f133a-2fdf-3cc2-b84d-1c889adc0939,Blue Lines,4,12.0,6.0,downtempo,Electronic
737293,4,8b6f133a-2fdf-3cc2-b84d-1c889adc0939,Blue Lines,4,11.0,5.0,electronic,Electronic


In [101]:
#And now we can drop the duplicate group_ids, keeping the top tags:
release_group_genre.drop_duplicates(subset=['group_id'],keep='first', inplace=True)
release_group_genre.head(20)

Unnamed: 0,group_id,group_mbid,release_group_name,artist_credit,tag_id,tag_counts,tag_name,Main_genre
312152,2,e8bee759-9efc-35c2-93d7-09ace9123467,Eclectic Electric,1,1186.0,2.0,acid rap,
737291,4,8b6f133a-2fdf-3cc2-b84d-1c889adc0939,Blue Lines,4,1498.0,7.0,trip hop,Hip Hop
1756939,11,c6fe6a2b-0ed6-3d2c-b9ce-ddd5421a3452,Hot,11,71.0,3.0,jazz,Jazz
1,12,2b10653e-655d-34fe-9db4-77242d817a17,Chore of Enchantment,12,41017.0,2.0,alternative/indie rock,
4,13,0eac6659-d590-3eb7-8c13-ed8b3fdf4ef7,The Inevitable,11,71.0,3.0,jazz,Jazz
1206877,21,bdd77e94-7917-3aa4-97de-501c53b1d343,The Best of the Art of Noise: Art Works,20,11.0,2.0,electronic,Electronic
312321,24,555dce82-41bf-397a-a487-997b54bee515,Emusic: The Extreme Collection,1,,,,
5,28,c554da1a-c1aa-30c3-b0bb-44b1b837de33,Piece and Love,26,,,,
271,30,2c644807-3b5d-39d4-8c65-dec603bf3f3a,Let It Be,28,41017.0,1.0,alternative/indie rock,
1089677,37,857c3dff-efec-387e-8e07-5b6bdb746afa,Liz Story,1097111,507.0,1.0,piano,


What we want now is to combine our main dataframe (which we exported at the end of part 4 of this notebook) with this new genre information we just retrieved:

In [102]:
#We open our main dataframe:
dataframe = pd.read_csv('dataframe.csv',sep='\t', encoding='utf-8')
dataframe.head()

Unnamed: 0,release_id,group_id_x,release_group,credit_id,release_area,release_area_name,release_code_type,release_year,artist_id,artist_mbid,artist_name,origin_name,origin_code,group_id_y
0,2163750,1962329,,2205562,240.0,[Worldwide],,2014-01-01,1654312.0,d10d6441-dcc1-4202-93bf-0c0acf72913a,Soul Glo,Philadelphia,3.0,
1,1846605,1713833,,1503027,240.0,[Worldwide],,2015-01-01,1112115.0,7b52c77b-1a34-439d-a285-3a7c69cb5b1a,Ben Bennett,[Worldwide],,1713833.0
2,1714060,1609358,Beaux Soirs De Paris,1324142,73.0,France,1.0,1995-01-01,1122795.0,71b8451c-c10a-400e-9544-101f34ab2522,Soixante Étages,France,1.0,1609358.0
3,2265346,2042812,Le 1,2291833,240.0,[Worldwide],,2018-01-01,1720981.0,a69efb5f-0b28-4328-8ff0-44d8d6f39755,TedeuzeM,Aix-en-Provence,3.0,
4,1772538,1656147,devil jokes,1653884,240.0,[Worldwide],,2016-01-01,1363025.0,c941ad72-8b13-4940-8d99-0ed9becad2d7,yzome,Seattle,3.0,


In [103]:
#Now, we will join our main dataframe with release_group_genre:

#1) We need to make small changes in dataframe for that:
dataframe.rename(columns={'group_id_x':'group_id'}, inplace=True)

#2) And now we can merge both:
main_df = pd.merge(dataframe, release_group_genre, how='left', on='group_id')
main_df.head()

Unnamed: 0,release_id,group_id,release_group,credit_id,release_area,release_area_name,release_code_type,release_year,artist_id,artist_mbid,...,origin_name,origin_code,group_id_y,group_mbid,release_group_name,artist_credit,tag_id,tag_counts,tag_name,Main_genre
0,2163750,1962329,,2205562,240.0,[Worldwide],,2014-01-01,1654312.0,d10d6441-dcc1-4202-93bf-0c0acf72913a,...,Philadelphia,3.0,,a4c06c86-d969-4f71-8bc3-85ecdc97e8de,,2205562.0,,,,
1,1846605,1713833,,1503027,240.0,[Worldwide],,2015-01-01,1112115.0,7b52c77b-1a34-439d-a285-3a7c69cb5b1a,...,[Worldwide],,1713833.0,33da2e5d-c5ae-482c-b91a-ef85dcaf19f9,,1503027.0,,,,
2,1714060,1609358,Beaux Soirs De Paris,1324142,73.0,France,1.0,1995-01-01,1122795.0,71b8451c-c10a-400e-9544-101f34ab2522,...,France,1.0,1609358.0,310a41db-b741-43de-9f80-f89bff6428ed,Beaux Soirs De Paris,1324142.0,,,,
3,2265346,2042812,Le 1,2291833,240.0,[Worldwide],,2018-01-01,1720981.0,a69efb5f-0b28-4328-8ff0-44d8d6f39755,...,Aix-en-Provence,3.0,,fdc9dffd-d231-4d31-b4b3-9a892762e78d,Le 1,2291833.0,,,,
4,1772538,1656147,devil jokes,1653884,240.0,[Worldwide],,2016-01-01,1363025.0,c941ad72-8b13-4940-8d99-0ed9becad2d7,...,Seattle,3.0,,a522be5f-a3ce-4117-8b03-8016b6d57d0e,devil jokes,1653884.0,,,,


In [104]:
#For how many releases do we have the main genre now?
main_df.isnull().sum(axis=0)

release_id                  0
group_id                    0
release_group               7
credit_id                   0
release_area                0
release_area_name           0
release_code_type      228606
release_year                0
artist_id                   0
artist_mbid                 0
artist_name                 0
origin_name                 0
origin_code             50065
group_id_y             841903
group_mbid               1370
release_group_name       1374
artist_credit            1370
tag_id                1193250
tag_counts            1193250
tag_name              1194017
Main_genre            1240292
dtype: int64

In [105]:
len(main_df)

1362614

So, according to the above results, we have for now retrieved the genre for just 122.322 releases, under a total of 1.362.614 (9% only).

In order to retrieve more genres, the next step is retrieving the artists' genre (the same we did for the release groups), and add them to our main_df.

Note: by doing this, we are assuming that each band or artist always produces the same musical genre. This is not 100% always accurate (especially if we look at the subgenres). However in general, we can say that the majority of the bands/artists usually stay in the same musical line during their lives and they can be categorized into the same "Main genre". Again, this is an assumption that we need to make in order to retrieve more info for this project.

For that, we'll use first Musicbrainz's artist_tag table and we'll follow the same process we did before.

In [106]:
artist_tag = pd.read_csv('Musicbrainz/Tables_used/artist_tag.txt',sep='\t', header=None, engine='c', usecols=[0,1,2])
artist_tag.columns = ['artist_id','tag_id','tag_counts']
artist_tag.head()

Unnamed: 0,artist_id,tag_id,tag_counts
0,468800,29,2
1,522545,63294,1
2,31390,173,1
3,108404,271,1
4,108404,7,1


In [107]:
#We merge it with the tags_genres dataframe:
temp = pd.merge(artist_tag, tags_genres, how='left', on='tag_id')
temp.head()

Unnamed: 0,artist_id,tag_id,tag_counts,tag_name,Main_genre
0,468800,29,2,progressive rock,Rock
1,522545,63294,1,austrian composer,
2,31390,173,1,polish,
3,108404,271,1,hard rock,Rock
4,108404,7,1,rock,Rock


In [108]:
#We merge it with the artist dataframe (beginning of notebook), to see the names for each artist:
artist_genre = pd.merge(temp, artists[['artist_id','artist_name']], on='artist_id', how='left')
artist_genre.head()

Unnamed: 0,artist_id,tag_id,tag_counts,tag_name,Main_genre,artist_name
0,468800,29,2,progressive rock,Rock,Citadel
1,522545,63294,1,austrian composer,,Robert Fuchs
2,31390,173,1,polish,,Behemoth
3,108404,271,1,hard rock,Rock,Blake
4,108404,7,1,rock,Rock,Blake


In [109]:
#We sort by artist_id and tag_counts:
artist_genre.sort_values(['artist_id','tag_counts'], ascending=[False,False], inplace=True)
artist_genre.head(20)

Unnamed: 0,artist_id,tag_id,tag_counts,tag_name,Main_genre,artist_name
313038,1797176,133,1,punk rock,Rock,Spit
313037,1797175,235,1,hip hop,Hip Hop,MC Kresha
313035,1797174,68,1,melodic death metal,Heavy Metal,Spit
313036,1797174,94,1,metalcore,Heavy Metal,Spit
313025,1797170,49,1,instrumental,Others,Their Methlab
313026,1797170,7,1,rock,Rock,Their Methlab
313027,1797170,48394,1,heavy psych,,Their Methlab
313028,1797170,709,1,psychedelic rock,Rock,Their Methlab
313029,1797170,93,1,stoner rock,Rock,Their Methlab
313030,1797170,16,1,post-rock,Rock,Their Methlab


In [110]:
#And now we can drop the duplicate artist_ids, keeping the top tags:
artist_genre.drop_duplicates(subset=['artist_id'],keep='first', inplace=True)
artist_genre.head(20)

Unnamed: 0,artist_id,tag_id,tag_counts,tag_name,Main_genre,artist_name
313038,1797176,133,1,punk rock,Rock,Spit
313037,1797175,235,1,hip hop,Hip Hop,MC Kresha
313035,1797174,68,1,melodic death metal,Heavy Metal,Spit
313025,1797170,49,1,instrumental,Others,Their Methlab
313010,1797148,88,1,punk,Punk,School Damage
312981,1797111,652,1,producer,,LuxrayBeats
312982,1797110,583,1,actress,,Suzan Shine
312983,1797109,2345,1,dancer,,Helen Schmiedle
312984,1797108,2345,1,dancer,,Dioni Birmpili
312985,1797107,2345,1,dancer,,Natascha Böhler


In [111]:
#We add this new information into our main dataframe:
main_df2 = pd.merge(main_df, artist_genre, how='left', on='artist_id')
main_df2.head()

Unnamed: 0,release_id,group_id,release_group,credit_id,release_area,release_area_name,release_code_type,release_year,artist_id,artist_mbid,...,artist_credit,tag_id_x,tag_counts_x,tag_name_x,Main_genre_x,tag_id_y,tag_counts_y,tag_name_y,Main_genre_y,artist_name_y
0,2163750,1962329,,2205562,240.0,[Worldwide],,2014-01-01,1654312.0,d10d6441-dcc1-4202-93bf-0c0acf72913a,...,2205562.0,,,,,,,,,
1,1846605,1713833,,1503027,240.0,[Worldwide],,2015-01-01,1112115.0,7b52c77b-1a34-439d-a285-3a7c69cb5b1a,...,1503027.0,,,,,,,,,
2,1714060,1609358,Beaux Soirs De Paris,1324142,73.0,France,1.0,1995-01-01,1122795.0,71b8451c-c10a-400e-9544-101f34ab2522,...,1324142.0,,,,,,,,,
3,2265346,2042812,Le 1,2291833,240.0,[Worldwide],,2018-01-01,1720981.0,a69efb5f-0b28-4328-8ff0-44d8d6f39755,...,2291833.0,,,,,,,,,
4,1772538,1656147,devil jokes,1653884,240.0,[Worldwide],,2016-01-01,1363025.0,c941ad72-8b13-4940-8d99-0ed9becad2d7,...,1653884.0,,,,,,,,,


In [112]:
main_df2.isnull().sum(axis=0)

release_id                  0
group_id                    0
release_group               7
credit_id                   0
release_area                0
release_area_name           0
release_code_type      228606
release_year                0
artist_id                   0
artist_mbid                 0
artist_name_x               0
origin_name                 0
origin_code             50065
group_id_y             841903
group_mbid               1370
release_group_name       1374
artist_credit            1370
tag_id_x              1193250
tag_counts_x          1193250
tag_name_x            1194017
Main_genre_x          1240292
tag_id_y               757860
tag_counts_y           757860
tag_name_y             759454
Main_genre_y          1042744
artist_name_y          757860
dtype: int64

In order to determine how much extra information we have retrieved in this last step, we need to input all the information related to genre into the same column. 

We will repeat the procedure we followed earlier for the origin columns: if the release has a Main genre, we leave it as is. If the value is missing, we fill it with the artist's Main genre.

In [113]:
main_df2.Main_genre_x = np.where(main_df2.Main_genre_x.isnull(), main_df2.Main_genre_y, main_df2.Main_genre_x)
main_df2.tag_id_x = np.where(main_df2.tag_id_x.isnull(), main_df2.tag_id_y, main_df2.tag_id_x)
main_df2.tag_name_x = np.where(main_df2.tag_name_x.isnull(), main_df2.tag_name_y, main_df2.tag_name_x)

In [114]:
main_df2.isnull().sum(axis=0)

release_id                  0
group_id                    0
release_group               7
credit_id                   0
release_area                0
release_area_name           0
release_code_type      228606
release_year                0
artist_id                   0
artist_mbid                 0
artist_name_x               0
origin_name                 0
origin_code             50065
group_id_y             841903
group_mbid               1370
release_group_name       1374
artist_credit            1370
tag_id_x               711056
tag_counts_x          1193250
tag_name_x             712688
Main_genre_x           979078
tag_id_y               757860
tag_counts_y           757860
tag_name_y             759454
Main_genre_y          1042744
artist_name_y          757860
dtype: int64

Not bad: we have now "only" 979.078 releases with no Main genre, so we have just retrieved the info for an extra 261.214 releases. In total, we have for now 383.536 releases with their genre information, so 28% of our Dataframe.

In [115]:
#We change the columns' names and delete extra columns:
main_df2.rename(columns={'artist_name_x':'artist_name','tag_id_x':'tag_id','tag_name_x':'tag_name','Main_genre_x':'Main_genre'}, inplace=True)
main_df2.drop(labels=['tag_counts_x','tag_id_y','tag_counts_y','tag_name_y','Main_genre_y','artist_name_y'], axis=1, inplace=True)

### Data from Wikidata Query with SPARQL

In [122]:
#We merge the dataframe with the tags_genres to retrieve tag_id and Main_genre:
wiki_genres = pd.merge(wiki_df, tags_genres, how='left', on='tag_name')
wiki_genres.head()

Unnamed: 0,artist_name,tag_name,artist_mbid,origin_name,tag_id,Main_genre
0,Wolfgang Amadeus Mozart,opera,b972f589-fb0e-474e-b64a-803b0364fa75,Salzburg,480.0,Classical
1,Wolfgang Amadeus Mozart,classical music,b972f589-fb0e-474e-b64a-803b0364fa75,Salzburg,2092.0,
2,Wolfgang Amadeus Mozart,classical music,b972f589-fb0e-474e-b64a-803b0364fa75,Salzburg,54585.0,
3,Wolfgang Amadeus Mozart,symphony,b972f589-fb0e-474e-b64a-803b0364fa75,Salzburg,2806.0,Classical
4,Wolfgang Amadeus Mozart,concerto,b972f589-fb0e-474e-b64a-803b0364fa75,Salzburg,22331.0,


As some artists appear more than once (if they have more than one tag), we will sort them by artist_mbid and Main_genre and keep the first appearance for each artist.

In [123]:
#We sort by artist_id and Main_genre:
wiki_genres.sort_values(['artist_mbid','Main_genre'], inplace=True)
wiki_genres.head()

Unnamed: 0,artist_name,tag_name,artist_mbid,origin_name,tag_id,Main_genre
51691,La Niña de los Peines,flamenco,00010eb3-ebfe-4965-81ef-0ac64cd49fde,Seville,367.0,Folk
149832,Silvery,,000200d1-1176-4859-b39c-669bde26ecea,,32232.0,
149833,Silvery,,000200d1-1176-4859-b39c-669bde26ecea,,80586.0,
80726,Deidre McCalla,,00026532-1fe3-45fb-a0df-34aec04a1319,,32232.0,
80727,Deidre McCalla,,00026532-1fe3-45fb-a0df-34aec04a1319,,80586.0,


In [124]:
#And now we can drop the duplicate artist_mbids, keeping the top rows:
wiki_genres.drop_duplicates(subset=['artist_mbid'],keep='first', inplace=True)
wiki_genres.head()

Unnamed: 0,artist_name,tag_name,artist_mbid,origin_name,tag_id,Main_genre
51691,La Niña de los Peines,flamenco,00010eb3-ebfe-4965-81ef-0ac64cd49fde,Seville,367.0,Folk
149832,Silvery,,000200d1-1176-4859-b39c-669bde26ecea,,32232.0,
80726,Deidre McCalla,,00026532-1fe3-45fb-a0df-34aec04a1319,,32232.0,
122405,O Rappa,reggae,00034ede-a1f1-4219-be39-02f36853373e,,267.0,Others
60490,Natsumi Abe,J-pop,0003fd17-b083-41fe-83a9-d550bd4f00a1,Muroran,,


In [125]:
#Now we can input this new information into our main dataframe:
main_df3 = pd.merge(main_df2, wiki_genres[['tag_name','artist_mbid','origin_name','tag_id','Main_genre']], how='left', on='artist_mbid')
main_df3.head()

Unnamed: 0,release_id,group_id,release_group,credit_id,release_area,release_area_name,release_code_type,release_year,artist_id,artist_mbid,...,group_mbid,release_group_name,artist_credit,tag_id_x,tag_name_x,Main_genre_x,tag_name_y,origin_name_y,tag_id_y,Main_genre_y
0,2163750,1962329,,2205562,240.0,[Worldwide],,2014-01-01,1654312.0,d10d6441-dcc1-4202-93bf-0c0acf72913a,...,a4c06c86-d969-4f71-8bc3-85ecdc97e8de,,2205562.0,,,,,,,
1,1846605,1713833,,1503027,240.0,[Worldwide],,2015-01-01,1112115.0,7b52c77b-1a34-439d-a285-3a7c69cb5b1a,...,33da2e5d-c5ae-482c-b91a-ef85dcaf19f9,,1503027.0,,,,,,,
2,1714060,1609358,Beaux Soirs De Paris,1324142,73.0,France,1.0,1995-01-01,1122795.0,71b8451c-c10a-400e-9544-101f34ab2522,...,310a41db-b741-43de-9f80-f89bff6428ed,Beaux Soirs De Paris,1324142.0,,,,,,,
3,2265346,2042812,Le 1,2291833,240.0,[Worldwide],,2018-01-01,1720981.0,a69efb5f-0b28-4328-8ff0-44d8d6f39755,...,fdc9dffd-d231-4d31-b4b3-9a892762e78d,Le 1,2291833.0,,,,,,,
4,1772538,1656147,devil jokes,1653884,240.0,[Worldwide],,2016-01-01,1363025.0,c941ad72-8b13-4940-8d99-0ed9becad2d7,...,a522be5f-a3ce-4117-8b03-8016b6d57d0e,devil jokes,1653884.0,,,,,,,


In [126]:
main_df3.isnull().sum(axis=0)

release_id                  0
group_id                    0
release_group               7
credit_id                   0
release_area                0
release_area_name           0
release_code_type      228606
release_year                0
artist_id                   0
artist_mbid                 0
artist_name                 0
origin_name_x               0
origin_code             50065
group_id_y             841903
group_mbid               1370
release_group_name       1374
artist_credit            1370
tag_id_x               711056
tag_name_x             712688
Main_genre_x           979078
tag_name_y             944819
origin_name_y         1026990
tag_id_y               873407
Main_genre_y          1111349
dtype: int64

In order to determine how much extra information we have retrieved in this last step, we need to input all the information related to genre and origin into the same column.

We will repeat the procedure we followed in previous steps: if the release has a Main genre or origin name, we leave it as is. If the value is missing, we fill it with the data from wikidata.

In [127]:
#First, let's see which type of information we have in wiki_genres as origin:
wiki_genres.origin_name.value_counts()

New York City                             1273
Los Angeles                               1209
London                                     995
Tokyo                                      750
Paris                                      440
Chicago                                    436
Brooklyn                                   390
Seoul                                      376
Philadelphia                               348
Toronto                                    333
Berlin                                     324
Stockholm                                  323
San Francisco                              289
Seattle                                    282
Boston                                     268
Moscow                                     267
California                                 266
Detroit                                    248
Montreal                                   248
Oslo                                       219
Rome                                       215
Helsinki     

It looks like the origin info we got from Wikidata is more detailed than country, so we'll keep it for the cases where Musicbrainz's origin info is either Null or country:

In [128]:
#Are there null values in the origin name?
wiki_genres.isnull().sum(axis=0)

artist_name        0
tag_name       40880
artist_mbid        0
origin_name    43548
tag_id          5453
Main_genre     65836
dtype: int64

In [129]:
#Now we can input the origin data into the column origin_name_x:
main_df3.origin_name_x = np.where(main_df3.origin_code.isnull(), main_df3.origin_name_y, main_df3.origin_name_x)
main_df3.origin_name_x = np.where(np.logical_and(main_df3.origin_code.isin([1]), main_df3.origin_name_y.notnull()), main_df3.origin_name_y, main_df3.origin_name_x)
main_df3.head()

Unnamed: 0,release_id,group_id,release_group,credit_id,release_area,release_area_name,release_code_type,release_year,artist_id,artist_mbid,...,group_mbid,release_group_name,artist_credit,tag_id_x,tag_name_x,Main_genre_x,tag_name_y,origin_name_y,tag_id_y,Main_genre_y
0,2163750,1962329,,2205562,240.0,[Worldwide],,2014-01-01,1654312.0,d10d6441-dcc1-4202-93bf-0c0acf72913a,...,a4c06c86-d969-4f71-8bc3-85ecdc97e8de,,2205562.0,,,,,,,
1,1846605,1713833,,1503027,240.0,[Worldwide],,2015-01-01,1112115.0,7b52c77b-1a34-439d-a285-3a7c69cb5b1a,...,33da2e5d-c5ae-482c-b91a-ef85dcaf19f9,,1503027.0,,,,,,,
2,1714060,1609358,Beaux Soirs De Paris,1324142,73.0,France,1.0,1995-01-01,1122795.0,71b8451c-c10a-400e-9544-101f34ab2522,...,310a41db-b741-43de-9f80-f89bff6428ed,Beaux Soirs De Paris,1324142.0,,,,,,,
3,2265346,2042812,Le 1,2291833,240.0,[Worldwide],,2018-01-01,1720981.0,a69efb5f-0b28-4328-8ff0-44d8d6f39755,...,fdc9dffd-d231-4d31-b4b3-9a892762e78d,Le 1,2291833.0,,,,,,,
4,1772538,1656147,devil jokes,1653884,240.0,[Worldwide],,2016-01-01,1363025.0,c941ad72-8b13-4940-8d99-0ed9becad2d7,...,a522be5f-a3ce-4117-8b03-8016b6d57d0e,devil jokes,1653884.0,,,,,,,


In [130]:
main_df3.isnull().sum(axis=0)

release_id                  0
group_id                    0
release_group               7
credit_id                   0
release_area                0
release_area_name           0
release_code_type      228606
release_year                0
artist_id                   0
artist_mbid                 0
artist_name                 0
origin_name_x           49524
origin_code             50065
group_id_y             841903
group_mbid               1370
release_group_name       1374
artist_credit            1370
tag_id_x               711056
tag_name_x             712688
Main_genre_x           979078
tag_name_y             944819
origin_name_y         1026990
tag_id_y               873407
Main_genre_y          1111349
dtype: int64

In [137]:
#Now we input the genre & tag information for the rows where Main_genre_x is null:
main_df3.Main_genre_x = np.where(np.logical_and(main_df3.Main_genre_x.isnull(),main_df3.Main_genre_y.notnull()), main_df3.Main_genre_y, main_df3.Main_genre_x)
main_df3.tag_id_x = np.where(np.logical_and(main_df3.Main_genre_x.isnull(),main_df3.tag_id_y.notnull()), main_df3.tag_id_y, main_df3.tag_id_x)
main_df3.tag_name_x = np.where(np.logical_and(main_df3.Main_genre_x.isnull(),main_df3.tag_name_y.notnull()), main_df3.tag_name_y, main_df3.tag_name_x)

In [138]:
#And now we can delete and rename some columns:
main_df3.rename(columns={'origin_name_x':'origin_name','tag_id_x':'tag_id','tag_name_x':'tag_name','Main_genre_x':'main_genre'}, inplace=True)
main_df3.drop(labels=['group_id_y','group_mbid','release_group_name','artist_credit','tag_name_y','origin_name_y','tag_id_y','Main_genre_y'], axis=1, inplace=True)
main_df3.head()

Unnamed: 0,release_id,group_id,release_group,credit_id,release_area,release_area_name,release_code_type,release_year,artist_id,artist_mbid,artist_name,origin_name,origin_code,tag_id,tag_name,main_genre
0,2163750,1962329,,2205562,240.0,[Worldwide],,2014-01-01,1654312.0,d10d6441-dcc1-4202-93bf-0c0acf72913a,Soul Glo,Philadelphia,3.0,,,
1,1846605,1713833,,1503027,240.0,[Worldwide],,2015-01-01,1112115.0,7b52c77b-1a34-439d-a285-3a7c69cb5b1a,Ben Bennett,,,,,
2,1714060,1609358,Beaux Soirs De Paris,1324142,73.0,France,1.0,1995-01-01,1122795.0,71b8451c-c10a-400e-9544-101f34ab2522,Soixante Étages,France,1.0,,,
3,2265346,2042812,Le 1,2291833,240.0,[Worldwide],,2018-01-01,1720981.0,a69efb5f-0b28-4328-8ff0-44d8d6f39755,TedeuzeM,Aix-en-Provence,3.0,,,
4,1772538,1656147,devil jokes,1653884,240.0,[Worldwide],,2016-01-01,1363025.0,c941ad72-8b13-4940-8d99-0ed9becad2d7,yzome,Seattle,3.0,,,


In [139]:
#How much information did we retrieve in this last step?
main_df3.isnull().sum(axis=0)

release_id                0
group_id                  0
release_group             7
credit_id                 0
release_area              0
release_area_name         0
release_code_type    228606
release_year              0
artist_id                 0
artist_mbid               0
artist_name               0
origin_name           49524
origin_code           50065
tag_id               612242
tag_name             660762
main_genre           863174
dtype: int64

So, according to the above, we have now 863.174 releases with no musical genre: Wikidata has provided us information for an extra 115.904 releases.

As for the origin name, we have now only 49.524 releases with no name, while before we had 50.065. 

Now that we have gathered as much data as possible, we will export our main dataframe and keep polishing our information in the next notebook.

In [141]:
main_df3.to_csv('main_dataframe.csv', sep=',', index=None, encoding='utf-8')