# <font color=red>DATA GATHERING I: MUSIC RELEASES AND THEIR GEOGRAPHICAL ORIGIN</font>

## <font color=blue>1) Artist information</font>

### Data from Musicbrainz.org

In [61]:
import pandas as pd
import numpy as np
import reverse_geocoder #pip install reverse geocoder in console
import time
#import tqdm

In [62]:
artists= pd.read_csv('Musicbrainz/Tables_used/artist.txt',sep='\t', header=None, engine='c', usecols=[0,1,2,11,17])
artists.columns = ['artist_id','artist_mbid','artist_name','start_area1', 'start_area2']
artists.head()

Unnamed: 0,artist_id,artist_mbid,artist_name,start_area1,start_area2
0,805192,8972b1c1-6482-4750-b51f-596d2edea8b1,WIK▲N,,
1,371203,49add228-eac5-4de8-836c-d75cde7369c3,Pete Moutso,,
2,273232,c112a400-af49-4665-8bba-741531d962a1,Zachary,,
3,101060,ca3f3ee1-c4a7-4bac-a16a-0b888a396c6b,The Silhouettes,222.0,7707.0
4,145773,7b4a548e-a01a-49b7-82e7-b49efeb9732c,Aric Leavitt,,


In [63]:
#Let's see how many artists we have:
artists['artist_id'].nunique()

1476425

In [64]:
#How much info we have for each artist?
artists.isnull().sum(axis=0)

artist_id            0
artist_mbid          0
artist_name          8
start_area1     808442
start_area2    1274001
dtype: int64

What are the "start_area1" and "start_area2"? If we look at Musicbrainz's field description for each artist (https://musicbrainz.org/doc/Artist), we can see that:

Area: The artist area, as the name suggests, indicates the area with which an artist is primarily identified with. It is often, but not always, its birth/formation country.

We will keep this information as the artist's origin for later.

We need to incorporate as well the table called "artist credit", which gives us the artist credit_id. We will use this field to join later on each release with its artist:

In [65]:
artists_credit= pd.read_csv('Musicbrainz/Tables_used/artist_credit_name.txt',sep='\t', header=None, engine='c', usecols=[0,2,3])
artists_credit.columns = ['credit_id','artist_id','artist_name']
artists_credit.head()

Unnamed: 0,credit_id,artist_id,artist_name
0,578352,578352,Gustav Ruppke
1,273232,273232,Zachary
2,153193,153193,The High Level Ranters
3,32262,32262,Georges Brassens
4,1389968,1171184,Harvard of the South


In [66]:
#Let's join the artists with their credit id and verify that the matching is good:
df = pd.merge(artists, artists_credit, how='left', on='artist_id')
df.head()

Unnamed: 0,artist_id,artist_mbid,artist_name_x,start_area1,start_area2,credit_id,artist_name_y
0,805192,8972b1c1-6482-4750-b51f-596d2edea8b1,WIK▲N,,,822846.0,WIK▲N
1,371203,49add228-eac5-4de8-836c-d75cde7369c3,Pete Moutso,,,,
2,273232,c112a400-af49-4665-8bba-741531d962a1,Zachary,,,273232.0,Zachary
3,101060,ca3f3ee1-c4a7-4bac-a16a-0b888a396c6b,The Silhouettes,222.0,7707.0,101060.0,The Silhouettes
4,145773,7b4a548e-a01a-49b7-82e7-b49efeb9732c,Aric Leavitt,,,145773.0,Aric Leavitt


In [67]:
#It looks like it makes sense. Please note that the credit id is sometimes equal to the artist_id, but not always:
df['check'] = df['artist_id'] - df['credit_id']
df['check'].nunique()

1270628

In [68]:
df.isnull().sum(axis=0)

artist_id              0
artist_mbid            0
artist_name_x         15
start_area1      1120376
start_area2      2109027
credit_id         461241
artist_name_y     461253
check             461241
dtype: int64

In [69]:
#We can now get rid of check and the duplicate artist_name column:
df.drop(labels=['check','artist_name_y'], axis=1, inplace=True)
df.head()

Unnamed: 0,artist_id,artist_mbid,artist_name_x,start_area1,start_area2,credit_id
0,805192,8972b1c1-6482-4750-b51f-596d2edea8b1,WIK▲N,,,822846.0
1,371203,49add228-eac5-4de8-836c-d75cde7369c3,Pete Moutso,,,
2,273232,c112a400-af49-4665-8bba-741531d962a1,Zachary,,,273232.0
3,101060,ca3f3ee1-c4a7-4bac-a16a-0b888a396c6b,The Silhouettes,222.0,7707.0,101060.0
4,145773,7b4a548e-a01a-49b7-82e7-b49efeb9732c,Aric Leavitt,,,145773.0


## <font color=blue>2) Release information</font>

### Data from Musicbrainz.org

The objective of this project is to visualize when each artist released for the first time a certain CD/Album/Single etc.

If we look at the "releases" table:

In [70]:
releases = pd.read_csv('Musicbrainz/Tables_used/release.txt',sep='\t', header=None, engine='c', usecols=[0,2,3,4])
releases.columns = ['release_id','release_group','credit_id','group_id']
releases.head()

Unnamed: 0,release_id,release_group,credit_id,group_id
0,9,A Sorta Fairytale,60,896742
1,10,A Sorta Fairytale,60,896742
2,11,Glory of the 80's,60,95360
3,12,Silent All These Years,60,104189
4,26,Demons,20211,94299


We can see, in the first 2 rows, that the same CD/Album can be released/remastered many times. According to Musicbrainz's field description for each release (https://musicbrainz.org/doc/Release):

"A MusicBrainz release represents the unique release (i.e. issuing) of a product on a specific date with specific release information such as the country, label, barcode and packaging. If you walk into a store and purchase an album or single, they are each represented in MusicBrainz as one release".

If we look at another release-related field in Musicbrainz, we find the "release group" (https://musicbrainz.org/doc/Release_Group):

"A release group, just as the name suggests, is used to group several different releases into a single logical entity. Every release belongs to one, and only one release group.

Both release groups and releases are "albums" in a general sense, but with an important difference: a release is something you can buy as media such as a CD or a vinyl record, while a release group embraces the overall concept of an album -- it doesn't matter how many CDs or editions/versions it had."

By reading these descriptions, we can clearly see that the release group is the table we are looking for as it represents a single creation, no matter how many times it has been edited or released afterwards. So we will have to keep the first release id for each release group.

In [71]:
release_country = pd.read_csv('Musicbrainz/Tables_used/release_country.txt',sep='\t', header=None, engine='c', usecols=[0,1,2])
release_country.columns = ['release_id','area_id','release_year']
release_country.head()

Unnamed: 0,release_id,area_id,release_year
0,3,81,1997.0
1,1427792,107,2014.0
2,9,81,2002.0
3,10,221,2002.0
4,11,81,1999.0


In [72]:
df2 = pd.merge(releases, release_country, how='left', on='release_id')
df2.head()

Unnamed: 0,release_id,release_group,credit_id,group_id,area_id,release_year
0,9,A Sorta Fairytale,60,896742,81.0,2002.0
1,10,A Sorta Fairytale,60,896742,221.0,2002.0
2,11,Glory of the 80's,60,95360,81.0,1999.0
3,12,Silent All These Years,60,104189,81.0,1997.0
4,26,Demons,20211,94299,107.0,1998.0


In [73]:
#Let's see how many releases we have:
df2['release_id'].nunique()

2198457

In [74]:
df2.isnull().sum(axis=0)

release_id            0
release_group         7
credit_id             0
group_id              0
area_id          287376
release_year     341983
dtype: int64

In [75]:
#We want to keep only the releases which have a release year, so we can drop the others:
df2.dropna(subset=['release_year'], axis=0, inplace=True)
df2['release_year'] = df2.release_year.astype(int,inplace=True)
df2['release_id'].nunique()

1859982

In [76]:
#Let's analyze the year column:
pd.options.display.max_rows = 2000
df2.groupby('release_year').count()

Unnamed: 0_level_0,release_id,release_group,credit_id,group_id,area_id
release_year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,2,2,2,2,2
4,1,1,1,1,1
5,5,5,5,5,5
7,1,1,1,1,1
8,2,2,2,2,2
10,3,3,3,3,3
14,1,1,1,1,1
17,4,4,4,4,4
18,1,1,1,1,1
19,3,3,3,3,3


By looking at the different year values, and, in order to have enough values per year, we could drop the rows whose year is below 1890 and above 2019. Our visualization would have 130 years, which is pretty good.

In [77]:
df2.drop(df2[df2['release_year'] < 1890].index , inplace=True)
df2.drop(df2[df2['release_year'] >2019].index , inplace=True)
df2.sort_values(by=['release_year']).head()

Unnamed: 0,release_id,release_group,credit_id,group_id,area_id,release_year
1266766,386919,Visions of Paradise Waltz,97546,712605,222.0,1890
1266956,386830,German Ballad with Variations,97546,712514,222.0,1890
1266958,386829,German Ballad with Variations,97546,712514,222.0,1890
1266960,386828,Mountain Bells Polka,97546,712513,222.0,1890
1266961,386827,Mountain Bells Polka,97546,712513,222.0,1890


In [78]:
#Converting the year column to datetime for later:
df2['release_year'] = pd.to_datetime(df2['release_year'].astype(str), format='%Y')
df2.dtypes

release_id                int64
release_group            object
credit_id                 int64
group_id                  int64
area_id                 float64
release_year     datetime64[ns]
dtype: object

In [79]:
#We sort by release id and year (we could have 2 release groups with the same name but produced by different artists):
df2.sort_values(['release_group','release_year','credit_id'], ascending=[True,True,True], inplace=True)
df2.head()

Unnamed: 0,release_id,release_group,credit_id,group_id,area_id,release_year
2026273,2163750,,2205562,1962329,240.0,2014-01-01
1648516,1846605,,1503027,1713833,240.0,2015-01-01
1250325,1714060,Beaux Soirs De Paris,1324142,1609358,73.0,1995-01-01
2116340,2265346,Le 1,2291833,2042812,240.0,2018-01-01
1748061,1895266,M2Music HitDisc Vol. 1,1,1751021,222.0,2006-01-01


In [80]:
df2[df2['release_group'] == 'Artaxerxes']

Unnamed: 0,release_id,release_group,credit_id,group_id,area_id,release_year
1836724,2378622,Artaxerxes,2392005,2132682,240.0,1996-01-01
1910376,2379252,Artaxerxes,2392005,2132682,221.0,2009-01-01
1909444,2379244,Artaxerxes,2392011,2133192,222.0,2011-01-01


In [81]:
#Now we can delete the duplicate releases and keep the ones who were first released:
df2.drop_duplicates(subset=['release_group','credit_id'],keep='first', inplace=True)
df2['release_id'].nunique()

1499614

In [82]:
#Just to double-check:
df2[df2['release_group'] == 'Artaxerxes']

Unnamed: 0,release_id,release_group,credit_id,group_id,area_id,release_year
1836724,2378622,Artaxerxes,2392005,2132682,240.0,1996-01-01
1909444,2379244,Artaxerxes,2392011,2133192,222.0,2011-01-01


## <font color=blue>3) Matching releases with artists</font>

Now that we have both artist and releases dataframes, we can join them:

In [85]:
df3 = pd.merge(df2, df, how='left', on='credit_id')
df3.head()

Unnamed: 0,release_id,release_group,credit_id,group_id,area_id,release_year,artist_id,artist_mbid,artist_name_x,start_area1,start_area2
0,2163750,,2205562,1962329,240.0,2014-01-01,1654312.0,d10d6441-dcc1-4202-93bf-0c0acf72913a,Soul Glo,7707.0,
1,1846605,,1503027,1713833,240.0,2015-01-01,1112115.0,7b52c77b-1a34-439d-a285-3a7c69cb5b1a,Ben Bennett,,
2,1714060,Beaux Soirs De Paris,1324142,1609358,73.0,1995-01-01,1122795.0,71b8451c-c10a-400e-9544-101f34ab2522,Soixante Étages,,
3,2265346,Le 1,2291833,2042812,240.0,2018-01-01,1720981.0,a69efb5f-0b28-4328-8ff0-44d8d6f39755,TedeuzeM,68613.0,
4,1895266,M2Music HitDisc Vol. 1,1,1751021,222.0,2006-01-01,1.0,89ad4ac3-39f7-470e-963a-56509c546377,Various Artists,,


In [25]:
df3.isnull().sum(axis=0)

release_id            0
release_group         4
credit_id             0
group_id              0
area_id               0
release_year          0
artist_id           151
artist_mbid         151
artist_name_x       155
start_area1      430452
start_area2      959581
dtype: int64

In [26]:
df3['release_id'].nunique()

1499614

In [27]:
len(df3)

1724524

In [28]:
df3[df3['release_group']=='Artaxerxes']

Unnamed: 0,release_id,release_group,credit_id,group_id,area_id,release_year,artist_id,artist_mbid,artist_name_x,start_area1,start_area2
119493,2378622,Artaxerxes,2392005,2132682,240.0,1996-01-01,391603.0,e3062782-ab7b-41bc-8e65-aeea16dc1a89,Ian Partridge,221.0,1178.0
119494,2378622,Artaxerxes,2392005,2132682,240.0,1996-01-01,124232.0,4e7f1926-8704-4545-a1a1-ded91651c884,Thomas Arne,221.0,1178.0
119495,2378622,Artaxerxes,2392005,2132682,240.0,1996-01-01,688791.0,f34e9da4-2ee7-4f27-aa34-adc5db791bec,Christopher Robson,,
119496,2378622,Artaxerxes,2392005,2132682,240.0,1996-01-01,1129787.0,c33f733e-2bf4-402b-9455-1a293601a1cd,Patricia Spence,,
119497,2378622,Artaxerxes,2392005,2132682,240.0,1996-01-01,1104538.0,5680c729-615b-47e2-969e-27a087c572fb,Philippa Hyde,221.0,
119498,2378622,Artaxerxes,2392005,2132682,240.0,1996-01-01,402986.0,70af5d9a-c6e0-4fcf-9cde-4d3d00e0fcb0,The Parley of Instruments,221.0,1178.0
119499,2378622,Artaxerxes,2392005,2132682,240.0,1996-01-01,183632.0,954d1c83-259f-4a25-8878-10c19bb097af,Catherine Bott,221.0,
119500,2378622,Artaxerxes,2392005,2132682,240.0,1996-01-01,87510.0,857588a5-b7aa-4f72-a87b-8f03dca60e30,Roy Goodman,221.0,30926.0
119501,2378622,Artaxerxes,2392005,2132682,240.0,1996-01-01,1078968.0,93da7aaa-250b-46e1-b5ef-0ad78d46dc3f,Richard Edgar‐Wilson,,
119502,2379244,Artaxerxes,2392011,2133192,222.0,2011-01-01,854064.0,a87f2b39-84c7-4888-935c-d41943bd7971,Classical Opera Company,221.0,


If we look at the above, we can see that there is one line per each artist that participated for each release ID.

As we don't want to show duplicate releases, we need to keep only one artist per release. We will keep the first artist appearing for each release (even though we know this is not 100% accurate, but we have to avoid duplicates). This will afftect 224.910 rows under a total of 1.499.614 unique releases, so 14% of our dataset.

In [29]:
#Now we can delete the duplicate releases and keep the ones who were first released:
df3.drop_duplicates(subset=['release_id'],keep='first', inplace=True)
df3['release_id'].nunique()

1499614

In [30]:
len(df3)

1499614

## <font color=blue>4) Geographical data</font>

### Data from Musicbrainz.org

The idea of the visualization is to see where each gender comes from, so, ideally, we would have to look at the artists origins (start area: last 2 columns of our dataframe).

In our dataframe df3, the 5th column "area_id" is related to the area where the release was produced. This isn't directly related to the origin of an artist/band, as many artists have to record their works in different countries/or areas.

Let's see for how many releases we have that information:

In [31]:
df3.isnull().sum(axis=0)

release_id            0
release_group         4
credit_id             0
group_id              0
area_id               0
release_year          0
artist_id           151
artist_mbid         151
artist_name_x       155
start_area1      404503
start_area2      876562
dtype: int64

In Musicbrainz's database, we have some tables related to the areas. Let's see how we can use them to input more geographical information into our dataframe:

In [14]:
areas = pd.read_csv('Musicbrainz/Tables_used/area.txt',sep='\t', header=None, engine='python', usecols=[0,2,3])
areas.columns = ['area_id','area_name','code_type']
areas.head()

Unnamed: 0,area_id,area_name,code_type
0,15449,Greccio,4.0
1,38,Canada,1.0
2,43,Chile,1.0
3,44,China,1.0
4,36,Cambodia,1.0


In [15]:
#Let's see the area types we have:
area_types = pd.read_csv('Musicbrainz/Tables_used/area_type.txt',sep='\t', header=None, engine='python', usecols=[1,3,4], error_bad_lines=False)
area_types.columns = ['type','code_type','definition']
area_types.head(20)

Unnamed: 0,type,code_type,definition
0,Country,1,Country is used for areas included (or previou...
1,Subdivision,2,Subdivision is used for the main administrativ...
2,County,7,County is used for smaller administrative divi...
3,Municipality,4,Municipality is used for small administrative ...
4,City,3,"City is used for settlements of any size, incl..."
5,District,5,District is used for a division of a large cit...
6,Island,6,Island is used for islands and atolls which do...


ISO tables: In order to retrieve the ISO code for the countries and states, Musicbrainz provides us with 2 tables which contain: area_id and their ISO code (for area code_types 1 and 2: country and subdivision). These are international standard codes set by the International organization for Standardization (www.iso.org).

We will add this information to our areas dataframe, as this will be usefull for our visualization.

In [16]:
#First, we load the first ISO file:
ISO1 = pd.read_csv('Musicbrainz/Tables_used/iso_3166_1.txt',sep='\t', header=None, engine='python')
ISO1.columns = ['area_id','ISO_code']

Note: as the only countries for which we would like to retrieve the subdivision are large countries (USA, Canada and Australia) for our visualization to work, we can at this early stage remove from ISO2 the rows not related to those countries:

In [17]:
#Loading ISO2 file:
ISO2 = pd.read_csv('Musicbrainz/Tables_used/iso_3166_2.txt',sep=',', header=None, engine='python')
ISO2.columns = ['area_id','ISO_code', 'ISO_country']
ISO2_target = pd.concat([ISO2[ISO2['ISO_country'] == 'CA'],ISO2[ISO2['ISO_country'] == 'US'],ISO2[ISO2['ISO_country'] == 'AU']])
ISO2_target.head()

Unnamed: 0,area_id,ISO_code,ISO_country
602,312,CA-AB,CA
603,313,CA-BC,CA
604,314,CA-MB,CA
605,315,CA-NB,CA
606,316,CA-NL,CA


In [18]:
#We drop the column ISO_country to concatenate:
ISO2_target.drop(labels='ISO_country', axis=1, inplace=True)

In [19]:
#Now, we can add both ISO dataframes together:
ISO_codes = pd.concat([ISO1, ISO2_target])
ISO_codes.head()

Unnamed: 0,area_id,ISO_code
0,1,AF
1,2,AL
2,3,DZ
3,4,AS
4,5,AD


In [20]:
#And finally, we can merge the ISO codes into the areas dataframe:
areas_ISO = pd.merge(areas, ISO_codes, how='left', on='area_id')
areas_ISO.head()

Unnamed: 0,area_id,area_name,code_type,ISO_code
0,15449,Greccio,4.0,
1,38,Canada,1.0,CA
2,43,Chile,1.0,CL
3,44,China,1.0,CN
4,36,Cambodia,1.0,KH


In [39]:
#Add the areas information to our main dataframe for the column "area_id":
df4 = pd.merge(df3, areas_ISO, how='left', on='area_id')
df4.head()

Unnamed: 0,release_id,release_group,credit_id,group_id,area_id,release_year,artist_id,artist_mbid,artist_name_x,start_area1,start_area2,area_name,code_type,ISO_code
0,2163750,,2205562,1962329,240.0,2014-01-01,1654312.0,d10d6441-dcc1-4202-93bf-0c0acf72913a,Soul Glo,7707.0,,[Worldwide],,XW
1,1846605,,1503027,1713833,240.0,2015-01-01,1112115.0,7b52c77b-1a34-439d-a285-3a7c69cb5b1a,Ben Bennett,,,[Worldwide],,XW
2,1714060,Beaux Soirs De Paris,1324142,1609358,73.0,1995-01-01,1122795.0,71b8451c-c10a-400e-9544-101f34ab2522,Soixante Étages,,,France,1.0,FR
3,2265346,Le 1,2291833,2042812,240.0,2018-01-01,1720981.0,a69efb5f-0b28-4328-8ff0-44d8d6f39755,TedeuzeM,68613.0,,[Worldwide],,XW
4,1895266,M2Music HitDisc Vol. 1,1,1751021,222.0,2006-01-01,1.0,89ad4ac3-39f7-470e-963a-56509c546377,Various Artists,,,United States,1.0,US


In [40]:
#Rearranging dataframe columns to have a clearer dataframe and do the next merging:
df4 = df4[['release_id','group_id','release_group','credit_id','area_id','area_name','ISO_code','code_type','release_year','artist_id','artist_mbid','artist_name_x','start_area1','start_area2']]
df4.rename(columns={'area_id':'release_area','area_name':'release_area_name','code_type':'release_code_type','ISO_code':'release_ISO_code','start_area1':'area_id'}, inplace=True)
df4.head()

Unnamed: 0,release_id,group_id,release_group,credit_id,release_area,release_area_name,release_ISO_code,release_code_type,release_year,artist_id,artist_mbid,artist_name_x,area_id,start_area2
0,2163750,1962329,,2205562,240.0,[Worldwide],XW,,2014-01-01,1654312.0,d10d6441-dcc1-4202-93bf-0c0acf72913a,Soul Glo,7707.0,
1,1846605,1713833,,1503027,240.0,[Worldwide],XW,,2015-01-01,1112115.0,7b52c77b-1a34-439d-a285-3a7c69cb5b1a,Ben Bennett,,
2,1714060,1609358,Beaux Soirs De Paris,1324142,73.0,France,FR,1.0,1995-01-01,1122795.0,71b8451c-c10a-400e-9544-101f34ab2522,Soixante Étages,,
3,2265346,2042812,Le 1,2291833,240.0,[Worldwide],XW,,2018-01-01,1720981.0,a69efb5f-0b28-4328-8ff0-44d8d6f39755,TedeuzeM,68613.0,
4,1895266,1751021,M2Music HitDisc Vol. 1,1,222.0,United States,US,1.0,2006-01-01,1.0,89ad4ac3-39f7-470e-963a-56509c546377,Various Artists,,


In [41]:
#Add the start area name and type to our main dataframe for the column "area id"(which was "start area 1" before):
df5 = pd.merge(df4, areas_ISO, how='left', on='area_id')
df5.head()

Unnamed: 0,release_id,group_id,release_group,credit_id,release_area,release_area_name,release_ISO_code,release_code_type,release_year,artist_id,artist_mbid,artist_name_x,area_id,start_area2,area_name,code_type,ISO_code
0,2163750,1962329,,2205562,240.0,[Worldwide],XW,,2014-01-01,1654312.0,d10d6441-dcc1-4202-93bf-0c0acf72913a,Soul Glo,7707.0,,Philadelphia,3.0,
1,1846605,1713833,,1503027,240.0,[Worldwide],XW,,2015-01-01,1112115.0,7b52c77b-1a34-439d-a285-3a7c69cb5b1a,Ben Bennett,,,,,
2,1714060,1609358,Beaux Soirs De Paris,1324142,73.0,France,FR,1.0,1995-01-01,1122795.0,71b8451c-c10a-400e-9544-101f34ab2522,Soixante Étages,,,,,
3,2265346,2042812,Le 1,2291833,240.0,[Worldwide],XW,,2018-01-01,1720981.0,a69efb5f-0b28-4328-8ff0-44d8d6f39755,TedeuzeM,68613.0,,Aix-en-Provence,3.0,
4,1895266,1751021,M2Music HitDisc Vol. 1,1,222.0,United States,US,1.0,2006-01-01,1.0,89ad4ac3-39f7-470e-963a-56509c546377,Various Artists,,,,,


In [42]:
#Rearranging dataframe columns to have a clearer dataframe:
df5 = df5[['release_id','group_id','release_group','credit_id','release_area','release_area_name','release_ISO_code','release_code_type','release_year','artist_id','artist_mbid','artist_name_x','area_id','area_name','ISO_code','code_type','start_area2']]
df5.rename(columns={'area_id':'artist_area1','area_name':'artist_area_name1','ISO_code':'artist_ISO_code1','code_type':'artist_code_type1','start_area2':'area_id'}, inplace=True)
df5.head()

Unnamed: 0,release_id,group_id,release_group,credit_id,release_area,release_area_name,release_ISO_code,release_code_type,release_year,artist_id,artist_mbid,artist_name_x,artist_area1,artist_area_name1,artist_ISO_code1,artist_code_type1,area_id
0,2163750,1962329,,2205562,240.0,[Worldwide],XW,,2014-01-01,1654312.0,d10d6441-dcc1-4202-93bf-0c0acf72913a,Soul Glo,7707.0,Philadelphia,,3.0,
1,1846605,1713833,,1503027,240.0,[Worldwide],XW,,2015-01-01,1112115.0,7b52c77b-1a34-439d-a285-3a7c69cb5b1a,Ben Bennett,,,,,
2,1714060,1609358,Beaux Soirs De Paris,1324142,73.0,France,FR,1.0,1995-01-01,1122795.0,71b8451c-c10a-400e-9544-101f34ab2522,Soixante Étages,,,,,
3,2265346,2042812,Le 1,2291833,240.0,[Worldwide],XW,,2018-01-01,1720981.0,a69efb5f-0b28-4328-8ff0-44d8d6f39755,TedeuzeM,68613.0,Aix-en-Provence,,3.0,
4,1895266,1751021,M2Music HitDisc Vol. 1,1,222.0,United States,US,1.0,2006-01-01,1.0,89ad4ac3-39f7-470e-963a-56509c546377,Various Artists,,,,,


In [43]:
#Add the start area 2 name and type to our main dataframe for the column "area id"(which was "start area 2" before):
df6 = pd.merge(df5, areas_ISO, how='left', on='area_id')
df6.head()

Unnamed: 0,release_id,group_id,release_group,credit_id,release_area,release_area_name,release_ISO_code,release_code_type,release_year,artist_id,artist_mbid,artist_name_x,artist_area1,artist_area_name1,artist_ISO_code1,artist_code_type1,area_id,area_name,code_type,ISO_code
0,2163750,1962329,,2205562,240.0,[Worldwide],XW,,2014-01-01,1654312.0,d10d6441-dcc1-4202-93bf-0c0acf72913a,Soul Glo,7707.0,Philadelphia,,3.0,,,,
1,1846605,1713833,,1503027,240.0,[Worldwide],XW,,2015-01-01,1112115.0,7b52c77b-1a34-439d-a285-3a7c69cb5b1a,Ben Bennett,,,,,,,,
2,1714060,1609358,Beaux Soirs De Paris,1324142,73.0,France,FR,1.0,1995-01-01,1122795.0,71b8451c-c10a-400e-9544-101f34ab2522,Soixante Étages,,,,,,,,
3,2265346,2042812,Le 1,2291833,240.0,[Worldwide],XW,,2018-01-01,1720981.0,a69efb5f-0b28-4328-8ff0-44d8d6f39755,TedeuzeM,68613.0,Aix-en-Provence,,3.0,,,,
4,1895266,1751021,M2Music HitDisc Vol. 1,1,222.0,United States,US,1.0,2006-01-01,1.0,89ad4ac3-39f7-470e-963a-56509c546377,Various Artists,,,,,,,,


In [44]:
#Renaming columns:
df6.rename(columns={'area_id':'artist_area2','area_name':'artist_area_name2','code_type':'artist_code_type2', 'ISO_code':'artist_ISO_code2'}, inplace=True)
df6.head()

Unnamed: 0,release_id,group_id,release_group,credit_id,release_area,release_area_name,release_ISO_code,release_code_type,release_year,artist_id,artist_mbid,artist_name_x,artist_area1,artist_area_name1,artist_ISO_code1,artist_code_type1,artist_area2,artist_area_name2,artist_code_type2,artist_ISO_code2
0,2163750,1962329,,2205562,240.0,[Worldwide],XW,,2014-01-01,1654312.0,d10d6441-dcc1-4202-93bf-0c0acf72913a,Soul Glo,7707.0,Philadelphia,,3.0,,,,
1,1846605,1713833,,1503027,240.0,[Worldwide],XW,,2015-01-01,1112115.0,7b52c77b-1a34-439d-a285-3a7c69cb5b1a,Ben Bennett,,,,,,,,
2,1714060,1609358,Beaux Soirs De Paris,1324142,73.0,France,FR,1.0,1995-01-01,1122795.0,71b8451c-c10a-400e-9544-101f34ab2522,Soixante Étages,,,,,,,,
3,2265346,2042812,Le 1,2291833,240.0,[Worldwide],XW,,2018-01-01,1720981.0,a69efb5f-0b28-4328-8ff0-44d8d6f39755,TedeuzeM,68613.0,Aix-en-Provence,,3.0,,,,
4,1895266,1751021,M2Music HitDisc Vol. 1,1,222.0,United States,US,1.0,2006-01-01,1.0,89ad4ac3-39f7-470e-963a-56509c546377,Various Artists,,,,,,,,


Now that we have the names of the different areas, let's check what kind of information we have in those columns.

As we said before, we prefer to keep the artist area preferably, as it represents more the real origin of the music.

1) Artist area 1:

In [45]:
df6.artist_area_name1.value_counts()

United States                                                 273545
United Kingdom                                                133070
Japan                                                          83908
Germany                                                        67463
France                                                         45927
Italy                                                          27215
Sweden                                                         24983
Canada                                                         23619
Finland                                                        21981
Netherlands                                                    18101
Australia                                                      17738
Spain                                                          16090
Russia                                                         13821
Brazil                                                         11142
Belgium                           

In [46]:
df6.artist_code_type1.value_counts()

1.0    954997
3.0    112299
2.0     24835
4.0      3058
5.0      2431
7.0       254
6.0       114
Name: artist_code_type1, dtype: int64

As we can see, the majority of the artists' start area type we have is related to countries. This would be good for our visualization except for big countries like USA, Canada or Australia, for which we would prefer to retrieve at least the artist's state/subdivision, to have a clearer view of the music's origin.

Also, we noticed that we have some area names that don't give us much information: "Worldwide", "Europe", "South Australia", etc.

2) Artist area 2:

In [47]:
df6.artist_area_name2.value_counts()

London                                         23087
Los Angeles                                    14173
New York                                       12554
Chicago                                         8353
Tokyo                                           7784
Paris                                           6395
Brooklyn                                        6307
Berlin                                          5941
Philadelphia                                    5276
Detroit                                         4659
San Francisco                                   4574
Toronto                                         4068
Boston                                          4036
Seattle                                         3938
Seoul                                           3800
Stockholm                                       3449
Melbourne                                       3308
Hamburg                                         3259
United Kingdom                                

In [48]:
df6.artist_code_type2.value_counts()

3.0    482663
2.0     61579
1.0     32265
5.0     25649
4.0     20562
7.0      2490
6.0       557
Name: artist_code_type2, dtype: int64

It looks like this second column could be giving us more detailed information about the artist's origin (only 31K rows have countries).

### Data from simplemaps.com

There is a free downloadable file in https://simplemaps.com/data/world-cities, which provides us with the names of the major cities in the world, as well as their country, subdivision and ISO code (for the country).

I have downloaded the csv version and we'll use it to classify our areas columns.

In [10]:
cities = pd.read_csv('worldcities.csv', sep=',', usecols=[1,4,5], encoding='utf-8')
cities.columns = ['area_name','country','country_ISO']
cities.head()

Unnamed: 0,area_name,country,country_ISO
0,Malisheve,Kosovo,XK
1,Prizren,Kosovo,XK
2,Zubin Potok,Kosovo,XK
3,Kamenice,Kosovo,XK
4,Viti,Kosovo,XK


Note: as we will merge the above dataframe later with our main dataframe using the "area_name" as key, we need to make sure that we remove the areas which have the same name, as we don't want to input the wrong information:

In [11]:
cities.drop_duplicates(subset='area_name', keep=False, inplace=True)

What we will do next is: 

1) For the rows whose artist_area_name1 is a country, we save it as the release origin. For the rows who have a subdivision, we keep only the ones related to USA, Canada or Australia (ie: the only ones who have area code =2 and ISO code not null, as we decided earlier when we generated the ISO2_target dataframe).

2) For the rows which don't have country or subdivision, we will match the artist_area_name1 with its country using the cities dataframe.


3) For the rows we don't have information with artist_area_name1, we will look into artist_area_name2 and repeat steps 1 & 2.

#### 1) Keeping the origin for rows who have already a country or subdivision in artist_area1:

In [51]:
# Adding the origin columns for the rows who have a subdivision (ie: USA, Canada and Australia):
df6['origin_code'] = np.nan
df6['origin_name'] = np.nan
df6['origin_ISO_code'] = np.nan
df6['origin_code_type'] = np.nan
df6.origin_code = np.where(np.logical_and(df6.artist_code_type1.isin([2]),df6.artist_ISO_code1.notnull()) , df6.artist_area1, df6.origin_code)
df6.origin_name = np.where(np.logical_and(df6.artist_code_type1.isin([2]),df6.artist_ISO_code1.notnull()) , df6.artist_area_name1, df6.origin_name)
df6.origin_ISO_code = np.where(np.logical_and(df6.artist_code_type1.isin([2]),df6.artist_ISO_code1.notnull()) , df6.artist_ISO_code1, df6.origin_ISO_code)
df6.origin_code_type = np.where(np.logical_and(df6.artist_code_type1.isin([2]),df6.artist_ISO_code1.notnull()) , df6.artist_code_type1, df6.origin_code_type)
df6.head()

Unnamed: 0,release_id,group_id,release_group,credit_id,release_area,release_area_name,release_ISO_code,release_code_type,release_year,artist_id,...,artist_ISO_code1,artist_code_type1,artist_area2,artist_area_name2,artist_code_type2,artist_ISO_code2,origin_code,origin_name,origin_ISO_code,origin_code_type
0,2163750,1962329,,2205562,240.0,[Worldwide],XW,,2014-01-01,1654312.0,...,,3.0,,,,,,,,
1,1846605,1713833,,1503027,240.0,[Worldwide],XW,,2015-01-01,1112115.0,...,,,,,,,,,,
2,1714060,1609358,Beaux Soirs De Paris,1324142,73.0,France,FR,1.0,1995-01-01,1122795.0,...,,,,,,,,,,
3,2265346,2042812,Le 1,2291833,240.0,[Worldwide],XW,,2018-01-01,1720981.0,...,,3.0,,,,,,,,
4,1895266,1751021,M2Music HitDisc Vol. 1,1,222.0,United States,US,1.0,2006-01-01,1.0,...,,,,,,,,,,


In [52]:
#Now we add the information related to the country:
df6.origin_code = np.where(df6.artist_code_type1.isin([1]), df6.artist_area1, df6.origin_code)
df6.origin_name = np.where(df6.artist_code_type1.isin([1]), df6.artist_area_name1, df6.origin_name)
df6.origin_ISO_code = np.where(df6.artist_code_type1.isin([1]), df6.artist_ISO_code1, df6.origin_ISO_code)
df6.origin_code_type = np.where(df6.artist_code_type1.isin([1]), df6.artist_code_type1, df6.origin_code_type)

#### 2) Matching cities in artist_area_name1 with cities dataframe:

As we mentioned earlier, the merging between our main dataframe and the cities is going to be done by area_name. 

There are many cities in the world with the same name and we need to make sure that we don't take them into consideration, to avoid mismatching data.

We will analyze separately the artist_area_name1 items and match them with the cities dataframe (only for the not duplicated city names).

In [53]:
cities_to_match = df6[df6['artist_code_type1'] == 3]
cities_to_match.head()

Unnamed: 0,release_id,group_id,release_group,credit_id,release_area,release_area_name,release_ISO_code,release_code_type,release_year,artist_id,...,artist_ISO_code1,artist_code_type1,artist_area2,artist_area_name2,artist_code_type2,artist_ISO_code2,origin_code,origin_name,origin_ISO_code,origin_code_type
0,2163750,1962329,,2205562,240.0,[Worldwide],XW,,2014-01-01,1654312.0,...,,3.0,,,,,,,,
3,2265346,2042812,Le 1,2291833,240.0,[Worldwide],XW,,2018-01-01,1720981.0,...,,3.0,,,,,,,,
5,1772538,1656147,devil jokes,1653884,240.0,[Worldwide],XW,,2016-01-01,1363025.0,...,,3.0,,,,,,,,
22,890832,426451,!!! En boka zerrada...,300620,194.0,Spain,ES,1.0,2001-01-01,300620.0,...,,3.0,,,,,,,,
54,2232738,2017354,!NADA!,1629941,240.0,[Worldwide],XW,,2015-01-01,1345721.0,...,,3.0,,,,,,,,


In [54]:
#We remove the duplicate city names:
cities_to_match.drop_duplicates(subset='artist_area_name1', keep=False, inplace=True)
#We remove the columns that we don't need:
columns = ['group_id', 'release_group', 'credit_id', 'release_area',
       'release_area_name', 'release_ISO_code', 'release_code_type',
       'release_year', 'artist_id', 'artist_mbid', 'artist_name_x',
       'artist_area1', 'artist_area2', 'artist_area_name2',
       'artist_code_type2', 'artist_ISO_code2', 'origin_code', 'origin_name',
       'origin_ISO_code', 'origin_code_type']
cities_to_match.drop(labels=columns, axis=1, inplace=True)

cities_to_match.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  errors=errors)


Unnamed: 0,release_id,artist_area_name1,artist_ISO_code1,artist_code_type1
618,1960646,Barbières,,3.0
4981,2055697,Torreón,,3.0
5346,1320982,Mariupol,,3.0
5559,1891407,Eschwege,,3.0
6995,2313686,Haedo,,3.0


In [55]:
#And now we can do the merging with the cities dataframe:
cities_matched = pd.merge(cities_to_match, cities, how='left', left_on='artist_area_name1', right_on='area_name')
cities_matched.head(20)

Unnamed: 0,release_id,artist_area_name1,artist_ISO_code1,artist_code_type1,area_name,country,country_ISO
0,1960646,Barbières,,3.0,,,
1,2055697,Torreón,,3.0,,,
2,1320982,Mariupol,,3.0,Mariupol,Ukraine,UA
3,1891407,Eschwege,,3.0,,,
4,2313686,Haedo,,3.0,,,
5,1711186,Lutsk,,3.0,Lutsk,Ukraine,UA
6,1584076,Tomball,,3.0,Tomball,United States,US
7,1447250,Platteville,,3.0,Platteville,United States,US
8,1709159,Pinneberg,,3.0,,,
9,2318988,Piekary Śląskie,,3.0,,,


In [56]:
#Now we can input this information into our main dataframe:
df7 = pd.merge(df6, cities_matched, how='left', on='release_id')
df7.head()

Unnamed: 0,release_id,group_id,release_group,credit_id,release_area,release_area_name,release_ISO_code,release_code_type,release_year,artist_id,...,origin_code,origin_name,origin_ISO_code,origin_code_type,artist_area_name1_y,artist_ISO_code1_y,artist_code_type1_y,area_name,country,country_ISO
0,2163750,1962329,,2205562,240.0,[Worldwide],XW,,2014-01-01,1654312.0,...,,,,,,,,,,
1,1846605,1713833,,1503027,240.0,[Worldwide],XW,,2015-01-01,1112115.0,...,,,,,,,,,,
2,1714060,1609358,Beaux Soirs De Paris,1324142,73.0,France,FR,1.0,1995-01-01,1122795.0,...,,,,,,,,,,
3,2265346,2042812,Le 1,2291833,240.0,[Worldwide],XW,,2018-01-01,1720981.0,...,,,,,,,,,,
4,1895266,1751021,M2Music HitDisc Vol. 1,1,222.0,United States,US,1.0,2006-01-01,1.0,...,,,,,,,,,,


In [57]:
#We input the info in the last columns into our origin columns:
df7.origin_code = np.where(df7.country_ISO.notnull(), df7.artist_area1, df7.origin_code)
df7.origin_name = np.where(df7.country_ISO.notnull(), df7.artist_area_name1_x, df7.origin_name)
df7.origin_ISO_code = np.where(df7.country_ISO.notnull(), df7.country_ISO, df7.origin_ISO_code)
df7.origin_code_type = np.where(df7.country_ISO.notnull(), df7.artist_code_type1_x, df7.origin_ISO_code)

In [58]:
#And now we can delete all the columns related to artist area1 and the ones generated in the last step:
remove = ['artist_area1', 'artist_area_name1_x', 'artist_ISO_code1_x','artist_code_type1_x','artist_area_name1_y',
       'artist_ISO_code1_y', 'artist_code_type1_y', 'area_name', 'country',
       'country_ISO']
df7.drop(labels=remove, axis=1, inplace=True)
df7.head()

Unnamed: 0,release_id,group_id,release_group,credit_id,release_area,release_area_name,release_ISO_code,release_code_type,release_year,artist_id,artist_mbid,artist_name_x,artist_area2,artist_area_name2,artist_code_type2,artist_ISO_code2,origin_code,origin_name,origin_ISO_code,origin_code_type
0,2163750,1962329,,2205562,240.0,[Worldwide],XW,,2014-01-01,1654312.0,d10d6441-dcc1-4202-93bf-0c0acf72913a,Soul Glo,,,,,,,,
1,1846605,1713833,,1503027,240.0,[Worldwide],XW,,2015-01-01,1112115.0,7b52c77b-1a34-439d-a285-3a7c69cb5b1a,Ben Bennett,,,,,,,,
2,1714060,1609358,Beaux Soirs De Paris,1324142,73.0,France,FR,1.0,1995-01-01,1122795.0,71b8451c-c10a-400e-9544-101f34ab2522,Soixante Étages,,,,,,,,
3,2265346,2042812,Le 1,2291833,240.0,[Worldwide],XW,,2018-01-01,1720981.0,a69efb5f-0b28-4328-8ff0-44d8d6f39755,TedeuzeM,,,,,,,,
4,1895266,1751021,M2Music HitDisc Vol. 1,1,222.0,United States,US,1.0,2006-01-01,1.0,89ad4ac3-39f7-470e-963a-56509c546377,Various Artists,,,,,,,,


In [59]:
#For how many releases do we have the origin ISO code so far?
df7.origin_ISO_code.isnull().value_counts()

False    965150
True     539852
Name: origin_ISO_code, dtype: int64

Not bad: we have 965.150 releases matched with their ISO code already (this means that we will be able to easily plot them with Tableau or any other visualization tool that supports ISO standards for geographical data).

We will follow now the last step: step 3

#### 3) For the rows we don't have information with artist_area_name1, we will look into artist_area_name2 and repeat steps 1 & 2.

In [60]:
#How much information do we have in the artist_code_type2 column?
df7.artist_code_type2.value_counts()

3.0    482663
2.0     61579
1.0     32265
5.0     25649
4.0     20562
7.0      2490
6.0       557
Name: artist_code_type2, dtype: int64

In [61]:
#We add the subdivision information from USA, Canada and Australia into the origin columns:
df7.origin_code = np.where(np.logical_and(df7.artist_code_type2.isin([2]),df7.artist_ISO_code2.notnull()) , df7.artist_area2, df7.origin_code)
df7.origin_name = np.where(np.logical_and(df7.artist_code_type2.isin([2]),df7.artist_ISO_code2.notnull()) , df7.artist_area_name2, df7.origin_name)
df7.origin_ISO_code = np.where(np.logical_and(df7.artist_code_type2.isin([2]),df7.artist_ISO_code2.notnull()) , df7.artist_ISO_code2, df7.origin_ISO_code)
df7.origin_code_type = np.where(np.logical_and(df7.artist_code_type2.isin([2]),df7.artist_ISO_code2.notnull()) , df7.artist_code_type2, df7.origin_code_type)

In [62]:
#Adding the country information for the rows that have their origin empty and artist_code_type2 = 1:
df7.origin_code = np.where(np.logical_and(df7.artist_code_type2.isin([1]), df7.origin_ISO_code.isnull()), df7.artist_area2, df7.origin_code)
df7.origin_name = np.where(np.logical_and(df7.artist_code_type2.isin([1]), df7.origin_ISO_code.isnull()), df7.artist_area_name2, df7.origin_name)
df7.origin_ISO_code = np.where(np.logical_and(df7.artist_code_type2.isin([1]), df7.origin_ISO_code.isnull()), df7.artist_ISO_code2, df7.origin_ISO_code)
df7.origin_code_type = np.where(np.logical_and(df7.artist_code_type2.isin([1]), df7.origin_ISO_code.isnull()), df7.artist_code_type2, df7.origin_code_type)

In [63]:
#We repeat step 2 for the cities in artist_area_name2:
cities_to_match2 = df7[df7['artist_code_type2'] == 3] #Select the cities to match
#We remove the duplicate city names:
cities_to_match2.drop_duplicates(subset='artist_area_name2', keep=False, inplace=True)
#We remove the columns that we don't need:
columns = ['group_id', 'release_group', 'credit_id', 'release_area',
       'release_area_name', 'release_ISO_code', 'release_code_type',
       'release_year', 'artist_id', 'artist_mbid', 'artist_name_x',
       'artist_area2', 'artist_code_type2',
       'artist_ISO_code2', 'origin_code', 'origin_name', 'origin_ISO_code',
       'origin_code_type']
cities_to_match2.drop(labels=columns, axis=1, inplace=True)
cities_to_match2.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  after removing the cwd from sys.path.


Unnamed: 0,release_id,artist_area_name2
173,1474213,Roy
618,1960646,Barbières
3206,992765,Wilbraham
3907,1983355,Moorpark
5836,2213232,Asnières


In [64]:
#And now we can do the merging with the cities dataframe:
cities_matched2 = pd.merge(cities_to_match2, cities, how='left', left_on='artist_area_name2', right_on='area_name')
cities_matched2.head(20)

Unnamed: 0,release_id,artist_area_name2,area_name,country,country_ISO
0,1474213,Roy,Roy,United States,US
1,1960646,Barbières,,,
2,992765,Wilbraham,,,
3,1983355,Moorpark,Moorpark,United States,US
4,2213232,Asnières,,,
5,1610325,Serang,Serang,Indonesia,ID
6,529227,Los Mochis,Los Mochis,Mexico,MX
7,949508,Danvers,,,
8,614278,Dzerzhinsk,Dzerzhinsk,Russia,RU
9,118159,Ardee,,,


In [65]:
#Now we can input this information into our main dataframe:
df8 = pd.merge(df7, cities_matched2, how='left', on='release_id')
df8.head()

Unnamed: 0,release_id,group_id,release_group,credit_id,release_area,release_area_name,release_ISO_code,release_code_type,release_year,artist_id,...,artist_code_type2,artist_ISO_code2,origin_code,origin_name,origin_ISO_code,origin_code_type,artist_area_name2_y,area_name,country,country_ISO
0,2163750,1962329,,2205562,240.0,[Worldwide],XW,,2014-01-01,1654312.0,...,,,,,,,,,,
1,1846605,1713833,,1503027,240.0,[Worldwide],XW,,2015-01-01,1112115.0,...,,,,,,,,,,
2,1714060,1609358,Beaux Soirs De Paris,1324142,73.0,France,FR,1.0,1995-01-01,1122795.0,...,,,,,,,,,,
3,2265346,2042812,Le 1,2291833,240.0,[Worldwide],XW,,2018-01-01,1720981.0,...,,,,,,,,,,
4,1895266,1751021,M2Music HitDisc Vol. 1,1,222.0,United States,US,1.0,2006-01-01,1.0,...,,,,,,,,,,


In [66]:
#We input the info in the last columns into our origin columns:
df8.origin_code = np.where(df8.country_ISO.notnull(), df8.artist_area2, df8.origin_code)
df8.origin_name = np.where(df8.country_ISO.notnull(), df8.artist_area_name2_x, df8.origin_name)
df8.origin_ISO_code = np.where(df8.country_ISO.notnull(), df8.country_ISO, df8.origin_ISO_code)
df8.origin_code_type = np.where(df8.country_ISO.notnull(), df8.artist_code_type2, df8.origin_ISO_code)

In [67]:
#And now we can delete all the columns related to artist area2 and the ones generated in the last step:
remove = ['artist_area2', 'artist_area_name2_x', 'artist_code_type2','artist_ISO_code2','artist_area_name2_y', 'area_name', 'country','country_ISO']
df8.drop(labels=remove, axis=1, inplace=True)
df8.head()

Unnamed: 0,release_id,group_id,release_group,credit_id,release_area,release_area_name,release_ISO_code,release_code_type,release_year,artist_id,artist_mbid,artist_name_x,origin_code,origin_name,origin_ISO_code,origin_code_type
0,2163750,1962329,,2205562,240.0,[Worldwide],XW,,2014-01-01,1654312.0,d10d6441-dcc1-4202-93bf-0c0acf72913a,Soul Glo,,,,
1,1846605,1713833,,1503027,240.0,[Worldwide],XW,,2015-01-01,1112115.0,7b52c77b-1a34-439d-a285-3a7c69cb5b1a,Ben Bennett,,,,
2,1714060,1609358,Beaux Soirs De Paris,1324142,73.0,France,FR,1.0,1995-01-01,1122795.0,71b8451c-c10a-400e-9544-101f34ab2522,Soixante Étages,,,,
3,2265346,2042812,Le 1,2291833,240.0,[Worldwide],XW,,2018-01-01,1720981.0,a69efb5f-0b28-4328-8ff0-44d8d6f39755,TedeuzeM,,,,
4,1895266,1751021,M2Music HitDisc Vol. 1,1,222.0,United States,US,1.0,2006-01-01,1.0,89ad4ac3-39f7-470e-963a-56509c546377,Various Artists,,,,


In [68]:
#For how many releases do we have the origin ISO code so far, now?
df8.origin_ISO_code.isnull().value_counts()

False    971572
True     533430
Name: origin_ISO_code, dtype: int64

### Data from the 1 million songs dataset

Between 2011 and 2012, there was a Music Information Retrieval challenge called "Million Song Dataset". The majority of the data contained was provided by The Echo Nest (today known as Spotify).

At the bottom of the following website, there are links to download the Dataset:

https://labrosa.ee.columbia.edu/millionsong/pages/getting-dataset

As we won't use the whole dataset (just some of the tables), you don't need to download them: they will be attached in the repo.

In [69]:
artists_locations = pd.read_csv('1M_songs/artist_location.csv',sep='<SEP>', header=None, engine='python')

In [73]:
artists_locations.columns = ['artist_id','lat','long','artist_name','location_name']
artists_locations.head()

Unnamed: 0,artist_id,lat,long,artist_name,location_name
0,ARZGXZG1187B9B56B6,-16.96595,-61.14804,Endless Blue,Santa Cruz
1,AR8K6F31187B99C2BC,46.44231,-93.36586,Go Fish,"Twin Cities, MN"
2,ARHJJ771187FB5B581,51.59678,-0.33556,Screaming Lord Sutch,"Harrow, Middlesex, England"
3,ARJ8YLL1187FB3CA93,40.69626,-73.83301,Morton Gould,"Richmond Hill, NY"
4,ARYBAGV11ECC836DAC,43.58828,-79.64372,Crash Parallel,Mississauga


In [74]:
#How many artist are there?
artists_locations['artist_id'].nunique()

13850

In this dataset, there is also another table which provides us with some extra information, especially the artist Musicbrainz's id (which will be very helpful to make the link with our main dataframe later).

In [75]:
metadata = pd.read_csv('1M_songs/track_metadata.csv',sep=',', header=0, engine='python', usecols=['artist_id','artist_mbid'])
metadata.head()

Unnamed: 0,artist_id,artist_mbid
0,ARYZTJS1187B98C555,357ff05d-848a-44cf-b608-cb34b5701ae5
1,ARMVN3U1187FB3A1EB,8d7ef530-a6fd-4f8f-b2e2-74aec765e0f9
2,ARGEKB01187FB50750,3d403d44-36ce-465c-ad43-ae877e65adc4
3,ARNWYLR1187B9B2F9C,12be7648-7094-495f-90e6-df4189d68615
4,AREQDTE1269FB37231,


In [125]:
#We drop the rows withour artist_mbid (as we can't link them with our df)
metadata.dropna(subset=['artist_mbid'],axis=0, inplace=True)
#We merge artist_locations and metadata dataframes:
a = pd.merge(artists_locations,metadata,how='left',on='artist_id', copy=False)
a.dropna(subset=['artist_mbid'],axis=0, inplace=True)
a.head()

Unnamed: 0,artist_id,lat,long,artist_name,location_name,artist_mbid
0,ARZGXZG1187B9B56B6,-16.96595,-61.14804,Endless Blue,Santa Cruz,0bd9755c-c86d-431c-bc28-ef908b8a9821
1,ARZGXZG1187B9B56B6,-16.96595,-61.14804,Endless Blue,Santa Cruz,0bd9755c-c86d-431c-bc28-ef908b8a9821
2,AR8K6F31187B99C2BC,46.44231,-93.36586,Go Fish,"Twin Cities, MN",d4620364-82ec-4c34-9265-a2b72dfa8e3e
3,AR8K6F31187B99C2BC,46.44231,-93.36586,Go Fish,"Twin Cities, MN",d4620364-82ec-4c34-9265-a2b72dfa8e3e
4,AR8K6F31187B99C2BC,46.44231,-93.36586,Go Fish,"Twin Cities, MN",d4620364-82ec-4c34-9265-a2b72dfa8e3e


In [126]:
#We get rid of the duplicate rows:
a.drop_duplicates(subset='artist_id', inplace=True)
a.head()

Unnamed: 0,artist_id,lat,long,artist_name,location_name,artist_mbid
0,ARZGXZG1187B9B56B6,-16.96595,-61.14804,Endless Blue,Santa Cruz,0bd9755c-c86d-431c-bc28-ef908b8a9821
2,AR8K6F31187B99C2BC,46.44231,-93.36586,Go Fish,"Twin Cities, MN",d4620364-82ec-4c34-9265-a2b72dfa8e3e
33,ARHJJ771187FB5B581,51.59678,-0.33556,Screaming Lord Sutch,"Harrow, Middlesex, England",e1079a78-75d4-4a1a-aef1-0be051386598
64,ARJ8YLL1187FB3CA93,40.69626,-73.83301,Morton Gould,"Richmond Hill, NY",4db4e744-3007-4386-b87d-9653acfe0464
78,ARYBAGV11ECC836DAC,43.58828,-79.64372,Crash Parallel,Mississauga,b0d85cf7-b73b-4a5d-bf31-a82493c3a8a8


As we can see above, the column "location_name" provides us with some geographical information but, for instance, in the first row, we don't really know the country where Santa Cruz is located.

Luckily, we have a pair of coordinates that we can use to retrieve more geographical detail for each row:

In [127]:
#We first create a new column called "coords" in which we'll gather both latitude and longitude:
a['coords'] = list(zip(a.lat, a.long))
coords = tuple(a['coords'].values.tolist())
#And now we use the reverse_geocoder utility to retrieve info for each pair of coordinates:
address = reverse_geocoder.search(coords)
a['address'] = address
a.head()

Unnamed: 0,artist_id,lat,long,artist_name,location_name,artist_mbid,coords,address
0,ARZGXZG1187B9B56B6,-16.96595,-61.14804,Endless Blue,Santa Cruz,0bd9755c-c86d-431c-bc28-ef908b8a9821,"(-16.96595, -61.14804)","{'lat': '-16.43333', 'lon': '-60.9', 'name': '..."
2,AR8K6F31187B99C2BC,46.44231,-93.36586,Go Fish,"Twin Cities, MN",d4620364-82ec-4c34-9265-a2b72dfa8e3e,"(46.44231, -93.36586)","{'lat': '46.53301', 'lon': '-93.71025', 'name'..."
33,ARHJJ771187FB5B581,51.59678,-0.33556,Screaming Lord Sutch,"Harrow, Middlesex, England",e1079a78-75d4-4a1a-aef1-0be051386598,"(51.596779999999995, -0.33555999999999997)","{'lat': '51.58342', 'lon': '-0.3386', 'name': ..."
64,ARJ8YLL1187FB3CA93,40.69626,-73.83301,Morton Gould,"Richmond Hill, NY",4db4e744-3007-4386-b87d-9653acfe0464,"(40.696259999999995, -73.83301)","{'lat': '40.68149', 'lon': '-73.83652', 'name'..."
78,ARYBAGV11ECC836DAC,43.58828,-79.64372,Crash Parallel,Mississauga,b0d85cf7-b73b-4a5d-bf31-a82493c3a8a8,"(43.58828, -79.64372)","{'lat': '43.5789', 'lon': '-79.6583', 'name': ..."


In [82]:
#Let's see how is this new infor formatted:
a['address'][0]

OrderedDict([('lat', '-16.43333'),
             ('lon', '-60.9'),
             ('name', 'Concepcion'),
             ('admin1', 'Santa Cruz'),
             ('admin2', ''),
             ('cc', 'BO')])

In [84]:
#What about the second row?
a['address'][2]

OrderedDict([('lat', '46.53301'),
             ('lon', '-93.71025'),
             ('name', 'Aitkin'),
             ('admin1', 'Minnesota'),
             ('admin2', 'Aitkin County'),
             ('cc', 'US')])

It looks like we would need the fields "admin1" and "cc" (which seems to contain the country ISO code). Let's extract that information for each row:

In [128]:
#We reset the index for the following loop to work:
a.reset_index(drop=True, inplace=True)

In [132]:
#We create 2 empty columns:
start = time.time()
a['state'] = np.nan
a['country_ISO'] = np.nan

#And fill them with the info we need:

for i in range(len(a)):
    address = list(a['address'][i].items())
    a['state'][i] = address[3][1]
    a['country_ISO'][i] = address[5][1]
end = time.time()
print(end-start)
#We check the result:
a.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  # Remove the CWD from sys.path while we load stuff.
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  # This is added back by InteractiveShellApp.init_path()


3217.9600045681


Unnamed: 0,artist_id,lat,long,artist_name,location_name,artist_mbid,coords,address,state,country_ISO
0,ARZGXZG1187B9B56B6,-16.96595,-61.14804,Endless Blue,Santa Cruz,0bd9755c-c86d-431c-bc28-ef908b8a9821,"(-16.96595, -61.14804)","{'lat': '-16.43333', 'lon': '-60.9', 'name': '...",Santa Cruz,BO
1,AR8K6F31187B99C2BC,46.44231,-93.36586,Go Fish,"Twin Cities, MN",d4620364-82ec-4c34-9265-a2b72dfa8e3e,"(46.44231, -93.36586)","{'lat': '46.53301', 'lon': '-93.71025', 'name'...",Minnesota,US
2,ARHJJ771187FB5B581,51.59678,-0.33556,Screaming Lord Sutch,"Harrow, Middlesex, England",e1079a78-75d4-4a1a-aef1-0be051386598,"(51.596779999999995, -0.33555999999999997)","{'lat': '51.58342', 'lon': '-0.3386', 'name': ...",England,GB
3,ARJ8YLL1187FB3CA93,40.69626,-73.83301,Morton Gould,"Richmond Hill, NY",4db4e744-3007-4386-b87d-9653acfe0464,"(40.696259999999995, -73.83301)","{'lat': '40.68149', 'lon': '-73.83652', 'name'...",New York,US
4,ARYBAGV11ECC836DAC,43.58828,-79.64372,Crash Parallel,Mississauga,b0d85cf7-b73b-4a5d-bf31-a82493c3a8a8,"(43.58828, -79.64372)","{'lat': '43.5789', 'lon': '-79.6583', 'name': ...",Ontario,CA


Note: the above loop took 55 minutes to run aprox.

In [134]:
#We drop the unnecessary columns:
a.drop(labels=['artist_id','lat','long', 'artist_name','coords','address'], axis=1, inplace=True)

In [135]:
#We input the retrieved information into our main dataframe:
df9 = pd.merge(df8, a, how='left', on='artist_mbid')
df9.columns

Index(['release_id', 'group_id', 'release_group', 'credit_id', 'release_area',
       'release_area_name', 'release_ISO_code', 'release_code_type',
       'release_year', 'artist_id', 'artist_mbid', 'artist_name_x',
       'origin_code', 'origin_name', 'origin_ISO_code', 'origin_code_type',
       'location_name', 'state', 'country_ISO'],
      dtype='object')

The column country ISO will be used as origin_ISO_code for the rows where the country isn't US, Canada or Australia. For these 3 countries, we need to create an extra column in which we retrieve also the subdivision.

To do so, we can use our areas_ISO dataframe:

In [141]:
areas_ISO[areas_ISO['code_type'] == 2]

Unnamed: 0,area_id,area_name,code_type,ISO_code
25,1949,Borovnica,2.0,
74,1969,Dravograd,2.0,
113,2205,Agio Oros,2.0,
165,4695,Roma,2.0,
185,1950,Bovec,2.0,
221,1951,Brda,2.0,
222,306,Virginia,2.0,US-VA
228,261,Maryland,2.0,US-MD
230,2004,Litija,2.0,
231,2113,Mirna Peč,2.0,


In [142]:
df10 = pd.merge(df9, areas_ISO, how='left', left_on='state', right_on='area_name')
df10.columns

Index(['release_id', 'group_id', 'release_group', 'credit_id', 'release_area',
       'release_area_name', 'release_ISO_code', 'release_code_type',
       'release_year', 'artist_id', 'artist_mbid', 'artist_name_x',
       'origin_code', 'origin_name', 'origin_ISO_code', 'origin_code_type',
       'location_name', 'state', 'country_ISO', 'area_id', 'area_name',
       'code_type', 'ISO_code'],
      dtype='object')

Now we can do the final merge: 

- For the rows that have the column 'origin_ISO_code empty and 'ISO code' not empty: we keep the ISO code as origin_ISO_code (those will be rows belonging to USA, Canada and Australia.
- For the rows that have the column 'origin_ISO_code empty and 'country_ISO' not empty: we keep the country_ISO as origin_ISO code

In [144]:
#First for USA, Canada and Australia:
df10.origin_ISO_code = np.where(np.logical_and(df10.ISO_code.notnull(),df10.origin_ISO_code.isnull()),df10.ISO_code,df10.origin_ISO_code)
df10.origin_name = np.where(np.logical_and(df10.ISO_code.notnull(),df10.origin_ISO_code.isnull()),df10.area_name,df10.origin_name)
df10.origin_code_type = np.where(np.logical_and(df10.ISO_code.notnull(),df10.origin_ISO_code.isnull()),df10.code_type,df10.origin_code_type)
#Then, for the rest:
df10.origin_ISO_code = np.where(np.logical_and(df10.country_ISO.notnull(),df10.origin_ISO_code.isnull()),df10.country_ISO,df10.origin_ISO_code)
df10.origin_name = np.where(np.logical_and(df10.country_ISO.notnull(),df10.origin_ISO_code.isnull()),df10.area_name,df10.origin_name)
df10.origin_code_type = np.where(np.logical_and(df10.country_ISO.notnull(),df10.origin_ISO_code.isnull()),1,df10.origin_code_type)

In [145]:
#We can drop the unnecesary columns:
columns = ['location_name','state', 'country_ISO', 'area_id', 'area_name', 'code_type', 'ISO_code']
df10.drop(labels=columns, axis=1, inplace=True)

In [149]:
#Dropping duplicated lines after the merging:
df10.drop_duplicates(subset=['release_id'],keep='first', inplace=True)

In [150]:
#For how many releases do we have the origin ISO code now, after this last step?
df10.origin_ISO_code.isnull().value_counts()

False    977235
True     522379
Name: origin_ISO_code, dtype: int64

So, before this last step using data from 1 million songs, we had 971.572 releases with ISO code, now we have an extra 5663 releases with that info.

### Data from Wikidata Query with SPARQL

https://query.wikidata.org/

1) Musicians

SELECT ?musician ?musicianLabel ?genre ?genreLabel ?MusicBrainz_artist_ID ?place_of_birth ?place_of_birthLabel WHERE {
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
  ?musician wdt:P106 wd:Q639669.
  OPTIONAL { ?musician wdt:P136 ?genre. }
  OPTIONAL { ?musician wdt:P434 ?MusicBrainz_artist_ID. }
  OPTIONAL { ?musician wdt:P19 ?place_of_birth. }
}


--> Export to csv file: query_wikidata_musicians.csv

2) Singers

SELECT ?musician ?musicianLabel ?genre ?genreLabel ?MusicBrainz_artist_ID ?place_of_birth ?place_of_birthLabel WHERE {
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
  ?musician wdt:P106 wd:Q177220.
  OPTIONAL { ?musician wdt:P136 ?genre. }
  OPTIONAL { ?musician wdt:P434 ?MusicBrainz_artist_ID. }
  OPTIONAL { ?musician wdt:P19 ?place_of_birth. }
}

--> Export to csv file: query_wikidata_singers.csv

3) Bands

SELECT ?band ?bandLabel ?genre ?genreLabel ?MusicBrainz_artist_ID ?location_of_formation ?location_of_formationLabel WHERE {
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
  ?band wdt:P31 wd:Q215380.
  OPTIONAL { ?band wdt:P136 ?genre. }
  OPTIONAL { ?band wdt:P434 ?MusicBrainz_artist_ID. }
  OPTIONAL { ?band wdt:P740 ?location_of_formation. }
}

--> Export to csv file: query_wikidata_bands.csv

In [3]:
#Open the files and load them into dataframes with the same column names (to match with our main dataframe later):
musicians = pd.read_csv('wikidata/query_wikidata_musicians.csv',sep=',', encoding='utf-8', usecols=[4,6])
musicians.rename(columns={'MusicBrainz_artist_ID':'artist_mbid','place_of_birthLabel':'origin_name'}, inplace=True)
singers = pd.read_csv('wikidata/query_wikidata_singers.csv',sep=',', encoding='utf-8', usecols=[4,6])
singers.rename(columns={'MusicBrainz_artist_ID':'artist_mbid','place_of_birthLabel':'origin_name'}, inplace=True)
bands = pd.read_csv('wikidata/query_wikidata_bands.csv',sep=',', encoding='utf-8', usecols=[4,6])
bands.rename(columns={'MusicBrainz_artist_ID':'artist_mbid','location_of_formationLabel':'origin_name'}, inplace=True)

In [152]:
bands.head()

Unnamed: 0,artist_mbid,origin_name
0,f26c72d3-e52c-467b-b651-679c73d8e1a7,Sacramento
1,f26c72d3-e52c-467b-b651-679c73d8e1a7,Sacramento
2,f26c72d3-e52c-467b-b651-679c73d8e1a7,Sacramento
3,a3cb23fc-acd3-4ce0-8f36-1e5aa6a18432,Dublin
4,a3cb23fc-acd3-4ce0-8f36-1e5aa6a18432,Dublin


In [4]:
#Now we can concatenate the 3 dataframes into one:
wiki_df = pd.concat([musicians, singers, bands])
wiki_df.head()

Unnamed: 0,artist_mbid,origin_name
0,,Cherbourg-en-Cotentin
1,b972f589-fb0e-474e-b64a-803b0364fa75,Salzburg
2,b972f589-fb0e-474e-b64a-803b0364fa75,Salzburg
3,b972f589-fb0e-474e-b64a-803b0364fa75,Salzburg
4,b972f589-fb0e-474e-b64a-803b0364fa75,Salzburg


In [154]:
len(wiki_df)

226926

In [5]:
#We can directly drop the rows which don't have a musicbrainz's id (we need and ID to join with our main df):
wiki_df.dropna(subset=['artist_mbid'], axis=0, inplace=True)

In [6]:
#Let's see how many artists we have:
wiki_df['artist_mbid'].nunique()

96110

In [7]:
#Drop duplicated artist_mbid:
wiki_df.drop_duplicates(subset=['artist_mbid'],keep='first', inplace=True)

In [8]:
#What kind of information do we have in the column origin_name?:
wiki_df.origin_name.value_counts()

New York City            1273
Los Angeles              1209
London                    995
Tokyo                     750
Paris                     440
Chicago                   436
Brooklyn                  390
Seoul                     376
Philadelphia              348
Toronto                   333
Berlin                    324
Stockholm                 323
San Francisco             289
Seattle                   282
Boston                    268
Moscow                    267
California                266
Montreal                  248
Detroit                   248
Oslo                      219
Rome                      215
Rio de Janeiro            212
Helsinki                  212
Vienna                    201
Istanbul                  200
Atlanta                   197
Liverpool                 196
Buenos Aires              188
Manchester                187
Nashville                 187
                         ... 
Niebüll                     1
Isernia                     1
Indian Hil

It looks like we have city names, so we can use our "cities" dataframe to retrieve the country ISO codes:

In [12]:
wiki_ISO = pd.merge(wiki_df, cities, how='left', left_on='origin_name', right_on='area_name')
wiki_ISO.head()

Unnamed: 0,artist_mbid,origin_name,area_name,country,country_ISO
0,b972f589-fb0e-474e-b64a-803b0364fa75,Salzburg,Salzburg,Austria,AT
1,b58165ba-ac55-49a1-8855-caf16c68f5f2,Sète,,,
2,d135874d-9cae-4fef-97e3-36acbd9f5a26,Chicago,Chicago,United States,US
3,75167b8b-44e4-407b-9d35-effe87b223cf,Toronto,Toronto,Canada,CA
4,4b585938-f271-45e2-b19a-91c634b5e396,Bexleyheath,Bexleyheath,United Kingdom,GB


In order to retrieve the ISO codes for USA, Canada and Australia, we can use our areas_ISO dataframe too:

In [21]:
wiki_ISO1 = pd.merge(wiki_ISO, areas_ISO[['area_name','ISO_code']], how='left', on='area_name')
wiki_ISO1.head()

Unnamed: 0,artist_mbid,origin_name,area_name,country,country_ISO,ISO_code
0,b972f589-fb0e-474e-b64a-803b0364fa75,Salzburg,Salzburg,Austria,AT,
1,b972f589-fb0e-474e-b64a-803b0364fa75,Salzburg,Salzburg,Austria,AT,
2,b972f589-fb0e-474e-b64a-803b0364fa75,Salzburg,Salzburg,Austria,AT,
3,b58165ba-ac55-49a1-8855-caf16c68f5f2,Sète,,,,
4,d135874d-9cae-4fef-97e3-36acbd9f5a26,Chicago,Chicago,United States,US,


In [22]:
#Drop duplicated artist_mbid:
wiki_ISO1.drop_duplicates(subset=['artist_mbid'],keep='first', inplace=True)
#And delete the extra column "origin_name"
wiki_ISO1.drop(labels='origin_name', axis=1, inplace=True)

In [212]:
##CHECKPOINT: EXPORT DF10:
df10.to_csv('df10.csv', sep='\t', index=False, encoding='utf-8' )

In [23]:
#START FROM CHECKPOINT: IMPORT DF10
df10 = pd.read_csv('df10.csv', sep='\t', header=0, encoding='utf-8')

In [24]:
#And we can merge that information into our main dataframe:
df11 = pd.merge(df10, wiki_ISO1, how='left', on='artist_mbid')
df11.columns

Index(['release_id', 'group_id', 'release_group', 'credit_id', 'release_area',
       'release_area_name', 'release_ISO_code', 'release_code_type',
       'release_year', 'artist_id', 'artist_mbid', 'artist_name_x',
       'origin_code', 'origin_name', 'origin_ISO_code', 'origin_code_type',
       'area_name', 'country', 'country_ISO', 'ISO_code'],
      dtype='object')

Now, as we did in previous steps, we can input the information in these new columns into our origin columns (first, for the rows related to USA, Canada and Australia, then for the rest):

In [25]:
#First for USA, Canada and Australia:
df11.origin_ISO_code = np.where(np.logical_and(df11.ISO_code.notnull(),df11.origin_ISO_code.isnull()),df11.ISO_code,df11.origin_ISO_code)
df11.origin_name = np.where(np.logical_and(df11.ISO_code.notnull(),df11.origin_ISO_code.isnull()),df11.area_name,df11.origin_name)
df11.origin_code_type = np.where(np.logical_and(df11.ISO_code.notnull(),df11.origin_ISO_code.isnull()),2,df10.origin_code_type)
#Then, for the rest:
df11.origin_ISO_code = np.where(np.logical_and(df11.country_ISO.notnull(),df11.origin_ISO_code.isnull()),df11.country_ISO,df11.origin_ISO_code)
df11.origin_name = np.where(np.logical_and(df11.country_ISO.notnull(),df11.origin_ISO_code.isnull()),df11.area_name,df11.origin_name)
df11.origin_code_type = np.where(np.logical_and(df11.country_ISO.notnull(),df11.origin_ISO_code.isnull()),1,df11.origin_code_type)

In [26]:
#Drop unnecessary columns:
columns = ['area_name','country','country_ISO','ISO_code']
df11.drop(labels=columns, axis=1, inplace=True)

In [27]:
#For how many releases do we have the origin ISO code now, after this last step?
df11.origin_ISO_code.isnull().value_counts()

False    984158
True     515456
Name: origin_ISO_code, dtype: int64

We have retrieved the information for an extra 6.923 releases in this last step, thanks to Wikidata Query.

As we have retrieved geographical data from 4 different sources already, we need to analyze what we have left.

Who are the artists for which we don't have any origin information?

In [28]:
unknown_area = df11[df11['origin_ISO_code'].isnull()]

In [29]:
#How many artists are there?
unknown_area['artist_mbid'].nunique()

188359

So, according to the above line, we have 188.359 artists with unknown or vague origin. Let's have a closer look:

In [30]:
unknown_artist = unknown_area.groupby('artist_name_x').count().sort_values('release_id',ascending=False)
unknown_artist.head(1000)

Unnamed: 0_level_0,release_id,group_id,release_group,credit_id,release_area,release_area_name,release_ISO_code,release_code_type,release_year,artist_id,artist_mbid,origin_code,origin_name,origin_ISO_code,origin_code_type
artist_name_x,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
Various Artists,134847,134847,134847,134847,134847,134847,134847,122627,134847,134847,134847,0,0,0,0
Berliner Philharmoniker,550,550,550,550,550,550,550,482,550,550,550,0,0,0,0
[language instruction],278,278,278,278,278,278,278,189,278,278,278,0,0,0,0
Dwelling of Duels,180,180,180,180,180,180,180,0,180,180,180,0,0,0,0
[nature sounds],170,170,170,170,170,170,170,167,170,170,170,0,0,0,0
Tom Jones,160,160,160,160,160,160,160,143,160,160,160,0,0,0,0
Michael Koser,139,139,139,139,139,139,139,139,139,139,139,0,0,0,0
Peerless Orchestra,122,122,122,122,122,122,122,122,122,122,122,0,0,0,0
Edison Concert Band,115,115,115,115,115,115,115,115,115,115,115,0,0,0,0
Daniel Menche,115,115,115,115,115,115,115,72,115,115,115,0,0,0,0


In [182]:
#From what we can see above, the category "Various Artists" has many releases assigned:
df11[df11['artist_name_x']=='Various Artists']

Unnamed: 0,release_id,group_id,release_group,credit_id,release_area,release_area_name,release_ISO_code,release_code_type,release_year,artist_id,artist_mbid,artist_name_x,origin_code,origin_name,origin_ISO_code,origin_code_type
4,1895266,1751021,M2Music HitDisc Vol. 1,1,222.0,United States,US,1.0,2006-01-01,1.0,89ad4ac3-39f7-470e-963a-56509c546377,Various Artists,,,,
26,356044,14028,!!!Here Ain't the Sonics!!!,1,222.0,United States,US,1.0,1993-01-01,1.0,89ad4ac3-39f7-470e-963a-56509c546377,Various Artists,,,,
41,1623578,1539062,!Go Hit,1298824,81.0,Germany,DE,1.0,1998-01-01,1.0,89ad4ac3-39f7-470e-963a-56509c546377,Various Artists,,,,
44,475440,785494,"!JBL, Volume 2: PROGRESSIVE",1,194.0,Spain,ES,1.0,2004-01-01,1.0,89ad4ac3-39f7-470e-963a-56509c546377,Various Artists,,,,
45,62055,28102,!K7,1,194.0,Spain,ES,1.0,2000-01-01,1.0,89ad4ac3-39f7-470e-963a-56509c546377,Various Artists,,,,
46,1053661,1078102,!K7 2011 Sampler,1,240.0,[Worldwide],XW,,2011-01-01,1.0,89ad4ac3-39f7-470e-963a-56509c546377,Various Artists,,,,
47,62061,147591,!K7 Compilation,1,81.0,Germany,DE,1.0,2003-01-01,1.0,89ad4ac3-39f7-470e-963a-56509c546377,Various Artists,,,,
48,1447852,147591,!K7 Compilation,1298824,81.0,Germany,DE,1.0,2003-01-01,1.0,89ad4ac3-39f7-470e-963a-56509c546377,Various Artists,,,,
49,1012140,955241,!K7 Spring 2002,1,221.0,United Kingdom,GB,1.0,2002-01-01,1.0,89ad4ac3-39f7-470e-963a-56509c546377,Various Artists,,,,
50,2319247,2084339,!Kollections 02: Classics,1,240.0,[Worldwide],XW,,2017-01-01,1.0,89ad4ac3-39f7-470e-963a-56509c546377,Various Artists,,,,


If we look in detail into these releases, we can see that most of them are music compilations (hence the generic category "Various Artists"). As they are music compilations, that means that the tracks included were originally released before by their genuine author, so we shouldn't take them into account (to avoid duplicates). Also, as we don't have an artist name for them, it will be impossible to retrieve the origin.

We will delete those rows from our dataframe later.

Let's analyze more in detail who are the rest of artists that have many releases, and decide what to do with them.

In [183]:
#Unknown artist:
df11[df11['artist_name_x']=='[unknown]']

Unnamed: 0,release_id,group_id,release_group,credit_id,release_area,release_area_name,release_ISO_code,release_code_type,release_year,artist_id,artist_mbid,artist_name_x,origin_code,origin_name,origin_ISO_code,origin_code_type
7287,546135,843736,100 Beste Kinderliedjes (disc 1),97546,150.0,Netherlands,NL,1.0,1998-01-01,97546.0,125ec42a-7229-4250-afc5-e057484327fe,[unknown],,,CA-NB,
8817,557295,852580,101 Children's Songs and Nursery Rhymes,97546,221.0,United Kingdom,GB,1.0,2008-01-01,97546.0,125ec42a-7229-4250-afc5-e057484327fe,[unknown],,,CA-NB,
9663,1232110,1222115,12 Chart Buster Hits: Volume 11,97546,221.0,United Kingdom,GB,1.0,1974-01-01,97546.0,125ec42a-7229-4250-afc5-e057484327fe,[unknown],,,CA-NB,
9664,665743,936259,12 Chartbuster Hits,97546,221.0,United Kingdom,GB,1.0,1974-01-01,97546.0,125ec42a-7229-4250-afc5-e057484327fe,[unknown],,,CA-NB,
9817,1235032,1224582,12 Makamda Yaylı Tanbur Taksimleri,97546,214.0,Turkey,TR,1.0,2004-01-01,97546.0,125ec42a-7229-4250-afc5-e057484327fe,[unknown],,,CA-NB,
9918,638247,914460,12 Tops: Volume 20,97546,221.0,United Kingdom,GB,1.0,1974-01-01,97546.0,125ec42a-7229-4250-afc5-e057484327fe,[unknown],,,CA-NB,
10852,1208592,1202487,14 Makamda Keman Taksimleri,97546,214.0,Turkey,TR,1.0,2004-01-01,97546.0,125ec42a-7229-4250-afc5-e057484327fe,[unknown],,,CA-NB,
11319,686485,952465,15 chansons et comptines pour votre bébé,97546,73.0,France,FR,1.0,2004-01-01,97546.0,125ec42a-7229-4250-afc5-e057484327fe,[unknown],,,CA-NB,
11705,1247647,1234682,16 Makamda Ud Taksimleri,97546,214.0,Turkey,TR,1.0,2004-01-01,97546.0,125ec42a-7229-4250-afc5-e057484327fe,[unknown],,,CA-NB,
16973,1015709,616974,20 Golden Guitar Hits,97546,221.0,United Kingdom,GB,1.0,1988-01-01,97546.0,125ec42a-7229-4250-afc5-e057484327fe,[unknown],,,CA-NB,


The category "unknown" seems to contain music compilations too.

In [184]:
#"Language instruction" artist:
df11[df11['artist_name_x']=='[language instruction]']

Unnamed: 0,release_id,group_id,release_group,credit_id,release_area,release_area_name,release_ISO_code,release_code_type,release_year,artist_id,artist_mbid,artist_name_x,origin_code,origin_name,origin_ISO_code,origin_code_type
11186,821781,1057780,15 Minute French,597116,222.0,United States,US,1.0,2005-01-01,597116.0,80a8851f-444c-4539-892b-ad2a49292aa9,[language instruction],,,,
11187,822699,1058470,15 Minute Italian,597116,222.0,United States,US,1.0,2006-01-01,597116.0,80a8851f-444c-4539-892b-ad2a49292aa9,[language instruction],,,,
30861,536732,836303,450 Nouveaux Exercices Grammaire Niveau Avancé,1964330,73.0,France,FR,1.0,2005-01-01,597116.0,80a8851f-444c-4539-892b-ad2a49292aa9,[language instruction],,,,
51254,1921485,1771345,A break in/The Police,597116,240.0,[Worldwide],XW,,2007-01-01,597116.0,80a8851f-444c-4539-892b-ad2a49292aa9,[language instruction],,,,
51811,1920797,1770754,A new telephone number,597116,240.0,[Worldwide],XW,,2006-01-01,597116.0,80a8851f-444c-4539-892b-ad2a49292aa9,[language instruction],,,,
73778,2107126,1916950,All Audio Spanish - Basic-Intermediate Disc 1,597116,222.0,United States,US,1.0,1999-01-01,597116.0,80a8851f-444c-4539-892b-ad2a49292aa9,[language instruction],,,,
78295,685879,731447,All-Audio Spanish,597116,222.0,United States,US,1.0,1997-01-01,597116.0,80a8851f-444c-4539-892b-ad2a49292aa9,[language instruction],,,,
81802,1381788,1341586,Alter ego 2,1340398,73.0,France,FR,1.0,2006-01-01,597116.0,80a8851f-444c-4539-892b-ad2a49292aa9,[language instruction],,,,
82122,1921562,1771391,Alternative forms of energy,597116,240.0,[Worldwide],XW,,2011-01-01,597116.0,80a8851f-444c-4539-892b-ad2a49292aa9,[language instruction],,,,
100674,1921270,1771164,Apologies and excuses,597116,240.0,[Worldwide],XW,,2007-01-01,597116.0,80a8851f-444c-4539-892b-ad2a49292aa9,[language instruction],,,,


As its name suggests, these releases are language courses recorded, so they are not music and they are out of our scope too.

The same would apply to the categories [nature sounds], [dialogue] or [christmas music].

We can now delete from our dataframes all these cathegories, and see what we have left.

In [31]:
#In our main dataframe:
labels = ['[nature sounds]','[dialogue]','[christmas music]', 'Various Artists','[unknown]','[language instruction]']
df11.artist_name_x = np.where(df11.artist_name_x.isin(labels), np.nan, df11.artist_name_x)
df11.dropna(subset=['artist_name_x'], axis=0, inplace=True)

In [43]:
#In our unknown_area dataframe:
unknown_area.artist_name_x = np.where(unknown_area.artist_name_x.isin(labels), np.nan, unknown_area.artist_name_x)
unknown_area.dropna(subset=['artist_name_x'], axis=0, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self[name] = value
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until


In [37]:
#So after this deletion of some releases, how many do we have left to retrieve the origin?
df11.origin_ISO_code.isnull().value_counts()

False    982785
True     379823
Name: origin_ISO_code, dtype: int64

In [41]:
#And how many artists do they represent?
unknown_area.artist_mbid.nunique()

188356

In [46]:
#Wich are the ones that produce the most?
unknown_artist = unknown_area.groupby('artist_name_x').count().sort_values('release_id',ascending=False)
unknown_artist.head(1000)

Unnamed: 0_level_0,release_id,group_id,release_group,credit_id,release_area,release_area_name,release_ISO_code,release_code_type,release_year,artist_id,artist_mbid,origin_code,origin_name,origin_ISO_code,origin_code_type
artist_name_x,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
Berliner Philharmoniker,550,550,550,550,550,550,550,482,550,550,550,0,0,0,0
Dwelling of Duels,180,180,180,180,180,180,180,0,180,180,180,0,0,0,0
Tom Jones,160,160,160,160,160,160,160,143,160,160,160,0,0,0,0
Michael Koser,139,139,139,139,139,139,139,139,139,139,139,0,0,0,0
Peerless Orchestra,122,122,122,122,122,122,122,122,122,122,122,0,0,0,0
Edison Concert Band,115,115,115,115,115,115,115,115,115,115,115,0,0,0,0
Daniel Menche,115,115,115,115,115,115,115,72,115,115,115,0,0,0,0
Minniva,114,114,114,114,114,114,114,113,114,114,114,0,0,0,0
Bibi & Tina,109,109,109,109,109,109,109,109,109,109,109,0,0,0,0
[no artist],108,108,108,108,108,108,108,103,108,108,108,0,0,0,0


In [51]:
df11[df11['artist_name_x'] == 'Berliner Philharmoniker']

Unnamed: 0,release_id,group_id,release_group,credit_id,release_area,release_area_name,release_ISO_code,release_code_type,release_year,artist_id,artist_mbid,artist_name_x,origin_code,origin_name,origin_ISO_code,origin_code_type
7266,1880145,1739508,100 Best Berliner Philharmoniker,64981,240.0,[Worldwide],XW,,2011-01-01,64981.0,dea28aa9-1086-4ffa-8739-0ccc759de1ce,Berliner Philharmoniker,,,,
9414,1894928,1052100,111 Years of Deutsche Grammophon: The Collecto...,1862166,241.0,Europe,XE,,2010-01-01,64981.0,dea28aa9-1086-4ffa-8739-0ccc759de1ce,Berliner Philharmoniker,,,,
9426,1894967,1052100,111 Years of Deutsche Grammophon: The Collecto...,1058583,241.0,Europe,XE,,2010-01-01,64981.0,dea28aa9-1086-4ffa-8739-0ccc759de1ce,Berliner Philharmoniker,,,,
9458,1895039,1052100,111 Years of Deutsche Grammophon: The Collecto...,1862486,241.0,Europe,XE,,2010-01-01,64981.0,dea28aa9-1086-4ffa-8739-0ccc759de1ce,Berliner Philharmoniker,,,,
9814,2244651,2026726,12 Londoner Symphonien,2275446,241.0,Europe,XE,,1990-01-01,64981.0,dea28aa9-1086-4ffa-8739-0ccc759de1ce,Berliner Philharmoniker,,,,
27200,461081,773437,3 Symphonies / The Rock,1337011,81.0,Germany,DE,1.0,1996-01-01,64981.0,dea28aa9-1086-4ffa-8739-0ccc759de1ce,Berliner Philharmoniker,,,,
30903,196588,498346,46 Symphonien,1106042,81.0,Germany,DE,1.0,1996-01-01,64981.0,dea28aa9-1086-4ffa-8739-0ccc759de1ce,Berliner Philharmoniker,,,,
31969,201571,442660,5. Symphonie,1382320,241.0,Europe,XE,,1996-01-01,64981.0,dea28aa9-1086-4ffa-8739-0ccc759de1ce,Berliner Philharmoniker,,,,
31970,974264,386285,5. Symphonie / Kindertotenlieder,1115711,81.0,Germany,DE,1.0,1985-01-01,64981.0,dea28aa9-1086-4ffa-8739-0ccc759de1ce,Berliner Philharmoniker,,,,
33090,2213987,481592,"6 ""Paris"" Symphonies",1499235,81.0,Germany,DE,1.0,1994-01-01,64981.0,dea28aa9-1086-4ffa-8739-0ccc759de1ce,Berliner Philharmoniker,,,,


# SEGUIR DESDE AQUI: VOLVER A CRUZAR DATOS DE ARTISTAS CON START AREA 1 Y 2 !!!!! Y DESPUES ASUMIR QUE CADA RELEASE TIENE SU ORIGEN DONDE SE PRODUCE. PASAR A NOTEBOOK DE GENERO