# Data gathering

## 1) Artist information

In [132]:
import pandas as pd
import numpy as np
import time
#!pip install pygeocoder
#from pygeocoder import Geocoder #If you want to follow the geocoding later, you will need your own Google Maps API key
#import matplotlib.pyplot as plt
#%matplotlib inline
from tqdm import tqdm

In [2]:
artists= pd.read_csv('Musicbrainz/Tables_used/artist.txt',sep='\t', header=None, engine='c', usecols=[0,1,2,11,17])
artists.columns = ['artist_id','artist_mbid','artist_name','start_area1', 'start_area2']
artists.head()

Unnamed: 0,artist_id,artist_mbid,artist_name,start_area1,start_area2
0,805192,8972b1c1-6482-4750-b51f-596d2edea8b1,WIK▲N,,
1,371203,49add228-eac5-4de8-836c-d75cde7369c3,Pete Moutso,,
2,273232,c112a400-af49-4665-8bba-741531d962a1,Zachary,,
3,101060,ca3f3ee1-c4a7-4bac-a16a-0b888a396c6b,The Silhouettes,222.0,7707.0
4,145773,7b4a548e-a01a-49b7-82e7-b49efeb9732c,Aric Leavitt,,


In [3]:
#Let's see how many artists we have:
artists['artist_id'].nunique()

1476425

In [4]:
#How much info we have for each artist?
artists.isnull().sum(axis=0)

artist_id            0
artist_mbid          0
artist_name          8
start_area1     808442
start_area2    1274001
dtype: int64

What are the "start_area1" and "start_area2"? If we look at Musicbrainz's field description for each artist (https://musicbrainz.org/doc/Artist), we can see that:

Area: The artist area, as the name suggests, indicates the area with which an artist is primarily identified with. It is often, but not always, its birth/formation country.

We will keep this information as the artist's origin for later.

We need to incorporate as well the table called "artist credit", which gives us the artist credit_id. We will use this field to join later on each release with its artist:

In [5]:
artists_credit= pd.read_csv('Musicbrainz/Tables_used/artist_credit_name.txt',sep='\t', header=None, engine='c', usecols=[0,2,3])
artists_credit.columns = ['credit_id','artist_id','artist_name']
artists_credit.head()

Unnamed: 0,credit_id,artist_id,artist_name
0,578352,578352,Gustav Ruppke
1,273232,273232,Zachary
2,153193,153193,The High Level Ranters
3,32262,32262,Georges Brassens
4,1389968,1171184,Harvard of the South


In [6]:
#Let's join the artists with their credit id and verify that the matching is good:
df = pd.merge(artists, artists_credit, how='left', on='artist_id')
df.head()

Unnamed: 0,artist_id,artist_mbid,artist_name_x,start_area1,start_area2,credit_id,artist_name_y
0,805192,8972b1c1-6482-4750-b51f-596d2edea8b1,WIK▲N,,,822846.0,WIK▲N
1,371203,49add228-eac5-4de8-836c-d75cde7369c3,Pete Moutso,,,,
2,273232,c112a400-af49-4665-8bba-741531d962a1,Zachary,,,273232.0,Zachary
3,101060,ca3f3ee1-c4a7-4bac-a16a-0b888a396c6b,The Silhouettes,222.0,7707.0,101060.0,The Silhouettes
4,145773,7b4a548e-a01a-49b7-82e7-b49efeb9732c,Aric Leavitt,,,145773.0,Aric Leavitt


In [7]:
#It looks like it makes sense. Please note that the credit id is sometimes equal to the artist_id, but not always:
df['check'] = df['artist_id'] - df['credit_id']
df['check'].nunique()

1270628

In [8]:
df.isnull().sum(axis=0)

artist_id              0
artist_mbid            0
artist_name_x         15
start_area1      1120376
start_area2      2109027
credit_id         461241
artist_name_y     461253
check             461241
dtype: int64

In [9]:
#We can now get rid of check and the duplicate artist_name column:
df.drop(labels=['check','artist_name_y'], axis=1, inplace=True)
df.head()

Unnamed: 0,artist_id,artist_mbid,artist_name_x,start_area1,start_area2,credit_id
0,805192,8972b1c1-6482-4750-b51f-596d2edea8b1,WIK▲N,,,822846.0
1,371203,49add228-eac5-4de8-836c-d75cde7369c3,Pete Moutso,,,
2,273232,c112a400-af49-4665-8bba-741531d962a1,Zachary,,,273232.0
3,101060,ca3f3ee1-c4a7-4bac-a16a-0b888a396c6b,The Silhouettes,222.0,7707.0,101060.0
4,145773,7b4a548e-a01a-49b7-82e7-b49efeb9732c,Aric Leavitt,,,145773.0


## 2) Release information

The objective of this project is to visualize when each artist released for the first time a certain CD/Album/Single etc.

If we look at the "releases" table:

In [10]:
releases = pd.read_csv('Musicbrainz/Tables_used/release.txt',sep='\t', header=None, engine='c', usecols=[0,2,3])
releases.columns = ['release_id','release_group','credit_id']
releases.head()

Unnamed: 0,release_id,release_group,credit_id
0,9,A Sorta Fairytale,60
1,10,A Sorta Fairytale,60
2,11,Glory of the 80's,60
3,12,Silent All These Years,60
4,26,Demons,20211


We can see, in the first 2 rows, that the same CD/Album can be released/remastered many times. According to Musicbrainz's field description for each release (https://musicbrainz.org/doc/Release):

"A MusicBrainz release represents the unique release (i.e. issuing) of a product on a specific date with specific release information such as the country, label, barcode and packaging. If you walk into a store and purchase an album or single, they are each represented in MusicBrainz as one release".

If we look at another release-related field in Musicbrainz, we find the "release group" (https://musicbrainz.org/doc/Release_Group):

"A release group, just as the name suggests, is used to group several different releases into a single logical entity. Every release belongs to one, and only one release group.

Both release groups and releases are "albums" in a general sense, but with an important difference: a release is something you can buy as media such as a CD or a vinyl record, while a release group embraces the overall concept of an album -- it doesn't matter how many CDs or editions/versions it had."

By reading these descriptions, we can clearly see that the release group is the table we are looking for as it represents a single creation, no matter how many times it has been edited or released afterwards. So we will have to keep the first release id for each release group.

In [11]:
release_country = pd.read_csv('Musicbrainz/Tables_used/release_country.txt',sep='\t', header=None, engine='c', usecols=[0,1,2])
release_country.columns = ['release_id','area_id','release_year']
release_country.head()

Unnamed: 0,release_id,area_id,release_year
0,3,81,1997.0
1,1427792,107,2014.0
2,9,81,2002.0
3,10,221,2002.0
4,11,81,1999.0


In [12]:
df2 = pd.merge(releases, release_country, how='left', on='release_id')
df2.head()

Unnamed: 0,release_id,release_group,credit_id,area_id,release_year
0,9,A Sorta Fairytale,60,81.0,2002.0
1,10,A Sorta Fairytale,60,221.0,2002.0
2,11,Glory of the 80's,60,81.0,1999.0
3,12,Silent All These Years,60,81.0,1997.0
4,26,Demons,20211,107.0,1998.0


In [13]:
#Let's see how many releases we have:
df2['release_id'].nunique()

2198457

In [14]:
df2.isnull().sum(axis=0)

release_id            0
release_group         7
credit_id             0
area_id          287376
release_year     341983
dtype: int64

In [15]:
#We want to keep only the releases which have a release year, so we can drop the others:
df2.dropna(subset=['release_year'], axis=0, inplace=True)
df2['release_year'] = df2.release_year.astype(int,inplace=True)
df2['release_id'].nunique()

1859982

In [16]:
#Let's analyze the year column:
pd.options.display.max_rows = 2000
df2.groupby('release_year').count()

Unnamed: 0_level_0,release_id,release_group,credit_id,area_id
release_year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,2,2,2,2
4,1,1,1,1
5,5,5,5,5
7,1,1,1,1
8,2,2,2,2
10,3,3,3,3
14,1,1,1,1
17,4,4,4,4
18,1,1,1,1
19,3,3,3,3


By looking at the different year values, and, in order to have enough values per year, we could drop the rows whose year is below 1890 and above 2019. Our visualization would have 130 years, which is pretty good.

In [17]:
df2.drop(df2[df2['release_year'] < 1890].index , inplace=True)
df2.drop(df2[df2['release_year'] >2019].index , inplace=True)
df2.sort_values(by=['release_year']).head()

Unnamed: 0,release_id,release_group,credit_id,area_id,release_year
1266766,386919,Visions of Paradise Waltz,97546,222.0,1890
1266956,386830,German Ballad with Variations,97546,222.0,1890
1266958,386829,German Ballad with Variations,97546,222.0,1890
1266960,386828,Mountain Bells Polka,97546,222.0,1890
1266961,386827,Mountain Bells Polka,97546,222.0,1890


In [18]:
#Converting the year column to datetime for later:
df2['release_year'] = pd.to_datetime(df2['release_year'].astype(str), format='%Y')
df2.dtypes

release_id                int64
release_group            object
credit_id                 int64
area_id                 float64
release_year     datetime64[ns]
dtype: object

In [19]:
#We sort by release id and year (we could have 2 release groups with the same name but produced by different artists):
df2.sort_values(['release_group','release_year','credit_id'], ascending=[True,True,True], inplace=True)
df2.head()

Unnamed: 0,release_id,release_group,credit_id,area_id,release_year
2026273,2163750,,2205562,240.0,2014-01-01
1648516,1846605,,1503027,240.0,2015-01-01
1250325,1714060,Beaux Soirs De Paris,1324142,73.0,1995-01-01
2116340,2265346,Le 1,2291833,240.0,2018-01-01
1748061,1895266,M2Music HitDisc Vol. 1,1,222.0,2006-01-01


In [20]:
df2[df2['release_group'] == 'Artaxerxes']

Unnamed: 0,release_id,release_group,credit_id,area_id,release_year
1836724,2378622,Artaxerxes,2392005,240.0,1996-01-01
1910376,2379252,Artaxerxes,2392005,221.0,2009-01-01
1909444,2379244,Artaxerxes,2392011,222.0,2011-01-01


In [21]:
#Now we can delete the duplicate releases and keep the ones who were first released:
df2.drop_duplicates(subset=['release_group','credit_id'],keep='first', inplace=True)
df2['release_id'].nunique()

1499614

In [23]:
#Just to double-check:
df2[df2['release_group'] == 'Artaxerxes']

Unnamed: 0,release_id,release_group,credit_id,area_id,release_year
1836724,2378622,Artaxerxes,2392005,240.0,1996-01-01
1909444,2379244,Artaxerxes,2392011,222.0,2011-01-01


## 3) Matching releases with artists

Now that we have both artist and releases dataframes, we can join them:

In [24]:
df3 = pd.merge(df2, df, how='left', on='credit_id')
df3.head()

Unnamed: 0,release_id,release_group,credit_id,area_id,release_year,artist_id,artist_mbid,artist_name_x,start_area1,start_area2
0,2163750,,2205562,240.0,2014-01-01,1654312.0,d10d6441-dcc1-4202-93bf-0c0acf72913a,Soul Glo,7707.0,
1,1846605,,1503027,240.0,2015-01-01,1112115.0,7b52c77b-1a34-439d-a285-3a7c69cb5b1a,Ben Bennett,,
2,1714060,Beaux Soirs De Paris,1324142,73.0,1995-01-01,1122795.0,71b8451c-c10a-400e-9544-101f34ab2522,Soixante Étages,,
3,2265346,Le 1,2291833,240.0,2018-01-01,1720981.0,a69efb5f-0b28-4328-8ff0-44d8d6f39755,TedeuzeM,68613.0,
4,1895266,M2Music HitDisc Vol. 1,1,222.0,2006-01-01,1.0,89ad4ac3-39f7-470e-963a-56509c546377,Various Artists,,


In [25]:
df3.isnull().sum(axis=0)

release_id            0
release_group         4
credit_id             0
area_id               0
release_year          0
artist_id           151
artist_mbid         151
artist_name_x       155
start_area1      430452
start_area2      959581
dtype: int64

In [26]:
df3['release_id'].nunique()

1499614

In [27]:
len(df3)

1724524

In [28]:
df3[df3['release_group']=='Artaxerxes']

Unnamed: 0,release_id,release_group,credit_id,area_id,release_year,artist_id,artist_mbid,artist_name_x,start_area1,start_area2
119493,2378622,Artaxerxes,2392005,240.0,1996-01-01,391603.0,e3062782-ab7b-41bc-8e65-aeea16dc1a89,Ian Partridge,221.0,1178.0
119494,2378622,Artaxerxes,2392005,240.0,1996-01-01,124232.0,4e7f1926-8704-4545-a1a1-ded91651c884,Thomas Arne,221.0,1178.0
119495,2378622,Artaxerxes,2392005,240.0,1996-01-01,688791.0,f34e9da4-2ee7-4f27-aa34-adc5db791bec,Christopher Robson,,
119496,2378622,Artaxerxes,2392005,240.0,1996-01-01,1129787.0,c33f733e-2bf4-402b-9455-1a293601a1cd,Patricia Spence,,
119497,2378622,Artaxerxes,2392005,240.0,1996-01-01,1104538.0,5680c729-615b-47e2-969e-27a087c572fb,Philippa Hyde,221.0,
119498,2378622,Artaxerxes,2392005,240.0,1996-01-01,402986.0,70af5d9a-c6e0-4fcf-9cde-4d3d00e0fcb0,The Parley of Instruments,221.0,1178.0
119499,2378622,Artaxerxes,2392005,240.0,1996-01-01,183632.0,954d1c83-259f-4a25-8878-10c19bb097af,Catherine Bott,221.0,
119500,2378622,Artaxerxes,2392005,240.0,1996-01-01,87510.0,857588a5-b7aa-4f72-a87b-8f03dca60e30,Roy Goodman,221.0,30926.0
119501,2378622,Artaxerxes,2392005,240.0,1996-01-01,1078968.0,93da7aaa-250b-46e1-b5ef-0ad78d46dc3f,Richard Edgar‐Wilson,,
119502,2379244,Artaxerxes,2392011,222.0,2011-01-01,854064.0,a87f2b39-84c7-4888-935c-d41943bd7971,Classical Opera Company,221.0,


If we look at the above, we can see that there is one line per each artist that participated for each release ID.

As we don't want to show duplicate releases, we need to keep only one artist per release. We will keep the first artist appearing for each release (even though we know this is not 100% accurate, but we have to avoid duplicates). This will afftect 224.910 rows under a total of 1.499.614 unique releases, so 14% of our dataset.

In [29]:
#Now we can delete the duplicate releases and keep the ones who were first released:
df3.drop_duplicates(subset=['release_id'],keep='first', inplace=True)
df3['release_id'].nunique()

1499614

In [30]:
len(df3)

1499614

## 4) Geographical data

The idea of the visualization is to see where each gender comes from, so, ideally, we would have to look at the artists origins (start area: last 2 columns of our dataframe).

In our dataframe df3, the 5th column "area_id" is related to the area where the release was produced. This isn't directly related to the origin of an artist/band, as many artists have to record their works in different countries/or areas.

Let's see for how many releases we have that information:

In [31]:
df3.isnull().sum(axis=0)

release_id            0
release_group         4
credit_id             0
area_id               0
release_year          0
artist_id           151
artist_mbid         151
artist_name_x       155
start_area1      404503
start_area2      876562
dtype: int64

In Musicbrainz's database, we have some tables related to the areas. Let's see how we can use them to input more geographical information into our dataframe:

In [32]:
areas = pd.read_csv('Musicbrainz/Tables_used/area.txt',sep='\t', header=None, engine='python', usecols=[0,2,3])
areas.columns = ['area_id','area_name','code_type']
areas.head()

Unnamed: 0,area_id,area_name,code_type
0,15449,Greccio,4.0
1,38,Canada,1.0
2,43,Chile,1.0
3,44,China,1.0
4,36,Cambodia,1.0


In [33]:
#Let's see the area types we have:
area_types = pd.read_csv('Musicbrainz/Tables_used/area_type.txt',sep='\t', header=None, engine='python', usecols=[1,3,4], error_bad_lines=False)
area_types.columns = ['type','code_type','definition']
area_types.head(20)

Unnamed: 0,type,code_type,definition
0,Country,1,Country is used for areas included (or previou...
1,Subdivision,2,Subdivision is used for the main administrativ...
2,County,7,County is used for smaller administrative divi...
3,Municipality,4,Municipality is used for small administrative ...
4,City,3,"City is used for settlements of any size, incl..."
5,District,5,District is used for a division of a large cit...
6,Island,6,Island is used for islands and atolls which do...


In [34]:
#Add the area name and type to our main dataframe for the column "area_id":
df4 = pd.merge(df3, areas, how='left', on='area_id')
df4.head()

Unnamed: 0,release_id,release_group,credit_id,area_id,release_year,artist_id,artist_mbid,artist_name_x,start_area1,start_area2,area_name,code_type
0,2163750,,2205562,240.0,2014-01-01,1654312.0,d10d6441-dcc1-4202-93bf-0c0acf72913a,Soul Glo,7707.0,,[Worldwide],
1,1846605,,1503027,240.0,2015-01-01,1112115.0,7b52c77b-1a34-439d-a285-3a7c69cb5b1a,Ben Bennett,,,[Worldwide],
2,1714060,Beaux Soirs De Paris,1324142,73.0,1995-01-01,1122795.0,71b8451c-c10a-400e-9544-101f34ab2522,Soixante Étages,,,France,1.0
3,2265346,Le 1,2291833,240.0,2018-01-01,1720981.0,a69efb5f-0b28-4328-8ff0-44d8d6f39755,TedeuzeM,68613.0,,[Worldwide],
4,1895266,M2Music HitDisc Vol. 1,1,222.0,2006-01-01,1.0,89ad4ac3-39f7-470e-963a-56509c546377,Various Artists,,,United States,1.0


In [35]:
#Rearranging dataframe columns to have a clearer dataframe:
df4 = df4[['release_id','release_group','credit_id','area_id','area_name','code_type','release_year','artist_id','artist_mbid','artist_name_x','start_area1','start_area2']]
df4.rename(columns={'area_id':'release_area','area_name':'release_area_name','code_type':'release_code_type','start_area1':'area_id'}, inplace=True)
df4.head()

Unnamed: 0,release_id,release_group,credit_id,release_area,release_area_name,release_code_type,release_year,artist_id,artist_mbid,artist_name_x,area_id,start_area2
0,2163750,,2205562,240.0,[Worldwide],,2014-01-01,1654312.0,d10d6441-dcc1-4202-93bf-0c0acf72913a,Soul Glo,7707.0,
1,1846605,,1503027,240.0,[Worldwide],,2015-01-01,1112115.0,7b52c77b-1a34-439d-a285-3a7c69cb5b1a,Ben Bennett,,
2,1714060,Beaux Soirs De Paris,1324142,73.0,France,1.0,1995-01-01,1122795.0,71b8451c-c10a-400e-9544-101f34ab2522,Soixante Étages,,
3,2265346,Le 1,2291833,240.0,[Worldwide],,2018-01-01,1720981.0,a69efb5f-0b28-4328-8ff0-44d8d6f39755,TedeuzeM,68613.0,
4,1895266,M2Music HitDisc Vol. 1,1,222.0,United States,1.0,2006-01-01,1.0,89ad4ac3-39f7-470e-963a-56509c546377,Various Artists,,


In [36]:
#Add the start area name and type to our main dataframe for the column "area id"(which was "start area 1" before):
df5 = pd.merge(df4, areas, how='left', on='area_id')
df5.head()

Unnamed: 0,release_id,release_group,credit_id,release_area,release_area_name,release_code_type,release_year,artist_id,artist_mbid,artist_name_x,area_id,start_area2,area_name,code_type
0,2163750,,2205562,240.0,[Worldwide],,2014-01-01,1654312.0,d10d6441-dcc1-4202-93bf-0c0acf72913a,Soul Glo,7707.0,,Philadelphia,3.0
1,1846605,,1503027,240.0,[Worldwide],,2015-01-01,1112115.0,7b52c77b-1a34-439d-a285-3a7c69cb5b1a,Ben Bennett,,,,
2,1714060,Beaux Soirs De Paris,1324142,73.0,France,1.0,1995-01-01,1122795.0,71b8451c-c10a-400e-9544-101f34ab2522,Soixante Étages,,,,
3,2265346,Le 1,2291833,240.0,[Worldwide],,2018-01-01,1720981.0,a69efb5f-0b28-4328-8ff0-44d8d6f39755,TedeuzeM,68613.0,,Aix-en-Provence,3.0
4,1895266,M2Music HitDisc Vol. 1,1,222.0,United States,1.0,2006-01-01,1.0,89ad4ac3-39f7-470e-963a-56509c546377,Various Artists,,,,


In [37]:
#Rearranging dataframe columns to have a clearer dataframe:
df5 = df5[['release_id','release_group','credit_id','release_area','release_area_name','release_code_type','release_year','artist_id','artist_mbid','artist_name_x','area_id','area_name','code_type','start_area2']]
df5.rename(columns={'area_id':'artist_area1','area_name':'artist_area_name1','code_type':'artist_code_type1','start_area2':'area_id'}, inplace=True)
df5.head()

Unnamed: 0,release_id,release_group,credit_id,release_area,release_area_name,release_code_type,release_year,artist_id,artist_mbid,artist_name_x,artist_area1,artist_area_name1,artist_code_type1,area_id
0,2163750,,2205562,240.0,[Worldwide],,2014-01-01,1654312.0,d10d6441-dcc1-4202-93bf-0c0acf72913a,Soul Glo,7707.0,Philadelphia,3.0,
1,1846605,,1503027,240.0,[Worldwide],,2015-01-01,1112115.0,7b52c77b-1a34-439d-a285-3a7c69cb5b1a,Ben Bennett,,,,
2,1714060,Beaux Soirs De Paris,1324142,73.0,France,1.0,1995-01-01,1122795.0,71b8451c-c10a-400e-9544-101f34ab2522,Soixante Étages,,,,
3,2265346,Le 1,2291833,240.0,[Worldwide],,2018-01-01,1720981.0,a69efb5f-0b28-4328-8ff0-44d8d6f39755,TedeuzeM,68613.0,Aix-en-Provence,3.0,
4,1895266,M2Music HitDisc Vol. 1,1,222.0,United States,1.0,2006-01-01,1.0,89ad4ac3-39f7-470e-963a-56509c546377,Various Artists,,,,


In [38]:
#Add the start area 2 name and type to our main dataframe for the column "area id"(which was "start area 2" before):
df6 = pd.merge(df5, areas, how='left', on='area_id')
df6.head()

Unnamed: 0,release_id,release_group,credit_id,release_area,release_area_name,release_code_type,release_year,artist_id,artist_mbid,artist_name_x,artist_area1,artist_area_name1,artist_code_type1,area_id,area_name,code_type
0,2163750,,2205562,240.0,[Worldwide],,2014-01-01,1654312.0,d10d6441-dcc1-4202-93bf-0c0acf72913a,Soul Glo,7707.0,Philadelphia,3.0,,,
1,1846605,,1503027,240.0,[Worldwide],,2015-01-01,1112115.0,7b52c77b-1a34-439d-a285-3a7c69cb5b1a,Ben Bennett,,,,,,
2,1714060,Beaux Soirs De Paris,1324142,73.0,France,1.0,1995-01-01,1122795.0,71b8451c-c10a-400e-9544-101f34ab2522,Soixante Étages,,,,,,
3,2265346,Le 1,2291833,240.0,[Worldwide],,2018-01-01,1720981.0,a69efb5f-0b28-4328-8ff0-44d8d6f39755,TedeuzeM,68613.0,Aix-en-Provence,3.0,,,
4,1895266,M2Music HitDisc Vol. 1,1,222.0,United States,1.0,2006-01-01,1.0,89ad4ac3-39f7-470e-963a-56509c546377,Various Artists,,,,,,


In [39]:
#Renaming columns:
df6.rename(columns={'area_id':'artist_area2','area_name':'artist_area_name2','code_type':'artist_code_type2'}, inplace=True)
df6.head()

Unnamed: 0,release_id,release_group,credit_id,release_area,release_area_name,release_code_type,release_year,artist_id,artist_mbid,artist_name_x,artist_area1,artist_area_name1,artist_code_type1,artist_area2,artist_area_name2,artist_code_type2
0,2163750,,2205562,240.0,[Worldwide],,2014-01-01,1654312.0,d10d6441-dcc1-4202-93bf-0c0acf72913a,Soul Glo,7707.0,Philadelphia,3.0,,,
1,1846605,,1503027,240.0,[Worldwide],,2015-01-01,1112115.0,7b52c77b-1a34-439d-a285-3a7c69cb5b1a,Ben Bennett,,,,,,
2,1714060,Beaux Soirs De Paris,1324142,73.0,France,1.0,1995-01-01,1122795.0,71b8451c-c10a-400e-9544-101f34ab2522,Soixante Étages,,,,,,
3,2265346,Le 1,2291833,240.0,[Worldwide],,2018-01-01,1720981.0,a69efb5f-0b28-4328-8ff0-44d8d6f39755,TedeuzeM,68613.0,Aix-en-Provence,3.0,,,
4,1895266,M2Music HitDisc Vol. 1,1,222.0,United States,1.0,2006-01-01,1.0,89ad4ac3-39f7-470e-963a-56509c546377,Various Artists,,,,,,


Now that we have the names of the different areas, let's check what kind of information we have in those columns.

As we said before, we prefer to keep the artist area preferably, as it represents more the real origin of the music.

1) Artist area 1:

In [40]:
df6.artist_area_name1.value_counts()

United States                                                 273415
United Kingdom                                                133067
Japan                                                          83908
Germany                                                        67463
France                                                         45927
Italy                                                          27215
Sweden                                                         24982
Canada                                                         23619
Finland                                                        21981
Netherlands                                                    18101
Australia                                                      17738
Spain                                                          16090
Russia                                                         13821
Brazil                                                         11142
Belgium                           

In [41]:
df6.artist_code_type1.value_counts()

1.0    949862
3.0    112208
2.0     24835
4.0      3058
5.0      2429
7.0       254
6.0       114
Name: artist_code_type1, dtype: int64

As we can see, the majority of the artists' start area type we have is related to countries. This would be good for our visualization except for big countries like USA, Canada or Australia, for which we would prefer to retrieve at least the artist's state, to have a clearer view of the music's origin.

Also, we noticed that we have some area names that don't give us much information: "Worldwide", "Europe", "South Australia", etc.

2) Artist area 2:

In [42]:
df6.artist_area_name2.value_counts()

London                                         23087
Los Angeles                                    14173
New York                                       12434
Chicago                                         8353
Tokyo                                           7784
Paris                                           6395
Brooklyn                                        6258
Berlin                                          5941
Philadelphia                                    5274
Detroit                                         4659
San Francisco                                   4574
Toronto                                         4068
Boston                                          3959
Seattle                                         3938
Seoul                                           3800
Stockholm                                       3448
Melbourne                                       3308
Hamburg                                         3259
United Kingdom                                

In [43]:
df6.artist_code_type2.value_counts()

3.0    481180
2.0     61532
1.0     31001
5.0     25596
4.0     20562
7.0      2487
6.0       556
Name: artist_code_type2, dtype: int64

It looks like this second column could be giving us more detailed information about the artist's origin (only 31K rows have countries). 

We will keep the detail in "artist_area_name2" and "artist_code_type2" as the origin for the rows who have that information, and fill the other rows with "artist_area_name1" and "artist_code_type1"

In [157]:
#First, we rename our columns:
df6.rename(columns={'artist_area_name2':'origin_name','artist_code_type2':'origin_code'}, inplace=True)
df6.head()

Unnamed: 0,release_id,release_group,credit_id,release_area,release_area_name,release_code_type,release_year,artist_id,artist_mbid,artist_name_x,artist_area1,artist_area_name1,artist_code_type1,artist_area2,origin_name,origin_code
0,2163750,,2205562,240.0,[Worldwide],,2014-01-01,1654312.0,d10d6441-dcc1-4202-93bf-0c0acf72913a,Soul Glo,7707.0,Philadelphia,3.0,,,
1,1846605,,1503027,240.0,[Worldwide],,2015-01-01,1112115.0,7b52c77b-1a34-439d-a285-3a7c69cb5b1a,Ben Bennett,,,,,,
2,1714060,Beaux Soirs De Paris,1324142,73.0,France,1.0,1995-01-01,1122795.0,71b8451c-c10a-400e-9544-101f34ab2522,Soixante Étages,,,,,,
3,2265346,Le 1,2291833,240.0,[Worldwide],,2018-01-01,1720981.0,a69efb5f-0b28-4328-8ff0-44d8d6f39755,TedeuzeM,68613.0,Aix-en-Provence,3.0,,,
4,1895266,M2Music HitDisc Vol. 1,1,222.0,United States,1.0,2006-01-01,1.0,89ad4ac3-39f7-470e-963a-56509c546377,Various Artists,,,,,,


In [158]:
#And now we can fill the NaNs with the values in "artist_area_name1" and "artist_code_type1":
df6['origin_name'].fillna(df6['artist_area_name1'], inplace=True)
df6['origin_code'].fillna(df6['artist_code_type1'], inplace=True)
#We can also delete some columns that we don't need anymore:
df6.drop(labels=['artist_area1','artist_area_name1','artist_code_type1','artist_area2'], axis=1, inplace=True)

In [159]:
df6.head()

Unnamed: 0,release_id,release_group,credit_id,release_area,release_area_name,release_code_type,release_year,artist_id,artist_mbid,artist_name_x,origin_name,origin_code
0,2163750,,2205562,240.0,[Worldwide],,2014-01-01,1654312.0,d10d6441-dcc1-4202-93bf-0c0acf72913a,Soul Glo,Philadelphia,3.0
1,1846605,,1503027,240.0,[Worldwide],,2015-01-01,1112115.0,7b52c77b-1a34-439d-a285-3a7c69cb5b1a,Ben Bennett,,
2,1714060,Beaux Soirs De Paris,1324142,73.0,France,1.0,1995-01-01,1122795.0,71b8451c-c10a-400e-9544-101f34ab2522,Soixante Étages,,
3,2265346,Le 1,2291833,240.0,[Worldwide],,2018-01-01,1720981.0,a69efb5f-0b28-4328-8ff0-44d8d6f39755,TedeuzeM,Aix-en-Provence,3.0
4,1895266,M2Music HitDisc Vol. 1,1,222.0,United States,1.0,2006-01-01,1.0,89ad4ac3-39f7-470e-963a-56509c546377,Various Artists,,


In [160]:
#Now, let's see what information we have for these new columns:
df6.origin_name.value_counts()

United States                                  81824
United Kingdom                                 52861
Japan                                          42326
Germany                                        34085
London                                         26075
France                                         26057
Los Angeles                                    16492
Sweden                                         15460
Italy                                          15210
New York                                       14259
Finland                                        13767
Netherlands                                    10886
Chicago                                         9463
Spain                                           9225
Canada                                          8819
Australia                                       8720
Russia                                          8592
Tokyo                                           8367
Berlin                                        

In [161]:
df6.origin_code.value_counts()

3.0    544417
1.0    433886
2.0     74659
5.0     26638
4.0     22369
7.0      2623
6.0       612
Name: origin_code, dtype: int64

In [162]:
#Now, let's see how many empty rows we have:
df6.isnull().sum(axis=0)

release_id                0
release_group             4
credit_id                 0
release_area              0
release_area_name         0
release_code_type    241017
release_year              0
artist_id               151
artist_mbid             151
artist_name_x           155
origin_name          392302
origin_code          394410
dtype: int64

In [58]:
df6['release_id'].nunique()

1499614

In [59]:
#And how many values equal to "Worldwide" or "Europe"?:
df6.loc[df6['origin_area'] == '[Worldwide]'].count()

release_id           1981
release_group        1981
credit_id            1981
release_area         1981
release_area_name    1981
release_code_type    1679
release_year         1981
artist_id            1981
artist_mbid          1981
artist_name_x        1981
artist_area1         1981
artist_area_name1    1981
artist_code_type1       0
artist_area2          199
artist_area_name2     199
artist_code_type2     163
origin_area          1981
origin_area_code        0
dtype: int64

In [60]:
df6.loc[df6['origin_area'] == 'Europe'].count()

release_id           376
release_group        376
credit_id            376
release_area         376
release_area_name    376
release_code_type    284
release_year         376
artist_id            376
artist_mbid          376
artist_name_x        376
artist_area1         376
artist_area_name1    376
artist_code_type1     13
artist_area2         115
artist_area_name2    115
artist_code_type2     77
origin_area          376
origin_area_code       0
dtype: int64

So we have 807.320 rows for which we don't have any information, plus 1981 who have the value "Worldwide" assigned and 376 with the value "Europe". 

Our objective now is to try to find information about those artists in other sources:

In [79]:
#We create a new dataframe for the rows who have missing or vague values:
a = df6.loc[df6['origin_area'].isnull()]
b = df6.loc[df6['origin_area'] == 'Europe']
c = df6.loc[df6['origin_area'] == '[Worldwide]'] 
unknown_area = pd.concat([a, b, c ], ignore_index=True)
unknown_area.origin_area.value_counts()

[Worldwide]    1981
Europe          376
Name: origin_area, dtype: int64

In [80]:
unknown_area.head()

Unnamed: 0,release_id,release_group,credit_id,release_area,release_area_name,release_code_type,release_year,artist_id,artist_mbid,artist_name_x,artist_area1,artist_area_name1,artist_code_type1,artist_area2,artist_area_name2,artist_code_type2,origin_area,origin_area_code
0,1846605,,1503027,240.0,[Worldwide],,2015-01-01,1112115.0,7b52c77b-1a34-439d-a285-3a7c69cb5b1a,Ben Bennett,,,,,,,,
1,1714060,Beaux Soirs De Paris,1324142,73.0,France,1.0,1995-01-01,1122795.0,71b8451c-c10a-400e-9544-101f34ab2522,Soixante Étages,,,,,,,,
2,1895266,M2Music HitDisc Vol. 1,1,222.0,United States,1.0,2006-01-01,1.0,89ad4ac3-39f7-470e-963a-56509c546377,Various Artists,,,,,,,,
3,1494610,!,1367808,107.0,Japan,1.0,2006-01-01,1154943.0,2b0e7ee2-a1d0-45d9-9291-2d269bea9160,三田村管打団?,,,,,,,,
4,1247979,!,874079,53.0,Croatia,1.0,2009-01-01,834659.0,9d02b2a1-c9a7-46aa-8674-adf38c44d81a,Gatuzo,53.0,Croatia,1.0,,,,,


In [82]:
#How many unique artists are there with no area info?
unknown_area['artist_id'].nunique()

265089

So, according to the above line, we have 265.089 artists with unknown or vague origin. We will try to find more info from them.

In [83]:
unknown_area.to_csv('unknown_area.csv')

In [63]:
z = pd.DataFrame({'Last_Name': ['Smith', None, 'Brown'], 
                   'First_Name': ['John', 'Mike', 'Bill'],
                   'Age': [35, 45, None]})
z.head()

Unnamed: 0,Last_Name,First_Name,Age
0,Smith,John,35.0
1,,Mike,45.0
2,Brown,Bill,


In [64]:
x = z[z.Age.notnull()]
x.head()

Unnamed: 0,Last_Name,First_Name,Age
0,Smith,John,35.0
1,,Mike,45.0


In [71]:
type(x[x['First_Name']=='John'])

pandas.core.frame.DataFrame

In [77]:
z[z.Last_Name.notnull()].values

array([['Smith', 'John', 35.0],
       ['Brown', 'Bill', nan]], dtype=object)

In [66]:
x['origin_area'] = np.where(x[x['First_Name']=='John'] & x[x.Last_Name.notnull()], x['Last_Name'], x['First_Name'])
x.head()

TypeError: unsupported operand type(s) for &: 'str' and 'bool'