# <font color=red>WIKIPEDIA ARTIST INFORMATION RETRIEVAL</font>

This is an auxiliary notebook in which we will use the Wikipedia API to retrieve information about the artists whose origin and/or genre we weren't able to identify in our main notebooks.

In [1]:
import wikipedia #!pip install wikipedia in console first
import requests
import json
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')
import time
import pycountry #!pip install pycountry in console
from pygeocoder import Geocoder #If you want to follow the geocoding later, you will need your own Google Maps API key
import tqdm
import re

We load the unknown_artist_origin dataframe that we had left from our notebook "Data_gathering_releases_origin":

In [96]:
%run -i 'Wikipedia_script.py'

In [5]:
#Load data:
data_1 = pd.read_csv('Data_out/unknown_artist_origin_2.csv',\
                     sep='\t', header=0,usecols=[5, 6], encoding='utf-8')
data_1.columns

Index(['artist_id', 'artist_name'], dtype='object')

In [6]:
len(data_1)

483640

To see how many releases does each of the artists, we create a "count" column:

In [7]:
data_1['count'] = 1
data_1.head()

Unnamed: 0,artist_id,artist_name,count
0,59115.0,Busted,1
1,59115.0,Busted,1
2,59115.0,Busted,1
3,59115.0,Busted,1
4,118094.0,Genie Nilsson and Troy Nilsson,1


In [9]:
#We can now group by count:
count = data_1[['artist_id', 'count']].groupby(by='artist_id').sum()
count.reset_index(drop=False, inplace=True)
count.sort_values(by='count', ascending=False, inplace=True)
count.head()

Unnamed: 0,artist_id,count
82390,559517.0,267
2701,33800.0,260
82926,562672.0,249
11779,102893.0,244
73594,505638.0,233


In [10]:
#And input the artist names:
artists_sorted = pd.merge(count, data_1[['artist_id','artist_name']], how='left', on='artist_id')
artists_sorted.drop_duplicates(subset='artist_id', keep='first', inplace=True)
artists_sorted.head(20)

Unnamed: 0,artist_id,count,artist_name
0,559517.0,267,The Cherry Blues Project
267,33800.0,260,Duke Ellington & His Orchestra
527,562672.0,249,Vitamin String Quartet
776,102893.0,244,Die drei ???
1020,505638.0,233,Senmuth
1253,41636.0,215,モーニング娘。
1468,118813.0,211,Stefan Wolf
1679,647066.0,200,Glee Cast
1879,1.0,188,Artistes variés
2067,1127509.0,180,Dwelling of Duels


For the foreign names, we need to get a new column from the Musicbrainz table "artist":

In [11]:
names = pd.read_csv('Data_in/Musicbrainz/artist.txt', \
                    sep='\t', header=None, engine='c', usecols=[0,3])
names.columns = ['artist_id','artist_name']
names.head()

Unnamed: 0,artist_id,artist_name
0,805192,WIK▲N
1,371203,"Moutso, Pete"
2,273232,Zachary
3,101060,"Silhouettes, The"
4,145773,"Leavitt, Aric"


In [12]:
#Bring that information into our previous dataframe:
artists_names = pd.merge(artists_sorted, names, how='left', on='artist_id')
artists_names.head(20)

Unnamed: 0,artist_id,count,artist_name_x,artist_name_y
0,559517.0,267,The Cherry Blues Project,"Cherry Blues Project, The"
1,33800.0,260,Duke Ellington & His Orchestra,"Ellington, Duke & His Orchestra"
2,562672.0,249,Vitamin String Quartet,Vitamin String Quartet
3,102893.0,244,Die drei ???,"drei ???, Die"
4,505638.0,233,Senmuth,Senmuth
5,41636.0,215,モーニング娘。,Morning Musume.
6,118813.0,211,Stefan Wolf,"Wolf, Stefan"
7,647066.0,200,Glee Cast,Glee Cast
8,1.0,188,Artistes variés,Various Artists
9,1127509.0,180,Dwelling of Duels,Dwelling of Duels


In [14]:
#We remove the punctuation:
artists_names['artist_name'] = artists_names['artist_name_y'].apply(lambda x: \
                                                                     re.sub(r"[^\w ]", " ", str(x), 0, re.MULTILINE))
artists_names.head()

Unnamed: 0,artist_id,count,artist_name_x,artist_name_y,artist_name
0,559517.0,267,The Cherry Blues Project,"Cherry Blues Project, The",Cherry Blues Project The
1,33800.0,260,Duke Ellington & His Orchestra,"Ellington, Duke & His Orchestra",Ellington Duke His Orchestra
2,562672.0,249,Vitamin String Quartet,Vitamin String Quartet,Vitamin String Quartet
3,102893.0,244,Die drei ???,"drei ???, Die",drei Die
4,505638.0,233,Senmuth,Senmuth,Senmuth


In [15]:
#And we reverse the order of "artist_name":
artists_names['name_formatted'] = artists_names['artist_name'].apply(lambda x: reverse(str(x)))
artists_names.head()

Unnamed: 0,artist_id,count,artist_name_x,artist_name_y,artist_name,name_formatted
0,559517.0,267,The Cherry Blues Project,"Cherry Blues Project, The",Cherry Blues Project The,Cherry Blues Project The
1,33800.0,260,Duke Ellington & His Orchestra,"Ellington, Duke & His Orchestra",Ellington Duke His Orchestra,Ellington Duke His Orchestra
2,562672.0,249,Vitamin String Quartet,Vitamin String Quartet,Vitamin String Quartet,Vitamin String Quartet
3,102893.0,244,Die drei ???,"drei ???, Die",drei Die,drei Die
4,505638.0,233,Senmuth,Senmuth,Senmuth,Senmuth


In [16]:
artists_names.drop(labels=['artist_name_x', 'artist_name_y','artist_name'], axis=1, inplace=True)

In [17]:
len(artists_names)

214130

In the following two lines, what I did was to split the artists_names dataframe into chunks and retrieve metadata for each chunk. 

Please note that each chunk takes between 14 and 25 to run, so it took more than 50h to complete all the chunks.

I have attached the chunks completed to the repo, so that we have the info available at all times.

In [16]:
#splitdf_1st(artists_names)

In [17]:
#retrieve_metadata_first_round(0,297)

### Importing all the retrieved information

In [22]:
concat_chunks_first_round(0,297)

100%|██████████| 296/296 [00:04<00:00, 65.99it/s] 


In [23]:
df = pd.read_csv('Data_out/Wikipedia_chunks_all_first_round.csv', sep='\t', header=0, encoding='utf-8')
df.head()

Unnamed: 0,artist_id,count,name_formatted,birth_place,genre
0,88814.0,298,Arthur Francis Collins,,REDIRECT [[Arthur Collins (singer)
1,559517.0,267,The Cherry Blues Project,,
2,33800.0,260,Duke & His Orchestra Ellington,,
3,562672.0,249,Vitamin String Quartet,"[[Los Angeles, California]], United States",Rock music|Rock
4,102893.0,244,Die drei ???,,


In [24]:
len(df)

296177

In [25]:
df.dtypes

artist_id         float64
count               int64
name_formatted     object
birth_place        object
genre              object
dtype: object

### 1) Geographical data

The artists for which we didn't have the origin are in data_1. We will retrieve only the origin for the artist_id contained in that dataframe:

In [26]:
#How many artists are there?
data_1.drop_duplicates(subset='artist_id', keep='first', inplace=True)
len(data_1)

214131

In [27]:
#We keep the artist_id's in a list:
pending_origin = data_1.artist_id.values.tolist()

In [28]:
retrieved_origin = df[df['artist_id'].isin(pending_origin)]
len(retrieved_origin)

214104

In [29]:
retrieved_origin.drop(labels=['count', 'genre'], axis=1, inplace=True)
retrieved_origin.head()

Unnamed: 0,artist_id,name_formatted,birth_place
1,559517.0,The Cherry Blues Project,
2,33800.0,Duke & His Orchestra Ellington,
3,562672.0,Vitamin String Quartet,"[[Los Angeles, California]], United States"
4,102893.0,Die drei ???,
6,505638.0,Senmuth,


In [30]:
retrieved_origin.notnull().sum(axis=0)

artist_id         214104
name_formatted    214103
birth_place        20582
dtype: int64

We have been able to retrieve the birth place for 20.582 artists. Let's see how we can format that information:

In [31]:
retrieved_origin1 = retrieved_origin[retrieved_origin['birth_place'].notnull()]
retrieved_origin1.head()

Unnamed: 0,artist_id,name_formatted,birth_place
3,562672.0,Vitamin String Quartet,"[[Los Angeles, California]], United States"
16,210784.0,The Alfee,"[[Tokyo]], [[Japan]]"
17,618288.0,Enid Blyton,"[[East Dulwich]], [[London]], England"
26,368737.0,Pritam,"[[Kolkata]], [[West Bengal]], [[India]]"
28,9617.0,Léo Ferré,[[Monaco]]


Using the script country_functions, we will extract the city, state and country for each of the rows:

In [34]:
%run -i 'country_state_functions.py'

In [35]:
retrieved_origin1['area'] = retrieved_origin1['birth_place'].apply(get_country_state_city_check)
retrieved_origin1.head()

Unnamed: 0,artist_id,name_formatted,birth_place,area
3,562672.0,Vitamin String Quartet,"[[Los Angeles, California]], United States","[Los Angeles, California, [United States]]"
16,210784.0,The Alfee,"[[Tokyo]], [[Japan]]","[Tokyo, , [Japan]]"
17,618288.0,Enid Blyton,"[[East Dulwich]], [[London]], England","[East Dulwich, , [United Kingdom]]"
26,368737.0,Pritam,"[[Kolkata]], [[West Bengal]], [[India]]","[Kolkata, , [India]]"
28,9617.0,Léo Ferré,[[Monaco]],"[Monaco, , [Monaco]]"


In [36]:
#Reset index before next step:
retrieved_origin1.reset_index(drop=True, inplace=True)

In [37]:
#Split the areas into 3 columns. First, the city:
retrieved_origin1['city'] = [retrieved_origin1['area'][row][0] for row in range(len(retrieved_origin1))]

In [38]:
#After, the state:
retrieved_origin1['state'] = [retrieved_origin1['area'][row][1] for row in range(len(retrieved_origin1))]

In [39]:
#And lastly, the country:
retrieved_origin1['country'] = [str(retrieved_origin1['area'][row][2]).strip('[]').strip("'")\
                                for row in range(len(retrieved_origin1))]
retrieved_origin1.head()

Unnamed: 0,artist_id,name_formatted,birth_place,area,city,state,country
0,562672.0,Vitamin String Quartet,"[[Los Angeles, California]], United States","[Los Angeles, California, [United States]]",Los Angeles,California,United States
1,210784.0,The Alfee,"[[Tokyo]], [[Japan]]","[Tokyo, , [Japan]]",Tokyo,,Japan
2,618288.0,Enid Blyton,"[[East Dulwich]], [[London]], England","[East Dulwich, , [United Kingdom]]",East Dulwich,,United Kingdom
3,368737.0,Pritam,"[[Kolkata]], [[West Bengal]], [[India]]","[Kolkata, , [India]]",Kolkata,,India
4,9617.0,Léo Ferré,[[Monaco]],"[Monaco, , [Monaco]]",Monaco,,Monaco


What we will do now, is to try to match the retrieved areas with the geocoded areas we defined in our first notebook, so that we can have the coordinates:

In [40]:
areas = pd.read_csv('Data_out/subdivisions_all.csv',\
                    sep='\t', header=0, encoding='utf-8')
areas.head()

Unnamed: 0,area_id,area_name,subdivision_name,country_name,latitude,longitude
0,262,,Alaska,United States,64.200841,-149.493673
1,339,,Sachsen-Anhalt,Germany,51.950265,11.692273
2,263,,Alabama,United States,32.318231,-86.902298
3,261,,Maryland,United States,39.045755,-76.641271
4,264,,Arkansas,United States,35.20105,-91.831833


In [41]:
#FillNa with empty string for the next step:
areas['area_name'].fillna(value='', inplace=True)
areas.head()

Unnamed: 0,area_id,area_name,subdivision_name,country_name,latitude,longitude
0,262,,Alaska,United States,64.200841,-149.493673
1,339,,Sachsen-Anhalt,Germany,51.950265,11.692273
2,263,,Alabama,United States,32.318231,-86.902298
3,261,,Maryland,United States,39.045755,-76.641271
4,264,,Arkansas,United States,35.20105,-91.831833


In [42]:
#We create a column that concatenates the area_name , subdivision and the country:
areas['area_match'] = areas['area_name'] + areas['subdivision_name'] + areas['country_name']
areas['area_match'].str.strip()
areas['area_match'] = areas['area_match'].apply(lambda x: ''.join(c for c in x if c not in([' '])))
areas.head()

Unnamed: 0,area_id,area_name,subdivision_name,country_name,latitude,longitude,area_match
0,262,,Alaska,United States,64.200841,-149.493673,AlaskaUnitedStates
1,339,,Sachsen-Anhalt,Germany,51.950265,11.692273,Sachsen-AnhaltGermany
2,263,,Alabama,United States,32.318231,-86.902298,AlabamaUnitedStates
3,261,,Maryland,United States,39.045755,-76.641271,MarylandUnitedStates
4,264,,Arkansas,United States,35.20105,-91.831833,ArkansasUnitedStates


In [43]:
#And another one that concatenates the area name and the country (this will help to match
#the rows that don't have a subdivision in our retrieved_origin1 dataframe):
areas['area_match2'] = areas['area_name'] + areas['country_name']
areas['area_match2'].str.strip()
areas['area_match2'] = areas['area_match2'].apply(lambda x: ''.join(c for c in x if c not in([' '])))
areas.head()

Unnamed: 0,area_id,area_name,subdivision_name,country_name,latitude,longitude,area_match,area_match2
0,262,,Alaska,United States,64.200841,-149.493673,AlaskaUnitedStates,UnitedStates
1,339,,Sachsen-Anhalt,Germany,51.950265,11.692273,Sachsen-AnhaltGermany,Germany
2,263,,Alabama,United States,32.318231,-86.902298,AlabamaUnitedStates,UnitedStates
3,261,,Maryland,United States,39.045755,-76.641271,MarylandUnitedStates,UnitedStates
4,264,,Arkansas,United States,35.20105,-91.831833,ArkansasUnitedStates,UnitedStates


We do the same in our retrieved_origin1 dataframe:

In [44]:
#FillNa with empty string for the next step:
retrieved_origin1['city'].fillna(value='', inplace=True)
retrieved_origin1['state'].fillna(value='', inplace=True)
retrieved_origin1.head()

Unnamed: 0,artist_id,name_formatted,birth_place,area,city,state,country
0,562672.0,Vitamin String Quartet,"[[Los Angeles, California]], United States","[Los Angeles, California, [United States]]",Los Angeles,California,United States
1,210784.0,The Alfee,"[[Tokyo]], [[Japan]]","[Tokyo, , [Japan]]",Tokyo,,Japan
2,618288.0,Enid Blyton,"[[East Dulwich]], [[London]], England","[East Dulwich, , [United Kingdom]]",East Dulwich,,United Kingdom
3,368737.0,Pritam,"[[Kolkata]], [[West Bengal]], [[India]]","[Kolkata, , [India]]",Kolkata,,India
4,9617.0,Léo Ferré,[[Monaco]],"[Monaco, , [Monaco]]",Monaco,,Monaco


In [45]:
#We create a column that concatenates the city , state and the country:
retrieved_origin1['area_match'] = retrieved_origin1['city'] \
                                + retrieved_origin1['state']\
                                + retrieved_origin1['country']
retrieved_origin1['area_match'].str.strip()
retrieved_origin1['area_match'] = retrieved_origin1['area_match'].apply(lambda x: ''.join(c for c\
                                                                                          in x if c not in([' '])))
retrieved_origin1.head()

Unnamed: 0,artist_id,name_formatted,birth_place,area,city,state,country,area_match
0,562672.0,Vitamin String Quartet,"[[Los Angeles, California]], United States","[Los Angeles, California, [United States]]",Los Angeles,California,United States,LosAngelesCaliforniaUnitedStates
1,210784.0,The Alfee,"[[Tokyo]], [[Japan]]","[Tokyo, , [Japan]]",Tokyo,,Japan,TokyoJapan
2,618288.0,Enid Blyton,"[[East Dulwich]], [[London]], England","[East Dulwich, , [United Kingdom]]",East Dulwich,,United Kingdom,EastDulwichUnitedKingdom
3,368737.0,Pritam,"[[Kolkata]], [[West Bengal]], [[India]]","[Kolkata, , [India]]",Kolkata,,India,KolkataIndia
4,9617.0,Léo Ferré,[[Monaco]],"[Monaco, , [Monaco]]",Monaco,,Monaco,MonacoMonaco


In [46]:
#We remove the word "Prefecture" to be able to match the Japanese areas:
retrieved_origin1['area_match'] = retrieved_origin1['area_match'].str.replace('Prefecture', '')

In [47]:
#And another one that concatenates the subdivision and the country (this will be to match some US states):
retrieved_origin1['area_match3'] = retrieved_origin1['state'] \
                                + retrieved_origin1['country']
retrieved_origin1['area_match3'].str.strip()
retrieved_origin1['area_match3'] = retrieved_origin1['area_match3'].apply(lambda x: ''.join(c for c\
                                                                                            in x if c not in([' '])))
retrieved_origin1.head()

Unnamed: 0,artist_id,name_formatted,birth_place,area,city,state,country,area_match,area_match3
0,562672.0,Vitamin String Quartet,"[[Los Angeles, California]], United States","[Los Angeles, California, [United States]]",Los Angeles,California,United States,LosAngelesCaliforniaUnitedStates,CaliforniaUnitedStates
1,210784.0,The Alfee,"[[Tokyo]], [[Japan]]","[Tokyo, , [Japan]]",Tokyo,,Japan,TokyoJapan,Japan
2,618288.0,Enid Blyton,"[[East Dulwich]], [[London]], England","[East Dulwich, , [United Kingdom]]",East Dulwich,,United Kingdom,EastDulwichUnitedKingdom,UnitedKingdom
3,368737.0,Pritam,"[[Kolkata]], [[West Bengal]], [[India]]","[Kolkata, , [India]]",Kolkata,,India,KolkataIndia,India
4,9617.0,Léo Ferré,[[Monaco]],"[Monaco, , [Monaco]]",Monaco,,Monaco,MonacoMonaco,Monaco


In [48]:
len(retrieved_origin1)

20582

In [49]:
#We can now match with our areas dataframe: first, for the US states with area_match3
a = retrieved_origin1[['artist_id', 'area_match', 'area_match3']].copy()
retrieved_coords1 = pd.merge(a, areas, how='left', left_on='area_match3', right_on='area_match')
retrieved_coords1.head()

Unnamed: 0,artist_id,area_match_x,area_match3,area_id,area_name,subdivision_name,country_name,latitude,longitude,area_match_y,area_match2
0,562672.0,LosAngelesCaliforniaUnitedStates,CaliforniaUnitedStates,266.0,,California,United States,36.778261,-119.417932,CaliforniaUnitedStates,UnitedStates
1,210784.0,TokyoJapan,Japan,,,,,,,,
2,618288.0,EastDulwichUnitedKingdom,UnitedKingdom,,,,,,,,
3,368737.0,KolkataIndia,India,,,,,,,,
4,9617.0,MonacoMonaco,Monaco,,,,,,,,


In [50]:
first_match = retrieved_coords1[retrieved_coords1['latitude'].notnull()]
to_drop = ['area_match_x', 'area_match_y','area_match2', 'area_match3']
first_match.drop(labels=to_drop, axis=1, inplace=True)

In [51]:
len(first_match)

4907

In [52]:
#Preparing the pending areas to match for the second round:
b = retrieved_coords1[retrieved_coords1['latitude'].isnull()]
pending_match = b[['artist_id', 'area_match_x', 'area_match3']].copy()

In [53]:
#We can now match with our areas dataframe on area_match2:
retrieved_coords2 = pd.merge(pending_match, areas, how='left', \
                             left_on='area_match_x', right_on='area_match2')
retrieved_coords2.head()

Unnamed: 0,artist_id,area_match_x,area_match3,area_id,area_name,subdivision_name,country_name,latitude,longitude,area_match,area_match2
0,210784.0,TokyoJapan,Japan,,,,,,,,
1,618288.0,EastDulwichUnitedKingdom,UnitedKingdom,80655.0,East Dulwich,England,United Kingdom,52.355518,-1.17432,EastDulwichEnglandUnitedKingdom,EastDulwichUnitedKingdom
2,368737.0,KolkataIndia,India,5090.0,Kolkata,West Bengal,India,22.986757,87.854975,KolkataWestBengalIndia,KolkataIndia
3,9617.0,MonacoMonaco,Monaco,,,,,,,,
4,60474.0,ParisFrance,France,4434.0,Paris,Île-de-France,France,48.84992,2.637041,ParisÎle-de-FranceFrance,ParisFrance


In [54]:
second_match = retrieved_coords2[retrieved_coords2['latitude'].notnull()]
to_drop = ['area_match_x','area_match', 'area_match2', 'area_match3']
second_match.drop(labels=to_drop, axis=1, inplace=True)

In [55]:
len(second_match)

7238

In [56]:
c = retrieved_coords2[retrieved_coords2['latitude'].isnull()]
pending_match2 = c[['artist_id', 'area_match_x']].copy()
pending_match2.head()

Unnamed: 0,artist_id,area_match_x
0,210784.0,TokyoJapan
3,9617.0,MonacoMonaco
6,10238.0,
8,187926.0,
10,813033.0,


In [57]:
#We can now match with our areas dataframe on area_match_x:
retrieved_coords3 = pd.merge(pending_match2, areas, how='left', left_on='area_match_x', right_on='area_match')
retrieved_coords3.head()

Unnamed: 0,artist_id,area_match_x,area_id,area_name,subdivision_name,country_name,latitude,longitude,area_match,area_match2
0,210784.0,TokyoJapan,397.0,,Tokyo,Japan,35.676192,139.650311,TokyoJapan,Japan
1,9617.0,MonacoMonaco,,,,,,,,
2,10238.0,,,,,,,,,
3,187926.0,,,,,,,,,
4,813033.0,,,,,,,,,


In [58]:
retrieved_coords3.isnull().sum(axis=0)

artist_id               0
area_match_x            0
area_id             10082
area_name           10082
subdivision_name    10082
country_name        10082
latitude            10082
longitude           10082
area_match          10082
area_match2         10082
dtype: int64

In [59]:
third_match = retrieved_coords3[retrieved_coords3['latitude'].notnull()]
to_drop = ['area_match', 'area_match2']
third_match.drop(labels=to_drop, axis=1, inplace=True)

In [60]:
len(third_match)

607

In [61]:
#What are the areas pending?
pending_areas = retrieved_coords3[retrieved_coords3['latitude'].isnull()][['artist_id', 'area_match_x']].copy()
pending_areas.head()

Unnamed: 0,artist_id,area_match_x
1,9617.0,MonacoMonaco
2,10238.0,
3,187926.0,
4,813033.0,
5,265728.0,


In [62]:
#We define a function to insert a space before a word starting with capital letters:
def spacer(text):
    return re.sub(r'([A-Z])',r" \1",text,re.MULTILINE)

In [63]:
#We split the elements in area_match_x:
pending_areas.reset_index(drop=True, inplace=True)
pending_areas['list'] = [spacer(pending_areas['area_match_x'][row]).split() for row in range(len(pending_areas))]
pending_areas.head()

Unnamed: 0,artist_id,area_match_x,list
0,9617.0,MonacoMonaco,"[Monaco, Monaco]"
1,10238.0,,[]
2,187926.0,,[]
3,813033.0,,[]
4,265728.0,,[]


In [64]:
#Remove empty rows:
for i in tqdm.tqdm(range(len(pending_areas))):
    if len(pending_areas['list'][i]) == 0:
        pending_areas.drop(index=i, inplace=True)

100%|██████████| 10082/10082 [00:04<00:00, 2318.70it/s]


In [65]:
#Split the list in 2 columns:
pending_areas.reset_index(drop=True, inplace=True)
pending_areas['first'] = [pending_areas['list'][row][0]for row in range(len(pending_areas))]
pending_areas['country'] = [pending_areas['list'][row][-1]for row in range(len(pending_areas))]
pending_areas.head()

Unnamed: 0,artist_id,area_match_x,list,first,country
0,9617.0,MonacoMonaco,"[Monaco, Monaco]",Monaco,Monaco
1,89675.0,JapanJapan,"[Japan, Japan]",Japan,Japan
2,492317.0,JapanJapan,"[Japan, Japan]",Japan,Japan
3,192540.0,PragueCzechia,"[Prague, Czechia]",Prague,Czechia
4,40043.0,SriGanganagarIndia,"[Sri, Ganganagar, India]",Sri,India


For those rows, it looks like we only got the country (not enough detail for our visualization) or, in some cases, an area that hasn't been identified.

Let's remove the rows for which we have only a country and try to retrieve the coordinates of the rest:

In [66]:
#Remove rows for which first=country
for i in tqdm.tqdm(range(len(pending_areas))):
    if pending_areas['first'][i] == pending_areas['country'][i]:
        pending_areas.drop(index=i, inplace=True)

100%|██████████| 3889/3889 [00:00<00:00, 5905.60it/s]


In [67]:
#Drop the rows that have United and Kingdom:
pending_areas.reset_index(drop=True, inplace=True)
for i in tqdm.tqdm(range(len(pending_areas))):
    if pending_areas['first'][i] == 'United' and pending_areas['country'][i] == 'Kingdom':
        pending_areas.drop(index=i, inplace=True)

100%|██████████| 2931/2931 [00:00<00:00, 37905.25it/s]


In [68]:
#How many do we have to identify?
len(pending_areas)

2839

In [69]:
pending_areas.head()

Unnamed: 0,artist_id,area_match_x,list,first,country
0,192540.0,PragueCzechia,"[Prague, Czechia]",Prague,Czechia
1,40043.0,SriGanganagarIndia,"[Sri, Ganganagar, India]",Sri,India
2,116590.0,ThebesGreece,"[Thebes, Greece]",Thebes,Greece
3,220297.0,DoumpiaGreece,"[Doumpia, Greece]",Doumpia,Greece
4,39678.0,PalaiokastritsaGreece,"[Palaiokastritsa, Greece]",Palaiokastritsa,Greece


In [70]:
#Create a column with the area_match name well formatted:
pending_areas.reset_index(drop=True, inplace=True)
pending_areas['area_name'] = [spacer(pending_areas['area_match_x'][row]) for row in range(len(pending_areas))]
pending_areas.drop(labels=['area_match_x', 'list', 'first'], axis=1, inplace=True)
pending_areas.head()

Unnamed: 0,artist_id,country,area_name
0,192540.0,Czechia,Prague Czechia
1,40043.0,India,Sri Ganganagar India
2,116590.0,Greece,Thebes Greece
3,220297.0,Greece,Doumpia Greece
4,39678.0,Greece,Palaiokastritsa Greece


And now we use again our geocoding tool:

If you want to follow the geocoding, please run the following commands (please note that the resulting file is provided):

API_key = "YOUR_API_KEY"

to_search = pending_areas['area_name'].values.tolist()

coordinates = []

start = time.time()

for i in to_search:
    
    try:
        result = Geocoder(api_key=API_key).geocode(i).coordinates
        
        coordinates.append(result)
        
    except:
        result = np.nan
        
        coordinates.append(result)
        
    
pending_areas['coordinates'] = coordinates

pending_areas.to_csv('Google_API/pending_areas.csv',index=None, sep="\t")

end = time.time()

print((end-start)/60)

pending_areas.head()

#### Note: the above loop took 24 minutes to run in my computer.

In [71]:
pending_areas = pd.read_csv('Data_in/Google_API/pending_areas.csv', header=0, sep="\t")
pending_areas.head()

Unnamed: 0,artist_id,country,area_name,coordinates
0,192540.0,Czechia,Prague Czechia,"(50.0755381, 14.4378005)"
1,40043.0,India,Sri Ganganagar India,"(29.9038399, 73.87719009999999)"
2,116590.0,Greece,Thebes Greece,"(38.322579, 23.3204309)"
3,220297.0,Greece,Doumpia Greece,"(40.5123246, 23.3490202)"
4,39678.0,Greece,Palaiokastritsa Greece,"(39.6751982, 19.7081324)"


In [72]:
#First, we drop the rows for which we didn't retrieve any coordinate:
pending_areas.dropna(subset=['coordinates'], axis=0, inplace=True)

In [73]:
#We split latitude and longitude:
coords_df = pd.DataFrame(pending_areas['coordinates'].str.strip('()').str.split(',').values.tolist())

In [74]:
coords_df.head()

Unnamed: 0,0,1
0,50.0755381,14.4378005
1,29.9038399,73.87719009999999
2,38.322579,23.3204309
3,40.5123246,23.3490202
4,39.6751982,19.7081324


In [75]:
result = pd.concat([pending_areas, coords_df], axis=1, join_axes=[pending_areas.index])
result.head()

Unnamed: 0,artist_id,country,area_name,coordinates,0,1
0,192540.0,Czechia,Prague Czechia,"(50.0755381, 14.4378005)",50.0755381,14.4378005
1,40043.0,India,Sri Ganganagar India,"(29.9038399, 73.87719009999999)",29.9038399,73.87719009999999
2,116590.0,Greece,Thebes Greece,"(38.322579, 23.3204309)",38.322579,23.3204309
3,220297.0,Greece,Doumpia Greece,"(40.5123246, 23.3490202)",40.5123246,23.3490202
4,39678.0,Greece,Palaiokastritsa Greece,"(39.6751982, 19.7081324)",39.6751982,19.7081324


In [76]:
#Drop unnecessary column:
result.drop(labels=['coordinates'], axis=1, inplace=True)
#Change column name:
result.rename(columns={'country':'country_name',0:'latitude', 1:'longitude'}, inplace=True)
#Adding empty columns with same format that our matched dataframes:
result['area_id'] = np.nan
result['subdivision_name'] = np.nan

In [77]:
result.head()

Unnamed: 0,artist_id,country_name,area_name,latitude,longitude,area_id,subdivision_name
0,192540.0,Czechia,Prague Czechia,50.0755381,14.4378005,,
1,40043.0,India,Sri Ganganagar India,29.9038399,73.87719009999999,,
2,116590.0,Greece,Thebes Greece,38.322579,23.3204309,,
3,220297.0,Greece,Doumpia Greece,40.5123246,23.3490202,,
4,39678.0,Greece,Palaiokastritsa Greece,39.6751982,19.7081324,,


In [78]:
#Merging our 4 dataframes:
all_retrieved = pd.concat([first_match, second_match, third_match, result], ignore_index=True)
all_retrieved.head()

Unnamed: 0,area_id,area_match_x,area_name,artist_id,country_name,latitude,longitude,subdivision_name
0,266.0,,,562672.0,United States,36.7783,-119.418,California
1,276.0,,,119635.0,United States,40.6331,-89.3985,Illinois
2,295.0,,,523680.0,United States,40.7128,-74.006,New York
3,292.0,,,1037860.0,United States,40.0583,-74.4057,New Jersey
4,266.0,,,11108.0,United States,36.7783,-119.418,California


In [79]:
#Drop duplicate artist_id:
all_retrieved.drop_duplicates(subset='artist_id', keep='first', inplace=True)
#And drop any row withour coordinates:
all_retrieved.dropna(subset=['latitude'], axis=0, inplace=True)
#Drop unnecessary column:
all_retrieved.drop(labels=['area_match_x'], axis=1, inplace=True)

Now that we have retrieved the maximum of coordinates for our pending artists, we can get all in a single file and input it in the last step of the notebook "Data_gathering_release_origin".

In [80]:
#For how many artists did we retrieve the origin with Wikipedia?
len(all_retrieved)

12608

In [81]:
#We can now export the dataframe and input it into our Data_gathering_releases_origin notebook as final step:
all_retrieved.to_csv('Data_out/Wikipedia_retrieved_origins.csv', sep='\t', index=False, encoding='utf-8')

### 2) Music genres

In this last step, we take the pending information from the notebook "Data_gathering_music_genre" and we try to retrieve it using our Wikipedia tool.

In [82]:
genre_pending = pd.read_csv('Data_out/data_pending_2.csv', sep='\t', header=0,usecols=[4,7], encoding='utf-8')
genre_pending.head()

Unnamed: 0,artist_id,artist_name
0,26.0,Meg Lee Chin
1,68.0,Sarge
2,68.0,Sarge
3,332.0,Anton Karas
4,1313.0,Queen Ida


In the first round, we retrieved already the genre for all the artists for which we didn't have the origin either. Let's see how many we searched:

In [83]:
df.head()

Unnamed: 0,artist_id,count,name_formatted,birth_place,genre
0,88814.0,298,Arthur Francis Collins,,REDIRECT [[Arthur Collins (singer)
1,559517.0,267,The Cherry Blues Project,,
2,33800.0,260,Duke & His Orchestra Ellington,,
3,562672.0,249,Vitamin String Quartet,"[[Los Angeles, California]], United States",Rock music|Rock
4,102893.0,244,Die drei ???,,


In [84]:
len(df)

296177

In [85]:
genre_pending.artist_id.nunique()

94265

In [86]:
genre_pending.drop_duplicates(subset='artist_id', keep='first', inplace=True)

In [87]:
#We put the artist_ids on a list:
already_searched = df.artist_id.values.tolist()

What we want now is to identify the artists in genre_pending for which we already searched their genre, to avoid duplicating time and tasks.

In [88]:
pending_artists = genre_pending[~genre_pending['artist_id'].isin(already_searched)]

In [89]:
#How many do we have to search for now?
len(pending_artists)

5101

In [92]:
pending_artists.head()

Unnamed: 0,artist_id,artist_name
22,3461.0,George Shearing
68,104164.0,Lincoln Mayorga
123,4276.0,Georgi Robev
142,494355.0,Ralph Blane
144,4356.0,Georges Delerue


Like we did in the first round, we are now going to search for these extra artists in Wikipedia:

In [97]:
splitdf_2nd(pending_artists) #We will have only 6 chunks this time

In [98]:
retrieve_metadata_second_round(0,6)

  0%|          | 0/1000 [00:00<?, ?it/s]

Starting with chunk 0: 


100%|██████████| 1000/1000 [23:20<00:00,  1.10it/s]
  0%|          | 0/1000 [00:00<?, ?it/s]

Starting with chunk 1: 


100%|██████████| 1000/1000 [18:48<00:00,  1.02it/s]
  0%|          | 0/1000 [00:00<?, ?it/s]

Starting with chunk 2: 


100%|██████████| 1000/1000 [16:18<00:00,  1.07s/it]
  0%|          | 0/1000 [00:00<?, ?it/s]

Starting with chunk 3: 


100%|██████████| 1000/1000 [16:34<00:00,  1.14s/it]
  0%|          | 0/1000 [00:00<?, ?it/s]

Starting with chunk 4: 


100%|██████████| 1000/1000 [16:03<00:00,  1.03it/s]
  0%|          | 0/101 [00:00<?, ?it/s]

Starting with chunk 5: 


100%|██████████| 101/101 [01:31<00:00,  1.38it/s]


### Importing all the retrieved information

In [99]:
#We retrieve all the information:
concat_chunks_second_round(0,6)

100%|██████████| 5/5 [00:00<00:00, 314.34it/s]


In [100]:
df2 = pd.read_csv('Data_out/Wikipedia_chunks_all_second_round.csv', sep='\t', header=0, usecols=[0,1,3], encoding='utf-8')
df2.head()

Unnamed: 0,artist_id,artist_name,genre
0,3461.0,George Shearing,Jazz
1,104164.0,Lincoln Mayorga,Pop music
2,4276.0,Georgi Robev,
3,494355.0,Ralph Blane,Musical theatre
4,4356.0,Georges Delerue,Film score


We have now all the information regarding music genres in our dataframes df and df2. We'll put them all together and identify the genres for all of them:

In [101]:
df_copy = df[['artist_id', 'genre']].copy()

In [102]:
identify_genre = pd.concat([df_copy, df2[['artist_id', 'genre']].copy()], ignore_index=True)
identify_genre.head()

Unnamed: 0,artist_id,genre
0,88814.0,REDIRECT [[Arthur Collins (singer)
1,559517.0,
2,33800.0,
3,562672.0,Rock music|Rock
4,102893.0,


In [103]:
#We can directly drop the rows that contain null values:
identify_genre.dropna(subset=['genre'], axis=0, inplace=True)
len(identify_genre)

105267

In [104]:
#What type of information do we have in this genre column?
identify_genre.genre.value_counts()

Jazz                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    

We can see that the most common format when a music genre appears in the column "genre is" "Genre music|Genre", so we can identify the genre by splitting that string:

In [105]:
identify_genre['genre_split'] = identify_genre['genre'].str.split('|')

In [106]:
#We keep the last item of the list in "genre_split":
identify_genre.reset_index(drop=True, inplace=True)
identify_genre['genre'] = [identify_genre['genre_split'][row][-1] for row in range(len(identify_genre))]
identify_genre.head()

Unnamed: 0,artist_id,genre,genre_split
0,88814.0,REDIRECT [[Arthur Collins (singer),[REDIRECT [[Arthur Collins (singer)]
1,562672.0,Rock,"[Rock music, Rock]"
2,1221150.0,''Daniel Alexander''' may refer to:\n\n* [[Dan...,[''Daniel Alexander''' may refer to:\n\n* [[Da...
3,505638.0,REDIRECT [[Senenmut,[REDIRECT [[Senenmut]
4,153755.0,C-pop,[C-pop]


In [107]:
#We can now drop the "genre_split" column:
identify_genre.drop(labels=['genre_split'], axis=1, inplace=True)

Now, we'll use our "Main_genre_list" file to try to identify the music genres appearing in the column "genre":

In [108]:
all_genres = pd.read_csv('Data_in/Main_genre_list.csv', sep='\t', header=0, encoding='utf-8')
all_genres.head()

Unnamed: 0,Main_genre,subgenre
0,Blues,acoustic blues
1,Blues,african blues
2,Blues,blues
3,Blues,blues music
4,Blues,blues rock


In [109]:
#We lower the case in our identify_genre dataframe, in order to be able to match by name:
identify_genre['genre'] = identify_genre['genre'].str.lower()
identify_genre.head()

Unnamed: 0,artist_id,genre
0,88814.0,redirect [[arthur collins (singer)
1,562672.0,rock
2,1221150.0,''daniel alexander''' may refer to:\n\n* [[dan...
3,505638.0,redirect [[senenmut
4,153755.0,c-pop


In [110]:
#And we do the merging:
genres_matched = pd.merge(identify_genre, all_genres, how='left', left_on='genre', right_on='subgenre')
genres_matched.head()

Unnamed: 0,artist_id,genre,Main_genre,subgenre
0,88814.0,redirect [[arthur collins (singer),,
1,562672.0,rock,Rock,rock
2,1221150.0,''daniel alexander''' may refer to:\n\n* [[dan...,,
3,505638.0,redirect [[senenmut,,
4,153755.0,c-pop,Pop,c-pop


In [111]:
#How many did we identify?
genres_matched.Main_genre.notnull().sum(axis=0)

17692

In [112]:
#We save that into a new dataframe:
genres_retrieved = genres_matched[genres_matched['Main_genre'].notnull()]
genres_retrieved.drop(labels=['genre'], axis=1, inplace=True)
genres_retrieved.head()

Unnamed: 0,artist_id,Main_genre,subgenre
1,562672.0,Rock,rock
4,153755.0,Pop,c-pop
8,279956.0,Pop,j-pop
10,210784.0,Rock,folk rock
15,35358.0,Pop,pop


In [113]:
#What do we have pending?
genres_matched[genres_matched['Main_genre'].isnull()].head()

Unnamed: 0,artist_id,genre,Main_genre,subgenre
0,88814.0,redirect [[arthur collins (singer),,
2,1221150.0,''daniel alexander''' may refer to:\n\n* [[dan...,,
3,505638.0,redirect [[senenmut,,
5,1338.0,film score,,
6,41636.0,redirect [[morning musume,,


It looks like these expressions don't provide us with much information about the artist genre. We'll now export the file with the retrieved genres and use it in the last step of "Data_gathering_music_genre" notebook:

In [114]:
genres_retrieved.to_csv('Data_out/Wikipedia_genres_retrieved.csv', sep='\t', index=False, encoding='utf-8')