# Musicbrainz: types of areas and geocoding

The purpose of this notebook is to retrieve as much detail as possible for each area code that Musicbrainz uses to describe the geographical location of the different entities in their database.

For each area code, we want to know the country and also the subdivision it belongs to (so that the plots in the final visualization have enough detail). We will also retrieve each subdivision's coordinates using Pygeocoder and the Google Maps API.

In [1]:
import pandas as pd
import numpy as np
#!pip install pygeocoder
from pygeocoder import Geocoder #If you want to follow the geocoding later, you will
                                #need your own Google Maps API key
import time
import tqdm
import warnings
warnings.filterwarnings('ignore')

In [2]:
areas = pd.read_csv('Data_in/Musicbrainz/area.txt',sep='\t', header=None, engine='python', usecols=[0,2,3])
areas.columns = ['area_id','area_name','code_type']
areas.head()

Unnamed: 0,area_id,area_name,code_type
0,15449,Greccio,4.0
1,38,Canada,1.0
2,43,Chile,1.0
3,44,China,1.0
4,36,Cambodia,1.0


In [3]:
#Let's see the area types we have:
area_types = pd.read_csv('Data_in/Musicbrainz/area_type.txt',sep='\t', header=None, engine='python', usecols=[1,3,4])
area_types.columns = ['type','code_type','definition']
area_types.head(20)

Unnamed: 0,type,code_type,definition
0,Country,1,Country is used for areas included (or previou...
1,Subdivision,2,Subdivision is used for the main administrativ...
2,County,7,County is used for smaller administrative divi...
3,Municipality,4,Municipality is used for small administrative ...
4,City,3,"City is used for settlements of any size, incl..."
5,District,5,District is used for a division of a large cit...
6,Island,6,Island is used for islands and atolls which do...


Musicbrainz provides us with a table in which we can see the immediate parent area for an area whose code_type is greater than 1 (country). We will use that table to get the level of detail we need for each area:

In [4]:
#Parent & child areas:
parent_child = pd.read_csv('Data_in/Musicbrainz/l_area_area.txt',sep='\t', header=None, engine='python', usecols=[2, 3])
parent_child.columns = ['parent_area', 'child_area']
parent_child.head()

Unnamed: 0,parent_area,child_area
0,222,262
1,81,339
2,222,263
3,222,261
4,222,264


In [5]:
parent_child.duplicated(subset='child_area').value_counts()

False    117421
True        154
dtype: int64

In [6]:
parent_child.drop_duplicates(subset='child_area', keep='first', inplace=True)

In [7]:
parent_type1 = pd.merge(parent_child, areas, how='left', left_on='parent_area', right_on='area_id')
parent_type1.drop(labels=['area_id'], axis=1, inplace=True)
parent_type1.rename(columns={'area_name':'parent_name1','code_type':'parent_code_type1'}, inplace=True)
parent_type1.head()

Unnamed: 0,parent_area,child_area,parent_name1,parent_code_type1
0,222,262,United States,1.0
1,81,339,Germany,1.0
2,222,263,United States,1.0
3,222,261,United States,1.0
4,222,264,United States,1.0


In [8]:
parent_child_type1 = pd.merge(parent_type1, areas, how='left', left_on='child_area', right_on='area_id')
parent_child_type1.rename(columns={'area_name':'child_name1','code_type':'child_code_type1'}, inplace=True)
parent_child_type1.drop(labels=['area_id'], axis=1, inplace=True)
parent_child_type1.head()

Unnamed: 0,parent_area,child_area,parent_name1,parent_code_type1,child_name1,child_code_type1
0,222,262,United States,1.0,Alaska,2.0
1,81,339,Germany,1.0,Sachsen-Anhalt,2.0
2,222,263,United States,1.0,Alabama,2.0
3,222,261,United States,1.0,Maryland,2.0
4,222,264,United States,1.0,Arkansas,2.0


In [9]:
#Rearranging the columns for better visibility:
order = ['parent_area','parent_name1', 'parent_code_type1','child_area','child_name1','child_code_type1']
parent_child_type1_ordered = parent_child_type1.reindex(columns=order)
parent_child_type1_ordered.head()

Unnamed: 0,parent_area,parent_name1,parent_code_type1,child_area,child_name1,child_code_type1
0,222,United States,1.0,262,Alaska,2.0
1,81,Germany,1.0,339,Sachsen-Anhalt,2.0
2,222,United States,1.0,263,Alabama,2.0
3,222,United States,1.0,261,Maryland,2.0
4,222,United States,1.0,264,Arkansas,2.0


In [10]:
subdivisions = parent_child_type1_ordered[parent_child_type1_ordered['parent_code_type1'] == 1]
subdivisions.rename(columns={'parent_area':'country_area', \
                             'parent_name1':'country_name', \
                             'child_area':'subdivision_area', 'child_name1':'subdivision_name'}, inplace=True)
subdivisions.head()

Unnamed: 0,country_area,country_name,parent_code_type1,subdivision_area,subdivision_name,child_code_type1
0,222,United States,1.0,262,Alaska,2.0
1,81,Germany,1.0,339,Sachsen-Anhalt,2.0
2,222,United States,1.0,263,Alabama,2.0
3,222,United States,1.0,261,Maryland,2.0
4,222,United States,1.0,264,Arkansas,2.0


In [11]:
len(subdivisions)

3783

# Retrieving coordinates for each area:

Pygeocoder with Google Maps API key:

In [12]:
subdivisions['search_coords'] = subdivisions['subdivision_name'] + ', ' + subdivisions['country_name']
subdivisions.head()

Unnamed: 0,country_area,country_name,parent_code_type1,subdivision_area,subdivision_name,child_code_type1,search_coords
0,222,United States,1.0,262,Alaska,2.0,"Alaska, United States"
1,81,Germany,1.0,339,Sachsen-Anhalt,2.0,"Sachsen-Anhalt, Germany"
2,222,United States,1.0,263,Alabama,2.0,"Alabama, United States"
3,222,United States,1.0,261,Maryland,2.0,"Maryland, United States"
4,222,United States,1.0,264,Arkansas,2.0,"Arkansas, United States"


If you want to follow the geocoding, please run the following commands (please note that the resulting file is attached to the repo files):

API_key = "YOUR_API_KEY"

to_search = subdivisions['search_coords'].values.tolist()

coordinates = []

start = time.time()

for i in to_search:
    
    try:
        result = Geocoder(api_key=API_key).geocode(i).coordinates
        
        coordinates.append(result)
        
    except:
        result = np.nan
        
        coordinates.append(result)
        
    
subdivisions['coordinates'] = coordinates

subdivisions.to_csv('Data_in/Google_API/subdivisions.csv',index=None, sep="\t")

end = time.time()

print((end-start)/60)

subdivisions.head()

#### Note: the above loop took 30 minutes to run in my computer.

In [13]:
subdivisions = pd.read_csv('Data_in/Google_API/subdivisions.csv', sep="\t")
subdivisions.head()

Unnamed: 0,country_area,country_name,parent_code_type1,subdivision_area,subdivision_name,child_code_type1,search_coords,coordinates
0,222,United States,1.0,262,Alaska,2.0,"Alaska, United States","(64.2008413, -149.4936733)"
1,81,Germany,1.0,339,Sachsen-Anhalt,2.0,"Sachsen-Anhalt, Germany","(51.9502649, 11.6922734)"
2,222,United States,1.0,263,Alabama,2.0,"Alabama, United States","(32.3182314, -86.902298)"
3,222,United States,1.0,261,Maryland,2.0,"Maryland, United States","(39.0457549, -76.64127119999999)"
4,222,United States,1.0,264,Arkansas,2.0,"Arkansas, United States","(35.20105, -91.8318334)"


Now, what we want is to create two columns, "latitude" and "longitude", as they will be our x and y axis in our visualization:

In [14]:
lat = []
lng = []

subdivisions['coordinates'] = subdivisions['coordinates'].str.strip('()').str.split(',')

for row in tqdm.tqdm(range(len(subdivisions))):
    
    try:
        lat.append(subdivisions['coordinates'][row][0])
        lng.append(subdivisions['coordinates'][row][1])
    
    except:
        lat.append(np.nan)
        lng.append(np.nan)

subdivisions['latitude'] = lat
subdivisions['longitude'] = lng


subdivisions.head()

100%|██████████| 3783/3783 [00:00<00:00, 50463.39it/s]


Unnamed: 0,country_area,country_name,parent_code_type1,subdivision_area,subdivision_name,child_code_type1,search_coords,coordinates,latitude,longitude
0,222,United States,1.0,262,Alaska,2.0,"Alaska, United States","[64.2008413, -149.4936733]",64.2008413,-149.4936733
1,81,Germany,1.0,339,Sachsen-Anhalt,2.0,"Sachsen-Anhalt, Germany","[51.9502649, 11.6922734]",51.9502649,11.6922734
2,222,United States,1.0,263,Alabama,2.0,"Alabama, United States","[32.3182314, -86.902298]",32.3182314,-86.902298
3,222,United States,1.0,261,Maryland,2.0,"Maryland, United States","[39.0457549, -76.64127119999999]",39.0457549,-76.64127119999999
4,222,United States,1.0,264,Arkansas,2.0,"Arkansas, United States","[35.20105, -91.8318334]",35.20105,-91.8318334


In [15]:
subdivisions.isnull().sum(axis=0)

country_area          0
country_name          0
parent_code_type1     0
subdivision_area      0
subdivision_name      0
child_code_type1      0
search_coords         0
coordinates          19
latitude             19
longitude            19
dtype: int64

In [16]:
subdivisions.dropna(subset=['latitude'], axis=0, inplace=True)

In [17]:
subdivisions_list = subdivisions.subdivision_area.values.tolist()
type(subdivisions_list)

list

In [18]:
#We keep as pending the ones that were not saved as subdivisons:
pending1 = parent_child_type1_ordered[~parent_child_type1_ordered.child_area.isin(subdivisions_list)]

In [19]:
#We do the second merging:
parent_child_type2 = pd.merge(pending1, subdivisions[['subdivision_area',\
                                                      'subdivision_name', \
                                                      'country_area', \
                                                      'country_name', \
                                                      'latitude', \
                                                      'longitude']],\
                              how='left', left_on='parent_area', right_on='subdivision_area')
parent_child_type2.drop(labels=['subdivision_area'], axis=1, inplace=True)
parent_child_type2.head()

Unnamed: 0,parent_area,parent_name1,parent_code_type1,child_area,child_name1,child_code_type1,subdivision_name,country_area,country_name,latitude,longitude
0,80688,South Kingstown,3.0,118562,Wakefield,5.0,,,,,
1,13141,Karlskoga Municipality,4.0,118563,Karlskoga,3.0,,,,,
2,104294,Woodbury County,7.0,25347,Correctionville,3.0,,,,,
3,2842,Rangpur,2.0,4131,Rangpur,2.0,Rangpur,18.0,Bangladesh,25.7438916,89.275227
4,432,England,2.0,3867,Kent,2.0,England,221.0,United Kingdom,52.3555177,-1.1743197


In [20]:
subdivisions2 = parent_child_type2[parent_child_type2['country_area'].notnull()]
len(subdivisions2)

32603

In [21]:
subdivisions2.head()

Unnamed: 0,parent_area,parent_name1,parent_code_type1,child_area,child_name1,child_code_type1,subdivision_name,country_area,country_name,latitude,longitude
3,2842,Rangpur,2.0,4131,Rangpur,2.0,Rangpur,18.0,Bangladesh,25.7438916,89.275227
4,432,England,2.0,3867,Kent,2.0,England,221.0,United Kingdom,52.3555177,-1.1743197
5,432,England,2.0,3872,Leicestershire,2.0,England,221.0,United Kingdom,52.3555177,-1.1743197
6,2768,Sumatera,2.0,4585,Sumatera Utara,2.0,Sumatera,100.0,Indonesia,-0.589724,101.3431058
7,1794,Flanders,2.0,4141,Antwerpen,2.0,Flanders,21.0,Belgium,51.0108706,3.7264613


In [22]:
subdivisions2_list = subdivisions2.child_area.values.tolist()

In [23]:
#We keep as pending the ones that were not saved as subdivisons2:
pending2 = parent_child_type2[~parent_child_type2.child_area.isin(subdivisions2_list)]
pending2.drop(labels=['subdivision_name','country_area', \
                      'country_name', 'latitude', 'longitude'], axis=1, inplace=True)

In [24]:
#And we do the third merging:
parent_child_type3 = pd.merge(pending2, subdivisions2, how='left', left_on='parent_area', right_on='child_area')
parent_child_type3.head()

Unnamed: 0,parent_area_x,parent_name1_x,parent_code_type1_x,child_area_x,child_name1_x,child_code_type1_x,parent_area_y,parent_name1_y,parent_code_type1_y,child_area_y,child_name1_y,child_code_type1_y,subdivision_name,country_area,country_name,latitude,longitude
0,80688,South Kingstown,3.0,118562,Wakefield,5.0,,,,,,,,,,,
1,13141,Karlskoga Municipality,4.0,118563,Karlskoga,3.0,484.0,Örebro,2.0,13141.0,Karlskoga Municipality,4.0,Örebro,202.0,Sweden,59.2752626,15.2134105
2,104294,Woodbury County,7.0,25347,Correctionville,3.0,274.0,Iowa,2.0,104294.0,Woodbury County,7.0,Iowa,222.0,United States,41.8780025,-93.097702
3,1178,London,3.0,3899,Redbridge,2.0,432.0,England,2.0,1178.0,London,3.0,England,221.0,United Kingdom,52.3555177,-1.1743197
4,4024,Tyne and Wear,2.0,3936,South Tyneside,2.0,432.0,England,2.0,4024.0,Tyne and Wear,2.0,England,221.0,United Kingdom,52.3555177,-1.1743197


In [25]:
subdivisions3 = parent_child_type3[parent_child_type3['country_area'].notnull()]
subdivisions3.head()

Unnamed: 0,parent_area_x,parent_name1_x,parent_code_type1_x,child_area_x,child_name1_x,child_code_type1_x,parent_area_y,parent_name1_y,parent_code_type1_y,child_area_y,child_name1_y,child_code_type1_y,subdivision_name,country_area,country_name,latitude,longitude
1,13141,Karlskoga Municipality,4.0,118563,Karlskoga,3.0,484.0,Örebro,2.0,13141.0,Karlskoga Municipality,4.0,Örebro,202.0,Sweden,59.2752626,15.2134105
2,104294,Woodbury County,7.0,25347,Correctionville,3.0,274.0,Iowa,2.0,104294.0,Woodbury County,7.0,Iowa,222.0,United States,41.8780025,-93.097702
3,1178,London,3.0,3899,Redbridge,2.0,432.0,England,2.0,1178.0,London,3.0,England,221.0,United Kingdom,52.3555177,-1.1743197
4,4024,Tyne and Wear,2.0,3936,South Tyneside,2.0,432.0,England,2.0,4024.0,Tyne and Wear,2.0,England,221.0,United Kingdom,52.3555177,-1.1743197
7,104301,Des Moines County,7.0,26608,Danville,3.0,274.0,Iowa,2.0,104301.0,Des Moines County,7.0,Iowa,222.0,United States,41.8780025,-93.097702


In [26]:
len(subdivisions3)

79359

In [27]:
subdivisions3_list = subdivisions3.child_area_x.values.tolist()

In [28]:
#We keep as pending the ones that were not saved as subdivisons3:
pending3 = parent_child_type3[~parent_child_type3.child_area_x.isin(subdivisions3_list)]
to_drop = ['parent_area_y', 'parent_name1_y', 'parent_code_type1_y',\
           'child_area_y', 'child_name1_y', 'child_code_type1_y',\
           'subdivision_name', 'country_area', 'country_name', 'latitude', 'longitude']
pending3.drop(labels=to_drop, axis=1, inplace=True)
pending3.head()

Unnamed: 0,parent_area_x,parent_name1_x,parent_code_type1_x,child_area_x,child_name1_x,child_code_type1_x
0,80688,South Kingstown,3.0,118562,Wakefield,5.0
5,3014,Basnāhira paḷāta,2.0,4728,Kŏḷamba,2.0
6,3014,Basnāhira paḷāta,2.0,4730,Kaḷutara,2.0
16,195,Sri Lanka,1.0,3014,Basnāhira paḷāta,2.0
73,117547,City of Marion,5.0,117548,Edwardstown,5.0


In [29]:
#And we do the fourth merging:
parent_child_type4 = pd.merge(pending3, subdivisions3, how='left', left_on='parent_area_x', right_on='child_area_x')
parent_child_type4.head()

Unnamed: 0,parent_area_x_x,parent_name1_x_x,parent_code_type1_x_x,child_area_x_x,child_name1_x_x,child_code_type1_x_x,parent_area_x_y,parent_name1_x_y,parent_code_type1_x_y,child_area_x_y,...,parent_name1_y,parent_code_type1_y,child_area_y,child_name1_y,child_code_type1_y,subdivision_name,country_area,country_name,latitude,longitude
0,80688,South Kingstown,3.0,118562,Wakefield,5.0,101979.0,Washington County,7.0,80688.0,...,Rhode Island,2.0,101979.0,Washington County,7.0,Rhode Island,222.0,United States,41.5800945,-71.4774291
1,3014,Basnāhira paḷāta,2.0,4728,Kŏḷamba,2.0,,,,,...,,,,,,,,,,
2,3014,Basnāhira paḷāta,2.0,4730,Kaḷutara,2.0,,,,,...,,,,,,,,,,
3,195,Sri Lanka,1.0,3014,Basnāhira paḷāta,2.0,,,,,...,,,,,,,,,,
4,117547,City of Marion,5.0,117548,Edwardstown,5.0,5141.0,Adelaide,3.0,117547.0,...,South Australia,2.0,5141.0,Adelaide,3.0,South Australia,13.0,Australia,-30.0002315,136.2091547


In [30]:
subdivisions4 = parent_child_type4[parent_child_type4['country_area'].notnull()]
subdivisions4.head()

Unnamed: 0,parent_area_x_x,parent_name1_x_x,parent_code_type1_x_x,child_area_x_x,child_name1_x_x,child_code_type1_x_x,parent_area_x_y,parent_name1_x_y,parent_code_type1_x_y,child_area_x_y,...,parent_name1_y,parent_code_type1_y,child_area_y,child_name1_y,child_code_type1_y,subdivision_name,country_area,country_name,latitude,longitude
0,80688,South Kingstown,3.0,118562,Wakefield,5.0,101979.0,Washington County,7.0,80688.0,...,Rhode Island,2.0,101979.0,Washington County,7.0,Rhode Island,222.0,United States,41.5800945,-71.4774291
4,117547,City of Marion,5.0,117548,Edwardstown,5.0,5141.0,Adelaide,3.0,117547.0,...,South Australia,2.0,5141.0,Adelaide,3.0,South Australia,13.0,Australia,-30.0002315,136.2091547
5,68690,Antwerp,3.0,117549,Antwerp,5.0,4141.0,Antwerpen,2.0,68690.0,...,Flanders,2.0,4141.0,Antwerpen,2.0,Flanders,21.0,Belgium,51.0108706,3.7264613
7,68690,Antwerp,3.0,117550,Berchem,5.0,4141.0,Antwerpen,2.0,68690.0,...,Flanders,2.0,4141.0,Antwerpen,2.0,Flanders,21.0,Belgium,51.0108706,3.7264613
14,68690,Antwerp,3.0,117551,Berendrecht-Zandvliet-Lillo,5.0,4141.0,Antwerpen,2.0,68690.0,...,Flanders,2.0,4141.0,Antwerpen,2.0,Flanders,21.0,Belgium,51.0108706,3.7264613


In [31]:
len(subdivisions4)

1575

In [32]:
subdivisions4_list = subdivisions4.child_area_x_x.values.tolist()

In [33]:
#We keep as pending the ones that were not saved as subdivisons4:
pending4 = parent_child_type4[~parent_child_type4.child_area_x_x.isin(subdivisions4_list)]
to_drop = ['parent_area_x_y', 'parent_name1_x_y', 'parent_code_type1_x_y',\
           'child_area_x_y', 'child_name1_x_y', 'child_code_type1_x_y',\
           'parent_area_y', 'parent_name1_y', 'parent_code_type1_y',\
           'child_area_y', 'child_name1_y', 'child_code_type1_y',\
           'subdivision_name', 'country_area','country_name', 'latitude', 'longitude']
pending4.drop(labels=to_drop, axis=1, inplace=True)
pending4.head()

Unnamed: 0,parent_area_x_x,parent_name1_x_x,parent_code_type1_x_x,child_area_x_x,child_name1_x_x,child_code_type1_x_x
1,3014,Basnāhira paḷāta,2.0,4728,Kŏḷamba,2.0
2,3014,Basnāhira paḷāta,2.0,4730,Kaḷutara,2.0
3,195,Sri Lanka,1.0,3014,Basnāhira paḷāta,2.0
6,3014,Basnāhira paḷāta,2.0,4729,Gampaha,2.0
8,66,Eritrea,1.0,3335,Al Janūbī,2.0


In [34]:
#And we do the fifth and final merging:
parent_child_type5 = pd.merge(pending4, subdivisions4, how='left', \
                              left_on='parent_area_x_x', right_on='child_area_x_x')

In [35]:
subdivisions5 = parent_child_type5[parent_child_type5['country_area'].notnull()]
subdivisions5.head()

Unnamed: 0,parent_area_x_x_x,parent_name1_x_x_x,parent_code_type1_x_x_x,child_area_x_x_x,child_name1_x_x_x,child_code_type1_x_x_x,parent_area_x_x_y,parent_name1_x_x_y,parent_code_type1_x_x_y,child_area_x_x_y,...,parent_name1_y,parent_code_type1_y,child_area_y,child_name1_y,child_code_type1_y,subdivision_name,country_area,country_name,latitude,longitude
23,13253,Londonderry,3.0,39878,Portstewart,3.0,3835.0,Derry,2.0,13253.0,...,Northern Ireland,2.0,115532.0,County Londonderry,2.0,Northern Ireland,221.0,United Kingdom,54.7877149,-6.4923145
38,117775,La Haute-Saint-Charles,5.0,117776,Lac-Saint-Charles,5.0,7715.0,Quebec City,3.0,117775.0,...,Quebec,2.0,117265.0,Capitale-Nationale,4.0,Quebec,38.0,Canada,46.8138783,-71.2079809
40,101160,Puurs,3.0,91108,Breendonk,3.0,118521.0,Puurs-Sint-Amands,4.0,101160.0,...,Flanders,2.0,4141.0,Antwerpen,2.0,Flanders,21.0,Belgium,51.0108706,3.7264613
82,30947,Altrincham,3.0,35055,Broadheath,5.0,3941.0,Trafford,2.0,30947.0,...,England,2.0,4021.0,Greater Manchester,2.0,England,221.0,United Kingdom,52.3555177,-1.1743197
99,106205,Sorel‐Tracy,3.0,117916,Sorel,5.0,117394.0,Pierre-De Saurel Regional County Municipality,7.0,106205.0,...,Quebec,2.0,117095.0,Montérégie,4.0,Quebec,38.0,Canada,46.8138783,-71.2079809


In [36]:
len(subdivisions5)

19

For each subdivisions dataframe, we need to keep the following data:

- child area or child_area_x etc (or subdivision name for the first "subdivisions" df): it will be our area_id
- child name (it will be area name)
- subdivision name
- country name
- latitude
- longitude

In [37]:
#Dropping unnecessary columns:
to_drop = ['country_area', 'parent_code_type1', 'child_code_type1', 'search_coords', 'coordinates']
subdivisions.drop(labels=to_drop, axis=1, inplace=True)
#Creating empty column "area_name" as these are pure subdivisions:
subdivisions['area_name'] = np.nan
#Rearranging the columns for better visibility:
order = ['subdivision_area','area_name', 'subdivision_name', 'country_name','latitude','longitude']
s_ordered = subdivisions.reindex(columns=order)
#Changing column names:
s_ordered.rename(columns={'subdivision_area':'area_id',}, inplace=True)
s_ordered.head()

Unnamed: 0,area_id,area_name,subdivision_name,country_name,latitude,longitude
0,262,,Alaska,United States,64.2008413,-149.4936733
1,339,,Sachsen-Anhalt,Germany,51.9502649,11.6922734
2,263,,Alabama,United States,32.3182314,-86.902298
3,261,,Maryland,United States,39.0457549,-76.64127119999999
4,264,,Arkansas,United States,35.20105,-91.8318334


In [38]:
#Dropping unnecessary columns:
to_drop = ['parent_area', 'parent_code_type1', 'parent_name1', 'child_code_type1', 'country_area']
subdivisions2.drop(labels=to_drop, axis=1, inplace=True)
#Changing column names:
subdivisions2.rename(columns={'child_area':'area_id','child_name1':'area_name'}, inplace=True)
subdivisions2.head()

Unnamed: 0,area_id,area_name,subdivision_name,country_name,latitude,longitude
3,4131,Rangpur,Rangpur,Bangladesh,25.7438916,89.275227
4,3867,Kent,England,United Kingdom,52.3555177,-1.1743197
5,3872,Leicestershire,England,United Kingdom,52.3555177,-1.1743197
6,4585,Sumatera Utara,Sumatera,Indonesia,-0.589724,101.3431058
7,4141,Antwerpen,Flanders,Belgium,51.0108706,3.7264613


In [39]:
s3 = subdivisions3[['child_area_x', 'child_name1_x',\
                    'subdivision_name','country_name', 'latitude', 'longitude']].copy()
#Changing column names:
s3.rename(columns={'child_area_x':'area_id','child_name1_x':'area_name'}, inplace=True)

In [39]:
s3.head()

Unnamed: 0,area_id,area_name,subdivision_name,country_name,latitude,longitude
1,118563,Karlskoga,Örebro,Sweden,59.2752626,15.2134105
2,25347,Correctionville,Iowa,United States,41.8780025,-93.097702
3,3899,Redbridge,England,United Kingdom,52.3555177,-1.1743197
4,3936,South Tyneside,England,United Kingdom,52.3555177,-1.1743197
7,26608,Danville,Iowa,United States,41.8780025,-93.097702


In [40]:
s4 = subdivisions4[['child_area_x_x', 'child_name1_x_x', \
                    'subdivision_name','country_name', 'latitude', 'longitude']].copy()
#Changing column names:
s4.rename(columns={'child_area_x_x':'area_id','child_name1_x_x':'area_name',}, inplace=True)
s4.head()

Unnamed: 0,area_id,area_name,subdivision_name,country_name,latitude,longitude
0,118562,Wakefield,Rhode Island,United States,41.5800945,-71.4774291
4,117548,Edwardstown,South Australia,Australia,-30.0002315,136.2091547
5,117549,Antwerp,Flanders,Belgium,51.0108706,3.7264613
7,117550,Berchem,Flanders,Belgium,51.0108706,3.7264613
14,117551,Berendrecht-Zandvliet-Lillo,Flanders,Belgium,51.0108706,3.7264613


In [41]:
s5 = subdivisions5[['child_area_x_x_x', 'child_name1_x_x_x',\
                    'subdivision_name', 'country_name', 'latitude', 'longitude']].copy()
#Changing column names:
s5.rename(columns={'child_area_x_x_x':'area_id','child_name1_x_x_x':'area_name'}, inplace=True)
s5.head()

Unnamed: 0,area_id,area_name,subdivision_name,country_name,latitude,longitude
23,39878,Portstewart,Northern Ireland,United Kingdom,54.7877149,-6.4923145
38,117776,Lac-Saint-Charles,Quebec,Canada,46.8138783,-71.2079809
40,91108,Breendonk,Flanders,Belgium,51.0108706,3.7264613
82,35055,Broadheath,England,United Kingdom,52.3555177,-1.1743197
99,117916,Sorel,Quebec,Canada,46.8138783,-71.2079809


Now we can concatenate the last 5 dataframes so that we have the majority of Musicbrainz's area_id's with their subdivision coordinates:

In [42]:
subdivisions_all = pd.concat([s_ordered, subdivisions2, s3, s4, s5], ignore_index=True)

In [43]:
#Is there any dupliacted area_id?
subdivisions_all.duplicated(subset='area_id').value_counts()

False    117320
dtype: int64

In [44]:
#Exporting the dataframe:
subdivisions_all.to_csv('Data_out/subdivisions_all.csv', sep='\t', index=False, encoding='utf-8')