# Map Layers
**Pre-processing Dataset**

The point here is to prepare the map base-layer dataset.

First thing we're doing here is to scale the $x, y$ coordinates to nice integer values through linear transformation. 
Next, we annotate the labelled coordinates with `tiers`, as a proxy for "Wikipedia Vital Articles" list. This allows us to control which labels should be shown when.

We write out the prepared dataset as `json` serialized objects to disk.

File naming convention is `l{a, b, c}-{data}.json`. This reads as `layer-{a, b, c}-{data-type}`. Don't think too much into it, as long as we use a consistent convention, we'll be fine.

In [1]:
import pandas as pd
import json
import requests

from pydash import py_

In [3]:
# Fetch the base-layer map as a dataframe
new_basemap = '/Users/prashantsinha/Downloads/dotAtlas_en.json'

group_map = 'https://ilearn.cri-paris.org/prod/api/map/group?group_id=beta'

df_mbase = pd.read_json(new_basemap)

df_mbase.head()


Unnamed: 0,label,wikidata_id,x,y
0,Antoine_Meillet,Q347001,-1.395,1.536
1,Linear_algebra,Q82571,-1.345,7.258
2,Politics_of_Argentina,Q1154647,-2.209,1.937
3,Austria,Q40,-4.543,0.163
4,Arc_de_Triomphe,Q64436,-2.022,-0.048


In [9]:
# We chose this magic values later, since we cant sync these coords otherwise with the dataset from server.
xmin, ymin = -30, -30

z = 1e3

df_mbase['x_t'] = (df_mbase
                   .x
                   .apply(lambda x: (x - xmin) * z)
                   .round()
                   .astype('int32'))
df_mbase['y_t'] = (df_mbase
                   .y
                   .apply(lambda y: (y - ymin) * z)
                   .round()
                   .astype('int32'))

df_mbase.head()

Unnamed: 0,label,labelOpacity,markerSize,portal,x,y,x_t,y_t
0,,0.3,0.2,sci,-8.12,-4.301,21880,25699
1,,0.3,0.2,sci,-11.263,-3.278,18737,26722
2,,0.3,0.2,sci,-10.163,-6.365,19837,23635
3,,0.3,0.2,sci,-10.697,-2.326,19303,27674
4,,0.3,0.2,sci,-10.684,-3.34,19316,26660


In [10]:
# We want to keep the "tier" information according to the "wikipedia vital articles"
# heirarchy. The `markerSize` property is a direct proxy for the 8 levels, which we 
# transform to integers and add to column `tier`.

df_mbase['tier'] = (df_mbase
                    .markerSize
                    .apply(lambda x: x * 10)
                    .astype('int32'))

df_mbase.tail()

Unnamed: 0,label,labelOpacity,markerSize,portal,x,y,x_t,y_t,tier
120389,,0.3,0.2,soc,7.591,-2.499,37591,27501,2
120390,,0.3,0.2,soc,9.026,-2.412,39026,27588,2
120391,,0.3,0.2,soc,11.275,-2.056,41275,27944,2
120392,,0.3,0.2,soc,12.214,-1.102,42214,28898,2
120393,SOCIÉTÉ,1.0,0.1,soc,10.0,-3.0,40000,27000,1


In [11]:
# We'll filter the rows with labels

df_labels = (df_mbase
             .iloc[df_mbase.label.dropna().index]
             .sort_values(by='tier'))

# ... and ensure that the labels are not `_` separated.
df_labels['label'] = df_labels.label.str.replace('_', ' ')

# Dump out the label, tier, portal, x_t, and y_t columns
# We'll rename `x_t` and `y_t` by `x` and `y`.
columns = ['label', 'portal', 'tier', 'x_t', 'y_t']

df_lb_labels = df_labels[columns].rename(columns={'x_t': 'x', 'y_t': 'y'})

# et voila:
df_lb_labels.head()


Unnamed: 0,label,portal,tier,x,y
120393,SOCIÉTÉ,soc,1,40000,27000
85392,HISTOIRE,hist,1,30500,25000
111949,SPORT ET LOISIRS,spo,1,39500,34000
104702,ARTS,art,1,35000,30000
65197,GÉOGRAPHIE,geo,1,30000,35500


In [12]:
# write it out to disk.
# [!] NOTE: Ensure `force_ascii` is False. We want to keep utf-8 as much as possible.
#           However... due to jupyter environment, it seems to be impossible. So
#           we'll just use a proper file pointer and explicit utf-8.

with open('./lb-labels.json', 'w', encoding='utf-8') as fp:
  df_lb_labels.to_json(fp, orient='records', force_ascii=False)