# Map Layers
**Pre-processing Dataset**

The point here is to prepare the map base-layer dataset.

First thing we're doing here is to scale the $x, y$ coordinates to nice integer values through linear transformation. 
Next, we annotate the labelled coordinates with `tiers`, as a proxy for "Wikipedia Vital Articles" list. This allows us to control which labels should be shown when.

We write out the prepared dataset as `json` serialized objects to disk.

File naming convention is `l{a, b, c}-{data}.json`. This reads as `layer-{a, b, c}-{data-type}`. Don't think too much into it, as long as we use a consistent convention, we'll be fine.

In [3]:
import pandas as pd
import json
import requests

from pydash import py_

In [6]:
# Fetch the base-layer map as a dataframe
mbase_url = 'https://noop-pub.s3.amazonaws.com/opt/atlas/atlas-optimal-02.json'

group_map = 'https://ilearn.cri-paris.org/prod/api/map/group?group_id=beta'

df_mbase = pd.read_json(mbase_url)

df_mbase.head()

group_map_pts = requests.get(group_map).json()['concepts']

pd.DataFrame.from_records(group_map_pts)


Unnamed: 0,elevation,title_en,title_fr,trueskill,wikidata_id,x_map_fr,y_map_fr
0,0.426,Linker (computing),,"{'mu': 42.598, 'sigma': 11...",Q523796,,
1,0.424,Subroutine,Sous-programme,"{'mu': 42.444, 'sigma': 14...",Q190686,-6.046,5.370
2,0.474,C (programming language),C (langage),"{'mu': 47.412, 'sigma': 12...",Q15777,-6.016,5.320
3,0.395,Lagrangian mechanics,Équations de Lagrange,"{'mu': 39.471, 'sigma': 11...",Q324669,-9.157,-0.150
4,0.388,Bletchley Park,Bletchley Park,"{'mu': 38.757, 'sigma': 10...",Q155921,-8.426,3.458
5,0.346,MSX,MSX,"{'mu': 34.551, 'sigma': 9....",Q853547,,
6,0.396,Lambda calculus,Lambda-calcul,"{'mu': 39.577, 'sigma': 8....",Q242028,-6.114,5.848
7,0.391,Combinatory logic,Logique combinatoire,"{'mu': 39.119, 'sigma': 7....",Q1481571,-9.790,-0.237
8,0.387,Alan Turing,Alan Turing,"{'mu': 38.716, 'sigma': 6....",Q7251,0.171,-4.776
9,0.458,RISC-V,RISC-V,"{'mu': 45.816, 'sigma': 10...",Q17637401,-6.715,5.354


### Linear Transformation

We'll scale the $x$ and $y$ coords linearly to integers, and shift them along both axes
so everything is in positive integer domain.

Transformation is implemented as follows:

1. Shift the position vector $ \vec s = <x, y> $

    $ \vec ∂ = < \min(x), \min(y) > $

2. Scale $ \vec s $ by a scaler $ z = 10^n $, where $ n $ is the desired number of precision.

3. Apply linear transformation to $ \vec s $

    $ \vec s_i = z(\vec s - \vec ∂) $

In [9]:
# We chose this magic values later, since we cant sync these coords otherwise with the dataset from server.
xmin, ymin = -30, -30

z = 1e3

df_mbase['x_t'] = (df_mbase
                   .x
                   .apply(lambda x: (x - xmin) * z)
                   .round()
                   .astype('int32'))
df_mbase['y_t'] = (df_mbase
                   .y
                   .apply(lambda y: (y - ymin) * z)
                   .round()
                   .astype('int32'))

df_mbase.head()

Unnamed: 0,label,labelOpacity,markerSize,portal,x,y,x_t,y_t
0,,0.3,0.2,sci,-8.12,-4.301,21880,25699
1,,0.3,0.2,sci,-11.263,-3.278,18737,26722
2,,0.3,0.2,sci,-10.163,-6.365,19837,23635
3,,0.3,0.2,sci,-10.697,-2.326,19303,27674
4,,0.3,0.2,sci,-10.684,-3.34,19316,26660


In [10]:
# We want to keep the "tier" information according to the "wikipedia vital articles"
# heirarchy. The `markerSize` property is a direct proxy for the 8 levels, which we 
# transform to integers and add to column `tier`.

df_mbase['tier'] = (df_mbase
                    .markerSize
                    .apply(lambda x: x * 10)
                    .astype('int32'))

df_mbase.tail()

Unnamed: 0,label,labelOpacity,markerSize,portal,x,y,x_t,y_t,tier
120389,,0.3,0.2,soc,7.591,-2.499,37591,27501,2
120390,,0.3,0.2,soc,9.026,-2.412,39026,27588,2
120391,,0.3,0.2,soc,11.275,-2.056,41275,27944,2
120392,,0.3,0.2,soc,12.214,-1.102,42214,28898,2
120393,SOCIÉTÉ,1.0,0.1,soc,10.0,-3.0,40000,27000,1


In [11]:
# We'll filter the rows with labels

df_labels = (df_mbase
             .iloc[df_mbase.label.dropna().index]
             .sort_values(by='tier'))

# ... and ensure that the labels are not `_` separated.
df_labels['label'] = df_labels.label.str.replace('_', ' ')

# Dump out the label, tier, portal, x_t, and y_t columns
# We'll rename `x_t` and `y_t` by `x` and `y`.
columns = ['label', 'portal', 'tier', 'x_t', 'y_t']

df_lb_labels = df_labels[columns].rename(columns={'x_t': 'x', 'y_t': 'y'})

# et voila:
df_lb_labels.head()


Unnamed: 0,label,portal,tier,x,y
120393,SOCIÉTÉ,soc,1,40000,27000
85392,HISTOIRE,hist,1,30500,25000
111949,SPORT ET LOISIRS,spo,1,39500,34000
104702,ARTS,art,1,35000,30000
65197,GÉOGRAPHIE,geo,1,30000,35500


In [12]:
# write it out to disk.
# [!] NOTE: Ensure `force_ascii` is False. We want to keep utf-8 as much as possible.
#           However... due to jupyter environment, it seems to be impossible. So
#           we'll just use a proper file pointer and explicit utf-8.

with open('./lb-labels.json', 'w', encoding='utf-8') as fp:
  df_lb_labels.to_json(fp, orient='records', force_ascii=False)