# Geo-Level-ID Analysis

In this notebook, we examine the geo-level-id features and try to find the real-world coordinates they are associated with. For this we use external data found on [microdata.cbs.gov.np](https://microdata.cbs.gov.np/index.php/catalog/69/data-dictionary/F1?file_name=Main_Table). First, we load the required dependencies.

In [1]:
import pandas as pd
import json

from src.visualization.visualize import get_verbose_value_counts

Load the required datasets.

In [2]:
train_data = pd.read_csv('../data/raw/train_values.csv')
dist = pd.read_csv('../data/external/dist.csv')
vcode = pd.read_csv('../data/external/vcode.csv')

## Geo-Level-1-ID

We assume that geo-level-1-id represents nepalese districts. For this, we examine the relative occurences of different ids and compare them.

In [3]:
districts_json = json.loads(dist['var_catgry'][0])
districts_df = pd.DataFrame()
for district in districts_json:
    districts_df = pd.concat([
        districts_df,
        pd.DataFrame({
            'id': [district['value']],
            'name': [district['labl']],
            'count': [district['stats'][0]['value']]
        })
    ])
districts_df.reset_index(drop=True)

Unnamed: 0,id,name,count
0,7,Dhankuta,4491
1,9,Sankhuwasabha,4966
2,10,Bhojpur,9242
3,11,Solukhumbu,17000
4,12,Okhaldhunga,36480
5,13,Khotang,12678
6,20,Sindhuli,62900
7,21,Ramechhap,54345
8,22,Dolakha,57691
9,23,Sindhupalchok,83589


In [4]:
districts_df['count'] = districts_df['count'].astype('int')
districts_df['relative'] = districts_df['count'] / districts_df['count'].sum()
districts_df['relative'] = districts_df['relative'].apply(lambda x: '{0:.2%}'.format(x))
districts_df = districts_df.sort_values(by='relative', ascending=False).reset_index(drop=True)
districts_df.columns = ['id (ext)', 'name (ext)', 'count (ext)', 'relative (ext)']

display(districts_df)

Unnamed: 0,id (ext),name (ext),count (ext),relative (ext)
0,24,Kavrepalanchok,90497,9.08%
1,31,Makwanpur,85188,8.55%
2,30,Dhading,85255,8.55%
3,23,Sindhupalchok,83589,8.39%
4,36,Gorkha,74389,7.46%
5,28,Nuwakot,72871,7.31%
6,20,Sindhuli,62900,6.31%
7,22,Dolakha,57691,5.79%
8,21,Ramechhap,54345,5.45%
9,27,Kathmandu,48133,4.83%


In [5]:
geo_level_1_id_df = get_verbose_value_counts(train_data['geo_level_1_id'])
geo_level_1_id_df.columns = ['id (raw)', 'count (raw)', 'relative (raw)']
display(geo_level_1_id_df)

Unnamed: 0,id (raw),count (raw),relative (raw)
0,6,24381,9.36%
1,26,22615,8.68%
2,10,22079,8.47%
3,17,21813,8.37%
4,8,19080,7.32%
5,7,18994,7.29%
6,20,17216,6.61%
7,21,14889,5.71%
8,4,14568,5.59%
9,27,12532,4.81%


In [11]:
comparison_df = districts_df.join(geo_level_1_id_df)
display(comparison_df[['name (ext)', 'relative (ext)', 'relative (raw)', 'id (raw)']])

Unnamed: 0,name (ext),relative (ext),relative (raw),id (raw)
0,Kavrepalanchok,9.08%,9.36%,6
1,Makwanpur,8.55%,8.68%,26
2,Dhading,8.55%,8.47%,10
3,Sindhupalchok,8.39%,8.37%,17
4,Gorkha,7.46%,7.32%,8
5,Nuwakot,7.31%,7.29%,7
6,Sindhuli,6.31%,6.61%,20
7,Dolakha,5.79%,5.71%,21
8,Ramechhap,5.45%,5.59%,4
9,Kathmandu,4.83%,4.81%,27


We see very similar relative occurences and assume that we found the correct districts to the corresponding geo_level_1_ids. Let's map the geo-level-ids to the correct district.

In [13]:
district_mapping = {}
for entry in comparison_df[['name (ext)', 'id (raw)']].iterrows():
    district_mapping[entry[1][1]] = entry[1][0]
print(dict(sorted(district_mapping.items())))

{0: 'Syangja', 1: 'Kaski', 2: 'Baglung', 3: 'Bhaktapur', 4: 'Ramechhap', 5: 'Parbat', 6: 'Kavrepalanchok', 7: 'Nuwakot', 8: 'Gorkha', 9: 'Chitwan', 10: 'Dhading', 11: 'Lalitpur', 12: 'Khotang', 13: 'Okhaldhunga', 14: 'Gulmi', 15: 'Bhojpur', 16: 'Solukhumbu', 17: 'Sindhupalchok', 18: 'Rasuwa', 19: 'Myagdi', 20: 'Sindhuli', 21: 'Dolakha', 22: 'Tanahu', 23: 'Dhankuta', 24: 'Sankhuwasabha', 25: 'Lamjung', 26: 'Makwanpur', 27: 'Kathmandu', 28: 'Nawalparasi', 29: 'Arghakhanchi', 30: 'Palpa'}


Afterwards, we will replace these districts with their corresponding latitudes/longitudes. These have to be found manually, e.g., from [wikipedia.org](https://en.wikipedia.org/wiki/List_of_districts_of_Nepal).

In [8]:
coordinate_mapping = {
    'Arghakhanchi': [28.000833, 83.246667],
    'Baglung': [28.266667, 83.6],
    'Bhaktapur': [27.672222, 85.427778],
    'Bhojpur': [27.1725, 87.048056],
    'Chitwan': [27.583333, 84.5],
    'Dhading': [27.933253, 84.865694],
    'Dhankuta': [26.981389, 87.343333],
    'Dolakha': [27.732542, 86.178946],
    'Gorkha': [28.283333, 84.683333],
    'Gulmi': [28.066667, 83.25],
    'Kaski': [28.3333, 84],
    'Kathmandu': [27.7, 85.3],
    'Kavrepalanchok': [27.530719, 85.537105],
    'Khotang': [27.2, 86.783333],
    'Lalitpur': [27.666667, 85.316667],
    'Lamjung': [28.226365, 84.376373],
    'Makwanpur': [27.416667, 85.033333],
    'Myagdi': [28.528294, 83.476009],
    'Nawalparasi': [27.533333, 83.666667],
    'Nuwakot': [27.878689, 85.138722],
    'Okhaldhunga': [27.316667, 86.5],
    'Palpa': [27.8666, 83.55],
    'Parbat': [28.2213, 83.7122],
    'Ramechhap': [27.400429, 86.027412],
    'Rasuwa': [28.116667, 85.283333],
    'Sankhuwasabha': [27.366667, 87.216667],
    'Sindhuli': [27.252, 85.97],
    'Sindhupalchok': [27.951295, 85.69572],
    'Solukhumbu': [27.70242, 86.677237],
    'Syangja': [28.096944, 83.821944],
    'Tanahu': [27.917373, 84.193726]
}
print(coordinate_mapping)

{'Arghakhanchi': [28.000833, 83.246667], 'Baglung': [28.266667, 83.6], 'Bhaktapur': [27.672222, 85.427778], 'Bhojpur': [27.1725, 87.048056], 'Chitwan': [27.583333, 84.5], 'Dhading': [27.933253, 84.865694], 'Dhankuta': [26.981389, 87.343333], 'Dolakha': [27.732542, 86.178946], 'Gorkha': [28.283333, 84.683333], 'Gulmi': [28.066667, 83.25], 'Kaski': [28.3333, 84], 'Kathmandu': [27.7, 85.3], 'Kavrepalanchok': [27.530719, 85.537105], 'Khotang': [27.2, 86.783333], 'Lalitpur': [27.666667, 85.316667], 'Lamjung': [28.226365, 84.376373], 'Makwanpur': [27.416667, 85.033333], 'Myagdi': [28.528294, 83.476009], 'Nawalparasi': [27.533333, 83.666667], 'Nuwakot': [27.878689, 85.138722], 'Okhaldhunga': [27.316667, 86.5], 'Palpa': [27.8666, 83.55], 'Parbat': [28.2213, 83.7122], 'Ramechhap': [27.400429, 86.027412], 'Rasuwa': [28.116667, 85.283333], 'Sankhuwasabha': [27.366667, 87.216667], 'Sindhuli': [27.252, 85.97], 'Sindhupalchok': [27.951295, 85.69572], 'Solukhumbu': [27.70242, 86.677237], 'Syangja': [

Now, we can replace the geo-level-1-ids with their corresponding real-world coordinates.

## Geo-Level-2-ID

We try the same approach for geo-level-2-ids. We find that there are too many different ids for this approach to be feasible.

In [9]:
vcode_json = json.loads(vcode['var_catgry'][0])
vcode_df = pd.DataFrame()
for vcode in vcode_json:
    vcode_df = pd.concat([
        vcode_df,
        pd.DataFrame({
            'id': [vcode['value']],
            'name': [vcode['labl']]
        })
    ])
vcode_df

Unnamed: 0,id,name
0,71,Ahale
0,72,Ankhisalla
0,73,Arkhaule Jitpur
0,74,Basantatar
0,77,Bodhe
...,...,...
0,5138,Simalpani
0,5139,Sitapur
0,5140,Subarnakhal
0,5141,Thada


In [10]:
geo_level_2_id_df = get_verbose_value_counts(train_data['geo_level_2_id'])
geo_level_2_id_df.columns = ['id (raw)', 'count (raw)', 'relative (raw)']
display(geo_level_2_id_df)

Unnamed: 0,id (raw),count (raw),relative (raw)
0,39,4038,1.55%
1,158,2520,0.97%
2,181,2080,0.80%
3,1387,2040,0.78%
4,157,1897,0.73%
...,...,...,...
1409,1330,1,0.00%
1410,1043,1,0.00%
1411,637,1,0.00%
1412,989,1,0.00%


First off, vcode does not contain any counts. Furthermore, the number of ids between geo-level-2 and vcodes is not identical. Lastly, there are a enormous number of different vcodes/geo-level-2 ids which could make finding corresponding latitudes/longitudes too time-consuming. We conclude that it is not feasible to find coordinates for geo-level-2-ids.