# Test Wiki backlinks coverage
We've already collected the backlink counts [here](get_geonames_wiki_backlink_count.py) (stored [here](/hg190/corpora/GeoNames/wiki_backlink_counts.tsv)), so let's see how much of GeoNames is covered by this data.

This will help us determine if we need to extend the coverage to smaller toponyms (e.g. streets) with a back-off strategy: e.g. get the backlinks for a toponym's containing city, state, country, etc.

In [2]:
import pandas as pd
from data_helpers import load_simplified_geonames_data, load_geocorpora_data

## Load data

In [74]:
backlinks_data = pd.read_csv('/hg190/corpora/GeoNames/wiki_backlink_counts.tsv', sep='\t', index_col=False, encoding='utf-8')
geonames_data = load_simplified_geonames_data()

In [75]:
geonames_data.head()

Unnamed: 0,geonames_ID,name,alternate_names,latitude,longitude,feature_class,feature_code,population,alternate_name_count
0,2986043,Pic de Font Blanca,"Pic de Font Blanca,Pic du Port",42.64991,1.53335,T,PK,0,2
1,2994701,Roc Mélé,"Roc Mele,Roc Meler,Roc Mélé",42.58765,1.74028,T,MT,0,3
2,3007683,Pic des Langounelles,Pic des Langounelles,42.61203,1.47364,T,PK,0,1
3,3017832,Pic de les Abelletes,"Pic de la Font-Negre,Pic de la Font-Nègre,Pic ...",42.52535,1.73343,T,PK,0,3
4,3017833,Estany de les Abelletes,"Estany de les Abelletes,Etang de Font-Negre,Ét...",42.52915,1.73362,H,LK,0,3


In [76]:
# merge with geonames data
geonames_data = pd.merge(geonames_data, backlinks_data.loc[:, ['backlink_count', 'geoname_ID']], 
                         left_on='geonames_ID', right_on='geoname_ID', how='outer')
geonames_data.drop('geoname_ID', axis=1, inplace=True)
geonames_data.head()

Unnamed: 0,geonames_ID,name,alternate_names,latitude,longitude,feature_class,feature_code,population,alternate_name_count,backlink_count
0,2986043.0,Pic de Font Blanca,"Pic de Font Blanca,Pic du Port",42.64991,1.53335,T,PK,0.0,2.0,
1,2994701.0,Roc Mélé,"Roc Mele,Roc Meler,Roc Mélé",42.58765,1.74028,T,MT,0.0,3.0,
2,3007683.0,Pic des Langounelles,Pic des Langounelles,42.61203,1.47364,T,PK,0.0,1.0,
3,3017832.0,Pic de les Abelletes,"Pic de la Font-Negre,Pic de la Font-Nègre,Pic ...",42.52535,1.73343,T,PK,0.0,3.0,
4,3017833.0,Estany de les Abelletes,"Estany de les Abelletes,Etang de Font-Negre,Ét...",42.52915,1.73362,H,LK,0.0,3.0,


In [77]:
geonames_data.loc[:, 'geonames_ID'] = geonames_data.loc[:, 'geonames_ID'].fillna(0, inplace=False).astype(int)

In [78]:
# replace nans
geonames_data.loc[:, 'backlink_count'] = geonames_data.loc[:, 'backlink_count'].fillna(-1, inplace=False)

In [79]:
print('%d/%d GeoNames entries have Wiki data'%
      (geonames_data[geonames_data.loc[:, 'backlink_count'] >= 0].shape[0], geonames_data.shape[0]))

525329/11686810 GeoNames entries have Wiki data


So it looks like 5% of all the Geonames entries have Wiki data, which seems extremely sparse.

We will need to extend the counts.

What kinds of entries do not have Wiki data?

In [80]:
print('blah')

blah


In [81]:
geonames_data.shape

(11686810, 10)

In [83]:
pd.np.random.seed(123)
non_wiki_data = geonames_data[geonames_data.loc[:, 'backlink_count'] == -1]
# shuffle data
non_wiki_data = non_wiki_data.loc[pd.np.random.choice(non_wiki_data.index, size=non_wiki_data.shape[0], replace=False), :]
non_wiki_data.head(100)

Unnamed: 0,geonames_ID,name,alternate_names,latitude,longitude,feature_class,feature_code,population,alternate_name_count,backlink_count
1459317,1555591,Tianqiao,,30.61667,116.06667,P,PPL,0.0,1.0,-1.0
7495944,1334926,Mirza Hill,,34.10899,71.14543,T,HLL,0.0,1.0,-1.0
4071142,10603180,Bargāon,,27.49393,82.67275,P,PPL,0.0,1.0,-1.0
7639191,11095983,Shambaliak Jhal,,28.08794,66.79847,H,STMI,0.0,1.0,-1.0
10980560,7154846,Dillon County Fire Department Station 1,"Dillon County Fire Department,Dillon County Fi...",34.41260,-79.37802,S,BLDG,0.0,2.0,-1.0
8188035,8059397,Maloyuz,"Malojuz,Maloyuz,Малоюз",55.06330,57.58450,H,STM,0.0,3.0,-1.0
11113686,7644832,Holiday Inn Express Ft. Lauderdale Airport / C...,,26.09116,-80.15644,S,HTL,0.0,1.0,-1.0
11244183,9847280,Best Western Bayfront Inn,,29.89541,-81.31136,S,HTL,0.0,1.0,-1.0
1261732,235820,Zigbo,,4.82907,22.99818,H,STM,0.0,1.0,-1.0
566461,3294256,Bugojuše,,45.07222,18.37806,P,PPL,0.0,1.0,-1.0


Lots of small toponyms!

- `Maloyuz` => stream
- `Best Western Bayfront Inn` => hotel
- `Interchange 343` => highway exit
- `Qinggangcun` => village?
- `Gora Beskesik` => hill

## Extend backlink counts
The original GeoNames data has administrative and country containment, which we can use to extend the backlink counts to smaller entities.

TODO: doing this naive extension still won't help all comparison cases, such as cases where the levels don't match up: e.g., if TOPO_A has both admin4 and admin1 Wiki data but TOPO_B only has admin1...do we weight the levels differently?

In [29]:
col_names = ['geonames_ID', 'country_code', 'cc2', 'admin1', 'admin2', 'admin3', 'admin4']
use_cols = [0, 8, 9, 10, 11, 12, 13]
geonames_admin_data = pd.read_csv('/hg190/corpora/GeoNames/allCountries.zip', sep='\t',
                                  compression='zip', 
                                  header=None, usecols=use_cols, names = col_names, 
                                  index_col=False, encoding='utf-8')
geonames_admin_data.columns = col_names
geonames_admin_data.head()

  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0,geonames_ID,country_code,cc2,admin1,admin2,admin3,admin4
0,2986043,AD,,00,,,
1,2994701,AD,"AD,FR",00,,,
2,3007683,AD,"AD,FR",00,,,
3,3017832,AD,FR,A9,66.0,663.0,66146.0
4,3017833,AD,FR,A9,,,


In [49]:
# join with names so we have an idea of what we're doing
geonames_admin_data = pd.merge(geonames_admin_data, geonames_data.loc[:, ['geonames_ID', 'name']], 
                               on='geonames_ID', how='inner')
geonames_admin_data.head(50)

Unnamed: 0,geonames_ID,country_code,cc2,admin1,admin2,admin3,admin4,name
0,2986043,AD,,00,,,,Pic de Font Blanca
1,2994701,AD,"AD,FR",00,,,,Roc Mélé
2,3007683,AD,"AD,FR",00,,,,Pic des Langounelles
3,3017832,AD,FR,A9,66.0,663.0,66146.0,Pic de les Abelletes
4,3017833,AD,FR,A9,,,,Estany de les Abelletes
5,3023203,AD,,00,,,,Port Vieux de la Coume d’Ose
6,3029315,AD,"AD,FR",B3,9.0,91.0,9139.0,Port de la Cabanette
7,3034945,AD,,00,,,,Port Dret
8,3038814,AD,,07,,,,Costa de Xurius
9,3038815,AD,,04,,,,Font de la Xona


In [52]:
geonames_admin_data.iloc[10000:10050, :]

Unnamed: 0,geonames_ID,country_code,cc2,admin1,admin2,admin3,admin4,name
10000,1121064,AF,,26,,,,Zamch Āb
10001,1121065,AF,,9,,,,Darah-ye Zambūr Khān
10002,1121066,AF,,9,,,,Zanbūr Khān
10003,1121067,AF,,41,,,,Zanbūrkhū
10004,1121068,AF,,39,,,,Zambūray
10005,1121069,AF,,26,,,,Zanbūrak
10006,1121070,AF,,10,,,,Zamboli
10007,1121071,AF,,30,,,,Zambukan
10008,1121072,AF,,11,,,,Zamīn Kaj
10009,1121073,AF,,37,,,,Zambar


Notes:

- `00` as the admin1 region means that it cannot be resolved to an administrative region.
- `NaN` as the cc2 code means that there is no dispute over country? Not sure.

First: collect the backlink count for all available admin regions.

In [58]:
# country
country_data_cols = ['ISO', 'Population', 'geonameid']
country_data = pd.read_csv('/hg190/corpora/GeoNames/countryInfo.txt', sep='\t', comment='#', usecols=country_data_cols)
# admin 1
admin_1_data_cols = [0, 3]
admin_1_data_col_names = ['country.admin1', 'geonames_ID']
admin_1_data = pd.read_csv('/hg190/corpora/GeoNames/admin1CodesASCII.txt', sep='\t', 
                          header=None, usecols=admin_1_data_cols, names=admin_1_data_col_names)
# admin 2
admin_2_data_cols = [0, 3]
admin_2_data_col_names = ['country.admin2.geonames_ID', 'geonames_ID']
admin_2_data = pd.read_csv('/hg190/corpora/GeoNames/admin2Codes.txt', sep='\t', 
                           header=None, usecols=admin_2_data_cols, names=admin_2_data_col_names)
# admin 3

# admin 4

In [61]:
admin_1_data.head(20)

Unnamed: 0,country.admin1,geonames_ID
0,AD.06,3039162
1,AD.05,3039676
2,AD.04,3040131
3,AD.03,3040684
4,AD.02,3041203
5,AD.07,3041566
6,AD.08,3338529
7,AE.07,290595
8,AE.05,291075
9,AE.03,292224


In [63]:
test_country_admin1 = 'AF.01'
print(admin_1_data[admin_1_data.loc[:, 'country.admin1'] == test_country_admin1])

   country.admin1  geonames_ID
43          AF.01      1147745


In [65]:
admin_2_data.head(1000).tail(100)

Unnamed: 0,country.admin2.geonames_ID,geonames_ID
900,AR.05.14147,3836128
901,AR.02.10091,3836129
902,AR.18.70098,3836188
903,AR.10.38077,3836444
904,AR.10.38070,3836473
905,AR.23.94007,3836551
906,AR.13.50105,3836668
907,AR.10.38063,3836752
908,AR.01.06763,3836848
909,AR.13.50098,3836982


In [None]:
# ordered admin columns from smallest => largest
ORDERED_ADMINS = ['admin4', 'admin3', 'admin2', 'admin1']
def expand_data(geo_id, admin_data):
    """
    For a given ID without data,
    we assign it data by starting at the
    lowest-possible administration region and backing
    off until we get a hit.
    """
    g_data = admin_data[admin_data.loc[:, 'geonames_ID'] == geo_id].iloc[0, :]
    