# Get Wiki data for Geonames
For every Geonames ID:

- determine if it has a Wiki page
- determine if that Wiki page has a counter for incoming links
- mine that counter
- new feature

In [1]:
import pandas as pd

## Load data

In [2]:
geonames_data = pd.read_csv('/hg190/corpora/GeoNames/allCountriesSimplified.tsv', sep='\t', index_col=False, encoding='utf-8')

In [3]:
geonames_data.head()

Unnamed: 0,geonames_ID,name,alternate_names,latitude,longitude,feature_class,feature_code,population,alternate_name_count
0,2986043,Pic de Font Blanca,"Pic de Font Blanca,Pic du Port",42.64991,1.53335,T,PK,0,2
1,2994701,Roc Mélé,"Roc Mele,Roc Meler,Roc Mélé",42.58765,1.74028,T,MT,0,3
2,3007683,Pic des Langounelles,Pic des Langounelles,42.61203,1.47364,T,PK,0,1
3,3017832,Pic de les Abelletes,"Pic de la Font-Negre,Pic de la Font-Nègre,Pic ...",42.52535,1.73343,T,PK,0,3
4,3017833,Estany de les Abelletes,"Estany de les Abelletes,Etang de Font-Negre,Ét...",42.52915,1.73362,H,LK,0,3


The basic procedure is to query for each Geonames ID and get the Wiki link from the response.

GeoNames ID --> Wiki ID --> Wiki stats

JUST KIDDING I found out that some (all?) of the Wiki links are available in the [alternate names file](http://download.geonames.org/export/dump/alternateNames.zip) for some reason, which we downloaded [here](/hg190/corpora/GeoNames/alternateNames.txt).

In [7]:
alternate_names = pd.read_csv('/hg190/corpora/GeoNames/alternateNames.txt', sep='\t', header=None, index_col=False, encoding='utf-8')
alternate_names.fillna('', inplace=True)
alternate_names.columns = ['alternateNameId', 'geonameid', 'isolanguage', 
                           'alternate name', 'isPreferredName', 'isShortName', 
                           'isColloquial', 'isHistoric']
alternate_names.head()

Unnamed: 0,alternateNameId,geonameid,isolanguage,alternate name,isPreferredName,isShortName,isColloquial,isHistoric
0,1282063,2986043,,Pic du Port,,,,
1,4274490,2986043,,Pic de Font Blanca,,,,
2,1284819,2994701,,Roc Mélé,,,,
3,1284820,2994701,,Roc Meler,,,,
4,4285256,3007683,,Pic des Langounelles,,,,


In [18]:
print('%d total alt names'%(alternate_names.shape[0]))
print('%d total unique toponyms'%(alternate_names.loc[:, 'geonameid'].nunique()))

12901541 total alt names
5887606 total unique toponyms


In [8]:
alternate_names_with_isolanguage = alternate_names[alternate_names.loc[:, 'isolanguage'] != '']
print('%d/%d names with isolanguage'%(alternate_names_with_isolanguage.shape[0],
                                      alternate_names.shape[0]))
alternate_names_with_isolanguage.head()

6760586/12901541 names with isolanguage


Unnamed: 0,alternateNameId,geonameid,isolanguage,alternate name,isPreferredName,isShortName,isColloquial,isHistoric
17,2968580,3038816,link,http://en.wikipedia.org/wiki/Xixerella,,,,
23,6905524,3038832,link,http://en.wikipedia.org/wiki/Vila%2C_Andorra,,,,
42,2181137,3038886,ca,Pic de Tristaina,1.0,,,
46,11292426,3038926,link,http://en.wikipedia.org/wiki/Monturull,,,,
49,6900415,3038987,link,http://en.wikipedia.org/wiki/Sorn%C3%A0s,,,,


In [37]:
import re
wiki_matcher = re.compile('http://.*wikipedia.org.*/wiki/.*')
if(wiki_matcher.match('http://en.wikipedia.org/wiki/Xixerella')):
    print('match')
alternate_names_with_wiki = alternate_names_with_isolanguage.loc[:, 'alternate name'].apply(lambda x: wiki_matcher.match(x) is not None)
alternate_names_with_wiki = alternate_names_with_isolanguage[alternate_names_with_wiki]
alternate_names_with_wiki_unique = alternate_names_with_wiki.drop_duplicates('geonameid', inplace=False)
print('%d/%d names with wiki link'%(alternate_names_with_wiki.shape[0],
                                    alternate_names.shape[0]))
print('%d/%d unique names with wiki link'%(alternate_names_with_wiki_unique.shape[0],
                                           alternate_names.shape[0]))

match
558960/12901541 names with wiki link
548367/12901541 unique names with wiki link


In [38]:
alternate_names_with_wiki.head()

Unnamed: 0,alternateNameId,geonameid,isolanguage,alternate name,isPreferredName,isShortName,isColloquial,isHistoric
17,2968580,3038816,link,http://en.wikipedia.org/wiki/Xixerella,,,,
23,6905524,3038832,link,http://en.wikipedia.org/wiki/Vila%2C_Andorra,,,,
46,11292426,3038926,link,http://en.wikipedia.org/wiki/Monturull,,,,
49,6900415,3038987,link,http://en.wikipedia.org/wiki/Sorn%C3%A0s,,,,
58,2968581,3038999,link,http://en.wikipedia.org/wiki/Soldeu,,,,


OK! About 5% of all toponyms have a Wiki link which makes sense since the bulk of toponyms are highly local.

Where are the duplicate rows coming from? (note the difference in "names with wiki link" versus "unique names")

In [39]:
geoname_counts = alternate_names_with_wiki.loc[:, 'geonameid'].value_counts().sort_values(inplace=False, ascending=False)
geoname_counts.head()

311046     14
1850147     8
1668341     8
4140963     8
6252001     6
Name: geonameid, dtype: int64

In [40]:
test_id = 311046
print(alternate_names_with_wiki[alternate_names_with_wiki.loc[:, 'geonameid'] == test_id])

          alternateNameId  geonameid isolanguage  \
11090128          2922851     311046        link   
11090129          3047645     311046        link   
11090131          7948538     311046        link   
11090132          7948539     311046        link   
11090133          7948540     311046        link   
11090134          7948541     311046        link   
11090135          7948542     311046        link   
11090136          7948543     311046        link   
11090137          7948544     311046        link   
11090138          7948545     311046        link   
11090139          7948546     311046        link   
11090140          7948547     311046        link   
11090141          7948548     311046        link   
11090142          7948549     311046        link   

                                             alternate name isPreferredName  \
11090128                http://en.wikipedia.org/wiki/Smyrna                   
11090129  http://ru.wikipedia.org/wiki/%D0%98%D0%B7%D0%B...  

OK! The duplicate rows just point to different languages.

Let's collect the Wiki page titles to feed to a backlink calculator ([here](https://dispenser.info.tm/~dispenser/cgi-bin/backlinkscount.py?title=Title)).

In [41]:
for a, a_row in alternate_names_with_wiki.iterrows():
    a_url = a_row.loc['alternate name']
    if(not wiki_title_matcher.match(a_url)):
        print('bad URL %s'%(a_row))
        break

In [42]:
wiki_title_matcher = re.compile('.*(?<=wiki/)(.*)')
wiki_lang_matcher = re.compile('http://([a-z]+)\..*')
test_title = 'http://en.wikipedia.org/wiki/Smyrna'
test_match = wiki_title_matcher.match(test_title)
if(test_match):
    print(test_match.group(1))
    print(wiki_lang_matcher.match(test_title).group(1))
wiki_titles = alternate_names_with_wiki.loc[:, 'alternate name'].apply(lambda x: wiki_title_matcher.match(x).group(1))
wiki_title_lang = alternate_names_with_wiki.loc[:, 'alternate name'].apply(lambda x: wiki_lang_matcher.match(x).group(1))

Smyrna
en


Now! For each title, query the backlink calculator.

In [51]:
import requests
backlink_url = 'https://dispenser.info.tm/~dispenser/cgi-bin/backlinkscount.py?title=%s'
wiki_titles_unique = wiki_titles.unique()
test_title = 'Smyrna'
test_url = backlink_url%(test_title)

In [53]:
%%timeit
test_response = requests.get(test_url)
print(test_response.text)

1926

1926

1926

1926

1 loop, best of 3: 558 ms per loop


Looks like "Smyrna" got 1926 backlinks. Is that right?

Yup! We see [here](https://en.wikipedia.org/w/index.php?title=Special:WhatLinksHere/Smyrna&limit=500) that "Smyrna" has 1926 backlinks, so it appears that the measure is accurate.

Let's migrate this code to a separate Python file to mine properly.