In [1]:
import pandas as pd
import utils

# Create a toponym matching training dataset

This notebook shows how to generate a WikiGazetteer (that is, a gazetteer generated from Wikipedia entries with coordinates), and how to build a toponym matching training set for DeezyMatch. The resulting dataset will have the following format:

Toponym 1 | Toponym 2 | Matching
----------|-----------|---------
Cala Egos | La Fuensanta | False
Cala Egos | Cala Pada | False
Cala Egos | Cala Nova | False
Cala Egos | Cala Egos | True
Cala Egos | Caló de ses Egos | True
Cala Egos | Caló de ses Egües | True

Note that this process may take several hours!

## Step 1: Generate a WikiGazetteer

WikiGazetteer is a gazetteer based on Wikipedia and enriched with Geonames data. To build a WikiGazetteer into a MySQL database for a specific Wikipedia language and version, follow [these instructions](https://github.com/Living-with-machines/lwm_GIR19_resolving_places/tree/master/gazetteer_construction).

***

<font color='darkred'>

**Note:** You can skip this step if you wish to create a toponym matching training set using a different gazetteer (e.g. geonames). Step 2 requires the input in the following format:

| &nbsp; | name | wikititle | latitude | longitude | source
| -- | -- | -- | -- | -- | --
| 416894 | Sierra del Caurel | Sierra_del_Caurel | 42.5833 | -7.16667 | wikimain
| 416895 | Serra do Courel | Sierra_del_Caurel | 42.5833 | -7.16667 | wikiredirect
| 416896 | Sierra del Courel | Sierra_del_Caurel | 42.5833 | -7.16667 | wikiredirect

The first column contains the alternate name id, column `name` contains the alternate names for the locations specified through columns `wikititle` (if you are using your own gazetteer, the content of this column could be the location id), `latitude` and `longitude`. As you can see in the table-example, there should be one row for each alternate name of each location. Column `source` contains the source of this alternate name (this column can be empty).

</font>

***

Once you have generated a WikiGazetteer, you can create a minimal version from it (minimal means containing only altnames, location ids, and coordinates, in the format shown in the table above) directly from MySQL (requires updating the server username and password and the name of your WikiGazetteer DB below):

In [2]:
wikigaz_db = "wikiGazES" # Name of your Wikigazetteer DB in the MySQL server
username_db = "testGazetteer" # Your MySQL user name
password_db = "1234" # Your MySQL password
min_wikigaz = "wikigaz_es" # Output name for the minimal Wikigazetteer from which we will create a training set

In [3]:
# Create a minimal wikigazetteer from MySQL server:
wgdf = utils.create_minimal_gaz(wikigaz_db, username_db, password_db, min_wikigaz)

## Step 2: Create a training set

Create a toponym matching training set from a local WikiGazetteer (or your own gazetteer in the right format).
* You can restrict your WikiGazetteer based on one or multiple bounding boxes (for testing purposes, it can be useful to make this fairly small)

In [4]:
# If you have skipped Step 1, load your own gazetteer here, updating this line as needed:
wgdf = pd.read_pickle(min_wikigaz + ".pkl")

In [5]:
wgdf.head()

Unnamed: 0,name,wikititle,latitude,longitude,source
0,Stupinigi,Stupinigi,44.9833,7.6,wikimain
2,Orto botanico Lorenzo Rota,"Jardín_Botánico_de_Bergamo_""Lorenzo_Rota""",45.7078,9.65778,wikiredirect
3,Moen,Moen,7.43333,151.85,wikimain
4,Torre del Gerro,Torre_del_Gerro,38.8199,0.159539,wikimain
5,Cerro Bayo,Cerro_Bayo_(desambiguación),-25.4167,-68.5833,wikimain


In [6]:
# Each bounding box is a list with coordinate limits: [W, S, E, N]
bboxes = [[-9.86, 35.84, 3.7, 43.88], # Iberian Peninsula
          [-18.71, 27.26, -12.99, 29.79], # Canary Islands
          [-120.6, -56.7, -32.4, 33.9]] # Ibero-America

# For testing: a small bounding box (corresponding to the Balearic Islands)
bboxes = [[0.9, 38.42, 4.76, 40.42]] # Balearic Islands

In [7]:
# Filter minimal gazetteer by coordinate bounding boxes:
wgdf = wgdf[wgdf.apply(lambda x: utils.filter_gaz_by_bbox(x["latitude"], x["longitude"], bboxes), axis=1)]

In [8]:
# Show size of filtered gazetteer (number of rows is the number of altname-location combinations)
wgdf.shape

(1341, 5)

In [9]:
# Create the toponym matching training dataset

titles_per_chunk = 1000 # how many titles are processed by chunk
kilometre_distance = 20 # minimum distance in km for negative toponym pairs (i.e.
                        # toponyms of locations closer than x km will not be selected
                        # as negative matches)
N = "default" # Number of CPUs (default: all available CPUs)
dataset_name = "wikigaz_es_topmatching.txt" # Name of the output file where the
                                                   # toponym matching dataset is stored

# Create the dataset:
utils.create_pairmatch_dataset(N, titles_per_chunk, wgdf, kilometre_distance, dataset_name)

100%|██████████| 1/1 [00:00<00:00,  6.32it/s]


In [12]:
# View the last 20 entries of your toponym matching dataset
pd.read_csv(dataset_name, sep="\t", names=["Toponym1", "Toponym2", "Matching"]).tail(20)

Unnamed: 0,Toponym1,Toponym2,Matching
1414,Avenida Jaime III,Sant Mateu d Albarca,False
1415,Avenida Jaime III,Avenida Jaime III,True
1416,Aeródromo de Son Bonet,Estación de Verge de Lluc,False
1417,Aeródromo de Son Bonet,Aeródromo de Son Bonet,True
1418,Ullaró,Torre Asima,False
1419,Ullaró,Ullaró,True
1420,Ullaró,Soller,False
1421,Ullaró,Ullaro,True
1422,Buñola,Sa Cabaneta,False
1423,Buñola,El Terreno,False
