## Toponym Interlinking via Ensemble methods

This notebook combines the following approaches via simple averaging and stacking in order to improve the accuracy of the toponym interlinking task:
- BERT model, as implemented in the notebook "1-BERT.ipynb"
- String Similarity-based features, as implemented in https://github.com/LinkGeoML/LGM-Interlinking and utilized to train a Random Forest classifier
- Siamese RNN, as implemented in https://github.com/LuisPB7/StringMatching


Load train, val and test datasets.

In [1]:
import numpy as np
import pandas as pd
import os

train_df = pd.read_csv('data/train.csv')
val_df = pd.read_csv('data/val.csv')
test_df = pd.read_csv('data/test.csv')

train_labels = train_df['label']
val_labels = val_df['label']
test_labels = test_df['label']

print('Number of train instances:', train_df.shape[0])
print('Number of val instances:', val_df.shape[0])
print('Number of test instances:', test_df.shape[0])

Number of train instances: 1999994
Number of val instances: 499999
Number of test instances: 2499991


### BERT

Load the BERT predictions created by the code in "1-BERT.ipynb".

In [2]:
bert_val_preds = np.load('preds/bert_val_preds.npy')
bert_test_preds = np.load('preds/bert_test_preds.npy')

print(bert_val_preds.shape, bert_test_preds.shape)

(499999, 2) (2499991, 2)


In [3]:
from sklearn.metrics import accuracy_score

print('Val accuracy:', accuracy_score(val_labels, np.argmax(bert_val_preds, axis=1)))
print('Test accuracy:', accuracy_score(test_labels, np.argmax(bert_test_preds, axis=1)))

Val accuracy: 0.8867977735955472
Test accuracy: 0.8869371929738947


### Similarity-ML

Code to create the features defined in https://github.com/LinkGeoML/LGM-Interlinking 

Repository's code is minimally adapted in order to make the following snippet work.

In [4]:
from LGM_Interlinking.interlinking.helpers import StaticValues
from LGM_Interlinking.interlinking import pre_process
from LGM_Interlinking.interlinking.sim_measures import LGMSimVars
from LGM_Interlinking.interlinking.features import Features

encoding = 'global'
LGMSimVars.per_metric_optValues = StaticValues.opt_values[encoding]
pre_process.extract_freqterms('data/train.csv', encoding)

train_feats = Features()
train_feats.load_data('data/train.csv', encoding)
train_feats = train_feats.build()

val_feats = Features()
val_feats.load_data('data/val.csv', encoding)
val_feats = val_feats.build()

test_feats = Features()
test_feats.load_data('data/test.csv', encoding)
test_feats = test_feats.build()

train_feats.shape, val_feats.shape, test_feats.shape

((1999994, 43), (499999, 43), (2499991, 43))

Train a Random Forest classifier. Predict on val and test sets and save the predictions.

In [5]:
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(n_estimators=100)
rf.fit(train_feats, train_labels)

rf_val_preds = rf.predict_proba(val_feats)
rf_test_preds = rf.predict_proba(test_feats)

np.save('preds/rf_val_preds.npy', rf_val_preds)
np.save('preds/rf_test_preds.npy', rf_test_preds)

In [6]:
print('Val accuracy:', accuracy_score(val_labels, np.argmax(rf_val_preds, axis=1)))
print('Test accuracy:', accuracy_score(test_labels, np.argmax(rf_test_preds, axis=1)))

Val accuracy: 0.8598177196354393
Test accuracy: 0.8602514969053888


### Siamese-RNN

Code to create the features defined in https://github.com/LuisPB7/StringMatching

Repository's code is minimally adapted in order to make the following snippet work.

In [7]:
import torch
from StringMatching.RSModel import RSModel
from pytorch_lightning.callbacks import EarlyStopping
from pytorch_lightning import Trainer

model = RSModel('rs', 'train', 'val')

early_stop_callback = EarlyStopping(
    monitor='loss',
    min_delta=0.00,
    patience=3,
    verbose=True,
    mode='min'
)

trainer = Trainer(gpus=1, show_progress_bar=True,
                  max_nb_epochs=20, early_stop_callback=early_stop_callback)

try:
    model.load_state_dict(torch.load(os.path.join(os.getcwd(), 'StringMatching/weights/rs-train.pt')), strict=True)
    print("Successfully loaded weights")
except:
    trainer.fit(model)
trainer.test(model)


model = RSModel('rs', 'train', 'test')
model.load_state_dict(torch.load(os.path.join(os.getcwd(), 'StringMatching/weights/rs-train.pt')), strict=True)
trainer.test(model)

INFO:root:gpu available: True, used: True
INFO:root:VISIBLE GPUS: 0
INFO:root:
              Name     Type Params
0             lin1   Linear   28 K
1             lin2   Linear   61  
2             relu     ReLU    0  
3          dropout  Dropout    0  
4          sigmoid  Sigmoid    0  
5          Encoder  Encoder  158 K
6  Encoder.dropout  Dropout    0  
7     Encoder.relu     ReLU    0  
8     Encoder.gru1      GRU   92 K
9     Encoder.gru2      GRU   65 K


Epoch 12: 100%|██████████| 7813/7813 [06:16<00:00, 21.21batch/s, batch_idx=7812, gpu=0, loss=0.243, v_num=0]

INFO:root:Epoch 00012: early stopping


Epoch 12: 100%|██████████| 7813/7813 [06:17<00:00, 20.71batch/s, batch_idx=7812, gpu=0, loss=0.243, v_num=0]

INFO:root:
              Name     Type Params
0             lin1   Linear   28 K
1             lin2   Linear   61  
2             relu     ReLU    0  
3          dropout  Dropout    0  
4          sigmoid  Sigmoid    0  
5          Encoder  Encoder  158 K
6  Encoder.dropout  Dropout    0  
7     Encoder.relu     ReLU    0  
8     Encoder.gru1      GRU   92 K
9     Encoder.gru2      GRU   65 K



Testing: 100%|██████████| 1954/1954 [01:10<00:00, 27.53batch/s]


INFO:root:
              Name     Type Params
0             lin1   Linear   28 K
1             lin2   Linear   61  
2             relu     ReLU    0  
3          dropout  Dropout    0  
4          sigmoid  Sigmoid    0  
5          Encoder  Encoder  158 K
6  Encoder.dropout  Dropout    0  
7     Encoder.relu     ReLU    0  
8     Encoder.gru1      GRU   92 K
9     Encoder.gru2      GRU   65 K


Testing: 100%|██████████| 9766/9766 [05:48<00:00, 28.02batch/s]


Load the corresponding predictions.

In [8]:
rs_val_preds = np.load('preds/rs_val_preds.npy')
rs_test_preds = np.load('preds/rs_test_preds.npy')

print(rs_val_preds.shape, rs_test_preds.shape)

(499999, 1) (2499991, 1)


As the model's predictions contain only 1 number per sample, this cell just transforms these predictions in order to follow the format of the first 2 models.

In [9]:
a = np.zeros(rs_val_preds.shape)
for ii, i in enumerate(rs_val_preds):
    a[ii] = 1-i
rs_val_preds = np.hstack([a, rs_val_preds])
np.save('preds/rs_val_preds.npy', rs_val_preds)

a = np.zeros(rs_test_preds.shape)
for ii, i in enumerate(rs_test_preds):
    a[ii] = 1-i
rs_test_preds = np.hstack([a, rs_test_preds])
np.save('preds/rs_test_preds.npy', rs_test_preds)

print(rs_val_preds.shape, rs_test_preds.shape)

(499999, 2) (2499991, 2)


In [10]:
print('Val accuracy:', accuracy_score(val_labels, np.argmax(rs_val_preds, axis=1)))
print('Test accuracy:', accuracy_score(test_labels, np.argmax(rs_test_preds, axis=1)))

Val accuracy: 0.8917057834115668
Test accuracy: 0.8916136098089953


## Averaging

Search for the best weights to use while averaging the 3 models predictions on the val set.

In [11]:
from itertools import product

weights = [p for p in product(np.arange(0.0, 1.05, 0.05), repeat=3) if sum(p) == 1.0]
best_weights, best_acc = None, 0

for w in weights:
    val_preds = np.average([bert_val_preds, rf_val_preds, rs_val_preds], axis=0, weights=w)
    acc = accuracy_score(val_labels, np.argmax(val_preds, axis=1))
    if acc > best_acc:
        best_acc = acc
        best_weights = w

best_weights, best_acc

((0.15000000000000002, 0.35000000000000003, 0.5), 0.9073398146796293)

Use the best weights to evaluate on the test set.

In [12]:
test_preds = np.average([bert_test_preds, rf_test_preds, rs_test_preds], axis=0, weights=best_weights)
print('Test accuracy:', accuracy_score(test_labels, np.argmax(test_preds, axis=1)))

Test accuracy: 0.9073948666215198


## Stacking

Stack the val predictions of the 3 models to train MLP as a 'meta-classifier'. Evaluate on the test set.

In [13]:
from sklearn.neural_network import MLPClassifier

val_preds = np.hstack([bert_val_preds, rf_val_preds, rs_val_preds])
test_preds = np.hstack([bert_test_preds, rf_test_preds, rs_test_preds])

mlp = MLPClassifier(learning_rate='adaptive').fit(val_preds, val_labels)
preds = mlp.predict(test_preds)
print('Test accuracy:', accuracy_score(test_labels, preds))

Test accuracy: 0.9082272696181706
