# Training and using a DeezyMatch model

This notebook shows how to train a new DeezyMatch model.

We start by importing some libraries, and the `ranking` script from the `geoparser` folder:

In [None]:
import os
import sys
from pathlib import Path

sys.path.insert(0, os.path.abspath(os.path.pardir))
from geoparser import pipeline, ranking

### Option 1: Load and use an existing DeezyMatch model

Create a `myranker` object of the `Ranker` class.

In [None]:
myranker = ranking.Ranker(
    method="deezymatch", # Here we're telling the ranker to use DeezyMatch.
    resources_path="../resources/wikidata/", # Here, the path to the Wikidata resources.
    mentions_to_wikidata=dict(), # We'll store the mentions-to-wikidata model here, leave it like this.
    wikidata_to_mentions=dict(), # We'll store the wikidata-to-mentions model here, leave it like this.
    strvar_parameters=dict(), # Parameters to create the string pair dataset (it can be left empty because it's not used if the model already exists).
    deezy_parameters={
        # Paths and filenames of DeezyMatch models and data:
        "dm_path": str(Path("../resources/deezymatch/").resolve()), # Path to the DeezyMatch directory where the model is saved.
        "dm_cands": "wkdtalts", # Name of the folder containing the wikidata candidates.
        "dm_model": "w2v_ocr", # Name of the DeezyMatch model.
        "dm_output": "deezymatch_on_the_fly", # Name of the file where the output of DeezyMatch will be stored. Feel free to change that.
        # Ranking measures:
        "ranking_metric": "faiss", # Metric used by DeezyMatch to rank the candidates.
        "selection_threshold": 25, # Threshold for that metric.
        "num_candidates": 3, # Number of name variations for a string (e.g. "London", "Londra", and "Londres" are three different variations in our gazetteer of "Londcn").
        "search_size": 3, # That should be the same as `num_candidates`.
        "verbose": False, # Whether to see the DeezyMatch progress or not.
        # DeezyMatch training:
        "overwrite_training": False, # You can choose to overwrite the model if it exists: in this case we're loading an existing model, so that should be False.
        "w2v_ocr_path": "", # Path to the w2v model used to generate the DeezyMatch pairs training set. Can be empty if the DeezyMatch model already exists.
        "w2v_ocr_model": "", # Name of the w2v model used to generate the DeezyMatch pairs training set. Can be empty if the DeezyMatch model already exists.
        "do_test": False, # Whether the DeezyMatch model we're loading was a test, or not.
    },
)

Load the resources (i.e. the `mentions-to-wikidata` and `wikidata-to-mentions` mappers) that will be used by the ranker:

In [None]:
# Load the resources:
myranker.mentions_to_wikidata = myranker.load_resources()

Train a DeezyMatch model (notice we will be training a `test` model):

In [None]:
# Train a DeezyMatch model if needed:
myranker.train()

Given the DeezyMatch model that has been loaded, find candidates on Wikidata:

In [None]:
# Find candidates given a toponym:
toponym = "Manchefter"
print(myranker.find_candidates([{"mention": toponym}])[0][toponym])

### Option 2: Train a new DeezyMatch model, and use it.

Create a `myranker` object of the `Ranker` class.

In [None]:
myranker = ranking.Ranker(
    method="deezymatch", # Here we're telling the ranker to use DeezyMatch.
    resources_path="../resources/wikidata/", # Here, the path to the Wikidata resources.
    mentions_to_wikidata=dict(), # We'll store the mentions-to-wikidata model here, leave it like this.
    wikidata_to_mentions=dict(), # We'll store the wikidata-to-mentions model here, leave it like this.
    strvar_parameters={
        # Parameters to create the string pair dataset:
        "ocr_threshold": 60,
        "top_threshold": 85,
        "min_len": 5,
        "max_len": 15,
    },
    deezy_parameters={
        # Paths and filenames of DeezyMatch models and data:
        "dm_path": str(Path("../resources/deezymatch/").resolve()),
        "dm_cands": "wkdtalts",
        "dm_model": "w2v_ocr",
        "dm_output": "deezymatch_on_the_fly",
        # Ranking measures:
        "ranking_metric": "faiss",
        "selection_threshold": 25,
        "num_candidates": 3,
        "search_size": 3,
        "verbose": False,
        # DeezyMatch training:
        "overwrite_training": True,
        "w2v_ocr_path": str(Path("../resources/models/w2v/").resolve()),
        "w2v_ocr_model": "w2v_*_news",
        "do_test": True,
    },
)

Load the resources (i.e. the `mentions-to-wikidata` and `wikidata-to-mentions` mappers) that will be used by the ranker:

In [None]:
# Load the resources:
myranker.mentions_to_wikidata = myranker.load_resources()

Train a DeezyMatch model (notice we will be training a `test` model):

In [None]:
# Train a DeezyMatch model if needed:
myranker.train()

Given the DeezyMatch model that has been loaded, find candidates on Wikidata:

In [None]:
# Find candidates given a toponym:
toponym = "Manchefter"
print(myranker.find_candidates([{"mention": toponym}])[0][toponym])
