# Training and using a DeezyMatch model (option 1)

This notebook shows how to generate a string pairs dataset and use it to train a new DeezyMatch model.

To do so, the `resources/` folder should (at least) contain the following files, in the following locations:
```
toponym-resolution/
   ├── ...
   ├── resources/
   │   ├── deezymatch/
   │   │   ├── data/
   │   │   └── inputs/
   │   │       ├── characters_v001.vocab
   │   │       └── input_dfm.yaml
   │   ├── models/
   │   │   └── w2v
   │   │       ├── w2v_[XXX]_news
   │   │       │   ├── w2v.model
   │   │       │   ├── w2v.model.syn1neg.npy
   │   │       │   └── w2v.model.wv.vectors.npy
   │   │       └── ...
   │   ├── news_datasets/
   │   ├── wikidata/
   │   │   └── mentions_to_wikidata.json
   │   └── wikipedia/
   └── ...
```

We start by importing some libraries, and the `ranking` script from the `geoparser` folder:

In [None]:
import os
import sys
from pathlib import Path

sys.path.insert(0, os.path.abspath(os.path.pardir))
from geoparser import ranking

Create a `myranker` object of the `Ranker` class.

In [None]:
myranker = ranking.Ranker(
    method="deezymatch", # Here we're telling the ranker to use DeezyMatch.
    resources_path="../resources/wikidata/", # Here, the path to the Wikidata resources.
    # Parameters to create the string pair dataset:
    strvar_parameters={
        "ocr_threshold": 60,
        "top_threshold": 85,
        "min_len": 5,
        "max_len": 15,
        "w2v_ocr_path": str(Path("../resources/models/w2v/").resolve()),
        "w2v_ocr_model": "w2v_*_news",
        "overwrite_dataset": True,
    },
    # Parameters to train, load and use a DeezyMatch model:
    deezy_parameters={
        # Paths and filenames of DeezyMatch models and data:
        "dm_path": str(Path("../resources/deezymatch/").resolve()), # Path to the DeezyMatch directory where the model is saved.
        "dm_cands": "wkdtalts", # Name we'll give to the folder that will contain the wikidata candidate vectors.
        "dm_model": "w2v_ocr", # Name of the DeezyMatch model.
        "dm_output": "deezymatch_on_the_fly", # Name of the file where the output of DeezyMatch will be stored. Feel free to change that.
        # Ranking measures:
        "ranking_metric": "faiss", # Metric used by DeezyMatch to rank the candidates.
        "selection_threshold": 50, # Threshold for that metric.
        "num_candidates": 1, # Number of name variations for a string (e.g. "London", "Londra", and "Londres" are three different variations in our gazetteer of "Londcn").
        "verbose": False, # Whether to see the DeezyMatch progress or not.
        # DeezyMatch training:
        "overwrite_training": True, # You can choose to overwrite the model if it exists: in this case we're training a model, regardless of whether it already exists.
        "do_test": True, # Whether the DeezyMatch model we're loading was a test, or not.
    },
)

Load the resources (i.e. the `mentions-to-wikidata` and `wikidata-to-mentions` mappers) that will be used by the ranker:

In [None]:
# Load the resources:
myranker.mentions_to_wikidata = myranker.load_resources()

Train a DeezyMatch model (notice we will be training a `test` model):

In [None]:
# Train a DeezyMatch model if needed:
myranker.train()

Given the DeezyMatch model that has been loaded, find candidates on Wikidata:

In [None]:
# Find candidates given a toponym:
toponym = "Manchefter"
print(myranker.find_candidates([{"mention": toponym}])[0][toponym])