# Learned Metric Index demo notebook
This notebook walks you through the whole process of creating and using a Learned Metric Index (LMI).

## Steps
1. Load the dataset
2. Build the LMI
3. Run a query
4. Find out its k-NN performance

In [1]:
%matplotlib inline
%load_ext autoreload
%autoreload 2

## (optional) Download example CoPhIR dataset

In [4]:
!./download_data.sh

--2021-06-21 10:11:47--  https://www.fi.muni.cz/~xslanin/lmi/knn_gt.json
Resolving www.fi.muni.cz (www.fi.muni.cz)... 2001:718:801:230::1, 147.251.48.1
Connecting to www.fi.muni.cz (www.fi.muni.cz)|2001:718:801:230::1|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 69449248 (66M) [application/json]
Saving to: 'knn_gt.json'


2021-06-21 10:11:48 (109 MB/s) - 'knn_gt.json' saved [69449248/69449248]

--2021-06-21 10:11:48--  https://www.fi.muni.cz/~xslanin/lmi/level-1.txt
Resolving www.fi.muni.cz (www.fi.muni.cz)... 2001:718:801:230::1, 147.251.48.1
Connecting to www.fi.muni.cz (www.fi.muni.cz)|2001:718:801:230::1|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 12387387 (12M) [text/plain]
Saving to: 'level-1.txt'


2021-06-21 10:11:48 (38.0 MB/s) - 'level-1.txt' saved [12387387/12387387]

--2021-06-21 10:11:48--  https://www.fi.muni.cz/~xslanin/lmi/level-2.txt
Resolving www.fi.muni.cz (www.fi.muni.cz)... 2001:718:801:230::1, 147.251.48.1

### Creating an LMI instance
`LMI` is the basic object to inveract with when working with Learned indexes. It contains operations for:
- loading the dataset
- interface for training with various classifiers
- interface for searching

In [2]:
from LMI import LMI
# specify the path with the Mtree data.
li = LMI("./Mtree-Cophir-100k")
df = li.get_dataset()
df.head(2)

Loading CoPhIR dataset.


21-06-21 09:27 INFO: Loaded dataset of shape: (100000, 285)


Unnamed: 0,L1,L2,object_id,0,1,2,3,4,5,6,...,272,273,274,275,276,277,278,279,280,281
0,8,31,1264121,-1.242989,0.183268,0.226676,-0.915374,0.252619,-1.130569,-1.174948,...,0.376475,0.246309,-1.161265,0.238361,0.191588,0.133651,0.191612,0.181059,0.071334,0.292033
1,8,31,1269339,-1.499727,-0.376083,-0.169159,-0.178085,-1.059864,1.100678,-0.675192,...,0.376475,0.246309,-0.91233,0.648106,0.191588,0.133651,0.191612,0.181059,0.071334,-0.206513


The dataset is composed of labels (`L1`, `L2`), identifiers (`object_id`) and numberical data. This data are the normalized descriptors of M-tree CoPhIR dataset. Labels describe the object location within the M-tree - `L1`-th node in the first level and `L2`-th node in the second level.

### Build the LMI (Training phase)
Training is goverened by the `train()` method in `LMI`. In order to specify the classifiers to use and their basic hyperparameters, you should provide it with `training_specs` dictionary. Currently supported classifiers and their parameters together with exaplanations can be found in the following tables:

| classifier | Hyp. 1 | Hyp. 2 |
|------------|--------|--------|
| RF         | depth  | n_est  |
| LogReg     | ep     |        |
| NN         | model  | opt    |
| NNMulti    | model  | opt    

| classifier                 | Hyperparameter 1                                       | Hyperparameter 2                                |
|----------------------------|----------------------------------------------|---------------------------------------|
| RandomForestClassifier     | max_depth of the trees                       | number of trees                       |
| Logistic Regression        | number of epochs                             |                                       |
| Neural networks            | a classifier function (one of networks.py) | optimizer (one of keras.optimizers) |
| Multilabel neural networks | a classifier function (one of networks.py) | optimizer (one of keras.optimizers) |

In [3]:
from networks import Adam, construct_fully_connected_model_282_128, construct_mlp
#training_specs = {"RF": [{"n_est": 100, "depth": 30}, {"n_est": 100, "depth": 30}]}
#training_specs = {"LogReg": [{"ep": 10}, {"ep": 10}]}
training_specs = {"NN": [{"model": construct_fully_connected_model_282_128, "opt": Adam(learning_rate=0.0001), "ep": 1}, \
                         {"model": construct_mlp, "opt": Adam(learning_rate=0.001), "ep":5}]}

df_result = li.train(df, training_specs)

Using TensorFlow backend.
21-06-21 09:28 INFO: Training NN with model: <function construct_fully_connected_model_282_128 at 0x14bb6ddd8e18>, optimizer: <keras.optimizers.Adam object at 0x14bbabb30da0> and epochs: 1
21-06-21 09:28 INFO: [282]-[128] model


Epoch 1/1


21-06-21 09:29 INFO: Training level 1
21-06-21 09:29 INFO: Training NN with model: <function construct_mlp at 0x14bb6ddd8730>, optimizer: <keras.optimizers.Adam object at 0x14bbabb30e10> and epochs: 5


Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


21-06-21 09:29 INFO: Training NN with model: <function construct_mlp at 0x14bb6ddd8730>, optimizer: <keras.optimizers.Adam object at 0x14bbabb30e10> and epochs: 5


Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


21-06-21 09:29 INFO: Training NN with model: <function construct_mlp at 0x14bb6ddd8730>, optimizer: <keras.optimizers.Adam object at 0x14bbabb30e10> and epochs: 5


Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


21-06-21 09:29 INFO: Training NN with model: <function construct_mlp at 0x14bb6ddd8730>, optimizer: <keras.optimizers.Adam object at 0x14bbabb30e10> and epochs: 5


Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


21-06-21 09:29 INFO: Training NN with model: <function construct_mlp at 0x14bb6ddd8730>, optimizer: <keras.optimizers.Adam object at 0x14bbabb30e10> and epochs: 5


Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


21-06-21 09:29 INFO: Training NN with model: <function construct_mlp at 0x14bb6ddd8730>, optimizer: <keras.optimizers.Adam object at 0x14bbabb30e10> and epochs: 5


Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


21-06-21 09:29 INFO: Training NN with model: <function construct_mlp at 0x14bb6ddd8730>, optimizer: <keras.optimizers.Adam object at 0x14bbabb30e10> and epochs: 5


Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


21-06-21 09:29 INFO: Training NN with model: <function construct_mlp at 0x14bb6ddd8730>, optimizer: <keras.optimizers.Adam object at 0x14bbabb30e10> and epochs: 5


Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


21-06-21 09:29 INFO: Training NN with model: <function construct_mlp at 0x14bb6ddd8730>, optimizer: <keras.optimizers.Adam object at 0x14bbabb30e10> and epochs: 5


Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


21-06-21 09:29 INFO: Training NN with model: <function construct_mlp at 0x14bb6ddd8730>, optimizer: <keras.optimizers.Adam object at 0x14bbabb30e10> and epochs: 5


Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


21-06-21 09:29 INFO: Training NN with model: <function construct_mlp at 0x14bb6ddd8730>, optimizer: <keras.optimizers.Adam object at 0x14bbabb30e10> and epochs: 5


Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


21-06-21 09:29 INFO: Training NN with model: <function construct_mlp at 0x14bb6ddd8730>, optimizer: <keras.optimizers.Adam object at 0x14bbabb30e10> and epochs: 5


Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


21-06-21 09:29 INFO: Training NN with model: <function construct_mlp at 0x14bb6ddd8730>, optimizer: <keras.optimizers.Adam object at 0x14bbabb30e10> and epochs: 5


Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


21-06-21 09:29 INFO: Training NN with model: <function construct_mlp at 0x14bb6ddd8730>, optimizer: <keras.optimizers.Adam object at 0x14bbabb30e10> and epochs: 5


Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


21-06-21 09:29 INFO: Training NN with model: <function construct_mlp at 0x14bb6ddd8730>, optimizer: <keras.optimizers.Adam object at 0x14bbabb30e10> and epochs: 5


Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


The training logs will inform you what level/node is being trained, and, in case of NNs, their accuracy as they're trained. Note that since we trian on the whole dataset we do not use any validation dataset.

### Searching

Once we've trained the LMI, we can search for specific objects within the LMI.

In [5]:
df_result.head(2)

Unnamed: 0,L1_pred,L2_pred,L1,L2,object_id,0,1,2,3,4,...,272,273,274,275,276,277,278,279,280,281
0,2,5,4,17,33010998,0.383016,-0.003183,-0.037214,-0.362407,-0.009878,...,-2.944586,-1.868269,0.581277,-0.581129,-3.545258,0.133651,0.191612,-2.053544,-11.183034,0.292033
1,9,52,9,40,12893856,0.639753,-0.562534,-0.301103,0.19056,-0.009878,...,-0.038657,0.850474,-1.161265,-0.990875,0.191588,0.133651,0.191612,0.181059,0.071334,-3.696336


In [7]:
knn_file = f"./Mtree-Cophir-100k/knn_gt.json"
knns = li.get_knn_ground_truth(filename=knn_file)
# Random 1000 queries selection used in experiments
searchable_objects = list(knns.keys())

In [8]:
result = li.search(df_result, int(searchable_objects[0]), stop_cond_objects=[500, 1000], debug=True)
result

21-06-21 09:40 INFO: Step 1: L1 added - PQ: [{'M.1.11': 0.5279014}, {'M.1.10': 0.1612377}, {'M.1.3': 0.15023968}, {'M.1.1': 0.038018044}, {'M.1.6': 0.03512803}, {'M.1.2': 0.022543093}, {'M.1.4': 0.020606503}, {'M.1.14': 0.019741502}, {'M.1.13': 0.0094071}, {'M.1.7': 0.00542337}, {'M.1.9': 0.003200936}, {'M.1.8': 0.002571476}, {'M.1.12': 0.0016359027}, {'M.1.5': 0.0015120244}, {'M.1.15': 0.00079491455}, {'M.1.15': 3.8224003e-05}]

21-06-21 09:40 INFO: Popped M.1.11
21-06-21 09:40 INFO: L2 added - PQ (Top 5): [{'M.1.10': 0.1612377}, {'M.1.3': 0.15023968}, {'C.1.11.54': 0.07336822}, {'C.1.11.46': 0.053007253}, {'C.1.11.27': 0.042159732}]

21-06-21 09:40 INFO: Popped M.1.10
21-06-21 09:40 INFO: L2 added - PQ (Top 5): [{'C.1.10.38': 0.17170843}, {'M.1.3': 0.15023968}, {'C.1.10.64': 0.13056695}, {'C.1.10.45': 0.12038053}, {'C.1.11.54': 0.07336822}]

21-06-21 09:40 INFO: L2 found bucket C.1.10.38
21-06-21 09:40 INFO: Popped M.1.3
21-06-21 09:40 INFO: L2 added - PQ (Top 5): [{'C.1.10.64': 0.13

{'id': 79691776,
 'time_checkpoints': [0.110809326171875, 0.11352705955505371],
 'popped_nodes_checkpoints': [['M.1.11',
   'M.1.10',
   'C.1.10.38',
   'M.1.3',
   'C.1.10.64',
   'C.1.10.45',
   'C.1.11.54',
   'C.1.10.27',
   'C.1.11.46'],
  ['M.1.11',
   'M.1.10',
   'C.1.10.38',
   'M.1.3',
   'C.1.10.64',
   'C.1.10.45',
   'C.1.11.54',
   'C.1.10.27',
   'C.1.11.46',
   'C.1.10.41',
   'C.1.10.4',
   'C.1.11.27',
   'C.1.3.67',
   'C.1.11.88',
   'C.1.3.30']],
 'objects_checkpoints': [575, 1004]}

If `debug=True` is specified when searching, the logging will guide us through the whole process of searching.
Beginning in the default step of popping the root node and collecting probabilities for nodes in the first level (`Step 1: L1 added`), to popping the nodes in the first level and collecting probs. of their children all the way to popping the buckets themselves.

The return value of the `search` operation is the following:
- `id` for node id (= `object_id`)
- `time_checkpoints` time (in s) it took to find the corresponding checkpoints
- `popped_nodes_checkpoints` - the nodes that managed to be popped till their collective sum of objects did not overstep the corresponding `stop_cond_objects` threshold
- `objects_checkpoints` - the actual sum of all found objects following `stop_cond_objects`. Is slightly higher than `stop_cond_objects`

### k-NN ground truth

The following output shows the ground truth buckets for every nearest neighbor of our query. The k-NN recall is computed as the number of objects in the visited buckets over the 30 overall objects.

In [10]:
from knn_search import get_knn_buckets_for_query, evaluate_knn_per_query
get_knn_buckets_for_query(df_result, result['id'], knns)

{'C.1.11.54': ['79691776', '13124750', '38444959', '45290554', '25651444'],
 'C.1.10.42': ['13489284', '30008633'],
 'C.1.3.55': ['32097677'],
 'C.1.3.34': ['49155309', '99892584'],
 'C.1.10.4': ['53819024'],
 'C.1.10.41': ['49154712'],
 'C.1.3.46': ['37800985'],
 'C.1.3.16': ['45045161'],
 'C.1.3.33': ['31048238', '73705556'],
 'C.1.11.27': ['47844531'],
 'C.1.11.9': ['99799732'],
 'C.1.11.31': ['76337079'],
 'C.1.11.46': ['40776009', '6648570'],
 'C.1.10.15': ['20894414'],
 'C.1.11.23': ['62673487'],
 'C.1.3.60': ['7066894'],
 'C.1.11.3': ['66020878', '23719144'],
 'C.1.11.67': ['92782303'],
 'C.1.3.61': ['3767533'],
 'C.1.10.64': ['100591208'],
 'C.1.11.53': ['39855095']}

In [11]:
evaluate_knn_per_query(result, df_result, knns)

Evaluating k-NN performance on 2 checkpoints: [575, 1004]
C.1.10.64
C.1.11.54
C.1.11.46
N. of knns found: 8 in 9 buckets.
C.1.10.64
C.1.11.54
C.1.11.46
C.1.10.41
C.1.10.4
C.1.11.27
N. of knns found: 11 in 15 buckets.


[0.26666666666666666, 0.36666666666666664]