In [1]:
%reload_ext autoreload
%autoreload 2

# Using LMI with a custom dataset

## Use case 3 -- running the search & evaluating with M-tree / M-index from [Learned metric index - proposition of learned indexing for unstructured data](https://www.sciencedirect.com/science/article/pii/S0306437921000326).

### Necessary components:
1. the `descriptors` file -- vector representation 
2. the `labels` file -- file associating every object form `descriptors` to a node in the template index
3. the `pivots` file -- representing each node in the template index with an object being its pivot

This use case assumes that you have a template indexing or clustering method that can provide the labels necessary for training for you.

To learn how to **Create the ground-truth and queries files**, visit the `use-case-1.ipynb` notebook.

# Example with a custom dataset

Within the data folder, we have stored a `test/simple-data.csv` file, which contains a tiny example descriptor dataset on which we'll demonstrate the use with any custom dataset.

## Steps:
1. Load the configuration file
2. Load the dataset using `SimpleDataLoader`
3. Load the labels using `SimpleDataLoader`
    - Create k-nn labels
4. Train and search in the LMI
5. Evaluate the results
6. Train and search in the LMI using a Multilabel NN

In [2]:
import os
from lmi.utils import load_yaml, load_model_config
from lmi.data.SimpleDataLoader import SimpleDataLoader
from lmi.indexes.LearnedMetricIndex import LMI

#### 1. Load the configuration file

In [3]:
config = load_yaml('./supplementary-experiment-setups/dummy-data-config.yml')
config

{'setup': 'lmi-test',
 'data': {'data-dir': '/storage/brno12-cerit/home/tslaninakova/data/test',
  'dataset-file': 'simple-data.csv',
  'queries': 'simple-queries.txt',
  'knn-gt': 'simple-knn.json',
  'labels-dir': 'labels/',
  'pivots-filename': 'pivots/M-tree.struct',
  'normalize': False,
  'shuffle': True},
 'LMI': {'model-config': './supplementary-experiment-setups/data-driven/models/model-kmeans.yml',
  'n_levels': 2,
  'training-dataset-percentage': 1},
 'experiment': {'output-dir': 'outputs',
  'search-stop-conditions': [0.0005,
   0.001,
   0.003,
   0.005,
   0.01,
   0.05,
   0.1,
   0.2,
   0.3,
   0.5],
  'knns': 30}}

#### 2. Load the dataset using `SimpleDataLoader`

Note that if the loading method of `SimpleDataLoader` does not work with your dataset, you can easily modify it -- we use the Pandas API.

In [4]:
loader = SimpleDataLoader(config['data'])
dataset = loader.load_descriptors()
dataset.head(2)

INFO:lmi.data.SimpleDataLoader:Loading dataset from /storage/brno12-cerit/home/tslaninakova/data/test/simple-data.csv.


Unnamed: 0,Sample,ecc,N,gammaG,Esoil,Econc,Dbot,H1,H2,H3,Mr_t,Mt_t,Mr_c,Mt_c
559,560.0,21.13903,3896.865479,0.92019,117.104591,31646.613281,19.75651,1.34821,1.69134,1.25041,-3.211984,-1.569884,3.6491,2.729958
130,131.0,17.52186,4584.50293,0.93011,90.690208,31934.832031,20.71199,1.26115,1.39311,1.36789,-3.273929,-1.46063,3.893511,2.658508


#### 3. Load the labels using `SimpleDataLoader`

The labels file, as specified in `config['data']` are located in the `labels/` subfolder with the number of files representing the depth of the template index:

In [5]:
DIR=f"{config['data']['data-dir']}/{config['data']['labels-dir']}"
%ls $DIR

level-1.txt  level-2.txt


In [6]:
FILE=DIR+'/level-1.txt'
!head -n 5 $FILE

9	735
3	700
4	412
9	615
8	806


In [7]:
FILE=DIR+'/level-2.txt'
!head -n 5 $FILE

9.7	735
3.1	700
4.5	412
9.5	615
8.8	806


This template index has 2 levels, the first captured in `level-1.txt` where every object (second column) has associated a node label (first column). In the second file, the location is specified as `"first-node"."second-node"`. Note that these labels represent a balanced tree, i.e., every object is present in every `level-*.txt` file. However, the tree can be unbalanced as well.

The `load_labels` function expects the labels files in these forms and with these filenames.

In [38]:
labels = loader.load_labels()

In [31]:
labels['L1'].value_counts()

4    127
5    120
7    118
2    114
1    113
6    107
9    105
3     99
8     97
Name: L1, dtype: int64

In [9]:
labels[735:].head()

Unnamed: 0_level_0,L1,L2
object_id,Unnamed: 1_level_1,Unnamed: 2_level_1
735,9,7
736,1,9
737,2,3
738,9,5
739,1,8


#### 4. Load the pivots file(s)

##### Format of the M-Tree pivots file:
```txt
125 1 7.326086
422 2 4.645504
642 3 2.3140666
933 4 2.2281752
...
```
first column is the pivot object id, second column the node identifier (similar to the notation in level-* files) and the last column is the radius of the ball region a given pivot/node covers. 

##### Format of the M-index pivots file:
is in the form of the original data descriptors with only a subset of the data objects included. This subset is the collection of the pivots (here the information of which node is represented by which pivot is not important).

In [20]:
config['data']['pivots-filename'] = 'pivots/M-tree.struct'
config['data']['normalize'] = False

In [39]:
loader = SimpleDataLoader(config['data'])
pivots = loader.load_mtree_pivots()
pivots.tail(2)

  dtype={'node': str, 'radius': np.float64}


Unnamed: 0,node,radius,level
189,"(8, 9)",1.228175,2
179,"(9, 9)",1.314067,2


In [10]:
config['data']['pivots-filename'] = 'pivots/M-index.struct'
config['data']['normalize'] = False

In [11]:
loader = SimpleDataLoader(config['data'])
mindex_pivots = loader.load_mindex_pivots()
mindex_pivots.head(2)

INFO:lmi.data.SimpleDataLoader:Loading dataset from /storage/brno12-cerit/home/tslaninakova/data/test/pivots/M-index.struct.


Unnamed: 0,Sample,ecc,N,gammaG,Esoil,Econc,Dbot,H1,H2,H3,Mr_t,Mt_t,Mr_c,Mt_c
7,8.0,22.674749,2464.371582,1.03556,93.100151,35409.90625,22.27972,1.13818,1.01188,0.85739,-2.29543,-0.99097,2.915441,1.614969
5,6.0,10.51368,2464.371582,1.03556,93.100151,35409.90625,22.27972,1.13818,1.01188,0.85739,-0.764885,-0.271209,1.674093,0.919407


#### 4. Search in M-tree / M-index

In [40]:
from lmi.indexes.Mindex import Mindex
from lmi.indexes.Mtree import Mtree

mindex = Mindex(dataset, labels, mindex_pivots, config['data']['dataset-file'])
mtree = Mtree(dataset, labels, pivots, config['data']['dataset-file'])

In [14]:
ground_truths = loader.load_knn_ground_truth()
queries = loader.load_queries()

The default distance metric for searching in M-index and M-tree is Euclidean distance. You can add your metric and incorporate it in `Mindex.get_distances()` function or `Mtree.search_node()` function.

In [15]:
mindex.search(queries[0], [20])

([[(7, 4), (4, 7)]], [0.006585359573364258], [27])

In [43]:
mtree.search(queries[0], [20])

([[(8, 7), (8, 6), (8, 5)]], [0.043366432189941406], [26])

#### 5. Evaluate the performance

In [16]:
from lmi.Experiment import Evaluator
import pandas as pd

queries_df = pd.DataFrame(queries)
queries_df = queries_df.set_index([0])

e = Evaluator(mindex, ground_truths, queries_df, config)
e.run_evaluate()
e.generate_summary()

INFO:lmi.Experiment:Starting the search for 500 queries.
INFO:lmi.Experiment:Evaluated 100/500 queries.
INFO:lmi.Experiment:Evaluated 200/500 queries.
INFO:lmi.Experiment:Evaluated 300/500 queries.
INFO:lmi.Experiment:Evaluated 400/500 queries.
INFO:lmi.Experiment:Search is finished, results are stored in: 'outputs/2022-03-24--10-26-34/search.csv'
INFO:lmi.Experiment:Consumed memory by evaluating (MB): None


In [17]:
!cat outputs/2022-03-24--10-26-34/summary.json

{
    "model": "Mindex",
    "experiment": "outputs/2022-03-24--10-26-34",
    "stop_conditions_perc": [
        0.0005,
        0.001,
        0.003,
        0.005,
        0.01,
        0.05,
        0.1,
        0.2,
        0.3,
        0.5
    ],
    "results": {
        "0": {
            "time": 0.0017857809066772461,
            "score": 0.007,
            "visited_objects": 13
        },
        "1": {
            "time": 0.0018117780685424805,
            "score": 0.007,
            "visited_objects": 13
        },
        "3": {
            "time": 0.0018189749717712402,
            "score": 0.007,
            "visited_objects": 13
        },
        "5": {
            "time": 0.0018249588012695312,
            "score": 0.007,
            "visited_objects": 13
        },
        "10": {
            "time": 0.0018406529426574707,
            "score": 0.007,
            "visited_objects": 13
        },
        "50": {
            "time": 0.0027525205612182616,
            "sco

In [45]:
e = Evaluator(mtree, ground_truths, queries_df, config)
e.run_evaluate()
e.generate_summary()

INFO:lmi.Experiment:Starting the search for 500 queries.
INFO:lmi.Experiment:Evaluated 100/500 queries.
INFO:lmi.Experiment:Evaluated 200/500 queries.
INFO:lmi.Experiment:Evaluated 300/500 queries.
INFO:lmi.Experiment:Evaluated 400/500 queries.
INFO:lmi.Experiment:Search is finished, results are stored in: 'outputs/2022-03-24--12-16-47/search.csv'
INFO:lmi.Experiment:Consumed memory by evaluating (MB): None


In [46]:
!cat outputs/2022-03-24--12-16-47/summary.json

{
    "model": "Mtree",
    "experiment": "outputs/2022-03-24--12-16-47",
    "stop_conditions_perc": [
        0.0005,
        0.001,
        0.003,
        0.005,
        0.01,
        0.05,
        0.1,
        0.2,
        0.3,
        0.5
    ],
    "results": {
        "0": {
            "time": 0.02144307565689087,
            "score": 0.009,
            "visited_objects": 10
        },
        "1": {
            "time": 0.02147545289993286,
            "score": 0.009,
            "visited_objects": 10
        },
        "3": {
            "time": 0.021483206272125244,
            "score": 0.009,
            "visited_objects": 10
        },
        "5": {
            "time": 0.021489752292633058,
            "score": 0.009,
            "visited_objects": 10
        },
        "10": {
            "time": 0.023204325675964355,
            "score": 0.015,
            "visited_objects": 17
        },
        "50": {
            "time": 0.03876618576049805,
            "score": 0.047