In [1]:
%reload_ext autoreload
%autoreload 2

# Using LMI with a custom dataset

## Use case 1 -- run and evaluate LMI with algorithms from [Data-driven LMI [2]](https://link.springer.com/chapter/10.1007/978-3-030-89657-7_7).

### Necessary components:
1. the `descriptors` file -- vector representation 
    - We assume that you have these descriptors ready
    
#### Components that will be used but can be extracted from necessary components:
2. the `ground-truth` file computed on the dataset
    - see 'How to create the ground-truth file'
3. the `queries` file
    - see 'How to create the queries file'

# Example with a custom dataset

Within the data folder, we have stored a `test/simple-data.csv` file, which contains a tiny example descriptor dataset on which we'll demonstrate the use with any custom dataset.

## Steps:
1. Load the configuration file
2. Load the dataset using `SimpleDataLoader`
3. Create the ground-truth and queries files
4. Train and search in the LMI
5. Evaluate the results

In [9]:
import os
from lmi.utils import load_yaml, load_model_config
from lmi.data.SimpleDataLoader import SimpleDataLoader
from lmi.indexes.LearnedMetricIndex import LMI

#### 1. Load the configuration file

In [64]:
config = load_yaml('./supplementary-experiment-setups/dummy-data-config.yml')
config

{'setup': 'lmi-test',
 'data': {'data-dir': '/storage/brno12-cerit/home/tslaninakova/data/test',
  'dataset-file': 'simple-data.csv',
  'queries': 'simple-queries.txt',
  'knn-gt': 'simple-knn.json',
  'normalize': False,
  'shuffle': True},
 'LMI': {'model-config': './supplementary-experiment-setups/data-driven/models/model-kmeans.yml',
  'n_levels': 2,
  'training-dataset-percentage': 1},
 'experiment': {'output-dir': 'outputs',
  'search-stop-conditions': [0.0005,
   0.001,
   0.003,
   0.005,
   0.01,
   0.05,
   0.1,
   0.2,
   0.3,
   0.5],
  'knns': 30}}

#### 2. Load the dataset using `SimpleDataLoader`

Note that if the loading method of `SimpleDataLoader` does not work with your dataset, you can easily modify it -- we use the Pandas API.

In [17]:
loader = SimpleDataLoader(config['data'])
dataset = loader.load_descriptors()
dataset.head(2)

INFO:lmi.data.SimpleDataLoader:Loading dataset from /storage/brno12-cerit/home/tslaninakova/data/test/simple-data.csv.


Unnamed: 0,Sample,ecc,N,gammaG,Esoil,Econc,Dbot,H1,H2,H3,Mr_t,Mt_t,Mr_c,Mt_c
730,731.0,18.366831,3948.023438,0.95699,114.363274,32521.521484,19.261681,1.26294,1.20946,1.41062,-2.592027,-1.281848,3.091215,2.218345
551,552.0,24.819481,4308.60791,0.91736,74.213387,32234.570312,19.815281,1.04914,1.28868,1.56531,-3.263736,-2.073766,5.87421,3.043677


#### 3. Create the ground-truth and queries files

##### 3.1. **Queries** are a newline separated list of object ids
This list can contain all the objects of `dataset` or just a subset. In our experiments we used 1k queries for 1M descriptor dataset. In the example below, random half of all the objects in `dataset` is selected as queries.

Note that if you're interested in using queries that are not in `dataset`, you need to store their vector representations into the `queries` file. This change will also need modification of the `compute_ground_truths` introduced below.

In [12]:
dataset.sample(frac=0.5).to_csv(
    os.path.join(config['data']['data-dir'], 'simple-queries.txt'), columns=[], header=False
)

##### 3.2. **Ground truths** are the true `k` nearest neighbors of every query object
With `k` being a modifiable parameter. To compute the ground truths we need to know the distances between objects. There are different metrics that are suitable for different dataset (see e.g. [spatial.distance](https://docs.scipy.org/doc/scipy/reference/spatial.distance.html)) -- we encourage the reader to implement his or her own. In our datasets, we used either Euclidean (L2) metric or a metric specific for the CoPhIR dataset. Both are taken from `lmi.distances`.

Note that depending on how many queries you have, this operation can be time consuming.

In [18]:
# make sure that the config['data']['queries'] value corresponds to the path we used to store 
# the queries in the cell above
queries = loader.load_queries()

In [26]:
from lmi.distances.euclidean import get_euclidean_distance

In [52]:
def compute_ground_truths(query_objects, dataset, k, metric):
    ground_truths = {}
    for query_object in query_objects:
        distances_to_query_object = {}
        for i, data_object in dataset.iterrows():
            distances_to_query_object[str(i)] = float(metric(dataset.loc[query_object], data_object))
        ground_truths[str(query_object)] = dict(
            sorted(distances_to_query_object.items(), key=lambda item: item[1])[:k]
        )
    return ground_truths

%time ground_truths = compute_ground_truths(queries, dataset, k=2, metric=get_euclidean_distance)

CPU times: user 2min 13s, sys: 148 ms, total: 2min 13s
Wall time: 2min 17s


The ground truth file should be a dict of dicts with `key` as query and `{key: value}` being the id of the `k` nearest objects and their distances.

E.g. for `k = 2` and two queries:
```
{361: {361: 0.0, 362: 7.1938624},
 136: {136: 0.0, 137: 14.150513}
}
```

In [28]:
# Alternative use with the metrics in `scipy.spatial.distance`
# from scipy.spatial.distance import cosine
# %time ground_truths = compute_ground_truths(queries, dataset, k=2, metric=cosine)

CPU times: user 1min 53s, sys: 99 ms, total: 1min 53s
Wall time: 1min 56s


You can save the file for later use:

In [53]:
from lmi.utils import save_json
save_json(ground_truths, os.path.join(config['data']['data-dir'], 'simple-knn.json'))

#### 4. Train and search in the LMI

In [58]:
lmi = LMI(config['LMI'], dataset)

In [66]:
model_config = load_model_config(config['LMI']['model-config'], lmi.n_levels)
model_config

{'level-0': {'model': 'KMeans',
  'n_clusters': 100,
  'n_init': 5,
  'max_iter': 10},
 'level-1': {'model': 'KMeans',
  'n_clusters': 100,
  'n_init': 5,
  'max_iter': 10}}

In [67]:
%time lmi.train(model_config, rebuild=True)

INFO:lmi.indexes.BaseIndex:Training model M.0 (root) on dataset(1000, 14) with {'model': 'KMeans', 'n_clusters': 100, 'n_init': 5, 'max_iter': 10}.
INFO:lmi.indexes.BaseIndex:Training level 1 with {'model': 'KMeans', 'n_clusters': 100, 'n_init': 5, 'max_iter': 10}.
INFO:lmi.indexes.BaseIndex:Finished training the LMI.


CPU times: user 1.05 s, sys: 12.1 ms, total: 1.06 s
Wall time: 1.6 s


In [68]:
%time search_results, times, visited_objects_all = lmi.search(queries[0], [50])
search_results, times, visited_objects_all

CPU times: user 2.77 ms, sys: 3.98 ms, total: 6.75 ms
Wall time: 6.77 ms


([[(45, 0),
   (45, 3),
   (45, 5),
   (45, 2),
   (71, 2),
   (71, 3),
   (71, 0),
   (71, 4),
   (9, 2),
   (9, 0),
   (18, 3),
   (18, 0),
   (69, 3),
   (69, 2),
   (69, 5),
   (69, 0),
   (41, 3),
   (41, 1),
   (48, 0),
   (96, 0),
   (81, 2),
   (81, 3),
   (81, 5),
   (81, 0)]],
 [0.006704092025756836],
 [50])

#### 5. Evaluate LMI's performance

In [86]:
from lmi.Experiment import Evaluator
import pandas as pd

queries_df = pd.DataFrame(queries)
queries_df = queries_df.set_index([0])

e = Evaluator(lmi, ground_truths, queries_df, config)
e.run_evaluate()
e.generate_summary()

INFO:lmi.Experiment:Starting the search for 500 queries.
INFO:lmi.Experiment:Evaluated 100/500 queries.
INFO:lmi.Experiment:Evaluated 200/500 queries.
INFO:lmi.Experiment:Evaluated 300/500 queries.
INFO:lmi.Experiment:Evaluated 400/500 queries.
INFO:lmi.Experiment:Search is finished, results are stored in: 'outputs/2022-03-16--10-20-05/search.csv'
INFO:lmi.Experiment:Consumed memory by evaluating (MB): None


In [87]:
!cat outputs/2022-03-16--10-20-05/summary.json

{
    "model": "LMI",
    "experiment": "outputs/2022-03-16--10-20-05",
    "stop_conditions_perc": [
        0.0005,
        0.001,
        0.003,
        0.005,
        0.01,
        0.05,
        0.1,
        0.2,
        0.3,
        0.5
    ],
    "results": {
        "0": {
            "time": 0.002288386344909668,
            "score": 0.851,
            "visited_objects": 2
        },
        "1": {
            "time": 0.002305778980255127,
            "score": 0.851,
            "visited_objects": 2
        },
        "3": {
            "time": 0.002391507148742676,
            "score": 0.991,
            "visited_objects": 3
        },
        "5": {
            "time": 0.002812516689300537,
            "score": 0.991,
            "visited_objects": 6
        },
        "10": {
            "time": 0.003476684093475342,
            "score": 0.991,
            "visited_objects": 10
        },
        "50": {
            "time": 0.009042134284973144,
            "score": 0.991,
 