In [1]:
%reload_ext autoreload
%autoreload 2

# Using LMI with a custom dataset

## Use case 2 -- run and evaluate LMI with algorithms from [Learned metric index - proposition of learned indexing for unstructured data](https://www.sciencedirect.com/science/article/pii/S0306437921000326).

### Necessary components:
1. the `descriptors` file -- vector representation 
    - We assume that you have these descriptors ready
2. the `labels` file -- file associating every object form `descriptors` to a node in the template index

The supervised version of LMI assumes that you have a template indexing or clustering method that can provide the labels necessary for training for you.
    
#### Components that will be used but can be extracted from necessary components:
5. the `knn-labels` file

To learn how to **Create the ground-truth and queries files**, visit the `use-case-1.ipynb` notebook.

# Example with a custom dataset

Within the data folder, we have stored a `test/simple-data.csv` file, which contains a tiny example descriptor dataset on which we'll demonstrate the use with any custom dataset.

## Steps:
1. Load the configuration file
2. Load the dataset using `SimpleDataLoader`
3. Load the labels using `SimpleDataLoader`
    - Create k-nn labels
4. Train and search in the LMI
5. Evaluate the results
6. Train and search in the LMI using a Multilabel NN

In [14]:
import os
import pandas as pd
from lmi.utils import load_yaml, load_model_config
from lmi.data.SimpleDataLoader import SimpleDataLoader
from lmi.indexes.LearnedMetricIndex import LMI

#### 1. Load the configuration file

In [15]:
config = load_yaml('./supplementary-experiment-setups/dummy-data-config.yml')
config

{'setup': 'lmi-test',
 'data': {'data-dir': '/storage/brno12-cerit/home/tslaninakova/data/test',
  'dataset-file': 'simple-data.csv',
  'queries': 'simple-queries.txt',
  'knn-gt': 'simple-knn.json',
  'labels-dir': 'labels/',
  'pivots-filename': 'pivots/M-tree.struct',
  'normalize': False,
  'shuffle': True},
 'LMI': {'model-config': './supplementary-experiment-setups/data-driven/models/model-kmeans.yml',
  'n_levels': 2,
  'training-dataset-percentage': 1},
 'experiment': {'output-dir': 'outputs',
  'search-stop-conditions': [0.0005,
   0.001,
   0.003,
   0.005,
   0.01,
   0.05,
   0.1,
   0.2,
   0.3,
   0.5],
  'knns': 30}}

#### 2. Load the dataset using `SimpleDataLoader`

Note that if the loading method of `SimpleDataLoader` does not work with your dataset, you can easily modify it -- we use the Pandas API.

In [16]:
loader = SimpleDataLoader(config['data'])
dataset = loader.load_descriptors()
dataset.head(2)

INFO:lmi.data.SimpleDataLoader:Loading dataset from /storage/brno12-cerit/home/tslaninakova/data/test/simple-data.csv.


Unnamed: 0,Sample,ecc,N,gammaG,Esoil,Econc,Dbot,H1,H2,H3,Mr_t,Mt_t,Mr_c,Mt_c
118,119.0,19.01165,2575.731201,0.93529,49.808781,30498.150391,21.289761,1.40301,1.08915,1.27428,-2.112034,-0.870522,2.437778,1.518358
873,874.0,14.27874,3356.581055,0.97378,106.582359,30839.126953,21.234859,1.43009,1.22072,0.97214,-1.536059,-0.744451,2.445121,1.569207


#### 3. Load the labels using `SimpleDataLoader`

The labels file, as specified in `config['data']` are located in the `labels/` subfolder with the number of files representing the depth of the template index:

In [17]:
DIR=f"{config['data']['data-dir']}/{config['data']['labels-dir']}"
%ls $DIR

level-1.txt  level-2.txt


In [18]:
FILE=DIR+'/level-1.txt'
!head -n 5 $FILE

9	735
3	700
4	412
9	615
8	806


In [19]:
FILE=DIR+'/level-2.txt'
!head -n 5 $FILE

9.7	735
3.1	700
4.5	412
9.5	615
8.8	806


This template index has 2 levels, the first captured in `level-1.txt` where every object (second column) has associated a node label (first column). In the second file, the location is specified as `"first-node"."second-node"`. Note that these labels represent a balanced tree, i.e., every object is present in every `level-*.txt` file. However, the tree can be unbalanced as well.

The `load_labels` function expects the labels files in these forms and with these filenames.

In [20]:
labels = loader.load_labels()

In [21]:
labels[735:].head()

Unnamed: 0_level_0,L1,L2
object_id,Unnamed: 1_level_1,Unnamed: 2_level_1
735,9,7
736,1,9
737,2,3
738,9,5
739,1,8


#### 3.1 Create the k-NN labels

If you're **interested in using the Multilabel-NN algorithm (if not, you can skip to 4.)** to train the LMI, you'll need to adjust the labels, such that each object has associated a vector of `k` (in our case 30) node identifiers as labels.

As a precursor, we need to know the ground truth of every object -- `k` closest object to every object (as opposed to just a subset, which is what we did in `use-case-1.ipynb`).

In [22]:
from lmi.distances.euclidean import get_euclidean_distance
def compute_ground_truths(query_objects, dataset, k, metric):
    ground_truths = {}
    for query_object in query_objects:
        distances_to_query_object = {}
        for i, data_object in dataset.iterrows():
            distances_to_query_object[str(i)] = float(metric(dataset.loc[query_object], data_object))
        ground_truths[str(query_object)] = dict(
            sorted(distances_to_query_object.items(), key=lambda item: item[1])[:k]
        )
    return ground_truths

%time ground_truths = compute_ground_truths(dataset.index, dataset, k=2, metric=get_euclidean_distance)

CPU times: user 3min 48s, sys: 262 ms, total: 3min 48s
Wall time: 3min 53s


In [23]:
# save the ground truths for later use
from lmi.utils import save_json
save_json(ground_truths, os.path.join(config['data']['data-dir'], 'simple-knn-all-objects.json'))

In [24]:
labels.columns

Index(['L1', 'L2'], dtype='object')

In [25]:
def get_knn_labels_from_ground_truth(ground_truths, labels):
    all_vectors = {}
    for obj, neighbors in ground_truths.items():
        vector_per_object = []
        for label_col in labels.columns:
            vector_per_object_per_label = []
            for neighbor,_ in neighbors.items():
                vector_per_object_per_label.append(labels.loc[int(neighbor)][label_col])
            vector_per_object.append(vector_per_object_per_label)
        all_vectors[int(obj)] = vector_per_object
        
    knn_labels_df = pd.DataFrame(all_vectors.values(), columns=labels.columns)
    knn_labels_df.index = list(all_vectors.keys())
    return knn_labels_df

%time knn_labels = get_knn_labels_from_ground_truth(ground_truths, labels)

CPU times: user 365 ms, sys: 0 ns, total: 365 ms
Wall time: 383 ms


In [26]:
knn_labels.head()

Unnamed: 0,L1,L2
118,"[3, 8]","[7, 3]"
873,"[8, 2]","[1, 8]"
807,"[2, 8]","[4, 8]"
244,"[5, 6]","[4, 5]"
890,"[7, 1]","[9, 7]"


In [27]:
# make sure that the config['data']['queries'] value corresponds to the path we used to store 
# the queries in the cell above
queries = loader.load_queries()

#### 4. Train and search in the LMI

In [28]:
lmi = LMI(config['LMI'], dataset, labels)

In [29]:
model_config = load_model_config(
    config['LMI']['model-config'].replace(
        'data-driven/models/model-kmeans.yml',
        '100k/models/CoPhIR-100k-Mtree-200-LR-model.yml'
    ), lmi.n_levels)
model_config

{'level-0': {'max_iter': 10, 'C': 10000, 'model': 'LogReg'},
 'level-1': {'max_iter': 5, 'C': 10000, 'model': 'LogReg'},
 'level-2': {'max_iter': 5, 'C': 10000, 'model': 'LogReg'},
 'level-3': {'max_iter': 5, 'C': 10000, 'model': 'LogReg'},
 'level-4': {'max_iter': 5, 'C': 10000, 'model': 'LogReg'},
 'level-5': {'max_iter': 5, 'C': 10000, 'model': 'LogReg'},
 'level-6': {'max_iter': 5, 'C': 10000, 'model': 'LogReg'},
 'level-7': {'max_iter': 5, 'C': 10000, 'model': 'LogReg'},
 'level-8': {'max_iter': 5, 'C': 10000, 'model': 'LogReg'}}

In [30]:
%time lmi.train(model_config, rebuild=True)

INFO:lmi.indexes.BaseIndex:Training model M.0 (root) on dataset(1000, 14) with {'max_iter': 10, 'C': 10000, 'model': 'LogReg'}.
INFO:lmi.indexes.BaseIndex:Training level 1 with {'max_iter': 5, 'C': 10000, 'model': 'LogReg'}.
INFO:lmi.indexes.BaseIndex:Finished training the LMI.


CPU times: user 147 ms, sys: 3.99 ms, total: 151 ms
Wall time: 156 ms


In [31]:
%time search_results, times, visited_objects_all = lmi.search(queries[0], [50])
search_results, times, visited_objects_all

CPU times: user 1.87 ms, sys: 7 µs, total: 1.88 ms
Wall time: 1.88 ms


([[(3, 7)]], [0.001852273941040039], [62])

#### 5. Evaluate LMI's performance

In [32]:
from lmi.Experiment import Evaluator
import pandas as pd

queries_df = pd.DataFrame(queries)
queries_df = queries_df.set_index([0])

e = Evaluator(lmi, ground_truths, queries_df, config)
e.run_evaluate()
e.generate_summary()

INFO:lmi.Experiment:Starting the search for 500 queries.
INFO:lmi.Experiment:Evaluated 100/500 queries.
INFO:lmi.Experiment:Evaluated 200/500 queries.
INFO:lmi.Experiment:Evaluated 300/500 queries.
INFO:lmi.Experiment:Evaluated 400/500 queries.
INFO:lmi.Experiment:Search is finished, results are stored in: 'outputs/2022-06-23--14-38-09/search.csv'
INFO:lmi.Experiment:Consumed memory by evaluating (MB): None


In [34]:
!cat outputs/2022-06-23--14-38-09/summary.json

{
    "model": "LMI",
    "experiment": "outputs/2022-06-23--14-38-09",
    "stop_conditions_perc": [
        0.0005,
        0.001,
        0.003,
        0.005,
        0.01,
        0.05,
        0.1,
        0.2,
        0.3,
        0.5
    ],
    "results": {
        "0": {
            "time": 0.0009971680641174317,
            "score": 0.928,
            "visited_objects": 56
        },
        "1": {
            "time": 0.001017770767211914,
            "score": 0.928,
            "visited_objects": 56
        },
        "3": {
            "time": 0.0010340828895568847,
            "score": 0.932,
            "visited_objects": 56
        },
        "5": {
            "time": 0.0010489230155944825,
            "score": 0.932,
            "visited_objects": 56
        },
        "10": {
            "time": 0.001065969467163086,
            "score": 0.934,
            "visited_objects": 57
        },
        "50": {
            "time": 0.0012597012519836425,
            "score": 

#### 6. Train and search in the LMI using a Multilabel NN

In [35]:
lmi = LMI(config['LMI'], dataset, knn_labels)

In [36]:
model_config = load_model_config(
    config['LMI']['model-config'].replace(
        'data-driven/models/model-kmeans.yml',
        '100k/models/CoPhIR-100k-Mtree-200-multilabel-NN-model.yml'
    ), lmi.n_levels)
model_config

{'level-0': {'model': 'MultilabelNN',
  'epochs': 10,
  'learning_rate': 0.0001,
  'optimizer': 'adam',
  'loss': 'categorical_crossentropy',
  'hidden_layers': {'dense': [{'units': 282,
     'activation': 'relu',
     'dropout': None}]}},
 'level-1': {'model': 'MultilabelNN',
  'epochs': 10,
  'learning_rate': 0.0001,
  'optimizer': 'adam',
  'loss': 'categorical_crossentropy',
  'hidden_layers': {'dense': [{'units': 282,
     'activation': 'relu',
     'dropout': None}]}},
 'level-2': {'model': 'MultilabelNN',
  'epochs': 10,
  'learning_rate': 0.01,
  'optimizer': 'adam',
  'loss': 'categorical_crossentropy',
  'hidden_layers': {'dense': [{'units': 282,
     'activation': 'relu',
     'dropout': None},
    {'units': 1024, 'activation': 'relu', 'dropout': None},
    {'units': 256, 'activation': 'relu', 'dropout': None}]}},
 'level-3': {'model': 'MultilabelNN',
  'epochs': 10,
  'learning_rate': 0.001,
  'optimizer': 'adam',
  'loss': 'categorical_crossentropy',
  'hidden_layers': {'d

In [37]:
%time lmi.train(model_config, rebuild=True)

INFO:lmi.indexes.BaseIndex:Training model M.0 (root) on dataset(1000, 14) with {'model': 'MultilabelNN', 'epochs': 10, 'learning_rate': 0.0001, 'optimizer': 'adam', 'loss': 'categorical_crossentropy', 'hidden_layers': {'dense': [{'units': 282, 'activation': 'relu', 'dropout': None}]}}.
INFO:lmi.indexes.BaseIndex:Training level 1 with {'model': 'MultilabelNN', 'epochs': 10, 'learning_rate': 0.0001, 'optimizer': 'adam', 'loss': 'categorical_crossentropy', 'hidden_layers': {'dense': [{'units': 282, 'activation': 'relu', 'dropout': None}]}}.
INFO:lmi.indexes.BaseIndex:Finished training the LMI.


CPU times: user 1.15 s, sys: 60.1 ms, total: 1.21 s
Wall time: 1.3 s


In [38]:
%time search_results, times, visited_objects_all = lmi.search(queries[0], [50])
search_results, times, visited_objects_all

CPU times: user 27.9 ms, sys: 4 ms, total: 31.9 ms
Wall time: 33.7 ms


([[(1, 1)]], [0.033634185791015625], [1000])