Skripta koja će skinuti **nmslib** u trenutni direktorij, build-at ga i postaviti python bindings za daljnji rad. 

Skripta ne brise skinuti repozitorij.


---
#!/bin/bash 

git clone https://github.com/searchivarius/nmslib.git

apt-get install libboost-all-dev libgsl0-dev libeigen3-dev

cd nmslib/similarity_search && cmake . -DWITH_EXTRAS=1 && make && cd ..

cd python_bindings && python setup.py build && sudo python setup.py install && cd ../.. 

---


Prebacite gornje naredbe u `nmslib-install.sh` file i onda samo pokrenite:

`sudo bash nmslib-install.sh `

i sve bi se trebalo postaviti, zajedno sa python bindings-ima. 

# NMSLIB summary

In [1]:
import sys
import os
import time

import numpy as np
from scipy.sparse import csr_matrix
from scipy.spatial import distance
import nmslib
import pandas as pd

from common import *

In [2]:
np.random.seed(17)

**NMSLIB** je library za indexiranje i brzo pretraživanje vektorskih podataka. Sadrži i benchamrk alate na koje ćemo se osvrnuti po potrebi. 

### Konstante/enumeracije paketa:

* Tipovi vektora : `nmslib.DataType`
    * `nmslib.DataType.DENSE_VECTOR`  ~ ovaj će biti najkorisniji
    
    * `nmslib.DataType.SPARSE_VECTOR` ~ nmslib treba index vrijednosti i vrijenost; nije prikladan za naš zadatak
    
    * `nmslib.DataType.OBJECT_AS_STRING` ~ radi nad stringovima; opet, DENSE_VECTOR is the way to go
  
___
  
* Povratne vrijednosti udaljenosti: `nmslib.DistType`
    * `nmslib.DistType.FLOAT` ~ default za naš zadatak
    
    * `nmslib.DistType.INT`
  

### Metode dostupne kroz python:

*** Note: *** `index` varijabla označava pointer na c++ objekt i kao takav se vidi u python kodu. `vector` varijable su python liste ili numpy objekti. 

 ___
 
* Stvaranje nove konfiguracije indeksa:
    * `nmslib.init(space_type:str, space_param:list(), method_name:str, dataType:int, distType:int)`
 
 ___
 
 
* Učitavanje postojećeg indeksa iz datoteke. Index koji se predaje mora biti izgrađen pomoću `nmslib.init` metode na isti način na koji je index bio pohranjen.
    * `nmslib.loadIndex(index:int, index_name:str)`
    * ***NOTE*** Index koji se predaje loader-u mora biti stvoren sa istim parametrima i tom se indeksu moraju dodati svi elementi koji su bili dodatni pri stvaranju spremljenog indeksa da bi ova metoda radila. 
___


* Dodavanje novog vektora indeksu
    * `nmslib.addDataPoint(index:int, id:int, data:list() )`
    * ***NOTE*** Ovdje je data nužno python list! Numpy objekti bacaju error.
___


* Dodavanje batch-a vektora indeksu
    * `nmslib.addDataPointBatch(index:int, id_list:numpy(int), data_list: numpy(vector) )`

___


* Izgradnja objekta nad kojim se mogu raditi upiti
    * `nmslib.createIndex(index:int, index_param=list(str))`

___


* Broj vektora nad kojim će biti izgrađen indeks 
    * `nmslib.getDataPointQty(index:int)`
    
___


* Dohvaćanje već pohranjenog vektora po indeksu 
    * `nmslib.getDataPoint(index:int,id:int)`

___


* Za već pohranjene vektore, ispitaj udaljenost pojedinih indeksa 
    * `nmslib.getDistance(index:int, id1:int, id2:int)`
___

* Postavljanje konfiguracijskih parametara za izvođenej upita.
    * `nmslib.setQueryTimeParams(index:int,query_time_param:list(str))`
    
___

* Slanje testnog indeksa na pretragu i vraćanje k najsličnijih vektora (zahtjeva `createIndex` )
    * `nmslib.knnQuery(index:int,k:int,data:list(float))`

___

* Slanje upita za k najsličnijih vektora za svaki element batch-a(zahtjeva `createIndex` )
    * `nmslib.knnQueryBatch(index:int,num_threads:int,k:int,query:numpy(vector))`

___

* Oslobađanje memorije izgrađenog indeksa 
    * `nmslib.freeIndex(index:int)`

Lista svih `space_type` vrijednosti.
Parametri pojedinih metoda definiraju se u `space_param` listi su obliku `'key=val'`.

* **Metric**
    * ~~Hamming
        * bit hamming~~
    * L1
        - l1
    * L2
        - l2
    * Linf
        - linf
    * Lp (p>=1)
        - lp:p=...        
    * Angular distance
        - angulardist
    * Jensen-Shan. metr.
        - jsmetrslow
        - jsmetrfast
        - jsmetrfastapprox
    * Levenshtein: ***NOTE*** used for strigns
        - ~~leven~~
    * SQFD: ***NOTE*** used for images 
        - ~~sqfd_minus_func~~
        - ~~sqfd_heuristic_func:alpha=...,~~
        - ~~sqfd_gaussian_func:alpha=...~~

In [3]:
# List of metric space type
lp_space_p=3
metric_space_type_opts=['l1',
                        'l2',
                        'linf',
                        'lp',
                        'angulardist',
                        'jsmetrslow',
                        'jsmetrfast',
                        'jsmetrfastapprox']


* ***NonMetric spaces(symmetric distances)***

    * Lp (generic p < 1)
        - lp:p=...         
    * Jensen-Shan. div.
        - jsdivslow
        - jsdivfast
        - jsdivfastapprox
    * Cosine distance
        - cosinesimil
        - ~~cosinesimil_sparse~~
        - ~~cosinesimil_sparse_fast~~
    * Norm. Levenshtein: ***NOTE*** used for strigns
        - ~~normleven~~

In [4]:
# List of nonmetric space type with symmetric distances
lp_space_p=0.5
nonmetric_sym_space_type_opts=['lp',
                        'jsdivslow',
                        'jsdivfast',
                        'cosinesimil',
                        'jsdivfastapprox']

* ***NonMetric spaces(non-symmetric distances)***
    * Regular KL-div.
        - kldivfast
        - kldivfastrq
    * Generalized KL-div.   
        - left queries:
            - kldivgenslow,
            - kldivgenfast
        - right queries:
            - kldivgenfastrq 
    * Itakura-Saito
        - left queries: 
            - ~~itakurasaitoslow~~
            - itakurasaitofast
        - right queries
            - ~~itakurasaitofastrq~~

In [5]:
# List of nonmetric space type with nonsymmetric distances
nonmetric_nonsym_space_type_opts= \
                        ['kldivfast',\
                         'kldivfastrq',\
                         'kldivgenslow',\
                         'kldivgenfast',\
                         'kldivgenfastrq',\
                         'itakurasaitofast']


Različite opcije `method_name` varijable s parametrima:

#TODO


In [6]:
# TODO Scan through all search methods and extract the most interesting

In [7]:
# TODO Find out what are index_param and query_time_param

Utility funkije:

In [8]:
def read_data_fast_batch(fn, batch_size, sep=','):
    for df in pd.read_csv(fn, sep=sep, header=None, chunksize=batch_size):
        yield np.ascontiguousarray(df.as_matrix(), dtype=np.float32)
        
def add_data_to_index(index,file_name,batch_size,sep=','):
    offset=0
    for data_batch in read_data_fast_batch(file_name,batch_size,sep):
        indices=np.arange(len(data_batch),dtype=np.int32)+offset
        nmslib.addDataPointBatch(index,indices,data_batch)
        offset+=data_batch.shape[0]
    return offset

def read_data_fast(fn, sep=','):
    df = pd.read_csv(fn, sep=sep, header=None)
    return np.ascontiguousarray(df.as_matrix(), dtype=np.float32)

Primjer rada:

In [9]:
# # Test dataset

# observations=2000
# queryset=100
# dim=3

# pd.DataFrame(data=np.random.rand(observations,dim)*100).to_csv('data.csv',sep=',',index=False,header=False)
# pd.DataFrame(data=np.random.rand(queryset,dim)*100).to_csv('queries.csv',sep=',',index=False,header=False)

# data=pd.read_csv('data.csv',header=None)

In [10]:
# one full run

def test_config(space_type = 'lp',\
                space_param = ['p=1'],\
                method_name = 'small_world_rand', \
               index_param = ['NN=17', 'initIndexAttempts=3', 'indexThreadQty=4'],\
               query_time_param = ['initSearchAttempts=3']):
    index_name  = method_name + '_sparse.index'
    index = nmslib.init(space_type,
                        space_param,
                        method_name,
                        nmslib.DataType.DENSE_VECTOR,
                        nmslib.DistType.FLOAT)

    last_offset=add_data_to_index(index,'data.csv',100)

    nmslib.createIndex(index, index_param)

    nmslib.setQueryTimeParams(index,query_time_param)

    k = 3
    query = read_data_fast('queries.csv')
    num_threads = 10
    res = nmslib.knnQueryBatch(index, num_threads, k, query)
    nmslib.freeIndex(index)
    return 

In [11]:
#Test for runnable space types
space_types=metric_space_type_opts+nonmetric_sym_space_type_opts+nonmetric_nonsym_space_type_opts
for space_type in space_types:
    print space_type
    test_config(space_type=space_type)
    print 'OK'
    print 

l1
OK

l2
OK

linf
OK

lp




OK

angulardist
OK

jsmetrslow
OK

jsmetrfast
OK

jsmetrfastapprox
OK

lp
OK

jsdivslow
OK

jsdivfast
OK

cosinesimil
OK

jsdivfastapprox
OK

kldivfast
OK

kldivfastrq
OK

kldivgenslow
OK

kldivgenfast
OK

kldivgenfastrq
OK

itakurasaitofast
OK



In [12]:
# #Test for runnable space types
# method_names=[
#     'vptree',
#     'mvptree',
#     'ghtree',
#     'list_clusters',
#     'satree',
#     'bbtree'
# ]
# for method_name in method_names:
#     print method_name
#     test_config(method_name=method_name)
#     print 'OK'
#     print 