# Vector Similarity for RediSearch

This file contains the different capabilities RediSearch offers in the private preview build.
RediSearch vector similiarity capabilities are:
1. Realtime vector indexing
2. Realtime vector update
3. Realtime vector deletion
4. TOP-K query

## Indexing capabilities
In private preview build there are two types of indexing methods supported and three types of distance metrics:

### Index algorithms
1. Brute force (Flat Index)
2. HNSW (Hierarchical Navigable Small World)

### Distance metrics
1. L2 - Euclidean distance between two vectors
2. IP - Internal product of two vectors
3. COSINE - Cosine similarity of two vectors

## Creating an index
In order to create a vector index, the index creation command `FT.CREATE` should be invoked over the vector field name with the new reserved word `VECTOR`

Command format:
```
FT.CREATE <index_name> SCHEMA <vector field name> VECTOR <type> <dimension> <distance metric> <index algorithm> <algorithm parameters>
```
### General indexing mandatory parameters
Parameters should be given to the index build command in the following order
* type - vector data type -  - Currently only `FLOAT32` is supported
* dimension - vector dimension.
* distance metric - either `L2` for euclidean distance, `IP` for internal product or `COSINE` for cosine similarity should be provided. Note, when `COSINE` is selected the indexed vectors will be normalized upon indexing, and the query vector will be normalized upon query.
* Indexing algorithm - either `BF` for brute force or `HNSW` for HNSW indexing algorithm

### Brute force (Flat index)
This index compares the entire indexed vector data to the query vector and returns the top-k similar vectors, according to the given distance metric.

#### Index specific parameters
* `INITIAL_CAP` - initial index capacity (number of vectors). This will make the index pre-allocate space for the intended vector, so no additional allocations will happen while indexing (Optimization)

An example for creating a brute force index, with initial capacity of 1 million vectors of 128 float, using L2 distance metric

```
FT.CREATE my_flat_index SCHEMA my_vector_field VECTOR FLOAT32 128 L2 BF INITIAL_CAP 1000000
```

### HNSW
This index algorithm is a modified version of [nmslib/hnswlib](https://github.com/nmslib/hnswlib) which is the author's implementation of [Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs](https://arxiv.org/ftp/arxiv/papers/1603/1603.09320.pdf)

#### Index specific parameters
* `INITIAL_CAP` - initial index capacity (number of vectors). This will make the index pre-allocate space for the intended vector, so no additional allocations will happen while indexing (Optimization)
* `M` - maximum number of outbound connections in the graph.
* `EF` -  Maximum number of potential candidates to connect while building the graph.

An example for creating HNSW index, with initial capacity of 1 million vectors of 128 float, using L2 distance metric, `ef` is 200 and `M` is 40

```
FT.CREATE my_hnsw_index SCHEMA my_vector_field VECTOR FLOAT32 128 L2 HNSW INITIAL_CAP 1000000 M 40 EF 200
```

## Query
In order to execute top-k vector query the search command `FT.SEARCH` should be invoked with the vector blob as a parameter

Command format:
```
FT.SEARCH <index name> "@<vector field name>:[$<vector blob parameter name> TOPK <k>]" RETURN 1 <vector field name>_score SORTBY <vector field name>_score LIMIT 0 <k>  PARAMS 2 <vector blob parameter name> <vector blob>
```

### Query tuning parameters
#### HNSW
* `EFRUNTIME` - Maximum number of potential top-k candidates to collect while querying the graph. `EFRUNTIME` should be greater or equal to `K`

An example for top-10 query over HNSW indexed dataset with `EFRUNTIME` equals 150

```
FT.SEARCH my_hnsw_index "@my_vector_field:[$vec TOPK 10] => {$EFRUNTIME:150}" RETURN 1 my_vector_field_score SORTBY my_vector_field_score LIMIT 0 10 PARAMS 2 vec <vector blob>
```

## Python examples

### Packages

In [16]:
!pip install git+https://github.com/RediSearch/redisearch-py.git@params
!pip install numpy

Collecting git+https://github.com/RediSearch/redisearch-py.git@params
  Cloning https://github.com/RediSearch/redisearch-py.git (to revision params) to /tmp/pip-req-build-a3f7k_u3
  Running command git clone -q https://github.com/RediSearch/redisearch-py.git /tmp/pip-req-build-a3f7k_u3
  Running command git checkout -b params --track origin/params
  Switched to a new branch 'params'
  Branch 'params' set up to track remote branch 'params' from 'origin'.
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h    Preparing wheel metadata ... [?25ldone
Collecting rejson<0.6.0,>=0.5.4
  Downloading rejson-0.5.4.tar.gz (8.4 kB)
Building wheels for collected packages: redisearch, rejson
  Building wheel for redisearch (PEP 517) ... [?25ldone
[?25h  Created wheel for redisearch: filename=redisearch-2.0.0-py2.py3-none-any.whl size=26773 sha256=df265a22a9f26fd63895ad2898e0a5bb8776a74ce68b7ea5841e53f00e475e9d
  Stored in directory: /tmp

In [1]:
import numpy as np
from redis import Redis
import redisearch

Create redis client

In [2]:
host = "localhost"
port = 6379

redis_conn = Redis(host = host, port = port)

In [10]:
n_vec = 10000
dim = 128
M = 40
EF = 200
vector_field_name = "vector"
k = 10
hnsw_EFRUNTIME = 10

In [5]:
def load_vectors(client : Redis, n, d,  field_name):
    for i in range(n):
        np_vector = np.random.rand(1, d).astype(np.float32)
        client.hset(i, field_name, np_vector.tobytes())
        
def delete_data(client: Redis):
    client.flushall()
        

### Brute Force

In [7]:
# build index
bf_index = redisearch.Client("my_flat_index", conn=redis_conn)
bf_index.redis.execute_command("FT.CREATE", "my_flat_index", "SCHEMA", vector_field_name, "VECTOR", "FLOAT32", dim, "L2", "BF", "INITIAL_CAP", n_vec)
#load vectors
load_vectors(bf_index.redis, n_vec, dim, vector_field_name)
#query
query_vector =  np.random.rand(1, dim).astype(np.float32)
q = redisearch.Query(f'@{vector_field_name}:[$vec_param TOPK {k}]').sort_by(f'{vector_field_name}_score').paging(0,k).return_field(f'{vector_field_name}_score')
res = bf_index.search(q, query_params = {'vec_param': query_vector.tobytes()})
docs = [int(doc.id) for doc in res.docs]
rs_dists = [float(doc.vector_score) for doc in res.docs]
print(docs)
print(rs_dists)
#cleanup
delete_data(bf_index.redis)

[6278, 8574, 5487, 1069, 4578, 3150, 2870, 9628, 5718, 6263]
[14.2357501984, 14.7951850891, 15.0644397736, 15.0954351425, 15.1015148163, 15.1202888489, 15.1720476151, 15.2158946991, 15.23898983, 15.4573745728]


### HNSW

In [12]:
# build index
hnsw_index = redisearch.Client("my_hnsw_index", conn=redis_conn)
hnsw_index.redis.execute_command("FT.CREATE", "my_hnsw_index", "SCHEMA", vector_field_name, "VECTOR", "FLOAT32", dim, "L2", "HNSW", "INITIAL_CAP", n_vec, "M", M, "EF", EF)
#load vectors
load_vectors(hnsw_index.redis, n_vec, dim, vector_field_name)
#query
query_vector =  np.random.rand(1, dim).astype(np.float32)
q = redisearch.Query(f'@{vector_field_name}:[$vec_param TOPK {k}]  => {{$EFRUNTIME : {hnsw_EFRUNTIME}}}').sort_by(f'{vector_field_name}_score').paging(0,k).return_field(f'{vector_field_name}_score')
res = hnsw_index.search(q, query_params = {'vec_param': query_vector.tobytes()})
docs = [int(doc.id) for doc in res.docs]
rs_dists = [float(doc.vector_score) for doc in res.docs]
print(docs)
print(rs_dists)
#cleanup
delete_data(hnsw_index.redis)

[9477, 7006, 7913, 9990, 6810, 6923, 1047, 6816, 3644, 164]
[13.624666214, 13.6373023987, 14.4420623779, 14.5344333649, 14.5973854065, 14.7105751038, 14.8035182953, 14.9341831207, 15.0296001434, 15.0314159393]
