# Similarity Search with Redis
### Redis as a Vector Database

with Brian Sam-Bodden

## The "Unstructured Data" Problem

- The **balanced** of data has changed radically... 
- **~80%** of the data generated by organizations is **Unstructured**<sup>(IDC report, 2020)</sup>
- This percentage is estimated to keep growing <sup>(with CAGR of 36.5% between 2020 and 2025)</sup>




## But what is "Unstructured" Data?

- Data that does not conform to a **pre-defined** data model
- Data that can not be easily **"indexed"** by a search engine
- Data is typically **high-dimensional** and **semantically** rich
- Examples include **images**, **videos**, **free-form text**, and **audio**


![data pyramid](./images/data-balance.png)

## Dealing with Unstructured Data

- Unstructured data must be **transformed**
- To deal with the **high-dimensional** nature we extract **"features"**
- Traditional extraction techniques included **labelling**, **tagging**, and **1-hot encoding** 
- The extracted features are commonly encoded as **vectors** 


## Manual Image Feature Extraction

![manual image feature extraction](./images/image-manual-feature-extraction.png)

## Manual Text Feature Extraction

![manual text feature extraction](./images/text-manual-feature-extraction.png)

## Vectors

- They are a **Numeric representation** of something in **N-dimensional** space
- Can represent **anything**... entire documents, images, video, audio 
- Quantifies **features** or **characteristics** of the item
- More importantly... they are **comparable**

## Vectors

- A Vector is a tuple of one or more **values** called **scalars**
- Each **scalar** represents the measure of a **feature**
- Different frameworks use different data types to represent them:
  - In **Numpy** they are **Numpy Arrays** (`np.arrays`)
  - In **TensorFlow** they are **Tensors** (`tf.Tensor`)
  - In **PyTorch** they are also **Tensors** (`torch.tensor`)

## 3 "Bicycle Reviews" Features as a Vector

![represenation of a vector](./images/bicycle_vector.png)

## 🧨 Issues with Feature Engineering

- **Time-consuming**: Might require domain knowledge and expertise.
- **High dimensionality**: Can lead to a high-dimensional feature space.
- **Lack of scalability**: Not easily scalable, more data **==** more people.

## Enter "Vector Embeddings"

- **Machine Learning** / **Deep Learning** have leaped forward in last decade 
- ML models **outperform** humans in many tasks nowadays
  - 🔥 **CV** (Computer Vision) models excel at detection/classification
  - 🔥 **LLMs** (Large Language Models) have advanced exponentially
- Today, most vectors are **generated** using pre-trained **ML Models**

## Enter "Vector Embeddings"

- ML models can **extract contextual meaning** from unstructured data
- Reduce semantically-rich high-dimensional inputs and **"flatten"** them 
- Flatten representations retain the semantic information and make for ideal vectors
- Once in vector form the world of **linear algebra** allows to operate on vectors

## Vector Embeddings from a CV Model

![vector embedding extraction](./images/embedding-extraction.png)

## Enter "Vector Databases"

- Pure Vector Databases **efficiently store** Vectors (along with **metadata**)
- Enable **searching** for vectors using **"similarity"** and **"distance"** metrics
- Enable **hybrid searches** combining vectors and metadata

## Redis as a Vector Database

- Redis provides **Search Capabilities** for structured/semi-structured data
- Redis supports `TEXT`, `NUMERIC`, `TAG`, `GEO` and `GEOSHAPE` fields
- Redis introduces the **`VECTOR`** schema field type for vector support 
- **`VECTOR`** field allows **indexing**, and **querying** vectors in **Hashes** or **JSON**
- Redis **in-memory** approach provides **fast** and **efficient** vector searches





## Redis as a Vector Database

- Capabilities:
  - **3** distance metrics: **Euclidean**, **Internal Product** and **Cosine**
  - **2** indexing methods: **HNSW** and **Flat**
  - **Hybrid queries** combined with `GEO`, `TAG`, `TEXT` or `NUMERIC`

## 🛠️ Demo
### Adding Similarity Search to the **Redis Bike Company**

![bikeshop](./images/bike_shop.png)

## Connecting to Redis Stack

* **Redis Stack** instance running locally
* Import `redis-py` client library
* Create a **client connection**

In [None]:
import redis
client = redis.Redis(host = 'localhost', port=6379, decode_responses=True)

* Use the `PING` command to check that Redis is up and running:

In [None]:
client.ping()

## Inspect the Bikes

* Use the `JSON.GET` command to retrive the bike with key `redisbikeco:bike:rbc00067`:

In [None]:
bike067 = client.json().get('redisbikeco:bike:rbc00067')
bike067

## Generating Embeddings with ML

![ML Models for embeddings](./images/target-model-embeddings-redis.png)

## Where to find pre-trained models?

![Model Zoos](./images/model-zoos.png)

## Sentence Transformers

![SBERT](./images/sbert-net.png)

- **SentenceTransformers** to **generate embeddings** for the bikes **descriptions** 
- **Sentence-BERT** (**SBERT**) produces **contextually rich** sentence embeddings
- Embeddings provide **efficient sentence-level** semantic similarity
- Improves tasks like **semantic search** and **text grouping**

## Selecting a suitable pre-trained Model

- We must pick a **suitable model** for **generating embeddings**
- We want to query for bicycles using **short queries** against the **longer** bicycle **descriptions**
- This is referred to as **"Asymmetric Semantic Search"** 
- Used when **search query** and the **documents** being searched are of **different nature or structure**

## Selecting a suitable pre-trained Model

- For **asymmetric semantic search** suitable models include pre-trained **MS MARCO** Models
- Optimized for understanding **real-world queries** and producing **relevant responses**
- **Highest performing** MS MARCO model is **`msmarco-distilbert-base-v4`**
  - which is tuned for **cosine-similarity** 

In [None]:
from sentence_transformers import SentenceTransformer

embedder = SentenceTransformer('msmarco-distilbert-base-v4')

## Extract the Bike's Description

- Let's extract the `description` into the `sample_description` var:

In [None]:
sample_description = bike067['description']
sample_description

## Generating an Embedding Vector

- To generate the vector embeddings, we use the `encode` function:

In [None]:
embedding = embedder.encode(sample_description)
VECTOR_DIMENSION = len(embedding)
VECTOR_DIMENSION

- Let's take a peek at the first **5** elements of the generated vector:

In [None]:
print(embedding.tolist()[:5])

## Generate Embeddings for the Bikes' Description

* To vectorize all the descriptions in the database, we will first collect all the Redis keys for the bikes:



In [None]:
keys = sorted(client.keys('redisbikeco:bike:*'))
len(keys)

In [None]:
print(keys[:3])

## Generate Embeddings for the Bikes' Description

* With the keys in `keys` we can use the Redis `JSON.MGET` command to retrieve just the `description` field
* We'll store all the descriptions in the `descriptions` variable
* The `encode` method can take a List of text passages to encode

In [None]:
import numpy as np

descriptions = client.json().mget(keys, '$.description')
descriptions = [item for sublist in descriptions for item in sublist]
embeddings = embedder.encode(descriptions).astype(np.float32).tolist()

* Let's checked that we've generated the correct number of embedding vectors:

In [None]:
len(embeddings)

## Add the embeddings to the JSON documents

- Now we can add the vectorized descriptions to the JSON documents in Redis
- Use the `JSON.SET` command to insert a new field in each of the documents at `$.description_embeddings`
- Use Redis' **pipeline** mode to minimize the round-trip times:

In [None]:
pipeline = client.pipeline()

for key, embedding in zip(keys, embeddings):
    pipeline.json().set(key, '$.description_embeddings', embedding)

pipeline.execute()
print('Vector Embeddings Saved!')

## Inspect the Bikes' Documents

- Let's inspect one of the vectorized bike documents using the `JSON.GET` command:

In [None]:
import json

print(json.dumps(client.json().get('redisbikeco:bike:rbc00001'), indent=2)) 

## Create Search Index for the Bikes Collection

- To define the index we'll import the `IndexDefinition` and the `IndexType`
- To define the schema fields we'll use the classes `TagField`, `TextField`, `NumericField`, and **`VectorField`**
- We'll create an index named **`idx:bikes_vss`**

In [None]:
from redis.commands.search.indexDefinition import IndexDefinition, IndexType
from redis.commands.search.field import TagField, TextField, NumericField, VectorField
from redis.commands.search.query import Query

INDEX_NAME = 'idx:bikes_vss'
DOC_PREFIX = 'redisbikeco:bike:'

## The Search Index Schema

In [None]:
try:
    client.ft(INDEX_NAME).info()
    print('Index already exists!')
except:
    schema = (
        TextField('$.model', no_stem=True, as_name='model'),  
        TextField('$.brand', no_stem=True, as_name='brand'),
        NumericField('$.price', as_name='price'),
        TagField('$.type', as_name='type'),
        TextField('$.description', as_name='description'),
        VectorField('$.description_embeddings', 'FLAT', {
          'TYPE': 'FLOAT32',
          'DIM': VECTOR_DIMENSION,
          'DISTANCE_METRIC': 'COSINE',
        },  as_name='vector'),
    )

    # index Definition
    definition = IndexDefinition(prefix=[DOC_PREFIX], index_type=IndexType.JSON)

    # create Index
    client.ft(INDEX_NAME).create_index(fields=schema, definition=definition)

## `VECTOR` Schema Field Definition

* **Indexing method**: `FLAT` **(brute-force indexing)** or `HNSW` **(Hierarchical Navigable Small World)**
* **Vector Type**: `FLOAT32` or `FLOAT64`.
* **Vector Dimension**: The length or dimension of our embeddings (`768`).
* **Distance Metric**: `L2` **(Euclidean distance)**, `IP` **(Inner Product)**, or `COSINE` **(Cosine Similarity)** 

## Check the state of the Index

- `FT.CREATE` creates the index
- The **indexing process** is automatically started in the **background**
- In the blink of an eye, our JSON documents are indexed and ready to be searched
- To corroborate that, we use the **`FT.INFO`**:

In [None]:
info = client.ft(INDEX_NAME).info()

num_docs = info['num_docs']
indexing_failures = info['hash_indexing_failures']
total_indexing_time = info['total_indexing_time']
percent_indexed = float(info['percent_indexed']) * 100


print(f"{num_docs} docs ({percent_indexed}%) indexed w/ {indexing_failures} failures in {float(total_indexing_time):.2f} msecs")

## Structured Data Searches with Redis

- Let's test the non-vector part of the index first:

- Retrieve all bikes where the `brand` is `Peaknetic`

In [None]:
query = (
    Query('@brand:Peaknetic').return_fields('id', 'brand', 'model', 'price')
)
client.ft(INDEX_NAME).search(query).docs

- Find all `Peaknetic` bikes price less than or equal to `10000`

In [None]:
query = (
    Query('@brand:Peaknetic @price:[0 10000]').return_fields('id', 'brand', 'model', 'price')
)
client.ft(INDEX_NAME).search(query).docs

## Semantic Queries

- We want to query for bikes using short query prompts
- Let's put our queries in a list so we can vectorize them and execute them in bulk:

In [None]:
queries = [
    'Bike for small kids',
    'Best Mountain bikes for kids',
    'Cheap Mountain bike for kids',
    'Female specific mountain bike',
    'Road bike for beginners',
    'Commuter bike for people over 60',
    'Comfortable commuter bike',
    'Good bike for college students',
    'Mountain bike for beginners',
    'Vintage bike',
    'Comfortable city bike'
]

In [None]:
encoded_queries = embedder.encode(queries)
len(encoded_queries)

## Visualizing Embeddings

- The image below was generated using **t-distributed stochastic neighbor embedding** (**t-SNE**) and a small subset of the embeddings
- **t-SNE** is a dimensionality reduction techniques that maps the higher dimension embeddings to a 2 or 3-D space

![TSNE Visualization](./images/embeddings-tsne.png)

## Constructing a "Pure KNN" VSS Query

- We'll start with a **K-nearest neighbors** (KNN) query 
- KNN goal is to find the **most similar** items to a given query item
- KNN calculates the **distance** between the query vector and each vector in the database
- Returns 'K' items with the **smallest** distances
- These are considered to be the most similar items

## Constructing a "Pure KNN" VSS Query

In [None]:
query = (
    Query('(*)=>[KNN 3 @vector $query_vector AS vector_score]')
     .sort_by('vector_score')
     .return_fields('vector_score', 'id', 'brand', 'model', 'description')
     .dialect(2)
)

- The syntax for KNN queries is `(*)=>[vector_similarity_query>]` 
  - where the `(*)` (the `*` meaning all) is the filter query for the search engine.
  - `$query_vector` represents the query parameter we'll use to pass the vectorized query prompt.
  - results are filtered by `vector_score`
  - Query returns the `vector_score`, the `id` of the matched documents, the `$.brand`, `$.model`, and `$.description`

## 🩼 Pretty-printing Query Results

- We want to run the queries in bulk 
- Visualize the results in a nice table
- We've added a utility function `create_query_table`

In [None]:
import pandas as pd
from IPython.display import display, HTML

def create_query_table(query, queries, encoded_queries, extra_params = {}):
    results_list = []
    for i, encoded_query in enumerate(encoded_queries):
        result_docs = client.ft(INDEX_NAME).search(query, { 'query_vector': np.array(encoded_query, dtype=np.float32).tobytes() } | extra_params).docs
        for doc in result_docs:
            vector_score = round(1 - float(doc.vector_score), 2)
            # this is cosine distance
            # cosine distance = 1 — cosine similarity
            results_list.append({
                'query': queries[i], 
                'score': vector_score, 
                'id': doc.id,
                'brand': doc.brand,
                'model': doc.model,
                'description': doc.description
            })

    # Pretty-print the table
    queries_table = pd.DataFrame(results_list)
    queries_table.sort_values(by=['query', 'score'], ascending=[True, False], inplace=True)
    queries_table['query'] = queries_table.groupby('query')['query'].transform(lambda x: [x.iloc[0]] + ['']*(len(x)-1))
    queries_table['description'] = queries_table['description'].apply(lambda x: (x[:497] + '...') if len(x) > 500 else x)
    html = queries_table.to_html(index=False, classes='striped_table')  
    display(HTML(html))

## 🏃🏾‍♀️Running the Query

- With the Query prepared in `query`
- and the query prompts in `queries` 
- and the encoded queries in `encoded_queries`
- we can use the `create_query_table` function to generate a table of results:

## 🏃🏾‍♀️Running the Query

In [None]:
create_query_table(query, queries, encoded_queries)

## Hybrid Queries

- "Pure KNN" queries evaluate a query against the **whole space of vectors**
- The larger the collection, the more **computationally expensive**
- Unstructured data does not live in isolation
- Rich search experiences must allow searching all data (structured and unstructured) 

## Hybrid Queries

- For example, users might arrive at your search interface with a brand preference in mind
- Redis can use this information to pre-filter the search space
- In the hybrid query definition below:
  - we pre-filter using the `brand` to consider only `Peaknetic` brand bikes 
  - before our primary filter query was `(*)`, AKA everything
  - we narrow the search space using `(@brand:Peaknetic)` before the KNN query

In [None]:
hybrid_query = (
    Query('(@brand:Peaknetic)=>[KNN 3 @vector $query_vector AS vector_score]')
     .sort_by('vector_score')
     .return_fields('vector_score', 'id', 'brand', 'model', 'description')
     .dialect(2)
)

## 🏃🏾‍♀️Running the Query

In [None]:
create_query_table(hybrid_query, queries, encoded_queries)

## Range Queries

- Range queries retrieve items within a specific **distance** from a query vector
- We consider **"distance"** to be the **measure of similarity** 
- The smaller the distance, the more similar the items
- For example, to return the top `4` bikes within a `0.55` radius of query: 

```
1️⃣ FT.SEARCH idx:bikes_vss 
2️⃣   @vector:[VECTOR_RANGE $range $query_vector]=>{$YIELD_DISTANCE_AS: vector_score} 
3️⃣   SORTBY vector_score ASC
4️⃣   LIMIT 0 4 
5️⃣   DIALECT 2 
6️⃣   PARAMS 4 range 0.55 query_vector "\x9d|\x99>bV#\xbfm\x86\x8a\xbd\xa7~$?*...."
```

## Range Queries

- In Python:

In [None]:
range_query = (
    Query('@vector:[VECTOR_RANGE $range $query_vector]=>{$YIELD_DISTANCE_AS: vector_score}') 
    .sort_by('vector_score')
    .return_fields('vector_score', 'id', 'brand', 'model', 'description')
    .paging(0, 4)
    .dialect(2)
)

## 🏃🏾‍♀️Running the Query

In [None]:
create_query_table(range_query, queries[:1], encoded_queries[:1], {'range': 0.55})

## Visualizing High-dimensional vectors with dimensionality reduction

In [None]:
%%html
<iframe src="https://projector.tensorflow.org/" width="1920" height="540"></iframe>

## Recap

- The tools and techniques to unlock the value in **Unstructured Data** have evolved greatly...
- Redis **in-memory first** approach makes it a perfect fit for vector similarity searches
- Redis natively supports vector searches over **Hashes** and **JSON**
- Redis combines the power of searching over semi-structured and unstructured data
  - with the performance you've come to expect from Redis 



## https://github.com/redis-developer/redis-bike-co

## Learn more at Redis University

## `https://university.redis.com`

![Redis U](./images/redis_university.png)

## Thank You!

![Simon and BSB](./images/simon_and_bsb.png)