<a href="https://colab.research.google.com/github/Redislabs-Solution-Architects/Redis-Workshops/blob/main/02-Vector_Similarity_Search/02.01_RedisVL.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Vector Search with RedisVL

[Always-on demo](https://antonum-redis-vss-streamlit-streamlit-app-p4z5th.streamlit.app/)

![Redis](https://redis.io/wp-content/uploads/2024/04/Logotype.svg?auto=webp&quality=85,75&width=120)

This notebook generates vector embeddings using pre-trained `sentence-transformers/all-MiniLM-L6-v2` model from HuggingFace, loads them to Redis and runs Vector Similarity search against Redis database using the Redis Vector Library ([RedisVL](https://www.redisvl.com/)) for Python.



In [1]:
# Install Redis client and Hugging Face sentence transformers
!pip install -q sentence_transformers redisvl

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/245.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━[0m [32m235.5/245.3 kB[0m [31m16.7 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m245.3/245.3 kB[0m [31m4.8 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/95.4 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m95.4/95.4 kB[0m [31m3.1 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/261.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m [32m256.0/261.3 kB[0m [31m41.2 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m261.3/261.3 kB[0m [31m7.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━

### Install Redis Stack locally

This installs the Redis Stack database for local usage if you cannot or do not want to create a Redis Cloud database, and it installs the `redis-cli` that will be used for checking database connectivity, etc.

In [2]:
%%sh
curl -fsSL https://packages.redis.io/gpg | sudo gpg --dearmor -o /usr/share/keyrings/redis-archive-keyring.gpg
echo "deb [signed-by=/usr/share/keyrings/redis-archive-keyring.gpg] https://packages.redis.io/deb $(lsb_release -cs) main" | sudo tee /etc/apt/sources.list.d/redis.list
sudo apt-get update  > /dev/null 2>&1
sudo apt-get install redis-stack-server  > /dev/null 2>&1
redis-stack-server --daemonize yes


deb [signed-by=/usr/share/keyrings/redis-archive-keyring.gpg] https://packages.redis.io/deb jammy main
Starting redis-stack-server, database path /var/lib/redis-stack


### Setup Redis connection
***Note - Replace Redis connection values if using a Redis Cloud instance!***

In [3]:
import os
import redis
REDIS_HOST = os.getenv("REDIS_HOST", "localhost")
REDIS_PORT = os.getenv("REDIS_PORT", "6379")
REDIS_PASSWORD = os.getenv("REDIS_PASSWORD", "")
# Replace values above with your own if using Redis Cloud instance
# REDIS_HOST="redis-12110.c82.us-east-1-2.ec2.cloud.redislabs.com"
# REDIS_PORT=12110
# REDIS_PASSWORD="pobhBJP7Psicp2gV0iqa2ZOc1WdXXXXX"

# Shortcut for redis-cli $REDIS_CONN command
if REDIS_PASSWORD!="":
  os.environ["REDIS_CONN"]=f"-h {REDIS_HOST} -p {REDIS_PORT} -a {REDIS_PASSWORD} --no-auth-warning"
else:
  os.environ["REDIS_CONN"]=f"-h {REDIS_HOST} -p {REDIS_PORT}"

# If SSL is enabled on the endpoint, use rediss:// as the URL prefix
REDIS_URL = f"redis://:{REDIS_PASSWORD}@{REDIS_HOST}:{REDIS_PORT}"

# Test Redis connection
!redis-cli $REDIS_CONN PING

PONG


### Initialize Embedding Model
Import a [RedisVL vectorizer](https://www.redisvl.com/user_guide/vectorizers_04.html#vectorizers) to generate text embeddings. For this lab, we are using the Hugging Face Text Vectorizer and selecting the `sentence-transformers/all-MiniLM-L6-v2` [model](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2).

In [4]:
os.environ["TOKENIZERS_PARALLELISM"] = "false"
from redisvl.utils.vectorize import HFTextVectorizer

import pandas as pd
import numpy as np
from tqdm.auto import tqdm

tqdm.pandas()
# Create a vectorizer
# Choose your model from the huggingface website
hf = HFTextVectorizer(model="sentence-transformers/all-MiniLM-L6-v2")

20:17:54 numexpr.utils INFO   NumExpr defaulting to 2 threads.
20:18:16 sentence_transformers.SentenceTransformer INFO   Use pytorch device_name: cuda
20:18:16 sentence_transformers.SentenceTransformer INFO   Load pretrained SentenceTransformer: sentence-transformers/all-MiniLM-L6-v2


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]



1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

In [5]:
# Embed a sentence
test = hf.embed("This is a test sentence.")

# Uncomment to see vector embedding output
# test[:10]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

### Embedding generation model

Here we are using `sentence-transformers/all-MiniLM-L6-v2` from HuggingFace. https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2

***Note - the following block is not needed for this lab, as we already specified our embedding model above with RedisVL.***

In [6]:
# from sentence_transformers import SentenceTransformer
# model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
# test=model.encode("This is a test sentence.")
# test[:]

### Download data to vectorize and store in Redis
Download 12k+ tweets

In [7]:
!wget https://raw.githubusercontent.com/antonum/Redis-VSS-Streamlit/main/Labelled_Tweets.csv

--2024-10-04 20:18:24--  https://raw.githubusercontent.com/antonum/Redis-VSS-Streamlit/main/Labelled_Tweets.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.109.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2486081 (2.4M) [text/plain]
Saving to: ‘Labelled_Tweets.csv’


2024-10-04 20:18:24 (84.4 MB/s) - ‘Labelled_Tweets.csv’ saved [2486081/2486081]



***Note - make sure to uncomment the `df=df.head(100)` line below if you are using the Free 30MB Redis Cloud database!***

In [8]:
df = pd.read_csv('Labelled_Tweets.csv').drop(columns=['created_at','score'])
# df=df.head(100) #trim dataframe to fit results into 30MB Redis database
df


Unnamed: 0,id,full_text
0,1,@KennyDegu very very little volume. With $10T ...
1,2,#ES_F achieved Target 2780 closing above 50% #...
2,3,RT @KimbleCharting: Silver/Gold indicator crea...
3,4,@Issaquahfunds Hedged our $MSFT position into ...
4,5,RT @zipillinois: 3 Surprisingly Controversial ...
...,...,...
12415,12587,RT @PeterLBrandt: $SPX $ES_F \r\nFollowing thi...
12416,12588,RT @vieiraUAE: Fearless Alex Vieira Calls Best...
12417,12589,$spy $spx $qqq $ndx #nyse going from poking th...
12418,12590,RT @DavidScottAdams: On watch tomorrow // Pt. ...


### Generate Embeddings

Generate vector embeddings within the dataframe. This step can take 2-3 minutes on GPU runtime for all 12k records.

In [9]:
def texts_to_embeddings(texts):
  return [np.array(embedding, dtype=np.float32).tobytes() for embedding in hf.embed_many(texts)]

# Generate vector embeddings
df["text_embedding"] = texts_to_embeddings(df["full_text"].tolist())
df.head()

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/14 [00:00<?, ?it/s]

Unnamed: 0,id,full_text,text_embedding
0,1,@KennyDegu very very little volume. With $10T ...,b')\x92\x81\xbd\x1fh\x8b\xbdl\xdf\xe4\xbc\xc2\...
1,2,#ES_F achieved Target 2780 closing above 50% #...,b']\x1b\x02\xbd\x16~/\xbd\x7fz\xb1\xbcP\x99\xd...
2,3,RT @KimbleCharting: Silver/Gold indicator crea...,b'\x0e\xaa\xa3\xbdl}\x10\xbd\xc4\xe8\xb9= \x08...
3,4,@Issaquahfunds Hedged our $MSFT position into ...,"b""\xde\x7f\xd1\xbc\xc5\n`\xbd>9 =\xe6\xc0\xef=..."
4,5,RT @zipillinois: 3 Surprisingly Controversial ...,b'\xd8\r\x1e\xbd:\\\xd4\xbcf/\xa1\xbc\xe7q7=\x...


### Define RediSearch Schema
[Define the RediSearch schema](https://redis.io/docs/latest/develop/interact/search-and-query/basic-constructs/schema-definition/) using the [RedisVL syntax](https://redis.io/docs/latest/integrate/redisvl/api/schema/) with the following details:
- Key Type: `Hash`
- Key Prefix: `tweet:`
- Indexed Fields:
  - `text`: Full Text
  - `text_embedding`: Vector

In [10]:
schema = {
    "index": {
        "name": "tweet:idx",
        "prefix": "tweet:",
        "storage_type": "hash", # default setting -- HASH
    },
    "fields": [
        {   "name": "full_text",
            "type": "text",
            "attrs": {
                "no_stem": False,
                "sortable": False
            }
        },
        {
            "name": "text_embedding",
            "type": "vector",
            "attrs": {
                "dims": 384,
                "distance_metric": "cosine",
                "algorithm": "HNSW",
                "datatype": "float32"
            }

        }
    ],
}

### Create index and load data to Redis

In [11]:
# Clear Redis database (optional)
# redis.flushdb()

from redisvl.index import SearchIndex

index = SearchIndex.from_dict(schema)

index.connect(REDIS_URL)

index.create(overwrite=True)

keys = index.load(df.to_dict('records'))


In [12]:
# Check how the data is stored in Redis
# !redis-cli $REDIS_CONN keys "*"
# !redis-cli $REDIS_CONN hgetall "tweet::c4b63c9494194d4ea65b7e265bcd2d5f"

## Query the database

[Alway-on Streamlit app](https://antonum-redis-vss-streamlit-streamlit-app-p4z5th.streamlit.app/)


Try queries like:
“Oil”, “Oil Reserves”, “Fossil fuels”

Lexical Full Text search quickly runs out of matches

Vector search continues to discover relevant tweets

In [13]:
user_query="oil price"
# Queries to try: "oil reserve", "fossil fuels"

In [14]:
# Using Full Text Index
from redisvl.query import FilterQuery
from redisvl.query.filter import Text

# Exact match filter -- document must contain the exact word doctor
text_filter = Text("full_text") == user_query

filter_query = FilterQuery(
    return_fields=["full_text"],
    filter_expression=text_filter
)

results = index.query(filter_query)

pd.DataFrame(results)

Unnamed: 0,id,full_text
0,tweet::3d5f8f54bd42426b9b30248407113874,Today's book recommendation goes for the winne...
1,tweet::87ff3ee918d047529932ae4103490643,RT @SeekingAlpha: $XOM - Despite The Oil Price...
2,tweet::7f4cb487111a460899cbfbe2906a3072,RT @SeekingAlpha: $XOM - Despite The Oil Price...
3,tweet::8a2c28e91ee749d38469c6606a519eff,RT @SeekingAlpha: $XOM - Despite The Oil Price...
4,tweet::ad854e5471be4b43935cd7914c6e2d9e,RT @SeekingAlpha: $XOM - Despite The Oil Price...
5,tweet::7a26056c19144cb5b1d6e7e0aef9f626,RT @CharlesSizemore: Investors believe that Sa...
6,tweet::8cc1095c79fa4b658a5ff47a0f93e33a,RT @SeekingAlpha: $EURN - Euronav's CEO On Sur...
7,tweet::bb56c55e01114606b273e4c9a90640fe,$EURN - Euronav's CEO On Surging Rates Amid Th...
8,tweet::342625de42bb4916948d243934bb991b,Investors believe that Saudi Arabia is winning...
9,tweet::12f0bb1c0ea9471db999c58cd0f6c814,Despite The Oil Price Carnage Exxon Will Grow ...


In [15]:
# Using Vector Index
from redisvl.query import VectorQuery
# from jupyterutils import result_print

query = VectorQuery(
    vector=hf.embed(user_query),
    vector_field_name="text_embedding",
    return_fields=["full_text", "vector_distance"],
    num_results=10
)
results = index.query(query)
pd.DataFrame(results)

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Unnamed: 0,id,vector_distance,full_text
0,tweet::2f46ec2dd3d3418f990132a83464e4bd,0.369450867176,Would you spend $2 more a gallon of gasoline i...
1,tweet::2c4171ad9b384aefb792a2bd42643130,0.371095478535,RT @tradingcrudeoil: Crude oil closed up $0.48...
2,tweet::e71b573460c84db595d5d5e19a0a7727,0.381934165955,..and oil still 25.74 LMAO &gt;&gt;&gt;NO DEM...
3,tweet::0b34e6c5b7a14eca95c4d3756d0a0a0b,0.396132171154,Bad news for #oil. It’s going to between $10 ...
4,tweet::45b98d3024a54e2c9f0030ab9ef86ebc,0.409308493137,Oil erases gains for the day in fall to $25 ht...
5,tweet::1adf57958eb140eb8810ee5a6320d2f2,0.42981672287,Do higher oil prices help the consumer and sma...
6,tweet::24994d0195ab44f9ae45d02976f072ed,0.430081188679,The price of Texas intermediate oil (WTI) slum...
7,tweet::8a4923a3f74d4c48b67814023b35692e,0.431391477585,#OIL Sentiment ($22.50)\r\n\r\nWhat’s next for...
8,tweet::f48564756cd04ba697f61df67932e721,0.441061675549,"RT @Benzinga: Oil Prices Rise, Fall As Russia,..."
9,tweet::7a6161f140dd4ffba08c89393086ac60,0.441061675549,"RT @Benzinga: Oil Prices Rise, Fall As Russia,..."


#### Cleanup
(Optional) Delete all keys and indexes from the database.

In [16]:
###-- FLUSHDB will wipe out the entire database!!! Use with caution --###
# !redis-cli $REDIS_CONN flushdb