<a href="https://colab.research.google.com/github/Redislabs-Solution-Architects/Redis-Workshops/blob/main/02-Vector_Similarity_Search/02.02_Redis_py.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Vector Similarity Search with Redis

[Always-on demo](https://antonum-redis-vss-streamlit-streamlit-app-p4z5th.streamlit.app/)

![Redis](https://redis.io/wp-content/uploads/2024/04/Logotype.svg?auto=webp&quality=85,75&width=120)

This notebook generates vector embeddings using pre-trained `sentence-transformers/all-MiniLM-L6-v2` model from HuggingFace, loads them to Redis and runs Vector Similarity search against Redis database.

In [11]:
#install Redis client and Hugging Face sentence transformers
!pip install -q redis sentence_transformers

Install Redis Stack locally

In [12]:
%%sh
curl -fsSL https://packages.redis.io/gpg | sudo gpg --dearmor -o /usr/share/keyrings/redis-archive-keyring.gpg
echo "deb [signed-by=/usr/share/keyrings/redis-archive-keyring.gpg] https://packages.redis.io/deb $(lsb_release -cs) main" | sudo tee /etc/apt/sources.list.d/redis.list
sudo apt-get update  > /dev/null 2>&1
sudo apt-get install redis-stack-server  > /dev/null 2>&1
redis-stack-server --daemonize yes


sudo: a terminal is required to read the password; either use the -S option to read from standard input or configure an askpass helper
sudo: a password is required
sh: line 2: lsb_release: command not found
sudo: a terminal is required to read the password; either use the -S option to read from standard input or configure an askpass helper
sudo: a password is required


Starting redis-stack-server, database path /opt/homebrew/var/db/redis-stack


### Connect to the Redis server

In [13]:
import os

REDIS_HOST = os.getenv("REDIS_HOST", "localhost")
REDIS_PORT = os.getenv("REDIS_PORT", "6379")
REDIS_PASSWORD = os.getenv("REDIS_PASSWORD", "")
#Replace values above with your own if using Redis Cloud instance
#REDIS_HOST="redis-18374.c253.us-central1-1.gce.cloud.redislabs.com"
#REDIS_PORT=18374
#REDIS_PASSWORD="1TNxTEdYRDgIDKM2gDfasupCADXXXX"

#shortcut for redis-cli $REDIS_CONN command
# If SSL is enabled on the endpoint add --tls
if REDIS_PASSWORD!="":
  os.environ["REDIS_CONN"]=f"-h {REDIS_HOST} -p {REDIS_PORT} -a {REDIS_PASSWORD} --no-auth-warning"
else:
  os.environ["REDIS_CONN"]=f"-h {REDIS_HOST} -p {REDIS_PORT}"

# If SSL is enabled on the endpoint, use rediss:// as the URL prefix
REDIS_URL = f"redis://:{REDIS_PASSWORD}@{REDIS_HOST}:{REDIS_PORT}"
INDEX_NAME = f"qna:idx"

# Test Redis connection
!redis-cli $REDIS_CONN PING

Unrecognized option or bad number of args for: '-h localhost -p 6379'


In [14]:
import redis
redis = redis.Redis(
  host=REDIS_HOST,
  port=REDIS_PORT,
  password=REDIS_PASSWORD)
redis.ping()

True

In [None]:
import pandas as pd
import numpy as np
from sentence_transformers import SentenceTransformer
from tqdm import tqdm
from redis import Redis
from redis.commands.search.field import (
    NumericField,
    TagField,
    TextField,
    VectorField,
)
from redis.commands.search.indexDefinition import IndexDefinition, IndexType
from redis.commands.search.query import Query


tqdm.pandas()



### Embedding generation model

Here we are using `sentence-transformers/all-MiniLM-L6-v2` from HuggingFace. https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2



In [16]:
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

Download 12k+ tweets

In [17]:
!wget https://raw.githubusercontent.com/antonum/Redis-VSS-Streamlit/main/Labelled_Tweets.csv

--2025-06-02 20:31:51--  https://raw.githubusercontent.com/antonum/Redis-VSS-Streamlit/main/Labelled_Tweets.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.110.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2486081 (2.4M) [text/plain]
Saving to: ‘Labelled_Tweets.csv’


2025-06-02 20:31:52 (13.3 MB/s) - ‘Labelled_Tweets.csv’ saved [2486081/2486081]



In [18]:
df = pd.read_csv('Labelled_Tweets.csv').drop(columns=['created_at','score'])
#df=df.head(3000) #trim dataframe to fit results into 30MB Redis database
df


Unnamed: 0,id,full_text
0,1,@KennyDegu very very little volume. With $10T ...
1,2,#ES_F achieved Target 2780 closing above 50% #...
2,3,RT @KimbleCharting: Silver/Gold indicator crea...
3,4,@Issaquahfunds Hedged our $MSFT position into ...
4,5,RT @zipillinois: 3 Surprisingly Controversial ...
...,...,...
12415,12587,RT @PeterLBrandt: $SPX $ES_F \r\nFollowing thi...
12416,12588,RT @vieiraUAE: Fearless Alex Vieira Calls Best...
12417,12589,$spy $spx $qqq $ndx #nyse going from poking th...
12418,12590,RT @DavidScottAdams: On watch tomorrow // Pt. ...


### Generate Embeddings

Generate vector embeddings within the dataframe. This step can take 2-3 minutes (unoptimized version) and 10 seconds (optimized version) on GPU runtime for all 12k records.

In [19]:
# Unoptimized version, processing one embedding at a time
#def text_to_embedding(text):
#  return model.encode(text).astype(np.float32).tobytes()

# Optimized version with for parallel processing
def texts_to_embeddings(texts):
  return [np.array(embedding, dtype=np.float32).tobytes() for embedding in model.encode(texts, show_progress_bar=True)]


#generate vector embeddings
#df["text_embedding"] = df["full_text"].progress_apply(text_to_embedding)
df["text_embedding"] = texts_to_embeddings(df["full_text"].tolist())
df

Batches: 100%|██████████| 389/389 [00:41<00:00,  9.44it/s]


Unnamed: 0,id,full_text,text_embedding
0,1,@KennyDegu very very little volume. With $10T ...,b'\'\x92\x81\xbd\x16h\x8b\xbdi\xdf\xe4\xbc\xd0...
1,2,#ES_F achieved Target 2780 closing above 50% #...,b']\x1b\x02\xbd\x1c~/\xbd|z\xb1\xbcD\x99\xd1\x...
2,3,RT @KimbleCharting: Silver/Gold indicator crea...,b'\x10\xaa\xa3\xbdg}\x10\xbd\xc6\xe8\xb9=%\x08...
3,4,@Issaquahfunds Hedged our $MSFT position into ...,b'\xd8\x7f\xd1\xbc\xc3\n`\xbd79 =\xe2\xc0\xef=...
4,5,RT @zipillinois: 3 Surprisingly Controversial ...,b'\xcc\r\x1e\xbdH\\\xd4\xbcL/\xa1\xbc\xe4q7=\x...
...,...,...,...
12415,12587,RT @PeterLBrandt: $SPX $ES_F \r\nFollowing thi...,b'\xa9x\x87\xbb\x03Y\x802\xf8\x01*\xbd\xf6\x9e...
12416,12588,RT @vieiraUAE: Fearless Alex Vieira Calls Best...,b'K\xeb><\xaf\xf7\x97=\xdb\xe3\x12\xbd\x83\xac...
12417,12589,$spy $spx $qqq $ndx #nyse going from poking th...,b'\x19\xef\xa1\xbd!\xf8\x8e\xbdU_\xa6=\x1a\x0c...
12418,12590,RT @DavidScottAdams: On watch tomorrow // Pt. ...,b'\xb4\xba\xb1\xbd_-)=\xc7\xb5}\xbc\x19\x1eK=?...


### Create Helper Functions

- Save dataframe to Redis HASH
- Create RediSearch Index

In [None]:
def load_dataframe(redis, df, key_prefix="tweet", id_column="id", pipe_size=100):
    records = df.to_dict(orient="records")
    pipe = redis.pipeline(transaction=False)
    total = len(records)
    
    print(f"Loading {total} records into Redis...")
    
    for i, record in enumerate(records):
        key = f"{key_prefix}:{record[id_column]}"
        pipe.hset(key, mapping=record)
        
        if (i + 1) % pipe_size == 0:
            pipe.execute()
            if (i+1) % (pipe_size*10) == 0:  # Print less frequently
                print(f"Processed {i+1}/{total} records ({(i+1)/total*100:.1f}%)")
    
    # Execute any remaining commands
    pipe.execute()
    print(f"Loaded {total} records into Redis")

### Create index and load data to Redis

In [21]:
# clear Redis database (optional)
redis.flushdb()

# create Index
create_redis_index(redis)

# load data from Dataframe to Redis HASH
load_dataframe(redis,df,key_prefix="tweet", pipe_size=100)


no index found


ImportError: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html

In [None]:
#Check how the data is stored in Redis
!redis-cli $REDIS_CONN hgetall "tweet:1001"

1) "full_text"
2) "1900 PIPS Profit so far\xf0\x9f\x94\xa5\xf0\x9f\x94\xa5\r\n\r\nFor free signals, Join https://t.co/eRcU1NYwV1\r\n\r\n#bitcoin #Forex #amzn #fx $BTC #MT5 #MT4 #xagusd #GBPAUD #ForexAnalysis #nflx #ForexGroup #gbpjpy #forexsignals #stocks #METATRADER #usoil #forexnews 68016 https://t.co/WUP4laPFlT"
3) "id"
4) "1001"
5) "text_embedding"
6) "G\x1e\x03:\xc9b\xbd\xbc5\xb0\x94\xbc\x1cK)< U\x94;\x19>\x11\xbd\xae3\r=\x03\x98\x7f=\x9eX\x88<tUR\xbd\xc8\xf6h\xbcH\xb5\xd2\xbb\x0e;\t\xbd\xc1m#<\xdf\xd4C=\x98\xc8\"\xbd\x1f\xa3-\xbd1^\t\xbdb\x00\xfd\xbc\xb2GE\xbd@\xadZ\xbc]\xae=\xbd;<\x99<\xf7b\xce\xbc\xff_\xab<\"DT\xbcpM\xd7\xbc\x8e\xf0\x82\xbb\x8d\xb0(=5v\x8a\xbc\xc7\xa8M\xbd],P=&S\x86<\xa4j\xfa<>?\xb5<\x93[\xee\xbb\xcd\x8d\x90=\bW8=Q\x81\x82\xbd\xab\xfb&<\xf7-:=S\xe6\xe4\xbd\xfd\x1d\x11\xbc\xce\xb7\xac\xbd>Z<\xbbp\xc9\xcc\xbb\xd39b\xbcF\xe3\x90=*\x9f\xa7=\xf57\x0f=x\xbb\xbb\xbbS\xad\x17=\xb6\xaeE\xbd\xdc\x8e\x90=v\xd7\x02\xbd\xd8\x01\x9f\xbc\xe1\xc3e;\n\xa5\xd8\xbcg{f<\x9ax\xaf\x

## Query the database

[Alway-on Streamlit app](https://antonum-redis-vss-streamlit-streamlit-app-p4z5th.streamlit.app/)


Try queries like:
“Oil”, “Oil Reserves”, “Fossil fuels”

Lexical Full Text search quickly runs out of matches

Vector search continues to discover relevant tweets

In [None]:
user_query="oil price"
# queries to try "oil reserve", "fossil fuels"

In [None]:
#using Full Text Index
q = Query(user_query)\
  .return_fields("full_text")
res = redis.ft("tweet:idx").search(q)
if res.total==0:
  print("No matches found")
else:
  res_df = pd.DataFrame([t.__dict__ for t in res.docs ]).drop(columns=["payload"])
  display(res_df)

Unnamed: 0,id,full_text
0,tweet:3220,The relative performance of TIPS has historica...
1,tweet:311,"Oil Prices Rise, Fall As Russia, Saudi Arabia ..."
2,tweet:1490,"RT @Benzinga: Oil Prices Rise, Fall As Russia,..."
3,tweet:1585,"RT @Benzinga: Oil Prices Rise, Fall As Russia,..."
4,tweet:1610,"Oil Prices Rise, Fall As Russia, Saudi Arabia ..."
5,tweet:7189,Do higher oil prices help the consumer and sma...
6,tweet:636,Told you Saudi Arabia will bend the knee @jimc...
7,tweet:5405,https://t.co/3IJBXa5wuf Historic oil price plu...
8,tweet:5406,Historic oil price plunge trashes sector's pro...
9,tweet:3865,Today's book recommendation goes for the winne...


In [None]:
# Using Vector Similarity Index
#query_vector=text_to_embedding(user_query)
query_vector=texts_to_embeddings([user_query])[0]
q = Query("*=>[KNN 10 @text_embedding $vector AS result_score]")\
                .return_fields("result_score","full_text")\
                .dialect(2)\
                .sort_by("result_score", True)
res = redis.ft("tweet:idx").search(q, query_params={"vector": query_vector})
#print(res)
res_df = pd.DataFrame([t.__dict__ for t in res.docs ]).drop(columns=["payload"])
res_df

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Unnamed: 0,id,result_score,full_text
0,tweet:444,0.369450807571,Would you spend $2 more a gallon of gasoline i...
1,tweet:11529,0.371095538139,RT @tradingcrudeoil: Crude oil closed up $0.48...
2,tweet:5654,0.381934285164,..and oil still 25.74 LMAO &gt;&gt;&gt;NO DEM...
3,tweet:204,0.396132171154,Bad news for #oil. It’s going to between $10 ...
4,tweet:9189,0.409308671951,Oil erases gains for the day in fall to $25 ht...
5,tweet:7189,0.42981672287,Do higher oil prices help the consumer and sma...
6,tweet:9330,0.430081129074,The price of Texas intermediate oil (WTI) slum...
7,tweet:531,0.431391477585,#OIL Sentiment ($22.50)\r\n\r\nWhat’s next for...
8,tweet:1490,0.441061556339,"RT @Benzinga: Oil Prices Rise, Fall As Russia,..."
9,tweet:1585,0.441061556339,"RT @Benzinga: Oil Prices Rise, Fall As Russia,..."
