<a href="https://colab.research.google.com/github/antonum/Redis-Workshops/blob/main/02-Vector_Similarity_Search/02-Redis_VSS.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Vector Similarity Search with Redis

[Always-on demo](https://antonum-redis-vss-streamlit-streamlit-app-p4z5th.streamlitapp.com/)

[GitHub repo](https://github.com/antonum/Redis-VSS-Streamlit)

![Redis](https://redis.com/wp-content/themes/wpx/assets/images/logo-redis.svg?auto=webp&quality=85,75&width=120)

This notebook generates vector embeddings using pretrained `sentence-transformers/all-MiniLM-L6-v2` model from HuggingFace, loads them to Redis and runs Vector Similarity search against Redis database. 

In [16]:
#install Redis client and Hugging Face sentence transformers
!pip install -q redis sentence_transformers

In [17]:
%%sh
curl -fsSL https://packages.redis.io/gpg | sudo gpg --dearmor -o /usr/share/keyrings/redis-archive-keyring.gpg 
echo "deb [signed-by=/usr/share/keyrings/redis-archive-keyring.gpg] https://packages.redis.io/deb $(lsb_release -cs) main" | sudo tee /etc/apt/sources.list.d/redis.list 
sudo apt-get update  > /dev/null 2>&1
sudo apt-get install redis-stack-server  > /dev/null 2>&1
redis-stack-server --daemonize yes 


deb [signed-by=/usr/share/keyrings/redis-archive-keyring.gpg] https://packages.redis.io/deb focal main
Starting redis-stack-server, database path /var/lib/redis-stack


gpg: cannot open '/dev/tty': No such device or address
(23) Failed writing body


In [18]:
import pandas as pd
import numpy as np
from sentence_transformers import SentenceTransformer
from tqdm.auto import tqdm
from redis import Redis
from redis.commands.search.field import (
    NumericField,
    TagField,
    TextField,
    VectorField,
)
from redis.commands.search.indexDefinition import IndexDefinition, IndexType
from redis.commands.search.query import Query


tqdm.pandas()

#load pre-trained model from HuggingFace
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

Download 12k+ tweets

In [19]:
!wget https://raw.githubusercontent.com/antonum/Redis-VSS-Streamlit/main/Labelled_Tweets.csv

--2023-05-25 20:23:07--  https://raw.githubusercontent.com/antonum/Redis-VSS-Streamlit/main/Labelled_Tweets.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2486081 (2.4M) [text/plain]
Saving to: ‘Labelled_Tweets.csv.1’


2023-05-25 20:23:07 (36.2 MB/s) - ‘Labelled_Tweets.csv.1’ saved [2486081/2486081]



In [20]:
df = pd.read_csv('Labelled_Tweets.csv').drop(columns=['created_at','score'])
df=df.head(3000) #trim dataframe to fit results into 30MB Redis database
df


Unnamed: 0,id,full_text
0,1,@KennyDegu very very little volume. With $10T ...
1,2,#ES_F achieved Target 2780 closing above 50% #...
2,3,RT @KimbleCharting: Silver/Gold indicator crea...
3,4,@Issaquahfunds Hedged our $MSFT position into ...
4,5,RT @zipillinois: 3 Surprisingly Controversial ...
...,...,...
2995,3039,Next trgts: 2746.50 and 2727.50\r\n\r\n$ES_F #...
2996,3040,bring this down!!!! $SPX
2997,3041,RT @DKellerCMT: When is the bear market rally ...
2998,3042,It’s all fun &amp; games until the big boys st...


Generate vector embeddings within the dataframe

In [21]:
def text_to_embedding(text):
  return model.encode(text).astype(np.float32).tobytes()

#generate vector embeddings
df["text_embedding"] = df["full_text"].progress_apply(text_to_embedding)
df.head()

  0%|          | 0/3000 [00:00<?, ?it/s]

Unnamed: 0,id,full_text,text_embedding
0,1,@KennyDegu very very little volume. With $10T ...,b')\x92\x81\xbd\x12h\x8b\xbd}\xdf\xe4\xbc\xbf\...
1,2,#ES_F achieved Target 2780 closing above 50% #...,b'S\x1b\x02\xbd\x16~/\xbd\x9bz\xb1\xbc`\x99\xd...
2,3,RT @KimbleCharting: Silver/Gold indicator crea...,b'\x0f\xaa\xa3\xbdc}\x10\xbd\xcc\xe8\xb9=(\x08...
3,4,@Issaquahfunds Hedged our $MSFT position into ...,b'\xc1\x7f\xd1\xbc\xbc\n`\xbd79 =\xe4\xc0\xef=...
4,5,RT @zipillinois: 3 Surprisingly Controversial ...,b'\xca\r\x1e\xbdV\\\xd4\xbcL/\xa1\xbc\xe8q7=\x...


In [22]:
#df.to_dict(orient="records")

Functions to: 
- save dataframe to Redis HASH
- Create RediSearch Index

In [23]:
def df_to_redis_hash(redis,df,key="tweet",pipesize=100):
  tweethash={}
  pipe = redis.pipeline(transaction=False)
  for i in tqdm(range(len(df["id"]))):
    keyname = "{}:{}".format(key,df["id"][i])
    tweethash["text"]=df["full_text"][i]
    tweethash["text_embeddings"]=df["text_embedding"][i]
    pipe.hset(keyname, mapping=tweethash)
    if (i % pipesize == 0):
      pipe.execute()
      pipe = redis.pipeline(transaction=False)
  pipe.execute()

def load_dataframeX(redis, df, key_prefix="tweet", id_column="id", pipe_size=100):
  pipe = redis.pipeline(transaction=False)
  for i, row in tqdm(enumerate(df.to_dict())):
    pipe.hset(key_prefix+str(row[id_column]), mapping=row.asDict())

  res=pipe.execute()

def load_dataframe(redis, df, key_prefix="tweet", id_column="id", pipe_size=100):
    records = df.to_dict(orient="records")
    pipe = redis.pipeline()
    for i, record in tqdm(enumerate(records)):
        key = f"{key_prefix}:{record[id_column]}"
        pipe.hset(key, mapping=record)
        if (i+1) % pipe_size == 0:
          res=pipe.execute()
    pipe.execute()

def create_redis_index(redis, idxname="tweet:idx"):
  try:
    redis.ft(idxname).dropindex()
  except:
    print("no index found")

  # Create an index
  indexDefinition = IndexDefinition(
      prefix=["tweet:"],
      index_type=IndexType.HASH,
  )

  redis.ft(idxname).create_index(
      (
          TextField("full_text", no_stem=False, sortable=False),
          VectorField("text_embedding", "HNSW", {  "TYPE": "FLOAT32", 
                                                    "DIM": 384, 
                                                    "DISTANCE_METRIC": "COSINE",
                                                  })
      ),
      definition=indexDefinition
  )



In [24]:
!redis-cli hgetall "tweet:9293"

(empty array)


In [25]:
import os 
import redis
REDIS_HOST = os.getenv("REDIS_HOST", "localhost")
REDIS_PORT = os.getenv("REDIS_PORT", "6379")
REDIS_PASSWORD = os.getenv("REDIS_PASSWORD", "")
#Replace values above with your own if using Redis Cloud instance
#REDIS_HOST="redis-12110.c82.us-east-1-2.ec2.cloud.redislabs.com"
#REDIS_PORT=12110
#REDIS_PASSWORD="pobhBJP7Psicp2gV0iqa2ZOc1WdXXXXX"

#shortcut for redis-cli $REDIS_CONN command
if REDIS_PASSWORD!="":
  os.environ["REDIS_CONN"]=f"-h {REDIS_HOST} -p {REDIS_PORT} -a {REDIS_PASSWORD} --no-auth-warning"
else:
  os.environ["REDIS_CONN"]=f"-h {REDIS_HOST} -p {REDIS_PORT}"
redis = redis.Redis(
  host=REDIS_HOST,
  port=REDIS_PORT,
  password=REDIS_PASSWORD)
redis.ping()
# clear Redis database (optional)
redis.flushdb()

# create Index
create_redis_index(redis)

# load data from Dataframe to Redis HASH
#df_to_redis_hash(redis,df,key="tweet", pipesize=100)
load_dataframe(redis,df,key_prefix="tweet", pipe_size=100)


no index found


0it [00:00, ?it/s]

## Query the database

[Alway-on Streamlit app](https://antonum-redis-vss-streamlit-streamlit-app-p4z5th.streamlitapp.com/) 


Try queries like:
“Oil”, “Oil Reserves”, “Fossil fuels”

Lexical Full Text search quickly runs out of matches

Vector search continues to discover relevant tweets

In [26]:
user_query="oil price"
# queries to try "oil reserve", "fossil fuels"

In [27]:
#using Full Text Index
q = Query(user_query)\
  .return_fields("full_text")
res = redis.ft("tweet:idx").search(q)
res_df = pd.DataFrame([t.__dict__ for t in res.docs ]).drop(columns=["payload"])
res_df

Unnamed: 0,id,full_text
0,tweet:311,"Oil Prices Rise, Fall As Russia, Saudi Arabia ..."
1,tweet:1490,"RT @Benzinga: Oil Prices Rise, Fall As Russia,..."
2,tweet:1585,"RT @Benzinga: Oil Prices Rise, Fall As Russia,..."
3,tweet:1610,"Oil Prices Rise, Fall As Russia, Saudi Arabia ..."
4,tweet:636,Told you Saudi Arabia will bend the knee @jimc...
5,tweet:927,RT @SeekingAlpha: $XOM - Despite The Oil Price...
6,tweet:1364,RT @SeekingAlpha: $XOM - Despite The Oil Price...
7,tweet:1606,RT @SeekingAlpha: $XOM - Despite The Oil Price...
8,tweet:1639,RT @SeekingAlpha: $XOM - Despite The Oil Price...
9,tweet:1391,Despite The Oil Price Carnage Exxon Will Grow ...


In [30]:
#using Vector Similarity Index
query_vector=text_to_embedding(user_query)
q = Query("*=>[KNN 10 @text_embedding $vector AS result_score]")\
                .return_fields("result_score","full_text")\
                .dialect(2)\
                .sort_by("result_score", True)
res = redis.ft("tweet:idx").search(q, query_params={"vector": query_vector})
print(res)
res_df = pd.DataFrame([t.__dict__ for t in res.docs ]).drop(columns=["payload"])
res_df

Result{10 total, docs: [Document {'id': 'tweet:444', 'payload': None, 'result_score': '0.369450867176', 'full_text': 'Would you spend $2 more a gallon of gasoline if you knew it was Oil only from the USA? $wtic $brent $spx $qqq $djia $uso $opec'}, Document {'id': 'tweet:204', 'payload': None, 'result_score': '0.396132230759', 'full_text': 'Bad news for #oil.  It’s going to between $10 to $15 dollars a barrel. They would of needed to cut double that amount just to keep the price stable $XOM $SU $VLO\r\n\r\nOPEC and allies agree to historic 10 million barrel per day production cut from @CNBC  https://t.co/1c8U7rrx6Z'}, Document {'id': 'tweet:531', 'payload': None, 'result_score': '0.431391537189', 'full_text': '#OIL Sentiment ($22.50)\r\n\r\nWhat’s next for the OIL?\r\n\r\n$DIA #DJIA $SPY $AAPL $AMZN $FB $NFLX $GOOG $NVDA $TSLA $AVGO $IWM $SOXX $USO $GLD $XLF $ETH $XRP $LINK $BA $AAL $MGM $CCL $LTC $XOM $MRO $ET $OXY $BP $HAL $APA $DVN $COG $PX $JAG $TDOC #OPEC #OOTT #BTC #Bitcoin'}, Doc

Unnamed: 0,id,result_score,full_text
0,tweet:444,0.369450867176,Would you spend $2 more a gallon of gasoline i...
1,tweet:204,0.396132230759,Bad news for #oil. It’s going to between $10 ...
2,tweet:531,0.431391537189,#OIL Sentiment ($22.50)\r\n\r\nWhat’s next for...
3,tweet:1490,0.441061496735,"RT @Benzinga: Oil Prices Rise, Fall As Russia,..."
4,tweet:1585,0.441061496735,"RT @Benzinga: Oil Prices Rise, Fall As Russia,..."
5,tweet:178,0.441844820976,OH how bullish for #oil LOL\r\n\r\n#OOTT #Oi...
6,tweet:2443,0.450135827065,$SPY $SPX $USO OPEC deal is to cut only 10 mil...
7,tweet:1610,0.454426407814,"Oil Prices Rise, Fall As Russia, Saudi Arabia ..."
8,tweet:1804,0.455194354057,Equities give back part of earlier gains. Oil ...
9,tweet:311,0.463826060295,"Oil Prices Rise, Fall As Russia, Saudi Arabia ..."


In [31]:
!redis-cli hgetall "tweet:9293"

(empty array)
