<a href="https://colab.research.google.com/github/Redislabs-Solution-Architects/Redis-Workshops/blob/main/02-Vector_Similarity_Search/02.01-RedisVL.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Vector Search with RedisVL

[Always-on demo](https://antonum-redis-vss-streamlit-streamlit-app-p4z5th.streamlit.app/)

![Redis](https://redis.io/wp-content/uploads/2024/04/Logotype.svg?auto=webp&quality=85,75&width=120)

This notebook generates vector embeddings using pre-trained `sentence-transformers/all-MiniLM-L6-v2` model from HuggingFace, loads them to Redis and runs Vector Similarity search against Redis database. 

In [1]:
#install Redis client and Hugging Face sentence transformers
!pip install -q sentence_transformers redisvl

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m227.1/227.1 kB[0m [31m1.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m66.5/66.5 kB[0m [31m2.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m46.0/46.0 kB[0m [31m3.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m252.0/252.0 kB[0m [31m13.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m21.3/21.3 MB[0m [31m66.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.8/86.8 kB[0m [31m10.3 MB/s[0m eta [36m0:00:00[0m
[?25h

Install Redis Stack locally

In [2]:
%%sh
curl -fsSL https://packages.redis.io/gpg | sudo gpg --dearmor -o /usr/share/keyrings/redis-archive-keyring.gpg
echo "deb [signed-by=/usr/share/keyrings/redis-archive-keyring.gpg] https://packages.redis.io/deb $(lsb_release -cs) main" | sudo tee /etc/apt/sources.list.d/redis.list
sudo apt-get update  > /dev/null 2>&1
sudo apt-get install redis-stack-server  > /dev/null 2>&1
redis-stack-server --daemonize yes


deb [signed-by=/usr/share/keyrings/redis-archive-keyring.gpg] https://packages.redis.io/deb jammy main
Starting redis-stack-server, database path /var/lib/redis-stack


### Connect to the Redis server

In [3]:
import os
import redis
REDIS_HOST = os.getenv("REDIS_HOST", "localhost")
REDIS_PORT = os.getenv("REDIS_PORT", "6379")
REDIS_PASSWORD = os.getenv("REDIS_PASSWORD", "")
#Replace values above with your own if using Redis Cloud instance
#REDIS_HOST="redis-12110.c82.us-east-1-2.ec2.cloud.redislabs.com"
#REDIS_PORT=12110
#REDIS_PASSWORD="pobhBJP7Psicp2gV0iqa2ZOc1WdXXXXX"

#shortcut for redis-cli $REDIS_CONN command
if REDIS_PASSWORD!="":
  os.environ["REDIS_CONN"]=f"-h {REDIS_HOST} -p {REDIS_PORT} -a {REDIS_PASSWORD} --no-auth-warning"
else:
  os.environ["REDIS_CONN"]=f"-h {REDIS_HOST} -p {REDIS_PORT}"

# If SSL is enabled on the endpoint, use rediss:// as the URL prefix
REDIS_URL = f"redis://:{REDIS_PASSWORD}@{REDIS_HOST}:{REDIS_PORT}"

# Test Redis connection
!redis-cli $REDIS_CONN PING

PONG


In [4]:
os.environ["TOKENIZERS_PARALLELISM"] = "false"
from redisvl.utils.vectorize import HFTextVectorizer

import pandas as pd
import numpy as np
from tqdm.auto import tqdm

tqdm.pandas()
# create a vectorizer
# choose your model from the huggingface website
hf = HFTextVectorizer(model="sentence-transformers/all-MiniLM-L6-v2")



The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [5]:
# embed a sentence
test = hf.embed("This is a test sentence.")
#test[:10]

### Embedding generation model

Here we are using `sentence-transformers/all-MiniLM-L6-v2` from HuggingFace. https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2



In [6]:
#from sentence_transformers import SentenceTransformer
#model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
#test=model.encode("This is a test sentence.")
#test[:]

Download 12k+ tweets

In [7]:
!wget https://raw.githubusercontent.com/antonum/Redis-VSS-Streamlit/main/Labelled_Tweets.csv

--2024-06-20 17:40:08--  https://raw.githubusercontent.com/antonum/Redis-VSS-Streamlit/main/Labelled_Tweets.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2486081 (2.4M) [text/plain]
Saving to: ‘Labelled_Tweets.csv’


2024-06-20 17:40:09 (240 MB/s) - ‘Labelled_Tweets.csv’ saved [2486081/2486081]



In [8]:
df = pd.read_csv('Labelled_Tweets.csv').drop(columns=['created_at','score'])
#df=df.head(100) #trim dataframe to fit results into 30MB Redis database
df


Unnamed: 0,id,full_text
0,1,@KennyDegu very very little volume. With $10T ...
1,2,#ES_F achieved Target 2780 closing above 50% #...
2,3,RT @KimbleCharting: Silver/Gold indicator crea...
3,4,@Issaquahfunds Hedged our $MSFT position into ...
4,5,RT @zipillinois: 3 Surprisingly Controversial ...
...,...,...
12415,12587,RT @PeterLBrandt: $SPX $ES_F \r\nFollowing thi...
12416,12588,RT @vieiraUAE: Fearless Alex Vieira Calls Best...
12417,12589,$spy $spx $qqq $ndx #nyse going from poking th...
12418,12590,RT @DavidScottAdams: On watch tomorrow // Pt. ...


### Generate Embeddings

Generate vector embeddings within the dataframe. This step can take 2-3 minutes on GPU runtime for all 12k records.

In [9]:
def text_to_embedding(text):
  return np.array(hf.embed(text), dtype=np.float32).tobytes()

#generate vector embeddings
df["text_embedding"] = df["full_text"].progress_apply(text_to_embedding)
df.head()

  0%|          | 0/12420 [00:00<?, ?it/s]

Unnamed: 0,id,full_text,text_embedding
0,1,@KennyDegu very very little volume. With $10T ...,b'\'\x92\x81\xbd\rh\x8b\xbd|\xdf\xe4\xbc\xbb\x...
1,2,#ES_F achieved Target 2780 closing above 50% #...,b'O\x1b\x02\xbd\x15~/\xbd\x8ez\xb1\xbcY\x99\xd...
2,3,RT @KimbleCharting: Silver/Gold indicator crea...,b'\x0e\xaa\xa3\xbdi}\x10\xbd\xc7\xe8\xb9=)\x08...
3,4,@Issaquahfunds Hedged our $MSFT position into ...,b'\xbd\x7f\xd1\xbc\xc1\n`\xbd59 =\xe1\xc0\xef=...
4,5,RT @zipillinois: 3 Surprisingly Controversial ...,b'\xc7\r\x1e\xbdZ\\\xd4\xbcS/\xa1\xbc\xe7q7=\x...


### Define RediSearch Schema


In [10]:
schema = {
    "index": {
        "name": "tweet:idx",
        "prefix": "tweet:",
        "storage_type": "hash", # default setting -- HASH
    },
    "fields": [
        {   "name": "full_text",
            "type": "text",
            "attrs": {
                "no_stem": False,
                "sortable": False
            }
        },
        {
            "name": "text_embedding",
            "type": "vector",
            "attrs": {
                "dims": 384,
                "distance_metric": "cosine",
                "algorithm": "HNSW",
                "datatype": "float32"
            }

        }
    ],
}

### Create index and load data to Redis

In [11]:
# clear Redis database (optional)
#redis.flushdb()

from redisvl.index import SearchIndex

index = SearchIndex.from_dict(schema)

index.connect(REDIS_URL)

index.create(overwrite=True)

keys = index.load(df.to_dict('records'))


In [12]:
#Check how the data is stored in Redis
#!redis-cli $REDIS_CONN keys "*"
#!redis-cli $REDIS_CONN hgetall "tweet::c4b63c9494194d4ea65b7e265bcd2d5f"

## Query the database

[Alway-on Streamlit app](https://antonum-redis-vss-streamlit-streamlit-app-p4z5th.streamlit.app/)


Try queries like:
“Oil”, “Oil Reserves”, “Fossil fuels”

Lexical Full Text search quickly runs out of matches

Vector search continues to discover relevant tweets

In [13]:
user_query="oil price"
# queries to try "oil reserve", "fossil fuels"

In [14]:
# using Full Text Index
from redisvl.query import FilterQuery
from redisvl.query.filter import Text

# exact match filter -- document must contain the exact word doctor
text_filter = Text("full_text") == user_query

filter_query = FilterQuery(
    return_fields=["full_text"],
    filter_expression=text_filter
)

results = index.query(filter_query)

pd.DataFrame(results)
#results

Unnamed: 0,id,full_text
0,tweet::b6f2ab2a71314a34807da705dc38d981,Today's book recommendation goes for the winne...
1,tweet::73d8b877704443f8af6d623fd535d3a0,RT @SeekingAlpha: $XOM - Despite The Oil Price...
2,tweet::4b16ad2121804e26841b401bff65312e,RT @SeekingAlpha: $XOM - Despite The Oil Price...
3,tweet::423bf0e8f2bf4c91b5f93fcf7b079b80,RT @SeekingAlpha: $XOM - Despite The Oil Price...
4,tweet::e7919524959b451b96b1a5f88f0ef610,RT @SeekingAlpha: $XOM - Despite The Oil Price...
5,tweet::b60765b19b17477290323b000decee5c,RT @CharlesSizemore: Investors believe that Sa...
6,tweet::9ff2a1f1f5ec46c4880dd6a567f1cfdf,RT @SeekingAlpha: $EURN - Euronav's CEO On Sur...
7,tweet::aff5fc910ea84a59b458a658ee7bffd5,$EURN - Euronav's CEO On Surging Rates Amid Th...
8,tweet::90f0cb99ccb34581bf9192803c1c9cd5,Investors believe that Saudi Arabia is winning...
9,tweet::135c25b5e294410abdf2f7c3cee0ae77,Despite The Oil Price Carnage Exxon Will Grow ...


In [15]:
from redisvl.query import VectorQuery
#from jupyterutils import result_print

query = VectorQuery(
    vector=hf.embed(user_query),
    vector_field_name="text_embedding",
    return_fields=["full_text", "vector_distance"],
    num_results=10
)
results = index.query(query)
pd.DataFrame(results)

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Unnamed: 0,id,vector_distance,full_text
0,tweet::fd4d649a3d574e63989eb6addc461317,0.369450807571,Would you spend $2 more a gallon of gasoline i...
1,tweet::c7314f80b71b4032bc22ffc404f3ab99,0.371095478535,RT @tradingcrudeoil: Crude oil closed up $0.48...
2,tweet::a95834ea261647a780995a8360d79131,0.381934285164,..and oil still 25.74 LMAO &gt;&gt;&gt;NO DEM...
3,tweet::1ace71418673498aa6951bc0cb969346,0.396132230759,Bad news for #oil. It’s going to between $10 ...
4,tweet::085face38f4f41079d40fed57897e60f,0.409308552742,Oil erases gains for the day in fall to $25 ht...
5,tweet::a745945c0f1f488aad6235cc31484542,0.42981672287,Do higher oil prices help the consumer and sma...
6,tweet::19ba598b48d94b7b88c10361148a9bf2,0.430081129074,The price of Texas intermediate oil (WTI) slum...
7,tweet::3ec7ba270f2a4020b85f6d8c03ea29b0,0.431391477585,#OIL Sentiment ($22.50)\r\n\r\nWhat’s next for...
8,tweet::5a49ea3e6bc4417a91f252a33a9612f9,0.441844880581,OH how bullish for #oil LOL\r\n\r\n#OOTT #Oi...
9,tweet::908ef37d79ae4844bc1449de8ed95f80,0.442979931831,$DXY 99.55-0.57%&lt;==US Dollar lower #Fed $2....
