<a href="https://colab.research.google.com/github/Redislabs-Solution-Architects/financial-vss/blob/main/redis-py-01.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Vector Search with redis-py
![Redis](https://redis.com/wp-content/themes/wpx/assets/images/logo-redis.svg?auto=webp&quality=85,75&width=120)

This notebook uses [redis-py](https://redis-py.readthedocs.io/en/stable/), the standard Redis Python client library to perform document + embedding indexing and semantic search tasks.

## Setup and Data Prep

### Pull Github Materials
We need to clone the supporting materials from github.

In [1]:
# This clones your git repository into a directory named 'temp_repo'.
!git clone https://github.com/Redislabs-Solution-Architects/financial-vss.git temp_repo

# This command moves the 'resources' directory from 'temp_repo' to your current directory.
!mv temp_repo/resources .

# This deletes the 'temp_repo' directory, cleaning up the unwanted files.
!rm -rf temp_repo


Cloning into 'temp_repo'...
remote: Enumerating objects: 96, done.[K
remote: Counting objects: 100% (96/96), done.[K
remote: Compressing objects: 100% (64/64), done.[K
remote: Total 96 (delta 43), reused 75 (delta 27), pack-reused 0[K
Receiving objects: 100% (96/96), 7.06 MiB | 13.58 MiB/s, done.
Resolving deltas: 100% (43/43), done.


### Install Python Dependencies

In [None]:
!pip install -q redis langchain pdf2image "unstructured[all-docs]" sentence-transformers

### Preprocess PDF Doc(s)

Now we will load a single financial (10k filings) doc and preprocess it using some LangChain helpers.

In [3]:
import os

from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.document_loaders import UnstructuredFileLoader

# Load list of pdfs
data_path = "resources/"
docs = [os.path.join(data_path, file) for file in os.listdir(data_path)]

print("Listing available documents ...", docs)

# For simplicity, we will just work with one of the 10k files. This will take some time still.
# To Note: the UnstructuredFileLoader is not the only document loader type that LangChain provides
# To Note: the RecursiveCharacterTextSplitter is what we use to create smaller chunks of text from the doc.
# Docs: https://python.langchain.com/docs/integrations/document_loaders/unstructured_file
# Docs: https://python.langchain.com/docs/modules/data_connection/document_transformers/text_splitters/recursive_text_splitter
doc = [doc for doc in docs if "nke" in doc][0]
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1500, chunk_overlap=100, add_start_index=True)
loader = UnstructuredFileLoader(doc, mode="single", strategy="fast")
chunks = loader.load_and_split(text_splitter)

print("Done preprocessing. Created", len(chunks), "chunks of the original pdf", doc)

Listing available documents ... ['resources/aapl-10k-2023.pdf', 'resources/nke-10k-2023.pdf', 'resources/jnj-10k-2023.pdf', 'resources/msft-10k-2023.pdf', 'resources/nvd-10k-2023.pdf', 'resources/amzn-10k-2023.pdf']


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


Done preprocessing. Created 323 chunks of the original pdf resources/nke-10k-2023.pdf


In [4]:
# Take a look at one item
print(chunks[2])

page_content="NIKE, Inc.(Exact name of Registrant as specified in its charter)Oregon93-0584541(State or other jurisdiction of incorporation)(IRS Employer Identification No.)One Bowerman Drive, Beaverton, Oregon 97005-6453(Address of principal executive offices and zip code)(503) 671-6453(Registrant's telephone number, including area code)SECURITIES REGISTERED PURSUANT TO SECTION 12(B) OF THE ACT:Class B Common StockNKENew York Stock Exchange(Title of each class)(Trading symbol)(Name of each exchange on which registered)SECURITIES REGISTERED PURSUANT TO SECTION 12(G) OF THE ACT:NONE\n\nAs of November 30, 2022, the aggregate market values of the Registrant's Common Stock held by non-affiliates were:Class A$7,831,564,572 Class B136,467,702,472 $144,299,267,044\n\nTable of ContentsUNITED STATESSECURITIES AND EXCHANGE COMMISSIONWashington, D.C. 20549FORM 10-K(Mark One)☑ ANNUAL REPORT PURSUANT TO SECTION 13 OR 15(D) OF THE SECURITIES EXCHANGE ACT OF 1934FOR THE FISCAL YEAR ENDED MAY 31, 2023

### Create document chunk embeddings

In [5]:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

.gitattributes:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

data_config.json:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

train_script.py:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

In [6]:
%%time
chunk_embeddings = model.encode([chunk.page_content for chunk in chunks])
len(chunk_embeddings) == len(chunks)

CPU times: user 3.3 s, sys: 1.23 s, total: 4.53 s
Wall time: 11.5 s


True

Helper functions to encode the single query vector and display redis search results

In [7]:
import pandas as pd
def encode_one(input):
  return model.encode(input).astype(np.float32).tobytes()

def redis_search_display(res):
  res_df = pd.DataFrame([t.__dict__ for t in res.docs ]).drop(columns=["payload"])
  display(res_df)


### Install Redis Stack (OPTIONAL)

Redis Search will be used as Vector Search engine for LangChain.

Instead of using in-notebook Redis Stack https://redis.io/docs/getting-started/install-stack/ you can provision your own free instance of Redis in the cloud. Get your own Free Redis Cloud instance at https://redis.com/try-free/

In [8]:
%%sh
curl -fsSL https://packages.redis.io/gpg | sudo gpg --dearmor -o /usr/share/keyrings/redis-archive-keyring.gpg
echo "deb [signed-by=/usr/share/keyrings/redis-archive-keyring.gpg] https://packages.redis.io/deb $(lsb_release -cs) main" | sudo tee /etc/apt/sources.list.d/redis.list
sudo apt-get update  > /dev/null 2>&1
sudo apt-get install redis-stack-server  > /dev/null 2>&1
redis-stack-server --daemonize yes

deb [signed-by=/usr/share/keyrings/redis-archive-keyring.gpg] https://packages.redis.io/deb jammy main
Starting redis-stack-server, database path /var/lib/redis-stack


### Connect to Redis

By default this notebook would connect to the local instance of Redis Stack. If you have your own Redis Cloud instance - replace REDIS_PASSWORD, REDIS_HOST and REDIS_PORT values with your own.

In [9]:
# Replace values below with your own if using Redis Cloud instance
REDIS_HOST = os.getenv("REDIS_HOST", "localhost")
REDIS_PORT = os.getenv("REDIS_PORT", "6379")
REDIS_PASSWORD = os.getenv("REDIS_PASSWORD", "")
#REDIS_HOST="redis-18374.c253.us-central1-1.gce.cloud.redislabs.com"
#REDIS_PORT=18374
#REDIS_PASSWORD="1TNxTEdYRDgIDKM2gDfasupCADXXXX"

#shortcut for redis-cli $REDIS_CONN command
if REDIS_PASSWORD!="":
  os.environ["REDIS_CONN"]=f"-h {REDIS_HOST} -p {REDIS_PORT} -a {REDIS_PASSWORD} --no-auth-warning"
else:
  os.environ["REDIS_CONN"]=f"-h {REDIS_HOST} -p {REDIS_PORT}"

REDIS_URL = f"redis://:{REDIS_PASSWORD}@{REDIS_HOST}:{REDIS_PORT}"


## VS with Redis Python

### Create the HASH index from schema
Below we connect to Redis and create an index for vector search that contains a single text field and vector field.

In [10]:
import redis
from redis.commands.search.field import TextField, VectorField
from redis.commands.search.indexDefinition import IndexDefinition, IndexType
from redis.commands.search.query import Query


r = redis.Redis.from_url(REDIS_URL)

index_name = "redispy"
key_prefix = "doc:redispy"

def create_index(index_type: str = "FLAT"):       # Creates a FLAT index by default
    try:
        # check to see if index exists
        r.ft(index_name).info()
        print("Index already exists!")
    except:
        # schema
        schema = (
            TextField("id"),                       # Text Field Name - synthetic ID
            TextField("content"),                  # Text Field Name
            VectorField("chunk_vector",            # Vector Field Name
                index_type, {                      # Vector Index Type: FLAT or HNSW
                    "TYPE": "FLOAT32",
                    "DIM": 384,                    # Number of Vector Dimensions
                    "DISTANCE_METRIC": "COSINE",   # Vector Search Distance Metric
                }
            ),
        )

        # index Definition
        definition = IndexDefinition(prefix=[key_prefix], index_type=IndexType.HASH)   # Uses HASH by default

        # create Index
        r.ft(index_name).create_index(fields=schema, definition=definition)

In [11]:
# Create the index
create_index()

In [12]:
# Check the info related to the newly created index
r.ft(index_name).info()

{'index_name': 'redispy',
 'index_options': [],
 'index_definition': [b'key_type',
  b'HASH',
  b'prefixes',
  [b'doc:redispy'],
  b'default_score',
  b'1'],
 'attributes': [[b'identifier',
   b'id',
   b'attribute',
   b'id',
   b'type',
   b'TEXT',
   b'WEIGHT',
   b'1'],
  [b'identifier',
   b'content',
   b'attribute',
   b'content',
   b'type',
   b'TEXT',
   b'WEIGHT',
   b'1'],
  [b'identifier',
   b'chunk_vector',
   b'attribute',
   b'chunk_vector',
   b'type',
   b'VECTOR']],
 'num_docs': '0',
 'max_doc_id': '0',
 'num_terms': '0',
 'num_records': '0',
 'inverted_sz_mb': '0',
 'vector_index_sz_mb': '0.00818634033203125',
 'total_inverted_index_blocks': '0',
 'offset_vectors_sz_mb': '0',
 'doc_table_size_mb': '0',
 'sortable_values_size_mb': '0',
 'key_table_size_mb': '0',
 'geoshapes_sz_mb': '0',
 'records_per_doc_avg': 'nan',
 'bytes_per_record_avg': 'nan',
 'offsets_per_term_avg': 'nan',
 'offset_bits_per_record_avg': 'nan',
 'hash_indexing_failures': '0',
 'total_indexing_

### Process and load data using Redis
Below we use a Redis pipeline (not a transaction) to batch send writes to Redis. This method helps with throughput significantly. The batch_size param can be customized and benchmarked on your hardware and with your data. We typically recommend starting small (100-200) and increasing as needed.

In [13]:
# load expects an iterable of dictionaries
import numpy as np

batch_size = 200
pipe = r.pipeline(transaction=False)
for i, chunk in enumerate(chunks):
    data = {
        'id': f"ID-{i}",
        'content': chunk.page_content,
        # For HASH -- must convert embeddings to bytes
        'chunk_vector': np.array(chunk_embeddings[i]).astype(np.float32).tobytes()
    }
    pipe.hset(f"{key_prefix}:{i}", mapping=data)
    # execute in "mini batches"
    if i % batch_size == 0:
        res = pipe.execute()

# cleanup final batch execution
res = pipe.execute()

In [14]:
# check the data size in Redis
len(chunks) == r.dbsize()

True

In [15]:
# do NOT run this command in production
keys = r.keys()

r.hgetall(keys[0])

{b'chunk_vector': b'\x99\x1b\xd1\xbd\xc6\xd1\x9d=4\x8a_\xbc\x9d\x10\x0e\xbdK\x14\x85\xbd\x97R\x94=*\xf0\xd1<\x95$>=H\x1a\xa4=\xc6\x87\x89<\xe43\x91<\xde\xcc{<!ck=\x16i\xba\xbd\xa3WY\xbc\x01\xc4w\xbd\xbf\xd78=\x1e\xb4(\xbd\x9f\xb31<\xaf\x98\xc7<\xa3\xc9\x06=\x86g=\xbbs6\xb4\xbd\xe2\xbf <\xb2\xb2\xf2<\x0c\xad\xcd\xbc`\xbd\x86\xbd\xe6\xf8\x02=\xce\x04\x15<\xa6\x02[\xbch\xad\xd1\xbcq;\xaa\xbcC\xdc\xa7=\r\x16O\xbdB\t\xc1;wU\xcf<f\xf8\xc8<m\x97\x89\xbd\xa8g\x8d<5-\'=\xd8\xf6\x16\xbc\xe9e\x93\xbc\xcaR\x8d\xbd\xca\xa1\x1f=\xe4}\x1d=NH\xb5\xbc\x00\xa7K;\x10y\xdd<\x90\xfa\x8d\xbd\x9b\xd0\xe2<\xd2\xa20\xbc\xaf!\x91<\xe1\x1cA\xbd\xdaDs=\xeeU\x8b\xbd\xae\xea\x91:-\xfc\xbf\xbc\x13\xa5\x9e\xbdW\xb9\xa3\xbb\xa6\xe4\xe3\xbc\xf3\xea\x10\xbd\x11\xc9\x85\xbd\t\x1dD\xbd\xc7C\xcf<\xb0\x1bC<~\xb6\xec\xbcC\xd8Z<Z\xc6\x03=\x96\xbe\xd5\xbd\xd0\x1c ;N\xa8\xc1\xbc\x847\x1c=\xf1MS\xbd\xfc%o<\x8fs\xab\xbc\x90\x84\xe2;\xe2\n\x9f=\xb3$D\xbdd\xc2\xac\xbb\xe6\xac\xf4\xbd\xe5\x17\xa2=\xde\xdf\xea=\x11\x88\x9a\xbdy\t"\xb

### Query the database
Now we can use the Redis search index to perform similarity search operations. This query takes a user input, converts to embeddings, and fetches the top 2 most semantically similar chunks from Redis.

In [16]:
# Grab user input
_input = "Nike profit margins and company performance"

query = (
    Query("*=>[KNN 2 @chunk_vector $vec as score]")
     .sort_by("score")
     .return_fields("content", "score")
     .paging(0, 2)
     .dialect(2)
)

query_params = {
    "vec": encode_one(_input)
}

res=r.ft(index_name).search(query, query_params)
redis_search_display(res)

Unnamed: 0,id,score,content
0,doc:redispy:150,0.354781925678,2023 FORM 10-K 35\n\nTable of Contents\n\nOPER...
1,doc:redispy:151,0.360561609268,Financial Statements contained in Item 8 of th...


In [17]:
# Example of sorting by a field other than score
query = (
    Query("*=>[KNN 4 @chunk_vector $vec as score]")
     .sort_by("id")
     .return_fields("id", "content", "score")
     .paging(0, 4)
     .dialect(2)
)

query_params = {
    "vec": encode_one(_input)
}

res=r.ft(index_name).search(query, query_params)
redis_search_display(res)

Unnamed: 0,id,score,content
0,doc:redispy:145,0.362202942371,NIKE Brand apparel revenues increased 8% on a ...
1,doc:redispy:150,0.354781925678,2023 FORM 10-K 35\n\nTable of Contents\n\nOPER...
2,doc:redispy:151,0.360561609268,Financial Statements contained in Item 8 of th...
3,doc:redispy:219,0.368923187256,BASIS OF CONSOLIDATION\n\nThe Consolidated Fin...


### Range Queries
Range queries allow you to set a pre defined "threshold" for which we want to return documents

In [18]:
query = (
    Query("@chunk_vector:[VECTOR_RANGE $radius $vec]=>{$YIELD_DISTANCE_AS: score}")
     .sort_by("score")
     .return_fields("content", "score")
     .dialect(2)
)

# Find all vectors within 0.8 of the query vector
query_params = {
    "radius": 0.8,
    "vec":encode_one(_input)
}
res=r.ft(index_name).search(query, query_params)
redis_search_display(res)

Unnamed: 0,id,score,content
0,doc:redispy:150,0.354781925678,2023 FORM 10-K 35\n\nTable of Contents\n\nOPER...
1,doc:redispy:151,0.360561609268,Financial Statements contained in Item 8 of th...
2,doc:redispy:145,0.362202942371,NIKE Brand apparel revenues increased 8% on a ...
3,doc:redispy:219,0.368923187256,BASIS OF CONSOLIDATION\n\nThe Consolidated Fin...
4,doc:redispy:142,0.372902095318,1 %\n\nSales through NIKE Direct Global Brand ...
5,doc:redispy:289,0.376974523067,Asia Pacific & Latin America Global Brand Divi...
6,doc:redispy:204,0.38412541151,To the Board of Directors and Shareholders of ...
7,doc:redispy:287,0.384333252907,Global Brand Divisions is included within the ...
8,doc:redispy:154,0.38654500246,"mix, lower margins in NIKE Direct due to highe..."
9,doc:redispy:153,0.387761950493,"increased 18%, driven by strong digital sales ..."


### Hybrid Queries
Hybrid queries contain both traditional filters (numeric, tags, text) and VS in one single Redis command.

In [19]:
query = (
    Query("@content:profit=>[KNN 2 @chunk_vector $vec as score]")
     .sort_by("score")
     .return_fields("content", "score")
     .paging(0, 2)
     .dialect(2)
)

query_params = {
    "vec": encode_one(_input)
}
res=r.ft(index_name).search(query, query_params)
redis_search_display(res)

Unnamed: 0,id,score,content
0,doc:redispy:145,0.362202942371,NIKE Brand apparel revenues increased 8% on a ...
1,doc:redispy:140,0.389455199242,COMPARABLE STORE SALES Comparable store sales:...


In [20]:
#r.ft(index_name).dropindex(True)

### What about JSON Support?

Redis also allows you to store data in JSON objects. The JSON fields can contain metadata and vectors. Below is a simple example of indexing JSON data.

**For now** -- JSON support is only enabled in the base redis-py library. It is coming soon to LangChain and RedisVL.

In [21]:
index_name = "redispy:json"
key_prefix = "doc:redispy:json"

# schema
schema = (
    TextField("$.content",                     # Text Field Name (JSON path)
        as_name="content"                      # Text Field Alias -- required for JSON
    ),
    VectorField("$.chunk_vector",              # Vector Field Name (JSON path)
        "FLAT", {                              # Vector Index Type: FLAT or HNSW
            "TYPE": "FLOAT32",
            "DIM": 384,                        # Number of Vector Dimensions
            "DISTANCE_METRIC": "COSINE",       # Vector Search Distance Metric
        },
        as_name="chunk_vector"                 # Vector Field Alias -- required for JSON
    ),
)

# index Definition
definition = IndexDefinition(prefix=[key_prefix], index_type=IndexType.JSON) # select JSON here

# create Index
r.ft(index_name).create_index(fields=schema, definition=definition)

b'OK'

In [22]:
r.ft(index_name).info()

{'index_name': 'redispy:json',
 'index_options': [],
 'index_definition': [b'key_type',
  b'JSON',
  b'prefixes',
  [b'doc:redispy:json'],
  b'default_score',
  b'1'],
 'attributes': [[b'identifier',
   b'$.content',
   b'attribute',
   b'content',
   b'type',
   b'TEXT',
   b'WEIGHT',
   b'1'],
  [b'identifier',
   b'$.chunk_vector',
   b'attribute',
   b'chunk_vector',
   b'type',
   b'VECTOR']],
 'num_docs': '0',
 'max_doc_id': '0',
 'num_terms': '0',
 'num_records': '0',
 'inverted_sz_mb': '0',
 'vector_index_sz_mb': '0.00818634033203125',
 'total_inverted_index_blocks': '7463',
 'offset_vectors_sz_mb': '0',
 'doc_table_size_mb': '0',
 'sortable_values_size_mb': '0',
 'key_table_size_mb': '0',
 'geoshapes_sz_mb': '0',
 'records_per_doc_avg': 'nan',
 'bytes_per_record_avg': 'nan',
 'offsets_per_term_avg': 'nan',
 'offset_bits_per_record_avg': 'nan',
 'hash_indexing_failures': '0',
 'total_indexing_time': '0',
 'indexing': '0',
 'percent_indexed': '1',
 'number_of_uses': 1,
 'cleanin

In [23]:
# Write JSON data to the index

batch_size = 200
pipe = r.pipeline(transaction=False)

for i, chunk in enumerate(chunks):
    redis_key = f"{key_prefix}:{i}"
    data = {
        'content': chunk.page_content,
        'chunk_vector': chunk_embeddings[i].tolist() # notice that we don't need to convert JSON embeddings to bytes
    }
    #print(data)
    pipe.json().set(redis_key, "$", data)
    # mini batch
    if i % batch_size == 0:
        res = pipe.execute()

res = pipe.execute() # make sure to use mini batches if working with larger datasets

In [24]:
# Fetch the JSON doc
r.json().get(f"{key_prefix}:0", "$")

[{'content': 'Indicate by check mark:YESNO•if the registrant is a well-known seasoned issuer, as defined in Rule 405 of the Securities Act.þ¨•if the registrant is not required to file reports pursuant to Section 13 or Section 15(d) of the Act.¨þ•whether the registrant (1) has filed all reports required to be filed by Section 13 or 15(d) of the Securities Exchange Act of 1934 during the preceding12 months (or for such shorter period that the registrant was required to file such reports), and (2) has been subject to such filing requirements for thepast 90 days.þ¨•whether the registrant has submitted electronically every Interactive Data File required to be submitted pursuant to Rule 405 of Regulation S-T(§232.405 of this chapter) during the preceding 12 months (or for such shorter period that the registrant was required to submit such files).þ¨•whether the registrant is a large accelerated filer, an accelerated filer, a non-accelerated filer, a smaller reporting company or an emerging gr

In [25]:
# And now you can perform the same kinds of queries

query = (
    Query("@content:profit=>[KNN 2 @chunk_vector $vec as score]")
     .sort_by("score")
     .return_fields("content", "score")
     .paging(0, 2)
     .dialect(2)
)

query_params = {
    "vec": encode_one(_input)

}
res=r.ft(index_name).search(query, query_params)
redis_search_display(res)

Unnamed: 0,id,score,content
0,doc:redispy:json:145,0.362202942371,NIKE Brand apparel revenues increased 8% on a ...
1,doc:redispy:json:140,0.389455199242,COMPARABLE STORE SALES Comparable store sales:...


## Cleanup
Clean up the index and data.

In [26]:
#r.ft(index_name).dropindex(True)

## What's Next?

Now that you have the basics down with the baseline Redis Python client:

| Notebook | Description | Documentation |
|----------|-------------|---------------|
| [**redisvl-02**](https://colab.research.google.com/github/Redislabs-Solution-Architects/financial-vss/blob/main/redisvl-02.ipynb) | Dive deeper using a dedicated VSS client library. | View Docs |
| [**langchain-03**](https://colab.research.google.com/github/Redislabs-Solution-Architects/financial-vss/blob/main/langchain-03.ipynb) | Wrap up with an integrated approach via LangChain and see how to build complex LLM chains. | View Docs |
