<a href="https://colab.research.google.com/github/Redislabs-Solution-Architects/financial-vss/blob/main/redisvl-02.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Vector Search with RedisVL

![Redis](https://redis.com/wp-content/themes/wpx/assets/images/logo-redis.svg?auto=webp&quality=85,75&width=120)

This notebook uses [redisvl](https://redisvl.com), a dedicated Python client library for using Redis as a vector database, to perform document + embdding indexing and semantic search tasks.

## Setup and Data Prep

### Pull Github Materials
We need to clone the supporting materials from github.

In [1]:
# This clones your git repository into a directory named 'temp_repo'.
!git clone https://github.com/Redislabs-Solution-Architects/financial-vss.git temp_repo

# This command moves the 'resources' directory from 'temp_repo' to your current directory.
!mv temp_repo/resources .

# This deletes the 'temp_repo' directory, cleaning up the unwanted files.
!rm -rf temp_repo


Cloning into 'temp_repo'...
remote: Enumerating objects: 96, done.[K
remote: Counting objects: 100% (96/96), done.[K
remote: Compressing objects: 100% (64/64), done.[K
remote: Total 96 (delta 43), reused 75 (delta 27), pack-reused 0[K
Receiving objects: 100% (96/96), 7.06 MiB | 6.55 MiB/s, done.
Resolving deltas: 100% (43/43), done.
mv: cannot move 'temp_repo/resources' to './resources': Directory not empty


### Install Python Dependencies

In [2]:
!pip install -q redis "redisvl>==0.0.4" langchain pdf2image "unstructured[all-docs]" sentence-transformers

### Preprocess PDF Doc(s)

Now we will load a single financial (10k filings) doc and preprocess it using some LangChain helpers.

In [3]:
import os

from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.document_loaders import UnstructuredFileLoader

# Load list of pdfs
data_path = "resources/"
docs = [os.path.join(data_path, file) for file in os.listdir(data_path)]

print("Listing available documents ...", docs)

# For simplicity, we will just work with one of the 10k files. This will take some time still.
# To Note: the UnstructuredFileLoader is not the only document loader type that LangChain provides
# To Note: the RecursiveCharacterTextSplitter is what we use to create smaller chunks of text from the doc.
# Docs: https://python.langchain.com/docs/integrations/document_loaders/unstructured_file
# Docs: https://python.langchain.com/docs/modules/data_connection/document_transformers/text_splitters/recursive_text_splitter
doc = [doc for doc in docs if "nke" in doc][0]
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1500, chunk_overlap=100, add_start_index=True)
loader = UnstructuredFileLoader(doc, mode="single", strategy="fast")
chunks = loader.load_and_split(text_splitter)

print("Done preprocessing. Created", len(chunks), "chunks of the original pdf", doc)

Listing available documents ... ['resources/aapl-10k-2023.pdf', 'resources/nke-10k-2023.pdf', 'resources/jnj-10k-2023.pdf', 'resources/msft-10k-2023.pdf', 'resources/nvd-10k-2023.pdf', 'resources/amzn-10k-2023.pdf']
Done preprocessing. Created 323 chunks of the original pdf resources/nke-10k-2023.pdf


In [4]:
# Take a look at one item
print(chunks[2])

page_content="NIKE, Inc.(Exact name of Registrant as specified in its charter)Oregon93-0584541(State or other jurisdiction of incorporation)(IRS Employer Identification No.)One Bowerman Drive, Beaverton, Oregon 97005-6453(Address of principal executive offices and zip code)(503) 671-6453(Registrant's telephone number, including area code)SECURITIES REGISTERED PURSUANT TO SECTION 12(B) OF THE ACT:Class B Common StockNKENew York Stock Exchange(Title of each class)(Trading symbol)(Name of each exchange on which registered)SECURITIES REGISTERED PURSUANT TO SECTION 12(G) OF THE ACT:NONE\n\nAs of November 30, 2022, the aggregate market values of the Registrant's Common Stock held by non-affiliates were:Class A$7,831,564,572 Class B136,467,702,472 $144,299,267,044\n\nTable of ContentsUNITED STATESSECURITIES AND EXCHANGE COMMISSIONWashington, D.C. 20549FORM 10-K(Mark One)☑ ANNUAL REPORT PURSUANT TO SECTION 13 OR 15(D) OF THE SECURITIES EXCHANGE ACT OF 1934FOR THE FISCAL YEAR ENDED MAY 31, 2023

### Create document chunk embeddings

In [5]:
from redisvl.vectorize.text import HFTextVectorizer

hf = HFTextVectorizer("sentence-transformers/all-MiniLM-L6-v2")

# Embed each page_content from the document chunks
chunk_embeddings = hf.embed_many([chunk.page_content for chunk in chunks])

# Check to make sure we've created enough embeddings, 1 per document chunk
len(chunk_embeddings) == len(chunks)

True

### Install Redis Stack (OPTIONAL)

Redis Search will be used as Vector Search engine for LangChain.

Instead of using in-notebook Redis Stack https://redis.io/docs/getting-started/install-stack/ you can provision your own free instance of Redis in the cloud. Get your own Free Redis Cloud instance at https://redis.com/try-free/

In [6]:
%%sh
curl -fsSL https://packages.redis.io/gpg | sudo gpg --dearmor -o /usr/share/keyrings/redis-archive-keyring.gpg
echo "deb [signed-by=/usr/share/keyrings/redis-archive-keyring.gpg] https://packages.redis.io/deb $(lsb_release -cs) main" | sudo tee /etc/apt/sources.list.d/redis.list
sudo apt-get update  > /dev/null 2>&1
sudo apt-get install redis-stack-server  > /dev/null 2>&1
redis-stack-server --daemonize yes

deb [signed-by=/usr/share/keyrings/redis-archive-keyring.gpg] https://packages.redis.io/deb jammy main
Starting redis-stack-server, database path /var/lib/redis-stack


gpg: cannot open '/dev/tty': No such device or address
curl: (23) Failed writing body


### Connect to Redis

By default this notebook would connect to the local instance of Redis Stack. If you have your own Redis Cloud instance - replace REDIS_PASSWORD, REDIS_HOST and REDIS_PORT values with your own.

In [7]:
# Replace values below with your own if using Redis Cloud instance
REDIS_HOST = os.getenv("REDIS_HOST", "localhost")
REDIS_PORT = os.getenv("REDIS_PORT", "6379")
REDIS_PASSWORD = os.getenv("REDIS_PASSWORD", "")
#REDIS_HOST="redis-18374.c253.us-central1-1.gce.cloud.redislabs.com"
#REDIS_PORT=18374
#REDIS_PASSWORD="1TNxTEdYRDgIDKM2gDfasupCADXXXX"

#shortcut for redis-cli $REDIS_CONN command
if REDIS_PASSWORD!="":
  os.environ["REDIS_CONN"]=f"-h {REDIS_HOST} -p {REDIS_PORT} -a {REDIS_PASSWORD} --no-auth-warning"
else:
  os.environ["REDIS_CONN"]=f"-h {REDIS_HOST} -p {REDIS_PORT}"

REDIS_URL = f"redis://:{REDIS_PASSWORD}@{REDIS_HOST}:{REDIS_PORT}"


## VS with RedisVL

### Create the index from schema
Below we connect to Redis and create an index for vector search that contains a single text field and vector field.

In [17]:
from redisvl.index import SearchIndex # can also use AsyncSearchIndex for prod use cases

index_name = "redisvl"

schema = {
  "index": {
    "name": index_name,
    "prefix": f"doc:{index_name}"
  },
  "fields": {
    "text": [{"name": "label"},
             {"name": "content"}],
    "vector": [{
                "name": "chunk_vector",
                "dims": 384,
                "distance_metric": "cosine",
                "algorithm": "hnsw",
                "datatype": "float32"}
        ]
  },
}

# construct a search index from the schema
index = SearchIndex.from_dict(schema) # or SearchIndex.from_yaml("schema.yaml") for yaml files

# connect to local redis instance
index.connect(REDIS_URL)

# create the index (no data yet)
index.create(overwrite=True)

In [18]:
# use the CLI to see the created index
!rvl index listall

[32m22:19:42[0m [34m[RedisVL][0m [1;30mINFO[0m   Indices:
[32m22:19:42[0m [34m[RedisVL][0m [1;30mINFO[0m   1. redisvl


In [19]:
!rvl index info -i redisvl



Index Information:
╭──────────────┬────────────────┬─────────────────┬─────────────────┬────────────╮
│ Index Name   │ Storage Type   │ Prefixes        │ Index Options   │   Indexing │
├──────────────┼────────────────┼─────────────────┼─────────────────┼────────────┤
│ redisvl      │ HASH           │ ['doc:redisvl'] │ []              │          0 │
╰──────────────┴────────────────┴─────────────────┴─────────────────┴────────────╯
Index Fields:
╭──────────────┬──────────────┬────────┬────────────────┬────────────────╮
│ Name         │ Attribute    │ Type   │ Field Option   │   Option Value │
├──────────────┼──────────────┼────────┼────────────────┼────────────────┤
│ label        │ label        │ TEXT   │ WEIGHT         │              1 │
│ content      │ content      │ TEXT   │ WEIGHT         │              1 │
│ chunk_vector │ chunk_vector │ VECTOR │                │                │
╰──────────────┴──────────────┴────────┴────────────────┴────────────────╯


### Process and load data using RedisVL
Below we use the RedisVL index to simply load the list of document chunks to Redis db.

In [20]:
# load expects an iterable of dictionaries
import numpy as np

new_chunks = [
    {
        'label': f'ID-{i}',
        'content': chunk.page_content,
        # For HASH -- must convert embeddings to bytes
        'chunk_vector': np.array(chunk_embeddings[i]).astype(np.float32).tobytes()
    } for i, chunk in enumerate(chunks)
]

# RedisVL handles batching automatically
index.load(new_chunks)

In [21]:
# access the underlying client to check the data size in Redis
len(chunks) == index.client.dbsize()

True

In [22]:
# do NOT run this command in production
keys = index.client.keys()

index.client.hgetall(keys[0])

{b'chunk_vector': b'eG\xaa\xbd\x03@6\xbd(\xe3\xd3<5\xd6\x9b:Dj\x86\xbcd\x83\x8b=\xf1\xf6;\xbd_l\xba=lu\x1a\xbdq\xff\x99\xbb\x95\xa8?=\xfb\x1d\xb6=\xe0\xab\xea\xbc\xdemj\xbcO\xa7$\xbdV\x93\x8a<\n5\xb9\xbb\x96\xb9\x11=EAv\xbd\xe9\x12\xb4\xbcv:\xd2=\xf6\xc1*\xbd3\xf3\xb1<\x9e7\x00=\x99\xdd\xeb\xbcp\x82H\xbb\xafM3\xbdC\x01\x8a=\x98|\xa7\xbdD\xf2/\xbd\xe7\xee\x83\xbdm\xa7\xc9\xbc7\xf7\xd1=\xb8YF\xbc\x04^\xc8=\xee\x96B\xbd\x8d\xc7I=U\x83\xab<\x1a#\x1f\xbd`S\xab\xbb7\xcb\xb6;\x8b\xe7#\xbe^\xb1o\xbc\x17\xc9\x02=\xd0#\xd4\xbcH\x80\xb0<\x98k\x01\xbd\xfb\x9e\x13=\xd6\xfc\x02\xbd\xac\x07\xec=\xde/\xb2\xbcD\xd1\xbd;\xe4\x98q=_M\xdf\xbc\xb6\x9e\x06=-m*<\x04\xa1?\xbd\xcc\x97-\xbd\x11(\x16\xbbW2\xc2;\x89aC<o\xa98\xbd\x8d^\xa8\xbd\x9b\xd1\xae<4\xe5\xd7=\xd4\xaaK\xbd\xc9\xa3\x0f=o\xae\xa3\xbd}\xec>\xbd\xae\x1e\x85:\x04\xc9\xa0=r\xf4\xc3\xbd\x88#\x88\xbd\xe6\x1a$\xbdD\xf4\x95\xbd\xd3{C=\xdex\x87<\x0e)\xba=\xe69\x81=Yo\xee\xbd`\xdb\x0c=P\xfe$=\xa4\x1a\x0c<[^s\xbd\xaf+\x92\xbc\x87\xfa\xc4\xbc\x0bD\xdc<\xec

### Query the database
Now we can use the RedisVL index to perform similarity search operations with Redis

In [23]:
from redisvl.query import (
    VectorQuery,
    RangeQuery,
    FilterQuery
)
from redisvl.query.filter import Text


query = "Nike profit margins and company performance"

v = VectorQuery(
    vector=hf.embed(query),
    vector_field_name="chunk_vector",
    num_results=4,
    return_fields=["label", "content"],
    return_score=True
)

# show the raw redis query
str(v)

'*=>[KNN 4 @chunk_vector $vector AS vector_distance] RETURN 3 label content vector_distance SORTBY vector_distance ASC DIALECT 2 LIMIT 0 4'

In [24]:
# execute the query with RedisVL
index.query(v)

[{'id': 'doc:redisvl:faf74f986a86418fb1be5d46dd5d3707',
  'vector_distance': '0.354781925678',
  'label': 'ID-150',
  'content': '2023 FORM 10-K 35\n\nTable of Contents\n\nOPERATING SEGMENTS\n\nAs discussed in Note 15 — Operating Segments and Related Information in the accompanying Notes to the Consolidated Financial Statements, our operating segments are evidence of the structure of the Company\'s internal organization. The NIKE Brand segments are defined by geographic regions for operations participating in NIKE Brand sales activity.\n\nThe breakdown of Revenues is as follows:\n\n(Dollars in millions)\n\nFISCAL 2023 FISCAL 2022\n\n% CHANGE\n\n% CHANGE EXCLUDING CURRENCY (1) CHANGES FISCAL 2021\n\n% CHANGE\n\nNorth America Europe, Middle East & Africa Greater China\n\n$\n\n21,608 $ 13,418 7,248\n\n18,353 12,479 7,547\n\n18 % 8 % -4 %\n\n18 % $ 21 % 4 %\n\n17,179 11,456 8,290\n\n7 % 9 % -9 %\n\nAsia Pacific & Latin America Global Brand Divisions\n\n(3)\n\n(2)\n\n6,431 58\n\n5,955 102\n

In [25]:
# or use the search API directly from Redis
# NOTE: the .query() syntax handles results parsing on your behalf

result = index.search(v.query, v.params)

[doc.__dict__ for doc in result.docs]

[{'id': 'doc:redisvl:faf74f986a86418fb1be5d46dd5d3707',
  'payload': None,
  'vector_distance': '0.354781925678',
  'label': 'ID-150',
  'content': '2023 FORM 10-K 35\n\nTable of Contents\n\nOPERATING SEGMENTS\n\nAs discussed in Note 15 — Operating Segments and Related Information in the accompanying Notes to the Consolidated Financial Statements, our operating segments are evidence of the structure of the Company\'s internal organization. The NIKE Brand segments are defined by geographic regions for operations participating in NIKE Brand sales activity.\n\nThe breakdown of Revenues is as follows:\n\n(Dollars in millions)\n\nFISCAL 2023 FISCAL 2022\n\n% CHANGE\n\n% CHANGE EXCLUDING CURRENCY (1) CHANGES FISCAL 2021\n\n% CHANGE\n\nNorth America Europe, Middle East & Africa Greater China\n\n$\n\n21,608 $ 13,418 7,248\n\n18,353 12,479 7,547\n\n18 % 8 % -4 %\n\n18 % $ 21 % 4 %\n\n17,179 11,456 8,290\n\n7 % 9 % -9 %\n\nAsia Pacific & Latin America Global Brand Divisions\n\n(3)\n\n(2)\n\n6,43

In [27]:
# Sort by field other than score

# Manipulate the Redis-py Query object
redis_py_query = v.query

# choose to sort by label instead of vector distance
redis_py_query.sort_by("label", asc=True)

# run the query with the ``SearchIndex.search`` method
result = index.search(redis_py_query, v.params)
[doc.__dict__ for doc in result.docs]

[{'id': 'doc:redisvl:9f2f26f97d674466bf32c058973adc96',
  'payload': None,
  'vector_distance': '0.362202882767',
  'label': 'ID-145',
  'content': 'NIKE Brand apparel revenues increased 8% on a currency-neutral basis, primarily due to higher revenues in Men\'s. Unit sales of apparel increased 4%, while higher ASP per unit contributed approximately 4 percentage points of apparel revenue growth. Higher ASP was primarily due to higher full-price ASP and growth in the size of our NIKE Direct business, partially offset by lower NIKE Direct ASP, reflecting higher promotional activity.\n\nNIKE Direct revenues increased 14% from $18.7 billion in fiscal 2022 to $21.3 billion in fiscal 2023. On a currency-neutral basis, NIKE Direct revenues increased 20% primarily driven by NIKE Brand Digital sales growth of 24%, comparable store sales growth of 14% and the addition of new stores. For further information regarding comparable store sales, including the definition, see "Comparable Store Sales". N

In [28]:
# vector search with metadata filtering

f = Text("content") % "profit"
v.set_filter(f)

index.query(v)

[{'id': 'doc:redisvl:9f2f26f97d674466bf32c058973adc96',
  'vector_distance': '0.362202882767',
  'label': 'ID-145',
  'content': 'NIKE Brand apparel revenues increased 8% on a currency-neutral basis, primarily due to higher revenues in Men\'s. Unit sales of apparel increased 4%, while higher ASP per unit contributed approximately 4 percentage points of apparel revenue growth. Higher ASP was primarily due to higher full-price ASP and growth in the size of our NIKE Direct business, partially offset by lower NIKE Direct ASP, reflecting higher promotional activity.\n\nNIKE Direct revenues increased 14% from $18.7 billion in fiscal 2022 to $21.3 billion in fiscal 2023. On a currency-neutral basis, NIKE Direct revenues increased 20% primarily driven by NIKE Brand Digital sales growth of 24%, comparable store sales growth of 14% and the addition of new stores. For further information regarding comparable store sales, including the definition, see "Comparable Store Sales". NIKE Brand Digital s

In [29]:
# Perform a standard text (lexical) search

fq = FilterQuery(return_fields=["content"], filter_expression=f, num_results=4)

# inspect raw redis query
str(fq)

'@content:(profit) RETURN 1 content DIALECT 2 LIMIT 0 4'

In [30]:
index.query(fq)

[{'id': 'doc:redisvl:f0fae0b64b964d7383b53936b34502f2',
  'content': 'Proposals to reform U.S. and foreign tax laws could significantly impact how U.S. multinational corporations are taxed on global earnings and could increase the U.S. corporate tax rate. For example, the Organization for Economic Co-operation and Development (OECD) and the G20 Inclusive Framework on Base Erosion and Profit Shifting (the "Inclusive Framework") has put forth two proposals—Pillar One and Pillar Two—that revise the existing profit allocation and nexus rules and ensure a minimal level of taxation, respectively. On December 12, 2022, the European Union member states agreed to implement the Inclusive Framework\'s global corporate minimum tax rate of 15%. Other countries are also actively considering changes to their tax laws to adopt certain parts of the Inclusive Framework\'s proposals. Although we cannot predict whether or in what form these proposals will be enacted into law, these changes, if enacted int

In [31]:
# Perform a Range Query!

rq = RangeQuery(
    vector=hf.embed(query),
    vector_field_name="chunk_vector",
    num_results=4,
    return_fields=["content"],
    return_score=True,
    distance_threshold=0.5  # find all items with a semantic distance of less than 0.5
)


# inspect query
str(rq)

'@chunk_vector:[VECTOR_RANGE $distance_threshold $vector]=>{$yield_distance_as: vector_distance} RETURN 2 content vector_distance SORTBY vector_distance ASC DIALECT 2 LIMIT 0 4'

In [32]:
index.query(rq)

[{'id': 'doc:redisvl:faf74f986a86418fb1be5d46dd5d3707',
  'vector_distance': '0.354781925678',
  'content': '2023 FORM 10-K 35\n\nTable of Contents\n\nOPERATING SEGMENTS\n\nAs discussed in Note 15 — Operating Segments and Related Information in the accompanying Notes to the Consolidated Financial Statements, our operating segments are evidence of the structure of the Company\'s internal organization. The NIKE Brand segments are defined by geographic regions for operations participating in NIKE Brand sales activity.\n\nThe breakdown of Revenues is as follows:\n\n(Dollars in millions)\n\nFISCAL 2023 FISCAL 2022\n\n% CHANGE\n\n% CHANGE EXCLUDING CURRENCY (1) CHANGES FISCAL 2021\n\n% CHANGE\n\nNorth America Europe, Middle East & Africa Greater China\n\n$\n\n21,608 $ 13,418 7,248\n\n18,353 12,479 7,547\n\n18 % 8 % -4 %\n\n18 % $ 21 % 4 %\n\n17,179 11,456 8,290\n\n7 % 9 % -9 %\n\nAsia Pacific & Latin America Global Brand Divisions\n\n(3)\n\n(2)\n\n6,431 58\n\n5,955 102\n\n8 % -43 %\n\n17 % -

In [33]:
# Add filter to range query

rq.set_filter(f)

index.query(rq)

[{'id': 'doc:redisvl:9f2f26f97d674466bf32c058973adc96',
  'vector_distance': '0.362202882767',
  'content': 'NIKE Brand apparel revenues increased 8% on a currency-neutral basis, primarily due to higher revenues in Men\'s. Unit sales of apparel increased 4%, while higher ASP per unit contributed approximately 4 percentage points of apparel revenue growth. Higher ASP was primarily due to higher full-price ASP and growth in the size of our NIKE Direct business, partially offset by lower NIKE Direct ASP, reflecting higher promotional activity.\n\nNIKE Direct revenues increased 14% from $18.7 billion in fiscal 2022 to $21.3 billion in fiscal 2023. On a currency-neutral basis, NIKE Direct revenues increased 20% primarily driven by NIKE Brand Digital sales growth of 24%, comparable store sales growth of 14% and the addition of new stores. For further information regarding comparable store sales, including the definition, see "Comparable Store Sales". NIKE Brand Digital sales were $12.6 billi

## Cleanup

Clean up the index.

In [None]:
#index.delete(drop=True)

## What's Next?

Now that you have tried the easy-to-use RedisVL client, try your hand with LangChain -- the highest level of abstraction for using and integrating Redis as a vector database.


<a href="https://colab.research.google.com/github/Redislabs-Solution-Architects/financial-vss/blob/main/langchain-03.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>