![Redis](https://redis.com/wp-content/themes/wpx/assets/images/logo-redis.svg?auto=webp&quality=85,75&width=120)

# Introduction to Redis Python

This notebook introduces [Redis](https://redis.io) and the standard Python client, [redis-py](https://redis-py.readthedocs.io/en/stable/), for interacting with the database. We will explore the basics of Redis setup, data structures, and capabilities like vector search!

## Environment Setup

### Pull Github Materials
Because you are likely running this notebook in **Jupyter Notebook**, we need to first
pull the necessary dataset and materials directly from GitHub.

We will be installing following packages: 
- `openai` - provides various tools and APIs for natural language processing (NLP) tasks, including language generation, text completion, language translation, and more.
- `tiktoken` - a Python library for generating device and user tokens for the TikTok social media platform.
- `langchain` - a Python library for natural language processing (NLP) tasks, including document loading, text splitting, and other text preprocessing tasks.
- `unstructured[pdf]` - designed for working with unstructured PDF documents.
- `sentence-transformers` - a Python library for generating sentence embeddings using pre-trained transformer-based models.
- `pandas` - a powerful data analysis and manipulation library for Python.
- `pdf2image` - a Python library for converting PDF documents into images.
- `redisvl>=0.1.0` - [RedisVL](https://www.redisvl.com/) provides a powerful, dedicated Python client library for using Redis as a Vector Database. 

**If you are running this notebook locally**, FYI you may not need to perform this
step at all.

In [23]:
# This clones the supporting git repository into a directory named 'temp_repo'.
!git clone https://github.com/redis-developer/financial-vss.git temp_repo

# This command moves the 'resources' directory from 'temp_repo' to your current directory.
!mv temp_repo/resources .
!mv temp_repo/requirements.txt .

# This deletes the 'temp_repo' directory, cleaning up the unwanted files.
!rm -rf temp_repo

Cloning into 'temp_repo'...


remote: Enumerating objects: 134, done.[K
remote: Counting objects: 100% (134/134), done.[K
remote: Compressing objects: 100% (94/94), done.[K
remote: Total 134 (delta 66), reused 92 (delta 35), pack-reused 0[K
Receiving objects: 100% (134/134), 7.19 MiB | 7.94 MiB/s, done.
Resolving deltas: 100% (66/66), done.


### Install Python Dependencies

This may take upto 4 minutes

In [24]:
!pip install -r requirements.txt

Collecting openai (from -r requirements.txt (line 1))
  Using cached openai-1.14.3-py3-none-any.whl.metadata (20 kB)
Collecting tiktoken (from -r requirements.txt (line 2))
  Using cached tiktoken-0.6.0-cp39-cp39-macosx_10_9_x86_64.whl.metadata (6.6 kB)
Collecting langchain (from -r requirements.txt (line 3))
  Using cached langchain-0.1.13-py3-none-any.whl.metadata (13 kB)
Collecting sentence-transformers (from -r requirements.txt (line 5))
  Using cached sentence_transformers-2.6.1-py3-none-any.whl.metadata (11 kB)
Collecting pandas (from -r requirements.txt (line 6))
  Using cached pandas-2.2.1-cp39-cp39-macosx_10_9_x86_64.whl.metadata (19 kB)
Collecting pdf2image (from -r requirements.txt (line 7))
  Using cached pdf2image-1.17.0-py3-none-any.whl.metadata (6.2 kB)
Collecting redisvl>=0.1.0 (from -r requirements.txt (line 8))
  Using cached redisvl-0.1.2-py3-none-any.whl.metadata (14 kB)
Collecting unstructured[pdf] (from -r requirements.txt (line 4))
  Using cached unstructured-0.1

In [25]:
import warnings

warnings.filterwarnings("ignore")

### Install Redis Stack

Later in this tutorial, Redis will be used to store, index, and query vector
embeddings created from PDF document chunks. **We need to make sure we have a Redis
instance available.**

#### Method 1: Locale Redis Stack
Use the shell script below to download, extract, and install [Redis Stack](https://redis.io/docs/getting-started/install-stack/) directly from the Redis package archive.

In [None]:
## Uncomment if Redis Cloud setup is not working for you
#%%sh
#curl -fsSL https://packages.redis.io/gpg | sudo gpg --dearmor -o /usr/share/keyrings/redis-archive-keyring.gpg
#echo "deb [signed-by=/usr/share/keyrings/redis-archive-keyring.gpg] https://packages.redis.io/deb $(lsb_release -cs) main" | sudo tee /etc/apt/sources.list.d/redis.list
#sudo apt-get update  > /dev/null 2>&1
#sudo apt-get install redis-stack-server  > /dev/null 2>&1
#redis-stack-server --daemonize yes

#### Method 2: Redis Cloud
[SetUp your Redis Environment](/00-setup/README.md)

### Define the Redis Connection URL

By default this notebook connects to the local instance of Redis Stack. **If you have your own Redis Enterprise instance** - replace REDIS_PASSWORD, REDIS_HOST and REDIS_PORT values with your own.

In [17]:
import os

## Uncomment and execute the following code in case Redis Cloud is not available
# Replace values below with your own if using Redis Cloud instance
#REDIS_HOST = os.getenv("REDIS_HOST", "localhost")
#REDIS_PORT = os.getenv("REDIS_PORT", "6379")
#REDIS_PASSWORD = os.getenv("REDIS_PASSWORD", "")

# Replace values below with your own if you are using a Redis Cloud instance.
REDIS_HOST = "redis-17834.c322.us-east-1-2.ec2.cloud.redislabs.com"
REDIS_PORT = 17834
REDIS_PASSWORD = "XzJR9ewHpp9hdMVNERloYCMFoavPkgH8"
REDIS_REQUIRE_PASS = True

# If SSL is enabled on the endpoint, use rediss:// as the URL prefix
REDIS_URL = f"redis://:{REDIS_PASSWORD}@{REDIS_HOST}:{REDIS_PORT}"

## Hello World Redis

Now let's connect to the Redis db and get a basic feel for the most common
commands and data structures.

In [5]:
import redis
import json
import numpy as np

from time import sleep

# Connect with the Redis Python Client
client = redis.Redis.from_url(REDIS_URL)

client.ping()

True

In [6]:
client.dbsize() # should be empty

0

Redis, at it's core, is a simple key/value store. It supports a number of interesting
and flexible data structures that can solve a variatey of business and operational
problems.

### [Redis Strings](https://redis.io/docs/data-types/strings/)

The basic string data type can be accessed using set/get methods. You can also place a
TTL policy (expiration) on any key in Redis.

In [None]:
client.set("hello", "world")

In [None]:
client.get("hello")

In [None]:
client.delete("hello")

In [None]:
client.set("hello", "world")
client.expire("hello", time=3)

sleep(4)

# should be EMPTY
client.get("hello")

### [Redis Hashes](https://redis.io/docs/data-types/hashes/)
Hashes are collections of key/value pairs that are grouped together. It gets
serialized as a string in Redis, but can hold a variety of data in each field.

You can think of a Hash as a one-level deep Python dictionary.


In [None]:
obj = {
    "user": "john",
    "age": 45,
    "job": "dentist",
    "bio": "long form text of john's bio",
    "user_embedding": np.array([0.3, 0.4, -0.8], dtype=np.float32).tobytes() # cast vectors to bytes string
}

In [None]:
client.hset("user:john", mapping=obj)

In [None]:
client.hgetall("user:john")

In [None]:
client.delete("user:john")

### JSON
With the JSON capabilitie enabled, Redis can be a drop-in replacement for MongoDB
or other slower document databases. You can store nested and structured JSON data
directly in Redis.

In [None]:
# set a JSON obj
obj = {
    "user": "john",
    "metadata": {
        "age": 45,
        "job": "dentist",
    },
    "user_embedding": [0.3, 0.4, -0.8]
}

client.json().set("user:john", "$", obj)

In [None]:
# get user JSON obj
client.json().get("user:john")

In [None]:
# grab array length for embedding field
client.json().arrlen("user:john", "$.user_embedding")

In [None]:
# grab obj keys
client.json().objkeys("user:john", "$")

In [None]:
# delete user JSON
client.delete("user:john")

### Lists
Lists store sequences of information... potentially list of messages in an LLM
converstion flow, or really any list of items in a queue.

In [None]:
# add items to a list
client.rpush("messages:john", *[
    json.dumps({"role": "user", "content": "Hello what can you do for me?"}),
    json.dumps({"role": "assistant", "content": "Hi, I am a helpful virtual assistant."})
])

In [None]:
# list all items in the list using indices
[json.loads(msg) for msg in client.lrange("messages:john", 0, -1)]

In [None]:
# count items in the list
client.llen("messages:john")

In [None]:
# pop the first item from the list and push to another list
client.rpoplpush("messages:john", "read_messages:john")

In [None]:
client.lrange("read_messages:john", 0, -1)

In [None]:
# list cleanup
client.delete("messages:john", "read_messages:john")

### [Redis Pipelining](https://redis.io/docs/manual/pipelining/)
Optimize round-trip time by batching Redis commands. 

Redis pipelining is a technique for improving performance by issuing multiple commands at once without waiting for the response to each individual command.

In [None]:
with client.pipeline(transaction=False) as pipe:
    for i in range(50):
        pipe.json().set(f"user:{i}", "$", obj)
    # execute batch
    pipe.execute()

In [None]:
client.dbsize()

In [None]:
# clean up!
client.flushall()

## Intro to Vector Search in Redis
Now that we've covered the basics, let's explore the fundamental principles of vector search in Redis using the standard Python client.

### Dataset Preparation (PDF Documents)

To showcase Redis as a vector database, we'll load a single financial document (10k fillings) and preprocess it using LangChain's helpers:

- LangChain offers various document loader types, not just `UnstructuredFileLoader`. [Docs](https://python.langchain.com/docs/integrations/document_loaders/unstructured_file)
- We use `RecursiveCharacterTextSplitter` to divide the document into smaller text chunks. More details can be bound here: [Docs](https://python.langchain.com/docs/modules/data_connection/document_transformers/text_splitters/recursive_text_splitter)

In [26]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.document_loaders import UnstructuredFileLoader

# Load list of pdfs from a folder
data_path = "resources/"
docs = [os.path.join(data_path, file) for file in os.listdir(data_path)]

print("Listing available documents:")
for d in docs:
    print(d)

Listing available documents:
resources/nke-10k-2023.pdf
resources/amzn-10k-2023.pdf
resources/jnj-10k-2023.pdf
resources/aapl-10k-2023.pdf
resources/nvd-10k-2023.pdf
resources/msft-10k-2023.pdf


Nike's 10K filling, file size 2.3 MB. It has 107 pages. To chunk this file it may take ~ 2 minutes

In [27]:
# fetch the Nike PDF
doc = [doc for doc in docs if "nke" in doc][0]

# set up the file loader/extractor and text splitter to create chunks
loader = UnstructuredFileLoader(doc, mode="single", strategy="fast")
text_splitter = RecursiveCharacterTextSplitter(chunk_size=2500, chunk_overlap=0)

# extract, load, and make chunks
chunks = loader.load_and_split(text_splitter)

print("Done preprocessing. Created", len(chunks), "chunks of the original pdf", doc)

Done preprocessing. Created 179 chunks of the original pdf resources/nke-10k-2023.pdf


In [28]:
# Take a look at content from a chunk
print(chunks[25].page_content)

Economic factors beyond our control, and changes in the global economic environment, including fluctuations in inflation and currency exchange rates, could result in lower revenues, higher costs and decreased margins and earnings.

A majority of our products are manufactured and sold outside of the United States, and we conduct purchase and sale transactions in various currencies, which creates exposure to the volatility of global economic conditions, including fluctuations in inflation and foreign currency exchange rates. Central banks may deploy various strategies to combat inflation, including increasing interest rates, which may impact our borrowing costs. Additionally, there has been, and may continue to be, volatility in currency exchange rates that impact the U.S. Dollar value relative to other international currencies. Our international revenues and expenses generally are derived from sales and operations in foreign currencies, and these revenues and expenses are affected by cu

### Text embedding generation with SentenceTransformers

How to choose an embedding model? 

You may review the [Massive Text Embedding Benchmark (MTEB)](https://huggingface.co/spaces/mteb/leaderboard) it is used when you want to evaluate the performance of text embedding models across a diverse range of tasks and datasets. It is particularly useful when you want to compare your model's performance with other models in the field.

SentenceTransformers is a Python framework for sentence, text, and image embeddings. 

The `sentence-transformers/all-MiniLM-L6-v2` benchmark is a sentence-transformers model that maps paragraphs and sentences to a 384-dimensional dense vector space. It's designed to be used for semantic search and clustering tasks.

This model is intended to be used as a sentence and short paragraph encoder. Given an input text, it outputs a vector which captures the semantic information. The sentence vector may be used for information retrieval, clustering or sentence similarity tasks.

By default, input text longer than 256 word pieces is truncated.

In [29]:
from sentence_transformers import SentenceTransformer

# load model - may take a minute or two to download the first time
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

ImportError: cannot import name 'is_nltk_available' from 'transformers.utils.import_utils' (/Users/manisharora/opt/anaconda3/lib/python3.9/site-packages/transformers/utils/import_utils.py)

In [None]:
%%time

# create embeddings
chunk_embeddings = model.encode([chunk.page_content for chunk in chunks])
len(chunk_embeddings) == len(chunks)

### Set up some helper functions

Helper functions to encode the single query vector and display redis search results

In [None]:
import pandas as pd

# The function takes a textual input as its argument. 
# It converts the text input into dense vector representation,
# also known as embedding.
#
# The input can be a single sentence, paragraph, or even a document.
# Finally, the generated vector is converted to a NumPy array of type 
# `np.float32` and then to bytes using the `tobytes()`` method. This 
# conversion facilitates efficient storage, transmission, and processing 
# of the vector data.
def encode_one(input):
    return model.encode(input).astype(np.float32).tobytes()

def table_view(res):
    res_df = pd.DataFrame([t.__dict__ for t in res.docs ]).drop(columns=["payload"])
    return res_df

### Define a schema and create an index
Below we connect to Redis and create an index for vector search that contains a single text field and a vector field.

In [None]:
from redis.commands.search.field import TagField, TextField, VectorField
from redis.commands.search.indexDefinition import IndexDefinition, IndexType
from redis.commands.search.query import Query

index_name = "redispy"
key_prefix = f"doc:{index_name}"

def create_index(index_type: str = "FLAT"):       # Creates a FLAT index by default
    try:
        # check to see if index exists
        client.ft(index_name).info()
        print("Index already exists!")
    except:
        # define schema
        schema = (
            TagField("doc_id"),                    # Tag Field - synthetic ID
            TextField("content"),                  # Text Field
            VectorField("chunk_vector",            # Vector Field
                index_type, {                      # Vector Index Type: FLAT or HNSW
                    "TYPE": "FLOAT32",
                    "DIM": 384,                    # Number of Vector Dimensions
                    "DISTANCE_METRIC": "COSINE",   # Vector Search Distance Metric
                }
            ),
        )

        # index Definition
        definition = IndexDefinition(prefix=[key_prefix], index_type=IndexType.HASH)

        # create Index
        client.ft(index_name).create_index(fields=schema, definition=definition)

In [None]:
# Create the index
create_index()

In [None]:
# Check the info related to the newly created index
client.ft(index_name).info()

### Process and load data using Redis
Below we use a Redis pipeline (not a transaction) to batch send writes to Redis. This method helps with throughput significantly. The batch_size param can be customized and benchmarked on your hardware and with your data. We typically recommend starting small (100-200) and increasing as needed.

In [None]:
# load expects an iterable of dictionaries

batch_size = 200

with client.pipeline(transaction=False) as pipe:
    for i, chunk in enumerate(chunks):
        data = {
            'doc_id': f"{i}",
            'content': chunk.page_content,
            # For HASH -- must convert embeddings to bytes
            'chunk_vector': np.array(chunk_embeddings[i]).astype(np.float32).tobytes()
        }
        pipe.hset(f"{key_prefix}:{i}", mapping=data)
        # execute in "mini batches"
        if i % batch_size == 0:
            res = pipe.execute()

    # cleanup final batch execution
    res = pipe.execute()

In [None]:
# check the data size in Redis
len(chunks) == client.dbsize()

In [None]:
client.hgetall(f"{key_prefix}:0")

### Query the database
Now we can use the Redis search index to perform vector similarity search operations.

The code below takes a user input, converts to embeddings, and fetches the top 2 most semantically similar chunks from Redis.

The syntax for vector similarity [KNN queries](https://redis.io/docs/interact/search-and-query/advanced-concepts/vectors/#knn-search) is *=>[<vector_similarity_query>] for running the query on an entire vector field, or <primary_filter_query>=>[<vector_similarity_query>] for running similarity query on the result of the primary filter query.

Valid example: `"(@title:Matrix @year:[2020 2022])=>[KNN 10 @v $B]"`

In [None]:
# Grab user input
_input = "Nike profit margins and company performance"

top_k = 3

query = (
    Query(f"*=>[KNN {top_k} @chunk_vector $vector as vector_distance]")
     .sort_by("vector_distance")
     .return_fields("content", "vector_distance")
     .paging(0, top_k)
     .dialect(2)
)

query_params = {
    "vector": encode_one(_input)
}

res = client.ft(index_name).search(query, query_params)

table_view(res)

In [None]:
# Example of sorting by a field other than vector_distance
top_k = 4
query = (
    Query(f"*=>[KNN {top_k} @chunk_vector $vector as vector_distance]")
     .sort_by("doc_id")
     .return_fields("doc_id", "content", "vector_distance")
     .paging(0, top_k)
     .dialect(2)
)

query_params = {
    "vector": encode_one(_input)
}

res = client.ft(index_name).search(query, query_params)

table_view(res)

### Range Queries
Range queries are specifically tailored for filtering query results based on vector field distances, making them ideal for tasks such as similarity search or nearest neighbor search. 

In [None]:
query = (
    Query("@chunk_vector:[VECTOR_RANGE $radius $vector]=>{$YIELD_DISTANCE_AS: vector_distance}")
     .sort_by("vector_distance")
     .return_fields("content", "vector_distance")
     .dialect(2)
)

# Find all vectors within 0.8 of the query vector
query_params = {
    "radius": 0.8,
    "vector": encode_one(_input)
}

res = client.ft(index_name).search(query, query_params)
table_view(res)

### Add filter statements
Redis queries can contain both vector search and traditional filters (numeric, tags, text, geo) in one single command.

In [None]:
# filter for docs that contain "profit" in the content field and do KNN vector search
query = (
    Query("@content:profit=>[KNN 2 @chunk_vector $vector as vector_distance]")
     .sort_by("vector_distance")
     .return_fields("content", "vector_distance")
     .paging(0, 2)
     .dialect(2)
)

query_params = {
    "vector": encode_one(_input)
}

res = client.ft(index_name).search(query, query_params)
table_view(res)

In [None]:
# lets clean up our index
client.ft(index_name).dropindex(True)

### What about JSON Support?

Redis also allows you to store data in JSON objects. The JSON fields can contain metadata and vectors. Below is a simple example of indexing JSON data.

In [None]:
# schema
schema = (
    TextField("$.content",                     # Text Field (JSON path)
        as_name="content"                      # Text Field Alias -- required for JSON
    ),
    VectorField("$.chunk_vector",              # Vector Field (JSON path)
        "FLAT", {                              # Vector Index Type: FLAT or HNSW
            "TYPE": "FLOAT32",
            "DIM": 384,                        # Number of Vector Dimensions
            "DISTANCE_METRIC": "COSINE",       # Vector Search Distance Metric
        },
        as_name="chunk_vector"                 # Vector Field Alias -- required for JSON
    ),
)

# index Definition
definition = IndexDefinition(prefix=[key_prefix], index_type=IndexType.JSON) # select JSON here

# create Index
client.ft(index_name).create_index(fields=schema, definition=definition)

In [None]:
client.ft(index_name).info()

In [None]:
# Write JSON data to the index

batch_size = 200

with client.pipeline(transaction=False) as pipe:
    for i, chunk in enumerate(chunks):
        redis_key = f"{key_prefix}:{i}"
        data = {
            'content': chunk.page_content,
            'chunk_vector': chunk_embeddings[i].tolist() # notice that we don't need to convert JSON embeddings to bytes
        }
        #print(data)
        pipe.json().set(redis_key, "$", data)
        # mini batch
        if i % batch_size == 0:
            res = pipe.execute()

    res = pipe.execute() # make sure to use mini batches if working with larger datasets

In [None]:
# Fetch the JSON doc
client.json().get(f"{key_prefix}:0", "$")

In [None]:
# And now you can perform the same kinds of queries
query = (
    Query("@content:profit=>[KNN 2 @chunk_vector $vector as vector_distance]")
     .sort_by("vector_distance")
     .return_fields("content", "vector_distance")
     .paging(0, 2)
     .dialect(2)
)

query_params = {
    "vector": encode_one(_input)

}
res = client.ft(index_name).search(query, query_params)

table_view(res)

## Cleanup
Clean up the index and data.

In [None]:
client.ft(index_name).dropindex(True)

## What's Next?

Now that you have the basics down with the baseline Redis Python client:

| Notebook | Description | Documentation |
|----------|-------------|---------------|
| [**redisvl-02**](https://colab.research.google.com/github/redis-developer/financial-vss/blob/main/redisvl-02.ipynb) | Dive deeper using a dedicated VSS client library to build a RAG application from scratch including some advanced techniques using an OpenAI LLM. | [View Docs](https://redisvl.com) |
| [**langchain-03**](https://colab.research.google.com/github/redis-developer/financial-vss/blob/main/langchain-03.ipynb) | Wrap up with an integrated approach via LangChain and see how to build complex LLM chains. | [View Docs](https://python.langchain.com/docs/get_started/introduction) |
