### Install dependencies
Install the python dependencies required for our application. Using a Python virtual environment is usually a good idea. 

In [None]:
%pip install openai num2words matplotlib plotly scipy scikit-learn pandas tiktoken redis

### Import libraries and set up Azure OpenAI service connection
Fill in the Azure OpenAI information below to establish the connection to the Azure OpenAI model. This example stores these values in application variables for the sake of simplicity. Outside of tutorials, it's strongly recommended to store these in environment variables or using a secrets manager like Azure KeyVault. 

Note that there are differences  between the `OpenAI` and `Azure OpenAI` endpoints. This example uses the configuration for `Azure OpenAI`. See [How to switch between OpenAI and Azure OpenAI endpoints with Python](https://learn.microsoft.com/en-us/azure/ai-services/openai/how-to/switching-endpoints) for more details. 

In [None]:
import openai
import re
import requests
import sys
from num2words import num2words
import os
import pandas as pd
import numpy as np
from openai.embeddings_utils import get_embedding
import tiktoken
from typing import List

API_KEY = "<azure-openai-key>"
RESOURCE_ENDPOINT = "<azure-openai-endpoint>" # e.g. https://openaiexample.openai.azure.com/
DEPLOYMENT_NAME = "my-embedding-model" # this is the name you selected for your deployment, not the name of the OpenAI model. 

openai.api_type = "azure"
openai.api_key = API_KEY
openai.api_base = RESOURCE_ENDPOINT
openai.api_version = "2023-05-15"

url = openai.api_base + "openai/models/" + DEPLOYMENT_NAME + "?api-version=2023-05-15" 

r = requests.get(url, headers={"api-key": API_KEY})

print(r.text)

### Import dataset

This example uses the [Wikipedia Movie Plots](https://www.kaggle.com/datasets/jrobischon/wikipedia-movie-plots) dataset from Kaggle. Download this file and place it in the same directory as this jupyter notebook.  

In [None]:
df=pd.read_csv(os.path.join(os.getcwd(),'wiki_movie_plots_deduped.csv'))
df

Process the dataset to remove spaces in the column titles and filter the dataset to lower the size. This isn't required, but is helpful in reducing the time it takes to generate embeddings and loading the index into Redis. Feel free to play around with the filters, or add your own! 

In [None]:
df.insert(0, 'id', range(0, len(df)))
df['year'] = df['Release Year'].astype(int)
df['origin'] = df['Origin/Ethnicity'].astype(str)
del df['Release Year']
del df['Origin/Ethnicity']
df = df[df.year > 1970] # only movies made after 1970
df = df[df.origin.isin(['American','British','Canadian'])] # only movies from English-speaking cinema
df

Remove whitespace from the `Plot` column to make it easier to generate embeddings.

In [None]:
pd.options.mode.chained_assignment = None

# s is input text
def normalize_text(s, sep_token = " \n "):
    s = re.sub(r'\s+',  ' ', s).strip()
    s = re.sub(r". ,","",s)
    # remove all instances of multiple spaces
    s = s.replace("..",".")
    s = s.replace(". .",".")
    s = s.replace("\n", "")
    s = s.strip()
    
    return s

df['Plot']= df['Plot'].apply(lambda x : normalize_text(x))

Calculate the number of tokens required to generate the embeddings for this dataset. You may want to filter the dataset more stringently in order to limit the tokens required. 

In [None]:
tokenizer = tiktoken.get_encoding("cl100k_base")
df['n_tokens'] = df["Plot"].apply(lambda x: len(tokenizer.encode(x)))
df = df[df.n_tokens<8192]
print('Number of movies: ' + str(len(df))) # print number of movies remaining in dataset
print('Number of tokens required:' + str(df['n_tokens'].sum())) # print number of tokens

### Generate embeddings

This function calls Azure OpenAI service to generate the embeddigns and add them to the dataframe in a column entitled `embeddings`

In [None]:
df['embeddings'] = df["Plot"].apply(lambda x : get_embedding(x, engine = DEPLOYMENT_NAME))
df

### Set up Redis
This example uses [Azure Cache for Redis](https://learn.microsoft.com/en-us/azure/azure-cache-for-redis/cache-overview), a fully managed Redis service on Azure. The [Enterprise tier](https://learn.microsoft.com/en-us/azure/azure-cache-for-redis/quickstart-create-redis-enterprise) of Azure Cache for Redis features the [RediSearch](https://learn.microsoft.com/en-us/azure/azure-cache-for-redis/cache-redis-modules#redisearch) module, which includes vector search capability. This example stores these values in application variables for the sake of simplicity. Outside of tutorials, it's strongly recommended to store these in environment variables or using a secrets manager like Azure KeyVault.  

In [None]:
import redis
from redis.commands.search.indexDefinition import (
    IndexDefinition,
    IndexType
)
from redis.commands.search.query import Query
from redis.commands.search.field import (
    TextField,
    VectorField
)

REDIS_HOST =  "<redis-endpoint>" # e.g. redisdemo.southcentralus.redisenterprise.cache.azure.net. If you're copying and pasting from the Azure portal, you do not need the ":10000" at the end of the hostname. 
REDIS_PORT = 10000  # default for Azure Cache for Redis Enterprise
REDIS_PASSWORD = "<redis-access-key>"

# Connect to Redis
redis_client = redis.Redis(
    host=REDIS_HOST,
    port=REDIS_PORT,
    password=REDIS_PASSWORD
)
redis_client.ping()

Define the fields for the vector index of the dataset

In [None]:
# Constants
VECTOR_DIM = 1536                               # length of the vectors for the text-embedding-ada-002 (Version 2) model
VECTOR_NUMBER = len(df)                         # initial number of vectors
INDEX_NAME = "embeddings-index"                 # name of the search index
PREFIX = "movie"                                # prefix for the document keys
DISTANCE_METRIC = "COSINE"                      # distance metric for the vectors (ex. COSINE, IP, L2)

# Define RediSearch fields for each of the columns in the dataset
title = TextField(name="Title")
director = TextField(name="Director")
cast = TextField(name="Cast")
genre = TextField(name="Genre")
embeddings_vectors = VectorField("embeddings",
    "FLAT", {
        "TYPE": "FLOAT32",
        "DIM": VECTOR_DIM,
        "DISTANCE_METRIC": DISTANCE_METRIC,
        "INITIAL_CAP": VECTOR_NUMBER,
    }
)
fields = [title, director, cast, genre, embeddings_vectors]

# Check if index exists
try:
   redis_client.ft(INDEX_NAME).info()
   # print(redis_client.ft(INDEX_NAME).info())
   # print("Index already exists")
except:
    # Create RediSearch Index
    redis_client.ft(INDEX_NAME).create_index(
        fields = fields,
        definition = IndexDefinition(prefix=[PREFIX], index_type=IndexType.HASH)
    )

Load the index documents into Redis. This can take 10+ minutes, depending on the dataset size used.

In [None]:
def index_documents(client: redis.Redis, prefix: str, documents: pd.DataFrame):
    records = documents.to_dict("records")
    for doc in records:
        key = f"{prefix}:{str(doc['id'])}"

        # create byte vectors for title and content
        plot_embeddings = np.array(doc["embeddings"], dtype=np.float32).tobytes()

        # replace list of floats with byte vectors
        doc["embeddings"] = plot_embeddings

        client.hset(key, mapping = doc)

index_documents(redis_client, PREFIX, df)
print(f"Loaded {redis_client.info()['db0']['keys']} documents in Redis search index with name: {INDEX_NAME}")

### Run search queries

First, we setup the search parameters, defining the index and vector fields we're using. 

In [None]:
def search_redis(
    redis_client: redis.Redis,
    user_query: str,
    index_name: str = "embeddings-index",
    vector_field: str = "embeddings",
    return_fields: list = ["Title", "Director", "Cast", "Genre", "vector_score"],
    hybrid_fields = "*",
    k: int = 20,
) -> List[dict]:

    # Creates embedding vector from user query
    embedded_query = openai.Embedding.create(deployment_id=DEPLOYMENT_NAME, input=user_query,
                                            model=DEPLOYMENT_NAME,
                                            )["data"][0]['embedding']

    # Prepare the Query
    base_query = f'{hybrid_fields}=>[KNN {k} @{vector_field} $vector AS vector_score]'
    query = (
        Query(base_query)
         .return_fields(*return_fields)
         .sort_by("vector_score")
         .paging(0, k)
         .dialect(2)
    )
    params_dict = {"vector": np.array(embedded_query).astype(dtype=np.float32).tobytes()}

    # perform vector search
    results = redis_client.ft(index_name).search(query, params_dict)
    for i, article in enumerate(results.docs):
        score = 1 - float(article.vector_score)
        print(f"{i}.{article.Title} (Score: {round(score ,3) })")
    return results.docs

Then, we can use the function we just defined to search for any plain-text query:

In [None]:
results = search_redis(redis_client, "A movie with old airplanes in it ", k=10)

0.The Flying Machine (Score: 0.861)
1.Airplane! (Score: 0.857)
2.Invasion from Inner Earth (Score: 0.843)
3.Every Time We Say Goodbye (Score: 0.842)
4.Wings of Courage (Score: 0.839)
5.A Dark Reflection (Score: 0.838)
6.The Great Waldo Pepper (Score: 0.837)
7.Ace Eli and Rodger of the Skies (Score: 0.831)
8.Skyjacked (Score: 0.829)
9.The Pilot (Score: 0.829)


### Run hybrid queries

You can also run hybrid queries, that is, queries that use both vector search and filters based on other parameters in the dataset. In this case, we filter our query results to only movies tagged with the `drama` genre.

In [None]:
def create_hybrid_field(field_name: str, value: str) -> str:
    return f'@{field_name}:"{value}"'

# search the content vector for movies with a specific genre
results = search_redis(redis_client,
                       "Basketball comeback story with NBA players",
                       vector_field="embeddings",
                       k=10,
                       hybrid_fields=create_hybrid_field("Genre", "drama")
                       )

0.Inside Moves (Score: 0.826)
1.That Championship Season (Score: 0.823)
2.Hurricane Season (Score: 0.819)
3.Coach Carter (Score: 0.818)
4.Blue Chips (Score: 0.816)
5.Sunset Park (Score: 0.814)
6.Home Run (Score: 0.814)
7.One on One (Score: 0.813)
8.That Championship Season (Score: 0.813)
9.He Got Game (Score: 0.811)
