# Vector Search with IRIS SQL
This tutorial covers how to use IRIS as a vector database. 

For this tutorial, we will use a dataset of 2.2k online reviews of scotch (
dataset from https://www.kaggle.com/datasets/koki25ando/22000-scotch-whisky-reviews) . With our latest vector database functionality, we can leverage the latest embedding models to run semantic search on the online reviews of scotch whiskeys. In addition, we'll be able to apply filters on columns with structured data. For example, we will be able to search for whiskeys that are priced under $100, and are 'earthy, smooth, and easy to drink'. Let's find our perfect whiskey!

First step is to do some imports and establish a connection to InterSystems IRIS.

In [11]:
import os, pandas as pd
from sentence_transformers import SentenceTransformer
from sqlalchemy import create_engine, text

username = 'superuser'
password = 'SYS'
hostname = os.getenv('IRIS_HOSTNAME', 'localhost')
port = '51787' 
namespace = 'USER'

CONNECTION_STRING = f"iris://{username}:{password}@{hostname}:{port}/{namespace}"

engine = create_engine(CONNECTION_STRING)

## Exploring the dataset

Let's take a look at the data in our CSV file with whiskey reviews.

In [None]:
# Load the CSV file
df = pd.read_csv('c:\\AIWebinar\\scotch_review.csv')
df.head()

Now we'll reorganize the data a little bit with panda functions to make it more practical to store in a table.

In [None]:
# Clean data
# Remove the specified columns
df.drop(['currency'], axis=1, inplace=True)

# Drop the first column
df.drop(columns=df.columns[0], inplace=True)

# Remove rows without a price
df.dropna(subset=['price'], inplace=True)

# Ensure values in 'price' are numbers
df = df[pd.to_numeric(df['price'], errors='coerce').notna()]

# Replace NaN values in other columns with an empty string
df.fillna('', inplace=True)

df.head()

## Creating the table in IRIS SQL

Now, InterSystems IRIS supports vectors as a datatype in tables! Here, we create a table with a few different columns. The last column, `description_vector` of type `VECTOR(FLOAT, 384)`, will be used to store vectors that are generated by passing the `description` of a review through an embedding model. The `FLOAT` option here is new in 2024.3, and 384 is the number of dimensions the chosen embedding model uses.

In [14]:
with engine.connect() as conn:
    with conn.begin():# Load 
        sql = f"""
                CREATE TABLE IF NOT EXISTS scotch_reviews (
                    name VARCHAR(255),
                    category VARCHAR(255),
                    review_point INT,
                    price DOUBLE,
                    description VARCHAR(2000),
                    description_vector VECTOR(FLOAT, 384)
                )
                """
        result = conn.execute(text(sql))

## Creating the embeddings

Next, we'll create the embeddings for the `description` column. In IRIS 2024.3, you can leave this work to IRIS by using the new [`EMBEDDING` datatype](https://docs.intersystems.com/iris20243/csp/docbook/DocBook.UI.Page.cls?KEY=GSQL_vecsearch#GSQL_vecsearch_insembed), but for now we'll go with classic Pythonic ways of creating them, based on a common Sentence Transformer model.

In [15]:
# Load a pre-trained sentence transformer model. This model's output vectors are of size 384
model = SentenceTransformer('all-MiniLM-L6-v2') 

In [None]:
# Generate embeddings for all descriptions at once.
# Batch processing before inserting into the table makes it faster, but this step may still take a minute
embeddings = model.encode(df['description'].tolist(), normalize_embeddings=True)

# Add the embeddings to the DataFrame
df['description_vector'] = embeddings.tolist()

df.head()

Now we'll load the data into our table. Note the `str()` call as we're passing the vector as a comma-separated list of values in string format, because there is no specific vector datatype in the DB-API driver standard.

In [17]:
with engine.connect() as conn:
    with conn.begin():
        for index, row in df.iterrows():
            sql = text("""
                INSERT INTO scotch_reviews 
                (name, category, review_point, price, description, description_vector) 
                VALUES (:name, :category, :review_point, :price, :description, TO_VECTOR(:description_vector))
            """)
            conn.execute(sql, {
                'name': row['name'], 
                'category': row['category'], 
                'review_point': row['review.point'], 
                'price': row['price'], 
                'description': row['description'], 
                'description_vector': str(row['description_vector'])
            })


## Running a few queries

Let's look for a scotch that costs less than $100, and has an earthy and creamy taste.

In [18]:
description_search = "earthy and creamy taste"
search_vector = model.encode(description_search, normalize_embeddings=True).tolist() # Convert search phrase into a vector

In [None]:
with engine.connect() as conn:
    with conn.begin():
        sql = text("""
            SELECT TOP 3 * FROM scotch_reviews 
            WHERE price < 100 
            ORDER BY VECTOR_DOT_PRODUCT(description_vector, TO_VECTOR(:search_vector)) DESC
        """)

        results = conn.execute(sql, {'search_vector': str(search_vector)}).fetchall()

print(results)

Let's print that result a little more nicely!

In [None]:
results_df = pd.DataFrame(results, columns=df.columns).iloc[:, :-1] # Remove vector
pd.set_option('display.max_colwidth', None)  # Easier to read description
results_df.head()

## Indexing vector data

The latest version of IRIS 2025.1 released through the [Early Access Program](https://www.intersystems.com/early-access-program) includes not only bug fixes and performance enhancements, but a new disk-based Approximate Nearest Neighbors index that speeds up vector search for large collections of vectors (typically over 100K). See [the docs](https://docs.intersystems.com/iris20251/csp/docbook/DocBook.UI.Page.cls?KEY=GSQL_vecsearch#GSQL_vecsearch_index) for more information on how to define and use the index.

```SQL
CREATE INDEX HNSWIndex ON TABLE scotch_reviews (description_vector) AS HNSW(M=80, Distance='DotProduct');
```

The index will automatically get used if you issue a query that uses a `TOP` clause and an `ORDER BY` to sort by the distance function for which the index was created. You can verify its use in the query plan, by using the `EXPLAIN` command or checking the plan through the System Management Portal UI.

```SQL
SELECT TOP 10 * FROM scotch_reviews ORDER BY VECTOR_DOT_PRODUCT(description_vector, TO_VECTOR(:search_vector)) DESC;
```

Since this notebook is working with a dataset much smaller than 100K rows, there won't be a measurable performance benefit, and this is provided as an example you can adapt.

NOTE: Please note this feature is currently targeted at 2025.1. Please join the [Early Access Program](https://live.evaluation.iscinternal.com/download/adminearlyaccess.csp?earlyAccessProgram=Vector_Search) if you'd like to work with a preview kit.