# Embeddings and Similarity Search with Cortex AISQL

This notebook demonstrates:
- **AI_EMBED**: Generate vector embeddings for text and images
- **AI_SIMILARITY**: Calculate similarity between embeddings
- Semantic search and clustering use cases


In [None]:
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
import numpy as np
from snowflake.snowpark import Session
from IPython.display import display, Markdown, HTML

# Try to get active session (for Snowflake Notebooks)
# Otherwise, connect using ~/.snowflake/connections.toml
try:
    from snowflake.snowpark.context import get_active_session
    session = get_active_session()
    print("✅ Connected using Snowflake Notebooks session")
except:
    # Fallback: Connect using connections.toml
    import toml
    from pathlib import Path
    
    toml_path = Path.home() / ".snowflake" / "connections.toml"
    if toml_path.exists():
        connections = toml.load(toml_path)
        connection_name = "default"
        if connection_name in connections:
            conn_params = connections[connection_name]
            session = Session.builder.configs(conn_params).create()
            print(f"✅ Connected using connection profile: {connection_name}")
        else:
            raise Exception(f"❌ Connection '{connection_name}' not found")
    else:
        raise Exception(f"❌ Connection file not found at {toml_path}")

session.sql("USE DATABASE AISQL_DB").collect()
session.sql("USE SCHEMA AISQL_SCHEMA").collect()
session.sql("USE WAREHOUSE AISQL_WH").collect()


✅ Connected using connection profile: vinodshiv


[Row(status='Statement executed successfully.')]

## 1. AI_EMBED: Generate Embeddings

Create vector embeddings for semantic search


In [7]:
# Create embeddings for emails
# AI_EMBED returns a VECTOR type (not an array)
sql = """
CREATE OR REPLACE TABLE email_embeddings AS
SELECT 
    ticket_id,
    user_id,
    content,
    AI_EMBED('snowflake-arctic-embed-m-v1.5', content) as embedding
FROM emails
LIMIT 100
"""
session.sql(sql).collect()

# View embeddings
sql_view = """
SELECT 
    ticket_id,
    SUBSTR(content, 1, 100) as content_preview
FROM email_embeddings
LIMIT 10
"""
df = session.sql(sql_view).to_pandas()
display(Markdown("### Email Embeddings Created"))
display(df)
display(Markdown("**Model:** `snowflake-arctic-embed-m-v1.5` (768-dimensional embeddings)"))


### Email Embeddings Created

Unnamed: 0,TICKET_ID,CONTENT_PREVIEW
0,1830,I'd appreciate a refund for the unused Saturda...
1,1462,"Also, quick heads up - there seems to be a gli..."
2,177,"While I have you, I wanted to share some feedb..."
3,632,"Also, the new paperless ticket system is gener..."
4,1813,"Also, I purchased tickets for the Summer Elect..."
5,1320,"Additionally, I wanted to provide some feedbac..."
6,466,I'm also wondering if you could clarify your r...
7,603,One suggestion - it would be amazing if you co...
8,1033,I urgently need assistance with transferring m...
9,681,"Also, I purchased VIP passes for the summer mu..."


**Model:** `snowflake-arctic-embed-m-v1.5` (768-dimensional embeddings)

## 2. AI_SIMILARITY: Find Similar Tickets

Use similarity scores to find related support tickets


In [6]:
# Find similar tickets using text-based AI_SIMILARITY
# AI_SIMILARITY works best with text inputs directly
sql = """
WITH ticket_pairs AS (
    SELECT 
        a.ticket_id as ticket_a,
        b.ticket_id as ticket_b,
        SUBSTR(a.content, 1, 80) as content_a,
        SUBSTR(b.content, 1, 80) as content_b,
        AI_SIMILARITY(a.content, b.content) as similarity_score
    FROM email_embeddings a
    CROSS JOIN email_embeddings b
    WHERE a.ticket_id < b.ticket_id
    LIMIT 1000  -- Limit for performance
)
SELECT * FROM ticket_pairs
ORDER BY similarity_score DESC
LIMIT 20
"""

df_similar = session.sql(sql).to_pandas()
display(Markdown("### Most Similar Ticket Pairs"))
display(df_similar)

# Visualize similarity distribution
display(Markdown("### Similarity Score Distribution"))
fig = px.histogram(df_similar, 
                   x='SIMILARITY_SCORE', 
                   nbins=20,
                   title='Distribution of Similarity Scores',
                   labels={'SIMILARITY_SCORE': 'Similarity Score', 'count': 'Count'})
fig.update_layout(height=400)
fig.show()


### Most Similar Ticket Pairs

Unnamed: 0,TICKET_A,TICKET_B,CONTENT_A,CONTENT_B,SIMILARITY_SCORE
0,88,1088,There was also a small issue with the sound sy...,"During last week's show, we encountered a mino...",0.909596
1,177,1177,"While I have you, I wanted to share some feedb...",I've been enjoying the updates to your mobile ...,0.884412
2,146,1146,I purchased 2 VIP tickets ($189 each) under bo...,I'm reaching out about my VIP ticket purchase ...,0.87108
3,1416,1813,"On a different note, I accidentally purchased ...","Also, I purchased tickets for the Summer Elect...",0.817165
4,35,1035,"The ticket reference number is TIX-2023-89445,...",Here is the rewritten version:\n\nI am writing...,0.806757
5,743,890,I'm having a really frustrating experience wit...,I attended the Summer Vibes Festival last week...,0.780119
6,603,1320,One suggestion - it would be amazing if you co...,"Additionally, I wanted to provide some feedbac...",0.777745
7,610,1182,I attended the electronic music festival last ...,I recently attended the electronic music festi...,0.77511
8,92,632,"While I have you, I just wanted to say that th...","Also, the new paperless ticket system is gener...",0.765479
9,1182,1310,I recently attended the electronic music festi...,I attended the electronic music festival at Mo...,0.765154


### Similarity Score Distribution

## 3. Semantic Search

Search for tickets similar to a query


In [8]:
# Semantic search using text-based AI_SIMILARITY
# AI_SIMILARITY works best with text inputs directly
search_query = "I need a refund for my ticket"

display(Markdown(f"### Semantic Search: '{search_query}'"))

sql = f"""
SELECT 
    ticket_id,
    SUBSTR(content, 1, 150) as content_preview,
    AI_SIMILARITY(content, '{search_query}') as relevance_score
FROM email_embeddings
ORDER BY relevance_score DESC
LIMIT 10
"""

df_search = session.sql(sql).to_pandas()
display(Markdown(f"**Top 10 Results:**"))
display(df_search)


### Semantic Search: 'I need a refund for my ticket'

**Top 10 Results:**

Unnamed: 0,TICKET_ID,CONTENT_PREVIEW,RELEVANCE_SCORE
0,1416,"On a different note, I accidentally purchased ...",0.637683
1,1830,I'd appreciate a refund for the unused Saturda...,0.607717
2,1033,I urgently need assistance with transferring m...,0.606089
3,325,"Lastly, I attempted to get a refund for the ""U...",0.593162
4,683,I attended your indie rock festival last weeke...,0.580258
5,1228,I'd greatly appreciate it if you could assist ...,0.57122
6,1813,"Also, I purchased tickets for the Summer Elect...",0.556001
7,466,I'm also wondering if you could clarify your r...,0.552956
8,940,"I spent over $450 on these tickets, and I'd re...",0.508404
9,1399,"On a separate note, I processed a refund for t...",0.491293
