## Vector Search on PostgreSQL


### Prerequisites
  
- Generate embeddings - [generate_embeddings.ipynb](../common/generate_embeddings.ipynb) 
- Create table and ingest embeddings - [postgree_ingestion.ipynb](.../postgree_ingestion.ipynb)

### Set environment variables

In [None]:
import os
from dotenv import load_dotenv

load_dotenv()

pg_host  = os.getenv("POSTGRESQL_HOST")
if pg_host is None or pg_host == "":
    print("POSTGRESQL_HOST environment variable not set.")
    exit()

pg_user  = os.getenv("POSTGRESQL_USERNAME")
if pg_user is None or pg_user == "":
    print("POSTGRESQL_USERNAME environment variable not set.")
    exit()

pg_password  = os.getenv("POSTGRESQL_PASSWORD")
if pg_password is None or pg_password == "":
    print("POSTGRESQL_PASSWORD environment variable not set.")
    exit()

db_name  = os.getenv("POSTGRESQL_DATABASE")
if db_name is None or db_name == "":
    print("POSTGRESQL_DATABASE environment variable not set.")
    exit()

aoai_key  = os.getenv("AZURE_OPENAI_KEY")
if aoai_key is None or aoai_key == "":
    print("AZURE_OPENAI_KEY environment variable not set.")
    exit()

aoai_endpoint = 'https://azure-openai-dnai.openai.azure.com'
aoai_api_version = '2023-08-01-preview'
aoai_embedding_deployed_model = 'embedding-ada'

text_table_name = 'text_sample'
doc_table_name = 'doc_sample'
image_table_name = 'image_sample'

postgresql_params = {
    "host": pg_host,
    "port": "5432", 
    "dbname": db_name,
    "user": pg_user,
    "password": pg_password
}




#### Simple vector search

In [None]:
import psycopg2 
import openai
from openai.embeddings_utils import get_embedding, cosine_similarity

query = 'web hosting services'

openai.api_type = "azure"
openai.api_key = aoai_key
openai.api_base = aoai_endpoint
openai.api_version = aoai_api_version

query_vector = get_embedding(query, engine = aoai_embedding_deployed_model)

connection = psycopg2.connect(**postgresql_params)
print("Connection established.")

query_sql = f"SELECT title FROM text_sample ORDER BY ((content_vector <=> '{query_vector}')) LIMIT 5;"

cursor = connection.cursor()
cursor.execute(query_sql)

records = cursor.fetchall()

for row in records:
        print(row[0], )

cursor.close()
connection.close()

### Function -  Converting the Dataframe values to help with the Search

In [18]:

# Function to convert string to PostgreSQL double precision[]
def to_double_precision_array(value):
    if isinstance(value, str):
        # Remove brackets and split by comma, then convert to float
        values = [float(x.strip()) for x in value.strip('[]').split(',')]
        return values
    return []


### Cross column vector similarity search

In [68]:
import matplotlib.pyplot as plt
from PIL import Image
from azure.search.documents.models import Vector 
import pandas as pd 
from openai.embeddings_utils import get_embedding, cosine_similarity
import psycopg2 
import openai

#autheticating
openai.api_type = "azure"
openai.api_key = aoai_key
openai.api_base = aoai_endpoint
openai.api_version = aoai_api_version


query = 'tools for software development'

query_sql = f'''
    SELECT  title_vector,
            title,
            content_vector,
            content
    FROM {text_table_name} ;
'''
print("Query table")



# Fetch and process the results
connection = psycopg2.connect(**postgresql_params)
cursor = connection.cursor()
cursor.execute(query_sql)

records = cursor.fetchall()

##creating a dataframe from the results for the query
column_names = [desc[0] for desc in cursor.description]
df_query_results = pd.DataFrame(records, columns=column_names)

##create embedding
query_vector = get_embedding(query,   engine=aoai_embedding_deployed_model )

##converting datatype
df_query_results['content_vector_array'] = df_query_results['content_vector'].apply(to_double_precision_array)
df_query_results['title_vector_array'] = df_query_results['title_vector'].apply(to_double_precision_array)


##checking similarities to do the vector cross search
df_query_results["similarities_content"] = df_query_results['content_vector_array'].apply(lambda x: cosine_similarity(x, query_vector))
df_query_results["similarities_title"] = df_query_results['title_vector_array'].apply(lambda x: cosine_similarity(x, query_vector))

# Display the results and similarities
for index, row in df_query_results.iterrows():
    print("Content :", row["content"])
    print("Content Vector:", row["content_vector_array"])
    print("Title :", row["title"])
    print("Title Vector:", row["title_vector_array"])
    print("similarities Content:", row["similarities_content"])
    print("similarities Title:", row["similarities_title"])
    print("\n")






Query table
Content : Azure App Service is a fully managed platform for building, deploying, and scaling web apps. You can host web apps, mobile app backends, and RESTful APIs. It supports a variety of programming languages and frameworks, such as .NET, Java, Node.js, Python, and PHP. The service offers built-in auto-scaling and load balancing capabilities. It also provides integration with other Azure services, such as Azure DevOps, GitHub, and Bitbucket.
Content Vector: [0.0076504378, -0.023626352, 0.012058043, -0.020860016, -0.024685236, 0.007041579, -0.028695758, -0.004930429, 0.010774146, -0.026458867, 0.007253356, -0.0060488754, -0.0061216736, -0.0029020042, -0.009702026, 0.002081369, 0.01679655, 0.0045101843, -0.011065339, -0.006220944, -0.030707639, -0.007901923, 0.012441888, 0.01782896, -0.013553716, 0.026630934, 0.011972008, -0.028404566, 0.03274599, -0.014400824, 0.012422034, -0.01712745, -0.0073327725, 0.010271176, 4.38186e-05, 0.016611245, -0.0030426371, -0.016809786, -0.0

### Hybrid search

In [67]:
import openai
from azure.search.documents.models import Vector 
from openai.embeddings_utils import get_embedding, cosine_similarity
import psycopg2 
import openai

# Define your search query
search_query = "software development best practices"
query_vector = get_embedding(query, engine = aoai_embedding_deployed_model)

# Connect to the PostgreSQL database
connection = psycopg2.connect(**postgresql_params)
cursor = connection.cursor()

# Perform Full-Text Search (Keyword Search) using PostgreSQL's FTS
cursor.execute("""
    SELECT id, title, content
    FROM text_sample
    WHERE to_tsvector('english', title || ' ' || content) @@ plainto_tsquery('english', %s)
""", (search_query,))
fts_results = cursor.fetchall()

# Perform Similarity Search using pgvector
query_sql = f"SELECT id, title, content FROM {text_table_name} ORDER BY ((content_vector <=> '{query_vector}')) LIMIT 5;"
cursor = connection.cursor()
cursor.execute(query_sql)
results = cursor.fetchall()

#  Combine and present the results
combined_results = []

# Add FTS results
for row in fts_results:
    combined_results.append({
        "id": row[0],
        "title": row[1],
        "content": row[2],
        "source": "Full-Text Search"
    })

# Add Similarity Search results
for row in results:
    combined_results.append({
        "id": row[0],
        "title": row[1],
        "content": row[2],
        "source": "Similarity Search"
    })

# Sort and present the combined results
combined_results = sorted(combined_results, key=lambda x: x["id"])

for result in combined_results:
    print("Source:", result["source"])
    print("Title:", result["title"])
    print("Content:", result["content"])
    print("\n")


# Close the database connection
connection.close()


Source: Similarity Search
Title: Azure App Service
Content: Azure App Service is a fully managed platform for building, deploying, and scaling web apps. You can host web apps, mobile app backends, and RESTful APIs. It supports a variety of programming languages and frameworks, such as .NET, Java, Node.js, Python, and PHP. The service offers built-in auto-scaling and load balancing capabilities. It also provides integration with other Azure services, such as Azure DevOps, GitHub, and Bitbucket.


Source: Similarity Search
Title: Azure DevOps
Content: Azure DevOps is a suite of services that help you plan, build, and deploy applications. It includes Azure Boards for work item tracking, Azure Repos for source code management, Azure Pipelines for continuous integration and continuous deployment, Azure Test Plans for manual and automated testing, and Azure Artifacts for package management. DevOps supports a wide range of programming languages, frameworks, and platforms, making it easy to in

#### Document search example

In [66]:
import matplotlib.pyplot as plt
from PIL import Image
from azure.search.documents.models import Vector 
import pandas as pd 
from openai.embeddings_utils import get_embedding, cosine_similarity
import psycopg2 
import openai

#autheticating
openai.api_type = "azure"
openai.api_key = aoai_key
openai.api_base = aoai_endpoint
openai.api_version = aoai_api_version



query = 'tools for software development'

query_sql = f'''
    SELECT t1.chunk_content_vector as chunk_content_vector,
           t1.chunk_content as chunk_content
    FROM {doc_table_name} t1
    LIMIT 100;
'''
print("Query table")


# Fetch and process the results
connection = psycopg2.connect(**postgresql_params)
cursor = connection.cursor()
cursor.execute(query_sql)

records = cursor.fetchall()

##creating a dataframe from the results for the query
column_names = [desc[0] for desc in cursor.description]
df_query_results = pd.DataFrame(records, columns=column_names)

##create embedding
query_vector = get_embedding(query,   engine=aoai_embedding_deployed_model )

##converting datatype
df_query_results['chunk_content_vector_array'] = df_query_results['chunk_content_vector'].apply(to_double_precision_array)


##listing datatype to cross check
#df_query_results['chunk_content_vector_array'].apply(lambda x: print(type(x), x))
#df_query_results['chunk_content_vector'].apply(lambda x: print(type(x), x))


##checking similarities to do the vector cross search
df_query_results["similarities"] = df_query_results['chunk_content_vector_array'].apply(lambda x: cosine_similarity(x, query_vector))


# Sort the DataFrame by similarities in descending order
df_query_results = df_query_results.sort_values(by="similarities", ascending=False)

# Display the results and similarities
for index, row in df_query_results.iterrows():
    print("Content:", row["chunk_content"])
    print("Content Vector:", row["chunk_content_vector_array"])
    print("Similarity:", row["similarities"])
    print("\n")


# Close the database connection
connection.close()



Query table
Content: Contoso Electronics 
Employee Handbook  
 
 
 
 
 
 
  
 This document contains information generated using a language model (Azure OpenAI). The 
information contained in this document is only for demonstration purposes and does not 
reflect the opinions or beliefs of Microsoft. Microsoft makes no representations or 
warranties of any kind, express or implied, about the completeness, accuracy, reliability, 
suitability or availability with respect to the information contained in this document.  
All rights reserved to Microsoft  
   Contoso Electronics Employee Handbook  
Last Updated: 2023 -03-05 
 
Contoso Electronics is a leader in the aerospace industry, providing advanced electronic 
components for both commercial and military aircraft. We specialize in creating cutting -
edge systems that are both reliable and efficient. Our mission is to provide the highest 
quality aircraft components to our customers, while maintaining a commitment to safety
Content Vector

#### Image search example

In [None]:
## TODO