# Semantic Search with Amazon OpenSearch Service 
To create semantic search, we will add a vector representation of the metadata to our data set in OpenSearch, then do the same with our sample query "Wildfires in Canada". In OpenSearch, we'll use a KNN search to find matches based on a cosine similarity rating on the vector.
We will:
1. Use a HuggingFace sentence-transformer BERT model to generate sentence embedding for the geo.ca metadata dataset
2. Upload the dataset to OpenSearch, with the original metadata schema text combined with the vector representation of the questions.
3. Translate the query question to a vector.
4. Perform a KNN search in OpenSearch to perform semantic search

### 1. Check PyTorch Version


As in the previous modules, let's import PyTorch and confirm that have have the latest version of PyTorch. The version should already be 2.0.1 or higher. If not, please run the lab in order to get everything set up.

In [None]:
#Only run the line below to install the latest torch. -kernel restart needed
#!pip install --upgrade torch torchvision torchaudio

import torch 
print(torch.__version__)

### 2. Retrieve notebook variables

The line below will retrieve your shared variables from the previous notebook.

In [None]:
%store -r

### 3. import library 

In [None]:
# #installed in the previous notebook
!pip install -q boto3
!pip install -q requests
!pip install -q requests-aws4auth
!pip install -q opensearch-py
!pip install -q tqdm
!pip install -q boto3
!pip install -q transformers[torch]
!pip install -q transformers
!pip install -q sentence-transformers rank_bm25
!pip install -q nltk

In [None]:
import boto3
import re
import time
import sagemaker

### 4. Prepare BERT Model 
#### Option 1: DistilBert model
For this module, we will be using the HuggingFace BERT model to generate vectorization data, where every sentence is 768 dimension data. Let's create some helper functions we'll use later on.
![BERT](image/nlp_bert.png)

We are creating 2 functions:
1. mean_pooling
2. sentence_to_vector - this is the key function we'll use to generate our vector embedding for the metadata dataset.

A reason for not using DistilBert:
 Transformer models like DistilBert have a fixed maximum input length (512), and any input longer than this limit can cause errors during processing.Our input sequence length (1086 tokens) exceeds the model's maximum sequence length (512 tokens).

In [None]:
# import torch
# from transformers import AutoTokenizer, AutoModel
# from transformers import DistilBertTokenizer, DistilBertModel

# #model_name = "distilbert-base-uncased"
# #model_name = "sentence-transformers/msmarco-distilbert-base-dot-prod-v3"
# model_name = "sentence-transformers/distilbert-base-nli-stsb-mean-tokens" #https://huggingface.co/sentence-transformers/distilbert-base-nli-stsb-mean-tokens


# #Mean Pooling - Take attention mask into account for correct averaging
# def mean_pooling(model_output, attention_mask):
#     token_embeddings = model_output[0] #First element of model_output contains all token embeddings
#     input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
#     sum_embeddings = torch.sum(token_embeddings * input_mask_expanded, 1)
#     sum_mask = torch.clamp(input_mask_expanded.sum(1), min=1e-9)
#     return sum_embeddings / sum_mask


# def sentence_to_vector(raw_inputs):
#     tokenizer = DistilBertTokenizer.from_pretrained(model_name)
#     model = DistilBertModel.from_pretrained(model_name)
#     inputs_tokens = tokenizer(raw_inputs, padding=True, return_tensors="pt")
    
#     with torch.no_grad():
#         outputs = model(**inputs_tokens)

#     sentence_embeddings = mean_pooling(outputs, inputs_tokens['attention_mask'])
#     return sentence_embeddings


#### Option 2: all-MiniLM-L6-v2
We can also use sentence-transformer models ['all-MiniLM-L6-v2'](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) which simplifies the process of obtaining sentence embeddings. It has 384 dimensions.It designed for generating sentence embeddings directly, which means we can use the Sentence Transformers library's functionality to handle both tokenization and embedding in a more streamlined manner compared to manually handling with DistilBertModel and DistilBertTokenizer.

The SentenceTransformer class's encode method directly handles the text input, tokenization, and conversion to sentence embeddings, eliminating the need for manual mean pooling. The encode method returns a tensor of sentence embeddings, where each embedding corresponds to the input sentences provided to the function.



In [None]:
!pip install -U sentence-transformers
from sentence_transformers import SentenceTransformer, util
import numpy as np 

# Load the Sentence Transformer model
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model_name = "paraphrase-multilingual-MiniLM-L12-v2"
model = SentenceTransformer(model_name)
model.to(device)  # üëà move model to GPU or CPU

def sentence_to_vector(raw_inputs):
    # Encode sentences to get sentence embeddings
    sentence_embeddings = model.encode(raw_inputs, convert_to_tensor=True)
    """
    When you work with vectors (such as embeddings) in Elasticsearch or OpenSearch, you need to convert the PyTorch tensor to a list of floats before indexing the document. 
    """
    encod_np_array = np.array(sentence_embeddings)
    encod_list = encod_np_array.tolist()
        
    return encod_list

### 5.Preprocess and embed the text


In [None]:
import pandas as pd 
import boto3
import json
from requests_aws4auth import AWS4Auth
import io
import os 
from tqdm import tqdm

In [None]:
# Load metadata
def read_parquet_from_s3_as_df(region, s3_bucket, s3_key):
    """
    Load a Parquet file from an S3 bucket into a pandas DataFrame.

    Parameters:
    - region: AWS region where the S3 bucket is located.
    - s3_bucket: Name of the S3 bucket.
    - s3_key: Key (path) to the Parquet file within the S3 bucket.

    Returns:
    - df: pandas DataFrame containing the data from the Parquet file.
    """

    # Setup AWS session and clients
    session = boto3.Session(region_name=region)
    s3 = session.resource('s3')

    # Load the Parquet file as a pandas DataFrame
    object = s3.Object(s3_bucket, s3_key)
    body = object.get()['Body'].read()
    df = pd.read_parquet(io.BytesIO(body))
    return df

# Upload the duplicate date to S3 as a parquet file 
def upload_df_to_s3_as_parquet(df, bucket_name, file_key):
    # Save DataFrame as a Parquet file locally
    parquet_file_path = 'temp.parquet'
    df.to_parquet(parquet_file_path)

    # Create an S3 client
    s3_client = boto3.client('s3')

    # Upload the Parquet file to S3 bucket
    try:
        response = s3_client.upload_file(parquet_file_path, bucket_name, file_key)
        os.remove(parquet_file_path)
        print(f'Uploading {file_key} to {bucket_name} as parquet file')
        # Delete the local Parquet file
        return True
    except Exception as e:
        print(e)
        return False

# Create new column 'organization_en' required by the API JSON response 
def extract_organisation_en(contact_str):
    try:
        # Parse the stringified JSON into Python objects
        contact_data = json.loads(contact_str)
        # If the parsed data is a list, iterate through it
        if isinstance(contact_data, list):
            for item in contact_data:
                # Check if 'organisation' and 'en' keys exist
                if 'organisation' in item and 'en' in item['organisation']:
                    return item['organisation']['en']
        elif isinstance(contact_data, dict):
            # If the data is a dictionary, extract 'organisation' in 'en' directly
            return contact_data.get('organisation', {}).get('en', None)
    except json.JSONDecodeError:
        # Handle cases where the contact string is not valid JSON
        return None
    except Exception as e:
        # Catch-all for any other unexpected errors
        return f"Error: {str(e)}"

# Text preprocess
def preprocess_records_into_text(df):
    selected_columns = ['features_properties_title_en','features_properties_description_en','features_properties_keywords_en']
    df = df[selected_columns]
    return df.apply(lambda x: f"{x['features_properties_title_en']}\n{x['features_properties_description_en']}\nkeywords:{x['features_properties_keywords_en']}",axis=1 )


# Text preprocess
def preprocess_records_into_text_multi(df):
    selected_columns = ['features_properties_title_en','features_properties_description_en','features_properties_keywords_en','features_properties_title_fr','features_properties_description_fr','features_properties_keywords_fr']
    df = df[selected_columns]
    return df.apply(lambda x: f"{x['features_properties_title_en']}\n{x['features_properties_title_fr']}\n{x['features_properties_description_en']}\n{x['features_properties_description_fr']}\nkeywords:{x['features_properties_keywords_en']}\nkeywords:{x['features_properties_keywords_fr']}",axis=1 )

In [None]:
#1) Step1: Load the data 
df_parquet = read_parquet_from_s3_as_df('ca-central-1', 'webpresence-geocore-geojson-to-parquet-prod', 'records.parquet')
df_sentinel1_1 = read_parquet_from_s3_as_df('ca-central-1', 'webpresence-geocore-geojson-to-parquet-prod', '1-sentinel1.parquet')
df_sentinel1_2 = read_parquet_from_s3_as_df('ca-central-1', 'webpresence-geocore-geojson-to-parquet-prod', '2-sentinel1.parquet')
df_rcm = read_parquet_from_s3_as_df('ca-central-1', 'webpresence-geocore-geojson-to-parquet-prod', 'rcm-ard.parquet')
df = pd.concat([df_parquet, df_sentinel1_1, df_sentinel1_2, df_rcm], ignore_index=True)
#df = pd.concat([df_parquet, df_rcm], ignore_index=True)
df.head()
#df.columns
df.shape

Subset to columns that are required in the app.geo.ca [api response](https://geocore.api.geo.ca/geo?north=81.77364370720657&east=360&south=-8.407168163601076&west=-359.6484375&keyword=&lang=en&min=1&max=10&sort=popularity-desc). 
##### Note, we are focus on the english search at the moment. 

In [None]:
#2) Step2: Clean the data  
col_names_list = [
    'features_properties_id','features_geometry_coordinates','features_properties_title_en',
    'features_properties_description_en','features_properties_date_published_date',
    'features_properties_keywords_en','features_properties_options','features_properties_contact',
    'features_properties_topicCategory','features_properties_date_created_date',
    'features_properties_spatialRepresentation','features_properties_type',
    'features_properties_temporalExtent_begin','features_properties_temporalExtent_end',
    'features_properties_graphicOverview','features_properties_language','features_popularity',
    'features_properties_sourceSystemName','features_properties_eoCollection',
    'features_properties_eoFilters'
]
col_names_list_multi = [
    'features_properties_id','features_geometry_coordinates','features_properties_title_en','features_properties_title_fr',
    'features_properties_description_en','features_properties_description_fr','features_properties_date_published_date',
    'features_properties_keywords_en','features_properties_keywords_fr','features_properties_options','features_properties_contact','features_properties_cited',
    'features_properties_topicCategory','features_properties_date_created_date',
    'features_properties_spatialRepresentation','features_properties_type',
    'features_properties_temporalExtent_begin','features_properties_temporalExtent_end',
    'features_properties_graphicOverview','features_properties_language','features_popularity',
    'features_properties_sourceSystemName','features_properties_eoCollection',
    'features_properties_eoFilters'
]
df_en = df[col_names_list_multi]
df_en['organisation_en'] = df_en['features_properties_contact'].apply(extract_organisation_en)
df_en['organisation_en_cited'] = df_en['features_properties_cited'].apply(extract_organisation_en)

# Create a new column 'temporalExtent' as a dictionary of {'begin': ..., 'end': ...}
values_to_replace = {'Present': None, 'Not Available; Indisponible': None}
columns_to_replace = ['features_properties_temporalExtent_begin', 'features_properties_temporalExtent_end']
df_en[columns_to_replace] = df_en[columns_to_replace].replace(values_to_replace)

df_en['temporalExtent'] = df_en.apply(lambda row: {'begin': row['features_properties_temporalExtent_begin'], 'end': row['features_properties_temporalExtent_end']}, axis=1)
df_en = df_en.drop(columns =['features_properties_temporalExtent_begin', 'features_properties_temporalExtent_end'])

values_to_replace = {'Not Available; Indisponible': None} # modifies dates to acceptable values
columns_to_replace = ['features_properties_date_published_date', 'features_properties_date_created_date']
df_en[columns_to_replace] = df_en[columns_to_replace].replace(values_to_replace)

In [None]:
#3) Step 3: Preprocess text 
#df_en['text'] = preprocess_records_into_text(df_en)
df_en['text'] = preprocess_records_into_text_multi(df_en)

pd.set_option('display.max_columns', None)
df_en.head(5)

### (optional) Text Preprocess using NLTK  

Create a new column that concadenate the selected columns: features_properties_title_en, features_properties_description_en,features_properties_keywords_en, and apply the following preprocesing before tokenization:
- convert to lower case 
- remove stopwords and punctuation
- remove apostrophe
- stemming


In [None]:
import nltk
from nltk.corpus import stopwords          # module for stop words that come with NLTK
from nltk.stem import PorterStemmer        # module for stemming
from nltk.tokenize import word_tokenize   # module for tokenizing strings 
import string
# Download necessary NLTK resources
nltk.download('punkt')
nltk.download('stopwords')


df_en['metadata_en'] = df_en['features_properties_title_en'] + ' ' + df_en['features_properties_description_en'] + ' ' + df_en['features_properties_keywords_en'] + ' ' + df_en['features_properties_title_fr'] + ' ' + df_en['features_properties_description_fr'] + ' ' + df_en['features_properties_keywords_fr']
if df_en['metadata_en'].isnull().any():
    df_en['metadata_en'] = df_en['metadata_en'].fillna('')

# Function to clean text
def clean_text(text):
#     """
#     text: raw tex input, a string of text, or a list of string
#     output: preprocess text in string format 
#     """
#     # Set of stopwords
    stop_words = set(stopwords.words('english'))
#     # Initialize the Porter Stemmer
    stemmer = PorterStemmer()
    
#     # Convert text to lowercase
    text = text.lower()
#     # Remove punctuation
    text = text.translate(str.maketrans('', '', string.punctuation.replace("'", "")))  # Keep apostrophe
#     # Remove apostrophes
    text = text.replace("'", "")
#     # Tokenize text
    word_tokens = word_tokenize(text)
#     # Remove stopwords and stem
    filtered_text = [stemmer.stem(word) for word in word_tokens if word not in stop_words]
    return " ".join(filtered_text)

df_en['processed_metadata_en'] = df_en['metadata_en'].apply(clean_text)
# # Show the processed column
print(df_en.loc[1:3, ['metadata_en', 'processed_metadata_en']])
df_en.head(4)

tqdm.pandas()
df_en['vector'] = df_en["processed_metadata_en"].progress_apply(sentence_to_vector)

In [None]:
# Step 4: Embedding text 
from tqdm.auto import tqdm
tqdm.pandas()
df_en['vector'] = df_en["text"].progress_apply(sentence_to_vector)


In [None]:
print(f'The dimension of the vector is {len(df_en.loc[1, "vector"])}')
df_en.head(4)

In [None]:
# Step 5 Upload the embeddings as a parquet file to S3 bucket 
upload_df_to_s3_as_parquet(df=df_en, bucket_name='webpresence-nlp-data-preprocessing-prod', file_key='semantic_search_embeddings-minilm-multilingual.parquet') 

### 6. Create an OpenSearch cluster connection.
Next, we'll use Python API to set up connection with OpenSearch Cluster.


In [None]:
from opensearch import get_awsauth_from_secret, create_opensearch_connection, delete_aos_index_if_exists, load_data_to_opensearch_index

In [None]:
import boto3
def read_parquet_from_s3_as_df(region, s3_bucket, s3_key):
    """
    Load a Parquet file from an S3 bucket into a pandas DataFrame.

    Parameters:
    - region: AWS region where the S3 bucket is located.
    - s3_bucket: Name of the S3 bucket.
    - s3_key: Key (path) to the Parquet file within the S3 bucket.

    Returns:
    - df: pandas DataFrame containing the data from the Parquet file.
    """

    # Setup AWS session and clients
    session = boto3.Session(region_name=region)
    s3 = session.resource('s3')

    # Load the Parquet file as a pandas DataFrame
    object = s3.Object(s3_bucket, s3_key)
    body = object.get()['Body'].read()
    df = pd.read_parquet(io.BytesIO(body))
    return df

#Optional: read the embedding data from the S3 bucket 
import pandas as pd 
import io
df_en = read_parquet_from_s3_as_df('ca-central-1', 'webpresence-nlp-data-preprocessing-prod', 'semantic_search_embeddings-minilm-multilingual.parquet')

In [None]:
row = df_en[df_en['features_properties_id'] == '0a2e22a2-a7a1-47d4-9249-3eaafe1c815b']
row

pd.set_option('display.max_columns', None)
df_en.head(5)

In [None]:
# Data normalization

# 1. Change cgp to geo-ca for source-system
df_en.loc[
    (df_en['features_properties_sourceSystemName'] == 'cgp') |
    (df_en['features_properties_sourceSystemName'].isna()),
    'features_properties_sourceSystemName'
] = 'geo-ca'

# 2. Create a new column called features_properties_mappable and set to true if options protocol sectoion contains a wildcard of 'esri' or 'ogc'
def is_mappable_from_str(options_str):
    try:
        import json
        options = json.loads(options_str) if isinstance(options_str, str) else options_str
        
        protocols = []
        if isinstance(options, list):
            protocols = [opt.get("protocol", "").strip().lower() for opt in options if isinstance(opt, dict)]
        elif isinstance(options, dict):
            protocol = options.get("protocol", "").strip().lower()
            protocols = [protocol]
                
        return any("esri" in p or "ogc" in p for p in protocols)
    except Exception as e:
        print(f"Error parsing options: {e}")
        return False

df_en["features_properties_mappable"] = df_en["features_properties_options"].apply(is_mappable_from_str)

df_en["features_properties_mappable"] = df_en["features_properties_mappable"].astype(str).str.lower()


# 3. Create a new column called features_properties_geo_theme and bin ISO topic categories

theme_bins = {
    'boundaries': 'administration',
    'planningcadastre': 'administration',
    'location': 'administration',
    'transportation': 'administration',

    'economy': 'economy',
    'farming': 'economy',

    'biota': 'environment',
    'environment': 'environment',
    'elevation': 'environment',
    'inlandwaters': 'environment',
    'oceans': 'environment',
    'climatologymeteorologyatmosphere': 'environment',  # lowercase key

    'imagerybasemapsearthcover': 'imagery',
    'earthobservation;syntheticaperatureradar': 'imagery',

    'structure': 'infrastructure',
    'transport': 'infrastructure',
    'utilitiescommunication': 'infrastructure',

    'geoscientificinformation': 'science',  # lowercase key

    'health': 'society',
    'society': 'society',
    'intelligencemilitary': 'society',
}

def map_topics_to_themes(topic_str):
    if not isinstance(topic_str, str):
        return []

    topics = [t.strip().lower() for t in topic_str.split(",")]
    themes = {theme_bins.get(topic) for topic in topics if theme_bins.get(topic)}

    # Convert everything to strings explicitly (just in case)
    return sorted([str(theme) for theme in themes])

df_en["features_properties_geo_theme"] = df_en["features_properties_topicCategory"].apply(map_topics_to_themes)

#print(df_en["features_properties_geo_theme"].iloc[0])
# should output: ["administration", "society"]

#print(type(df_en["features_properties_geo_theme"].iloc[0][0]))
# should output: <class 'str'>


# 4. Organizations should not have divisions
def get_second_segment(s):
    """Extract second segment (index 1) from semicolon-separated string."""
    if isinstance(s, str):
        parts = [p.strip() for p in s.split(";")]
        if len(parts) >= 2:
            return parts[1]
    return None

def extract_org(contact_list):
    """Extract org dict with 'en' and 'fr' second segments."""
    if isinstance(contact_list, str):
        try:
            contact_list = json.loads(contact_list)
        except Exception:
            return {"en": None, "fr": None}

    if not isinstance(contact_list, list) or not contact_list:
        return {"en": None, "fr": None}

    # filter out empty/nulls
    contact_list = [c for c in contact_list if isinstance(c, dict) and c]
    if not contact_list:
        return {"en": None, "fr": None}

    org = contact_list[0].get("organisation", {})
    if not isinstance(org, dict):
        return {"en": None, "fr": None}

    return {
        "en": get_second_segment(org.get("en")),
        "fr": get_second_segment(org.get("fr"))
    }

def choose_org(cited, contact):
    """Use cited if valid, else fallback to contact."""
    if cited in [None, [], {}, [None], "[]", "[null]"] or (
        isinstance(cited, float) and pd.isna(cited)
    ):
        return extract_org(contact)
    return extract_org(cited)

# ‚ö° Faster: use vectorized row-wise operation only ONCE
df_en["features_properties_org"] = [
    choose_org(cited, contact)
    for cited, contact in zip(df_en["features_properties_cited"], df_en["features_properties_contact"])
]



# 5. ISO themes should be an array
df_en['features_properties_topicCategory'] = df_en['features_properties_topicCategory'].apply(
    lambda x: [s.strip() for s in x.split(',')] if isinstance(x, str) else x if isinstance(x, list) else []
)

df_en["features_properties_keywords_en"] = df_en["features_properties_keywords_en"].apply(
    lambda x: [s.strip() for s in x.split(',')] if isinstance(x, str) else x if isinstance(x, list) else []
)

df_en["features_properties_keywords_fr"] = df_en["features_properties_keywords_fr"].apply(
    lambda x: [s.strip() for s in x.split(',')] if isinstance(x, str) else x if isinstance(x, list) else []
)





# 6. Extract unique English descriptions (before first semicolon) across all options

def extract_unique_eng_desc(options_str):
    try:
        import json
        options = json.loads(options_str) if isinstance(options_str, str) else options_str

        descs = []
        if isinstance(options, list):
            for opt in options:
                if isinstance(opt, dict):
                    desc = opt.get("description", {}).get("en")
                    if isinstance(desc, str):
                        # keep only text before first semicolon
                        descs.append(desc.split(";")[0].strip())
        elif isinstance(options, dict):
            desc = options.get("description", {}).get("en")
            if isinstance(desc, str):
                descs.append(desc.split(";")[0].strip())

        return list(set(descs))  # unique
    except Exception as e:
        print(f"Error parsing options: {e}")
        return []


df_en["features_properties_type"] = df_en["features_properties_options"].apply(extract_unique_eng_desc)


# Find rows where the theme column is an empty list ([])
#empty_theme_df = df_en[df_en["features_properties_geo_theme"].apply(lambda x: isinstance(x, list) and len(x) == 0)]
# Display the filtered DataFrame
#empty_theme_df

row = df_en[df_en['features_properties_id'] == '0a2e22a2-a7a1-47d4-9249-3eaafe1c815b']
row




In [None]:
"""
import json

pd.set_option('display.max_columns', None)

# Data normalization

#1 Change cgp to geo-ca for source-system
df_en.loc[
    (df_en['features_properties_sourceSystemName'] == 'cgp') |
    (df_en['features_properties_sourceSystemName'].isna()),
    'features_properties_sourceSystemName'
] = 'geo-ca'

#2 Create a new column called features_properties_mappable and set to true if options protocol sectoion contains a wildcard of 'esri' or 'ogc'
def is_mappable_from_str(options_str):
    try:
        import json
        options = json.loads(options_str) if isinstance(options_str, str) else options_str
        
        protocols = []
        if isinstance(options, list):
            protocols = [opt.get("protocol", "").strip().lower() for opt in options if isinstance(opt, dict)]
        elif isinstance(options, dict):
            protocol = options.get("protocol", "").strip().lower()
            protocols = [protocol]       
        return any("esri" in p or "ogc" in p for p in protocols)
    except Exception as e:
        print(f"Error parsing options: {e}")
        return False

df_en["features_properties_mappable"] = df_en["features_properties_options"].apply(is_mappable_from_str)
df_en["features_properties_mappable"] = df_en["features_properties_mappable"].astype(str).str.lower()


#4

def map_special_topic_category(value):
    if isinstance(value, str):
        cleaned = value.replace(" ", "").lower()
        if cleaned == "earthobservation;syntheticaperatureradar":
            return "imageryBaseMapsEarthCover"
    return value

df_en["features_properties_topicCategory"] = df_en["features_properties_topicCategory"].apply(map_special_topic_category)

# Create a new column called features_properties_geo_theme and bin ISO topic categories
theme_bins = {
    'boundaries': 'administration',
    'planning_cadastre': 'administration',
    'location': 'administration',
    'transportation': 'administration',
    'planningcadastre': 'administration', #missing under old implementation

    'economy': 'economy',
    'farming': 'economy',

    'biota': 'environment',
    'environment': 'environment',
    'elevation': 'environment',
    'inlandwaters': 'environment',
    'oceans': 'environment',
    'climatologymeteorologyatmosphere': 'environment',  # lowercase key

    'imagerybasemapsearthcover': 'imagery',

    'structure': 'infrastructure',
    'transport': 'infrastructure',
    'utilitiescommunication': 'infrastructure',

    'geoscientificinformation': 'science',  # lowercase key

    'health': 'society',
    'society': 'society',
    'intelligencemilitary': 'society',
}

def map_topics_to_themes(topic_str):
    if not isinstance(topic_str, str):
        return []

    topics = [t.strip().lower() for t in topic_str.split(",")]
    themes = {theme_bins.get(topic) for topic in topics if theme_bins.get(topic)}

    # Convert everything to strings explicitly (just in case)
    return sorted([str(theme) for theme in themes])

df_en["features_properties_geo_theme"] = df_en["features_properties_topicCategory"].apply(map_topics_to_themes)

# 5 Organizations should not have divisions

# List of 8 UUIDs to overwrite
health_canada_ids = [
    "d256b422-2834-40a2-9f0c-bd5fc32781b2",
    "12acd145-626a-49eb-b850-0a59c9bc7506",
    "685257af-5592-49de-8726-90090c6c3061",
    "16348a20-02f4-4557-b68c-d1754e3026a4",
    "0f45b62a-b4f6-4eaa-ad84-2fca80108d0b",
    "67bedee8-beb0-4b3a-a1c6-24a4cda08afe",
    "f0d1c3a9-cf78-4b07-af55-9e8d4303449e",
    "21b821cf-0f1c-40ee-8925-eab12d357668",
]

hc_org = {"en": "Government of Canada;Health Canada", "fr": "Gouvernement du Canada;Sant√© Canada"}

for uid in health_canada_ids:
    idx = df_en.index[df_en["features_properties_id"] == uid]
    if not idx.empty:
        i = idx[0]

        contact_list = df_en.at[i, "features_properties_contact"]

        # Parse stringified JSON if needed
        if isinstance(contact_list, str):
            try:
                contact_list = json.loads(contact_list)
            except json.JSONDecodeError:
                continue  # Skip if malformed

        # Modify organisation field
        if isinstance(contact_list, list) and len(contact_list) > 0:
            contact_list[0]["organisation"] = hc_org
            # Re-serialize to JSON string before assigning
            df_en.at[i, "features_properties_contact"] = json.dumps(contact_list, ensure_ascii=False)
            
def extract_org_second_segment(contact_list):
    try:
        # Step 1: If JSON string, parse
        if isinstance(contact_list, str):
            import json
            contact_list = json.loads(contact_list)

        # Step 2: Ensure list with at least one contact
        if not isinstance(contact_list, list) or not contact_list:
            return {"en": None, "fr": None}

        contact = contact_list[0]
        org = contact.get("organisation", {})

        # Step 3: Extract second segment (index 1) if it exists
        def get_second_segment(s):
            if isinstance(s, str):
                parts = [p.strip() for p in s.split(";")]
                #print("Parts:", parts)  # Debug print
                if len(parts) >= 2:
                    return parts[1]
            return None

        return {
            "en": get_second_segment(org.get("en")),
            "fr": get_second_segment(org.get("fr"))
        }

    except Exception as e:
        print("Error:", e)
        return {"en": None, "fr": None}

df_en["features_properties_org"] = df_en["features_properties_contact"].apply(extract_org_second_segment)

#hc_org = {"en": "Health Canada", "fr": "Sant√© Canada"}

#for uid in health_canada_ids:
#    df_en.loc[df_en["features_properties_id"] == uid, "features_properties_org"] = [hc_org]

#6. Themes from dynamodb theme

# 6.1. Initialize DynamoDB client
dynamodb = boto3.client('dynamodb')
table = 'theme'

# 6.2. Batch read from DynamoDB (since it is not too large)
def scan_table(table_name):
    paginator = dynamodb.get_paginator('scan')
    items = []
    for page in paginator.paginate(TableName=table_name):
        items.extend(page['Items'])
    return items

all_items = scan_table(table)
cleaned = [{'uuid': item['uuid']['S'], 'tag': item['tag']['S']} for item in all_items]
df_theme = pd.DataFrame(cleaned)

# 6.3. Merge with df_en using features_properties_id
df_en = df_en.merge(df_theme, how='left', left_on='features_properties_id', right_on='uuid')

# 6.4. Append the tag to the geo_theme list if not already present
def append_theme(existing, tag):
    try:
        if not isinstance(existing, list):
            existing = []
        if isinstance(tag, str) and tag and tag not in existing:
            existing.append(tag)
        return existing
    except Exception:
        return existing

df_en["features_properties_geo_theme"] = df_en.apply(
    lambda row: append_theme(row["features_properties_geo_theme"], row["tag"]),
    axis=1
)

# 6.5. Drop helper columns from step 2
df_en.drop(columns=['uuid', 'tag'], inplace=True)

# Test
#row = df_en[df_en['features_properties_id'] == '0a2e22a2-a7a1-47d4-9249-3eaafe1c815b']
#row

#7. Themes from dynamodb foundational

# 7.1. Initialize DynamoDB client
dynamodb = boto3.client('dynamodb')
table = 'foundational'

# 7.2. Batch read from DynamoDB (since it is not too large)
def scan_table(table_name):
    paginator = dynamodb.get_paginator('scan')
    items = []
    for page in paginator.paginate(TableName=table_name):
        items.extend(page['Items'])
    return items

all_items = scan_table(table)
cleaned = [{'uuid': item['uuid']['S'], 'tag': item['loc']['S']} for item in all_items]
df_theme = pd.DataFrame(cleaned)

# 7.3. Merge with df_en using features_properties_id
df_en = df_en.merge(df_theme, how='left', left_on='features_properties_id', right_on='uuid')

# 7.4. Append the tag to the geo_theme list if not already present
def append_theme(existing, tag):
    try:
        if not isinstance(existing, list):
            existing = []
        if isinstance(tag, str) and tag and tag not in existing:
            existing.append(tag)
        return existing
    except Exception:
        return existing

df_en["features_properties_geo_theme"] = df_en.apply(
    lambda row: append_theme(row["features_properties_geo_theme"], row["tag"]),
    axis=1
)

# 7.5. Drop helper columns from step 2
df_en.drop(columns=['uuid', 'tag'], inplace=True)

# 8. Prints for debugging

#df_fitered = df_en[df_en['features_properties_id'] == '0a2e22a2-a7a1-47d4-9249-3eaafe1c815b']
#df_fitered

#check if there are any records not mapped to a geo-ca theme
#filtered = df_en[
#    df_en["features_properties_geo_theme"].isnull() | 
#    (df_en["features_properties_geo_theme"].apply(lambda x: x == [] or x == ''))
#]

# Print all columns of those rows
#filtered


In [None]:
#5 use additional themes from the dynamodb table called 'theme'

# 1.1. Initialize DynamoDB client
dynamodb = boto3.client('dynamodb')
table = 'theme'

# 1.2. Batch read from DynamoDB (since it is not too large)
def scan_table(table_name):
    paginator = dynamodb.get_paginator('scan')
    items = []
    for page in paginator.paginate(TableName=table_name):
        items.extend(page['Items'])
    return items

all_items = scan_table(table)
cleaned = [{'uuid': item['uuid']['S'], 'tag': item['tag']['S']} for item in all_items]
df_theme = pd.DataFrame(cleaned)
print(df_theme)

# 1.3. Merge with df_en using features_properties_id
df_en = df_en.merge(df_theme, how='left', left_on='features_properties_id', right_on='uuid')

# 1.4. Append the tag to the geo_theme list if not already present
def append_theme(existing, tag):
    try:
        if not isinstance(existing, list):
            existing = []
        if tag in ['emergency', 'legal'] and tag not in existing:
            existing.append(tag)
        return existing
    except Exception:
        return existing

df_en["features_properties_geo_theme"] = df_en.apply(
    lambda row: append_theme(row["features_properties_geo_theme"], row["tag"]),
    axis=1
)

# 1.5. Drop helper columns from step 2
df_en.drop(columns=['uuid', 'tag'], inplace=True)

# 2.1 Foundational
dynamodb = boto3.client('dynamodb')
table = 'foundational'

# 2.2. Batch read from DynamoDB (since it is not too large)
def scan_table(table_name):
    paginator = dynamodb.get_paginator('scan')
    items = []
    for page in paginator.paginate(TableName=table_name):
        items.extend(page['Items'])
    return items

all_items = scan_table(table)
cleaned = [{'uuid': item['uuid']['S'], 'tag': item['loc']['S']} for item in all_items]
df_theme = pd.DataFrame(cleaned)

# 2.3. Merge with df_en using features_properties_id
df_en = df_en.merge(df_theme, how='left', left_on='features_properties_id', right_on='uuid')

# 2.4. Append the tag to the geo_theme list if not already present
def append_theme(existing, tag):
    try:
        if not isinstance(existing, list):
            existing = []
        if isinstance(tag, str) and tag and tag not in existing:
            existing.append(tag)
        return existing
    except Exception:
        return existing

df_en["features_properties_geo_theme"] = df_en.apply(
    lambda row: append_theme(row["features_properties_geo_theme"], row["tag"]),
    axis=1
)
df_en.drop(columns=['uuid', 'tag'], inplace=True)

# 3. Create a new column called features_properties_foundational
def is_foundational(theme):
    try:
        import json
        
        # If it's a JSON string, parse it
        if isinstance(theme, str):
            try:
                theme_parsed = json.loads(theme)
            except Exception:
                # fallback: treat as plain string
                theme_parsed = theme
        else:
            theme_parsed = theme

        # Normalize to a list
        if isinstance(theme_parsed, list):
            themes = [str(t).strip().lower() for t in theme_parsed if t is not None]
        elif isinstance(theme_parsed, str):
            themes = [theme_parsed.strip().lower()]
        else:
            themes = []

        return "foundational" in themes
    except Exception as e:
        print(f"Error parsing theme: {e}")
        return False

df_en["features_properties_foundational"] = df_en["features_properties_geo_theme"].apply(is_foundational)

# Normalize boolean to lowercase string "true"/"false" if you want consistency with your mappable column
df_en["features_properties_foundational"] = df_en["features_properties_foundational"].astype(str).str.lower()

# Test
#row = df_en[df_en['features_properties_id'] == '0a2e22a2-a7a1-47d4-9249-3eaafe1c815b']
#row


In [None]:
#7. Convert ISO themes column to a vector for OpenSearch indexing to support aggregates

df_en['features_properties_topicCategory'] = df_en['features_properties_topicCategory'].apply(
    lambda x: [s.strip() for s in x.split(',')] if isinstance(x, str) else x if isinstance(x, list) else []
)

df_en["features_properties_keywords_en"] = df_en["features_properties_keywords_en"].apply(
    lambda x: [s.strip() for s in x.split(',')] if isinstance(x, str) else x if isinstance(x, list) else []
)

df_en["features_properties_keywords_fr"] = df_en["features_properties_keywords_fr"].apply(
    lambda x: [s.strip() for s in x.split(',')] if isinstance(x, str) else x if isinstance(x, list) else []
)

df_en

df_fitered = df_en[df_en['features_properties_id'] == 'd256b422-2834-40a2-9f0c-bd5fc32781b2']
#df_fitered[features_properties_contact]


In [None]:
pd.set_option('display.max_colwidth', None)  # Show full content in columns
print(df_fitered['features_properties_contact'])

In [None]:
vector = df_en['vector']
print(type(vector))
# Check for null values
has_null = vector.isnull().any()
print(f"Series has null values: {has_null}")

json_en = df_en.to_dict("records")
print(type(json_en))
#print(json_en[0])

# Extract the 'vector' values
vectors = [item['vector'] for item in json_en]
print(type(vectors))
#print(vectors[0])

#heck if there is null values 
import numpy as np 
array = np.array(vectors, dtype=object)
has_null = np.any(array == None)
print(f"List of lists has null values: {has_null}")

Under the cloudformation template 'geocore-semantic-search-with-opensearch-stage; Output tab, find the values for region, aos_host, and os_secret_id

In [None]:
!pip install -q opensearch-py

In [None]:
import json
import time
import boto3
from tqdm import tqdm
from urllib.parse import urlparse
from opensearchpy import OpenSearch, RequestsHttpConnection, AWSV4SignerAuth

from opensearchpy.helpers import bulk
import numpy as np


def get_awsauth_from_secret(region, secret_id):
    """
    Retrieves AWS opensearh credentials stored in AWS Secrets Manager.
    """

    client = boto3.client('secretsmanager', region_name=region)
    
    try:
        response = client.get_secret_value(SecretId=secret_id)
        secret = json.loads(response['SecretString'])
        
        master_username = secret['username']
        master_password = secret['password']

        return (master_username, master_password)
    except Exception as e:
        print(f"Error retrieving secret: {e}")
        return None
        

def create_opensearch_connection(aos_host, awsauth):
    try:
        # Create the OpenSearch client
        aos_client = OpenSearch(
            hosts=[{'host': aos_host, 'port': 443}],
            http_auth=awsauth,
            use_ssl=True,
            verify_certs=True,
            connection_class=RequestsHttpConnection
            # timeout=60,  # Set a higher timeout value
            # max_retries=10,  # Increase the number of retries
            # retry_on_timeout=True
        )
        # Print the client to confirm the connection
        print("Connection to OpenSearch established:", aos_client)
        return aos_client
    except Exception as e:
        print("Failed to connect to OpenSearch:", e)
        return None

def delete_aos_index_if_exists(aos_client, index_to_delete):
    """
    Deletes the specified index if it exists.

    :param aos_client: An instance of OpenSearch client.
    :param index_to_delete: The name of the index to delete.
    """
    # List all indexes and check if the specified index exists
    all_indices = aos_client.cat.indices(format='json')
    existing_indices = [index['index'] for index in all_indices]
    print("Current indexes:", existing_indices)

    if index_to_delete in existing_indices:
        # Delete the specified index
        try:
            response = aos_client.indices.delete(index=index_to_delete)
            print(f"Deleted index: {index_to_delete}")
            print("Response:", response)
        except Exception as e:
            print(f"Error deleting index {index_to_delete}:", e)
    else:
        print(f"Index {index_to_delete} does not exist.")

    # List all indexes again to confirm deletion
    all_indices_after_deletion = aos_client.cat.indices(format='json')
    existing_indices_after_deletion = [index['index'] for index in all_indices_after_deletion]
    print("Indexes after deletion attempt:", existing_indices_after_deletion)




def safe_json(raw, default):
    """Safely JSON-decode a field."""
    try:
        if raw is None:
            return default
        return json.loads(raw)
    except:
        return default


def load_data_to_opensearch_index_v2(df_en, aos_client, index_name, chunk_size=100):

    start = time.time()
    json_en = df_en.to_dict("records")

    # Check vectors
    vectors = [item.get("vector") for item in json_en]
    array = np.array(vectors, dtype=object)
    has_null = np.any(array == None)
    print(f"vector has null values: {has_null}")

    print(f"Preparing {len(json_en)} documents...")


    def generate_actions():
        """Generate bulk actions document-by-document."""
        for x in json_en:

            # Geometry
            bounding_box = safe_json(x.get('features_geometry_coordinates'), [])
            coordinates = {"type": "Polygon", "coordinates": bounding_box}

            # Build source doc (same fields as your original)
            src = {
                'id': x.get('features_properties_id', ''),
                'coordinates': coordinates,
                'title_en': x.get('features_properties_title_en', ''),
                'title_fr': x.get('features_properties_title_fr', ''),
                'description_en': x.get('features_properties_description_en', ''),
                'description_fr': x.get('features_properties_description_fr', ''),
                'published': x.get('features_properties_date_published_date', ''),
                'keywords_en': x.get('features_properties_keywords_en', ''),
                'keywords_fr': x.get('features_properties_keywords_fr', ''),
                'options': safe_json(x.get('features_properties_options'), []),
                'contact': safe_json(x.get('features_properties_contact'), []),
                'cited': safe_json(x.get('features_properties_cited'), []),
                'topicCategory': x.get('features_properties_topicCategory', ''),
                'created': x.get('features_properties_date_created_date', ''),
                'spatialRepresentation': x.get('features_properties_spatialRepresentation', ''),
                'type': x.get('features_properties_type', ''),
                'temporalExtent': x.get('temporalExtent', ''),
                'graphicOverview': safe_json(x.get('features_properties_graphicOverview'), []),
                'language': x.get('features_properties_language', ''),
                'organisation': x.get('features_properties_org', ''),
                'theme': x.get('features_properties_geo_theme', ''),
                'mappable': x.get('features_properties_mappable', ''),
                'foundational': x.get('features_properties_foundational', ''),
                'popularity': int(x.get('features_popularity', 0)),
                'systemName': x.get('features_properties_sourceSystemName', ''),
                'eoCollection': x.get('features_properties_eoCollection', ''),
                'eoFilters': safe_json(x.get('features_properties_eoFilters'), []),
                "vector": x.get("vector", "")
            }

            yield {
                "_index": index_name,
                "_id": x.get('features_properties_id'),
                "_source": src
            }


    print(f"Starting bulk indexing with chunk_size={chunk_size}...")

    try:
        success, errors = bulk(
            aos_client,
            generate_actions(),
            chunk_size=chunk_size,
            max_retries=3,
            request_timeout=120,
        )
    except Exception as e:
        print("\n‚ùå Bulk indexing failed:")
        print(str(e))
        return

    print(f"‚úî Bulk indexing done. Successful: {success}")
    print(f"‚ùó Errors returned by bulk(): {errors}")
    print(f"‚è± Total time: {time.time() - start:.2f} seconds")

def load_data_to_opensearch_index(df_en, aos_client, index_name, log_level="INFO"):
    """
    Index data from a pandas DataFrame to an OpenSearch index.

    Parameters:
    - df_en: DataFrame containing the data to index.
    - aos_client: OpenSearch client.
    - index_name: Name of the OpenSearch index to which the data will be indexed.
    - log_level: Logging level, defaults to "INFO". Set to "DEBUG" for detailed logs.
    """
    start_time = time.time()

    # Convert DataFrame to a list of dictionaries (JSON)
    json_en = df_en.to_dict("records")
    
    # check if vector has null values 
    vectors = [item['vector'] for item in json_en]
    import numpy as np 
    array = np.array(vectors, dtype=object)
    has_null = np.any(array == None)
    print(f"vector has null values: {has_null}")

    # Index the data
    for x in tqdm(json_en, desc="Indexing Records"):
        try:
            bounding_box = json.loads(x.get('features_geometry_coordinates', '[]'))
            coordinates = {
                "type": "Polygon",
                "coordinates": bounding_box
            }

            document = {
                'id': x.get('features_properties_id', ''),
                'coordinates': coordinates,
                'title_en': x.get('features_properties_title_en', ''),
                'title_fr': x.get('features_properties_title_fr', ''),
                'description_en': x.get('features_properties_description_en', ''),
                'description_fr': x.get('features_properties_description_fr', ''),
                'published': x.get('features_properties_date_published_date', ''),
                'keywords_en': x.get('features_properties_keywords_en', ''),
                'keywords_fr': x.get('features_properties_keywords_fr', ''),
                'options': json.loads(x.get('features_properties_options', '[]')),
                'contact': json.loads(x.get('features_properties_contact', '[]')),
                'cited': json.loads(x.get('features_properties_cited', '[]')),
                'topicCategory': x.get('features_properties_topicCategory', ''),
                'created': x.get('features_properties_date_created_date', ''),
                'spatialRepresentation': x.get('features_properties_spatialRepresentation', ''),
                'type': x.get('features_properties_type', ''),
                'temporalExtent': x.get('temporalExtent', ''),
                'graphicOverview': json.loads(x.get('features_properties_graphicOverview', '[]')),
                'language': x.get('features_properties_language', ''),
                'organisation': x.get('features_properties_org', ''),
                'theme': x.get('features_properties_geo_theme',''),
                'mappable': x.get('features_properties_mappable',''),
                'foundational': x.get('features_properties_foundational',''),
                'type': x.get('features_properties_type',''),
                'popularity': int(x.get('features_popularity', '0')),
                'systemName': x.get('features_properties_sourceSystemName', ''),
                'eoCollection': x.get('features_properties_eoCollection', ''),
                'eoFilters': json.loads(x.get('features_properties_eoFilters', '[]')),
                "vector":x.get("vector", "")
            }

            if log_level == "DEBUG":
                print((json.dumps(document, indent=4)))

            aos_client.index(index=index_name, body=document)

        except Exception as e:
            print(e)
    # Final record count check
    try:
        res = client.search(index=index_name, body={"query": {"match_all": {}}})
        print(f"Total documents in index: {res['hits']['total']['value']}")
    except Exception as e:
        print(f"Error retrieving document count: {e}")


In [None]:
from opensearchpy import OpenSearch, RequestsHttpConnection
from requests_aws4auth import AWS4Auth

REGION = "ca-central-1"
AOS_HOST = "REDACTED.ca-central-1.es.amazonaws.com"
os_secret_id = "OpenSearchSecret-geocore-semantic-search-with-opensearch-prod"


def connect_to_opensearch(REGION, AOS_HOST):
    try:        
        credentials = boto3.Session().get_credentials()
        if not credentials:
            raise NoCredentialsError()
        awsauth = AWS4Auth(credentials.access_key, credentials.secret_key, REGION, 'es', session_token=credentials.token)

        os_client = OpenSearch(
            hosts=[{'host': AOS_HOST, 'port': 443}],
            http_auth=awsauth,
            use_ssl=True,
            verify_certs=True,
            connection_class=RequestsHttpConnection
        )
        return os_client
    except NoCredentialsError:
        print("Missing AWS credentials.")
        return None
    except ConnectionError as e:
        print(f"Failed to connect to OpenSearch: {e}")
        return None
    except Exception as e:
        print(f"Unexpected error: {e}")
        return None

aos_client = connect_to_opensearch(REGION, AOS_HOST)

### 7. Create a index in Amazon Opensearch Service 
The following index setting is configuring an OpenSearch index to support k-nearest neighbor (KNN) searches with specific characteristics. KNN is a feature that allows for similarity searches, finding the "nearest" documents in a high-dimensional space. This setting is crucial for enabling vector-based searches, where vectors represent document features in a multidimensional space.

How k-NN search works:K-NN search works by calculating the distance between a query vector and the vectors in the dataset to find the closest matches. OpenSearch stores these vectors in an index and uses specialized algorithms (like HNSW, Hierarchical Navigable Small World graphs) to perform efficient similarity search at scale.



In [None]:
index_name = "minilm-knn-multilingual"
knn_index = {
    "settings": {
        "index.knn": True, #This enables the k-nearest neighbor (KNN) search capability on the index.
        "index.knn.space_type": "cosinesimil", #cosine similarity 
        "analysis": {
          "analyzer": {
            "default": {
              "type": "standard",
              "stopwords": "_english_"
            }
          }
        }
    },
    "mappings": {
        "properties": {
            "vector": {
                "type": "knn_vector",
                "dimension": 384,
                "store": True
            },
            "coordinates":{
              "type": "geo_shape", 
              "store": True 
            }  
        }
    }
}


In [None]:
#Delete index if it exists 
delete_aos_index_if_exists(aos_client, index_to_delete=index_name)

In [None]:
#Create a index 
aos_client.indices.create(index=index_name,body=knn_index,ignore=400)

In [None]:
#Load data to OpenSearch Index 
load_data_to_opensearch_index(df_en, aos_client, index_name)

In [None]:
res = aos_client.search(index=index_name, body={"query": {"match_all": {}}})
print(f"Records loaded into the index {index_name} is {res['hits']['total']['value']}.")

### 8. Test "Semantic Search" 

Now that we have vector in OpenSearch and a vector for our query question, let's perform a KNN search in OpenSearch.

In [None]:
index_name="minilm-knn-multilingual"
INPUT = "protocole de mesure de la pollution de l‚Äôair"
search_vector = sentence_to_vector(INPUT)
print(index_name)

query={
    "size": 20,
    "query": {
        "knn": {
            "vector":{
                "vector":search_vector,
                "k":20
            }
        }
    }
}

res = aos_client.search(index=index_name, size=20,body=query,request_timeout=55)
query_result=[]
for hit in res['hits']['hits']:
    row=[hit['_id'],hit['_score'],hit['_source']['title_en'],hit['_source']['id']]
    query_result.append(row)
query_result_df = pd.DataFrame(data=query_result,columns=["_id","relevancy_score","title",'uuid'])
display(query_result_df)

### 9. Test "Semantic Search" using the model endpoints

Now that we have vector in OpenSearch and a vector for our query question, let's perform a KNN search in OpenSearch.

In [None]:
# Initialize a boto3 client for SageMaker
import boto3
import sagemaker
from sagemaker import get_execution_role
from sagemaker.huggingface.model import HuggingFaceModel

# Initialize a boto3 client for SageMaker
sagemaker_client = boto3.client('sagemaker', region_name='ca-central-1')  # Specify the AWS region

def list_sagemaker_endpoints():
    """List all SageMaker endpoints"""
    try:
        # Get the list of all SageMaker endpoints
        response = sagemaker_client.list_endpoints(SortBy='Name')
        print("Listing SageMaker Endpoints:")
        for endpoint in response['Endpoints']:
            print(f"Endpoint Name: {endpoint['EndpointName']}, Status: {endpoint['EndpointStatus']}")
    except Exception as e:
        print(f"Error listing SageMaker endpoints: {e}")

def invoke_sagemaker_endpoint_ft(endpoint_name, payload):
    """Invoke a SageMaker endpoint to get predictions with ContentType='application/json'."""
    # Initialize the runtime SageMaker client
    runtime_client = boto3.client('runtime.sagemaker', region_name='ca-central-1')  
    try:
        """
        if not isinstance(payload, str):
            payload = str(payload)
        """
        # Invoke the SageMaker endpoint
        response = runtime_client.invoke_endpoint(
            EndpointName=endpoint_name,
            ContentType='application/json',
            Body=json.dumps(payload)
        )
        # Decode the response
        result = json.loads(response['Body'].read().decode())
        return (result)
        #print(f"Prediction from {endpoint_name}: {result}")
    except Exception as e:
        print(f"Error invoking SageMaker endpoint {endpoint_name}: {e}")

def invoke_sagemaker_endpoint_pretrain(endpoint_name, payload):
    """Invoke a SageMaker endpoint to get predictions with ContentType='text/plain'."""
    # Initialize the runtime SageMaker client
    runtime_client = boto3.client('runtime.sagemaker', region_name='ca-central-1')  

    try:
        # Ensure payload is a string, since ContentType is 'text/plain'
        if not isinstance(payload, str):
            payload = str(payload)
        
        # Invoke the SageMaker endpoint
        response = runtime_client.invoke_endpoint(
            EndpointName=endpoint_name,
            ContentType='text/plain',
            Body=payload
        )
        
        # Decode the response
        result = json.loads(response['Body'].read().decode())
        return (result)
        #print(f"Prediction from {endpoint_name}: {result}")
    except Exception as e:
        print(f"Error invoking SageMaker endpoint {endpoint_name}: {e}")
    


In [None]:
list_sagemaker_endpoints()

In [None]:
endpoint_name = 'semantic-search-pretrain-all-MiniLM-L6-v2-1743177448'
payload = "inondation"
vector = invoke_sagemaker_endpoint_pretrain(endpoint_name, payload)

In [None]:
index_name="minilm-knn-2"

query={
    "size": 10,
    "query": {
        "knn": {
            "vector":{
                "vector":vector,
                "k":10
            }
        }
    }
}

res = aos_client.search(index=index_name, size=20,body=query,request_timeout=55)
query_result=[]
for hit in res['hits']['hits']:
    row=[hit['_id'],hit['_score'],hit['_source']['title'],hit['_source']['id']]
    query_result.append(row)
query_result_df = pd.DataFrame(data=query_result,columns=["_id","relevancy_score","title",'uuid'])
display(query_result_df)