## Vector Search on Document: Azure Cognitive Search via REST Endpoint

This notebook demonstrates how to use Azure Cognitive Search REST endpoint with OpenAI to chunk and generate embeddings for documents. It uses the [employee_handbook.pdf](../../data/pdf/employee_handbook.pdf) as source document. 

Key steps in the notebook -

- Create ACS Index from index definition via REST endpoint
- Load the source dataset and generating embeddings
- Ingesting embeddings to ACS Index via REST endpoint
- Multiple search queries
  
### Prerequisites

- Create a conda environment using the [cognitive_search_rest_conda.yml](/cognitive_search_rest_conda.yml) file to include all the python dependencies.
- Create a *.env* file from the *.env-template* and populate it with all necessary endpoint links and keys. 

### Load environment variables

In [None]:
import os
from dotenv import load_dotenv

load_dotenv()

acs_key  = os.getenv("COGNITIVE_SEARCH_KEY")
if acs_key is None or acs_key == "":
    print("COGNITIVE_SEARCH_KEY environment variable not set.")
    exit()

aoai_key  = os.getenv("AZURE_OPENAI_KEY")
if aoai_key is None or aoai_key == "":
    print("AZURE_OPENAI_KEY environment variable not set.")
    exit()

acs_endpoint = 'https://cogsearch02.search.windows.net'
acs_index_definition = 'index_definition/index_definition_doc.json'
acs_api_version = '2023-07-01-Preview'
aoai_endpoint = 'https://azure-openai-dnai.openai.azure.com'
aoai_api_version = '2023-08-01-preview'
aoai_embedding_deployed_model = 'embedding-ada'

### Helper Methods

In [None]:
import requests
import json

def insert_record(acs_endpoint, acs_index, data, acs_key, acs_api_version):
    url = f"{acs_endpoint}/indexes/{acs_index}/docs/index?api-version={acs_api_version}"
    headers = {
        "Content-Type": "application/json",
        "api-key": acs_key
    }    
    response = requests.post(url, data=data, headers=headers)
    print(response.status_code)
    print(response.content)

def create_index(acs_endpoint, json_content, index_name, api_key, acs_api_version):
    url = f"{acs_endpoint}/indexes/{index_name}?api-version={acs_api_version}"
    headers = {
        "Content-Type": "application/json",
        "api-key": api_key
    }
    response = requests.request('PUT', url, headers=headers, data=json_content)
    print(response.status_code)
    print(response.content)

def search_vector_similarity(query_vector, top_doc_count, acs_endpoint, acs_index,acs_key, acs_api_version):
    url = f"{acs_endpoint}/indexes/{acs_index}/docs/search?api-version={acs_api_version}"

    headers = {
        "Content-Type": "application/json",
        "api-key": acs_key
    }

    request_body = {
        "vectors": [{
            "value": query_vector,
            "fields": "chunk_content_vector",
            "k": top_doc_count
        }],
        "select": "chunk_content"
    }
    request_body = json.dumps(request_body)
    response = requests.request('POST', url, headers=headers, data=request_body)

    docs = [(item['chunk_content']) for item in response.json()['value']]

    return docs

def get_acs_index_name(acs_index_definition):
    index_json_content = read_json_file(acs_index_definition)
    index_json = json.loads(index_json_content)
    index_name = index_json['name']

    return index_name

def read_json_file(file_path):
    with open(file_path, "r") as file:
        return file.read()

### Create ACS Index

In [None]:
index_definition = read_json_file(acs_index_definition)
index_name = get_acs_index_name(acs_index_definition)

create_index(acs_endpoint, index_definition, index_name, acs_key, acs_api_version)

### Chunk Document

In [None]:
from PyPDF2 import PdfReader
import pandas as pd
from langchain.text_splitter import CharacterTextSplitter

pdf_reader = PdfReader('../../data/docs/employee_handbook.pdf')
pages = [page.extract_text() for page in pdf_reader.pages]
text = " ".join(pages)

text_splitter = CharacterTextSplitter(
    separator="\n",
    chunk_size=1000,
    chunk_overlap=200,
    length_function=len
)
chunks = text_splitter.split_text(text)

df = pd.DataFrame(chunks, columns=["chunk_content"])

print(df.head())

### Create embeddings

In [None]:
import openai
from openai.embeddings_utils import get_embedding, cosine_similarity
import pandas as pd
import json

openai.api_type = "azure"
openai.api_key = aoai_key
openai.api_base = aoai_endpoint
openai.api_version = aoai_api_version

df['chunk_content_vector'] = df['chunk_content'].apply(lambda x : get_embedding(x, engine = aoai_embedding_deployed_model)) 

df['id'] = df.index

print(df.head())

### Ingest to Azure Cognitive Search

This cell works because the dataframe and the ACS Index both have same columns. If the dataframe doesn't have the same columns (column names or numbers) as the ACS Index, add a preprocessing step to it to structure the dataframe according to the ACS columns.

In [None]:
import requests
import json

batch_size = 10
total_records = df.shape[0]
fields = df.columns.to_numpy()
df['id'] = df['id'].astype(str)

records = {
    'value': []
}

for index, row in df.iterrows():
    record = {}
    for field in fields:
            record[field] = row[field]

    records['value'].append(
        record
    )

    if index % batch_size == 0 or (index+1 == total_records):
        json_data = json.dumps(records)
        insert_record(acs_endpoint, index_name, json_data, acs_key, acs_api_version)
        records['value'] = []

### Perform a vector similarity search

In [None]:
query = 'when are performance review announced?'

query_vector = get_embedding(query, engine = aoai_embedding_deployed_model)

search_results = search_vector_similarity(query_vector, 5, acs_endpoint, index_name, acs_key, acs_api_version)

print(search_results)