# Indexing Delimited Files on Azure AI Search

## Overview
LLMs work best when querying vector databases (DBs). In a few of our tutorials in this repo, we have created vector DBs from unstructured data like PDF documents. Here, we create a vector DB from structured data, which is technically complex and requires additional steps. Here we will vectorize (embed) a csv file, index our DB using Azure AI Search, and then query our vector DB using a GPT model deployed within Azure AI Studio.

## Prerequisites
We assume you have access to Azure AI Studio and Azure AI Search Service and have already deployed an LLM.

## Learning objectives

This tutorial will cover the following topics:
+ Introduce embeddings from structured data
+ Create Azure AI Search index from command line
+ Query Azure AI Search index from command line using LLMs


## Get started

### Install packages

In [None]:
pip install -U "langchain" "openai" "langchain-openai" "langchain-community"

In [None]:
pip install azure-search-documents --pre --upgrade

### Import CSV data

For this tutorial we are using a Kaggle dataset about data scientist salaries from 2023. This dataset can be downloaded from [here](https://www.kaggle.com/datasets/henryshan/2023-data-scientists-salary).

In [None]:
import pandas as pd
import numpy as np 
# reading the csv file using read_csv
# storing the data frame in variable called df
df = pd.read_csv('ds_salaries.csv')
 
df.head()

Add an ID to each row of your data this will be the key in our Index. If you choose to use your own data make sure to clean up any trailing whitespaces or punctuation. Your headers should not have any spaces between the words.

In [None]:
df['ID'] = np.arange(df.shape[0]).astype(str)

#making the entire dataset into strings
df= df.astype(str)
df.head()

#### Optional: Adding embeddings to our data

If you want to add embeddings to your data you can run the code below! Embeddings will help our vector store (Azure AI Search) to retrieve relevant information based on the query or question you have supplied the model. Here we use the embedding **text-embedding-ada-002** to convert our data into numerical values which represents how similar each word is to another in your data. Embedding are usually used for dense data so if you have any columns in your dataset that contains sentences of text its recommended to add embeddings. Although the dataset we are using doesn't have that for this example we will be adding embeddings for the `job_title` column and add them to a new column called and `job_title_vector`.

**If you don't want to add embeddings you can skip this code cell and run the [next one](#csv2json).**

In [None]:
os.environ["AZURE_OPENAI_ENDPOINT"] = "<Your Azure Endpoint>"
os.environ["AZURE_OPENAI_KEY"] = "<Your Azure AI Key>"

#create embeddings functions to apply to a given column
from openai import AzureOpenAI
    
client = AzureOpenAI(
    api_key=os.getenv("AZURE_OPENAI_KEY"),  
    api_version="2023-05-15",
    azure_endpoint = os.getenv("AZURE_OPENAI_ENDPOINT")
    )

def generate_embeddings(text, model="text-embedding-ada-002"):
    return client.embeddings.create(input = [text], model=model).data[0].embedding

#adding embeddings for job title to get more accurate search results
df['job_title_vector'] = df['job_title'].apply(lambda x : generate_embeddings (x)) # model should be set to the deployment name you chose when you deployed the text-embedding-ada-002 (Version 2) model

<a id='csv2json'> Now we will convert our dataframe into JSON format. </a>

In [None]:
df_json = df.to_json(orient="records")

### Connect to our Azure Open AI Models

Here we are setting the keys and endpoint to our OpenAI models as environmental variables which will help us connect to our LLM model which in this case is **gpt-4**.

In [None]:
os.environ["AZURE_OPENAI_ENDPOINT"] = "<Your Azure Endpoint>"
os.environ["AZURE_OPENAI_KEY"] = "<Your Azure AI Key>"

In [None]:
import os
from openai import AzureOpenAI
    
client = AzureOpenAI(
    api_key=os.getenv("AZURE_OPENAI_KEY"),  
    api_version="2023-05-15",
    azure_endpoint = os.getenv("AZURE_OPENAI_ENDPOINT")
    )

### Create Azure AI Search Service

Enter in the name you would like for your AI Search service and index along with the name of your resource group and the location you would like your index to be held in.

In [None]:
service_name='<Your Service Name>'
index_name = '<Your Index Name>'
location = 'eastus2'
resource_group = '<Your Resource Group>'

Authenticate to use Azure cli, follow the outputs instructions.

In [None]:
! az login

Create your Azure AI Search service. We will be using the free tier that holds 50MB of memory and allows you to create up to 3 indexes.

In [None]:
! az search service create --name {service_name} --sku free --location {location} --resource-group {resource_group} --partition-count 1 --replica-count 1

Save the key to a JSON file and then we will save the value to our **search_key** variable.

In [None]:
! az search admin-key show --resource-group {resource_group} --service-name {service_name} > keys.json

In [None]:
import json
with open('keys.json', mode='r') as f:
    data = json.load(f)
search_key = data["primaryKey"]

### Create Azure AI Index

Import the necessary tools to create our index and the fields this will be our **vector store**.

In [None]:
import os

from azure.search.documents import SearchClient
from azure.core.credentials import AzureKeyCredential
from azure.search.documents.indexes import SearchIndexClient, SearchIndexerClient
from azure.search.documents.indexes.models import (
    SimpleField,
    SearchField,
    SearchableField,
    SearchFieldDataType,
    SearchIndexerDataContainer,
    SearchIndexerDataSourceConnection,
    SearchIndex,
    SearchIndexer,
    TextWeights,
    VectorSearch,
    VectorSearchProfile,
    HnswAlgorithmConfiguration,
    ComplexField
)

Create your index client to pass on information about our index too.

In [None]:
endpoint = "https://{}.search.windows.net/".format(service_name)
index_client = SearchIndexClient(endpoint, AzureKeyCredential(index_key))

Next you will add in the field names to the index which are based on the names of your columns. Notice that the **Key** is our 'ID' column and it is a string also that even columns that hold integers will also be strings because we want to be able to search and retrieve data from our index which can only be done so if our data is in string format.

If you **added embeddings** to your data skip to the next section [Adding Embeddings to Vector Store](#Embeddings-to-Vector-Store).

In [None]:
fields = [
    SimpleField(
        name="ID",
        type=SearchFieldDataType.String,
        key=True,
    ),
    SearchableField(
        name="work_year",
        type=SearchFieldDataType.String,
        searchable=True,
    ),
    SearchableField(
        name="experience_level",
        type=SearchFieldDataType.String,
        searchable=True,
    ),    
    SearchableField(
        name="employment_type",
        type=SearchFieldDataType.String,
        searchable=True,
    ),
    SearchableField(
        name="job_title",
        type=SearchFieldDataType.String,
        searchable=True,
    ),
    SearchableField(
        name="salary",
        type=SearchFieldDataType.String,
        searchable=True,
    ),
    SearchableField(
        name="salary_currency",
        type=SearchFieldDataType.String,
        searchable=True,
    ),
    SearchableField(
        name="salary_in_usd",
        type=SearchFieldDataType.String,
        searchable=True,
    ),
    SearchableField(
        name="employee_residence",
        type=SearchFieldDataType.String,
        searchable=True,
    ),
    SearchableField(
        name="remote_ratio",
        type=SearchFieldDataType.String,
        searchable=True,
    ),
    SearchableField(
        name="company_location",
        type=SearchFieldDataType.String,
        searchable=True,
    ),
    SearchableField(
        name="company_size",
        type=SearchFieldDataType.String,
        searchable=True,
    )
]
    
#set our index values
index = SearchIndex(name=index_name, fields=fields)
#create our index
index_client.create_index(index)


<a id='Embeddings-to-Vector-Store'> <h4>Optional: Adding Embeddings to Vector Store</h4> </a> 

If you are working with embeddings you need to add a **SearchField** that holds a collection which is your array of numerical values. The name of the column is the same as our dataset **job_title_vector**. We also need to set a **vector profile** which dictates what algorithm we will have our vector store use to find text that are similar to each other (find the nearest neighbors) for this profile we will be using the **Hierarchical Navigable Small World (HNSW) algorithm**, we have named our profile **vector_search**.

In [None]:
fields = [
    fields = [
    SimpleField(
        name="ID",
        type=SearchFieldDataType.String,
        key=True,
    ),
    SearchableField(
        name="work_year",
        type=SearchFieldDataType.String,
        searchable=True,
    ),
    SearchableField(
        name="experience_level",
        type=SearchFieldDataType.String,
        searchable=True,
    ),    
    SearchableField(
        name="employment_type",
        type=SearchFieldDataType.String,
        searchable=True,
    ),
    SearchableField(
        name="job_title",
        type=SearchFieldDataType.String,
        searchable=True,
    ),
    SearchField(
        name="job_title_vector",
        type=SearchFieldDataType.Collection(SearchFieldDataType.Single),
        searchable=True,
        vector_search_dimensions=len(generate_embeddings("Text")),
        vector_search_profile_name="my-vector-config"
    ),
    SearchableField(
        name="salary",
        type=SearchFieldDataType.String,
        searchable=True,
    ),
    SearchableField(
        name="salary_currency",
        type=SearchFieldDataType.String,
        searchable=True,
    ),
    SearchableField(
        name="salary_in_usd",
        type=SearchFieldDataType.String,
        searchable=True,
    ),
    SearchableField(
        name="employee_residence",
        type=SearchFieldDataType.String,
        searchable=True,
    ),
    SearchableField(
        name="remote_ratio",
        type=SearchFieldDataType.String,
        searchable=True,
    ),
    SearchableField(
        name="company_location",
        type=SearchFieldDataType.String,
        searchable=True,
    ),
    SearchableField(
        name="company_size",
        type=SearchFieldDataType.String,
        searchable=True,
    )
]

vector_search = VectorSearch(
    profiles=[VectorSearchProfile(name="my-vector-config", algorithm_configuration_name="my-algorithms-config")],
    algorithms=[HnswAlgorithmConfiguration(name="my-algorithms-config")],
)
   
#set our index values
index = SearchIndex(name=index_name, fields=fields, vector_search=vector_search)
#create our index
index_client.create_index(index)


### Upload Data to our Index

Here we are creating a **search client** that will allow us to upload our data to our index and query our index.

In [None]:
from azure.search.documents import SearchClient
search_client = SearchClient(endpoint, index_name, AzureKeyCredential(index_key))

Next we will convert our dataset into a JSON object because even though it is in JSON format its still labeled as a Python object. After that we will upload each row of our data, or in this case, since we are now dealing with JSON, each group as a separate document. This process is essentially **chunking** our data to help our index easily query our data and only retrieve the groups that hold similar text to the our query. This also minimizes hallucinations.

In [None]:
import json
 
# Convert JSON data to a Python object
data = json.loads(df_json)

# Iterate through the JSON array
for item in data:
    result = search_client.upload_documents(documents=[item])

print("Upload of new document succeeded: {}".format(result[0].succeeded))

### Interacting with our Model

First, we will write our query. You can run any of the ones below or make your own. That query will be passed to our index which will then give us results of documents that held similar text to our query.

In [None]:
query = "Please count how many ML Engineers are there."

In [None]:
query = "Please list the unique job titles."

In [None]:
query = "Please count how many employees worked in 2020."

Here is where we will input our query and then fix the formatting of the results in a way that our model can understand. This will mean first gathering our results in a list, removing any unncessary keys to lessen the token count, converting that list into JSON format so that it is also a string, and then adding quotes around spaces for the model to better decipher our query results.

In [None]:
#gathering our query results
search_results = list(search_client.search(query))

#removing any removing any unncessary keys to lessen the token count (some of these are provided by the vector store)
#job_title_vector is for users that included embeddings to their data
remove_keys = ['job_title_vector', '@search.reranker_score', '@search.highlights', '@search.captions', '@search.score']
for l in search_results:
    for i in remove_keys:
        l.pop(i, None)

In [None]:
#converting that list into JSON format
search_results = json.dumps(search_results)
#adding quotes around spaces
context=' '.join('"{}"'.format(word) for word in search_results.split(' '))

We will then pass our context and query to our model via a message.

In [None]:
response = client.chat.completions.create(
    model="gpt-4",
    messages=[
        {"role": "system", "content": "You are a helpful assistant who answers only from the given Context and answers the question from the given Query. If you are asked to count then you must count all of the occurances mentioned."},
        {"role": "user", "content": "Context: "+ context + "\n\n Query: " + query}
    ],
    #max_tokens=100,
    temperature=0,
)

Now we can see our results!

In [None]:
response.choices[0].message.content

## Conclusion
Here we created embeddings from structured data and fed these embeddings to our LLM. Key skills you learned were to : 
+ Create embeddings and a vector store using Azure AI Search
+ Send prompts to the LLM grounded on your structured data

## Clean up

**Warning:** Dont forget to delete the resources we just made to avoid accruing additional costs, including shutting down your Azure ML compute, delete your AI search resource, and optionally delete your deployed models in AI Studio

In [None]:
#delete search service this will also delete any indexes
! az search service delete --name {service_name} --resource-group {resource_group} -y