# Indexing Delimited Files on Azure AI Search

### Install Necessary Tools

In [None]:
pip install -U "langchain" "openai"

In [None]:
pip install azure-search-documents --pre --upgrade

### Import Kaggle Dataset

In [None]:
import pandas as pd
import numpy as np 
# reading the csv file using read_csv
# storing the data frame in variable called df
df = pd.read_csv('ds_salaries.csv')
 
df.head()

Add an ID to each row of your data this will be the key in our Index.

In [None]:
df['ID'] = np.arange(df.shape[0]).astype(str)
df.head()

Now we will make our data into strings and convert our dataframe into json format.

In [None]:
df= df.astype(str)
df_json = df.to_json(orient="records")

### Connect to our Azure Open AI Models

Here we are setting the keys and endpoint to our OpenAI models as environmental variables which will help us connect to our LLM model which in this case is **gpt-35-turbo**.

In [None]:
os.environ["AZURE_OPENAI_ENDPOINT"] = "<Your Azure Endpoint>"
os.environ["AZURE_OPENAI_KEY"] = "<Your Azure AI Key>"

In [None]:
import os
from openai import AzureOpenAI
    
client = AzureOpenAI(
    api_key=os.getenv("AZURE_OPENAI_KEY"),  
    api_version="2023-05-15",
    azure_endpoint = os.getenv("AZURE_OPENAI_ENDPOINT")
    )

### Create Azure AI Search Service

Enter in the name you would like for your AI Search service and index along with the name of your resource group and the location you would like your index to be held in.

In [None]:
service_name='<Your Service Name>'
index_name = '<Your Index Name>'
location = 'eastus2'
resource_group = '<Your Resource Group>'

Create your Azure AI Search service.

In [None]:
! az search service create --name {service_name} --sku free --location {location} --resource-group {resource_group} --partition-count 1 --replica-count 1

Save the key to a json file and then we will save the value to our **search_key** variable.

In [None]:
! az search admin-key show --resource-group {resource_group} --service-name {service_name} > keys.json

In [None]:
import json
with open('keys.json', mode='r') as f:
    data = json.load(f)
search_key = data["primaryKey"]

### Create Azure AI Index

Import the necessary tools to create our index and the fields.

In [None]:
import os

from azure.search.documents import SearchClient
from azure.core.credentials import AzureKeyCredential
from azure.search.documents.indexes import SearchIndexClient, SearchIndexerClient
#from azure.search.documents.models import VectorizedQuery
from azure.search.documents.indexes.models import (
    SimpleField,
    SearchField,
    SearchableField,
    SearchFieldDataType,
    SearchIndexerDataContainer,
    SearchIndexerDataSourceConnection,
    SearchIndex,
    SearchIndexer,
    TextWeights,
    VectorSearch,
    VectorSearchProfile,
    HnswAlgorithmConfiguration,
    ComplexField
    #IndexingParametersConfiguration
)

Create your index client to pass on information about our index too.

In [None]:
endpoint = "https://{}.search.windows.net/".format(service_name)
index_client = SearchIndexClient(endpoint, AzureKeyCredential(index_key))

Next you will add in the field names for your index which are based on the names of your columns. Notice that the Key is our 'ID' column and is a string also that even columns that hold integers will be strings this is because we want to beable to search and retrieve data from our index which can only be done so if our data is in string format.

In [None]:
fields = [
    SimpleField(
        name="ID",
        type=SearchFieldDataType.String,
        key=True,
    ),
    SearchField(
        name="work_year",
        type=SearchFieldDataType.String,
        searchable=True,
    ),
    SearchableField(
        name="experience_level",
        type=SearchFieldDataType.String,
        searchable=True,
    ),    
    SearchField(
        name="employment_type",
        type=SearchFieldDataType.String,
        searchable=True,
    ),
    SearchField(
        name="job_title",
        type=SearchFieldDataType.String,
        searchable=True,
    ),
    SearchField(
        name="salary",
        type=SearchFieldDataType.String,
        searchable=True,
    ),
    SearchField(
        name="salary_currency",
        type=SearchFieldDataType.String,
        searchable=True,
    ),
    SearchField(
        name="salary_in_usd",
        type=SearchFieldDataType.String,
        searchable=True,
    ),
    SearchField(
        name="employee_residence",
        type=SearchFieldDataType.String,
        searchable=True,
    ),
    SearchField(
        name="remote_ratio",
        type=SearchFieldDataType.String,
        searchable=True,
    ),
    SearchField(
        name="company_location",
        type=SearchFieldDataType.String,
        searchable=True,
    ),
    SearchField(
        name="company_size",
        type=SearchFieldDataType.String,
        searchable=True,
    )
]
    
#set our index values
index = SearchIndex(name=index_name, fields=fields)
#create our index
index_client.create_index(index)


### Upload Data to our Index

Here we are creating a search client that will allow us to upload our data to our index and query our index.

In [None]:
from azure.search.documents import SearchClient
search_client = SearchClient(endpoint, index_name, AzureKeyCredential(index_key))

Next we will convert our dataset into a JSON object because even though it is in JSON form its still labeled as a Python object. After that we will upload each row of our data or in this case, since we are now dealing with JSON, each group as a seprate document. This will help our index easily query our data and only retrieve the groups that hold similar text to the our query.

In [None]:
import json
 
# Convert JSON data to a Python object
data = json.loads(df_json)

# Iterate through the JSON array
for item in data:
    result = search_client.upload_documents(documents=[item])

print("Upload of new document succeeded: {}".format(result[0].succeeded))

### Interacting with our Model

First we will write our query, run any of the ones below or make your own. That query will be passed to our index which will then give us results of documents that held similar text from our query.

In [None]:
query = "Count how many ML Engineers are there?"

In [None]:
query = "please count how many rows are in my data?"

In [None]:
query = "Count how many employees worked in 2020?"

Here is where we will input our query and then fix the formating of the reqults in a way that our model can understand. This will mean first gathering our results in a list, converting that list into JSON format so that it is also a string, and then adding quotes around spaces for the model to better decipher our query results.

In [None]:
search_results = list(search_client.search(query))
search_results = json.dumps(search_results)
context=' '.join('"{}"'.format(word) for word in search_results.split(' '))

We will then pass our context and query to our model via a message.

In [None]:
response = client.chat.completions.create(
    model="gpt-4",
    messages=[
        {"role": "system", "content": "You are a helpful assistant who answers only from the given Context and answers the question from the given Query. If you are asked to count then you must count all of the occurances mentioned."},
        {"role": "user", "content": "Context: "+ context + "\n\n Query: " + query}
    ],
    #max_tokens=100,
    temperature=1,
    top_p=1,
    n=1
)

Now we can see our results!

In [None]:
response.choices[0].message.content