# Introduction

Welcome to this repository. We will be walking you to a series of notebooks in which you will understand how RAG works (Retrieval Augmented Generation, a technique that combines the power of search and generation of AI to answer user queries). We will work with different sources (Azure Cog Search, Files, SQL Server, Websites, etc) and at the end of the notebooks you will understand why the magic happens with the combination of:

1. Multi-Agents: Agents talking to each other
2. GPT-4-32k: The best model available
3. Very detailed prompts

But we need to start from the basics, so let's begin with Azure Cognitive Search and how it works:


# Load and Enrich multiple file types Azure Cognitive Search

In this Jupyter Notebook, we create and run enrichment steps to unlock searchable content in the specified Azure blob. It performs operations over mixed content in Azure Storage, such as images and application files, using a skillset that analyzes and extracts text information that becomes searchable in Azure Cognitive Search.
The reference sample can be found at [Tutorial: Use Python and AI to generate searchable content from Azure blobs](https://docs.microsoft.com/azure/search/cognitive-search-tutorial-blob-python).

In this scipts we will create our data source and indexes

Although only PDF files are used here, this can be done at a much larger scale and Azure Cognitive Search supports a range of other file formats including: Microsoft Office (DOCX/DOC, XSLX/XLS, PPTX/PPT, MSG), HTML, XML, ZIP, and plain text files (including JSON).

This notebook creates the following objects on your search service:

- data source
- skillset
- search index
- indexer

This notebook calls the [Search REST APIs](https://docs.microsoft.com/rest/api/searchservice/), but you can also use the Azure.Search.Documents client library in the Azure SDK for Python to perform the same steps. See this [Python quickstart](https://docs.microsoft.com/azure/search/search-get-started-python) for details.

To run this notebook, you should have already created the Azure services on README. Once you've done this, you can run all cells, but the query won't return results until the indexer is finished and the search index is loaded.

We recommend running each step and making sure it completes before moving on.


![cog-search](./images/Cog-Search-Enrich.png)


In [1]:
import os
import json
import requests
from dotenv import load_dotenv

load_dotenv("credentials.env")

# Name of the container in your Blob Storage Datasource ( in credentials.env)
BLOB_CONTAINER_NAME = "hack"

In [None]:
# Setup the Payloads header
headers = {
    "Content-Type": "application/json",
    "api-key": os.environ["AZURE_SEARCH_KEY"],
}
params = {"api-version": os.environ["AZURE_SEARCH_API_VERSION"]}

## Create Index for VBD File##


In [2]:
# Define the names for the data source, skillset, index and indexer for the Azure Search service
datasource_name = "ds-vbd"
skillset_name = "adlsgen2-skillset"
index_name = "adlsgen2-index"
indexer_name = "adlsgen2-indexer"

### Create Data Source (Blob container with the Arxiv CS pdfs)


In [4]:
# The following code sends the json paylod to Azure Search engine to create the Datasource

datasource_payload = {
    "name": datasource_name,
    "description": "VBD File cognitive search capabilities.",
    "type": "adlsgen2",
    "credentials": {"connectionString": os.environ["ADLS_CONNECTION_STRING"]},
    "container": {"name": BLOB_CONTAINER_NAME, "query": "vbd"},
}
r = requests.put(
    os.environ["AZURE_SEARCH_ENDPOINT"] + "/datasources/" + datasource_name,
    data=json.dumps(datasource_payload),
    headers=headers,
    params=params,
)
print(r.status_code)
print(r.ok)

201
True


In [None]:
# The following code sends the json paylod to Azure Search engine to create the Datasource
datasource_name = "cogsrch-datasource-esxp"
datasource_payload = {
    "name": datasource_name,
    "description": "Index Data Lake Gen 2",
    "type": "adlsgen2",
    "credentials": {"connectionString": os.environ["ADLS_CONNECTION_STRING"]},
    "container": {"name": BLOB_CONTAINER_NAME, "query": "esxp"},
}
r = requests.put(
    os.environ["AZURE_SEARCH_ENDPOINT"] + "/datasources/" + datasource_name,
    data=json.dumps(datasource_payload),
    headers=headers,
    params=params,
)
print(r.status_code)
print(r.ok)

201
True


- 201 - Successfully created
- 204 - Succesfully overwritten
- 40X - Authentication Error

For information on Change and Delete file detection please see [HERE](https://learn.microsoft.com/en-us/azure/search/search-howto-index-changed-deleted-blobs?tabs=rest-api)


In [5]:
# If you have a 403 code, probably you have a wrong endpoint or key, you can debug by uncomment this
# r.text

### 02- Create Skillset - OCR, Text Splitter, Language Detection, KeyPhrase extraction, Entity Recognition


In [6]:
# Create a skillset
skillset_payload = {
    "name": skillset_name,
    "description": "Extract entities, detect language and extract key-phrases",
    "skills": [
        {
            "@odata.type": "#Microsoft.Skills.Vision.OcrSkill",
            "description": "Extract text (plain and structured) from image.",
            "context": "/document/normalized_images/*",
            "defaultLanguageCode": "en",
            "detectOrientation": True,
            "inputs": [{"name": "image", "source": "/document/normalized_images/*"}],
            "outputs": [{"name": "text", "targetName": "images_text"}],
        },
        {
            "@odata.type": "#Microsoft.Skills.Text.MergeSkill",
            "description": "Create merged_text, which includes all the textual representation of each image inserted at the right location in the content field. This is useful for PDF and other file formats that supported embedded images.",
            "context": "/document",
            "insertPreTag": " ",
            "insertPostTag": " ",
            "inputs": [
                {"name": "text", "source": "/document/content"},
                {
                    "name": "itemsToInsert",
                    "source": "/document/normalized_images/*/images_text",
                },
                {
                    "name": "offsets",
                    "source": "/document/normalized_images/*/contentOffset",
                },
            ],
            "outputs": [{"name": "mergedText", "targetName": "merged_text"}],
        },
        {
            "@odata.type": "#Microsoft.Skills.Text.SplitSkill",
            "context": "/document",
            "textSplitMode": "pages",
            "maximumPageLength": 5000,  # 5000 is default
            "defaultLanguageCode": "en",
            "inputs": [{"name": "text", "source": "/document/merged_text"}],
            "outputs": [{"name": "textItems", "targetName": "pages"}],
        },
        {
            "@odata.type": "#Microsoft.Skills.Text.KeyPhraseExtractionSkill",
            "context": "/document/pages/*",
            "maxKeyPhraseCount": 2,
            "defaultLanguageCode": "en",
            "inputs": [{"name": "text", "source": "/document/pages/*"}],
            "outputs": [{"name": "keyPhrases", "targetName": "keyPhrases"}],
        },
        {
            "@odata.type": "#Microsoft.Skills.Text.V3.EntityRecognitionSkill",
            "context": "/document/pages/*",
            "categories": ["Person", "URL", "Email"],
            "minimumPrecision": 0.5,
            "defaultLanguageCode": "en",
            "inputs": [{"name": "text", "source": "/document/pages/*"}],
            "outputs": [
                {"name": "persons", "targetName": "persons"},
                {"name": "urls", "targetName": "urls"},
                {"name": "emails", "targetName": "emails"},
            ],
        },
    ],
    "cognitiveServices": {
        "@odata.type": "#Microsoft.Azure.Search.CognitiveServicesByKey",
        "description": os.environ["COG_SERVICES_NAME"],
        "key": os.environ["COG_SERVICES_KEY"],
    },
}

r = requests.put(
    os.environ["AZURE_SEARCH_ENDPOINT"] + "/skillsets/" + skillset_name,
    data=json.dumps(skillset_payload),
    headers=headers,
    params=params,
)
print(r.status_code)
print(r.ok)

201
True


## Create Index


Create index cogsrch-index-sales-cs and Index adlsgen2-index


In [None]:
# Create an index cogsrch-index-sales-cs
index_name = "cogsrch-index-sales-cs"

# Queries operate over the searchable fields and filterable fields in the index

index_payload = {
    "name": index_name,
    "fields": [
        {
            "name": "id",
            "type": "Edm.String",
            "searchable": "false",
            "filterable": "false",
            "retrievable": "true",
            "sortable": "false",
            "facetable": "false",
            "key": "true",
        },
        {
            "name": "ServiceName",
            "type": "Edm.String",
            "searchable": "true",
            "filterable": "true",
            "retrievable": "true",
            "sortable": "false",
            "facetable": "false",
            "key": "false",
        },
        {
            "name": "ServiceType",
            "type": "Edm.String",
            "searchable": "true",
            "filterable": "false",
            "retrievable": "true",
            "sortable": "false",
            "facetable": "false",
            "key": "false",
        },
        {
            "name": "ServiceCategory",
            "type": "Edm.String",
            "searchable": "false",
            "filterable": "false",
            "retrievable": "true",
            "sortable": "false",
            "facetable": "false",
            "key": "false",
        },
        {
            "name": "ServiceFamily",
            "type": "Edm.String",
            "searchable": "true",
            "filterable": "false",
            "retrievable": "true",
            "sortable": "false",
            "facetable": "false",
            "key": "false",
        },
        {
            "name": "Price",
            "type": "Edm.String",
            "searchable": "true",
            "filterable": "false",
            "retrievable": "true",
            "sortable": "false",
            "facetable": "false",
            "key": "false",
        },
        {
            "name": "PriceUSD",
            "type": "Edm.String",
            "searchable": "true",
            "filterable": "false",
            "retrievable": "true",
            "sortable": "false",
            "facetable": "false",
            "key": "false",
        },
        {
            "name": "Currency",
            "type": "Edm.String",
            "searchable": "true",
            "filterable": "false",
            "retrievable": "true",
            "sortable": "false",
            "facetable": "false",
            "key": "false",
        },
        {
            "name": "Description",
            "type": "Edm.String",
            "searchable": "true",
            "filterable": "false",
            "retrievable": "true",
            "sortable": "false",
            "facetable": "false",
            "key": "false",
        },
        {
            "name": "url",
            "type": "Edm.String",
            "searchable": "true",
            "filterable": "false",
            "retrievable": "true",
            "sortable": "false",
            "facetable": "false",
            "key": "false",
        },
        {
            "name": "InternalComments",
            "type": "Edm.String",
            "searchable": "true",
            "filterable": "false",
            "retrievable": "true",
            "sortable": "false",
            "facetable": "false",
            "key": "false",
        },
        {
            "name": "Duration",
            "type": "Edm.String",
            "searchable": "true",
            "filterable": "false",
            "retrievable": "true",
            "sortable": "false",
            "facetable": "false",
            "key": "false",
        },
        {
            "name": "FocusArea",
            "type": "Edm.String",
            "searchable": "true",
            "filterable": "false",
            "retrievable": "true",
            "sortable": "false",
            "facetable": "false",
            "key": "false",
        },
        {
            "name": "ActionArea",
            "type": "Edm.String",
            "searchable": "true",
            "filterable": "false",
            "retrievable": "true",
            "sortable": "false",
            "facetable": "false",
            "key": "false",
        },
        {
            "name": "PrimaryTechnology",
            "type": "Edm.String",
            "searchable": "true",
            "filterable": "false",
            "retrievable": "true",
            "sortable": "false",
            "facetable": "false",
            "key": "false",
        },
        {
            "name": "TargetLevel",
            "type": "Edm.String",
            "searchable": "true",
            "filterable": "false",
            "retrievable": "true",
            "sortable": "false",
            "facetable": "false",
            "key": "false",
        },
        {
            "name": "MaturityLevel",
            "type": "Edm.String",
            "searchable": "true",
            "filterable": "false",
            "retrievable": "true",
            "sortable": "false",
            "facetable": "false",
            "key": "false",
        },
        {
            "name": "DeliveryDomain",
            "type": "Edm.String",
            "searchable": "true",
            "filterable": "false",
            "retrievable": "true",
            "sortable": "false",
            "facetable": "false",
            "key": "false",
        },
        {
            "name": "ServiceDivision",
            "type": "Edm.String",
            "searchable": "true",
            "filterable": "false",
            "retrievable": "true",
            "sortable": "false",
            "facetable": "false",
            "key": "false",
        },
        {
            "name": "ContentLanguage",
            "type": "Edm.String",
            "searchable": "true",
            "filterable": "false",
            "retrievable": "true",
            "sortable": "false",
            "facetable": "false",
            "key": "false",
        },
        {
            "name": "vectorized",
            "type": "Edm.Boolean",
            "searchable": "false",
            "filterable": "false",
            "retrievable": "true",
            "sortable": "false",
            "facetable": "false",
            "key": "false",
        },
        {
            "name": "metadata_storage_name",
            "type": "Edm.String",
            "searchable": "true",
            "filterable": "false",
            "retrievable": "true",
            "sortable": "false",
            "facetable": "false",
            "key": "false",
        },
        {
            "name": "metadata_storage_path",
            "type": "Edm.String",
            "searchable": "false",
            "filterable": "false",
            "retrievable": "true",
            "sortable": "false",
            "facetable": "true",
            "key": "false",
        },
        {
            "name": "metadata_storage_last_modified",
            "type": "Edm.DateTimeOffset",
            "searchable": "false",
            "filterable": "false",
            "retrievable": "false",
            "sortable": "false",
            "facetable": "true",
            "key": "false",
        },
        {
            "name": "chunks",
            "type": "Collection(Edm.String)",
            "searchable": "false",
            "filterable": "false",
            "retrievable": "true",
            "sortable": "false",
            "facetable": "false",
            "key": "false",
        },
        {
            "name": "Vector",
            "type": "Collection(Edm.Single)",
            "searchable": "true",
            "filterable": "false",
            "retrievable": "true",
            "sortable": "false",
            "facetable": "false",
            "key": "false",
            "dimensions": 1536,
            "vectorSearchProfile": "vectorConfig-profile",
            "synonymMaps": [],
        },
    ],
    "scoringProfiles": [],
    "suggesters": [],
    "analyzers": [],
    "normalizers": [],
    "tokenizers": [],
    "tokenFilters": [],
    "charFilters": [],
    "semantic": {
        "configurations": [
            {
                "name": "my-semantic-config",
                "prioritizedFields": {
                    "titleField": {"fieldName": "ServiceName"},
                    "prioritizedContentFields": [{"fieldName": "Description"}],
                    "prioritizedKeywordsFields": [],
                },
            }
        ],
    },
    "vectorSearch": {
        "algorithms": [
            {
                "name": "vectorConfig",
                "kind": "hnsw",
                "hnswParameters": {
                    "metric": "cosine",
                    "m": 4,
                    "efConstruction": 400,
                    "efSearch": 500,
                },
            }
        ],
        "profiles": [
            {
                "name": "vectorConfig-profile",
                "algorithm": "vectorConfig",
                "vectorizer": "vectorizer-1700846291065",
            }
        ],
        "vectorizers": [
            {
                "name": "vectorizer-1700846291065",
                "kind": "azureOpenAI",
                "azureOpenAIParameters": {
                    "resourceUri": "https://openai-uhgh5rtmeij4u.openai.azure.com",
                    "deploymentId": "text-embedding-ada-002",
                    "apiKey": "ba0b6d9ca7e042f0a7d4277e66391b86",
                },
            }
        ],
    },
}


r = requests.put(
    os.environ["AZURE_SEARCH_ENDPOINT"] + "/indexes/" + index_name,
    data=json.dumps(index_payload),
    headers=headers,
    params=params,
)

print(r.status_code)

print(r.ok)

201
True


In [7]:
# Create an index adlsgen2-index
index_name = "adlsgen2-index"
index_payload = (
    {
        "name": index_name,

        "fields": [
            {
                "name": "content",
                "type": "Edm.String",
                "searchable": "true",
                "filterable": "false",
                "retrievable": "true",
                "sortable": "false",
                "facetable": "false",
                "key": "false",
            },
            {
                "name": "metadata_storage_content_type",
                "type": "Edm.String",
                "searchable": "false",
                "filterable": "false",
                "retrievable": "false",
                "sortable": "false",
                "facetable": "false",
                "key": "false",
            },
            {
                "name": "metadata_storage_size",
                "type": "Edm.Int64",
                "searchable": "false",
                "filterable": "false",
                "retrievable": "false",
                "sortable": "false",
                "facetable": "false",
                "key": "false",
            },
            {
                "name": "metadata_storage_last_modified",
                "type": "Edm.DateTimeOffset",
                "searchable": "false",
                "filterable": "false",
                "retrievable": "false",
                "sortable": "false",
                "facetable": "false",
                "key": "false",
            },
            {
                "name": "metadata_storage_content_md5",
                "type": "Edm.String",
                "searchable": "false",
                "filterable": "false",
                "retrievable": "false",
                "sortable": "false",
                "facetable": "false",
                "key": "false",
            },
            {
                "name": "metadata_storage_name",
                "type": "Edm.String",
                "searchable": "false",
                "filterable": "false",
                "retrievable": "true",
                "sortable": "false",
                "facetable": "false",
                "key": "false",
            },
            {
                "name": "metadata_storage_path",
                "type": "Edm.String",
                "searchable": "false",
                "filterable": "false",
                "retrievable": "trye",
                "sortable": "false",
                "facetable": "false",
                "key": "trye",
            },
            {
                "name": "metadata_storage_file_extension",
                "type": "Edm.String",
                "searchable": "false",
                "filterable": "false",
                "retrievable": "false",
                "sortable": "false",
                "facetable": "false",
                "key": "false",
            },
            {
                "name": "metadata_content_type",
                "type": "Edm.String",
                "searchable": "false",
                "filterable": "false",
                "retrievable": "true",
                "sortable": "false",
                "facetable": "false",
                "key": "false",
            },
            {
                "name": "metadata_author",
                "type": "Edm.String",
                "searchable": "false",
                "filterable": "false",
                "retrievable": "true",
                "sortable": "false",
                "facetable": "false",
                "key": "false",
            },
            {
                "name": "metadata_creation_date",
                "type": "Edm.DateTimeOffset",
                "searchable": "false",
                "filterable": "false",
                "retrievable": "false",
                "sortable": "false",
                "facetable": "false",
                "key": "false",
            },
            {
                "name": "metadata_last_modified",
                "type": "Edm.DateTimeOffset",
                "searchable": "false",
                "filterable": "false",
                "retrievable": "false",
                "sortable": "false",
                "facetable": "false",
                "key": "false",
            },
            {
                "name": "metadata_slide_count",
                "type": "Edm.String",
                "searchable": "false",
                "filterable": "false",
                "retrievable": "false",
                "sortable": "false",
                "facetable": "false",
                "key": "false",
            },
            {
                "name": "metadata_title",
                "type": "Edm.String",
                "searchable": "false",
                "filterable": "false",
                "retrievable": "false",
                "sortable": "false",
                "facetable": "false",
                "key": "false",
            },
            {
                "name": "people",
                "type": "Collection(Edm.String)",
                "searchable": "true",
                "filterable": "false",
                "retrievable": "true",
                "sortable": "false",
                "facetable": "false",
                "key": "false",
                "analyzer": "standard.lucene",
            },
            {
                "name": "organizations",
                "type": "Collection(Edm.String)",
                "searchable": "true",
                "filterable": "false",
                "retrievable": "true",
                "sortable": "false",
                "facetable": "false",
                "key": "false",
                "analyzer": "standard.lucene",
            },
            {
                "name": "translated_text",
                "type": "Collection(Edm.String)",
                "searchable": "true",
                "filterable": "false",
                "retrievable": "true",
                "sortable": "false",
                "facetable": "false",
                "key": "false",
                "analyzer": "fr.lucene",
            },
            {
                "name": "language",
                "type": "Edm.String",
                "searchable": "true",
                "filterable": "false",
                "retrievable": "true",
                "sortable": "false",
                "facetable": "false",
                "key": "false",
                "analyzer": "standard.lucene",
            },
            {
                "name": "keyphrases",
                "type": "Collection(Edm.String)",
                "searchable": "true",
                "filterable": "false",
                "retrievable": "true",
                "sortable": "false",
                "facetable": "false",
                "key": "false",
                "analyzer": "standard.lucene",
            },
            {
                "name": "merged_content",
                "type": "Edm.String",
                "searchable": "true",
                "filterable": "false",
                "retrievable": "true",
                "sortable": "false",
                "facetable": "false",
                "key": "false",
                "analyzer": "standard.lucene",
            },
            {
                "name": "text",
                "type": "Collection(Edm.String)",
                "searchable": "true",
                "filterable": "false",
                "retrievable": "true",
                "sortable": "false",
                "facetable": "false",
                "key": "false",
                "analyzer": "standard.lucene",
            },
            {
                "name": "layoutText",
                "type": "Collection(Edm.String)",
                "searchable": "true",
                "filterable": "false",
                "retrievable": "true",
                "sortable": "false",
                "facetable": "false",
                "key": "false",
                "analyzer": "standard.lucene",
                "synonymMaps": [],
            },
            {
                "name": "imageTags",
                "type": "Collection(Edm.String)",
                "searchable": "true",
                "filterable": "false",
                "retrievable": "true",
                "sortable": "false",
                "facetable": "false",
                "key": "false",
                "analyzer": "standard.lucene",
            },
            {
                "name": "imageCaption",
                "type": "Collection(Edm.String)",
                "searchable": "true",
                "filterable": "false",
                "retrievable": "true",
                "sortable": "false",
                "facetable": "false",
                "key": "false",
                "analyzer": "standard.lucene",
            },
            {
                "name": "vectorized",
                "type": "Edm.Boolean",
                "searchable": "false",
                "filterable": "false",
                "retrievable": "true",
                "sortable": "true",
                "facetable": "false",
                "key": "false",
            },
            {
                "name": "chunks",
                "type": "Collection(Edm.String)",
                "searchable": "true",
                "filterable": "false",
                "retrievable": "true",
                "sortable": "false",
                "facetable": "false",
                "key": "false",
                "analyzer": "standard.lucene",
            },
            {
                "name": "vector",
                "type": "Collection(Edm.Single)",
                "searchable": "true",
                "filterable": "false",
                "retrievable": "false",
                "sortable": "false",
                "facetable": "false",
                "key": "false",
                "dimensions": 1536,
                "vectorSearchProfile": "vector-profile-1701115615080",
                "synonymMaps": [],
            },
        ],
        "scoringProfiles": [],
        "suggesters": [],
        "analyzers": [],
        "normalizers": [],
        "tokenizers": [],
        "tokenFilters": [],
        "charFilters": [],
        "semantic": {
            "configurations": [
                {
                    "name": "my-semantic-config",
                    "prioritizedFields": {
                        "titleField": {"fieldName": "metadata_title"},
                        "prioritizedContentFields": [
                            {"fieldName": "content"},
                            {"fieldName": "keyphrases"},
                            {"fieldName": "translated_text"},
                            {"fieldName": "merged_content"},
                        ],
                        "prioritizedKeywordsFields": [
                            {"fieldName": "keyphrases"},
                            {"fieldName": "imageTags"},
                        ],
                    },
                }
            ]
        },
        "vectorSearch": {
            "algorithms": [
                {
                    "name": "vector-config-1701115557885",
                    "kind": "hnsw",
                    "hnswParameters": {
                        "metric": "cosine",
                        "m": 4,
                        "efConstruction": 400,
                        "efSearch": 500,
                    },
                }
            ],
            "profiles": [
                {
                    "name": "vector-profile-1701115615080",
                    "algorithm": "vector-config-1701115557885",
                    "vectorizer": "vectorizer-1701115625551",
                }
            ],
            "vectorizers": [
                {
                    "name": "vectorizer-1701115625551",
                    "kind": "azureOpenAI",
                    "azureOpenAIParameters": {
                        "resourceUri": "https://openai-uhgh5rtmeij4u.openai.azure.com",
                        "deploymentId": "text-embedding-ada-002",
                        "apiKey": "ba0b6d9ca7e042f0a7d4277e66391b86",
                    },
                }
            ],
        },
    },
)



r = requests.put(
    os.environ["AZURE_SEARCH_ENDPOINT"] + "/indexes/" + index_name,
    data=json.dumps(index_payload),
    headers=headers,
    params=params,
)



print(r.status_code)



print(r.ok)

201
True


In [8]:
# print(r.text)

### Semantic Search capabilities

As you can see above in the index payload, there is a `semantic configuration`. What is that?

Azure Search has a feature called: Semantic Search. This is a Deep Neural Network that lives on the engine that tries to find results based on the semantic meaning of the query and the content, not keyword mathching/counting.
From the [official documentation](https://learn.microsoft.com/en-us/azure/search/semantic-search-overview):

Semantic search is a collection of features that improve the quality of initial search results for text-based queries. When you enable it on your search service, semantic search extends the query execution pipeline in two ways:

- First, it adds secondary ranking over an initial result set, promoting the most semantically relevant results to the top of the list.

- Second, it extracts and returns captions and answers in the response, which you can render on a search page to improve the user's search experience.

For deeper explanation and limitations see [HERE](https://learn.microsoft.com/en-us/azure/search/semantic-ranking)


## Create and Run the Indexer - (runs the pipeline)


The three components you have created thus far (data source, skillset, index) are inputs to an indexer. Creating the indexer on Azure Cognitive Search is the event that puts the entire pipeline into motion.


In [None]:
# Create an indexer adlsgen2-indexer
indexer_payload = {
    "name": "adlsgen2-indexer",
    "dataSourceName": datasource_name,
    "targetIndexName": index_name,
    "skillsetName": skillset_name,
    "schedule": {
        "interval": "PT2H"
    },  # How often do you want to check for new content in the data source
    "fieldMappings": [
        {
            "sourceFieldName": "metadata_storage_path",
            "targetFieldName": "metadata_storage_path",
            "mappingFunction": {"name": "base64Encode", "parameters": "null"},
        }
    ],
    "outputFieldMappings": [
        {
            "sourceFieldName": "/document/merged_content/pages/*/people/*",
            "targetFieldName": "people",
        },
        {
            "sourceFieldName": "/document/merged_content/pages/*/organizations/*",
            "targetFieldName": "organizations",
        },
        {
            "sourceFieldName": "/document/merged_content/pages/*/translated_text",
            "targetFieldName": "translated_text",
        },
        {"sourceFieldName": "/document/language", "targetFieldName": "language"},
        {
            "sourceFieldName": "/document/merged_content/pages/*/keyphrases/*",
            "targetFieldName": "keyphrases",
        },
        {
            "sourceFieldName": "/document/merged_content",
            "targetFieldName": "merged_content",
        },
        {
            "sourceFieldName": "/document/merged_content/pages/*",
            "targetFieldName": "chunks",
        },
        {
            "sourceFieldName": "/document/normalized_images/*/text",
            "targetFieldName": "text",
        },
        {
            "sourceFieldName": "/document/normalized_images/*/layoutText",
            "targetFieldName": "layoutText",
        },
        {
            "sourceFieldName": "/document/normalized_images/*/imageTags/*/name",
            "targetFieldName": "imageTags",
        },
        {
            "sourceFieldName": "/document/normalized_images/*/imageCaption",
            "targetFieldName": "imageCaption",
        },
    ],
    "parameters": {
        "maxFailedItems": -1,
        "maxFailedItemsPerBatch": -1,
        "configuration": {
            "dataToExtract": "contentAndMetadata",
            "imageAction": "generateNormalizedImages",
        },
    },
}

r = requests.put(
    os.environ["AZURE_SEARCH_ENDPOINT"] + "/indexers/" + indexer_name,
    data=json.dumps(indexer_payload),
    headers=headers,
    params=params,
)
print(r.status_code)
print(r.ok)

201
True


In [9]:
# Create an indexer cogsrch-index-sales-esxp
indexer_payload = {
    "name": "cogsrch-index-sales-esxp",
    "dataSourceName": "cogsrch-datasource-esxp",
    "targetIndexName": "cogsrch-index-sales-cs",
    "skillsetName": "cogsrch-skillset-esxp",
    "schedule": {
        "interval": "PT2H"
    },  # How often do you want to check for new content in the data source
    "fieldMappings": [
        {
            "sourceFieldName": "ItemCode",
            "targetFieldName": "id",
        },
        {
            "sourceFieldName": "ServiceName",
            "targetFieldName": "ServiceName",
        },
        {
            "sourceFieldName": "ServiceType",
            "targetFieldName": "ServiceType",
        },
        {
            "sourceFieldName": "ServiceCategory",
            "targetFieldName": "ServiceCategory",
        },
        {
            "sourceFieldName": "ServiceFamily",
            "targetFieldName": "ServiceFamily",
        },
        {
            "sourceFieldName": "Price",
            "targetFieldName": "Price",
        },
        {
            "sourceFieldName": "PriceUSD",
            "targetFieldName": "PriceUSD",
        },
        {
            "sourceFieldName": "Currency",
            "targetFieldName": "Currency",
        },
        {
            "sourceFieldName": "FocusArea",
            "targetFieldName": "FocusArea",
        },
        {
            "sourceFieldName": "ActionArea",
            "targetFieldName": "ActionArea",
        },
        {
            "sourceFieldName": "PrimaryTechnology",
            "targetFieldName": "PrimaryTechnology",
        },
        {
            "sourceFieldName": "TargetLevel",
            "targetFieldName": "TargetLevel",
        },
        {
            "sourceFieldName": "MaturityLevel",
            "targetFieldName": "MaturityLevel",
        },
        {
            "sourceFieldName": "DeliveryDomain",
            "targetFieldName": "DeliveryDomain",
        },
        {
            "sourceFieldName": "ServiceDivision",
            "targetFieldName": "ServiceDivision",
        },
        {
            "sourceFieldName": "ContentLanguage",
            "targetFieldName": "ContentLanguage",
        },
        {
            "sourceFieldName": "Description",
            "targetFieldName": "Description",
        },
        {
            "sourceFieldName": "DatasheetLink",
            "targetFieldName": "url",
        },
        {
            "sourceFieldName": "InternalComments",
            "targetFieldName": "InternalComments",
        },
        {
            "sourceFieldName": "Duration",
            "targetFieldName": "Duration",
        },
        {
            "sourceFieldName": "DatasheetLink",
            "targetFieldName": "metadata_storage_path",
        },
    ],
    "outputFieldMappings": [
        {"sourceFieldName": "/document/pages/*", "targetFieldName": "chunks"}
    ],
    "parameters": {
        "maxFailedItems": -1,
        "maxFailedItemsPerBatch": -1,
        "configuration": {
            "dataToExtract": "contentAndMetadata",
            "imageAction": "generateNormalizedImages",
        },
    },
}

r = requests.put(
    os.environ["AZURE_SEARCH_ENDPOINT"] + "/indexers/" + indexer_name,
    data=json.dumps(indexer_payload),
    headers=headers,
    params=params,
)
print(r.status_code)
print(r.ok)

201
True


In [10]:
# Uncomment if you find an error
# r.text

Note: If you get a 400 unauthorize error, make sure that you are using the Azure Search MANAGEMENT KEY, not the QUERY key


In [15]:
# Optionally, get indexer status to confirm that it's running
try:
    r = requests.get(
        os.environ["AZURE_SEARCH_ENDPOINT"] + "/indexers/" + indexer_name + "/status",
        headers=headers,
        params=params,
    )
    # pprint(json.dumps(r.json(), indent=1))
    print(r.status_code)
    print("Status:", r.json().get("lastResult").get("status"))
    print("Items Processed:", r.json().get("lastResult").get("itemsProcessed"))
    print(r.ok)

except Exception as e:
    print("Wait a few seconds until the process starts and run this cell again.")

200
Status: inProgress
Items Processed: 400
True


**When the indexer finishes running we will have all 9.8k documents indexed in your Search Engine!.**


## Creation of its corresponding vector-based index


**Azure Cognitive Search has now vector search capabilities** ([Watch this video](https://aka.ms/Vector_SearchSnackableVideo)). The advantages of vector search in Azure Cognitive Search include its integration with other capabilities of Azure Cognitive Search, the ability to use any type of data (text, image, audio, video, etc) from diverse Azure datastores to inform a single generative AI-powered application, and the support of vector fields in the search indexes. It also offers pure vector search, hybrid retrieval, and a sophisticated re-ranking system powered by Bing in a single integrated solution (check the release [blog site](https://techcommunity.microsoft.com/t5/azure-ai-services-blog/announcing-vector-search-in-azure-cognitive-search-public/ba-p/3872868)).

![vector-search](https://techcommunity.microsoft.com/t5/image/serverpage/image-id/489211i001E2B9B34F483C2/image-dimensions/876x416?v=v2)

**The main limitations (for now) of vector search in Azure Cognitive Search are:**

- It does not generate vector embeddings for the content. Users need to provide the embeddings themselves by using a service such as Azure OpenAI.
- There is not field type for Collection of vectors, meaning that each document in the vector-based index must be either a small document or a chunk of a bigger document.

We are going to come back to these limitations and solve them in the next notebooks, but for now let's just create our corresponding vector-based index


In [16]:
index_payload = {
    "name": index_name + "-vector",
    "fields": [
        {"name": "id", "type": "Edm.String", "key": "true", "filterable": "true"},
        {
            "name": "title",
            "type": "Edm.String",
            "searchable": "true",
            "retrievable": "true",
        },
        {
            "name": "chunk",
            "type": "Edm.String",
            "searchable": "true",
            "retrievable": "true",
        },
        {
            "name": "chunkVector",
            "type": "Collection(Edm.Single)",
            "searchable": "true",
            "retrievable": "true",
            "dimensions": 1536,
            "vectorSearchConfiguration": "vectorConfig",
        },
        {
            "name": "name",
            "type": "Edm.String",
            "searchable": "true",
            "retrievable": "true",
            "sortable": "false",
            "filterable": "false",
            "facetable": "false",
        },
        {
            "name": "location",
            "type": "Edm.String",
            "searchable": "false",
            "retrievable": "true",
            "sortable": "false",
            "filterable": "false",
            "facetable": "false",
        },
    ],
    "vectorSearch": {
        "algorithmConfigurations": [{"name": "vectorConfig", "kind": "hnsw"}]
    },
    "semantic": {
        "configurations": [
            {
                "name": "my-semantic-config",
                "prioritizedFields": {
                    "titleField": {"fieldName": "title"},
                    "prioritizedContentFields": [{"fieldName": "chunk"}],
                    "prioritizedKeywordsFields": [],
                },
            }
        ]
    },
}

r = requests.put(
    os.environ["AZURE_SEARCH_ENDPOINT"] + "/indexes/" + index_name + "-vector",
    data=json.dumps(index_payload),
    headers=headers,
    params=params,
)
print(r.status_code)
print(r.ok)

201
True


# References

- https://learn.microsoft.com/en-us/azure/search/cognitive-search-tutorial-blob
- https://github.com/Azure-Samples/azure-search-python-samples/blob/main/Tutorial-AI-Enrichment/PythonTutorial-AzureSearch-AIEnrichment.ipynb
- https://github.com/Azure/azure-sdk-for-python/tree/main/sdk/search/azure-search-documents/samples
- https://learn.microsoft.com/en-us/azure/search/search-get-started-python
- https://github.com/Azure-Samples/azure-search-python-samples/blob/main/Tutorial-AI-Enrichment/PythonTutorial-AzureSearch-AIEnrichment.ipynb


# NEXT

In the next notebook 02, we will implement another type of indexing call One-to-Many, in which a single CSV or JSON file can be converted into multiple individual searchable documents in Azure Search.
