# Create a custom skill for Azure AI Search

### Introduction                 
` AI Search` module provides a cloud-based solution for indexing and querying a wide range of data sources, and creating comprehensive and high-scale search solutions. 

Here we use the case of Margie's Travel, in order to help customers find valuable insights from the accumulated hotel reviews when they are making travel plans, we need to follow the following instructions to upload the review data from `. /assets/data/` comment data into **Blob Containers** and create a `margies-index` (parsed unstructured data) by default **Skills** (field parsing function).

## Upload Documents to Azure Storage

1. Create a **Blob Containers** in Azure Storage Accounts (Storage Accounts -> [your-account] -> Data Storage -> Containers -> +Container)
   - **Name**: margies
   - **Access level**: container
2. Upload both `collateral/` and `reviews/` in `./assets/data/` to **Blob Containers** for  (Storage Accounts -> [your-account] -> Storage Browser -> blob containers -> 
margie -> + Add directory/Uploads)

## Generate Index of These Files using Skills

**【Note】** In order to use multiple AI services at the same time, please create a **multi-service account** in Azure AI Services first. and get the key of the resource.

Now that you have the documents in place, you can create a search solution by indexing them.

1. Browse to your `Azure AI Search`. Then, select **Import data** on its **Overview** page.
2. On the **Connect to your data** page, in the **Data Source** list, select **Azure Blob Storage**. Then complete the data store details with the following values:
    - **Data Source**: Azure Blob Storage
    - **Data source name**: margies-data
    - **Data to extract**: Content and metadata
    - **Parsing mode**: Default
    - **Connection string**: *Select **Choose an existing connection**. Then select your storage account, and finally select the **margies** container that was created by the UploadDocs.cmd script.*
    - **Managed identity authentication**: None
    - **Container name**: margies
    - **Blob folder**: *Leave this blank*
    - **Description**: Brochures and reviews in Margie's Travel web site.
3. Proceed to the next step (*Add cognitive skills*).
4. in the **Attach Azure AI Services** section, select your Azure AI Services (Multi-Service Account) resource.
5. In the **Add enrichments** section:
    - Change the **Skillset name** to **margies-skillset**.
    - Select the option **Enable OCR and merge all text into merged_content field**.
    - Ensure that the **Source data field** is set to **merged_content**.
    - Leave the **Enrichment granularity level** as **Source field**, which is set the entire contents of the document being indexed; but note that you can change this to extract information at more granular levels, like pages or sentences.
    - Select the following enriched fields:

        | Cognitive Skill | Parameter | Field name |
        | --------------- | ---------- | ---------- |
        | Extract location names | | locations |
        | Extract key phrases | | keyphrases |
        | Detect language | | language |
        | Generate tags from images | | imageTags |
        | Generate captions from images | | imageCaption |
      
6. Change the **Index name** to **margies-index**.
7. Ensure that the **Key** is set to **metadata_storage_path** and leave the **Suggester name** blank and **Search mode** at its default.
8. Make the following changes to the index fields, leaving all other fields with their default settings (**IMPORTANT**: you may need to scroll to the right to see the entire table):

    | Field name | Retrievable | Filterable | Sortable | Facetable | Searchable |
    | ---------- | ----------- | ---------- | -------- | --------- | ---------- |
    | metadata_storage_size | &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&#10004; | &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&#10004; | &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&#10004; | | |
    | metadata_storage_last_modified | &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&#10004; | &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&#10004; | &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&#10004; | | |
    | metadata_storage_name | &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&#10004; | &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&#10004; | &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&#10004; | | &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&#10004; |
    | metadata_author | &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&#10004; | &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&#10004; | &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&#10004; | &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&#10004; | &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&#10004; |
    | locations | &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&#10004; | &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&#10004; | | | &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&#10004; |
    | keyphrases | &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&#10004; | &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&#10004; | | | &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&#10004; |
    | language | &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&#10004; | &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&#10004; | | | |

9. Change the **Indexer name** to **margies-indexer**.
10. Leave the **Schedule** set to **Once**.
11. Expand the **Advanced** options, and ensure that the **Base-64 encode keys** option is selected (generally encoding keys make the index more efficient).
12. Select **Submit** to create the **Indexers**

### You can verify this index already build successed 
1. At the top of the **Overview** page for your Azure AI Search resource, select **Search explorer**.
2. In the View menu, select JSON view and note that the JSON request for the search is shown, like this:
   ```
   {     "search": "*"   
}
   `
then it will then respond to travel agent related information``

# Add Custom Skills and Update Your Indexes

While you can use the predefined skills in Azure AI Search to extract additional information from the source data to enrich the index. However, sometimes you may have specific data retrieval needs that may not be met by the predefined skills and require some customization. For example: integrating the Form Recognizer service to extract data from forms, using Azure Machine Learning models to integrate predicted values into the index. To support these scenarios, you can implement custom skills as web-hosted services, such as Azure Functions, and use them as part of the index generation process.illset.

<img src="./assets/custom-skllset.png" width="320"/>

## Create a Custom Word Count Function (Example).

1. In the Azure Portal, Create a new **Function App** resource with the following settings:
    - **Subscription**: *Your subscription*
    - **Resource Group**: *The same resource group as your Azure AI Search resource*
    - **Function App name**: *A unique name*
    - **Publish**: Code
    - **Runtime stack**: Node.js
    - **Version**: 18 LTS
    - **Region**: *The same region as your Azure AI Search resource*

2. Wait for deployment to complete, and then go to the deployed Function App resource.
3. In the Overview page for your Function App, in the section down the page, select the **Functions** tab. Then create a new function in the portal with the following settings:
    - **Setup a development environment**"
        - **Development environment**: Develop in portal
    - **Select a template**"
        - **Template**: HTTP Trigger
    - **Template details**:
        - **New Function**: wordcount
        - **Authorization level**: Function
4. Wait for the *wordcount* function to be created. Then in its page, select the **Code + Test** tab.
5. Replace the default function code with the following code:

```javascript
module.exports = async function (context, req) {
    context.log('JavaScript HTTP trigger function processed a request.');

    if (req.body && req.body.values) {

        vals = req.body.values;

        // Array of stop words to be ignored
        var stopwords = ['', 'i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 
        "youre", "youve", "youll", "youd", 'your', 'yours', 'yourself', 
        'yourselves', 'he', 'him', 'his', 'himself', 'she', "shes", 'her', 
        'hers', 'herself', 'it', "its", 'itself', 'they', 'them', 
        'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 
        'this', 'that', "thatll", 'these', 'those', 'am', 'is', 'are', 'was',
        'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 
        'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 
        'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 
        'about', 'against', 'between', 'into', 'through', 'during', 'before', 
        'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 
        'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 
        'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 
        'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 
        'only', 'own', 'same', 'so', 'than', 'too', 'very', 'can', 'will',
        'just', "dont", 'should', "shouldve", 'now', "arent", "couldnt", 
        "didnt", "doesnt", "hadnt", "hasnt", "havent", "isnt", "mightnt", "mustnt",
        "neednt", "shant", "shouldnt", "wasnt", "werent", "wont", "wouldnt"];

        res = {"values":[]};

        for (rec in vals)
        {
            // Get the record ID and text for this input
            resVal = {recordId:vals[rec].recordId, data:{}};
            txt = vals[rec].data.text;

            // remove punctuation and numerals
            txt = txt.replace(/[^ A-Za-z_]/g,"").toLowerCase();

            // Get an array of words
            words = txt.split(" ")

            // count instances of non-stopwords
            wordCounts = {}
            for(var i = 0; i < words.length; ++i) {
                word = words[i];
                if (stopwords.includes(word) == false )
                {
                    if (wordCounts[word])
                    {
                        wordCounts[word] ++;
                    }
                    else
                    {
                        wordCounts[word] = 1;
                    }
                }
            }

            // Convert wordcounts to an array
            var topWords = [];
            for (var word in wordCounts) {
                topWords.push([word, wordCounts[word]]);
            }

            // Sort in descending order of count
            topWords.sort(function(a, b) {
                return b[1] - a[1];
            });

            // Get the first ten words from the first array dimension
            resVal.data.text = topWords.slice(0,9)
              .map(function(value,index) { return value[0]; });

            res.values[rec] = resVal;
        };

        context.res = {
            body: JSON.stringify(res),
            headers: {
            'Content-Type': 'application/json'
        }

        };
    }
    else {
        context.res = {
            status: 400,
            body: {"errors":[{"message": "Invalid input"}]},
            headers: {
            'Content-Type': 'application/json'
        }

        };
    }
};
```

6. Save the function and then open the **Test/Run** pane.
7. In the **Test/Run** pane, replace the existing **Body** with the following JSON, which reflects the schema expected by an Azure AI Search skill in which records containing data for one or more documents are submitted for processing:

    ```
    {
        "values": [
            {
                "recordId": "a1",
                "data":
                {
                "text":  "Tiger, tiger burning bright in the darkness of the night.",
                "language": "en"
                }
            },
            {
                "recordId": "a2",
                "data":
                {
                "text":  "The rain in spain stays mainly in the plains! That's where you'll find the rain!",
                "language": "en"
                }
            }
        ]
    }
    ```
    
8. Click **Run** and view the HTTP response content that is returned by your function. This reflects the schema expected by Azure AI Search when consuming a skill, in which a response for each document is returned. In this case, the response consists of up to 10 terms in each document in descending order of how frequently they appear:

<img src="./assets/test-word-count.png" width="900"/>

9. Close the **Test/Run** pane and in the **wordcount** function blade, click **Get function URL**. Then copy the URL for the default key to the clipboard. You'll need this in the next procedure.

## Update Skillset, Index and Reload Indexer

1. Open your  **Skillset** in Azure AI Search (Azure AI Search -> [your-ai-search-service] -> Search management -> Skillsets -> margies-skillset)
2. Configurate the definition of margies-skillset (You can refer to the example in `./assets/skillset-definition-example.json`).
   * change the vaule of `cognitiveServices` from Null to the following configuration 
        ```
            ...  ...
        "cognitiveServices": {
                "@odata.type": "#Microsoft.Azure.Search.CognitiveServicesByKey",
                "description": "Azure AI services",
                "key": "<YOUR_MULTI_SERVICE_ACCOUNT_KEY>"
              },
            ...  ...
        ```
   * Add `get-top-words` and `get-sentiment` into `skills`
        ```
        {
          "@odata.type": "#Microsoft.Skills.Custom.WebApiSkill",
          "name": "get-top-words",
          "description": "custom skill to get top 10 most frequent words",
          "context": "/document",
          "uri": "<YOUR_FUNCTION_APP_URL>",
          "httpMethod": "POST",
          "timeout": "PT30S",
          "batchSize": 1,
          "degreeOfParallelism": null,
          "authResourceId": null,
          "inputs": [
            {
              "name": "text",
              "source": "/document/merged_content"
            },
            {
              "name": "language",
              "source": "/document/language"
            }
          ],
          "outputs": [
            {
              "name": "text",
              "targetName": "topWords"
            }
          ],
          "httpHeaders": {},
          "authIdentity": null
        },
        {
          "@odata.type": "#Microsoft.Skills.Text.V3.SentimentSkill",
          "name": "get-sentiment",
          "description": "Evaluate sentiment",
          "context": "/document",
          "defaultLanguageCode": "en",
          "modelVersion": null,
          "includeOpinionMining": false,
          "inputs": [
            {
              "name": "text",
              "source": "/document/merged_content"
            },
            {
              "name": "languageCode",
              "source": "/document/language"
            }
          ],
          "outputs": [
            {
              "name": "sentiment",
              "targetName": "sentimentLabel"
            }
          ]
        },
        ```

3. **Save** and navigate to index (Azure AI Search -> [your-ai-search-service] -> Search management -> indexes -> margies-index)
4. Select `fields` and Add fields below
    | Field name | Retrievable | Filterable | Sortable | Facetable | Searchable |
    | ---------- | ----------- | ---------- | -------- | --------- | ---------- |
    | sentiment | &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&#10004; | &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&#10004; | &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&#10004; | | &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&#10004;|
    | url | &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&#10004; | &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&#10004; | | | &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&#10004;|
    | top_words | &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&#10004; | &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&#10004; | | | &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&#10004;|
5. **Save** and navigate to indexer (Azure AI Search -> [your-ai-search-service] -> Search management -> indexer -> margies-indexer)
6. Press **Reset** and then press **Run** to regenerate indexes.

### Demonstration

#### Step1. Build a Search Client

In [4]:
!pip install azure-search-documents
from azure.core.credentials import AzureKeyCredential
from azure.search.documents import SearchClient

SEARCH_ENDPOINT = "https://mslearn-ai-searchs-01.search.windows.net"
INDEX_NAME = "margies-indexer"
ADMIN_KEY = "<YOUR-AI-SEARCH-KEY>"
search_client = SearchClient(endpoint=SEARCH_ENDPOINT, index_name=INDEX_NAME, credential=AzureKeyCredential(ADMIN_KEY))

Defaulting to user installation because normal site-packages is not writeable


In [10]:
query = {"search": "Las Vegas", "select": "url,top_words"}
results = search_client.search(search_text=query["search"], select=query["select"].split(","))

for result in results:
    print(result)
    print(f"URL: {result['url']}, Top Words: {result['top_words']}")

{'url': None, 'top_words': ['amazing', 'experience', 'canal', 'hotel', 'las', 'vegas', 'usa', 'expected', 'something'], '@search.score': 21.513073, '@search.reranker_score': None, '@search.highlights': None, '@search.captions': None}
URL: None, Top Words: ['amazing', 'experience', 'canal', 'hotel', 'las', 'vegas', 'usa', 'expected', 'something']
{'url': None, 'top_words': ['hotel', 'las', 'vegas', 'may', 'icon', 'fountain', 'usa', 'reason', 'leave'], '@search.score': 21.42015, '@search.reranker_score': None, '@search.highlights': None, '@search.captions': None}
URL: None, Top Words: ['hotel', 'las', 'vegas', 'may', 'icon', 'fountain', 'usa', 'reason', 'leave']
{'url': None, 'top_words': ['vegas', 'las', 'city', 'hotel', 'margies', 'travel', 'known', 'populated', 'nevada'], '@search.score': 19.18885, '@search.reranker_score': None, '@search.highlights': None, '@search.captions': None}
URL: None, Top Words: ['vegas', 'las', 'city', 'hotel', 'margies', 'travel', 'known', 'populated', 'nev