# Searching document text at scale using Azure Cognitive Search

<ul>
    <li>Creating an Azure Storage Account</li>
    <li>Uploading documents to our Azure Storage Account</li>
    <li>Creating an Azure Cognitive Search instance</li>
    <li>Connecting the Azure Cognitive Search instance to our data source</li>
    <li>Connecting the Azure Cognitive Search instance to Cognitive Services</li>
    <li>Defining a skillset including our custom skill</li>
    <li>Indexing our documents</li>
    <li>Querying Azure Search</li>
    <li>Formatting search highlights</li>
</ul>

<img src='img/azure-search-diagram.png'>

<h2>Creating a blob storage account</h2>

### with azure cli and with VSCode terminal

<p>
az login <br>
select Subscription : <br>
az account set --subscription 4495aea3-1236-4df7-bfe7-ffa0902059ab <br>
az group create --name azure-search-nar-demo --location eastus2 <br>
az storage account create --name narsearchdemostorage --resource-group azure-search-nar-demo --location eastus2 <br>
az storage account show-connection-string --name narsearchdemostorage <br>
<br>
"connectionString": "DefaultEndpointsProtocol=https;EndpointSuffix=core.windows.net;AccountName=narsearchdemostorage;AccountKey=kCgQIYa1avf5cYm2rNkHP4PpkM+MrrkJZRF3is6lfZZgeCtnRWEBuEqGB+OfdCaA7yfKP/AwwihMOjypVsQ81A=="
<br>
<b> ---------- have to update Update to use keyVault instead of clear connection string ---------- </b>
<br><br>
<b>pip install not required as I put put into the environement.yml</b><br>
    pip install azure-storage-blob<br>
    pip install requests<br>

</p>

In [1]:
import os

from azure.storage.blob import BlobServiceClient, BlobClient, ContainerClient

connection_string = "DefaultEndpointsProtocol=https;EndpointSuffix=core.windows.net;AccountName=narsearchdemostorage;AccountKey=kCgQIYa1avf5cYm2rNkHP4PpkM+MrrkJZRF3is6lfZZgeCtnRWEBuEqGB+OfdCaA7yfKP/AwwihMOjypVsQ81A=="

blob_service_client = BlobServiceClient.from_connection_string(connection_string)

container_name = "nar-pdf-container"

container_client = blob_service_client.create_container(container_name)

In [2]:
os.listdir('./pdfs')

['nar00418-0066.pdf',
 'nar00418-0076.pdf',
 'nar00418-0094.pdf',
 'nar00418-0107.pdf',
 'nar00418-0118.pdf',
 'nar00418-0133.pdf',
 'nar00418-0158.pdf',
 'nar00418-0174.pdf',
 'nar00418-0187.pdf',
 'nar00432-0036.pdf']

### copy pdfs into a blob container

We can now iterate through these files and upload each of them to our storage account (it may be easier for larger numbers of files to use the AzCopy tool or Azure Storage Explorer or ADF).

In [3]:
for pdf in os.listdir('./pdfs'):
    blob_client = blob_service_client.get_blob_client(
        container=container_name,
        blob=pdf
    )
    with open(os.path.join('.', 'pdfs', pdf), "rb") as data:
        blob_client.upload_blob(data)

To check our files have uploaded, we can list the files in the container:

In [4]:
blob_list = container_client.list_blobs()
for blob in blob_list:
    print(blob.name)

nar00418-0066.pdf
nar00418-0076.pdf
nar00418-0094.pdf
nar00418-0107.pdf
nar00418-0118.pdf
nar00418-0133.pdf
nar00418-0158.pdf
nar00418-0174.pdf
nar00418-0187.pdf
nar00432-0036.pdf


## Create Azure Cognitive Search Service

using azure cli <br>
az search service create --name nar-demo-search --resource-group azure-search-nar-demo --location eastus2 --sku standard

## Connect Azure Cognitive Search Service to BLOB storage

In the cell below, we’ll import modules and define variables that we’ll need for each of the Azure Cognitive Search Service REST requests for the rest of this blog post.
<br>
For each request to Azure Cognitive Search, you must provide an API Version, in this post, we’re using the latest stable version, which is "2019-05-06".

In [5]:
import requests
import json

service_name = "nar-demo-search"
api_version = "2019-05-06"
api_key = "F7DD58AACD0E50237B2A4B3828998D4E"

headers = {
    'Content-Type': 'application/json',
    'api-key': api_key
}

datasource_name = "blob-datasource"
uri = f"https://{service_name}.search.windows.net/datasources?api-version={api_version}"

body = {
    "name": datasource_name,
    "type": "azureblob",
    "credentials": {"connectionString": connection_string},
    "container": {"name": container_name}
}

resp = requests.post(uri, headers=headers, data=json.dumps(body))
print(resp.status_code)
print(resp.ok)

201
True


## Connect Azure Cognitive Search Service to Cognitive Services

Next we’re going to define the AI skillset which will be used to enrich our search index.
<br>
But first, in order to do this, it’s advisable to create an Azure Cognitive Services instance, otherwise your AI enrichment <br> capabilities will be severely limited in scope. This Azure Cognitive Services instance must be in the same region as your<br>  Azure Cognitive Search instance, and can be created from the CLI, I’ve named mine nar-demo-cognitive-services:

az cognitiveservices account create --name nar-demo-cognitive-services --kind CognitiveServices --sku S0 --resource-group azure-search-nar-demo --location eastus2

We’ll need the account key for this Azure Cognitive Services instance in order to connect our Azure Cognitive Search instance to it, we can retrieve the key using the Azure CLI:

az cognitiveservices account keys list --name nar-demo-cognitive-services --resource-group azure-search-nar-demo

In [6]:
cognitive_service_key = "bc51192ebffa4abf96a94d76ff11e3a5"

## Create Azure Cognitive Search Skillset

To enrich our Azure Cognitive Search Index with an AI skillset, we’ll need to define a skillset.
<br>
We’ll be using a combination of skills that utilise Azure Cognitive Services. The skills we’ll be using are:
<br>
<ul>
<li>OCR to extract text from image</li>
<li>Merge text extracted from OCR into the correct place in documents</li>
<li>Detect language</li>
<li>Split text into pages if not already done so</li>
<li>Key phrase extraction (has a maximum character limit so requires text to be split into pages)</li>
</ul>
The skillset definition is a JSON object and each of the skills defined take one or more fields as input and provide one or more fields as output.
<br>
<br>
We’ll need to provide the Cognitive Search API with the key to our Cognitive Services account in order to use it, so make sure you put this in your skillset definition.


In [8]:
skillset = {
  "description": "Extract text from images and merge with content text to produce merged_text. Also extract key phrases from pages",
  "skills":
  [
    {
      "description": "Extract text (plain and structured) from image.",
      "@odata.type": "#Microsoft.Skills.Vision.OcrSkill",
      "context": "/document/normalized_images/*",
      "defaultLanguageCode": "en",
      "detectOrientation": True,
      "inputs": [
        {
          "name": "image",
          "source": "/document/normalized_images/*"
        }
      ],
      "outputs": [
        {
          "name": "text"
        }
      ]
    },
    {
      "@odata.type": "#Microsoft.Skills.Text.MergeSkill",
      "description": "Create merged_text, which includes all the textual representation of each image inserted at the right location in the content field.",
      "context": "/document",
      "insertPreTag": " ",
      "insertPostTag": " ",
      "inputs": [
        {
          "name":"text", "source": "/document/content"
        },
        {
          "name": "itemsToInsert", "source": "/document/normalized_images/*/text"
        },
        {
          "name":"offsets", "source": "/document/normalized_images/*/contentOffset"
        }
      ],
      "outputs": [
        {
          "name": "mergedText", "targetName" : "merged_text"
        }
      ]
    },
    {
      "@odata.type": "#Microsoft.Skills.Text.LanguageDetectionSkill",
      "inputs": [
        { "name": "text", "source": "/document/content" }
      ],
      "outputs": [
        { "name": "languageCode", "targetName": "languageCode" }
      ]
    },
    {
      "@odata.type": "#Microsoft.Skills.Text.SplitSkill",
      "textSplitMode" : "pages",
      "maximumPageLength": 4000,
      "inputs": [
        { "name": "text", "source": "/document/content" },
        { "name": "languageCode", "source": "/document/languageCode" }
      ],
      "outputs": [
        { "name": "textItems", "targetName": "pages" }
      ]
    },
    {
      "@odata.type": "#Microsoft.Skills.Text.KeyPhraseExtractionSkill",
      "context": "/document/pages/*",
      "inputs": [
        { "name": "text", "source": "/document/pages/*" },
        { "name":"languageCode", "source": "/document/languageCode" }
      ],
      "outputs": [
        { "name": "keyPhrases", "targetName": "keyPhrases" }
      ]
    },
  ],
    "cognitiveServices": {
        "@odata.type": "#Microsoft.Azure.Search.CognitiveServicesByKey",
        "description": "NAR Demo Cognitive Services",
        "key": "bc51192ebffa4abf96a94d76ff11e3a5"
    }
}

The skillset is created through a PUT request to our Azure Cognitive Search skillsets REST endpoint.

The data above is serialised to a JSON string before being provided as the body of the request:

In [9]:
skillset_name = 'nar-demo-skillset'
uri = f"https://{service_name}.search.windows.net/skillsets/{skillset_name}?api-version={api_version}"

resp = requests.put(uri, headers=headers, data=json.dumps(skillset))
print(resp.status_code)
print(resp.ok)

201
True


## Create Azure Cognitive Search Index

The index is the definition of fields that will be returned, as well as metadata such as the data type of this field, whether it is a key or not, and whether it is searchable, filterable, facetable or sortable.

We’ll return the id, metadata_storage_name, content, languageCode, keyPhrases, genetic_codes, and merged_text field and make all of them searchable.

In [10]:
index = {
  "fields": [
    {
      "name": "id",
      "type": "Edm.String",
      "key": True,
      "searchable": False,
      "filterable": False,
      "facetable": False,
      "sortable": True
    },
    {
      "name": "metadata_storage_name",
      "type": "Edm.String",
      "searchable": True,
      "filterable": False,
      "facetable": False,
      "sortable": True
    },
    {
      "name": "content",
      "type": "Edm.String",
      "searchable": True,
      "filterable": False,
      "facetable": False,
      "sortable": False
    },
    {
      "name": "languageCode",
      "type": "Edm.String",
      "searchable": True,
      "filterable": False,
      "facetable": False,
      "sortable": False
    },
    {
      "name": "keyPhrases",
      "type": "Collection(Edm.String)",
      "searchable": True,
      "filterable": False,
      "facetable": False,
      "sortable": False
    },
    {
      "name": "merged_text",
      "type": "Edm.String",
      "searchable": True,
      "filterable": False,
      "facetable": False,
      "sortable": False
    }
  ]
}

Just as with our skillset, the index is created through a PUT request to our Azure Cognitive Search indexes REST endpoint.

The data above is serialised to a JSON string before being provided as the body of the request:

In [11]:
index_name = 'nar-demo-index'
uri = f"https://{service_name}.search.windows.net/indexes/{index_name}?api-version={api_version}"

resp = requests.put(uri, headers=headers, data=json.dumps(index))
print(resp.status_code)
print(resp.ok)

201
True


## Create Azure Cognitive Search Indexer

We’ll now create our indexer, which runs through our data and indexes it as we’ve defined above using the data source, index and skillset given.

We can set maxFailedItems to maxFailedItemsPerBatch to -1 if we want the indexer to continue through until reaching the end of all the documents regardless of how many failures it encounters.

The storage path of the file within the storage container will be base 64 encoded and act as a key as it will be unique for each document, regardless of whether two files have the same filename.

In [12]:
indexer_name = "nar-demo-indexer"

indexer = {
  "name": indexer_name,
  "dataSourceName" : datasource_name,
  "targetIndexName" : index_name,
  "skillsetName" : skillset_name,
  "fieldMappings" : [
    {
      "sourceFieldName" : "metadata_storage_path",
      "targetFieldName" : "id",
      "mappingFunction" : {"name": "base64Encode"}
    },
    {
      "sourceFieldName" : "metadata_storage_name",
      "targetFieldName" : "metadata_storage_name",
    },
    {
      "sourceFieldName" : "content",
      "targetFieldName" : "content"
    }
  ],
  "outputFieldMappings" :
  [
    {
      "sourceFieldName" : "/document/merged_text",
      "targetFieldName" : "merged_text"
    },
    {
      "sourceFieldName" : "/document/pages/*/keyPhrases/*",
      "targetFieldName" : "keyPhrases"
    },
    {
      "sourceFieldName": "/document/languageCode",
      "targetFieldName": "languageCode"
    }
  ],
  "parameters":
  {
    "maxFailedItems": 1,
    "maxFailedItemsPerBatch": 1,
    "configuration":
    {
      "dataToExtract": "contentAndMetadata",
      "parsingMode": "default",
      "firstLineContainsHeaders": False,
      "delimitedTextDelimiter": ",",
      "imageAction": "generateNormalizedImages"
    }
  }
}

You may be sensing a theme now to creating associated resources for our Azure Cognitive Search instance. Just as with our skillset and index, the indexer is created through a PUT request to our Azure Cognitive Search indexers REST endpoint.

The data above is serialised to a JSON string before being provided as the body of the request. This will kick off the indexing on our PDF files.

We have not defined a schedule in our indexer so it will just run once but if you wish to provide a schedule that will check for updates to the index you can, for example, every 2 hours would be:
"schedule" : { "interval" : "PT2H" }
And 5 minutes, the shortest interval, would be:
"schedule" : { "interval" : "PT5M" }

In [13]:
indexer_name = 'nar-demo-indexer'
uri = f"https://{service_name}.search.windows.net/indexers/{indexer_name}?api-version={api_version}"

resp = requests.put(uri, headers=headers, data=json.dumps(indexer))
print(resp.status_code)
print(resp.ok)

201
True


To check on the status of our indexer, you can run the following cell.

The JSON response has more information for if you want to know how many files the indexer has completed and how many is left to go.

In [19]:
uri = f"https://{service_name}.search.windows.net/indexers/{indexer_name}/status?api-version={api_version}"

resp = requests.get(uri, headers=headers)
print(resp.status_code)
print(resp.json().get('lastResult').get('status'))
print(resp.ok)
print(resp.json())#.get('itemsProcessed'))

200
success
True


## Querying the Azure Cognitive Search Index

In [20]:
# Base URL
url = 'https://{}.search.windows.net/indexes/{}/docs'.format(service_name, index_name)
# API version is required
url += '?api-version={}'.format(api_version)
# Search query of "RNA"
url += '&search=RNA'
# Return the count of the results
url += '&$count=true'
print(url)

resp = requests.get(url, headers=headers)
print(resp.status_code)

https://nar-demo-search.search.windows.net/indexes/nar-demo-index/docs?api-version=2019-05-06&search=RNA&$count=true
200


The 200 status code is a good start. To get the search results from our response, we must call the json() method of our response object to de-serialise the JSON string object that is returned in the body of our response.

In [21]:
search_results = resp.json()

In [22]:
search_results.keys()

dict_keys(['@odata.context', '@odata.count', 'value'])

This object is a dict with 3 keys:
<ul>
<li>@odata.context – The index and documents that were searched</li>
<li>@odata.count – The count of documents returned from our index search query</li>
<li>value – The output fields of the index search query</li>
</ul>

In [24]:
search_results['@odata.context']

"https://nar-demo-search.search.windows.net/indexes('nar-demo-index')/$metadata#docs(*)"

In [25]:
search_results['@odata.count']

10

In [26]:
len(search_results['value'])

10

For each item in value, we are returned the fields that we defined in our index, as well as a search score:

In [27]:
search_results['value'][0].keys()

dict_keys(['@search.score', 'id', 'metadata_storage_name', 'content', 'languageCode', 'keyPhrases', 'merged_text'])

We can take a look at the results of our search by looking at the PDF name and search score.

By default, the output of the query is ordered by search score and we have not defined any weightings to fields but this is something that can also be defined:

In [28]:
for result in search_results['value']:
    print('PDF Name: {}, Search Score {}'.format(result['metadata_storage_name'], result['@search.score']))

PDF Name: nar00418-0158.pdf, Search Score 0.05012928
PDF Name: nar00418-0066.pdf, Search Score 0.024230875
PDF Name: nar00418-0118.pdf, Search Score 0.019516218
PDF Name: nar00418-0174.pdf, Search Score 0.018859072
PDF Name: nar00418-0076.pdf, Search Score 0.018488316
PDF Name: nar00418-0133.pdf, Search Score 0.017665934
PDF Name: nar00418-0187.pdf, Search Score 0.0138388
PDF Name: nar00418-0107.pdf, Search Score 0.00976345
PDF Name: nar00418-0094.pdf, Search Score 0.0070217247
PDF Name: nar00432-0036.pdf, Search Score 0.003417009


In the cell below, we take a look at the beginning of the merged text of the highest scoring search result. Given that the name of this article is “Evidence for the role of double-helical structures in the maturation of Sinian Virus-40 messenger RNA”, it’s unsurprising that this is the highest scoring search result when searching for “RNA”.

In [29]:
print(search_results['value'][0]['merged_text'][:350])


Volume 8 Number 1 1980 Nucleic Acids Research

Evidence for the role of double-helical structures in the maturation of Sinian Virus-40 messenger
RNA

Nancy H.Chiu, Walter B.Bruszewski and Norman P.Salzman

Laboratory of Biology of Viruses, National Institute of Allergy and Infectious Diseases, National
Institutes of Health, Bethesda, MD 20205, USA


Let’s now take a look at a query that won’t just return every document that we uploaded.

Transcription is the mechanism by which DNA is encoded into RNA as the initial step in gene expression, ready to be translated into proteins.

Let’s do a search for “transcription” and see how many results our query returns.

In [30]:
url = 'https://{}.search.windows.net/indexes/{}/docs'.format(service_name, index_name)
url += '?api-version={}'.format(api_version)
url += '&search=transcription'
url += '&$count=true'
print(url)

resp = requests.get(url, headers=headers)
print(resp.status_code)

https://nar-demo-search.search.windows.net/indexes/nar-demo-index/docs?api-version=2019-05-06&search=transcription&$count=true
200


In the following cell, we find that the term transcription is found in 4 out of our 10 documents.

In [31]:
search_results = resp.json()
search_results['@odata.count']

4

In [32]:
for result in search_results['value']:
    print('PDF Name: {}, Search Score {}'.format(result['metadata_storage_name'], result['@search.score']))

PDF Name: nar00418-0133.pdf, Search Score 0.061607756
PDF Name: nar00418-0066.pdf, Search Score 0.04978199
PDF Name: nar00418-0076.pdf, Search Score 0.007796707
PDF Name: nar00418-0118.pdf, Search Score 0.0067256093


When looking at our highest result, we can be buoyed once more by the fact that the highest scoring search result is likely to be highly related to transcription – this journal article is about the regions of the ovalbumin gene that are responsible for controlling gene expression and is led by noted Molecular Geneticist Pierre Chambon, whose work has focused on gene transcription.

In [33]:
print(search_results['value'][0]['merged_text'][:715])


Volume8Numberl 1980 Nucleic Acids Research

The ovalbumin gene - sequence of putative control regions

C.Benoist, K.O'Hare, R.Breathnach and P.Chambon

Laboratoire de Gine'tique Moleculaire des Eucaryotes du CNRS, Unite 184 de Biologie Moleculaire
et de Genie Genetique de l'INSERM, Institut de Chimie Biologique, Faculte de Medecine,
Strasbourg, France

Received 8 November 1979

ABSTRACT

We present the sequence of regions of the chicken ovalbumin
gene believed to be important in the control of initiation of
transcription, splicing, and transcription termination or
polyadenylation. Comparison with corresponding areas of other
genes reveals some homologous regions which might play a role in
these processes.


## Pagination of Search Results

As we might not want all results all of the time from our search, and indeed by default only the first 50 results will be returned, we might want to paginate our results, to allow users to explore results in batches.

Let’s say we want to search for RNA, which we already know will find 10 results, but we only want to display 5 results on a page. We can do this by supplying to our URL a $top parameter.

In [34]:
url = 'https://{}.search.windows.net/indexes/{}/docs'.format(service_name, index_name)
url += '?api-version={}'.format(api_version)
url += '&search=RNA'
url += '&$count=true'
url += '&$top=5'
print(url)

resp = requests.get(url, headers=headers)
print(resp.status_code)

search_results = resp.json()

print("Results Found: {}, Results Returned: {}".format(search_results['@odata.count'], len(search_results['value'])))
print("Highest Search Score: {}".format(search_results['value'][0]['@search.score']))

https://nar-demo-search.search.windows.net/indexes/nar-demo-index/docs?api-version=2019-05-06&search=RNA&$count=true&$top=5
200
Results Found: 10, Results Returned: 5
Highest Search Score: 0.05012928


That’s perfect for our first page, but what about the second page? We need a way of skipping the first 5 results and showing the next 5 highest scoring results.

We can do this by supplying both $top and $skip parameters to our URL.

In [35]:
url = 'https://{}.search.windows.net/indexes/{}/docs'.format(service_name, index_name)
url += '?api-version={}'.format(api_version)
url += '&search=RNA'
url += '&$count=true'
url += '&$top=5'
url += '&$skip=5'
print(url)

resp = requests.get(url, headers=headers)
print(resp.status_code)

search_results = resp.json()

print("Results Found: {}, Results Returned: {}".format(search_results['@odata.count'], len(search_results['value'])))
print("Highest Search Score: {}".format(search_results['value'][0]['@search.score']))

https://nar-demo-search.search.windows.net/indexes/nar-demo-index/docs?api-version=2019-05-06&search=RNA&$count=true&$top=5&$skip=5
200
Results Found: 10, Results Returned: 5
Highest Search Score: 0.017665934


## Article Highlights from Search Results

One of my favourite features of Azure Cognitive Search is the ability to highlight parts of the document that are relevant to our search results. Combined with the above, this really helps turn our API into a proper search engine.

We can request highlights on a particular field by adding the highlight parameter to our URL as below. Let’s highlight the results we had from our search for “transcription”

In [36]:
url = 'https://{}.search.windows.net/indexes/{}/docs'.format(service_name, index_name)
url += '?api-version={}'.format(api_version)
url += '&search=transcription'
url += '&$count=true'
url += '&highlight=merged_text'

resp = requests.get(url, headers=headers)
print(resp.status_code)

search_results = resp.json()

200


This will extract relevant parts of the journal article regarding “transcription”.

We can see in the cell below, which shows the highlights from the search result, that this will return a list of highlights from the article and even provides HTML em tags for the word transcription.

In [37]:
search_results['value'][0]['@search.highlights']['merged_text']

['The roles that these sequences might play in the\ncontrol of initiation of <em>transcription</em>, splicing of the primary\ntranscript, and termination of <em>transcription</em> and polyadenylation\nare discussed.',
 'The roles that these sequences might play in the control of initiation of <em>transcription</em>, splicing of the primary transcript, and termination of <em>transcription</em> and polyadenylation are discussed.',
 'Studies on the major late adenovirus 2 <em>transcription</em> unit\n\nhave led to the conclusion that in this case the start of\n<em>transcription</em> most probably corresponds to the first, capped\nnucleotide of the mature messenger (7, 8).',
 'Studies on the major late adenovirus 2 <em>transcription</em> unit have led to the conclusion that in this case the start of <em>transcription</em> most probably corresponds to the first, capped nucleotide of the mature messenger (7, 8) .',
 'The availability of in vitro <em>transcription</em> systems where specific 

If we’re using a jupyter notebook, we can display these highlights with the word transcription emphasised through italicisation.

In [38]:
from IPython.display import display, HTML

for highlight in search_results['value'][0]['@search.highlights']['merged_text']:
    display(HTML(highlight))

But that’s not all, if we wanted to supply our own tags to, for example, highlight instead of italicise these highlights we can supply our own HTML tags to surround the query phrase.

In this case we’re providing span tags with some inline CSS.

In [39]:
import urllib

url = 'https://{}.search.windows.net/indexes/{}/docs'.format(service_name, index_name)
url += '?api-version={}'.format(api_version)
url += '&search=transcription'
url += '&$count=true'
url += '&highlight=merged_text'
url += '&highlightPreTag=' + urllib.parse.quote('<span style="background-color: #f5e8a3">', safe='')
url += '&highlightPostTag=' + urllib.parse.quote('</span>', safe='')

resp = requests.get(url, headers=headers)
print(url)
print(resp.status_code)

search_results = resp.json()

https://nar-demo-search.search.windows.net/indexes/nar-demo-index/docs?api-version=2019-05-06&search=transcription&$count=true&highlight=merged_text&highlightPreTag=%3Cspan%20style%3D%22background-color%3A%20%23f5e8a3%22%3E&highlightPostTag=%3C%2Fspan%3E
200


Now when we display our search results, the output looks a little bit more like that of a regular search engine.

Feel free to explore your Azure Cognitive Search endpoints and don’t forget to delete your resource group when you’re finished exploring.

In [40]:
for result in search_results['value']:
    display(HTML('<h4>' + result['metadata_storage_name'] + '</h4>'))
    for highlight in result['@search.highlights']['merged_text']:
        display(HTML(highlight))

## search for Text OCR

<img src='img/Origin.PNG'>

In [41]:
import urllib

url = 'https://{}.search.windows.net/indexes/{}/docs'.format(service_name, index_name)
url += '?api-version={}'.format(api_version)
url += '&search=Origin'
url += '&$count=true'
url += '&highlight=merged_text'
url += '&highlightPreTag=' + urllib.parse.quote('<span style="background-color: #f5e8a3">', safe='')
url += '&highlightPostTag=' + urllib.parse.quote('</span>', safe='')

resp = requests.get(url, headers=headers)
print(url)
print(resp.status_code)

search_results = resp.json()

https://nar-demo-search.search.windows.net/indexes/nar-demo-index/docs?api-version=2019-05-06&search=Origin&$count=true&highlight=merged_text&highlightPreTag=%3Cspan%20style%3D%22background-color%3A%20%23f5e8a3%22%3E&highlightPostTag=%3C%2Fspan%3E
200


In [42]:
for result in search_results['value']:
    display(HTML('<h4>' + result['metadata_storage_name'] + '</h4>'))
    for highlight in result['@search.highlights']['merged_text']:
        display(HTML(highlight))