This sample demonstrates how to use multiple indexers in Azure Cognitive Search to create a single search index from files in Blob storage, with additional file metadata in Table storage. The scenario is described in detail in the Index file content and metadata by using Azure Cognitive Search article in Azure Architecture Center.
By following this sample, you will create an Azure Storage account which has a number of files uploaded to blob storage. You will then create an Azure Cognitive Search service which indexes these files so that you can search for information contained within them. By combining a search indexer for Azure Blob Storage and a separate indexer for Azure Table Storage, each document in the search index will contain the joint metadata from both places.
In this sample, the search index contains author
, document_type
and business_impact
metadata fields. The author
is retrieved directly from the files in Blob storage, whereas the document_type
and business_impact
values come from a row in Table storage for each associated blob.
By default, a few basic text files are uploaded to Blob storage from the samplefiles
directory in this repository. You can put additional files in the subdirectories, as long as they have one of the supported document formats. The directory name of the file (for example, paper
or report
) is used as the document_type
metadata value, so you can also create additional subdirectories to have more document types to search for. If you'd like to use more interesting files to index, you can find sample data sets in the Azure Cognitive Search Sample Data
repository.
- An active Azure subscription (you can get a free Azure account if you don't have one).
- The Azure CLI installed to create the Azure resources.
- A Bash shell to run the deployment commands. Note that you can use the Azure Cloud Shell for this as well; it even has the Azure CLI pre-installed.
- Optional: if you're using Visual Studio Code, you can use the Azure CLI Tools extension to light up additional features for
.azcli
files.
In your bash terminal, run the following commands to clone this repository:
git clone https://github.com/Azure-Samples/azure-cognitive-search-blob-metadata.git
cd azure-cognitive-search-blob-metadata
Edit the first few lines in deploy.azcli
to set the Azure region and resource group name. Using bash, run the lines from deploy.azcli
in order. You can copy-paste groups of commands to your bash terminal, or if you are using Visual Studio Code with the Azure CLI Tools extension you can highlight each section and use the "Run Line in Terminal" function.
In order to create the index, data sources and indexers in Azure Cognitive Search, this sample uses a few template files. The most relevant files for this sample are:
deploy-index.json
defines the structure of the search index.- It includes the typical information from blob storage (the content as well as file name, full path, file size, etc.).
- It also defines the additional fields for the metadata values (
author
,document_type
andbusiness_impact
).
deploy-blob-indexer.json
defines the indexer for Blob storage.- The
dataToExtract
is set tocontentAndMetadata
so that metadata from blobs is included in the search index. - The
metadata_storage_path
(which is used as the document key by default) is base64-encoded to ensure legal key names.
- The
deploy-table-indexer.json
defines the indexer for Table storage.- This uses the same
targetIndexName
as the Blob indexer so that the same destination index is used by both indexers. - It maps the
metadata_storage_path
column in Table storage to themetadata_storage_path
field in the index; it's also base64-encoded to make sure that it refers to the same document key as the Blob indexer.
- This uses the same
The last curl
command in the deploy.azcli
file performs a search against the index. After a few seconds, this should start returning a JSON structure with the most interesting information for a single document: the search performs a $top=1
to limit the results and $select
to only return the file name and the relevant metadata values.
The important outcome here is that a single document will contain both the author
metadata (coming from Blob storage) as well as the document_type
and business_impact
values (coming from Table storage).
You can also interactively search for documents in the Azure Portal by using the Search explorer.
To clean up the resources created for this demo, run the last command in the deploy.azcli
file or manually delete the resource group again which contains all your resources.