Indexing file contents and metadata in Azure Cognitive Search

This sample demonstrates how to use multiple indexers in Azure Cognitive Search to create a single search index from files in Blob storage, with additional file metadata in Table storage. The scenario is described in detail in the Index file content and metadata by using Azure Cognitive Search article in Azure Architecture Center.

Getting Started

By following this sample, you will create an Azure Storage account which has a number of files uploaded to blob storage. You will then create an Azure Cognitive Search service which indexes these files so that you can search for information contained within them. By combining a search indexer for Azure Blob Storage and a separate indexer for Azure Table Storage, each document in the search index will contain the joint metadata from both places.

In this sample, the search index contains author, document_type and business_impact metadata fields. The author is retrieved directly from the files in Blob storage, whereas the document_type and business_impact values come from a row in Table storage for each associated blob.

By default, a few basic text files are uploaded to Blob storage from the samplefiles directory in this repository. You can put additional files in the subdirectories, as long as they have one of the supported document formats. The directory name of the file (for example, paper or report) is used as the document_type metadata value, so you can also create additional subdirectories to have more document types to search for. If you'd like to use more interesting files to index, you can find sample data sets in the Azure Cognitive Search Sample Data repository.

Prerequisites

An active Azure subscription (you can get a free Azure account if you don't have one).
The Azure CLI installed to create the Azure resources.
A Bash shell to run the deployment commands. Note that you can use the Azure Cloud Shell for this as well; it even has the Azure CLI pre-installed.
Optional: if you're using Visual Studio Code, you can use the Azure CLI Tools extension to light up additional features for .azcli files.

Quickstart

In your bash terminal, run the following commands to clone this repository:

git clone https://github.com/Azure-Samples/azure-cognitive-search-blob-metadata.git
cd azure-cognitive-search-blob-metadata

Edit the first few lines in deploy.azcli to set the Azure region and resource group name. Using bash, run the lines from deploy.azcli in order. You can copy-paste groups of commands to your bash terminal, or if you are using Visual Studio Code with the Azure CLI Tools extension you can highlight each section and use the "Run Line in Terminal" function.

In order to create the index, data sources and indexers in Azure Cognitive Search, this sample uses a few template files. The most relevant files for this sample are:

deploy-index.json defines the structure of the search index.
- It includes the typical information from blob storage (the content as well as file name, full path, file size, etc.).
- It also defines the additional fields for the metadata values (author, document_type and business_impact).
deploy-blob-indexer.json defines the indexer for Blob storage.
- The dataToExtract is set to contentAndMetadata so that metadata from blobs is included in the search index.
- The metadata_storage_path (which is used as the document key by default) is base64-encoded to ensure legal key names.
deploy-table-indexer.json defines the indexer for Table storage.
- This uses the same targetIndexName as the Blob indexer so that the same destination index is used by both indexers.
- It maps the metadata_storage_path column in Table storage to the metadata_storage_path field in the index; it's also base64-encoded to make sure that it refers to the same document key as the Blob indexer.

Demo

The last curl command in the deploy.azcli file performs a search against the index. After a few seconds, this should start returning a JSON structure with the most interesting information for a single document: the search performs a $top=1 to limit the results and $select to only return the file name and the relevant metadata values.

The important outcome here is that a single document will contain both the author metadata (coming from Blob storage) as well as the document_type and business_impact values (coming from Table storage).

You can also interactively search for documents in the Azure Portal by using the Search explorer.

Cleanup

To clean up the resources created for this demo, run the last command in the deploy.azcli file or manually delete the resource group again which contains all your resources.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
.github		.github
samplefiles		samplefiles
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE.md		LICENSE.md
README.md		README.md
deploy-blob-datasource.json		deploy-blob-datasource.json
deploy-blob-indexer.json		deploy-blob-indexer.json
deploy-index.json		deploy-index.json
deploy-table-datasource.json		deploy-table-datasource.json
deploy-table-indexer.json		deploy-table-indexer.json
deploy.azcli		deploy.azcli

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.github

.github

samplefiles

samplefiles

CHANGELOG.md

CHANGELOG.md

CONTRIBUTING.md

CONTRIBUTING.md

LICENSE.md

LICENSE.md

README.md

README.md

deploy-blob-datasource.json

deploy-blob-datasource.json

deploy-blob-indexer.json

deploy-blob-indexer.json

deploy-index.json

deploy-index.json

deploy-table-datasource.json

deploy-table-datasource.json

deploy-table-indexer.json

deploy-table-indexer.json

deploy.azcli

deploy.azcli

Repository files navigation

Indexing file contents and metadata in Azure Cognitive Search

Getting Started

Prerequisites

Quickstart

Demo

Cleanup

About

Releases

Packages

Contributors 2

License

Azure-Samples/azure-cognitive-search-blob-metadata

Folders and files

Latest commit

History

Repository files navigation

Indexing file contents and metadata in Azure Cognitive Search

Getting Started

Prerequisites

Quickstart

Demo

Cleanup

About

Resources

License

Code of conduct

Stars

Watchers

Forks