<h1>Creating a searchable Art Database with The MET's open-access collection</h1>

In this example, we show how you can enrich data using Cognitive Skills and write to an Azure Search Index using MMLSpark. We use a subset of The MET's open-access collection and enrich it by passing it through 'Describe Image' and a custom 'Image Similarity' skill. The results are then written to a searchable index.

In [1]:
import os, sys, time, json, requests
from pyspark.ml import Transformer, Estimator, Pipeline
from pyspark.ml.feature import SQLTransformer
from pyspark.sql.functions import lit, udf, col, split
from notebookutils import mssparkutils

In [2]:
# get api key from AzureKeyVault linked service: https://docs.microsoft.com/en-us/azure/synapse-analytics/spark/microsoft-spark-utilities?pivots=programming-language-python
VISION_API_KEY = mssparkutils.credentials.getSecret("<akv-service-name>", "<akv-secret-name>", "<linked-service-name>")
AZURE_SEARCH_KEY = mssparkutils.credentials.getSecret("<akv-service-name>", "<akv-secret-name>", "<linked-service-name>")

VISION_API_LOCATION = "<vision-api-location>"
search_service = "<search-service-name>"
search_index = "test"

In [3]:
data = spark.read\
  .format("csv")\
  .option("header", True)\
  .load("wasbs://publicwasb@mmlspark.blob.core.windows.net/metartworks_sample.csv")\
  .withColumn("searchAction", lit("upload"))\
  .withColumn("Neighbors", split(col("Neighbors"), ",").cast("array<string>"))\
  .withColumn("Tags", split(col("Tags"), ",").cast("array<string>"))\
  .limit(25)

<img src="https://mmlspark.blob.core.windows.net/graphics/CognitiveSearchHyperscale/MetArtworkSamples.png" width="800" style="float: center;"/>

In [4]:
from mmlspark.cognitive import AnalyzeImage
from mmlspark.stages import SelectColumns

#define pipeline
describeImage = (AnalyzeImage()
  .setSubscriptionKey(VISION_API_KEY)
  .setLocation(VISION_API_LOCATION)
  .setImageUrlCol("PrimaryImageUrl")
  .setOutputCol("RawImageDescription")
  .setErrorCol("Errors")
  .setVisualFeatures(["Categories", "Tags", "Description", "Faces", "ImageType", "Color", "Adult"])
  .setConcurrency(5))

df2 = describeImage.transform(data)\
  .select("*", "RawImageDescription.*").drop("Errors", "RawImageDescription")

<img src="https://mmlspark.blob.core.windows.net/graphics/CognitiveSearchHyperscale/MetArtworksProcessed.png" width="800" style="float: center;"/>

Before writing the results to a Search Index, you must define a schema which must specify the name, type, and attributes of each field in your index. Refer [Create a basic index in Azure Search](https://docs.microsoft.com/en-us/azure/search/search-what-is-an-index) for more information.

In [5]:
from mmlspark.cognitive import *
df2.writeToAzureSearch(
  subscriptionKey=AZURE_SEARCH_KEY,
  actionCol="searchAction",
  serviceName=search_service,
  indexName=search_index,
  keyCol="ObjectID"
)

The Search Index can be queried using the [Azure Search REST API](https://docs.microsoft.com/rest/api/searchservice/) by sending GET or POST requests and specifying query parameters that give the criteria for selecting matching documents. For more information on querying refer [Query your Azure Search index using the REST API](https://docs.microsoft.com/en-us/rest/api/searchservice/Search-Documents)

In [6]:
url = 'https://{}.search.windows.net/indexes/{}/docs/search?api-version=2019-05-06'.format(search_service, search_index)
requests.post(url, json={"search": "Glass"}, headers = {"api-key": AZURE_SEARCH_KEY}).json()

{'@odata.context': "https://azuresearchforsynapse.search.windows.net/indexes('test')/$metadata#docs(*)", 'value': [{'@search.score': 2.3855119, 'ObjectID': '823', 'Department': 'American Decorative Arts', 'Culture': 'American', 'Medium': 'Favrile glass', 'Classification': 'Glass', 'PrimaryImageUrl': 'https://images.metmuseum.org/CRDImages/ad/original/34602.jpg', 'Tags': ['Bowls'], 'Neighbors': ['https://images.metmuseum.org/CRDImages/as/web-large/54933.jpg', 'https://images.metmuseum.org/CRDImages/ph/web-large/DP136836.jpg', 'https://images.metmuseum.org/CRDImages/as/web-large/30_76_258_F.jpg', 'https://images.metmuseum.org/CRDImages/ph/web-large/DP136791.jpg', 'https://images.metmuseum.org/CRDImages/as/web-large/39081.jpg', 'https://images.metmuseum.org/CRDImages/an/web-large/vsz1999_325_117.jpg', 'https://images.metmuseum.org/CRDImages/as/web-large/862.jpg', 'https://images.metmuseum.org/CRDImages/ph/web-large/DP136819.jpg', 'https://images.metmuseum.org/CRDImages/ph/web-large/DP1368

In [7]:
spark.stop()