# Image embeddings in BigQuery for image similarity and clustering tasks

This notebook shows how to do use a pre-trained embedding as a vector representation of an image in Google Cloud Storage.
Given this embedding, we can load it as a BQ-ML model and then carry out document similarity or clustering.

This notebook accompanies the following Medium blog post:


In [1]:
BUCKET='ai-analytics-solutions-kfpdemo'  # CHANGE to a bucket you own

## Embedding model for images

We're going to use the EfficientNets model trained on ImageNet. It is compact and trained on a large variety of real-world images.

In [2]:
import tensorflow as tf
import tensorflow_hub as tfhub
import os

model = tf.keras.Sequential()
model.add(tf.keras.Input(shape=[None,None,3]))
model.add(tfhub.KerasLayer("https://tfhub.dev/google/efficientnet/b4/feature-vector/1", name='image_embeddings'))
model.summary()

Instructions for updating:
If using Keras pass *_constraint arguments to layers.


Instructions for updating:
If using Keras pass *_constraint arguments to layers.


Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
image_embeddings (KerasLayer (None, 1792)              17673816  
Total params: 17,673,816
Trainable params: 0
Non-trainable params: 17,673,816
_________________________________________________________________


The model on TensorFlow Hub expects images of a certain size, and provided as normalized arrays. 
So, we'll define a serving function that carries out the necessary reading and preprocessing of the images.

In [3]:
@tf.function(input_signature=[tf.TensorSpec(shape=[None], dtype=tf.string)])
def serve(filename):
    img = tf.io.read_file(filename[0])
    img = tf.io.decode_image(img, channels=3)
    img = tf.cast(img, tf.float32) / 255.0
    #img = tf.image.resize(img, [380, 380])
    return model(img)

path='gs://{}/effnet_image_embedding'.format(BUCKET)
tf.saved_model.save(model, path, signatures={'serving_default': serve})

INFO:tensorflow:Assets written to: gs://ai-analytics-solutions-kfpdemo/effnet_image_embedding/assets


INFO:tensorflow:Assets written to: gs://ai-analytics-solutions-kfpdemo/effnet_image_embedding/assets


In [12]:
!saved_model_cli show --all --dir gs://$BUCKET/effnet_image_embedding


MetaGraphDef with tag-set: 'serve' contains the following SignatureDefs:

signature_def['__saved_model_init_op']:
  The given SavedModel SignatureDef contains the following input(s):
  The given SavedModel SignatureDef contains the following output(s):
    outputs['__saved_model_init_op'] tensor_info:
        dtype: DT_INVALID
        shape: unknown_rank
        name: NoOp
  Method name is: 

signature_def['serving_default']:
  The given SavedModel SignatureDef contains the following input(s):
    inputs['filename'] tensor_info:
        dtype: DT_STRING
        shape: (-1)
        name: serving_default_filename:0
  The given SavedModel SignatureDef contains the following output(s):
    outputs['output_0'] tensor_info:
        dtype: DT_FLOAT
        shape: (-1, 1792)
        name: StatefulPartitionedCall:0
  Method name is: tensorflow/serving/predict
Instructions for updating:
If using Keras pass *_constraint arguments to layers.
2020-08-22 07:32:45.004950: I tensorflow/core/platform/

## Loading model into BigQuery

Since we saved the model in SavedModel format into GCS it is straightforward to load it into BigQuery

Let's load the model into a BigQuery dataset named advdata (create it if necessary)

In [13]:
%%bigquery
CREATE OR REPLACE MODEL advdata.effnet_image_embed
OPTIONS(model_type='tensorflow', model_path='gs://ai-analytics-solutions-kfpdemo/effnet_image_embedding/*')

From the BigQuery web console, click on "schema" tab for the newly loaded model. You will see that the input is a string called filename and the output is called output_0.  The model is computationally expensive.

In [15]:
%%bigquery
SELECT output_0 FROM
ML.PREDICT(MODEL advdata.effnet_image_embed,(
SELECT 'gs://gcs-public-data--met/634108/0.jpg' AS filename))

Executing query with job ID: 51ae7b70-f2a0-4269-b356-d55944064f76
Query executing: 36.25s


ERROR:
 400 GET https://bigquery.googleapis.com/bigquery/v2/projects/ai-analytics-solutions/queries/51ae7b70-f2a0-4269-b356-d55944064f76?maxResults=0&timeoutMs=400&location=US: Error when running TensorFlow SavedModel: File system scheme 'gs' not implemented (file: 'gs://gcs-public-data--met/634108/0.jpg')
	 [[{{node ReadFile}}]]

(job ID: 51ae7b70-f2a0-4269-b356-d55944064f76)

                 -----Query Job SQL Follows-----                  

    |    .    |    .    |    .    |    .    |    .    |    .    |
   1:SELECT output_0 FROM
   2:ML.PREDICT(MODEL advdata.effnet_image_embed,(
   3:SELECT 'gs://gcs-public-data--met/634108/0.jpg' AS filename))
    |    .    |    .    |    .    |    .    |    .    |    .    |


Copyright 2020 Google Inc. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License

