# Document embeddings in BigQuery

This notebook shows how to do use a pre-trained embedding as a vector representation of a natural language text column.
Given this embedding, we can use it in machine learning models.

## Embedding model for documents

We're going to use a model that has been pretrained on Google News. Here's an example of how it works in Python. We will use it directly in BigQuery, however.

In [10]:
import tensorflow as tf
import tensorflow_hub as tfhub

model = tf.keras.Sequential()
model.add(tfhub.KerasLayer("https://tfhub.dev/google/tf2-preview/gnews-swivel-20dim/1",
                           output_shape=[20], input_shape=[], dtype=tf.string))
model.summary()
model.predict(["""
Long years ago, we made a tryst with destiny; and now the time comes when we shall redeem our pledge, not wholly or in full measure, but very substantially. At the stroke of the midnight hour, when the world sleeps, India will awake to life and freedom.
A moment comes, which comes but rarely in history, when we step out from the old to the new -- when an age ends, and when the soul of a nation, long suppressed, finds utterance. 
"""])

Model: "sequential_4"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
keras_layer_3 (KerasLayer)   (None, 20)                400020    
Total params: 400,020
Trainable params: 0
Non-trainable params: 400,020
_________________________________________________________________


array([[ 0.52828205, -0.814417  ,  2.7678437 , -0.70152074, -0.99541044,
        -2.9311025 , -1.3798233 ,  0.10915907,  0.8491049 , -1.6155498 ,
        -1.1453229 ,  1.2871503 , -1.0593784 ,  0.32060066, -3.060015  ,
         2.4751766 ,  2.9106884 , -2.6531873 , -2.379123  , -0.58328384]],
      dtype=float32)

## Loading model into BigQuery

The Swivel model above is already available in SavedModel format. But we need it on Google Cloud Storage before we can load it into BigQuery.

In [18]:
%%bash
BUCKET=ai-analytics-solutions-kfpdemo   # CHANGE AS NEEDED

rm -rf tmp
mkdir tmp
FILE=swivel.tar.gz
wget --quiet -O tmp/swivel.tar.gz  https://tfhub.dev/google/tf2-preview/gnews-swivel-20dim/1?tf-hub-format=compressed
cd tmp
tar xvfz swivel.tar.gz
cd ..
mv tmp swivel
gsutil -m cp -R swivel gs://${BUCKET}/swivel
rm -rf swivel

echo "Model artifacts are now at gs://${BUCKET}/swivel/*"

assets/
assets/tokens.txt
saved_model.pb
variables/
variables/variables.data-00000-of-00001
variables/variables.index
Model artifacts are now at gs://ai-analytics-solutions-kfpdemo/swivel/*


Copying file://swivel/swivel.tar.gz [Content-Type=application/x-tar]...
Copying file://swivel/saved_model.pb [Content-Type=application/octet-stream]...
Copying file://swivel/assets/tokens.txt [Content-Type=text/plain]...
Copying file://swivel/variables/variables.index [Content-Type=application/octet-stream]...
Copying file://swivel/variables/variables.data-00000-of-00001 [Content-Type=application/octet-stream]...
/ [5/5 files][  3.2 MiB/  3.2 MiB] 100% Done                                    
Operation completed over 5 objects/3.2 MiB.                                      


Let's load the model into a BigQuery dataset named advdata (create it if necessary)

In [19]:
%%bigquery
CREATE OR REPLACE MODEL advdata.swivel_text_embed
OPTIONS(model_type='tensorflow', model_path='gs://ai-analytics-solutions-kfpdemo/swivel/*')

From the BigQuery web console, click on "schema" tab for the newly loaded model. We see that the input is called sentences and the output is called output_0:
<img src="swivel_schema.png" />

In [20]:
%%bigquery
SELECT output_0 FROM
ML.PREDICT(MODEL advdata.swivel_text_embed,(
SELECT "Long years ago, we made a tryst with destiny; and now the time comes when we shall redeem our pledge, not wholly or in full measure, but very substantially." AS sentences))

Unnamed: 0,output_0
0,"[-0.09961678087711334, -1.1282159090042114, 2...."


## Create lookup table

Let's create a lookup table of embeddings. We'll use the comments field of a storm reports table from NOAA.
This is an example of the Feature Store design pattern.

In [3]:
%%bigquery
CREATE OR REPLACE TABLE advdata.comments_embedding AS
SELECT
  output_0 as comments_embedding,
  comments
FROM ML.PREDICT(MODEL advdata.swivel_text_embed,(
  SELECT comments, LOWER(comments) AS sentences
  FROM `bigquery-public-data.noaa_preliminary_severe_storms.wind_reports`
))

For an example of using these embeddings in text similarity or document clustering, please see the following Medium blog post: https://medium.com/@lakshmanok/how-to-do-text-similarity-search-and-document-clustering-in-bigquery-75eb8f45ab65

Copyright 2020 Google Inc. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License

