# Document embeddings in BigQuery for document similarity and clustering tasks

This notebook shows how to do use a pre-trained embedding as a vector representation of a natural language text column.
Given this embedding, we can load it as a BQ-ML model and then carry out document similarity or clustering.

This notebook accompanies the following Medium blog post:
https://medium.com/@lakshmanok/how-to-do-text-similarity-search-and-document-clustering-in-bigquery-75eb8f45ab65

## Embedding model for documents

We're going to use a model that has been pretrained on Google News. Here's an example of how it works in Python. We will use it directly in BigQuery, however.

In [10]:
import tensorflow as tf
import tensorflow_hub as tfhub

model = tf.keras.Sequential()
model.add(tfhub.KerasLayer("https://tfhub.dev/google/tf2-preview/gnews-swivel-20dim/1",
                           output_shape=[20], input_shape=[], dtype=tf.string))
model.summary()
model.predict(["""
Long years ago, we made a tryst with destiny; and now the time comes when we shall redeem our pledge, not wholly or in full measure, but very substantially. At the stroke of the midnight hour, when the world sleeps, India will awake to life and freedom.
A moment comes, which comes but rarely in history, when we step out from the old to the new -- when an age ends, and when the soul of a nation, long suppressed, finds utterance. 
"""])

Model: "sequential_4"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
keras_layer_3 (KerasLayer)   (None, 20)                400020    
Total params: 400,020
Trainable params: 0
Non-trainable params: 400,020
_________________________________________________________________


array([[ 0.52828205, -0.814417  ,  2.7678437 , -0.70152074, -0.99541044,
        -2.9311025 , -1.3798233 ,  0.10915907,  0.8491049 , -1.6155498 ,
        -1.1453229 ,  1.2871503 , -1.0593784 ,  0.32060066, -3.060015  ,
         2.4751766 ,  2.9106884 , -2.6531873 , -2.379123  , -0.58328384]],
      dtype=float32)

## Loading model into BigQuery

The Swivel model above is already available in SavedModel format. But we need it on Google Cloud Storage before we can load it into BigQuery.

In [18]:
%%bash
BUCKET=ai-analytics-solutions-kfpdemo   # CHANGE AS NEEDED

rm -rf tmp
mkdir tmp
FILE=swivel.tar.gz
wget --quiet -O tmp/swivel.tar.gz  https://tfhub.dev/google/tf2-preview/gnews-swivel-20dim/1?tf-hub-format=compressed
cd tmp
tar xvfz swivel.tar.gz
cd ..
mv tmp swivel
gsutil -m cp -R swivel gs://${BUCKET}/swivel
rm -rf swivel

echo "Model artifacts are now at gs://${BUCKET}/swivel/*"

assets/
assets/tokens.txt
saved_model.pb
variables/
variables/variables.data-00000-of-00001
variables/variables.index
Model artifacts are now at gs://ai-analytics-solutions-kfpdemo/swivel/*


Copying file://swivel/swivel.tar.gz [Content-Type=application/x-tar]...
Copying file://swivel/saved_model.pb [Content-Type=application/octet-stream]...
Copying file://swivel/assets/tokens.txt [Content-Type=text/plain]...
Copying file://swivel/variables/variables.index [Content-Type=application/octet-stream]...
Copying file://swivel/variables/variables.data-00000-of-00001 [Content-Type=application/octet-stream]...
/ [5/5 files][  3.2 MiB/  3.2 MiB] 100% Done                                    
Operation completed over 5 objects/3.2 MiB.                                      


Let's load the model into a BigQuery dataset named advdata (create it if necessary)

In [19]:
%%bigquery
CREATE OR REPLACE MODEL advdata.swivel_text_embed
OPTIONS(model_type='tensorflow', model_path='gs://ai-analytics-solutions-kfpdemo/swivel/*')

From the BigQuery web console, click on "schema" tab for the newly loaded model. We see that the input is called sentences and the output is called output_0:
<img src="swivel_schema.png" />

In [20]:
%%bigquery
SELECT output_0 FROM
ML.PREDICT(MODEL advdata.swivel_text_embed,(
SELECT "Long years ago, we made a tryst with destiny; and now the time comes when we shall redeem our pledge, not wholly or in full measure, but very substantially." AS sentences))

Unnamed: 0,output_0
0,"[-0.09961678087711334, -1.1282159090042114, 2...."


## Document search

Let's use the embeddings to return similar strings. We'll use the comments field of a storm reports table from NOAA.

In [25]:
%%bigquery
SELECT 
  EXTRACT(DAYOFYEAR from timestamp) AS julian_day,
  ST_GeogPoint(longitude, latitude) AS location,
  comments
FROM `bigquery-public-data.noaa_preliminary_severe_storms.wind_reports`
WHERE EXTRACT(YEAR from timestamp) = 2019
LIMIT 10

Unnamed: 0,julian_day,location,comments
0,19,POINT(-85.18 32.49),TREES DOWN NEAR THE INTERSECTION OF LEE RD 440...
1,43,POINT(-85.13 32.49),REPORTS OF TREES DOWN IN VARIOUS LOCATIONS IN ...
2,62,POINT(-85.24 32.6),CORRECTS PREVIOUS TORNADO REPORT FROM SALEM. U...
3,85,POINT(-85.1 32.55),A TREE WAS DOWNED ONTO A HOME. (BMX)
4,158,POINT(-85.42 32.6),TREE DOWN ON A HOME. TIME ESTIMATED FROM RADAR...
5,158,POINT(-85.42 32.51),MULTIPLE TREES DOWN ON LEE ROAD 29. TIME ESTIM...
6,158,POINT(-85.18 32.71),TREES DOWN IN BEULAH. TIME ESTIMATED FROM RADA...
7,158,POINT(-85.09 32.54),MULTIPLE TREES DOWN IN SMITHS STATION. TIME ES...
8,190,POINT(-85.49 32.67),TREES DOWN NEAR THE INTERSECTION OF LEE RD 147...
9,217,POINT(-85.07 32.52),CORRECTS PREVIOUS TSTM WND DMG REPORT FROM 1 E...


Let's define a distance function and then do a search for matching documents to the search string "power line down on a home". Note that the matches include "house" as a synonym for home. And not as good, but close matches all include "power line" as the more distinctive term.

In [2]:
%%bigquery
CREATE TEMPORARY FUNCTION td(a ARRAY<FLOAT64>, b ARRAY<FLOAT64>, idx INT64) AS (
   (a[OFFSET(idx)] - b[OFFSET(idx)]) * (a[OFFSET(idx)] - b[OFFSET(idx)])
);

CREATE TEMPORARY FUNCTION term_distance(a ARRAY<FLOAT64>, b ARRAY<FLOAT64>) AS ((
   SELECT SQRT(SUM( td(a, b, idx))) FROM UNNEST(GENERATE_ARRAY(0, 19)) idx
));

WITH search_term AS (
  SELECT output_0 AS term_embedding FROM ML.PREDICT(MODEL advdata.swivel_text_embed,(SELECT "power line down on a home" AS sentences))
)

SELECT
  term_distance(term_embedding, output_0) AS termdist,
  comments
FROM ML.PREDICT(MODEL advdata.swivel_text_embed,(
  SELECT comments, LOWER(comments) AS sentences
  FROM `bigquery-public-data.noaa_preliminary_severe_storms.wind_reports`
  WHERE EXTRACT(YEAR from timestamp) = 2019
)), search_term
ORDER By termdist ASC
LIMIT 10

Unnamed: 0,termdist,comments
0,0.959289,POWER LINE DOWN ON HOUSE (ILX)
1,0.959289,POWER LINE DOWN ON HOUSE. (BGM)
2,1.218646,TREE DOWN ON A POWER LINE ON TATE RD. (RNK)
3,1.277504,TREE DOWN ON A POWER LINE. (TAE)
4,1.277504,TREE DOWN ON A POWER LINE. (DVN)
5,1.332273,TREE DOWN ON CR450 AND POWER LINE DOWN ON CR 4...
6,1.370752,POWER LINE DOWN ON BOWEN ROAD IN KING. (RNK)
7,1.370996,TREE DOWN... TAKING DOWN A POWER LINE... AT IN...
8,1.412945,POWER LINES KNOCKED DOWN ON SIERRA VISTA DR NE...
9,1.415289,TREE DOWN ON POWER LINE. (RAH)


## Document clustering

We can use the embeddings as input to a K-Means clustering model. To make things interesting, let's also include the day and location.
K-Means at present doesn't accept arrays as input, so I'm defining a function to make it a struct with named parameters.

In [None]:
%%bigquery
CREATE TEMPORARY FUNCTION arr_to_input_20(arr ARRAY<FLOAT64>)
RETURNS 
STRUCT<p1 FLOAT64, p2 FLOAT64, p3 FLOAT64, p4 FLOAT64,
       p5 FLOAT64, p6 FLOAT64, p7 FLOAT64, p8 FLOAT64, 
       p9 FLOAT64, p10 FLOAT64, p11 FLOAT64, p12 FLOAT64, 
       p13 FLOAT64, p14 FLOAT64, p15 FLOAT64, p16 FLOAT64,
       p17 FLOAT64, p18 FLOAT64, p19 FLOAT64, p20 FLOAT64>

AS (
STRUCT(
    arr[OFFSET(0)]
    , arr[OFFSET(1)]
    , arr[OFFSET(2)]
    , arr[OFFSET(3)]
    , arr[OFFSET(4)]
    , arr[OFFSET(5)]
    , arr[OFFSET(6)]
    , arr[OFFSET(7)]
    , arr[OFFSET(8)]
    , arr[OFFSET(9)]
    , arr[OFFSET(10)]
    , arr[OFFSET(11)]
    , arr[OFFSET(12)]
    , arr[OFFSET(13)]
    , arr[OFFSET(14)]
    , arr[OFFSET(15)]
    , arr[OFFSET(16)]
    , arr[OFFSET(17)]
    , arr[OFFSET(18)]
    , arr[OFFSET(19)]    
));


CREATE OR REPLACE MODEL advdata.storm_reports_clustering
OPTIONS(model_type='kmeans', NUM_CLUSTERS=10) AS

SELECT
  arr_to_input_20(output_0) AS comments_embed,
  EXTRACT(DAYOFYEAR from timestamp) AS julian_day,
  longitude, latitude
FROM ML.PREDICT(MODEL advdata.swivel_text_embed,(
  SELECT timestamp, longitude, latitude, LOWER(comments) AS sentences
  FROM `bigquery-public-data.noaa_preliminary_severe_storms.wind_reports`
  WHERE EXTRACT(YEAR from timestamp) = 2019
))

The resulting clusters look like this
<img src="storm_reports_clusters.png"/>

Show a few of the comments from cluster #1

In [27]:
%%bigquery
CREATE TEMPORARY FUNCTION arr_to_input_20(arr ARRAY<FLOAT64>)
RETURNS 
STRUCT<p1 FLOAT64, p2 FLOAT64, p3 FLOAT64, p4 FLOAT64,
       p5 FLOAT64, p6 FLOAT64, p7 FLOAT64, p8 FLOAT64, 
       p9 FLOAT64, p10 FLOAT64, p11 FLOAT64, p12 FLOAT64, 
       p13 FLOAT64, p14 FLOAT64, p15 FLOAT64, p16 FLOAT64,
       p17 FLOAT64, p18 FLOAT64, p19 FLOAT64, p20 FLOAT64>

AS (
STRUCT(
    arr[OFFSET(0)]
    , arr[OFFSET(1)]
    , arr[OFFSET(2)]
    , arr[OFFSET(3)]
    , arr[OFFSET(4)]
    , arr[OFFSET(5)]
    , arr[OFFSET(6)]
    , arr[OFFSET(7)]
    , arr[OFFSET(8)]
    , arr[OFFSET(9)]
    , arr[OFFSET(10)]
    , arr[OFFSET(11)]
    , arr[OFFSET(12)]
    , arr[OFFSET(13)]
    , arr[OFFSET(14)]
    , arr[OFFSET(15)]
    , arr[OFFSET(16)]
    , arr[OFFSET(17)]
    , arr[OFFSET(18)]
    , arr[OFFSET(19)]    
));

SELECT sentences 
FROM ML.PREDICT(MODEL `ai-analytics-solutions.advdata.storm_reports_clustering`, 
(
SELECT
  sentences,
  arr_to_input_20(output_0) AS comments_embed,
  EXTRACT(DAYOFYEAR from timestamp) AS julian_day,
  longitude, latitude
FROM ML.PREDICT(MODEL advdata.swivel_text_embed,(
  SELECT timestamp, longitude, latitude, LOWER(comments) AS sentences
  FROM `bigquery-public-data.noaa_preliminary_severe_storms.wind_reports`
  WHERE EXTRACT(YEAR from timestamp) = 2019
))))
WHERE centroid_id = 1
LIMIT 10

Unnamed: 0,sentences
0,(iln)
1,(iln)
2,(cle)
3,(iln)
4,asos station ktol toledo. (cle)
5,(cle)
6,pea size hail also reported. (cle)
7,(pub)
8,corrects previous tstm wnd gst report from 9 s...
9,(rlx)


As you can see, these are basically uninformative comments.  How about centroid #3?

In [28]:
%%bigquery
CREATE TEMPORARY FUNCTION arr_to_input_20(arr ARRAY<FLOAT64>)
RETURNS 
STRUCT<p1 FLOAT64, p2 FLOAT64, p3 FLOAT64, p4 FLOAT64,
       p5 FLOAT64, p6 FLOAT64, p7 FLOAT64, p8 FLOAT64, 
       p9 FLOAT64, p10 FLOAT64, p11 FLOAT64, p12 FLOAT64, 
       p13 FLOAT64, p14 FLOAT64, p15 FLOAT64, p16 FLOAT64,
       p17 FLOAT64, p18 FLOAT64, p19 FLOAT64, p20 FLOAT64>

AS (
STRUCT(
    arr[OFFSET(0)]
    , arr[OFFSET(1)]
    , arr[OFFSET(2)]
    , arr[OFFSET(3)]
    , arr[OFFSET(4)]
    , arr[OFFSET(5)]
    , arr[OFFSET(6)]
    , arr[OFFSET(7)]
    , arr[OFFSET(8)]
    , arr[OFFSET(9)]
    , arr[OFFSET(10)]
    , arr[OFFSET(11)]
    , arr[OFFSET(12)]
    , arr[OFFSET(13)]
    , arr[OFFSET(14)]
    , arr[OFFSET(15)]
    , arr[OFFSET(16)]
    , arr[OFFSET(17)]
    , arr[OFFSET(18)]
    , arr[OFFSET(19)]    
));

SELECT sentences 
FROM ML.PREDICT(MODEL `ai-analytics-solutions.advdata.storm_reports_clustering`, 
(
SELECT
  sentences,
  arr_to_input_20(output_0) AS comments_embed,
  EXTRACT(DAYOFYEAR from timestamp) AS julian_day,
  longitude, latitude
FROM ML.PREDICT(MODEL advdata.swivel_text_embed,(
  SELECT timestamp, longitude, latitude, LOWER(comments) AS sentences
  FROM `bigquery-public-data.noaa_preliminary_severe_storms.wind_reports`
  WHERE EXTRACT(YEAR from timestamp) = 2019
))))
WHERE centroid_id = 3

Unnamed: 0,sentences
0,tree downed along brier ridge road. time estim...
1,barn and wires down near us 62. time estimated...
2,tree downed along yockey road. time estimated ...
3,multiple power poles down across road. time es...
4,mobile home and outbuilding destroyed. signifi...
...,...
1634,three trees were knocked down outside of waver...
1635,tree down across vigo road. time estimated. (iln)
1636,trees down on bluelick road. time estimated fr...
1637,numerous trees reported down in huntington tow...


These are all reports that were validated in some way by radar!!!!

Copyright 2020 Google Inc. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License

