# Part 2: Exporting the Item embeddings from BigQuery to Cloud Storage using Dataflow

This multi-part tutorial shows how to use Matrix Factorization algorithm in BigQuery ML to generate embeddings for items based on their cooccurrence statistics. The generated item embeddings can be then used to find similar items.

Part 2 covers exporting the trained embeddings from the Matrix Factorization BigQuery ML Model to Cloud Storage.
The embedding export logic is implemented in a Beam pipeline, which perform the following steps:
1. Read the embedding records from BigQuery.
2. Combine the two embedding vectors for each item generated by the Matrix Factorization model.
3. Format the output as CSV.
4. Write the output to Cloud Storage.


## Setup

In [None]:
!pip install -U -q apache-beam[gcp]

### Import libraries

In [None]:
import os
import numpy as np
import tensorflow.io as tf_io
import apache_beam as beam
from datetime import datetime

### Configure GCP environment settings

In [None]:
PROJECT_ID = 'ksalama-cloudml' # Change to your project.
BUCKET = 'ksalama-cloudml' # Change to your bucket.
REGION = 'europe-west2' # Change to your Dataflow region.
BQ_DATASET_NAME = 'recommendations'

!gcloud config set project $PROJECT_ID

### Authenticate your GCP account
This is required if you run the notebook in Colab

In [None]:
try:
  from google.colab import auth
  auth.authenticate_user()
  print("Colab user is authenticated.")
except: pass

## Export Trained Embeddings from BigQuery ML to Cloud Storage

### Extract the embeddings from the Matrix Factorization model to a BigQuery table

In [None]:
%%bigquery --project $PROJECT_ID

CREATE OR REPLACE PROCEDURE recommendations.sp_ExractEmbeddings() 
BEGIN
CREATE OR REPLACE TABLE recommendations.item_embeddings AS
SELECT 
  feature AS item_Id,
  processed_input AS axis,
  factor_weights,
  intercept
FROM
  ML.WEIGHTS(MODEL `recommendations.item_matching_model`)
WHERE feature != 'global__INTERCEPT__';
END

In [None]:
%%bigquery --project $PROJECT_ID

CALL recommendations.sp_ExractEmbeddings() 

In [None]:
%%bigquery --project $PROJECT_ID

SELECT axis, COUNT(*) embedding_count
FROM recommendations.item_embeddings
GROUP BY axis;

In [None]:
%%bigquery --project $PROJECT_ID

SELECT *
FROM recommendations.item_embeddings
LIMIT 5;

### Run the embedding extraction pipeline
The Beam pipeline in implement in the [embeddings_exporter/pipeline.py](embeddings_exporter/pipeline.py) module.

In [None]:
runner = 'DataflowRunner'
timestamp = datetime.utcnow().strftime('%y%m%d%H%M%S')
job_name = f'ks-bqml-export-embeddings-{timestamp}'
bq_dataset_name = BQ_DATASET_NAME
embeddings_table_name = 'item_embeddings'
output_dir = f'gs://{BUCKET}/bqml/item_embeddings'
project = PROJECT_ID
temp_location = os.path.join(output_dir, 'tmp')
region = REGION

print(f'runner: {runner}')
print(f'job_name: {job_name}')
print(f'bq_dataset_name: {bq_dataset_name}')
print(f'embeddings_table_name: {embeddings_table_name}')
print(f'output_dir: {output_dir}')
print(f'project: {project}')
print(f'temp_location: {temp_location}')
print(f'region: {region}')

In [None]:
try: os.chdir(os.path.join(os.getcwd(), 'embeddings_exporter'))
except: pass

Executing the pipeline

In [None]:
if tf_io.gfile.exists(output_dir):
  print("Removing {} contents...".format(output_dir))
  tf_io.gfile.rmtree(output_dir)

print("Creating output: {}".format(output_dir))
tf_io.gfile.makedirs(output_dir)

!python runner.py \
  --runner={runner} \
  --job_name={job_name} \
  --bq_dataset_name={bq_dataset_name} \
  --embeddings_table_name={embeddings_table_name} \
  --output_dir={output_dir} \
  --project={project} \
  --temp_location={temp_location} \
  --region={region}

In [None]:
!gsutil ls {output_dir}/embeddings-*.csv

## License

Copyright 2020 Google LLC

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License. You may obtain a copy of the License at: http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 

See the License for the specific language governing permissions and limitations under the License.

**This is not an official Google product but sample code provided for an educational purpose**