# TFX -  Create BigQuery Stored Procedures

This multi-part tutorial shows how to use Matrix Factorization algorithm in BigQuery ML to generate embeddings for items based on their cooccurrence statistics. The generated item embeddings can be then used to find similar items.

The is notebook covers creating the BigQuery stored procedures executed by the TFX pipeline that automates the running the solution:
1. [sp_ComputePMI](sql_scripts/sp_ComputePMI.sql) -  This computes the pointwise mutual information and store the results in the `item_cooc` table.
2. [sp_TrainItemMatchingModel](sql_scripts/sp_TrainItemMatchingModel.sql) - This creates the `item_embedding_model` Matrix Factorization model using the data in the `item_cooc` table.
3. [sp_ExractEmbeddings](sql_scripts/sp_ExractEmbeddings.sql) - This extracts the item embedding values from the `item_embedding_model`, aggregate the two embedding vectors produced for each item, and store them in the `item_embeddings` table.

The notebook assumes that the `vw_item_groups`, which is created in [00_prep_bq_and_datastore.ipynb](00_import_bq_to_datastore.ipynb) notebook.

## Setup

In [None]:
!pip install -q -U google-cloud-bigquery

### Import libraries

In [None]:
import os
from google.cloud import bigquery

### Configure GCP environment settings

In [None]:
PROJECT_ID = 'ksalama-cloudml' # Change to your project.
BUCKET = 'ksalama-cloudml' # Change to your bucket.
SQL_SCRIPTS_DIR = 'sql_scripts'
BQ_DATASET_NAME = 'recommendations'

!gcloud config set project $PROJECT_ID

### Authenticate your GCP account
This is required if you run the notebook in Colab

In [None]:
try:
  from google.colab import auth
  auth.authenticate_user()
  print("Colab user is authenticated.")
except: pass

## Execute the BigQuery Scripts to Create the Procedures

In [None]:
client = bigquery.Client(project=PROJECT_ID)

In [None]:
sql_scripts = dict()

for script_file in [file for file in os.listdir(SQL_SCRIPTS_DIR) if '.sql' in file]:
  script_file_path = os.path.join(SQL_SCRIPTS_DIR, script_file)
  sql_script = open(script_file_path, 'r').read()
  sql_script = sql_script.replace('@DATASET_NAME', BQ_DATASET_NAME)
  sql_scripts[script_file] = sql_script

In [None]:
for script_file in sql_scripts:
  print(f'Executing {script_file} script...')
  query = sql_scripts[script_file]
  query_job = client.query(query)
  result = query_job.result()

print('Done.')

## List the Created Procedures

In [None]:
query = f'SELECT * FROM {BQ_DATASET_NAME}.INFORMATION_SCHEMA.ROUTINES;'
query_job = client.query(query)
query_job.result().to_dataframe()

## License

Copyright 2020 Google LLC

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License. You may obtain a copy of the License at: http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 

See the License for the specific language governing permissions and limitations under the License.

**This is not an official Google product but sample code provided for an educational purpose**