# Part 1: Learn item embeddings based on song co-occurrence

This notebook is the first of five notebooks that guide you through running the [Real-time Item-to-item Recommendation with BigQuery ML Matrix Factorization and ScaNN](https://github.com/GoogleCloudPlatform/analytics-componentized-patterns/tree/master/retail/recommendation-system/bqml-scann) solution.

Use this notebook to complete the following tasks:

1. Explore the sample playlist data.
2. Compute [Pointwise mutual information (PMI)](https://en.wikipedia.org/wiki/Pointwise_mutual_information) that represents the co-occurence of songs on playlists. 
3. Train a [matrix factorization](https://en.wikipedia.org/wiki/Matrix_factorization_(recommender_systems)) model using BigQuery ML to learn item embeddings based on the PMI data.
4. Explore the learned embeddings.

Before starting this notebook, you must run the [00_prep_bq_procedures](00_prep_bq_procedures.ipynb) notebook to complete the solution prerequisites.

After completing this notebook, run the [02_export_bqml_mf_embeddings](02_export_bqml_mf_embeddings.ipynb) notebook to process the item embedding data.


## Setup

Import the required libraries, configure the environment variables, and authenticate your GCP account.

### Import libraries

In [None]:
from google.cloud import bigquery
from datetime import datetime
import matplotlib.pyplot as plt, seaborn as sns

### Configure GCP environment settings

Update the `PROJECT_ID` variable to reflect the ID of the Google Cloud project you are using to implement this solution.

In [None]:
PROJECT_ID = 'yourProject' # Change to your project.

!gcloud config set project $PROJECT_ID

### Authenticate your GCP account
This is required if you run the notebook in Colab. If you use an AI Platform notebook, you should already be authenticated.

In [None]:
try:
  from google.colab import auth
  auth.authenticate_user()
  print("Colab user is authenticated.")
except: pass

## Explore the sample data

Use visualizations to explore the data in the `vw_item_groups` view that you created in the `00_prep_bq_and_datastore.ipynb` notebook.

Import libraries for data visualization:

In [None]:
import matplotlib.pyplot as plt, seaborn as sns

Count the number of songs that occur in at least 15 groups:

In [None]:
%%bigquery  --project $PROJECT_ID

CREATE OR REPLACE TABLE recommendations.valid_items
AS
SELECT 
  item_Id, 
  COUNT(group_Id) AS item_frequency
FROM recommendations.vw_item_groups
GROUP BY item_Id
HAVING item_frequency >= 15;

SELECT COUNT(*) item_count FROM recommendations.valid_items;

Count the number of playlists that have between 2 and 100 items:

In [None]:
%%bigquery  --project $PROJECT_ID

CREATE OR REPLACE TABLE recommendations.valid_groups
AS
SELECT 
  group_Id, 
  COUNT(item_Id) AS group_size
FROM recommendations.vw_item_groups
WHERE item_Id IN (SELECT item_Id FROM recommendations.valid_items)
GROUP BY group_Id
HAVING group_size BETWEEN 2 AND 100;

SELECT COUNT(*) group_count FROM recommendations.valid_groups;

Count the number of records with valid songs and playlists:

In [None]:
%%bigquery  --project $PROJECT_ID

SELECT COUNT(*) record_count
FROM `recommendations.vw_item_groups`
WHERE item_Id IN (SELECT item_Id FROM recommendations.valid_items)
AND group_Id IN (SELECT group_Id FROM recommendations.valid_groups);

Show the playlist size distribution:

In [None]:
%%bigquery size_distribution --project $PROJECT_ID

WITH group_sizes
AS
(
  SELECT 
    group_Id, 
    ML.BUCKETIZE(
      COUNT(item_Id), [10, 20, 30, 40, 50, 101])
     AS group_size
  FROM `recommendations.vw_item_groups`
  WHERE item_Id IN (SELECT item_Id FROM recommendations.valid_items)
  AND group_Id IN (SELECT group_Id FROM recommendations.valid_groups)
  GROUP BY group_Id
)

SELECT 
  CASE 
    WHEN group_size = 'bin_1' THEN '[1 - 10]'
    WHEN group_size = 'bin_2' THEN '[10 - 20]'
    WHEN group_size = 'bin_3' THEN '[20 - 30]'
    WHEN group_size = 'bin_4' THEN '[30 - 40]'
    WHEN group_size = 'bin_5' THEN '[40 - 50]'
    ELSE '[50 - 100]'
  END AS group_size,
  CASE 
    WHEN group_size = 'bin_1' THEN 1
    WHEN group_size = 'bin_2' THEN 2
    WHEN group_size = 'bin_3' THEN 3
    WHEN group_size = 'bin_4' THEN 4
    WHEN group_size = 'bin_5' THEN 5
    ELSE 6
  END AS bucket_Id,
  COUNT(group_Id) group_count
FROM group_sizes
GROUP BY group_size, bucket_Id
ORDER BY bucket_Id 

In [None]:
plt.figure(figsize=(20,5))
q = sns.barplot(x='group_size', y='group_count', data=size_distribution)

Show the song occurrence distribution:

In [None]:
%%bigquery occurrence_distribution --project $PROJECT_ID

WITH item_frequency
AS
(
  SELECT 
    Item_Id, 
    ML.BUCKETIZE(
      COUNT(group_Id)
      , [15, 30, 50, 100, 200, 300, 400]) AS group_count
  FROM `recommendations.vw_item_groups`
  WHERE item_Id IN (SELECT item_Id FROM recommendations.valid_items)
  AND group_Id IN (SELECT group_Id FROM recommendations.valid_groups)
  GROUP BY Item_Id
)


SELECT 
  CASE 
    WHEN group_count = 'bin_1' THEN '[15 - 30]'
    WHEN group_count = 'bin_2' THEN '[30 - 50]'
    WHEN group_count = 'bin_3' THEN '[50 - 100]'
    WHEN group_count = 'bin_4' THEN '[100 - 200]'
    WHEN group_count = 'bin_5' THEN '[200 - 300]'
    WHEN group_count = 'bin_6' THEN '[300 - 400]'
    ELSE '[400+]'
  END AS group_count,
  CASE 
    WHEN group_count = 'bin_1' THEN 1
    WHEN group_count = 'bin_2' THEN 2
    WHEN group_count = 'bin_3' THEN 3
    WHEN group_count = 'bin_4' THEN 4
    WHEN group_count = 'bin_5' THEN 5
    WHEN group_count = 'bin_6' THEN 6
    ELSE 7
  END AS bucket_Id,
  COUNT(Item_Id) item_count
FROM item_frequency
GROUP BY group_count, bucket_Id
ORDER BY bucket_Id 

In [None]:
plt.figure(figsize=(20, 5))
q = sns.barplot(x='group_count', y='item_count', data=occurrence_distribution)

In [None]:
%%bigquery --project $PROJECT_ID

DROP TABLE IF EXISTS recommendations.valid_items;

In [None]:
%%bigquery --project $PROJECT_ID

DROP TABLE IF EXISTS recommendations.valid_groups;

## Compute song PMI data

You run the [sp_ComputePMI](sql_scripts/sp_ComputePMI.sql) stored procedure to compute song PMI data. This PMI data is what you'll use to train the matrix factorization model in the next section.

This stored procedure accepts the following parameters:

+ `min_item_frequency` — Sets the minimum number of times that a song must appear on playlists.
+ `max_group_size` — Sets the maximum number of songs that a playlist can contain.

These parameters are used together to select records where the song occurs on a number of playlists equal to or greater than the `min_item_frequency` value and the playlist contains a number of songs between 2 and the `max_group_size` value. These are the records that get processed to make the training dataset.

The stored procedure works as follows:

1. Selects a `valid_item_groups1 table and populates it with records from the
   `vw_item_groups` view that meet the following criteria:

    + The song occurs on a number of playlists equal to or greater than the
      `min_item_frequency` value
    + The playlist contains a number of songs between 2 and the `max_group_size`
      value.

1. Creates the `item_cooc` table and populates it with co-occurrence data that
   identifies pairs of songs that occur on the same playlist. It does this by:

    1. Self-joining the `valid_item_groups` table on the `group_id` column.
    1. Setting the `cooc` column to 1.
    1. Summing the `cooc` column for the `item1_Id` and `item2_Id` columns.

1. Creates an `item_frequency` table and populates it with data that identifies
   how many playlists each song occurs in.
1. Recreates the `item_cooc` table to include the following record sets:

    + The `item1_Id`, `item2_Id`, and `cooc` data from the original `item_cooc`
      table. The PMI values calculated from these song pairs lets the solution
      calculate the embeddings for the rows in the feedback matrix.

     <img src="figures/feedback-matrix-rows.png" alt="Embedding matrix that shows the matrix rows calculated by this step." style="width: 400px;"/>

    + The same data as in the previous bullet, but with the `item1_Id` data
      written to the `item2_Id` column and the `item2_Id` data written to the
      `item1_Id` column. This data provides the mirror values of the initial
      entities in the feedback matrix. The PMI values calculated from these
      song pairs lets the solution calculate the embeddings for the columns in
      the feedback matrix.

     <img src="figures/feedback-matrix-columns.png" alt="Embedding matrix that shows the matrix columns calculated by this step." style="width: 400px;"/>

    + The data from the `item_frequency` table. The `item_Id` data is written
      to both the `item1_Id` and `item2_Id` columns and the `frequency` data is
      written to the `cooc` column. This data provides the diagonal entries of
      the feedback matrix. The PMI values calculated from these song pairs lets
      the solution calculate the embeddings for the diagonals in the feedback
      matrix.

     <img src="figures/feedback-matrix-diagonals.png" alt="Embedding matrix that shows the matrix diagonals calculated by this step." style="width: 400px;"/>

1. Computes the PMI for item pairs in the `item_cooc` table, then recreates the
   `item_cooc` table to include this data in the `pmi` column.

### Run the `sp_ComputePMI` stored procedure

In [None]:
%%bigquery --project $PROJECT_ID

DECLARE min_item_frequency INT64;
DECLARE max_group_size INT64;

SET min_item_frequency = 15;
SET max_group_size = 100;

CALL recommendations.sp_ComputePMI(min_item_frequency, max_group_size);

### View the song PMI data

In [None]:
%%bigquery --project $PROJECT_ID

SELECT 
  a.item1_Id, 
  a.item2_Id, 
  b.frequency AS freq1,
  c.frequency AS freq2,
  a.cooc,
  a.pmi,
  a.cooc * a.pmi AS score
FROM recommendations.item_cooc a
JOIN recommendations.item_frequency b
ON a.item1_Id = b.item_Id
JOIN recommendations.item_frequency c 
ON a.item2_Id = c.item_Id
WHERE a.item1_Id != a.item2_Id
ORDER BY score DESC
LIMIT 10;

In [None]:
%%bigquery --project $PROJECT_ID

SELECT COUNT(*) records_count 
FROM recommendations.item_cooc

## Train the BigQuery ML matrix factorization model

You run the [sp_TrainItemMatchingModel](sql_scripts/sp_TrainItemMatchingModel.sql) stored procedure to train the `item_matching_model` matrix factorization model on the song PMI data. The model builds a feedback matrix, which in turn is used to calculate item embeddings for the songs. For more information about how this process works, see [Understanding item embeddings](https://cloud.google.com/solutions/real-time-item-matching#understanding_item_embeddings).

This stored procedure accepts the `dimensions` parameter, which provides the value for the [NUM_FACTORS](https://cloud.google.com/bigquery-ml/docs/reference/standard-sql/bigqueryml-syntax-create-matrix-factorization#num_factors) parameter of the [CREATE MODEL](https://cloud.google.com/bigquery-ml/docs/reference/standard-sql/bigqueryml-syntax-create-matrix-factorization) statement. The `NUM_FACTORS` parameter lets you set the number of latent factors to use in the model. Higher values for this parameter can increase model performance, but will also increase the time needed to train the model. Using the default `dimensions` value of 50, the model takes around 120 minutes to train.


### Run the `sp_TrainItemMatchingModel` stored procedure

After the `item_matching_model model` is created successfully, you can use the the [BigQuery console](https://console.cloud.google.com/bigquery) to investigate the loss through the training iterations, and also see the final evaluation metrics.

In [None]:
%%bigquery --project $PROJECT_ID

DECLARE dimensions INT64 DEFAULT 50;
CALL recommendations.sp_TrainItemMatchingModel(dimensions)

### Explore the trained embeddings

In [None]:
%%bigquery song_embeddings --project $PROJECT_ID

SELECT 
  feature,
  processed_input,
  factor_weights,
  intercept
FROM
  ML.WEIGHTS(MODEL recommendations.item_matching_model)
WHERE 
  feature IN ('2114406',
              '2114402',
              '2120788',
              '2120786',
              '1086322',
              '3129954',
              '53448',
              '887688',
              '562487',
              '833391',
              '1098069',
              '910683',
              '1579481',
              '2675403',
              '2954929',
              '625169')

In [None]:
songs = {
    '2114406': 'Metallica: Nothing Else Matters',
    '2114402': 'Metallica: The Unforgiven',
    '2120788': 'Limp Bizkit: My Way',
    '2120786': 'Limp Bizkit: My Generation',
    '1086322': 'Jacques Brel: Ne Me Quitte Pas',
    '3129954': 'Édith Piaf: Non, Je Ne Regrette Rien',
    '53448': 'France Gall: Ella, Elle l\'a',
    '887688': 'Enrique Iglesias: Tired Of Being Sorry',
    '562487': 'Shakira: Hips Don\'t Lie',
    '833391': 'Ricky Martin: Livin\' la Vida Loca',
    '1098069': 'Snoop Dogg: Drop It Like It\'s Hot',
    '910683': '2Pac: California Love',
    '1579481': 'Dr. Dre: The Next Episode',
    '2675403': 'Eminem: Lose Yourself',
    '2954929': 'Black Sabbath: Iron Man',
    '625169': 'Black Sabbath: Paranoid',
}

In [None]:
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

def process_results(results):
  items = list(results['feature'].unique())
  item_embeddings = dict()
  for item in items:
    emebedding = [0.0] * 100
    embedding_pair = results[results['feature'] == item]

    for _, row in embedding_pair.iterrows():
      factor_weights = list(row['factor_weights'])
      for _, element in enumerate(factor_weights):
        emebedding[element['factor'] - 1] += element['weight']

    item_embeddings[item] = emebedding
    
  return item_embeddings

item_embeddings = process_results(song_embeddings)

In [None]:
item_ids = list(item_embeddings.keys())
for idx1 in range(0, len(item_ids) - 1):
  item1_Id = item_ids[idx1]
  title1 = songs[item1_Id]
  print(title1)
  print("==================")
  embedding1 = np.array(item_embeddings[item1_Id])
  similar_items = []
  for idx2 in range(len(item_ids)):
    item2_Id = item_ids[idx2]
    title2 = songs[item2_Id]
    embedding2 = np.array(item_embeddings[item2_Id])
    similarity = round(cosine_similarity([embedding1], [embedding2])[0][0], 5)
    similar_items.append((title2, similarity))
  
  similar_items = sorted(similar_items, key=lambda item: item[1], reverse=True)
  for element in similar_items[1:]:
    print(f"- {element[0]}' = {element[1]}")
  print()

## License

Copyright 2020 Google LLC

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License. You may obtain a copy of the License at: http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 

See the License for the specific language governing permissions and limitations under the License.

**This is not an official Google product but sample code provided for an educational purpose**