In [1]:
# Copyright 2019 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Collaborative Filtering using AutoML Tables

## Overview
In this notebook we will see how [AutoML Tables](https://cloud.google.com/automl-tables/) can be used to make music recommendations to users using collaborative filtering. AutoML Tables is a supervised learning service for structured data that can vastly simplify the model building process.

### Dataset
AutoML Tables allows data to be imported from either GCS or BigQuery. This tutorial uses the [ListenBrainz](https://console.cloud.google.com/marketplace/details/metabrainz/listenbrainz) dataset from [Cloud Marketplace](https://console.cloud.google.com/marketplace), hosted in BigQuery.

The ListenBrainz dataset is a log of songs played by users, some notable pieces of the schema include:
  - **user_name:** a user id.
  - **track_name:** a song id.
  - **artist_name:** the artist of the song.
  - **release_name:** the album of the song.
  - **tags:** the genres of the song.

### Objective
The goal of this notebook is to demonstrate how to create a lookup table in BigQuery of songs to recommend to users using a log of user-song listens and AutoML Tables. This will be done by training a regression model to predict how similar a given `user` and `song` are on a 0 to 1 scale, and using predictions for every `(user, song)` pair to generate a ranking of the most similar songs for each user.

As the number of `(user, song)` pairs grows exponentially with the number of unique users and songs, this approach may not be optimal for extremely large datasets. One workaround would be to train a model that learns to embed users and songs in the same embedding space, and use a nearest-neighbors algorithm to get recommendations for users. Unfortunately, AutoML Tables does not expose any feature for training and using embeddings, so a [custom ML model](https://github.com/GoogleCloudPlatform/professional-services/tree/master/examples/cloudml-collaborative-filtering) would need to be used instead.

Another recommendation approach that is worth mentioning is [using extreme multiclass classification](https://ai.google/research/pubs/pub45530), as that also circumvents storing every possible pair of users and songs. Unfortunately, AutoML Tables does not support the multiclass classification of more than [100 classes](https://cloud.google.com/automl-tables/docs/prepare#target-requirements).

### Costs
This tutorial uses billable components of Google Cloud Platform (GCP):
- Cloud AutoML Tables

Learn about [AutoML Tables pricing](https://cloud.google.com/automl-tables/pricing), and use the [Pricing Calculator](https://cloud.google.com/products/calculator/) to generate a cost estimate based on your projected usage.

## 1. Setup

Follow the [AutoML Tables documentation](https://cloud.google.com/automl-tables/docs/) to
* [Enable billing](https://cloud.google.com/billing/docs/how-to/modify-project).
* [Enable AutoML API](https://console.cloud.google.com/apis/library/automl.googleapis.com?q=automl&project=automl-music-recommendation&folder&organizationId=433637338589)

### 1.1 PIP Install Packages and dependencies
Install addional dependencies not installed in the notebook environment.

In [4]:
! pip install --quiet google-cloud-automl google-cloud-bigquery

Restart the kernel to allow `automl_v1beta1` to be imported.

In [5]:
from IPython.core.display import HTML
HTML("<script>Jupyter.notebook.kernel.restart()</script>")

### 1.2 Import libraries and define constants

Populate the following cell with the necessary constants.

In [9]:
# The GCP project id.
PROJECT_ID = ""
# The region to use for compute resources.
LOCATION = ""
# A name for the AutoML tables Dataset to create.
DATASET_DISPLAY_NAME = ""
# The BigQuery dataset to import data from (doesn't need to exist).
INPUT_BQ_DATASET = ""
# The BigQuery table to import data from (doesn't need to exist).
INPUT_BQ_TABLE = ""
# A name for the AutoML tables model to create.
MODEL_DISPLAY_NAME = ""
# The number of hours to train the model.
MODEL_TRAIN_HOURS = 0

assert all([
    PROJECT_ID,
    LOCATION,
    DATASET_DISPLAY_NAME,
    INPUT_BQ_DATASET,
    INPUT_BQ_TABLE,
    MODEL_DISPLAY_NAME,
    MODEL_TRAIN_HOURS,
])

Import relevant packages and initialize clients for BigQuery and AutoML Tables.

In [8]:
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

from google.cloud import automl_v1beta1
from google.cloud import bigquery
from google.cloud import exceptions

aml_client = automl_v1beta1.AutoMlClient()
prediction_client = automl_v1beta1.PredictionServiceClient()
bq_client = bigquery.Client()

location_path = aml_client.location_path(PROJECT_ID, LOCATION)
location_path

'projects/automl-music-recommendation/locations/us-central1'

## 2. Create a Dataset

In order to train a model, a structured dataset must be injested into AutoML tables from either BigQuery or Google Cloud Storage. Once injested, the user will be able to cherry pick columns to use as features, labels, or weights and configure the loss function.

### 2.1 Create BigQuery table

First, do some feature engineering on the original ListenBrainz dataset to turn it into a dataset for training and export it into a seperate BigQuery table:

    1. Make each sample a unique `(user, song)` pair.
    2. For features, use the user's top 10 songs ever played and the song's number of albums, artist, and genres.
    3. For a label, use the number of times the user has listened to the song, normalized by dividing by the maximum number of times that user has listened to any song. Normalizing the listen counts ensures active users don't have disproportionate effect on the model error.
    4. Add a weight equal to the label to give songs more popular with the user higher weights. This is to help account for the skew in the label distribution.

In [10]:
query = """
  WITH
    songs AS (
      SELECT CONCAT(track_name, " by ", artist_name) AS song
      FROM `listenbrainz.listenbrainz.listen`
      GROUP BY song
      ORDER BY COUNT(*) DESC
      LIMIT 10000
    ),
    user_songs AS (
      SELECT user_name AS user, ANY_VALUE(artist_name) AS artist,
        CONCAT(track_name, " by ", artist_name) AS song,
        COUNT(*) AS user_song_listens
      FROM `listenbrainz.listenbrainz.listen`
      JOIN songs ON songs.song = CONCAT(track_name, " by ", artist_name)
      WHERE track_name != ""
      GROUP BY user_name, song
    ),
    user_song_ranks AS (
      SELECT user, song, user_song_listens,
        ROW_NUMBER() OVER (PARTITION BY user ORDER BY user_song_listens DESC)
          AS rank
      FROM user_songs
    ),
    user_features AS (
      SELECT user, ARRAY_AGG(song) AS top_10,
        MAX(user_song_listens) AS user_max_listen
      FROM user_song_ranks
      WHERE rank <= 10
      GROUP BY user
    ),
    item_features AS (
      SELECT CONCAT(track_name, " by ", artist_name) AS song,
        SPLIT(ANY_VALUE(tags), ",") AS tags,
        COUNT(DISTINCT(release_name)) AS albums
      FROM `listenbrainz.listenbrainz.listen`
      WHERE track_name != ""
      GROUP BY song
    )
  SELECT user, song, artist, tags, albums, top_10,
    user_song_listens/user_max_listen AS count_norm,
    SQRT(user_song_listens/user_max_listen) AS weight
  FROM user_songs
  JOIN user_features USING(user)
  JOIN item_features USING(song)
"""

In [11]:
def create_table_from_query(query, table):
    """Creates a new table using the results from the given query.
    
    Args:
        query: a query string.
        table: a name to give the new table.
    """
    job_config = bigquery.QueryJobConfig()
    bq_dataset = bigquery.Dataset("{0}.{1}".format(PROJECT_ID, INPUT_BQ_DATASET))
    bq_dataset.location = "US"

    try:
        bq_dataset = bq_client.create_dataset(bq_dataset)
    except exceptions.Conflict:
        pass

    table_ref = bq_client.dataset(INPUT_BQ_DATASET).table(table)
    job_config.destination = table_ref

    query_job = bq_client.query(query,
                             location=bq_dataset.location,
                             job_config=job_config)

    query_job.result()
    print('Query results loaded to table {}'.format(table_ref.path))

In [12]:
create_table_from_query(query, INPUT_BQ_TABLE)

Query results loaded to table /projects/automl-music-recommendation/datasets/notebook_dataset/tables/training_data


### 2.2 Create AutoML Dataset

Create a Dataset by importing the BigQuery table that was just created. Importing data may take a few minutes or hours depending on the size of your data.

In [13]:
dataset_config = {
    "display_name": DATASET_DISPLAY_NAME,
    "tables_dataset_metadata": {},
}
dataset = aml_client.create_dataset(location_path, dataset_config)

dataset_bq_input_uri = 'bq://{0}.{1}.{2}'.format(PROJECT_ID, INPUT_BQ_DATASET, INPUT_BQ_TABLE)
input_config = {
    'bigquery_source': {
        'input_uri': dataset_bq_input_uri
    }
}
import_data_response = aml_client.import_data(dataset.name, input_config)
import_data_result = import_data_response.result()
import_data_result



Inspect the datatypes assigned to each column. In this case, the `song` and `artist` should be categorical, not textual.

In [14]:
list_table_specs_response = aml_client.list_table_specs(dataset.name)
table_specs = [s for s in list_table_specs_response]
table_spec_name = table_specs[0].name
list_column_specs_response = aml_client.list_column_specs(table_spec_name)
column_specs = {s.display_name: s for s in list_column_specs_response}

def print_column_specs(column_specs):
    """Parses the given specs and prints each column and column type."""
    data_types = automl_v1beta1.proto.data_types_pb2
    return [(x, data_types.TypeCode.Name(
        column_specs[x].data_type.type_code)) for x in column_specs.keys()]

print_column_specs(column_specs)

[('user', 'CATEGORY'),
 ('weight', 'FLOAT64'),
 ('albums', 'FLOAT64'),
 ('tags', 'ARRAY'),
 ('count_norm', 'FLOAT64'),
 ('song', 'STRING'),
 ('artist', 'STRING'),
 ('top_10', 'ARRAY')]

### 2.3 Update Dataset params

Sometimes, the types AutoML Tables automatically assigns each column will be off from that they were intended to be. When that happens, we need to update Tables with different types for certain columns.

In this case, set the `song` and `artist` column types to `CATEGORY`.

In [15]:
for col in ["song", "artist"]:
    update_column_spec_dict = {
        "name": column_specs[col].name,
        "data_type": {
            "type_code": "CATEGORY"
        }
    }
    aml_client.update_column_spec(update_column_spec_dict)

list_column_specs_response = aml_client.list_column_specs(table_spec_name)
column_specs = {s.display_name: s for s in list_column_specs_response}
print_column_specs(column_specs)

[('user', 'CATEGORY'),
 ('weight', 'FLOAT64'),
 ('albums', 'FLOAT64'),
 ('tags', 'ARRAY'),
 ('count_norm', 'FLOAT64'),
 ('song', 'CATEGORY'),
 ('artist', 'CATEGORY'),
 ('top_10', 'ARRAY')]

Not all columns are feature columns, in order to train a model, we need to tell Tables which column should be used as the target variable and, optionally, which column should be used as sample weights.

In [16]:
label_column_name = "count_norm"
label_column_spec = column_specs[label_column_name]
label_column_id = label_column_spec.name.rsplit('/', 1)[-1]

weight_column_name = "weight"
weight_column_spec = column_specs[weight_column_name]
weight_column_id = weight_column_spec.name.rsplit('/', 1)[-1]

update_dataset_dict = {
    'name': dataset.name,
    'tables_dataset_metadata': {
        'target_column_spec_id': label_column_id,
        'weight_column_spec_id': weight_column_id,
    }
}
update_dataset_response = aml_client.update_dataset(update_dataset_dict)
update_dataset_response

name: "projects/120451501752/locations/us-central1/datasets/TBL4001333919709396992"
display_name: "listenbrainz"
create_time {
  seconds: 1562105555
  nanos: 279259000
}
etag: "AB3BwFquIN-yc6mMK2rEEGAcNy-Ctwb-SplDYyvFHJm798sET7xJHC0FwK4zrg7LLLc="
example_count: 2467282
tables_dataset_metadata {
  primary_table_spec_id: "6310827307527307264"
  target_column_spec_id: "6309217622504243200"
  weight_column_spec_id: "7462139127111090176"
  stats_update_time {
    seconds: 1562105803
    nanos: 263000000
  }
}

## 3. Create a Model

Once the Dataset has been configured correctly, we can tell AutoML Tables to train a new model. The amount of resources spent to train this model can be adjusted using a parameter called `train_budget_milli_node_hours`. As the name implies, this puts a maximum budget on how many resources a training job can use up before exporting a servable model.

In [17]:
feat_list = list(column_specs.keys())
feat_list.remove(label_column_name)
feat_list.remove(weight_column_name)

model_dict = {
    'display_name': MODEL_DISPLAY_NAME,
    'dataset_id': dataset.name.rsplit('/', 1)[-1],
    'tables_model_metadata': {
        'train_budget_milli_node_hours': MODEL_TRAIN_HOURS * 1000,
        'target_column_spec': column_specs[label_column_name],
        'input_feature_column_specs': [column_specs[x] for x in feat_list]
    },
}
    
create_model_response = aml_client.create_model(location_path, model_dict)
create_model_result = create_model_response.result()
create_model_result

name: "projects/120451501752/locations/us-central1/models/TBL5087774553254395904"

## 4. Model Evaluation

Because we are optimizing a surrogate problem (predicting the similarity between `(user, song)` pairs) in order to achieve our final objective of producing a list of recommended songs for a user, it's difficult to tell how well the model performs by looking only at the final loss function. Instead, an evaluation metric we can use for our model is `recall@n` for the top `m` most listened to songs for each user. This metric will give us the probability that one of a user's top `m` most listened to songs will appear in the top `n` recommendations we make.

In order to get the top recommendations for each user, we need to create a batch job to predict similarity scores between each user and item pair. These similarity scores would then be sorted per user to produce an ordered list of recommended songs.

### 4.1 Create an evaluation table

Instead of creating a lookup table for all users, let's just focus on the performance for a single user for this tutorial. We will make recommendations specifically for the user `rob`. We start by creatings a dataset for prediction to feed into the trained model; this is a table of every possible `(user, song)` pair containing `rob` and corresponding features.

In [18]:
user = "rob"
training_table = "{}.{}.{}".format(PROJECT_ID, INPUT_BQ_DATASET, INPUT_BQ_TABLE)
query = """
    WITH
      pairs AS (
        SELECT "{0}" AS user, song, ANY_VALUE(artist) as artist,
          ANY_VALUE(tags) as tags, ANY_VALUE(albums) as albums
        FROM `{1}`
        GROUP BY song
      ),
      user_features AS (
        SELECT user, ANY_VALUE(top_10) as top_10
        FROM `{1}`
        GROUP BY user
      )
    SELECT * FROM pairs
    JOIN user_features USING(user)
""".format(user, training_table)

In [19]:
eval_table = "{}_example".format(INPUT_BQ_TABLE)
create_table_from_query(query, eval_table)

Query results loaded to table /projects/automl-music-recommendation/datasets/notebook_dataset/tables/training_data_example


### 4.2 Make predictions

Once the prediction table is created, start a batch prediction job. This may take a few minutes.

In [20]:
model_name = aml_client.model_path(PROJECT_ID, LOCATION, create_model_result.name)
preds_bq_input_uri = "bq://{}.{}.{}".format(PROJECT_ID, INPUT_BQ_DATASET, eval_table)
preds_bq_output_uri = "bq://{}".format(PROJECT_ID)

input_config = {"bigquery_source": {"input_uri": preds_bq_input_uri}}
output_config = {"bigquery_destination": {"output_uri": preds_bq_output_uri}}
response = prediction_client.batch_predict(create_model_result.name, input_config, output_config)
response.result()
output_uri = response.metadata.batch_predict_details.output_info.bigquery_output_dataset

With the similarity predictions for `rob`, we can order by the predictions to get a ranked list of songs to recommend to `rob`.

In [21]:
n = 10
query = """
    SELECT user, song, tables.value as pred, count_norm as label
    FROM `{}.predictions` a, UNNEST(predicted_count_norm)
    LEFT JOIN `{}` USING(user, song)
    WHERE user = "{}"
    ORDER BY pred DESC
    LIMIT {}
""".format(output_uri[5:].replace(":", "."), training_table, user, n)
query_job = bq_client.query(query)

print("Top {} songs recommended for {}:".format(n, user))
for idx, row in enumerate(query_job):
    print("{}.".format(idx + 1), row["song"])

Top 10 songs recommended for rob:
1. Battle Against Time by Wintersun
2. Shelter by Porter Robinson
3. Cirice by Ghost
4. God's Plan by Drake
5. Schism by Tool
6. Intro by alt-J
7. Vicarious by Tool
8. The Grudge by Tool
9. IV. Sweatpants by Childish Gambino
10. Down With the Sun by Insomnium


### 4.3 Evaluate predictions

In order to calculate `recall@n`, we need to join in the ground truth similarities for all songs `rob` has already listened to. With this additional data, we can find the top `n` songs that would be recommended and see how many of `rob`s top `m` songs appear in that list.

In [24]:
user_top_n = 100
recall_n = 1000
query = """
    WITH
      top_k AS (
        SELECT user, song, count_norm,
          ROW_NUMBER() OVER (PARTITION BY user ORDER BY count_norm DESC) as user_rank
        FROM `{0}`
      ),
      preds AS (
        SELECT user, song, tables.value as pred, count_norm as label,
          ROW_NUMBER() OVER (ORDER BY tables.value DESC) as rank, user_rank
        FROM `{1}.predictions` a, UNNEST(predicted_count_norm)
        LEFT JOIN top_k USING(user, song)
        ORDER BY pred DESC
      )
    SELECT COUNT(label)/{2} as recall_{3}_top_{2}
    FROM preds
    WHERE rank <= {3} AND user_rank <= {2}
""".format(training_table, output_uri[5:].replace(":", "."), user_top_n, recall_n)
query_job = bq_client.query(query)

for row in query_job:
    print("Recall of user's top {} songs in recommended top {}: {}".format(
        user_top_n, recall_n, row["recall_{}_top_{}".format(recall_n, user_top_n)]))

Recall of user's top 100 songs in recommended top 1000: 0.3


## 5. Cleanup

Uncomment the following cells to clean up the BigQuery tables and AutoML Table Datasets that were created with this notebook to avoid additional charges for storage.

### 5.1 Delete the Dataset

In [25]:
# aml_client.delete_dataset(dataset.name)

<google.api_core.operation.Operation at 0x7f2ab5ce2c88>

### 5.2 Delete BigQuery datasets

In order to delete BigQuery tables, make sure the service account linked to this notebook has a role with the `bigquery.tables.delete` permission such as `Big Query Data Owner`. The following command displays the current service account.

IAM permissions can be adjusted [here](https://console.cloud.google.com/iam-admin/iam).

In [28]:
!gcloud config list account --format "value(core.account)"

120451501752-compute@developer.gserviceaccount.com


Clean up the BigQuery tables created by this notebook.

In [29]:
# # Delete the prediction dataset.
# dataset_id = str(output_uri[5:].replace(":", "."))
# bq_client.delete_dataset(dataset_id, delete_contents=True, not_found_ok=True)

# # Delete the training dataset.
# dataset_id = "{0}.{1}".format(PROJECT_ID, INPUT_BQ_DATASET)
# bq_client.delete_dataset(dataset_id, delete_contents=True, not_found_ok=True)