# Collaborative filtering on Google Analytics data

This notebook demonstrates how to implement a WALS matrix refactorization approach to do collaborative filtering.

In [83]:
import os
PROJECT = 'cloud-training-demos' # REPLACE WITH YOUR PROJECT ID
BUCKET = 'cloud-training-demos-ml' # REPLACE WITH YOUR BUCKET NAME
REGION = 'us-central1' # REPLACE WITH YOUR BUCKET REGION e.g. us-central1

# do not change these
os.environ['PROJECT'] = PROJECT
os.environ['BUCKET'] = BUCKET
os.environ['REGION'] = REGION

In [84]:
%bash
gcloud config set project $PROJECT
gcloud config set compute/region $REGION

Updated property [core/project].
Updated property [compute/region].


## Create dataset
<p>
For collaborative filtering, we don't need to know anything about either the users or the content. Essentially, all we need to know is userId, itemId, and rating that the particular user gave the particular item.
<p>
In this case, we are working with newspaper articles. The company doesn't ask their users to rate the articles. However, we can use the time-spent on the page as a proxy for rating.
<p>
Normally, we would also add a time filter to this ("latest 7 days"), but our dataset is itself limited to a few days.

In [85]:
import google.datalab.bigquery as bq

sql="""
#standardSQL
WITH visitor_page_content AS (

   SELECT  
     fullVisitorID,
     (SELECT MAX(IF(index=10, value, NULL)) FROM UNNEST(hits.customDimensions)) AS latestContentId,  
     (LEAD(hits.time, 1) OVER (PARTITION BY fullVisitorId ORDER BY hits.time ASC) - hits.time) AS session_duration 
   FROM `GA360_test.ga_sessions_sample`,   
     UNNEST(hits) AS hits
   WHERE 
     # only include hits on pages
      hits.type = "PAGE"

   GROUP BY   
     fullVisitorId, latestContentId, hits.time
     )

# aggregate web stats
SELECT   
  fullVisitorID,
  latestContentId,
  SUM(session_duration) AS session_duration 
 
FROM visitor_page_content
  WHERE latestContentId IS NOT NULL 
  GROUP BY fullVisitorID, latestContentId
  HAVING session_duration > 0
  ORDER BY latestContentId 
"""

df = bq.Query(sql).execute().result().to_dataframe()
df.head()

Unnamed: 0,fullVisitorID,latestContentId,session_duration
0,7337153711992174438,100074831,44652
1,5190801220865459604,100170790,1214205
2,5874973374932455844,100510126,32109
3,2293633612703952721,100510126,47744
4,1173698801255170595,100676857,10512


In [86]:
stats = df.describe()
stats

Unnamed: 0,session_duration
count,278913.0
mean,127218.8
std,234643.9
min,1.0
25%,17095.0
50%,57938.0
75%,129393.0
max,7690598.0


In [87]:
# the rating is the session_duration scaled to be in the range 0-1.  This will help with training.
df['rating'] = 0.3 * (1 + (df['session_duration'] - stats.loc['50%', 'session_duration'])/stats.loc['50%', 'session_duration'])
df.loc[df['rating'] > 1, 'rating'] = 1
df.describe()

Unnamed: 0,session_duration,rating
count,278913.0,278913.0
mean,127218.8,0.402427
std,234643.9,0.349947
min,1.0,5e-06
25%,17095.0,0.088517
50%,57938.0,0.3
75%,129393.0,0.66999
max,7690598.0,1.0


In [88]:
del df['session_duration']

## Enumerate mapping
<p>
For WALS, the userId and itemId have to be 0,1,2 ... so we create such a mapping.  We save the mapping to a file because at prediction time, we'll need to know how to map the contentId in the table above to the itemId.

In [89]:
%bash
rm -rf data
mkdir data

In [121]:
def create_mapping(values, filename):
  with open(filename, 'w') as ofp:
    value_to_id = {value:idx for idx, value in enumerate(values.unique())}
    for value, idx in value_to_id.items():
      ofp.write('{},{}\n'.format(value, idx))
  return value_to_id

df.to_csv('data/collab_raw.csv', index=False, header=False)
user_mapping = create_mapping(df['fullVisitorID'], 'data/users.csv')
item_mapping = create_mapping(df['latestContentId'], 'data/items.csv')

In [122]:
df['userId'] = df['fullVisitorID'].map(user_mapping.get)
df['itemId'] = df['latestContentId'].map(item_mapping.get)

In [123]:
outdf = df[['userId', 'itemId', 'rating']]
outdf.to_csv('data/collab_mapped.csv', index=False, header=False)
outdf.head()

Unnamed: 0,userId,itemId,rating
0,0,0,0.231206
1,1,1,1.0
2,2,2,0.166259
3,3,2,0.247216
4,4,3,0.054431


In [124]:
print '{} items, {} users, {} interactions'.format( len(item_mapping), len(user_mapping), len(outdf) )

5668 items, 82802 users, 278913 interactions


## Train with WALS

Once you have the dataset, do matrix factorization with WALS using the [WALSMatrixFactorization](https://www.tensorflow.org/versions/master/api_docs/python/tf/contrib/factorization/WALSMatrixFactorization) in the contrib directory.
This is an estimator model, so it should be relatively familiar.
<p>
As usual, we write an input_fn to provide the data to the model, and then create the Estimator to do train_and_evaluate.
Because it is in contrib and hasn't moved over to tf.estimator yet, we use tf.contrib.learn.Experiment to handle the training loop.

In [40]:
import tensorflow as tf
from tensorflow.contrib.factorization import WALSMatrixFactorization
CSV_COLUMNS = 'userId,itemId,rating'.split(',')
DEFAULTS = [[0L], [0L], [0.0]]

def read_dataset(filename, mode, args):
    if mode == tf.estimator.ModeKeys.TRAIN:
        num_epochs = None # indefinitely
    else:
        num_epochs = 1 # end-of-input after this

    # the actual input function passed to TensorFlow
    def _input_fn():
        # could be a path to one file or a file pattern.
        input_file_names = tf.train.match_filenames_once(filename)
        filename_queue = tf.train.string_input_producer(
            input_file_names, shuffle=True, num_epochs=num_epochs)

        # read CSV
        reader = tf.TextLineReader()
        _, value = reader.read_up_to(filename_queue, num_records=args['batch_size'])
        #value_column = tf.expand_dims(value, -1)
        columns = tf.decode_csv(value, record_defaults=DEFAULTS)
        columns = dict(zip(CSV_COLUMNS, columns))

        # in the format required by WALS
        columns['userId'] = tf.cast(columns['userId'], dtype=tf.int64)
        columns['itemId'] = tf.cast(columns['itemId'], dtype=tf.int64)
        input_rows = tf.stack( [columns['userId'], columns['itemId']], axis=1 )
        input_cols = tf.stack( [columns['itemId'], columns['userId']], axis=1 )
        features = {
                     WALSMatrixFactorization.INPUT_ROWS:
                         tf.SparseTensor(input_rows,
                                         columns['rating'],
                                         (args['n_users'], args['n_items'])),
                     WALSMatrixFactorization.INPUT_COLS:
                         tf.SparseTensor(input_cols,
                                         columns['rating'],
                                         (args['n_items'], args['n_users'])),
                     WALSMatrixFactorization.PROJECT_ROW: tf.constant(True)
                   }
        return features, None

    return _input_fn

In [None]:
def serving_input_fn():
    feature_ph = {
        'userId': tf.placeholder(tf.int64, [None])
    }
    features = {
        WALSMatrixFactorization.INPUT_ROWS: feature_ph['userId'],
        WALSMatrixFactorization.PROJECT_ROW: tf.constant(True)  # get items for userId
    }
    return tf.estimator.export.ServingInputReceiver(features, feature_ph)

from tensorflow.contrib.learn.python.learn.utils import saved_model_export_utils
def train_and_evaluate(args):
    train_steps = int(0.5 + (1.0 * args['num_epochs'] * args['n_interactions']) / args['batch_size'])
    print('Will train for {} steps'.format(train_steps))
    def experiment_fn(output_dir):
        return tf.contrib.learn.Experiment(
            tf.contrib.factorization.WALSMatrixFactorization(
                         num_rows=args['n_users'], num_cols=args['n_items'],
                         embedding_dimension=args['n_embeds'],
                         model_dir=args['output_dir']),
            train_input_fn=read_dataset(args['train_path'], tf.estimator.ModeKeys.TRAIN, args),
            eval_input_fn=read_dataset(args['train_path'], tf.estimator.ModeKeys.EVAL, args),
            train_steps=train_steps,
            eval_steps=None
    )

    from tensorflow.contrib.learn.python.learn import learn_runner
    learn_runner.run(experiment_fn, args['output_dir'])

## Run as a Python module

Let's run it as Python module for just a few steps.

In [None]:
%bash
rm -rf wals.tar.gz wals_trained
export PYTHONPATH=${PYTHONPATH}:${PWD}/wals
python -m trainer.task \
   --output_dir=${PWD}/wals_trained \
   --train_path=${PWD}/data/collab_mapped.csv \
   --num_epochs=0.01 --n_items=5668 --n_users=82802 --n_interactions=278913 \
   --job-dir=./tmp

## Run on Cloud

In [None]:
%bash
gsutil cp data/collab_mapped.csv gs://${BUCKET}/wals/data/collab_mapped.csv

In [None]:
%bash
OUTDIR=gs://${BUCKET}/wals/model_trained
JOBNAME=wals_$(date -u +%y%m%d_%H%M%S)
echo $OUTDIR $REGION $JOBNAME
gsutil -m rm -rf $OUTDIR
gcloud ml-engine jobs submit training $JOBNAME \
   --region=$REGION \
   --module-name=trainer.task \
   --package-path=${PWD}/wals/trainer \
   --job-dir=$OUTDIR \
   --staging-bucket=gs://$BUCKET \
   --scale-tier=BASIC_GPU \
   --runtime-version=1.4 \
   -- \
   --output_dir=$OUTDIR \
   --train_path=gs://${BUCKET}/wals/data/collab_mapped.csv \
   --num_epochs=10 --n_items=5668 --n_users=82802 --n_interactions=278913 

This took <b>10 minutes</b> and finished with a loss of 23418.4. (FIXME: what does the loss represent?)

## Deploy and predict

This part is a work in progress.

In [136]:
%writefile data/input.json
{"userId": [4]}

Writing data/input.json


In [141]:
%bash
gcloud ml-engine local predict --model-dir=wals_trained --runtime_version=1.4 --json-instances=data/input.json 

ERROR: (gcloud.ml-engine.local.predict) unrecognized arguments: --runtime_version=1.4
Usage: gcloud ml-engine local predict --model-dir=MODEL_DIR (--json-instances=JSON_INSTANCES | --text-instances=TEXT_INSTANCES) [optional flags]
  optional flags may be  --help | --json-instances | --text-instances

For detailed information on this command and its flags, run:
  gcloud ml-engine local predict --help


<pre>
# Copyright 2018 Google Inc. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#      http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
</pre>