# Collaborative filtering on Google Analytics data

This notebook demonstrates how to implement a WALS matrix refactorization approach to do collaborative filtering.

In [1]:
import os
PROJECT = 'cloud-training-demos' # REPLACE WITH YOUR PROJECT ID
BUCKET = 'cloud-training-demos-ml' # REPLACE WITH YOUR BUCKET NAME
REGION = 'us-central1' # REPLACE WITH YOUR BUCKET REGION e.g. us-central1

# do not change these
os.environ['PROJECT'] = PROJECT
os.environ['BUCKET'] = BUCKET
os.environ['REGION'] = REGION

In [2]:
%bash
gcloud config set project $PROJECT
gcloud config set compute/region $REGION

Updated property [core/project].
Updated property [compute/region].


## Create dataset
<p>
For collaborative filtering, we don't need to know anything about either the users or the content. Essentially, all we need to know is userId, itemId, and rating that the particular user gave the particular item.
<p>
In this case, we are working with newspaper articles. The company doesn't ask their users to rate the articles. However, we can use the time-spent on the page as a proxy for rating.
<p>
Normally, we would also add a time filter to this ("latest 7 days"), but our dataset is itself limited to a few days.

In [3]:
import google.datalab.bigquery as bq

sql="""
#standardSQL
WITH visitor_page_content AS (

   SELECT  
     fullVisitorID,
     (SELECT MAX(IF(index=10, value, NULL)) FROM UNNEST(hits.customDimensions)) AS latestContentId,  
     (LEAD(hits.time, 1) OVER (PARTITION BY fullVisitorId ORDER BY hits.time ASC) - hits.time) AS session_duration 
   FROM `GA360_test.ga_sessions_sample`,   
     UNNEST(hits) AS hits
   WHERE 
     # only include hits on pages
      hits.type = "PAGE"

   GROUP BY   
     fullVisitorId, latestContentId, hits.time
     )

# aggregate web stats
SELECT   
  fullVisitorID,
  latestContentId,
  SUM(session_duration) AS session_duration 
 
FROM visitor_page_content
  WHERE latestContentId IS NOT NULL 
  GROUP BY fullVisitorID, latestContentId
  HAVING session_duration > 0
  ORDER BY latestContentId 
"""

df = bq.Query(sql).execute().result().to_dataframe()
df.head()

Unnamed: 0,fullVisitorID,latestContentId,session_duration
0,7337153711992174438,100074831,44652
1,5190801220865459604,100170790,1214205
2,5874973374932455844,100510126,32109
3,2293633612703952721,100510126,47744
4,1173698801255170595,100676857,10512


In [4]:
stats = df.describe()
stats

Unnamed: 0,session_duration
count,278913.0
mean,127218.8
std,234643.9
min,1.0
25%,17095.0
50%,57938.0
75%,129393.0
max,7690598.0


In [5]:
# the rating is the session_duration scaled to be in the range 0-1.  This will help with training.
df['rating'] = 0.3 * (1 + (df['session_duration'] - stats.loc['50%', 'session_duration'])/stats.loc['50%', 'session_duration'])
df.loc[df['rating'] > 1, 'rating'] = 1
df.describe()

Unnamed: 0,session_duration,rating
count,278913.0,278913.0
mean,127218.8,0.402427
std,234643.9,0.349947
min,1.0,5e-06
25%,17095.0,0.088517
50%,57938.0,0.3
75%,129393.0,0.66999
max,7690598.0,1.0


In [6]:
del df['session_duration']

## Enumerate mapping
<p>
For WALS, the userId and itemId have to be 0,1,2 ... so we create such a mapping.  We save the mapping to a file because at prediction time, we'll need to know how to map the contentId in the table above to the itemId.

In [7]:
%bash
rm -rf data
mkdir data

In [8]:
def create_mapping(values, filename):
  with open(filename, 'w') as ofp:
    id_for_value = dict((value, idx) for idx, value in enumerate(values))
    for idx, value in enumerate(values):
      ofp.write('{},{}\n'.format(idx, value))
  return id_for_value

df.to_csv('data/collab_raw.csv', index=False, header=False)
user_mapping = create_mapping(df['fullVisitorID'], 'data/users.csv')
item_mapping = create_mapping(df['latestContentId'], 'data/items.csv')

In [11]:
df['userId'] = df['fullVisitorID'].map(user_mapping.get)
df['itemId'] = df['latestContentId'].map(item_mapping.get)

In [15]:
outdf = df[['userId', 'itemId', 'rating']]
outdf.to_csv('data/collab_mapped.csv', index=False, header=False)
outdf.head()

Unnamed: 0,userId,itemId,rating
0,14743,0,0.231206
1,278820,1,1.0
2,7912,3,0.166259
3,3,3,0.247216
4,4,4,0.054431


In [17]:
print '{} items, {} users, {} interactions'.format( len(item_mapping), len(user_mapping), len(outdf) )

5668 items, 82802 users, 278913 interactions


## Train with WALS

Once you have the dataset, do matrix factorization with WALS.
<p>
See WALSMatrixFactorization: (https://www.tensorflow.org/versions/master/api_docs/python/tf/contrib/factorization/WALSMatrixFactorization)
This is an estimator model, so it should be relatively familiar.


## Run as a Python module

Let's run it as Python module.  Note the --model parameter

In [None]:
%bash
rm -rf wals.tar.gz wals_trained
export PYTHONPATH=${PYTHONPATH}:${PWD}/wals
python -m trainer.task \
   --output_dir=${PWD}/wals_trained \
   --train_steps=10 --job-dir=./tmp

Now, let's do it on ML Engine. Note the --model parameter

In [None]:
%bash
OUTDIR=gs://${BUCKET}/wals/trained_${MODEL_TYPE}
JOBNAME=mnist_${MODEL_TYPE}_$(date -u +%y%m%d_%H%M%S)
echo $OUTDIR $REGION $JOBNAME
gsutil -m rm -rf $OUTDIR
gcloud ml-engine jobs submit training $JOBNAME \
   --region=$REGION \
   --module-name=trainer.task \
   --package-path=${PWD}/wals/trainer \
   --job-dir=$OUTDIR \
   --staging-bucket=gs://$BUCKET \
   --scale-tier=BASIC_GPU \
   --runtime-version=1.4 \
   -- \
   --output_dir=$OUTDIR \
   --train_steps=1000 --train_batch_size=512

<pre>
# Copyright 2018 Google Inc. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#      http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
</pre>