# Neural network hybrid recommendation system on Google Analytics data preprocessing

This notebook demonstrates how to implement a hybrid recommendation system using a neural network to combine content-based and collaborative filtering recommendation models using Google Analytics data. We are going to use the learned user embeddings from [wals.ipynb](../wals.ipynb) and combine that with our previous content-based features from [content_based_using_neural_networks.ipynb](../content_based_using_neural_networks.ipynb)

First we are going to preprocess our data using BigQuery and Cloud Dataflow to be used in our later neural network hybrid recommendation model.

Apache Beam only works in Python 2 at the moment, so we're going to switch to the Python 2 kernel. In the above menu, click the dropdown arrow and select `python2`.

In [1]:
# Import helpful libraries and setup our project, bucket, and region
import os

PROJECT = "cloud-training-demos" # REPLACE WITH YOUR PROJECT ID
BUCKET = "cloud-training-demos-ml" # REPLACE WITH YOUR BUCKET NAME
REGION = "us-central1" # REPLACE WITH YOUR BUCKET REGION e.g. us-central1

# Do not change these
os.environ["PROJECT"] = PROJECT
os.environ["BUCKET"] = BUCKET
os.environ["REGION"] = REGION
os.environ["TFVERSION"] = "1.13"

In [2]:
%%bash
gcloud  config  set project $PROJECT
gcloud config set compute/region $REGION

Updated property [core/project].
Updated property [compute/region].


<h2> Create ML dataset using Dataflow </h2>
Let's use Cloud Dataflow to read in the BigQuery data, do some preprocessing, and write it out as CSV files.

First, let's create our hybrid dataset query that we will use in our Cloud Dataflow pipeline. This will combine some content-based features and the user and item embeddings learned from our WALS Matrix Factorization Collaborative filtering lab that we extracted from our trained WALSMatrixFactorization Estimator and uploaded to BigQuery.

In [3]:
query_hybrid_dataset = """
WITH CTE_site_history AS (
    SELECT
        fullVisitorId as visitor_id,
        (SELECT MAX(IF(index = 10, value, NULL)) FROM UNNEST(hits.customDimensions)) AS content_id,
        (SELECT MAX(IF(index = 7, value, NULL)) FROM UNNEST(hits.customDimensions)) AS category, 
        (SELECT MAX(IF(index = 6, value, NULL)) FROM UNNEST(hits.customDimensions)) AS title,
        (SELECT MAX(IF(index = 2, value, NULL)) FROM UNNEST(hits.customDimensions)) AS author_list,
        SPLIT(RPAD((SELECT MAX(IF(index = 4, value, NULL)) FROM UNNEST(hits.customDimensions)), 7), '.') AS year_month_array,
        LEAD(hits.customDimensions, 1) OVER (PARTITION BY fullVisitorId ORDER BY hits.time ASC) AS nextCustomDimensions
    FROM 
        `cloud-training-demos.GA360_test.ga_sessions_sample`,   
         UNNEST(hits) AS hits
    WHERE 
        # only include hits on pages
        hits.type = "PAGE"
        AND
        fullVisitorId IS NOT NULL
        AND
        hits.time != 0
        AND
        hits.time IS NOT NULL
        AND
        (SELECT MAX(IF(index = 10, value, NULL)) FROM UNNEST(hits.customDimensions)) IS NOT NULL
),
CTE_training_dataset AS (
    SELECT
        (SELECT MAX(IF(index=10, value, NULL)) FROM UNNEST(nextCustomDimensions)) AS next_content_id,

        visitor_id,
        content_id,
        category,
        REGEXP_REPLACE(title, r",", "") AS title,
        REGEXP_EXTRACT(author_list, r"^[^,]+") AS author,
        DATE_DIFF(DATE(CAST(year_month_array[OFFSET(0)] AS INT64), CAST(year_month_array[OFFSET(1)] AS INT64), 1), DATE(1970, 1, 1), MONTH) AS months_since_epoch
    FROM
        CTE_site_history
    WHERE
        (SELECT MAX(IF(index=10, value, NULL)) FROM UNNEST(nextCustomDimensions)) IS NOT NULL)

SELECT
    CAST(next_content_id AS STRING) AS next_content_id,

    CAST(training_dataset.visitor_id AS STRING) AS visitor_id,
    CAST(training_dataset.content_id AS STRING) AS content_id,
    CAST(IFNULL(category, 'None') AS STRING) AS category,
    CONCAT("\\"", REPLACE(TRIM(CAST(IFNULL(title, 'None') AS STRING)), "\\"",""), "\\"") AS title,
    CAST(IFNULL(author, 'None') AS STRING) AS author,
    CAST(months_since_epoch AS STRING) AS months_since_epoch,

    IFNULL(user_factors._0, 0.0) AS user_factor_0,
    IFNULL(user_factors._1, 0.0) AS user_factor_1,
    IFNULL(user_factors._2, 0.0) AS user_factor_2,
    IFNULL(user_factors._3, 0.0) AS user_factor_3,
    IFNULL(user_factors._4, 0.0) AS user_factor_4,
    IFNULL(user_factors._5, 0.0) AS user_factor_5,
    IFNULL(user_factors._6, 0.0) AS user_factor_6,
    IFNULL(user_factors._7, 0.0) AS user_factor_7,
    IFNULL(user_factors._8, 0.0) AS user_factor_8,
    IFNULL(user_factors._9, 0.0) AS user_factor_9,

    IFNULL(item_factors._0, 0.0) AS item_factor_0,
    IFNULL(item_factors._1, 0.0) AS item_factor_1,
    IFNULL(item_factors._2, 0.0) AS item_factor_2,
    IFNULL(item_factors._3, 0.0) AS item_factor_3,
    IFNULL(item_factors._4, 0.0) AS item_factor_4,
    IFNULL(item_factors._5, 0.0) AS item_factor_5,
    IFNULL(item_factors._6, 0.0) AS item_factor_6,
    IFNULL(item_factors._7, 0.0) AS item_factor_7,
    IFNULL(item_factors._8, 0.0) AS item_factor_8,
    IFNULL(item_factors._9, 0.0) AS item_factor_9,

    FARM_FINGERPRINT(CONCAT(CAST(visitor_id AS STRING), CAST(content_id AS STRING))) AS hash_id
FROM
    CTE_training_dataset AS training_dataset
LEFT JOIN
    `cloud-training-demos.GA360_test.user_factors` AS user_factors
        ON CAST(training_dataset.visitor_id AS FLOAT64) = CAST(user_factors.user_id AS FLOAT64)
LEFT JOIN
    `cloud-training-demos.GA360_test.item_factors` AS item_factors
        ON CAST(training_dataset.content_id AS STRING) = CAST(item_factors.item_id AS STRING)
"""

Let's pull a sample of our data into a dataframe to see what it looks like.

In [7]:
from google.cloud import bigquery
bq = bigquery.Client(project = PROJECT)
df_hybrid_dataset = bq.query(query_hybrid_dataset + "LIMIT 100").to_dataframe()
df_hybrid_dataset.head()

Unnamed: 0,next_content_id,visitor_id,content_id,category,title,author,months_since_epoch,user_factor_0,user_factor_1,user_factor_2,...,item_factor_1,item_factor_2,item_factor_3,item_factor_4,item_factor_5,item_factor_6,item_factor_7,item_factor_8,item_factor_9,hash_id
0,299954138,1000025265994336570,299816215,News,"""Fahnenskandal von Mailand: Die Austria zeigt ...",Alexander Strecha,574,0.000712,0.001678,-0.001852,...,-2.434206e-24,-3.109219e-24,-7.968385e-24,-1.769454e-24,-1.478184e-24,-1.331895e-24,4.589061e-24,-2.270997e-24,3.567726e-24,-5768114586991797349
1,299826775,1000163602560555666,299933565,News,"""Koalitionsverhandler vor Konsens bei Krankenk...",Peter Temel,574,-2.3e-05,1.9e-05,-4.2e-05,...,-1.474965e-16,6.380229e-16,1.089593e-15,2.664141e-16,-6.541727e-16,5.789462e-16,-1.160644e-15,3.606335e-16,-3.140273e-16,7572436456843040598
2,299918278,1000163602560555666,299826775,Lifestyle,"""Auf Bank ausgeruht: Pensionist muss Strafe za...",Marlene Patsalidis,574,-2.3e-05,1.9e-05,-4.2e-05,...,-0.0004174247,0.0005184471,0.001980996,-0.001876339,-0.002035266,-0.0009295577,-0.002669745,0.001398294,8.31834e-06,4505774231869461664
3,299853016,1000163602560555666,299918278,News,"""Skipässe in Wintersport-Hochburgen massiv teu...",Stefan Hofer,574,-2.3e-05,1.9e-05,-4.2e-05,...,0.001557606,0.02223454,-0.0203694,0.01540217,0.02021701,0.02157036,0.0004977403,-0.001841626,-0.00347671,-8810985193337367461
4,298888038,1000163602560555666,299853016,News,"""Schröcksnadel gegen Werdenigg: Keine Aussprache""",,574,-2.3e-05,1.9e-05,-4.2e-05,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5575481432772243221


In [8]:
df_hybrid_dataset.describe()

Unnamed: 0,user_factor_0,user_factor_1,user_factor_2,user_factor_3,user_factor_4,user_factor_5,user_factor_6,user_factor_7,user_factor_8,user_factor_9,...,item_factor_1,item_factor_2,item_factor_3,item_factor_4,item_factor_5,item_factor_6,item_factor_7,item_factor_8,item_factor_9,hash_id
count,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,...,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0
mean,3e-06,0.001737,-0.000583,-0.000835,0.000241,0.000465,0.00124061,-0.000273,0.000269,0.000153,...,0.6560215,-0.4308458,0.00380716,-0.2353299,0.3869609,-0.6329494,-0.1234204,0.5202034,-0.2226345,-8.484932e+17
std,0.001702,0.002274,0.001475,0.001685,0.00181,0.001654,0.002443893,0.001819,0.001391,0.001901,...,4.542447,3.203811,0.1326512,1.45665,2.643587,4.557982,0.7201253,3.763805,1.570324,5.309422e+18
min,-0.005761,-0.000195,-0.003718,-0.005138,-0.009186,-0.003106,-0.001258821,-0.005378,-0.002282,-0.006065,...,-0.009316617,-22.70137,-0.6738291,-10.19035,-0.2655937,-32.37039,-4.917779,-1.516911,-11.15637,-8.901887e+18
25%,-0.000588,2e-05,-0.001785,-0.000553,-0.000412,-0.000133,-0.0001164185,-0.000759,-0.000327,-8e-06,...,-1.474965e-16,-3.109219e-24,-1.572803e-15,-5.862354e-06,-7.421535e-07,-1.280417e-14,-0.0006903299,-1.320764e-19,-3.410457e-16,-5.638007e+18
50%,0.0,0.001079,-0.000149,-0.000219,0.000105,9.6e-05,-3.539207e-07,-0.000214,4.7e-05,1.1e-05,...,-3.82171e-36,2.0893850000000003e-23,0.0,-1.769454e-24,-1.478184e-24,-4.871939e-37,-1.8015670000000002e-22,2.222428e-24,3.567726e-24,-1.995584e+18
75%,0.00042,0.002404,4e-06,-6.8e-05,0.001072,0.001426,0.002663703,0.000534,0.000654,0.001292,...,5.076126e-06,6.361844e-06,1.089593e-15,1.561346e-20,1.10126e-15,1.87998e-13,4.589061e-24,1.807638e-06,4.006397e-06,3.609213e+18
max,0.005343,0.008446,0.005723,0.001581,0.003355,0.007909,0.01273011,0.002683,0.004856,0.00678,...,32.29185,1.982741,0.7682434,0.05441631,18.77586,1.030793,0.03816447,26.71273,0.2130968,8.981579e+18


In [9]:
import apache_beam as beam
import datetime, os

def to_csv(rowdict):
    # Pull columns from BQ and create a line
    import hashlib
    import copy
    CSV_COLUMNS = "next_content_id,visitor_id,content_id,category,title,author,months_since_epoch".split(",")
    FACTOR_COLUMNS = ["user_factor_{}".format(i) for i in range(10)] + ["item_factor_{}".format(i) for i in range(10)]

    # Write out rows for each input row for each column in rowdict
    data = ",".join(["None" if k not in rowdict else (rowdict[k].encode("utf-8") if rowdict[k] is not None else "None") for k in CSV_COLUMNS])
    data += ","
    data += ",".join([str(rowdict[k]) if k in rowdict else "None" for k in FACTOR_COLUMNS])
    yield ("{}".format(data))
  
def preprocess(in_test_mode):
    import shutil, os, subprocess
    job_name = "preprocess-hybrid-recommendation-features" + "-" + datetime.datetime.now().strftime("%y%m%d-%H%M%S")

    if in_test_mode:
        print("Launching local job ... hang on")
        OUTPUT_DIR = "./preproc/features"
        shutil.rmtree(OUTPUT_DIR, ignore_errors=True)
        os.makedirs(OUTPUT_DIR)
    else:
        print("Launching Dataflow job {} ... hang on".format(job_name))
        OUTPUT_DIR = "gs://{0}/hybrid_recommendation/preproc/features/".format(BUCKET)
        try:
            subprocess.check_call("gcloud storage rm --recursive {}".format(OUTPUT_DIR).split())
        except:
            pass

    options = {
        "staging_location": os.path.join(OUTPUT_DIR, "tmp", "staging"),
        "temp_location": os.path.join(OUTPUT_DIR, "tmp"),
        "job_name": job_name,
        "project": PROJECT,
        "teardown_policy": "TEARDOWN_ALWAYS",
        "no_save_main_session": True
    }
    opts = beam.pipeline.PipelineOptions(flags = [], **options)
    if in_test_mode:
        RUNNER = "DirectRunner"
    else:
        RUNNER = "DataflowRunner"
    p = beam.Pipeline(RUNNER, options = opts)
  
    query = query_hybrid_dataset

    if in_test_mode:
        query = query + " LIMIT 100" 

    for step in ["train", "eval"]:
        if step == "train":
            selquery = "SELECT * FROM ({}) WHERE ABS(MOD(hash_id, 10)) < 9".format(query)
        else:
            selquery = "SELECT * FROM ({}) WHERE ABS(MOD(hash_id, 10)) = 9".format(query)

        (p 
         | "{}_read".format(step) >> beam.io.Read(beam.io.BigQuerySource(query = selquery, use_standard_sql = True))
         | "{}_csv".format(step) >> beam.FlatMap(to_csv)
         | "{}_out".format(step) >> beam.io.Write(beam.io.WriteToText(os.path.join(OUTPUT_DIR, "{}.csv".format(step))))
        )

    job = p.run()
    if in_test_mode:
        job.wait_until_finish()
        print("Done!")
    
preprocess(in_test_mode = False)

Launching Dataflow job preprocess-hybrid-recommendation-features-190412-184419 ... hang on


Let's check our files to make sure everything went as expected

In [None]:
%%bash
rm -rf features
mkdir features

In [None]:
!gcloud storage cp --recursive gs://{BUCKET}/hybrid_recommendation/preproc/features/*.csv* features/

In [38]:
!head -3 features/*

==> features/eval.csv-00000-of-00001 <==
710535,951784927766849126,710535,News,"Haus aus Marmor und Grabsteinen",None,503,-0.00170100899413,0.00496714003384,0.0040482301265,0.000690933316946,3.52509268851e-05,-0.00172890012618,0.00153049221262,0.00100265210494,0.00228979066014,-0.00201142113656,-5.59943889043e-19,7.42678684608e-19,-1.3985523895e-19,3.42277049416e-19,1.11620765154e-18,2.17990091471e-18,-2.42801472173e-19,1.5953545546e-19,-1.10792405809e-18,-4.38625547901e-19
299818044,6813364694829221327,711895,None,"Impressum KURIER.at",None,553,0.000617411220446,-0.000148811683175,-0.000547810224816,-0.000194783264305,-0.000416739669163,-1.85458611668e-05,-0.000259642780293,-0.000104108628875,-0.000167975216755,0.000182291900273,-6.17342042923,14.5652112961,17.5528583527,3.3229033947,-44.9284629822,29.9998893738,18.7066059113,-14.6920909882,-20.3173618317,-3.7755317688
714241,8640555275627058154,711895,None,"Impressum KURIER.at",None,553,1.48057097249e-05,-1.99350433832e-05,-1.6683281

  chunks = self.iterencode(o, _one_shot=True)


<h2> Create vocabularies using Dataflow </h2>

Let's use Cloud Dataflow to read in the BigQuery data, do some preprocessing, and write it out as CSV files.

Now we'll create our vocabulary files for our categorical features.

In [5]:
query_vocabularies = """
SELECT
    CAST((SELECT MAX(IF(index = index_value, value, NULL)) FROM UNNEST(hits.customDimensions)) AS STRING) AS grouped_by
FROM `cloud-training-demos.GA360_test.ga_sessions_sample`,
    UNNEST(hits) AS hits
WHERE
    # only include hits on pages
    hits.type = "PAGE"
    AND (SELECT MAX(IF(index = index_value, value, NULL)) FROM UNNEST(hits.customDimensions)) IS NOT NULL
GROUP BY
    grouped_by
"""

In [None]:
import apache_beam as beam
import datetime, os

def to_txt(rowdict):
    # Pull columns from BQ and create a line

    # Write out rows for each input row for grouped by column in rowdict
    return "{}".format(rowdict["grouped_by"].encode("utf-8"))
  
def preprocess(in_test_mode):
    import shutil, os, subprocess
    job_name = "preprocess-hybrid-recommendation-vocab-lists" + "-" + datetime.datetime.now().strftime("%y%m%d-%H%M%S")

    if in_test_mode:
        print("Launching local job ... hang on")
        OUTPUT_DIR = "./preproc/vocabs"
        shutil.rmtree(OUTPUT_DIR, ignore_errors=True)
        os.makedirs(OUTPUT_DIR)
    else:
        print("Launching Dataflow job {} ... hang on".format(job_name))
        OUTPUT_DIR = "gs://{0}/hybrid_recommendation/preproc/vocabs/".format(BUCKET)
        try:
            subprocess.check_call("gcloud storage rm --recursive {}".format(OUTPUT_DIR).split())
        except:
            pass

    options = {
        "staging_location": os.path.join(OUTPUT_DIR, "tmp", "staging"),
        "temp_location": os.path.join(OUTPUT_DIR, "tmp"),
        "job_name": job_name,
        "project": PROJECT,
        "teardown_policy": "TEARDOWN_ALWAYS",
        "no_save_main_session": True
    }
    opts = beam.pipeline.PipelineOptions(flags = [], **options)
    if in_test_mode:
        RUNNER = "DirectRunner"
    else:
        RUNNER = "DataflowRunner"

    p = beam.Pipeline(RUNNER, options = opts)
  
    def vocab_list(index, name):
        query = query_vocabularies.replace("index_value", "{}".format(index))

        (p 
         | "{}_read".format(name) >> beam.io.Read(beam.io.BigQuerySource(query = query, use_standard_sql = True))
         | "{}_txt".format(name) >> beam.Map(to_txt)
         | "{}_out".format(name) >> beam.io.Write(beam.io.WriteToText(os.path.join(OUTPUT_DIR, "{0}_vocab.txt".format(name))))
        )

    # Call vocab_list function for each
    vocab_list(10, "content_id") # content_id
    vocab_list(7, "category") # category
    vocab_list(2, "author") # author
  
    job = p.run()
    if in_test_mode:
        job.wait_until_finish()
        print("Done!")
    
preprocess(in_test_mode = False)

Also get vocab counts from the length of the vocabularies

In [None]:
import apache_beam as beam
import datetime, os

def count_to_txt(rowdict):
    # Pull columns from BQ and create a line

    # Write out count
    return "{}".format(rowdict["count_number"])
  
def mean_to_txt(rowdict):
    # Pull columns from BQ and create a line

    # Write out mean
    return "{}".format(rowdict["mean_value"])
  
def preprocess(in_test_mode):
    import shutil, os, subprocess
    job_name = "preprocess-hybrid-recommendation-vocab-counts" + "-" + datetime.datetime.now().strftime("%y%m%d-%H%M%S")

    if in_test_mode:
        print("Launching local job ... hang on")
        OUTPUT_DIR = "./preproc/vocab_counts"
        shutil.rmtree(OUTPUT_DIR, ignore_errors=True)
        os.makedirs(OUTPUT_DIR)
    else:
        print("Launching Dataflow job {} ... hang on".format(job_name))
        OUTPUT_DIR = "gs://{0}/hybrid_recommendation/preproc/vocab_counts/".format(BUCKET)
        try:
            subprocess.check_call("gcloud storage rm --recursive {}".format(OUTPUT_DIR).split())
        except:
            pass

    options = {
        "staging_location": os.path.join(OUTPUT_DIR, "tmp", "staging"),
        "temp_location": os.path.join(OUTPUT_DIR, "tmp"),
        "job_name": job_name,
        "project": PROJECT,
        "teardown_policy": "TEARDOWN_ALWAYS",
        "no_save_main_session": True
    }
    opts = beam.pipeline.PipelineOptions(flags = [], **options)
    if in_test_mode:
        RUNNER = "DirectRunner"
    else:
        RUNNER = "DataflowRunner"

    p = beam.Pipeline(RUNNER, options = opts)
  
    def vocab_count(index, column_name):
        query = """
        SELECT
          COUNT(*) AS count_number
        FROM ({})
        """.format(query_vocabularies.replace("index_value", "{}".format(index)))

        (p 
         | "{}_read".format(column_name) >> beam.io.Read(beam.io.BigQuerySource(query = query, use_standard_sql = True))
         | "{}_txt".format(column_name) >> beam.Map(count_to_txt)
         | "{}_out".format(column_name) >> beam.io.Write(beam.io.WriteToText(os.path.join(OUTPUT_DIR, "{0}_vocab_count.txt".format(column_name))))
        )
    
    def global_column_mean(column_name):
        query = """
        SELECT
          AVG(CAST({1} AS FLOAT64)) AS mean_value
        FROM ({0})
        """.format(query_hybrid_dataset, column_name)

        (p 
         | "{}_read".format(column_name) >> beam.io.Read(beam.io.BigQuerySource(query = query, use_standard_sql = True))
         | "{}_txt".format(column_name) >> beam.Map(mean_to_txt)
         | "{}_out".format(column_name) >> beam.io.Write(beam.io.WriteToText(os.path.join(OUTPUT_DIR, "{0}_mean.txt".format(column_name))))
        )
    
    # Call vocab_count function for each column we want the vocabulary count for
    vocab_count(10, "content_id") # content_id
    vocab_count(7, "category") # category
    vocab_count(2, "author") # author

    # Call global_column_mean function for each column we want the mean for
    global_column_mean("months_since_epoch") # months_since_epoch
  
    job = p.run()
    if in_test_mode:
        job.wait_until_finish()
        print("Done!")
    
preprocess(in_test_mode = False)

Let's check our files to make sure everything went as expected

In [None]:
%%bash
rm -rf vocabs
mkdir vocabs

In [None]:
!gcloud storage cp --recursive gs://{BUCKET}/hybrid_recommendation/preproc/vocabs/*.txt* vocabs/

In [1]:
!head -3 vocabs/*

==> vocabs/author_vocab.txt-00000-of-00001 <==
Wolfgang Atzenhofer
Stefan Hofer
Bernhard Gaul, Christian Böhmer

==> vocabs/category_vocab.txt-00000-of-00001 <==
News
Lifestyle
Stars & Kultur

==> vocabs/content_id_vocab.txt-00000-of-00001 <==
299792293
299965853
299800661


In [None]:
%%bash
rm -rf vocab_counts
mkdir vocab_counts

In [None]:
!gcloud storage cp --recursive gs://{BUCKET}/hybrid_recommendation/preproc/vocab_counts/*.txt* vocab_counts/

In [70]:
!head -3 vocab_counts/*

==> vocab_counts/author_vocab_count.txt-00000-of-00001 <==
1103

==> vocab_counts/category_vocab_count.txt-00000-of-00001 <==
3

==> vocab_counts/content_id_vocab_count.txt-00000-of-00001 <==
15634


  chunks = self.iterencode(o, _one_shot=True)
