<h1>Creating Dataset For Content Based Filtering</h1>
<b>Gentle Remainder:</b> <p>going to be a bit lengthy and hectic</p>

This notebook builds the data we will use for creating our content based model. We'll collect the data via a collection of SQL queries from the publicly avialable Kurier.at dataset in BigQuery.
Kurier.at is an Austrian newsite. The goal of these labs is to recommend an article for a visitor to the site. In this lab we collect the data for training, in the subsequent notebook we train the recommender model. 

This notebook illustrates
* how to pull data from BigQuery table and write to local files
* how to make reproducible train and test splits 

In [1]:
import os
import tensorflow as tf
import numpy as np
from google.cloud import bigquery


PROJECT = "qwiklabs-gcp-03-c4d7f33650cb"   #YOUR PROJECT ID
BUCKET = "qwiklabs-gcp-03-c4d7f33650cb" #BUCKET ID
REGION = "us-central1" 

os.environ["PROJECT"] = PROJECT
os.environ["BUCKET"] = BUCKET
os.environ["REGION"] = REGION
os.environ["TFVERSION"] = "1.8" #WE REQUIRED 1.8

In [2]:
%%bash
gcloud config set project $PROJECT
gcloud config set compute/region $REGION

Updated property [core/project].
Updated property [compute/region].


We will use this helper funciton to write lists containing article ids, categories, and authors for each article in our database to local file.

In [3]:
def write_list_to_disk(my_list, filename):
  with open(filename, 'wb') as f:
    for item in my_list:
        line = "%s\n" % item
        f.write(line.encode('utf8'))

<h1>Bigquery</h1>
<h3>Pull data from Bigquery</h3>
Our data resides upon Bigquery first lets have a look in it

In [4]:
sql = """
#standardSQL

SELECT  
  (SELECT MAX(IF(index=10, value, NULL)) FROM UNNEST(hits.customDimensions)) AS content_id 
FROM `cloud-training-demos.GA360_test.ga_sessions_sample`,   
  UNNEST(hits) AS hits
WHERE 
  # only include hits on pages
  hits.type = "PAGE"
  AND (SELECT MAX(IF(index=10, value, NULL)) FROM UNNEST(hits.customDimensions)) IS NOT NULL
GROUP BY
  content_id
"""


content_ids_list = bigquery.Client().query(sql).to_dataframe()["content_id"].tolist()
write_list_to_disk(content_ids_list, "content_ids.txt")
print("content id samples {}".format(content_ids_list[:3]))
print("total number of article {}".format(len(content_ids_list)))

content id samples ['299922662', '299826775', '299437612']
total number of article 15634


Problem solved use "wb" instead of "w".
<p>Next, we'll create a local file which contains a list of article categories and a list of article authors.

Note the change in the index when pulling the article category or author information. Also, we are using the first author of the article to create our author list.  
Refer back to the original dataset, use the `hits.customDimensions.index` field to verify the correct index.</p>

In [5]:
sql ="""
#standardSQL
SELECT  
  (SELECT MAX(IF(index=7, value, NULL)) FROM UNNEST(hits.customDimensions)) AS category  
FROM `cloud-training-demos.GA360_test.ga_sessions_sample`,   
  UNNEST(hits) AS hits
WHERE 
  # only include hits on pages
  hits.type = "PAGE"
  AND (SELECT MAX(IF(index=7, value, NULL)) FROM UNNEST(hits.customDimensions)) IS NOT NULL
GROUP BY   
  category
"""

category_list = bigquery.Client().query(sql).to_dataframe()["category"].tolist()
write_list_to_disk(category_list, "category.txt")
print(category_list)

['Lifestyle', 'News', 'Stars & Kultur']


The categories are 'News', 'Stars & Kultur', and 'Lifestyle'.  
When creating the author list, we'll only use the first author information for each article. 

In [6]:
sql="""
SELECT
  REGEXP_EXTRACT((SELECT MAX(IF(index=2, value, NULL)) FROM UNNEST(hits.customDimensions)), r"^[^,]+")  AS first_author  
FROM `cloud-training-demos.GA360_test.ga_sessions_sample`,   
  UNNEST(hits) AS hits
WHERE 
  # only include hits on pages
  hits.type = "PAGE"
  AND (SELECT MAX(IF(index=2, value, NULL)) FROM UNNEST(hits.customDimensions)) IS NOT NULL
GROUP BY   
  first_author
"""

authors_list = bigquery.Client().query(sql).to_dataframe()['first_author'].tolist()
write_list_to_disk(authors_list, "authors.txt")
print("Some sample authors {}".format(authors_list[:10]))
print("The total number of authors is {}".format(len(authors_list)))

Some sample authors ['Stefan Berndl', 'Bernhard Gaul', 'Thomas  Trescher', 'Elisabeth Spitzer', 'Marlene Patsalidis', 'Yvonne Widler', 'Hermann Sileitsch-Parzer', 'Maria Zelenko', 'Daniela Davidovits', 'Christina Michlits']
The total number of authors is 385


### Create train and test sets.

In this section, we will create the train/test split of our data for training our model. We use the concatenated values for visitor id and content id to create a farm fingerprint, taking approximately 90% of the data for the training set and 10% for the test set.

Basically putting all the above query together .

In [7]:
sql="""
WITH site_history as (
  SELECT
      fullVisitorId as visitor_id,
      (SELECT MAX(IF(index=10, value, NULL)) FROM UNNEST(hits.customDimensions)) AS content_id,
      (SELECT MAX(IF(index=7, value, NULL)) FROM UNNEST(hits.customDimensions)) AS category, 
      (SELECT MAX(IF(index=6, value, NULL)) FROM UNNEST(hits.customDimensions)) AS title,
      (SELECT MAX(IF(index=2, value, NULL)) FROM UNNEST(hits.customDimensions)) AS author_list,
      SPLIT(RPAD((SELECT MAX(IF(index=4, value, NULL)) FROM UNNEST(hits.customDimensions)), 7), '.') as year_month_array,
      LEAD(hits.customDimensions, 1) OVER (PARTITION BY fullVisitorId ORDER BY hits.time ASC) as nextCustomDimensions
  FROM 
    `cloud-training-demos.GA360_test.ga_sessions_sample`,   
     UNNEST(hits) AS hits
   WHERE 
     # only include hits on pages
      hits.type = "PAGE"
      AND
      fullVisitorId IS NOT NULL
      AND
      hits.time != 0
      AND
      hits.time IS NOT NULL
      AND
      (SELECT MAX(IF(index=10, value, NULL)) FROM UNNEST(hits.customDimensions)) IS NOT NULL
) #same as previous
SELECT
  visitor_id,
  content_id,
  category,
  REGEXP_REPLACE(title, r",", "") as title,
  REGEXP_EXTRACT(author_list, r"^[^,]+") as author, #first one as previous
  DATE_DIFF(DATE(CAST(year_month_array[OFFSET(0)] AS INT64), CAST(year_month_array[OFFSET(1)] AS INT64), 1), DATE(1970,1,1), MONTH) as months_since_epoch,
  (SELECT MAX(IF(index=10, value, NULL)) FROM UNNEST(nextCustomDimensions)) as next_content_id
FROM
  site_history
WHERE (SELECT MAX(IF(index=10, value, NULL)) FROM UNNEST(nextCustomDimensions)) IS NOT NULL
      AND ABS(MOD(FARM_FINGERPRINT(CONCAT(visitor_id, content_id)), 10)) < 9
"""

training_set_df = bigquery.Client().query(sql).to_dataframe()
training_set_df.to_csv("training_set.csv", header=False, index=False, encoding = "utf-8")
training_set_df.head()

Unnamed: 0,visitor_id,content_id,category,title,author,months_since_epoch,next_content_id
0,1000148716229112932,299913368,News,U4-Störung legt Wiener Frühverkehr lahm,Yvonne Widler,574,299931241
1,1000148716229112932,299931241,Stars & Kultur,Regisseur Michael Haneke kritisiert Flüchtling...,,574,299913879
2,1000360453832106474,299925700,Lifestyle,Nach Tod von Vater: Tochter bekommt jedes Jahr...,Marlene Patsalidis,574,299922662
3,1000360453832106474,299922662,Lifestyle,Australischer Fernsehstar rechtfertigt Sexismu...,Marlene Patsalidis,574,299826775
4,1001846185946874596,299930679,News,Wintereinbruch naht: Erster Schnee im Osten mö...,Daniela Wahl,574,299930679


In [8]:
sql="""
WITH site_history as (
  SELECT
      fullVisitorId as visitor_id,
      (SELECT MAX(IF(index=10, value, NULL)) FROM UNNEST(hits.customDimensions)) AS content_id,
      (SELECT MAX(IF(index=7, value, NULL)) FROM UNNEST(hits.customDimensions)) AS category, 
      (SELECT MAX(IF(index=6, value, NULL)) FROM UNNEST(hits.customDimensions)) AS title,
      (SELECT MAX(IF(index=2, value, NULL)) FROM UNNEST(hits.customDimensions)) AS author_list,
      SPLIT(RPAD((SELECT MAX(IF(index=4, value, NULL)) FROM UNNEST(hits.customDimensions)), 7), '.') as year_month_array,
      LEAD(hits.customDimensions, 1) OVER (PARTITION BY fullVisitorId ORDER BY hits.time ASC) as nextCustomDimensions
  FROM 
    `cloud-training-demos.GA360_test.ga_sessions_sample`,   
     UNNEST(hits) AS hits
   WHERE 
     # only include hits on pages
      hits.type = "PAGE"
      AND
      fullVisitorId IS NOT NULL
      AND
      hits.time != 0
      AND
      hits.time IS NOT NULL
      AND
      (SELECT MAX(IF(index=10, value, NULL)) FROM UNNEST(hits.customDimensions)) IS NOT NULL
) #same as previous
SELECT
  visitor_id,
  content_id,
  category,
  REGEXP_REPLACE(title, r",", "") as title,
  REGEXP_EXTRACT(author_list, r"^[^,]+") as author, #first one as previous
  DATE_DIFF(DATE(CAST(year_month_array[OFFSET(0)] AS INT64), CAST(year_month_array[OFFSET(1)] AS INT64), 1), DATE(1970,1,1), MONTH) as months_since_epoch,
  (SELECT MAX(IF(index=10, value, NULL)) FROM UNNEST(nextCustomDimensions)) as next_content_id
FROM
  site_history
WHERE (SELECT MAX(IF(index=10, value, NULL)) FROM UNNEST(nextCustomDimensions)) IS NOT NULL
      AND ABS(MOD(FARM_FINGERPRINT(CONCAT(visitor_id, content_id)), 10)) >= 9
"""

test_set_df = bigquery.Client().query(sql).to_dataframe()
test_set_df.to_csv("test_set.csv", header=False, index=False, encoding = "utf-8")
test_set_df.head()

Unnamed: 0,visitor_id,content_id,category,title,author,months_since_epoch,next_content_id
0,1004555043399129313,299906166,Lifestyle,"Foodwatch vergibt ""Goldenen Windbeutel"" an Ale...",Anita Kattinger,574,299788195
1,1027916139247221006,299865757,News,ÖVP und FPÖ wollen zurück zu alten Noten,Bernhard Gaul,574,299781837
2,1029190018075790956,299931241,Stars & Kultur,Regisseur Michael Haneke kritisiert Flüchtling...,,574,299953030
3,1029245324364976482,299902870,News,RAF-Terroristin bittet Schleyer-Familie um Ver...,Stefan Hofer,574,299902870
4,1029245324364976482,299902870,News,RAF-Terroristin bittet Schleyer-Familie um Ver...,Stefan Hofer,574,299898026


Let's have a look at the two csv files we just created containing the training and test set. We'll also do a line count of both files to confirm that we have achieved an approximate 90/10 train/test split.  
In the next notebook, **Content Based Filtering** we will build a model to recommend an article given information about the current article being read, such as the category, title, author, and publish date.

In [14]:
%%bash
wc -l *_set.csv

   25599 test_set.csv
  232308 training_set.csv
  257907 total


In [15]:
!head *_set.csv

==> test_set.csv <==
1004555043399129313,299906166,Lifestyle,"Foodwatch vergibt ""Goldenen Windbeutel"" an Alete-Kinderkekse",Anita Kattinger,574,299788195
1027916139247221006,299865757,News,ÖVP und FPÖ wollen zurück zu alten Noten,Bernhard Gaul,574,299781837
1029190018075790956,299931241,Stars & Kultur,Regisseur Michael Haneke kritisiert Flüchtlingspolitik,,574,299953030
1029245324364976482,299902870,News,RAF-Terroristin bittet Schleyer-Familie um Verzeihung,Stefan Hofer,574,299902870
1029245324364976482,299902870,News,RAF-Terroristin bittet Schleyer-Familie um Verzeihung,Stefan Hofer,574,299898026
1029245324364976482,299898026,News,"Rechte Aktivisten wollten ""Washington Post"" in Falle locken",Stefan Hofer,574,299793275
1030167934885488168,299826775,Lifestyle,Auf Bank ausgeruht: Pensionist muss Strafe zahlen,Marlene Patsalidis,574,299957318
103114529785595991,299972194,News,LIVE: Spielstand bei Sturm - Admira,Mathias Kainz,574,299950903
1031247968806695080,299925700,Lifestyle,Nach T

## Content-Based Filtering Using Neural Networks

Now we will explore 
1. how to build feature columns for a model using tf.feature_column
2. how to create custom evaluation metrics and add them to Tensorboard
3. how to train a model and make predictions with the saved model

We need tensorflow hub for it .
Tensorflow Hub should already be installed. You can check that it is by using "pip freeze".

In [11]:
%%bash
pip freeze | grep tensor

tensorboard==1.15.0
tensorflow==1.15.2
tensorflow-datasets==1.2.0
tensorflow-estimator==1.15.1
tensorflow-hub==0.6.0
tensorflow-io==0.8.1
tensorflow-metadata==0.21.1
tensorflow-probability==0.8.0
tensorflow-serving-api==1.15.0


Let's make sure we install the necessary version of tensorflow-hub. After doing the pip install below, click **"Reset Session"** on the notebook so that the Python environment picks up the new packages.

In [None]:
!pip3 install tensorflow-hub==0.4.0
!pip3 install --upgrade tensorflow==1.13.1

Collecting tensorflow-hub==0.4.0
  Downloading tensorflow_hub-0.4.0-py2.py3-none-any.whl (75 kB)
[K     |████████████████████████████████| 75 kB 2.9 MB/s eta 0:00:011
Installing collected packages: tensorflow-hub
  Attempting uninstall: tensorflow-hub
    Found existing installation: tensorflow-hub 0.6.0
    Uninstalling tensorflow-hub-0.6.0:
      Successfully uninstalled tensorflow-hub-0.6.0
Successfully installed tensorflow-hub-0.4.0
Collecting tensorflow==1.13.1
  Downloading tensorflow-1.13.1-cp37-cp37m-manylinux1_x86_64.whl (92.6 MB)
[K     |████████████████████████████████| 92.6 MB 81 kB/s s eta 0:00:01
Collecting tensorboard<1.14.0,>=1.13.0
  Downloading tensorboard-1.13.1-py3-none-any.whl (3.2 MB)
[K     |████████████████████████████████| 3.2 MB 43.8 MB/s eta 0:00:01
Collecting tensorflow-estimator<1.14.0rc0,>=1.13.0
  Downloading tensorflow_estimator-1.13.0-py2.py3-none-any.whl (367 kB)
[K     |████████████████████████████████| 367 kB 58.1 MB/s eta 0:00:01
[31mERROR: ten

### Quick Importing Required Modules

In [5]:
import os
import tensorflow as tf
import numpy as np
from google.cloud import bigquery
import tensorflow_hub as hub
import shutil

PROJECT = "qwiklabs-gcp-03-7b7fb0d6a41d"   #YOUR PROJECT ID
BUCKET = "qwiklabs-gcp-03-7b7fb0d6a41d" #BUCKET ID
REGION = "us-central1" 

os.environ["PROJECT"] = PROJECT
os.environ["BUCKET"] = BUCKET
os.environ["REGION"] = REGION
os.environ["TFVERSION"] = "1.13.1" 



In [2]:
%%bash
gcloud config set project $PROJECT
gcloud config set compute/region $REGION

Updated property [core/project].
Updated property [compute/region].


### Build the feature columns for the model.
To start, we'll load the list of categories, authors and article ids we created in the previous Create Datasets notebook.

In [3]:
categories_list = open("category.txt").read().splitlines()
authors_list = open("authors.txt").read().splitlines()
content_ids_list = open("content_ids.txt").read().splitlines()
mean_months_since_epoch = 523

In the cell below we'll define the feature columns to use in our model.

In [6]:
embedded_title_column = hub.text_embedding_column(
                                                  key="title",
                                                  module_spec="https://tfhub.dev/google/nnlm-de-dim50/1",
                                                  trainable = False
                                                  )
content_id_column = tf.feature_column.categorical_column_with_hash_bucket(
                                                   key="content_id",
                                                hash_bucket_size = len(content_ids_list) + 1)
                                                
embedded_content_column = tf.feature_column.embedding_column(
                                                  categorical_column=content_id_column,
                                                   dimension=10)

#content_id,title done 

author_column = tf.feature_column.categorical_column_with_hash_bucket(
                                                 key="author",
                                                hash_bucket_size=len(authors_list)+1)

embedded_author_column = tf.feature_column.embedding_column(
                                                 categorical_column=author_column,
                                                  dimension=3)


category_column_categorical = tf.feature_column.categorical_column_with_vocabulary_list(
                                                 key="category",
                                                  vocabulary_list=categories_list,
                                                  num_oov_buckets=1)

category_column = tf.feature_column.indicator_column(category_column_categorical)


#copied portion
months_since_epoch_boundaries = list(range(400,700,20))
months_since_epoch_column = tf.feature_column.numeric_column(
    key="months_since_epoch")
months_since_epoch_bucketized = tf.feature_column.bucketized_column(
    source_column = months_since_epoch_column,
    boundaries = months_since_epoch_boundaries)

crossed_months_since_category_column = tf.feature_column.indicator_column(tf.feature_column.crossed_column(
  keys = [category_column_categorical, months_since_epoch_bucketized], 
  hash_bucket_size = len(months_since_epoch_boundaries) * (len(categories_list) + 1)))


#total_feature column
feature_columns = [embedded_content_column,
                   embedded_author_column,
                   category_column,
                   embedded_title_column,
                   crossed_months_since_category_column

                   ]

#finished





### Create Input Function
Next we'll create the input function for our model. This input function reads the data from the csv files we created in the previous

In [10]:
column_keys = ["visitor_id", "content_id", "category", "title", "author", "months_since_epoch", "next_content_id"]
record_defaults = [["Unknown"], ["Unknown"],["Unknown"],["Unknown"],["Unknown"],[mean_months_since_epoch],["Unknown"]]

label_key = "next_content_id"


def read_dataset(filename, mode, batch_size=512):
    def _input_fn():
        def decode_csv(value_column):
            columns = tf.decode_csv(value_column,record_defaults=record_defaults)
            features = dict(zip(column_keys,columns))
            label = features.pop(label_key)
            
            return features,label
        
        #create list of files that matches pattern
        file_list = tf.gfile.Glob(filename)
        
        #Create dataset from filename
        dataset = tf.data.TextLineDataset(file_list).map(decode_csv)
        
        
        if (mode==tf.estimator.ModeKeys.TRAIN):
            num_epochs = None
            dataset = dataset.shuffle(buffer_size = 10 * batch_size)
        else:
            num_epochs = 1
        
        dataset = dataset.repeat(num_epochs).batch(batch_size)
        return dataset.make_one_shot_iterator().get_next()
    return _input_fn
            
            
        

### Create the model and train/evaluate


Next, we'll build our model which recommends an article for a visitor to the Kurier.at website. Look through the code below. We use the input_layer feature column to create the dense input layer to our network. This is just a sigle layer network where we can adjust the number of hidden units as a parameter.

Currently, we compute the accuracy between our predicted 'next article' and the actual 'next article' read next by the visitor. We'll also add an additional performance metric of top 10 accuracy to assess our model. To accomplish this, we compute the top 10 accuracy metric, add it to the metrics dictionary below and add it to the tf.summary so that this value is reported to Tensorboard as well.

In [8]:
def model_fn(features, labels, mode, params):
    ###function for building model
  net = tf.feature_column.input_layer(features, params['feature_columns'])
  for units in params['hidden_units']:
        net = tf.layers.dense(net, units=units, activation=tf.nn.relu)
   # Compute logits (1 per class).
  logits = tf.layers.dense(net, params['n_classes'], activation=None) 

  predicted_classes = tf.argmax(logits, 1)
  from tensorflow.python.lib.io import file_io
    
  with file_io.FileIO('content_ids.txt', mode='r') as ifp:
    content = tf.constant([x.rstrip() for x in ifp])
  predicted_class_names = tf.gather(content, predicted_classes)
  if mode == tf.estimator.ModeKeys.PREDICT:
    predictions = {
        'class_ids': predicted_classes[:, tf.newaxis],
        'class_names' : predicted_class_names[:, tf.newaxis],
        'probabilities': tf.nn.softmax(logits),
        'logits': logits,
    }
    return tf.estimator.EstimatorSpec(mode, predictions=predictions)
  table = tf.contrib.lookup.index_table_from_file(vocabulary_file="content_ids.txt")
  labels = table.lookup(labels)
  # Compute loss.
  loss = tf.losses.sparse_softmax_cross_entropy(labels=labels, logits=logits)

  # Compute evaluation metrics.
  accuracy = tf.metrics.accuracy(labels=labels,
                                 predictions=predicted_classes,
                                 name='acc_op')
  top_10_accuracy = tf.metrics.mean(tf.nn.in_top_k(predictions=logits, 
                                                   targets=labels, 
                                                   k=10))
  
  metrics = {
    'accuracy': accuracy,
    'top_10_accuracy' : top_10_accuracy}
  
  tf.summary.scalar('accuracy', accuracy[1])
  tf.summary.scalar('top_10_accuracy', top_10_accuracy[1])

  if mode == tf.estimator.ModeKeys.EVAL:
      return tf.estimator.EstimatorSpec(
          mode, loss=loss, eval_metric_ops=metrics)

  # Create training op.
  assert mode == tf.estimator.ModeKeys.TRAIN

  optimizer = tf.train.AdagradOptimizer(learning_rate=0.1)
  train_op = optimizer.minimize(loss, global_step=tf.train.get_global_step())
  return tf.estimator.EstimatorSpec(mode, loss=loss, train_op=train_op)

### Train and Evaluate

In [11]:
#same code used in tensorflow
outdir = 'content_based_model_trained'
shutil.rmtree(outdir, ignore_errors = True) # start fresh each time
tf.summary.FileWriterCache.clear() # ensure filewriter cache is clear for TensorBoard events file
estimator = tf.estimator.Estimator(
    model_fn=model_fn,
    model_dir = outdir,
    params={
     'feature_columns': feature_columns,
      'hidden_units': [200, 100, 50],
      'n_classes': len(content_ids_list)
    })

train_spec = tf.estimator.TrainSpec(
    input_fn = read_dataset("training_set.csv", tf.estimator.ModeKeys.TRAIN),
    max_steps = 2000)

eval_spec = tf.estimator.EvalSpec(
    input_fn = read_dataset("test_set.csv", tf.estimator.ModeKeys.EVAL),
    steps = None,
    start_delay_secs = 30,
    throttle_secs = 60)

tf.estimator.train_and_evaluate(estimator, train_spec, eval_spec)

INFO:tensorflow:Using default config.


INFO:tensorflow:Using default config.


INFO:tensorflow:Using config: {'_model_dir': 'content_based_model_trained', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': allow_soft_placement: true
graph_options {
  rewrite_options {
    meta_optimizer_iterations: ONE
  }
}
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f9460eaf810>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}


INFO:tensorflow:Using config: {'_model_dir': 'content_based_model_trained', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': allow_soft_placement: true
graph_options {
  rewrite_options {
    meta_optimizer_iterations: ONE
  }
}
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f9460eaf810>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}


INFO:tensorflow:Not using Distribute Coordinator.


INFO:tensorflow:Not using Distribute Coordinator.


INFO:tensorflow:Running training and evaluation locally (non-distributed).


INFO:tensorflow:Running training and evaluation locally (non-distributed).


INFO:tensorflow:Start train and evaluate loop. The evaluate will happen after every checkpoint. Checkpoint frequency is determined based on RunConfig arguments: save_checkpoints_steps None or save_checkpoints_secs 600.


INFO:tensorflow:Start train and evaluate loop. The evaluate will happen after every checkpoint. Checkpoint frequency is determined based on RunConfig arguments: save_checkpoints_steps None or save_checkpoints_secs 600.


INFO:tensorflow:Calling model_fn.


INFO:tensorflow:Calling model_fn.


Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.


Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.


Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.


Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.


Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.


Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.


Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.


Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.


Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.


Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.


Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.


Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.


Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.


Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.


Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.


Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.


Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.


Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.


Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.


Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.


Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.


Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.


Instructions for updating:
Use tf.cast instead.


Instructions for updating:
Use tf.cast instead.


Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.


Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.


Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.


Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.


Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.


Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.


Instructions for updating:
Use tf.cast instead.


Instructions for updating:
Use tf.cast instead.


Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.


Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.


Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.


Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.


Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.


Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.


INFO:tensorflow:Saver not created because there are no variables in the graph to restore


INFO:tensorflow:Saver not created because there are no variables in the graph to restore


INFO:tensorflow:Saver not created because there are no variables in the graph to restore


INFO:tensorflow:Saver not created because there are no variables in the graph to restore


Instructions for updating:
Use keras.layers.dense instead.


Instructions for updating:
Use keras.layers.dense instead.



For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
If you depend on functionality not listed there, please file an issue.

Instructions for updating:
Use tf.cast instead.


Instructions for updating:
Use tf.cast instead.


INFO:tensorflow:Done calling model_fn.


INFO:tensorflow:Done calling model_fn.


INFO:tensorflow:Create CheckpointSaverHook.


INFO:tensorflow:Create CheckpointSaverHook.


INFO:tensorflow:Graph was finalized.


INFO:tensorflow:Graph was finalized.


INFO:tensorflow:Running local_init_op.


INFO:tensorflow:Running local_init_op.


INFO:tensorflow:Done running local_init_op.


INFO:tensorflow:Done running local_init_op.


INFO:tensorflow:Saving checkpoints for 0 into content_based_model_trained/model.ckpt.


INFO:tensorflow:Saving checkpoints for 0 into content_based_model_trained/model.ckpt.


INFO:tensorflow:loss = 9.657599, step = 1


INFO:tensorflow:loss = 9.657599, step = 1


INFO:tensorflow:global_step/sec: 6.70639


INFO:tensorflow:global_step/sec: 6.70639


INFO:tensorflow:loss = 5.46904, step = 101 (14.913 sec)


INFO:tensorflow:loss = 5.46904, step = 101 (14.913 sec)


INFO:tensorflow:global_step/sec: 6.80925


INFO:tensorflow:global_step/sec: 6.80925


INFO:tensorflow:loss = 5.3465405, step = 201 (14.686 sec)


INFO:tensorflow:loss = 5.3465405, step = 201 (14.686 sec)


INFO:tensorflow:global_step/sec: 6.76986


INFO:tensorflow:global_step/sec: 6.76986


INFO:tensorflow:loss = 5.1290684, step = 301 (14.771 sec)


INFO:tensorflow:loss = 5.1290684, step = 301 (14.771 sec)


INFO:tensorflow:global_step/sec: 6.81488


INFO:tensorflow:global_step/sec: 6.81488


INFO:tensorflow:loss = 5.115883, step = 401 (14.674 sec)


INFO:tensorflow:loss = 5.115883, step = 401 (14.674 sec)


INFO:tensorflow:global_step/sec: 6.77229


INFO:tensorflow:global_step/sec: 6.77229


INFO:tensorflow:loss = 5.0935826, step = 501 (14.767 sec)


INFO:tensorflow:loss = 5.0935826, step = 501 (14.767 sec)


INFO:tensorflow:global_step/sec: 6.82503


INFO:tensorflow:global_step/sec: 6.82503


INFO:tensorflow:loss = 5.090375, step = 601 (14.651 sec)


INFO:tensorflow:loss = 5.090375, step = 601 (14.651 sec)


INFO:tensorflow:global_step/sec: 6.80998


INFO:tensorflow:global_step/sec: 6.80998


INFO:tensorflow:loss = 5.1792736, step = 701 (14.684 sec)


INFO:tensorflow:loss = 5.1792736, step = 701 (14.684 sec)


INFO:tensorflow:global_step/sec: 6.80427


INFO:tensorflow:global_step/sec: 6.80427


INFO:tensorflow:loss = 4.879672, step = 801 (14.697 sec)


INFO:tensorflow:loss = 4.879672, step = 801 (14.697 sec)


INFO:tensorflow:global_step/sec: 6.70328


INFO:tensorflow:global_step/sec: 6.70328


INFO:tensorflow:loss = 5.023737, step = 901 (14.918 sec)


INFO:tensorflow:loss = 5.023737, step = 901 (14.918 sec)


INFO:tensorflow:global_step/sec: 6.65967


INFO:tensorflow:global_step/sec: 6.65967


INFO:tensorflow:loss = 5.034051, step = 1001 (15.016 sec)


INFO:tensorflow:loss = 5.034051, step = 1001 (15.016 sec)


INFO:tensorflow:global_step/sec: 6.65871


INFO:tensorflow:global_step/sec: 6.65871


INFO:tensorflow:loss = 5.0140915, step = 1101 (15.017 sec)


INFO:tensorflow:loss = 5.0140915, step = 1101 (15.017 sec)


INFO:tensorflow:global_step/sec: 6.7991


INFO:tensorflow:global_step/sec: 6.7991


INFO:tensorflow:loss = 4.9860573, step = 1201 (14.708 sec)


INFO:tensorflow:loss = 4.9860573, step = 1201 (14.708 sec)


INFO:tensorflow:global_step/sec: 6.77146


INFO:tensorflow:global_step/sec: 6.77146


INFO:tensorflow:loss = 4.888763, step = 1301 (14.768 sec)


INFO:tensorflow:loss = 4.888763, step = 1301 (14.768 sec)


INFO:tensorflow:global_step/sec: 6.82351


INFO:tensorflow:global_step/sec: 6.82351


INFO:tensorflow:loss = 4.9365063, step = 1401 (14.655 sec)


INFO:tensorflow:loss = 4.9365063, step = 1401 (14.655 sec)


INFO:tensorflow:global_step/sec: 6.83947


INFO:tensorflow:global_step/sec: 6.83947


INFO:tensorflow:loss = 4.931014, step = 1501 (14.621 sec)


INFO:tensorflow:loss = 4.931014, step = 1501 (14.621 sec)


INFO:tensorflow:global_step/sec: 6.81101


INFO:tensorflow:global_step/sec: 6.81101


INFO:tensorflow:loss = 4.95496, step = 1601 (14.682 sec)


INFO:tensorflow:loss = 4.95496, step = 1601 (14.682 sec)


INFO:tensorflow:global_step/sec: 6.73058


INFO:tensorflow:global_step/sec: 6.73058


INFO:tensorflow:loss = 4.782462, step = 1701 (14.857 sec)


INFO:tensorflow:loss = 4.782462, step = 1701 (14.857 sec)


INFO:tensorflow:global_step/sec: 6.84038


INFO:tensorflow:global_step/sec: 6.84038


INFO:tensorflow:loss = 4.7382765, step = 1801 (14.619 sec)


INFO:tensorflow:loss = 4.7382765, step = 1801 (14.619 sec)


INFO:tensorflow:global_step/sec: 6.87667


INFO:tensorflow:global_step/sec: 6.87667


INFO:tensorflow:loss = 5.02299, step = 1901 (14.542 sec)


INFO:tensorflow:loss = 5.02299, step = 1901 (14.542 sec)


INFO:tensorflow:Saving checkpoints for 2000 into content_based_model_trained/model.ckpt.


INFO:tensorflow:Saving checkpoints for 2000 into content_based_model_trained/model.ckpt.


INFO:tensorflow:Calling model_fn.


INFO:tensorflow:Calling model_fn.


INFO:tensorflow:Saver not created because there are no variables in the graph to restore


INFO:tensorflow:Saver not created because there are no variables in the graph to restore


INFO:tensorflow:Saver not created because there are no variables in the graph to restore


INFO:tensorflow:Saver not created because there are no variables in the graph to restore


INFO:tensorflow:Done calling model_fn.


INFO:tensorflow:Done calling model_fn.


INFO:tensorflow:Starting evaluation at 2020-04-13T22:26:50Z


INFO:tensorflow:Starting evaluation at 2020-04-13T22:26:50Z


INFO:tensorflow:Graph was finalized.


INFO:tensorflow:Graph was finalized.


Instructions for updating:
Use standard file APIs to check for files with this prefix.


Instructions for updating:
Use standard file APIs to check for files with this prefix.


INFO:tensorflow:Restoring parameters from content_based_model_trained/model.ckpt-2000


INFO:tensorflow:Restoring parameters from content_based_model_trained/model.ckpt-2000


INFO:tensorflow:Running local_init_op.


INFO:tensorflow:Running local_init_op.


INFO:tensorflow:Done running local_init_op.


INFO:tensorflow:Done running local_init_op.


INFO:tensorflow:Finished evaluation at 2020-04-13-22:26:56


INFO:tensorflow:Finished evaluation at 2020-04-13-22:26:56


INFO:tensorflow:Saving dict for global step 2000: accuracy = 0.055236533, global_step = 2000, loss = 4.892137, top_10_accuracy = 0.32294232


INFO:tensorflow:Saving dict for global step 2000: accuracy = 0.055236533, global_step = 2000, loss = 4.892137, top_10_accuracy = 0.32294232


INFO:tensorflow:Saving 'checkpoint_path' summary for global step 2000: content_based_model_trained/model.ckpt-2000


INFO:tensorflow:Saving 'checkpoint_path' summary for global step 2000: content_based_model_trained/model.ckpt-2000


INFO:tensorflow:Loss for final step: 4.9411874.


INFO:tensorflow:Loss for final step: 4.9411874.


({'accuracy': 0.055236533,
  'loss': 4.892137,
  'top_10_accuracy': 0.32294232,
  'global_step': 2000},
 [])

It will take a while

### Make prediction with the trained model
With the model now trained, we can make predictions by calling the predict method on the estimator. Let's look at how our model predicts on the first five examples of the training set.
To start, we'll create a new file 'first_5.csv' which contains the first five elements of our training set. We'll also save the target values to a file 'first_5_content_ids' so we can compare our results.

In [17]:
%%bash
head -5 training_set.csv > first_5.csv
head first_5.csv
awk -F "\"*,\"*" '{print $2}' first_5.csv > first_5_content_ids

1000148716229112932,299913368,News,U4-Störung legt Wiener Frühverkehr lahm ,Yvonne Widler,574,299931241
1000148716229112932,299931241,Stars & Kultur,Regisseur Michael Haneke kritisiert Flüchtlingspolitik,,574,299913879
1000360453832106474,299925700,Lifestyle,Nach Tod von Vater: Tochter bekommt jedes Jahr Blumen,Marlene Patsalidis,574,299922662
1000360453832106474,299922662,Lifestyle,Australischer Fernsehstar rechtfertigt Sexismus mit Asperger-Syndrom,Marlene Patsalidis,574,299826775
1001846185946874596,299930679,News,Wintereinbruch naht: Erster Schnee im Osten möglich,Daniela Wahl,574,299930679


In [19]:
output = list(estimator.predict(input_fn=read_dataset("first_5.csv", tf.estimator.ModeKeys.PREDICT)))
import numpy as np
recommended_content_ids = [np.asscalar(d["class_names"]).decode('UTF-8') for d in output]
content_ids = open("first_5_content_ids").read().splitlines()

from google.cloud import bigquery
recommended_title_sql="""
#standardSQL
SELECT
(SELECT MAX(IF(index=6, value, NULL)) FROM UNNEST(hits.customDimensions)) AS title
FROM `cloud-training-demos.GA360_test.ga_sessions_sample`,   
  UNNEST(hits) AS hits
WHERE 
  # only include hits on pages
  hits.type = "PAGE"
  AND (SELECT MAX(IF(index=10, value, NULL)) FROM UNNEST(hits.customDimensions)) = \"{}\"
LIMIT 1""".format(recommended_content_ids[0])

current_title_sql="""
#standardSQL
SELECT
(SELECT MAX(IF(index=6, value, NULL)) FROM UNNEST(hits.customDimensions)) AS title
FROM `cloud-training-demos.GA360_test.ga_sessions_sample`,   
  UNNEST(hits) AS hits
WHERE 
  # only include hits on pages
  hits.type = "PAGE"
  AND (SELECT MAX(IF(index=10, value, NULL)) FROM UNNEST(hits.customDimensions)) = \"{}\"
LIMIT 1""".format(content_ids[0])
recommended_title = bigquery.Client().query(recommended_title_sql).to_dataframe()['title'].tolist()[0].encode('utf-8').strip()
current_title = bigquery.Client().query(current_title_sql).to_dataframe()['title'].tolist()[0].encode('utf-8').strip()
print("Current title: {} ".format(current_title))
print("Recommended title: {}".format(recommended_title))

INFO:tensorflow:Calling model_fn.


INFO:tensorflow:Calling model_fn.


INFO:tensorflow:Saver not created because there are no variables in the graph to restore


INFO:tensorflow:Saver not created because there are no variables in the graph to restore


INFO:tensorflow:Saver not created because there are no variables in the graph to restore


INFO:tensorflow:Saver not created because there are no variables in the graph to restore


INFO:tensorflow:Done calling model_fn.


INFO:tensorflow:Done calling model_fn.


INFO:tensorflow:Graph was finalized.


INFO:tensorflow:Graph was finalized.


INFO:tensorflow:Restoring parameters from content_based_model_trained/model.ckpt-2000


INFO:tensorflow:Restoring parameters from content_based_model_trained/model.ckpt-2000


INFO:tensorflow:Running local_init_op.


INFO:tensorflow:Running local_init_op.


INFO:tensorflow:Done running local_init_op.


INFO:tensorflow:Done running local_init_op.
  This is separate from the ipykernel package so we can avoid doing imports until


Current title: b'U4-St\xc3\xb6rung legt Wiener Fr\xc3\xbchverkehr lahm' 
Recommended title: b'Auf Bank ausgeruht: Pensionist muss Strafe zahlen'


### Congratulations You Made It.
### I am tired BYE.