In [None]:
#@title Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# On-device recommendations with Firebase ML and TensorFlow Lite

<table class="tfo-notebook-buttons" align="left">
  <td>
    <a target="_blank" href="https://colab.research.google.com/github/FirebaseExtended/codelab-contentrecommendation-android/blob/master/Firebase_ML_on_device_recommentations.ipynb"><img src="https://www.tensorflow.org/images/colab_logo_32px.png" />Run in Google Colab</a>
  </td>
  <td>
    <a target="_blank" href="https://github.com/FirebaseExtended/codelab-contentrecommendation-android/blob/master/Firebase_ML_on_device_recommentations.ipynb"><img src="https://www.tensorflow.org/images/GitHub-Mark-32px.png" />View source on GitHub</a>
  </td>
</table>

## Overview

This is the notebook for step 11 of the codelab [**Add recommendations to your app with TensorFlow Lite and Firebase**](https://codelabs.developers.google.com/codelabs/contentrecommendation-android). Before running the code in this notebook, complete steps 1-10 of the codelab to get your app and console projects set up.

This code base provides a toolkit to train an on-device recommendation
tensorflow model with user data collected in your app with Firebase Analytics. This model will then be deployed with Firebase ML to serve movie recommendations in the sample app FireFlix. 

This Notebook shows an end-to-end example that 1) imports Firebase Analytics data from BigQuery 2) preprocesses that data to prepare it for training 3) trains a recommendations model using the data and 4) exports the model in tflite format, ready to use in apps to run inference and serve recommendations.

Since the app we use in the codelab is just a sample app, it doesn't have the usage necessary to generate a significant amount of analytics events. Since training accurate models requires a large amount of data, for the purposes of this codelab and notebook, we will be simulating a larger analytics event store by using the public [movielens](https://grouplens.org/datasets/movielens/) dataset, but you could
adapt the data processing script for your dataset and train your own
recommendation model.

## Prerequisites

Run the cell below to clone the tensorflow recommendations model sample from Github. This is the model we will use, with our analytics training data, to create the recommendations model.

The model uses a Convolutional neural-network encoder (CNN): applying multiple layers of convolutional neural-network to generate an encoding of the user history analytics data. For more details, refer to the [documentation]() for the underlying tensorflow model.

In [None]:
!git clone https://github.com/tensorflow/examples
%cd /content/examples/lite/examples/recommendation/ml/
!pip install -r requirements.txt
!pip install --upgrade google-cloud-storage google-cloud-bigquery[bqstorage]

fatal: destination path 'examples' already exists and is not an empty directory.
/content/examples/lite/examples/recommendation/ml


## Set up authentication

In this notebook, we use analytics data from BigQuery to generate training data for our recommendations model. To access BigQuery data from the Colab notebook, you need to upload the service account file that you downloaded in step 10 of the codelab.

Note: If this step is throwing an error, you can either:
1. Manually upload the json file to the /content folder using the Folder icon in the left menu. Then set the GOOGLE_APPLICATION_CREDENTIALS environment variable to the file path.
i.e. If file was uploaded to /content, run:
`os.environ["GOOGLE_APPLICATION_CREDENTIALS"]='/content/<your_service_acct_file_name>`
OR,
2. Try disabling third party cookies in your browser, as [suggested here](https://stackoverflow.com/a/61494336).

In [None]:
import os
from google.colab import files

uploaded = files.upload()

for fn in uploaded.keys():
  print('User uploaded file "{name}" with length {length} bytes'.format(
      name=fn, length=len(uploaded[fn])))
  with open('/content/' + fn, 'wb') as f:
    f.write(uploaded[fn])
  os.environ["GOOGLE_APPLICATION_CREDENTIALS"]='/content/' + fn
  projectID = fn.rsplit("-", 1)[0]

Saving comp6442groupassignment-11fddf16223d.json to comp6442groupassignment-11fddf16223d.json
User uploaded file "comp6442groupassignment-11fddf16223d.json" with length 2358 bytes


# Import app analytics data from BigQuery

In this step, we will load the analytics data we collected in the app with Firebase Analytics and sent to BigQuery. We will load the data into the pandas data processing library and then preprocess this data to be the appropriate format for input for the model training step.

## Enable BigQuery IPython magics

BigQuery provides several convenience IPython magics that we will use to fetch data with the %load_ext magic below.

In [None]:
%reload_ext google.cloud.bigquery

## Import data

We use the following SQL statement to get items from the table we created in BigQuery. Firebase Analytics exports a lot of additional information, such as device type, platform version, etc, that we don't need for the purposes of training this model. Initially, we only get a limited amount of rows to briefly explore the form of this data and select which fields are important.

Notice that a row in the dataframe is created for each analytics event logged in the app. This row has many properties, but the ones that are of importance for this notebook are the fields:
* event_name
* event_timestamp
* items
* user_pseudo_id

Notice that some fields, such as the **items** field is actually an object. We will extract the subfield of interest below.

In [None]:
%%bigquery analytics_test_import
SELECT
    *
FROM `comp6442groupassignment.userData.userLikes`
LIMIT 10

Query complete after 0.02s: 100%|██████████| 1/1 [00:00<00:00, 91.94query/s] 
Downloading: 100%|██████████| 10/10 [00:00<00:00, 10.00rows/s]


In [None]:
analytics_test_import

Unnamed: 0,posts,id,__key__,__error__,__has_error__
0,"[-MmNeGCDAQRLYrlsiURu, -MmNeGY7-BmYQbIHIf86, -...",3l0665CqVnhdHyxvzFINhFgdwbx1,"{'namespace': '', 'app': 's~comp6442groupassig...",[],False
1,"[-MmNeG6EQhXtFlQ7U5NZ, -MmNeGXnsUqpmwt0DVMf, -...",3lx3geYNoDNBcQKLFGlMXVH3wTv1,"{'namespace': '', 'app': 's~comp6442groupassig...",[],False
2,"[-MmNeGH8WQVthx4ho7-z, -MmNeGYzAZsfaVxLwB8d, -...",3nw7vLbdfubFjI1m1NYCTDvAWZ22,"{'namespace': '', 'app': 's~comp6442groupassig...",[],False
3,"[-MmNeG_nnZBfiL1vo6G2, -MmNeGb8kJMHjrSoo-md, -...",3tjhWozCKzLJEXXNEehlYGWVKk42,"{'namespace': '', 'app': 's~comp6442groupassig...",[],False
4,"[-MmNeGYP7goLR0iK7OBE, -MmNeGEGYGxbWE5BMkn8, -...",3wNpgvQrsyRm3TtloklIV6UkHHR2,"{'namespace': '', 'app': 's~comp6442groupassig...",[],False
5,"[-MmNeGGgK6JQ4AdqwIUD, -MmNeGYfS50GiAdn3mUY, -...",4S9punuX7HScofagiBBjGij62fv1,"{'namespace': '', 'app': 's~comp6442groupassig...",[],False
6,"[-MmNeG_V_tifosfABrkk, -MmNeGH4F6zzhVbKqS2U, -...",4VtmDdM873XcY6IKBkdBKBx80i72,"{'namespace': '', 'app': 's~comp6442groupassig...",[],False
7,"[-MmNeGFr2hXeKIQugSP2, -MmNeGXnsUqpmwt0DVMj, -...",4XN9TOni20ZCDvlImLKFiImh2ri1,"{'namespace': '', 'app': 's~comp6442groupassig...",[],False
8,"[-MmNeG_rxoAJ0je4Jiwd, -MmNeGW8kScYJE3EmAWe, -...",4gptZIBqB4VOiZSyArqWfCFgjR93,"{'namespace': '', 'app': 's~comp6442groupassig...",[],False
9,"[-MmNeGHYe3eRDNUYp_YZ, -MmNeGGuqBS4JeDfPT0E, -...",55ejQMRINkXm0ZIicMMeyU00lc43,"{'namespace': '', 'app': 's~comp6442groupassig...",[],False


All of the columns included in each analytics event entry.

In [None]:
analytics_test_import.columns

Index(['posts', 'id', '__key__', '__error__', '__has_error__'], dtype='object')

Of the information logged under 'items', we are only interested in 'item_id',which corresponds to the ID of the movie the user interacted with.

Now we run the following command to import the whole dataset into a variable. Note how we only import the fields which we are interested in for training purposes.

In [None]:
%%bigquery analytics_data_real
SELECT
    id, posts
FROM `comp6442groupassignment.userData.userLikes`

Query complete after 0.01s: 100%|██████████| 1/1 [00:00<00:00, 185.42query/s]
Downloading: 100%|██████████| 414/414 [00:00<00:00, 416.60rows/s]


In [None]:
analytics_data_real.head()

Unnamed: 0,id,posts
0,3l0665CqVnhdHyxvzFINhFgdwbx1,"[-MmNeGCDAQRLYrlsiURu, -MmNeGY7-BmYQbIHIf86, -..."
1,3lx3geYNoDNBcQKLFGlMXVH3wTv1,"[-MmNeG6EQhXtFlQ7U5NZ, -MmNeGXnsUqpmwt0DVMf, -..."
2,3nw7vLbdfubFjI1m1NYCTDvAWZ22,"[-MmNeGH8WQVthx4ho7-z, -MmNeGYzAZsfaVxLwB8d, -..."
3,3tjhWozCKzLJEXXNEehlYGWVKk42,"[-MmNeG_nnZBfiL1vo6G2, -MmNeGb8kJMHjrSoo-md, -..."
4,3wNpgvQrsyRm3TtloklIV6UkHHR2,"[-MmNeGYP7goLR0iK7OBE, -MmNeGEGYGxbWE5BMkn8, -..."


# Preprocess the dataset

In this step, we create a lambda function to extract a subfield 'item_id' from the items object. This represents the movie_id, so we also rename the columns to match.

In [None]:
import pandas as pd

In [None]:
all_posts = []
for i in range(len(analytics_data_real)):
  posts = analytics_data_real['posts'][i]
  all_posts = all_posts + list(posts)
p_df = pd.value_counts(all_posts).reset_index(drop=False)
p_df.head()

Unnamed: 0,index,0
0,-MmNeGFG2-aqwf8yjpQJ,5
1,-MmNeG8FCeT-cNNbHV2J,4
2,-MmNeGGtjnTY9RMTL3zN,4
3,-MmNeGEAkksim_aImKPY,4
4,-MmNeGYruc-l_2NCuJ3K,4


In [None]:
p_df.columns = ['pid', 'count']
p_df = p_df.sort_values('count', ascending=False).reset_index(drop=True)

In [None]:
hot_posts = p_df[:1000]['pid'].values

In [None]:
import numpy as np
w = np.where(hot_posts=="-MmNeGFG2-aqwf8yjpQJ")
hot_posts[w[0][0]]

'-MmNeGFG2-aqwf8yjpQJ'

In [None]:
res = np.zeros([len(analytics_data_real), len(hot_posts)])
for i in range(len(analytics_data_real)):
  posts = analytics_data_real['posts'][i]
  for pid in posts:
    w = np.where(hot_posts == pid)
    if (len(w[0]) > 0):
      res[i, w[0][0]] = 1
res

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [1., 0., 0., ..., 0., 0., 0.]])

In [56]:
np.save("/content/training.npy", res)

In [57]:
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=20, random_state=0).fit(res)
np.save("/content/cluster_centers_.npy", kmeans.cluster_centers_)

In [62]:
 n, m = kmeans.cluster_centers_.shape
 with open('/content/cluster_centers_.txt','w') as file:
   for i in range(n):
     for j in range(m-1):
       file.write(str(kmeans.cluster_centers_[i][j]))
       file.write(',')
     file.write(str(kmeans.cluster_centers_[i][m-1]))
     file.write('\n')

In [64]:
with open('/content/hot_posts.txt', 'w') as file:
  for s in hot_posts:
    file.write(s)
    file.write('\n')