# On-device recommendations with Firebase ML and TensorFlow Lite

## Overview

Для выполнения необходимо:
Подключить firebase analytics
Создать и заполнить таблицу в BigQuery

## Prerequisites

We're gonna start with a simple knn model

## Set up authentication

In this notebook, we use analytics data from BigQuery to generate training data for our recommendations model. To access BigQuery data from the Colab notebook, you need to upload the service account file that you downloaded in step 10 of the codelab.

Note: If this step is throwing an error, you can either:
1. Manually upload the json file to the /content folder using the Folder icon in the left menu. Then set the GOOGLE_APPLICATION_CREDENTIALS environment variable to the file path.
i.e. If file was uploaded to /content, run:
`os.environ["GOOGLE_APPLICATION_CREDENTIALS"]='/content/<your_service_acct_file_name>`
OR,
2. Try disabling third party cookies in your browser, as [suggested here](https://stackoverflow.com/a/61494336).

In [1]:
import os

os.environ["GOOGLE_APPLICATION_CREDENTIALS"]='donapp-d2378-firebase-adminsdk-zxd1d-2147e3a97f.json'

# Import app analytics data from BigQuery

In this step, we will load the analytics data we collected in the app with Firebase Analytics and sent to BigQuery. We will load the data into the pandas data processing library and then preprocess this data to be the appropriate format for input for the model training step.

## Enable BigQuery IPython magics

BigQuery provides several convenience IPython magics that we will use to fetch data with the %load_ext magic below.

In [2]:
%reload_ext google.cloud.bigquery

## Import data

We use the following SQL statement to get items from the table we created in BigQuery. Firebase Analytics exports a lot of additional information, such as device type, platform version, etc, that we don't need for the purposes of training this model. Initially, we only get a limited amount of rows to briefly explore the form of this data and select which fields are important.

Notice that a row in the dataframe is created for each analytics event logged in the app. This row has many properties, but the ones that are of importance for this notebook are the fields:
* event_name
* event_timestamp
* items
* user_pseudo_id

Notice that some fields, such as the **items** field is actually an object. We will extract the subfield of interest below.

In [3]:
%%bigquery analytics_test_import
SELECT
    *
FROM `firebase_recommendations_dataset.donations_table`
LIMIT 10

Query is running:   0%|          |

Downloading:   0%|          |

In [4]:
analytics_test_import

Unnamed: 0,userID,charityID,amount,timestamp
0,qk0Q5ZmS3au5RkcPuyotTjtg3G0b,3MWg9xpBnDeB1GPauOf57hl90SIy,,2023-03-08 15:13:33+00:00
1,qk0Q5ZmS3au5RkcPuyotTjtg3G0b,JwY0MIYrcifUDJj35tO5JKedB8Nt,400.0,2023-03-08 15:13:34+00:00
2,qk0Q5ZmS3au5RkcPuyotTjtg3G0b,tjKvwntCUTmcxOsE3hAYsOc4pxMk,4000.0,2023-03-08 15:13:36+00:00
3,qk0Q5ZmS3au5RkcPuyotTjtg3G0b,3MWg9xpBnDeB1GPauOf57hl90SIy,,2023-03-08 15:13:37+00:00
4,qk0Q5ZmS3au5RkcPuyotTjtg3G0b,JwY0MIYrcifUDJj35tO5JKedB8Nt,,2023-03-08 15:13:37+00:00
5,qk0Q5ZmS3au5RkcPuyotTjtg3G0b,tjKvwntCUTmcxOsE3hAYsOc4pxMk,,2023-03-08 15:13:37+00:00
6,a1aX9SLLbe3Fksvqtv7YhRoqtaw9,tjKvwntCUTmcxOsE3hAYsOc4pxMk,,2023-03-08 15:13:38+00:00
7,unFb4dqPjHNnUoGJhC9u7gBdb9YW,9TPmjJvlASIKCgZ9NHBZP1jZEP3S,,2023-03-08 15:13:39+00:00
8,unFb4dqPjHNnUoGJhC9u7gBdb9YW,9TPmjJvlASIKCgZ9NHBZP1jZEP3S,,2023-03-08 15:13:40+00:00
9,7dlnsoBWpyftpEJ6gQBkm86oxmI7,b79lWABizCzlu2gUYvVsUCSApCD4,,2023-03-08 15:13:42+00:00


All of the columns included in each analytics event entry.

In [5]:
analytics_test_import.columns

Index(['userID', 'charityID', 'amount', 'timestamp'], dtype='object')

Of the information logged under 'items', we are only interested in 'item_id',which corresponds to the ID of the movie the user interacted with.

In [6]:
analytics_test_import['userID'][0]

'qk0Q5ZmS3au5RkcPuyotTjtg3G0b'

Now we run the following command to import the whole dataset into a variable. Note how we only import the fields which we are interested in for training purposes.

In [7]:
%%bigquery analytics_data_real
SELECT
    charityID,userID,timestamp
FROM `firebase_recommendations_dataset.donations_table`

Query is running:   0%|          |

Downloading:   0%|          |

In [8]:
analytics_data_real.head()

Unnamed: 0,charityID,userID,timestamp
0,3MWg9xpBnDeB1GPauOf57hl90SIy,qk0Q5ZmS3au5RkcPuyotTjtg3G0b,2023-03-08 15:13:33+00:00
1,JwY0MIYrcifUDJj35tO5JKedB8Nt,qk0Q5ZmS3au5RkcPuyotTjtg3G0b,2023-03-08 15:13:34+00:00
2,tjKvwntCUTmcxOsE3hAYsOc4pxMk,qk0Q5ZmS3au5RkcPuyotTjtg3G0b,2023-03-08 15:13:36+00:00
3,3MWg9xpBnDeB1GPauOf57hl90SIy,qk0Q5ZmS3au5RkcPuyotTjtg3G0b,2023-03-08 15:13:37+00:00
4,JwY0MIYrcifUDJj35tO5JKedB8Nt,qk0Q5ZmS3au5RkcPuyotTjtg3G0b,2023-03-08 15:13:37+00:00


# Preprocess the dataset

In this step, we create a lambda function to extract a subfield 'item_id' from the items object. This represents the movie_id, so we also rename the columns to match.

In [9]:
analytics = analytics_data_real
#def getMovieID(row):
#  items_obj = row['items'][0]
#  return items_obj['item_id']
#analytics['movieId'] = analytics.apply(lambda row: getMovieID(row), axis=1)
#analytics

Here is our processed dataframe containing only the data we want to use in training.

The data has the following properties:
*   UserIDs string
*   MovieIDs string
*   Timestamp Timestamp

In [10]:
analytics.values

array([['3MWg9xpBnDeB1GPauOf57hl90SIy', 'qk0Q5ZmS3au5RkcPuyotTjtg3G0b',
        Timestamp('2023-03-08 15:13:33+0000', tz='UTC')],
       ['JwY0MIYrcifUDJj35tO5JKedB8Nt', 'qk0Q5ZmS3au5RkcPuyotTjtg3G0b',
        Timestamp('2023-03-08 15:13:34+0000', tz='UTC')],
       ['tjKvwntCUTmcxOsE3hAYsOc4pxMk', 'qk0Q5ZmS3au5RkcPuyotTjtg3G0b',
        Timestamp('2023-03-08 15:13:36+0000', tz='UTC')],
       ['3MWg9xpBnDeB1GPauOf57hl90SIy', 'qk0Q5ZmS3au5RkcPuyotTjtg3G0b',
        Timestamp('2023-03-08 15:13:37+0000', tz='UTC')],
       ['JwY0MIYrcifUDJj35tO5JKedB8Nt', 'qk0Q5ZmS3au5RkcPuyotTjtg3G0b',
        Timestamp('2023-03-08 15:13:37+0000', tz='UTC')],
       ['tjKvwntCUTmcxOsE3hAYsOc4pxMk', 'qk0Q5ZmS3au5RkcPuyotTjtg3G0b',
        Timestamp('2023-03-08 15:13:37+0000', tz='UTC')],
       ['tjKvwntCUTmcxOsE3hAYsOc4pxMk', 'a1aX9SLLbe3Fksvqtv7YhRoqtaw9',
        Timestamp('2023-03-08 15:13:38+0000', tz='UTC')],
       ['9TPmjJvlASIKCgZ9NHBZP1jZEP3S', 'unFb4dqPjHNnUoGJhC9u7gBdb9YW',
        Timestamp(

In [11]:
analytics.sort_values(by=['timestamp'])

Unnamed: 0,charityID,userID,timestamp
0,3MWg9xpBnDeB1GPauOf57hl90SIy,qk0Q5ZmS3au5RkcPuyotTjtg3G0b,2023-03-08 15:13:33+00:00
1,JwY0MIYrcifUDJj35tO5JKedB8Nt,qk0Q5ZmS3au5RkcPuyotTjtg3G0b,2023-03-08 15:13:34+00:00
2,tjKvwntCUTmcxOsE3hAYsOc4pxMk,qk0Q5ZmS3au5RkcPuyotTjtg3G0b,2023-03-08 15:13:36+00:00
3,3MWg9xpBnDeB1GPauOf57hl90SIy,qk0Q5ZmS3au5RkcPuyotTjtg3G0b,2023-03-08 15:13:37+00:00
4,JwY0MIYrcifUDJj35tO5JKedB8Nt,qk0Q5ZmS3au5RkcPuyotTjtg3G0b,2023-03-08 15:13:37+00:00
5,tjKvwntCUTmcxOsE3hAYsOc4pxMk,qk0Q5ZmS3au5RkcPuyotTjtg3G0b,2023-03-08 15:13:37+00:00
6,tjKvwntCUTmcxOsE3hAYsOc4pxMk,a1aX9SLLbe3Fksvqtv7YhRoqtaw9,2023-03-08 15:13:38+00:00
7,9TPmjJvlASIKCgZ9NHBZP1jZEP3S,unFb4dqPjHNnUoGJhC9u7gBdb9YW,2023-03-08 15:13:39+00:00
8,9TPmjJvlASIKCgZ9NHBZP1jZEP3S,unFb4dqPjHNnUoGJhC9u7gBdb9YW,2023-03-08 15:13:40+00:00
9,b79lWABizCzlu2gUYvVsUCSApCD4,7dlnsoBWpyftpEJ6gQBkm86oxmI7,2023-03-08 15:13:42+00:00


## Encode user, charity IDs

Order of rows (users) and columns(charities) according to timestamp

In [12]:
from sklearn.preprocessing import LabelEncoder


## Train a model

In [13]:
from sklearn.neighbors import NearestNeighbors
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=3)

In [19]:
import tensorflow as tf
import numpy as np
import keras
from keras import Input
from keras import Model
from keras.layers import Flatten
from keras.layers import Dense
from keras.layers import Concatenate
import torch
import tensorflow_recommenders as tfrs
from typing import Dict, Text
import pandas as pd

In [15]:
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.preprocessing import LabelEncoder

In [30]:
data = analytics.drop(columns=['timestamp']).drop_duplicates()
data

Unnamed: 0,charityID,userID
0,3MWg9xpBnDeB1GPauOf57hl90SIy,qk0Q5ZmS3au5RkcPuyotTjtg3G0b
1,JwY0MIYrcifUDJj35tO5JKedB8Nt,qk0Q5ZmS3au5RkcPuyotTjtg3G0b
2,tjKvwntCUTmcxOsE3hAYsOc4pxMk,qk0Q5ZmS3au5RkcPuyotTjtg3G0b
6,tjKvwntCUTmcxOsE3hAYsOc4pxMk,a1aX9SLLbe3Fksvqtv7YhRoqtaw9
7,9TPmjJvlASIKCgZ9NHBZP1jZEP3S,unFb4dqPjHNnUoGJhC9u7gBdb9YW
9,b79lWABizCzlu2gUYvVsUCSApCD4,7dlnsoBWpyftpEJ6gQBkm86oxmI7
12,9TPmjJvlASIKCgZ9NHBZP1jZEP3S,pL9XlHZqQpBNF1BrLfT7SAfBchQm
13,bWmuiL1Np1nat3k3BRFsrcRfBxIx,pL9XlHZqQpBNF1BrLfT7SAfBchQm
14,5gNFY4JG86pVLGu6X0vAQJUuJYHc,pL9XlHZqQpBNF1BrLfT7SAfBchQm
18,P0UKy85iinfvfJZJPMB4R5G024ID,hAdXSNr0fVNOavrzh2SKW7geqxGT


In [25]:
uniq = data.charityID.unique()
uniq = pd.DataFrame(uniq)

uniq.columns = ['charityID']
uniq

rat = data[['userID', 'charityID']]

dataset = tf.data.Dataset.from_tensor_slices(dict(data))
ratings = dataset.from_tensor_slices(dict(rat))

Charity = dataset.from_tensor_slices(dict(uniq))

ratings = ratings.map(lambda x: {
"userID": x["userID"],
"charityID": x["charityID"]
})

Charity = Charity.map(lambda x: x["charityID"])
ratings.take(1)

UserID_vocabulary = tf.keras.layers.experimental.preprocessing.StringLookup(mask_token=None)
UserID_vocabulary.adapt(ratings.map(lambda x: x["userID"]))

Charity_vocabulary = tf.keras.layers.experimental.preprocessing.StringLookup(mask_token=None)
Charity_vocabulary.adapt(Charity)

#Define a model
#We can define a TFRS model by inheriting from tfrs.Model and implementing the compute_loss method:
class CharityRecModel(tfrs.Model):
    def __init__(self, UserModel: tf.keras.Model, CharityModel: tf.keras.Model, task: tfrs.tasks.Retrieval):
        super().__init__()

        # Set up Customer and SalesItem representations.
        self.UserModel = UserModel
        self.CharityModel = CharityModel

        # Set up a retrieval task.
        self.task = task
    
    def compute_loss(self, features: Dict[Text, tf.Tensor], training=False) -> tf.Tensor:
        # Define how the loss is computed.
        UserEmbeddings = self.UserModel(features["userID"])
        CharityEmbeddings = self.CharityModel(features["charityID"])
        
        return self.task(UserEmbeddings, CharityEmbeddings)

In [26]:
UserModel = tf.keras.Sequential([
    UserID_vocabulary,
    tf.keras.layers.Embedding(UserID_vocabulary.vocabulary_size(), 64)
])

CharityModel = tf.keras.Sequential([
    Charity_vocabulary,
    tf.keras.layers.Embedding(Charity_vocabulary.vocabulary_size(), 64)
])

task = tfrs.tasks.Retrieval(metrics=tfrs.metrics.FactorizedTopK(
    Charity.batch(4).map(CharityModel))
)

In [27]:
model = CharityRecModel(UserModel, CharityModel, task)
model.compile(optimizer=tf.keras.optimizers.Adagrad(0.5))

# Train for 3 epochs.
model.fit(ratings.batch(4), epochs=3)
# Use brute-force search to set up retrieval using the trained representations.
index = tfrs.layers.factorized_top_k.BruteForce(model.UserModel)

index.index_from_dataset(Charity.batch(4).map(model.CharityModel))
users = data.userID.unique().tolist()

fcst = pd.DataFrame()

for x in users:
    _, Charity = index(np.array([x]))
    fcst = pd.concat((fcst, pd.DataFrame(Charity[0, :10].numpy()).transpose()))
    
fcst['User'] = users

Epoch 1/3
Epoch 2/3
Epoch 3/3


In [28]:
fcst

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,User
0,0,1,13,12,4,7,14,5,6,11,qk0Q5ZmS3au5RkcPuyotTjtg3G0b
0,2,3,10,6,7,15,8,9,14,5,a1aX9SLLbe3Fksvqtv7YhRoqtaw9
0,3,6,14,12,2,5,15,0,8,1,unFb4dqPjHNnUoGJhC9u7gBdb9YW
0,4,7,8,11,9,15,10,1,13,0,7dlnsoBWpyftpEJ6gQBkm86oxmI7
0,6,5,12,3,14,13,2,0,1,9,pL9XlHZqQpBNF1BrLfT7SAfBchQm
0,7,12,4,9,0,1,2,3,15,14,hAdXSNr0fVNOavrzh2SKW7geqxGT
0,11,10,8,15,13,4,3,1,0,2,2jdmhYp8SmSIOpAhniXIzuYQ6uwZ
0,12,7,6,5,3,14,0,1,2,9,rDlVTz5Ec0tXI4WlNVz6Nh94jmvh
0,4,9,10,11,8,13,15,2,1,0,0PyuWHnQPqkbVPi5oRUnuRQQ8MXY
0,13,11,10,8,6,5,0,1,9,3,GoOo3nSmuBiT3hiPoBANg569IVTe


In [29]:
inputs = ['3MWg9xpBnDeB1GPauOf57hl90SIy', 'JwY0MIYrcifUDJj35tO5JKedB8Nt']

In [33]:
import tempfile

In [36]:
with tempfile.TemporaryDirectory() as tmp:
    path = os.path.join(tmp, "model")
    tf.saved_model.save(index, path)
    loaded = tf.saved_model.load(path)
    scores, titles = loaded(['3MWg9xpBnDeB1GPauOf57hl90SIy', 'JwY0MIYrcifUDJj35tO5JKedB8Nt'])
    print(f"Recommendations: {titles[0][:3]}")
    print(loaded)



INFO:tensorflow:Assets written to: C:\Users\visio\AppData\Local\Temp\tmpvxwtzc7i\model\assets


INFO:tensorflow:Assets written to: C:\Users\visio\AppData\Local\Temp\tmpvxwtzc7i\model\assets


Recommendations: [15 11 13]
<tensorflow.python.saved_model.load.Loader._recreate_base_user_object.<locals>._UserObject object at 0x000002655FBBB520>


In [37]:
scann_index = tfrs.layers.factorized_top_k.ScaNN(model.UserModel)
scann_index.index_from_dataset(
  tf.data.Dataset.zip((movies.batch(100), movies.batch(100).map(model.movie_model)))
)

_, titles = scann_index(tf.constant(["42"]))
print(f"Recommendations for user 42: {titles[0, :3]}")

AttributeError: 'CharityRecModel' object has no attribute 'user_model'

In [13]:
def GetItemItemSim(user_ids, charity_ids):
    CharityUserMatrix = csr_matrix(([1]*len(user_ids), (charity_ids, user_ids)))
    similarity = cosine_similarity(CharityUserMatrix)
    return similarity, CharityUserMatrix

In [14]:
def get_recommendations_from_similarity(similarity_matrix, CharityUserMatrix, top_n=5):
    CharityUserMatrix = csr_matrix(CharityUserMatrix.T)
    UserCharityScores = CharityUserMatrix.dot(similarity_matrix) # sum of similarities to all purchased products
    
    RecForUser = []
    
    for user_id in range(UserCharityScores.shape[0]):
        scores = UserCharityScores[user_id, :]
        
        donated_charities = CharityUserMatrix.indices[CharityUserMatrix.indptr[user_id]:
        CharityUserMatrix.indptr[user_id+1]]
        
        scores[donated_charities] = -1 # do not recommend already donated charities (or do?)
        
        top_charities_ids = np.argsort(scores)[-top_n:][::-1]
        recommendations = pd.DataFrame(
            top_charities_ids.reshape(1, -1),
            index=[user_id],
            columns=['Top%s' % (i+1) for i in range(top_n)])
        RecForUser.append(recommendations)
        
    return pd.concat(RecForUser)

In [15]:
def get_recommendations(donation_data):
    user_label_encoder = LabelEncoder()
    user_ids = user_label_encoder.fit_transform(donation_data.userID)
    
    charity_label_encoder = LabelEncoder()
    charity_ids = charity_label_encoder.fit_transform(donation_data.charityID)
    # compute recommendations
    similarity_matrix, CharityUserMatrix = GetItemItemSim(user_ids, charity_ids)
    recommendations = get_recommendations_from_similarity(similarity_matrix, CharityUserMatrix)
    recommendations.index = user_label_encoder.inverse_transform(recommendations.index)
    for i in range(recommendations.shape[1]):
        recommendations.iloc[:, i] = charity_label_encoder.inverse_transform(recommendations.iloc[:, i])
    return recommendations


In [22]:
recommendations = get_recommendations(analytics)


In [23]:
recommendations

Unnamed: 0,Top1,Top2,Top3,Top4,Top5
0PyuWHnQPqkbVPi5oRUnuRQQ8MXY,nyhK1EZt98jO1Adr7Pb7ptEWrBRK,dnQuAUt0lpGpxZayXlwxj2Vw0lF2,EBMREkEGMiovkibTjYMHAjO2Mcny,tjKvwntCUTmcxOsE3hAYsOc4pxMk,i9F9dIW8V0suu1JN9w2xVOat04yn
2jdmhYp8SmSIOpAhniXIzuYQ6uwZ,tjKvwntCUTmcxOsE3hAYsOc4pxMk,i9F9dIW8V0suu1JN9w2xVOat04yn,bWmuiL1Np1nat3k3BRFsrcRfBxIx,b79lWABizCzlu2gUYvVsUCSApCD4,SrYqL0G0KEht16Q6iUqicpy2oLWI
7dlnsoBWpyftpEJ6gQBkm86oxmI7,tjKvwntCUTmcxOsE3hAYsOc4pxMk,nyhK1EZt98jO1Adr7Pb7ptEWrBRK,i9F9dIW8V0suu1JN9w2xVOat04yn,dnQuAUt0lpGpxZayXlwxj2Vw0lF2,bWmuiL1Np1nat3k3BRFsrcRfBxIx
GoOo3nSmuBiT3hiPoBANg569IVTe,tjKvwntCUTmcxOsE3hAYsOc4pxMk,nyhK1EZt98jO1Adr7Pb7ptEWrBRK,i9F9dIW8V0suu1JN9w2xVOat04yn,dnQuAUt0lpGpxZayXlwxj2Vw0lF2,bWmuiL1Np1nat3k3BRFsrcRfBxIx
IgTNwJa4DlZJdlTRauS8OLiPTdhy,tjKvwntCUTmcxOsE3hAYsOc4pxMk,nyhK1EZt98jO1Adr7Pb7ptEWrBRK,i9F9dIW8V0suu1JN9w2xVOat04yn,dnQuAUt0lpGpxZayXlwxj2Vw0lF2,bWmuiL1Np1nat3k3BRFsrcRfBxIx
KQFfe5lN41XR3ltWHZWYbweXOiQs,tjKvwntCUTmcxOsE3hAYsOc4pxMk,nyhK1EZt98jO1Adr7Pb7ptEWrBRK,dnQuAUt0lpGpxZayXlwxj2Vw0lF2,bWmuiL1Np1nat3k3BRFsrcRfBxIx,b79lWABizCzlu2gUYvVsUCSApCD4
a1aX9SLLbe3Fksvqtv7YhRoqtaw9,JwY0MIYrcifUDJj35tO5JKedB8Nt,3MWg9xpBnDeB1GPauOf57hl90SIy,nyhK1EZt98jO1Adr7Pb7ptEWrBRK,i9F9dIW8V0suu1JN9w2xVOat04yn,dnQuAUt0lpGpxZayXlwxj2Vw0lF2
hAdXSNr0fVNOavrzh2SKW7geqxGT,tjKvwntCUTmcxOsE3hAYsOc4pxMk,nyhK1EZt98jO1Adr7Pb7ptEWrBRK,i9F9dIW8V0suu1JN9w2xVOat04yn,dnQuAUt0lpGpxZayXlwxj2Vw0lF2,bWmuiL1Np1nat3k3BRFsrcRfBxIx
pL9XlHZqQpBNF1BrLfT7SAfBchQm,tjKvwntCUTmcxOsE3hAYsOc4pxMk,nyhK1EZt98jO1Adr7Pb7ptEWrBRK,i9F9dIW8V0suu1JN9w2xVOat04yn,dnQuAUt0lpGpxZayXlwxj2Vw0lF2,b79lWABizCzlu2gUYvVsUCSApCD4
qk0Q5ZmS3au5RkcPuyotTjtg3G0b,nyhK1EZt98jO1Adr7Pb7ptEWrBRK,i9F9dIW8V0suu1JN9w2xVOat04yn,dnQuAUt0lpGpxZayXlwxj2Vw0lF2,bWmuiL1Np1nat3k3BRFsrcRfBxIx,b79lWABizCzlu2gUYvVsUCSApCD4


In [16]:
def ids_encoder(donations):
    users = sorted(donations['userID'].unique())
    items = sorted(donations['charityID'].unique())

    # create users and items encoders
    uencoder = LabelEncoder()
    iencoder = LabelEncoder()

    # fit users and items ids to the corresponding encoder
    uencoder.fit(users)
    iencoder.fit(items)

    # encode userids and itemids
    donations.userID = uencoder.transform(donations.userID.tolist())
    donations.charityID = iencoder.transform(donations.charityID.tolist())

    return donations, uencoder, iencoder

donations, uencoder, iencoder = ids_encoder(analytics)

In [17]:
donations = donations.drop(columns=['timestamp'])

In [18]:
donations = donations.drop_duplicates()
donations = donations.reset_index()
donations = donations.drop(columns=['index'])

In [20]:
donations = pd.concat([donations, pd.DataFrame([1] * donations.shape[0])], axis=1)

In [33]:
donations

Unnamed: 0,charityID,userID,0
0,0,9,1
1,5,9,1
2,15,9,1
3,15,6,1
4,2,11,1
5,10,2,1
6,2,8,1
7,11,8,1
8,1,8,1
9,6,7,1


In [17]:
from scipy.sparse import csr_matrix

In [23]:
donations_table = donations.pivot(index='charityID', columns='userID', values=0).fillna(0)
donations_table_sp = csr_matrix(donations_table.values)
donations_table

userID,0,1,2,3,4,5,6,7,8,9,10,11,12
charityID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
4,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
6,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
8,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [26]:
num_charities, num_users = donations_table.shape
print(num_users, num_charities)
X_train, y_train = [], []
for user in range(num_users):
    for charity in range(num_charities):
        if donations_table[user][charity] == 1:
            X_train.append([user, charity])
            y_train.append(1)

# Create model
embedding_size = 10
model = keras.models.Sequential([
    keras.layers.Flatten(),
    keras.layers.Dense(1, activation='sigmoid')
])
model.compile(optimizer='adam', loss='binary_crossentropy')
# Train model
model.fit(np.array(X_train), np.array(y_train), epochs=10)

# Export model to TFLite format
converter = tf.lite.TFLiteConverter.from_keras_model(model)
tflite_model = converter.convert()
open("model.tflite", "wb").write(tflite_model)

13 16
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
INFO:tensorflow:Assets written to: C:\Users\visio\AppData\Local\Temp\tmp1ln6jjol\assets


1692

In [26]:
charity_input = Input(shape=[1], name="Charities-Input")

charity_vec = Flatten(name="Flatten-Charities")(charity_input)

user_input = Input(shape=[1], name="User-Input")
user_vec = Flatten(name="Flatten-Users")(user_input)
# concatenate features
conc = Concatenate()([charity_vec, user_vec])
# add fully-connected-layers
fc1 = Dense(128, activation='relu')(conc)
fc2 = Dense(32, activation='relu')(fc1)
out = Dense(1)(fc2)
# Create model and compile it
model2 = Model([user_input, charity_input], out)
model2.compile('adam', 'mean_squared_error')
history = model2.fit([donations.userID, donations.charityID], donations[0], epochs=5, verbose=1)


Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


In [27]:
#Make predictions
predictions = model2.predict([pd.DataFrame([6, 6]), pd.DataFrame([5, 4])])
[print(predictions[i], donations[0].iloc[i]) for i in range(0,len(predictions))]

[0.77993804] 1
[0.67318016] 1


[None, None]

In [29]:
brute_force = tfrs.layers.factorized_top_k.BruteForce(k=5)

In [37]:
donations = donations.drop(columns=[0])

In [41]:
print(donations)

    charityID  userID
0           0       9
1           5       9
2          15       9
3          15       6
4           2      11
5          10       2
6           2       8
7          11       8
8           1       8
9           6       7
10         14       1
11          9       1
12         12       1
13          4       1
14          3      10
15          9       0
16          8       3
17          7      12
18         13       5
19          6       4


## Sample model output

In [27]:
interpreter = tf.lite.Interpreter(model_path="model.tflite")
interpreter.allocate_tensors()

# Prepare input data
input_data = [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0]  # Example input data

# Set input tensor
input_details = interpreter.get_input_details()
input_data = np.array([input_data], dtype=np.int32)
interpreter.set_tensor(input_details[0]['index'], input_data)

# Run inference
interpreter.invoke()

# Get output tensor
output_details = interpreter.get_output_details()
output = interpreter.get_tensor(output_details[0]['index'])

# Print output
print(output)

ValueError: Cannot set tensor: Dimension mismatch. Got 16 but expected 2 for dimension 1 of input 0.

In [None]:
# unused models:
model_knn = NearestNeighbors(metric='cosine', algorithm='brute', n_neighbors=3, n_jobs=-1)
# fit the dataset
model_knn.fit(donations_table_sp)

def make_recommendation(model_knn, data, mapper, fav_movie, n_recommendations):
    
    # fit
    model_knn.fit(data)
    # get input movie index
    print('You have input movie:', fav_movie)
    idx = fuzzy_matching(mapper, fav_movie, verbose=True)
    
    print('Recommendation system start to make inference')
    print('......\n')
    distances, indices = model_knn.kneighbors(data[idx], n_neighbors=n_recommendations+1)
    
    raw_recommends = \
        sorted(list(zip(indices.squeeze().tolist(), distances.squeeze().tolist())), key=lambda x: x[1])[:0:-1]
    # get reverse mapper
    reverse_mapper = {v: k for k, v in mapper.items()}
    # print recommendations
    print('Recommendations for {}:'.format(fav_movie))
    for i, (idx, dist) in enumerate(raw_recommends):
        print('{0}: {1}, with distance of {2}'.format(i+1, reverse_mapper[idx], dist))
        
        
make_recommendation(model_knn)

## Sort and group training data to create training examples

Our analytics events need to be reorganized in the format required for the model training step. We will create an object that maps key user_id to a list of movies that user has seen. We use the timestamp data to create the sequential context.

In [None]:
import collections
def convert_to_timelines(df):
  """Convert ratings data to user."""
  timelines = collections.defaultdict(list)
  charity_counts = collections.Counter()
  for charityID, userID, timestamp in df.values:
    timelines[userID].append([charityID, int(timestamp)])
    charity_counts[charityID] += 1
  # Sort per-user timeline by timestamp
  for (user_id, timeline) in timelines.items():
    timeline.sort(key=lambda x: x[1])
    timelines[user_id] = [movie_id for movie_id, _ in timeline]
  return timelines, movie_counts
timelines, counts = convert_to_timelines(analytics)

The timelines object contains a list of movie_id's keyed on user_id to indicate the sequence of movies that user has interacted with.

In [None]:
import itertools

for key, val in sorted(timelines.items())[0:10]:
  print(key, val)

1 [3826, 307, 1590, 2478, 3698, 3020, 1449, 3424, 481, 1257, 2134, 1091, 1591, 3893, 2986, 2840]
2 [1962, 849, 2108, 2746, 1244, 2915, 1663, 2352, 170, 1235, 3363, 2707, 2243, 1296, 1186]
3 [1321, 3171, 1221, 960, 640, 828, 2028, 1645, 1985, 2024, 1825]
4 [2683, 2997, 786, 1527, 1923, 1584, 1610, 368, 2987, 1208, 1517, 104, 4306, 3793, 2174, 16, 1080, 8360, 6377, 5418, 7143, 7649, 150, 6679, 161, 5064, 4896, 5816, 6947, 5152, 6378, 6879, 5218, 1207, 1625, 2770, 2791, 3578, 380, 2918, 2763, 1, 4022, 367, 1356, 1291, 349, 163, 6807, 6537, 4344, 4343, 10, 3082, 3638, 3617, 1769, 2273, 1748, 1586, 327, 76, 193, 5954, 6264, 4052, 3388, 3190, 2353, 780, 590, 1270, 494, 4223, 647, 6539, 329, 4270, 2803, 3255, 3755, 3717, 2161, 2405, 1037, 361, 5944, 4958, 4621, 3997, 1606, 8810, 2959, 3977, 318, 593, 2115, 3948, 2762, 1682, 364, 4246, 7841, 1059, 5299, 4308, 5069, 7810, 941, 653, 1876, 3751, 1747, 2890, 2571, 3994, 186, 4890, 4975, 4701, 2993, 3113, 3825, 3298, 3300, 2723, 1805, 1687, 1385, 6

## Generate training examples

We use the timelines data to generate tensorflow training examples. We discard any timeline with less than 3 context items, and we consider context lengths of 100 items. We perform the following steps:

* Groups movie records by user, and orders per-user movie records by timestamp.
* Generates Tensorflow examples with features: 1) "context": time-ordered sequential movie IDs 2) "label": next movie ID user viewed as label. "max_history_length" is taken in as parameter to define "context" feature shape, if not enough history found, right padding with out-of-vocab ID 0 will be performed.
* Then partition the available data into a training and test set.

Sample generated training example with max user history as 10:
```
0 : {   # (tensorflow.Example)
  features: {   # (tensorflow.Features)
    feature: {
      key  : "context"
      value: {
        int64_list: {
          value: [ 595, 2687, 745, 588, 1, 2355, 2294, 783, 1566, 1907 ]
        }
      }
    }
    feature: {
      key  : "label"
      value: {
        int64_list: {
          value: [ 48 ]
        }
      }
    }
  }
}
```

In [None]:
import tensorflow as tf

# used to pad when user doesn't have enough context
OOV_MOVIE_ID = 0

def generate_examples_from_timelines(timelines,
                                     min_timeline_len=3,
                                     max_context_len=100):
  """Convert user timelines to tf examples.

  Convert user timelines to tf examples by adding all possible context-label
  pairs in the examples pool.

  Args:
    timelines: the user timelines to process.
    min_timeline_len: minimum length of the user timeline.
    max_context_len: maximum length of context signals.

  Returns:
    train_examples: tf example list for training.
    test_examples: tf example list for testing.
  """
  train_examples = []
  test_examples = []
  for timeline in timelines.values():
    # Skip if timeline is shorter than min_timeline_len.
    if len(timeline) < min_timeline_len:
      continue
    for label_idx in range(1, len(timeline)):
      start_idx = max(0, label_idx - max_context_len)
      context = timeline[start_idx:label_idx]
      # Pad context with out-of-vocab movie id 0.
      while len(context) < max_context_len:
        context.append(OOV_MOVIE_ID)
      label = timeline[label_idx]
      feature = {
          "context":
              tf.train.Feature(int64_list=tf.train.Int64List(value=context)),
          "label":
              tf.train.Feature(int64_list=tf.train.Int64List(value=[label]))
      }
      tf_example = tf.train.Example(features=tf.train.Features(feature=feature))
      if label_idx == len(timeline) - 1:
        test_examples.append(tf_example.SerializeToString())
      else:
        train_examples.append(tf_example.SerializeToString())
  return train_examples, test_examples



In [None]:
train_examples, test_examples = generate_examples_from_timelines(timelines)

Write examples to tfrecords, to be loaded in the model training step.

In [None]:
def write_tfrecords(tf_examples, filename):
  """Write tf examples to tfrecord file."""
  with tf.io.TFRecordWriter(filename) as file_writer:
    for example in tf_examples:
      file_writer.write(example)

output_dir = ''
OUTPUT_TRAINING_DATA_FILENAME = "train_movielens_1m.tfrecord"
OUTPUT_TESTING_DATA_FILENAME = "test_movielens_1m.tfrecord"
print(test_examples)
if not tf.io.gfile.exists(output_dir):
  tf.io.gfile.makedirs(output_dir)
write_tfrecords(
    tf_examples=train_examples,
    filename=os.path.join(output_dir, OUTPUT_TRAINING_DATA_FILENAME))
write_tfrecords(
    tf_examples=test_examples,
    filename=os.path.join(output_dir, OUTPUT_TESTING_DATA_FILENAME))




In [None]:
!pwd

# Train model

The training launcher script uses TensorFlow keras compile/fit APIs and performs
the following steps to kick start training and evaluation process:

*   Set up both train and eval dataset input function.
*   Construct keras model according to provided configs, please refer to sample.config file in the source code to config your model architecture, such as embedding dimension, convolutional neural network params, LSTM units etc.
*   Setup loss function. In this code base, we leverages customized batch softmax loss function.
*   Setup optimizer, with flag specified learning rate and gradient clip if needed.
*   Setup evaluation metrics, we provided recall@k metrics by default.
*   Compile model with loss function, optimizer and defined metrics.
*   Setup callbacks for tensorboard and checkpoint manager.
*   Run model.fit with compiled model, where you could specify number of epochs to train, number of train steps in each epoch and number of eval steps in each epoch.

## Model training parameters

### Encoder type

You can train the model using three different encoder types: a convolutional neural net (cnn), a recurrent neural net (rnn), or a bag of words (bow). You can select between the various types with the **--encoder_type** parameter supplying **cnn**, **rnn**, or **bow**. Different encoders have strengths and weakensses depending on the input / output characteristics of your dataset.

For example: If the input context (here, the user history length) is long, cnn and rnn would be more suitable as they have better summarization ability with longer user histories.

### Training time / size

Another consideration is training time. Rnn generally requires the longer training times, followed by cnn, and finally bow with the shortest training times. Bag of words will also be a smaller sized model if space is a consideration.

To start training, execute the following command. Please note that we are using a very small number of epochs (**num_epochs** parameter below) of 10 to speed up training time at the expense of model quality. Generating a high quality model often requires a much higher number. For this model, setting num_epochs to at least 100 should provide a model of sufficient quality. 


In [None]:
tf.__version__

In [None]:
!git clone https://github.com/niap123/ml

In [None]:
!cd ml/lite/examples/recommendation/ml
!pip install -r requirements.txt

In [None]:
!pwd

In [None]:
%cd ml

In [None]:
!python -m model.recommendation_model_launcher_keras \
  --run_mode "train_and_eval" \
  --encoder_type "cnn" \
  --training_data_filepattern "data/examples/train_movielens_1m.tfrecord" \
  --testing_data_filepattern "data/examples/test_movielens_1m.tfrecord" \
  --model_dir "model/model_dir" \
  --params_path "model/sample_config.json"\
  --batch_size 64 \
  --learning_rate 0.1 \
  --steps_per_epoch 1000 \
  --num_epochs 10 \
  --num_eval_steps 1000 \
  --gradient_clip_norm 1.0 \
  --max_history_length 10

# Export model

Now we export the trained model to a tflite file suitable for on-device inference on mobile devices.
Note that here we use the latest checkpoint, number 10000 in the **checkpoint_path**. This results from num_epochs (10) x steps_per_epoch (1000). If you change either parameter in the previous training step, you should update this parameter to accordingly export the latest checkpoint.

In [None]:
!python -m model.recommendation_model_launcher_keras \
  --run_mode "export" \
  --encoder_type "cnn" \
  --params_path "model/sample_config.json"\
  --model_dir "model/model_dir" \
  --checkpoint_path "model/model_dir/ckpt-10000" \
  --num_predictions 100

# Model inference (Optional)

You could verify your model's performance by running inference with test examples.

In [None]:
import tensorflow as tf
import os
import json

# Use [0, 1, ... 9] as example input to represent 10 movies that user interacted with.
#context = [1196, 1210, 2628]
# context = tf.range(10)
context = tf.constant([1196, 1210, 2628, 260, 480, 2571, 589, 1240, 1, 10])

# Directory to exported TensorFlow Lite model.
export_dir = "model/model_dir/export"
tflite_model_path = os.path.join(export_dir, 'model.tflite')
f = open(tflite_model_path, 'rb')
interpreter = tf.lite.Interpreter(model_content=f.read())
interpreter.allocate_tensors()
input_details = interpreter.get_input_details()
output_details = interpreter.get_output_details()
print(input_details)
print(output_details)

interpreter.set_tensor(input_details[0]['index'], context)
interpreter.invoke()
tflite_top_predictions_ids = interpreter.get_tensor(
    output_details[0]['index'])
tflite_top_prediction_scores = interpreter.get_tensor(
    output_details[1]['index'])
print("results >>>>>")
print("input >>>>>")
print(input_details[0])
print("output >>>>>")
print(tflite_top_predictions_ids)

# Deploy model to the Firebase Console

We now deploy the model to the Firebase Console. From there, it can be automatically downloaded to your user's devices with Firebase ML.

Step 1. Initialize Firebase App Instance

In [None]:
import firebase_admin

firebase_admin.initialize_app(options={'projectId': projectID, 
             'storageBucket': projectID + '.appspot.com' })

Step 2. Upload the model file to Cloud Storage

In [None]:
from firebase_admin import ml

# This uploads it to your bucket as recommendation.tflite
source = ml.TFLiteGCSModelSource.from_saved_model(export_dir, 'model.tflite')
print (source.gcs_tflite_uri)

Step 3. Deploy the model to Firebase

In [None]:
# Create a Model Format
model_format = ml.TFLiteFormat(model_source=source)

# Create a Model object
sdk_model_1 = ml.Model(display_name="recommendations", model_format=model_format)

# Make the Create API call to create the model in Firebase
firebase_model_1 = ml.create_model(sdk_model_1)
print(firebase_model_1.as_dict())

# Publish the model
model_id = firebase_model_1.model_id
firebase_model_1 = ml.publish_model(model_id)

# Return to the Firebase Console
At this point, we have deployed the trained model to the Firebase console. You can go to Develop > Machine Learning > Custom to check it out!

Note that for the purposes of this codelab, in order to have a quick training time, we intentionally chose suboptimal training parameters (as described in the model training step above) that sacrifice model quality. To get better results, please use the pre-trained model included in the Github code repo [here](https://github.com/FirebaseExtended/codelab-contentrecommendation-android/blob/master/recommendation_cnn_i10o100.tflite).
To replace the model we just published:
1. In the Firebase console, go to Develop > Machine Learning > Custom
1. Select the settings dropdown under the model named "recommendations"
1. Choose "Replace model" and upload the model file from the Github repo.

Finally, please return to the [codelab](https://codelabs.developers.google.com/codelabs/contentrecommendation-android) and complete the last steps to see the app in action!