In [1]:
import tensorflow as tf
import numpy as np
import timeit as t

# Matrix Factorization

# Theory

## Mathematical Problem

We have a sparse matrix $A$, of size $m\times n$. This sparse matrix is a table of interactions between **users** and **items**. We describe the **users** and **items** based on $k$ **latent features**. The **users** and **items** are both described by these **latent features**, and the matrix $A$ is the table of **interactions** based on these common **latent features**.

Let **users** be described by an $m \times k$ matrix $U$. Let **items** be described by an $n \times k$ matrix $V$. We call $U$ and $V$ the user/item **embeddings**. It is our job to find $U$, $V$ such that:

$$(A - U\cdot V^{T})_{ij} < \varepsilon \quad \forall i,j : A_{ij} \neq 0$$

Where $\varepsilon$ is an arbitrary positive constant.

Thus, our **goal** is to **train these embeddings**, $U$, $V$, in order to reliably predict the *unobserved interactions* between **user** and **item**.

## Practical/Business Problem

The practical problem here is that, given some interaction history between users and items, we can mathematically predict interactions between users and items based on their embeddings.

# Implementation

## Choice of Loss Function

We must compute the loss here in an appropriate manner so that we may quantify the extent to which our prediction and real interactions disagree.

An obvious choice is the MSE loss over all of our nonzero entries.

Above, we stated that our objective is:

$$\left(A - UV^{T}\right)_{ij} < \varepsilon \quad \forall ij : A_{ij} \neq 0$$

### 1. Worst Implementation

Therefore, in **this naive implementation, our loss would be**:

$$L = \left(A - UV^{T}\right)_{ij} \quad \forall ij : A_{ij} \neq 0$$

However, we see here that this is only guaranteed to converge for **positive interactions** $A$ and **positive initializations** $U$ and $V$.

### 2. Naive Implementation

If we want to guarantee convergence while retaining the differentiability of our loss function, we should probably do this:

$$L_{ij} = \left(A - UV^{T}\right)^{2}_{ij} \quad \forall ij : A_{ij} \neq 0$$

We can easily implement this and write some equations down to train this. But the issue here is that this is **prone to overfitting**. In the presence of extreme values, the neighboring entries surrounding an extreme interaction are prone to tending towards the extreme interaction rather than capturing the true nature of that interaction based on the features. 

### 3. Better Implementation

Therefore, we should weight this in order to control this contribution!

$$L_{ij} = \lambda \left(A - U\cdot V^{T}\right)^{2}_{ij} \quad \forall ij : A_{ij} \neq 0$$

However, there is one improvement we could make here. This is still not the most accurate loss possible because **we are assuming that the observed interactions completely determine the unobserved interactions**. In order to control for this, what do we do?

### 4. Best Implementation

We can **weight the unobserved entries**.

$$L = \begin{cases} 
\lambda_{1}\left(A - U\cdot V^{T}\right)^{2}_{ij} & \forall ij : A_{ij} \neq 0\\[5pt]
\lambda_{2}\left(U\cdot V^{T}\right)^{2}_{ij} & \forall ij : A_{ij} = 0
\end{cases}$$

This choice of $L$ takes contributions of both our observed interactions and unobserved interactions, while having the hyperparameters to control which contribution is more impactful.

In practice, we will implement the 3rd method, and then we can get to doing the best possible implementation.

## Initialize Embeddings

We will initialize our user and item embeddings here.

In [2]:
# our original matrix of observed interactions
A = tf.constant([[5,0,2,0,0], [0,0,3,1,0], [0,0,0,1,5], [2,0,0,0,1], [0,0,0,0,5]], dtype=tf.float32)

# our embeddings
U, V = tf.Variable(tf.random.uniform(shape=(5,3))), tf.Variable(tf.random.uniform(shape=(5,3)))

## Implement Automatic Differentiation for our Loss

In [3]:
# lambda_1 = tf.constant(0.01)
# with tf.GradientTape(persistent=True) as tape:
#     # initialize our loss in the scope of the gradient tape
#     loss_fn = tf.multiply(lambda_1, tf.pow(tf.where(A != 0, A - tf.matmul(U, tf.transpose(V)), 0.0), 2))

# # backpropagate to compute gradient
# dloss_dU = tape.gradient(loss_fn, U)
# dloss_dV = tape.gradient(loss_fn, V)

# dloss_dU.numpy(), dloss_dV.numpy()

## Implement Gradient Descent

We will take an existing gradient descent method from keras.

In [4]:
learning_rate=1e-1

optimizer = tf.keras.optimizers.Adam(learning_rate)

In [5]:
U.numpy(), V.numpy()

(array([[0.08122253, 0.757429  , 0.25588763],
        [0.84419334, 0.01586115, 0.7617438 ],
        [0.20530128, 0.848701  , 0.8023819 ],
        [0.21759665, 0.18440318, 0.46592808],
        [0.31103742, 0.07522655, 0.5374615 ]], dtype=float32),
 array([[0.22551465, 0.22709131, 0.7178328 ],
        [0.68392634, 0.35441232, 0.99062216],
        [0.82800114, 0.6106597 , 0.10556054],
        [0.88696134, 0.5254947 , 0.3948456 ],
        [0.05091584, 0.41112316, 0.65729904]], dtype=float32))

In [6]:
tf.matmul(U, tf.transpose(V))

<tf.Tensor: shape=(5, 5), dtype=float32, numpy=
array([[0.37400696, 0.5774803 , 0.55679536, 0.57110226, 0.4837268 ],
       [0.7407846 , 1.3375877 , 0.78908885, 1.057873  , 0.5501972 ],
       [0.8150071 , 1.2360584 , 0.7729571 , 0.94489914, 0.88677853],
       [0.42540607, 0.6757335 , 0.34196147, 0.47387233, 0.39314562],
       [0.4730343 , 0.7718092 , 0.3602119 , 0.5276236 , 0.40003705]],
      dtype=float32)>

In [7]:
lambda_1 = tf.constant(0.01)
with tf.GradientTape(persistent=True) as tape:
    # initialize our loss in the scope of the gradient tape
    loss_fn = tf.multiply(lambda_1, tf.pow(tf.where(A != 0, A - tf.matmul(U, tf.transpose(V)), 0.0), 2))

# backpropagate to compute gradient
dloss_dU, dloss_dV = tape.gradient(loss_fn, U), tape.gradient(loss_fn, V)

li_grads = [dloss_dU, dloss_dV]
li_vars = [U, V]

# now do the optimization
optimizer.apply_gradients(zip(li_grads, li_vars))

U.numpy(), V.numpy()

(array([[0.18121548, 0.8574208 , 0.3558831 ],
        [0.9441845 , 0.11584917, 0.86166877],
        [0.3052401 , 0.9486918 , 0.90237606],
        [0.31755573, 0.28437713, 0.56591773],
        [0.41096997, 0.1752182 , 0.6374563 ]], dtype=float32),
 array([[0.32549265, 0.32708716, 0.81782454],
        [0.68392634, 0.35441232, 0.99062216],
        [0.9279932 , 0.7106457 , 0.20555285],
        [0.7873807 , 0.625151  , 0.43949082],
        [0.15090927, 0.5111191 , 0.75729644]], dtype=float32))

In [8]:
tf.matmul(U, tf.transpose(V))

<tf.Tensor: shape=(5, 5), dtype=float32, numpy=
array([[0.63048553, 0.7803642 , 0.85064185, 0.8351104 , 0.73510027],
       [1.0499117 , 1.5403992 , 1.1356429 , 1.1945513 , 0.8542376 ],
       [1.1476436 , 1.4389035 , 1.1429304 , 1.2300017 , 1.2143242 ],
       [0.6591996 , 0.8785821 , 0.6131069 , 0.67653155, 0.6218402 ],
       [0.71240675, 0.974651  , 0.63692635, 0.71328384, 0.63431996]],
      dtype=float32)>

In [9]:
A

<tf.Tensor: shape=(5, 5), dtype=float32, numpy=
array([[5., 0., 2., 0., 0.],
       [0., 0., 3., 1., 0.],
       [0., 0., 0., 1., 5.],
       [2., 0., 0., 0., 1.],
       [0., 0., 0., 0., 5.]], dtype=float32)>

So our optimizer updated the embeddings, this was one pass so the differences are not huge so far, but it did update.

## Implement Training Loop (Loss 3)

We will now implement the training loop (i.e. the gradient descent for multiple epochs).

In [10]:
# our original matrix of observed interactions
A = tf.constant([[5,0,2,0,0], [0,0,3,1,0], [0,0,0,1,5], [2,0,0,0,1], [0,0,0,0,5]], dtype=tf.float32)

# our embeddings
U_1, V_1 = tf.Variable(tf.random.uniform(shape=(5,3))), tf.Variable(tf.random.uniform(shape=(5,3)))

lambda_1 = tf.constant(0.01)
optimizer_1 = tf.keras.optimizers.Adam(1e-3)

In [11]:
U_1, V_1

(<tf.Variable 'Variable:0' shape=(5, 3) dtype=float32, numpy=
 array([[0.07933831, 0.59718215, 0.26859355],
        [0.7451229 , 0.8870889 , 0.61328113],
        [0.1465751 , 0.05487955, 0.45774043],
        [0.94897795, 0.05587411, 0.11407959],
        [0.3906908 , 0.8970146 , 0.7635932 ]], dtype=float32)>,
 <tf.Variable 'Variable:0' shape=(5, 3) dtype=float32, numpy=
 array([[0.47210371, 0.05594623, 0.10557187],
        [0.82508147, 0.70971763, 0.92652845],
        [0.507422  , 0.2751348 , 0.01749194],
        [0.6188357 , 0.08820915, 0.30138838],
        [0.5684279 , 0.8174298 , 0.4409219 ]], dtype=float32)>)

In [12]:
# function to implement the optimization of loss 3

def train_loss3(U, V, A, epochs, optimizer, lambda_1):
    total_time = 0
    # run training loop for multiple epochs
    for epoch in range(epochs):
        start = t.default_timer()
        with tf.GradientTape(persistent=True) as tape:
            # initialize our loss in the scope of the gradient tape
            loss_fn = tf.multiply(lambda_1, tf.pow(tf.where(A != 0, A - tf.matmul(U, tf.transpose(V)), 0.0), 2))

        # backpropagate to compute gradient
        dloss_dU, dloss_dV = tape.gradient(loss_fn, U), tape.gradient(loss_fn, V)

        # get arrays of variables and gradients
        li_grads = [dloss_dU, dloss_dV]
        li_vars = [U, V]

        # now do the optimization
        optimizer.apply_gradients(zip(li_grads, li_vars))
        
        # track runtime and total loss
        end = t.default_timer()
        
        total_time += (end - start)
        # print the updates for every epoch
        if (epoch + 1) % 50 == 0:
            print(f'Loss {tf.reduce_sum(loss_fn)} | Total Runtime {total_time} sec.')

In [13]:
epochs = 250

train_loss3(U_1, V_1, A, epochs, optimizer_1, lambda_1)

Loss 0.6647934913635254 | Total Runtime 0.3557698000000009 sec.
Loss 0.6099028587341309 | Total Runtime 0.6507769000000039 sec.
Loss 0.552852213382721 | Total Runtime 0.9248403999999972 sec.
Loss 0.4948466122150421 | Total Runtime 1.2308782999999952 sec.
Loss 0.43711239099502563 | Total Runtime 1.5156865999999933 sec.


## Implement Training Loops (Loss 4)

Loss 4 was the method 4 in which we added an extra term:

In [14]:
# our original matrix of observed interactions
A = tf.constant([[5,0,2,0,0], [0,0,3,1,0], [0,0,0,1,5], [2,0,0,0,1], [0,0,0,0,5]], dtype=tf.float32)

# our embeddings
U_2, V_2 = tf.Variable(tf.random.uniform(shape=(5,3))), tf.Variable(tf.random.uniform(shape=(5,3)))

lambda_1, lambda_2 = tf.constant(0.01), tf.constant(0.001) 
optimizer_1 = tf.keras.optimizers.Adam(1e-3)

In [15]:
U_2, V_2

(<tf.Variable 'Variable:0' shape=(5, 3) dtype=float32, numpy=
 array([[0.45406008, 0.6125567 , 0.6828482 ],
        [0.3958645 , 0.27806878, 0.64831066],
        [0.9756948 , 0.8801267 , 0.14891171],
        [0.7785053 , 0.771783  , 0.1295476 ],
        [0.48571408, 0.40935826, 0.9892758 ]], dtype=float32)>,
 <tf.Variable 'Variable:0' shape=(5, 3) dtype=float32, numpy=
 array([[0.5707859 , 0.11755216, 0.2963915 ],
        [0.22033596, 0.09281969, 0.48017776],
        [0.25119758, 0.78879523, 0.3585558 ],
        [0.91686547, 0.99404526, 0.519649  ],
        [0.69685435, 0.7268249 , 0.76492393]], dtype=float32)>)

In [16]:
# function to implement the optimization of loss 3

def train_loss4(U, V, A, epochs, optimizer, lambda_1, lambda_2):
    total_time = 0
    # run training loop for multiple epochs
    for epoch in range(epochs):
        start = t.default_timer()
        with tf.GradientTape(persistent=True) as tape:
            # initialize our loss in the scope of the gradient tape
            loss_fn = tf.where(A != 0, tf.multiply(lambda_1, tf.pow(A - tf.matmul(U, tf.transpose(V)), 2)), tf.multiply(lambda_2, tf.pow(-tf.matmul(U, tf.transpose(V)), 2)))

        # backpropagate to compute gradient
        dloss_dU, dloss_dV = tape.gradient(loss_fn, U), tape.gradient(loss_fn, V)

        # get arrays of variables and gradients
        li_grads = [dloss_dU, dloss_dV]
        li_vars = [U, V]

        # now do the optimization
        optimizer.apply_gradients(zip(li_grads, li_vars))
        
        # track runtime and total loss
        end = t.default_timer()
        
        total_time += (end - start)
        # print the updates for every epoch
        if (epoch + 1) % 50 == 0:
            print(f'Loss {tf.reduce_sum(loss_fn)} | Total Runtime {total_time} sec.')

In [17]:
epochs = 250

train_loss4(U_2, V_2, A, epochs, optimizer_1, lambda_1, lambda_2)

Loss 0.5190655589103699 | Total Runtime 0.4495614000000039 sec.
Loss 0.46565189957618713 | Total Runtime 0.8992254000000042 sec.
Loss 0.41350579261779785 | Total Runtime 1.3589951000000067 sec.
Loss 0.3640349507331848 | Total Runtime 1.7045038000000172 sec.
Loss 0.3185979127883911 | Total Runtime 2.305753300000033 sec.


In [18]:
tf.matmul(U_2, tf.transpose(V_2)).numpy()

array([[ 1.4111626 ,  0.16100453,  1.7719631 ,  1.3313775 ,  2.453659  ],
       [ 1.205685  ,  0.18614072,  1.3572495 ,  1.0194348 ,  2.0053127 ],
       [ 1.629098  , -0.0031546 ,  1.9628751 ,  1.731515  ,  2.6354938 ],
       [ 1.0395435 , -0.0117838 ,  1.0327182 ,  1.0324891 ,  1.4998994 ],
       [ 1.5107505 ,  0.26055306,  1.7363986 ,  1.2583627 ,  2.5601194 ]],
      dtype=float32)

In [19]:
A.numpy()

array([[5., 0., 2., 0., 0.],
       [0., 0., 3., 1., 0.],
       [0., 0., 0., 1., 5.],
       [2., 0., 0., 0., 1.],
       [0., 0., 0., 0., 5.]], dtype=float32)

## Method 3 vs Method 4

Comparing method 3 and method 4, we see that method 4 is slightly more powerful at the cost of slightly longer training time. Maybe we can uncover better schemes to make our loss more accurate, but we must ensure that it does not overfit or improperly represent our observed interactions. This is essentially what it boils down to:

Assume $\mathrm{Observed} + \mathrm{Unobserved} = \mathrm{All\;Interactions}$ (where $+$ here means disjoint union of sets).

Here are the two losses we used:

$$L_{3} = \sum_{(i,j) \in \mathrm{Observed}} \lambda_{1}\left(A - U\cdot V^{T}\right)_{ij}$$

$$L_{4} = \sum_{(i,j) \in \mathrm{Observed}} \lambda_{1}\left(A - U\cdot V^{T}\right)_{ij} + \sum_{(i,j) \in \mathrm{Unobserved}} \lambda_{2}\left(-U \cdot V^{T}\right)_{ij}$$

We see that $L_{3}$ only trains the **observed interactions** (i.e. the entries corresponding to the i,j such that $A_{ij} \neq 0$). This is fine, however, in cases where certain observations are extreme (there is a lot of interactions between certain item, user pairs), the nearby interactions (entries) will likely **overfit** to those values.

To counteract this, we train the **unobserved interactions** as well (i.e. the entries corresponding to the i,j such that $A_{ij} = 0$). The purpose of doing this is to counteract the **overfitting** that comes with only training the **observed interactions**. So in some sense, the extra term in $L_{4}$ (the term with the $\lambda_{2}$ on it) is a **regularization** term.

**Note:** If we set $\lambda_{2} = 0$ for $L_{4}$, then we obtain $L_{3}$, so it is better to implement $L_{4}$.

## Create a Class

We will now use object-oriented programming to consolidate the basic aspects of this model. This will be the basis of the library we are building.

In [20]:
class MatrixFactorization:
    
    """Class for a Matrix Factorization Model. A hybrid (content-based and collaborative filtering) recommender model 
    that takes user embeddings and item embeddings to predict unobserved interactions in a given interaction table"""
    
    def __init__(self, num_features, shape):
        
        """
        Arguments:
        - num_features: a python int, number of latent features for the user and item embeddings
        - shape: a python tuple of the form (n,m) where (n,m) is the shape of the interaction table we are approximating
        
        Purpose:
        Initializes the following instance attributes:
        - user embeddings: U ---> of shape (n, num_features)
        - item embeddings: V ---> of shape (m, num_features)
        """
        n, m = shape
        user_shape = (n, num_features)
        item_shape = (m, num_features)
        
        self.U = tf.Variable(tf.random.uniform(shape=user_shape), trainable=True, dtype=tf.float32)
        self.V = tf.Variable(tf.random.uniform(shape=item_shape), trainable=True, dtype=tf.float32)
    
    @tf.function
    def loss(self, A, U, V, lambda_1, lambda_2):
        """
        Arguments:
        - None
        
        Purpose:
        Outputs the loss for both observed and unobserved interactions. In particular, it creates a loss graph for computation
        in tensorflow's autograd
        
        2022-06-03 - The only loss available is the MSE loss, this will be changed as necessary as we discover more losses to use
        """
        
        # mean squared error loss with regularization
        return tf.where(A != 0, tf.multiply(lambda_1, tf.pow(A - tf.matmul(U, tf.transpose(V)), 2)), tf.multiply(lambda_2, tf.pow(-tf.matmul(U, tf.transpose(V)), 2)))
        
    
    def fit(self, A, epochs, optimizer, lambda_1=0.01, lambda_2=0.001, verbose=1):
        """
        Arguments:
        - A: an array with 2 axes, an interaction table between users and items
        - epochs: a python int, the number of times the model will train
        - optimizer: a tf.keras.optimizers object, the optimization algorithm used in the minimization of the loss
        - lambda_1: a python float, the hyperparamater that weights the contribution of observed entries in the training
        - lambda_2: a python float, the hyperparameter that weights the contribution of unobserved entries in the training
        - verbose: a python int, 
        --- 0: no information about training process printed
        --- 1: epoch number, loss, cumulative training runtime printed
        --- 2: epoch number printed
        
        Purpose:
        Runs a training loop for a specified loss function using an optimization algorithm of choice.
        """
        
        # get history of losses and training parameters
        train_history, li_loss = {}, []
        
        # recast hyperparameters as tf.constant objects
        lambda_1, lambda_2 = tf.constant(lambda_1), tf.constant(lambda_2)
        
        total_time = 0
        for epoch in range(epochs):
            
            # start timer
            start=t.default_timer()
            
            with tf.GradientTape(persistent=True) as tape:
                
                # initialize loss to take gradients
                loss_fn = self.loss(A, self.U, self.V, lambda_1, lambda_2)
                
            # backpropagate to compute gradient
            dloss_dU, dloss_dV = tape.gradient(loss_fn, self.U), tape.gradient(loss_fn, self.V)

            # get arrays of variables and gradients
            li_grads = [dloss_dU, dloss_dV]
            li_vars = [self.U, self.V]

            # now do the optimization
            optimizer.apply_gradients(zip(li_grads, li_vars))

            # end timer
            end = t.default_timer()

            # put in key for train history, get loss from every epoch
            li_loss.append(tf.reduce_sum(loss_fn).numpy())
            train_history['Loss'] = li_loss

            # verbose outputs
            if verbose == 0:
                pass
            
            if verbose == 1:
                total_time += (end - start)
                
                # print the updates for every epoch
                if (epoch + 1) % 50 == 0:
                    print(f'Epoch {epoch + 1}/{epochs} | Loss {tf.reduce_sum(loss_fn)} | Total Runtime {total_time} sec.')
            
            if verbose == 2:
                total_time += (end - start)
                
                # print the updates for every epoch
                if (epoch + 1) % 50 == 0:
                    print(f'Epoch {epoch + 1}/{epochs} | Total Runtime {total_time} sec.')
                    
        return train_history

In [21]:
# test the class

### initialize model
model1 = MatrixFactorization(3, (5,5))

model1.U, model1.V

(<tf.Variable 'Variable:0' shape=(5, 3) dtype=float32, numpy=
 array([[0.687345  , 0.89686644, 0.2590859 ],
        [0.20819509, 0.2115885 , 0.31697786],
        [0.7831768 , 0.91765773, 0.8220005 ],
        [0.9278258 , 0.7236351 , 0.9515606 ],
        [0.74774766, 0.33941114, 0.49397886]], dtype=float32)>,
 <tf.Variable 'Variable:0' shape=(5, 3) dtype=float32, numpy=
 array([[0.7801782 , 0.5865929 , 0.1732508 ],
        [0.8587339 , 0.295681  , 0.7646496 ],
        [0.17215383, 0.82947946, 0.32177687],
        [0.14014292, 0.8995613 , 0.34337568],
        [0.53720796, 0.33877194, 0.6331147 ]], dtype=float32)>)

In [22]:
A = tf.constant([[5,0,2,0,0], [0,0,3,1,0], [0,0,0,1,5], [2,0,0,0,1], [0,0,0,0,5]], dtype=tf.float32)

In [23]:
# test the fitting

optimizer_adam=tf.keras.optimizers.Adam(1e-3)

history_model1 = model1.fit(A, 250, optimizer_adam)

Epoch 50/250 | Loss 0.5270388722419739 | Total Runtime 0.8224340000000012 sec.
Epoch 100/250 | Loss 0.4757670760154724 | Total Runtime 1.1190998000000114 sec.
Epoch 150/250 | Loss 0.42503368854522705 | Total Runtime 1.5748088000000102 sec.
Epoch 200/250 | Loss 0.37621575593948364 | Total Runtime 2.1004636000000225 sec.
Epoch 250/250 | Loss 0.3307192921638489 | Total Runtime 2.5274242000000235 sec.


In [24]:
history_model1

{'Loss': [0.5769777,
  0.57596195,
  0.5749459,
  0.5739299,
  0.5729136,
  0.57189727,
  0.57088083,
  0.5698642,
  0.56884754,
  0.56783074,
  0.5668139,
  0.56579685,
  0.5647798,
  0.56376266,
  0.5627454,
  0.561728,
  0.56071055,
  0.559693,
  0.5586753,
  0.55765754,
  0.5566397,
  0.5556216,
  0.5546034,
  0.5535852,
  0.5525667,
  0.5515481,
  0.5505293,
  0.54951024,
  0.5484911,
  0.54747164,
  0.54645205,
  0.54543227,
  0.54441226,
  0.543392,
  0.5423716,
  0.54135084,
  0.54032993,
  0.5393088,
  0.53828734,
  0.53726584,
  0.53624403,
  0.53522205,
  0.5341999,
  0.53317744,
  0.5321548,
  0.531132,
  0.53010905,
  0.5290859,
  0.52806246,
  0.5270389,
  0.52601516,
  0.5249913,
  0.52396727,
  0.522943,
  0.52191865,
  0.52089405,
  0.51986945,
  0.51884466,
  0.51781964,
  0.5167947,
  0.5157694,
  0.51474416,
  0.5137188,
  0.5126933,
  0.51166767,
  0.51064193,
  0.50961626,
  0.50859034,
  0.5075644,
  0.50653845,
  0.50551236,
  0.5044863,
  0.5034602,
  0.5024341

## Test Scalability

Let us test how scalable this training is. Let us try it for large interaction tables.

In [25]:
model2 = MatrixFactorization(3, (1000,1000))

A_2 = tf.random.uniform(shape=(1000,1000), maxval=5)

# test the fitting

optimizer_adam=tf.keras.optimizers.Adam(1e-3)

history_model2 = model2.fit(A_2, 250, optimizer_adam)

Epoch 50/250 | Loss 48255.1953125 | Total Runtime 4.8321564999999715 sec.
Epoch 100/250 | Loss 43383.28515625 | Total Runtime 9.029244199999983 sec.
Epoch 150/250 | Loss 38848.359375 | Total Runtime 13.226628699999981 sec.
Epoch 200/250 | Loss 34834.125 | Total Runtime 17.274489399999947 sec.
Epoch 250/250 | Loss 31454.318359375 | Total Runtime 21.382511099999927 sec.


For a $1000 \times 1000$ matrix, this actually has acceptable speed. Although we did not train it for long. Let us test the limits.

In [27]:
model3 = MatrixFactorization(3, (10000,10000))

A_3 = tf.random.uniform(shape=(10000,10000), maxval=5)

# test the fitting

optimizer_adam=tf.keras.optimizers.Adam(1e-3)

history_model3 = model3.fit(A_3, 250, optimizer_adam)

Epoch 50/250 | Loss 4806552.0 | Total Runtime 708.1014444999997 sec.
Epoch 100/250 | Loss 4318902.0 | Total Runtime 1418.5567218000006 sec.


KeyboardInterrupt: 

### Notes on Performance

I did not finish training the embeddings on the $10000 \times 10000$ interaction matrix, for the safety of my computer (it was overheating like crazy). However, here are some observations from the limited run above:

- the output here tells us that on a CPU alone, the training is manageable. We have that the computation takes about 14 seconds per epoch

- furthermore, we tested this on a VERY dense matrix (this interaction table had 100 million nonzero entries!), in practice, we are far more likely to have more users than items, and many of the interactions are likely to be zero. Thus, the computation time is not as severe for these more likely cases.

- the loss is encouraging, however, in 50 epochs, we saw a difference of 300000 in mse loss.