In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
cd "/content/drive/My Drive/Colab Notebooks/CoE202/Collaborative Filtering"

## Collaborative Filtering (CF)

<figure class="image">
  <img src="https://drive.google.com/uc?export=view&id=185fQI_jd3DewJRSKO7TEIZH6Sjr15UGc" width="50%" height="50%" title="recommender system" alt="recommender system"></img>
</figure>

- The most prominent approach to generate recommendations
    - Used by large, commercial e-commerce sites
    - Well-understood, various algorithms and variations exist
    - Applicable in many domains
- Use the **wisdom of the crowd** to recommend items
- Basic assumption and idea
    - Users give ratings to items (implicitly or explicitly)
    - <span style="color:red">**Customers who had similar tastes in the past will have similar tastes in the future**</span>

### Model-based CF
#### Matrix Factorization
We can factorize **ratings matrix** into **user matrix** and **item matrix**.

<figure class="image">
  <img src="https://drive.google.com/uc?export=view&id=1880BHOvpFW66QjjjnnN-exW_HOyX9EkU" width="50%" height="50%" title="recommender system" alt="recommender system"></img>
</figure>

We can model directly leveraging only observed **ratings**, while avoiding overfitting through an adequate regularized model, such as :  
$min \frac{1}{2} \sum_{(u, i)\in R}{(r_{ui} -\mu -b_{u}^{user} -b_{i}^{item} -p_{u}q_{i}^{T})^{2}} + \lambda(|p_{u}|^{2} + |q_{i}|^{2} + {b_{u}^{user}}^{2} + {b_{i}^{item}}^{2})$  
where $p_{u}$ and $q_{i}$ are the latent factor of user $u$ and item $i$, respectively.  
$b_{u}$ and $b_{i}$ are the bias term of user $u$ and item $i$, respectively.  
$\mu$ is the mean of the observed ratings matrix.  

#### Implicit Feedback
Recommender systems rely on many different types of feedbacks.  
Netflix collects star ratings for movies and TiVo users indicate tehir preferences for TV shows by hitting thumbs-up/dowm buttons.  
However, **explicit feedback is not always available**.
Therefore, recommenders can infer user preferences from the more abundant **implicit feedback**,  
which indirectly reflect opinion through observing user behavior.

Implicit feedback can be collected constantly and do not require additional efforts from user.

<figure class="image">
  <img src="https://drive.google.com/uc?export=view&id=1WXWuwp2dJp6kqUEkQibrvRpiRktl4am7" width="50%" height="50%" title="recommender system" alt="recommender system"></img>
</figure>

The input data associate users and items through $r_{ui}$ values, which we henceforth call *observations*.  
For implicit feedback datasets, those values would indicate observations for user actions.  
For example, $r_{ui}$ can indicate the number of items $u$ purchased item $i$ or the time $u$ spent on webpage $i$.

Prime characteristics of Implicit Feedback.
- No negative feedback
- Implicit feedback is inherently noisy
- Numerical value of implicit feedback indicates confidence
- Evaluation of implicity-feedback recommender requires appropriate measures

#### OCCF (One Class Collaborative Filtering)
OCCF introduces confidence levels on unobserved interaction.
- If user $u$ consumed item $i$ ($r_{ui} > 0$), then we have an indication that $u$ likes $i$.
- If user $u$ never consumed item $i$, we belive no preference.
- i.e., $p_{ui} = \begin{cases} 1 & r_{ui} > 0 \\ 0 & r_{ui} = 0 \end{cases}$
- However, with **varying confidence levels** (Not all consumptions are the same, Not all non-consumptions are the same)
- i.e., $c_{ui} = 1 + \alpha r_{ui}$

So we aim to minimize the following equations
$min \sum_{(u, i)}{c_{ui}(p_{ui} -x_{u}^{T}y_{i})^{2}} + \lambda(\sum_{u}{||x_{u}||}^{2} + \sum_{i}{||y_{i}||}^{2})$ 

From now on, since $p_{ui}$ contains meaning even when $p_{ui}$ equals zero, we should take care of all possible ($u, i$) pairs, rather than only those corresponding to observed data.  
Since there are too many pairs, it is not feasible to apply stochastic gradient descent.  
Instead, we solve the problem via **Alternating Least Squares (ALS)** algorithm.

<figure class="image">
  <img src="https://drive.google.com/uc?export=view&id=1ihADiG1xa-Xspb661zoBDriWRQI-sdbc" width="80%" height="80%" title="recommender system" alt="recommender system"></img>
</figure>

OCCF iteratively update the user and item latent with below equations.  
$x_{u} = (Y^{T}C^{u}Y + \lambda I)^{-1}Y^{T}C^{u}p(u)$  
$y_{i} = (X^{T}C^{i}X + \lambda I)^{-1}X^{T}C^{i}p(i)$

In [None]:
import numpy as np
import data
from timeit import default_timer as timer

In [None]:
class OCCF():
    
    def __init__(self, train, test, f, epsilon):
        """
        param train : Rating Matrix for train
        param test : Rating Matrix for test
        param f : latent feature parameter
        """
        
        self.R = train # Implication Matrix for training size (m, n)
        self.R_test = test # Implication Matrix for test size (m, n)
        self.P = np.array(np.vectorize(lambda x: 0 if x==0 else 1)(train), dtype = np.float64) # Preference Matrix for training
        self.P_test = np.array(np.vectorize(lambda x: 0 if x==0 else 1)(test), dtype = np.float64) # Preference Matrix for training
        self.n_user_rated = np.sum(self.P, axis = 1)
        self.n_item_rated = np.sum(self.P, axis = 0)
        self.num_users, self.num_items = train.shape
        self.alpha = 40
        self.reg = 0.002
        self.C = 1 + self.alpha * self.R # Confidence Matrix size (m, n)
        self.f = f
        self.epsilon = epsilon
        
        
    def fit(self):
        """
        training Matrix Factorization : update matrix latent weight and bias
        """
        # init latent features
        self.X = np.random.normal(scale = 1.0/self.f, size=(self.num_users, self.f))
        self.Y = np.random.normal(scale = 1.0/self.f, size=(self.num_items, self.f))
        
        count = 0
        cost_diff = 1000000
        self.training_process = []
        self.cost = [0]
        time = 0
        # repeat ALS until convergence
        while cost_diff > self.epsilon :
            
            start = timer()
            count += 1
            self.yTy = self.Y.T.dot(self.Y)
            for u in range(self.num_users):
                self.optimize_x(u)
            
            self.xTx = self.X.T.dot(self.X)
            for i in range(self.num_items):
                self.optimize_y(i)
            time += (timer() - start)
            
            cost = self.cost()
            self.cost.append(cost)
            if count > 1 :
                cost_diff = self.cost[count - 1] - self.cost[count]
            start_rank = timer()
            rank = self.compute_rank()
            print("time to compute rank : %.4f" % ( timer() - start_rank ))
            self.training_process.append([count, cost_diff, rank])
            print("count: %d, cost_difference : %.4f, rank : %.4f, time for a epoch : %.4f"% (count, cost_diff, rank, time))
            time = 0

                
    def optimize_x(self, u):
        """
        Optimize X given user u
        """
        C_u = np.diag(self.C[u, :]) # create diagonal matrix size (n, n)
        
        # (f,f) matrix
        temp1 = self.yTy + self.Y.T.dot(C_u - np.identity(self.num_items)).dot(self.Y) + self.reg * np.identity(self.f)
        # (f,1) matrix
        temp2 = self.Y.T.dot(C_u).dot(self.P[u])
        
        self.X[u, :] = np.linalg.solve(temp1, temp2)
        
    
    def optimize_y(self, i):
        """
        Optimize X given user u
        """
        C_i = np.diag(self.C[:, i]) # create diagonal matrix size (m, m)
        
        # (f,f) matrix
        temp1 = self.xTx + self.X.T.dot(C_i - np.identity(self.num_users)).dot(self.X) + self.reg * np.identity(self.f)
        # (f,1) matrix
        temp2 = self.X.T.dot(C_i).dot(self.P[:, i])
        
        self.Y[i, :] = np.linalg.solve(temp1, temp2)
        
  
    def cost(self):
        """
        compute Loss function
        """
        loss = np.sum(self.C * np.square(self.P - self.X.dot(self.Y.T))) + self.reg * (np.linalg.norm(self.X) + np.linalg.norm(self.Y))
        
        return loss
    

    def compute_rank(self):
        
        prediction = self.X.dot(self.Y.T)
        temp_1 = 0
        temp_2 = 0
        
        for x in range(self.num_users) :
            inv_pre = -1 * prediction[x, :]
            sort_x = inv_pre.argsort() # index starts with 0
            sort_x = sort_x.argsort()
            rank_x = sort_x / len(sort_x)
            
            temp_1 += (self.R_test[x, :] * rank_x).sum()
            temp_2 += self.R_test[x, :].sum()
        
        rank = temp_1 / temp_2
            
        return rank
    
    
    def print_results(self):
        """
        print fit results
        """

        print("Final P hat matrix:")
        print(self.X.dot(self.Y.T))


In [None]:
np.random.seed(7)    
np.seterr(all="warn")

train = data.train
test = data.test

factorizer = MatrixFactorization(train, test, f=40, epsilon = 100)
factorizer.fit()

#### BPR : Bayesian Personalized Ranking from Implicit Feedback
Basically, collaborative filtering models we learned in class are designed for the item prediction task of **personalized ranking**.  
But none of them is directly optimized for ranking.  

<figure class="image">
  <img src="https://drive.google.com/uc?export=view&id=1zoqtL1RQZaHLDLn9aCR4nuZ2ITxtHHbg" width="70%" height="70%" title="recommender system" alt="recommender system"></img>
</figure>

BPR presents a generic optimization criterion *BPR-OPT* for personalized ranking that is maximum posterior estimator derived from a Bayesian analysis of the problem.  

$p(\theta|>_{u}) \propto p(>_{u}|\theta)p(\theta)$  

$\prod_{u \in U}{p(\theta|>_{u})} = \prod_{(u, i, j) \in U \times I \times I}{p(i >_{u} j|\theta)^{\delta((u, i, j)\in D_{s})}{(1 - p(i >_{u} j|\theta))^{\delta((u, j, i)\notin D_{s})}}}$  

where $\delta$ is the indicator function:  

$\delta(b) := \begin{cases} 1 & \text{if}~b~\text{is true} \\ 0 & \text{else} \end{cases}$

The above formula can be simplified to 

$\prod_{u \in U}{p(>_{u}|\theta)} = \prod_{(u, i, j) \in D_{s}}{p(i >_{u} j|\theta)}$

So we formulate the maximum posterior estimator to derive our generic optimization criterion for personalized ranking BPR-OPT:  

$\ln p(\theta | >_{u})$  

$ = \ln p(>_{u}| \theta)p(\theta)$  

$ = \prod_{(u, i, j) \in D_{s}}{\sigma(\hat{x}_{uij})p(\theta)}$  

$ = \sum_{(u, i, j) \in D_{s}}{\ln\sigma(\hat{x}_{uij}) + \ln p(\theta)}$  

$ = \sum_{(u, i, j) \in D_{s}}{\ln\sigma(\hat{x}_{uij}) - \lambda_{\theta} {\|\theta\|}^{2}}$

We optimise the criteria using **bootstrapping** based **stochastic gradient descent**.