## `Collabarative Filtering`

Collabarative filtering is a umbrella name for recommendation techniques which consider user data to provide personalized recommendations.

These techniques are based on the idea "apes together strong!" xD. Simply recommendations are based on the permise that people principally keep their tastes over time and that if they agreed with somebody in the past, they will likely to agree with them in the future as well.

One Collabarative filtering type is `Neighbourhood based filtering`. There are 2 techniques in this type.

1. User based filtering - find simialar users and recommend items liked by them.
2. Item based filtering - find item ratings and recommend similarly rated items.

To do the mentioned tasks we use a data structure(matrix) called `ratings matrix`. Basically this matrix include the each user and their ratings for respective item and our goal is to predict the missing values in this ratings matrix so that we can recommend those to users.


<center><image src="./images/Recommendation pipeline.jpg" width="500px" /></center>


In practice, one of the main concern in recommendation engines are similarity checking. Since this is a time consuming task given the number of items we need to consider is high, it is essential to precalculate the similarities before hand. And in fact Apparently Amazon used such offline calculation of similarity for their recommender systems.

<pre style="color:yellow;">
    For each item in product catalog, I1
        For each customer C who purchased I1
            For each item I2 purchased by customer C
                Record that a customer purchased I1 and I2
            For each item I2
                Compute the similarity between I1 and I2
</pre>

Following is a simple implementation to show the process of item - item recommendation

In [75]:
import pandas as pd
import numpy as np


rating_matrix = pd.DataFrame(data=np.array([ [5,4,5,3,3,2],[3,3,2,5,3,3],[None,4,5,3,3,2],\
                                    [2,None,2,None,2,3],[2,3,1,1,4,5],[2,3,1,1,5,5]]).T,\
                        columns=['item1','item2','item3','item4','item5','item6'],\
                        index=['user1','user2','user3','user4','user5','user6'])

print(rating_matrix)

      item1 item2 item3 item4 item5 item6
user1     5     3  None     2     2     2
user2     4     3     4  None     3     3
user3     5     2     5     2     1     1
user4     3     5     3  None     1     1
user5     3     3     3     2     4     5
user6     2     3     2     3     5     5


In [76]:
pd.set_option('precision', 2)

In [77]:
# %%timeit -n 100
def get_adjusted_ratings(rating_matrix):
    adjusted_matrix = rating_matrix.sub(rating_matrix.mean(axis=1), axis=0)
    return adjusted_matrix

get_adjusted_ratings(rating_matrix)

Unnamed: 0,item1,item2,item3,item4,item5,item6
user1,2.2,0.2,,-0.8,-0.8,-0.8
user2,0.6,-0.4,0.6,,-0.4,-0.4
user3,2.33,-0.67,2.33,-0.67,-1.67,-1.67
user4,0.4,2.4,0.4,,-1.6,-1.6
user5,-0.33,-0.33,-0.33,-1.33,0.67,1.67
user6,-1.33,-0.33,-1.33,-0.33,1.67,1.67


In [145]:
def adjusted_cosine_similarity(rating1, rating2):
    from math import sqrt

    totMulSum = 0
    r1sqr = 0
    r2sqr = 0
    
    for i in range(len(rating1)):

        if(pd.isnull(rating1[i]) or pd.isnull(rating2[i])):
            continue

        i1 = round(rating1[i], 2)
        i2 = round(rating2[i], 2)
        totMulSum = totMulSum + (i1*i2)
        r1sqr += i1**2
        r2sqr += i2**2
    # print(totMulSum)
    # print(sqrt(r1sqr), sqrt(r2sqr))
    cosine = totMulSum/( round(sqrt(r1sqr),2)*round(sqrt(r2sqr),2))

    return cosine
    

In [151]:
# %%timeit -n 100
def get_similarity_matrix(rating_matrix):

    adjusted_ratings = get_adjusted_ratings(rating_matrix)
    similarity_matrix = pd.DataFrame(np.ones_like(adjusted_ratings))

    for i in range(len(adjusted_ratings)):
        for j in range(len(adjusted_ratings)):
            similarity_matrix[i][j] = adjusted_cosine_similarity(adjusted_ratings.iloc[:,i],\
                                                                    adjusted_ratings.iloc[:,j])

        #     print(adjusted_cosine_similarity(adjusted_ratings.iloc[:,0],\
        #                                                             adjusted_ratings.iloc[:,1]))

        #     break
        # break

    similarity_matrix.columns = rating_matrix.columns
    similarity_matrix.index = rating_matrix.columns
    return similarity_matrix

get_similarity_matrix(rating_matrix)

Unnamed: 0,item1,item2,item3,item4,item5,item6
item1,1.0,0.02,1.0,-0.41,-0.82,-0.76
item2,0.02,1.0,-0.04,0.58,-0.44,-0.43
item3,1.0,-0.04,1.0,-0.17,-0.87,-0.81
item4,-0.41,0.58,-0.17,1.0,0.07,-0.2
item5,-0.82,-0.44,-0.87,0.07,1.0,0.96
item6,-0.76,-0.43,-0.81,-0.2,0.96,1.0


The above similarity matrix indicates how the items are similar to each other based on the rating they received from the users.

* according to the table item1 and item3 seems to be very similar in terms of user ratings.
* similarly item3 seems to be extremely different from the item5 as well.

similarity close to zero gives a neither simialar nor different like idea.

This method has few problems. Namely when the items which are rated very few times may cause issues. Also having many items to compare may also cause the similarity values to be less meaningfull when normalizing.

To reduce the computational complexity of having large number of items and to improve the overall recommendations we can use `Neighbourhood Selection` techniques.

1. __Clustering__ - Using clustering algorithms to identify the neighbourhoods.
2. __Top N__ - Defining a parameter as a number of neighbours that need to consider. (not a good method)
3. __Threshold__ - Instead of just defining a number of neighbour parameter, we additionally define some kind of standard to keep the similarity in check.

>In the above similarity matrix, if we consider the top 2 values as neighbourhood for the item1, output would be (item2, item3) as the recommendation. But if we consider a additional threshold of 0.5 to keep the recommendation quality, we can only get the item3 as the recommendation.

Once we find similar items, we can calculate the rating predictions for them based on 2 main methods.
- Regression --> finding average of rating across similar items.
- Classification --> getting the most received type of rating for similar items.

<center><image src="./images/Rating prediction.jpg" width="500px" /></center>

By using the above formula we can predict the rating a user would give to an item based on the neighbourhood items which was also rated by the user.

In practice above codes would be really slow. Therefore we can use matrix based implementation for the initial similarity matrix calculation

>## Offline Data Processing --> Similarity Matrix calculation

In [1]:
import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from scipy.sparse import coo_matrix

In [2]:
# Function to normalize the rating per user (x is a grouped dataframe)
def normalize(x):
    x = x.astype(float)
    x_sum = x.sum()
    x_num = x.astype(bool).sum()
    x_mean = x_sum / x_num
    if x.std() == 0:
        return 0.0
    return (x - x_mean) / (x.max() - x.min())

In [24]:
# Assume that the ratings is a dataframe with  columns (userid, movieid, rating, normalized_rating)
def calculateSimilarityMatrix(ratings, min_similarity = 0.5, min_overlap=0.2):

    # Building the sparse matrix with row->movieid, column->userid and values->normalized_rating
    sparse_ratings = coo_matrix((ratings['avg'].astype(float),
                    (ratings['movie_id'].cat.codes, ratings['user_id'].cat.codes)))

    # Overlapping matrix to identify the how rows overlap with eachother.
    overlap_matrix = sparse_ratings.astype(bool).astype(int)\
                        .dot(sparse_ratings.transpose().astype(bool).astype(int))

    similarity_matrix = cosine_similarity(sparse_ratings, dense_output=False)

    # Elementwise multiplication with a boolean matrix for filtering results
    similarity_matrix = similarity_matrix.multiply(similarity_matrix > min_similarity) 
    similarity_matrix = similarity_matrix.multiply(overlap_matrix > min_overlap)

    return similarity_matrix


The confusing part is the overlapping matrix. It is used to identify the number of overlapping users per considering 2 items. The data transformation to boolean helps to get the count.

Following rough code segments show the innerworking of the above function.

In [25]:
#                        Movie Rating       movieid        userid
temp = coo_matrix(([3.0,2.0,1.0,4.0,5.0], ([1,1,2,2,5], [6,7,6,9,6])))
print(temp)
print()
cor = cosine_similarity(temp, dense_output=False)
print(cor)

  (1, 6)	3.0
  (1, 7)	2.0
  (2, 6)	1.0
  (2, 9)	4.0
  (5, 6)	5.0

  (1, 5)	0.8320502943378437
  (1, 2)	0.20180183819889375
  (1, 1)	1.0
  (2, 5)	0.24253562503633297
  (2, 2)	1.0
  (2, 1)	0.20180183819889375
  (5, 5)	1.0
  (5, 2)	0.24253562503633297
  (5, 1)	0.8320502943378437


In [26]:
overlap_matrix = temp.astype(bool).astype(int)\
                    .dot(temp.transpose().astype(bool).astype(int))
print(overlap_matrix)

  (1, 5)	1
  (1, 2)	1
  (1, 1)	2
  (2, 5)	1
  (2, 2)	2
  (2, 1)	1
  (5, 5)	1
  (5, 2)	1
  (5, 1)	1


>## Online Data Processing --> For users, predict movies based on ratings

In this part, recommendations would go to users(the inference) based on the previously calculated similarity matrix.

In [27]:
def recommend_items(active_user_items):
    '''
    active_user_items -> is a dictionary with item_id to user rating mapping
    '''

    mean_user_item_rating = sum(active_user_items.values())/len(active_user_items)

    '''
    # Below part is Psuedo code. Ideally should connect to database and
    #    get the saved similarity matrix data from there.

    #  Table schema  userid, target, source, rating
    #  Here target and source are corresponding row and column items in similarity matrix)
    
    candidate_items = Similarity_Matrix_DB_Table(('source' is in user_rated_items) \
                                & not ('target' is in user_rated_items))
    candidate_items = candidate_items.order_by('-similarity') # Can limit this to a predefined number
    '''
    candidate_items = [] # Defining empty list to make the code runnable

    recommendations = {}

    for item in candidate_items:
        item_id = item.target
        
        related_items = [i for i in candidate_items if i.target==item_id]

        # Consider only if related number of items greater than 1
        if(len(related_items)>1):
            sim_sum = 0
            pre = 0
            for sim_item in related_items:
                r = active_user_items[sim_item] - mean_user_item_rating
                pre += sim_item.similarity * r
                sim_sum += sim_item.similarity
                if sim_sum > 0:
                    recs[item_id] = {'prediction': (user_mean + (pre / sim_sum)),
                                     'sim_items': [r.source for r in related_items]}



    sorted_items = sorted(recs.items(), key=lambda item: -float(item[1]['prediction']))

    return sorted_items

Note in above candidate_items is a list of items which were fitered based on the items already rated by the active user. Which means this contains items very similar to the items rated by the active user based on previously calculated similarity matrix.

>## Pros and Cons of Collabarative Filtering

### Cons
- Sparcity - one of the main proble,s
- Gray Sheep problem - Users with bizzare tastes
- Number of ratings - need of having considerable num of rating beforehand, otherwise Cold start problem
- Similarity - Bias for popular items due to massive ratings.

### Pros
- Content agnostic - No need to have extra meta data to get recommendations.
- More user centric recommendations