## __Collaborative Filtering__

Recommender System is a system that seeks to predict or filter preferences according to the user’s choices. Recommender systems are utilized in a variety of areas including movies, music, news, books, research articles, search queries, social tags, and products in general.

Like many machine learning techniques, a recommender system makes prediction based on users’ historical behaviors. Specifically, it’s to predict user preference for a set of items based on past experience. To build a recommender system, the most two popular approaches are Content-based and Collaborative Filtering.

Recommender systems produce a list of recommendations in any of the two ways –

<img src = '0_img.png'>

1.    **Collaborative filtering**

Collaborative filtering approaches build a model from user’s past behavior (i.e. items purchased or searched by the user) as well as similar decisions made by other users. This model is then used to predict items (or ratings for items) that user may have an interest in.

Collaborative Filtering, on the other hand, doesn’t need anything else except users’ historical preference on a set of items. Because it’s based on historical data, the core assumption here is that the users who have agreed in the past tend to also agree in the future. 

2.    **Content-based filtering**

Content-based filtering approaches uses a series of discrete characteristics of an item in order to recommend additional items with similar properties. Content-based filtering methods are totally based on a description of the item and a profile of the user’s preferences. It recommends items based on user’s past preferences.

Content-based approach requires a good amount of information of items’ own features, rather than using users’ interactions and feedbacks. For example, it can be movie attributes such as genre, year, director, actor etc., or textual content of articles that can extracted by applying Natural Language Processing. 

<hr>

### __Disadvantages of Employing content-based filtering__

A few of them are:

-    content-based recommender systems tend to over-specialize. They will recommend items similar to those already consumed, with a tendency to create a “filter bubble”, leaving less possibility for expanding a user’s interests.
-    The issue of limited content analysis: If the content doesn’t contain enough information to discriminate the items precisely, the recommendation will be poor and thus hand-engineered features are required, or tags need to be assigned.

<hr>

### __What is Collaborative Filtering?__

Collaborative filtering (CF) systems work by collecting user feedback in the form of ratings for items in a given domain and exploiting similarities in rating behavior among several users in determining how to recommend an item.

CF accumulates customer product ratings, identifies customers with common ratings, and offers recommendations based on inter-customer comparisons. It’s based on the idea that people who agree in their evaluations of certain items in the past are likely to agree again in the future. For example, most people ask their trusted friends for restaurant or movie suggestions.

Collaborative filtering models are based on an assumption that people like things similar to other things they like, and things that are liked by other people with similar taste.

<hr>

### __Two Major Collaborative Filtering Techniques__

### __1. Memory-based approach:__

This approach is based on taking a matrix of preferences for items by users using this matrix to predict missing preferences and recommend items with high predictions. Simply stated:

``Item-Item Collaborative Filtering: “Users who liked this item also liked …”``

``User-Item Collaborative Filtering: “Users who are similar to you also liked …”``

<img src='https://miro.medium.com/max/700/1*7bFc9R97Z4jKK6J2jaSUlw.jpeg'>

### __2. Model-based approach__

In this approach, CF models are developed using machine learning algorithms to predict a user’s rating of unrated items. Some of these models/techniques include: k-nearest neighbors, clustering, matrix factorization, and deep learning models like autoencoders and using techniques like embeddings as low-dimensional hidden factors for items and users.

<img src='https://miro.medium.com/max/700/1*K5BOY3B93MLn173VVzOW0Q.png'>

<hr>

## **User-Based Collaborative Filtering (UB-CF)**

Imagine that we want to recommend a movie to our friend Stanley. We could assume that similar people will have similar taste. Suppose that me and Stanley have seen the same movies, and we rated them all almost identically. But Stanley hasn’t seen ‘The Godfather: Part II’ and I did. If I love that movie, it sounds logical to think that he will too. With that, we have created an artificial rating based on our similarity.

Well, UB-CF uses that logic and recommends items by finding similar users to the active user (to whom we are trying to recommend a movie). A specific application of this is the user-based Nearest Neighbor algorithm. This algorithm needs two tasks:



1. Find the K-nearest neighbors (KNN) to the user a, using a similarity function w to measure the distance between each pair of users:

<img src='a_img.png' width=550 height=300>

2. Predict the rating that user a will give to all items the k neighbors have consumed but a has not. We Look for the item j with the best predicted rating.

In other words, we are creating a User-Item Matrix, predicting the ratings on items the active user has not see, based on the other similar users. This technique is **memory-based**.

<img src= 'b_img.png' width=400 height=400>

### **PROS:**

-    Easy to implement.
-    Context independent.
-    Compared to other techniques, such as content-based, it is more accurate.

### **CONS:**

-    Sparsity: The percentage of people who rate items is really low.
-    Scalability: The more K neighbors we consider (under a certain threshold), the better my classification should be. Nevertheless, the more users there are in the system, the greater the cost of finding the nearest K neighbors will be.
-    Cold-start: New users will have no to little information about them to be compared with other users.
-    New item: Just like the last point, new items will lack of ratings to create a solid ranking (More of this on ‘How to sort and rank items’).

<hr>

### __K-Nearest Neighbors__

The k-nearest neighbors (KNN) algorithm doesn’t make any assumptions on the underlying data distribution, but it relies on item feature similarity. When a KNN makes a prediction about a movie, it will calculate the “distance” (distance metrics will be discussed later) between the target movie and every other movie in its database. It then ranks its distances and returns the top k nearest neighbor movies as the most similar movie recommendations.

<hr>

The standard method of Collaborative Filtering is known as **Nearest Neighborhood** algorithm. There are user-based CF and item-based CF. Let’s first look at **User-based CF**. We have an n × m matrix of ratings, with user uᵢ, i = 1, ...n and item pⱼ, j=1, …m. Now we want to predict the rating rᵢⱼ if target user i did not watch/rate an item j. The process is to calculate the similarities between target user i and all other users, select the top X similar users, and take the weighted average of ratings from these X users with similarities as weights.

<img src = 'c_img.png'>

<img src = 'd_img.png'  width=600 height=300>

While different people may have different baselines when giving ratings, some people tend to give high scores generally, some are pretty strict even though they are satisfied with items. To avoid this bias, we can subtract each user’s average rating of all items when computing weighted average, and add it back for target user, shown as below.

<img src = 'e_img.png'  width=600 height=300>

<hr>

### **Two ways to calculate similarity are Pearson Correlation** and **Cosine Similarity**.

<img src = 'f_img.png' width=600 height=300>

<img src = 'g_img.png' width=600 height=300>

Basically, the idea is to find the most similar users to your target user (nearest neighbors) and weight their ratings of an item as the prediction of the rating of this item for target user.

Without knowing anything about items and users themselves, we think two users are similar when they give the same item similar ratings . Analogously, for Item-based CF, we say two items are similar when they received similar ratings from a same user. Then, we will make prediction for a target user on an item by calculating weighted average of ratings on most X similar items from this user. One key advantage of Item-based CF is the stability which is that the ratings on a given item will not change significantly overtime, unlike the tastes of human beings.

<img src = 'h_img.png'>

There are quite a few limitations of this method. It doesn’t handle sparsity well when no one in the neighborhood rated an item that is what you are trying to predict for target user. Also, it’s not computational efficient as the growth of the number of users and products.

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from scipy.sparse import csr_matrix
from sklearn.neighbors import NearestNeighbors # unsupervise learning

In [4]:
df = pd.read_csv('file.tsv', sep='\t', names=['user_id', 'item_id', 'rating', 'timestamp'])
df

Unnamed: 0,user_id,item_id,rating,timestamp
0,0,50,5,881250949
1,0,172,5,881250949
2,0,133,1,881250949
3,196,242,3,881250949
4,186,302,3,891717742
...,...,...,...,...
99998,880,476,3,880175444
99999,716,204,5,879795543
100000,276,1090,1,874795795
100001,13,225,2,882399156


In [5]:
movie = pd.read_csv('Movie_Id_Titles.csv')
movie.head()

Unnamed: 0,item_id,title
0,1,Toy Story (1995)
1,2,GoldenEye (1995)
2,3,Four Rooms (1995)
3,4,Get Shorty (1995)
4,5,Copycat (1995)


In [6]:
movie[movie['item_id'] == 50]

Unnamed: 0,item_id,title
49,50,Star Wars (1977)


In [7]:
data = pd.merge(df, movie, on='item_id')
data.head()

Unnamed: 0,user_id,item_id,rating,timestamp,title
0,0,50,5,881250949,Star Wars (1977)
1,290,50,5,880473582,Star Wars (1977)
2,79,50,4,891271545,Star Wars (1977)
3,2,50,5,888552084,Star Wars (1977)
4,8,50,5,879362124,Star Wars (1977)
