# Collaborative filtering practice

In this homework you will test different collaborative filtering (CF) approaches on famous Movielens dataset.

In class we implemented item2item CF, so this time let's use **user2user** approach.

In [6]:
import pandas as pd
import numpy as np
import sklearn
from sklearn.model_selection import train_test_split

## Task 0: Dataset (5 points)

Load [movielens](https://grouplens.org/datasets/movielens/) dataset with preffered provider (e.g. recball, implicit, [scikit surprise](https://surprise.readthedocs.io/en/stable/dataset.html), etc)

Split dataset to train and validation parts (ideally with respect to timestamps).

Don't forget to encode users and items from 0 to maximum!

In [None]:
# Downloading and unpacking movie dataset 
!curl -o ml-latest-small.zip https://files.grouplens.org/datasets/movielens/ml-latest-small.zip
!tar -xf ml-latest-small.zip

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
  0     0    0     0    0     0      0      0 --:--:--  0:00:01 --:--:--     0
 15  955k   15  148k    0     0  72471      0  0:00:13  0:00:02  0:00:11 72538
 34  955k   34  328k    0     0   105k      0  0:00:09  0:00:03  0:00:06  105k
 49  955k   49  468k    0     0   114k      0  0:00:08  0:00:04  0:00:04  114k
 60  955k   60  578k    0     0   113k      0  0:00:08  0:00:05  0:00:03  113k
 78  955k   78  749k    0     0   118k      0  0:00:08  0:00:06  0:00:02  143k
 83  955k   83  796k    0     0   113k      0  0:00:08  0:00:07  0:00:01  131k
 90  955k   90  867k    0     0   106k      0  0:00:08  0:00:08 --:--:--  107k
100  955k  100  955k    0     0   107k      0  0:00:08  0:00:08 --:--:--  101k


Original dataset consists of 4 files: `links.csv`, `movies.csv`, `ratings.csv` and `tags.csv`

In [None]:
pd.set_option('display.max_rows', 200)

ratings = pd.read_csv('ml-latest-small/ratings.csv').set_index(['userId', 'movieId'])
left = ratings.loc[1].rating
right = ratings.loc[2].rating
a = left*right #np.logical_and((left!=0), (right!=0))
b = a[~ a.isna()]
sim_pearson(left, right)

In [None]:
indexes = np.where(np.logical_or(left, right))
x, y = left[indexes[0]], right[indexes[0]]

x = (x-np.mean(x)) / np.std(x)
y = (y-np.mean(y)) / np.std(y)
sim = np.mean(x*y) 

In [109]:
df = pd.read_csv('ml-latest-small/ratings.csv').set_index(['userId', 'movieId'])
X_sample, y_sample = df.drop(columns=['rating']), df[['rating']]

In [None]:
# Split for test and train
X_train, X_val, y_train, y_val = train_test_split(
    X_sample, y_sample,
    test_size=0.2, 
    random_state=42,
    shuffle=True,
    stratify=y_sample
)

Let's look on the train part of dataset, we set a multi index

In [116]:
display(X_train.head())
display(y_train.head())

Unnamed: 0_level_0,Unnamed: 1_level_0,timestamp
userId,movieId,Unnamed: 2_level_1
599,719,1498525383
132,4018,1157978825
475,1676,1498031862
462,8848,1174690660
23,3000,1107164425


Unnamed: 0_level_0,Unnamed: 1_level_0,rating
userId,movieId,Unnamed: 2_level_1
599,719,3.0
132,4018,3.0
475,1676,4.5
462,8848,3.0
23,3000,3.5


In [None]:
import seaborn as sns
sns.countplot(y_sample.rating.to_list())

## Task 1: Similarities (5 points each)

You need to implement 3 similarity functions:
1. Dot product (intersection)
$$s(u, v)=\sum \limits_{i\in I_u\cup I_v}r_{ui}r_{vi}$$
2. Jaccard index (intersection over union)
$$s(u, v)=\frac{|I_u\cap I_v|}{|I_u\cup I_v|}$$
3. Pearson correlation between vectors of common rating
$$s(u, v)=\frac{\sum \limits_{i\in I_u\cup I_v}(r_{ui}-\bar{r}_u)(r_{vi}-\bar{r}_v)}{\sqrt{\sum \limits_{i\in I_u\cup I_v}(r_{ui}-\bar{r}_u)^2}\sqrt{\sum \limits_{i\in I_u\cup I_v}(r_{vi}-\bar{r}_v)^2}}$$
4. Pearson correlation with decreasing coefficient
$$s(u, v)=\min\left(\frac{|I_u\cap I_v|}{50}, 1\right)\frac{\sum \limits_{i\in I_u\cup I_v}(r_{ui}-\bar{r}_u)(r_{vi}-\bar{r}_v)}{\sqrt{\sum \limits_{i\in I_u\cup I_v}(r_{ui}-\bar{r}_u)^2}\sqrt{\sum \limits_{i\in I_u\cup I_v}(r_{vi}-\bar{r}_v)^2}}$$

There are two ways to find similiarities: `User2User` and `Item2Item` 

In [1]:
def sim_dot(left, right) -> float:
    '''Dot product similarity

    Args:
        left: first user ratings
        right: second user ratings

    Retruns:
        Similarity score for this pair
    '''
    sim = np.sum(left*right)
    return sim

In [None]:
def sim_jacc(left, right) -> float:
    '''Jaccard index similarity

    Args:
        left: first user ratings pd.DataFrame(cols = ['rating'], index=['movieId'])
        right: second user ratings

    Retruns:
        Similarity score for this pair
    '''
    intersection = np.logical_and((left!=0), (right!=0))
    union = np.logical_or((left!=0), (right!=0))
    sim = np.sum(intersection) / np.sum(union)
    return sim

In [None]:
def sim_pearson(left, right) -> float:
    '''Pearson correlation similarity

    Args:
        left: first user ratings pd.DataFrame(cols = ['rating'], index=['movieId'])
        right: second user ratings

    Retruns:
        Similarity score for this pair
    '''
    indexes = np.where(np.logical_or(left, right))
    x, y = left[indexes], right[indexes]

    x = (x-np.mean(x)) / np.std(x)
    y = (y-np.mean(y)) / np.std(y)
    sim = np.mean(x*y) 
    return sim

In [None]:
def sim_pearson_decreasing(left, right) -> float:
    '''Pearson correlation similarity which decreases on small intersection

    Args:
        left: first user ratings pd.DataFrame(cols = ['rating'], index=['movieId'])
        right: second user ratings

    Retruns:
        Similarity score for this pair
    '''
    sim = sim_pearson(left, right)
    intersection = np.logical_and((left!=0), (right!=0))
    coef = min(np.sum(intersection)/50, 1)

    return coef*sim

In [21]:
sim_pearson_decreasing(np.array([1, 0, 2]), np.array([0, 2, 10]))

np.float64(0.015118578920369085)

## Task 2: Collaborative filtering algorithm (5 points each)

Now you have several options to use similarities for ratings prediction:
1. Simple averaging
$$\hat{r}_{ui}=\frac{\Sigma_{v\in N(u)}s(u,v)r_{vi}}{\Sigma_{v\in N(u)}|s(u,v)|}$$
2. Mean corrected averaging
$$\hat{r}_{ui}=\bar{r}_u + \frac{\Sigma_{v\in N(u)}s(u,v)(r_{vi}-\bar{r}_{v})}{\Sigma_{v\in N(u)}|s(u,v)|}$$

$$\hat{r}_{ui}=\bar{r}_u + \sigma_u\frac{\Sigma_{v\in N(u)}s(u,v)(r_{vi}-\bar{r}_{v})/\sigma_v}{\Sigma_{v\in N(u)}|s(u,v)|}$$

Implement them both

In [None]:
class UserBasedCf:
  '''User2user collaborative filtering algorithm'''
    def __init__(self, sim_fn,  mean_correct: bool = False):
        self.sim_fn = sim_fn
        self.mean_correct = mean_correct

    def calc_sim_matrix(self, feedbacks):
        '''Fills matrix of user similarities

        Args:
            feedbacks: numpy array with ratings
        '''
        self.feedbacks = feedbacks

        self.sim_matrix = ... # your code here

    def recommend(self, user: int, n: int):
        '''Computes most relevant unseen items for the user

        Args:
            user: user_id for which to provide recommendations
            n: how many items to return
        '''
        recommended = 
        return recommended[:n]

This way you have got 8 different recommendation methods (each of two CF modes can be used with every similarity score).

## Task 3: Apply models

1. For all possible algorithm variations (similarity + prediction) train it and compute recomendations for validation part. (10 points)

In [None]:
similiarities = [
    sim_dot,
    sim_jacc,
    sim_pearson,
    sim_pearson_decreasing
]

predictions = [
    simple_averaging,
    mean_corrected_averaging
]

for sim in similiarities:
    for pred in predictions:
        model = pred(sim)
        model.fit(X_train, y_train)

        y_pred = model.predict(X_val)
        print(y_pred)

2. Which metrics do you want to use? Why? (5 points)

`precision@k` = (number of relevant items in top K) / K

3. Show that your implementation is relevant by computing metrics. Compare algorithms by creating a table with metrics. (5 points)

In [None]:
from itertools import product

CLASSIFICATION_METRICS = {
    "precision@k" : precision_at_k,

}
metrics_df = pd.DataFrame()

def evaluate_classification(model_name, prefix, y_true, y_pred):
    for metric_name, metric_function in CLASSIFICATION_METRICS:
        metrics_df.loc[model_name, prefix + metric_name] = metric_function(y_true, y_pred) 4

for algo in product(similiarities, predictions):
    evaluate_classification(model_name, prefix, y_val, y_pred)

4. Predict top-5 recommendations for each user. Show distribution of items by how may times you recommend item in top-5.\
Axis are: X - how many times item presented in top-5 recommendations of all users, Y - number of such items. (10 points)

In [None]:
plt.plot()

# Task 4: Your favorite films

1. Choose from 10 to 50 films rated by you (you can export it from IMDB or kinopoisk) which are presented in Movielens dataset. </br> Print them in human readable form (5 points)

In [76]:
my_movies = '''
1 + 1
The Shawshank Redemption
Inception
Back to the Future
Apocalypto
Dogville
Harry Potter and the Chamber
Avengers: Age of Ultron
jouet
Pirates of the Caribbean: At World's
Amelie 
Budapest Hotel
'''
my_rating = '''
5
5
5
5
2
4
5
3
2
4.5
5
2.5
'''
my_movies = my_movies.split('\n')[1:-1]
my_rating = list(map(float, my_rating.split('\n')[1:-1]))
MOVIES = pd.read_csv('ml-latest-small/movies.csv').title

mapping_1 = {x: y for x in my_movies for y in MOVIES if x in y}
mapping_2 = dict(zip(my_movies, my_rating))
data = {mapping_1[key] : mapping_2[key] for key in mapping_1}

In [83]:
print('My movies rating:')
for i, (key, value) in enumerate(data.items()):
    print(f'{i+1}. {key} '.ljust(90, '-') + f'{value}')

My movies rating:
1. Inception (2010) ----------------------------------------------------------------------5.0
2. Ivan Vasilievich: Back to the Future (Ivan Vasilievich menyaet professiyu) (1973) -----5.0
3. Apocalypto (2006) ---------------------------------------------------------------------2.0
4. Dogville (2003) -----------------------------------------------------------------------4.0
5. Harry Potter and the Chamber of Secrets (2002) ----------------------------------------5.0
6. Avengers: Age of Ultron (2015) --------------------------------------------------------3.0
7. Pirates of the Caribbean: At World's End (2007) ---------------------------------------4.5
8. Amelie (Fabuleux destin d'Amélie Poulain, Le) (2001) ----------------------------------5.0
9. Grand Budapest Hotel, The (2014) ------------------------------------------------------2.5


2. Compute top 10 recomendations based on this films for each of 6 methods implemented. Print them in **human readable from** (5 points)

In [None]:
# your code here

3. Rate films that was recommended in previous step (by title, description, trailer). For each algorithm compute metrics based on ratings you put.

_Your ratings_

# Task 5: Conclusion (10 points)

Compare all methods based on both dataset (metrics) and your personal recomendations.

Which algorithm is the best? Why?

Was recommedations different? Which set of recomendations you like the most?

What differences in algorithms have you noted?

In [101]:
ratings.pivot_table(index = 'userId', columns = 'movieId', values = 'rating')

movieId,1,2,3,4,5,6,7,8,9,10,...,193565,193567,193571,193573,193579,193581,193583,193585,193587,193609
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,4.0,,4.0,,,4.0,,,,,...,,,,,,,,,,
2,,,,,,,,,,,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,
5,4.0,,,,,,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
606,2.5,,,,,,2.5,,,,...,,,,,,,,,,
607,4.0,,,,,,,,,,...,,,,,,,,,,
608,2.5,2.0,2.0,,,,,,,4.0,...,,,,,,,,,,
609,3.0,,,,,,,,,4.0,...,,,,,,,,,,


_Your conclusion_