In [1]:
import numpy as np
import pandas as pd

from similarity import pearson_similarity
from predict import prediction_function
from group import average_aggregation, least_misery_aggregation, weighted_disagreements_aggregation, group_recommendation
from evaluate import kendalltau_distance

In [2]:
# Constants
MAX_NEIGHBORS = 50          # ~ 2*np.sqrt(num_users)

# Preprocessing
Creation of the "User-Item" Matrix and the "Movie_Map" DataFrame

In [3]:
ratings, movies = pd.read_csv('./datasets/ratings.csv'), pd.read_csv('./datasets/movies.csv')

In [4]:
user_ids = ratings['userId'].unique().tolist()
movie_ids = movies['movieId'].unique().tolist()

matrix = pd.DataFrame(index=user_ids, columns=movie_ids, dtype=np.float32)

for i in range(len(ratings)):
    user_id, movie_id, rating = ratings.iloc[i]['userId'], ratings.iloc[i]['movieId'], ratings.iloc[i]['rating']
    matrix.at[user_id, movie_id] = rating

print("Matrix Shape:", matrix.shape)

Matrix Shape: (610, 9742)


In [5]:
matrix.head()

Unnamed: 0,1,2,3,4,5,6,7,8,9,10,...,193565,193567,193571,193573,193579,193581,193583,193585,193587,193609
1,4.0,,4.0,,,4.0,,,,,...,,,,,,,,,,
2,,,,,,,,,,,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,
5,4.0,,,,,,,,,,...,,,,,,,,,,


In [6]:
movie_map = pd.DataFrame(data=movies['title'].values, index=movies['movieId'].values, columns=['title'])
movie_map.head()

Unnamed: 0,title
1,Toy Story (1995)
2,Jumanji (1995)
3,Grumpier Old Men (1995)
4,Waiting to Exhale (1995)
5,Father of the Bride Part II (1995)


# Experiment
The following codeblocks will implement an approach of $\textit{User-Based Collaborative Filtering}$ to obtain movie recommendations for a group composed by 3 Users.

In [7]:
group = [11, 23, 249]

### Similarities Computation
For each user in the group, it's now necessary to compute the similarities between all the other users and the current user himself.

The similarity between the Input User $i$ and another User $x$ is computed with the Pearson Similarity. The set $P$ contains all the "already-seen" movies for both the considered Users ($i$ and $x$).
$$\text{pearson-sim}(i,x) = \displaystyle\frac{\sum_{p\in P}[(r_{i,p}-\overline{r}_i)\cdot(r_{x,p}-\overline{r}_x)]}{\sqrt{\sum_{p\in P}(r_{i,p}-\overline{r}_i)^2}\cdot\sqrt{\sum_{p\in P}(r_{x,p}-\overline{r}_x)^2}}$$

In [9]:
list_of_similarities = list()

for user in group:
    dictionary = dict()

    other_users = [u for u in user_ids if u != user]
    for u in other_users:
        dictionary[u] = pearson_similarity(matrix, user, u)

    list_of_similarities.append(dictionary)

To compute recommendations for each user, we take in account a neighborhood of 50 Users...who are the Users regarded as "the most similar" to the current Input User.

In [10]:
for i in range(0, len(list_of_similarities)):
    similarities = list_of_similarities.pop(i)
    similarities = {k: v for k, v in sorted(similarities.items(), key=lambda item: item[1], reverse=True)}
    similarities = dict(list(similarities.items())[:MAX_NEIGHBORS])
    list_of_similarities.insert(i, similarities)

## Score Computation
It's now time to compute the recommendations for each User of the group. In this first step of computation, each User is considered as isolated from the group.

In [11]:
list_of_scores = list()

In [12]:
for i in range(0, len(group)):
    scores = prediction_function(matrix, group[i], list_of_similarities[i], matrix.shape[1])
    list_of_scores.append(scores)

### Scores Aggregation
We predicted for each User the Scores related to his "not-yet-seen" movies. It's now time to aggregate this predicted Scores to obtain the predicted Scores for the group.

#### First Aggregation Function: Average
The "group Score" for a specific Item is computed as the average of the Scores predicted for the Users for such Item.

$$avg(i) = \displaystyle\frac{\sum_{u\in G}[score(u,i)]}{|G|}$$

In [13]:
avg_matrix = average_aggregation(list_of_scores)

In [25]:
recs_avg = group_recommendation(avg_matrix, movie_map, matrix.shape[1])

i = 0
print("TOP 10 RECOMMENDED MOVIES FOR THE GROUP WITH 'AVERAGE' AGGREGATION OF PREFERENCES:\n")
for k, v in recs_avg.items():
    if i == 10:
        break
    else:
        print(f"Movie: {k} -> Score: {v:.5f}")
        i += 1

TOP 10 RECOMMENDED MOVIES FOR THE GROUP WITH 'AVERAGE' AGGREGATION OF PREFERENCES:

Movie: Jaws (1975) -> Score: 5.00259
Movie: Singin' in the Rain (1952) -> Score: 4.98816
Movie: Traffic (2000) -> Score: 4.97833
Movie: Tangled (2010) -> Score: 4.86329
Movie: Raising Arizona (1987) -> Score: 4.85759
Movie: Pinocchio (1940) -> Score: 4.83803
Movie: Young Frankenstein (1974) -> Score: 4.79752
Movie: Dead Poets Society (1989) -> Score: 4.77201
Movie: Bridge on the River Kwai, The (1957) -> Score: 4.76807
Movie: Close Encounters of the Third Kind (1977) -> Score: 4.75357


#### Second Aggregation Function: Least Misery
The "group Score" for a specific Item corresponds to the minimum Score predicted for the Users for such Item.
$$\text{least-misery}(i)=min_{u\in G}[score(u,i)]$$

In [17]:
lm_matrix = least_misery_aggregation(list_of_scores)

In [26]:
recs_lm = group_recommendation(lm_matrix, movie_map, matrix.shape[1])

i = 0
print("TOP 10 RECOMMENDED MOVIES FOR THE GROUP WITH 'LEAST MISERY' AGGREGATION OF PREFERENCES:\n")
for k, v in recs_lm.items():
    if i == 10:
        break
    else:
        print(f"Movie: {k} -> Score: {v:.5f}")
        i += 1

TOP 10 RECOMMENDED MOVIES FOR THE GROUP WITH 'LEAST MISERY' AGGREGATION OF PREFERENCES:

Movie: Traffic (2000) -> Score: 4.85268
Movie: Singin' in the Rain (1952) -> Score: 4.68310
Movie: Tangled (2010) -> Score: 4.57717
Movie: Wallace & Gromit: The Wrong Trousers (1993) -> Score: 4.55361
Movie: True Romance (1993) -> Score: 4.52735
Movie: 50 First Dates (2004) -> Score: 4.52219
Movie: Insider, The (1999) -> Score: 4.49932
Movie: Dead Poets Society (1989) -> Score: 4.47424
Movie: Dead Alive (Braindead) (1992) -> Score: 4.47134
Movie: Day of the Dead (1985) -> Score: 4.47134


#### Third Aggregation Function: Weighted Disagreements Sum
The "group Score" for a specific Item takes is account all the pairwise disagreements between the Group members for that Item.

$$\text{pairwise-dis}(a,b,i) = 1 + |score(a,i)-score(b,i)|$$

$$\text{weighted-disagreements}(i) = \displaystyle\frac{\sum_{(a,b)\in C^G_2}[\text{pairwise-dis}(a,b,i)\cdot(\displaystyle\frac{score(a,i)+score(b,i)}{2})]}{\sum_{(a,b)\in C^G_2}[\text{pairwise-dis}(a,b,i)]}$$

In [20]:
wd_matrix = weighted_disagreements_aggregation(list_of_scores)

In [27]:
recs_wd = group_recommendation(wd_matrix, movie_map, matrix.shape[1])

i = 0
print("TOP 10 RECOMMENDED MOVIES FOR THE GROUP WITH 'WEIGHTED DISAGREEMENTS SUM' AGGREGATION OF PREFERENCES:\n")
for k, v in recs_wd.items():
    if i == 10:
        break
    else:
        print(f"Movie: {k} -> Score: {v:.5f}")
        i += 1

TOP 10 RECOMMENDED MOVIES FOR THE GROUP WITH 'WEIGHTED DISAGREEMENTS SUM' AGGREGATION OF PREFERENCES:

Movie: Traffic (2000) -> Score: 4.98354
Movie: Singin' in the Rain (1952) -> Score: 4.97253
Movie: Jaws (1975) -> Score: 4.91389
Movie: Tangled (2010) -> Score: 4.89246
Movie: Raising Arizona (1987) -> Score: 4.87993
Movie: Pinocchio (1940) -> Score: 4.83456
Movie: I Am Sam (2001) -> Score: 4.81579
Movie: Young Frankenstein (1974) -> Score: 4.77009
Movie: Dead Poets Society (1989) -> Score: 4.76148
Movie: True Romance (1993) -> Score: 4.75097


# Evaluation
The Evaluation of the proposed experiments takes in account the differences between the Group Recommendations and the Recommendations computed for each User when considered as isolated from the Group.

A good Group Recommendation approach minimizes the sum of $\textit{Kendall-Tau Distances}$ between the Group Ranking and the Rankings of each member: given two rankings, the $\textit{Kendall-Tau Distance}$ counts the number of $\textit{pairwise disagreements}$ between the two rankings.

Example, suppose that within the First Ranking, "Item1" is ranked better than "Item2": if in the Second Ranking "Item2" is ranked better than "Item1" there is a $\textit{pairwise disagreement}$ between Ranking 1 and Ranking 2.

In [28]:
columns = ['KendallTau_AVG', 'KendallTau_AVG%', 'KendallTau_LM', 'KendallTau_LM%', 'KendallTau_WD', 'KendallTau_WD%']
index = [u for u in group]

evals = pd.DataFrame(index=index, columns=columns)

In [29]:
for i in range(0, len(group)):
    user = group[i]

    kt_avg = kendalltau_distance(list_of_scores[i], avg_matrix)
    evals.at[user, 'KendallTau_AVG'] = kt_avg

    kt_lm = kendalltau_distance(list_of_scores[i], lm_matrix)
    evals.at[user, 'KendallTau_LM'] = kt_lm

    kt_wd = kendalltau_distance(list_of_scores[i], wd_matrix)
    evals.at[user, 'KendallTau_WD'] = kt_wd

    evals.at[user, 'KendallTau_AVG%'] = kt_avg / (len(avg_matrix)*(len(avg_matrix)-1)/2)
    evals.at[user, 'KendallTau_LM%'] = kt_lm / (len(lm_matrix)*(len(lm_matrix)-1)/2)
    evals.at[user, 'KendallTau_WD%'] = kt_wd / (len(wd_matrix)*(len(wd_matrix)-1)/2)

In [30]:
evals

Unnamed: 0,KendallTau_AVG,KendallTau_AVG%,KendallTau_LM,KendallTau_LM%,KendallTau_WD,KendallTau_WD%
11,6453145,0.175379,7320854,0.198961,6442313,0.175085
23,6999460,0.190227,6792105,0.184591,7035990,0.19122
249,7242318,0.196827,6692215,0.181877,7267342,0.197507


The percentual values for each $\textit{Kendall Tau Distance}$ are computed as described:
$$KendallTau\%=\displaystyle\frac{KendallTauDistance}{|\text{Pairs}|}$$