The purpose of the exercise is to implement a recommendation system for a movie search engine.

When we think about selecting a video that our user will like, let's first consider what data we have available? First of all, we have information in the database about how our user rated the movies he once watched. It's worth noting here that this is absolutely not all of the movies in our database given, and most often it's a heavily limited subset of a huge set of movies. So we can find out which movies our user liked and which ones he didn't. 

Is this all the data available? Well, no! We also have information about the preferences of other users! So we can find in the data a sample of users who have similar movie taste to our user. Note that virtually every such other user has watched some movies that our user has never watched before! The idea behind collaborative filtering is very simple: if another user with similar tastes rated a movie highly, our user will probably rate it highly too! Let's recommend movies that users with similar tastes have rated highly!


Let's formalize some ideas:
 - how to count the similarity between users' tastes? 
 
 Just calculate the correlation between their movie ratings. Users with a strongly positive correlation have similar tastes, and those with a strongly negative correlation have opposite tastes;) 
 
 - Having found similar users, how to count the predicted rating of the video by our user?
 
 We count the weighted average of ratings of users with similar tastes where the weight is the measure of similarity (correlation). The closer a user's tastes are to us, the more weight his rating has for us. (slide 27, http://www.mmds.org/mmds/v2.1/ch09-recsys1.pdf)


In [1]:
import pandas as pd
import numpy as np
from scipy.stats import pearsonr

df = pd.read_csv('./ratings.csv')
df

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931
...,...,...,...,...
100831,610,166534,4.0,1493848402
100832,610,168248,5.0,1493850091
100833,610,168250,5.0,1494273047
100834,610,168252,5.0,1493846352


### Task
Modify the dataframe to have moveID as index, userID as column and rating as values

In [2]:
ratings_matrix = df.pivot(index='movieId', columns='userId', values='rating')

ratings_matrix = ratings_matrix.sort_index().sort_index(axis=1)
ratings_matrix.head()

userId,1,2,3,4,5,6,7,8,9,10,...,601,602,603,604,605,606,607,608,609,610
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,4.0,,,,4.0,,4.5,,,,...,4.0,,4.0,3.0,4.0,2.5,4.0,2.5,3.0,5.0
2,,,,,,4.0,,4.0,,,...,,4.0,,5.0,3.5,,,2.0,,
3,4.0,,,,,5.0,,,,,...,,,,,,,,2.0,,
4,,,,,,3.0,,,,,...,,,,,,,,,,
5,,,,,,5.0,,,,,...,,,,3.0,,,,,,


### Task
Let's try to recommend movies for user 610. Calculate the correlation between this user and the remaining ones.

In [24]:
user = 610
user_original = user

seen_movies = ratings_matrix[user].dropna()
print(f'User {user} has seen {seen_movies.shape[0]} movies')

correlations = dict()

for user in ratings_matrix.columns:
    if user != 610:
        common_ratings = ratings_matrix[[610, user]].dropna()
        if len(common_ratings) > 1:
            corr, _ = pearsonr(common_ratings[610], common_ratings[user])
            correlations[user] = corr

print(correlations)

User 610 has seen 1302 movies


  corr, _ = pearsonr(common_ratings[610], common_ratings[user])
  corr, _ = pearsonr(common_ratings[610], common_ratings[user])
  corr, _ = pearsonr(common_ratings[610], common_ratings[user])
  corr, _ = pearsonr(common_ratings[610], common_ratings[user])
  corr, _ = pearsonr(common_ratings[610], common_ratings[user])


{1: np.float64(-0.032086493353821015), 2: np.float64(0.6232876214384132), 3: np.float64(0.5695622052516349), 4: np.float64(-0.04378585086042065), 5: np.float64(0.040582062796472834), 6: np.float64(0.11558016858860558), 7: np.float64(0.3412327151163502), 8: np.float64(0.16793105512607803), 9: np.float64(0.6156384451828573), 10: np.float64(-0.20508117581889784), 11: np.float64(0.1370537580698882), 12: np.float64(-0.6729773775727069), 13: np.float64(0.7969688608275001), 14: np.float64(0.14048224445673518), 15: np.float64(-0.09990749158102), 16: np.float64(-0.13414974132046256), 17: np.float64(-0.03465314687603261), 18: np.float64(0.32739704559848504), 19: np.float64(0.39921282593907825), 20: np.float64(0.41336829341554443), 21: np.float64(-0.0630440879661395), 22: np.float64(0.20851711128146594), 23: np.float64(-0.07442430015832184), 24: np.float64(0.12154761177602688), 25: np.float64(0.371303095438256), 26: np.float64(0.7016464154456233), 27: np.float64(-0.19944204623675554), 28: np.floa

  corr, _ = pearsonr(common_ratings[610], common_ratings[user])
  corr, _ = pearsonr(common_ratings[610], common_ratings[user])


### Task
There are a few users with the perfect match. Isn't it suspicious? Check it

In [4]:
s=list()
for i in correlations:
    if correlations[i] == 1:
        print(f'User {i} has a perfect correlation with user 610')
        s.append(i)

for user2 in s:
    user_ratings = ratings_matrix[user2].dropna()
    print(f'User {user2} rated {len(user_ratings)} movies:')
    print(list(user_ratings.index))
    print(list(user_ratings))

    same_movies = list(set(user_ratings.index) & set(ratings_matrix[610].dropna().index))
    print(same_movies)
    print(user_ratings[same_movies])
    print(ratings_matrix[610][same_movies])
    print('-' * 40)
    

User 442 has a perfect correlation with user 610
User 545 has a perfect correlation with user 610
User 576 has a perfect correlation with user 610
User 442 rated 20 movies:
[362, 468, 524, 610, 616, 1186, 1231, 1272, 1644, 2020, 2145, 2881, 2908, 3107, 3363, 3386, 3510, 3752, 3863, 4361]
[2.5, 1.5, 2.0, 1.0, 1.5, 1.0, 1.0, 0.5, 0.5, 1.0, 2.0, 0.5, 2.0, 0.5, 1.5, 0.5, 1.0, 2.0, 0.5, 2.5]
[3752, 3863]
movieId
3752    2.0
3863    0.5
Name: 442, dtype: float64
movieId
3752    3.5
3863    3.0
Name: 610, dtype: float64
----------------------------------------
User 545 rated 23 movies:
[44, 748, 1267, 1367, 1438, 1779, 1876, 1909, 1911, 1960, 2300, 2427, 2605, 2641, 2671, 2763, 3273, 3991, 33794, 49284, 58293, 63540, 63853]
[2.5, 3.0, 5.0, 3.5, 4.0, 3.0, 4.5, 4.0, 3.0, 3.0, 2.5, 4.0, 5.0, 3.5, 4.0, 4.0, 3.0, 3.5, 3.5, 1.5, 3.0, 2.0, 2.5]
[3273, 33794]
movieId
3273     3.0
33794    3.5
Name: 545, dtype: float64
movieId
3273     3.0
33794    4.0
Name: 610, dtype: float64
-----------------------

### Explanation
The correlation is 1.0 due to the fact that there are only two movies and how correlation is measured.

### Task
Find 5 users with at least 5 common movies with user=610 and the highest correlation with that user

In [10]:
users_sorted=dict()
for user2 in ratings_matrix.columns:
    if user2 == 610:
        continue
    if len(set(ratings_matrix[user2].dropna().index) & set(ratings_matrix[610].dropna().index)) >= 5:
        users_sorted[user2] = correlations[user2]

users_sorted_trunc = dict(sorted(users_sorted.items(), key=lambda item: item[1], reverse=True)[:5])
print(f'Users with at least 5 movies in common with user {user}: {users_sorted_trunc.keys()}')

Users with at least 5 movies in common with user 1: dict_keys([92, 120, 463, 138, 494])


### Task
Predict scores for each movie based on the most correlated users. Use weighted average with correlation coefficient as weights.
$$\hat{y_j} = \frac{\sum_{i \in U} w_iy_{ij}}{\sum_{i \in U} w_i}$$

$U$ is a set of those users that also watched $j$th moveie, $w$ denotes the correlation between our user and $i$th user, $y_{ij}$ is a score given by $i$th user to $j$th movie
Use only movies watched by at least two users from the considered set

In [88]:
movie_watchers = dict()
for movie in ratings_matrix.index:
    movie_ratings = ratings_matrix.loc[movie].dropna()
    if len(movie_ratings) > 2:
        movie_watchers[movie] = list(movie_ratings.keys())
        
for movie in movie_watchers:
    movie_watchers[movie] = [user for user in movie_watchers[movie] if user in users_sorted_trunc]

movie_ratings = dict()
for movie in movie_watchers:
    if len(movie_watchers[movie]) >= 2:
        suma=0
        for user in movie_watchers[movie]:
            x = ratings_matrix.loc[movie, user]
            y = users_sorted[user]
            suma += x * y
        movie_ratings[movie] = suma / sum([users_sorted[user] for user in movie_watchers[movie]])


seen_by_user = set(ratings_matrix[user_original].dropna().index)
common_movies = [m for m in movie_ratings.keys() if m in seen_by_user]
print(common_movies)

#if we want to predict only unseen movies we just filter them out from common_movies
# for m in common_movies:
#     del movie_ratings[m]
MAE = np.mean([abs(movie_ratings[movie] - ratings_matrix.loc[movie, user_original]) for movie in common_movies])
print(f'Mean Absolute Error (MAE): {MAE}')
print("Predicted ratings for all movies based on highly correlated users:")
for movie in movie_ratings:
    print(movie, movie_ratings[movie], ratings_matrix.loc[movie, user_original])


[110, 260, 780, 858, 1210, 1221, 2028]
Mean Absolute Error (MAE): 0.2866475585123135
Predicted ratings for all movies based on highly correlated users:
110 4.748253810609682 4.5
260 5.0 5.0
780 2.725176599344724 3.5
858 5.0 5.0
1210 4.5200366804594 5.0
1221 4.748253810609682 5.0
1367 3.4932259666617975 nan
2028 4.748253810609682 5.0


### Task
How to check the quality of our recommendations? 

We have to remove a few scores from the dataset and then compare predictions with the real ones.

In [78]:
#it's done above

### Improved system

In [None]:
correlation_treshold = 0.8
movies_watched_together = 3

def testing_framework(correlation_treshold, movies_watched_together, output=False, abs_corr=True):
    users_sorted=dict()
    for user2 in ratings_matrix.columns:
        if user2 == 610:
            continue
        if len(set(ratings_matrix[user2].dropna().index) & set(ratings_matrix[610].dropna().index)) >= movies_watched_together:
            users_sorted[user2] = correlations[user2]

    users_highly_correlated = dict(sorted(users_sorted.items(), key=lambda item: item[1], reverse=True))
    if abs_corr:
        users_highly_correlated = {k: v for k, v in users_highly_correlated.items() if abs(v) >= correlation_treshold}
    else:
        users_highly_correlated = {k: v for k, v in users_highly_correlated.items() if v >= correlation_treshold}

    movie_watchers = dict()
    for movie in ratings_matrix.index:
        movie_ratings = ratings_matrix.loc[movie].dropna()
        if len(movie_ratings) > 2:
            movie_watchers[movie] = list(movie_ratings.keys())
            
    for movie in movie_watchers:
        movie_watchers[movie] = [user for user in movie_watchers[movie] if user in users_highly_correlated]

    movie_ratings = dict()
    for movie in movie_watchers:
        if len(movie_watchers[movie]) >= 2:
            suma=0
            for user in movie_watchers[movie]:
                x = ratings_matrix.loc[movie, user]
                y = users_sorted[user]
                suma += x * y
            movie_ratings[movie] = suma / sum([users_sorted[user] for user in movie_watchers[movie]])


    seen_by_user = set(ratings_matrix[user_original].dropna().index)

    for mid in list(movie_ratings.keys()):
        if mid not in common_movies:
            movie_ratings.pop(mid, None)
            
    MAE = np.mean([abs(movie_ratings[movie] - ratings_matrix.loc[movie, user_original]) for movie in movie_ratings if not pd.isna(ratings_matrix.loc[movie, user_original])])
    if output:
        print(f'Mean Absolute Error (MAE): {MAE}')
    for movie in movie_ratings:
        if pd.isna(ratings_matrix.loc[movie, user_original]):
            continue
    return MAE, movie_ratings

In [142]:
minimum_mae = float('inf')
best_params = (0, 0)
for i in [0.5, 0.55, 0.6, 0.65, 0.7, 0.75, 0.8]:
    for j in [2, 3, 4, 5, 6]:
        print(f'Testing with correlation threshold {i} and movies watched together {j}:', end='\r')
        MAE, ratings = testing_framework(i, j, False)
        
        if(len(ratings)<5):
            continue
        else:
            if MAE < minimum_mae:
                minimum_mae = MAE
                best_params = (i, j)
        

print(f'Best parameters: correlation threshold {best_params[0]}, movies watched together {best_params[1]} with MAE {minimum_mae}')

res_matrix = pd.DataFrame([[testing_framework(i,j,False)[0] for j in range(2,7)] for i in [0.5, 0.55, 0.6, 0.65, 0.7, 0.75, 0.8]], index=[0.5, 0.55, 0.6, 0.65, 0.7, 0.75, 0.8], columns=list(range(2,7)))
print(res_matrix)

Testing with correlation threshold 0.7 and movies watched together 2::

  movie_ratings[movie] = suma / sum([users_sorted[user] for user in movie_watchers[movie]])


Best parameters: correlation threshold 0.6, movies watched together 2 with MAE 0.22212262578219763
             2         3         4         5         6
0.50  0.310614  0.310614  0.310614  0.318411  0.303121
0.55  0.333184  0.333184  0.333184  0.356072  0.322197
0.60  0.222123  0.222123  0.222123  0.258536  0.265209
0.65  0.344719  0.344719  0.344719  0.381132  0.387805
0.70  0.434326  0.434326  0.434326  0.479810  0.491798
0.75  0.429078  0.429078  0.429078  0.491798  0.491798
0.80  0.223928  0.223928  0.223928  0.286648  0.286648


In [141]:
minimum_mae = float('inf')
best_params = (0, 0)
for i in [0.5, 0.55, 0.6, 0.65, 0.7, 0.75, 0.8]:
    for j in [2, 3, 4, 5, 6]:
        print(f'Testing with correlation threshold {i} and movies watched together {j}:', end='\r')
        MAE, ratings = testing_framework(i, j, False,True)
        
        if(len(ratings)<5):
            continue
        else:
            if MAE < minimum_mae:
                minimum_mae = MAE
                best_params = (i, j)
        

print(f'Best parameters: correlation threshold {best_params[0]}, movies watched together {best_params[1]} with MAE {minimum_mae}')

res_matrix = pd.DataFrame([[testing_framework(i,j,False,True)[0] for j in range(2,7)] for i in [0.5, 0.55, 0.6, 0.65, 0.7, 0.75, 0.8]], index=[0.5, 0.55, 0.6, 0.65, 0.7, 0.75, 0.8], columns=list(range(2,7)))
print(res_matrix)

Testing with correlation threshold 0.7 and movies watched together 2::

  movie_ratings[movie] = suma / sum([users_sorted[user] for user in movie_watchers[movie]])


Best parameters: correlation threshold 0.6, movies watched together 2 with MAE 0.22212262578219763
             2         3         4         5         6
0.50  0.310614  0.310614  0.310614  0.318411  0.303121
0.55  0.333184  0.333184  0.333184  0.356072  0.322197
0.60  0.222123  0.222123  0.222123  0.258536  0.265209
0.65  0.344719  0.344719  0.344719  0.381132  0.387805
0.70  0.434326  0.434326  0.434326  0.479810  0.491798
0.75  0.429078  0.429078  0.429078  0.491798  0.491798
0.80  0.223928  0.223928  0.223928  0.286648  0.286648


# Results

I've tried different approaches involving absolute value of correlation (to maybe include strong negative correlations) but it made the recommendation system worse. The best MAE of a grading is 0.218 which is pretty good for a movie - if we just recommend the user movies that other users rated very highly it will probably yield great results. In regards to the question of difference I think that the most important is the highest few % (so 5.0 and 4.5 grades), but for a user who has hundreds of watched movies a system that can differentiate decent from mediocre movies is valuable, as the set of movies to recommend is not infinite.

Now there are many reasons for why the result matrix looks how it looks:


| Corr. threshold | j=2    | j=3    | j=4    | j=5    | j=6    |
|----------------:|-------:|-------:|-------:|-------:|-------:|
| 0.50            | 0.3106 | 0.3106 | 0.3106 | 0.3184 | 0.3031 |
| 0.55            | 0.3332 | 0.3332 | 0.3332 | 0.3561 | 0.3222 |
| 0.60            | 0.2221 | 0.2221 | 0.2221 | 0.2585 | 0.2652 |
| 0.65            | 0.3447 | 0.3447 | 0.3447 | 0.3811 | 0.3878 |
| 0.70            | 0.4343 | 0.4343 | 0.4343 | 0.4798 | 0.4918 |
| 0.75            | 0.4291 | 0.4291 | 0.4291 | 0.4918 | 0.4918 |
| 0.80            | 0.2239 | 0.2239 | 0.2239 | 0.2866 | 0.2866 |

Best parameters: (0.6, 2) — MAE: 0.2221

I think that there are several important observations:

- movies with more ratings will have more accurate ratings
- personal taste means that a movie might be good (~4) for some but a 5.0 for others
- the set of users with very high correlations is very small and a set of movies watched by at least two of these users is even smaller
- in the matrix most of the rows have the same columns for j in [2,3,4]. This means that (mostly) people who watched only the same two movies don't have any correlation with each other.

Now the reason why the results look like they look is because of the tension from the fact that we want more ratings so we can get the more accurate average, while also wanting less ratings, so that they come from more similar users. I think that there are many reasons but ultimatelly the given correlation threshold and movies_in_common seem to be right in the sweet spot where we capture both the personal taste and the ground truth.


# ---------------------------------------------------------------------------------

Try to improve the system, you can use the following ideas:
 - Can we use more users (e.g. with negative correlation)?
 - Which difference is more important predicting 5 when a real score is 4 or predicting 3 instead of 2?
 - Did we use the best value for the minimal number of common movies?
 - Is prediction for a movie seen by just one user trustworthy?
 
 
Describe your approach, its strengths and weaknesses, and analyze the results. Send the report (notebook with comments/markdown) within 144 hours after the class to gmiebs@cs.put.poznan.pl, start the subject with [IR]

Credits to F. Maxwell Harper and Joseph A. Konstan. 2015. The MovieLens Datasets: History and Context. ACM Transactions on Interactive Intelligent Systems (TiiS) 5, 4: 19:1–19:19. https://doi.org/10.1145/2827872 and Mateusz Lango