### In this notebook, we make recommendations and compute the coverage of the predictions made by our item-based KNN recommender system. To accomplish this, since we need to fill out the entire user-item matrix to find the top 5 rated by them. This dataset has only 100 users and 848 movies, where there are 25462 ratings present in the ratings dataset. We will fill in this matrix and then make 5 recommendations to a user and compute the catalog coverage. We choose a k-value of 48 for KNN since this almost gave us the best results while taking significantly lesser time than when k was 64 (which gave us the best results).

In [1]:
import pandas as pd
import numpy as np
import math
from math import sqrt
from sklearn.model_selection import train_test_split as tts
from sklearn.metrics import mean_squared_error, mean_absolute_error
from sklearn.neighbors import NearestNeighbors

# Recommendations

In [2]:
pred_rating = pd.read_csv("data/reduced_final_sr.csv") 

In [3]:
pr = pred_rating.pivot(
    index='movieId',
    columns='userId',
    values='rating'
).fillna(0)

In [4]:
sr_test_list = pr.unstack().reset_index(name="rating").set_index("movieId")
sr_test_list = sr_test_list[sr_test_list.rating==0]

In [5]:
sr_test_list.shape

(58490, 2)

In [6]:
sr_test_rem = sr_test_list

In [9]:
def fit_model(k):
    model_knn = NearestNeighbors(metric='cosine',algorithm='brute', n_neighbors=k, n_jobs=-1)
    model_knn.fit(pr)
    return model_knn
    
def recommendation(model, mid, k):
    if int(pr[pr.index == mid].sum(axis=1)) == 0:
        results = pd.DataFrame(list(zip([0]*k,[0]*k)), columns = ['sim', 'mov_id'])
    else:
        query_index=mid
        distances,indices=model.kneighbors(pr[pr.index==query_index].values.reshape(1,-1))
        ind = list(indices.flatten())
        distances = list(distances.flatten())
        sim=[]
        for i in range(len(distances)):
            sim.append(1-float(distances[i]))
        mov_ids = []
        for i in ind:
            mov_ids.append(pr.index[i])
        results = pd.DataFrame(list(zip(sim,mov_ids)), columns = ['sim', 'mov_id'])
        results = results.set_index("mov_id")
    return results

In [10]:
def cal_mean(s):
    count = 0
    add = 0 
    for i in range(len(s)):
        if s[i] != 0:
            add += s[i]
            count += 1
    return float(add)/(count+0.001)

def predict(mid, uid, r):
    w_sum = 0 
    sim_list = []
    if max(list(r.loc[:, "sim"]))==0:
        return cal_mean(list(pr[uid]))
    else:
        for m in list(r.index):
            if float(pr[pr.index == m][uid]) != 0.0:
                sim_list.append(float(r[r.index == m]['sim']))
                w_sum += float(pr[pr.index == m][uid])*float(r[r.index == m]['sim'])
        if sim_list == []:
            return cal_mean(list(pr[uid]))
        else:
            return w_sum/sum(sim_list)    

In [11]:
def multi_k(k):
    model = fit_model(k)
    total = len(sr_test_rem)
    m_list = list(set(sr_test_rem.index))
    i=0

    for m in m_list:
        u_list = list(sr_test_rem[sr_test_rem.index==m]['userId'])
        print(i, "of", total ,"completed")
        i=i+len(u_list)
        r = recommendation(model, m, k)
        for u in u_list:
            p = predict(m, u, r)
            pr.at[m,u] = p


In [12]:
multi_k(48)

0 of 58490 completed
95 of 58490 completed
190 of 58490 completed
230 of 58490 completed
316 of 58490 completed
408 of 58490 completed
443 of 58490 completed
537 of 58490 completed
615 of 58490 completed
646 of 58490 completed
717 of 58490 completed
786 of 58490 completed
799 of 58490 completed
880 of 58490 completed
971 of 58490 completed
1069 of 58490 completed
1153 of 58490 completed
1192 of 58490 completed
1217 of 58490 completed
1314 of 58490 completed
1364 of 58490 completed
1438 of 58490 completed
1528 of 58490 completed
1609 of 58490 completed
1668 of 58490 completed
1766 of 58490 completed
1864 of 58490 completed
1924 of 58490 completed
1957 of 58490 completed
1980 of 58490 completed
2076 of 58490 completed
2106 of 58490 completed
2204 of 58490 completed
2231 of 58490 completed
2325 of 58490 completed
2353 of 58490 completed
2410 of 58490 completed
2504 of 58490 completed
2580 of 58490 completed
2632 of 58490 completed
2653 of 58490 completed
2699 of 58490 completed
2759 of 58

22688 of 58490 completed
22700 of 58490 completed
22798 of 58490 completed
22838 of 58490 completed
22901 of 58490 completed
22998 of 58490 completed
23084 of 58490 completed
23178 of 58490 completed
23264 of 58490 completed
23362 of 58490 completed
23400 of 58490 completed
23448 of 58490 completed
23481 of 58490 completed
23579 of 58490 completed
23597 of 58490 completed
23693 of 58490 completed
23789 of 58490 completed
23818 of 58490 completed
23914 of 58490 completed
23933 of 58490 completed
24026 of 58490 completed
24119 of 58490 completed
24157 of 58490 completed
24179 of 58490 completed
24250 of 58490 completed
24277 of 58490 completed
24350 of 58490 completed
24431 of 58490 completed
24526 of 58490 completed
24607 of 58490 completed
24677 of 58490 completed
24773 of 58490 completed
24859 of 58490 completed
24927 of 58490 completed
25016 of 58490 completed
25100 of 58490 completed
25168 of 58490 completed
25265 of 58490 completed
25351 of 58490 completed
25447 of 58490 completed


45123 of 58490 completed
45207 of 58490 completed
45302 of 58490 completed
45392 of 58490 completed
45490 of 58490 completed
45576 of 58490 completed
45674 of 58490 completed
45761 of 58490 completed
45811 of 58490 completed
45902 of 58490 completed
45952 of 58490 completed
46050 of 58490 completed
46125 of 58490 completed
46220 of 58490 completed
46249 of 58490 completed
46334 of 58490 completed
46358 of 58490 completed
46435 of 58490 completed
46530 of 58490 completed
46560 of 58490 completed
46599 of 58490 completed
46640 of 58490 completed
46680 of 58490 completed
46765 of 58490 completed
46815 of 58490 completed
46913 of 58490 completed
47010 of 58490 completed
47083 of 58490 completed
47151 of 58490 completed
47235 of 58490 completed
47247 of 58490 completed
47345 of 58490 completed
47443 of 58490 completed
47454 of 58490 completed
47530 of 58490 completed
47617 of 58490 completed
47628 of 58490 completed
47651 of 58490 completed
47745 of 58490 completed
47777 of 58490 completed


In [13]:
act = pred_rating.pivot(
    index='movieId',
    columns='userId',
    values='rating'
).fillna(0)

In [14]:
sr_test_list =act.unstack().reset_index(name="rating").set_index("movieId")
sr_test_list = sr_test_list[sr_test_list.rating>0]

In [15]:
sr_test_rem = sr_test_list

In [16]:
#Filling the actual values in our user-item matrix to 0 so that the movies they have already rated
# are not recommended to them again.
a = list(sr_test_rem.index)
b = list(sr_test_rem.loc[:, "userId"])

for i in range(len(a)):
    pr.at[a[i],b[i]]=0

In [17]:
pr.to_csv("predictions.csv", sep=',')

In [18]:
act.to_csv("actual.csv")

In [19]:
rec = pr.unstack().reset_index(name="rating")

To make recommendations, we sort the item-user matrix for every user and pick the top 5 highest recommendations predicted for a user and then compute the coverage.

In [20]:
rec_sorted = rec.sort_values(by=['userId','rating'],ascending = [True,False])
rec_sorted_top_10 = rec_sorted.groupby('userId').head(10)
rec_sorted_top_10.head()

Unnamed: 0,userId,movieId,rating
574,903,48127,4.0
74,903,746,3.821539
338,903,4338,3.682377
573,903,47972,3.507663
774,903,97005,3.498344


We compute catalog coverage as the percentage of movies in the sample present in the top-5 recommendations made by our model.

In [21]:
(len(rec_sorted_top_10['movieId'].drop_duplicates())/len(rec_sorted['movieId'].drop_duplicates()))*100

24.882075471698112

Our catalog coverage on the movies set is nearly 25%. This can be significantly improved by adding the same business rule we added to our ALS model. We now look at the distribution of the number of ratings of the movies recommended by our model. 

In [24]:
movie_dist = pred_rating.groupby('movieId').size().reset_index(name='counts')
movie_dist = movie_dist.merge(rec_sorted_top_10,on="movieId")
movie_dist['counts'].describe()

count    990.000000
mean       7.242424
std        8.412628
min        1.000000
25%        2.000000
50%        4.000000
75%        9.000000
max       86.000000
Name: counts, dtype: float64

From the distribution, we see that 75% of the recommended movies have only been rated by atmost 9% of the users in the sampled data, whereas there are movies with even a high number of ratings (rated by 86 of the 100 users in the sample). Thus, we are able to recommend movies with low and high number of ratings, hence accomplishing our business objective of being able to recommend even movies with low number of ratings to improve discovery.

We assume that these results are reproducible on a larger dataset (which we were unable to test due to the unfeasibility of filling out the entire matrix for a larger sample - it took 1 hour to just run the code for this very small sample). We assume that building this model for a larger dataset combined with adding our business rule to the model will produce recommendations with high coverage and discoverability.