# Nearest Neighbour and Rating Prediction

In this section we will look at item-based collaborative filtering to cluster together items having similar ratings and to predict what a certain user might rate a yet unrated item. 

## Prepare dataset

In [1]:
import pandas as pd
df_books_ratings = pd.read_csv('data/BX-Book-Ratings.csv', sep=';', encoding='latin-1')

In [2]:
df = df_books_ratings

df = df[df['Book-Rating'] != 0]

# subset users (more than 20 reviews
x = df['User-ID'].value_counts() >= 20

users = x[x].index 

df = df[df['User-ID'].isin(users)]

# subset books (more than 20 reviews)
x = df['ISBN'].value_counts() >= 20

isbns = x[x].index 

df = df[df['ISBN'].isin(isbns)]

df.to_csv('data/BX-Book-Ratings-Subset.csv', index=False, sep=';')

## Creating the rating matrix

As covered in week 02, we need to construct a rating matrix out of the ratings dataset. Each row of the matrix are user ratings for a given book.

In [3]:
df_books_ratings = pd.read_csv('data/BX-Book-Ratings-Subset.csv', sep=';')

In [4]:
df = df_books_ratings.pivot(index='ISBN', columns='User-ID', values='Book-Rating').fillna(0)
df

User-ID,242,254,507,638,643,709,805,882,929,1025,...,278026,278137,278188,278202,278221,278356,278418,278582,278633,278843
ISBN,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
002542730X,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
006000438X,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
0060096195,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
006016848X,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
0060173289,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,9.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1573229571,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1573229725,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1576737330,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1592400876,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Nearest Neighbors


$$
sim(i, j) = \frac{r_{i} \cdot r_{j}} {||r_{i}||_{2}||r_{j}||_{2}}
$$

Now that we have a ratings matrix, we can compute similarities between books bases on their respective ratings. Remember cosine distance? We will use this with sklearn's NearestNeighbors algorithm.   

In [5]:
from sklearn.neighbors import NearestNeighbors

knn = NearestNeighbors(metric='cosine', algorithm='brute')

knn.fit(df.values)

distances, indices = knn.kneighbors(df.values, n_neighbors=5)

Lets explore the NearestNeighbors outputs and construct a data structure to hold the computed neighbourhoods! Please note that the first element of each array in the indices list is the base element from which the distance with the other elements are computed (hence the distance of 0 in the distances array).

In [7]:
 #indices
 distances

array([[1.11022302e-16, 8.72045096e-01, 8.72236524e-01, 8.89511956e-01,
        8.90890060e-01],
       [0.00000000e+00, 8.22197723e-01, 8.45799648e-01, 8.45966882e-01,
        8.49323882e-01],
       [0.00000000e+00, 8.16215804e-01, 8.54572906e-01, 8.63366182e-01,
        8.73966190e-01],
       ...,
       [0.00000000e+00, 8.06007243e-01, 8.44110619e-01, 8.50576618e-01,
        8.65538918e-01],
       [1.11022302e-16, 8.51059461e-01, 8.59025728e-01, 8.64215083e-01,
        8.76340329e-01],
       [1.11022302e-16, 8.22145424e-01, 8.72410494e-01, 8.74373467e-01,
        8.93422831e-01]])

In [10]:
neighbours = {}

for i in range(0, len(indices)):
    nn = indices[i]
    dist = distances[i]    
    e = nn[0]
    e_isbn = df.index[e]    
    neighbours[e_isbn] = {"nn": [df.index[n] for n in nn[1:]], "dist": [x for x in dist[1:]]}
    
neighbours
    

{'002542730X': {'nn': ['0786889020', '0316666009', '080411109X', '0064407675'],
  'dist': [0.8720450960582269,
   0.8722365237597214,
   0.8895119559613895,
   0.8908900595880038]},
 '006000438X': {'nn': ['0060987103', '0679734775', '0375760911', '0452269571'],
  'dist': [0.822197723487859,
   0.8457996480341703,
   0.8459668818652946,
   0.849323882047947]},
 '0060096195': {'nn': ['0743418204', '0440241162', '034541389X', '0375760911'],
  'dist': [0.8162158040738834,
   0.8545729063965468,
   0.8633661819173695,
   0.873966190310399]},
 '006016848X': {'nn': ['0894805770', '0515122734', '1558531025', '0842329250'],
  'dist': [0.8257707260450212,
   0.8545390907659136,
   0.8580511173483932,
   0.8686965848263328]},
 '0060173289': {'nn': ['0399146431', '0316789089', '0452260116', '0385484518'],
  'dist': [0.7902272750421573,
   0.8412658418705504,
   0.8664421459773384,
   0.8742139608011895]},
 '0060199652': {'nn': ['0399146431', '0743412028', '0374199698', '0553277472'],
  'dist': [0.

## Predict rating (based on neighbours)

Now that we have neighbour clusters, we can predict the rating a certain user might give to an item based on this item's neighbours and the potential rating the user gave them. To compute a prediction we use the following:

$$
Pred(u, i) = \frac{\sum_{j} sim(i, j) * r_{u, j}} {\sum_{j} sim(i, j)}
$$

Where $sim(i, j)$ is the calculated distance above between items i and j, and $r_{u, j}$ is the rating that the user gave to item j. 


In [11]:
def predict_rating(user_id, ISBN, neighbours):
    
    if ISBN not in neighbours:
        print("no data for ISBN")
        
    neighbours = neighbours[ISBN]
    
    nn = neighbours['nn']
    dist = neighbours['dist']
    
    numerator = 0
    denominator = 0
    
    for i in range(0, len(nn)):
        
        isbn = nn[i]
        user_rating = df.loc[isbn, user_id]
            
        numerator += user_rating * dist[i]
        denominator += dist[i]
            
    if denominator > 0:
        
        return numerator / denominator
    
    else: 
        
        return 0


Can you think of a way to use predictions to recommend items to a given user? If so, how would you rank the recommendations?

In [12]:
all_books = df.index.tolist()
all_users = df.columns.tolist()

for b in all_books:
    for u in all_users:
        pr = predict_rating(u, b, neighbours)
        if pr > 0:
             print(f"{b} - {u}: prediction - {pr}")


002542730X - 3923: prediction - 1.2637872668182821
002542730X - 7958: prediction - 2.2712982020067636
002542730X - 8253: prediction - 2.2712982020067636
002542730X - 11676: prediction - 6.778594297051076
002542730X - 13273: prediction - 2.0220596269092517
002542730X - 16795: prediction - 3.0063565963992684
002542730X - 17003: prediction - 2.0189317351171234
002542730X - 17950: prediction - 4.2693618902695185
002542730X - 21356: prediction - 1.5141988013378422
002542730X - 21540: prediction - 1.9792870765658777
002542730X - 28523: prediction - 1.9792870765658777
002542730X - 30445: prediction - 2.5275745336365643
002542730X - 30735: prediction - 2.2712982020067636
002542730X - 33124: prediction - 2.2266979611366122
002542730X - 36807: prediction - 2.5275745336365643
002542730X - 36907: prediction - 2.2712982020067636
002542730X - 43006: prediction - 1.2373259758798425
002542730X - 45557: prediction - 2.0189317351171234
002542730X - 50129: prediction - 1.261832334448202
002542730X - 5258