# Nearest Neighbour and Rating Prediction

In this section we will look at item-based collaborative filtering to cluster together items having similar ratings and to predict what a certain user might rate a yet unrated item. 

## Prepare dataset

In [None]:
import pandas as pd
df_books_ratings = pd.read_csv('data/BX-Book-Ratings.csv', sep=';', encoding='latin-1')

In [None]:
df = df_books_ratings

df = df[df['Book-Rating'] != 0]

# subset users (more than 20 reviews
x = df['User-ID'].value_counts() >= 20

users = x[x].index 

df = df[df['User-ID'].isin(users)]

# subset books (more than 20 reviews)
x = df['ISBN'].value_counts() >= 20

isbns = x[x].index 

df = df[df['ISBN'].isin(isbns)]

df.to_csv('data/BX-Book-Ratings-Subset.csv', index=False, sep=';')

## Creating the rating matrix

As covered in week 02, we need to construct a rating matrix out of the ratings dataset. Each row of the matrix are user ratings for a given book.

In [None]:
df_books_ratings = pd.read_csv('data/BX-Book-Ratings-Subset.csv', sep=';')

In [None]:
df = df_books_ratings.pivot(index='ISBN', columns='User-ID', values='Book-Rating').fillna(0)
df

## Nearest Neighbors


$$
sim(i, j) = \frac{r_{i} \cdot r_{j}} {||r_{i}||_{2}||r_{j}||_{2}}
$$

Now that we have a ratings matrix, we can compute similarities between books bases on their respective ratings. Remember cosine distance? We will use this with sklearn's NearestNeighbors algorithm.   

In [None]:
from sklearn.neighbors import NearestNeighbors

knn = NearestNeighbors(metric='cosine', algorithm='brute')

knn.fit(df.values)

distances, indices = knn.kneighbors(df.values, n_neighbors=5)

Lets explore the NearestNeighbors outputs and construct a data structure to hold the computed neighbourhoods! Please note that the first element of each array in the indices list is the base element from which the distance with the other elements are computed (hence the distance of 0 in the distances array).

In [None]:
 indices
# distances

In [None]:
neighbours = {}

for i in range(0, len(indices)):
    nn = indices[i]
    dist = distances[i]    
    e = nn[0]
    e_isbn = df.index[e]    
    neighbours[e_isbn] = {"nn": [df.index[n] for n in nn[1:]], "dist": [1 - x for x in dist[1:]]}
    
neighbours
    

## Predict rating (based on neighbours)

Now that we have neighbour clusters, we can predict the rating a certain user might give to an item based on this item's neighbours and the potential rating the user gave them. To compute a prediction we use the following:

$$
Pred(u, i) = \frac{\sum_{j} sim(i, j) * r_{u, j}} {\sum_{j} sim(i, j)}
$$

Where $sim(i, j)$ is the calculated distance above between items i and j, and $r_{u, j}$ is the rating that the user gave to item j. 


In [None]:
def predict_rating(user_id, ISBN, neighbours):
    
    if ISBN not in neighbours:
        print("no data for ISBN")
        
    neighbours = neighbours[ISBN]
    
    nn = neighbours['nn']
    dist = neighbours['dist']
    
    numerator = 0
    denominator = 0
    
    for i in range(0, len(nn)):
        
        isbn = nn[i]
        user_rating = df.loc[isbn, user_id]
            
        numerator += user_rating * dist[i]
        denominator += dist[i]
            
    if denominator > 0:
        
        return numerator / denominator
    
    else: 
        
        return 0


Can you think of a way to use predictions to recommend items to a given user? If so, how would you rank the recommendations?

In [None]:
all_books = df.index.tolist()
all_users = df.columns.tolist()

for b in all_books:
    for u in all_users:
        pr = predict_rating(u, b, neighbours)
        if pr > 0:
             print(f"{b} - {u}: prediction - {pr}")
