# Cosine Similarity

In this section you will construct another similarity metric, now based on the cosinus.

Remember trigonometry (or better, linear algebra!) from your mathematics class? Well this metric is based on trigonometric operations and calculates the angle between two vectors. It might look difficult but it is rather simple. 

## 1. Load the dataset

In [1]:
import pandas as pd
df = pd.read_csv('data/BX-Book-Ratings-Subset.csv', sep=';')

## 2. Explore the dimensions (shape) of users' rating data

Here are the IDs of two users in our ratings dataset. What are their respective ratings' dimensions (shape)? How many books did these users rate respectively?

In [13]:
user_id_a = 277427
user_id_b = 277203

ratings_a = df.groupby('User-ID').get_group(user_id_a)
ratings_b = df.groupby('User-ID').get_group(user_id_b)

ratings_a.shape, ratings_b.shape

((17, 3), (1, 3))

If we are to produce vectors from the users' ratings and apply trigonometric operations on them,  can you see a problem here? Are the vectors of the same dimension? If not, why is this a 'problem'?

The vectors are not the same dimension. We thus cannot apply vector operations on them.

## 3. Vectorize ratings

Can you vectorize the above users' ratings so they have the same dimension? To help you do this, here is sorted  list of all the ISBNs in our dataset. How can you use this list of all the ISBNs to create a (large!) vector for user_id_a?

In [33]:
def vectorize_ratings(ratings, full_vector):
    return [
    ratings[ratings['ISBN'] == x]['Book-Rating'].item()
    if x in ratings['ISBN'].values
    else 0
    for x in full_vector
]

In [35]:
import numpy as np

ISBNS_array = df['ISBN'].unique()
ISBNS_array = np.sort(ISBNS_array).tolist()

vector_a = vectorize_ratings(ratings_a, ISBNS_array)
vector_b = vectorize_ratings(ratings_b, ISBNS_array)

len(vector_a), len(vector_b)


(832, 832)

## 4. Helper functions

Below are two functions that (1) retrieve user ratings from a given dataset and (2) vectorize these ratings according to a certain dimension (all ISBNs).

In [36]:
def get_user_ratings(user_id, df_subset):
    
    df_user = df_subset[df_subset['User-ID'] == user_id]
        
    return dict(zip(df_user['ISBN'], df_user['Book-Rating']))

In [40]:
def create_ratings_vector(user_ISBN_rating_dict, all_ISBNS_array):
    
    user_ISBNS = user_ISBN_rating_dict.keys()
    
    return [0 if v not in user_ISBNS else user_ISBN_rating_dict[v] for v in all_ISBNS_array]    

## 5. Cosine distance function

Can you finish the writing of the function below that calculates the angle between two vectors have the same dimension? As you can see, use the numpy `dot` and `norm` operators do translate the given formula into code.

In [60]:
from numpy import dot
from numpy.linalg import norm

def cosine_distance(ratings_vector_user_a, ratings_vector_user_b):
    p1 = dot(ratings_vector_user_a, ratings_vector_user_b)
    p2 = norm(ratings_vector_user_a) *  norm(ratings_vector_user_b)
    return 1 - (p1/p2)
#     a . b  -> dot(a, b)
#     -----
#     |a||b| -> norm(a) * norm(b)   

## 6. Calculate distances

Here is the ID of a user in our dataset (you can of course choose another one!).

Can you calculate this user's cosine distance from all the other users in the dataset?

In [None]:
a_user_id = 277427
users = df['User-ID'].unique()

a_ratings = get_user_ratings(a_user_id, df)
a_vector = create_ratings_vector(a_ratings, ISBNS_array)


cosine_distances = [cosine_distance(a_vector, create_ratings_vector(get_user_ratings(id, df), ISBNS_array)) for id in users]

## 7. Function calculating distances

Considering the code above, can you make a function that will take as input a given user's ID and calculate its distance from all other users in our dataset?

In [None]:
users = df['User-ID'].unique()

def cosine_distances(user_id):
    a_ratings = get_user_ratings(a_user_id, df)
    a_vector = create_ratings_vector(a_ratings, ISBNS_array)
    return [cosine_distance(a_vector, create_ratings_vector(get_user_ratings(id, df), ISBNS_array)) for id in users]

cosine_distances(277427)