# Cosine Similarity

In this section you will construct another similarity metric, now based on the cosinus.

Remember trigonometry (or better, linear algebra!) from your mathematics class? Well this metric is based on trigonometric operations and calculates the angle between two vectors. It might look difficult but it is rather simple. 

In [2]:
import pandas as pd
import numpy as np
from collections import defaultdict

## 1. Load the dataset

In [38]:
df = pd.read_csv('data/BX-Book-Ratings-Subset.csv', sep=';', encoding='latin-1')
df

Unnamed: 0,User-ID,ISBN,Book-Rating
0,276725,034545104X,0
1,276727,0446520802,0
2,276744,038550120X,7
3,276746,0425115801,0
4,276746,0449006522,0
...,...,...,...
393949,276704,0446605409,0
393950,276704,0743211383,7
393951,276704,080410526X,0
393952,276706,0679447156,0


## 2. Explore the dimensions (shape) of users' rating data

Here are the IDs of two users in our ratings dataset. What are their respective ratings' dimensions (shape)? How many books did these users rate respectively?

In [39]:
user_id_a = 277427
user_id_b = 277203

df_user_a = df[df["User-ID"] == user_id_a]
print(f"The shape for User-ID a is : {df_user_a.shape}")
df_user_b = df[df["User-ID"] == user_id_b]
print(f"The shape for User-ID b is : {df_user_b.shape}")


The shape for User-ID a is : (235, 3)
The shape for User-ID b is : (9, 3)


In [40]:
len(df_user_a["ISBN"].tolist())

235

If we are to produce vectors from the users' ratings and apply trigonometric operations on them,  can you see a problem here? Are the vectors of the same dimension? If not, why is this a 'problem'?

## 3. Vectorize ratings

Can you vectorize the above users' ratings so they have the same dimension? To help you do this, here is sorted  list of all the ISBNs in our dataset. How can you use this list of all the ISBNs to create a (large!) vector for user_id_a?

In [29]:
import numpy as np

ISBNS_array = df['ISBN'].unique()
ISBNS_array = np.sort(ISBNS_array).tolist()

user_reviews = df_user_a["ISBN"].tolist()
book_vector = dict.fromkeys(ISBNS_array, 0)

for book in user_reviews:
    book_vector[book] = book_vector.get(book) + 1
    

## 4. Helper functions

Below are two functions that (1) retrieve user ratings from a given dataset and (2) vectorize these ratings according to a certain dimension (all ISBNs).

In [32]:
def get_user_ratings(user_id, df_subset):
    
    df_user = df_subset[df_subset['User-ID'] == user_id]
        
    return dict(zip(df_user['ISBN'], df_user['Book-Rating']))

dictx = get_user_ratings(277427,df)

In [33]:
def create_ratings_vector(user_ISBN_rating_dict, all_ISBNS_array):
    
    user_ISBNS = user_ISBN_rating_dict.keys()
    
    return [0 if v not in user_ISBNS else user_ISBN_rating_dict[v] for v in all_ISBNS_array]  
create_ratings_vector(dictx, ISBNS_array)

[10,
 0,
 0,
 7,
 0,
 0,
 0,
 0,
 0,
 9,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 10,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 8,
 0,
 8,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 10,
 0,
 0,
 0,
 0,
 0,
 9,
 0,
 8,
 0,
 0,
 0,
 9,
 9,
 9,
 0,
 0,
 9,
 0,
 9,
 8,
 0,
 8,
 0,
 0,
 0,
 7,
 9,
 0,
 9,
 0,
 0,
 0,
 8,
 0,
 0,
 0,
 9,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 7,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 5,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 7,
 0,
 0,
 8,
 0,
 9,
 0,
 0,
 8,
 0,
 10,
 9,
 10,
 0,
 0,
 0,
 0,
 0,
 10,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 8,
 0,
 10,
 0,
 0,
 6,
 0,
 0,
 0,
 0,
 0,
 8,
 0,
 0,
 0,
 0,
 0,
 0,
 9,
 0,
 0,
 10,
 0,
 0,
 10,
 9,
 0,
 0,
 0,
 9,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 8,
 0,
 0,
 8,
 0,
 0,
 0,
 0,
 0,
 9,
 0,
 5,
 0,
 0,
 0,
 0,
 9,
 8,
 9,
 9,
 0,
 8,
 0,
 10,
 10,
 0,
 0,
 0,
 0,
 10,
 10,
 0,
 10,
 10,
 0,
 10,
 0,
 0,
 10,
 0,
 10,
 8,
 0,
 10,
 0,
 7,
 0,
 10,
 0,
 0,
 0,
 10,
 8,
 0,
 0,
 0,
 0,
 10]

## 5. Cosine distance function

Can you finish the writing of the function below that calculates the angle between two vectors have the same dimension? As you can see, use the numpy `dot` and `norm` operators do translate the given formula into code.

In [58]:
from numpy import dot
from numpy.linalg import norm

def cosine_distance(ratings_vector_user_a, ratings_vector_user_b):
    dot_distance = dot(ratings_vector_user_a, ratings_vector_user_b)
    norm_distance = norm(ratings_vector_user_a) * norm(ratings_vector_user_b)
    if norm_distance == 0:
        return float(0)
    else:
        return dot_distance/norm_distance
    
#     a . b  -> dot(a, b)
#     -----
#     |a||b| -> norm(a) * norm(b)    

## 6. Calculate distances

Here is the ID of a user in our dataset (you can of course choose another one!).

Can you calculate this user's cosine distance from all the other users in the dataset?

In [59]:
a_user_id = 277427

user_list = df["User-ID"].unique()
user_a_vector = create_ratings_vector(get_user_ratings(a_user_id,df), ISBNS_array)


for user in user_list:
    user_vector = create_ratings_vector(get_user_ratings(user,df), ISBNS_array)
    cos_dis = cosine_distance(user_a_vector, user_vector)
    print(cos_dis)
    

0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.17942892439213867
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.14011045056973284
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.12609940551275955
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.9999999999999998
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.07425839028532635
0.06725301627347177
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.12609940551275955
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0

0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.07498315997030547
0.12609940551275955
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0860194959826337
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.14011045056973284
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0891657447416646
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.14011045056973284
0.138337506903075
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.11208836045578627
0.0
0.0
0.14011045056973284
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.10721923238292456
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0


0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.11443956942649351
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.14011045056973284
0.0
0.0
0.056332327012722325
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0594438298277764
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.12609940551275955
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.09807731539881298
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.12609940551275955
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0


0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.11208836045578627
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.08406627034183971
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.12609940551275955
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.062175435539388854
0.0
0.0
0.0
0.11208836045578627
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.16573501718321765
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0

0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.11208836045578627
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.038089718413029584
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.09807731539881298
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.11208836045578628
0.0
0.0
0.0
0.11208836045578627
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.12609940551275955
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.12609940551275955
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0

0.0
0.0
0.0
0.0
0.11208836045578627
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.14011045056973284
0.0
0.0
0.0
0.0
0.0
0.12609940551275955
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.1440440450044374
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.07925843977036852
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.12609940551275955
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.11208836045578627
0.0
0.0
0.0
0.0
0.12609940551275955
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.11208836045578627
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.11208836045578627
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0

KeyboardInterrupt: 

## 7. Function calculating distances

Considering the code above, can you make a function that will take as input a given user's ID and calculate its distance from all other users in our dataset?

In [61]:
def calculate_cosine_distance(user_id):
    comparison_user_vector = create_ratings_vector(get_user_ratings(user_id,df), ISBNS_array)
    for user in user_list:
        user_vector = create_ratings_vector(get_user_ratings(user,df), ISBNS_array)
        cos_dis = cosine_distance(user_a_vector, user_vector)

calculate_cosine_distance(277427)    
    

KeyboardInterrupt: 