# Cosine Similarity

In this section you will construct another similarity metric, now based on the cosinus.

Remember trigonometry (or better, linear algebra!) from your mathematics class? Well this metric is based on trigonometric operations and calculates the angle between two vectors. It might look difficult but it is rather simple. 

## 1. Load the dataset

In [1]:
# code goes here
import pandas as pd
df_books_ratings = pd.read_csv('data/BX-Book-Ratings-Subset.csv', sep=';')
df_books_ratings

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


Unnamed: 0,User-ID,ISBN,Book-Rating
0,277203,039480001X,9
1,277427,002542730X,10
2,277427,0061009059,9
3,277427,0316776963,8
4,277427,0345413903,10
...,...,...,...
29982,276680,0375727132,8
29983,276680,0375727345,8
29984,276680,0385504209,8
29985,276680,0440221595,8


## 2. Explore the dimensions (shape) of users' rating data

Here are the IDs of two users in our ratings dataset. What are their respective ratings' dimensions (shape)? How many books did these users rate respectively?

In [2]:
user_id_a = 277427
user_id_b = 277203

# code goes here
df = df_books_ratings

a = df[df['User-ID'] == user_id_a]
b = df[df['User-ID'] == user_id_b]

print(a.shape)
print(b.shape)

(17, 3)
(1, 3)


If we are to produce vectors from the users' ratings and apply trigonometric operations on them,  can you see a problem here? Are the vectors of the same dimension? If not, why is this a 'problem'?

## 3. Vectorize ratings

Can you vectorize the above users' ratings so they have the same dimension? To help you do this, here is sorted  list of all the ISBNs in our dataset. How can you use this list of all the ISBNs to create a (large!) vector for user_id_a?

In [3]:
import numpy as np

ISBNS_array = df['ISBN'].unique()
ISBNS_array = np.sort(ISBNS_array).tolist()


## 4. Helper functions

Below are two functions that (1) retrieve user ratings from a given dataset and (2) vectorize these ratings according to a certain dimension (all ISBNs).

In [4]:
def get_user_ratings(user_id, df_subset):
    
    df_user = df_subset[df_subset['User-ID'] == user_id]
        
    return dict(zip(df_user['ISBN'], df_user['Book-Rating']))

In [5]:
def create_ratings_vector(user_ISBN_rating_dict, all_ISBNS_array):
    
    user_ISBNS = user_ISBN_rating_dict.keys()
    
    return [0 if v not in user_ISBNS else user_ISBN_rating_dict[v] for v in all_ISBNS_array]    

## 5. Cosine distance function

Can you finish the writing of the function below that calculates the angle between two vectors have the same dimension? As you can see, use the numpy `dot` and `norm` operators do translate the given formula into code.

In [6]:
from numpy import dot
from numpy.linalg import norm

def cosine_distance(ratings_vector_user_a, ratings_vector_user_b):
    
#     a . b  -> dot(a, b)
#     -----
#     |a||b| -> norm(a) * norm(b)
    
    return dot(ratings_vector_user_a, ratings_vector_user_b) / (norm(ratings_vector_user_a) * norm(ratings_vector_user_b))
    
    

## 6. Calculate distances

Here is the ID of a user in our dataset (you can of course choose another one!).

Can you calculate this user's cosine distance from all the other users in the dataset?

In [7]:
a_user_id = 277427

all_user_ids = df['User-ID'].unique().tolist()

a_user_ISBN_ratings = get_user_ratings(a_user_id, df)

a_user_ratings_vector = create_ratings_vector(a_user_ISBN_ratings, ISBNS_array)

# print(a_user_ratings_vector)

for u_id in all_user_ids:
    
    if u_id == a_user_id:
        continue
    
    user_ISBN_ratings = get_user_ratings(u_id, df)
    
    user_ratings_vector = create_ratings_vector(user_ISBN_ratings, ISBNS_array)
    
    d = cosine_distance(a_user_ratings_vector, user_ratings_vector)
    
    if d > 0.0:
        print(f'{a_user_id} - {u_id} : d={d}')

277427 - 278026 : d=0.11940052227892682
277427 - 638 : d=0.11654244369375978
277427 - 643 : d=0.16647034525433485
277427 - 882 : d=0.08571345482257746
277427 - 1211 : d=0.13242634874528172
277427 - 2179 : d=0.09256953046211239
277427 - 3556 : d=0.10085874893487139
277427 - 4017 : d=0.04443051815726318
277427 - 5476 : d=0.10044197546222991
277427 - 6251 : d=0.0704143162355698
277427 - 6532 : d=0.09804540026175507
277427 - 6543 : d=0.04494476592262712
277427 - 6575 : d=0.09497624559731205
277427 - 7082 : d=0.09807990046130236
277427 - 7283 : d=0.06298626852192385
277427 - 7286 : d=0.15770264009396068
277427 - 7841 : d=0.1023324542959714
277427 - 7915 : d=0.1177022116929194
277427 - 8019 : d=0.14111021154444514
277427 - 8067 : d=0.06028424722831183
277427 - 8245 : d=0.06500786938340393
277427 - 8454 : d=0.058076749190585225
277427 - 8734 : d=0.08747093011090237
277427 - 9177 : d=0.12394207749905321
277427 - 9856 : d=0.05782372105304827
277427 - 10030 : d=0.09134210115426093
277427 - 10314

## 7. Function calculating distances

Considering the code above, can you make a function that will take as input a given user's ID and calculate its distance from all other users in our dataset?

In [None]:
# code goes here