# Jaccard Similarity 

In this section you will construct a similarity metric based on the Jaccard similarity coefficient. 

Remember sets from your mathematics class? Well the coefficient is rather simple as it is based on sets operations, namely intersection and union.

## 1. Load the dataset

In [2]:
# code goes here
import pandas as pd
df_books_ratings = pd.read_csv('data/BX-Book-Ratings-Subset.csv', sep=';')

## 2. Make a set for each book entry

Now we need to construct a set for each book in our dataset. This set will be composed of User-IDs who rated the book. 

From our data frame, can you construct a python dictionary containing ISBNs as keys and an array of User-IDs as values? 

In [3]:
# code goes here
df = df_books_ratings

dict_isbn_groups = df.groupby(['ISBN'])['User-ID'].aggregate(lambda x: list(x))

print(len(dict_isbn_groups.keys()))

dict_isbn_groups.to_dict()

831


{'002542730X': [277427,
  8019,
  10030,
  11676,
  12538,
  16996,
  41084,
  52584,
  71712,
  80538,
  86243,
  89551,
  104058,
  105108,
  110934,
  113270,
  113752,
  119725,
  128835,
  150979,
  152946,
  164096,
  171602,
  173291,
  174216,
  179734,
  183995,
  202963,
  208671,
  209516,
  213760,
  225763,
  229741,
  234597,
  235498,
  243720,
  244277,
  259264,
  262902,
  269566],
 '006000438X': [4017,
  6242,
  6575,
  8454,
  10560,
  11676,
  17003,
  89602,
  104113,
  115435,
  124747,
  143011,
  143715,
  147678,
  148258,
  149908,
  201042,
  203044,
  224349,
  259626,
  263877],
 '0060096195': [7125,
  7346,
  8067,
  13552,
  15819,
  25409,
  28204,
  35320,
  51350,
  71490,
  87938,
  93426,
  95173,
  95359,
  125692,
  127359,
  138198,
  155141,
  164027,
  181687,
  228868,
  233711,
  246507,
  249407,
  251422,
  269738],
 '006016848X': [11676,
  13850,
  15049,
  16634,
  31556,
  32802,
  37974,
  39281,
  59150,
  86835,
  96440,
  100088,
  1

## 3. Jaccard distance function

Here is the `jaccard_distance` function we provide you for the exercise. It calculates the distance between 2 books, taking into account who rated them (i.e., if more users rated the same book, then the books are closer). 

Please have a closer look at the function. As you can see, we are using python sets and the function is expecting two arrays composed of User-IDs.

In [4]:
def jaccard_distance(user_ids_isbn_a, user_ids_isbn_b):
                
    # code goes here
    set_isbn_a = set(user_ids_isbn_a)
    set_isbn_b = set(user_ids_isbn_b)
    
    union = set_isbn_a.union(set_isbn_b)
    intersection = set_isbn_a.intersection(set_isbn_b)
        
    return len(intersection) / float(len(union))

## 4. Calculate distances 

Here is the ISBN of a book in our dataset (you can of course choose another one!). 

Can you calculate this book's jaccard distance from all the other books in the dataset?

In [5]:
a_book_isbn = '002542730X'

# code goes here
for isbn, users in dict_isbn_groups.items():
    if isbn != a_book_isbn:
        d = jaccard_distance(dict_isbn_groups[a_book_isbn], users)
        if d > 0.0:
            print(a_book_isbn + ' - ' + isbn + ' : d=' + str(d))

002542730X - 006000438X : d=0.016666666666666666
002542730X - 006016848X : d=0.014925373134328358
002542730X - 0060199652 : d=0.016666666666666666
002542730X - 0060248025 : d=0.016666666666666666
002542730X - 0060391626 : d=0.013888888888888888
002542730X - 0060392452 : d=0.026785714285714284
002542730X - 0060502258 : d=0.017699115044247787
002542730X - 0060915544 : d=0.03225806451612903
002542730X - 0060916508 : d=0.014084507042253521
002542730X - 0060922532 : d=0.016666666666666666
002542730X - 0060928336 : d=0.005988023952095809
002542730X - 0060929790 : d=0.015873015873015872
002542730X - 006092988X : d=0.01694915254237288
002542730X - 0060930535 : d=0.017543859649122806
002542730X - 0060934417 : d=0.028846153846153848
002542730X - 0060938455 : d=0.00909090909090909
002542730X - 0060958022 : d=0.057971014492753624
002542730X - 0060964049 : d=0.01694915254237288
002542730X - 0060976845 : d=0.01652892561983471
002542730X - 0060987103 : d=0.0423728813559322
002542730X - 0060987529 : d

## 5. Function calculating distances 

Considering the code above, can you make a function that will take as input a given book's ISBN and calculate its distance from all other books in our dataset? 

In [None]:
# code goes here