# Jaccard Similarity 

In this section you will construct a similarity metric based on the Jaccard similarity coefficient. 

Remember sets from your mathematics class? Well the coefficient is rather simple as it is based on sets operations, namely intersection and union.

## 1. Load the dataset

In [2]:
import pandas as pd
df_books_ratings = pd.read_csv('data/BX-Book-Ratings-Subset.csv', sep=';', encoding='latin-1')

## 2. Make a set for each book entry

Now we need to construct a set for each book in our dataset. This set will be composed of User-IDs who rated the book. 

From our data frame, can you construct a python dictionary containing ISBNs as keys and an array of User-IDs as values? 

In [3]:
df = df_books_ratings

dict_isbn_groups = df.groupby(['ISBN'])['User-ID'].aggregate(lambda x: list(x))
dict_isbn_groups.to_dict()

print(dict_isbn_groups)

ISBN
002542730X    [277427, 8019, 10030, 11676, 12538, 16996, 410...
006000438X    [4017, 6242, 6575, 8454, 10560, 11676, 17003, ...
0060096195    [7125, 7346, 8067, 13552, 15819, 25409, 28204,...
006016848X    [11676, 13850, 15049, 16634, 31556, 32802, 379...
0060173289    [278843, 2313, 6772, 8362, 29526, 76483, 78448...
                                    ...                        
1573229571    [11676, 21014, 23699, 37712, 53614, 67270, 730...
1573229725    [6575, 26621, 27647, 43910, 73330, 76626, 8862...
1576737330    [12154, 14521, 15135, 44852, 55187, 60244, 642...
1592400876    [11676, 21576, 26583, 38464, 43246, 49889, 499...
1878424319    [10118, 14638, 36907, 39281, 46398, 62862, 977...
Name: User-ID, Length: 831, dtype: object


## 3. Jaccard distance function

Here is the `jaccard_distance` function we provide you for the exercise. It calculates the distance between 2 books, taking into account who rated them (i.e., if more users rated the same book, then the books are closer). 

Please have a closer look at the function. As you can see, we are using python sets and the function is expecting two arrays composed of User-IDs.

In [4]:
def jaccard_distance(user_ids_isbn_a, user_ids_isbn_b):
    set_isbn_a = set(user_ids_isbn_a)
    set_isbn_b = set(user_ids_isbn_b)

    union = set_isbn_a.union(set_isbn_b)
    intersection = set_isbn_a.intersection(set_isbn_b)
        
    return len(intersection) / float(len(union))

## 4. Calculate distances 

Here is the ISBN of a book in our dataset (you can of course choose another one!). 

Can you calculate this book's jaccard distance from all the other books in the dataset?

In [5]:
a_book_isbn = '002542730X'

for isbn, users in dict_isbn_groups.items():
    if isbn != a_book_isbn:
        d = jaccard_distance(dict_isbn_groups[a_book_isbn], users)
        if d > 0.0:
            print(a_book_isbn + ' - ' + isbn + ' : d=' + str(d))

002542730X - 006000438X : d=0.016666666666666666
002542730X - 006016848X : d=0.014925373134328358
002542730X - 0060199652 : d=0.016666666666666666
002542730X - 0060248025 : d=0.016666666666666666
002542730X - 0060391626 : d=0.013888888888888888
002542730X - 0060392452 : d=0.026785714285714284
002542730X - 0060502258 : d=0.017699115044247787
002542730X - 0060915544 : d=0.03225806451612903
002542730X - 0060916508 : d=0.014084507042253521
002542730X - 0060922532 : d=0.016666666666666666
002542730X - 0060928336 : d=0.005988023952095809
002542730X - 0060929790 : d=0.015873015873015872
002542730X - 006092988X : d=0.01694915254237288
002542730X - 0060930535 : d=0.017543859649122806
002542730X - 0060934417 : d=0.028846153846153848
002542730X - 0060938455 : d=0.00909090909090909
002542730X - 0060958022 : d=0.057971014492753624
002542730X - 0060964049 : d=0.01694915254237288
002542730X - 0060976845 : d=0.01652892561983471
002542730X - 0060987103 : d=0.0423728813559322
002542730X - 0060987529 : d

## 5. Function calculating distances 

Considering the code above, can you make a function that will take as input a given book's ISBN and calculate its distance from all other books in our dataset? 

In [6]:
def calculate_jaccard_distances(book_isbn):
    # first look if the book is in our data
    if book_isbn in dict_isbn_groups:
        # fetch the list of users who reviewed this book
        book_isbn_users = dict_isbn_groups[book_isbn]
        # create a working dict where we record the distances
        dict_distances = {}
        # iterate through all book-users dataset
        for isbn, users in dict_isbn_groups.items():
            if isbn != book_isbn:
                # for each book, calculate the distance between this book's users and our book's users
                d = jaccard_distance(book_isbn_users, users)
                # if the distance is 0, then skip it
                if d == 0:
                    continue
                # record the distance in our working dict
                dict_distances[isbn] = d
        # dict_distances has all the results, yet I would like to return an ordered dict based on the distance values
        ordered_dict_distances = {k: v for k, v in sorted(dict_distances.items(), reverse=True, key=lambda item: item[1])}
        return ordered_dict_distances
    else:
        return None

# try it out with 'a_book_isbn'
calculate_jaccard_distances(a_book_isbn)

{'0894808249': 0.06557377049180328,
 '0345438329': 0.06493506493506493,
 '0312274920': 0.06451612903225806,
 '0060958022': 0.057971014492753624,
 '0064407675': 0.05555555555555555,
 '055321313X': 0.05405405405405406,
 '080411109X': 0.05333333333333334,
 '0099747200': 0.05263157894736842,
 '0446527785': 0.05263157894736842,
 '0525946829': 0.05263157894736842,
 '0786889020': 0.05263157894736842,
 '0064400557': 0.05194805194805195,
 '0375504613': 0.05172413793103448,
 '0375756981': 0.05172413793103448,
 '0142001430': 0.05063291139240506,
 '0679745203': 0.05063291139240506,
 '1573229326': 0.05063291139240506,
 '0064407667': 0.05,
 '0064407683': 0.05,
 '0316666009': 0.05,
 '067102423X': 0.05,
 '0804108749': 0.05,
 '0671867172': 0.04918032786885246,
 '0894805770': 0.04918032786885246,
 '0446602620': 0.04838709677419355,
 '0312983824': 0.047619047619047616,
 '0385492081': 0.047058823529411764,
 '0399501487': 0.04672897196261682,
 '0440224675': 0.046511627906976744,
 '0316096199': 0.0462962962