# Jaccard Similarity 

In this section you will construct a similarity metric based on the Jaccard similarity coefficient. 

Remember sets from your mathematics class? Well the coefficient is rather simple as it is based on sets operations, namely intersection and union.

## 1. Load the dataset

In [5]:
import pandas as pd
df = pd.read_csv('data/BX-Book-Ratings-Subset.csv', sep=';')


## 2. Make a set for each book entry

Now we need to construct a set for each book in our dataset. This set will be composed of User-IDs who rated the book. 

From our data frame, can you construct a python dictionary containing ISBNs as keys and an array of User-IDs as values? 

In [21]:
grouped_df = df.groupby('ISBN')['User-ID'].apply(list)
users_by_book = dict(grouped_df)

## 3. Jaccard distance function

Here is the `jaccard_distance` function we provide you for the exercise. It calculates the distance between 2 books, taking into account who rated them (i.e., if more users rated the same book, then the books are closer). 

Please have a closer look at the function. As you can see, we are using python sets and the function is expecting two arrays composed of User-IDs.

In [None]:
def jaccard_distance(user_ids_isbn_a, user_ids_isbn_b):
    set_isbn_a = set(user_ids_isbn_a)
    set_isbn_b = set(user_ids_isbn_b)
    
    union = set_isbn_a.union(set_isbn_b)
    intersection = set_isbn_a.intersection(set_isbn_b)
        
    return len(intersection) / float(len(union))
    

## 4. Calculate distances 

Here is the ISBN of a book in our dataset (you can of course choose another one!). 

Can you calculate this book's jaccard distance from all the other books in the dataset?

In [None]:
a_book_isbn = '002542730X'
a_users = users_by_book.get(a_book_isbn)
a_users

distances = [jaccard_distance(a_users, b) for b in users_by_book.values()]

distances

## 5. Function calculating distances 

Considering the code above, can you make a function that will take as input a given book's ISBN and calculate its distance from all other books in our dataset? 

In [40]:
from typing import Dict, List
def calculate_jaccard_distances(isbn: str, dataset: Dict[str, List[str]]) -> List[float]:
    a_users = dataset.pop(isbn)
    distances = [jaccard_distance(a_users, b) for b in users_by_book.values()]
    return distances

calculate_jaccard_distances('002542730X', users_by_book)

[0.016666666666666666,
 0.0,
 0.014925373134328358,
 0.0,
 0.016666666666666666,
 0.016666666666666666,
 0.0136986301369863,
 0.026785714285714284,
 0.017699115044247787,
 0.03225806451612903,
 0.014084507042253521,
 0.0,
 0.016666666666666666,
 0.005988023952095809,
 0.015873015873015872,
 0.0,
 0.01694915254237288,
 0.0,
 0.017391304347826087,
 0.028846153846153848,
 0.00909090909090909,
 0.057971014492753624,
 0.0,
 0.01694915254237288,
 0.01652892561983471,
 0.0,
 0.0423728813559322,
 0.021739130434782608,
 0.033707865168539325,
 0.014925373134328358,
 0.018867924528301886,
 0.011904761904761904,
 0.014705882352941176,
 0.0,
 0.0,
 0.0,
 0.011764705882352941,
 0.05194805194805195,
 0.05,
 0.05555555555555555,
 0.05,
 0.034482758620689655,
 0.014084507042253521,
 0.01639344262295082,
 0.03389830508474576,
 0.029411764705882353,
 0.02564102564102564,
 0.01098901098901099,
 0.05263157894736842,
 0.0,
 0.03389830508474576,
 0.034482758620689655,
 0.014705882352941176,
 0.01612903225806