# Jaccard Similarity 

In this section you will construct a similarity metric based on the Jaccard similarity coefficient. 

Remember sets from your mathematics class? Well the coefficient is rather simple as it is based on sets operations, namely intersection and union.

In [1]:
import pandas as pd
import numpy as np
from collections import defaultdict

## 1. Load the dataset

In [2]:
df = pd.read_csv('data/BX-Book-Ratings-Subset.csv', sep=';', encoding='latin-1')
df

Unnamed: 0,User-ID,ISBN,Book-Rating
0,276725,034545104X,0
1,276727,0446520802,0
2,276744,038550120X,7
3,276746,0425115801,0
4,276746,0449006522,0
...,...,...,...
393949,276704,0446605409,0
393950,276704,0743211383,7
393951,276704,080410526X,0
393952,276706,0679447156,0


## 2. Make a set for each book entry

Now we need to construct a set for each book in our dataset. This set will be composed of User-IDs who rated the book. 

From our data frame, can you construct a python dictionary containing ISBNs as keys and an array of User-IDs as values? 

In [3]:
book_set = {}

for index, row in df.iterrows():
    ISBN = row["ISBN"]
    user_review = row["User-ID"]
    if ISBN in book_set:
        book_set[ISBN].append(user_review)
    else:
        book_set[ISBN] = []

        

In [4]:
book_set

{'034545104X': [2313,
  6543,
  8680,
  10314,
  23768,
  28266,
  28523,
  39002,
  50403,
  56157,
  59102,
  59287,
  63970,
  77480,
  77940,
  81977,
  94362,
  98391,
  112199,
  115435,
  123981,
  128045,
  128276,
  129311,
  132081,
  133235,
  135045,
  138543,
  144315,
  144486,
  145451,
  147799,
  158506,
  163202,
  168816,
  171912,
  173743,
  174834,
  188513,
  195374,
  208019,
  208406,
  210597,
  214011,
  220502,
  227106,
  227447,
  227520,
  227886,
  240144,
  243142,
  244420,
  255502,
  257492,
  260897,
  263003,
  268191,
  271448,
  274925],
 '0446520802': [278418,
  638,
  3363,
  7158,
  8253,
  9939,
  11676,
  12589,
  13279,
  16046,
  19371,
  23768,
  24878,
  26525,
  27617,
  28204,
  29855,
  30261,
  30711,
  32440,
  33974,
  35845,
  37556,
  39396,
  41237,
  43146,
  44925,
  54823,
  63507,
  64544,
  64679,
  79977,
  85993,
  91761,
  98153,
  99085,
  99720,
  102967,
  103190,
  105979,
  113340,
  115473,
  115490,
  116577,
  13

## 3. Jaccard distance function

Here is the `jaccard_distance` function we provide you for the exercise. It calculates the distance between 2 books, taking into account who rated them (i.e., if more users rated the same book, then the books are closer). 

Please have a closer look at the function. As you can see, we are using python sets and the function is expecting two arrays composed of User-IDs.

In [5]:
def jaccard_distance(user_ids_isbn_a, user_ids_isbn_b):       
    union = set(user_ids_isbn_a + user_ids_isbn_b)
    intersection = list(set(user_ids_isbn_a) & set(user_ids_isbn_b))
    return len(intersection) / float(len(union))

listx = [1,2,3,4]
listy = [1,2,3,4]

jaccard_distance(listx,listy)

1.0

## 4. Calculate distances 

Here is the ISBN of a book in our dataset (you can of course choose another one!). 

Can you calculate this book's jaccard distance from all the other books in the dataset?

In [51]:
a_book_isbn = '002542730X'
a_book_similarities = {}

for book, reviews in book_set.items():
    jd = jaccard_distance(a_book_isbn,book)
    a_book_similarities[book] = jd

a_book_similarities
    
    

{'034545104X': 0.625,
 '0446520802': 0.4444444444444444,
 '038550120X': 0.5555555555555556,
 '0425115801': 0.4444444444444444,
 '0449006522': 0.4444444444444444,
 '0553561618': 0.3,
 '055356451X': 0.5555555555555556,
 '0060517794': 0.4,
 '0451192001': 0.4444444444444444,
 '0609801279': 0.2727272727272727,
 '0671537458': 0.5,
 '0679776818': 0.18181818181818182,
 '3442437407': 0.7142857142857143,
 '0684867621': 0.4,
 '0451166892': 0.36363636363636365,
 '0380711524': 0.6666666666666666,
 '0451167317': 0.5555555555555556,
 '0553572369': 0.5555555555555556,
 '0345443683': 0.4444444444444444,
 '043935806X': 0.5,
 '055310666X': 0.4444444444444444,
 '0330332775': 0.7142857142857143,
 '0330367358': 0.4444444444444444,
 '3548603203': 0.5555555555555556,
 '0061054143': 0.4444444444444444,
 '0061054151': 0.3333333333333333,
 '0061056774': 0.4444444444444444,
 '0671024108': 0.4,
 '0064405176': 0.4444444444444444,
 '0440498058': 0.3333333333333333,
 '0671749609': 0.3,
 '0140062718': 0.4,
 '006009619

## 5. Function calculating distances 

Considering the code above, can you make a function that will take as input a given book's ISBN and calculate its distance from all other books in our dataset? 

In [54]:
def calculate_jaccard(input_book):
    input_book_similarities = {}
    for book, reviews in book_set.items():
        jd = jaccard_distance(input_book,book)
        input_book_similarities[book] = jd
    return input_book_similarities

calculate_jaccard("0553802542")

{'034545104X': 0.5,
 '0446520802': 0.7142857142857143,
 '038550120X': 0.625,
 '0425115801': 0.7142857142857143,
 '0449006522': 0.5,
 '0553561618': 0.5,
 '055356451X': 0.4444444444444444,
 '0060517794': 0.3,
 '0451192001': 0.5,
 '0609801279': 0.3,
 '0671537458': 0.5555555555555556,
 '0679776818': 0.2,
 '3442437407': 0.5714285714285714,
 '0684867621': 0.4444444444444444,
 '0451166892': 0.5555555555555556,
 '0380711524': 0.75,
 '0451167317': 0.4444444444444444,
 '0553572369': 0.4444444444444444,
 '0345443683': 0.7142857142857143,
 '043935806X': 0.5555555555555556,
 '055310666X': 0.3333333333333333,
 '0330332775': 0.5714285714285714,
 '0330367358': 0.5,
 '3548603203': 0.8571428571428571,
 '0061054143': 0.5,
 '0061054151': 0.375,
 '0061056774': 0.3333333333333333,
 '0671024108': 0.4444444444444444,
 '0064405176': 0.3333333333333333,
 '0440498058': 0.5714285714285714,
 '0671749609': 0.2,
 '0140062718': 0.4444444444444444,
 '0060096195': 0.2222222222222222,
 '0552546933': 0.625,
 '0786817070'