# Jaccard Index

- The __Jaccard index__, also known as __Intersection over Union__ 

- The __Jaccard similarity coefficient__ (originally given the French name: *coefficient de communauté* by Paul Jaccard), is a statistic used for gauging the __similarity__ and __diversity__ of sample sets. 

- The Jaccard coefficient measures similarity between finite sample sets, and is defined as the size of the intersection divided by the size of the union of the sample sets:

$$ J(A, B) = \frac {|A \cap B|} {|A \cup B|} = \frac {|A \cap B|} {|A| + |B| - |A \cap B|} $$

In [1]:
import numpy as np
import pandas as pd

In [2]:
# Misalkan terdapat data aktual (y) & data hasil prediksi model (yp)
y = np.array([1, 1, 1, 0, 1])
yp = np.array([0, 1, 1, 1, 1])

<hr>

### Jaccard Manual calculation

$\displaystyle J(y, yp) = \frac {|y \cap yp|} {|y \cup yp|} = \frac {|y \cap yp|} {|y| + |yp| - |y \cap yp|} $

- $\displaystyle |y \cap yp| = $ jumlah y & yp yang sama

- $\displaystyle |y \cup yp| = $ jumlah total data y atau yp

In [3]:
j = 3 / 5
j

0.6

<hr>

### Jaccard Index Using Sklearn

- jaccard_score may be a poor metric if there are no positives for some samples or classes. Jaccard is undefined if there are no true or predicted labels, and our implementation will return a score of 0 with a warning.

In [4]:
from sklearn.metrics import jaccard_score

jaccard_score(y, yp)

0.6

In [7]:
from sklearn.metrics import jaccard_similarity_score

jaccard_similarity_score(y, yp)



0.6

In [5]:
y_true = np.array([[0, 1, 1], [1, 1, 0]])
y_pred = np.array([[1, 1, 1], [1, 0, 0]])

# binary case: hanya ada 2 val target: 0/1, False/True, No,Yes
print(jaccard_score(y_true[0], y_pred[0]))

# multilabel case: memiliki >1 dimensi
print(jaccard_score(y_true, y_pred, average=None))

0.6666666666666666
[0.5 0.5 1. ]


In [6]:
# multiclass case: prediksi >2 val target/kategori
y_pred = [0, 2, 1, 2]
y_true = [0, 1, 2, 2]
jaccard_score(y_true, y_pred, average=None)

array([1.        , 0.        , 0.33333333])

<hr>

### Jaccard Index for Multi-class case

In [17]:
aa = np.array([1,2,3,1,2,3])
ap = np.array([1,1,2,1,3,3])

In [21]:
# Hitung manual, sedikit berbeda dengan rumus asal, krn dihtung per class

a1 = 2/3   # jumlah aa = 1 = ap / jumlah class (1,2,3)
a2 = 0/3   # jumlah aa = 2 = ap / jumlah class (1,2,3)
a3 = 1/3   # jumlah aa = 3 = ap / jumlah class (1,2,3)
a1, a2, a3

(0.6666666666666666, 0.0, 0.3333333333333333)

In [22]:
# Hitung dengan Sklearn

print(jaccard_score(aa, ap, average=None))
print(jaccard_score(aa, ap, average='macro'))
print(jaccard_score(aa, ap, average='micro'))
print(jaccard_score(aa, ap, average='weighted'))

[0.66666667 0.         0.33333333]
0.3333333333333333
0.3333333333333333
0.3333333333333333
