# Shingling with Jaccard

Comparing document similarities where the set of objects is word or character ngrams taken over a sliding window from the document (shingles). The set of shingles is used to determine the document similarity, Jaccard similarity, between a pair of documents.

In [None]:
shingle_size = 10

In [50]:
def shingler(doc, size):
    return [doc[i:i+size] for i in range(len(doc))][:-size]

In [51]:
def jaccard_dist(shingle1, shingle2):
    return len(set(shingle1) & set(shingle2)) / len(set(shingle1) | set(shingle2))

In [52]:
document1 = """An elephant slept in his bunk,
              And in slumber his chest rose and sunk.
              But he snored how he snored!
              All the other beasts roared,
              So his wife tied a knot in his trunk."""

document2 = """A large red cow
               Tried to make a bow,
               But did not know how,
               They say.
               For her legs got mixed,
               And her horns got fixed,
               And her tail would get
               In her way."""

document3 = """An walrus slept in his bunk,
              And in slumber his chest rose and sunk.
              But he snored how he snored!
              All the other beasts roared,
              So his wife tied a knot in his whiskers."""

In [53]:
# shingle and discard the last x as these are just the last n<x characters from the document
shingle1 = shingler(document1, shingle_size)
shingle1[0:10]

['An elephan',
 'n elephant',
 ' elephant ',
 'elephant s',
 'lephant sl',
 'ephant sle',
 'phant slep',
 'hant slept',
 'ant slept ',
 'nt slept i']

In [54]:
# shingle and discard the last x as these are just the last n<x characters from the document
shingle2 = shingler(document2, shingle_size)
shingles[0:10]

['An wa',
 'n wal',
 ' walr',
 'walru',
 'alrus',
 'lrus ',
 'rus s',
 'us sl',
 's sle',
 ' slep']

In [55]:
# shingle and discard the last x as these are just the last n<x characters from the document
shingle3 = shingler(document3, shingle_size)
shingles[0:10]

['An wa',
 'n wal',
 ' walr',
 'walru',
 'alrus',
 'lrus ',
 'rus s',
 'us sl',
 's sle',
 ' slep']

In [56]:
# Jaccard distance is the size of set intersection divided by the size of set union
print(f"Document 1 and Document 2 Jaccard Distance: {jaccard_dist(shingle1, shingle2)}")

Document 1 and Document 2 Jaccard Distance: 0.03943661971830986


In [57]:
# Jaccard distance is the size of set intersection divided by the size of set union
print(f"Document 1 and Document 3 Jaccard Distance: {jaccard_dist(shingle1, shingle3)}")

Document 1 and Document 3 Jaccard Distance: 0.8382352941176471


In [58]:
# Jaccard distance is the size of set intersection divided by the size of set union
print(f"Document 2 and Document 3 Jaccard Distance: {jaccard_dist(shingle2, shingle3)}")

Document 2 and Document 3 Jaccard Distance: 0.03932584269662921
