# How to Block with MinHash LSH

Let's look at movies from IMBD.
Let's try to find all "related" movies, eg the sequels, prequels, and remakes.

They should have similar names, but they won't match exactly.
If we treat each title as a set of terms, eg the title `Batman Begins`
is `{"BATMAN", "BEGINS"}`, then we want to find other records that
have a high [jaccard similarity](https://en.wikipedia.org/wiki/Jaccard_index)
to this record.
This is what the MinHashLSH algorithm does:
allows us to find record pairs with high jaccard similarity in O(N) time.
This speedup isn't free though. The downside that we have to accept is
that MinHashLSH is probabilistic. Pairs with higher jacccard similarity
are more likely to be blocked together, but are not guaranteed.

Note that the data in this example isn't the ideal for MinHashLSH.
It would be better to use longer-form text, like a synopsis or description,
because then the size of our sets would be larger and more meaningful.

In [6]:
from __future__ import annotations

import ibis
from ibis import _

import mismo

ibis.options.interactive = True

In [7]:
t = ibis.examples.imdb_title_basics.fetch()
print(t.count())
t = t.head(1_000_000)
t

[1;36m10814709[0m


We create our terms from the tokens in the `primaryTitle` column.
As an additional step, we drop common stopwords such as "The".

In [3]:
t = t.select(
    record_id=ibis.row_number(),
    primaryTitle=_.primaryTitle,
    terms=_.primaryTitle.upper().split(" "),
)
t = mismo.arrays.array_filter_isin_other(
    t, "terms", mismo.sets.rare_terms(t.terms, max_records_frac=0.05)
)
t = t.cache()
t

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

In MinHashLSH, the probability that a pair is blocked together is a function
of their jaccard similarity. We can tune this probability using the
`band_size` and `n_bands` params of `MinhashLshBlocker`:

In [4]:
mismo.block.plot_lsh_curves()

In [5]:
# Performance is still very bad. For now filter to a smaller dataset.
# Hopefully we can keep the API something like this,
# but it may have to change to get the performance we need.
k = 5_000
batmans = t.filter(_.primaryTitle.upper().contains("BATMAN"))
to_block = ibis.union(batmans, t.head(k))

# The time required is related to the band size and number of bands,
# so keep them small as small as possible.
blocker = mismo.block.MinhashLshBlocker(
    terms_column="terms_filtered", band_size=5, n_bands=10
)
blocked = blocker(to_block, to_block)
blocked2 = blocked.filter(_.primaryTitle_l.upper() != _.primaryTitle_r.upper())
blocked2 = blocked2.cache()
blocked2