In [1]:
from __future__ import annotations

import ibis
from ibis import _
from ibis.expr.types import (
    ArrayValue,
    FloatingValue,
    IntegerValue,
    StringValue,
    StructValue,
)

ibis.options.interactive = True

We are going to dedupe the PATSTAT dataset. Every record represents a patent filed,
and our task is to determine which one came from the same inventor.
The end goal is to add a column called `inventor_id` to each patent.
This dataset contains a ground truth label, so we can evaluate how well we did.

`Mismo` contains this as an included dataset so it is easy to get started.
The returned dataset is an Ibis table, which is a lazy representation of a SQL table.
It is similar to a pandas dataframe, but has a few properties that make it much
better for the record linkage use case:

- Since it is SQL backed, it can handle datasets that are larger than memory, in
  the many millions of rows.
- Computation is performed by the powerful SQL backend of your choice: Google BigQuery,
  Apache Spark, Snowflake, etc. For this demo, we use DuckDB, which is a 
  state-of-the-art SQL engine based around a columnar data model
  (ie oriented towards the bulk operations of record linkage)
- Ibis is strongly typed, has a full API, is well-documented, and has good
  integration with the rest of the python data science ecosystem.


In [2]:
from mismo.datasets import load_patents  # noqa: E402

patents = load_patents()
print(patents.count())
patents

[1;36m2379[0m



In [3]:
from mismo.plot import plot_distributions  # noqa: E402

plot_distributions(patents)

Let's clean this up a bit:
- clean up whitespace
- convert the `coauthors` and `classes` columns to actual arrays (they really represent sets)

Each element in `classes` is a 4-character IPC technical code that is like a tag
for the patent. Similar patents will have similar tags.


In [4]:
from mismo.clean.strings import norm_whitespace  # noqa: E402


def clean_names(names: StringValue) -> StringValue:
    names = norm_whitespace(names)
    names = names.upper()
    # Only want to keep letters, numbers, and spaces
    names = names.re_replace("[^0-9A-Z ]", "")
    # Now have to do whitespace fixup again
    names = norm_whitespace(names)
    return names


def parse_list(s: StringValue) -> ArrayValue:
    return s.upper().split("**").map(norm_whitespace).sort()


cleaned = patents.select(
    "record_id",
    "label_true",
    "name_true",
    "name",
    name_cleaned=clean_names(_.name),
    latitude=_.latitude.nullif(0),
    longitude=_.longitude.nullif(0),
    coauthors=parse_list(_.coauthors.nullif("NONE")),
    classes=parse_list(_.classes),
)

Then, let's add some features. The binned coordinates will be used in the blocking step,
so that locations in the same lat/lng bin will be compared to each other.

We also generate some features based on the `name` column.

In [5]:
def bin_lat_lon(lat: FloatingValue, lon: FloatingValue) -> StructValue:
    """Bin a latitude or longitude to 0.1 degree precision, which is ~6 miles.

    If both are null, return null.

    (52.35, 4.916667 -> (524, 49)
    """

    def _bin_coord(coord: FloatingValue) -> IntegerValue:
        return (coord.round(1) * 10).cast("int16").fillna(0)

    result = ibis.struct(
        {
            "lat_hash": _bin_coord(lat),
            "lon_hash": _bin_coord(lon),
        }
    )
    both_null = lat.isnull() & lon.isnull()
    return both_null.ifelse(ibis.null(), result)


featured = cleaned.mutate(
    name_tokens=_.name_cleaned.split(" ").map(norm_whitespace).sort(),
    name_first3=_.name_cleaned[0:3],
    coords_hashed=bin_lat_lon(_.latitude, _.longitude),
)
featured

OK, now it's time to block! This is where we generate comparisons between records.
If we were naive and generated all possible comparisons from N record,
you would end up with N^2 comparisons. For our small dataset of ~2000 records
we would be able to get away with this, but for datasets much larger than this
it would be infeasible.

In [6]:
from mismo.block import BlockingRule, BlockingRules  # noqa: E402

rules = BlockingRules(
    BlockingRule("Coordinates Close", "coords_hashed"),
    BlockingRule("Name First 3", "name_first3"),
    BlockingRule("Coauthors Exact", "coauthors"),
    BlockingRule("Classes Exact", "classes"),
)

featured = featured.cache()
blocked = rules.block(featured, featured, labels=True)
blocked = blocked.cache()
blocked

The result of that was the two tables joined together, with a `_l` added
to all the columns from the left table, and a `_r` added to all the columns
from the right table. In addition, there is a column `blocking_rules` that
tells us which blocking rules were used to generate the pair.

By blocking, we reduced the number of needed pairs by a large factor.
In larger datasets, and with better blocking rules, this would be even more!


In [7]:
from mismo import metrics  # noqa: E402

n_comparisons = blocked.count().execute()
n_naive = metrics.n_naive_comparisons(featured)
reduction_ratio = n_comparisons / n_naive
n_naive, n_comparisons, reduction_ratio

(2828631, 631761, 0.2233451447007404)

We can also inspect which blocking rules were most to blame for the generated
pairs. If some rules generate a huge amount of comparisons, it might be worth
trying to make them more restrictive so we get better performance. Or, if some
blocking rules aren't generating any comparisons, that might be an indication
that we have a bug in there somewhere.

In [8]:
from mismo.block import upset_plot  # noqa: E402

upset_plot(blocked)

OK, now that we have our candidate pairs generated, let's actually do the
comparing of pairs. There are many ways to do this, but one of the most common
is to generate a set of Comparison objects, each of which represents a
measurement of similarity based on some dimension (eg "location"). Each Comparison is
composed of Levels, which represent discrete levels of aggreement
(eg "exactly", "within 100km", "one or both values null")

In [11]:
from mismo.compare import (  # noqa: E402
    Comparison,
    ComparisonLevel,
    Comparisons,
    distance_km,
)

name_comparison = Comparison(
    name="Name",
    levels=[
        ComparisonLevel("exact", _.name_cleaned_l == _.name_cleaned_r),
        ComparisonLevel(
            "Share 1 token",
            condition=_.name_tokens_l.intersect(_.name_tokens_r).length() == 1,
        ),
        ComparisonLevel(
            "Share 2 or more tokens",
            condition=_.name_tokens_l.intersect(_.name_tokens_r).length() >= 2,
        ),
    ],
)

classes_comparison = Comparison(
    name="Classes",
    levels=[
        ComparisonLevel("exact", _.classes_l == _.classes_r),
        ComparisonLevel(
            name="Share 1 class",
            condition=_.classes_l.intersect(_.classes_r).length() == 1,
        ),
        ComparisonLevel(
            name="Share 2 or more classes",
            condition=_.classes_l.intersect(_.classes_r).length() >= 2,
        ),
    ],
)

coauthors_comparison = Comparison(
    name="Coauthors",
    levels=[
        ComparisonLevel("exact", _.coauthors_l == _.coauthors_r),
        ComparisonLevel(
            name="Share one coauthor",
            condition=_.coauthors_l.intersect(_.coauthors_r).length() >= 1,
        ),
    ],
)

coords_comparison = Comparison(
    name="Coords",
    levels=[
        ComparisonLevel(
            name="Coords match",
            condition=(_.latitude_l == _.latitude_r) & (_.longitude_l == _.longitude_r),
        ),
        ComparisonLevel(
            name="Coords within 10km",
            condition=distance_km(
                lat1=_.latitude_l,
                lon1=_.longitude_l,
                lat2=_.latitude_r,
                lon2=_.longitude_r,
            )
            <= 10,
        ),
        ComparisonLevel(
            name="Coords within 100km",
            condition=distance_km(
                lat1=_.latitude_l,
                lon1=_.longitude_l,
                lat2=_.latitude_r,
                lon2=_.longitude_r,
            )
            <= 100,
        ),
        ComparisonLevel(
            name="One or both coord missing",
            condition=_.coords_hashed_l.isnull() | _.coords_hashed_r.isnull(),
        ),
    ],
)

comparisons = Comparisons(
    name_comparison,
    classes_comparison,
    coauthors_comparison,
    coords_comparison,
)
compared = comparisons.label_pairs(blocked, how="name")
compared = compared.cache()
compared

ImportError: cannot import name 'ComparisonWeights' from partially initialized module 'mismo.fs' (most likely due to a circular import) (/Users/nc/code/mismo/mismo/fs/__init__.py)

The result above is the blocked table, with a column added for every `Comparison`.
The value of each column is the level that the record pair matched at.
For example, there is now a "Name" column,
filled with values like "exact_name_cleaned", "exact_name_first3", etc.

Now that we have our features, we can use the Fellegi-Sunter model to train weights
for each of these features. This is a probabilistic model that is based on the concept
of odds. When you see an exact match on name, that increases the odds of a match
by some amount, maybe 50x. When you see a non-match on name, that decreases the odds
of a match by some amount, maybe 0.1x. We can either train this from labeled data,
or we can use unlabeled data using an algorithm called "Expectation Maximization".

In [None]:
from mismo.fs import train_comparisons  # noqa: E402

weights = train_comparisons(comparisons, featured, featured, max_pairs=10_000, seed=42)
# Can save and load weights
# weights.to_json("weights.json")
# weights = ComparisonWeights.from_json("weights.json")
weights.plot()

In the above plot, you can see how nearly all record pairs fall into the 
"else" level for the Coauthors Comparison. This indicates that we could improve
the model by making the other levels for that comparison less strict, so record pairs
are more evenly distributed between the levels, which would give our
model discrimitating power.

Use the weights to score the record pairs, findng the odds for each
Comparison, and then combining them into an overall odds for the record pair.

In [None]:
scored = weights.score(compared)
scored = scored.cache()
scored

We can plot these compared pairs.
We can see which comparison levels are most common,
which occur together,
which lead to matches, and which lead to non-matches.

Unsurprisingly, most records pairs match against the "else" levels.

the exact match levels have the highest odds, and the
else levels have the lowest. The other levels are somewhere in between.

In [None]:
from mismo.compare import plot_compared  # noqa: E402

plot_compared(compared, comparisons=comparisons, weights=weights)

It looks like an odds of 50 seems to separate the pairs between non-matches
and matches.
If I hover over the above chart, I can see that pretty much all the ELSE comparisons
are in the low cluster, and all the SAME comparisons are in the high cluster.

In [None]:
odds_threshold = 50
(scored.odds >= odds_threshold).value_counts()

Let's be really picky and only take the most likely matches as true matches, and
then perform connected components to label each patent with its inventor:

In [None]:
from mismo.cluster import connected_components  # noqa: E402

links = scored[_.odds >= odds_threshold]
links = links.cache()
print(links.count().execute())
labels = connected_components(links, nodes=featured.record_id)
print(labels.count().execute())
labels

63671
2379


Now let's evaluate how good our labeling is. Mismo wraps all of the evaluation
metrics from sklearn, so we can use them with Ibis Tables.

In [None]:
labels_true = patents.select("record_id", label=_.label_true)
labels_pred = labels.select("record_id", label=_.component)
metrics.adjusted_rand_score(labels_true, labels_pred)

0.7516296229573269

In [None]:
metrics.homogeneity_score(labels_true, labels_pred)

0.9553582188036884

The high homogeneity means we have a high precision, and don't have a lot of false-links

In [None]:
metrics.completeness_score(labels_true, labels_pred)

0.7561066128881119

The low completeness score means we have low recall, and are missing a lot of true-links