This is report has been automatically generated for the following data source(s) at 09:55PM on November 30, 2022:



For more information, please refer to the [PatentsView-Evaluation project homepage](https://github.com/PatentsView/PatentsView-Evaluation/) or to the report [source code](index.qmd).


In [None]:
import os
import pandas as pd
import plotly.io as pio
pio.templates.default = "plotly_white" # Set plotly theme

from er_evaluation.summary import (
    homonimy_rate,
    name_variation_rate,
)
from er_evaluation.plots import (
    compare_plots, 
    plot_cluster_sizes_distribution,
    plot_entropy_curve,
)
from pv_evaluation.benchmark import (
    inventor_benchmark_plot,
    inventor_estimates_plot,
    plot_entropy_curves,
    top_inventors,
    plot_cluster_sizes,
    plot_homonimy_rates,
    plot_name_variation_rates,
)

# Summary Statistics


In [None]:
def read_auto(datapath, dtype):
    _, ext = os.path.splitext(datapath)

    if ext == ".csv":
        return pd.read_csv(datapath, dtype=dtype)
    elif ext == ".tsv":
        return pd.read_csv(datapath, sep="\t", dtype=dtype)
    elif ext in [".parquet", ".pq", ".parq"]:
        return pd.read_parquet(datapath, dtype=dtype)
    elif ext in [".xlsx", ".xls"]:
        return pd.read_excel(datapath, dtype=dtype)
    else:
        raise Exception("Unsupported file type. Should be one of csv, tsv, parquet, or xlsx.")

def load_disambiguation(filename):
    data = read_auto(filename, dtype=str)
    data.set_index("mention_id", inplace=True)
    disambiguation = data.iloc[:, 0]
    disambiguation.rename(filename, inplace=True)

    return disambiguation

disambiguations = {filename: load_disambiguation(filename) for filename in ['disambiguation_20211230.tsv', 'disambiguation_20220630.tsv']}

inventor_not_disambiguated = read_auto("g_inventor_not_disambiguated.tsv", dtype=str)
inventor_not_disambiguated["mention_id"] = "US" + inventor_not_disambiguated.patent_id + "-" + inventor_not_disambiguated.inventor_sequence
inventor_not_disambiguated.set_index("mention_id", inplace=True)

full_names = inventor_not_disambiguated.raw_inventor_name_first + " " + inventor_not_disambiguated.raw_inventor_name_last
full_names = full_names.rename("full_name")

::: {.panel-tabset}

## Cluster Sizes

:::{.callout-tip collapse="true"}

## Detailed Explanation (click to expand)

The plot below provides the number of inventors with a given number of (co-)authored patents. 

We can read from the plot the number of inventors with a single authored patent, with exactly two authored patents, and so forth. This distribution of the number of authored patents patents per inventor is called the **cluster sizes distribution** of the disambiguation.

When comparing disambiguation results, look for shifts in shape of the cluster sizes distribution. Is one of the distribution more skewed to the left than another? This could indicate that one of the disambiguation favors smaller clusters, possibly resulting in higher precision but lower recall.

:::

:::{.column-body-outset}

In [None]:
plot_cluster_sizes(disambiguations)

:::

## Top Inventors

:::{.callout-tip collapse="true"}

## Detailed Explanation (click to expand)

The table below provides the inventors with the largest number of authored patents. 

Make sure to sort the table by number of patents.

When comparing disambiguation results, look for large changes in the ranking of inventors. Large changes in the estimated numbers of authored patents may also warrant the need to investigate the behavior of the disambiguation for these prolific inventors. 

:::

::: {.panel-tabset}



## disambiguation_20211230.tsv


In [None]:
from IPython.core.display import display, HTML

display(HTML(top_inventors(disambiguations["disambiguation_20211230.tsv"], full_names).to_html()))

## disambiguation_20220630.tsv


In [None]:
from IPython.core.display import display, HTML

display(HTML(top_inventors(disambiguations["disambiguation_20220630.tsv"], full_names).to_html()))

:::

## Entropy Curve

:::{.callout-tip collapse="true"}

## Detailed Explanation (click to expand)

The Hill Numbers entropy curve is a characterization of the cluster sizes distribution.

It is based on [Hill Numbers](https://en.wikipedia.org/wiki/Diversity_index) of order $q$, which are exponentiated Rényi $q$-entropy. That is, for a given $q > 0$ and for $p_i$ the proportion of inventors with $i$ authored patents, the corresponding Hills Number is defined as 
$$
    H_q = \left ( \sum_{i=1}^{\infty} p_i^{q}\right )^{1/(1-q)}
$$
The Hill Numbers entropy curve is the plot of Hill Numbers as a function of $q > 0$.

For $q=0$, the Hill Number is defined as the number of indices $i$ such that $p_i > 0$. This is the size of the support of the cluster sizes distribution
$$
    H_0 = \# \left\{ i > 0 \,:\, p_i > 0  \right\}.
$$

For $q=1$, the Hill Number is defined as the exponentiated Shannon entropy
$$
    H_1 = \exp \left ( - \sum_{i=1}^{\infty} p_i \log p_i \right ).
$$

For $q=2$, the Hill number is the inverse of the probability that two randomly sampled inventors have the same number of authored patents:
$$
    H_2 = \sum_{i=1}^{\infty} p_i^2.
$$

When comparing disambiguation results, look for major relative differences between entropy curves. These represent differences in the cluster sizes distribution which can be further investigated using the cluster sizes distribution plot.

Higher Hill Numbers represent a more spread out cluster sizes distribution, while lower values represent more peaked distributions. The order $q$ of the Hill Numbers represent how the cluster sizes proportions $p_i > 0$ are accounted for. At $q = 0$, we have the number of possible cluster sizes in the data. As $q \rightarrow \infty$, the Hill Number tends to the proportion of inventors with the most common cluster size.

:::

:::{.column-body-outset}

In [None]:
plot_entropy_curves(disambiguations)

:::

## Cluster Homogeneity

:::{.callout-tip collapse="true"}

## Detailed Explanation (click to expand)

Cluster homogeneity is the level similarity of among a cluster's elements. For inventor disambiguation, this is about the variation between the way that an inventor is represented on its patents (e.g., different name spellings).

Here we look at cluster homogeneity from a binary perspective -- whether or not there is variation, within a cluster, in how an inventor's name is spelled. The proportion of inventors with a unique name (no name variation within its cluster) is our metric of cluster homogeneity called the **homogeneity rate**.

In the plot below, the homogeneity rate (i.e., clusters with no name variation) is plotted as a function of cluster size. For inventors with a single patent, the proportion of homogeneous clusters is trivially 1. For inventors with two patents, we can read off the proportion of them with no name variation, and so forth.

When comparing two disambiguation results, look for changes in the homogeneity rate across cluster sizes. A higher homogeneity rate means possibly smaller, more robust clusters. On the other hand, lower homogeneity rates may be associated with an increased error probability.

:::

:::{.column-body-outset}

In [None]:
plot_name_variation_rates(disambiguations, full_names)

:::

## Between-cluster similarity

:::{.callout-tip collapse="true"}

## Detailed Explanation (click to expand)

Between-cluster similarity is the level of similarity between different clusters. For inventor disambiguation, this is about different inventors that have similar representations on some patents, such as having the same names.

Here we look at between-cluster similarity from a binary perspective -- whether or not an inventor's name is shared with another inventor. The proportion of inventors sharing their name with someone else is our metric of between-cluster similarity which we call the **homonymy rate**.

In the plot below, the homonymy rate is plotted as a function of cluster sizes.

:::

:::{.column-body-outset}

In [None]:
plot_homonimy_rates(disambiguations, full_names)

:::

:::


# Benchmark Evaluation

::: {.panel-tabset}

## Metrics for Benchmark Datasets (Non-Representative)

:::{.callout-tip collapse="true"}

## Detailed Explanation (click to expand)

Performance evaluation metrics compare predicted clusters to a reference clustering from benchmark datasets.

Note that to compute performance evaluation metrics, predicted clusters are first restricted to mentions which appear in the reference data. Performance evaluation metrics on benchmark datasets are not at all representative of performance on the full data.

Commonly used metrics are defined below.

- **Pairwise precision:** Proportion of predicted links (pairs of mentions in the same predicted cluster) which are also linked under the reference clustering.
- **Pairwise recall:** Proportion of links in the reference clustering (pairs of mentions in the same reference cluster) which are also linked under the predicted clustering.
- **Pairwise f-score:** Harmonic mean between pairwise precision and pairwise recall.
- **Cluster precision:** Proportion of predicted clusters which are entirely contained within a single reference cluster. That is, cluster precision is the proportion of predicted clusters which are not split up in the reference clustering.
- **Cluster recall:** Proportion of reference clusters which are entirely contained within a single predicted cluster. That is, cluster recall is the proportion of reference clusters which are not split up in the predicted clusters.
- **Cluster f-score:** Harmonic mean between pairwise precision and pairwise recall.

:::

:::{.column-body-outset}

In [None]:
fig = inventor_benchmark_plot(disambiguations)
fig.update_layout(autosize=False, width=800)
fig.show()

:::

## Predicted Cluster Errors

:::{.callout-tip collapse="true"}

## Detailed Explanation (click to expand)

Cluster error tabs (predicted cluster errors and reference cluster errors) allow you to investigate clusters which contain errors.

**Predicted cluster errors** are about predicted clusters which should have been split up in two or more parts, according to the reference data.

To investigate predicted cluster errors, sort the table by `prediction` id and look for mismatching `reference` ids within predicted clusters.

**Reference cluster errors** are about predicted clusters which should be merged together, according to the reference data.

To investigate reference cluster errors, sort the table by `reference` id and look for mismatching `cluster` ids within reference clusters.

:::


In [None]:
from IPython.core.display import display, HTML

from pv_evaluation.benchmark import (
    load_israeli_inventors_benchmark,
    load_patentsview_inventors_benchmark,
    load_lai_2011_inventors_benchmark,
    load_als_inventors_benchmark,
    load_ens_inventors_benchmark,
    inspect_clusters_to_split,
    inspect_clusters_to_merge,
    style_cluster_inspection,
)

def style_and_display(table, by="prediction"):
    return display(HTML(style_cluster_inspection(table.reset_index(), by=by).hide_index().to_html()))

::: {.panel-tabset}



### disambiguation_20211230.tsv

::: {.panel-tabset}

#### Israeli Inventors


In [None]:
#| column: page
style_and_display(
    inspect_clusters_to_split(
        disambiguations["disambiguation_20211230.tsv"], 
        load_israeli_inventors_benchmark(),
        join_with=inventor_not_disambiguated,
        links=True), 
    by="prediction")

#### PatentsView Inventors


In [None]:
#| column: page
style_and_display(
    inspect_clusters_to_split(
        disambiguations["disambiguation_20211230.tsv"], 
        load_patentsview_inventors_benchmark(),
        join_with=inventor_not_disambiguated,
        links=True), 
        by="prediction")

#### Lai's Benchmark


In [None]:
#| column: page
style_and_display(
    inspect_clusters_to_split(
        disambiguations["disambiguation_20211230.tsv"], 
        load_lai_2011_inventors_benchmark(),
        join_with=inventor_not_disambiguated,
        links=True), 
        by="prediction")

#### ALS Benchmark


In [None]:
#| column: page
style_and_display(
    inspect_clusters_to_split(
        disambiguations["disambiguation_20211230.tsv"], 
        load_als_inventors_benchmark(),
        join_with=inventor_not_disambiguated,
        links=True), 
        by="prediction")

#### ENS Benchmark


In [None]:
#| column: page
style_and_display(
    inspect_clusters_to_split(
        disambiguations["disambiguation_20211230.tsv"], 
        load_ens_inventors_benchmark(),
        join_with=inventor_not_disambiguated,
        links=True), 
        by="prediction")

:::



### disambiguation_20220630.tsv

::: {.panel-tabset}

#### Israeli Inventors


In [None]:
#| column: page
style_and_display(
    inspect_clusters_to_split(
        disambiguations["disambiguation_20220630.tsv"], 
        load_israeli_inventors_benchmark(),
        join_with=inventor_not_disambiguated,
        links=True), 
    by="prediction")

#### PatentsView Inventors


In [None]:
#| column: page
style_and_display(
    inspect_clusters_to_split(
        disambiguations["disambiguation_20220630.tsv"], 
        load_patentsview_inventors_benchmark(),
        join_with=inventor_not_disambiguated,
        links=True), 
        by="prediction")

#### Lai's Benchmark


In [None]:
#| column: page
style_and_display(
    inspect_clusters_to_split(
        disambiguations["disambiguation_20220630.tsv"], 
        load_lai_2011_inventors_benchmark(),
        join_with=inventor_not_disambiguated,
        links=True), 
        by="prediction")

#### ALS Benchmark


In [None]:
#| column: page
style_and_display(
    inspect_clusters_to_split(
        disambiguations["disambiguation_20220630.tsv"], 
        load_als_inventors_benchmark(),
        join_with=inventor_not_disambiguated,
        links=True), 
        by="prediction")

#### ENS Benchmark


In [None]:
#| column: page
style_and_display(
    inspect_clusters_to_split(
        disambiguations["disambiguation_20220630.tsv"], 
        load_ens_inventors_benchmark(),
        join_with=inventor_not_disambiguated,
        links=True), 
        by="prediction")

:::



:::

## Reference Cluster Errors

:::{.callout-tip collapse="true"}

## Detailed Explanation (click to expand)

Cluster error tabs (predicted cluster errors and reference cluster errors) allow you to investigate clusters which contain errors.

**Predicted cluster errors** are about predicted clusters which should have been split up in two or more parts, according to the reference data.

To investigate predicted cluster errors, sort the table by `prediction` id and look for mismatching `reference` ids within predicted clusters.

**Reference cluster errors** are about predicted clusters which should be merged together, according to the reference data.

To investigate reference cluster errors, sort the table by `reference` id and look for mismatching `cluster` ids within reference clusters.

:::

::: {.panel-tabset}



### disambiguation_20211230.tsv

::: {.panel-tabset}

#### Israeli Inventors


In [None]:
#| column: page
style_and_display(
    inspect_clusters_to_merge(
        disambiguations["disambiguation_20211230.tsv"], 
        load_israeli_inventors_benchmark(),
        join_with=inventor_not_disambiguated,
        links=True), 
        by="reference")

#### PatentsView Inventors


In [None]:
#| column: page
style_and_display(
    inspect_clusters_to_merge(
        disambiguations["disambiguation_20211230.tsv"], 
        load_patentsview_inventors_benchmark(),
        join_with=inventor_not_disambiguated,
        links=True), 
        by="reference")

#### Lai's Benchmark


In [None]:
#| column: page
style_and_display(
    inspect_clusters_to_merge(
        disambiguations["disambiguation_20211230.tsv"], 
        load_lai_2011_inventors_benchmark(),
        join_with=inventor_not_disambiguated,
        links=True), 
        by="reference")

#### ALS Benchmark


In [None]:
#| column: page
style_and_display(
    inspect_clusters_to_merge(
        disambiguations["disambiguation_20211230.tsv"], 
        load_als_inventors_benchmark(),
        join_with=inventor_not_disambiguated,
        links=True), 
        by="reference")

#### ENS Benchmark


In [None]:
#| column: page
style_and_display(
    inspect_clusters_to_merge(
        disambiguations["disambiguation_20211230.tsv"], 
        load_ens_inventors_benchmark(),
        join_with=inventor_not_disambiguated,
        links=True), 
        by="reference")

:::



### disambiguation_20220630.tsv

::: {.panel-tabset}

#### Israeli Inventors


In [None]:
#| column: page
style_and_display(
    inspect_clusters_to_merge(
        disambiguations["disambiguation_20220630.tsv"], 
        load_israeli_inventors_benchmark(),
        join_with=inventor_not_disambiguated,
        links=True), 
        by="reference")

#### PatentsView Inventors


In [None]:
#| column: page
style_and_display(
    inspect_clusters_to_merge(
        disambiguations["disambiguation_20220630.tsv"], 
        load_patentsview_inventors_benchmark(),
        join_with=inventor_not_disambiguated,
        links=True), 
        by="reference")

#### Lai's Benchmark


In [None]:
#| column: page
style_and_display(
    inspect_clusters_to_merge(
        disambiguations["disambiguation_20220630.tsv"], 
        load_lai_2011_inventors_benchmark(),
        join_with=inventor_not_disambiguated,
        links=True), 
        by="reference")

#### ALS Benchmark


In [None]:
#| column: page
style_and_display(
    inspect_clusters_to_merge(
        disambiguations["disambiguation_20220630.tsv"], 
        load_als_inventors_benchmark(),
        join_with=inventor_not_disambiguated,
        links=True), 
        by="reference")

#### ENS Benchmark


In [None]:
#| column: page
style_and_display(
    inspect_clusters_to_merge(
        disambiguations["disambiguation_20220630.tsv"], 
        load_ens_inventors_benchmark(),
        join_with=inventor_not_disambiguated,
        links=True), 
        by="reference")

:::



:::

:::